• On CHOW: Girls who hate girly drinks
advertisement
mySimon mySimon mySimon Outdoor Gear mySimon Swimwear mySimon Home and Garden
December 12, 2006 4:00 AM PST

Working with Office 2007 documents under Mac OS X: Extracting text from .docx files, more

by CNET staff
  • Font size
  • Print
  • 9 comments

You may have a few Windows-using friends or co-workers who have already made the leap to Office 2007, and know more who are planning to upgrade in the coming weeks. With its new file formats, Office 2007 creates documents that won't be readily accessible under any current version of Office for Mac OS X (v.x or 2004).

Microsoft has promised beta file conversion utilities in Spring 2007 that will allow you to open these files (dubbed "Open XML") in Office 2004 (and possibly Office v.X), but users seeking interoperability are largely left in the lurch until then. There are, however, a few promising means for exchanging documents with Office 2007 users, or at least extracting pertinent data from said documents.

Files created with the new Open XML format used by Office 2007 are actually ZIP packages that contain various XML files as well as images and other data. Since they are actually archived folders, the "meat" of any document will be stored in a directory (or directories) within the document packages. For instance, in a .docx document (created by Word 2007), there is a directory labeled word that contains various XML documents with the actual text. For Excel, items are located in the /excel directory, etc.

Manual expansion/stripping As such, one brute alternative for extracting data under Mac OS X is to change the .docx extension of a received Office 2007 document to .zip, (e.g. file.docx to file.zip) then double-click the file to expand it. You can then peer inside the expanded folder (contents of a Word .docx file shown to right). Again, the items you want are located in the located in the /word (or name of other Office application) directory.

For Word documents, once you've opened the /word directory, you'll probably see a series of .xml files named as such:

  • document.xml
  • endnotes.xml
  • header1.xml

The names are self-explanatory: you'll generally find the body text within the document.xml file.

Once you've found the appropriate file(s), you can either open it in any text editor an manually strip the XML, or you can use a tool like downCast to convert it (without great accuracy) to a RTF (rich-text format) document that can be opened in Word v.X or 2004.

BBEdit, for instance, has a function that will strip most XML tags from documents, leaving plain text. Open the .xml file in BBEdit, select the appropriate text, then go to the "Markup" menu, and select "Utilities" then "Remove Markup." Some other text editors have similar functionality.

You can practice with some sample files available from OpenXMLDeveloper.org.

docx-converter.com A Web site dubbed docx-converter can translate a Microsoft Word 2007 .docx file into a simple html file. According to its creators, the tool "strips out some of the formatting, but now supports bold, italic, and underlined text. Left, right, center, and justified alignment. Unicode characters, and more!" This is a great interim solution that has the key advantage of retaining some formatting, but the site might buckle under heavy load.

Windows-side saving Though it's inconvenient and impractical in many real-world cases, one obvious solution to this problem (and the one suggested by Microsoft's Mac BU) is to ask Office 2007 users to save their documents in "Word/Excel/PowerPoint 97-2003" format (.doc, .xls, .ppt). Some document elements might be lost in the process, but this will ensure interoperability with Mac versions of Office.

Wait until January for a new OpenOffice.org release Novell has stated that it is working on and supporting an open-source project to bring Office 2007 document (Open XML) opening support to a coming release of OpenOffice.org, the rival productivity suite that is available as an X11 application, which can run under Mac OS X. A CNET article says "By January, Novell said, users of the OpenOffice word processor will be able to read documents saved in the Office Open XML format, the default setting for Microsoft's recently released Office 2007 suite."

Feedback? Late-breakers@macfixit.com.

Digg!

Resources

  • More from Late-Breakers
  • Recent posts from MacFixIt
    Address Book: Search not working properly
    iTunes 9.0.3 breaks AirTunes connection for some
    Apple releases Aperture 3.0
    Manage iCal's automatic e-mail generation for invitations
    CNET TV Apple Byte: Apple faces critics
    Weekly Utilities Update: Net Monitor, MiniUsage, TimeMachineEditor, more...
    Odds and Ends: Essential video codec packs for OS X
    Address Book: Unable to add, view contacts
    Add a Comment (Log in or register) (9 Comments)
    • prev
    • next
    by Fingal December 12, 2006 6:37 AM PST
    All of the Mac users I know of who use OpenOffice regularly use the NeoOffice version (which doesn't depend on X11). In fact, I have a friend who runs his whole business with NeoOffice.
    Reply to this comment
    by Billsey December 12, 2006 6:37 AM PST
    >
    This is a reply to a previous comment by Fingal


    I am a Mac user, and I do not use NeoOffice, because it is always rather far behind OpenOffice.org version-wise.
    Reply to this comment
    by Fingal December 12, 2006 6:37 AM PST
    >>
    This is a reply to a previous comment by Billsey


    Fair enough. I did notice that it took some time for NeoOffice to update to the 2.0 code base. It wasn't a big deal for me or the other people I know to wait but I'm sure there are situations where it would be an issue.
    Reply to this comment
    by MacFixItUser December 12, 2006 11:23 AM PST
    MacLinkPlus Deluxe?
    Reply to this comment
    by December 12, 2006 2:09 PM PST
    Here's another option:

    Return the funky document to the sender and ask them to use something other than a new, non-standard format for their file.
    Reply to this comment
    by Uncle Asad December 12, 2006 11:47 PM PST
    Don't depend on these converters appearing in the "official" versions of OpenOffice.org released from www.openoffice.org; reading MS Office Open XML is a highly politicized issue, and there is a very large and vocal faction of the OpenOffice.org community vehemently opposed to applications that can read the Microsoft format. Just because Novell includes the converters in versions of OpenOffice.org it distributes to its customers does not mean that the code will be accepted into the main OpenOffice.org sourcecode or that the converters will be inculded in official versions from Sun/www.openoffice.org.

    (The NeoOffice developers have indicated that they are looking into including the Novell code for reading MS Office Open XML, as well as Novell's code for VBA support, in future versions of NeoOffice.)
    Reply to this comment
    by xbjllb December 13, 2006 3:47 AM PST
    Just great... a word document that has to be unzipped to be read. Can't wait to see what the virus writers do with this genius idea of a document format. Microsoft... dedicated to making virus writers' lives more fulfilling and fun-filled.
    Reply to this comment
    by Fingal December 13, 2006 6:34 AM PST
    I wonder what it means that Apple, OpenOffice.org (Novel's contributors to it, at least) and DataViz (makers of MacLinkPlus) are all ahead of Microsoft in releasing translators. Maybe it's just that Microsoft is being conservative in making sure that their own official translator is well tested or maybe there's more to it than that. I suppose it might also be that Microsoft has been slow to catch onto XML and their developers have had to chatch up with the basics of the technology before they could work on the translators. Apple has certainly gone whole-hog into XML since the introduction of OS X, considering the amount of XML content they use in the OS itself as well as applications like Keynote, Pages, etc.
    Reply to this comment
    by toxdoc December 13, 2006 6:34 AM PST
    >
    This is a reply to a previous comment by Fingal


    I looked at the Dataviz web site and can find no mention of support for the new Office 2007 formats. Is there a beta version somewhere?
    Reply to this comment
    (9 Comments)
    • prev
    • next
    advertisement

    About MacFixIt

    MacFixIt is CNET's troubleshooting resource for all things Mac. The information here helps you navigate the ins-and-outs of Mac ownership with how-tos, troubleshooting information, news, reviews, and more.

    Add this feed to your online news reader