Open Document Formats for All

30 November -0001

I've been giving a lot of thought to data formatting recently. It seems that we have an ever increasing volume of data being stored digitally, as well as new requirement to retain digital information, formatting is becoming an ever increasing problem. I know I frequently face formatting issues when moving data from one application to another, and often from one version of a format to another. It seems to me that much of this formatting is laziness on developers' part. Sure, you want to format data in a way that's easily accessible and that looks nice, but in the end data can all just be reduced to 0's and 1's. More fundamentally, all human usable data can generally be broken down into text - that is character data. There is no excuse for hiding this sort of data in a format that users cannot access. I understand that often times developers want to embed formatting information along with the data, but often times there is such a tight coupling between formatting and the actual content that the content is effectively scrambled. For instance, if I was to type this article up in a Word Processing program, even without any formatting at all, and save it, then open that file in a text reader it would be garbled. Adding formatting would only further garble the data, and this is the simplest case. Spreadsheet data, calendaring software, and other formats all mangle the user inputted data to the point where the application is *required* to access the data. This effectively locks the user into a platform - the platform in which they composed or deposited their data.

Open Office has proposed a new Open Document Format (ODF). With conspicuous support from Google for ODF this format it might take off. The great thing about ODF is that it is all based on XML. This means that if you open up your ODF document with a text browser you'll see human readable characters. Now, your document might be cluttered up with extraneous XML tags, but it's easy to filter those out with a simple find and replace to distill the content of your document out of the markup. Additionally, since XML is an open standard, there are numerous applications capable of reading the data, and with an open standard you can even write your own application to view the document. Unfortunately, because the actual Open Office .odt file has lots of separate information that applies to it, it is actually stored as a ZIP archive. So if I create a simple text file, with the following text in it:

Hello World!

and I bold the text, then save it as an OpenDocument Text (.odt) file (for instance as test.odt) I can't directly access the XML by opening the file with a text reader (if you try this you see a lot of gibberish). Instead you have to unzip the .odt file using a program like 7-zip or GNU zip. Once you open the archive you'll see it contains several files (mimetype, current.xml, content.xml, styles.xml, meta.xml, thumbnail.png, settings.xml and manifest.xml). If you open the content.xml file with a text editor you'll see the following:

<?xml version="1.0" encoding="UTF-8"?>
<office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" office:version="1.0"><office:scripts/><office:font-face-decls><style:font-face style:name="Tahoma1" svg:font-family="Tahoma"/><style:font-face style:name="Times New Roman" svg:font-family="'Times New Roman'" style:font-family-generic="roman" style:font-pitch="variable"/><style:font-face style:name="Arial" svg:font-family="Arial" style:font-family-generic="swiss" style:font-pitch="variable"/><style:font-face style:name="Arial Unicode MS" svg:font-family="'Arial Unicode MS'" style:font-family-generic="system" style:font-pitch="variable"/><style:font-face style:name="MS Mincho" svg:font-family="'MS Mincho'" style:font-family-generic="system" style:font-pitch="variable"/><style:font-face style:name="Tahoma" svg:font-family="Tahoma" style:font-family-generic="system" style:font-pitch="variable"/></office:font-face-decls><office:automatic-styles><style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard"><style:text-properties fo:font-weight="bold" style:font-weight-asian="bold" style:font-weight-complex="bold"/></style:style></office:automatic-styles><office:body><office:text><office:forms form:automatic-focus="false" form:apply-design-mode="false"/><text:sequence-decls><text:sequence-decl text:display-outline-level="0" text:name="Illustration"/><text:sequence-decl text:display-outline-level="0" text:name="Table"/><text:sequence-decl text:display-outline-level="0" text:name="Text"/><text:sequence-decl text:display-outline-level="0" text:name="Drawing"/></text:sequence-decls><text:p text:style-name="P1">Hello World</text:p></office:text></office:body></office:document-content>

Granted, this is pretty gnarly, but you can easily read it (it's all character data), and it is valid XML. You can even easily spot the 'Hello World' content near the bottom.

Using the ODF ensures that no matter what changes happen to your word processing program, your data will never get locked away in a proprietary format. There is currently a plug-in for Microsoft Office that allows you to save Microsoft Word documents in ODF format.

Although some people advocate using Adobe's PDF format as a universal distribution format, it isn't suitable for archiving and truly open access. Adobe's format, while readable on may different platforms is not editable unless you pay for Adobe's software. Additionally the format, while accessible, is proprietary. You can get open source readers, but you'll never have the freedom that you do with ODF because Adobe controls the format ultimately.

Unfortunately image data lags a little ways behind in terms of standardizing input. Luckily the word has evolved some de-facto standards such as PNG, GIF, and JPEG. Hopefully some time soon there will be a standardization movement in the imaging world that can mirror the open document standards.