Global Companies Relying on Business Critical xDoc Based Applications... 

 

 
 

CambridgeDocs xDoc Word to XML Conversions

Questions? Download xDoc Pro Download Documentation
xDoc transforms your Microsoft Word documents using the Java Word Driver, which is the most sophisticated and complete Java-based means of parsing Microsoft *.doc files available anywhere today.  xDoc uses the Java Word Driver as part of its integrated multi-step approach to converting content.

The Java Word Driver reads in the binary Microsoft *.doc format and extracts as much information as possible, including text, formatting, styles, layout and graphical information.  The Java Word Driver outputs a complete stylistic XML rendering of the document.

The stylistic XML is in fact "non-lossy," and gives you unprecedented programmatic access to both the original Word content as well as its formatting, which you can then use for mapping to XML schemas / DTDs like DocBook and DITA.  Furthermore, you can use a Microsoft Word *.doc file as a template and fill it in with live data for multi-channel publishing, such as (re)converting the content to HTML, PDF, and RTF formats.  And, for the first time, you can do all of this on Solaris and Linux servers as well as on Windows machines, because the Java Word Driver is cross-platform  and does not automate Microsoft Word in any way.

Java Word Driver Benefits:
Parse and process Microsoft Word documents on Windows, Linux and Solaris machines
Transform Word content to standard and custom XML schemas / DTDs, such as DocBook and DITA
Create custom documents on a server using *.doc files as templates
Does not require Microsoft Word to be installed on the server
Republish programmatically modified Word documents in HTML, PDF, and RTF formats
Index Word content in a Documentum or Oracle Database

The list below provides just a sampling of the items that the xDoc Java Word Driver provides you with the ability to identify, parse and process:

Java Word Driver Features:
Paragraph Content
Paragraph Layout Information, including left-indent, right-indent, space-before, and space-after
Paragraph Format Information, including font, font-color, font-size, and weight
Paragraph Style Information
Style / Formatting Overrides for paragraph text-runs that deviate from the "Style" setting in Microsoft Word
Images, including both bitmap images as well as WMF files, along with their x,y page coordinates
Frames, including the text content of the frame and its x,y page coordinates
Lists, including both ordered and unordered lists
Tables, including rows / cells, background color, column-widths, row-height, colspan, rowspan, border, and border-color
Superscripted Text, including in-line superscripts, along with their programmatic footnote references
Word Fields, including field code as well as field content
Office Shapes
Page Breaks
Sections
Footnotes and Endnotes
Page Headers and Footers
Links
Tabs
Text Boxes

 

Java Word Driver FAQ

What Microsoft Word formats are supported?

Currently, the Java Word Driver can convert Microsoft Word 97, Word XP (2002) and Word 2003 binary *.doc files into XML.  xDoc also has limited support for RTF to XML and WordML to XML conversions.

What XML format can I convert my Word documents into?

The Java Word Driver initially converts the content into a stylistic XML output called ppXML, or preprocess XML, which is an "initial format" that xDoc uses as part of its integrated multi-step process in transforming documents.

You can then convert the ppXML into any further XML schema you like, including DocBook, DITA, LegalXML, or into your own custom DTD/schema.  This further conversion can be done using an XSLT stylesheet, or by using xDoc extraction and transformation rules.

We also provide out-of-the-box conversion capabilities for transforming Microsoft Word *.doc files to XSL:FO.

What formats (PDF, HTML) can I publish a Word document into?

We provide out-of-the-box support for transforming a Word document into PDF, HTML, and RTF.

This can be done in a Windows or Unix server environment without requiring Microsoft Word to be installed, because the xDoc Java Word Driver is a pure Java application that operates against the source *.doc files.

Can I do a two-way conversion back into Word on a server?

Yes, you can do a two way conversion - from Word in to XML, and then from XML back into Word using our XSL:FO and RTF rendering capabilities.