Global Companies Relying on Business Critical xDoc Based Applications... 

 

 
 

CambridgeDocs Support / Frequently Asked Questions

General Questions:
Pricing Questions:
Technical Questions:

 

General Questions:


What is xDoc, the xDoc Converter, and xDoc Server?

"xDoc" is the generic term that we use to describe the architecture of flagship product family, which consists of the "xDoc Converter Desktop" and "xDoc Server."  Both products use the same integrated multi-step approach to transforming content.

When we say that "xDoc is capable of something", we mean that either the xDoc Converter Desktop or the xDoc Server has that capability.

Back to Top

What file formats can I transform into XML using xDoc?

xDoc can convert content from a number of different file types, as highlighted here.  In addition to these formats, xDoc can also read and convert text content directly, and its rules can also be used to convert XML from one schema or DTD format into another.

Other file formats will be supported in the future. Please contact sales@cambridgedocs.com for more information as to their availability.

Back to Top

What defined XML formats (DTD or XSD) can I convert content into?

xDoc is capable of converting content into any XML schema or DTD by using an integrated multi-step process.

The xDoc Converter Desktop is a point and click interface for defining rules which map source content to a target/destination schema.  Once perfected using the xDoc Converter Desktop, the rules can also be deployed with the xDoc Server, for "lights-out" transformations.

By default, the xDoc Java Preprocessing Drivers convert content from its existing source format into a stylistic XML format called ppXML, which stands for pre-processed XML, without needing to define any rules and is CambridgeDocs internal XML format. For more on ppXML, see the xDoc documentation.

Currently, we include example MapTemplates for transforming HTML, PDF, Word, and FrameMaker documents into ppXML, DocBook and DITA. Additionally, clients have used the product to transform content into various other formats, including standards such as the Military’s 3001 and S1000D for maintenance, SPL (structured product labeling) for pharmaceutical submissions, and custom DTDs for their own proprietary applications.

Back to Top

What formats can I publish XML into? How about HTML and PDF?

xDoc Server can be used to publish XML into a number of different file formats, as highlighted here, by using our out-of-the-box stylesheets in a server environment. The stylesheets are based off of XSL:FO. Clients have also used our stylesheets to create customized HTML – for example, HTML 3.2 for the SEC, etc.

To learn more about our publishing capabilities using XSL:FO, please contact sales@cambridgedocs.com.

Back to Top

What platforms are supported by xDoc?

Currently the xDoc Converter Desktop runs on Windows 2000 and Windows XP.  xDoc Server is cross-platform and can run on Windows 2000/XP, Linux, and Solaris.
Back to Top

Can I republish Microsoft Word and WordML documents on a Unix or Linux server?

Because of its unique design, xDoc Server can read Microsoft Word (*.doc) files in a cross-platform way and republish the Word document as PDF, RTF, and HTML, while preserving the formatting. Because xDoc Server is written in Java, this can be done on Unix servers without needing to have any Microsoft software installed.

Currently, this publishing capability has been developed for Microsoft Word *.doc files with very high fidelity. We are working on equivalent functionality for other Word formats, including WordML.

Back to Top

How about re-publishing PDF documents on a Unix or Linux server?

Similarly, our xDoc Java PDF Driver can read and manipulate PDF files in a server environment without needing to have any Adobe software installed. Our Java PDF Driver can automatically republish the pages of a PDF file into HTML, or transform PDF files into meaningful XML based upon your own extraction/conversion rules.

Furthermore, our JPDFToolkit and the PDFAssembly solutions can be used to merge, combine, and submit PDF files in a J2EE environment. For more information on these two solutions, please e-mail sales@cambridgedocs.com.

Back to Top

How is xDoc different from Microsoft’s XDocs and Office11?

Office 11 allows you to author documents according to whatever XML DTD or schema you describe to it in a way that is transparent to a business user.  There has been a significant push for such features for some time, and there are other programs available now that will let you do something similar.

What none of the Office 11 products allow you to do, however, is transform existing legacy content into defined XML DTDs or schemas. That is where CambridgeDocs xDoc comes in.

We are interested in enabling the release of information that has been authored in semi- or unstructured formats into meaningful XML formats. We have had, for example, clients use our product to convert existing technical manuals from Word into DocBook.

Considering our focus we have also facilitated the conversion of documents in batch.  For instance one might capture data from fileservers full of insurance forms and convert it to an internal XML format.

Back to Top

 

Pricing Questions:


How are the xDoc Converter Desktop and xDoc Server priced?

CambridgeDocs has standalone xDoc Converter Desktop pricing for one time conversions, and xDoc Server pricing for on-going conversions. Please contact sales@cambridgedocs.com for the latest prices.
Back to Top

If the xDoc Server can be used standalone, what are the licensing issues and costs involved with integrating it into our product?

You need to have a special license in order to embed xDoc Server functionality in your own application.

If you are running on a single server, you will need a server embedded license. If you are embedding within a product that will be shipped to your customers, you will need an OEM embedded license.

We are very happy to explore OEM, VAR and other partnering opportunities. Please e-mail partners@cambridgedocs.com to discuss your specific business and how we might start working together.

Back to Top

Are special prices available for academic use?

Yes, please e-mail sales@cambridgedocs.com for the latest academic pricing information.
Back to Top

 

Technical Questions:


Where can I get information on how xDoc handles Microsoft Word documents? 

Our Word to XML Conversions page describes the means by which xDoc converts Microsoft Word documents.  You'll find there a general technical description, along with a dedicated FAQ. 

In addition, our publishing from XML to RTF page describes the means by which xDoc publishes XML content into RTF format, which can then be opened in Microsoft Word.

Back to Top

Where can I get information on how xDoc works with PDF documents?

Our PDF to XML Conversions page describes the means by which xDoc converts Adobe PDF documents.  In that page, you'll find there a general technical description, and a dedicated FAQ.

We also have a page on publishing from XML to PDF, which describes how xDoc publishes XML content into PDF data.

Back to Top

Can xDoc run on more than one document at a time? Can it run in batch mode?

Yes. xDoc is meant to be run on multiple documents.

To do so, first using the xDoc Converter Desktop, define a MapTemplate, which contains rules for parsing and transforming documents into XML.  Then refine the MapTemplate by running it on sample content one file at a time. 

When you have refined a MapTemplate, you can run it in batch mode against multiple files simply by specifying wildcards, “*.doc,” for example, in a particular directory using the xDoc Converter Desktop.

Once the MapTemplate has been developed and refined, you can then deploy it in your server-based environment with the xDoc Server, which can run on Windows, Linux, and Solaris.  You can use one of the xDoc Server APIs to transform either a single document or a list of documents.  Please see the API documentation and sample code included in the xDoc Server download package.

Back to Top

What is ppXML?

ppXML is a stylistic XML format that captures the content and the formatting information from source documents, and is output from each of the xDoc Java Conversion Drivers. ppXML is a normalized form, intended to make it easy to parse Microsoft Word, Adobe PDF, and Adobe FrameMaker documents without having to understand the specifications associated with those documents.

xDoc uses this format, which can be viewed in HTML, or easily transformed into PDF or RTF, as an "intermediate format" to normalize content coming from different source formats. ppXML is very easy to understand, and more importantly, it is much easier to parse than Microsoft Word files, PDF files, or even HTML files.

For example, in ppXML you can easily get the font or emphasis information of a piece of text - in HTML and xHTML you would have to "parse upwards", in order to look at all of the parent nodes of a particular text item, and look for either <font tags>, or any other tag that had a <style> attribute - you might then need to parse a CSS file to find the definition of a style - see if it has a parent style, etc.

This requires writing in-depth parsing code. The xDoc Java Drivers do this work for you, and produce ppXML as the result from which you can define further transformations.

Back to Top

What are parsing rules? How can I extract meaningful content out of an unstructured document?

xDoc has very sophisticated parsing rules, which make it ideal for extracting meaningful values out of unstructured documents.  Unlike many parsers, which only allow regular-expression type parsing, xDoc contains an entire rules engine that allows for identifying and extracting content based on:
  • Formatting and stylistic information: If Styles were used in the original document, you can use this information to help identify meaningful elements. However, we know that most authors don't use styles, even though they do use formatting, so you can also use formatting clues (indenting, font, font-size, italics, bold, etc.) to help extract content.  More importantly, the formatting information is very easy to access - you don't have to write regular expressions or long XPATH statements or subroutines to figure out if a paragraph, or a snippet of text is italicized, for example.  You simply us a formatting rule.
  • Content rules: You can use the content itself to extract values - you can look for specific words and phrases as keys to help the parser recognize a product list, for example.  You can also use regular expressions or token-based grammatical rules to define specific types of passages that you are looking for.
  • Sequence rules: You can use rules based upon where content appears within a document. For example, you might say that the element “Product Description” consists of a product name, followed by one or more body paragraphs, followed by an image, and then either a ordered list or a price.
  • Combination rules: You can use rules in combination with each other using "AND" and "OR", "NOT" constructs. So you might say that a title can be followed by a subtitle OR it could be followed by a body paragraph AND it must have a certain formatting. With these constructs, you can create a single set of rules that account for many differences in similar source documents.
  • Encapsulated rules as objects: All parsing rules can be encapsulated in an MapObject, which is a re-usable component.  The power of xDoc can be unleashed by re-using rules within more complex rules. So you might define an AMERICAN_DATE MapObject, and a EUROPEAN_DATE MapObject, each of which can parse particular date formats. Then you could define a UNIVERSAL_DATE MapObject that consists of either the AMERICAN_DATE MapObject OR the EUROPEAN_DATE MapObject.  There is no limit to the sophistication you can achieve by re-using and combining in MapObjects.
Back to Top

How are these rules different from XSLT?  Can't I just parse ppXML with XSLT?

XSLT is great for tree-to-tree transformations of documents that start in XML.  However, XSLT was never meant to be a full parsing environment for unstructured content.  It is not ideal for parsing text or extracting meaningful tags out of the middle of an element's textual block.

You can integrate an XSLT at any point in our transformation process - as part of the pre-processing or post processing, and in fact we recommend using the two together (if you are familiar with XSLT programming). We recommend it particularly for publishing to various formats.

There are several advantages to using the xDoc transformation rules in tandem with ppXML.  They were created with the mindset of making parsing much easier than writing code.

  • The first is the development environment, which provides a point and click interface for converting unstructured content into a defined XML schema.  Using XSLT, you're still writing and debugging code.
  • The point and click interface makes it easy to both manage and reuse individual parsing rules (MapObjects) and individual transformation templates (MapTemplates).
  • Just as importantly, you can view and debug the results - using our "document packet" concept. You can treat an HTML file and its associated images (which can be copied together) as a document packet. We believe that any automated process must account for the ability to see what happened along the way.
  • Another key advantage for parsing is that the xDoc rules let you take into account sequencing of content and elements - which is difficult to do with XSLT. For example, you might say that a CHAPTER is only matched if you find a "chapter_title" followed by one or more sections or block elements. Writing code for these types of sequences is only for the best XSLT programmers.
  • Finally, ppXML brings formatting information to the elements so that it is very easy to define a rule that looks for content that is “bold,” for example. As explained above, if you had to write code to do this, you might find yourself parsing up, looking for tags and or styles, and find yourself parsing CSS type statements.   xDoc simplifies this process significantly.
Back to Top

Can I invoke outside parsing programs?

You can invoke executables and XSLTs as part of the transformation process.
Back to Top

Can I extract metadata from the documents?

By default, xDoc extracts metadata that was stored with the document. The type of metadata that is automatically extracted varies depending on the source documents (meta tags for HTML files, document properties for Word and PDF files, etc.). You can define rules within MapTemplates to extract additional metadata from within the text of documents as needed. 
Back to Top