Global Companies Relying on Business Critical xDoc Based Applications... 

 

 
 

CambridgeDocs xDoc PDF to XML Conversions

Questions? Download xDoc Pro Download Documentation
xDoc uses its Java PDF Driver to convert your PDF files into XML as part of its integrated multi-step process for transforming content. The xDoc Java PDF Driver is the most complete, Java-based means of parsing PDF files available today, and gives you the capability of processing your PDF content from any J2EE application server like BEA WebLogic or IBM WebSphere.

The Driver reads in the binary PDF file and extracts its text, layout, formatting, vector graphics and images, and outputs a stylistic XML representation of the content. This stylistic XML gives you complete Java programmatic access to the otherwise unreadable binary content of the document, as well as its formatting and key structural characteristics.  You can then use this information to map the content to XML DTDs like DocBook or DITA as shown below, and for programmatic multi-channel publishing such as modifying the PDF data and re-rendering the content in either PDF or HTML format.

Because the Java PDF Driver operates against the file content directly, and because it does not automate Adobe Acrobat in any way, you can parse and process your PDF content using multi-threaded technologies on any Linux, Solaris or Windows 2000/XP machine, thereby giving you maximal flexibility in server-based performance and cost.

Java PDF Driver Benefits:
Parse and manipulate PDF documents on Windows, Linux and Solaris machines
Convert PDF content into standard and custom XML schemas or DTDs, such as DocBook and DITA XML
Does not require Adobe Acrobat to be installed on the server
Works with J2EE application servers like BEA WebLogic or IBM WebSphere
Works in any multi-threaded server environment
Index PDF content in a Documentum or Oracle Database
Republish programmatically modified PDF documents in HTML and PDF formats PDF

Unlike the other conversion formats supported by xDoc, pagination is key to the structure of a PDF. Each page generally contains:

PDF File Characteristics:
Individual text runs, which are small segments of text
Bitmap type images, which are called raster images
Line drawings / fill information, which are referred to as vector graphics

When you convert your data into the initial stylistic XML, the Java PDF Driver performs the following extractions:
 
Java PDF Driver Features:
Individual pages are exported as <PAGE> elements, including their page height/width information
Contiguous lines of text are consolidated into <TEXT> elements, where each <TEXT> element retains its formatting (pdf font, font-size, font-color, font-weight, etc.) and positioning information (text run height and width, and x,y coordinates on the page)
Where applicable, individual <TEXT> elements are consolidated into <PARAGRAPH> elements, which may contain one or more text runs that follow the standard formatting for the paragraph, as well as text runs which deviate (bold, italic, font-size), etc.  The <PARAGRAPH>s are put together using proximity rules.
Raster Images are exported as separate files, and images are exported in either JPG or PNG formats
Vector graphics are exported as part of an SVG file (Scaleable Vector Graphics), where currently one SVG file is produced per page
Makes a best-effort determination of tabular data*
Makes a best-effort determination of columnar data*

*Note that PDF files do not actually identify columns and tables internally -- they are essentially just a collection of text runs positioned such that they look like a column or table when you view the data, hence the "best-effort" guess that the Java PDF Driver will make based on text run coordinate locations.  For better accuracy, however, contact info@docscience.com about a CambridgeDocs plugin to Adobe Acrobat, shown below, which allows you to visually select and identify PDF content.

Java PDF Driver FAQ

What XML format(s) can I convert my PDF documents into?

The Java PDF Driver initially converts PDF content into a stylistic XML output called ppXML, or preprocess XML, which is the  "initial format" that xDoc uses as part of its integrated multi-step approach to transforming content. 

Once the PDF data is in ppXML format, you can then transform it into any further XML DTD, including DocBook, DITA, LegalXML, or even into your own custom DTD/schema.  This additional conversion can be done using an XSLT, or by using xDoc rules.

What formats can I publish a PDF document into?

We provide out-of-the-box support for transforming a PDF document into HTML. If you would like to render it in other formats, including RTF, you can transform the PDF file into an XML schema, and then use a stylesheet to go to XSL:FO or back to PDF.

On what platforms can I run PDF to XML and PDF to HTML conversions?

Because the xDoc Java PDF Driver is a pure Java application, it can convert source *.pdf files on Windows 2000/XP, Linux, or Solaris operating systems.

Can I extract only parts of a PDF document?

Yes, you can use xDoc rules to extract specific content from a PDF file. The content can be extracted based upon formatting and styling, based upon x,y position on the page, or based upon positioning or some other criteria.

What if the PDF files are not consistently formatted or have different layouts?

As mentioned above, we have a plugin for Adobe Acrobat which allows you to “zone” certain areas of a PDF files using their x,y coordinates. These zones allow you to visually identify images, margins, columns, as well as any other custom characteristic.

How about tables of data in PDF files? Can I get access to them in the XML?

Since the x,y coordinates of all the paragraphs are available in the ppXML, you can re-display tables with their original look and feel without any difficulty.

However, if you want to extract meaningful content from the tables, there are several options, including allowing xDoc to try to find them for you, zoning them as images using the plugin mentioned above, or zoning the rows and columns using xDoc PDF Table Definition functionality. This zoning will give you precise control over how tables appear in your final XML. 

Please contact info@docscience.com if you have further questions about the the Acrobat plugin or about the xDoc PDF Table Definition functionality.

What if I want to combine multiple PDF files into a single PDF file, or break an existing PDF file into multiple PDF files?

To do this you would use our PDFAssembly Solution, which is built on top of our JPDF Toolkit. For more information about this solution, please contact info@docscience.com.