|
|||||||||
CambridgeDocs xDoc PDF to XML Conversions |
|||||||||||||||||||||||||||||
| Questions? | Download xDoc Pro | Download Documentation | |||||||||||||||||||||||||||
| xDoc uses its Java PDF Driver to
convert your PDF files
into XML as part of its integrated
multi-step process for transforming content. The xDoc Java PDF Driver is the most
complete, Java-based means of parsing PDF files available today, and
gives you the capability of processing your PDF content from any
J2EE application server like BEA WebLogic or IBM WebSphere.
The Driver reads in the binary PDF file and extracts its text, layout, formatting, vector graphics and images, and outputs a stylistic XML representation of the content. This stylistic XML gives you complete Java programmatic access to the otherwise unreadable binary content of the document, as well as its formatting and key structural characteristics. You can then use this information to map the content to XML DTDs like DocBook or DITA as shown below, and for programmatic multi-channel publishing such as modifying the PDF data and re-rendering the content in either PDF or HTML format. Because the Java PDF Driver operates against the file content directly, and because it does not automate Adobe Acrobat in any way, you can parse and process your PDF content using multi-threaded technologies on any Linux, Solaris or Windows 2000/XP machine, thereby giving you maximal flexibility in server-based performance and cost.
Unlike the other conversion formats supported by xDoc, pagination is key to the structure of a PDF. Each page generally contains:
*Note that PDF files do not actually identify columns and tables internally -- they are essentially just a collection of text runs positioned such that they look like a column or table when you view the data, hence the "best-effort" guess that the Java PDF Driver will make based on text run coordinate locations. For better accuracy, however, contact info@docscience.com about a CambridgeDocs plugin to Adobe Acrobat, shown below, which allows you to visually select and identify PDF content.
Java PDF Driver FAQWhat XML format(s) can I convert my PDF documents into?
What formats can I publish a PDF document into?
On what platforms can I run PDF to XML and PDF to HTML conversions?
Can I extract only parts of a PDF document?
What if the PDF files are not consistently formatted or have different layouts?
How about tables of data in PDF files? Can I get access to them in the XML?
What if I want to combine multiple PDF files into a single PDF file, or break an existing PDF file into multiple PDF files?
|
|||||||||||||||||||||||||||||
|
© 2002-2010 EMC Document Sciences All Rights Reserved. -- Privacy Policy |