Global Companies Relying on Business Critical xDoc Based Applications... 

 

 
 

"Transforming Unstructured Content into Meaningful XML"

White Paper
By
Rizwan Virk,
Chief Technology Officer
CambridgeDocs - www.cambridgedocs.com
Table of Contents

Introduction: XML for Unstructured Content
The “Save as XML” Problem
What is Meaningful XML?
What are the methods of transforming it?
The need for a new platform
About CambridgeDocs

 Introduction: XML for Unstructured Content

A lot of attention has been paid to the benefits of having unstructured information (“documents”) stored in XML format. It’s viewed by many as the “holy grail” of unstructured content – a single storage format which can then be transformed into any type of needed output – HTML, PDF, WML, RTF, or into other XML formats.
This is a promising future for unstructured content, and eventually promises to make the lives of those engaged in content creation, management, and publishing much easier.
However, while a lot of attention has been paid to transforming the XML into different outputs, not as much attention has been paid on how to get existing content into an XML format that can then be transformed. For a content provider, or even an enterprise that has thousands of existing documents, this can be a major problem – one that’s often overlooked when jumping on the “XML bandwagon”.
While there are significant benefits for storing content in XML format, this conversion process can be tedious and excruciatingly painful – often done manually. This paper explores the issues surrounding content conversion into XML format and what some of the options are for automated transformation.

 The “Save as XML” Problem

Most people who have thought about XML at a high level haven’t thought too much about the problems inherent in converting unstructured content into a meaningful XML format.
Data that comes from relational databases, which is highly structured, is very easy to transform into XML, so this isn’t really a problem. A comma-separated file is relatively easy to transform. As in the example below:
Sample comma separated file:

LastName, FirstName, Company, City, State
Doe, John, IBM, Rochester, New York

Resulting sample XML file:

<Record>
<LastName>Doe</LastName>
<FirstName>John</FirstName>
<Company>IBM</Company>
<City>Rochester</City>
<State>New York</State>
</Record>

In fact, many SQL databases can now return their result sets in XML format, so there isn’t even a need to transform the result set into XML. This is a great step forward for the interoperability of data. Though it will take years before many existing applications are rewritten to use XML, it’s a safe bet to assume that most new applications will start to use XML for storage of certain kinds of data right off of the bat.
The transformation is relatively simple because the structure of the XML document mirrors the structure of the record.
Let’s explore a similar situation for unstructured content, which might be stored in say, a Microsoft Word document or even an HTML file or a PDF page.
At first glance, it might appear that it’s relatively straightforward to transform an unstructured document into XML. After all, most programs that manage unstructured content, such as Word Processors, etc, will eventually have an option “Save As XML”, won’t they? This is similar to how most word processors added a “Save as HTML” option within a few years of the web becoming popular.
But the reality is much trickier than that. The real trick is not so much in the transformation into any kind of XML, but in transformation to the kind of XML format that you would like to end up with. Storing a press release that is in HTML format as an XML document doesn’t do much for the initiated.
The main problem is not that most unstructured documents lack a meaningful structure. The main problem is that most unstructured documents don’t have metadata saved in logical fields.
As an example, let’s look at a press release. A useful XML structure for a press release might be something like this:
<PressRelease>
<Company> IBM</Company>
<Division>Global Service </Division>
<PRDate>12/15/2001</PRDate>
<Title>IBM buys XYZ Technologies</Title>
<Subtitle> Acquisition makes IBM a leader in unstructured content management</Subtitle>
<Location>Cambridge, MA</Location>
<Body>
<Comment>text of press release</Comment>
</Body>
<AboutCompany>
….IBM is a global company …
</AboutCompany>
<AboutCompany>
XYZ Technologies provides tools for managing, classifying, and aggregating unstructured content
</AboutCompany>
<ContactInformation>
<ContactPerson>…</ContactPerson>
</ContactInformation
</PressRelease>

This type of XML schema would not only allow the press release to be transformed into different types of output for publishing, it also allows it to be classified easily and assembled quickly. You can think of this as semi-structured format (most of the text is still within the Body field), but the metadata has been extracted so that transformation of the press release is easy to do.
Unfortunately, it’s not straightforward or easy to convert existing press releases, which may be in Microsoft Word or HTML format, to this new XML format. You cannot use XSLT, because the document is not in an XML format to begin with.
One solution to the problem might be to “save as XML” so that the document starts out in XML, and then use XSLT to transform it into the target XML we want. In the case of Word, you could save as HTML first. Once it’s in HTML format, you could convert to XHTML, which is an XML-like version of HTML and there are many tools to turn HTML into XHTML.
However, even this doesn’t solve the problem! The information that is to end up in certain XML fields is embedded within the text and not available for simple transformation. You still can’t use XSLT or XPath to get at that information. There is not an XML field called subtitle or date – this information is lost within the text of the unstructured part of the document.
This example, while simplified, demonstrates a key issue for XML adoption for unstructured content: it is non-trivial to transform existing unstructured content in meaningful XML.

 What is Meaningful XML?

Meaningful XML is any XML schema that accurately represents the meaning of the document, and not simply the structure of the document. This allows the document to be sorted, classified, taken apart, re-assembled, or transformed as appropriate.
The example given above is a perfect example of a meaningful XML document for a press release. A meaningful XML document for a resume might be something like this:
<Resume>
<Name>John Doe</Name>
<Objective>To get a position as a product manager</Objective>
<Experience>
<Job>
<Company>Lotus</Company>
<Title> Product Manager</Title>
<DateFrom>1/1/1999</DateFrom>
<DateTo>Present</DateTo>
</Job>
<Job>
</Job>
</Experience>
<Skills>
<Skill> Product Management</Skill>
</Skills>
</Resume>

Note that a meaningful XML document is more than just an XML version of a document.
It is an XML version of the document that has content sorted into fields that are meaningful for the application or person that needs to use the XML. In our resume example, this type of structure would allow for the searching across hundreds of resumes to find the people who had worked at Lotus in 1999, say.
The truth is that there is always more than one meaningful XML schema for any given document. The structure that is best an only be determined by the uses of the application. However, like a well-designed relational database, a meaningful XML document will have enough fields so that it can be used for multiple purposes. In the resume example, this type of XML schema could be used both for searching and for publishing purposes.
There are a number of emerging standards for different types unstructured content. A good example is RIXML and IRXML, two XML standards for investment research. Most investment research is put into a PDF file or some other format, which doesn’t have all of the fields tagged that are necessary for an application to be able to use the information. A quick examination of RIXML or IRXML, for example, will show that there are dozens of fields to be filled in.
Meaningful XML is a hierarchical structure that represents an unstructured document in a way that it can be easily classified, searched, dis-assembled, assembled, and published in multiple formats. The fields that are in the Meaningful XML structure vary based upon the type of document.
Another reason to transform a document into meaningful XML is to be able to easily assemble new documents from meaningful sections of existing documents. For example, in the press release example, the “About IBM” section and the “Contact Information” sections rarely change, so there is no need to store it in every press release XML document. Instead, it can be assembled at the transformation time. Similarly, sections of proposals can be stored as separate XML chunks and then assembled, but only if there is a meaningful XML structure.
Yet another scenario is when different parts of existing documents have different security needs. One section might be marked “Top Secret” for example, and another section might be marked “Secret” and yet another section “Public”. By having meaningful XML, these segments can be assembled based upon the profile of a user.

 What are the methods of transforming it?

Now that we have a good idea of what meaningful XML is and why it’s needed, let’s explore how we might get content into a meaningful XML format.
In a typical corporation there are easily thousands, if not hundreds of thousands (even millions) of existing documents stored on local hard drives, file servers, groupware databases, and emails. Obviously all of it can’t be transformed into meaningful XML, nor should it. However, transformation into XML is one of the key steps in an enterprise wide unstructured content management platform.
The methods for transforming it are:
Manual Extraction. The most popular method for transforming existing content into XML is manual extraction. This is the most accurate but most labor-intensive effort. Because the values of the fields are embedded within the document, humans read the documents and “extract” the values of existing fields and then enter the values into a user interface. This UI often makes it somewhat easier than typing in straight XML, but this manual process is still difficult.
This method tends to work better for new documents that are being created. However, most standard content management processes do not take into account filling in all of the fields which would make for a meaningful XML document.
In many cases, this manual classification is done not when creating the document (which may still be done in Microsoft Word or some similar program) but when submitting the document into a portal or knowledge management system. What this means is that even new documents are often not being classified/categorized correctly. Usually, this classification is limited to submitting it to the correct place in a taxonomy. This could be a folder in the taxonomy called Resumes or Proposals. Or it could be attached to a record in a sales system.
When there are lots of documents, the manual method is clearly inadequate and should be used as a last resort.

Automated approaches to classifying and metadata extraction. There are a number of automated methods for indexing and classification of unstructured content. Though these don’t necessarily extract XML from the documents, these methods can be used to infer the value of some of the fields in the XML document.
Full text indexing. The most common method for indexing documents is the full text index. While this can be effective for end users searching for documents about a particular topic, it’s not very effective for classifying documents or producing meaningful XML.
Concept based clustering. There are many searching and indexing technologies out there that are more advanced than full text indexing. The most popular are based on Bayesian logic and “cluster” related documents together. These technologies, from companies like Autonomy, use an existing document and then find other documents like it. The system gets trained over time by humans and gets better at clustering.
How useful is this technology for creating meaningful XML?
It falls short in that two documents that are “similar” to each other don’t necessarily have the same values for the fields in a meaningful XML schema. They may however have similar values for one or two fields in the schema. This method is quite imprecise and I wouldn’t recommend using it to generate sophisticated meaningful XML documents from existing unstructured content. . It might be possible to use this method to populate one or two fields in the schema and then manually enter the rest.
Learning Based Extraction. That’s not to say that machine-learning algorithms don’t have a place in text processing. The key is to use the right tools for the right job - neural net algorithms tend to work best when there is a large training set of data but not a lot of explicit or implicit rules to follow. While this is useful for “recognizing” over time that two documents are similar (as in clustering algorithms), it’s not entirely useful for extracting a hierarchical XML structure out of an unstructured document.
Expert systems tend to work better when there are rules that can be defined and followed (see Automated, Rule Based Field Extraction below). Fuzzy Logic systems tend to work when the rules can be defined, but their application remains imprecise. For example in processing a press release, the rule might be “If text is NEAR TOP and is BIG, then it is somewhat likely to be the TITLE field”.
Automated (Rule Based) Field Extraction. The next approach is to define a set of rules for extracting “virtual fields” within the unstructured document and then mapping those into “custom fields” in the document. Many portals and indexing tools have a limited means to do this. For example, in an HTML document, they are able to extract the META fields and place those into “virtual fields”.
This approach can be effective if there are lots of documents that follow a similar structure and if the rules can be sufficiently complex. However, most existing technology platforms provide only very simple rule-based extraction.
The first shortfall of existing technology is usually limited to “start text” for the field and “end text”. However, within a particular document ( a resume or a press release) the start text and end text may or may not be similar even for documents of similar structure. What is needed is a sophisticated rule generator which allows for taking into account the placement of text within a document, the placement of text within sections, the size and font of documents, and the ability to nest more sophisticated regular expressions.
The second, and perhaps larger shortfall in generating meaningful XML, is the inability to define a hierarchical schema that can be filled in this way. Using simple rules an only work with very simple (one level deep) schemas. This might work with, say press releases (maybe!) but most schemas require several levels of nesting. XML documents are tree-like in nature and this requires having “rules” within rules.
The third shortfall of existing technologies is the inability of the system to deduce which rules should be applied when. This leads to a manual application of rules on a single type of document, rather than a large body of rules which can be chosen selectively by the automated program based upon the type/context of the source documents.
Rule based extraction shows significant promise if these shortfalls can be overcome.

Combination Approach. Often the most effective approach for any given unstructured content source is a combination of the methods listed. Because of the nature of hierarchical XML documents, it is inevitable that a rule-based approach is the starting point for extracting fields from an unstructured document.
This might include combining a rule-based approach with neuro-fuzzy methods to be used for certain fields, while using more standard rules for other fields in the hierarchy. Finally, whatever results are produced would need to be verified / corrected by humans. This would allow for refinement of the neuro-fuzzy methods and of the rules over time. In an ideal word, these rules could then be stored at a granular enough level that they could be reused on different types of documents.
Most importantly, the effort to build rule-based systems must be less than the effort to manually tag the existing document set. A simple rules based system doesn’t take long to set up but isn’t able to do more sophisticated processing. A very complex rule-based system is the same as writing custom computer programs to do the parsing.

Figure 1: An overview of Content Conversion into XML, and then further transformation.

 The need for a new platform

It’s clear that to accurately transform existing unstructured content into Meaningful XML, that new platforms are needed that go beyond simple “full text indexing” or even the use of neural net algorithms for “clustering” documents. While many companies have tried to tackle the problem of clustering documents into folders in a taxonomy, relatively few have produced robust tools for extracting and managing the XML.
Figure 1 shows an overview of what the new platform would do. It would take both internal and external unstructured content, crawl/retrieve the content into a meaningful intermediate XML format. This XML would be stored in a searchable, indexable “content warehouse”. In fact, one document might be transformed into more than one meaningful XML format. This is similar to the concept of a data warehouse or data mart, which normalizes data across different systems and stores it in a centralized location. In our case, we are only storing those XML.
This intermediary format could then be used for content comparison, assembly into XML documents that are meaningful for particular end users. These meaningful XML documents would then be transformed into the appropriate format (DOC, PDF, HTML, WML) for end users.
A new platform would have the following characteristics:
Apply sophisticated rules against source content and target schema. It would allow for the creation of rules to populate the elements of an XML schema. This rule mechanism would allow for the fact that real XML documents can have nested elements many levels deep. It would allow for the extraction of pieces of content, and then further parse these pieces of content for more specific data. This rules based format would be able to apply different rules, based upon selection criteria.
In an ideal world, these rules could be applied dynamically and evolve using neuro-fuzzy machine learning methods rather than having to explicitly apply them each time. For example, the system could learn about the “structure of a press release” and be reasonably sure that a particular document is a press release, which would lead to the application of rules related to press releases.
In an ideal system, the rules would be easy to define, and can range from very simple rules (which can be defined by a point and click) to more sophisticated (which may require writing custom parsers in a language such as perl or java or C/C++).
The system could apply similar rules across different source formats – including plain text, HTML, Microsoft Word, Lotus Notes, PDF files, adapting the same high level rules to each.
Need to have workflow for Human Intervention. It is important that any automated system be able to allow for human intervention to correct “mistakes
Need to be able to scale and do it in real time. An ideal system would be able to scale and process thousands of documents relatively quickly. Text processing systems are notorious for being extremely slow. The ability to inject custom programs (written in 3gl and 4gl languages) could speed up the processing considerably. In some cases, the XML transformation would be used to transform a document and store it in a “content warehouse” – in other cases the XML might be used simply to extract metadata and make it available to a calling program. In the latter case, the system needs to perform well enough to be able to convert the XML in real time.
There are of course, other characteristics that an ideal system would have. A number of companies, including XYZ Technologies, are working on sophisticated platforms for extracting metadata and transformation of existing documents into “Meaningful XML”. Because this is a particularly hard problem to solve, the more flexible the platforms the more applicable they will be to different kinds of unstructured content.
 
About CambridgeDocs

For more information on the CambridgeDocs platform and “Meaningful XML” conversion tools, please contact info@cambridgedocs.com, or visit our site at www.cambridgedocs.com