|
|||||||||
"Transforming Unstructured Content into Meaningful XML" |
|
White
Paper
By
Rizwan
Virk,
Chief
Technology Officer
CambridgeDocs
- www.cambridgedocs.com
Table
of Contents
A lot
of attention has been paid to the benefits of having unstructured
information (“documents”) stored in XML format. It’s viewed by
many as the “holy grail” of unstructured content – a single
storage format which can then be transformed into any type of needed
output – HTML, PDF, WML, RTF, or into other XML formats.
This is
a promising future for unstructured content, and eventually promises to
make the lives of those engaged in content creation, management, and
publishing much easier.
However,
while a lot of attention has been paid to transforming the XML into
different outputs, not as much attention has been paid on how to get existing
content into an XML format that can then be transformed. For a
content provider, or even an enterprise that has thousands of existing
documents, this can be a major problem – one that’s often overlooked
when jumping on the “XML bandwagon”.
While
there are significant benefits for storing content in XML format, this
conversion process can be tedious and excruciatingly painful – often
done manually. This paper explores the issues surrounding content
conversion into XML format and what some of the options are for
automated transformation.
Most
people who have thought about XML at a high level haven’t thought too
much about the problems inherent in converting unstructured content into
a meaningful XML format.
Data
that comes from relational databases, which is highly structured, is
very easy to transform into XML, so this isn’t really a problem. A
comma-separated file is relatively easy to transform. As in the example
below:
Sample
comma separated file:
LastName,
FirstName, Company, City, State
Doe,
John, IBM, Rochester, New York
Resulting
sample XML file:
<Record>
<LastName>Doe</LastName>
<FirstName>John</FirstName>
<Company>IBM</Company>
<City>Rochester</City>
<State>New
York</State>
</Record>
In
fact, many SQL databases can now return their result sets in XML format,
so there isn’t even a need to transform the result set into XML. This
is a great step forward for the interoperability of data. Though it will
take years before many existing applications are rewritten to use XML,
it’s a safe bet to assume that most new applications will start to use
XML for storage of certain kinds of data right off of the bat.
The
transformation is relatively simple because the structure of the XML
document mirrors the structure of the record.
Let’s
explore a similar situation for unstructured content, which might be
stored in say, a Microsoft Word document or even an HTML file or a PDF
page.
At
first glance, it might appear that it’s relatively straightforward to
transform an unstructured document into XML. After all, most programs
that manage unstructured content, such as Word Processors, etc, will
eventually have an option “Save As XML”, won’t they? This is
similar to how most word processors added a “Save as HTML” option
within a few years of the web becoming popular.
But the
reality is much trickier than that. The real trick is not so much in the
transformation into any kind of XML, but in transformation to the kind
of XML format that you would like to end up with. Storing a press
release that is in HTML format as an XML document doesn’t do much for
the initiated.
The
main problem is not that most unstructured documents lack a meaningful
structure. The main problem is that most unstructured documents don’t
have metadata saved in logical fields.
As an
example, let’s look at a press release. A useful XML structure for a
press release might be something like this:
<PressRelease>
<Company>
IBM</Company>
<Division>Global
Service </Division>
<PRDate>12/15/2001</PRDate>
<Title>IBM
buys XYZ Technologies</Title>
<Subtitle>
Acquisition makes IBM a leader in unstructured content
management</Subtitle>
<Location>Cambridge,
MA</Location>
<Body>
<Comment>text
of press release</Comment>
</Body>
<AboutCompany>
….IBM
is a global company …
</AboutCompany>
<AboutCompany>
XYZ
Technologies provides tools for managing, classifying, and aggregating
unstructured content
</AboutCompany>
<ContactInformation>
<ContactPerson>…</ContactPerson>
</ContactInformation
</PressRelease>
This
type of XML schema would not only allow the press release to be
transformed into different types of output for publishing, it also
allows it to be classified easily and assembled quickly. You can think
of this as semi-structured format (most of the text is still within the
Body field), but the metadata has been extracted so that transformation
of the press release is easy to do.
Unfortunately,
it’s not straightforward or easy to convert existing press releases,
which may be in Microsoft Word or HTML format, to this new XML format.
You cannot use XSLT, because the document is not in an XML format to
begin with.
One
solution to the problem might be to “save as XML” so that the
document starts out in XML, and then use XSLT to transform it into the
target XML we want. In the case of Word, you could save as HTML first.
Once it’s in HTML format, you could convert to XHTML, which is an
XML-like version of HTML and there are many tools to turn HTML into
XHTML.
However,
even this doesn’t solve the problem! The information that is to end up
in certain XML fields is embedded within the text and not available for
simple transformation. You still can’t use XSLT or XPath to get at
that information. There is not an XML field called subtitle or date –
this information is lost within the text of the unstructured part of the
document.
This
example, while simplified, demonstrates a key issue for XML adoption for
unstructured content: it is non-trivial to transform existing
unstructured content in meaningful XML.
Meaningful
XML is any XML schema that accurately represents the meaning of the
document, and not simply the structure of the document. This allows the
document to be sorted, classified, taken apart, re-assembled, or
transformed as appropriate.
The
example given above is a perfect example of a meaningful XML document
for a press release. A meaningful XML document for a resume might be
something like this:
<Resume>
<Name>John
Doe</Name>
<Objective>To
get a position as a product manager</Objective>
<Experience>
<Job>
<Company>Lotus</Company>
<Title>
Product Manager</Title>
<DateFrom>1/1/1999</DateFrom>
<DateTo>Present</DateTo>
</Job>
<Job>
…
</Job>
</Experience>
<Skills>
<Skill>
Product Management</Skill>
</Skills>
</Resume>
Note
that a meaningful XML document is more than just an XML version of a
document.
It is
an XML version of the document that has content sorted into fields that
are meaningful for the application or person that needs to use the XML.
In our resume example, this type of structure would allow for the
searching across hundreds of resumes to find the people who had worked
at Lotus in 1999, say.
The
truth is that there is always more than one meaningful XML schema for
any given document. The structure that is best an only be determined by
the uses of the application. However, like a well-designed relational
database, a meaningful XML document will have enough fields so that it
can be used for multiple purposes. In the resume example, this type of
XML schema could be used both for searching and for publishing purposes.
There
are a number of emerging standards for different types unstructured
content. A good example is RIXML and IRXML, two XML standards for
investment research. Most investment research is put into a PDF file or
some other format, which doesn’t have all of the fields tagged that
are necessary for an application to be able to use the information. A
quick examination of RIXML or IRXML, for example, will show that there
are dozens of fields to be filled in.
Meaningful
XML is a hierarchical structure that represents an unstructured document
in a way that it can be easily classified, searched, dis-assembled,
assembled, and published in multiple formats. The fields that are in the
Meaningful XML structure vary based upon the type of document.
Another
reason to transform a document into meaningful XML is to be able to
easily assemble new documents from meaningful sections of existing
documents. For example, in the press release example, the “About
IBM” section and the “Contact Information” sections rarely change,
so there is no need to store it in every press release XML document.
Instead, it can be assembled at the transformation time. Similarly,
sections of proposals can be stored as separate XML chunks and then
assembled, but only if there is a meaningful XML structure.
Yet
another scenario is when different parts of existing documents have
different security needs. One section might be marked “Top Secret”
for example, and another section might be marked “Secret” and yet
another section “Public”. By having meaningful XML, these segments
can be assembled based upon the profile of a user.
Now
that we have a good idea of what meaningful XML is and why it’s
needed, let’s explore how we might get content into a meaningful XML
format.
In a
typical corporation there are easily thousands, if not hundreds of
thousands (even millions) of existing documents stored on local hard
drives, file servers, groupware databases, and emails. Obviously all of
it can’t be transformed into meaningful XML, nor should it. However,
transformation into XML is one of the key steps in an enterprise wide
unstructured content management platform.
The
methods for transforming it are:
This
method tends to work better for new documents that are being created.
However, most standard content management processes do not take into
account filling in all of the fields which would make for a meaningful
XML document.
In many
cases, this manual classification is done not when creating the document
(which may still be done in Microsoft Word or some similar program) but
when submitting the document into a portal or knowledge management
system. What this means is that even new documents are often not being
classified/categorized correctly. Usually, this classification is
limited to submitting it to the correct place in a taxonomy. This could
be a folder in the taxonomy called Resumes or Proposals. Or it could be
attached to a record in a sales system.
When
there are lots of documents, the manual method is clearly inadequate and
should be used as a last resort.
How
useful is this technology for creating meaningful XML?
It
falls short in that two documents that are “similar” to each other
don’t necessarily have the same values for the fields in a meaningful
XML schema. They may however have similar values for one or two fields
in the schema. This method is quite imprecise and I wouldn’t recommend
using it to generate sophisticated meaningful XML documents from
existing unstructured content. . It might be possible to use this method
to populate one or two fields in the schema and then manually enter the
rest.
Expert
systems tend to work better when there are rules that can be defined and
followed (see Automated, Rule Based Field Extraction below). Fuzzy Logic
systems tend to work when the rules can be defined, but their
application remains imprecise. For example in processing a press
release, the rule might be “If text is NEAR TOP and is BIG, then it is
somewhat likely to be the TITLE field”.
This
approach can be effective if there are lots of documents that follow a
similar structure and if the rules can be sufficiently complex. However,
most existing technology platforms provide only very simple rule-based
extraction.
The
first shortfall of existing technology is usually limited to “start
text” for the field and “end text”. However, within a particular
document ( a resume or a press release) the start text and end text may
or may not be similar even for documents of similar structure. What is
needed is a sophisticated rule generator which allows for taking into
account the placement of text within a document, the placement of text
within sections, the size and font of documents, and the ability to nest
more sophisticated regular expressions.
The
second, and perhaps larger shortfall in generating meaningful XML, is
the inability to define a hierarchical schema that can be filled in this
way. Using simple rules an only work with very simple (one level deep)
schemas. This might work with, say press releases (maybe!) but most
schemas require several levels of nesting. XML documents are tree-like
in nature and this requires having “rules” within rules.
The
third shortfall of existing technologies is the inability of the system
to deduce which rules should be applied when. This leads to a manual
application of rules on a single type of document, rather than a large
body of rules which can be chosen selectively by the automated program
based upon the type/context of the source documents.
Rule
based extraction shows significant promise if these shortfalls can be
overcome.
This
might include combining a rule-based approach with neuro-fuzzy methods
to be used for certain fields, while using more standard rules for other
fields in the hierarchy. Finally, whatever results are produced would
need to be verified / corrected by humans. This would allow for
refinement of the neuro-fuzzy methods and of the rules over time. In an
ideal word, these rules could then be stored at a granular enough level
that they could be reused on different types of documents.
Most
importantly, the effort to build rule-based systems must be less than
the effort to manually tag the existing document set. A simple rules
based system doesn’t take long to set up but isn’t able to do more
sophisticated processing. A very complex rule-based system is the same
as writing custom computer programs to do the parsing.
Figure
1: An overview of Content Conversion into XML, and then further
transformation.
It’s
clear that to accurately transform existing unstructured content into
Meaningful XML, that new platforms are needed that go beyond simple
“full text indexing” or even the use of neural net algorithms for
“clustering” documents. While many companies have tried to tackle
the problem of clustering documents into folders in a taxonomy,
relatively few have produced robust tools for extracting and managing
the XML.
Figure
1 shows an overview of what the new platform would do. It would take
both internal and external unstructured content, crawl/retrieve the
content into a meaningful intermediate XML format. This XML would be
stored in a searchable, indexable “content warehouse”. In fact, one
document might be transformed into more than one meaningful XML format.
This is similar to the concept of a data warehouse or data mart, which
normalizes data across different systems and stores it in a centralized
location. In our case, we are only storing those XML.
This
intermediary format could then be used for content comparison, assembly
into XML documents that are meaningful for particular end users. These
meaningful XML documents would then be transformed into the appropriate
format (DOC, PDF, HTML, WML) for end users.
A new
platform would have the following characteristics:
About
CambridgeDocs
For
more information on the CambridgeDocs platform and “Meaningful
XML” conversion tools, please contact
info@cambridgedocs.com, or visit our site at www.cambridgedocs.com
|
|
© 2002-2006 CambridgeDocs All Rights Reserved. -- Privacy Policy |