|
|||||||||
"Why Convert Documents into XML?" |
|
White
Paper
By
Rizwan Virk,
Chief
Technology Officer,
CambridgeDocs
- www.cambridgedocs.com
Introduction
XML has
become a buzzword that's so over-used that it's difficult to understand
when it might and when it might not be appropriate. In
general, the main reason for XML's popularity is that it provides an
underlying technology that gives "portability" of information
across platforms, applications, and organizations.
Much of
the emphasis on XML has been on sending “structured” data in between
companies. For example, if company A wants to send a purchase
order to company B - they both need to agree on a formatting convention.
XML provides the language of both the description of that
formatting convention, and provides a convenient way to actually send
the purchase order data.
While
there are significant benefits to having inter-operable structured data,
we believe that a use of XML that is just as important is for the
creation, storage, indexing, and publishing of documents - what is often
referred to as “unstructured content”. Unstructured
(and semi-structured) content today in corporations is kept in a number
of locations and typically makes up about 80% of a company's overall
data/information. Unlike structured data, which typically lives in
databases and is well-ordered, unstructured content lives on individual
file servers (as Microsoft Word or PDF files), in groupware databases
(like Lotus Notes), on web servers (as HTML documents) or in other
legacy systems.
This
article is about the reasons why XML is particularly well suited for
this task - the creation, storage, indexing, and publishing of
documents, and why it is cost effective to come up with a strategy for
converting a company's key unstructured assets into XML.
Why
Create/Convert Documents to XML?
Allows
Intelligent Queries of Content.
One of
the main reasons to get documents out of their existing formats is to be
able to search / index those documents in a meaningful way.
Say,
for example, that your organization has one or more directories full of
resumes. Many resumes come in email or in Microsoft Word (.DOC)
formats. This is not a particularly useful format for searching or
indexing. Suppose you wanted to do a query to find “all people
who worked for Lotus from 1998-2000. It is difficult, if not
impossible to find this information from a group of files sitting on a
file server. One approach has been to full text index the
documents. This might help you find all people with the word Lotus
in their resume- but there is still no intelligence around the indexing.
If the documents were broken into meaningful XML formats (such as
HR-XML, etc.), then it would be much easier to do this type of querying
as you would have turned your documents into a virtual database.
Similarly,
if you were a mutual fund company you might have a collection of
investment research gathered from a number of different sources, sitting
on file servers as PDF files. PDF files are particularly difficult
to fool around with because they aren't meant to be edited - only read.
However, you might want to query this body of research to find all
of those research reports which upgraded a stock from a Buy to a Strong
Buy. Again, if you were to convert these into a meaningful
XML format (such as RIXML), then you would be able to do this type of
querying against the source data because it would be intelligently
categorized.
This
“intelligent” indexing can happen even if the documents say as
individual XML files on the file server, or it could happen by moving
the XML into an XML database store.
Figure
1: XML allows you to write it once and publish it in many formats
Write
Once, Publish Many Times.
Perhaps
the most important reason to convert documents to XML is when those
documents need to be published. Corporations today have more than
one channel of information to their customers. This includes
printed documents and manuals, electronic communication that is
emailed (brochures, email), web sites (which are in HTML format).
Most
companies don't have a coherent strategy for external publishing - it is
done in different ways throughout the company. One group might use
Word Documents which are printed directly. Another might use a
content management system for the web site. Yet another might
convert to PDF for manuals.
The key
with XML, as shown in Figure 1, is that it can be transformed into the
appropriate publishing format - Word (DOC/RTF), HTML (for web sites),
PDF (for printed documentation), DocBook (An XML standard for storage
and sharing of content), WML (for wireless devices), and into any
other format which becomes available in the future. This saves
time and money because effort doesn't have to be repeated. With a
push of the button the XML can be transformed (using XSLT, or XML
stylesheets) for transformation.
Custom-Assemble
Documents for Customers, Business Partners.
Another
key benefit that comes from having content stored in XML format is that
it can be “custom-assembled”. This means that customer
A, who might be a customer that is only interested in research about two
companies in the semi-conductor industry and 3 companies in software,
can bet a research report that only covers those companies - rather than
having to go through dozens of companies in each industry. Because
the content can be assembled on the fly, as shown in Figure 2.
Figure
2: Investment research transformed into XML and custom assembled
for each client
Saves
Time and Money by Streamlining the Authoring Process.
Research
has shown that during the authoring process as much as 50% of the time
that is spent is on formatting. By having templates for documents
that are similar (which can be done using XSLT) and using an XML
authoring tool , the author only has to worry about the content. For
example, most press releases look the same, as do most product
brochures. Most proposals should look the same, but often don't.
Using XML as the mechanism for authoring and storing content can
enforce consistency in standards and allow users not to have to worry
about the eventual formatting, which will be handled by the templates
and by validation files (DTD's or XML schemas).
Encourages
Reuse of Documents and Fragments.
XML
allows for the storage of “document fragments”, which encourages
reuse of existing content. This means that you will be able to
find document fragments and include them in new documents much more
easily.
Distributed
Authoring and Security.
XML is
ideal for a content management system where dozens of people need to
contribute content. Existing authoring tools, such as Word and
other desktop editors are not ideal for this type of environment. Because
each section (or page, within a web site) may have one or more people
who are allowed to edit it, storage of pages in XML format allows each
to be treated as a separate object, with separate permissions and
authors can simultaneously edit different pages within the overall
document.
Another
key benefit is that if end users are only allowed to view certain parts
of documents - by assembling the final document based on the preferences
of the end user is a better way to distribute documents. Again, if
all the sections are in XML, this type of end user security becomes much
easier to enforce. If all the sections are stored in Word or PDF
files, this becomes a much more difficult task.
Syndication
of Content - Web Services.
XML is
the language of Web Services and of Syndication of Content. This
means that you can distribute your content (research reports, press
releases, product catalogs, brochures) to other web sites or companies
who may need to include your information on their site, but with some
changes. Syndication of Content is often used for aggregation of
content from different sources (for example, an industry site might want
to publish a press release that your company created). If the
information is provided in HTML, this is problematic because each source
site will have different formatting. However, if each source
company provides XML (even if they provide slightly differing XML), the
aggregation site can easily.
Web
services is an emerging trend where one server makes a request for
content from another server. This could be any type of content, or
could be more programmatic structured data. By converting your
documents into XML, you open up Web Services for documents, which
allows for better information sharing with customers, business partners,
and suppliers. For more on Web Services, see the upcoming white
paper, Web Services for Documents.
Portability
of content.
Many
web content management systems provide distributed authoring,
reuse of fragments, etc., but do not store their content in an XML
format. This makes it very difficult to move off of that
particular content management system. If, however, the data is in
XML (or can be easily exported into XML), then the end user has the
flexibility to migrate the content easily into another system that
supports XML rather than being tied to a particular vendor.
In
addition to all of these specific business benefits, XML is particularly
well suited technically for the storage of unstructured and
semi-structured content. This is because most documents have a
tree-like structure (title, heading 1, section 1, paragraph 1, etc.) ,
and XML has a tree-like structures. There is a lot of content that
has been published in HTML format over the last five years (millions of
pages) - and XML is a perfect format for distributing this information
between sites. That is because both HTML and XML are both based on SGML,
which is a more generic language for defining documents.
Conclusion
Corporations
have a tremendous amount of information assets that exist today as
individual files in directories. This includes memos, reports,
proposals, brochures, white papers, documentation, research, intranet
sites, public web pages, etc. Because of its unstructured nature,
it has been difficult to leverage this information and to reduce both
the cost and complexity of managing this information. XML
is a powerful tool that simplifies the creation, storage, indexing,
categorization, and publishing of this content in complex environments.
By converting existing documents and new documents
into XML, organizations can achieve significant savings of both time and
money.
About
CambridgeDocs
For
more information on the CambridgeDocs platform and “Meaningful
XML” conversion tools, please contact
info@cambridgedocs.com, or visit our site at
www.cambridgedocs.com
|
|
© 2002-2006 CambridgeDocs All Rights Reserved. -- Privacy Policy |