Content in the Age of XML

Bruce Silver

Can you manage documents with the ease and automation of data? Is there a payoff in a structured approach? Will compliance demands usher in a new era? The answer to all these questions is yes, but complicated authoring tools and the burden of enterprisewide planning stand in the way of change. Here's how, and in which industries, management will adapt.


Not so long ago, information could be neatly separated into data and content. Data was "structured" and stored in relational databases, OLTP and other hierarchical systems. Content was "unstructured" and consisted of, well, everything else. Data was what made the company go, so databases became a strategic part of the enterprise IT infrastructure. Everything else — at least the piece of it worth managing — was housed in special repositories called document or content management systems, typically deployed at the department level.

Today, as organizations struggle to keep information consistent and synchronized across print and Web media formats, global locations (and, thus, languages) and internal-, partner- and customer-facing business contexts, they're asking for content to be managed more like data. That means it must be easily reused, dynamically queried and custom-built, transformed into any presentation format and "processable" by application software. Whether they know it or not, businesses are asking for content expressed and managed as XML.


More Software Insights

Webcasts

More >>

White Papers

More >>

Reports

More >>

All types of content can be built on XML, which in turn is supported by the Internet's universal standards and low-cost infrastructure. It's far more IT-friendly than conventional content formats, which typically require information to be manually extracted into metadata to be machine processable.

Will XML's advantages, including content reuse, automation and compliance, lead it to become a pervasive form of source content? Initiatives underway in finance, the pharmaceutical industry and other applications point the way to XML adoption, but it may be years before the technology breaks out of niche applications. In the companion article, "Reuse Content Without Starting From Scratch" http://www.intelligententerprise.com/showArticle.jhtml?articleID=163105099, we'll explore how McDonald's, Hilton International and manufacturer Emerson Process Management are making the most of their content without XML-based management.

Executive Summary
XML brings structure to content and, with it, the advantages of fast machine processing and efficient reuse. When content is authored and managed as XML, you can slice, dice and query the information as you would data so you can find patterns and develop targeted collections without having to read through thousands of documents. Using XML stylesheets, you can easily transform content into any format, including HTML, PDF, Word files or data streams that are compatible with legacy systems. XML is also emerging as the basis for open standards such as Extensible Business Reporting Language (XBRL), used in banking and finance, and Structured Product Labeling (SPL), soon to be required by the FDA for all packaging information on prescription and over-the-counter drugs.

Despite the many potential benefits of XML-based content, however, its use is currently confined largely to scientific and technical publishing. To truly take advantage of component-based management, organizations need a disciplined, cross-enterprise approach that requires in-depth analysis and hard work. XML isn't a user-friendly authoring environment; so it may be some time before it's used pervasively. In the meantime, many companies are taking a less demanding approach to content reuse, improving enterprisewide access to information, developing dedicated libraries of approved content and breaking documents into components that can be shared.

Blurring the Line

Through waves of the software industry's consolidation, content management has remained stubbornly distinct from data management, rebuffing the DBMS giants' periodic attempts at absorption. That's because data and content are fundamentally different. Data is structured and self-describing (records and fields have defined names, types and relationships), and thus it's machine-processible. Using a universal language (SQL), data can be queried and retrieved at any level of granularity, from an entire table down to an individual field. Visual data presentation in report tables, charts and other views is an application-level function distinct from the data itself. Data is pure information.

In contrast, content doesn't cleanly separate text information from presentation formats defined by applications such as Microsoft Word, Adobe Acrobat and HTML editors. Content objects are information views, optimized for processing by people, not machines. The unit of retrieval is a document or file. While you can search for content based on tiny information fragments, as small as a single word or character, what you retrieve is a list of documents, not the fragments.

Being unstructured, content isn't self-describing, so it must be linked to external structured elements called metadata to allow searching by title, author, creation date or application-specific elements such as customerID. You can query the metadata or text-search the content, but not both at once.

So what's the problem? Compared to data, content is hard to reuse, cumbersome to search and generally resistant to automated processing. To reassemble content fragments in new contexts — from financial statements to press releases, for example — you must find each fragment inside its respective source document, then cut, paste and reformat. Because content is managed at the document level and information is inextricably tangled up with presentation formatting, manipulation is largely manual and inefficient.

Page 2: Page 2
 1 | 2 | 3  | Next Page » 

Related Reading


Informationweek Discussions

Start the Discussion


InformationWeek encourages readers to engage in spirited, healthy debate, including taking us to task. However, InformationWeek moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. InformationWeek further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
Subscribe to RSS

Resource Links