XML

XML (Extensible Markup Language)

Purpose

This article describes, explains, and suggests references for further study of XML Schema from the viewpoint of potential military users who do not already have a basic understanding of the uses of the Extensible Markup Language (XML) for data interoperability.

The main body of the article will provide a fundamental insight into the basic capabilities of XML Schema sufficient to form a general overview. Several of the appendices are presented in tutorial form for the reader who wants a “hands-on” exposure to XML. Other appendices go into greater detail, allowing the reader to be selective in pursuit of his/her objectives for further knowledge.

Upon reading the main body of the article, the reader will be able to appreciate the needs that XML schema addressed when first introduced, the methodologies employed by the technology to resolve such needs, the track record to date, and some limitations of the technology.

Background on XML

XML is aptly named. The Extensible Markup Language is a markup language, and it is extensible. The original design goal of XML was and continues to be data interoperability, the ability to share data “seamlessly”.

Before XML, an important means of exchanging data was by means of using ASCII text files. The ASCII text would be in a specific format such as “delimited” ASCII or “fixed-field” ASCII, and the files could be passed from UNIX to DOS to mainframe environments and be read and processed interchangeably. The drawback in using ASCII is that you need to specify the format and other details about the data outside the data itself.

The problem with an ASCII data exchange is that there are two packages, the data itself and the details about the data (the metadata). Moreover, one part is physically separate from the other. It is not self-contained.

There are major problems with that. Workarounds have to be devised to ensure the internal consistency of the data. Audit activities include record counts, check sums, and other techniques designed to make up for the lack of built-in integrity. Just think of the extra processing time.

Metaphorically, what XML does it to take all the ASCII characters from a text file, roll them up in a ball, and tie them together with a string made out of the metadata. When the ball goes from one place to another, the string goes with it. It is a package - a self-describing clump of data that does in fact enhance interoperability. There is one slight hitch, however. There must be an agreement between the users of the ball – an agreement made in advance – that the string or metadata has a meaning that is understood and acceptable by each party. In other words, parties to the exchange need to agree to – or at least know about -- the metadata when it's initially used and each time it's changed.

In the absence of a superior technology, the limitation just described may certainly be considered surmountable. It simply requires that parties to the exchange maintain lines of communication reliable and robust enough to ensure these objectives are accomplished.

Notwithstanding the inflexibility of the string, XML has taken off as a technology that accomplishes significant interoperability through the use of self-described data.

But how, specifically, does XML accomplish the self-description of data?

XML is a markup language that evolved from the same parentage as HTML, the Hypertext Markup Language. Both languages owe their heritage to SGML, the Standardized General Markup Language, created by IBM. Neither XML nor HTML is as complex as SGML. Each has been designed to accomplish different missions. And each can co-exist with the other in the same document.

HTML allows text files to be “marked up” with special codes that control how the page looks when viewed through a Web browser such as Netscape or Internet Explorer. There’s no data involved in using HTML – its use is exclusively oriented to appearance.

Two major attributes distinguish XML from HTML: (1) whereas HTML is an appearance-driven language, XML is a data-driven language; and (2) whereas HTML is “just another” markup language that marks-up ASCII text files with tags from a pre-defined, fixed set of tags, XML is a meta-markup language - one that can be used to create other markup languages.

Both languages use “tags” to accomplish their markup objectives. A tag looks like this:

<tag> content </tag>

Its purpose is to describe content to a browser which presents it to a viewer. An HTML tag might look like:

<bold> New York Times </bold>

whereas XML tags might look like:

<NAME>

<SURNAME>Smith</SURNAME>

</NAME>

Please note the similarity in how each language uses the <> and </> bracket expressions to denote the beginning and end, respectively, of tagged content. Aside from noting the use of angle brackets and the slash, we won’t go into the details of how each markup language works. Just remember that HTML tags are significantly different than XML tags. HTML tags are selected by a document’s author from a pre-existing, pre-defined set of permissible tags developed for HTML use but XML tags are directly developed by a document’s author who can use anything he or she wishes to just as long as the general pattern is followed in terms of angle brackets, slashes and the content location.

It is the capability to limitlessly invent tags that confers extensibility to XML. Having no limits on such tags tends to promote specialization as we will see.

Background on XML Schema

XML Schema is one of several specialized applications of XML. Since XML is a meta-markup language, XML is itself used to create XML “applications”, i.e., specific specialty languages whose use is oriented to a particular purpose. As you would expect from a meta-language as flexible as XML, there are many applications, some of which have already faded from current use.

In alphabetic order, here are some notable XML applications:

• Channel Definition Format

• Chemical Markup Language

• Classic Literature

• Extensible Forms Description Language

• HR-XML

• MusicML

• Open Financial Exchange

• RDF (Resource Description Framework)

• XML for XML

We will discuss the last two applications further. RDF figures prominently in the use of ontologies that are discussed in another E-MAPS article.

“XML for XML” is where we find XML Schemas. The author of several XML books, Rusty Harold, characterizes XML for XML as those applications of XML “…used for …further refinements of XML itself.”

Applications of this nature include:

• XSL, the Extensible Stylesheet Language

• XLinks, a more generalized kind of hyperlink for the XML environment.

• Schemas, the particular focus of this article.

Explanation of XML Schemas

We will explain XML Schemas by using definitions, examples, descriptions of limitations, and references.

1. Definitions

As used in contemporary literature, what is a schema? When originally used in the field of computer science, a schema had something to do with databases. It still does: a database schema describes the structural contents of a database in fairly specific terms. More generally speaking, however, “schema” can be used to characterize those documents that describe other documents in a particular way.

The word schema has evolved from the previously stated definition to a more generic meaning of any document that describes the permissible contents of other documents, especially if data typing is involved. Thus, you’ll hear about different kinds of schemas from different technologies, including vocabulary schemas, RDF schemas, organizational schemas, X.500 schemas, and of course, XML schemas. (attribution to Rusty Harold).

XML schemas are therefore…? XML schemas are documents that describe XML data structures/XML content in significant detail.

The documents are written in a particular “schema language”. There are many such languages including:

• DDML, the Document Definition Markup Language

• TREX, Tree Regular Expressions for XML

• Schematron

• Relax

• DTD, Data Type Documents

• W3C XML Schema language, also known as XSD or XML Schema Definition

Of these, XSD is the most significant. It is currently the W3C (World Wide Web Consortium) XML Schema Language that has garnered the most support on a global basis.

From this point on, when we refer to “documents”, “schema”, and “schema language”, we refer to them in the context of XSD as specified by W3C.

According to a very effective W3C primer:

The purpose of a schema is to define a class of XML documents, and so the term "instance document" is often used to describe an XML document that conforms to a particular schema.

2. Examples

Click here to see an example of a W3C sample XML document (or “instance document”) related to purchase orders as well as an example of the schema that created it.

What we see is (1) a slightly more complex representation of XML markup using tags similar to our previous examples and (2) a XSD file or schema. The primer cites the purpose of a schema as being “to define a class of XML documents”.

To summarize, an XML schema written in XSD tends to be captured as a document in the form of a text file with a .xsd extension, and is frequently used to create other documents (or “document instances”) of XML content marked up in XML tags (as a .xml text file).

3. Limitations

The primary limitation of XML is its lack of built-in intelligence. It needs to be defined and agreed upon as a condition precedent to the exchange of data.

Its ability to “figure out things” for itself is non-existent. There are no artificial intelligence influences applied to XML technology as it presently exists.

To visualize a contrasting and somewhat more intelligent process, consider two modems connecting with each other. They go through a handshaking process as they mutually explore their communications protocols and throughput capabilities until they reach agreement on a wide degree of parameters associated with their exchange of data. They work out a mutually acceptable basis on which to exchange data without having known in advance what protocols or throughput rates to use.

The modems literally agree online, contemporaneously, to exchange data, in contrast to the XML exchange which is agreed to offline, a priori instead of contemporaneously.

As discussed in another E-MAPS article, somewhat more contemporary approaches offer a higher degree of automation and reliability by overcoming these limitations.

Conclusion

This was an overview of XML from 10,000 feet. Despite its limitations, XML is in very active use today, achieving high marks in terms of data interoperability and the loyalty of its numerous proponents

We have intentionally avoided the complexity one can easily encounter at any time during a serious inquiry into XML. There are several factors contributing to this complexity: (1) the plethora of acronyms, (2) the fact that the technology tends to be used by programmers, (3) the convoluted path that the technology has taken during its evolution, and (4) the rapidity and robustness of the evolution itself.

Our approach in presenting this overview has attempted to be orderly and incremental. We think it’s important for readers to understand the role or place of a piece of technology in terms of the wider context of associated technologies.

However, there is one particular resource that, despite certain minor limitations, is compelling in the way it imparts knowledge about XML and should be reviewed before concluding this article. Click here to see it.

What we have presented is something akin to “the top ten hits of XML”, or the ten most important things to know about XML. It’s slightly unsystematic but highly enlightening reading.

Back