XML 101


By Jeff Jones

Whether you're discussing e-commerce, knowledge management, or the
Internet in general, you've likely seen or heard reference to eXtensible
Markup Language (XML). XML is, without a doubt, one of the most heralded
technologies to come across the wire in recent years (pun intended).
What is XML? Why is it creating such a deluge of interest? What should
you know about XML, and perhaps more importantly, why should you even
care about it? In this article, I will provide a high-level description
of what XML is (and what it's not), discuss the key components of an XML
document, and provide a compelling argument for why it's well worth your
while to learn more about XML.

To understand XML, it's helpful to compare and contrast it with another
technology with which a great many of us are familiar - HyperText Markup
Language (HTML).

If you've used or read about HTML, you know that it was created so that
users could format and display information on the Web. HTML uses a
fixed and finite set of tags, elements, and attributes that allow it to
communicate to a user's browser how that browser should display the
document. We see HTML everywhere, and it has for some time served as the
lingua franca for displaying information on the Web. It is a proven
technology that well serves its purpose in most scenarios. What if,
however, the current version of HTML doesn't allow me to do something
that I want to do? I have two choices: I can either write my own
browser that understands my tags (bad idea) or I can put my project on
hold for a year or so and hope that the next version of HTML includes
the functionality that I need (even worse idea). Try selling either of
these options to your boss or client and see if you still have a job by
the time you end your discussion. So concludes our one-paragraph,
in-depth investigation of HTML.

Now, if I may lapse into my days of standardized test taking, HTML is to
displaying information as XML is to defining information. They both are
text-based, and they both consist of tags, elements, and attributes.
Unlike HTML, however, XML allows users to structure and define the
information in their documents. While technically it is a markup
language (it allows you to use tags to "mark-up" the contents of your
document), it more appropriately is a meta-language. By meta-language,
I mean that it allows users to create their own collection of tags,
elements, and attributes as needed and in so doing to accurately
describe the physical contents of a document. Unlike HTML with its
finite collection of tags, XML allows users to create their own to meet
their own requirements (thus the eXtensibility).

I've made several references to tags, elements, and attributes. These
are the core building blocks of an XML document. Consider the following
HTML fragment. It should be painfully familiar to anyone who's ever
looked at an HTML document and will prove useful in understanding XML
syntax.

<table border="0" cellpadding="0" cellspacing="0">

<tr>

<td width="50%">Here is the first group of text</td>

<td width="50%">Here is the second group of text</td>

</tr>

</table>

This document contains a table element ("<table>") with a table row
element ("<tr>"). The table row element, in turn contains two table cell
elements ("<td>"). Each of these elements has both an opening tag
("<table>") and a closing tag ("</table>"). While this is fairly
straightforward, it also is somewhat inflexible. What if, for example,
I need to create a document that describes my company's employee roster
for the Annual InfoStrat softball tournament? With XML, it's as easy as
replacing the element and attribute names from the previous HTML
document with my own custom tags that describe my company and its
employees. Here is what such a document might look like:

<?xml version="1.0"?>

<company name="Information Strategies">

<employees>

<employee id="1">Hank Aaron</employee>

<employee id="2">Babe Ruth</employee>

</employees>

</company>

With this XML document, I have defined my company and two of its
employees and have described the relationship between company (parent)
and employees (children). I have shown that my company has two
employees, but I easily could add new employee elements to reflect new
hires that we bring on to ensure that we don't lose this year's
tournament:

<employee id="3">Mickey
Mantle</employee>

<employee id="4">Ty Cobb</employee>

After creating my XML document, I can display its contents in my format
of choice. The same XML document could easily be displayed as HTML, a
Microsoft Word document, an Adobe .pdf file, or even as text in the body
of an e-mail message. As long as the XML document is well formed
(meaning that it follows the appropriate XML format and syntax), you can
choose your method of preference (or necessity) for displaying its
content.

Let's dissect the pieces of my company roster XML document to see each
piece's role and responsibility.

Header:

The header tells the document's user that this is an XML document -
using version 1.0 of the XML specification in this case.

<?xml version="1.0"?>

<company name="Information Strategies">

<employees>

<employee id="1">Hank Aaron</employee>

<employee id="2">Babe Ruth</employee>

</employees>

</company>

Tags (brackets, greater than, less than):

Just like in HTML, you use greater than (">") and less than ("<") signs
called tags to indicate the opening and closing of an element.

<?xml version="1.0"?>

<company name="Information Strategies">

<employees>

<employee id="1">Hank Aaron</employee>

<employee id="2">Babe Ruth</employee>

</employees>

</company>

Elements:

Elements are the basic building blocks of XML. They may contain text,
comments, or other elements, and consist of a start tag and an end tag.
Typically, XML elements are akin to nouns in the real world. They
represent people, places, or things.

<?xml version="1.0"?>

<company name="Information Strategies">

<employees>

<employee id="1">Hank Aaron</employee>

<employee id="2">Babe Ruth</employee>

</employees>

</company>

Note that in XML, every opening element (i.e. "<company>") must also
contain a closing element (i.e. "</company>"). The closing element
consists of the name of the opening element, prefixed with a slash
("/"). XML is case-sensitive. While "<company ></company>" is
well-formed, "<COMPANY></company >" and "<Company></cOMPANY >" are not.
Also, if the element does not contain text or other elements, you may
abbreviate the closing tag by simply adding a slash ("/") before the
closing bracket in your element (i.e. "<company></company>" can be
abbreviated as "<company />"). In addition to the rules defining
opening and closing tags, it is important to note that in order to
create a well-formed XML document, you must properly nest all elements.
The previous document properly nests the "<employee>" elements within
the "<employees>" element, but the following would not be acceptable in
XML because the second "<employee>" element exists outside of the
"<employees>" element:

<employees>

<employee id="1">Hank
Aaron</employee>

</employees>

<employee id="2">Babe
Ruth</employee>

Attributes:

Where elements represent the nouns contained in an XML document,
attributes represent the adjectives that describe the elements. The
following document tells me that Hank Aaron's id is "1" and that Babe
Ruth's is "2". This helps to describe these two employees.

<?xml version="1.0"?>

<company name="Information Strategies">

<employees>

<employee id="1">Hank Aaron</employee>

<employee id="2">Babe Ruth</employee>

</employees>

</company>

Note that in order to be well formed, all attribute values must be
contained within quotation marks. id="1" is correct, while id=1 is not
acceptable. This is a marked difference from standard HTML formatting
that places much looser restrictions on what is acceptable.

Text/Content:

Elements contain contents that give critical information about them.
This information represents that entity itself in an XML document. In
the following document, Hank Aaron is the employee; Babe Ruth is the
employee.

<?xml version="1.0"?>

<company name="Information Strategies">

<employees>

<employee id="1">Hank Aaron</employee>

<employee id="2">Babe Ruth</employee>

</employees>

</company>

As you can see, XML and HTML are practically identical with the
exception that XML is far less lenient when it comes to
case-sensitivity, using closing tags, and properly nesting parent/child
elements. This is excellent news for Web developers everywhere as it
ensures that if you write well-formed HTML, you'll find the transition
to XML virtually seamless.

To summarize, XML is a text-based meta-language that uses tags,
elements, and attributes to add structure and definition to documents.
It is similar to HTML in syntax and implementation, but different with
regard to functionality. Where HTML allows users to control how
documents are displayed, XML allows them to describe the actual contents
of the documents. It is a markup language because it uses tags to
mark-up documents and it is a meta-language because it uses these tags
to give structure to documents that it in turn uses as a means of
communication. XML is extensible because it enables users to create
their own collection of tags (unlike HTML).

Now, why should you care about XML? If for no other reason, consider
that the World Wide Web Consortium (W3C), the Internet's governing body,
is considering a proposal to rewrite the HTML 4 language in XML 1.0. As
of the time this article was written, XHTML had received endorsement by
the director of the W3C as a recommendation. This proposal, known as
XHTML will require well-formedness in all HTML documents. The W3C is a
neutral standards body responsible for defining the future of the
Internet. They do not support every new idea that comes along, and we
should view their full support of XML (or any technology) as a harbinger
of where tomorrow's Internet will take us. Ignore XML if you will, but
know that it is most definitely a legitimate technology that will
revolutionize the way that we program applications for the Web.

For more information on XHTML, XML, and the W3C, check out the W3C
website at http://www.w3c.org.