Thursday, September 19, 2024

Validating XML: A Pretty Complete Primer

As the Internet moves forward, Extensible Markup Language, XML, is poised to become the method for interchanging information among all sorts of devices. For instance, a hand-held Global Positioning System device might be Internet-enabled to receive weather reports encoded in XML. This hypothetical device doesn’t have a lot of extra memory to do all the error-checking and “forgiving” that a browser can do with your HTML. This means that servers must ensure that the data is “good to go” before sending it to the device. XML Schema is a new method that the World Wide Web Consortium has come up with to help make sure your data is valid.

Before we can describe XML Schema, we have to discuss what we mean when we say “valid” and how documents are currently validated. Let’s look at a sample weather report, written in the just-invented EWEML (Eisenberg’s Weather Example Markup Language).

-==-

The first step in quality control is making sure that the document follows the basic rules of XML. Among these rules:

  1. opening tags must have closing tags
  2. tags must be nested properly
  3. values in a tag must be enclosed in quote marks

A document that follows these “punctuation rules” is called well-formed. You don’t need any information other than the document itself to tell if all the tags are closed or nested correctly. There are, however, some questions you can’t answer just by looking at the document.

Validity and the DTD

There is no way, just by looking at the document, that you can answer questions like:

  • Does the <speed> element belong inside the element or not?
  • Is the tag really <min>, or should it be <minimum>?
  • Is the fullname attribute required for a <station> element?

The only way to answer these questions and find out if the tags have been used in a valid manner is to have some other information that describes what combinations of tags, attributes, and values are correct. You need some sort of external specification, usually designed in great detail before one creates documents according to it. We could make a specification in English:

A weather <report> consists of a <datestamp>, <station>, <temperature>, and <wind> element (in that order).

The <station> element must have fullname and abbrev attributes, and contain both <latitude> and <longitude> elements.

The <temperature> element must contain (in this order) a <min>, <max>, <forecast-low> and <forecast-high>.

Finally, the <wind> element must contain a <speed> and may contain a <direction> element. (If the speed is zero, a direction is not necessary.)

Sidebar: A design issue for this example markup language

A Design Issue for this Example Markup Language

A problem that often plagues beginning designers is whether some some aspect of their design should be implemented as an element (tag) or as an attribute. A general rule to use is this:

If the item in question is a component or part of a whole, then make it an element. If the item characterizes an element and, potentially, all its sub-elements, then it should be an attribute.

In some cases, the decision is obvious. In HTML, <li> is an element, since a list item is clearly a part of the bigger <ol> or <ul>. Similarly, type= is an attribute, as the numbering or bullets characterize the list and all of its items.

For the weather report example given in this article, the question arose as to how to handle the <station> element. I thought of the fullname and abbrev as identifiers, so they naturally became attributes. I then rationalized that latitude and longitude should be elements, since:

  • The location is what really makes the station a unique station.
  • I needed an example of an element that had both attributes and sub-elements.

Of course, computers can’t scan this English specification, so we have to make a more rigorous, machine-readable version. The most common such form of this specification is called a DTD (Document Type Definition).

That’s the purpose of the

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

which you so often see in the source of HTML pages. It tells which version of the HTML Document Type Definition should be used when a validating program is checking the document.

A Document Type Definition lists all the elements and attributes in a document and the contexts in which they are valid. Here’s part of the DTD for the weather report markup language:

   <!ELEMENT station (latitutde, longitude)>
   <!ATTLIST station
                fullname CDATA #REQUIRED
                abbrev CDATA #REQUIRED>
   <!ELEMENT temperature (min, max, forecast-low, forecast-high)>
   <!ELEMENT (min, max, forecast-low, forecast-high) (#PCDATA)>
   <!ELEMENT wind (speed, direction?)>
   <!ELEMENT (speed, direction) (#PCDATA)>

The #PCDATA and CDATA mean that the items in question consist of character data.

Though not as easy to read as the English, the DTD is very compact, and does the job it’s intended to do; it lets computer programs verify that a document uses only an approved set of tags and that the tags are used in the proper context.

If the XML rules are considered as punctuation, then the DTD serves the function of a spelling list and grammar reference.

Of course, just because an English sentence has proper spelling, grammar, and punctuation doesn’t make it meaningful:

The brick astonished the sunlight.

Similarly, there are some questions about the weather report that the DTD can’t answer:

  • Does the wind
    <direction> have a valid value?
  • Is the <station>’s
    abbrev= attribute in the proper format?
  • Are the values for <latitude>
    and <longitude> in a valid range?

Clearly, a different notation is necessary for doing this level of validation. Ideally, this notation will itself be in XML, which will make it easier to read than a DTD. In the remainder of this article, we will explore two such validation markups.

The World Wide Web Consortium has created XML Schema. The material in this article that covers Schema is based on the excellent XML Schema primer available at the W3C website.

Mr. Makoto Murata has invented RELAX, (REgular LAnguage description for XML), which will be submitted to the International Standards Organization for adoption as a standard. The material in this article that covers RELAX is based on the How to RELAX tutorial.

J. David Eisenberg lives in San Jose, California with his exceptionally handsome cat, Marco Polo.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

Caymas new construction homes naples florida.