This site will look much better in a browser that supports web standards, but is accessible to any browser or Internet device.

Anomaly ~ G. Wade Johnson Anomaly Home G. Wade Home

August 03, 2006

Misunderstanding XML

Like many developers, I've been working with XML for many years now. My first XML-based program dates back to within 6 months of the publishing of the XML 1.0 Recommendation by W3C. I've already gone through the phases of:

  • XML looks promising,
  • XML is cool,
  • everything in XML, and
  • XML is just another tool

XML is widely enough used that most people doing active development should have some concept of the ground rules. I realize that some people have not worked with XML as extensively as others, but there are a few things that have really begun to get on my nerves. I apologize in advance for what will mostly be a rant.

XML is powerful. There are a large number of tools that work with XML. But, not every text file with angle brackets is XML. If you say you are working in XML, you have to follow the rules specified in the recommendation or it's not XML.

Well-formed XML

If the data is not well-formed, it is not XML. There is no such thing as lenient XML. If you call it XML, except we have a few extensions..., its not XML. If you have start tags with no end tags, it's not XML. If you have more than one root element, it's not XML. If you have attribute values that are not surrounded by quotes, it's not XML. This does not mean that the format may not be useful. But, if your data is not XML, don't expect an XML tool to handle it. This makes as much sense as expecting your MP3 player to read an MS Word document aloud.

Over and over again through the years, someone will report that one XML tool or another is broken because it can't read their XML. Most of the time, the document turns out to be ill-formed. Sometimes, the questioner will say "thanks" and fix their XML. Most of the time, they begin suggesting that the tool be modified just a little to handle this case. Or they may complain the the authors of the tool should have made it capable of dealing with this case, since it is obviously useful.

These people are missing the point. They may have a useful format, but it is not XML. No XML tools should be expected to deal with them, any more than you should expect an XML tool to process a jpeg image. If you have questions about the definition of well-formed XML, there are many sources available. The definitive description, of course, is at Extensible Markup Language (XML) 1.0.

Valid XML

If an XML application is described by a schema of some type, (DTD, XML Schema, RELAX NG, etc.) only documents matching that schema are valid. This is the definition of a valid XML document. If there is no schema, the document is neither valid nor invalid. However, if there is a defined schema for a particular XML application, documents that don't match the schema are invalid.

If your document does not contain required elements or attributes, it is invalid. If your document contains extra elements or attributes (unless they are in a different namespace), your document is invalid. Even elements and attributes from a different namespace are only valid if the definition of the XML application supports it. Your document is not almost valid or mostly valid. It is invalid.

I work a bit with SVG. This XML application is specified by schemas in several formats and processors are supposed to ignore elements and attributes from unknown namespaces. This makes SVG extremely flexible for many applications.

Unfortunately, people regularly complain on the SVG mailing lists that one viewer or another fails on their SVG when it reaches some non-standard element they have defined. How do these people expect that the viewer will render their <chair/> element or whatever it happens to be? (Usually they want the element to be ignored.) If you have non-standard elements in the SVG namespace (often the default namespace in SVG images), the document is not valid SVG. As such, a viewer should error out. The document is not SVG, but is pretending to be. The viewer cannot tell the difference between an invalid <chair/> element that you want it to ignore and a rectangle that you misspelled <Rect/>. Surely, you would want the viewer to report the second case so that you can correct the error.

Conclusion

One purpose of XML is to provide a standardized, yet flexible, file format that can be processed in a standard way. Many tools have grown around the XML recommendation that allow consistent processing, independent of platform or programming language. These are major benefits. Those who have not spent time troubleshooting data in unusual, proprietary, binary formats may not fully appreciate the benefits of XML. With those benefits come a few restrictions.

  • XML must be well-formed.
  • If you are using a standardized XML application, your XML document must be valid.

These are not really horrible restrictions when you consider the benefits.

Posted by GWade at August 3, 2006 07:12 AM. Email comments