This site will look much better in a browser that supports web standards, but is accessible to any browser or Internet device.

Anomaly ~ G. Wade Johnson Anomaly Home G. Wade Home

August 03, 2006

Misunderstanding XML

Like many developers, I've been working with XML for many years now. My first XML-based program dates back to within 6 months of the publishing of the XML 1.0 Recommendation by W3C. I've already gone through the phases of:

  • XML looks promising,
  • XML is cool,
  • everything in XML, and
  • XML is just another tool

XML is widely enough used that most people doing active development should have some concept of the ground rules. I realize that some people have not worked with XML as extensively as others, but there are a few things that have really begun to get on my nerves. I apologize in advance for what will mostly be a rant.

XML is powerful. There are a large number of tools that work with XML. But, not every text file with angle brackets is XML. If you say you are working in XML, you have to follow the rules specified in the recommendation or it's not XML.

Well-formed XML

If the data is not well-formed, it is not XML. There is no such thing as lenient XML. If you call it XML, except we have a few extensions..., its not XML. If you have start tags with no end tags, it's not XML. If you have more than one root element, it's not XML. If you have attribute values that are not surrounded by quotes, it's not XML. This does not mean that the format may not be useful. But, if your data is not XML, don't expect an XML tool to handle it. This makes as much sense as expecting your MP3 player to read an MS Word document aloud.

Over and over again through the years, someone will report that one XML tool or another is broken because it can't read their XML. Most of the time, the document turns out to be ill-formed. Sometimes, the questioner will say "thanks" and fix their XML. Most of the time, they begin suggesting that the tool be modified just a little to handle this case. Or they may complain the the authors of the tool should have made it capable of dealing with this case, since it is obviously useful.

These people are missing the point. They may have a useful format, but it is not XML. No XML tools should be expected to deal with them, any more than you should expect an XML tool to process a jpeg image. If you have questions about the definition of well-formed XML, there are many sources available. The definitive description, of course, is at Extensible Markup Language (XML) 1.0.

Valid XML

If an XML application is described by a schema of some type, (DTD, XML Schema, RELAX NG, etc.) only documents matching that schema are valid. This is the definition of a valid XML document. If there is no schema, the document is neither valid nor invalid. However, if there is a defined schema for a particular XML application, documents that don't match the schema are invalid.

If your document does not contain required elements or attributes, it is invalid. If your document contains extra elements or attributes (unless they are in a different namespace), your document is invalid. Even elements and attributes from a different namespace are only valid if the definition of the XML application supports it. Your document is not almost valid or mostly valid. It is invalid.

I work a bit with SVG. This XML application is specified by schemas in several formats and processors are supposed to ignore elements and attributes from unknown namespaces. This makes SVG extremely flexible for many applications.

Unfortunately, people regularly complain on the SVG mailing lists that one viewer or another fails on their SVG when it reaches some non-standard element they have defined. How do these people expect that the viewer will render their <chair/> element or whatever it happens to be? (Usually they want the element to be ignored.) If you have non-standard elements in the SVG namespace (often the default namespace in SVG images), the document is not valid SVG. As such, a viewer should error out. The document is not SVG, but is pretending to be. The viewer cannot tell the difference between an invalid <chair/> element that you want it to ignore and a rectangle that you misspelled <Rect/>. Surely, you would want the viewer to report the second case so that you can correct the error.

Conclusion

One purpose of XML is to provide a standardized, yet flexible, file format that can be processed in a standard way. Many tools have grown around the XML recommendation that allow consistent processing, independent of platform or programming language. These are major benefits. Those who have not spent time troubleshooting data in unusual, proprietary, binary formats may not fully appreciate the benefits of XML. With those benefits come a few restrictions.

  • XML must be well-formed.
  • If you are using a standardized XML application, your XML document must be valid.

These are not really horrible restrictions when you consider the benefits.

Posted by GWade at 07:12 AM. Email comments

October 06, 2005

To XML or not to XML...

I've been seeing a lot of comments in various forums similar to this comment by Christopher Diggins: XML down a slippery slope. In most of them, there is the implied belief that XML is the solution to all problems and that anything that violates the spirit of XML is to be condemned.

I have personally been working with XML since just after XML 1.0 became a recommendation. I've been through the everything in XML phase and come through it alive. The most important issue that people need to understand is that XML is just a data (or document) format; it is not a religion. It is a very useful data format. It can be a powerful way to represent some kinds of data, but it is not always the best way.

For example, I doubt that anyone would really recommend that we use the following XML anywhere:

<number type="integer" sign="positive">
  <thousand>2</thousand>
  <hundred>0</hundred>
  <ten>0</ten>
  <ones>5</ones>
</number>

Obviously, 2005 is more useful, even though marking up the number as above would be more in the spirit of XML. After all, you might need to do something specific with all numbers that have a 2 in the thousands position and you'll need a micro-parser to deal with this embedded format if you don't mark it up.

Some may consider this to be a bit of a strawman argument, but I would like to propose that it is actually one point along a continuum. Individual numbers and words obviously do not (necessarily) need to be marked up in XML. Just as obviously, a complicated, nested data structure or document greatly benefits from XML markup or something similar.

Diggins references a previous article, XML.com: Painting by Numbers with SVG, which covers a discussion on some of the reasons the SVG recommendation uses the micro format for the path element. According to the SVG Working Group, using an element-based path format could easily result in documents that were twice the size of the chosen approach. The working group decided that this was not acceptable.

In fact, their foresight has paid off. One of the areas where SVG has done very well is in mapping. Maps tend not to have many regular shapes. They are mostly built from paths. On those kinds of documents, the increase in size may be a factor of three or higher. In addition, some of these maps are very large. A factor of three for a multi-megabyte file is much different than a factor of three for a 10K file.

The important thing to remember is that XML is not a religion, it is only a tool we use in solving problems. Every tool involves tradeoffs. Sometimes we decide in the direction of purity of expression, sometimes we bow to practicality. While I personally don't particularly like the path element, I understand the tradeoffs involved. Just as importantly, I'm not sure I would have done differently in their shoes.

It may be useful to consider that the people who worked on these standards were not stupid. They thought and worked on the standard for a long time. Any compromises they made to purity were probably carefully considered. This does not mean that they are always right, but it does mean that second-guessing them without considering all of the use cases they considered might be a bit rash.

Posted by GWade at 07:14 AM. Email comments

May 26, 2004

Review of XML Bible

XML Bible
Elliotte Rusty Harold
IDG Books, 1999

As computer books go, this book is old. However, it does a spectacular job of handling all of the nitty-gritty details of XML. I first stumbled across this book when it was new. The only really definitive information you could find on XML was the W3C recommendation itself. This book does a very good job of explaining most aspects of the XML recommendation in a more conversational style.

Currently, most people who need to work with XML know (or believe they know) enough about it to work effectively. The W3C recommendation is still the place to go for definitive answers to clear up any confusion. But, if you need to be the expert, and you don't want to spend a lot of time reading the official version, this book will probably answer any question you have, without putting you to sleep.

Although the book is very complete on XML, it does not cover any of the more recent tools and specifications. This is understandable since it was written before they existed. In spite of that, I still highly recommend this book.

Posted by GWade at 10:11 PM. Email comments

Review of XML in a Nutshell

XML in a Nutshell
Elliotte Rusty Harold and W. Scott Means
O'Reilly, 2001

This is a good overview of XML-based technologies as of 2001. If you are looking for a deep understanding of the uses of XML and related technologies, this is not the right book for you. However, if you are looking for a high-level overview or a book to refresh your memory, this is the book to read. It provides a good overview of the major technologies available, including XSLT, XPath, CSS, SAX, and others.

Unlike many books and articles I've read, this book looks at the two main philosophies of XML in a relatively dispassionate way. One section of the book is devoted to XML as documents, which they call Narrative-Centric Documents. Another section is devoted to XML as Data, which they call Data-Centric Documents.

Unlike many writers, I find many good uses for both approaches. This book treats both approaches equally and covers how some of the related technologies relate to each.

Posted by GWade at 09:58 PM. Email comments

February 12, 2004

XML Living Up To Its Promise

XML.com: Opening Open Formats with XSLT [Feb. 04, 2004]

This article by Bob DuCharme is a great example of something we don't see enough of. He takes data from a defined XML application (OpenOffice.org Impress format). He uses standard tools (XSLT) to extract and format data useful to him.

This is not the normal If I put my data in XML everyone will be able to use it message we see everywhere. It is also not an example of business-based conversion of some consortium-sponsored format into some other consortium-sponsored format. It's not even the (often) contrived examples from XML books of how XML makes our lives better.

This is a great example of someone with a very specific need that is solved by standard tools and XML. This specific need is not something that the OpenOfice.org team would want to dedicate the time to satisfy. Since they chose an open XML file format, they don't have to. The user owns his data in a more meaningful way than if the same data were in a proprietary format.

In the 5 years I've been working with XML, I've almost never seen this kind of example. What's more amusing is that this use has been promised all along. I've spent a lot of my time working with special purpose XML applications that I've crafted for a handful of uses. This article is the wakeup call I needed to remind me to look at some other formats again.

Posted by GWade at 06:52 AM. Email comments

February 05, 2004

XML Data Representation

I had an interesting thought in an email conversation with a friend yesterday. One problem many people have when using XML for data is a misunderstanding of what the XML is.

(If you don't believe in the data in XML approach, feel free to ignore me.<grin/>)

It's easy to make the mistake of treating the XML as if it is the data when you are first learning to use XML this way. But it is really important to realize that the XML represents the data, it is not the same as the data.

You would never have problems with the concept that a line chart or pie chart is not the data, they are just representations of the data. XML is just another representation.

How does that help? In much the same way that you decide to add or remove information from a line chart to make it serve its purpose better, you can do the same with XML. Let's look at some of the representation only issues you consider when making a line chart. The most obvious information removed is the actual values. On a line chart the trends and relative levels appear to be more important. On the other hand, many line charts color is often added to provide differentiation between different kinds of data or different levels. Error bars are sometimes added to enhance your understanding of the fuzziness of the data.

All of these changes do not actually change the data, they just change the representation. In some cases, they might add implied information (error range, data grouping) or remove extra unneeded details (values). But, the data remains.

I have realized that the same is true of XML (when used for data). You may include structure or grouping in data that isn't evident in the raw values. You may add scaling or units that are implied in the original data. You can even add links to explanations of results. This allows for a richer representation of the data. So you really aren't limited to how you represented your data inside your application. I have sometimes marked up a data set with exactly those pieces of implied information that have always given me problems when communicating between programs. Since I was using XML as an interchange format, making the implied assumption explicit simplifies the overall project.

Posted by GWade at 05:45 PM. Email comments

January 31, 2004

XML-Serialized Objects and Coupling

Although the debate continues to rage between the XML as documents camp and the XML as data camp, it seems reasonable to believe that both styles are here to stay. I have noticed one trend in XML data that strikes me as déjà vu all over again. There seem to be a large number of tools for automatically generating XML from objects. Giving you the ability to serialize an object, send it somewhere and possibly reconstitute it elsewhere.

These tools seem to make an annoying chore much easier. But, I have to wonder. These tools simplify applying a particular solution. But, is it the right solution?

I started working in XML long before most of these tools were available. In the early days (a few years ago <grin/>), we worked out the serialisation by hand. You converted what needed conversion to XML and left the rest alone.

One problem I see with the current approach is an increased coupling between the initial object and the XML stream that comes from it. If you are guaranteed to have the same kind of object, in the same language, on both sides of your transfer, that might be an appropriate solution. But what if you don't have that guarantee? What if you are providing a service to someone else? What if you are providing an API over a network? (I didn't say a Web Service, because nowadays that implies a particular architecture.)

What happens as your service changes over time? Do you really want to change the whole interface because the object that generates it has been refactored? If not, then you either have to leave those objects alone, or drop the nice tool that helps you generate the XML.

Many years ago, before Web programming and before the mainstream use of OO languages, there was a simple concept in programming to describe this problem, coupling. Long ago we learned that inappropriate coupling was bad. The higher the coupling between two pieces of code the harder it is to change one of them without changing the other. The whole concept of interfaces is based around the idea of reducing coupling.

My problem with these tools and the approach they simplify is that they may be increasing coupling unnecessarily. If both ends of the system must have identical object layouts in order to use the tool, then you are locking clients of the service into your way of looking at things. This makes it much more difficult for other people to use the service in ways you hadn't planned for. In fact, it makes it more difficult for you to use it differently in a year than you see it now.

I built a Web Service a few years ago for use inside a company. This was before the proliferation of WSDL and UDDI. SOAP was still pretty new. We defined the system using XML over HTTP. We defined the XML to fit the data we needed to send, not the objects we generated it with. It was not perfect and we learned many things along the way. One of the more interesting things that came out of it was the fact that the generic XML could be consumed relatively easily by code written in with different technologies from ASP to Perl to Flash.

I think the next time I build something like this, I will definitely do a few things differently. But the serialised objects approach is one thing I probably won't do. I don't think the increased coupling is worth the temporary gain.

Posted by GWade at 11:32 AM. Email comments

January 23, 2004

SVG and CSS

In most of the SVG I've seen people either prefer to use the style attribute or set the individual style attributes. I don't see much use of CSS classes and I wonder why.

Most of the criticisms I've seen of the use of CSS fall into four categories:

  1. It's not XML.
  2. It's too verbose to use CSS on all elements.
  3. It's not as easily scripted or animated.
  4. It's inconsistently supported.

It's not XML

Let's take these one at a time. The first is one of my favorite non-arguments. I don't use XML for every piece of data in my life. For example, most of the words I type do not use characters that are explicitly marked up. Even more amazing, the individual digits of most numbers I use aren't marked up in XML, either.

All sarcasm aside, XML is a good format for some things and a lousy one for others. Sometimes raw text is better. Sometimes a comma-separated values (CSV) file is better. And, sometimes XML is best. So I don't consider this to be a useful argument.

It's too verbose to use CSS on all elements.

Well, that was true of CSS and HTML as well. In fact, I remember people using that as a reason to go ahead and use <font> tags in HTML. (Which were even more verbose.)

People with a bit more experience, or people that are too lazy to style everything explicitly (like me), often use CSS classes to solve that problem. Instead of a large number of individual style properties, you only need one class attribute. This also simplifies changing the look of many elements by modifying the class they are associated with.

It's not as easily scripted or animated.

This is often true. I believe there are viewers and libraries that allow you to modify CSS on the fly, but I don't know that the methods are consistent across tools. The key is to not use CSS for things that you want to be dynamic. In many cases, a large amount of the elements in the display don't change (or some of the styling is static even on elements that do change), style those with CSS classes or with direct styles. Put the things you plan to change in attributes.

In my experience, most of the things I animate or script, I do by changing either individual XML attributes or whole classes of styling. But, I don't usually directly modify styling information. In fact, changing one class attribute can result in a drastic change to an element, effectively modifying a large number of style properties all in one call.

The one downside, of course, is that you need to set up the CSS classes in advance.

It's inconsistently supported.

This is definitely true. I have run across lax CSS features in ASV3 which cause difficulty displaying SVG written for ASV3 on other viewers. But, from what I've heard and seen of newer viewers, that situation seems to be improving. Of course, if we use that excuse we need to stop working on the web. Support and rendering of HTML is still inconsistent. And, in the past, I remember people using invalid HTML to get the visual effects they wanted on certain browsers.

My experience

I've tried to use CSS classes in SVG for most of my own work, and I find it works quite well. I also use the style attribute to override the class values for a few elements.

Finally, if I need to do a lot of scripting or animation on an element I tend to rely on the styling attributes.

In fact each of these approaches has its own strengths and weaknesses. Playing with them allows you to develop a sense for when each could be the right tool for your particular job.

Posted by GWade at 11:31 PM. Email comments