Post Archive

› January 20, 2004

A view on XML handling

  • Reported by liorean

There's been much discussion about XML handling lately, on Dive into Mark, Ongoing, Surfin' Safari and elsewhere. Some of the arguments I agree with, some not. This is my view on it.

The first thing I'm going to say here, is that the views on XML vary widely - what it is and what it is not. The way I see it, XML is not a data storage format. Data MAY be stored in XML format as a document, but that is not the real purpose of XML, nor is it something that XML is particularly good at - the real purpose for XML, rather, is as a data structure format. XML's primary purpose is to provide a structure to data.

The data may look whatever way it can, the structure is XML, and that structure is provided as a hierarchical tree of single-parent entries, or nodes - a structure that can be represented as a DOM object tree, or as an XML document, or as an XML fragment, or in a number of other formats such as SXML. In essence, what I wanted to point out is that XML is the structure, not the document, nor the data.

The discussion on the above mentioned blogs has been about XHTML, Atom, RSS and XML and liberal versus draconian parsing - but the thing I want to point out is that XML is not primarily a data storage (document) format, but a data structure (metadata) format. That the XML structure is stored in documents together with the data and that it is as a document that the whole thing is transferred to the client, is the weak point, because the conversion from structure to document and from document to structure is where it may go wrong. XML is the structure, and that structure must be unambigous. The structure must be intact in the first place - and it must then be unambigous in the document, because if it is not unambigous in the document, it can not be converted back to structure. And that unambiguity is provided by well formedness.

The purpose of XML parsers is to perfom that conversion - from document to structure and data. If the document is malformed, it can not be converted into structure and data reliably, and because of that, it can not be considered to be XML. Again, XML is structure, not markup in a document.

All this essentially lead me to the conclusion: If your data must be readable with or without an intact structure, the question is whether XML is really the appropriate format. A prerequisite for XML is that the structure is intact, which in a document means it is well formed. If you can not live with the data not being parsed if the markup is not well formed, your XML application should not be an XML application at all.

Comments