Dan Maharry

Writing about web development since 1997

Discrepancies in the XML Schema Docs

All I did was try to write a small set of extension methods to validate whether a given string was valid according to the built-in schema string types and the editor in me comes out and starts nit picking. The W3C Schema docs are very good but sometimes annoyingly ambiguous without a degree in lateral thinking.

Problem #1 : Is "" valid?

Section 3.2.1 says

The ·value space· of string is the set of finite-length sequences of characters (as defined in [XML 1.0 (Second Edition)]) that ·match· the Char production from [XML 1.0 (Second Edition)].

So, is the empty string valid then? Taking this definition on spec, the answer seems to depend on what 'finite-length' means. According to the dictionary finite means

1.having bounds or limits; not infinite; measurable.
2.Mathematics.

  • (of a set of elements) capable of being completely counted.
  • not infinite or infinitesimal.
  • not zero.

So maybe an empty string isn't valid then? The dictionary implies it. Alas, no. The XML Schema spec at the top of section 4 also states

Any property identified as a having a set, subset or ·list· value may have an empty value unless this is explicitly ruled out:this is not the same as absent.

OK, so the empty string is valid as a string but could the W3C please link to this last note about sets containing the empty value from the many uses of the word 'set' around the document please? Either that or define the phrase 'finite-length' in situ as 'zero or greater'.

Problem #2 : In which string datatypes is "" invalid?

The problem with the note about sets is that it states a type must explicitly rule the empty string as invalid before it really is invalid. But what about it being implied elsewhere but not in black and white as, say the value space of the NMTOKENS type?

NMTOKENS represents the NMTOKENS attribute type from [XML 1.0 (Second Edition)]. The ·value space· of NMTOKENS is the set of finite, non-zero-length sequences of ·NMTOKEN·s

Let's go one step back up the type hierarchy to the NMTOKEN type.

NMTOKEN represents the NMTOKEN attribute type from [XML 1.0 (Second Edition)]. The ·value space· of NMTOKEN is the set of tokens that ·match· the Nmtoken production in [XML 1.0 (Second Edition)].

No explicit mention of non-zero-length anythings here. But the defintion of the NMTOKEN in XML 1.0 says that it should consist of one or more characters.

NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
Nmtoken ::= (NameChar)+

By those rules, a valid NMTOKEN cannot be empty even if the writer or the schema sets minLength to 0. The same logic applies to the Language and Name string types in the schema defintion as well so if none of them can be empty, neither can NCName, ID, IDREF, IDREFS, ENTITY or ENTITIES either despite the fact that only IDREFS and ENTITIES are the only of these to also mention valid types to be non-zero-length explicitly.

So then, what phrase is missing from "must explicitly rule the empty string as invalid" because it's definitely not all there.

Problem #3 : Colons or not?

The next issue spans three W3C recommendations and it's a question of colons. In the XML Schema document,

[the Name type is] the set of all strings which ·match· the Name production of [XML 1.0 (Second Edition)].

From the XML spec, the Name production looks like this

NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
Name ::= (Letter | '_' | ':') (NameChar)*

The Name type has several derived types - ID, IDREF and ENTITY all of which are defined similarly and which have the same ambiguity. Let's use IDREF

IDREF represents the IDREF attribute type from [XML 1.0 (Second Edition)]. The ·value space· of IDREF is the set of all strings that ·match· the NCName production in [Namespaces in XML]. The ·lexical space· of IDREF is the set of strings that ·match· the NCName production in [Namespaces in XML].

From the [Namespaces in XML] spec then, the basic gist of the NCName production is that it's the same as the Name production in [XML 1.0 (Second Edition)] but without the colons

NCNameChar ::= Letter | Digit | '.' | '-' | '_' | CombiningChar | Extender
NCName ::= (Letter | '_') (NCNameChar)*

OK? Name with colons. NCName without. Now the XML spec defines the IDREF attribute type as follows

Values of type IDREF must match the Name production....

So then, values of the schema type IDREF which cannot have colons must be able to represent XML IDREF attributes which can have colons. Is it me or is there potential for a problem with that? I realise that 'represent' doesn't mean 'be the same as' but still.

Problem #4 : Single spaces or more?

Last issue is another ambiguity which could be easily sorted if the W3C ever revised the Schema docs. At the bottom of the string type derivation tree are two 'plural' types, IDREFS and ENTITIES. Both are defined in the same way, so let's use IDREFS.

IDREFS represents the IDREFS attribute type from [XML 1.0 (Second Edition)]. The ·value space· of IDREFS is the set of finite, non-zero-length sequences of IDREFs. The ·lexical space· of IDREFS is the set of space-separated lists of tokens, of which each token is in the ·lexical space· of IDREF.

For me at least, the ambiguity is in the word "space-separated". How many spaces? Whitespace in general or literally just the space character, \x20? Again, we have to consult the XML specification to get the answer where we're told

values of type IDREFS must match [the] Names [production]

and [the] Names [production] reveals that it means each IDREF must be separated by a single \x20 character only else the string isn't a valid IDREFS type string.

Names   ::=   Name (#x20 Name)*

So why can't the schema spec just say something like

The ·lexical space· of IDREFS is the set of lists of tokens each separated by a single \x20 character,....

and take the ambiguity out of the statement?

Conclusion

Since the schema specification documents were last updated in October 2004, the XML Spec has undergone two more revisions and the Namespaces In XML spec has been revised once. All four problems remain. Hopefully they can all be addressed in the third edition of XML Schemas but in the meantime, coder watch, learn and be wary.

Comments are closed