XML Syntax Rules
April 5, 2002
In this section, we explain the various syntactical rules of XML.
Documents that follow these rules are called well-formed, but not
necessarily valid, as we'll see. If your document breaks any of
these rules, it will be rejected by most, if not all, XML parsers.
Well-Formedness
The minimal requirement for an XML document is that it be
well-formed, meaning that it adheres to a small number of
syntax rules,
6
which are summarized in Table 3-1 and explained in the following
sections. However, a document can abide by all these rules and
still be invalid. To be valid, a document must both be
well-formed and adhere to the constraints imposed by a DTD or XML
Schema.
Table 3.1 XML Syntax Rules (Well-Formedness Constraints)
-
The document must have a consistent, well-defined structure.
- All attribute values must be quoted
(single or double quotes).
-
White space in content, including line breaks, is significant.
-
All start tags must have corresponding end tags (exception:
empty elements).
-
The root element must contain all others, which must nest properly
by start/end tag pairing.
-
Elements must not overlap; they may be nested, however. (This is
also technically true for HTML. Browsers ignore overlapping in
HTML, but not in XML.)
-
Each element except the root element must have exactly one parent
element that contains it.
-
Element and attribute names are case-sensitive: Price
and PRICE are different elements.
- Keywords such as DOCTYPE
and ENTITY
must always appear in uppercase; similarly for other DTD keywords
such as ELEMENT and ATTLIST.
-
Tags without content are called empty elements and must end in
"/>".
Legal XML Name Characters
An XML Name (sometimes
called simply a Name ) is a token that
-
begins with a letter, underscore, or colon (but not other
punctuation)
-
continues with letters, digits, hyphens, underscores, colons, or
full stops [periods], known as name characters.
Names beginning with the string "xml", or any string
which would match ((`X'|`x')(`M'|`m')(`L'|`l')), are reserved.
Element and attribute names must be valid XML Names. (Attribute
values need not be.) An NMTOKEN (name token) is any
mixture of name characters (letters, digits, hyphens, underscores,
colons, and periods).
-
Note:
The Namespaces in XML Recommendation assigns a meaning to names
that contain colon characters. Therefore, authors should not use
the colon in XML names except for namespace purposes (e.g.,
xsl:template).
Listing 3-2 illustrates a number of legal XML Names, followed by
three that should be avoided but may or may not be identified as
illegal, depending on the XML parser you use, and four that are
definitely illegal. (This is file
name-tests.xml on the CD; you can try this
with your favorite parser, or with one of the ones provided on
the CD.)
Listing 3-2 Legal, Illegal, and Questionable XML Names
<?xml version = "1.0" standalone = "yes" encoding = "UTF-8"?>
<Test>
<!-- legal -->
<price />
<Price />
<pRice />
<_price />
<subtotal07 />
<discounted-price />
<discounted_price />
<discounted.price />
<discountedPrice />
<DiscountedPrice />
<DISCOUNTEDprice />
<kbs:DiscountedPrice />
<xlink:role />
<xsl:apply-templates />
<!-- discouraged -->
<xml-price />
<xml:price />
<discounted:price />
<!-- illegal -->
<7price />
<-price />
<.price />
<discounted price />
</Test>
From the legal examples, we see that any mixture of uppercase and
lowercase is fine, as are numbers, and the punctuation characters
that were in the definition.
Since the last three examples in the first group use a colon, they
are assumed to be elements in the namespaces identified by the
prefixes "kbs", "xlink", and "xsl".
Of these, the last two refer to W3C-specified namespaces;
xlink:role is an attribute defined by the XLink
specification and xsl:apply-templates is an element
defined by the XSLT specification. The "kbs" prefix
refers to a hypothetical namespace, which I could have declared
(but didn't), since namespaces do not come only from the W3C.
(See chapter 5 for a thorough discussion of namespaces.)
The three debatable examples are xml-price,
xml:price, and discounted: price. The
first two use the reserved letters "xml"; you shouldn't
use them, but most parsers won't reject them. The
discounted:price example uses a colon, which is
frowned upon if "discounted" is not meant to be a prefix
associated with a declared namespace.
The four illegal cases are much more clear. The first three,
7price, -price, and .price,
are illegal because the initial character is not a letter,
underscore, or colon. The fourth example is illegal because a
space character cannot occur in an XML Name. Most parsers will
think this is supposed to be the element named discounted
and the attribute named price, minus a required equal
sign and value.
Note: XML Names and NMTOKENS apply to elements, attributes,
processing instructions, and many other constructs where an
identifier is required, so it's important to understand what is
and what is not legal. |