Markup, Character Data, and Parsing
March 29, 2002
An XML document contains text characters that fall into two
categories: either they are part of the document markup or part
of the data content, usually called character data, which
simply means all text that is not part of the markup. In other
words, XML text consists of intermingled character data and
markup. Let’s revisit an earlier fragment.
<Address>
<Street>123 Milky Way</Street>
<City>Columbia</City>
<State>MD</State>
<Zip>20777</Zip>
</Address>
The character data comprises the four strings “123 Milky Way”,
“Columbia”, “MD”, and “20777”; the markup comprises the start
and end tags for the five elements Address , Street , City ,
State , and Zip . Note that this is similar but not identical,
to what we previously called content. For example, although each
chunk of character data is the content of a particular element,
the content of the Address element is all of the child
elements. We can think of all the character data belonging
to both the element that directly contains it and indirectly to
Address . (In fact, in some XML applications such as XSLT, if we
ask for the text content of Address, , we’ll get the
concatenation of all the individual strings.)
The markup itself can be divided into a number of categories, as per section 2.4
of the XML 1.0 specification.
-
start tags and end tags (e.g., <Address>and </Address>)
-
empty-element tags (e.g., <Divider/>)
-
entity references (e.g., &footer;or %otherDTD;)
-
character references (e.g., <or >)
-
comments (e.g., <!--whatever -->)
-
CDATA section delimiters (e.g., <![CDATA [ insert code here ]] >)
-
document type declarations (e.g., <!DOCTYPE ....>)
-
document type declarations (e.g., )
-
processing instructions (e.g., <?myJavaApp numEmployees="25"
location="Columbia"....?>)
-
XML declarations (e.g., <?xml version=....?>)
-
text declarations (e.g., <?xml encoding=....?>)
-
any white space at the top level (before or after the root element)
We will discuss each of these markup aspects in either this
chapter or the next. Note that for all types of markup, there
are some delimiters, most but not all of which are angle
brackets.
The specification states that all text that is not markup
constitutes the character data of the document. In other words,
if you stripped all markup from the docu-ment, the remaining
content would be the character data. Consider this example:
<?xml version="1.0"standalone="no"?>
<!DOCTYPE Message SYSTEM "message.dtd">
<Message mime-type="text/plain">
<!--This is a trivial example.-->
<From>The Kenster</From>
<To>Silly Little Cowgirl</To>
<Body>
Hi,there.How is your gardening going?
</Body>
</Message>
The character data when the markup is removed would be:
The Kenster Silly Little Cowgirl Hi,there.How is your gardening
going?
In general this is essentially the text between the start and
end tags, which we pre-viously called the content of the
element, but there is a subtlety related to parsing. Depending
on parser details, the new lines after </From> and
</To> might be replaced single spaces, as shown.
Parsing is the process of splitting up a stream of
information into its constituent pieces (often called tokens).
In the context of XML, parsing refers to scanning an XML
document (which need not be a physical file—it can be a data
stream) in order to split it into its various markup and
character data, and more specifically, into elements and their
attributes. XML parsing reveals the structure of the information
since the nesting of elements implies a hierarchy. It is
possible for an XML docu-ment to fail to parse completely if it
does not follow the well-formedness rules described in the XML
1.0 Recommendation. A successfully parsed XML document may be
either well-formed (at a minimum) or valid, as discussed in
detail later in this chapter and the next.
There is a subtlety about processing character data. During the
parsing process, if there is markup that contains entity
references, the markup will be converted into character data. A
typical example from XHTML would be:
<p>"AT &T is a winning company,"he said.</p>
After the parser substitutes for the entities, the resultant
character data is:
"AT&T is a winning company,"he said.
After parsing and substituting for special characters, the character data that
remains after the substitution is parsed character data, which is referred to as
# in DTDs and always refers to textual content of elements. Character data
that is not parsed is called CDATA in DTDs; this relates exclusively to attribute
values.
Document Body
XML Family of Specifications: A Practical Guide
XML Syntax Rules
|