XML: Structuring Data for the Web: An Introduction
May 3rd 1998
Fixing the Web
If it ain't broke, don't fix it. That's what some people might
say with respect to the World Wide Web. After all, millions of people
surf the Web every day -- from homemakers searching for a recipe, to
investors seeking the latest stock quotes, to students researching the
assassination of Abraham Lincoln, to readers purchasing the latest
novel from an on-line e-commerce site. The Web works well for them.
Or does it? Let's kick the tires a bit and see if the Web can take it.
Here are a few of the problems that hinder the current Web and beg for
solutions with the Next Generation of Internet technologies.
- HTML
standards change too slowly - During most of the Web's
history, there have been essentially only two versions of the HTML
specification,
HTML 2.0 and
HTML 3.2.
(HTML 1.0 pre-dates most Web sites and
HTML 4.0,
although the current standard as of December 18, 1997, is only slowly
appearing on popular sites.) When HTML 3.2 was finally approved in
January, 1997, it was more of a rubber stamp of then-current practices
than an innovation since nearly all of the elements it defined had
been in use unofficially for as long as a year. It simply took too
long for the
World Wide Web Consortium (W3C)
to agree on the specification (presumably due largely to the
browser-specific extensions discussed next).
- Browser-specific tags ("extensions") - Prior to
HTML 3.2,
Netscape and
Microsoft
began the unfortunate practice of introducing their own extensions
to the
language.
This was an endless cause of headaches for content
developers who struggled to make their pages accessible to all users
while needing (wanting?) to use the latest features introduced by the
browser vendors. Less ambitious authors succumbed to the "This
site best viewed with {Netscape/Microsoft}" virus which has
contributed to some truly horrible sites. These authors forgot that
the Web isn't truly "World Wide" if authors entrench
themselves in different camps and embrace extensions which aren't
universally supported.
- Can't markup data in any meaningful way - HTML was
originally intended to provide a simple way to markup any type of
document to reflect its
structure
(title, major headings, minor
headings, lists, and so on) as well as some stylistic aspects (bold,
italics, and so forth). Adding to this the hypertext linking
capability HTML offered, as well as browser support for a long list
of MIME types, it isn't hard to understand the phenomenal rate at which
the Web developed, especially since Web authoring fell within the
capabilities of grade school students. HTML was (and still is) great
for marking up documents. However, businesses and scientists also have
the need to exchange data. A new language is needed to express the
hierarchical relationship of data values, such as that which is
represented by database records and object hierarchies. HTML reflects
structure and
presentation,
but conveys nothing about the meaning
of the marked up document.
- Browser paradigm is too constraining - With the advent of
Java
and
JavaScript,
the Web browser quickly became far more than
merely a tool for surfing the Web; it became the launcher of
applications. However, often the browser gets in the way. Customers
want applications that look and feel more like their familiar desktop
applications, such as spreadsheets. While MIME type content handling
helps in this regard, there are times when the browser paradigm just
doesn't make sense. Even if you can "lose the chrome" (i.e.,
browser menus and controls), sometimes there is a need to pass
information between two or more cooperating applications. What we
really need is web-enabled applications (programs that
understand common Internet protocols such as HTTP) so we can access
Web resources without using a browser at all. (This is not science
fiction; companies such as
webMethods, Inc.
have already achieved this goal.)
- Search engines return far too many hits - Unless you become
a master of your favorite
search engines
by learning their similar yet annoyingly different query syntax,
you'll undoubtedly receive hundreds or thousands more hits than you
have time or patience to examine. If you're incredibly lucky
(or skillful), the reference you're looking for may be in the first
page or two of results -- but don't count on it. The problem is that
search engines typically can only index frequency of words, document
titles, and, in some cases, meta tags that describe the contents of a
page. What is needed is a way to markup the significant portions of a
document and to convey the semantics of documents so search engines can
ignore all of the noise and focus instead on the signal. Sometimes
searches require a finer granularity of control than most search
engines permit. For example, how would you search for books written
by Paul McCartney, rather than books that refer to him, the
Beatles, or Wings? If the words "Paul McCartney" could be
tagged as <AUTHOR> to indicate a specific meaning, such
finely-tuned searches would become possible.
- Can't specify collections of related pages - It is often
the case that you encounter a Web page which is obviously part of a
larger collection. If you're lucky enough to find a link to a table
of contents, a home page, or some other means of listing the
collection, then you're half way there. But how do you print the
collection? Current answer: one HTML file at a time.) There has to be
a better way to express the interrelationship of a set of pages so
they can be processed as a group. We need to be able to attach
metadata
("information about information" or "machine understandable
information") to Web pages to express interrelationships.
- One-way linking is somewhat limited - Although the Web's
current one-way hypertext link capability has proven extremely useful,
did you know far more flexible schemes have existed for many years in
the publishing industry? Since 1992,
Hypermedia/Time-based Structuring Language (HyTime) and the
Text Encoding Initiative (TEI) have enabled publishers to express
complex link relationships, such as links with multiple targets,
multi-directional links, and automatically updated link databases.
We need a richer linking language for the Web.
HTML 3.2 together with
CGI
scripts,
Java applets, and
JavaScript (and
its derivatives), plus
plug-ins
such as Shockwave, RealPlayer, and
Quicktime provide Web authors and commercial sites with a rich array
of techniques for displaying content that is visually
compelling and possibly even informative. However, these techniques
do little if anything for the representation of structured data
unless one introduces middleware solutions.
XML: Structuring Data for the Web: An Introduction
XML: Structuring Data for the Web: An Introduction
XML: Structuring Data for the Web: An Introduction
|