Similar Yet Un-related
June 26, 2000
Another very common problem results from sites that have a lot of pages
with similar content. There are many reasons why such sites exist, and
not all of them are attempts at spamdexing. Consider an online
discussion group. Some discussion groups create a separate page for
every single post, using the subject line of the post as the page
title. So if a discussion thread goes on for a hundred posts (as oh so
many do), there will be a hundred pages all with similar titles, each
containing a wee bit of information. And all hundred of those darn
pages will come up when you do a search, crowding out other sites that
may have a lot more information. Not that the discussion group posts
aren't worthwhile, but you don't need an individual link to each one
of them. What would really be useful is a single link to the home page
of the discussion group, so you could go and check out the whole
thread in its proper context.
The problem of pages showing up out of context is a multi-faceted one.
The home page of a site is no more likely to come up in search results
than is a minor page ten levels down. A lot of times a search engine
leads you to a page which is part of a larger site, but it's unclear
what site it is, or how to get to the rest of it. Poor site design
makes this problem a lot worse. Good designers generally use a uniform
navbar throughout a site, so that if someone chances upon any page, it
will be clear what site they're on, and they'll be able to click right
to the home page. But, to put it mildly, not all designers follow this
sound rule.
Yet another problem arises from the fact that a word or phrase often
refers to two or more unrelated topics. For example, "Kansas" is both
a US state and a musical group. If you search for "blues", how does a
search engine know whether you're interested in blues music, the
sports team called the "Blues", or the children's TV show
"Blue's Clues"? Or perhaps you "have the blues" and want to see a
shrink? Alas, current search algorithms have no way of dealing with
this, so any time that you search for a phrase that has more than one
meaning (which is often), a large portion of the results will be
completely irrelevant.
The lowly card catalog solves this problem easily. If you look up
"Kansas" in your local library, you'll find two entries:
Kansas (US State)
Kansas (Musical Group)
And maybe more besides.
Of course, the Internet is nowhere near as well organized as a decent
small-town library. Guess what? It's not organized at all! There's no
other medium that's as chaotic as the Internet. Books, sound recordings,
and all other media enjoy standardized systems of classification. The
Internet has none. For example, you can go to any library and search
for a book by title, author, subject matter, or publisher. On
Amazon.com or any other self-respecting book site, you can search by
many other criteria as well. If you really get stumped, there's a
well-known standard reference, called Books in Print, which
purports to list every single book there is.
Need to find a magazine? Try the Reader's Guide to Periodical
Literature. Looking for a sound recording? Go to any decent record
store and browse through the Phonolog, which indexes just about
every recording ever made, by artist, song title and publisher. Movies,
retail goods (UPC codes) military equipment (Jane's) and even
people (Who's Who) have all been meticulously classified,
indexed and analyzed to death by standardized systems that everyone
knows about, and accepts as the authority for a particular type of
media.
Of course, this is exactly what the major search sites should be, but
are not, alas. Neither a search engine nor a Web directory qualifies
as a true classification system. A search engine basically just
searches for keywords, which has all the problems that we've already
harped upon. A directory categorizes sites according to the (one hopes)
judgement of its editors - a better way, but too labor-intensive to
keep up with the swashbuckling Web. A real classification
system lets you search for sites by author, date, publisher, regional
focus, type of media, and other criteria. It also describes
relationships among documents, so that (for example) you can tell if a
particular Web page is part of a larger site.
Why, o why doesn't the Web have such a classification system? Oh woe is
us...But wait! There is such a thing. In fact, there's a W3C-approved
system that's been around for quite a while. The Resource Description
Framework (RDF), together with Dublin Core Metadata, provides a
powerful, flexible way of classifying Web pages (or just about
anything else), and neatly solves every single one of the search engine
problems we've discussed.
The Dublin Core is a standard set of elements that describe a document,
or to use their own description, it's "a metadata element set intended
to facilitate discovery of electronic resources." Using the appropriate
elements, details about the authorship and applicability of documents,
and the relationships among different documents, can be recorded in a
standardized way that is machine-searchable.
Wouldn't you like to be able to search only for documents newer than a
certain date? Or restrict a search to sites that serve a particular
geographical area? Or, when a search turns up some long-ago discussion
group post, to instantly find the home page and FAQ list of the
discussion group? All of these things are completely impossible with
current search engine technology, but would be a snap if something
like the Dublin Core were widely supported.
So What's the Problem?
Are search engines dead?
Including Metadata in a Web Page
|