Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Similar Yet Un-related

June 26, 2000

Another very common problem results from sites that have a lot of pages with similar content. There are many reasons why such sites exist, and not all of them are attempts at spamdexing. Consider an online discussion group. Some discussion groups create a separate page for every single post, using the subject line of the post as the page title. So if a discussion thread goes on for a hundred posts (as oh so many do), there will be a hundred pages all with similar titles, each containing a wee bit of information. And all hundred of those darn pages will come up when you do a search, crowding out other sites that may have a lot more information. Not that the discussion group posts aren't worthwhile, but you don't need an individual link to each one of them. What would really be useful is a single link to the home page of the discussion group, so you could go and check out the whole thread in its proper context.

The problem of pages showing up out of context is a multi-faceted one. The home page of a site is no more likely to come up in search results than is a minor page ten levels down. A lot of times a search engine leads you to a page which is part of a larger site, but it's unclear what site it is, or how to get to the rest of it. Poor site design makes this problem a lot worse. Good designers generally use a uniform navbar throughout a site, so that if someone chances upon any page, it will be clear what site they're on, and they'll be able to click right to the home page. But, to put it mildly, not all designers follow this sound rule.

Yet another problem arises from the fact that a word or phrase often refers to two or more unrelated topics. For example, "Kansas" is both a US state and a musical group. If you search for "blues", how does a search engine know whether you're interested in blues music, the sports team called the "Blues", or the children's TV show "Blue's Clues"? Or perhaps you "have the blues" and want to see a shrink? Alas, current search algorithms have no way of dealing with this, so any time that you search for a phrase that has more than one meaning (which is often), a large portion of the results will be completely irrelevant.

The lowly card catalog solves this problem easily. If you look up "Kansas" in your local library, you'll find two entries:

Kansas (US State)
Kansas (Musical Group)

And maybe more besides.

Of course, the Internet is nowhere near as well organized as a decent small-town library. Guess what? It's not organized at all! There's no other medium that's as chaotic as the Internet. Books, sound recordings, and all other media enjoy standardized systems of classification. The Internet has none. For example, you can go to any library and search for a book by title, author, subject matter, or publisher. On Amazon.com or any other self-respecting book site, you can search by many other criteria as well. If you really get stumped, there's a well-known standard reference, called Books in Print, which purports to list every single book there is.

Need to find a magazine? Try the Reader's Guide to Periodical Literature. Looking for a sound recording? Go to any decent record store and browse through the Phonolog, which indexes just about every recording ever made, by artist, song title and publisher. Movies, retail goods (UPC codes) military equipment (Jane's) and even people (Who's Who) have all been meticulously classified, indexed and analyzed to death by standardized systems that everyone knows about, and accepts as the authority for a particular type of media.

Of course, this is exactly what the major search sites should be, but are not, alas. Neither a search engine nor a Web directory qualifies as a true classification system. A search engine basically just searches for keywords, which has all the problems that we've already harped upon. A directory categorizes sites according to the (one hopes) judgement of its editors - a better way, but too labor-intensive to keep up with the swashbuckling Web. A real classification system lets you search for sites by author, date, publisher, regional focus, type of media, and other criteria. It also describes relationships among documents, so that (for example) you can tell if a particular Web page is part of a larger site.

Why, o why doesn't the Web have such a classification system? Oh woe is us...But wait! There is such a thing. In fact, there's a W3C-approved system that's been around for quite a while. The Resource Description Framework (RDF), together with Dublin Core Metadata, provides a powerful, flexible way of classifying Web pages (or just about anything else), and neatly solves every single one of the search engine problems we've discussed.

The Dublin Core is a standard set of elements that describe a document, or to use their own description, it's "a metadata element set intended to facilitate discovery of electronic resources." Using the appropriate elements, details about the authorship and applicability of documents, and the relationships among different documents, can be recorded in a standardized way that is machine-searchable.

Wouldn't you like to be able to search only for documents newer than a certain date? Or restrict a search to sites that serve a particular geographical area? Or, when a search turns up some long-ago discussion group post, to instantly find the home page and FAQ list of the discussion group? All of these things are completely impossible with current search engine technology, but would be a snap if something like the Dublin Core were widely supported.

So What's the Problem?
Are search engines dead?
Including Metadata in a Web Page


Up to => Home / Internet / Dead_SearchEngines




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers