Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions
 Discussion Forums
 HTML, XML, JavaScript...
 Software Reviews
 Editors,Others...
 Top100
 JavaScript Tutorials, ...
 Tutorials
 ASP, CSS, Databases...
 Discussion List
 FAQ, Roundup, Configure ...
 Authoring
 HTML, JavaScript, CSS...
 Design
 Layout, Navigation,...
 Graphics
 Tools, Colors, Images...
 Software
 Browsers, Editors, XML...
 Internet
 Domains, E-Commerce, ...
 WDVL Resources
  Intermdiate, Tutorials,...
 WDVL
 Discussion Lists, Top 100,...
 Technology Jobs


WDVL Newsletter

Active Server Pages
JSP/Java Servlets
Microsoft SQL Server
Daily Backup
Dedicated Servers
Streaming Audio/Video
24-hour Support    

jobs.webdeveloper.com

Hiermenus


e-commerce
Partner With Us















Developer Channel
FlashKit.com
JavaScript.com
JavaScriptSource
Developer Jobs
ScriptSearch
StreamingMediaWorld
Web Developer's Journal
Web Developer's Virtual Library
WebDeveloper.com
Webreference
Web Hosts
XMLfiles.com

internet.com
IT
Developer
Internet News
Small Business
Personal Technology
International

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers


Top 10 Articles
  1. Web Developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions
  2. JavaScript Tutorial for Programmers
  3. Design
  4. JavaScript Tutorial for Programmers - Objects
  5. JavaScript Tutorial for Programmers - JavaScript Grammar
  6. JavaScript Tutorial for Programmers - Versions of JavaScript
  7. Cascading Style Sheets
  8. JavaScript Tutorial for Programmers - Embedding JavaScript
  9. JavaScript Tutorial for Programmers - Functions
  10. Authoring JavaScript
Domain Name Lookup
Search to find the availability of a domain name. Just enter the complete domain name with extension (.com, .net, .edu)

The Web Librarian

Locating information on the web is becoming more and more problematic. Search engines overwhelm users with vast quantities of information, much or most of which is not precisely what was wanted; and browsable catalogs ('virtual libraries') take a lot of time to use, and now often index only a fraction of the relevant material. * An automated 'Web Librarian' (WLn) should help address these problems. Traditional notions of simple hierarchical classification need to be augmented or replaced with more powerful methods, e.g. concept analysis and faceted classification. Authors and publishers should be encouraged to provide resource meta - information. * A WLn is a synthesis of humans and robots, databases, FAQs, and smart software, designed to enable people to receive precise answers to precisely formulated queries.

Current Problems

Finding Stuff

Whenever we develop a new skill or extend an old one, we have to emphasize the relative importance of some aspects and features over others. We can then place these into neat levels only when we discover systematic ways to do so. Then our classifications can resemble level-schemes and hierarchies. But the hierarchies always end up getting tangled and disorderly because there are also exceptions and interactions to each classification scheme.
-- Marvin Minsky, The Society of Mind

If you want to find something on the web, what do you do: browse a catalog, or use a or a search engine (or something else) ? I find I'm using Alta Vista increasingly - even for web development topics (my VL). Why? Because 1) it's fast, and 2) even at nearly 2,000 entries, my VL only catalogs a fraction of the relevant material available on the web. The drawback is that I may get swamped with results that have to be carefully screened for relevance.

For example, I tried to locate any web information on Faceted Classification (more on this later). First, I tried the WWW Virtual Library. I looked in "Information Science" - only 3 links ! There is, apparently, no "Library Science" (!); "Libraries" just points into Libraries. I expected to find "Software Engineering" under something like "Computers" but no, it's under "Engineering". An example of the problem of strict hierarchy. I then found a couple of pointers to software reusability, but nothing on faceted classification. After 20 minutes I gave up.

Next, I entered "faceted classification" into Alta Vista (with the quotes). Within seconds I had 155 results, and a few minutes of checking through them confirmed I had got some good hits.

Of course, this may have been just an unfortunate example. But the point I want to make, is that I believe the concept of the VL as a static browsable hierarchy needs serious rethinking. We've all had great fun putting our hotlists on public display, but web technology is superceding us. I know some will say that these catalogs are hand-crafted by domain experts and are therefore of very high quality. This has some merit - but it's not enough. We have a Library, but no Librarian. Users may browse, but they can't ask for help. There isn't even a card index.

User Entry of Classification Data

Some of the problems with user-entered URLs are:
  • Annotations given are too short, too long, or not very helpful.
  • People enter inapropriate URLs, not related to the catalog's subject matter (I reject 1/3rd of all entries for this alone).
  • People want their listing to appear in multiple categories. This is equivalent to wanting to associate multiple keywords with the entries - which would be a useful extension.
  • People classify their entries poorly. This depends to some extent on how clear and intuitive the classification system is, although some people apparently refuse to spend any time trying to understand it. The greatest number of mis-classifications are from commercial entities seeking to publicise their products and services.
In addition, other means of populating a catalog, by spider and by surfing, reading newsgroups, mailing lists, etc, may also be used. This could be partially automated by program to extract URLs from these sources, compare them with the catalog URLs, and those not found can be added to a list for later investigation by a human.

Classification Systems

There are three general types of classification schemes: enumerative, synthetic and analytico-synthetic. The enumerative scheme is based on the concept of a universe of knowledge which is divided into successively narrower and more specific subjects. Theoretically, all topics are to be represented. Library of Congress (LC) is an enumerative scheme. A synthetic scheme is one in which new class numbers can be developed for new topics not already listed. The Dewey Decimal Catalog (DDC), although primarily enumerative, approaches a synthetic scheme with each revision.

Faceted Classification

The facet classification is an analytico-synthetic scheme. It is analytic because it subdivides broader elements into single concepts that are clearly defined through facet analysis. It is synthetic in that new elements can be developed. The classification was first originated by S.R. Ranganathan in the 1930's with the Colon Classification. Note that the process of facet analysis can also be used to construct thesauri. There is renewed interest in this system, because some believe that older systems such as DDC and LC do not provide enough detail to accurately describe all subjects in all media, may not meet the needs of the individual or special library, may not provide for enough coordination of terms, may require complex or lengthy notation, and are often difficult to use to locate materials.

Basically, the facet development process begins by defining the subject to be covered by examining existing classifications or thesauri, or titles or objects in the perspective database. The derived topics are broken down into facets each with a distinct label. Items are organized so that they are in homogeneous, mutually exclusive groups that differ from the main group by one characteristic. Within each facet, subfacets or more specific topics are listed. The breakdown continues into subfacets within subfacets. The items in each subfacet, in general, are ordered from more general to more specific, complex or concrete.

I don't think a hierarchical classification scheme is good enough for a modern web-based catalog of any substantial size. Entries rarely fit exactly into one leaf node. Ruben Prieto-Diaz has proposed "faceted classification" for a reusable software library - a concept he found in library science. In a faceted classification scheme, the facets may be considered to be dimensions in a cartesian classification space, and the value of a facet is the position of the artifact in that dimension. For software, one might have facets with values such as "Operand", "Functionality", "Platform", "Language", .... Prieto-Diaz claims that a fixed (and small) number of facets is sufficient for classifying all software.

Implementation of a Web Librarian

At the bare minimum, this is a classification system, database, and means to populate the database. But instead of blindly indexing all the words in a zillion web pages, it should distill or encapsulate domain intelligence and structure. A user's query should not just shoot keywords at an index, but should be "understood" by the librarian, sufficiently that it can direct you to the appropriate library section. The Librarian should be thesaurus-based so that it can suggest synonyms and related concepts. The Librarian should be an active participant in the user's exploration of the library.

Most web indexing systems don't have any provision for the author of a web resource to offer any guidance on how it should be indexed. The WLn system should not only allow, but positively encourage authors to provide some meta information. Although full text searches of a Web archive are an important way of identifying relevant information, sometimes it can be very useful to base searches on document attributes such as author, keywords, language, etc. The HTML specification defines a special markup element for this purpose: the <META>. This tag can be used to augment documents with information that is not normally displayed by browsers. It provides document authors with a mechanism for identifying information that should be included in the response headers for an HTTP request. The markup is stored as attributes of a tag and is not displayed if the document is loaded into a browser. It can however be extracted by servers and clients for use in identifying, indexing, and cataloging documents.

The full system would be something like:

  • Internet resources (web, ftp, news, ...)
  • resource meta information
  • classification system(s)
  • databases
  • search engines
  • NL parser
  • learning system
  • expert system
  • network of WLns
  • human support

The fundamental architecture of this system would be based on a series of levels where a query might be resolved, rather like memory management levels:

  1. cache
  2. "FAQ"
  3. index db
  4. internet resources
  5. subdomain WLns
  6. human support

The basic algorithm then would be:

  1. Parse query
  2. Identify possible subject domains
  3. Pass on query to other WLns or human if inappropriate
  4. Search levels from top till query resolved
  5. Record answer in the levels above the one where it was resolved.

Quality Control

The issues of quality control need consideration, e.g. criteria for acceptance into the catalog; and whether some kind of rating system would be useful. Related to this is the possibility of assigning multiple keywords to each entry, perhaps with relevance weightings so that search results could be sorted to help the user select the entries closest to their needs. Weeding the list must also be done to remove URLs that become misleading, obsolete, or are a lesser quality duplication of another URL.

Navigation

Having collected a lot of good, up-to-date URLs, it is of course essential for users to be able to locate what they need very quickly with high precision and recall. The two main methods are by browsing and searching. The browse hierarchy is currently only two levels deep (excluding the root), and could be deeper. Alternatively, a faceted classification scheme (multiple keywords) could replace the hierarchy with a directed acyclic graph permitting multiple links from category parents, which would improve the likelihood that relevant entries are found from a given starting point.

Database Design

Information about the resources will be stored in a relational database. The following information may be used to search the database:
Title The name of the object. This will normally be the Title as given in the HEAD of the HTML file.
URL The content is a URL to fetch an instance of the resource. String or number used to uniquely identify this object. This is the key field and must be unique.
Author The person(s) and/or organization(s) primarily responsible for the intellectual content of the work.
Abstract A description or annotation of the object.
Publisher The agent or agency responsible for making the object available.
Date The date of publication.
Other Agent Other person(s) and/or organization(s), such as editors, transcribers, sponsors, etc. who have made significant contributions to the work. Author and Publisher are special cases of OtherAgent.
Keywords The abstract category of the object defined by a fixed set of keywords. The keywords are partitioned into the following facets:
Function The main activity in which the object applies.
Context The setting or environment in which the object is used.
Object The object itself.
Medium Stuff the object is built from.
Type The particular manifestation or data representation of the object, such as PostScript file or Windows executable. For URCs, form will typically be specified as an Internet Media Type - formerly known as the MIME Content-type.
Relation Relationship to other objects. This element should identify the role of the relationship, as well as the related objects.
Status An indicator for the state of the object in the db, e.g. new entry; to be deleted; etc.
Language Natural language of the intellectual content.
Email Electronic address of the resource maintainer.

Bibliography


Web Librarian Puts Tools In Designers' Hands

in WebWeek.
Christian Neuss, Robert E. Kent
Conceptual Analysis of Resource Meta-information.

R. Prieto-Diaz and P. Freeman.
Classifying Software for Reusability. IEEE Software, 4(1):6-16, January 1987.

Ron Daniel and Michael Mealling,
An SGML-based URC Service

D. Cohen,
A Format for E-Mailing Bibliographic Records, RFC 1357

Stuart Weibel, Jean Godby, Eric Miller, and Ron Daniel (eds),
OCLC/NCSA Metadata Workshop Report


Up to => Home / Location




Jupiter Online Media: internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and Jupiter Online Media

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers