Untitled
table test
 |
Register
here for your free Internet.com membership to download your
Justifying and Funding Infrastructure Investments
report.
This independent report will help you make the case for your IT
investments. Topics covered include:
Measuring Infrastructure Value
Justifying New Investments
Establishing an Infrastructure Value Chain and
More. |
Register now for your free
Internet.com membership to download your complimentary Forrester
report.
Limited Time Offer! |
 |
|
|
|
|
|
HyperText Transfer Protocol
The innovations that Berners-Lee added to the Internet to create the World Wide Web
had two fundamental dimensions: connectivity and interface. He invented a new protocol
for the computers to speak as they exchanged hypermedia documents. This Hypertext
Transfer Protocol (HTTP) made it very easy for any computer on the Internet to safely
offer up its collection of documents into the greater whole; using HTTP, a computer that
asked for a file from another computer would know, when it received the file, if it was a
picture, a movie, or a spoken word. With this feature of HTTP, the Internet began to reflect
an important truth - retrieving a file's data is almost useless unless you know what kind of data it is.
In a sea of Web documents, it's impossible to know in advance what a document is -
it could be almost anything - but the Web understands "data types" and passes that information along.
- Mark Pesce, "VRML - Browsing and Building Cyberspace", New Riders Publishing, 1995.
Although an understanding of HTTP is not strictly necessary for the
development of CGI applications, some appreciation of "what's under
the hood" will certainly help you to develop them with more fluency
and confidence. As with any field of endeavour, a grasp of the
fundamental underlying principles allows you to visualise the
structures and processes involved in the CGI transactions between
clients and servers - giving you a more
comprehensive mental model on which to base your programming.
Underlying the user interface represented by browsers, is the network
and the protocols that travel the wires to the servers or "engines"
that process requests, and return the various media. The protocol of
the web is known as HTTP, for
HyperText Transfer Protocol.
HTTP is the underlying mechanism on which CGI operates,
and it directly determines
what you can and cannot send or receive via CGI.
Tim Berners-Lee implemented the HTTP protocol in 1990-1 at CERN,
the European Center for High-Energy Physics in Geneva, Switzerland.
HTTP stands at the very core of the World Wide Web.
According to the HTTP 1.0 specification,
The Hypertext Transfer Protocol (HTTP) is an application-level protocol
with the lightness and speed necessary for distributed, collaborative,
hypermedia information systems. It is a generic, stateless,
object-oriented protocol which can be used for many tasks,
such as name servers and distributed object management systems,
through extension of its request methods (commands).
A feature of HTTP is the typing and negotiation of data representation,
allowing systems to be built independently of the data being
transferred.
HTTP Properties
A comprehensive addressing scheme
The HTTP protocol uses the concept of reference provided by the Universal
Resource Identifier
(URI) as a location (URL) or name (URN),
for indicating the resource on which a method is to be applied.
When an HTML hyperlink is
composed, the URL (Uniform Resource Locator) is of the general
form http://host:port-number/path/file.html.
More generally, a URL reference is of the type
service://host/file.file-extension and in this way, the
HTTP protocol can subsume the more basic Internet services.
HTTP/1.0 is also used for communication between user agents and
various gateways,
allowing hypermedia access to existing Internet protocols like SMTP,
NNTP, FTP, Gopher, and WAIS.
HTTP/1.0 is designed to allow communication with such gateways,
via proxy servers,
without any loss of the data conveyed by those earlier protocols.
Client-Server Architecture
The HTTP protocol is based on a request/response paradigm.
The communication generally takes place over a TCP/IP connection
on the Internet.
The default port is 80, but other ports can be used.
This does not preclude the HTTP/1.0 protocol from being implemented
on top of any other protocol on the Internet,
so long as reliability can be guaranteed.
A requesting program (a client) establishes a connection
with a receiving program (a server) and sends a request to
the server in the form of a request method, URI, and protocol version,
followed by a message containing request modifiers, client
information, and possible body content.
The server
responds with a status line, including its protocol version and a success
or error code, followed by a message containing server information,
entity metainformation, and possible body content.
The HTTP protocol is connectionless
Although we have just said that the client establishes a connection
with a server, the protocol is called connectionless because
once the single request has been satisfied, the connection is dropped.
Other protocols typically keep the connection open, e.g. in an FTP
session you can move around in remote directories, and the server
keeps track of who you are, and where you are.
While this greatly simplifies the server construction and relieves it
of the performance penalties of session housekeeping, it makes the
tracking of user behaviour, e.g. navigation paths between local documents,
impossible. Many, if not most, web documents consist of one or more
inline images, and these must be retrieved individually, incurring
the overhead of repeated connections.
The HTTP protocol is stateless
After the server
has responded to the client's request, the connection
between client and server is dropped and forgotten. There is no
"memory" between client connections. The pure HTTP server
implementation treats every request as
if it was brand-new, i.e. without context.
CGI applications get around this by encoding the state or a state identifier in
hidden fields, the path information, or URLs in the form being returned
to the browser. The first two methods return the state or its id when
the form is submitted back by the user; the method of encoding state into hyperlinks (URLs)
in the form only returns the state (or id) if the user clicks on the link and the link is back to
the originating server.
It's often advisable to not encode the whole state but to save it,
e.g. in a file, and identify it by means of a unique identifier, such
as a sequential integer. Visitor counter programs can be adapted very
nicely for this - and thereby become useful. You then only have to
send the state identifier in the form, which is advisable if the state
vector becomes large - saving network traffic. However you then have to
take care of housekeeping the state files, e.g. by periodic clean-up tasks.
An extensible and open representation for data types
HTTP uses Internet Media Types (formerly referred to as MIME Content-Types)
to provide open and extensible data typing and type negotiation.
For mail applications, where there is no type negotiation between
sender and receiver, it's reasonable to put strict limits on the
set of allowed media types. With HTTP, where the sender and recipient
can communicate directly, applications are allowed more freedom in the
use of non-registered types.
When the client sends a transaction to the server, headers are
attached that conform to standard Internet e-mail
specifications (RFC822). Most client requests expect an answer
either in plain text or HTML. When the HTTP Server transmits
information back to the client, it includes a MIME-like
(Multipart Internet Mail Extension) header to inform the
client what kind of data follows the header.
Translation then depends on the client possessing the
appropriate utility (image viewer, movie player, etc.)
corresponding to that data type.
HTTP Header Fields
An HTTP transaction consists of a header followed optionally by an empty
line and some data. The header will specify such things as the action
required of the server, or the type of data being returned, or a status code.
The use of header fields sent in HTTP transactions gives the protocol
great flexibility. These fields allow descriptive information
to be sent in the transaction, enabling authentication, encryption, and/or
user identification. The header is a block of data preceding the
actual data, and is often referred to as meta information, because
it is information about information.
The header lines received from the client, if
any, are placed by the server into the CGI environment variables with the prefix HTTP_ followed by
the header name.
Any - characters in the header name are changed to _ characters.
The server may exclude any headers which it has already processed,
such as Authorization, Content-type, and Content-length.
If necessary, the server may choose to exclude any or all of these
headers if including them would exceed any system environment limits.
An example of this is the HTTP_ACCEPT variable, another example is the header User-Agent.
-
HTTP_ACCEPT
The MIME types which the client will accept, as given by HTTP
headers. Other protocols may need to get this information from
elsewhere. Each item in this list should be separated by commas as
per the HTTP spec.
Format: type/subtype, type/subtype
-
HTTP_USER_AGENT
The browser the client is using to send the request.
General format: software/version library/version.
The server sends back to the client:
- A status code that indicates whether the request was successful or not.
Typical error codes indicate that the requested file was not found,
that the request was malformed, or that authentication is required
to access the file.
- The data itself. Since HTTP is liberal about sending documents
of any format, it is ideal for transmitting multimedia such as graphics, audio, and video
files. This complete freedom to transmit data of any format is one of the most significant
advantages of HTTP and the Web.
-
It also sends back information about the object being returned.
Note that the following is not a complete list of header fields, and that some of them only make sense
in one direction.
The Content-Type header field indicates the media type of the
data sent to the recipient
or, in the case of the HEAD method, the media type that would
have been sent had the request
been a GET.
This field is used by browsers to know how
to deal with the data. The client uses this information to determine
how to handle a video file or an inline graphic.
An example:
Content-Type: text/html
The Date header represents the date and time at which the message
was originated.
An example is
Date: Tue, 15 Nov 1994 08:12:31 GMT
The Expires field gives the
date after which the information in the document ceases to be valid.
Caching clients,
including proxies, must not cache this copy of the resource beyond the date
given, unless its
status has been updated by a later check of the origin server.
Expires: Thu, 01 Dec 1994 16:00:00 GMT
The From header field, if given, should contain an Internet
e-mail address for the human user who controls the requesting user agent.
An example is:
From: Stars@WDVL.com
This header field may be used for logging purposes and as a means for identifying the source
of invalid or unwanted requests. It should not be used as an insecure form of access protection.
The interpretation of this field is that the request is being performed on behalf of the person
given, who accepts responsibility for the method performed. In particular, robot agents should
include this header so that the person responsible for running the robot can be contacted if
problems occur on the receiving end.
The If-Modified-Since header field is used with the
GET method to make it conditional: if the
requested resource has not been modified since the time specified in this
field, a copy of the
resource will not be returned from the server; instead, a 304 (not modified)
response will be returned without any data.
An example of the field is:
If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT
The Last-Modified header field indicates the date and time at which the sender believes the
resource was last modified.
The "Last Modified" field is useful for clients that eliminate
unnecessary transfers by using caching.
The exact semantics of this field are defined in terms of how the
receiver should interpret it: if the receiver has a copy of this resource
which is older than the
date given by the Last-Modified field, that copy should be
considered stale.
An example of its use is
Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT
The Location response header field defines the exact location of the
resource that was identified by the request URI.
If the value is a full URL, the server returns a "redirect" to the
client to retrieve the specified object directly.
Location: http://WWW.Stars.com/Tutorial/HTTP/index.html
If you want to reference another file on your own server, you
should output a partial URL, such as the following:
Location: /Tutorial/HTTP/index.html
The server will act as if the client had not requested your script, but instead requested
http://yourserver/Tutorial/HTTP/index.html.
It will take care of all access control, determining the file's type,
etc..
In this case clients don't do the redirection, but the server does it "on the fly".
Important: Only full URLs in
Location field can contain the #label part of URL (i.e. fragment),
because that is meant only for the client-side, and the server cannot
possibly handle it in any way.
As an example of actual use, the "Ask Dr.Web" form has a Yes/No toggle after the question
"Did you search the library and read the FAQ?". The default is No, so if the user doesn't
reset this to Yes they will simply be redirected to the FAQ and their question will not be sent.
if ($input{'YN'} eq "No") {
print
"Location: http://WWW.Stars.com/Dr.Web/FAQ.html\n\n";
}
else {
print "Content-type: text/html\n\n";
&Feedback;
}
The Referer request header field allows the client to specify,
for the server's benefit, the address
(URI) of the resource from which the request URI was obtained.
This allows a server to
generate lists of back-links to resources for interest, logging, optimized
caching, etc. It also
allows obsolete or mistyped links to be traced for maintenance.
Example:
Referer: http://WWW.Stars.com/index.html
If a partial URI is given, it should be interpreted relative to the
request URI. The URI must not include a fragment (#label within a document).
The Server response header field contains information about
the software used by the origin
server to handle the request. The field can contain multiple product tokens
and
comments identifying the server and any significant subproducts.
By convention, the product
tokens are listed in order of their significance for identifying the application.
Example:
Server: CERN/3.0 libwww/2.17
The User-Agent field contains information about the user agent
originating the request. This is
for statistical purposes, the tracing of protocol violations, and automated
recognition of user
agents for the sake of tailoring responses to avoid particular user agent
limitations - such as inability to support HTML tables.
By convention, the product tokens are listed in order of their significance
for identifying the application.
Example:
User-Agent: CERN-LineMode/2.15 libwww/2.17b3
HTTP Methods
HTTP/1.0 allows an open-ended set of methods to be used to indicate
the purpose of a request.
The three most often used methods are GET, HEAD, and POST.
The GET method
The GET method is used to ask for a specific document - when you click on a
hyperlink, GET is being used.
GET should probably be used when a URL access will not change
the state of a database (by, for example, adding or deleting information)
and POST should be used when an access will cause a change.
The semantics of the GET method changes to a
"conditional GET" if the request message includes an
If-Modified-Since header field.
A conditional GET method requests that the identified
resource be transferred only if it has been modified since the date
given by the If-Modified-Since header.
The conditional GET method is intended to reduce network
usage by allowing cached entities to be refreshed without requiring
multiple requests or transferring unnecessary data.
The HEAD method
The HEAD method is used to ask only for information about a
document, not for the document itself. HEAD is much faster than GET, as
a much smaller amount of data is transferred. It's often used by clients
who use caching, to see if the document has changed since it was last accessed.
If it was not, then the local copy can be reused, otherwise the
updated version must be retrieved with a GET.
The metainformation contained in the HTTP headers in response to a HEAD
request should be identical to the information sent in response to a GET request.
This method can be used for obtaining metainformation about the resource
identified by the request URI without transferring the data itself.
This method is often used for testing hypertext links for validity,
accessibility, and recent modification.
The POST method
The POST method is used to transfer data from the client to the server;
it's designed to allow a uniform method to cover functions like:
annotation of existing resources;
posting a message to a bulletin board, newsgroup, mailing list,
or similar group of articles;
providing a block of data (usually a form) to a data-handling process;
extending a database through an append operation.
POST /cgi-bin/post-query HTTP/1.0
Accept: text/html
Accept: video/mpeg
Accept: image/gif
Accept: application/postscript
User-Agent: Lynx/2.2 libwww/2.14
From: Stars@WDVL.com
Content-type: application/x-www-form-urlencoded
Content-length: 150
* a blank line *
org=CyberWeb%20SoftWare
&users=10000
&browsers=lynx
-
This is a "POST" query addressed for the program residing
in the file at "/cgi-bin/post-query",
that simply echoes the values it receives.
-
The client lists the MIME-types it is capable of accepting,
and identifies itself and the version of the WWW library it is using.
-
Finally, it indicates the MIME-type it has used to encode the
data it is sending, the number of character included, and the
list of variables and their values it has collected from the
user.
-
MIME-type application/x-www-form-urlencoded means that the
variable name-value pairs will be encoded the same way a URL is
encoded.
Any special characters, including puctuation,
will be encoded as
%nn where nn
is the ASCII value for the character in hex.
HTTP Response
Here is an example of an HTTP response from a server to a client request:
HTTP/1.0 200 OK
Date: Wednesday, 02-Feb-95 23:04:12 GMT
Server: NCSA/1.3
MIME-version: 1.0
Last-modified: Monday, 15-Nov-93 23:33:16 GMT
Content-type: text/html
Content-length: 2345
* a blank line *
<HTML> ...
-
The server agrees to use HTTP version 1.0 for
communication and sends the status 200 indicating it has
successfully processed the client's request.
-
It then sends the date and identifies itself as an NCSA HTTP server.
-
It also indicates it is using MIME version 1.0 to describe
the information it is sending, and includes the MIME-type of the
information about to be sent in the "Content-type:" header.
-
Finally, it sends the number of characters it is going to send,
followed by a blank line and the data itself.
-
Client and server headers are RFC 822 compliant mail headers.
A Client may send any number of Accept: headers and the
server is expected to convert the data into a form the
client can accept.
The HyperText Transfer Protocol - Next Generation
The essential simplicity of HTTP has been a major factor in its rapid
adoption, but this very simplicity has become its main drawback; the
next generation of HTTP, dubbed "
HTTP-NG", will
be a replacement for HTTP 1.0 with much higher performance and
adding some extra features needed for use in commercial applications.
It's designed to make it easy to implement the basic functionality
needed by all browsers, whilst making the addition of more
powerful features such as security and authentication much simpler.
The current HTTP 1.0
often causes performance problems on the server side, and on
the network, since it sets up a new connection for every
request. Simon Spero has published a progress report on what
the W3C calls "HTTP Next Generation", or HTTP-NG.
HTTP-NG "divides up the connection [between client and server]
into lots of different channels ...
each object is returned over its own channel."
HTTP-NG allows many different requests to be sent over a single
connection.
These requests are asynchronous - there's no need for
the client to wait for a response before sending out a new request.
The server can also respond to requests in any order it sees fit -
it can even interweave the data from multiple objects,
allowing several images to be transferred in "parallel".
To make these multiple data streams easy to work with,
HTTP-NG sends all its messages and data using a "session layer".
This divides the connection up into lots of different channels.
HTTP-NG sends all control messages (GET requests, meta-information etc)
over a control channel.
Each object is returned over in its own channel.
This also makes redirection much more powerful - for example,
if the object is a video the server can return the meta-information
over the same connection, together with a URL pointing to a dedicated
video transfer protocol that will fetch the data for the relevant
object.
This becomes very important when working with multimedia aware
networking technologies, such as ATM or RSVP.
The HTTP-NG protocol will permit complex data
types such as video to redirect the URL to a video transfer
protocol and only then will the data be fetched for the client.
|