On The Net
Unlocking URLs: Extensions, Shortening
Options, and Other Oddities
By Greg R. Notess
Reference Librarian Montana
Yes, I know, for
those on the Net for years, Uniform Resource Locators (URLs) seem pretty obvious
and self-explanatory. URLs are now so prevalent in all kinds of media that
it is easy to get rather blasé about them. Yet as the Web matures, the
sophistication of how URLs can be used has increased as well. With Web site
redesigns commonplace, extensions often change; citations using URLs continue
to be a problem; and URLs are getting longer.
The root addresses for Web sites are, after all, fairly straightforward.
Add the www. prefix to whatever brilliant host name someone dreamed up, and
you are at the Web site's home page and primary entry point. For individual
pages on the site, add a directory, and/or a file name and voilá, there
is the standard URLthat might look like www.brilliant.gov/pubs/catalog.html.
[Editor's Note: Use caution when reviewing the URLs in this column.
Some are imaginary, created by Greg for illustrative purposes only.]
But once we get beyond the basics, there are all sorts of URL details that
can be useful to the information professional. The host name portion is often
the first place to look when evaluating the provenance and authority of specific
content. As pages change, understanding more than just the basics can help
track down new locations and find older information. In addition, as we ship
URLs to each other and use them in citations, unlocking some of their stranger
secrets, along with knowledge of URL shortening tools, can make it easier to
actually get to the appropriate Web page.
DEFAULTS AND CHANGING EXTENSIONS
First of all, let's take a look at default extensions and some of their many
permutations. Most sites are set up with a default directory index file name.
In other words, when a user requests a URL like www.somewhere.edu/dir/, the
Web server actually delivers a specific file in the dir/ directory. Most of
the time, the default directory index file name is index.html. Thus, the earlier
www.somewhere.edu/dir/ URL actually displays www.somewhere.edu/dir/index.html.
This is why we can use short URLs like yahoo.com with no specific file name,
because the Web server automatically delivers the index file.
There are plenty of other options for the default file. A Web server can
be configured to look for any file name as the default. So www.somewhere.edu
could actually retrieve the www.somewhere.edu/home.jsp file rather than www.somewhere.edu/index.html.
And, in most cases, the displayed URL will only show as www.somewhere.edu and
not display the actual file name.
The ability to change the default directory index file name can be a great
help when a Web site goes through a major transformation or moves to a new
content management system that uses different file extensions. While the original
incarnation of a site may have relied primarily on HTML files, the new version
using some database-driven system may use different file extensions. Some possibilities
are .htm, .php, .asp, .jsp, .cfm, and .shtml. The .htm is just a shortened
form of .html. The others may tell you something about the systems being used.
Typically .php pages are using the PHP scripting language and running on Linux,
while .asp pages (Active Server Pages) are probably running on a Microsoft
server and may contain some Visual Basic programming. The Java Server Pages
(.jsp) are using Java coding. ColdFusion sites often have .cfm, and .shtml
is used for Server Side Includes.
So what is the use of knowing these options? A quick check on a domain at
Netcraft.com will identify the Web server software and operating system more
accurately than guessing based on extensions. But knowing the different options
can help in tracking down an errant Web site by guessing some common file names.
Or if an old, dead URL pointed to somewhere.edu/biosci/index.html, trying out
somewhere.edu/biosci/ or somewhere.edu/biosci/index.php may retrieve the information.
Sometimes, it even helps find pages that are in the process of changing from
one system to another.
A Web site can be configured to display no default index file. In this case,
going to a root URL may simply result in an error message. Going to www.sundaybaroque.org one
day from a link on another page resulted in a Forbidden or Directory Listing
Denied error message. Guessing that the home file might be named index.html,
I tried www.sundaybaroque.org/index.html which
did bring up real content, but it was 2 years old. After a search (where both
Google and AlltheWeb turned up www.sundaybaroque.org as the top hit even though
it didn't work), I found that the new URL I needed was http://www.sundaybaroque.org/flash/flash_index.htm.
The site had been converted to a Macromedia Flash site, and the old URL did
not redirect to the new site. While this was fixed several days later, it is
the kind of situation in which understanding default index file names can help.
A variety of free URL shortening services are available and rather popular
on the Net. Sites like TinyURL.com, SnipURL.com, MakeAShorterLink.com, and
Shorl.com can take a long URL and convert it to a much shorter one. These tools
are useful, especially when trying to e-mail a long URL that is likely to wrap
in the e-mail.
Let's say you wanted to e-mail the URL for the ACS Regional Meeting Calendar
to a colleague or client. The URL of www.chemistry.org/portal/Chemistry?PID=acsdisplay.html&
likely too long for a single line in an e-mail message and will have a line
break inserted somewhere within the URL. When the recipient tries to click
on it, the partial URL is likely to result in an error message. A quick, free
visit to SnipURL.com turns that long URL into www.snurl.com/tcm,
which should be short enough not to wrap in an e-mail message.
One problem with URLs shortening or redirection services is that they mask
the actual domain name of the site containing the information content. That
domain name can be very useful in determining the origin of the information,
the reliability of it, and even whether or not to actually click on it. One
approach is to include both in an e-mail, as in the following example:
See the ACS Regional Meeting Calendar at www.chemistry.org/portal/Chemistry?PID=acsdisplay.
Or, if that doesn't work, try www.snurl.com/tcm.
Sometimes there are ways to shorten URLs that will not mask the domain name
or require the services of a URL redirection service. Think back to the default
index file names. With the practice of defaulting to a specific file name,
see if you get the same content when the file name portion is removed, especially
if it is something like index.html. Often, but not always, the beginning www.
of a URL can be left off. And as long as it remains recognizable, the http://
portion can almost always be left behind.
Let's look at an example of this alternative shortening that maintains the
necessary information and yet is considerably shorter. A recent article in
a professional journal included a reference to the Merriam-Webster site and
used the 27-character URL of http://www.m-w.com/home.htm,
which does work, but it could have been shortened further. The home.htm works
(as does index.html), but it is unnecessary. The http:// and the www. could
also be dropped. Just entering m-w.com into the browser will get exactly the
same page. For ease of recognition, www.m-w.com or http://m-w.com would
make it clearer that it is a Web address, but the 7-character short form of
m-w.com does work.
Bear in mind that these shortcuts may not always work. Leaving off the www.
may or may not get you to the same page. The Special Libraries Association
site currently loads at www.sla.org, but just using sla.org results in an "Under
Construction" message. (And that of course could all change depending on what
SLA decides to do with its name this June.)
Beyond these basic tricks, other URL shortening options are tied into understanding
more about unnecessary options that may be included in the URL. For that we
need to explore the variable extensions that URLs may have.
VARIABLE URL EXTENSIONS
Within extended URLs, there are several ways to track information, such as
session IDs, search submissions, or other user information. As more sites move
to managing content via a database-driven back-end, URLs are getting longer
and more complex. A trip to MapBlast to view a map of the Chicago O'Hare airport
and surrounding roads can generate a unique, 501-character URL. What are all
those characters representing? They encode the search parameters such at latitude
and longitude, zoom level, included landmarks, etc.
Sometimes a Web site adds variables or tracking information after a question
mark or related symbol. Leaving off some portions may still bring up the same
page. For example, from the main Yahoo! page, the link to Yahoo! Finance is www.yahoo.com/r/sq,
but the address you end up at is finance.yahoo.com/?u. Chop off the extra two
characters and the slash, and the exact same information content shows up using
finance.yahoo.com (but some of the text ads on the page vary). The first /r/sq
is probably used to help track where people click and the /r/ may stand for "redirect."
Sometimes a URL will have a redirection prefix which can be used to track
click-through traffic. AlltheWeb.com search results often have URLs such as http://click.alltheweb.com/go2/2/atw/
all that is needed to cite the site is the last 11 characters of www.loc.gov.
In the same way, other URLs will have affiliate suffixes. A link to an Amazon
product might look like amazon.com/exec/obidos/ASIN/0910965471/localaffil.
When citing or linking to that page, just leave off the /localaffil unless
you want to help out that affiliate.
The URL shortening services are a simple approach to the Persistent URL (PURL)
approach that OCLC introduced years ago and which is used on some government
sites and in a variety of library systems. Like the URL shortening programs,
PURLs have the problem of not displaying the actual host name of the originating
Web site. While that could be another whole topic for a column, the persistence
problem goes beyond the regular changes in host names, paths, file names, and
various dead links.
Some URLs are not persistent even for a few minutes. The Bureau of Economic
Affairs' site offers all sorts of economic statistics. Requesting per capita
personal income for California in 2000 from the BEA's Local Area Personal Income
section gives a nice table and the URL of www.bea.gov/bea/regional/reis/drill.cfm.
But change the request to a different date or state, and it gives the exact
Library catalogs and research databases may act in a similar way. Or a string
of apparent nonsense characters may be added that tracks a user's session.
A portion of a URL from one search in part includes the following:
The ? is followed by a search argument. Ampersands are used to separate search
variables. The SID could be a session identifier and the CNT specifies the
maximum number of records to display.
With all of these oddities, can such URLs be used in a citation or as a link
from another page? Maybe. Sometimes, careful removal of variables like the
session ID and other extraneous and/or unused variables can create a shorter
URL that will still work. Or the entire long URL may persist after the session
is over. But how can you tell?
One way to verify if such a URL will get to the same information is to check
on another computer. But with long URLs, this can get tedious to type in the
full URL. An easier approach, for those who have more than one browser available
on their computers is to copy the URL, open the other browser, paste the URL,
and see if the same page is retrieved. If it does not work in the other browser,
it will probably not work for anyone else either. The BEA example above just
gives an error message when pasted into another browser. Note that just opening
another window of the same browser may not test as accurately. It is best to
switch completely from, say, Internet Explorer to Netscape or Mozilla or Opera.
Occasionally you may run across rather unusual URLs, such as one that is
all numbers with no periodshttp://1117674563.
These are probably real addresses, but most who use this technique are spam
e-mailers who want to mask the full URL. And there are many techniques to mask
URLs, often called obfuscating or munging.
Most familiar URLs use a word-based address. The American Library Association's
Web site is at www.ala.org, of course. Computers prefer to deal with numbers
and translate words and letters into numbers. The dotted-quad notation for
an IP address is only the first step in converting an alphabetic address into
one used by the computers. The dotted-quad notation is still relatively familiar,
consisting of four numbers between 0 and 255 separated by dots as in 188.8.131.52
for www.ala.org. But that
is still not a binary number, and computers will convert even the numeric IP
address into other numeric formats.
Continuing with the ALA example, the binary version of 184.108.40.206 is 01000010100111100101110001000011,
which can be translated back into our regular decimal system as 1117674563.
In addition to binary and decimal, computers can also speak in octal and hex.
And each number in the dotted-quad notation could be expressed in any one of
those four formats.
More details are available from the "Obfuscated URLs" page at www.markjamesmullins.com/antispam.html.
For a more extensive list of possible combinations (see the URL by mousing
over the link with the cursor and looking in the status bar) and to see which
work in your browser, try "URLs that access CCN's Home Page" at www.chebucto.ns.ca/~af380/CCN-URLs.html.
Which of these various obfuscated or munged URLs might work depends in part
on the browser you use and the Web server itself. Internet Explorer 5.5 works
for some while Internet Explorer 6 does not. Because of this, spammers can
use the various versions to target particular browser users.
The key points to realize are that such strange-looking URLs can actually
work but that spammers are the ones most likely to use the technique.
After investigating the details of long and unusual URLs, most of us might
prefer to lock them back up and ignore the more cryptic portions. But I find
that a quick look at any URL can convey a great deal about the information
I'm viewing. It helps identify satire sites that may look like the real thing
but come from a completely different organization than expected. It can expose
what kind of server and content management system or scripting language is
used. It may contain original publication dates for articles, author names,
ad companies, or search options.
Watch the URLs. Check them when evaluating Web content and be careful when
citing them. Make sure others will be able to see what you see. Reduce them
to their minimum functioning version when citing. Above all, mine the full
URLs for the information that they contain.
R. Notess (email@example.com; www.notess.com)
is a reference librarian at Montana State University and founder of SearchEngineShowdown.com.
Comments? Email the editor at firstname.lastname@example.org.