Information Today, Inc. Corporate Site KMWorld CRM Media Streaming Media Faulkner Speech Technology DBTA/Unisphere
PRIVACY/COOKIES POLICY
Other ITI Websites
American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Streaming Media Producer Unisphere Research



Magazines > Online > May/June 2003
Back Index Forward
 




SUBSCRIBE NOW!
Online Magazine
Vol. 27 No. 3 — May/June 2003
On The Net
Unlocking URLs: Extensions, Shortening Options, and Other Oddities
By Greg R. Notess
Reference Librarian Montana State University

Yes, I know, for those on the Net for years, Uniform Resource Locators (URLs) seem pretty obvious and self-explanatory. URLs are now so prevalent in all kinds of media that it is easy to get rather blasé about them. Yet as the Web matures, the sophistication of how URLs can be used has increased as well. With Web site redesigns commonplace, extensions often change; citations using URLs continue to be a problem; and URLs are getting longer.

The root addresses for Web sites are, after all, fairly straightforward. Add the www. prefix to whatever brilliant host name someone dreamed up, and you are at the Web site's home page and primary entry point. For individual pages on the site, add a directory, and/or a file name and voilá, there is the standard URL—that might look like www.brilliant.gov/pubs/catalog.html. [Editor's Note: Use caution when reviewing the URLs in this column. Some are imaginary, created by Greg for illustrative purposes only.]

But once we get beyond the basics, there are all sorts of URL details that can be useful to the information professional. The host name portion is often the first place to look when evaluating the provenance and authority of specific content. As pages change, understanding more than just the basics can help track down new locations and find older information. In addition, as we ship URLs to each other and use them in citations, unlocking some of their stranger secrets, along with knowledge of URL shortening tools, can make it easier to actually get to the appropriate Web page.

DEFAULTS AND CHANGING EXTENSIONS

First of all, let's take a look at default extensions and some of their many permutations. Most sites are set up with a default directory index file name. In other words, when a user requests a URL like www.somewhere.edu/dir/, the Web server actually delivers a specific file in the dir/ directory. Most of the time, the default directory index file name is index.html. Thus, the earlier www.somewhere.edu/dir/ URL actually displays www.somewhere.edu/dir/index.html. This is why we can use short URLs like yahoo.com with no specific file name, because the Web server automatically delivers the index file.

There are plenty of other options for the default file. A Web server can be configured to look for any file name as the default. So www.somewhere.edu could actually retrieve the www.somewhere.edu/home.jsp file rather than www.somewhere.edu/index.html. And, in most cases, the displayed URL will only show as www.somewhere.edu and not display the actual file name.

The ability to change the default directory index file name can be a great help when a Web site goes through a major transformation or moves to a new content management system that uses different file extensions. While the original incarnation of a site may have relied primarily on HTML files, the new version using some database-driven system may use different file extensions. Some possibilities are .htm, .php, .asp, .jsp, .cfm, and .shtml. The .htm is just a shortened form of .html. The others may tell you something about the systems being used. Typically .php pages are using the PHP scripting language and running on Linux, while .asp pages (Active Server Pages) are probably running on a Microsoft server and may contain some Visual Basic programming. The Java Server Pages (.jsp) are using Java coding. ColdFusion sites often have .cfm, and .shtml is used for Server Side Includes.

So what is the use of knowing these options? A quick check on a domain at Netcraft.com will identify the Web server software and operating system more accurately than guessing based on extensions. But knowing the different options can help in tracking down an errant Web site by guessing some common file names. Or if an old, dead URL pointed to somewhere.edu/biosci/index.html, trying out somewhere.edu/biosci/ or somewhere.edu/biosci/index.php may retrieve the information. Sometimes, it even helps find pages that are in the process of changing from one system to another.

A Web site can be configured to display no default index file. In this case, going to a root URL may simply result in an error message. Going to www.sundaybaroque.org one day from a link on another page resulted in a Forbidden or Directory Listing Denied error message. Guessing that the home file might be named index.html, I tried www.sundaybaroque.org/index.html which did bring up real content, but it was 2 years old. After a search (where both Google and AlltheWeb turned up www.sundaybaroque.org as the top hit even though it didn't work), I found that the new URL I needed was http://www.sundaybaroque.org/flash/flash_index.htm. The site had been converted to a Macromedia Flash site, and the old URL did not redirect to the new site. While this was fixed several days later, it is the kind of situation in which understanding default index file names can help.

SHORTENING URLS

A variety of free URL shortening services are available and rather popular on the Net. Sites like TinyURL.com, SnipURL.com, MakeAShorterLink.com, and Shorl.com can take a long URL and convert it to a much shorter one. These tools are useful, especially when trying to e-mail a long URL that is likely to wrap in the e-mail.

Let's say you wanted to e-mail the URL for the ACS Regional Meeting Calendar to a colleague or client. The URL of www.chemistry.org/portal/Chemistry?PID=acsdisplay.html&
DOC=meetings\regional\2003.html
is likely too long for a single line in an e-mail message and will have a line break inserted somewhere within the URL. When the recipient tries to click on it, the partial URL is likely to result in an error message. A quick, free visit to SnipURL.com turns that long URL into www.snurl.com/tcm, which should be short enough not to wrap in an e-mail message.

One problem with URLs shortening or redirection services is that they mask the actual domain name of the site containing the information content. That domain name can be very useful in determining the origin of the information, the reliability of it, and even whether or not to actually click on it. One approach is to include both in an e-mail, as in the following example:

See the ACS Regional Meeting Calendar at www.chemistry.org/portal/Chemistry?PID=acsdisplay.
html&DOC=meetings\regional\2003.html
. Or, if that doesn't work, try www.snurl.com/tcm.

ALTERNATIVE SHORTENING

Sometimes there are ways to shorten URLs that will not mask the domain name or require the services of a URL redirection service. Think back to the default index file names. With the practice of defaulting to a specific file name, see if you get the same content when the file name portion is removed, especially if it is something like index.html. Often, but not always, the beginning www. of a URL can be left off. And as long as it remains recognizable, the http:// portion can almost always be left behind.

Let's look at an example of this alternative shortening that maintains the necessary information and yet is considerably shorter. A recent article in a professional journal included a reference to the Merriam-Webster site and used the 27-character URL of http://www.m-w.com/home.htm, which does work, but it could have been shortened further. The home.htm works (as does index.html), but it is unnecessary. The http:// and the www. could also be dropped. Just entering m-w.com into the browser will get exactly the same page. For ease of recognition, www.m-w.com or http://m-w.com would make it clearer that it is a Web address, but the 7-character short form of m-w.com does work.

Bear in mind that these shortcuts may not always work. Leaving off the www. may or may not get you to the same page. The Special Libraries Association site currently loads at www.sla.org, but just using sla.org results in an "Under Construction" message. (And that of course could all change depending on what SLA decides to do with its name this June.)

Beyond these basic tricks, other URL shortening options are tied into understanding more about unnecessary options that may be included in the URL. For that we need to explore the variable extensions that URLs may have.

VARIABLE URL EXTENSIONS

Within extended URLs, there are several ways to track information, such as session IDs, search submissions, or other user information. As more sites move to managing content via a database-driven back-end, URLs are getting longer and more complex. A trip to MapBlast to view a map of the Chicago O'Hare airport and surrounding roads can generate a unique, 501-character URL. What are all those characters representing? They encode the search parameters such at latitude and longitude, zoom level, included landmarks, etc.

Sometimes a Web site adds variables or tracking information after a question mark or related symbol. Leaving off some portions may still bring up the same page. For example, from the main Yahoo! page, the link to Yahoo! Finance is www.yahoo.com/r/sq, but the address you end up at is finance.yahoo.com/?u. Chop off the extra two characters and the slash, and the exact same information content shows up using finance.yahoo.com (but some of the text ads on the page vary). The first /r/sq is probably used to help track where people click and the /r/ may stand for "redirect."

Sometimes a URL will have a redirection prefix which can be used to track click-through traffic. AlltheWeb.com search results often have URLs such as http://click.alltheweb.com/go2/2/atw/
1c4B8A2C54/MSxILHdlYg/http/www.loc.gov/
when all that is needed to cite the site is the last 11 characters of www.loc.gov. In the same way, other URLs will have affiliate suffixes. A link to an Amazon product might look like amazon.com/exec/obidos/ASIN/0910965471/localaffil. When citing or linking to that page, just leave off the /localaffil unless you want to help out that affiliate.

URL PERSISTENCE

The URL shortening services are a simple approach to the Persistent URL (PURL) approach that OCLC introduced years ago and which is used on some government sites and in a variety of library systems. Like the URL shortening programs, PURLs have the problem of not displaying the actual host name of the originating Web site. While that could be another whole topic for a column, the persistence problem goes beyond the regular changes in host names, paths, file names, and various dead links.

Some URLs are not persistent even for a few minutes. The Bureau of Economic Affairs' site offers all sorts of economic statistics. Requesting per capita personal income for California in 2000 from the BEA's Local Area Personal Income section gives a nice table and the URL of www.bea.gov/bea/regional/reis/drill.cfm. But change the request to a different date or state, and it gives the exact same URL.

Library catalogs and research databases may act in a similar way. Or a string of apparent nonsense characters may be added that tracks a user's session. A portion of a URL from one search in part includes the following:

/cgi-bin/s.cgi?Search_Arg=test&SID=7825&CNT=50

The ? is followed by a search argument. Ampersands are used to separate search variables. The SID could be a session identifier and the CNT specifies the maximum number of records to display.

VERIFYING URLS

With all of these oddities, can such URLs be used in a citation or as a link from another page? Maybe. Sometimes, careful removal of variables like the session ID and other extraneous and/or unused variables can create a shorter URL that will still work. Or the entire long URL may persist after the session is over. But how can you tell?

One way to verify if such a URL will get to the same information is to check on another computer. But with long URLs, this can get tedious to type in the full URL. An easier approach, for those who have more than one browser available on their computers is to copy the URL, open the other browser, paste the URL, and see if the same page is retrieved. If it does not work in the other browser, it will probably not work for anyone else either. The BEA example above just gives an error message when pasted into another browser. Note that just opening another window of the same browser may not test as accurately. It is best to switch completely from, say, Internet Explorer to Netscape or Mozilla or Opera.

UNUSUAL URLS

Occasionally you may run across rather unusual URLs, such as one that is all numbers with no periods—http://1117674563. These are probably real addresses, but most who use this technique are spam e-mailers who want to mask the full URL. And there are many techniques to mask URLs, often called obfuscating or munging.

Most familiar URLs use a word-based address. The American Library Association's Web site is at www.ala.org, of course. Computers prefer to deal with numbers and translate words and letters into numbers. The dotted-quad notation for an IP address is only the first step in converting an alphabetic address into one used by the computers. The dotted-quad notation is still relatively familiar, consisting of four numbers between 0 and 255 separated by dots as in 66.158.92.67 for www.ala.org. But that is still not a binary number, and computers will convert even the numeric IP address into other numeric formats.

Continuing with the ALA example, the binary version of 66.158.92.67 is 01000010100111100101110001000011, which can be translated back into our regular decimal system as 1117674563. In addition to binary and decimal, computers can also speak in octal and hex. And each number in the dotted-quad notation could be expressed in any one of those four formats.

More details are available from the "Obfuscated URLs" page at www.markjamesmullins.com/antispam.html. For a more extensive list of possible combinations (see the URL by mousing over the link with the cursor and looking in the status bar) and to see which work in your browser, try "URLs that access CCN's Home Page" at www.chebucto.ns.ca/~af380/CCN-URLs.html.

Which of these various obfuscated or munged URLs might work depends in part on the browser you use and the Web server itself. Internet Explorer 5.5 works for some while Internet Explorer 6 does not. Because of this, spammers can use the various versions to target particular browser users.

The key points to realize are that such strange-looking URLs can actually work but that spammers are the ones most likely to use the technique.

URL WATCHING

After investigating the details of long and unusual URLs, most of us might prefer to lock them back up and ignore the more cryptic portions. But I find that a quick look at any URL can convey a great deal about the information I'm viewing. It helps identify satire sites that may look like the real thing but come from a completely different organization than expected. It can expose what kind of server and content management system or scripting language is used. It may contain original publication dates for articles, author names, ad companies, or search options.

Watch the URLs. Check them when evaluating Web content and be careful when citing them. Make sure others will be able to see what you see. Reduce them to their minimum functioning version when citing. Above all, mine the full URLs for the information that they contain.


Greg NotessGreg R. Notess (greg@notess.com; www.notess.com) is a reference librarian at Montana State University and founder of SearchEngineShowdown.com

Comments? Email the editor at marydee@infotoday.com

 


       Back to top