| On The Net Unlocking URLs: Extensions, Shortening
                        Options, and Other Oddities
 By Greg R. Notess
 Reference Librarian Montana
                        State University
 
 
 Yes, I know, for
  those on the Net for years, Uniform Resource Locators (URLs) seem pretty obvious
  and self-explanatory. URLs are now so prevalent in all kinds of media that
  it is easy to get rather blasé about them. Yet as the Web matures, the
  sophistication of how URLs can be used has increased as well. With Web site
  redesigns commonplace, extensions often change; citations using URLs continue
  to be a problem; and URLs are getting longer.
  The root addresses for Web sites are, after all, fairly straightforward.
  Add the www. prefix to whatever brilliant host name someone dreamed up, and
  you are at the Web site's home page and primary entry point. For individual
  pages on the site, add a directory, and/or a file name and voilá, there
  is the standard URLthat might look like www.brilliant.gov/pubs/catalog.html.
  [Editor's Note: Use caution when reviewing the URLs in this column.
  Some are imaginary, created by Greg for illustrative purposes only.]
  But once we get beyond the basics, there are all sorts of URL details that
  can be useful to the information professional. The host name portion is often
  the first place to look when evaluating the provenance and authority of specific
  content. As pages change, understanding more than just the basics can help
  track down new locations and find older information. In addition, as we ship
  URLs to each other and use them in citations, unlocking some of their stranger
  secrets, along with knowledge of URL shortening tools, can make it easier to
  actually get to the appropriate Web page.
  DEFAULTS AND CHANGING EXTENSIONS
  First of all, let's take a look at default extensions and some of their many
  permutations. Most sites are set up with a default directory index file name.
  In other words, when a user requests a URL like www.somewhere.edu/dir/, the
  Web server actually delivers a specific file in the dir/ directory. Most of
  the time, the default directory index file name is index.html. Thus, the earlier
  www.somewhere.edu/dir/ URL actually displays www.somewhere.edu/dir/index.html.
  This is why we can use short URLs like yahoo.com with no specific file name,
  because the Web server automatically delivers the index file.
  There are plenty of other options for the default file. A Web server can
  be configured to look for any file name as the default. So www.somewhere.edu
  could actually retrieve the www.somewhere.edu/home.jsp file rather than www.somewhere.edu/index.html.
  And, in most cases, the displayed URL will only show as www.somewhere.edu and
  not display the actual file name.
  The ability to change the default directory index file name can be a great
  help when a Web site goes through a major transformation or moves to a new
  content management system that uses different file extensions. While the original
  incarnation of a site may have relied primarily on HTML files, the new version
  using some database-driven system may use different file extensions. Some possibilities
  are .htm, .php, .asp, .jsp, .cfm, and .shtml. The .htm is just a shortened
  form of .html. The others may tell you something about the systems being used.
  Typically .php pages are using the PHP scripting language and running on Linux,
  while .asp pages (Active Server Pages) are probably running on a Microsoft
  server and may contain some Visual Basic programming. The Java Server Pages
  (.jsp) are using Java coding. ColdFusion sites often have .cfm, and .shtml
  is used for Server Side Includes.
  So what is the use of knowing these options? A quick check on a domain at
  Netcraft.com will identify the Web server software and operating system more
  accurately than guessing based on extensions. But knowing the different options
  can help in tracking down an errant Web site by guessing some common file names.
  Or if an old, dead URL pointed to somewhere.edu/biosci/index.html, trying out
  somewhere.edu/biosci/ or somewhere.edu/biosci/index.php may retrieve the information.
  Sometimes, it even helps find pages that are in the process of changing from
  one system to another.
  A Web site can be configured to display no default index file. In this case,
  going to a root URL may simply result in an error message. Going to www.sundaybaroque.org one
  day from a link on another page resulted in a Forbidden or Directory Listing
  Denied error message. Guessing that the home file might be named index.html,
  I tried www.sundaybaroque.org/index.html which
  did bring up real content, but it was 2 years old. After a search (where both
  Google and AlltheWeb turned up www.sundaybaroque.org as the top hit even though
  it didn't work), I found that the new URL I needed was http://www.sundaybaroque.org/flash/flash_index.htm.
  The site had been converted to a Macromedia Flash site, and the old URL did
  not redirect to the new site. While this was fixed several days later, it is
  the kind of situation in which understanding default index file names can help.
  SHORTENING URLS
  A variety of free URL shortening services are available and rather popular
  on the Net. Sites like TinyURL.com, SnipURL.com, MakeAShorterLink.com, and
  Shorl.com can take a long URL and convert it to a much shorter one. These tools
  are useful, especially when trying to e-mail a long URL that is likely to wrap
  in the e-mail.
  Let's say you wanted to e-mail the URL for the ACS Regional Meeting Calendar
  to a colleague or client. The URL of www.chemistry.org/portal/Chemistry?PID=acsdisplay.html&DOC=meetings\regional\2003.html is
  likely too long for a single line in an e-mail message and will have a line
  break inserted somewhere within the URL. When the recipient tries to click
  on it, the partial URL is likely to result in an error message. A quick, free
  visit to SnipURL.com turns that long URL into www.snurl.com/tcm,
  which should be short enough not to wrap in an e-mail message.
  One problem with URLs shortening or redirection services is that they mask
  the actual domain name of the site containing the information content. That
  domain name can be very useful in determining the origin of the information,
  the reliability of it, and even whether or not to actually click on it. One
  approach is to include both in an e-mail, as in the following example:
  See the ACS Regional Meeting Calendar at   www.chemistry.org/portal/Chemistry?PID=acsdisplay.html&DOC=meetings\regional\2003.html.
  Or, if that doesn't work, try www.snurl.com/tcm.
  ALTERNATIVE SHORTENING
  Sometimes there are ways to shorten URLs that will not mask the domain name
  or require the services of a URL redirection service. Think back to the default
  index file names. With the practice of defaulting to a specific file name,
  see if you get the same content when the file name portion is removed, especially
  if it is something like index.html. Often, but not always, the beginning www.
  of a URL can be left off. And as long as it remains recognizable, the http://
  portion can almost always be left behind.
  Let's look at an example of this alternative shortening that maintains the
  necessary information and yet is considerably shorter. A recent article in
  a professional journal included a reference to the Merriam-Webster site and
  used the 27-character URL of http://www.m-w.com/home.htm,
  which does work, but it could have been shortened further. The home.htm works
  (as does index.html), but it is unnecessary. The http:// and the www. could
  also be dropped. Just entering m-w.com into the browser will get exactly the
  same page. For ease of recognition, www.m-w.com or http://m-w.com would
  make it clearer that it is a Web address, but the 7-character short form of
  m-w.com does work.
  Bear in mind that these shortcuts may not always work. Leaving off the www.
  may or may not get you to the same page. The Special Libraries Association
  site currently loads at www.sla.org, but just using sla.org results in an "Under
  Construction" message. (And that of course could all change depending on what
  SLA decides to do with its name this June.)
  Beyond these basic tricks, other URL shortening options are tied into understanding
  more about unnecessary options that may be included in the URL. For that we
  need to explore the variable extensions that URLs may have.
  VARIABLE URL EXTENSIONS
  Within extended URLs, there are several ways to track information, such as
  session IDs, search submissions, or other user information. As more sites move
  to managing content via a database-driven back-end, URLs are getting longer
  and more complex. A trip to MapBlast to view a map of the Chicago O'Hare airport
  and surrounding roads can generate a unique, 501-character URL. What are all
  those characters representing? They encode the search parameters such at latitude
  and longitude, zoom level, included landmarks, etc.
  Sometimes a Web site adds variables or tracking information after a question
  mark or related symbol. Leaving off some portions may still bring up the same
  page. For example, from the main Yahoo! page, the link to Yahoo! Finance is www.yahoo.com/r/sq,
  but the address you end up at is finance.yahoo.com/?u. Chop off the extra two
  characters and the slash, and the exact same information content shows up using
  finance.yahoo.com (but some of the text ads on the page vary). The first /r/sq
  is probably used to help track where people click and the /r/ may stand for "redirect."
  Sometimes a URL will have a redirection prefix which can be used to track
  click-through traffic. AlltheWeb.com search results often have URLs such as http://click.alltheweb.com/go2/2/atw/1c4B8A2C54/MSxILHdlYg/http/www.loc.gov/  when
  all that is needed to cite the site is the last 11 characters of www.loc.gov.
  In the same way, other URLs will have affiliate suffixes. A link to an Amazon
  product might look like amazon.com/exec/obidos/ASIN/0910965471/localaffil.
  When citing or linking to that page, just leave off the /localaffil unless
  you want to help out that affiliate.
  URL PERSISTENCE
  The URL shortening services are a simple approach to the Persistent URL (PURL)
  approach that OCLC introduced years ago and which is used on some government
  sites and in a variety of library systems. Like the URL shortening programs,
  PURLs have the problem of not displaying the actual host name of the originating
  Web site. While that could be another whole topic for a column, the persistence
  problem goes beyond the regular changes in host names, paths, file names, and
  various dead links.
  Some URLs are not persistent even for a few minutes. The Bureau of Economic
  Affairs' site offers all sorts of economic statistics. Requesting per capita
  personal income for California in 2000 from the BEA's Local Area Personal Income
  section gives a nice table and the URL of www.bea.gov/bea/regional/reis/drill.cfm.
  But change the request to a different date or state, and it gives the exact
  same URL.
  Library catalogs and research databases may act in a similar way. Or a string
  of apparent nonsense characters may be added that tracks a user's session.
  A portion of a URL from one search in part includes the following:
  /cgi-bin/s.cgi?Search_Arg=test&SID=7825&CNT=50
  The ? is followed by a search argument. Ampersands are used to separate search
  variables. The SID could be a session identifier and the CNT specifies the
  maximum number of records to display.
  VERIFYING URLS
  With all of these oddities, can such URLs be used in a citation or as a link
  from another page? Maybe. Sometimes, careful removal of variables like the
  session ID and other extraneous and/or unused variables can create a shorter
  URL that will still work. Or the entire long URL may persist after the session
  is over. But how can you tell?
  One way to verify if such a URL will get to the same information is to check
  on another computer. But with long URLs, this can get tedious to type in the
  full URL. An easier approach, for those who have more than one browser available
  on their computers is to copy the URL, open the other browser, paste the URL,
  and see if the same page is retrieved. If it does not work in the other browser,
  it will probably not work for anyone else either. The BEA example above just
  gives an error message when pasted into another browser. Note that just opening
  another window of the same browser may not test as accurately. It is best to
  switch completely from, say, Internet Explorer to Netscape or Mozilla or Opera.
  UNUSUAL URLS
  Occasionally you may run across rather unusual URLs, such as one that is
  all numbers with no periodshttp://1117674563.
  These are probably real addresses, but most who use this technique are spam
  e-mailers who want to mask the full URL. And there are many techniques to mask
  URLs, often called obfuscating or munging.
  Most familiar URLs use a word-based address. The American Library Association's
  Web site is at www.ala.org, of course. Computers prefer to deal with numbers
  and translate words and letters into numbers. The dotted-quad notation for
  an IP address is only the first step in converting an alphabetic address into
  one used by the computers. The dotted-quad notation is still relatively familiar,
  consisting of four numbers between 0 and 255 separated by dots as in 66.158.92.67
  for www.ala.org. But that
  is still not a binary number, and computers will convert even the numeric IP
  address into other numeric formats.
  Continuing with the ALA example, the binary version of 66.158.92.67 is 01000010100111100101110001000011,
  which can be translated back into our regular decimal system as 1117674563.
  In addition to binary and decimal, computers can also speak in octal and hex.
  And each number in the dotted-quad notation could be expressed in any one of
  those four formats.
  More details are available from the "Obfuscated URLs" page at www.markjamesmullins.com/antispam.html.
  For a more extensive list of possible combinations (see the URL by mousing
  over the link with the cursor and looking in the status bar) and to see which
  work in your browser, try "URLs that access CCN's Home Page" at www.chebucto.ns.ca/~af380/CCN-URLs.html.
  Which of these various obfuscated or munged URLs might work depends in part
  on the browser you use and the Web server itself. Internet Explorer 5.5 works
  for some while Internet Explorer 6 does not. Because of this, spammers can
  use the various versions to target particular browser users.
  The key points to realize are that such strange-looking URLs can actually
  work but that spammers are the ones most likely to use the technique.
  URL WATCHING
  After investigating the details of long and unusual URLs, most of us might
  prefer to lock them back up and ignore the more cryptic portions. But I find
  that a quick look at any URL can convey a great deal about the information
  I'm viewing. It helps identify satire sites that may look like the real thing
  but come from a completely different organization than expected. It can expose
  what kind of server and content management system or scripting language is
  used. It may contain original publication dates for articles, author names,
  ad companies, or search options.
  Watch the URLs. Check them when evaluating Web content and be careful when
  citing them. Make sure others will be able to see what you see. Reduce them
  to their minimum functioning version when citing. Above all, mine the full
  URLs for the information that they contain.
 
  Greg
R. Notess (greg@notess.com; www.notess.com)
is a reference librarian at Montana State University and founder of SearchEngineShowdown.com. Comments? Email the editor at marydee@infotoday.com.  
   
 |