| On The Net Fiddling with File Types
 By Greg R. Notess
 Reference Librarian Montana
                        State University
 
 It's obvious that information is stored and communicated in many
  ways, which has ramifications for its retrieval. In the computer age, each
  software program seems to have one or more distinct file types in which it
  saves individual documents. While the Web has pushed a single format, HTML,
  one of the Net's great strengths has been its ability to make many other file
  formats available as well. Images in GIF or JPG formats are just one example.
  For the information seeker, the textual file formats are usually the most
  desirable to find. Initially, Web search engines only indexed the text within
  an HTML page and ignored text within PDF or word-processing files. Fortunately
  for the information professional, this finally started to change in late January
  2001 when Google began to index text within PDF documents. In November of that
  year, Google expanded to include even more file types, such as PostScript and
  Microsoft Word.
  Before long, all the major search engines except Teoma had expanded their
  indexing to include at least PDF files. Some index PowerPoint, spreadsheet,
  and Flash files. Search engines have added special command line syntax for
  limiting to, or excluding, specific file formats and have integrated these
  documents into the search results listings.
  Since these files were not necessarily created specifically for the Web,
  for search engines, or for searchers, there are some unique issues to consider
  when trying to find and view them. Plus, the nature of the files makes for
  some unique search problems, and the commands and scope vary between search
  engines.
  FILE TYPE PRIMER
  Almost any kind of file type can be made available on the Web. There are
  hundreds of file extensions available and a somewhat smaller number of file
  types. Take a look at a list such as the one at www.webopedia.com/quick_ref/fileextensions.asp  to get a sense of the wide variety of files.
  The file name extension is typically used to identify the file type. So in
  a file named report.abc, the abc part is the extension and identifies the type
  of file. Common extensions include .doc for Microsoft Word documents, .pdf
  for Adobe Acrobat PDF, and .ps for PostScript. The default setting in recent
  versions of the Windows operating system will hide these file extensions from
  view, but on Web pages and in search engine results, these extensions are typically
  viewable.
  THE CONTROVERSY
  Surprisingly, I found some controversy regarding the indexing of other file
  types. Librarians and other information-oriented folks appreciate the information-rich
  content found within such files. Certainly PDFs and word-processing file formats
  tend to be longer than the standard Web page, besides being a popular way to
  post technical reports, periodicals, annual reports, press releases, and other
  important information content.
  Yet information professionals do not make up the majority of Internet users.
  Indeed, in browsing through online discussion forums, I discovered that many
  in the Webmaster and e-commerce communities, especially the search engine marketers,
  are downright hostile towards PDFs and other files. They would prefer to have
  them either excluded from the likes of Google or at least ranked quite low.
  Personally, my preference is to have these other file types ranked higher.
  Fortunately, using file type limiters, it is easy enough to pull up all the
  files in a certain format, as long as you know the search engine's syntax.
  Why even bother with the extra file types as separate searches? Sometimes
  such documents can contain very interesting information unavailable elsewhere.
  Spreadsheets are great sources for statistics. Limit to PowerPoint files to
  find recent conference presentations, especially on research that may not yet
  have been published elsewhere. Looking for samples of online tutorials for
  a topic? Add a Flash limit to the keywords for the topic to see what Flash
  tutorials the search engine can find. Mary Ellen Bates in her April 2003 Tip
  of the Month [http://batesinfo.com/tip.html#April2003]
 mentioned several such uses with
  some great examples.
  SEARCHABLE FILE TYPES
  Of the hundreds of different file types available, only a few are commonly
  found on the Web. Of these, which are being indexed and are thus searchable,
  and by which search engines? Google and AlltheWeb have the most extensive coverage
  of file types, although PDFs are by far the most common and informationally
  significant of the other file types. AltaVista only indexes PDFs at this point,
  while the Inktomi database used at MSN Search and HotBot has PDF, PowerPoint,
  Word, and Excel capabilities.
  Google provides access to files in many formats, including at least all of
  those on the following list and probably some others as well. The common file
  extensions for each type are listed in parentheses.
 
    Adobe Portable Document Format (pdf)    Adobe PostScript (ps)    Corel WordPerfect (wpd, wp5, wp6, wp7)    Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wks, wku)    Lotus WordPro (lwp)    MacWrite (mw)    Microsoft Excel (xls)    Microsoft PowerPoint (ppt)    Microsoft Word (doc)    Microsoft Works (wks, wps, wdb)    Microsoft Write (wri)    Rich Text Format (rtf)    Text (ans, txt)   AlltheWeb has a fair number as well, although it has only officially announced
  PDF, Microsoft Word, and Macromedia Flash. AlltheWeb includes some files from
  each of the following types and, like Google, probably covers even more that
  I have not discovered. The common extensions are listed for file types not
  included in Google's list.
 
    Adobe PDF    Adobe PostScript    Corel WordPerfect    Lotus 1-2-3    Macromedia Flash (swf)    Microsoft Excel    Microsoft PowerPoint    Microsoft Word    Rich Text Format    Star Office (sdw, sdc, sdd)    Text   SEARCH SYNTAX
  How do you find such files? Because the search engines index these additional
  file types, nothing extra is needed to find themsometimes. These files
  will simply show up in regular search results. For example, searching for SC00-2348
  (a Florida court docket number) will bring up PDF files in the top few hits
  at Google, AlltheWeb, MSN, and AltaVista.
  For the power searcher, the search engines do offer special features to look
  specifically for documents in a certain file format or to exclude such documents.
  The advanced search screens at Google, AlltheWeb, AltaVista, and MSN Search
  all have file type options to select. Yet the advanced search pages may only
  show a few of the file formats available. For Google and AlltheWeb in particular,
  the command line searching is much more powerful.
  For AltaVista, with only a PDF limit, the advanced page works well. AltaVista
  accepts a filetype:pdf command, but it will not work in the Boolean searching
  box. For MSN Search, the advanced search page is the only option for file type
  limiting since it does not yet have a command line option. HotBot, which does
  include some other file types, does not have any capability to limit file types,
  even on its advanced search page. HotBot's Page Content limit is similar, but
  it looks for links to or embedded file types rather than searching for the
  files separately.
  The command line syntax is only of use at present at Google and AlltheWeb,
  but of course both use somewhat different syntax.
  THE GOOGLE VERSION
  Google's Advanced Search page only offers some of the most popular file type
  limits. These are under the label of "File Formats" and give six choices: PDF,
  PostScript, Word, Excel, PowerPoint, and Rich Text Format. The advanced page
  does give the option to either limit results to a particular format or to exclude
  all of a particular format, but multiple formats cannot be combined.
  The command line version (which can be used in the regular Google search
  box) uses the syntax of filetype: followed by the extension. This cannot be
  used alone and has to be combined with another search term. To search for an
  Excel spreadsheet that includes cognizant, use
  cognizant filetype:xls
  To search for Lotus 1-2-3 spreadsheets mentioning health, try a search like
  health filetype:wks OR filetype:wku OR filetype: wk5 ORfiletype:wk4 OR fil
  type:wk3
  Google expands on its cached copy of Web pages to provide HTML versions of
  many of its separately indexed additional file type documents. Look for the "View
  as HTML" link in the search results list. This will show an HTML version of
  the file, which is especially useful for a quick look at the content or when
  you do not have the necessary viewer for that file type.
  THE ALLTHEWEB ALTERNATIVE
  The AlltheWeb Advanced Search page also uses the label of "File Formats" for
  the file type limit and has only three choices: Adobe PDF, Macromedia Flash,
  and Microsoft Word. Yet there are many more file types indexed by AlltheWeb.
  These are not officially released, and sometimes they behave rather strangely.
  The command syntax is similar at first to Google, using the filetype: prefix,
  but then rather than using the file extensions, AlltheWeb uses their MIME type
  designation.
 
    filetype:pdf    filetype:flash    filetype:msword    filetype:rtf    filetype:powerpoint    filetype:excel    filetype:postscript    filetype:wordperfect    filetype:staroffice    filetype:lotus123    filetype:text    filetype:xml   The advantage to this approach is that a single search for filetype:staroffice
  can find files with several StarOffice extensions, including sdw, sdc, and
  sdd, without stringing a long OR statement together. The disadvantage is that
  the syntax is different from Google and harder to remember, especially for
  the frequent Google user.
  Nor does AlltheWeb have the View as HTML option any more than it has a cached
  copy of Web pages. Still, the availability of the Flash and StarOffice files
  provides access to content not on Google. AlltheWeb still finds some files
  in PDF and other common file types that Google does not.
  THE GIGABLAST ALTERNATIVE
  One lesser-known search engine, Gigablast, also indexes several file typesWord,
  Excel, PowerPoint, Postscriptand plain-text documents. Gigablast is significantly smaller at this point than the major search engines,
  with 200+ million indexed documents compared to the billions in the major search
  engines, but since Gigablast is planning on being able to expand up to 5 billion,
  it is well worth watching.
  No option for file type limits is available (at this point) on its advanced
  search page, so searchers have to use command syntax. Instead of filetype:, Gigablast uses type: followed by the extension. So it is more like Google
  than AlltheWeb. The extensions are the same as Google except for the plain
  text, which is "text" rather than just "txt."
 
    type:pdf    type:doc	
     type:xls	
     type:ppt    type:ps	
     type:text   Although Gigablast is small now, it at least contains a cached copy of Web
  pages and of the other file types as well. Rather than using Google's View
  as HTML, Gigablast just labels them "[cached]" like the Web pages. Again they
  are text versions of the files, but a great way to quickly view the information
  content.
  FILE ACCESS
  While all it takes to put a file on the Web is to load the file on a Web
  server and create a link to it, that does not necessarily mean that the rest
  of us will be able to view the file. Take an Acrobat PDF as an example. To
  view a PDF, a searcher must have an Acrobat viewer in addition to the Web browser.
  While there is a free viewer available, it has to be installed and working
  to view the content within the PDF.
  For other file types, there may or may not be a free viewer available. Microsoft
  Office users should have no problem viewing Word, Excel, or PowerPoint files,
  but StarOffice, or even Microsoft Works files, may not be directly viewable.
  If the file does not load easily, look at the file extension for a clue as
  to the file type. Google and AlltheWeb will make some guess as to the file
  type, but knowing that it is a Microsoft Works or PostScript file will not
  help if you do not have a program to view such files. In addition, people can
  use unusual file extensions. If a Word document has a .pdf at the end of its
  file name, Acrobat will try to open it.
  Sometimes one of these unusual file types will not load properly, will automatically
  prompt to save to disk, or will display on the screen as gibberish. This could
  be due to the remote Web server not being configured to recognize the file
  as the correct MIME type. This is where Google's View as HTML feature and Gigablast's
  cached copies can be so useful to view some the content, if not the formatting.
  For other search engines, try saving the filing by right clicking the mouse
  and choosing to "Save target as. . . ."
  SEARCH CONSIDERATIONS
  To index the content of all these non-HTML files, the search engines have
  to find a way to transform the file into one with indexable text. They have
  to filter something that looks like
 
   %â??Ó
    157 0 obj
   <<
    /Linearized 1
    and strip out the codes to find the remaining text. That filtering process
  can lead to some strange interpretations of the text within the document. In
  PDF files particularly, initial letters may be separated from the rest of the
  word. Try a search on nalyze filetype:pdf to find all kinds of hits due to
  an extra space. For any key- words, especially those that might start a sentence,
  try leaving off the first letter.
  Many other strange things happen to these files when converted to an indexable
  format, especially for more graphically oriented files like Flash or PowerPoint.
  While these can be information-rich files, the filtered translation means that
  they may only be found with some creative guessing of words or word fragments
  found within the documents.
  The non-HTML files can be great sources of information and are now an important
  part of Web searching. They may appear on any search, even in the top 10 results.
  Knowing how to limit to specific file types, exclude others, combine several,
  and to view them is not a search skill needed every day, but it is one more
  technique that for certain searches can help the professional find information
  that no one else can.
 
  Greg
R. Notess (greg@notess.com; www.notess.com)
is a reference librarian at Montana State University and founder of SearchEngineShowdown.com. Comments? Email the editor at marydee@infotoday.com.  
   
 |