On The Net - Fiddling with File Types

Information Today, Inc. Corporate Site

KMWorld

CRM Media

Streaming Media

Faulkner

Speech Technology

DBTA/Unisphere

PRIVACY/COOKIES POLICY

Other ITI Websites

American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Enterprise AI World Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Unisphere Research

Magazines > Online > March/April 2004
Back Index Forward

SUBSCRIBE NOW!

Vol. 28 No. 2 — March/April 2004

On The Net
Fiddling with File Types
By Greg R. Notess
Reference Librarian Montana State University

It's obvious that information is stored and communicated in many ways, which has ramifications for its retrieval. In the computer age, each software program seems to have one or more distinct file types in which it saves individual documents. While the Web has pushed a single format, HTML, one of the Net's great strengths has been its ability to make many other file formats available as well. Images in GIF or JPG formats are just one example.

For the information seeker, the textual file formats are usually the most desirable to find. Initially, Web search engines only indexed the text within an HTML page and ignored text within PDF or word-processing files. Fortunately for the information professional, this finally started to change in late January 2001 when Google began to index text within PDF documents. In November of that year, Google expanded to include even more file types, such as PostScript and Microsoft Word.

Before long, all the major search engines except Teoma had expanded their indexing to include at least PDF files. Some index PowerPoint, spreadsheet, and Flash files. Search engines have added special command line syntax for limiting to, or excluding, specific file formats and have integrated these documents into the search results listings.

Since these files were not necessarily created specifically for the Web, for search engines, or for searchers, there are some unique issues to consider when trying to find and view them. Plus, the nature of the files makes for some unique search problems, and the commands and scope vary between search engines.

FILE TYPE PRIMER

Almost any kind of file type can be made available on the Web. There are hundreds of file extensions available and a somewhat smaller number of file types. Take a look at a list such as the one at www.webopedia.com/quick_ref/fileextensions.asp to get a sense of the wide variety of files.

The file name extension is typically used to identify the file type. So in a file named report.abc, the abc part is the extension and identifies the type of file. Common extensions include .doc for Microsoft Word documents, .pdf for Adobe Acrobat PDF, and .ps for PostScript. The default setting in recent versions of the Windows operating system will hide these file extensions from view, but on Web pages and in search engine results, these extensions are typically viewable.

THE CONTROVERSY

Surprisingly, I found some controversy regarding the indexing of other file types. Librarians and other information-oriented folks appreciate the information-rich content found within such files. Certainly PDFs and word-processing file formats tend to be longer than the standard Web page, besides being a popular way to post technical reports, periodicals, annual reports, press releases, and other important information content.

Yet information professionals do not make up the majority of Internet users. Indeed, in browsing through online discussion forums, I discovered that many in the Webmaster and e-commerce communities, especially the search engine marketers, are downright hostile towards PDFs and other files. They would prefer to have them either excluded from the likes of Google or at least ranked quite low.

Personally, my preference is to have these other file types ranked higher. Fortunately, using file type limiters, it is easy enough to pull up all the files in a certain format, as long as you know the search engine's syntax.

Why even bother with the extra file types as separate searches? Sometimes such documents can contain very interesting information unavailable elsewhere. Spreadsheets are great sources for statistics. Limit to PowerPoint files to find recent conference presentations, especially on research that may not yet have been published elsewhere. Looking for samples of online tutorials for a topic? Add a Flash limit to the keywords for the topic to see what Flash tutorials the search engine can find. Mary Ellen Bates in her April 2003 Tip of the Month
[http://batesinfo.com/tip.html#April2003]
mentioned several such uses with some great examples.

SEARCHABLE FILE TYPES

Of the hundreds of different file types available, only a few are commonly found on the Web. Of these, which are being indexed and are thus searchable, and by which search engines? Google and AlltheWeb have the most extensive coverage of file types, although PDFs are by far the most common and informationally significant of the other file types. AltaVista only indexes PDFs at this point, while the Inktomi database used at MSN Search and HotBot has PDF, PowerPoint, Word, and Excel capabilities.

Google provides access to files in many formats, including at least all of those on the following list and probably some others as well. The common file extensions for each type are listed in parentheses.

• Adobe Portable Document Format (pdf)

• Adobe PostScript (ps)

• Corel WordPerfect (wpd, wp5, wp6, wp7)

• Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wks, wku)

• Lotus WordPro (lwp)

• MacWrite (mw)

• Microsoft Excel (xls)

• Microsoft PowerPoint (ppt)

• Microsoft Word (doc)

• Microsoft Works (wks, wps, wdb)

• Microsoft Write (wri)

• Rich Text Format (rtf)

• Text (ans, txt)

AlltheWeb has a fair number as well, although it has only officially announced PDF, Microsoft Word, and Macromedia Flash. AlltheWeb includes some files from each of the following types and, like Google, probably covers even more that I have not discovered. The common extensions are listed for file types not included in Google's list.

• Adobe PDF

• Adobe PostScript

• Corel WordPerfect

• Lotus 1-2-3

• Macromedia Flash (swf)

• Microsoft Excel

• Microsoft PowerPoint

• Microsoft Word

• Rich Text Format

• Star Office (sdw, sdc, sdd)

• Text

SEARCH SYNTAX

How do you find such files? Because the search engines index these additional file types, nothing extra is needed to find them—sometimes. These files will simply show up in regular search results. For example, searching for SC00-2348 (a Florida court docket number) will bring up PDF files in the top few hits at Google, AlltheWeb, MSN, and AltaVista.

For the power searcher, the search engines do offer special features to look specifically for documents in a certain file format or to exclude such documents. The advanced search screens at Google, AlltheWeb, AltaVista, and MSN Search all have file type options to select. Yet the advanced search pages may only show a few of the file formats available. For Google and AlltheWeb in particular, the command line searching is much more powerful.

For AltaVista, with only a PDF limit, the advanced page works well. AltaVista accepts a filetype:pdf command, but it will not work in the Boolean searching box. For MSN Search, the advanced search page is the only option for file type limiting since it does not yet have a command line option. HotBot, which does include some other file types, does not have any capability to limit file types, even on its advanced search page. HotBot's Page Content limit is similar, but it looks for links to or embedded file types rather than searching for the files separately.

The command line syntax is only of use at present at Google and AlltheWeb, but of course both use somewhat different syntax.

THE GOOGLE VERSION

Google's Advanced Search page only offers some of the most popular file type limits. These are under the label of "File Formats" and give six choices: PDF, PostScript, Word, Excel, PowerPoint, and Rich Text Format. The advanced page does give the option to either limit results to a particular format or to exclude all of a particular format, but multiple formats cannot be combined.

The command line version (which can be used in the regular Google search box) uses the syntax of filetype: followed by the extension. This cannot be used alone and has to be combined with another search term. To search for an Excel spreadsheet that includes cognizant, use

cognizant filetype:xls

To search for Lotus 1-2-3 spreadsheets mentioning health, try a search like

health filetype:wks OR filetype:wku OR filetype: wk5 ORfiletype:wk4 OR fil type:wk3

Google expands on its cached copy of Web pages to provide HTML versions of many of its separately indexed additional file type documents. Look for the "View as HTML" link in the search results list. This will show an HTML version of the file, which is especially useful for a quick look at the content or when you do not have the necessary viewer for that file type.

THE ALLTHEWEB ALTERNATIVE

The AlltheWeb Advanced Search page also uses the label of "File Formats" for the file type limit and has only three choices: Adobe PDF, Macromedia Flash, and Microsoft Word. Yet there are many more file types indexed by AlltheWeb. These are not officially released, and sometimes they behave rather strangely.

The command syntax is similar at first to Google, using the filetype: prefix, but then rather than using the file extensions, AlltheWeb uses their MIME type designation.

• filetype:pdf

• filetype:flash

• filetype:msword

• filetype:rtf

• filetype:powerpoint

• filetype:excel

• filetype:postscript

• filetype:wordperfect

• filetype:staroffice

• filetype:lotus123

• filetype:text

• filetype:xml

The advantage to this approach is that a single search for filetype:staroffice can find files with several StarOffice extensions, including sdw, sdc, and sdd, without stringing a long OR statement together. The disadvantage is that the syntax is different from Google and harder to remember, especially for the frequent Google user.

Nor does AlltheWeb have the View as HTML option any more than it has a cached copy of Web pages. Still, the availability of the Flash and StarOffice files provides access to content not on Google. AlltheWeb still finds some files in PDF and other common file types that Google does not.

THE GIGABLAST ALTERNATIVE

One lesser-known search engine, Gigablast, also indexes several file types—Word, Excel, PowerPoint, Postscript—and plain-text documents.
Gigablast is significantly smaller at this point than the major search engines, with 200+ million indexed documents compared to the billions in the major search engines, but since Gigablast is planning on being able to expand up to 5 billion, it is well worth watching.

No option for file type limits is available (at this point) on its advanced search page, so searchers have to use command syntax. Instead of file
type:, Gigablast uses type: followed by the extension. So it is more like Google than AlltheWeb. The extensions are the same as Google except for the plain text, which is "text" rather than just "txt."

• type:pdf

• type:doc

• type:xls

• type:ppt

• type:ps

• type:text

Although Gigablast is small now, it at least contains a cached copy of Web pages and of the other file types as well. Rather than using Google's View as HTML, Gigablast just labels them "[cached]" like the Web pages. Again they are text versions of the files, but a great way to quickly view the information content.

FILE ACCESS

While all it takes to put a file on the Web is to load the file on a Web server and create a link to it, that does not necessarily mean that the rest of us will be able to view the file. Take an Acrobat PDF as an example. To view a PDF, a searcher must have an Acrobat viewer in addition to the Web browser. While there is a free viewer available, it has to be installed and working to view the content within the PDF.

For other file types, there may or may not be a free viewer available. Microsoft Office users should have no problem viewing Word, Excel, or PowerPoint files, but StarOffice, or even Microsoft Works files, may not be directly viewable. If the file does not load easily, look at the file extension for a clue as to the file type. Google and AlltheWeb will make some guess as to the file type, but knowing that it is a Microsoft Works or PostScript file will not help if you do not have a program to view such files. In addition, people can use unusual file extensions. If a Word document has a .pdf at the end of its file name, Acrobat will try to open it.

Sometimes one of these unusual file types will not load properly, will automatically prompt to save to disk, or will display on the screen as gibberish. This could be due to the remote Web server not being configured to recognize the file as the correct MIME type. This is where Google's View as HTML feature and Gigablast's cached copies can be so useful to view some the content, if not the formatting. For other search engines, try saving the filing by right clicking the mouse and choosing to "Save target as. . . ."

SEARCH CONSIDERATIONS

To index the content of all these non-HTML files, the search engines have to find a way to transform the file into one with indexable text. They have to filter something that looks like

%â??Ó

157 0 obj

<<

/Linearized 1

and strip out the codes to find the remaining text. That filtering process can lead to some strange interpretations of the text within the document. In PDF files particularly, initial letters may be separated from the rest of the word. Try a search on nalyze filetype:pdf to find all kinds of hits due to an extra space. For any key- words, especially those that might start a sentence, try leaving off the first letter.

Many other strange things happen to these files when converted to an indexable format, especially for more graphically oriented files like Flash or PowerPoint. While these can be information-rich files, the filtered translation means that they may only be found with some creative guessing of words or word fragments found within the documents.

The non-HTML files can be great sources of information and are now an important part of Web searching. They may appear on any search, even in the top 10 results. Knowing how to limit to specific file types, exclude others, combine several, and to view them is not a search skill needed every day, but it is one more technique that for certain searches can help the professional find information that no one else can.

Greg R. Notess (greg@notess.com; www.notess.com) is a reference librarian at Montana State University and founder of SearchEngineShowdown.com.

Comments? Email the editor at marydee@infotoday.com.

Back to top