ONLINE Magazine
THE LEADING MAGAZINE FOR 
INFORMATION PROFESSIONALS 
Table of Contents Previous Issues Subscribe Now!
VOLUME 26 • NUMBER 2 • MARCH/APRIL 2002 
• FEATURE •
Searching the PCT Patent Files: Another Instance of Faux Full Text
by Stephen Adams

Over the last few years, there has been an explosion in access to online files containing the complete texts of patents. These types of files present some unique advantages to both regular and infrequent searchers, but they also have some significant pitfalls for the unwary. This short study considers the content of files containing the full texts of patent applications published under the Patent Co-operation Treaty (PCT).
 

OPERATION OF THE PATENT CO-OPERATION TREATY

Before discussing the files, you should know about the operation of the PCT in general terms. The PCT is administered by the International Bureau (IB) of the World Intellectual Property Organization (WIPO). Despite the fact that published PCT documents are often loosely referred to as "world patents," this term is highly misleading. It is not part of the role of the IB to grant patents at all—this is still handled by the various national and regional patent offices around the world. The function of the IB is to try to make the patent system more accessible to all countries, by providing a common "front-end" to the patenting process. Once the Bureau deals with the initial formalities, these so-called "international patent applications" may be forwarded to one or more national offices for the second stage, namely substantive examination that may lead to a granted patent.

One of the striking aspects of the PCT is its multilingualism. This is a deliberate policy, in order to help the would-be patentee to defer the considerable costs of translating an application into the languages required by the world's examining offices. 
 

FILINGS CHRONOLOGY

The normal sequence of events for filing an application under the PCT, using the so-called Chapter I procedure, is as follows:

1. Initial application—usually in the applicant's home country in their national language.

2. At 12 months—file an international patent application under the PCT, again in the national language, at a "national receiving office," usually a department within the national patent office.

3. At 18 months—the international patent application is published, in one of seven official languages (English, French, German, Japanese, Russian, Spanish, or Chinese).

4. At 20 or 21 months—transfer to national patent offices for further processing to grant.

Stages (1) and (2) are both entirely confidential, but stage (3) generates a published document, which is the source material for the full-text files under discussion here. They are numbered in an annual series, with a two-letter "WO" prefix identifying the publishing authority, WIPO—hence the first document published in 1999 was WO 99/00001. The WO documents are, broadly speaking, indicative of an applicant's desire to obtain patent protection for his or her invention in a large number of countries around the world. As such, they have become a highly significant source of early warning intelligence for the patent searcher.

There are two points in the process where an applicant may be required to file a translation of his or her patent application. This will depend upon whether the applicant has one of the seven PCT languages as his or her own national language. Consider, for example, a Swedish company that intends to obtain a patent in Japan via the PCT. At stage (1), its original ("priority") filing will be in Swedish. At stage (2), it is again permitted to use Swedish in order to lodge its international application, but by the time it reaches stage (3), it must provide a translation into one of the official publication languages; typically, this would be into English. Eventually, if it decides to proceed to full examination, it will be required to provide a further translation, this time into Japanese, so that the Japanese Patent Office can proceed with the case.

By contrast, the situation for applicants in states that already use one of the seven languages of the PCT as their official language is somewhat simpler. For example, an Austrian company would lodge a priority document in German (stage 1), followed by the same text at stage (2), with the same text publishing at stage (3), and it would only be faced with any translation costs when it came to stage (4) when, in common with the Swedish applicant, it would be obliged to provide a Japanese translation if it chose to proceed in Japan.

This disparity in processes makes it important to distinguish between the language used at stage (2), which is referred to as the "filing language" of the PCT application, and that used at stage (3), which is the "publication language" of the actual WO-prefix documents which enter the databases. Unfortunately, some of the earlier attempts to provide searchable files of PCT documents did not retain this distinction, and merely created a single LA (language) field that was effectively useless. There are several candidate files for this survey, referred to in the following discussion by the file number in the first column of the accompanying table.

In addition to MicroPatent, Dialog, STN, and Questel*Orbit, Aurigin Inc. has announced the availability of an intranet file of PCT specifications under its Aureka system, apparently covering from 1978 to the present, and further files are believed to be pending from Delphion and Univentio,but no further details are known at this time. 
 

FILE PREPARATION 

A key issue in considering these files is the question of how complete they are for the claimed time span. This in turn relates to the method by which the file data are collected and turned into machine-readable files.

MicroPatent, Dialog, and STN all use a dataset originated by MicroPatent LLC, which is prepared by Optical Character Recognition (OCR) scanning of original paper texts. This has two immediate consequences. From the point of view of searchability of the resultant text files, OCR is a vastly improved technology, but is still far from 100% perfect. The most notable handicap is in the handling of tabular data. Patent specifications frequently use tables to illustrate test results or comparative ex- amples, and it is quite common for the OCR version of these tables to be so mangled that the individual words within the table columns become un-usable as discrete search terms. This is a pity, as column headings or cell entries may contain potentially valuable text, not used elsewhere in the specification.

The second principal cause for concern is the treatment of non-Roman scripts. As outlined above, some of the published PCT applications are issued in Japanese, Chinese, or Russian. As far as I know, MicroPatent does not attempt to include these texts in its file at all. The searchable elements for these records are thereby reduced to basic bibliographic fields, including a title and an abstract. All of the benefit of increased recall inherent in a full-text file is lost for these records. The resultant file contains only full-texts for documents in English, French, German, and Spanish. The Bluesheet for Dialog's load of the data claims that "additional content is provided by WIPO from 1997 forward," but gives no details about whether that content relates to additional full texts or simply abstracts and bibliographic records.

In July 2001, Aurigin announced that it had completed its own OCR version of the PCT specifications, back to 1978. A press release stated that the file would "extend full-text searchability to all English-language publications since 1978. Currently, Aureka provides full-text searchability to all (English) publications since 1993, and about 45% of pubs in the range 1978-1992." (my emphasis). From this, it is not clear whether any non-English specifications have been scanned at all. If not, it leaves substantial gaps in the file over both time periods.
 

FILE COMPLETENESS

It is by no means straightforward to derive information concerning how complete these files are. For example, the accompanying table shows the official statistics issued by WIPO concerning publication under the PCT during 2000.

However, examination of the paper PCT Gazette for 2000 clearly shows that the highest allocated number was WO 00/79858, i.e. a total of 79,858 published documents. What has happened to the missing 89 documents? It sometimes occurs that a WO publication number is allocated but that the document is withdrawn from publication at the last moment. If this publication number is included in the number sequence but classed as "non-published," it would result in the highest publication number being higher than the total of the publication statistics. But what is found is the reverse—the publication numbers are lower than the publication statistics. One possible explanation is that the publication statistics include a number of delayed search reports (carrying the Kind of Document code suffix A3 but the same publication number as the earlier corresponding specification). But this explanation is not supported by the WIPO statement that specifically refers to the figures as "WO pamphlets," i.e. the entire specifications. Despite approaches to WIPO, at the time of writing no satisfactory explanation for the missing 89 documents has been found.

However, the mystery deepens on examining the contents of the STN version of the MicroPatent file. The individual statistics by publication language are derived by a set of search statements "2000/PY AND CC/LA," in which "CC" was replaced by the corresponding two-letter code for publication language. Taken together, this adds up to 78,492 documents (1,455 short of the official WIPO total, Grand Total 1). However, using the single search term "2000/PY" results in yet another figure, of 79,856 (Grand Total 2, still 91 short, but only 2 different from the total derived from publication numbers). The most serious shortfall is clearly in the Japanese language documents, where at least 1,250 appear not to have even a bibliographic entry in the file.

From the point of view of the searcher, it is bad enough that any documents are missing. The situation is then exacerbated by the treatment policy for non-Roman script documents, which are not processed to include full texts at all. This reduces the actual texts available for searching by a further 6,410 based on the missing Japanese, Chinese, and Russian texts, with one further English and five German texts also disappearing.

If we assume that the explanation for the missing documents is based on problems at source, then this will affect all the file versions based on the MicroPatent dataset (MicroPatent, Dialog, and STN). 

The fourth file, based on EPO data, is derived differently. The selection policy for the file limits the possibility to English, French, or German source documents, and the corresponding statistics for year 2000 publications. Clearly, the overall numbers are lower as a result of the missing Spanish and non-Roman text documents, but in addition it can be seen that only between 80 and 90% of all the potentially available documents have been selected for addition to the file. Questel*Orbit, to its credit, has never pretended that this file is a complete collection of PCT texts, and always advises that it is best used in conjunction with the bibliographic PCTPAT file. The Questel*Orbit print command is designed to operate from within the PCTPAT file in order to retrieve full texts on-the-fly from the WOTEXT file. If there is no corresponding full text, the search results display only the bibliographic entry from the PCTPAT file.

The further missing 10-20% is due to the selection policy for the addition of texts, which is based upon that used for the European Patent Office's internal search files. This policy gives preference to the inclusion of an English-language full-text whenever such a member is available in the patent family. For example, if an English-language PCT application was the first member of a new family to be published (the "basic"), there is a high probability that the text of this document would be added to the EPO search files as the "master text." However, if there were a granted U.S. patent (in English) published quickly (say in 15 months), followed by a German-language PCT case (an "equivalent") at 18 months, the U.S. text would be adopted for the search files. Due to this policy, it is virtually impossible to predict what proportion of the WO texts in any given period will be chosen for inclusion in the database. The situation has changed since mid-2000, when WIPO started supplying the EPO with full-texts in XML format; from this point onwards, all texts in English, French or German were added to the internal EPO files, and carried through to the WOTEXT file.
 

CONSEQUENCES FOR THE SEARCHER

The purpose of this article is not simply to criticize the file producers, who are doing the best they can with a complex set of data, but to raise awareness of the problem for the unwary searcher. Anyone coming to these files who based his or her understanding of the content upon the publicity material distributed by the vendor will be hard-pressed to discern the danger of missing prior art.

Consider searchers who approach the MicroPatent file, either in its Web site version or via one of the commercial vendors. Before they have even touched a keyboard, they are limiting their search to approximately 90% of the documents that they might think are included (this is not good news: lest anyone consider otherwise, the 80/20 rule does not apply in patentability searching!). This 90% proportion, moreover, will only continue for as long as the Roman script languages (English, French, German and Spanish) maintain their dominance within the PCT. Some time ago, the South Korean patent office was accepted as a so-called International Preliminary Examination Authority under the PCT. In my opinion, the next logical step will be to accept Korean as the eighth official publication language. Given the prolific rate at which national Korean patent applications are published, this would be a popular move and create a considerable downward pressure on the dominance of English within the system. In turn, this will cause even bigger holes to appear in the full-text databases reliant upon Roman scripts.

The second hurdle facing the unwary searcher is if he or she forgets multilingualism entirely, and starts to use only English search terms in his or her strategy. If the user chooses the MicroPatent file, the maximum possible recall drops immediately to around 56019/79858, or just over 70%. Should he or she choose the WOTEXT file (against advice), his or her recall plummets to 46985/79858, or 59% of the entire theoretical search file. For the reason given above, in future years users may not even be able to maintain that level, as English loses its dominance.

The third hurdle is that these files (excluding WOTEXT) include a document title and abstract in English, for all cases. It is therefore possible to retrieve certain non-English texts by searching in English, provided that your search terms happen to appear in the abstract and that the abstract is included in the basic index (or its equivalent). This can give the misleading impression that all the documents in the same language as the single fortuitous hit have in fact been searched. 

To summarize, there are dangers in treating these tools as comprehensive subject-matter sources for patentability searching, when reliant upon words. At the present time, there is only one reliable, language-independent method for searching these important texts by subject in their entirety—namely, patent classification. The files remain useful tools for searching other bibliographic elements, such as inventor or assignee, or when knowingly limiting a search to the (not-very-informative) applicant's abstracts and titles, but such a method negates the advantages of full text and can equally well be done in one of the several PCT bibliographic files, such as PCTPAT or PATOS-WO. The problem of multilingualism is not going to go away—if anything, it will get worse—and the challenges that this brings to all searchers, whether of patents or anything else, remain formidable.
 
Sources of PCT Full-Text Files
File 
Number
 Data Source Host Access Coverage
1 MicroPatent MicroPatent Internet 1983 - present
2 MicroPatent Dialog  File 349 1983 - present
3 MicroPatent STN File PCTFULL  1983 - present
4 EPO Questel File WOTEXT 1978 - present

 
Publication Rates in 2000 According to WIPO
Publication Language   No. of  Documents Proportion (%)
English 56,084 70.2
German 12,010 15.0
Japanese 7,057 8.8
French 3,654 4.6
Russian 496 0.6
Spanish 422 0.5
Chinese 224 0.3
Grand total 79,947 100.0

 
Publication Rates in 2000 According to the MicroPatent/STN File
Publication Language   No. of  Documents Proportion (%)
Based on Total 1
Full-texts Available
English 56,019 71.4 56,018
German 11,990 15.3 11,985
Japanese 5,807 7.4 0
French 3,651 4.7 3,651
Russian 413 0.5 0
Spanish 422 0.5 422
Chinese 190 0.2 0
Grand total 1 78,492 - -
Grand total 2 79,856 - -
Grand total 3 - - 72,076

 
Full-Texts Available in the WOETEXT File (Year 2000)
Publication 
Language
No. of Documents 
in File
Language Proportion
(%) within File 
Official
WIPO Total
Document Proportion 
(%) of Official WIPO Total
English  46,985 77.3 56,084 83.8
German 10,573 17.4 12,010 88.0
French 3,252 5.3 3,654 89.0
Grand total  60,810 - -- --


Stephen Adams (stevea@magister.co.uk) is director for Magister Ltd., Reading, U.K. 
Comments? Email letters to the editor to marydee@infotoday.com.
[Contents] [ONLINE Home] [Subscribe] [Top] [Information Today, Inc.]