Vol. 10 No. 9 October 2002 
Patinformatics: Identifying Haystacks from Space 
by Anthony J. Trippe Senior Staff Investigator, Vertex Pharmaceuticals
Table of Contents Previous Issues Subscribe Now! ITI Home
Generally, when individuals think about patent information they conjure up an image of a diligent searcher, poring over reams and reams of information, looking for the one reference out of hundreds, maybe thousands, that will satisfy their client. The idea of searching for a "needle in a haystack" comes readily to mind when referring to the activities in which these professionals commonly find themselves. More recently, however, information professionals find themselves being asked to look at the bigger picture. Instead of trying to identify a single grain of sand on a vast beach, business decision-makers more and more ask information professionals to identify trends and provide general overviews to put information in context when compared to a much larger collection of materials. Instead of finding a needle in a haystack, today's searchers are becoming analysts and being asked to identify haystacks from space and then forecast whether the haystack is the beginning of a new field or the remainder from last year's harvest.

The title of this article introduces the notion of "patinformatics." This term is borrowed from the more common fields of bioinformatics or cheminformatics. For example, by definition, bioinformatics is the science of analyzing large amounts of biological data using computational methods. For example, researchers use genomic data to discover relationships or trends between different genes or biological pathways when looking at smaller datasets could mean missing a connection. In a similar fashion, the term patinformatics describes the science of analyzing patent information to discover relationships and trends that would be difficult to see when working with patent documents on a one-on-one basis. The term encompasses all forms of analyzing patent information, including the following:

  • Patent intelligence The use of patent information to identify the technical capabilities of an organization and the use of that intelligence to develop a strategy for strategic technical planning

  • Patent mapping Sometimes described as white space mapping, which uses published patent data to create a graphical or physical representation of the relevant art pertaining to a particular subject area or novel invention

  • Patent citation analysis the study of patent citations for potentially determining a patent's value or, perhaps more reliably, the identification of potential licensing partners or leads based on the citation of an organization's patents by another company in the same or a completely different market space
Patinformatics can also cover additional applications of patent information involving a subsequent analysis step. The key underlying property in each of these diverse areas is the analysis step. This article will focus primarily on the basic principles of patinformatics and will survey the field of tools, resources, and Web sites that have evolved in this space over the last few years.

One might imagine that the same rules which apply to conducting patinformatics also apply to patent searching. This is not entirely the case. Just as in physics, in which quantum mechanics outline the principles for understanding the microscopic world, while Newtonian principles apply to the macromolecular world of large bodies in motion, one can distinguish two different approaches to patent information. Traditional patent searching deals with the micro level, in which very small changes become extremely important and details and precision are imperatives. Patinformatics, by comparison, deals with thousands or tens of thousands of documents and, since small details will not be seen across such a vast landscape, takes a more macroscopic view of the data, using different methods and reaching different conclusions. 

The difference between patent searchers and analysts is one of perspective. Searchers are trained to find a needle in a haystack, while analysts want to identify haystacks from space. A comparison of the two approaches can illustrate how each practitioner will deal with the search, review/analysis, and final presentation of the data discovered. This description is only an example. I do not claim, by any means, that all patent searches (or searchers) perform these functions in the exact way described. This section simply illustrates how to employ a different approach to data gathering when doing analysis vs. focused searching.

The greatest difference between patent searchers and patent analysts occurs in how they approach a search. Patent searchers are generally concerned with absolute precision, sometimes at the expense of recall, especially with regards to data that is of a cursory interest to the subject of the search. Patent searchers will go to great efforts to find the exact references needed by their clients. Identifying a single document can sometimes be the goal of their search. In fact, finding no documents at all may constitute a satisfactory result. It is not uncommon for a searcher to spend days, weeks, or even months working on a single search, looking for a particular piece of information. Their search strategies will often be extremely complicate, involving large keyword hedges and the extensive use of database-specific indexing codes. Starting with a large collection of data, the searcher will progressively add layers of detail to the search in order to specifically narrow a dataset down to those on-target answers most likely to interest their clients.

By comparison, a patent analyst may also put together a complicated search strategy and try to be as directed as possible in their searching, but generally want to create a comprehensive dataset for use as the basis for subsequent analytical steps. Analysts will also use large collections of keywords and database-specific indexing, but they will more likely keep their strategies broad rather than narrowing results to a fine point. As long as the data discovered is more or less on target, leaving some irrelevant answers in the set may not bother them, since small inconsistencies will not be seen above the base line. Statistically speaking, analysis requires the presence of enough data to discover trends and relationships, so patent analysts prefer an overabundance of data as opposed to a lack of it. Making the search too specific may bias the data. It is important to let the data speak for itself, as opposed to having the analysis directed by the searcher's preconceived notions while building the dataset.

The single most difficult task for searchers to overcome as they start doing patent analysis may be learning to adjust their natural tendency toward directed, specific searches in order to produce datasets free from bias and subjectivity. Under these circumstances, datasets may grow to several thousand records. Searchers will ordinarily stay away from datasets this large, since previously working with so much information was difficult for end-users to grasp. Using computerized analytical tools, however, working with large datasets has become much less complicated and should not deter an aggressive search strategy.

Philosophically, the differences between patent searchers and patent analysts in the data review and subsequent analysis phases are not as dramatic as in the searching phase. Searchers will typically review the document set that they've created before they package and send the information to an end-user. The searcher may simply take a quick glance at the data, looking for obvious false drops which can be deleted without fear that the client would miss them. In the same fashion, answers absolutely on target may be pulled from the remainder of the dataset and set aside for prominent placement in a different section of the search report. With large document sets, say 100 to 500 documents, the searcher might manually scan a list of titles and mark documents that they consider relevant to the needs of the client. The analysis aspect comes from the searcher's familiarity with the subject matter and their level of understanding client needs. 

Under these circumstances, it is important that the searcher has spent some time with the research team and has a clear understanding of the technical aspects of the project. The more familiar the searcher is with the needs of the client and the technical specifications of the project, the better job they will do in selecting appropriate documents. The analysis in this case is a judgment call on the part of the searcher in evaluating what information the end-user would find more relevant. Depending on the sensitivity of the project, the searcher may have more or less flexibility in practicing their good judgment. Some clients will simply ask the searcher to send the results without any review at all, while others will expect the searcher to screen the majority of the answers and send only the top answers for their perusal. In the case of searchers, the analysis and review are usually conducted as a single step. 

A patent analyst, on the other hand, will look at review and analysis as separate steps with different objectives and methods. The analyst has to look at the review step as if they're building a data warehouse, examining the integrity of the data and making certain that it is clean. The first part of this may in fact involve a relevance review not dissimilar to the one conducted by a searcher, only not as detailed and eliminating results widely off topic. Once again, precision is not the issue here, so the review process goes fairly quickly. After the analyst is more or less convinced that they have accrued data generally on topic, they begin the process of building the data warehouse. Thistypically involves importing the data into a software tool and checking to make sure that the process has gone smoothly and that the data is ready for the subsequent analysis phase. 

The analyst will scan the data warehouse, occasionally taking samples of the data, looking at it, and making certain the information has ended up in the proper fields and formatted correctly. Depending on the size of the dataset, this process may take quite some time. A few hundred documents may go quickly, but when the dataset expands to include several thousand documents, this can become very time consuming. After building the data warehouse, the review process is complete and the data analysis can begin. Specific details on performing patent analysis will be discussed in a subsequent section. The process has a great deal to do with having a clear understanding of the business objective and desired use of the intelligence produced by the analysis. It is less a judgment call based on the analyst's understanding of the subject matter, as it is an experiment with conclusions drawn based on the results.

Finally, searchers and patent analysts will present results back to their clients in dramatically different fashion. The main work product from a searcher is a collection of references or patent documents. Their search report will typically provide an overview of the objective of the search, the methods used to conduct the search, the databases used, the time coverage of these same databases, and finally, the references themselves. The report may classify results in different sections based on relevance, document type, or date of publication. Based on how much relevancy analysis the searcher can provide, the report will reflect the professional skill of its creator. 

When patent searchers present large collections of results, sometimes reports end up looking like raw data dumps. The end-user is left to wade through hundreds or perhaps even thousands of documents, all in reverse chronological order, without any way to distinguish the 5th answer from the 535th one. End-users may find it difficult to identify trends or patterns within the data, having a different perspective when looking at the 100th record than they did when looking at the 4th. It is also difficult for the human brain to keep track of several variables while examining hundreds of documents. A computer, on the other hand, can objectively weigh a set of variables, regardless of which document they came from, and identify patterns within the data.

The analyst will typically have a number of computational tools available at their disposal designed to identify patterns and trends from their experiments. Information when analyzed becomes intelligence. Intelligence is the main work product from an analyst. Instead of delivering information, patent analysts will experiment with the data provided, draw conclusions based on analysis, and provide those conclusions to a business decision-maker. Analysts are generally much more integrally involved in the decision-making process and are seen as consultants rather than as intermediaries. In most cases, business decision-makers do not want a large collection of data. They want data to be compiled and analyzed, with different scenarios and their corresponding advantages and challenges laid out so the decision-makers can draw rapid conclusions and act on them. The analyst's results, therefore, are generally a few slides outlining the business need, the hypothesis under investigation, the results of the analysis, and, finally, some opinions on the potential conclusions of following different courses of action.

Having worked in both of these positions, I do not intend this comparison to imply that one role is more important or requires more intelligence and ability than the other, but simply to characterize the differences between the two roles and provide some insight to those individuals who seek to move from one role to the other.

Patinformatic Principles
When dealing with the more detailed analysis involved with patinformatics, we can divide the different type of analyses into two broad categories: data mining and text mining.

Data mining involves the extraction of fielded data and its analysis. Normally, this means analyzing the bibliographic information contained within patents. For example, someone might want to examine the relationship between patent assignees and International Patent Classification (IPC) codes for a specific area of technology. Mining or mapping this information can give an idea of the major players in a technology area and what type of work they generally focus on. When using Derwent data, a similar analysis could replace IPC codes with Derwent manual codes. 

Text mining or mapping typically involves clustering or categorizing documents based on the major concepts contained. The data source is unstructured text, it is not fielded, and the only structure within the material comes from what the author applied when writing the document and how they built relationships between different concepts and ideas. For example, you could collect patents from a specific patent assignee and analyze the text of those documents. In a cluster map, the software would extract the major concepts found and create clusters of documents concept by concept. The software would then visualize these clusters in some fashion, creating a map. By looking at the clusters created (and subsequently, the documents themselves, but now with an organized method), you can quickly get a general idea of the concepts that this organization is working on and how these concepts interrelate.

Success in either data or text mining will often depend on the analyst's familiarity with the data source being analyzed and the methods used to prepare and analyze the data. A full discussion of the submethods and potential pratfalls of different mining exercises is, unfortunately, beyond the scope of this article.

With this general background, I would like to propose a law for the linear analysis of patent information. Here components of Trippe's Law of Linear Patent Analysis:

  • Create a tool kit of patinformatics tools.
  • Understand the business need. and the need behind the need.
  • The need drives the question.
  • The question drives the data.
  • The data drives the tool.

Create a Patinformatics Tool Kit
As mentioned earlier, patinformatics can include patent mapping, citation analysis, co-occurrency analysis, thematic mapping, temporal visualization, and various other techniques beyond the scope of this article. Clearly no one tool will accomplish all of these types of analyses. In order to succeed in the overall field of patinformatics, the practitioner needs the maximum flexibility to pursue questions based on business needs; therefore, the patinformatics practitioner should invest in a collection of tools and resources. This approach can get expensive quickly. So one must understand the types of questions that are likely to be asked and arrange for tools that will satisfy the corresponding analysis needs.

Understand the Business Need and the Need Behind the Need
When it comes to starting an ad-hoc project, the analyst will typically start by understanding as much as possible about the analysis need at hand. As information professionals well know, it is often difficult to get a client to express their true need when making an information request. Frequently a client will say, "We need to know everything about Company Y." As strange as this might sound, the response to that request ought to be, "No, you don't, and if you did it would take a forklift to cart in all of the data. It would take 6 months for you to get through all of it. And, in the end, you might not be any closer to the intelligence you're seeking than when you first started." 

In patinformatics it is absolutely essential that the business need for intelligence is clearly understood before anything else begins. It is also critical to know all of the needs behind the need as well. Analysts need to understand how the data will be used and who will use it. They need to know what type of story to tell in order to represent their intelligence work in such a way that the person receiving it will understand it and will stand the greatest chance of putting it into business practice. While important to all information professionals, these principles are absolutely essential to analysts. Improper assumptions made up front about the scope and goal of the project can lead the analysis astray, producing inappropriate or, in extreme cases, misleading information in the context of the business decision at hand. The analyst assumes the role of a trusted advisor in these cases and needs to be as close to the decision-making process as possible, so they can integrate a thorough understanding of the business need into their work.

The Need Drives the Question
In a true linear sense, once the need is understood, then the analyst and client can work together to formulate questions to supply intelligence that will impact the underlying business decision. For example, a business may need to gain additional insight on how the research and development progress works for a particular company (say Company Y again). In such a case, understanding what research and development projects Company Y conducts in its 10 different research facilities in the U.S. becomes an interesting question. Additional questions might include the following: Where do the inventors on their U.S. patents live? What patenting topics are closer to basic science? Which apply more to process technologies? By asking a number of compelling questions and compiling intelligence on each of them, an analyst can begin to paint a mosaic of the dynamics associated with the business need. Examining all the dynamics will lead the analyst to draw conclusions.

The Question Drives the Data
Once an analyst decides on the questions that need to be answered, they must begin collecting relevant data, just as a scientist investigates a scientific question. Referred to as the scientific method, this process involves the formation of a hypothesis, experimentation to determine the validity of the hypothesis, and verification of the validity of the experimentation and of the conclusions drawn based on experimental results. In the realm of patinformatics, the gathering of data is directly analogous to the idea of preparing an experiment to support or dispute a hypothesis. Selection of the appropriate tool is also important to the process.

The Data Drives the Tool
Some questions require very specific types of data. In these circumstances, the tool selected must not only allow for the analysis necessary to provide the insight, but must also work with the data source most appropriate for answering the questions. Continuing with the example initiated above, if the question posed asks where the inventors on Company Y's U.S. patents live, then the data will have to include the inventor's address information, which appears on the front page of all U.S. patents. Perhaps more importantly, this data must be available in an electronic format for importing into the appropriate analysis tool. If a tool cannot handle the data format for the file that includes inventor address data, than it cannot answer the question.

Once again, it is important to follow these steps in a linear fashion: Deviation from this path will lead to a situation in which the questions asked are biased by the tools available to the analyst. If an organization focuses on a single analysis tool, than all subsequent analysis may be overshadowed by the strengths and weaknesses of that particular tool. As the old saying goes, "If all you have is a hammer, everything looks like a nail." 

Another principle that should be applied during patinformatics exercises is the idea of Actionable Intelligence. This idea dictates that intelligence is only useful if it is applied to a business question and more importantly used to make a business decision. Analysis work should not be done for its own sake. If a report will simply collect dust on the decision-maker's desk, then it was not worth doing in the first place. Analysts must not get trapped in the novelty or cleverness of their work. They must stay focused on creating analysis that allows the decision-maker to definitively see the various options available to them and to deliver good enough intelligence on those options to support a clear and relatively unambiguous decision on a course of action. When intelligence is applied to a business decision, then it becomes actionable.

A Review of Patinformatics Tools, Resources and Web Sites:
Items discussed in this section fall into three different categories: tools (software driven by an internal corporate server or installed on a client's personal computer), resources (printed or reference materials that provide static statistics), and Web sites (analytical devices driven from an external server where clients typically purchase access to the service on a monthly or yearly basis).


Aurigin Systems Inc.
Aurigin's most well-established system is called the Aureka IPAM system. IPAM stands for Intellectual Property Asset Management and, as the name implies, this system allows you to organize and manage intellectual property (not just patents, but corporate documents as well). The system contains tools for patent analysis also as an integral part of smart IP management. While a very powerful and flexible platform, Aureka is a big-ticket item. There are substantial costs involved in purchasing a server to run the system and setting it up to work within an organization. It offers a great deal of power, flexibility, and security (since it is located behind your company's firewall), but it will take considerable commitment to establish it. 

As an integrated system, the Aureka platform provides enough flexibility to incorporate a number of third-party applications to work within the framework. Aurigin works with or acquires some of the best third-party analysis tools companies to partner with them and integrate their systems into the Aurigin management system. Aurigin has incorporated both text and data-mining tools into the system and set them up so the tools all work together seamlessly. 

Aurigin has pre-loaded its platform with patent data taken from the four major patent authorities (US, EP, JP, and PCT) and includes a search engine for identifying relevant references. These references can be saved, creating sets for further analysis and sharing with colleagues. Another nice feature of the Aureka platform is the ability to annotate documents. Since Aurigin began life as SmartPatents Inc., users have all of the annotation and viewing capabilities of SmartPatents accessible through the system. (In a bold move, Aurigin recently announced that SmartPatents would be given away to customers who have a subscription to the system). One of the key strengths of the IPAM system is the ability for individuals within an organization to create sets of patents, analyze them, annotate them, generally create intelligence from them, and save all of this knowledge in a single place.

One of the analytical tools built into the Aureka system is the ThemeScape thematic, text-mining tool. Originally marketed by a company called Cartia, Aurigin Systems acquired it in 2000 and integrated it into the Aureka platform. A detailed description of how the program works is beyond the scope of this article, but ThemeScape employs a concept mapping method of creating technology landscapes. The program reads full-text documents, identifies themes that occur throughout the references, and employs clustering algorithms to organize documents by co-occurrence of the identified themes. 

ThemeScape organizes this information by using a topographical map paradigm. In a ThemeScape map, each document is represented by a black dot; taller peaks are clusters that contain a higher number of documents on a particular subject; and the closer two documents appear to one another on a map, the more overlap they share in their themes. ThemeScape is a text-mining tool with a few built-in data- mining features that enhance the clustering aspect. It incorporates a data-mining aspect since you can ask it to identify a specific patent assignee on a map. This takes the form of small white dots, reflecting a concept area where that patent assignee is working.

Another analytical tool within the Aureka platform is the citation tool. Licensed from InXight, this technology incorporates a hyperbolic tree viewer. The citation tree tool creates a hyperbolic tree of citation information from within the U.S. patents covered in Aureka. Select a single U.S. patent and it will become the root of the tree, with subsequent citations to that document forming branches moving forward one generation to the next. Backward citations can also appear visualized in this tree format. One can label trees in a number of different ways, including by assignee, publication date, or inventor. Trees can also be colored based on date or assignee. Citation trees can support a rapid visual review of the citation history for a single U.S. patent. 

This approach has one drawback: The branches of the citation tree are all formed in a linear fashion. Relationships between citing documents from one generation to the next cannot be seen since only linear lines are drawn. If the producers enhanced the system with the addition of interconnecting lines between documents that cross generations, this would allow the discovery of documents that appear to be cited by several different assignees throughout the entire citation history of the root document. 

The Aureka system also contains a reporting tool that supports statistical analysis of the patent data. The reporting module is broken into three broad sections: 

  • Key summary reports, which provide the top10 data elements in a particular area such as top 10 assignees or the top10 inventors within a collection.
  • Detailed reports, which include text-based reports on subjects such as the pace of invention, citation history, inventor and assignee reports, and matrices such as International Patent Classification vs. patent assignee reports.
  • Pivot tables, the most powerful of the reporting models, created from assignee, inventor of, U.S. classification, citation, and several other data elements. Once collected, results are exported to Microsoft Excel for the creation of the table and subsequent manipulation by the analyst. Almost all the bibliographic fields within the patent are available to the analyst for combining, mixing, and matching in a pivot table. Incredibly complicated and elegant analyses can be created in this fashion. 
Aurigin recently announced the creation of an HTML-based version of its platform labeled the Aureka Online System (AOS) (see Figure 1 on page 33). Available as a fully hosted or on-site model, AOS brings an even more seamless integration of the management and analysis tools of the Aureka platform to an increasingly end-user audience. Innovations incorporated into AOS include licensing of the Vivisimo [] clustering tool for automatic categorization of patent documents, annotation of not just the documents but also the various data elements themselves such as the citation trees and ThemeScape maps, and stratification of user levels based on user analytical needs and training. These levels are designated Gold, Silver, and Bronze. Gold users have access to all the capabilities within the system, Silver users have access to just about everything except ThemeScape, and Bronze users can access the search, view, and print capabilities of the system as well as collaborate with projects initiated by a Gold or Silver user. Additional information on Aurigin appears on its Web site:

Aurigin has had some recent financial difficulties, which led the company into Chapter 11 bankruptcy. The situation was resolved when MicroPatent acquired the company at auction. Since the acquisition, MicroPatent has delayed the launch of AOS 3.0, but plans to continue supporting existing customers and continue offering access to the new AOS 2.5 offering. MicroPatent officials have said that eventually its plans on merging the two services.

ClearForest Inc.
The products from ClearForest Inc. are among the most powerful text- mining tools available. Most text-mining tools begin by performing what is called term extraction, the process whereby the application selects relevant terms from within the text and extracts them for subsequent analysis. Term extraction works similar to the process used to create a full-text, inverted index of a particular document. Once extracted from the text, the terms can be analyzed in a number of ways. Information extraction extends the term extraction operation; it not only selects terms, but subsequently categorizes them automatically into pre-defined categories or taxonomies. It works on unstructured text. There are no inventor fields or assignee fields with data specifically tagged and classified. Information extraction techniques can analyze unstructured text and automatically extract and categorize such information as people's names, their positions, their companies, or various other attributes. 

ClearForest has developed two different tools for managing the information extraction process: ClearStudio and ClearLab. ClearStudio uses a wizard-driven interface to allow the user to quickly design language association rules for automatic classification of information. ClearLab allows for the creation of association rules as well, but it is a C + + driven interface designed for use by people familiar with this programming language. For more information on information extraction based on association, contact the ClearForest staff. 

The third component of the ClearForest suite is called ClearResearch. This application supports the analysis of classified information. The tool allows for a number of different analyses; one of the most powerful involves the use of circle graphs to visualize the relationships between one collection of taxonomies and another. A taxonomy, for instance, could cover all the companies named within a document collection. Imagine a circle displaying technological terms on the left-hand side and company names on the right. Lines of varying thickness drawn from one side of the circle to the other would represent relationships between a company and the technological terms associated with it. Variations in thickness and color of the lines represent the intensity of the relationship based on the total number of documents that support it. Practitioners have referred to circle graphs as balls of string based on their resemblance to the popular cat's toy. By double-clicking on a line, users see the documents. Double-clicking on an individual technology term or company name along the edge of the circle will cause a new window to open with the clicked-upon object as the center and the subsequent terms displayed around it as spokes. For instance, clicking on a company name will open a new window with the company named at the center and spokes leading off to the technology terms associated with the company. Right-clicking on one of the technology terms in this window will bring up a contextual menu that will allow an additional distribution on any of the taxonomies available to the analyst. In this fashion, one can distribute the company's inventors by the corresponding technology terms associated with them. 

The ClearForest suite contains many powerful text-mining features. Additional information can be found at the Web site:

Search Technology
Search Technology produces VantagePoint, a data-mining tool that, for the most part, deals with the statistical analysis of values within fielded data. If the field happens to contain written text, then the tool applies natural language processing algorithms to parse out topics. The first step in using VantagePoint involves importing and parsing data from online records. Using the import editor, fielded data from almost any source can be correctly parsed and imported into the system for analysis. After creating a database with the fielded values, the system provides tools for conducting list cleanup. Using fuzzy logic routines, the system can help the user identify values within the field that should probably be grouped together since the values are synonymous with one another. Two of the most common uses for this feature are the company name and inventor name cleanups. As mentioned, good statistical analysis needs good, clean data. Often a time-consuming and laborious process, the list cleanup features in VantagePoint can make the process easier. 

The major statistical paradigm used by VantagePoint is the co-ocurrency matrix. One attribute is placed on the Y-axis, while another goes on the X-axis (see Figure 2 on page 34). Numbers show up within the matrix, indicating the number of documents that incorporate the corresponding values on the X and Y-axis. Clicking on a cell produces a list of the titles of the documents that support this relationship. Synonymous values can be collected in a group and compared to another field within the matrix. For instance, organizations can be grouped by their general affiliation: industrial, governmental, or educational. The user can then compare the number of documents produced by each of the different organizational sectors within certain key technologies. Along with co-ocurrency matrices, the system can also perform principal components decomposition and create factor maps for any of the fields. The system also provides pre-defined macros that allow the automatic selection and exporting of a matrix into Microsoft Excel for visualization using 3-D graphs, line graphs, and various other charts. See the Web site [] for additional information.

The Technology Watch tool is a data-mining product originally developed by IBM (now marketed by Synthema). Once it has the information fielded, Technology Watch will cluster documents based on co-occurrence of exact string-matched data using many-to-many relationships. In other words, it does not build silos using one-to-one relationship whereby it would group all the documents containing a unique code, and a document might end up in several different silos if it contained more than one value for a particular field. Instead, the program looks for documents that have a greater than 50 percent (this number can be adjusted) homogeneity in the field analyzed and groups those documents together in a bubble or cluster. The idea here is that if documents have a high degree of similarity in the fields under analysis, they will likely focus on a similar topic.

Figure 3 on page 34 shows a finished Technology Watch map on which patents from a Cambridge, Massachusetts, pharmaceutical company called Vertex are clustered based on their sharing of similar Derwent Manual codes. The bubbles have been manually labeled, colored, and positioned by the analyst in order to demonstrate collections of patents on the same topic. Lines drawn between the bubbles indicate a relationship between the documents that had less than a 50 percent homogeneity. The program automatically created these lines.

Besides clustering, Technology Watch can also provide statistics on any single field and demonstrate which documents have a specified value for this field.

Invention Machine Corporation
Invention Machine Corporation produces a number of applications that assist in the computer-aided invention process. With regards to patent analysis, however, its two most relevant products are Co-Brain and Knowledgist. Both programs do basically the same thing: extract subject/action/object (SAO) functions from full-text data. The company has recently begun to refer to these functions as problem/solution paradigms. The idea behind this approach is that patents are designed to instruct readers on how to solve a practical problem. Think of the subject and action as the solution and the object as the problem. For example, if the object were to have clean clothes, the solution to the problem would be washing with soap provided by the action and the subject. Once the software has extracted the subject/action/object functions from documents, it puts together the problems and solutions, grouping similar problems together, so that users may compare different ways to solve a problem by viewing them next to one another (see Figure 4 on page 34).

The two programs differ in their scope and scale. Knowledgist is a desktop application that can be used on personal datasets, while Co-Brain is designed to work from a corporate server and act as a corporate knowledge portal. Both systems come with a synonym tool that can greatly reduce the number of problem solution sets created and greatly increase the system's ability to understand when two different solutions solve the same problem. Both programs can often create large problem/solution functions that are difficult to navigate by scrolling up and down the list. To assist in identifying relevant functions, a search button allows the user to find problems or solutions quickly. Please see the Web site at for additional details.

BizInt produces SmartCharts for Patents, tabulation software for patents. The software allows a user to import patent data from the Derwent, IFI, and Chemical Abstracts files on STN and create tables of information (including many of the included images) from it. While not a text- or data-mining tool per se, the software works well for formatting patent data for end-user distribution. Tables are customizable, and one can even add columns to keep track of comments made by people working together on a project. An illustration of a portion of one of the SmartCharts tables appears in Figure 5 on page 39. For more information and additional examples of the tables go to

IDDEX Corporation
IDDEX Corporation is a relatively new organization that offers a form of electronic notebook for documenting innovation and invention disclosures. In the company's own words, it offers a software platform and tools to manage the innovation life cycle. As a software platform, one can use the system as an intranet solution; as an ASP model, customers can go to a secure Web site to interact with the system.

With a look and feel similar to commonly used e-mail software, users can create new projects or inventions and begin supporting these inventions with documents they have created or received from others. The system catalogs contributors and can help measure which individuals contributed what to the overall invention. The platform also allows the tracking of disclosures related to the invention and includes precise time stamping. Overall, the system maintains a clear and detailed evidentiary trial that could become invaluable if the invention were ever legally challenged.

After collecting the information, the platform also provides analysis capabilities that help licensing managers decide what portions of their portfolios to out-license and which need further development. See Figure 6 on page 39 for an example.

One of the earliest text-mining and visualization packages available was SPIRE from Battelle. Members of the SPIRE development team spun off to form the company Cartia. Cartia produced the ThemeScape tool discussed earlier in this article. Another group of scientists at Battelle recognized that the SPIRE technology could be used for more than straight text mining and began applying the tool to biological and chemical datasets. This work again spun-off to create another new company, OmniViz.

At its core, OmniViz shares a number of similarities to ThemeScape, but the OmniViz staff have made a number of improvements on the work done by Cartia/Aurigin. With regards to text mining, OmniViz can import a large number of different text formats and styles and recognize fielded text. This is important, since when it comes to choosing what portions of a fielded record an analyst wants to use for doing a cluster analysis, the system can distinguish key portions of the text. For example, a fielded record might have fields for inventor, assignee, title, abstract, and year published. With the OmniViz system these different fields can be identified, the title and abstract used for conducting a cluster analysis, and the remaining fields used on the resulting visualization to call out interesting patterns, such as what assignees have similar documents based on the similarity of their titles and abstracts, or which subjects were published during which years. One could perform some of these activities previously, but not with the power and convenience found in OmniViz.

As mentioned, the OmniViz developers did not stop at analyzing text, they also added functionality to the system, enabling a biologist to analyze a large collection of cell assay data, for instance, looking for drug candidates that share a similar assay profile, even though very different structurally. The system also allows the linking of two or more analyses that have elements in common. This allows an analyst to identify trends using one type of data source and analysis, while observing if a similar trend appears using a related source and method. 

Now while this may not seem like a valid application when discussing patinformatics, it actually raises some interesting possibilities. For instance, imagine if one analysis shows clusters of molecules that are all active against one member of a family of enzyme targets. A linked analysis of the drug candidates clustered based on their chemical similarity shows that two of the potential inhibitors have a similar structure, while a third is quite different. Now, add in a third linked cluster analysis containing clusters of patent and literature references that contain the drug candidates of interest. Since the three analyses are linked, one could easily see that the two similar compounds are actually covered in several relevant patents, while the third compound is discussed in a literature reference. The ability to collect and analyze data from biological, chemical, and text sources and look for trends across all three sources makes OmniViz a unique tool. A "Galaxy Map" from OmniViz appears in Figure 7 on page 39. For additional information, go to their Web site,

The Metrics Group/VxInsight
The Metrics Group, a consulting firm, specializes in patent analysis, particularly patent citation analysis (see later in this article), but it does other types of patent analysis projects as well. In April 2002, the firm announced a partnership with the makers of VxInsight that allows Metrics Group clients to use VxInsight to navigate citation analysis results from the Metrics Group for a period of 6 months.

VxInsight was developed at Sandia National Labs and provides visuals similar to those produced by Aurigin's ThemeScape and OmniViz's ThemeMap. The tool allows Metrics Group customers to see complicated co-citation links between a collection of patent documents. Heavily co-cited patent documents appear raised on the map for easy identification. The linkages between these documents and others are also obvious.


IFI U.S. Claims
IFI is the organization that media sources quote when ranking the companies with the highest number of granted U.S. patents per year. This type of data, along with a number of different types of statistics, appears in its Patent Intelligence and Technology report. It provides detailed statistics on over 1,600 companies, showing how their U.S. patenting activity (defined by broad U.S. and International Patent Classes) has changed over the years. The report also contains distributions of patents by company over the U.S. classification. IFI subscribers can access the document from the IFI Web site or purchase a downloadable version for subsequent uploading to a corporate intranet site. With the Web version, a user can quickly look up a U.S. patent class of interest and discover which organizations received the highest number of granted patents in that class over the past year. Versions of the report can also be generated with cumulative 5- and 10-year back files. For additional information, try the IFI Web site:

The British company Current-Patents produces a number of different patent resources and publications. For the most part, clients can browse these publications on a weekly or monthly basis. In addition, the Drug Patents 2001 and Current Trends in Pharmaceutical Discovery reports also contain data-mining and statistical analyses of information from pharmaceutical patents. 

The crown jewel from Current-Patents is its new DOLPHIN database. DOLPHIN allows an analyst to work with pharmaceutical patents in several different ways: by searching for a patent number, by conducting a text search, or by looking up profiles based on a drug or company name. The analytical capabilities are pre-defined for each profile, but provide an interesting snapshot of how an organization may compare to its competition. Some of the analytics include a chart of drugs owned by the company with the highest patenting activity, drugs for which the company has filed patents other than product or composition of matter patents, patent classifications of the company vs. the industrial average, therapeutic areas of the company vs. the industrial average, and therapeutic areas by year and action. In the drugs owned by the organization with the highest patenting activity chart, different colors represent new use, component of combination, formulation, and various product or composition of matter categories. The charts are built in Macromedia Flash and thus are dynamic. When the user passes the cursor over a color on the bar, the corresponding values for that percentage appear. Users can also click on sections of the bars to go directly to those documents. For individual drugs, some nice charts display patent classes and the company that filed them. This view can give a user a quick overview of which companies work with a particular drug substance and how they're doing. Interested users can sign up for a demonstration account by registering at

Web Sites
A relative newcomer, the Web site offers an Intellectual Property Quotient (IPQ) on patent documents. The score can be thought of as an IP version of the well-known human Intelligence Quotient and is read in a similar way, with a score of 100 being average. The site's authors calculate this score by looking at patent metrics determined as statistically correlated to patent maintenance rates. In a number of countries, after a patent grants, the assignee must pay maintenance fees on a fixed future schedule in order to keep the patent in force. The logic behind this product holds that patents with maintenance fees kept up-to-date must have a greater value than the ones allowed to expire for lack of payment. The owners would not continue to pay maintenance fee if the IP did not have sustained value. Using regression analysis and looking at over 40 individual patent metrics, the staff behind claim to have a model that accurately predicts patent value.

Figure 8 on page 40 shows a probability distribution of observed fourth year maintenance rates vs. calculated IPQ scores for a sample population of about 100,000 utility patents issued in 1996. This illustration shows clearly that patents with an IPQ score of less than 60 have less than a 50 percent chance of being maintained at the first opportunity.
One of the newest patinformatics Web sites,, created by The Metrics Group, focuses on patent citation analysis. The site provides two different product lines. Citation Bridge, a free service, is available when a client registers at the site. It allows a user to generate a forward or backward citation report from any U.S. patent back to 1980. The citation information displays as a text-based report, not in a tree format as in citation reports from Aurigin. The company is also considering generating forward and backward citation reports for documents from countries other than the U.S. It is not known at this time if these citation reports will also be offered free of charge. Clearly, the owners of this site hope that the allure of free backward and forward citation reports will attract customers to pay for additional services. 

The second product line produces citation reports with more value-added using the Citation Indicator Analytics (CIA) database, for which users pay a flat fee. The CIA database is Web-enabled, using a Microsoft SQL-Server platform. Metrics Group uses the CIA database internally to produce reports and generate Citation Alerts, but may provide Web-based access within the next few months. As opposed to the Citation Bridge, which can only handle one patent at a time, CIA-based reports can run an entire group of patents. Examples of the type of reports available include forward citation inventory, competitor impact report, patent cousins, and corporate innovations flow. Additional citation analyses the system can perform include a competitor historical citation grid, speed of knowledge capture from patents, and external and internal inventor historical citation grids. 

The usefulness of U.S. patent citations as an indicator of overall value has long been debated and is still highly in doubt. Clearly, however, the reports available from do not focus on the inherent value of a patent, but instead examine the relationships that patents have to one another and the implied relationships that organizations have to each other when either the organization or a patent examiner cites one document with respect to another.

DOORS is marketed as a tool for companies to help identify prior art and licensing opportunities for their portfolios. The system works by combining advanced semantic analysis with co-citation analysis. Documents that may be considered as relevant prior art are selected based on patent citations they hold in common. In addition, the system utilizes Latent Semantic Filtering (LSF) a process in which documents with identical, or near-identical, concepts can be identified whether or not the same words appear in each occurrence. LSF uses word pairs and related nearby topics selected from the documents. Documents are compared to one another based on not only the shared word pairs, but also on the inclusion of similar nearby related topics. Combining the two techniques allows easier identification of highly related patent references. M-CAM employs a number of different visual displays to help analysts keep track of a collection of patents concurrently. 

The primary mode of access is by subscription to the M-CAM Web site. Originally designed for use by large financial organizations, the powerful system is also quite expensive. Demonstrations can be arranged at the Web site:

Wisdomain is a Korean company that has created a patent analysis site with three major components: a search module, a citation module, and an analysis module. The search module is populated with the databases normally expected in this type of service patents from the U.S., Europe, WIPO, and Japan. Searching is straight Boolean along with the ability to search selected fields with the option to save search sets for later review and retrieval. 

The citation module, as the name implies, allows the user to work with citation information from U.S. patents. The visualization allows for multiple nodes and the identification of inner relationships between them. This visualization helps to quickly identify core documents, those references that appear to be at the crux of several branches. The diagram is interactive, so users can drill down in patent nodes to quickly find additional information on the reference. 

The analysis module contains a number of pre-configured charts and graphs. An analyst can quickly get a top-level view of a document collection by seeing the patent count by assignee, the International Patent Classification codes by assignee, and a few additional charts and graphs.

Access to the system involves subscription payments. For subscription details, go to

Delphion and MicroPatent
Both of these vendors have traditionally been viewed as patent document delivery companies. Both are making strides toward integrating more analytics into their Web sites. Delphion has always had text clustering and basic patent analytic abilities, the first from its relationship with IBM, and the second based on a tool purchased from Wisdomain called Patent Lab II. Both Delphion and MicroPatent work with CHI Research to provide patent citation reports to clients. 

Delphion has recently released a citation analysis tool called Citation Link. While not strictly a hyperbolic tree, such as the citation tool available from Aurigin, the tool does allow a user to identify a root patent and visually represent backward and forward citation relationships to it.

Both organizations have also added features to allow users to easily export fielded data to Microsoft Excel for subsequent analysis and visualization. MicroPatent recently added analytical reports and charts based on its data, along with forward citation visualizations. MicroPatent has also acquired the assets of Aurigin Systems Inc., as mentioned previously.

To check out these respective Web sites, go to and

The patentinformatics field is constantly shifting. New practitioners are joining the field and making contributions to the development of new methods for gleaning value from patent data. Vendors already producing products and services within this field change rapidly as well. Traditional patent information providers are partnering with new companies or developing their own, new capabilities to prepare the value-added indexing that they have spent years generating for analysis and use in detailed and extensive data and text-mining experiments. The field is sure to grow and advance in the years to come. Future practitioners will certainly work with exciting new capabilities as the practice develops.

Anthomy Trippe's e-mail address is
Table of Contents Previous Issues Subscribe Now! ITI Home
© 2002