[ONLINE] feature

Information Security and Sharing

Elizabeth Liddy

ONLINE, May 2001
Copyright © 2001 Information Today, Inc.


The role of information specialists has broadened in recent years, and, for many, it now includes shared responsibility for the security of the intellectual property of the organization. Information security is a responsibility that, broadly defined, includes two major areas: protection from intruders and protection from unwanted release of information. The first of these, protection from intruders, is usually the responsibility of the IT team alone, and focuses on three major concerns: interruption of service, protection of IP addresses, and intrusion detection. All these pertain to protecting an organization's intellectual property from assault from outside the organization.

There is a second, equally important area: protection of an organization from unwanted release of information to inappropriate recipients. While information specialists are frequently seen as providing the conduits for getting vital information to the right people at the right time, they also are increasingly being asked to participate in the task of ensuring that vital information is not released to those who do not have the right to it. They are protecting an organization's intellectual property from advertent or inadvertent breaches of security from within the organization.


The technical solution for preventing the unauthorized release of sensitive information is referred to as "boundary control." A boundary controller is a software solution designed to enforce the business rules of an organization in order to control the information flows between the organization and the outside world, as well as between internal units. A high accuracy boundary controller is essential in supporting the information sharing that is required within an organization where there are groups that must share some, but not all, information.

Here's a real-world situation that requires boundary controllers: a company has decided to do research and development on a new approach to solving a common, expensive problem within the industry, and needs to protect this competitive advantage. Another is a financial services firm that must ensure outgoing email messages are in full compliance with securities laws and regulations. Boundary controllers come into play when an organization is providing technology to a team within a multinational task force and needs to share necessary information with other teams, but national security deems that not all information be shared.


There are a number of possible approaches to the boundary control problem that have been utilized commercially, as well as in internally developed solutions. The main distinction is between keyword and conceptual approaches.

Keyword Approach

The keyword approach is what is also referred to as the "dirty word" approach, whereby an organization prepares a list of words that, if contained in an outgoing message or document, should raise a security concern for that organization. Technically simple, most keyword approaches tokenize outgoing messages and documents into words. The words are then compared against a pre-established dirty word list. If such a word were found, the email system or document server would either stop the message or document from being sent, or, alternatively, route it to a human reviewer. The reviewer would then make the decision whether the message or document really posed a security concern, and either block it or send it on to its intended recipient. Dirty words might include sensitive internal project names, technical terms being used in current research, or the names of competitors.

The keyword boundary controller approach is technologically the inverse of the approach utilized in most Web-based filters intended to prevent children from accessing mature Web sites, and thus its name-dirty word list. These filters prevent Web sites with content matching the keywords from being accessed by a user's computer, while boundary controllers compare outgoing messages to a keyword list. But the limitations endemic to the keyword approach when used in censorware, which have been widely reported in the press, are similar. Namely, since language is so ambiguous, a word can express multiple meanings, and some of its occurrences will reflect the sense intended in the keyword list, as well as senses that are quite innocuous. However, a keyword system cannot tell the difference, and therefore, many false hits are detected. Secondly, the same meaning can be expressed by different words, thereby requiring the developer of the keyword list to include every possible synonym in order for the list to be exhaustively protective. If the keyword list does not contain all such re-phrasings, offending messages, documents, or Web sites are not accurately screened.

Conceptual Approach

The conceptual approach is based on a more sophisticated processing of both the organization's business rules and the content of outgoing messages or documents. First, the organization's business rules are run a priori through the Natural Language Processing (NLP) module to produce a rich semantic representation, versus an orthographic word-based representation. Then, when a user attempts to send a message or document, the same NLP is performed on the message to produce a higher-level, more abstract representation of its content. A semantic comparison of representations is then computed between the business rules and the outgoing message or document to determine whether it is releasable or unreleasable. There are two methods for conducting this semantic comparison, each using rich features of the texts. Both of these will be described after a brief introduction to NLP.


Natural Language Processing is a method for analyzing text that utilizes all levels of human language understanding in order to accurately represent the contents of the text. NLP applies algorithms, which interpret the meaning that is conveyed both explicitly and implicitly in parts of words, words, phrases, syntax, various meanings of a single word, the flow and intent of spans of text, as well as how the language refers to entities in the real world. NLP is complex, but major advances have been accomplished in recent years to the point where the extra processing time and power required by NLP has been deemed worthy for a range of applications, given the improved quality of performance.

For example, a business rule that has been processed through a full NLP module will have the words morphologically analyzed and the meanings of the roots, as well as the suffixes, stored in the representation. After this, each word will be tagged with its correct part-of-speech tag, phrasal concepts will be bracketed and not stored as unrelated single words, the correct sense of each polysemous word will have the correct sense of the word selected, and synonymous words and phrases of each concept in the business rule will be appended to the stored representation. Entities, relations, and events will be understood. For example, a human name will be tagged as a person, the syntactic role of subject in a sentence; if already tagged as a person, and the subject of an active verb, the module will indicate that person as the agent of the action described by the verb, the frame for the event instantiated by the verb will be activated, and the attached slots of the frame filled with semantic information from the sentence.

NLP's ability to produce a representation of business rules, as well as outgoing messages and documents with this degree of human-like knowledge enrichment, is what makes the conceptual approach to boundary control so powerful. In the following, Part A provides the semantic representation, and Part B provides the logical representation of the business rule that states that, "Junior employees of the Acme Corporation must not describe specifications of company products in outgoing emails." For the sake of clarity, the logical representation does not contain all the semantic details of A.

  1. Semantic Representation:

    <Junior_employee (new_hire; level_1_to_6)|Person> of|PREP the|ART <Acme_Corporation|Company>must|MOD not|MOD <describe (tell; explain; discuss)> <specification (size)> of|PREP ,company_product|ProdName> in|PREP <outgoing_email (message; posting)>.

  2. Logical Representation:

    If ISA (?X,junior_employee) and ISA (?Y, Acme_product) and ISA (?Z, email) and RCPT (?Z, ?P) and LOC (?P, outside_network) and CONT (?Z, 'ASSOC (?Y, ?A) & MEAS (?A, ?B)'), then CHRC (?Z, nonreleasable).

The two means by which conceptual boundary controllers can be im- plemented are the information re- trieval approach and the categorization approach.

Information Retrieval Approach

The information retrieval (IR) approach utilizes conceptual representations of an organization's busi- ness rules as semantic releasability models-one for each rule or set of rules. The system then processes each outgoing message or document through the same NLP module. The resulting conceptual representation of the message is compared to all the business rule representations, and, if the match between the intended missive meets a predetermined criteria of similarity to any business rule, the message or document is barred from release. This method basically simulates the routing query style of IR, where the representations of the business rules can be seen as functioning as standing queries. The system matches the outgoing documents against them.

Categorization Approach

This approach begins with obtaining or producing a training set of outgoing messages and documents that are categorized according to whether they are releasable or, if not releasable, categorizing them according to which of the business rules they breach. These training messages and documents are then run through the conceptual NLP-based processing de- scribed above for a rich marking of linguistic features. The training docu- ments are processed by a classification engine which utilizes a machine learning algorithm that enables the boundary controller to produce, test, and retest the best vector for each rule. The vector contains the distinguishing terms, phrases, concepts, entities, relations, and events which characterize a message or document breaching that rule.

After the system has been trained, these vectors are used to categorize each new outgoing missive according to whether it is releasable or unreleasable because it violates a business rule.


My group empirically tested the keyword and conceptual approaches as part of a federally funded DARPA research project [1]. The goal was to determine the relative merits of these two approaches, in terms of effectiveness and efficiency, with the goal of understanding how best to combine the two for the greatest pay-off. The experiment compared the performance of a commercially available keyword boundary controller, used by many organizations, to the performance of a conceptual boundary controller, namely DataShield, solutions-united. com's NLP-based categorization-style boundary controller (http://www.solutions-united.com/datashield). DataShield utilizes a probabilistic text classifier and uncertainty sampling algorithm [2].

The test results, run on 112 messages plus attachments, showed that the conceptual approach significantly outperformed the keyword approach in terms of messages correctly blocked and messages correctly released. The keyword commercial system had scores of 70% precision and 38% recall; while the DataShield system had scores of 96% precision and 99% recall. In terms of efficiency, it took the keyword commercial system 2.04 seconds and DataShield took 9.75 seconds.

The difference in effectiveness is significant, and while the difference in efficiency may appear to be a rather dramatic difference, it must be weighed against the performance results, which showed that the conceptual system stopped 99% of the un- releasable messages from being sent out, and only stopped 4% of the messages that should have been released. Although it took four and a half times longer for the conceptual system (then still a prototype), this needs to be compared to trained human subjects who, in this experiment, took an average time of 560 seconds or 9.33 minutes to review and decide on the releasability of this same set of documents.


Information specialists who share in the responsibility for the information security of their organization's intellectual property need to look for a conceptual boundary controller that both recognizes differences in 'need to know' among internal departments, and prevents advertent and inadvertent security slip-ups. The solutions available today are much better than what was available just a year or so ago, when the only choice was between hiring and training a human security monitor who was effective, but extremely inefficient; or by relying on a keyword boundary controller, which was efficient, but not sufficiently effective. The advancement of conceptual boundary controllers using NLP is making sophisticated, conceptual boundary control a reality.


[1] Filsinger, J., Morrison, W., Monteith, E., Gallaro, D., Paik, W., Hoffman, M., Smith, R. & Bebee, B. (2000). Advanced Boundary Control Capability Experiment Report. Interim Technical Report for Defense Advanced Research Project Agency (DARPA) Information Systems Office (NAI Labs #00-017), NAI Labs, Washington, DC.

[2] Lewis, D. & Gale, W. (1994). "A sequential algorithm for training text classifiers." Proceedings of the 21st Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval.

Elizabeth Liddy, Ph.D. (liddy@syr.edu) is director of the Center for Natural Language Processing, School of Information Studies, Syracuse University.

Comments? Email letters to the Editor at marydee@infotoday.com.

[infotoday.com] [ONLINE] [Current Issue] [Subscriptions] [Top]

Copyright © 2001, Information Today, Inc. All rights reserved.