Big Data and AI: Technology, Transparency, and Trust
By Marydee Ojala
Librarians are surrounded by Big Data. Our discovery services allow us to search across hundreds of millions of items. Individual publisher platforms can add millions more. Beyond traditional journal articles and ebooks, libraries house non-textual materials. Datasets created by researchers in the academic and corporate environments contribute petabytes more to our Big Data world. Videos, photographs, art works, and other types of images further add to the amount of data that information professionals struggle to make discoverable.
|Technologies related to Big Data bring both exciting opportunities for research and worrying prospects for misinformation, disinformation, and falsified information.
And then there’s that behemoth we call the internet. Web search engines, such as Google and Bing, scrape billions of websites to provide information in response to search queries. Niche search engines specialize in surfacing government data, patent and trademark records, music lyrics, film summaries, scientific studies, and a host of other very specific information. Add in the trillions of posts to social media, and you’re in the exabyte range.
Librarians faced Big Data long before the term became fashionable, but the volume, variety, and velocity are steadily increasing. This leads to concerns about validity, transparency, and truth. Can we trust Big Data technologies? Is there any way for information professionals to use a trust-but-verify approach?
Technologies related to Big Data bring both exciting opportunities for research and worrying prospects for misinformation, disinformation, and falsified information. Transformational technologies, roughly falling under the AI umbrella, can sift through massive amounts of data, spotting patterns and anomalies that the human brain can’t process.
IEEE (ieee.org) partnered with ip.com to create a new patent database—InnovationQ Plus (innovationqplus.ieee.org)—that uses semantic search to facilitate searches by concept rather than keyword. It built semantic relationships to equate alternative words and phrases that are returned when a search query is executed. Not a static system, it uses machine learning to increase the accuracy of its concept searching. Given that patents are not written to inform and enlighten, the ability to cut through the confusing and purposeful obfuscation of meaning in patent texts is a boon to patent researchers.
In the legal field, AI-enabled software lets lawyers extract key variables—such as dates, parties involved, key provisions, and specific clauses—from contracts. The result is higher consistency and avoidance of errors in writing new contracts. AI companies such as Luminance (luminance.com)—which was founded by University of Cambridge mathematicians 13 years ago—and iManage (imanage.com)—which acquired RAVN in 2017—use machine learning and pattern recognition to scan through vast amounts of legal documents to pinpoint anomalous data and aid lawyers in analyzing contracts, due diligence documents, e-discovery results, and other legal documents.
Predictive analytics can forecast judges’ rulings to help lawyers decide which court will probably provide the best outcome. Almost every major legal information vendor now offers a version of this, including Westlaw Edge, LexisNexis’ Ravel, Fastcase, and Bloomberg Law. In medicine, the possibility of scanning anonymized electronic health records to predict the efficacy of a particular drug could lead to more accurate prescribing by doctors.
Digital humanities and social science projects would not exist without new AI technologies. Researchers who extract new information from digitized primary sources, ancient manuscripts, or historical public records rely on software that can read vast amounts of data, pulling out pertinent information. Data mining and text mining are the backbone of modern scholarly research in these disciplines.
Microsoft Academic Search (academic.microsoft.com) uses semantic search technology to determine meaning beyond the terms entered in a search query box. It doesn’t merely match keywords; it looks for meaning and context. It doesn’t always get it right, but it comes close much of the time.
Do You Believe in Magic?
It sounds magical, doesn’t it? But because we have to accept the magic on faith, information professionals struggle with issues of transparency and trust. We can’t check the work of AI software. Let’s suppose a researcher wanted to count the number of times Shakespeare used the word “sky.” Would the software build equivalencies to “heavens,” “celestial,” and “welkin”? When does blue refer to the sky and when doesn’t it? How would AI handle a poetic allusion that a human would understand as referring to sky? Does AI understand poetic license? Does it have an imagination? If the software tells us that Shakespeare used “sky” 10,582 times, how would we know that’s true? It’s not as if we could manually duplicate the count; that’s why we’re using technology to begin with. By their very nature, AI technologies are the opposite of transparent.
Consider some hypothetical scenarios. What happens if predictive analytics gets so good that a publisher could tell from the draft of a scholarly paper whether it will garner a high impact or not. That would influence the publishing decision. And the author might never know why the paper was rejected. Here’s another: A search engine notices that others in your organization have preferred results from a particular domain. Therefore, it doesn’t show you results relevant to your interests from a different domain. Amazon got into the prediction game when it claimed it knew what books would interest you and would order them for you without your knowledge. That never happened.
When it comes to machine learning, a key question is training. Training isn’t magic, but it needs to be done carefully to make the magic work. Learning from past data is only as good as the data. Bias in that data leads to perpetuating the bias in predictive analytics and algorithmic search results. If legal contracts analyzed by a computer concern only financial services, that’s the expertise the machine learns. Throw at it a contract about facilities management or a due diligence document, and the result may not be spectacular. If those health records were only from one geographic area, they might not predict accurately in another place. If salary data for C-level jobs goes back decades, it’s likely to be biased toward higher salaries for men than for women. Machine bias is hard to recognize and to control, and it can sometimes be aligned with human bias—particularly confirmation bias—making it extremely difficult to fix.
Giving Transparency Lots of Love
Information professionals adore the idea of transparency. We like to know how things work. The advent of algorithm-driven search results and machine learning can work against comprehensive information retrieval. It obscures our ability to guarantee that our research is relevant and complete.
Particularly with web searching, machine learning does not work in favor of information professionals. Legal research, data-mining projects in the humanities and social sciences, pattern recognition in patents, and text mining the scientific literature restrict the AI technologies to a targeted group of documents. On the web, available information on websites is fragmented and eclectic. Information professionals are not the primary seekers of information. Instead, it’s the general public—and searches frequently have shopping as their focus.
Thus, web search engines learn that searches directing people to shopping sites are extremely popular. They put retail sites at the top of the results, deeming them more relevant than a marketing research report, a scientific study, or a historical document. Web search engines do not return results tailored to the library community. We are simply not a large enough group to influence web search engines’ machine learning.
Where search engines have learned reasonably well is with medical topics, which are also popular with the general public. It’s still problematic, though. Depending on the search terms, results from the Mayo Clinic or WebMD might top the list, but you’re equally likely to hear that eating carrots will purportedly cure cancer or that juicing will prevent cancer.
Results from searching the web are guided by search algorithms. By design, these are not transparent. Think of them as a trade secret carefully guarded by search companies. We don’t know what algorithms search engines use, but we do know they change frequently. We can’t guarantee, with algorithms determining relevancy, that search results are neutral or unbiased. We don’t know whether other relevant datapoints were not revealed to us because algorithms blocked them. We have no idea if alternative points of view have been hidden from us. Web search companies’ personalization initiatives work well for shopping but not for professionally oriented research.
Information professionals face the prospect of doing research and guiding others in their research activities while flying blind. Even our discovery systems can mislead searchers because they don’t actually search everything the library subscribes to or owns. Traditional information industry companies are under pressure to look like Google. What happens if they begin to emulate its machine learning, predictive analysis, and personalization technologies? Could they decide to only show results from databases other people in the organization have favored?
It’s not all bad. Take LexisNexis SmartIndexing Technology. This attempt to use rule-based technology to automatically weight indexing terms is laudable. Since the percentages assigned are transparent and searchers can employ the rankings in their search strategies, it’s not the “black box” of web searching. It does, however, represent an interesting approach to determining the relevancy of search results.
Another concern about trust and transparency: The information we find refuses to sit still. We live in a choose-your-own adventure world. Movie studios routinely test drive films and change the endings based on audience reaction. Could this happen to scholarly research? Is it already happening with preprints? You read a preprint and suggest an alternative conclusion. If enough others also think the conclusion should be different, then the final version of the paper could change the conclusion.
We expect to see news stories evolve over time as more facts become available, but are other information types equally ephemeral? When information fragments from the printed page to electronic resources, images, audio, video, and social media, does it become so fluid as to be non-preservable? Fragmentation of information can result in data silos and confirmation bias. It is a contributing factor to information overload. If the hallmark of information today is fluidity and fragmentation, we need to find new ways of approaching its preservation. We need to find ways to ensure that we, as information professionals, retain the trust of our clients.
Death of Transparency
Transparency marked the early versions of online bibliographic databases—and still largely does so. With web search engines, that’s not the case. Research and reference librarians, along with those who teach information literacy and bibliographic instruction sessions, are hard-pressed to determine why search results can vary so much from one searcher to another.
Algorithm-driven search results, machine learning that affects comprehensive information retrieval, redefinition of relevance, and information overload are the downside of AI and Big Data in library work. The disturbing rise of false information—not only fake news, but also fraudulent research, photoshopped images, and mislabeled data—is another indication of the death of transparency. AI technologies even power deep fakes, in which even the spoken word can be manipulated, giving new meaning to the phrase “putting words in his mouth.” Unchecked and unmonitored, machine learning can learn the wrong things, with disastrous results.
The upside—and it’s a very powerful one—is the ability AI technologies give us to sift through massive amounts of data, spot patterns and anomalies, and deliver new insights. This is the cornerstone of new research efforts in many disciplines. Big Data can, however, obscure transparency. On the one hand, it provides phenomenal opportunities for groundbreaking research and important insights. On the other hand, it fosters an environment in which misinformation and disinformation can flourish. It’s up to vigilant information professionals to be aware of the ramifications of the death of transparency and alert others about trust issues stemming from these new technologies.