Stormy Weather, Redundancy, and De-Duplication
By Marydee Ojala
Editor • ONLINE
It’s been a stormy summer here in the American Midwest. We’ve had massive thunderstorms and, in some areas, severe flooding. There’s also been the occasional tornado. Personally, I’ve been lucky. No floods at my house or office, although the road up to the office building was underwater one day, and no tornados. It’s mid-July as I write this, and we still have found no reason to water our lawn. Nature’s doing a very nice job without our intervention.
We seem to be the odd man out on this. Walking around the neighborhood, it’s amazing how many people feel the need to turn on sprinklers to water already wet grass. Sometimes the sprinklers run while it’s raining. There are times when redundancy is practical—bringing your conference presentation with you as well as emailing it and uploading it to a conference website comes to mind—but watering your lawn during a thunderstorm isn’t one of them.
In the online arena, an important development decades ago was the ability to de-duplicate search results, to eliminate the redundancies. This is a distinctive feature for aggregators, as they can use algorithms specific to structured fields. When it comes to unstructured data, found by web search or enterprise search engines, the situation changes. It’s much harder to build algorithms to identify redundant information. Rudimentary de-duping exists, but it’s nowhere near as sophisticated as what you find with Dialog, Factiva, or LexisNexis.
Do Web 2.0 tools and technologies help information professionals surmount this? The essence of Web 2.0 is sharing and user-contributed content. If anything, this adds to redundancy. News stories, blog posts, podcasts, and microblogging invite comments. A note on Twitter may be picked up by a blogger, then amplified at another blog, and possibly migrated to mainstream media. This ensures duplication and necessitates following numerous threads and hyperlinks, adding to research time.
Duplication across online services raises other research conundrums. The same journal may be online with Dialog, EBSCOhost, Factiva, LexisNexis, and ProQuest, but be presented differently with different indexing. Searching multiple aggregators retrieves duplicate information with no way to automate de-duping. A similar situation exists with web search. The same search on Google, Yahoo!, and Ask.com will retrieve both redundant and unique items. De-duping across search engines remains a manual task.
From an author’s perspective, particularly academic authors who want their writing widely cited, duplication is good. If their articles are on all aggregators and on the net, their findability escalates considerably. As scholars embrace online searching and electronic journals, redundancy in databases equates to greater visibility. Information professionals may consider these redundancies the counterpart of watering wet grass and wish for a tornado to whirl them away. It all depends on your perspective and ultimate objective.
Here’s hoping you’ve found ways to avoid drowning in the flood of information you’re encountering—and do remember to turn off the sprinklers if it’s raining.
Ojala is the editor of ONLINE. Comments? E-mail letters
to the editor to