The Big Deal About Big Data
by Erik Arnold
When I was working on this article, I searched more than 2 million emails, millions of articles in my newsreader, tens of thousands of documents, and countless queries on the internet.
So consider this for a moment: Take a personal knowledgebase, such as the one I created, multiply it by everyone in your company, and then imagine if that knowledge was somehow all linked together. A manager could find the answer to just about any question he could think of: How many articles on my company and our competitors do my employees read? How does research correlate to sales?
The big deal with Big Data is that it helps tech-savvy companies answer ad hoc business questions and immediately get usable answers. In fact, Big Data is big and getting bigger. The Big Data market is slated to be $50 billion by 2017, with several companies either above or closing in on the $100-million sales threshold, according to Wikibon, a professional community dedicated to solving technology and business problems via an open source collaboration of free advisory knowledge.
Information that may help drive new sales could reside in email, in a Word document, on Twitter, as a comment on a webpage, or in a log file. Maybe there is a correlation among four, three, two, or none of the above. In the new Big Data world, when information is funneled into one repository, the answers are easy to find.
Data storage and access have been problems that span all industries. Most organizations are still struggling with the most basic aspects of managing unstructured content, which include free-form language, emails, and documents. Big Data not only includes important local information, it also takes into account content from social media, the web, and log files.
It wasn’t long ago that computer systems functioned as digital support systems (aka a digital exoskeleton), according to Edd Dumbill, program chair for the O’Reilly Strata Conference and the Open Source Convention. Today, that digital exoskeleton is becoming the digital nervous system for enterprises. In an O’Reilly Strata blog post on Aug. 29, 2012, he wrote, “The drive of the web, social media, mobile, and the cloud is bringing more of each business into a data-driven world.” This growth of the digital nervous system, he writes, “makes the techniques and tools of big data relevant to us today. The challenges of massive data flows, and the erosion of hierarchy and boundaries, will lead us to the statistical approaches, systems thinking and machine learning we need to cope with the future we’re inventing.”
Where Did Big Data Come From?
While definitions may vary, Big Data can be defined as massive amounts of stored content (structured or unstructured) that can be easily analyzed in real time (a reasonable amount of time to get a useful answer). And we can thank Google for this technology. Google’s core innovation (a webpage’s popularity has more to do with the links to it than the actual content on the page) led to one basic question: How on Earth do we count all the links on the internet?
Answering this question wasn’t easy given the internet’s size and growth. Existing software solutions couldn’t handle the processing needed to calculate the number of links in the internet’s constantly changing content. Something new had to be created to measure this vast amount of information in real time or as close to real time as possible. Google’s solution was MapReduce, which is at the core of the Big Data revolution. MapReduce simply enabled Google to get an answer to that basic question of counting all those links, which is in play in Google Analytics.
But Google couldn’t keep its MapReduce technology a secret for long. By mid-2000, Doug Cutting, creator of Lucene who is now working at Cloudera (a Big Data startup), engineered Hadoop, an open source version of MapReduce. Yahoo!, which was flailing in its attempts to deal with its massive data at this same time, turned to Hadoop to solve its data problems, and the Big Data solution soon spread to other web companies.
The Value of Big Data
The value of Big Data lies in two main areas: 1) where/how to store data, and 2) how the data can be accessed in real time. Traditional companies usually deal with these problems as two separate issues. Hadoop is considered to be a “black box” since it deals with both issues at once: It pulls in data, stores it, compresses it, and enables access to it for queries. Compare this with the traditional storage approaches that involve placing data on a tape backup and putting it in a cave. And we all know that caves and technology aren’t concepts that go together naturally.
One of the big benefits of Hadoop is that it doesn’t require any additional hardware expenditures for an enterprise. Once a file is imported, parsed into bits, and stored in the system, it can then be accessed whenever needed.
In larger organizations today, business intelligence (BI) applications typically provide information that is at the core of their businesses. These apps address the organization’s most critical aspects, but they only represent a small part of the data that can be analyzed. BI vendors admit that there is more data yet to analyze outside these software systems, which they are now working to integrate and leverage.
In its recent 2012 report, “Big Data—Extracting Value From Your Digital Landfills,” AIIM Market Intelligence Industry Watch put it simply: Some organizations (26%) are still struggling to organize their content, and 30% of organizations admit to having poor reporting and BI capabilities.
The emergence of Big Data does not mean the end for any existing systems. It simply means that all of the data will stay, but it will be easier to access. Successful organizations will take advantage of new knowledge streams. However, those companies that do not integrate the pattern analysis and trends from Big Data applications will not succeed. The competitive disadvantage will just be too difficult for them to overcome.
The Next Steps
Companies such as Google, Facebook, and Twitter, which are built from the ground up, funnel their information through Big Data systems so they can spot trends among their users and their advertisers. Compare those systems to that in a traditional organization that has thousands of PCs with valuable information, silos of structured enterprise information, plus external research information related to its company. It would be impossible to query across all those data sets; there is simply no way to move quickly to a Big Data environment.
A Unisphere Research (a division of Information Today, Inc.) report titled “Big Data Is Real and It Is Here: 2012 Survey on Managing Big and Unstructured Data,” by Joseph McKendrick, pointed to the big issues confronting Big Data. Most respondents of the survey admitted that they do not consider their current IT infrastructure and database systems to be adequate for managing the amount of data they expect to ingest over the next 3 years. One respondent noted: “We are having difficulty dealing with analyzing unstructured data in scientific analysis primarily in the biological sciences. There is much more discovery work that could be done if we have better data handling/analysis algorithms.”
The value of Big Data lies in analyzing information that you could not previously access. Test it out for yourself by creating a Big Data pilot: Assemble a team in your organization that is interested in having additional analytic capability, find data sets of easily accessible information that haven’t been analyzed, and establish a Big Data repository.
Once you input a sample data collection into the repository, gather the team in the conference room, ask questions in real time, and show everyone how the system can provide immediate results. That should provide proof about the value of Big Data.
Key Players in Big Data
Big Data is not a silver bullet, however. In fact, Big Data is complicated. Companies will not be able to conduct a search on a job placement site and find people who specialize in Big Data. This is still too complex, too new, and too specialized.
Given that Big Data has been around for at least a decade, Big Data vendors already exist that are generating plenty of big business. A few Big Data-only companies generate hundreds of millions of dollars in revenue. Here are just a few leading service providers of open source Big Data technologies.
- Cloudera, Inc. (www.cloudera.com) offers enterprises a powerful new data platform built on the popular Apache Hadoop open source software package.
- Think Big Analytics (www.thinkbiganalytics.com) is a professional services firm for Big Data and advanced analytics.
- MapR Technologies, Inc. (www.mapr.com) is a provider of the open, enterprise-grade distribution for Apache Hadoop.
- Hortonworks, Inc. (www.hortonworks.com) Data Platform (HDP) is a 100% open source data platform based on Apache Hadoop.
Even the Government Is Doing It
Every part of the IT world is beginning to invest in Big Data, from the automobile sector to finance and healthcare. In fact, Big Data is even gaining traction with the notoriously slow-moving government IT environment. Bob Gourley, editor of CTOvision.com, is sponsoring the second annual Government Big Data Solutions Award with five honorees for 2012:
- USASearch: Hosted search services across more than 500 government sites; provides search and suggestion services and analytical-tool dashboards
- GCE Federal: Cloud-based financial management solutions
- PNNL Bioinformatics: Advancing understanding of health, biology, genetics, and computing
- SherpaSurfing: A cybersecurity solution that analyzes trends, finds malware, and writes alerts
- The U.S. Department of State, Bureau of Consular Affairs: Features a large data set with critically important applications for citizen service and national security
The Future of Big Data
Just as technologies and vendors may change, the way enterprises store valuable information is changing too, along with how information can boost the value of an organization’s general business goals. According to the AIIM report on Big Data, more than 60% of the survey respondents reported that they would find it very useful to correlate text-based data with transactional data, but only 2% are able to do so at present. However, technology already exists to get these numbers to 100%; the problem is how fast enterprises will move to get on board.
As Dumbill noted in his blog post of Aug. 29, “If you’re not contemplating the advantages of taking more of your operation digital, you can bet your competitors are.”