Today's companies are dealing with an avalanche of data from social media, search, and sensors as well as from traditional sources. According to one estimate, 2.5 quintillion bytes of data per day are generated around the world. Making sense of big data to improve decision making and business performance has become one of the primary opportunities for organizations of all shapes and sizes, but it also represents big challenges.
Green Mountain Coffee in Waterbury, Vermont, is analyzing both structured and unstructured audio and text data to learn more about customer behavior and buying patterns. The firm has 20 brands and more than 200 beverages and uses Calabrio Speech Analytics to glean insights from multiple interaction channels and data streams. In the past, Green Mountain was unable to use all the data it gathered when customers called its contact center. The company wanted to know more about how many people were asking for a specific product, which products generated the most questions, and which products and categories created the most confusion. By analyzing its big data, Green Mountain was able to gather information that was much more precise and use it to produce materials, web pages, and database entries to help representatives do their jobs more effectively. Management can now identify issues more rapidly before they create problems for customers.
A number of services have emerged to analyze big data to help consumers. There are now online services that enable consumers to check thousands of flight and hotel options and book their own reservations, tasks that travel agents previously handled. New mobile-based services make it even easier to compare prices and pick the best travel options. For instance, a mobile app from Sky scanner Ltd. shows deals from all over the web in one list sorted by price, duration, or airline so travelers don't have to scour multiple sites to book within their budget. Sky scanner uses information from more than 300 airlines, travel agents, and timetables and shapes the data into at-a-glance formats, with algorithms to keep pricing current and make predictions about who will have the best deal for a given market.
Big data is also providing benefits in law enforcement (see this chapter's Interactive Session on People), sports, education, science, and health care. A recent McKinsey Global Institute report estimated that the U.S. health care system could save $300 billion each year $1,000 per American through better integration and analysis of the data produced by everything from clinical trials to health insurance transactions to smart running shoes. Health care companies are currently analyzing big data to determine the most effective and economical treatments for chronic illnesses and common diseases and provide personalized care recommendations to patients.
There are limits to using big data. A number of companies have rushed to start big data projects without first establishing a business goal for this new information. Swimming in numbers doesn't necessarily mean that the right information is being collected or that people will make smarter decisions.
Experts in big data analysis believe too many companies, seduced by the promise of big data, jump into big data projects with nothing to show for their efforts. They start amassing and analyzing mountains of data with no clear objective or understanding of exactly how analyzing big data will achieve their goal or what questions they are trying to answer. Darian Shirzai, founder of Radius Intelligence Inc., likens this to haystacks without needles. Companies don't know what they're looking for because they think big data alone will solve their problem.
According to Michael Walker of Rose Business Technologies, which helps companies build big-data systems, a significant majority of big-data projects aren't producing any valuable, actionable results. A recent report from Gartner Inc. stated that through 2017, 60 percent of big-data projects will fail to go beyond piloting and experimentation and will eventually be abandoned. This is especially true for very large-scale big data projects. Companies are often better off starting with smaller projects with narrower goals.
Hadoop has emerged as a major technology for handling big data because it allows distributed processing of large unstructured as well as structured data sets across clusters of inexpensive computers. However, Hadoop is not easy to use, requires a considerable learning curve, and does not always work well for all corporate big-data tasks. For example, when Bank of New York Mellon used Hadoop to locate glitches in a trading system, Hadoop worked well on a small scale, but it slowed to a crawl when many employees tried to access it at once. Very few of the company's 13,000 IT specialists had the expertise to troubleshoot this problem. David Gleason, the bank's chief data officer at the time, said he liked Hadoop but felt it still wasn't ready for prime time. According to Gartner Inc. research director for information management Neil Heudecker, technology originally built to index the web may not be sufficient for corporate big-data tasks.
Hadoop vendors are responding with improvements and enhancements. For example, Hortonworks produced a tool that lets other applications run on top of Hadoop. Other companies are offering tools as Hadoop substitutes. Data bricks developed Spark open-source software that is more adept than Hadoop at handling real-time data, and the Google spinoff Metanautix is trying to supplant Hadoop entirely.
Hadoop vendors are responding with improvements and enhancements. For example, Hortonworks produced a tool that lets other applications run on top of Hadoop. Other companies are offering tools as Hadoop substitutes. Databricks developed Spark open-source software that is more adept than Hadoop at handling real-time data, and the Google spinoff Metanautix is trying to supplant Hadoop entirely.
It is difficult to find enough technical IT specialists with expertise in big-data analytical tools, including Hive, Pig, Cassandra, MongoDB, or Hadoop. On top of that, many business managers lack numerical and statistical skills required for finding, manipulating, managing, and interpreting data.
Even with big-data expertise, data analysts need some business knowledge of the problem they are trying to solve with big data. For example, if a pharmaceutical company monitoring point-of-sale data in real time sees a spike in aspirin sales in January, it might think that the flu season is intensifying. However, before pouring sales resources into a big campaign and increasing flu medication production, the company would do well to compare sales patterns to past years. People might also be buying aspirin to nurse their hangovers following New Years Eve parties. In other words, analysts need to know the business and the right questions to ask of the data.
Just because something can be measured doesn't mean it should be measured. Suppose, for instance, that a large company wants to measure its website traffic in relation to the number of mentions on Twitter. It builds a digital dashboard to display the results continuously. In the past, the company had generated most of its sales leads and eventual sales from trade shows and conferences. Switching to Twitter mentions as the key metric to measure changes the sales department's focus. The department pours its energy and resources into monitoring website clicks and social media traffic, which produce many unqualified leads that never lead to sales.
Although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, big data analysis doesn't necessarily show causation or which correlations are meaningful. For example, examining big data might show that from 2006 to 2011, the United States murder rate was highly correlated with the market share of Internet Explorer because both declined sharply. Nevertheless, that doesn't mean there is any meaningful connection between the two phenomena.
Several years ago, Google developed what it thought was a leading-edge algorithm using data it collected from web searches to determine exactly how many people had influenza and how the disease was spreading. It tried to calculate the number of people with flu in the United States by relating people's location to flu-related search queries on Google. The service has consistently overestimated flu rates when compared to conventional data collected afterward by the U.S. Centers for Disease Control (CDC). According to Google Flu Trends, nearly 11 percent of the U.S. population was supposed to have had influenza at the flu season's peak in mid-January 2013. However, an article in the science journal Nature stated that Google's results were twice the actual number the U.S. Centers for Disease Control and Prevention estimated, which had 6 percent of the population coming down with the disease. Why did this happen? Several scientists suggested that widespread media coverage of that year's severe flu season in the United States, which was further amplified by social media coverage, tricked Google. Google's algorithm only looked at numbers, not the context of the search results.
Big data can also provide a distorted picture of the problem. Boston's street Bump app uses a smartphone's accelerometer to detect potholes without the need for city workers to patrol the streets. Users of this mobile app collect road condition data while they drive and automatically provide city government with real-time information to fix problems and plan long-term investments. However, what Street Bump actually produces is a map of potholes that favors young, affluent areas where more people own smartphones. The capability to record every road bump or pothole from every enabled phone is not the same as recording every pothole. Data contain systematic biases, and it takes careful thought to spot and correct for those biases.
And let's not forget that big data poses some challenges to information security and privacy. As Chapter 4 pointed out, companies are now aggressively collecting and mining massive data sets on people's shopping habits, incomes, hobbies, residences, and (through mobile devices) movements from place to place. They are using such big data to discover new facts about people, to classify them based on subtle patterns, to flag them as risks (for example, loan default risks or health risks), to predict their behavior, and to manipulate them for maximum profit.
When you combine someone's personal information with pieces of data from many sources, you can infer new facts about that person (such as the fact that they are showing early signs of Parkinson's disease, or are unconsciously drawn toward products that are colored blue or green). If asked, most people might not want to disclose such information, but they might not even know such information about them exists. Privacy experts worry that people will be tagged and suffer adverse consequences without due process, denied the ability to fight back, or even know that they have been discriminated against.
case study questions PLEASE USE THE ABOVE CASE STUDY TO ANSWER QUESTIONS BELOW
- What business benefits did the companies and services described in this case achieve by analyzing and using big data?
- Identify two decisions at the organizations described in this case that were improved by using big data and two decisions that big data did not improve.
- List and describe the limitations to using big data.
- Should all organizations try to analyze big data? Why or why not? What people, organization, and technology issues should be addressed before a company decides to work with big data?