Why Big Data is Better

Quick! What is America's favorite pie? It’s apple, of course. We know this because of data. You take supermarket sales of frozen, 12-inch pies and apple wins. No Contest.

Written By Kenneth Cukier ’91 | Senior Editor at The Economist

But then supermarkets started selling smaller, 7-inch pies, and suddenly, apple fell to fourth or fifth place. What happened?

Think about it. When you buy a 12-inch pie, the whole family has to agree, and apple is everyone's second favorite. But when you buy an individual, 7-inch pie, you can buy the one that you want. You can get your first choice.

The point is that when you have more data, you can see things that you can’t see when you only have smaller amounts of it. More data doesn't just let us see more. More data lets us see new. It lets us see better. It lets us see different. In this case, it allows us to see what America's favorite pie is. Not apple.

The example comes from the late economist Walter Oi of the University of Rochester. He was a famous prankster and liked to occasionally “invent” facts just to keep things interesting. So the story may not even be true. But it captures the mind about how the “ground truth” of the world changes based on the data we collect.

We have all heard the term “big data.” In fact, we’re a bit sick of hearing the term “big data.” It’s true there is a lot of hype. This is unfortunate because big data is an important tool by which society is going to advance. The idea is basically this: we can do with a large body of data things that we fundamentally can’t do when we’re only working with smaller amounts of it. The change in scale leads to a change in state. A quantitative shift leads to a qualitative shift.

Big data is new and important. The only way this planet is going to deal with its global challenges—to feed people, give them medical care, supply them with energy and make sure they're not burnt to a crisp because of global warming—is through the effective use of data.

So what is so new about big data? To answer that question, consider what information physically looked like in the past.

In 1908 on the island of Crete in Greece, archaeologists discovered a small clay disc that they named the Phaistos disc. They dated it from 2000 B.C. There are inscriptions on it, but we don't know what they mean. Yet this is what information used to look like 4,000 years ago. It is how society stored and transmitted information.

Since then, society hasn't advanced that much. We still store information on discs—only now they’re computer disc drives. We can store a lot more. Searching it, copying it, sharing it and processing it are easier. We can reuse the information for purposes that were never imagined when it was first collected. In this respect, the data has gone from a stock to a flow, from something that is stationary and static to something that is fluid and dynamic. There is a “liquidity” to information.

The disc discovered off of Crete is heavy. It doesn't store a lot of information. And the information is unchangeable. By contrast, all of the files that Edward Snowden took from the U.S. National Security Agency fit on a memory stick the size of a fingernail and can be shared at the speed of light.

Why do we have so much data today? One reason is we are collecting more information on things that we've always bothered to record. But another reason is that we’re taking things that have always been informational but never rendered into data before, and we are turning it into data.

Take, for example, location. Where someone is at any time is a matter of information. But it’s not a matter of data. So imagine that in 1776 I wanted to know where Paul Revere was. Is he in his blacksmith shop, or is he in the pub? Where Paul Revere is, is a matter of information. But it’s not data. And if I wanted to record his location at all times, I’d need to constantly write it down with a feathery fountain pen. Hard to do.

Now, think of our own lives. You know that somewhere there is a record of your location at all times, going back at least a decade. It’s in a mobile phone operator’s database. In this respect, location has been datafied.

Or, think of posture. The way that we all sit is different. It's a function of leg length and back and the contours of the back. And if one were to put 100 pressure sensors into a chair, it could create an index that's fairly unique, like a fingerprint. So what could we do with it?

Researchers in Tokyo are using it as a potential anti-theft device in cars. The idea is that the carjacker jumps behind the wheel, tries to stream off, but the car recognizes that a non-approved driver is behind the wheel, and maybe the engine stops unless you type in a password into the dashboard. (Parents of teenagers can perhaps think of other uses of this very wise technology!)

What if every car in America had this installed? What could we do then? Maybe, if we aggregated the data, we could identify signs that best predict that a car accident is going to take place— in the next five seconds.

What we would have datafied is driver fatigue. And the service would be that when the car detects that the person slumps into that position, it triggers an alarm to vibrate the steering wheel or honk inside the car to signal: "Hey! Wake up! Pay more attention to the road!" These are the sorts of things we can do when we datafy more aspects of our lives.

There are many technologies around big data. But one of the most impressive is an area called machine learning. It is a branch of artificial intelligence, which itself is a branch of computer science. But at its heart, it’s really just about basic statistics. The general idea is that instead of instructing a computer what to do, we simply throw data at the problem and tell the computer to figure it out for itself.

It will help to understand it by appreciating its origins. In the 1950s a computer scientist at IBM named Arthur Samuel liked to play checkers, so he wrote a computer program so he could play against the computer. He played and he always won—because the computer only knew what a legal move was. Arthur Samuel knew something else: strategy.

So he wrote a small sub-program that operated in the background. It scored the probability that a given board configuration would likely lead to a winning board versus a losing board. He played the computer. He still won.

And then Arthur Samuel left the computer to play itself. It played itself; it collected more data. As it collected more data, it increased the accuracy of its prediction. Then he went back to the computer and played it, and he always lost. Arthur Samuel had created a machine that surpassed his ability in a task that he taught it.

Machine learning is at the heart of many things we do every day: search engines, Amazon's personalization algorithm, computer translation, voice recognition systems. It’s the reason why we have self-driving cars.

It’s not because we are better at enshrining all the rules of the road into software. No. It's because we changed the nature of the problem from one in which we tried to explicitly teach the computer how to drive, to one in which we said, "Here’s a lot of data around the vehicle. You figure it out.” The computer makes hundreds of predictions a second—this is a stoplight, that is a bicycle rider—and now we have cars that drive themselves.

Researchers at Stanford and Harvard recently applied computer vision and machine-learning to see if a machine could detect highly cancerous cells in biopsies, based on patients’ survival rates. Sure enough, the algorithm did better than the human pathologists. In fact, the algorithm was able to identify the 12 telltale signs that best predicted the biopsy was highly cancerous. The problem? The medical literature only knew nine of them. Three of the traits were ones that people didn't know to look for, but the algorithm spotted.

Big data will improve our lives. But there are dark sides as well. Yes, privacy. However, just as worrying is “propensity”—the idea that there will be algorithms predicting what we do and we may be held accountable before we’ve acted. We’ll be punished for a prediction. Privacy was the central challenge in a small data era. In the age of big data, the challenge will be safeguarding free will, moral choice and human agency.

There is another problem: Big data is going to steal our jobs. Big data and algorithms are going to challenge white-collar, professional, knowledge-work in the 21st century in the same way that factory automation and the assembly line challenged blue- collar labor in the 20th century.

Think about the pathologist peering into a microscope at a biopsy to determine whether it's cancerous. The person went to university, buys property, votes. He or she is a stakeholder in society. And that person, as well as an entire fleet of professionals, is going to find that their jobs are radically changed or completely destroyed.

We like to think that technology creates more jobs over time, after a temporary period of dislocation. And that was true for the frame of reference for which we’re familiar, the Industrial Revolution. Farm jobs became factory jobs and then nicer, office jobs.

But that analysis forgets that there are some categories of jobs that when eliminated never come back. For example, the Industrial Revolution wasn't very good if you were a horse. It just didn’t matter if the horse went to a wonderful liberal arts college in the Midwest: once the tractor and automobile arrived, there were less need for them. Will today’s graduates go the way of horses as big data hits the office cubicle?

It’s too early to say. But we are going to need to be careful and apply big data to our needs. We need to work with the machine, and bring our very human traits: inquisitiveness, ambition, our sense of daring. We have to be the master of the technology, not its servant.

We are just at the outset of the big data era, and honestly, we are not very good at handling all the data that we can now collect. It's not just a problem for the National Security Agency. Businesses collect lots of data, and they misuse it too. We need to get better at this, and it will take time. It's like the challenge faced by early man and fire. This is a tool, but this is a tool that, unless we're careful, will burn us.

Big data is going to transform how we live, work and think. It is going to help us manage our careers and lead lives of satisfaction and hope, happiness and health.

In the past, we've often looked at information technology and our eyes have only seen the “T,” the technology, the hardware because that’s what is physical. We now need to recast our gaze at the “I,” the information, which is less apparent, but in some ways more important. Humanity can finally learn from the information that it can collect, as part of our timeless quest to understand the world and our place in it. And that's why big data is a big deal.

This essay was adapted from a TED Talk the author delivered in Berlin in June 2014

Back to top