Mark Chu-Carroll is a Google software engineer, and in a personal blog post this week explained the concept of one of Google’s programming models: MapReduce, which splits a task onto many computers on Google’s server farm (server farm, or single super computer, depending on how you look at it) to be quickly crunched.
What is MapReduce? What does it do?
Suppose you’re at work, and you need to do something that’s going to take a long time to run on your computer. You don’t want to wait. But you don’t want to go out and spend a couple of million dollars buying a supercomputer. How do you make it run faster? One way is buy a whole bunch of cheap machines, and make it run on all of them at once. Another is to notice that your office has lots of computers – pretty much every office has a computer on the desk of every employee. And at any given moment, most of those computers aren’t doing much. So why not take advantage of that? When your machine isn’t doing much, you let you coworkers borrow the capability you’re not using; when you need to do something, you can borrow their machines. So when you need to run something big, you can easily find a pool of a dozen machines.
The problem with that approach is that most programs aren’t written to run on a dozen machines. They’re written to run on one machine. To split a hard task among a lot of computers is hard.
MapReduce is a library that lets you adopt a particular, stylized way of programming that’s easy to split among a bunch of machines. The basic idea is that you divide the job into two parts: a Map, and a Reduce. Map basically takes the problem, splits it into sub-parts, and sends the sub-parts to different machines – so all the pieces run at the same time. Reduce takes the results from the sub-parts and combines them back together to get a single answer.
The key to how MapReduce does things is to take input as, conceptually, a list of records. The records are split among the different machines by the map. The result of the map computation is a list of key/value pairs. Reduce takes each set of values that has the same key, and combines them into a single value. So Map takes a set of data chunks, and produces key/value pairs; reduce merges things, so that instead of a set of key/value pair sets, you get one result. You can’t tell whether the job was split into 100 pieces or 2 pieces; the end result looks pretty much like the result of a single map.
Mark adds that “The beauty of MapReduce is that it’s easy to write.” and that MapReduce (or “M/R”) programs are “really as easy as parallel programming ever gets.” For a more in-depth look at MapReduce and some actual source code, take a look at the Google research publication on the subject.
[Via Friendfeed. Mark’s post is Creative Commons-licensed.]