What is mapreduce?

02 Jan 2024

Today we’re having a look at MapReduce a programming model you can use on Hadoop!

Demo

Input: (A,B,C,1), (A,B,C,2)
Map, to get values
(B,1),(B,2),(B,3)
Shuffle these mapped values
(B, [1,2,3])
Reduce
Output: (B,3)

If feels like the shuffle step is very hard. How does Hadoop get the right data to the right machines? Is it just hashed? I’m not too sure. 🤔

hadoop MyClass input.txt outputdir command runs Hadoop.

output dir cannot exist before running job
HADOOP_CLASSPATH env var can be used to add application classes to classpath

Hadoop tries its best to run a map on a node where input data resides in
Map tasks write to local disk not to HDFS, as map output is just intermediate output
All this intermediate output is sent to the reduce task
Reduce task write to HDFS

You can optionally have a combiner function that combines map outputs. For example you could find the max temperature for the map output of a single split. Combine functions are not guaranteed to be ran. And they can save a lot of unnecessary data transfer!

The other way you can use Hadoop is with streaming scripts! All you need to provide is a mapper and reducer program that eats from stdin and outputs to stdout! You can simply test it with unix pipes for sanity checks. Thats pretty beautiful!

What is mapreduce?

oboe

about posts

What is mapreduce?

You might also like

What is zookeeper?

What is hbase?

What is spark?