Who should read?
This article is aiming for programmers like myself, who is curious enough to find out how MapReduce works and wanted to try it out myself, but yet, a lazy guy who don’t want to build a Hadoop cluster by hand.
In this article I am using Python and Hadoop streaming API for the sake of simplicity, Hadoop is written in Java (don’t be scared), and there is a way to write mappers and reducers in Python, which data is passed in and out via STDIN and STDOUT, for more information please have a quick read on this link http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ BTW, all codes and sample data used in this article are also borrowed from that link.
Quick intro to MapReduce and Hadoop
MapReduce is indeed kind of simple concept, it basically reads big input by small chunks and gradually compute the result one by one.
eg. Hadoop is the software I use in this demo, let’s say you have big files contains big data, the system will chop the files into chunks, it feeds the chunks into map tasks, then in turn spits out mapped data to reducer to compute the results. Clear as mud? Great, read on.
Lets say you have a data file like this, imagine there are million lines of it.
foo 1 foo 1 quux 1 labs 1 foo 1 bar 1 quux 1 ...
And you want to count the occurrence to become:
foo 3 quux 2 labs 1
To begin with, typically you store the data file onto a location, Hadoop has a file system called “HDFS”, so you store that file on HDFS, when your job starts, it split the file(s) into chucks, typically separated by line breaks, eg, a chunk might contains 400 lines.
Then it will feed into the “mapper”, in this example, via STDIN. A mapper is a piece of code that does most of the logic, ultimately spits out a key and value. Eg. ‘A,3,5,7’ and you only interested in the first and third character, in this case, ‘A’ and ‘5’, so your mapper will split the line by comma, get the values then output ‘A<tab>5’. So Hadoop knows the key is ‘A’ and value is ‘5’
IMPORTANT: results spitted out from mapper doesn’t not immediately end up in reducer, there is a layer that sort the key-value pairs then feed into reducer instances. In other words, the mapper instances processed all data, before feeding into reducers. Also for your piece of mind, Hadoop makes sure the data passed to each reducer instance via STDIN have the keys in sequential order.
Eg.
foo\t1 foo\t1 foo\t1 labs\t1
So that in your reducer logic, when you loop through the lines of STDIN, you can simply checks that if key is changed, you can safely reset the counter without worrying about previous key showing up again.
eg.
key = None counter = 0 for line in sys.stdin: _key, _counter = line.split('\t') if not key == _key: # if key changed, then output the data of the key print '{}\t{}'.format(key, counter) counter = 0 counter += int(_counter) key = _key
Hands on, without getting hands dirty
So how do you run a MapReduce task without setting up a Hadoop cluster on your computer by hand? I am going to show you how to run a simple MapReduce task on Amazon Elastic MapReduce (EMR). What you need to do first is have an Amazon Web Services Account, put in your credit card and you are almost ready.
Remember I mentioned about storing the data file on Hadoop filesystem (HDFS)? EMR actually reads data from and save output to S3 bucket, so go here https://console.aws.amazon.com/s3/ to create yourself a bucket.
Upload files
First download the files, unzip it and upload to S3 bucket like this:
Create EMR cluster
Then create the EMR cluster here http://console.aws.amazon.com/elasticmapreduce/home
- I chose “Core Hadoop” as that’s the only thing I needed
- I chose to use the cheapest instance types (m1.medium) to save money.
- EC2 Key pair just choose ‘not needed’, as we are not going to ssh to the instances.
- I left it default to 3 instances
Wait for the cluster to be created, usually takes about 10 minutes. Then you will see this screen:
Add Step
Click on “Add step”, and configure your new step
It’s worth mentioning that:
- the input location can point to a file, a folder, and wildcard.
- the output location should point to a folder that don’t exists, otherwise Hadoop will throw error, for the reason being don’t want to overwrite any data. In this case ‘output’ directory doesn’t exist, and Hadoop will create that folder.
Click on “Add”
Wait For Task To Run
Task Completed
Let’s Checkout The Output
As we can see, it created 3 data files, which was a result of 3 instances we created. If I open one of them, I can see this inside the file:
bar 1 foo 3
Another file:
quux 2
Finish
So there you have it, don’t forget to terminate the cluster or you will end up receiving a big bill from Amazon.