Hands on MapReduce without getting hands dirty

By | February 4, 2016

Who should read?

This article is aiming for programmers like myself, who is curious enough to find out how MapReduce works and wanted to try it out myself, but yet, a lazy guy who don’t want to build a Hadoop cluster by hand.

In this article I am using Python and Hadoop streaming API for the sake of simplicity, Hadoop is written in Java (don’t be scared), and there is a way to write mappers and reducers in Python, which data is passed in and out via STDIN and STDOUT,  for more information please have a quick read on this link http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ BTW, all codes and sample data used in this article are also borrowed from that link.

 Quick intro to MapReduce and Hadoop

MapReduce is indeed kind of simple concept, it basically reads big input by small chunks and gradually compute the result one by one.

eg. Hadoop is the software I use in this demo,  let’s say you have big files contains big data, the system will chop the files into chunks, it feeds the chunks into map tasks, then in turn spits out mapped data to reducer to compute the results. Clear as mud? Great, read on.

Lets say you have a data file like this, imagine there are million lines of it.

foo     1
foo     1
quux    1
labs    1
foo     1
bar     1
quux    1
...

And you want to count the occurrence to become:

foo  3
quux 2
labs 1

To begin with, typically you store the data file onto a location, Hadoop has a file system called “HDFS”, so you store that file on HDFS, when your job starts, it split the file(s) into chucks, typically separated by line breaks, eg, a chunk might contains 400 lines.

Then it will feed into the “mapper”, in this example, via STDIN. A mapper is a piece of code that does most of the logic, ultimately spits out a key and value. Eg. ‘A,3,5,7’ and you only interested in the first and third character, in this case, ‘A’ and  ‘5’,  so your mapper will split the line by comma, get the values then output ‘A<tab>5’. So Hadoop knows the key is ‘A’ and value is ‘5’

IMPORTANT: results spitted out from mapper doesn’t not immediately end up in reducer, there is a layer that sort the key-value pairs then feed into reducer instances. In other words, the mapper instances processed all data, before feeding into reducers. Also for your piece of mind, Hadoop makes sure the data passed to each reducer instance via STDIN have the keys in sequential order.

Eg.

foo\t1
foo\t1
foo\t1
labs\t1

So that in your reducer logic, when you loop through the lines of STDIN, you can simply checks that if key is changed, you can safely reset the counter without worrying about previous key showing up again.

eg.

key = None
counter = 0
for line in sys.stdin:
    _key, _counter = line.split('\t')
    if not key == _key:
        # if key changed, then output the data of the key
        print '{}\t{}'.format(key, counter)
        counter = 0
    counter += int(_counter)
    key = _key

Hands on, without getting hands dirty

So how do you run a MapReduce task without setting up a Hadoop cluster on your computer by hand? I am going to show you how to run a simple MapReduce task on Amazon Elastic MapReduce (EMR). What you need to do first is have an Amazon Web Services Account, put in your credit card and you are almost ready.

Remember I mentioned about storing the data file on Hadoop filesystem (HDFS)? EMR actually reads data from and save output to S3 bucket, so go here https://console.aws.amazon.com/s3/ to create yourself a bucket.

Upload files

First download the files, unzip it and upload to S3 bucket like this:

Screen Shot 2016-02-03 at 11.50.20 am

Screen Shot 2016-02-04 at 2.18.00 pm

Create EMR cluster

Then create the EMR cluster here http://console.aws.amazon.com/elasticmapreduce/home

Screen Shot 2016-02-04 at 1.18.52 pm

  • I chose “Core Hadoop” as that’s the only thing I needed
  • I chose to use the cheapest instance types (m1.medium) to save money.
  • EC2 Key pair just choose ‘not needed’, as we are not going to ssh to the instances.
  • I left it default to 3 instances

Wait for the cluster to be created, usually takes about 10 minutes. Then you will see this screen:

Screen Shot 2016-02-03 at 11.52.30 am

Add Step

Screen Shot 2016-02-04 at 1.48.33 pm

Click on “Add step”, and configure your new step

Screen Shot 2016-02-03 at 11.53.58 am

It’s worth mentioning that:

  • the input location can point to a file, a folder, and wildcard.
  • the output location should point to a folder that don’t exists, otherwise Hadoop will throw error, for the reason being don’t want to overwrite any data. In this case ‘output’ directory doesn’t exist, and Hadoop will create that folder.

Click on “Add”

Wait For Task To Run

Screen Shot 2016-02-04 at 1.50.56 pm

Task Completed

Screen Shot 2016-02-04 at 1.53.55 pm

Let’s Checkout The Output

Screen Shot 2016-02-04 at 1.54.17 pm

As we can see, it created 3 data files, which was a result of 3 instances we created. If I open one of them, I can see this inside the file:

bar 1
foo 3

Another file:

quux 2

Finish

So there you have it, don’t forget to terminate the cluster or you will end up receiving a big bill from Amazon.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.