Raspberry PI 2 Hadoop 2 Cluster

By | May 16, 2016

Setup your own Raspberry PI 2 Hadoop 2 cluster (Raspberry PI 2 Model B and Hadoop 2.7.2) on Rasbian Linux. For some background and general information around Hadoop please see my previous post:

http://www.widriksson.com/raspberry-pi-hadoop-cluster/

Contents

The Setup

  • 3x Raspberry PI 2 Model Braspberry pi hadoop cluster boards(4core CPU, 1GB RAM)
  • 3x 16gb MicroSDHC cards (Sandisk UHS-I 10x)
  • Rasbpian Jessie Lite
    (Linux 4.1.17-v7+)
  • Oracle Java java version 1.8.0_65
  • Hadoop 2.7.2 (ARM native compiled from source)

Difference Hadoop 1.x vs. Hadoop 2.x

Some of the more major differences are:

  • YARN (Yet Another Resource Negotiator) – Next generation MapReduce (MRv2)
    • Separation of processing engine and resource management which was implemented in Hadoop 1.x mapreduce
    • In Hadoop 1.x all processing was done through the mapreduce framework. With Hadoop 2.x the use of other data processing  frameworks is possible
    • TaskTracker slots are replaced with containers which are more generic
    • Hadoop 2.x MapReduce programs are backward compatible with Hadoop 1.x MapReduce
    • Overall increased scalability and performance
  • HDFS Federation – Possible to use multiple namenode servers to manage namespace which allows for horizantal scaling

Hadoop 1 vs 2

Depricated properties

See the following urls for properties that are depricated from Hadoop 1.x:
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/DeprecatedProperties.html

Web Interface (default ports)

With Yarn and MapReduce 2 (MR2) http ports for monitoring have changed:

TCP Port Application Configuration property
8032 ResourceManager Client RPC yarn.resourcemanager.address
8030 ResourceManager Scheduler RPC yarn.resourcemanager.scheduler.address
8033 ResourceManager Admin RPC yarn.resourcemanager.admin.address
8088 ResourceManager Web UI and REST APIs yarn.resourcemanager.webapp.address
8031 ResourceManager Resource Tracker RPC yarn.resourcemanager.resource-tracker.address
8040 NodeManager Localizer RPC yarn.nodemanager.localizer.address
8042 NodeManager Web UI and REST APIs yarn.nodemanager.webapp.address
10020 Job History RPC mapreduce.jobhistory.address
19888 Job History Web UI and REST APIs mapreduce.jobhistory.webapp.address
13562 Shuffle HTTP mapreduce.shuffle.port

Install Raspbian and prepare environment for Hadoop

Rasbian installation

Download Raspbian Jessie Lite:

https://www.raspberrypi.org/downloads/raspbian/

Write to SD card (use any tool of choice) for windows I use:

https://sourceforge.net/projects/win32diskimager/

Plugin in SD card and fire up your PI.

For inital configuration (raspi-config)

  • Expand filesystem
  • Under 9 Advanced Options -> A3 Memory Split
    Choose 16MB to give as much RAM as possible for Hadoop

If you wish you may also try to overlcock the PI a bit to improve performance. For this tutorial I use default.

Configure network

Login as pi (default password: raspberry)

Edit (as root with sudo -s):

 /etc/dhcpcd.conf

At the bottom of the file add:

interface eth0
static ip_address=192.168.0.110/24
# static ip_address=192.168.0.111/24
# static ip_address=192.168.0.112/24
static routers=192.168.0.1
static domain_name_servers=123.123.123.123 123.123.123.123

Change domain name servers to your environment. Note that we are not required to make any changes to /etc/interfaces in Raspbian Jessie.

Update system and install Oracle Java

sudo apt-get update && sudo apt-get install oracle-java7-jdk

Run update-alternatives, ensure jdk-8-oracle-*** is selected:

sudo update-alternatives --config java

Configure Hadoop user

Create a new user for use with Hadoop:

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo

Create SSH paris keys with blank password. This will enable nodes to communicate with each other in the cluster.

su hduser
mkdir ~/.ssh
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys

Login as hduser (answer yes when prompted to trust certificate key – otherwise Hadoop will fail to login later)

su hduser
ssh localhost
exit

Compile Native Hadoop 2.7.2 for Raspberry PI (ARM)

Ensure you have logged out as hduser and logged in as pi user. (for sudo command below to work properly)

Install protobuf 2.5.0

This is required to build Hadoop.

wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
tar xzvf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
./configure --prefix=/usr
make
make check
sudo make install

Install Hadoop 2.7.2

Download and build

wget http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.2/hadoop-2.7.2-src.tar.gz
tar xzvf hadoop-2.7.2-src.tar.gz

Java 8 uses a more strict syntax than previous version. We need to be less strict to be able to compile Hadoop 2.7.2. To fix this edit:

hadoop-2.7.2-src/pom.xml

Between <properties></properties> tags insert the following:

<additionalparam>-Xdoclint:none</additionalparam>

For Hadoop 2.7.2 to build properly we also need to apply a patch.

cd hadoop-2.7.2-src/hadoop-common-project/hadoop-common/src
wget https://issues.apache.org/jira/secure/attachment/12570212/HADOOP-9320.patch
patch < HADOOP-9320.patch

cd ~/hadoop-2.7.2-src/

To start build Hadoop run the following: (Note this may take ~1,5 hours)

sudo mvn package -Pdist,native -DskipTests -Dtar

Install

Copy compiled binaries to /opt

cd hadoop-dist/target/
sudo cp -R hadoop-2.7.2 /opt/hadoop

Give access to hduser

sudo chown -R hduser.hadoop /opt/hadoop/

Verify installation and native libraries

su hduser
cd /opt/hadoop/bin
hadoop checknative -a

16/03/24 20:20:03 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
16/03/24 20:20:03 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /opt/hadoop/lib/native/libhadoop.so.1.0.0
zlib: true /lib/arm-linux-gnueabihf/libz.so.1
snappy: true /usr/lib/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib/arm-linux-gnueabihf/libbz2.so.1
openssl: true /usr/lib/arm-linux-gnueabihf/libcrypto.so

Check version

hadoop version

Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by root on 2016-02-21T19:05Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.7.2.jar

Configure environment variables

In /etc/bash.bashrc, add to bottom of file:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:jre/bin/java::”)
export HADOOP_INSTALL=/opt/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_HOME=$HADOOP_INSTALL

Edit and change varibales in Hadoop environment.sh (/opt/hadoop/etc/hadoop/)

Find out your java home (readlink -f /usr/bin/java | sed “s:jre/bin/java::).

export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt

Enable the use of native hadoop library and IPv4 stack:

export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native -Djava.net.preferIPv4Stack=true"

Hadoop 2.7.x YARN and MapReduce memory and resources configuration

Some of the challanges of running Hadoop on the Raspberry is the limited resources. In order for it to run properly we will need to adjust memory configuration of YARN and Mapreduce framework.

Slots in Hadoop 1.x vs. containers in Hadoop 2.x

In Hadoop 1.x each map or reduce task occupied a slot regardless of how much memory or cpu it used. In Hadoop 2.x YARN is coordinating resources (such as RAM, CPU and disk) among different applications that is running on the cluster. This allows for more dynamic and efficient use of resources. YARN will allocate resources to applications (MapReduce, Spark, Impala etc..) by using containers. A container is a unit of processing capacity in YARN. Each application in the cluster will request processing capacity from YARN which then in turn is provided by a processing container. When the applicaiton reuqest the container the request also contains information about how much resources it would like to use. YARN is then balancing and negoitating the resource requests on the cluster. For the purpose of this tutorial we will look into the memory configuration.

YARN Configuration (yarn-site.xml)

For more details see: https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Property Default Our Desription
yarn.nodemanager.resource.memory-mb 8192 768 Amount of physical memory, in MB, that can be allocated for containers
yarn.scheduler.minimum-allocation-mb 1024 64 Minimum amount of memory YARN will allocate for a container. An application should ask for at least this amount of memory or more. If it ask for more YARN will round it up to the closest unit of yarn.scheduler.minimum-allocation-mb. For example if we have a minimum allocation of 2048mb and an application asks for 3072mb it will receieve a container of 4096mb
yarn.scheduler.maximum-allocation-mb 8192 256 Maximum amount of memory YARN will allocate for a container
yarn.nodemanager.vmem-check-enabled true true true/false, weather enforced virtual memory limit will be enabled
yarn.nodemanager.vmem-pmem-ratio 2.1 2.1 Ratio of virtual to physical memory when applying memory limit for containers
yarn.nodemanager.pmem-check-enabled true true true/false, weather enforced physical memory limit will be enabled

In case memory would exceed physical/virtual memory limits you will receieve errors similar to below when running mapreduce jobs:

16/05/12 10:07:02 INFO mapreduce.Job: Task Id : attempt_1463046730550_0002_m_000006_0, Status : FAILED
 Container [pid=20945,containerID=container_1463046730550_0002_01_000047] is running beyond virtual memory limits. Current usage: 38.2 MB of 128 MB physical memory used; 484.2 MB of 268.8 MB virtual memory used. Killing container.
 Dump of the process-tree for container_1463046730550_0002_01_000047 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 20945 20941 20945 20945 (bash) 1 2 3121152 493 /bin/bash -c /usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx404m -Djava.io.tmpdir=/hdfs/tmp/nm-local-dir/usercache/hduser/appcache/application_1463046730550_0002/container_1463046730550_0002_01_000047/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/opt/hadoop/logs/userlogs/application_1463046730550_0002/container_1463046730550_0002_01_000047 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.0.110 48748 attempt_1463046730550_0002_m_000006_0 47 1>/opt/hadoop/logs/userlogs/application_1463046730550_0002/container_1463046730550_0002_01_000047/stdout 2>/opt/hadoop/logs/userlogs/application_1463046730550_0002/container_1463046730550_0002_01_000047/stderr
 |- 20952 20945 20945 20945 (java) 668 36 504578048 9288 /usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx404m -Djava.io.tmpdir=/hdfs/tmp/nm-local-dir/usercache/hduser/appcache/application_1463046730550_0002/container_1463046730550_0002_01_000047/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/opt/hadoop/logs/userlogs/application_1463046730550_0002/container_1463046730550_0002_01_000047 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.0.110 48748 attempt_1463046730550_0002_m_000006_0 47

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

MapReduce 2 configuration properties (mapred-site.xml)

For more details see:
https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

Property Default Our Desription
mapreduce.framework.name local yarn The runtime framework for executing MapReduce jobs. Can be one of local, classic or yarn
mapreduce.map.memory.mb 1024 256 The amount of memory to request from the scheduler for each map task
mapreduce.map.java.opts -Xmx1024M -Xmx204M Java opts (Max JVM heap size for map task. (Should be less than mapreduce.map.memory.mb, common recommendation 0.8 * mapreduce.map.memory.mb))
mapreduce.map.cpu.vcores 1 2 Number of CPU’s availible for map tasks
mapreduce.reduce.memory.mb 1024 102 Max amount of memory for reduce task
mapreduce.reduce.java.opts -Xmx2560M -Xmx102M Java opts (Max JVM heap size for reduce task. (Should be less than mapreduce.reduce.memory.mb, common recommendation 0.8 * mapreduce.reduce.memory.mb))
mapreduce.reduce.cpu.vcores 1 2 Number of CPU’s availible for reduce tasks
yarn.app.mapreduce.am.resource.mb 1536 128 Max amount of memory for App Master
yarn.app.mapreduce.am.command-opts -Xmx1024m -Xmx102M Java opts (Max JVM heap size for App Master)
yarn.app.mapreduce.am.resource.cpu-vcores 1 1 Max amount of CPU’s for App Master
mapreduce.job.maps 2 2 The default number of map tasks per job. Ignored when mapreduce.jobtracker.address is “local”
mapreduce.job.reduces 1 2 The default number of reduce tasks per job. Ignored when mapreduce.jobtracker.address is “local”

YARN and MapReduce memory configuration overview

With our setup using the values above the YARN resource manager could potentially give use the following scenarios below when running MapReduce jobs:

 

hadoop 2 yarn mapreduce 2 memory configuration 1

hadoop 2 yarn mapreduce 2 memory configuration 2

The idea is that we try to utilize as much RAM and CPU power out of our PI as possible.

Configure Single Node

Edit configuration files in /opt/hadoop/etc/hadoop

core-site.xml

<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/hdfs/tmp</value>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:54310</value>
  </property>
</configuration>

hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.blocksize</name>
    <value>5242880</value>
  </property>
</configuration>
Property Default Our Description
dfs.replication 3 1 By default hadoop replicates data on 3 nodes. Since we only have one node we set it to 1 and later switch it to 3.
dfs.blocksize 134217728 5242880 Default blocksize for new files in bytes. Default value is 128mb.  In this tutorial we will use files less than this size and in order for them to scale on Raspberry PI we set this property to 5mb instead. This means that files will be split in chunks of 5mb in our cluster when distributed among nodes and also each of our mapreduce tasks will work with 5mb at a time.

For a complete list of properties for hdfs-default.xmlhttps://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

slaves

Currently we only have one node. Keep default value of localhost. We will later add all our nodes here for a multi-node setup.

localhost

yarn-site.xml

<configuration>
	<property>
		<name>yarn.resourcemanager.resource-tracker.address</name>
		<value>node1:8025</value>
	</property>
	<property>
		<name>yarn.resourcemanager.scheduler.address</name>
		<value>node1:8035</value>
	</property>
	<property>
		<name>yarn.resourcemanager.address</name>
		<value>node1:8050</value>
	</property>
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
	<property>
		<name>yarn.nodemanager.resource.cpu-vcores</name>
		<value>4</value>
	</property>
	<property>
		<name>yarn.nodemanager.resource.memory-mb</name>
		<value>768</value>
	</property>
	<property>
		<name>yarn.scheduler.minimum-allocation-mb</name>
		<value>64</value>
	</property>
	<property>
		<name>yarn.scheduler.maximum-allocation-mb</name>
		<value>256</value>
	</property>
	<property>
		<name>yarn.scheduler.minimum-allocation-vcores</name>
		<value>1</value>
	</property>
	<property>
		<name>yarn.scheduler.maximum-allocation-vcores</name>
		<value>4</value>
	</property>
	<property>
		<name>yarn.nodemanager.vmem-check-enabled</name>
		<value>true</value>
	</property>
	<property>
		<name>yarn.nodemanager.pmem-check-enabled</name>
		<value>true</value>
	</property>
	<property>
		<name>yarn.nodemanager.vmem-pmem-ratio</name>
		<value>2.1</value>
	</property>
</configuration>

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>256</value>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx204m</value>
    </property>
    <property>
        <name>mapreduce.map.cpu.vcores</name>
        <value>2</value>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>128</value>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx102m</value>
    </property>
    <property>
        <name>mapreduce.reduce.cpu.vcores</name>
        <value>2</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>128</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.command-opts</name>
        <value>-Xmx102m</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.resource.cpu-vcores</name>
        <value>1</value>
    </property>
    <property>
        <name>mapreduce.job.maps</name>
        <value>4</value>
    </property>
    <property>
        <name>mapreduce.job.reduces</name>
        <value>4</value>
    </property>
</configuration>

Format HDFS filesystem

sudo mkdir -p /hdfs/tmp
sudo chown hduser:hadoop /hdfs/tmp
sudo chmod 750 /hdfs/tmp
hadoop namenode -format

Start Hadoop

Run the following commands as hduser:

start-dfs-sh
start-yarn.sh

Verify that all services started correctly:

jps

Result should show the following active processes:
1413 NodeManager
838 NameNode
1451 Jps
1085 SecondaryNameNode
942 DataNode
1311 ResourceManager

If all processes not are visible review log files in /opt/hadoop/logs

Run hadoop hello world – wordcount

Download sample files (From my preivous blog post)

Run all commands as hduser. Extract the files to home directory of hduser.

tar -xvzf hadoop_sample_txtfiles.tar.gz ~/

Upload files to HDFS:

hadoop fs -put smallfile.txt /smallfile.txt
hadoop fs -put mediumfile.txt /mediumfile.txt

Execute wordcount example from source code folder (~/hadoop-2.7.2-src/hadoop-mapreduce-project/hadoop-mapreduce-examples/target/):

time hadoop jar hadoop-mapreduce-examples-2.7.2.jar wordcount /smallfile.txt /smallfile-out
time hadoop jar hadoop-mapreduce-examples-2.7.2.jar wordcount /mediumfile.txt /mediumfile-out

Result:

File Size Using Raspberry PI 1 Raspberry PI 2 (this setup)
smallfile.txt 2MB 2:17 1:47
mediumfile.txt 35MB 9:19 5:50

When you execute the job you will also see that there is an http url to track progress:

lynx http://node1:8088/proxy/application_1463048023890_0002 (Url will be unique to your execution, see url when you start the job – you may use any browser in this case I used Lynx)

Task Tracker

When this snapshot was taken we can see that we have one runing map task and 2 running reduce tasks which looks good concerning our anticipated memory allocation outcome here.

Configure Multiple Nodes

Clear HDFS filesystem

rm -rf /hdfs/tmp/*
hdfs namenode -format

Edit configuration files

/etc/hostname

node1/node2/node3

/etc/hosts

192.168.0.110 node1
192.168.0.111 node2
192.168.0.112 node3

/etc/dhcpcd.conf

interface eth0
static ip_address=192.168.0.110/24
# static ip_address=192.168.0.111/24
# static ip_address=192.168.0.112/24
static routers=192.168.0.1
static domain_name_servers=123.123.123.123 123.123.123.123

/opt/hadoop/slaves

node1
node2
node3

/opt/hadoop/core-site.xml

<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/hdfs/tmp</value>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://node1:54310</value>
  </property>
</configuration>

/opt/hadoop/hdfs-site.xml

By changing this to 3 we will ensure that we have data locally on each node when we add new files (assuming we have 3 nodes in our cluster)

<property>
<name>dfs.replication</name>
<value>3</value>
</property>

/opt/hadoop/yarn-site.xml

<property>
  <name>yarn.resourcemanager.resource-tracker.address</name>
  <value>node1:8025</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address</name>
  <value>node1:8035</value>
</property>
<property>
  <name>yarn.resourcemanager.address</name>
  <value>node1:8050</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>

Clone SD-Card

Use Win32 imager or any other software of your choice to clone the sd-card. After clone ensure to configure each node with new hostname and ip-address.

Also ensure to configure /etc/hostname and /etc/dhcpcd.conf on each node correctly with IP/hostname.

Login with ssh to all nodes to fix host key verification

Logon with ssh to all nodes from namenode (node1). Enter Yes when you get the “Host key verification failed message”. This is important in order for namenode to be able to communicate with the other nodes without entering a password.

 Start services

Run the following commands as hduser:

start-dfs-sh
start-yarn.sh

Run jps command on all nodes to ensure all services started fine. On the data/processing nodes you should have the following services:

hduser@node2:~ $ jps
1349 DataNode
1465 NodeManager
1582 Jps

And on the namenode (node1) you should have the following:

hduser@node1:/home/pi $ jps
3329 ResourceManager
2851 NameNode
3435 NodeManager
3115 SecondaryNameNode
3742 Jps
2959 DataNode

Performance comparison to Raspberry PI 1 Model B (512mb)

Run wordcount as done in previous step. As we can see there is a significant difference using the newer Raspberry PI 2:

File Size Using Raspberry PI 1 Raspberry PI 2 (this setup – 3 nodes)
smallfile.txt 2MB 2:17 1:41
mediumfile.txt 35MB 9:19 2:22
(larger file) 431MB N/A 17:10

For fun I also tried to include a larger file to see if the PI’s could put up with it. They did fine but I got some containers killed due to memory issues but that could probably be tweaked further.

Happy data crunching! 🙂

References

http://www.instructables.com/id/Native-Hadoop-260-Build-on-Pi/

http://stackoverflow.com/questions/32823375/compiling-hadoop-2-7-1-on-arm-raspbian

http://www.becausewecangeek.com/building-a-raspberry-pi-hadoop-cluster-part-1/

https://www.mapr.com/blog/managing-monitoring-and-testing-mapreduce-jobs-managing-jobs-and-tasks

http://arturmkrtchyan.com/how-to-setup-multi-node-hadoop-2-yarn-cluster

21 thoughts on “Raspberry PI 2 Hadoop 2 Cluster

  1. Jones

    Hi,

    Thanks for your tutorial. I would like create a Pi zero cluster (only learning purpose). Do you if the process is the same ? suppose i’m connected by wifi

    Thanks,

    BJ

    1. admin Post author

      Hi Jones!

      I have not tried the Zero but should be the same process. You might need to tweak the memory configuration in Hadoop to make it run properly on 512mb though.

      //Jonas

  2. Phil

    I have 3 new RPi3, which I was trying to setup with Hadoop 2.7.2 using your old V1 install blog, and working out what to do at each step of the way for differences of V2.7.2. I was glad to see that you posted this recently, saved me some work once I saw it, thanks.
    My 3 RPi3 run the mediumfile.txt processing in around 1m30sec.
    I didn’t install the different Java, just used the one that was pre-installed in the Raspbian Jessie distribution.
    I also did not compile Hadoop myself, I just got the package from http://apache.mirrors.spacedump.net/hadoop/core/stable/hadoop-2.7.2.tar.gz
    Is there a special reason that you got the Oracle Java, and compiled Hadoop locally?
    I put instructions of how I set mine up (much of your steps here with whatever I did a bit different) and copies of various config files and short-cut scripts into a GitHub repo at https://github.com/phil-davis/HadoopRPi3
    Anyone feel free to make use of any of that how you wish.

  3. Festivity

    Thanks for the guide! I just got hadoop running on a 4 node odroid cluster with some help from your tutorial and a few other sources. I used the jar version of hadoop instead of compiling it from source myself, but I might change that in the future. My current benchmark seems to be ~11.5 minutes to perform wordcount on a 431 MB file (which I created by cat-ing mediumfile.txt onto itself over and over again).

    I ran into some problems with home network configurations. I have a home router, a second router I use as a switch, and the switch I used for the hadoop cluster. The hadoop cluster’s switch likes registering itself as 192.168.0.1, which conflicted with the main router I connect directly to my modem, and when I set my nodes to use static IPs, they tried to use the main router as a gateway and got the switch instead.

    I spent a fair deal of time struggling with how I set up static IPs for my cluster. I think at some point I added my hostname as an entry in /etc/hosts assigned to 127.0.0.1, and then added it again when I moved from single node to cluster operation, so a lot of my services weren’t actually listening to network traffic.

    I ran into fairly extensive problems with memory management. Since the Odroid has 2 GB of memory instead of 1 GB on the Pi, I tried doubling most of the maximum and overall memory allocations, but there was a relationship between vmem-pmem ratio and overall pmem that I missed somehow, and I had a bunch of problems with overallocating virtual memory.

    I still seem to be having at least intermittent problems, but I’m having more trouble reproducing my problems now. When I have those problems, they’re usually still related to overallocated virtual memory.

  4. Lladislas

    Thanks for the tutorial.

    However, I am getting stuck on the step:

    sudo mvn package -Pdist,native -DskipTests -Dtar

    I am using a Pi 3, connected through WiFi, and when I type this command into terminal, I get the error:

    sudo: mvn: Command not found

    So, after some research, I was able to get it to work using

    sudo env “PATH=$PATH” mvn package -Pdist,native -DskipTests -Dtar

    Then maven starts to do its magic, but when Hadoop is being built, the entire process terminates, because Hadoop HTTPFS received an error.

    I can send you an eMail with the Screenshots from Terminal, if you would be willing to help me. It would be a great gesture, and I will be sure to pay it forward as well

  5. AC

    I followed the guide and it does not work for me. I successfully compiled, deployed the package, all ssh connectivity are tested for 2 RPi, hosts and hostname also validated, however when I run “hadoop fs -ls /”, I encountered failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused.

    Anyone can help out please?

    1. jos

      I’m experiencing the same problem as AC. jps shows all services are running.
      What strikes me as strange that nestat -a does not show any hadoop service listening

    2. admin Post author

      Thanks for the comments. I saw that I accidently kept node1 as hostname in core-site.xml and yarn-site.xml for the single node setup. In the single node setup use localhost for hostname references in core-site-xml and yarn-site.xml. I have updated the tutorial.

      //Jonas

      1. zaphod

        Hmm, I had already corrected that, but still got the authorization problem. B.t.w. the formatting of the 2 xml-files you mention is now off, they are almost unreadable now. You may want to correct that. Any other ideas where the authorization problem stems from?

  6. Stan

    Thanks for this great guide. I’m learning a ton. I’m running on a B+ model jesse 8.0 lite.

    On the ./hadoop checknative -a I’m getting a false on snappy and bzip2.

    bzip2 is installed (version 1.0.6)

    Does it matter? Can it be fixed?

    Thanks again!

    1. Stan

      Check that – I successfully compiled on a 2B with 1 GB.

      Now I’m trying to compile on a B+ with 512MB.

      I’m getting a: Java.lang.OutOfMemoryError: Java heap space
      [INFO] Compiling 852 source files to /home/pi/hadoop-2.7.2-src/hadoop-common-project/hadoop-common/target/classes
      [INFO] Apache Hadoop Main …………………………… SUCCESS [ 19.520 s]
      [INFO] Apache Hadoop Project POM …………………….. SUCCESS [ 17.692 s]
      [INFO] Apache Hadoop Annotations …………………….. SUCCESS [ 43.565 s]
      [INFO] Apache Hadoop Assemblies ……………………… SUCCESS [ 3.967 s]
      [INFO] Apache Hadoop Project Dist POM ………………… SUCCESS [ 18.825 s]
      [INFO] Apache Hadoop Maven Plugins …………………… SUCCESS [ 46.943 s]
      [INFO] Apache Hadoop MiniKDC ………………………… SUCCESS [ 39.553 s]
      [INFO] Apache Hadoop Auth …………………………… SUCCESS [ 51.357 s]
      [INFO] Apache Hadoop Auth Examples …………………… SUCCESS [ 54.245 s]
      [INFO] Apache Hadoop Common …………………………. SKIPPED
      [INFO] Final Memory: 55M/106M
      I set export MAVEN_OPTS=”-Xms512m -Xmx1024m”, but it didn’t seem to make a difference. Not sure what the values should be anyway.

      Thanks,

  7. Thom

    Not sure what I did wrong here…

    pi@controller:~ $ su hduser
    Password:
    hduser@controller:/home/pi $ mkdir ~/.ssh
    hduser@controller:/home/pi $ ssh-keygen -t rsa -P “”
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
    key_save_private: No such file or directory
    Saving the key failed: cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys.
    hduser@controller:/home/pi $ ssh-keygen -t rsa -P “”
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/hduser/.ssh/id_rsa): SSH_Key
    key_save_private: Permission denied
    Saving the key failed: SSH_Key.

  8. Stan

    I think you created the .ssh in the /home/pi dir instead of the /home/hduser.

    WHen you do the su command put a – between su and hduser, like this su – hduser. using the hyphen changes your dir to the home dir of the user and sets that users env variables.

    Hope this helps!

    Stan

  9. Vince

    Great write-up! I really appreciate the details.

    However when I run the “sudo update-alternatives –config java”, it shows that nothing is configured as shown below. I’d appreciate any help so that I will have the jdk-8 as you suggested.
    Thanks,
    Vince
    ============================================

    pi@RPI3-node1:~ $ sudo update-alternatives –config java
    There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/jdk-7-oracle-arm-vfp-hflt/jre/bin/java
    Nothing to configure.

  10. Dave

    having problems when it comes to httpFS they pom.xml dont exist. has anyone come across this yet?

    Downloading: https://repository.apache.org/content/repositories/snapshots/com/googlecode/json-simple/json-simple/1.1/json-simple-1.1.pom

    Downloading: http://repository.jboss.org/nexus/content/groups/public/com/googlecode/json-simple/json-simple/1.1/json-simple-1.1.pom

    Downloading: https://repo.maven.apache.org/maven2/com/googlecode/json-simple/json-simple/1.1/json-simple-1.1.pom

  11. Tony Harvey

    Hi

    I recently ( 5-Dec=-2016) started building using this pretty good blog. I soon found the links problem as mentioned by Hukly, no problem, the github linked worked great.

    But … during the “Install protobuf 2.5.0” phase,
    wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz

    tar xzvf protobuf-2.5.0.tar.gz
    cd protobuf-2.5.0
    ./configure –prefix=/usr
    make
    make check …….

    the make check produced the following error messages .. I’ve edited the output to just show the
    problems area

    does anyone know why we’re getting these errors ?

    Tony

    [———-] 14 tests from DatabaseSource/AllowUnknownDependenciesTest
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.PlaceholderFile/0
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.PlaceholderFile/0 (1 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.PlaceholderFile/1
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.PlaceholderFile/1 (1 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.PlaceholderTypes/0
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.PlaceholderTypes/0 (1 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.PlaceholderTypes/1
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.PlaceholderTypes/1 (1 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.CopyTo/0
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.CopyTo/0 (1 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.CopyTo/1
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.CopyTo/1 (1 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.CustomOptions/0
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.CustomOptions/0 (2 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.CustomOptions/1
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.CustomOptions/1 (1 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.UnknownExtendee/0
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.UnknownExtendee/0 (1 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.UnknownExtendee/1
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.UnknownExtendee/1 (2 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.CustomOption/0
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.CustomOption/0 (2 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.CustomOption/1
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.CustomOption/1 (2 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.UndeclaredDependencyTriggersBuildOfDependency/0
    [libprotobuf ERROR google/protobuf/descriptor.cc:2547] Invalid proto descriptor for file “invalid_file_as_undeclared_dep.proto”:
    [libprotobuf ERROR google/protobuf/descriptor.cc:2550] undeclared.Quux.quux: Field number 1 has already been used in “undeclared.Quux” by field “qux”.
    [libprotobuf INFO google/protobuf/descriptor_unittest.cc:2107] syntax = “proto2”;

    message Corge {
    optional .undeclared.Quux quux = 1;
    }

    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.UndeclaredDependencyTriggersBuildOfDependency/0 (2 ms)
    [ RUN ] DatabaseSource/AllowUnknownDependenciesTest.UndeclaredDependencyTriggersBuildOfDependency/1
    [libprotobuf ERROR google/protobuf/descriptor.cc:2547] Invalid proto descriptor for file “invalid_file_as_undeclared_dep.proto”:
    [libprotobuf ERROR google/protobuf/descriptor.cc:2550] undeclared.Quux.quux: Field number 1 has already been used in “undeclared.Quux” by field “qux”.
    [libprotobuf INFO google/protobuf/descriptor_unittest.cc:2107] syntax = “proto2”;

    message Corge {
    optional .undeclared.Quux quux = 1;
    }

    [libprotobuf ERROR google/protobuf/descriptor.cc:2547] Invalid proto descriptor for file “invalid_file_as_undeclared_dep.proto”:
    [libprotobuf ERROR google/protobuf/descriptor.cc:2550] undeclared.Quux.quux: Field number 1 has already been used in “undeclared.Quux” by field “qux”.
    [ OK ] DatabaseSource/AllowUnknownDependenciesTest.UndeclaredDependencyTriggersBuildOfDependency/1 (3 ms)
    [———-] 14 tests from DatabaseSource/AllowUnknownDependenciesTest (23 ms total)

    [———-] Global test environment tear-down
    [==========] 875 tests from 113 test cases ran. (11312 ms total)
    [ PASSED ] 875 tests.
    PASS: protobuf-test

    ==================
    All 5 tests passed
    ==================
    make[3]: Leaving directory ‘/home/pi/protobuf-2.5.0/src’
    make[2]: Leaving directory ‘/home/pi/protobuf-2.5.0/src’
    make[1]: Leaving directory ‘/home/pi/protobuf-2.5.0/src’
    pi@pi01:~/protobuf-2.5.0 $

  12. Andy Burgin

    Jonas, just wanted to give a massive thanks for publishing this article, I’ve used it a the basis of building a 5 node hadoop cluster running a range of hadoop components with a hue front end.

    hadoop 2.6.4
    hue 3.11.0
    hbase 1.2.4
    pig 0.12.1
    hive 1.2.1
    spark 1.6.2
    livy 0.2.0
    oozie 4.3.0
    sqoop 1.99.4
    solr 4.10.4
    impala – not supported

    Under laboratory conditions it runs everything I’ve tried (although there are some oozie workflows I don’t have dependencies to try out). I’ve done it as a learning exercise but thought you’d want to know about one of the “spin off” projects your work has inspires/contributed to. Here’s the links:

    Chef code and config – https://github.com/andyburgin/hadoopi
    Tutorial video with step-by-step instructions – https://www.youtube.com/watch?v=YtI9LkIJ7Hc
    Blog post – http://data.andyburgin.co.uk/post/157450047463/running-hue-on-a-raspberry-pi-hadoop-cluster

  13. Bryan Nieznajko

    Need to update line:
    wget http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.2/hadoop-2.7.2-src.tar.gz
    to:
    wget http://apache.claz.org/hadoop/common/hadoop-2.7.2/hadoop-2.7.2-src.tar.gz
    as the original throws errors:
    gzip: stdin: not in gzip format
    tar: Child returned status 1
    tar: Exiting with failure status due to previous errors

    running “file” on hadoop-2.7.2-src.tar.gz shows:
    hadoop-2.7.2-src.tar.gz: HTML document, ASCII text, with very long lines

Leave a Reply