Monday, September 2, 2013

Hadoop monitoring and poor man's profiling

Hadoop Monitoring
I created a simple script that monitors the hadoop cluster for changes in the number of nodes on the cluster. If you run it with an external tool such as Jenkins you can send error emails to yourself whenever the script exits with error code 1. You can also extend the script to do mailing if you don't want to use an external tool. Since the hadoop dfsadmin -report command fails when the namenode is down this script also alerts you when the namenode is unhappy. There are many ways to monitor your cluster such as Cloudera Manager but we decided to create our own tools for the time being. You can also extend the script to check for whatever you like on the nodes. I'm checking for the tmp directory space as it often fills up the cluster when bad queries are executed on Hive.

Here's the gist:
https://gist.github.com/yash-ranadive/6418644

Poor Man's Profiling
There are many ways you can do profiling on your hadoop cluster to see what is causing slowness. One such technique is to take thread dumps of the java threads and see what process is running most frequently. You can do so by taking a few(10-15) thread dumps in random intervals. If you see the same methods being called over and over again - you can infer the location your app is spending most of its time.

To find what's

We did one such exercise when we saw our hive queries were stuck in the last reduce phase. The singular reduce is almost always the bottleneck. To take the dump -
1. Navigate the resource manager (YARN) and click on the query you are running.
2. Click on the application master link and click on Map/Reduce links
3. More likely than not it is the reduce phase in the map reduce generated by hive that runs slow. Click on the reduce and then get the server on which the container has been running the longest.
4. SSH in to the box and run the following multiple times in irregular time intervals.

killall -QUIT java 

This will dump the threads to STDOUT on the logs. You can navigate to the logs on the box on the resource manager and look at the dumps. We realized that compression method for the ZLIB compression class was most frequently appearing in the dumps. Again, this is poor man's profiling and there are better ways to profile your java process. But it did work for us and we were able to get a performance increase after switching to LZO.

1 comment: