Friday, October 10, 2014

Convert ISO-8601 time to UTC time in Hive

In order to convert the ISO-8601 datetime string .e.g "2013-06-10T12:31:00+0700" in to UTC time "2013-06-10T05:31:00Z" you can do the following

select from_unixtime(iso8601_to_unix_timestamp('2013-06-10T12:31:00Z'), 'yyyy-MM-dd-HH-mm-ss') from table limit 1;

For this to work you will need the simply measured's hive udf and you will need to add the following jars:

hive> ADD JAR hdfs:///external-jars/commons-codec-1.9.jar;
hive> ADD JAR hdfs:///external-jars/joda-time-2.2.jar;
hive> ADD JAR hdfs:///external-jars/sm-hive-udf-1.0-SNAPSHOT.jar;

hive>select from_unixtime(iso8601_to_unix_timestamp('2013-06-10T12:31:00Z'), 'yyyy-MM-dd-HH-mm-ss') from table limit 1;

Tuesday, February 25, 2014

Ruby Read Large files from the Network and write to File

If you are writing a large file to disk using the traditional way (open-uri), you will notice that the memory usage spikes up just before writing the file to disk.

The workaround to this is to use the HTTP.start method and write chunks at a time to disk as they are received.

Net::HTTP.start(end_point, { :use_ssl =>true }) do |http|
   http.request_get(resource) do |response|
    open filename(date), 'w' do |io|
      response.read_body do |chunk|
        io.write chunk

Tuesday, February 18, 2014


I'm looking for a good hosted Mardown blogging solution. I've written my first blog using Svbtle at

It was fairly easy to write. But the Svbtle interface requires getting used to. Overall it looks good. Looking forward to using more of it and documenting my experiences.

Thursday, February 13, 2014

Don't put you passwords in the commandline

Consider this shell command

$>mysql -u username -p password

Passwords on the command line are a real BAD idea. Here's why:

1. They are easily viewable in the process-list by doing a ps
2. They are easily viewable in the command history by doing history

Remember, don't enter your passwords in version control systems like Git. Git servers like github are often published to a wider audience within an organization. Always use external configuration files or a configuration framework such as Configatron to deploy password/username/keys/etc.

Tuesday, February 11, 2014

Code Kata, Simple implementation of Bloom Filters

We are doing kata at work this month. It is pretty exciting as you spend around 30 mins everyday learning a new technique or stretching your coding abilities. I tried the code kata website for some fun exercises.

One thing that I've been particularly interested in the last year or so is Bloom Filters. Joins are so expensive! Bloom filters are simply amazing. They help you find if a value is NOT in a particular.

Here's my implementation of Bloom Filters in Ruby. It is not perfect and can use a BitSet ruby implementation to save on some memory. Also, not tested very thoroughly but you get the idea.


  class BloomFilter
    def initialize(bitmap_size)
      @bitmap =
      @bitmap_max_size = bitmap_size

    def hash_function_1(some_object)
      raw_val = some_object.inspect.each_byte.inject do |sum,c|
        sum += c

      raw_val % @bitmap_max_size

    def hash_function_2(some_object)
      raw_val = some_object.inspect.each_byte.inject do |sum,c|
        sum += c

      (raw_val + raw_val.to_s.length**3) % @bitmap_max_size

    def hash_function_3(some_object)
      raw_val = some_object.inspect.each_byte.inject do |sum,c|
        sum += c

      (raw_val + raw_val.to_s.split().last.to_i**8) % @bitmap_max_size

    def put(put_object)
      @bitmap[hash_function_1(put_object)] = 1
      @bitmap[hash_function_2(put_object)] = 1
      @bitmap[hash_function_3(put_object)] = 1

    def exists(put_object)
      @bitmap[hash_function_1(put_object)] == 1 &&
      @bitmap[hash_function_2(put_object)] == 1 &&
      @bitmap[hash_function_3(put_object)] == 1

a =
(1..1000).each {|x| a.put("test#{x}")}
(1..1000).each {|x| puts "#{x}" unless a.exists("test#{x}")}

This was quickly hacked, so let me know your comments on how this can be improved.

Wednesday, November 6, 2013

Presto - Facebook's Data Crunching Monster

I came to know about Facebook Presto for the first time few months back at the "Analytics at Web Scale" conference at Facebook. Today they open sourced Presto

Really excited to see how this changes the big data landscape as analysts get more hungry for data and demand faster speed of query execution.

Thursday, October 24, 2013

Copy gems from one server to another

First copy the gem list from the source box:
ssh account@sourcegembox 'gem list' > /tmp/gem-list

Now install the gems:
cat /tmp/gem-list | cut -d " " -f 1 | xargs sudo gem install

Wednesday, October 16, 2013

Download Sqoop 2 from cloudera using apt-get install

Today I had problems downloading the sqoop2 server and client using apt-get install. The problem was that apt wasn't able to get the correct package. I tried to manually set up the .list file in /etc/apt/sources.list.d directory according to
the cloudera link

But that did not work either. Finally I was able to get it running by downloading  and installing the debian file for cloudera from Here's the link for lucid systems

sudo dpkg -i cdh4-repository_1.0_all.deb

Now try

sudo apt-get install sqoop2-server
sudo apt-get install sqoop2-client

Thursday, September 5, 2013

Some links for Hadoop Performance Tuning

Here's a list of links for hadoop / hive tuning techniques. This list was compiled by @OngEmil

Monday, September 2, 2013

Hadoop monitoring and poor man's profiling

Hadoop Monitoring
I created a simple script that monitors the hadoop cluster for changes in the number of nodes on the cluster. If you run it with an external tool such as Jenkins you can send error emails to yourself whenever the script exits with error code 1. You can also extend the script to do mailing if you don't want to use an external tool. Since the hadoop dfsadmin -report command fails when the namenode is down this script also alerts you when the namenode is unhappy. There are many ways to monitor your cluster such as Cloudera Manager but we decided to create our own tools for the time being. You can also extend the script to check for whatever you like on the nodes. I'm checking for the tmp directory space as it often fills up the cluster when bad queries are executed on Hive.

Here's the gist:

Poor Man's Profiling
There are many ways you can do profiling on your hadoop cluster to see what is causing slowness. One such technique is to take thread dumps of the java threads and see what process is running most frequently. You can do so by taking a few(10-15) thread dumps in random intervals. If you see the same methods being called over and over again - you can infer the location your app is spending most of its time.

To find what's

We did one such exercise when we saw our hive queries were stuck in the last reduce phase. The singular reduce is almost always the bottleneck. To take the dump -
1. Navigate the resource manager (YARN) and click on the query you are running.
2. Click on the application master link and click on Map/Reduce links
3. More likely than not it is the reduce phase in the map reduce generated by hive that runs slow. Click on the reduce and then get the server on which the container has been running the longest.
4. SSH in to the box and run the following multiple times in irregular time intervals.

killall -QUIT java 

This will dump the threads to STDOUT on the logs. You can navigate to the logs on the box on the resource manager and look at the dumps. We realized that compression method for the ZLIB compression class was most frequently appearing in the dumps. Again, this is poor man's profiling and there are better ways to profile your java process. But it did work for us and we were able to get a performance increase after switching to LZO.