Amazon DynamoDB, EMR and Hive notes

First you need the EMR cluster running and you should have ssh connection to the master instance like described in the getting started tutorial.

Now it is possible to run Hive commands in few following ways:

  • Connect via ssh, launch hive and run commands interactively
  • Create a script file with commands, upload it to S3 and launch as a ERM 'Hive program' step
  • Run it from Hue web-interface (see below)

Connect to Hue

Hue is a Hadoop web interface. It is automatically installed when EMR cluster is launched. The recommended way to connect to Hue is via ssh tunnel, see also.

But there is a simpler way (but less secure) - open the 8888 port on the master EMR instance:

Analyze DynamoDB data with Hive

To analyze the DynamoDB data there are following options:

  • Create the external Hive table pointing to DynamoDB table and make queries against it (slow and consumes DynamoDB resources)
  • Export data from dynamo to the native Hive table then query this data off-line
  • Export data from dynamo to S3 and then query it, in this case queries are a bit slower than for a native Hive table, but data persists on S3, so it is possible to terminate the cluster and then launch it again when necessary

Example of the script to move data from dynamo to hive native table:

-- Here you can drop/create an external table at any time - this will not affect real data
CREATE EXTERNAL TABLE dynamo_table (hash string, range bigint, data string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "mytable",
"dynamodb.column.mapping" = "hash:hash,range:range,data:data");

CREATE TABLE hive_table (hash string, range bigint, data string);

SET dynamodb.throughput.read.percent=0.9;

INSERT OVERWRITE TABLE hive_table SELECT * FROM dynamo_table;

Example of the script to move data from dynamo to S3:

CREATE EXTERNAL TABLE dynamo_table (hash string, range bigint, data string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "mytable",
"dynamodb.column.mapping" = "hash:hash,range:range,data:data");

CREATE EXTERNAL TABLE s3_table(hash string, range bigint, data string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://my-bucket-name/2015-01-24-dynamo-table-data/';

INSERT OVERWRITE TABLE s3_table SELECT * FROM dynamo_table;

Hive - timestamp to date conversions

Examples of some queries to convert the timestamp to date string:

-- Convert timestamp to date in UTC+2 timezone
select hash, stamp, from_unixtime(stamp, 'y-M-d H:m:sZ+0200') from hive_table limit 10;

-- Select data by date string
select hash, stamp from hive_table where stamp > unix_timestamp('2014-12-25 10:18:41+0200') limit 10;

Execute python script from Hive script

It is possible to transform data in Hive using external python script (see here and here). And using a fake table it is possible to run the python script from the hive script:

--
-- Run python script to create test tables on dynamo.
-- Script should be uploaded to tapway-scripts/ on s3
--
CREATE TABLE IF NOT EXISTS empty_src (id string);
add file s3://tapway-scripts/hive.copy.test.py;
select transform (id) using 'hive.copy.test.py' from empty_src;

Hive Language Manual

Hive Operators and User-Defined Functions

DynamoDB Guide: Hive Command Examples for Exporting, Importing, and Querying Data in DynamoDB

EMR Guide: Export, Import, Query, and Join Tables in DynamoDB Using Amazon EMR

Optimizing Performance for Amazon EMR Operations in DynamoDB

Using DynamoDB with Amazon Elastic MapReduce

Hive & DynamoDB Pitfalls

Stackoverflow: Amazon Elastic MapReduce - mass insert from S3 to DynamoDB is incredibly slow

Amazon DynamoDB, Apache Hive and Leaky Abstractions

Amazon AWS: Hive, EMR and DynamoDb

Exploring Dynamo DB