2013-06-11

Amazon NoSQL Solutions

Amazon provides following NoSQL storage options:

Amazon's review of big data solutions

Amazon's review Big Data on AWS mentions only DynamoDB and Elastic Map Reduce (based on Hadoop) as tools for big data management. As one of major DynamoDB benefits it is highlighted that it uses solid state drives, but this option is also available for other technologies:

Solid state, at your service:
NoSQL data stores benefit greatly from the speed of solid state drives.
DynamoDB uses them by default, but if you are using alternatives from the AWS
Marketplace, such as Cassandra or MongoDB, accelerate your access with
on-demand access to terabytes of solid state storage, with the High I/O
instance class. Learn more about the options with EC2 instance types (http://aws.amazon.com/ec2/instance-types).

[Elastic Map Reduce] (http://aws.amazon.com/elasticmapreduce/) is a computing service which can be used in conjunction with storage services to perform operations on large datasets (indexing, data mining, log file analysis, etc). Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. Amazon Elastic Compute Cloud (EC2) - Amazon Elastic Compute Cloud delivers scalable, pay-as-you-go compute capacity in the cloud.

[Hadoop] (http://hadoop.apache.org/) The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Q: When would I use Amazon RDS vs. Amazon EC2 Relational Database AMIs vs. Amazon SimpleDB vs. Amazon DynamoDB? (http://aws.amazon.com/rds/faqs/#4)

Amazon Web Services provides a number of
database alternatives for developers. Amazon RDS enables you to run a fully
featured relational database while offloading database administration; Amazon
SimpleDB provides simple index and query capabilities with seamless
scalability; Amazon DynamoDB is a fully managed NoSQL database service that
offers fast and predictable performance with seamless scalability; and using
one of our many relational database AMIs on Amazon EC2 and Amazon EBS allows
you to operate your own relational database in the cloud. There are important
differences between these alternatives that may make one more appropriate for
your use case. See Running Databases on AWS for guidance on which solution is
best for you.

Amazon DynamoDB

General Info

Get Started with Amazon DynamoDB:

Amazon DynamoDB automatically spreads the data and traffic for the table over a
sufficient number of servers to handle the request capacity specified by the
customer and the amount of data stored, while maintaining consistent and fast
performance. All data items are stored on Solid State Disks (SSDs) and are
automatically replicated across multiple Availability Zones in a Region to
provide built-in high availability and data durability.

Read / write throughput limits:

Provisioned Throughput - When you create or update a table, you specify how
much provisioned throughput capacity you want to reserve for reads and writes.
Amazon DynamoDB will reserve the necessary machine resources to meet your
throughput needs while ensuring consistent, low-latency performance.  If your
application requirements change, simply update your table throughput capacity
using the AWS Management Console or the Amazon DynamoDB APIs. You are still
able to achieve your prior throughput levels while scaling is underway.

When you create a table, you must provide a table name, its primary key and
your required read and write throughput values. Except for the required primary
key, an Amazon DynamoDB table is schema-less. Individual items in an Amazon
DynamoDB table can have any number of attributes, although there is a limit of
64 KB on the item size.  A unit of read capacity represents one strongly
consistent read per second (or two eventually consistent reads per second) for
items as large as 4 KB. A unit of write capacity represents one write per
second for items as large as 1 KB.  Reads = Number of item reads per second × 4
KB item size (If you use eventually consistent reads, you'll get twice as many
reads per second.) Writes = Number of item writes per second × 1 KB item size.

Amazon DynamoDB supports the following two types of primary keys:

Local Secondary Indexes:

When you create a table with a hash-and-range key, you
can optionally define one or more local secondary indexes on that table. A
local secondary index lets you query the data in the table using an alternate
range key, in addition to queries against the primary key.

Amazon DynamoDB Data Types Amazon DynamoDB supports the following data types:

API / SDK:

Amazon DynamoDB is a web service that uses HTTP and HTTPS as a transport and
JavaScript Object Notation (JSON) as a message serialization format. Your
application code can make requests directly to the Amazon DynamoDB web service

API Reference (http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/Welcome.html). There is also PHP library as a part of Amazon PHP SDK (https://github.com/aws/aws-sdk-php).

DynamoDB as events storage and limitations

For example, we have a table to store events data with following fields:

And we have large amount of events for which we need to perform following operations:

DynamoDB limitations:

Additional indexes to perform data filtering (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForLSI.html ): having many indexes is not recommended because they consume storage and provisioned throughput and make table operations slower.

For tables with local secondary indexes there is size limit of 10GB for data with the same hash key (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LSI.html).

The maximum size of any item collection is 10 GB. This limit does not apply to
tables without local secondary indexes; only tables that have one or more local
secondary indexes are affected.
We recommend as a best practice that you instrument your application to monitor
the sizes of your item collections. One way to do so is to set the
ReturnItemCollectionMetrics parameter to SIZE whenever you use BatchWriteItem,
DeleteItem, PutItem or UpdateItem. Your application should examine the
ReturnItemCollectionMetrics object in the output and log an error message
whenever an item collection exceeds a user-defined limit (8 GB, for example).
Setting a limit that is less than 10 GB would provide an early warning system
so you know that an item collection is approaching the limit in time to do
something about it.

Item collection - is any group of items that have the same hash key, across a
table and all of its local secondary indexes. For instance, consider an
e-commerce application that stores customer order data in a DynamoDB table with
hash-range schema of customer id-order timestamp. Without LSI, to find an
answer to the question “Display all orders made by Customer X with shipping
date in the past 30 days, sorted by shipping date”, you had to use the Query
API to retrieve all the objects under the hash key “X”, sort the results by
shipment date and then filter out older records.

It is not possible to add secondary indexes into existing table.

Existing indexes also can not be changed or deleted.

Query results sorting:

Query results are always sorted by the range key. If the data type of the range
key is Number, the results are returned in numeric order; otherwise, the
results are returned in order of ASCII character code values. By default, the
sort order is ascending. To reverse the order use the ScanIndexForward
parameter set to false.

Get records count:

in a request, set the Count parameter to true if you want
Amazon DynamoDB to provide the total number of items that match the scan filter
or query condition, instead of a list of the matching items. In a response,
Amazon DynamoDB returns a Count value for the number of matching items in a
request. If the matching items for a scan filter or query condition is over 1
MB, Count contains a partial count of the total number of items that match the
request. To get the full count of items that match a request, use the
LastEvaluatedKey in a subsequent request. Repeat the request until Amazon
DynamoDB no longer returns a LastEvaluatedKey.

Single operation size limit:

A single operation can retrieve up to 1 MB of data, which can comprise
as many as 100 items. BatchGetItem will return a partial result if the response
size limit is exceeded, the table's provisioned throughput is exceeded, or an
internal processing failure occurs. If a partial result is returned, the
operation returns a value for UnprocessedKeys. You can use this value to retry
the operation starting with the next item to get.  For example, if you ask to
retrieve 100 items, but each individual item is 50 KB in size, the system
returns 20 items (1 MB) and an appropriate UnprocessedKeys value so you can get
the next page of results. If desired, your application can include its own
logic to assemble the pages of results into one dataset.

Export - data export can be implemented as off-line operation using Elastic Map Reduce. See Elastic Map Reduce use cases:

To perform advanced queries on data it is also possible to copy data to Amazon Redshift:

Amazon Redshift complements Amazon DynamoDB with advanced business intelligence
capabilities and a powerful SQL-based interface. When you copy data from an
Amazon DynamoDB table into Amazon Redshift, you can perform complex data
analysis queries on that data, including joins with other tables in your Amazon
Redshift cluster.
In terms of provisioned throughput, a copy operation from an Amazon DynamoDB
table counts against that table's read capacity. After the data is copied, your
SQL queries in Amazon Redshift do not affect Amazon DynamoDB in any way. This
is because your queries act upon a copy of the data from DynamoDB, rather than
upon DynamoDB itself.

Cost

Dynamo db pricing page and price calculator.

Assumptions (having events table described above):

Parameters for cost calculation:

Cost (according to the price calculator) is about $6500 per month (for a single user's data).

Notes:

Other DynamoDB resources

Amazon Dynamo: The Next Generation Of Virtual Distributed Storage

Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications

[Why My Team Went with DynamoDB Over MongoDB] (http://slashdot.org/topic/bi/why-my-team-went-with-dynamodb-over-mongodb/) - limitations and possible solutions; custom index (maybe at that time secondary indexes where not present); data as json + compression (other way is to store data as file to S3).

DynamoDB shortcomings (and our work arounds) they decided not to use DynamoDB for events (SQL was a better tool in this case, so we decided not to use DynamoDB at all for storing events) and build cache between DynamoDB and their app.

DYNAMODB IS AWESOME, BUT… - about limitations.

Amazon DynamoDB - about provisioned throughput, Query, Scan and indexing.

Expanding the Cloud: Faster, More Flexible Queries with DynamoDB - about secondary indexes

[My Disappointments with Amazon DynamoDB] (http://whynosql.com/my-disappointments-with-amazon-dynamodb/)

[Amazon DynamoDB Part III: MapReducin’ Logs] (http://www.newvem.com/amazon-dynamodb-part-iii-mapreducin-logs/)

Amazon forum threads:

DynamoDB mocks:

Amazon SimpleDB

General Info

Simple DB description page.

Developer Guide.

PHP SDK and HTTP API.

Features

Complex Queries

One of the main uses for Amazon SimpleDB involves making complex queries
against your data set, so you can get exactly the data you need. For more
information, refer to the Select section of the Amazon SimpleDB Developer
Guide.

Select operator supports:

Data Storage and Performance

For information on how quickly stored data is recorded to Amazon SimpleDB,
refer to the Consistency section of the Amazon SimpleDB Developer Guide.

Limits and Restrictions

During development, it is important to understand Amazon SimpleDB's limits
when storing data, the amount of data Amazon SimpleDB can return from a
query, and what to do if the limits are exceeded. For more information, refer
to the Limits section of the Amazon SimpleDB Developer Guide.

Limits:

Partition data to domains strategy.

SimpleDB as events storage and limitations

A table to store events data has following fields:

There is large amount of data and we want to perform following operations:

Limitations:

Cost

Cost calculator for SimpleDB (http://calculator.s3.amazonaws.com/calc5.html#s=SIMPLEDB).

If we have the same assumtions as for DynamoDB (see above) then we have:

Calculated cost: $173.

Amazon SimpleDB vs DynamoDB

Q: How does Amazon DynamoDB differ from Amazon SimpleDB? Which should I use? (http://aws.amazon.com/dynamodb/faqs/#How_does_Amazon_DynamoDB_differ_from_Amazon_SimpleDB_Which_should_I_use)

Both services are non-relational databases that remove the work of database
administration. Amazon DynamoDB focuses on providing seamless scalability and
fast, predictable performance. It runs on solid state disks (SSDs) for
low-latency response times, and there are no limits on the request capacity or
storage size for a given table. This is because Amazon DynamoDB automatically
partitions your data and workload over a sufficient number of servers to meet
the scale requirements you provide.
In contrast, a table in Amazon SimpleDB has a strict storage limitation of 10
GB and is limited in the request capacity it can achieve (typically under 25
writes/second); it is up to you to manage the partitioning and re-partitioning
of your data over additional SimpleDB tables if you need additional scale.
While SimpleDB has scaling limitations, it may be a good fit for smaller
workloads that require query flexibility. Amazon SimpleDB automatically indexes
all item attributes and thus supports query flexibility at the cost of
performance and scale.

Note: in the DynamoDB there is also 10GB limitation for item collections (items with the same hash key) if we use secondary indexes.

For events storage we have:

Additional resources

Quora: What is the difference between SimpleDB and DynamoDB?

Stackoverflow: Amazon SimpleDB vs Amazon DynamoDB

Stackoverflow: Amazon SimpleDB or DynamoDB

General Resourses

Overview of Big Data and NoSQL Technologies as of January 2013

What The Heck Are You Actually Using NoSQL For?

Scaling Twitter: Making Twitter 10000 Percent Faster

Stackoverflow: How to store 7.3 billion rows of market data (optimized to be read)?

Running Databases on AWS

Running NoSQL Databases on AWS

Anti-RDBMS: A list of distributed key-value stores

CouchDB: Why NoSQL?

Thoughts on SimpleDB, DynamoDB and Cassandra

Some Other NoSQL solutions

MongoDB and MongoDB use cases: Storing log data.

MongoDB NoSQL Database on AWS.

Cassandra and Cassandra use cases.

Hypertable.

HBase.

CouchDB.

VoltDB.

Is VoltDB really as scalable as they claim?.

VoltDB Decapitates Six SQL Urban Myths And Delivers Internet Scale OLTP In The Process.

profile for Boris Serebrov on Stack Exchange, a network of free, community-driven Q&A sites