Hadoop is offered by both Amazon and Microsoft through their cloud services. I spent some time today working through a comparison of a reasonably sized Hadoop cluster and here is the comparison in pricing per month.
The specifications I used were:
- 1 Head Node running a Extra Large (A4) instance on Azure or the roughly equivalent C3.2xLarge on Amazon. Both of these VMs have 8 cores and 14 GB RAM (Amazon’s has 15 GB RAM)
- 10 Data Nodes running a Large (A3) instance on Azure or the roughly equivalent C3.xlarge on Amazon. Both of these VMs have 4 cores and 7 GB of RAM (Amazon’s has 14.5 GB RAM)
- 50 TB of blob storage per month. I used Azure Local Redundant Storage as Amazon S3 storage standard is not geo redundant from what I can see.
- 10 TB a month of inbound data transfers per month.7
- 5 TB a month of outbound data transfers per month.
- 10 Million Transactions per month. Each record you put into storage is considered a transaction.
Here are the results based on Azure and Amazon’s latest pricing (keep in mind these change quite often):
Price Per Day for VMs
Price Per Month for Storage
|Inbound Data Transfers||10 TB||Free||Free|
|Outbound Data Transfers||5 TB||$665.60||$614.40|
In both scenarios, you can create yourself a 10 node Hadoop cluster for ~$3000 a month. Amazon charges more for their VMs and in particular with Hadoop they have a secondary charge in addition to their VMs for their Elastic Map Reduce service. They also curiously charge significantly more for transactions (e.g. PUTS into storage).
Storage pricing depends on whether you choose locally redundant pricing or geo redundant pricing for Azure BLOB storage. If you use geo redundant pricing, the storage cost goes from $1,244.16 per month to $2,488.32 per month for the same 50 TB of storage.
Microsoft also provides a secondary head node free of charge to increase the availability of the service.
Amazon supports Hadoop 2.4 in production – Microsoft only has it currently in preview. Both support Hadoop 2.2 in production currently.
The Advantages of HDInsight in an Existing Microsoft Environment
HDInsight provides additional libraries that are designed to allow for better integration between Hadoop and other Microsoft technologies. These include:
- Powershell scripts and cmdlets for automation of Hadoop cluster deployments
- Avro Library provides data serialization across languages for processing of complex data structures using C, C++, C#, Java, PHP, etc.
- Integration with Excel through Power Query
- HIVE ODBC Driver for querying data from Windows, SQL Server, .NET, etc.
If you are already used to Microsoft technologies and are running in a Microsoft environment, then these additional features provide an easier road to Hadoop as you integrate it with traditional SQL, Excel, SharePoint, etc.