Amazon DynamoDB Offers Big Data Cloud Processing With Managed Services

This week, Amazon Web Services announced the availability of Amazon DynamoDB, a fully managed cloud-based database service for Big Data processing. The announcement represents yet another move by Amazon Web Services to consolidate enterprise market share by providing an offering that can store massive amounts of data with ultra-fast, predictable rates of performance and low latency waiting times. Amazon DynamoDB is a NoSQL database built for customers that do not require complex querying capabilities such as indexes, transactions, or joins. DynamoDB constitutes a greatly enhanced version of Amazon SimpleDB. One of Amazon SimpleDB’s principal limitations is its 10 GB limit on data within containers known as domains. Moreover, Amazon SimpleDB suffered from performance issues due to indexing all of the attributes for an object within a domain and a commitment to eventual consistency of the database taken to an extreme. Amazon DynamoDB builds upon the company’s prior experience with SimpleDB and Dynamo, the precursor to NoSQL, by offering the following features:

• Managed services

Amazon DynamoDB managed services take care of processes such as provisioning servers, configuring a cluster, and dealing with scaling, partition and replication issues.

• No Upper Bound On Data

Customers can store as much data as they would like. Data will be spread out across multiple servers spanning multiple Availability Zones.

• Speed

The solid state drives on which Amazon DynamoDB is built help optimize performance and ensure low latencies. Applications running in the EC2 environment should expect to see latencies in the “single-digit millisecond range for a 1KB object.” Another reason performance is optimized involves a design whereby all attributes are not indexed.

• Flexible schemas and data models

Data need not adopt a particular schema and can have multiple attributes, including attributes that themselves have multiple values.

• Integration with Amazon Elastic MapReduce (Amazon EMR)

Because DynamoDB is integrated with the Hadoop-based, Amazon Elastic MapReduce technology, customers can analyze data in DynamoDB and store the results in S3, thereby preserving the original dataset in DynamoDB.

• Low cost

Pricing starts at $1 per GB per month.

With this set of features, Amazon DynamoDB represents a dramatic entrant to the Big Data party that features Oracle, HP, Teradata, Splunk and others. The product underscores Amazon Web Services’s strategic investment in becoming a one-stop service for cloud and Big Data processing. Moreover, the managed services component of Amazon DynamoDB represents a clear change of pace by Jeff Bezos’s spin-off because of its recognition of the value of managed services at the enterprise level for technology deployments. Amazon DynamoDB’s managed services offering is expected to appeal to enterprises that would rather invest technical resources in innovation and software development as opposed to the operational maintenance of a complex IT ecosystem. Assuming that AWS can quantify the degree to which DynamoDB’s managed services offering ends up being responsible for sales, expect to see more managed service offerings from Amazon Web Services in both the cloud computing and Big Data verticals. Going forward, the technology community should also expect partnerships between Amazon Web Services and business intelligence vendors that mimic the deal between Jaspersoft and Red Hat’s OpenShift given how Amazon Web Services appears intent on retaining customers within their ecosystem for all of their cloud hosting, Big Data and business intelligence analytics needs.


Fujitsu Reveals Cloud Based Platform For Big Data From Sensing Technologies

Fujitsu revealed a cloud based platform for Big Data known called Data Utilization Platform Services on Monday. The platform enables the aggregation, exchange, manipulation and analysis of “massive quantities of sensing data” in a variety of formats. Fujitsu’s Data Utilization Platform Services features the following four components:

• Data Management & Integration Services

The platform provides an apparatus for the collection and categorization of massive volumes of sensor-driven data.

• Communications Control Services

In addition to collecting and categorizing data, Fujitsu’s platform can transmit data to other devices in order to automatically adjust data driven equipment such as devices in a home, automotive, factory or scientific environment.

• Data Collection and Detection Services

Fujitsu’s platform can apply rules to data derived from sensors to adjust machinic behavior with real-time frequency using an iterative feedback loop. Rule based sensing data decision making may involve equipment in the fields of navigation, robotics or other industries in which real-time decisions depend on a contemporary data store.

• Data Analysis Services.

The platform contains a bevy of business intelligence tools that enable the production of actionable analytics to drive operational decisions.

A schematic of the architecture of the Data Utilization Platform Services is given below:

Fujitsu will also be offering a set of “Data Curation Services” that involve professional services and analytic tools that assist customers to tackle their Big Data challenges. Fujitsu failed to elaborate on the underlying technology for either the cloud-based or Big Data components of its Data Utilization Platform Services, but a report in The Register speculates that “Hadoop, the open source MapReduce data muncher and its related Hadoop Distributed File System” constitutes one of the platform’s key technologies. Absent details of its underlying technology, the most notable feature about Fujitsu’s cloud platform for Big Data is its distinct focus on data derived from sensing technologies in fields such as navigation, robotics and meteorology.

Ten Things You Should Know About Splunk And Its $125 Million IPO

Splunk Inc. filed for a $125 million IPO on Friday in what marks the first IPO in the rapidly growing Big Data technology space. Big Data technology refers to software that specializes in the analysis of massive amounts of structured and unstructured data. Splunk’s mission is “to make machine data accessible, usable and valuable to everyone in an organization.” Splunk produces software that analyzes operational machine data about customer transactions, user actions and security risks. The San Francisco based company provides IT and business stakeholders with analytics that enable them to improve project delivery, cut costs, reduce security threats, demonstrate compliance with security regulations and derive actionable business intelligence insights.

Founded in 2004, Splunk capitalized on the market opportunity for actionable analytics on data derived from increasingly complex and heterogeneous enterprise IT environments featuring corporate data centers, cloud based and virtualized application environments. Splunk’s software provides its users with a 360 degree view of analytics about enterprise operations by running against structured data sets as well as unstructured data that lacks a pre-defined schema. Here are ten things you should know about Splunk and its S-1 filing:

1. Splunk has over 3300 customers including Bank of America, Zynga, and Comcast.

2. Splunk’s software can be downloaded and installed within hours and lacks extensive customization and professional services for setup. Splunk is currently developing Splunk Storm (Beta), a cloud-based version of its software that features a subset of its functionality.

3. Splunk recorded revenues of $18.2 million, $35.0 million and $66.2 million in fiscal 2009, 2010 and 2011, with losses of $14.8 million, $7.5 million and $3.8 million, respectively. Revenue grew at a rate of 93% for fiscal 2010 and 89% for fiscal 2011.

4. For the first nine months of fiscal 2011 and 2012, Splunk’s revenues were $43.5 million and $77.8 million, with losses of $2 million and $9.7 million, respectively. Revenue grew at a rate of 79% during this time period.

5. Splunkbase and Splunk Answers, Splunk’s online user communities, provide customers with an infrastructure by which to share apps and offer each other insights and support. Splunk believes that enriching these user communities constitutes a key component of its growth strategy.

6. More than 300 apps are available via the Splunkbase website. Over 100 apps were developed by third parties. Examples of Splunk apps include Splunk for Enterprise Security, Splunk for PCI Compliance and Splunk for VMware.

7. In fiscal 2011 and the first nine months of fiscal 2012, 21% and 24% of Splunk’s revenues derived from international sales. The large percentage of Splunk’s customers that are outside the U.S. means that the company is vulnerable to risks specific to international sales transactions related to global economic conditions, increased payment cycles and the additional costs of managerial, legal and accounting for international business operations.

8. The IPO filing cited the following analytics vendors as Splunk’s principal competition: (1) Web analytics vendors such Adobe Systems, Google, IBM and Webtrends; (2) Business intelligence vendors including IBM, Oracle, SAP and EMC; and (3) Big Data technologies such as Hadoop.

9. Godfrey Sullivan has served as Splunk’s CEO since 2008. Prior to Splunk, Sullivan was CEO of Hyperion Solutions Corp., which he helped sell to Oracle for $3.3 billion in 2007.

10. Three of Splunk’s key technologies are Schema on the fly, Machine data fabric and Search engine capability for Machine data. Schema on the fly refers to the ability to develop schemas that adjust to queries and relevant data sets instead of inserting data into a pre-defined schema. The result is a more flexible modality of tagging data that renders itself receptive to unstructured data sets that lack a well defined schema. Machine data fabric refers to the ability to access machine data in all its various forms. Splunk’s machine data fabric means that no data is left uncovered by its software. As noted in the S-1 filing, Splunk’s “software enables users to process machine data no matter the infrastructure topology, from a single machine to a globally distributed, virtualized IT infrastructure.” Search engine capability means that Splunk boasts a range of arithmetic and advanced statistical capabilities for searching and performing business intelligence analysis on machine data.

Splunk has yet to reveal the number of shares that will be offered as part of its $125 million IPO under the ticker symbol SPLK. Thus far, the company has raised $40 million in venture capital funding from August Capital, JK&B Capital, Ignition Partners and Sevin Rosen Funds. The IPO is led by Morgan Stanley. JPMorgan Chase & Co., Credit Suisse Group AG and Bank of America Corp. are also working with Morgan Stanley on the public offering. Rest assured that Splunk’s IPO will be watched very closely by all vendors in the Big Data space.

Oracle Partners With Cloudera For Newly Available Big Data Appliance

On Tuesday, Oracle declared the availability of the Big Data appliance that it introduced to the world at its October conference Oracle Open World. The appliance runs on Linux and features Cloudera’s version of Apache Hadoop (CDH), Cloudera Manager for managing the Hadoop distribution, the Oracle NoSQL database as well as an open source version of R, the statistical software package. Oracle’s partnership with Cloudera in delivering its Big Data appliance goes beyond the latter’s selection as a Hadoop distributor to include assistance with customer support. Oracle plans to deliver tier one customer support while Cloudera will provide assistance with tier two and tier three customer inquiries, including those beyond the domain of Hadoop.

Oracle will run its Big Data appliance on hardware featuring 864 GB main memory, 216 CPU cores, 648 TB of raw disk storage, 40 Gb/s InfiniBand connectivity and10 Gb/s Ethernet data center connectivity. Oracle also revealed details of four connectors to its appliance with the following functionality:

• Oracle Loader for Hadoop to load massive amounts of data into the appliance by using the MapReduce parallel processing technology.
• Oracle Data Integrator Application Adapter for Hadoop which provides a graphical interface that simplifies the creation of Hadoop MapReduce programs.
• Oracle Connector R which provides users of R streamlined access to the Hadoop Distributed File System (HDFS)
• Oracle Direct Connector for Hadoop Distributed File System (ODCH), which supports the integration of Oracle’s SQL database with its Hadoop Distributed File System.

Oracle’s announcement of the availability of its Big Data appliance comes as the battle for Big Data market share takes shape in a landscape dominated by the likes of Teradata, Microsoft, IBM, HP, EMC, Informatica, MarkLogic and Karmasphere. Oracle’s selection of Cloudera as its Hadoop distributor indicates that it intends to make a serious move into the world of Big Data. For one, the partnership with Cloudera gives Oracle increased access to Cloudera’s universe of customers. Secondly, the partnership enhances the credibility of Oracle’s Big Data offering given that Cloudera represents that most prominent distributor of Apache Hadoop in the U.S.

In October, Microsoft revealed plans for a Big Data appliance featuring Hadoop for Windows Server and Azure, and Hadoop connectors for SQL Server and SQL Parallel Data Warehouse. Whereas Oracle chose Cloudera for Hadoop distribution, Microsoft partnered with Yahoo spinoff Hortonworks to integrate Hadoop with Windows Server and Windows Azure. In late November, HP provided details of Autonomy IDOL (Integrated Data Operating Layer) 10, which features the ability to process large-scale structured data sets in addition to a NoSQL interface for loading and analyzing structured and unstructured data. In December, EMC released its Greenplum Unified Analytics Platform (UAP) marked by the ability to load structured data, enterprise-grade Hadoop for analyzing structured and unstructured data and Chorus, a collaboration and productivity software tool. Bolstered by its partnership with Cloudera, Oracle is set to compete squarely with HP’s Autonomy IDOL 10, EMC’s Greenplum Chorus and IBM’s BigInsights until Microsoft’s appliance officially enters the Big Data doohyoo (土俵) qua sumo ring as well.

Apache Software Foundation Releases Hadoop Version 1.0

The Apache Software Foundation announced the release of Apache Hadoop version 1.0 on January 4. Hadoop, the software framework for analyzing massive amounts of data, graduated to the version 1.0 designation after six years of gestation. Hadoop’s principal attribute consists of the capability to process massive amounts of data on clusters of computing nodes in parallel. Even though Hadoop stores data evenly across all computing nodes, it distributes the task of processing across the nodes and then synthesizes the results of data processing.

Version 1.0 represents a milestone in terms of scale and enterprise-readiness. The release already features deployments as large as 50,000 nodes and marks the culmination of six years of development, testing and feedback from users and data scientists. Key features of the 1.0 release include the following:

• The integration of Hadoop’s big data table, HBase
• Performance enhanced access to local files for HBase
• Kerberos-based authentication to ensure security
• Support for webhdfs with a read/write HTTP access layer
• Miscellaneous performance enhancements and bug fixes
• All version 0.20.205 and prior 0.20.2xx features

Hadoop represented the Big Data story of 2011 insofar as it was incorporated into almost every 2011 Big Data product revealed by enterprises and startups alike. The release of Apache Hadoop 1.0 promises to propel Hadoop even closer to the enterprise and transition the software framework from a tool for supporting web data into an enterprise-grade Big Data software infrastructure, more generally.

Microsoft Announces Updates to Windows Azure

Microsoft today announced a set of updates to Windows Azure, its Platform as a Service (PaaS) cloud computing offering that was launched in February 2010. Microsoft grouped the updates into the categories of Ease of Use, Interoperability and Overall Value. Here are some highlights from the wide-ranging update:

• Node.js language libraries added to the Windows Azure software development kit (SDK)

Windows Azure now supports the Node.js software system, known for its suitability for highly scalable internet applications such as web servers. The Azure software development kit for Node.js includes libraries for blob, table and queue storage as well as PowerShell command-line tools for development.

• Apache Hadoop on Windows Azure – A Preview

Developers seeking to unlock the Big Data potential of Apache Hadoop can now obtain a preview of Hadoop on Windows Azure by taking advantage of a streamlined installation process that sets up Hadoop on Azure in hours as opposed to days.

• Upper Bound on Price for Large Azure Databases

The maximum Azure database size has been increased from 50 GB to 150 GB. Additionally, the maximum price for the 150 GB database has been set at $499.95, resulting in a 67% price decrease for customers using the largest size.

• Lowering of Data Transfer Prices

Data transfer fees in Zone 1 (North America and Europe) have been lowered from $0.15/GB to $0.12/GB. Data transfer fees in Zone 2 (Asia Pacific) have been lowered from $0.20/GB to $0.19/GB.

Additional updates include enhanced tools for Eclipse/Java, MongoDB, SQL Azure Federation, Solr/Lucene and Memcached as well as access to Windows Azure libraries for .NET, Java, and Node.js on GitHub via an Apache 2 software license.