Microsoft Azure recently announced news of the Azure Data Lake, a product that serves as a repository for “every type of data collected in a single place prior to any formal definition of requirements or schema.” As noted by Oliver Chiu in a blog post, Data Lakes allow organizations to store all data types regardless of data type and size on the theory that they can subsequently use advanced analytics to determine which data sources should be transferred to a data warehouse for more rigorous data profiling, processing and analytics. The Azure Data Lake’s compatibility with HDFS means that products with data stored in Azure HDInsight and infrastructures that use distributions such as Cloudera, Hortonworks and MapR can integrate with it, thereby allowing them to feed the Azure Data Lake with streams of Hadoop data from internal and third party data sources as necessary. Moreover, the Azure Data Lake supports massively parallel queries that allow for the execution of advanced analytics on massive datasets of the type envisioned for the Azure Data Lake, particularly given its ability to support unlimited data both in aggregate, and with respect to specific files as well. Built for the cloud, the Azure Data Lake gives enterprises a preliminary solution to the problem of architecting an enterprise data warehouse by providing a repository for all data that customers can subsequently use as a base platform from which to retrieve and curate data of interest.
The Azure Data Lake illustrates the way in which the economics of cloud storage redefines the challenges associated with creating an enterprise data warehouse by shifting the focus of enterprise data management away from master data management and data cleansing toward advanced analytics that can query and aggregate data as needed, thereby absolving organizations of the need to create elaborate structures for storing data. In much the same way that Gmail dispenses with files and folders for email storage and depends upon its search functionality to facilitate the retrieval of email-based data, data lakes take the burden of classifying and curating data away from customers but correspondingly place the emphasis on the analytic capabilities of organizations with respect to the ability to query and aggregate data. As such, the commercial success of the Azure Data Lake hinges on its ability to simplify the process of running ad hoc and repeatable analytics on data stored within its purview by giving customers a rich visual user interface and platform for constructing and refining analytic queries on Big Data.