June 1st 2020
by Monte Zweben & Syed Mahmood of Splice Machine
Apache Hadoop emerged on the IT scene in 2006 with the promise to provide organizations with the capability to store an unprecedented volume of data using commodity hardware. This promise not only addressed the size of the data sets but also the type of data, such as data generated by IoT devices, sensors, servers, and social media that businesses were increasingly interested in analyzing. The combination of data volume, velocity, and variety was popularly known as Big Data.
Schema-on-read played a vital role in the popularity of Hadoop. Businesses thought they no longer had to worry about the tedious process of defining which tables contained what data and how are they connected to each other — a process that took months and not a single data warehouse query could be executed before it was complete. In this brave new world, businesses could store as much data as they could get their hands on in Hadoop-based repositories known as data lakes and worry about how it is going to be analyzed later.
Data lakes began to appear in enterprises. These data lakes were enabled by commercial Big Data distributions — a number of independent Open Source compute engines supported in a platform that would power the data lake to analyze data in different ways. And on top of that, all of this being Open Source was free to try! What could go wrong?
Schema-on-Read Was a Mistake
As with so many things in life, the features of Hadoop that were touted as its advantages also turned out to be its Achilles’ heel. First, with the schema-on-write restriction lifted, terabytes of structured and unstructured data began to flow into the data lakes. With Hadoop’s data governance framework and capability still being defined, it became increasingly difficult for businesses to determine the contents of their data lake and the lineage of their data. Also, the data was not ready to be consumed. Businesses began to lose faith in the data that was in their data lakes and slowly these data lakes began to turn into data swamps. The “build it and they will come” philosophy of schema-on-read failed.
Hadoop Complexity and Duct-Taped Compute Engines
Second, Hadoop distributions provided a number of Open Source compute engines like Apache Hive, Apache Spark and Apache Kafka to name just a few but this turned out to be a case of too much of a good thing. A case in point — one commercial Hadoop platform consisted of 26 such separate engines. These compute engines were complex to operate and required specialized skills to duct-tape together that were difficult to find in the market.
The Wrong Focus: The Data Lake versus The App
Third and most importantly, data lake projects began to fail because enterprises placed a priority on storing all the enterprise data in a central location with the goal to make this data available to all the developers — an uber data warehouse if you will versus thinking about how the data will be consumed by applications. As a result, Hadoop clusters often became the gateways of enterprise data pipelines that filter, process, and transform data that is then exported to other databases and data marts for reporting downstream and almost never find their way to a real business application in the operating fabric enterprise. As a result, the data lakes end up being a massive set of disparate compute engines, operating on disparate workloads, all sharing the same storage. This is very hard to manage. The resource isolation and management tools in this ecosystem are improving but they still have a way to go. All this complexity — just for reports.
Enterprises, for the most part, were not able to shift their focus away from using their data lakes as inexpensive data repositories and processing pipelines to platforms that consume data and power mission-critical applications. Case in point, Apache Hive, and Apache Spark are among the most widely used compute engines for Hadoop data lakes. Both these engines are used for analytical purposes — either to process SQL-like queries (Hive) or to perform SQL-like data transformations and build predictive models (Spark). These data lake implementations have not focused enough on how to operationally use data in applications.
Strategy Going Forward
So if your organization is concerned about the recent developments in the Hadoop ecosystem and increasingly under pressure to demonstrate the value of your data lake you should start out by focusing on the operational applications first and then work your back to the data.
By focusing on the modernization of applications with data and intelligence you will end up with apps that can leverage data to predict what might happen in the future based on experience and be proactive to make decisions in the moment that result in superior business outcomes. Here are five ingredients to a successful application modernization strategy:
- Pick an application to modernize: Rather than focusing your efforts on centralizing the data, first, pick an application that you would like to modernize. The prime candidate for this is one of many custom-built applications that have fallen behind in the marketplace and are in need of becoming more agile, intelligent and data-driven. Once you have identified the application that can deliver a competitive advantage to your organization then you can focus on sourcing the data required to power that application and whether that data can be made available from the data lake.
- Use scale-out SQL for your application modernization: SQL has been the workhorse of workloads in the enterprise for a number of years and there are hundreds of developers, business analysts and IT personnel in your organization who are fully conversant in SQL. Do not incur additional time, expense and risk of re-writing your original SQL application into a low-level NOSQL API. Select a platform that would enable you to maintain the familiar patterns and powerful functionality of SQL to modernize the application but do so on an architecture that can elastically scale-out on inexpensive infrastructure. Scale-out brings the power of an entire cluster to bear upon computation making it much faster than old SQL systems that worked on a centralized system. With scale-out you can add more capacity and take it away as workloads change as well.
- Adopt an ACID platform: ACID compliance is the mechanism through which transactions maintain integrity in the database and allows users to perform actions such as commit and rollback. It is a critical functionality to power operational applications as it ensures that the database does not make changes visible to others until a commit has been issued. Select a platform that provides ACID capability at the individual transaction level in the database. Otherwise all of these consistency ramifications need to be handled in the application code. All traditional SQL systems were ACID compliant. The data lakes mistakenly discarded this making applications very difficult to write.
- Combine The Analytics: According to a recent Gartner blog, historically, there were good reasons to separate your IT infrastructure into operational (OLTP) and analytical (OLAP) components but that is no longer the case. ETL kills our SLA’s with latency. It used to be the case that the operational and analytical workloads interfered with each other and you had to separate them. Moreover, legacy data platforms performed so poorly we had to transform the operational schema to star-schemas or snowflake-schemas that were better for analytical workloads. This ETL is no longer required and you can run analytics on the operational platform, often using the operational schema. By implementing this platform you will ensure that your application is running on a platform that minimizes data movement and doesn’t contribute to the latency in the application. This delivers your insights, reports and dashboards in the moment versus on yesterday’s or last week’s data.
- Embed Native Machine Learning: One of the primary reasons for modernizing your application is to inject AI and ML into it so that it can learn from experience, dynamically adapt to changes, and make in-the-moment decisions. In order to make your application intelligent it is critical that you select a platform that has machine learning built-in at the database level so that updated data is always available to the models to experiment, train and execute.
This is fundamentally a different approach than what you have used your data lake for so far. This approach delivers tangible business value to the line of business faster through the application that can now leverage the data lake.
This approach will ensure that in addition to modernizing the applications that provide your business with competitive advantage you also preserve the investment in your data lake.