From Data Mess to a Data Mesh
By Jarvin Mutatiina, Can Yurtseven and Ernst Blaauw
With the growing number of data sources and need for agility, a decentralized data architecture concept- Data Mesh can be explored to enforce data quality and governance adherence. Data Mesh achieves this via decentralizing the data responsibility to domain level and making high quality transformed data only available as a product.
Every year more data is produced globally. This holds also for companies: more details than ever are recorded from customers, partners, transactions, products and supply chain resulting in more data. According to IDC , “the global datasphere will grow from 45 zettabytes in 2019 to 175 by 2025”. This data forms the raw material from which organizations are drawing valuable, actionable insights. But the collection, integration and governance of this data is still one of the main challenges and inhibitors as established in recent research by Deloitte.
Many organizations are now looking at a relatively new concept called “Data Mesh” to overcome these main challenges and inhibitors. They are realizing that flexible access to data with critical benefit around decreased time-to-market can be guaranteed by focusing on domain specific data products enabled by common support functions. The Data Mesh leverages concepts of newer architectural approaches (e.g. service mesh) and focuses on the data management part rather than connectivity and orchestration. So what is Data Mesh and what are the benefits?
Where data warehouses congested and data lakes turned into swamps
The first paradigm to get to a reliable, integrated and central data repository was the data warehouse. Data warehouses basically boiled down to copying operational data into a centralized, harmonized, well-defined data repository which should lead to a “single source of truth”. That turned out to be mostly inflexible and not really well-suited for the era of “Big Data”, in which the data got higher volume, variety and velocity. The Data Lake concept was invented to capture raw data from various sources into a single repository, in order to build various data layers to suit multiple use cases. The data lake was better suited to support a variety of “big data” (e.g. data streaming, NoSQL database technologies…etc.).
However, data lakes also did not always deliver to their promise. As they become increasingly more complex with the vast amounts of data, the process to create new data products adhering to the company standards may take too much time. Business switched to ways to circumvent the central IT organization, so their projects could continue. However, this resulted in non-compliant solutions – in other words shadow IT. Non-compliant solutions might provide initial results faster but will never be sustainable for production environments and therefore inhibits the application of analytical insights at scale.
Data lakes and data warehouses share the properties that the data processing pipelines are mostly managed by centralized IT teams and that the data is stored in a centralized location. As data volumes grow, data landscape complexity will also grow; inevitably resulting in centralized systems failing to cope with drastically increased scalability and agility needs of the organization.
This model does not always translate well to a typical organization: different business domains know best what is in their data, but it is supposed to be managed centrally. Central IT teams are very busy trying to keep up with all requests from the company – but most of the time backlogs grow rather than shrink. Domain knowledge is not available when it is needed, leading to a decrease of the quality of deliveries. Here the concept of Data Mesh might be a solution to tweak the disadvantages with data warehouses/ lakes without losing investments made so far.
Why is it so popular now?
Data Mesh is a pretty new concept (emerged around 2019) and it is picking popularity. Data Mesh has been very interesting for enterprises seeking fast time-to-market with growing data sources/ volumes. This is achieved via decentralizing the data responsibility to domain level and making high quality transformed data only available as a product. Business domain knowledge is preserved while also making the data available to the rest of the business. Data engineers do not have to sieve through unfamiliar data, often dumped in data lakes from multiple sources. The proposed architecture aims to ease the often strained collaboration between data experts and data owners concerning the growing specific domain business acumen needed to bring value to data.
Data Mesh explained
The Data Mesh concept is a democratized approach of managing data where different business domains operationalize their own data, backed by a central and self-service data infrastructure. The infrastructure comprises of data pipeline engines, storage and computing capabilities that are bundled as illustrated in Figure 1.
Figure 1: Self-service data infrastructure layer (example)
Rather than looking at enterprise data as one huge data repository, data mesh considers it as a set of repositories of data products. Hence, a business domain (e.g. “Finance”) provides data as a product; ready to use for analysis purposes, discoverable and reliable. This way, the data product owner is the actual business domain representative that has the deep domain knowledge. This is illustrated in the Data Product layer in Figure 2. Thus, no specific domain knowledge gets lost like it could in the translation towards a data warehouse/lake and no bottleneck occurs at the central data engineering team.
Figure 2: Data product layer (example)
Different types of data consumers, like data scientists and business analysts, have direct access to relevant data product(s), on the basis of service level agreements.
The data products are also self-explanatory , in the sense that the product is discoverable and described, so it can be used in a “plug and play” fashion without the need for complex data transformation functions like we know from the data warehouse/lake concepts. By ensuring all data products have the same format, data governance guidelines are enforced across the domain data products within the mesh. The industry standards for governance are illustrated in the federated data governance layer in Figure 3.
Figure 3: Federated Data Governance
The three layers; distributed data product layer, federated data governance and self-service data infrastructure interact together to form the Data Mesh Reference card as in Figure 4 :
Figure 4: Data Mesh reference card
There are direct benefits for an organization adopting this architectural concept:
- Agility and scalability; there is a significant improvement in time-to market, scalability, overall business domain agility and it also helps slim down the IT backlog; these are all because of the decentralized data operations, and provisioned data infrastructure as a service. This is also a result of agile project teams being able to operate independently, focusing on relevant data product(s).
- Strong central governance to control end-to-end compliance; with the fast growing number of data sources and their varying data formats, traditional architectural setup with centralized data lakes fail at reconciling the semantics and volume of ingested data. Decentralizing data operations to a domain and enforcing global data governance guidelines promotes quality data delivery and also ease to access data. There will be no more bulk data dumps into data lakes.
- Cross-functional domain teams; in comparison with the traditional data architecture approaches that promote isolation of skill teams that often have long backlogs, Data Mesh proposes a fix whereby domain experts and owners are in charge. This is via increased domain knowledge, closer business and IT teams plus agile virtual teams.
- Faster data delivery; setting up data infrastructure (e.g. data processing, data storage, logging, monitoring, identity management etc.) is often a hinderance for data management. Data Mesh provides such governable and centralized infrastructure in a self-service manner with the underlying complexity hidden away for faster data delivery.
Barriers to overcome
Despite the benefits that Data Mesh is expected to bring; particularly its decentralization property introduces a couple of challenges. Difficulties with managing the multiple data products and their corresponding metadata may very well lead to a mess of spaghetti data pipelines. Below are some of the potential improvement points for Data Mesh:
- Duplication of data across different domains; as data repurposed to serve a new domain’s business needs that differ from the source domain, redundancy ensues and could have a potential impact on resource utilization and data management cost.
- Enforce federated data governance and quality adherence; with the independent co-existing data products and pipelines, the quality principles can easily be neglected leading up to tremendous technical debt. These responsibilities and principles have to be appropriately identified and federated.
- Significant level of change management involved; in order to adopt to Data Mesh decentralized data operations, significant amount of change effort is needed.
- Technology choices shape overall data capabilities of the data platform; technology choices, that are both standardized across the organization and future-proof for all needed data capabilities, need to be concretely addressed. Unfitting technology decisions could easily result into data products that mount up increased technical debt over time.
- Cross domain analytics; an overarching enterprise-wide data model is not explicitly defined to aggregate and consolidate the various data products into one report.
Also published on Deloitte.
Create your free account to unlock your custom reading experience.