To have a good lake, you need an anchor

Posted by at 13:10h

Ever since Enterprise adoption of Hadoop and so-called NoSQL databases began, there’s been a schism in the database world.  On the one hand, we have conventional relational database management systems (RDBMSes) which have been powering both operational databases and data warehouse systems for decades.  On the other hand, we have new data repositories, based on innovative but less-battle-worn data lake technologies, for unstructured and big data.

Compared to their relational brethren, these newer platforms have fewer requirements around modeling the data and can offer much lower cost of storage.  The distinct offerings legitimately meet distinct needs.  But they have also bisected the database market and, unfortunately, cleaved customer data strategies along the way.

Tools, on two paths
Perpetuating the problem is a similar split in data management tools.  Take a look at the market and you’ll see that tools from the largest software vendors, as well as a few smaller but well-established ones, are geared to the RDBMS world.  These products have been used in production for many years and have a well-established pool of professionals to run and support them.

A separate set of specialized tools exist in the data lake realm of Hadoop, Spark, cloud storage and NoSQL.  In theory, these tools should help catalyze the adoption of the newer technologies since they are starting to offer the same information management functionality (ETL, data quality and master data administration) that has existed on the RDBMS side for quite a while.

But the result has been that the industry, and customers’ data, have been stratified.  Highly-curated data continues to go into relational databases, managed by older tools.  Less “formal” data, that is often very high in volume, goes to the data lake, managed by a new generation of products.

Anchored agility
Yes, data lakes facilitate the capture of data that might not otherwise be stored and analyzed; that’s the good news.  The bad news is that the bifurcation they have caused makes it very difficult to coordinate analysis and governance of what’s in the data lake with the more formal data in relational databases.

RDBMS sources contain data from operational applications, or from the data warehouses built on top of such data.  These applications and data warehouses are systems of record.  Their data is authoritative, definitive and universally adopted.  But data in the data lake is less transactional and more ephemeral…it can provide keen insights into behavior and phenomena that relational data cannot.

Those insights lack gravitas, though, if they’re not correlated to the referenceable data of consensus that exists in relational sources.  Without that correlation, analysis of data lake data is not only complex and siloed, but also lacking in efficacy.  Relational data is an anchor in the data lake.  Anchorless analysis lacks specificity and authority – making it far less actionable.

From many, one
The worlds of relational data and data lakes need to come together for the latter to reach their full potential.  Finding relationships between data sets in the lake and tables in the RDBMS is not merely elegant, it’s imperative.  If you’re not unifying these two big worlds of data, then you’re not fully executing on a data-driven strategy for your business.  And without that, true digital transformation will be elusive.

Data lakes, despite sometimes being denigrated as “swamps,” are a true innovation in data analysis.  Their potential for enabling non-specialists in realizing the decades long-goal of data-focused decision support is huge.  But throwing business users head-first into the data lake will yield avoidably poor results.

Treat data lakes as the distinct repositories that they are, but correlate them with the systems of record business users already know, trust and understand how to use.  It’s the fusion of the warehouse and the lake that customers need.  Together, there’s deep insight to be had, anchored in a core business context.  Separately, the potential insights wander, rudderless in a stagnant body of data.