Machine learning-driven foundational data discovery: Adding efficacy to the data lake/data warehouse combination

Posted by at 20:57h

Once, centrally-administered data warehouses were the most authoritative and comprehensive repository available for analysis of corporate information. With the passage of time and the advent of self-service BI tools like Microsoft’s Power BI, Tableau and Qlik, people in the business units have become self-driven to perform analysis work. And that motivation has extended to a desire to introduce their own data sets and perform similar analysis on them.

Growth begets growth
This increased BI adoption has led to growth in data volumes and the variety of data sources. This, in turn, has led to increased computing and storage demands and accompanying increases in cost. With those new demands and costs, legacy data warehouse systems have faced challenges in meeting the required workloads in a cost-effective manner.
Cloud computing, cloud storage and data lake technology have emerged as the responses to these challenges. The ability to store data in raw format, defer the modeling of it until time of analysis and the compelling economics of cloud storage and distributed file systems has changed everything.

In with the new…but the old isn’t going out
As much as the data lake model has taken root, the data warehouse is still with us. The two need to coexist and our tools and platforms need to accommodate this new heterogeneity. That coexistence isn’t well accommodated by the tools market, which is largely split along data warehouse-data lake lines. Older, established tools that pre-date Hadoop and data lakes were designed to work with relational database management systems (RDBMSes). Newer tools that grew up in the Big Data era are more focused on managing individual data files kept in cloud storage systems like Amazon S3 or distributed file systems such as Hadoop’s HDFS.

Enterprises don’t want a broken tool chain though; they want technologies that can straddle the line and work with platforms on either side of it. Customers have two sets of data technologies, but the data in each must be leveraged together, to benefit one company. For the very reason that RDBMSes and data lakes must coexist in equilibrium, the data within them must be queried and analyzed in a coordinated fashion. And that’s very hard to do when companies have separate tools for each broad class of data repository.

Versatility wins
In fact, the true realization of that harmonized orchestration doesn’t just come from tools that can work with either repository class. The winners are tools that work across both and can bring them together. This is true for query and analysis: BI tools that can fetch data from both repository types, join that data together, and then present visualized insights from that integration, definitely qualify.

This is an issue for data management tools, too. If you’re going to manage a comprehensive data catalog, then data from both data warehouses and data lakes must be in it. If data sets from the data lake are not properly cataloged, the lake will quickly become mismanaged, earning the infamous distinction of data “swamp.” This is especially the case because of the physical format of a data lake: a collection of files in a folder structure.

Draining the swamp
The difference between lakes and swamps is much like the distinction between well-organized and disorganized hard disks: it’s one thing to throw a bunch of files in a single “Documents” folder; it’s quite another to have an intuitive and well-organized folder structure established and rigorously adhered to. Furthermore, filling out document properties, like title, tags and comments, make your documents more easily shared. If you do all this on your hard drive, you’ll be able to find what you need, when you need it, and quickly.

Similarly, having a well-organized data catalog, with lots of metadata applied to the data set files within it, makes those data sets more discoverable and the data lake, as a whole, more usable.

But beyond mere organization, for a catalog product to be its best, data from distinct data sources must really coalesce, or the integration at query time will be under-served. Can your data catalog see data in both your data warehouse and data lake? If the answer is yes, can the catalog tell you which data sets, from each, can or should be used together? In other words, do your governance tools chronicle the relationships within and flows between such heterogeneous data sets – do they perform foundational data discovery?

Automation fascination
If the answer to all these questions is yes, you’re in a good spot, but you’re not done yet. Because even if relationships and flows can be documented in the catalog, you then need to determine if this work must be done manually or if the relationships and flows can instead be detected on an automated basis. And even if automatic detection is supported, you need to determine if it will work in instances when there is no schema information that documents the relationships or flows.

Given the data volumes in today’s data lakes, both in terms of the number of data sets present, and the size of each, discovering such relationships on a manual basis is sufficiently difficult as to be thoroughly impractical. Automation is your savior, and algorithmic detection of such relationships and flows can be achieved through analysis of data values, distribution, formulas and so forth.

The case for machine learning
The solution here is Machine Learning (ML). Today’s ML technology hits its sweet spot when deployed to help automate a process that’s repetitive and procedural, and yet needs to accommodate manual intervention and human expertise.

Machine learning can use sophisticated heuristics to perform foundational data discovery predictively. This approach is much more contextually savvy than a purely algorithmic one. Better yet, as the “L” in “ML” would indicate, the models can learn from human experts – the users of the product, who audit the automated work – and the heuristics will improve over time. This adds a layer of customization and context specificity to the automated relationship and flow detection.

ML-driven foundational data discovery makes governance of your data manageable. And manageable data governance means that data warehouses and data lakes can coexist, not just peacefully, but productively. With each analytic data repository handling its own appropriate workloads, organizations can better achieve what has seemed an elusive goal: being data-driven. From there, true digital transformation can take place.