Data Analytics: Should You Build a New Data Lake or ‘Drain the Swamp’?

Posted by at 13:14h

Modern data analytics is rapidly transforming business in all industries. Early movers are rapidly disrupting business models and transforming customer expectations while others are scrambling to catch up and avoid becoming a dinosaur. For example:

  • A large pharmaceutical company struggling with rapidly rising clinical trial costs doubles down on its data analytics spend with the expectation that each dollar spent on data analytics will cut two dollars in the overall research budget.
  • Leading retailers are using data analytics in almost every part of the company value chain, from predicting customer demand to inventory management, back end store operations, dynamic pricing models and customer profiling to determine the optimal mix to cross and up-sell.
  • A full-service wealth management firm increased its commitment to analytics and data-driven AI technology in order to identify potential high net worth clients early so they could target them and minimize loss to low cost competitors.

All of these companies have vast amounts of data from many different, incompatible systems that they want to analyze in order to reap data enabled business insight and competitive advantage. To cope with the rapid data growth and the increased variability of data, they have turned to the concept of the ‘data lake’ as a cost effective and flexible solution.


What Is a Data Lake?

In layman’s terms, a data lake is a large unified storage repository that is able to store vast amounts of raw data in its native format.  Picture123This means that if a company has many different siloed or disparate systems like Oracle, SAP or DB2, all of the information can be consolidated into a single ‘lake’ for storage and easier end-user access.  The downside is that if it is not done properly you can end up with an even bigger mess where you don’t understand all the incompatible information within the lake, or how the information is related to each other.

Hadoop is the most widely used data lake implementation technology, although other solutions from Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform services are also popular. In this post we will discuss the importance of organizing the data correctly so that the business goals of analytics can be met.

While the concept of a data lake suggests that it should be easy to store a lot of data of many different types without too much effort, many early data lake implementations have been too casual about how they store the information making it hard to retrieve meaningful data later. Such failed data lake implementations are vividly referred to as ‘data swamps’.


Building The Data Lake

What is needed to create a pristine data lake in the first place? Is it possible to drain a swamp and turn it into a beautiful lake again? Fortunately, the thinking, processes, and tools that allow organizations to create successful data lakes in the first place can also be used to ‘drain the swamp’. Needless to say, prevention of swamps in the first place is preferable but we will discuss both situations below.

Picture1234234Before you create your first data lake your need to make critical decisions about both your structured and unstructured data. In both cases you have to create and store “enough” metadata describing your data before you pour it into the lake so that you have some sense what it is. Structured data is typically relational. The relations may be either explicit or implicit, meaning they are explicitly defined in the metadata or are implied in the data through analysis. In most cases, data retrieval and analysis will be easier and more successful if the vast majority of relations are known and stored as metadata in the lake. Unfortunately, it is likely that many of your relational systems only track 20-30% of all relations explicitly in the metadata, leaving the remaining 70-80% of your data ‘hidden’ or undocumented.   If this is the case, it means that at some point you will need to use a data discovery tool to find and understand the undocumented data elements and data relationships if you want a complete understanding of your data to use for such things as data analytics, regulatory compliance or to modernize a system.   We would argue that you should discover the ‘hidden’, implicit relations in your data before you start putting it into the lake, as it:

  • Means you only need to discover the data once upfront
  • Allows you to retrieve information from the lake with higher precision.

The additional storage requirement for meta-data is small and should not discourage you from up-front discovery of your relational data.

Now, let’s turn to unstructured data.  Most unstructured data has some structure even though it is far less structured than a general ledger. For example:

  • Free text is made up of paragraphs, headlines, sentences, words, and letters
  • Audio is made up of time-samples of frequency and amplitude
  • Images are made up of pixel matrices where each pixel color is encoded as a numeric

In fact, most popular analytical methods – e.g. supervised machine learning – would not work if the data was completely without structure. This kind of data needs some metadata before they are stored in the lake in order to understand it.  This can be illustrated by analogy with a librarian. No librarian would just put a new book on any shelf without recording some basic information about the book (e.g., book title, author, shelf location). Similarly, metadata about an audio clip should not be dumped in the lake without metadata. Metadata (e.g. source, time of recording, etc.) together with the actual clip is the retrievable information that should go into the lake.


Dealing with a ‘Swamp’

If you already have a lake where a lot of data has been stored without the appropriate metadata, you are probably experiencing challenges to retrieve the right data for analysis. Picture1234324Furthermore, if the data your retrieve from the lake lacks metadata, you will have to discover and recreate the metadata over and over again which is obviously wasteful.

Your choice is to either start over and create a new lake or gradually fix the metadata deficit and turn your swamp back into a lake. In the latter case, you would use much the same thinking and tools that were described in the proactive case. The key difference is that you will have to add metadata discovery and write back to every retrieve from the lake until the lake has been restored to pristine conditions.


Complete, Understandable Data Is Essential for Successful Analytics

Having the right data is a critical foundation for successful analytics, or as the saying goes ‘Garbage In, Garbage Out’.  But finding the right data is impossible without having a complete understanding of your data assets across your entire enterprise.  With knowledge about the data landscape, data scientists and citizen analysts can find more relevant data and perform better predictions that will positively impact the firms bottom-line. Unfortunately, it is not easy to discover data relations without tools that can help you automate large parts of the discovery process.  Manual methods that rely on third party consultants and internal SME’s is very tedious, costly and prone to error.

To solve this problem ROKITT has developed a product called ROKITT ASTRA that performs automated data discovery across an enterprise using machine learning. ROKITT ASTRA goes well beyond the information found at the metadata level of a company’s databases, and is able to discover the ‘hidden’, undocumented data that can make up to 80% of a company’s data assets that often reside in legacy or siloed systems.

Businesses who aggressively leverage data analytics to drive their business must ensure that that they have access to the most complete data available.  The alternative is that decisions will be made on incomplete data, leading to possibly incorrect or misleading conclusions.  Products like ROKITT ASTRA ensure that your investment in analytics and visualization solutions will be fueled with the most complete data to help ensure the best, most comprehensive analytical insights.