Of data governance, and minding the store

Posted by at 00:00h

Data governance is a hot concept right now. With the twin phenomena of frequent data breaches and ever more stringent regulations on data protection, this should be no surprise. But as is often the case with hype-driven terms, not everyone knows what data governance actually means. Sometimes analogies can help with understanding complex terms, and I think an analogy would help here. The one I have in mind may seem a bit peculiar, though, because it involves a major metropolis and decades-old memories of what it was like.  Bear with me though, as I think it will clarify things.

The setting for the analogy is New York City, a city now littered with chain retail establishments. It wasn’t always that way though: a few decades ago, “mom and pop” stores (small privately-owned retail shops) were ubiquitous. I grew up in New York, and I remember lots of things about these stores: for example, the creative signage and the friendly owners who all knew their regular customers and treated them like friends and VIPs.  Beyond that though – and here’s the root of the analogy – some of these stores would occasionally suspend business for an entire day, posting a sign on the door that said “closed for inventory.”

Why the big deal?
This always struck me as odd. Why was taking inventory such a significant undertaking that the store had to close, and give up a day’s revenue?  Why was inventory not being kept on a continuous basis?  How hard could that be? And what in the world does it have to do with data governance?

As it turns out, the notion of inventory is important in the world of data and analytics too; very important in fact. And, just as it was for those mom and pop shops, taking comprehensive inventory of data can be hard. Moreover, regulations imposed by various governments, and the European Union with its General Data Protection Regulation (GDPR), mean that having such inventory is critical to the legal well-being of companies subject to those regulations. As such, the mapping between data governance and such an inventory is fairly direct and tight.

What’s on hand?
But what does inventorying your data mean and how specific must this inventory be? Start by thinking about mom and pop’s inventory – they need to know what they have in stock and where it is, so they can restock their shelves and re-order below a certain threshold. On the data inventory side, it’s important to know what databases, tables and file-based data sets you have, what they contain and where they are located.

And while data sets may not need to be replenished the way goods for sale must be, the data sets you have, what’s in them and where they’re located is important to your efforts. This part of your inventory gives you the ability to leverage the right data set, for the right requirement, and know where to find it.

Don’t just score; protect your own goal
While a data set-level understanding of an organization’s data is a good start, it in no way constitutes a full inventory. That’s because data inventory serves both proactive and defensive facets of data management.

The proactive side entails using data for insights and operational or commercial success or advantage.  Having a dataset-level inventory of the data serves that end well. It’s the same with a small retail business: the owners need to know what’s on-hand so they can promote and sell it, and they need to safeguard their goods-on-hand and keep the quantities sufficient to meet demand.

The defensive side, which entails the protection of data and compliance with regulations, is very important too, though. In fact, because of regulatory penalties for non-compliance (up to the larger of €20M or 4 percent of global revenue, for GDPR) and the catastrophic economic and reputational impact of a data breach, the defensive side of data management may even be the more important one.

Stand your ground
While the mom and pop shop may not have the fear of punitive measures as a motivator, its proprietors nonetheless need to protect their stock from theft and, if there’s a recall on any item, they need to know where all units are stored so they can return them to their distributor or vendor.

Likewise, on the data side, protection involves identifying and recording where sensitive data – including personally identifiable information (PII) – is stored, to prevent identity theft and subsequent fraud.  Identifying and knowing the location of redundant data – i.e. data sets that are significantly derivative or duplicative of others – is of paramount importance as well.  Command of redundant data is especially critical for GDPR and its concept of “the right to be forgotten.” After all, you can’t remove all data about a person unless you know where every instance or copy of that data lies.

Relationships matter
In the world of physical inventory, parts belong to sub-assemblies which, in turn belong to finished goods. And some parts are interchangeable. Depending on the type of business, knowing and understanding these peer and hierarchical relationships, is crucial to having truly comprehensive inventory. Merely being able to enumerate individual items is not enough.

By the same token, there’s one more facet of inventorying your data that may be the most important of all, and it is germane to both proactive and defensive work: understanding the relationships between your data sets, including relationships within or between databases or data lakes…or even between one of each.

Knowing these relationships helps your proactive data analysis, as you’ll be in a better position to cross-reference and enhance the data set you are focusing on, with others. It’s necessary on the defensive side too, both because regulations may outright require it, and/or because understanding these relationships can also help discern data lineage.

Relationship detection, also referred to as data discovery, is hard work if you’re doing it manually. It can work well, though, if automation – driven by Artificial Intelligence/Machine Learning (AI/ML) – is employed. Even in cases where the metadata is fairly complete, unanticipated relationships may exist, especially as data volumes grow, and automated discovery is therefore critical. Io-Tahoe’s CEO, Oksana Sokolovsky, speaks to the importance of automated data discovery, and its AI/ML underpinnings, in a recent blog post on VMblog.

Easier said than done, but do it you must
Having pondered all of this, I guess I understand why some of those mom and pop shops closed for an entire day to take their inventory.  It’s not a simple process, and its importance is not to be underestimated.

The big chain stores can’t afford to close for a full day though – business must go on, and continuous inventory must be their modus operandi. The same is true for most organizations and their data inventory. Baseline discovery work needs to get done, requiring products and platforms that are constantly monitoring data, keeping the inventory of data sets, sensitive data, redundant data and inter- and intra-database relationships up-to-date.