How to validate data?
Background: Enormous amounts of data are processed every day. 8 million mobile payments were made in Q1 2015 at Starbucks, 3 Million – 5 Million trades are completed at NYSE on a daily basis, and 26.2 Billion credit card transactions were done in United States 2012.
When is it needed?
Unless we are talking about big data (which most firms are beginning to implement), mostly the data resides in relational databases and the data is propagated daily to a Data warehouse. Every firm must produce documents with accurate information to satisfy its customers, business and auditors. Ensuring the correctness of these documents is of prime importance to the firm. Consider the well-known bank statements; they need to be accurate and complete.
Things to consider when validating DATA
Business Logic / Functionality: Validate the business logic and conditions imposed on the data. These could come in many form like
- Default Value – Certain columns in the table must have a default value. E.g: a column may be marked “Disabled” by default and only when user enables functionality can it be turned on.
- Pattern based – Some columns must follow a certain pattern and that must be validated e.g. Zip code in US (5digits – 4 digits), credit cards (16 digits)
- Referential Integrity – Based on the design, certain columns may refer to the other columns in the database and hold a referential integrity. In these cases all the values of the one column must be part of the values in the other column.
Validate every entry point – ETL process transforms the data to its target state. There are many intermediate steps taken since the source of the data is consumed. A best practice is to validate data at each intermediate step to ensure correctness and discover issues earlier on.
Inter and Intra table validation – Once you understand the tables behind the application, you realize the relationships between columns and tables. This comes in handy when trying to do inter and intra application checks. E.g.: The total value of all the trades conducted today by a customer must match the total value traded on a summary page. Behind the scenes, most likely there are 2 different tables. An automated check can be written to validate all such conditions and formulas.
Volume: The application must be able to handle large volumes of data. Usually with volume the performance degrades. To test for volume, tests must be run with maximum amount of data and internal tables/files must be at full capacity.
Latency – how long does it take to fetch the data? Some may consider this as part of the performance testing and not a data testing competency. However, I believe that it is a data testing competency as the tester needs to be intimately familiar with the data he is testing. Three major categories come to mind when we talk of latency:
- Real-time: Information is written with little or no delay. Online transactions, stock quotes etc.
- Near-time: Information is updated constantly at a set interval of time.
- Batch: Information is processed after a long duration could be nightly, weekly, monthly or quarterly.
Basics: During the ETL process one must validate:
- Identify data sources
- Row count, Min and max functions – – before and after the load / processing
- Account for all rejected records and why?
- Totals for numeric fields
- Boundary value analysis
- Null check, precision check
- Check for duplicates in data
How to get data for data testing?
One of the hurdles when conducting data testing is how to get the test data in the first place which resembles most if not all data conditions and a sample production load. There are 2 ways in which you can get your test data:
Create your own – You cannot get access to production data and must create your own.
- Ensures same input data every time
- Can create data hard to find
- For new applications and functionality (when no production data exists)
- No need for masking
- Requires understanding of the scenarios and the data structures to create test data
- If functional automation is used for data generation, it’s a time consuming process. Also, applications must be setup in the test environment.
- Get a subset of data from production – You can access and can utilize production data for testing purposes
- Good quality and real data
- Data integrity is not an issue if you bring all the related tables (in production all the systems are setup. May not be the case in a test environment)
- Real user behavior pattern
- Sampling of data must be done right in order to get the best available set of data
- May require data masking and thus require other groups to be involved.
- Data masking increasing the time it takes to make data available for testing
Conclusion – Huge amounts of data are collected and processed every day. The data collected across organizations is growing exponentially. Managing, maintaining and processing this data is an arduous task. Validating data is difficult than testing applications through user interfaces. Data testing is a discipline in itself. It requires deep understanding of the business and the tables behind the scenes. The benefits of data testing are clear – you ensure correctness for volumes of data, increase test coverage and validate behind-the-scenes processes (such as data acquisition and transformation).