Data Validation - During Ingestion into Data Lake
Any enterprise that wants to harness the power of data, almost always begins with building a data lake.
By definition, a data lake is a centralized repository that stores all structured, semi-structured, and unstructured data whose value is yet to be discovered by downstream pipelines. The data in the data lake is pretty raw in nature and can be refined by subsequent data pipelines.
Many times I am asked, how do we validate the data ingested into the data lake? In this article, we will see if there is a need to validate, first of all. If yes, what types of data can be validated and what to validate about the data, as well. This is a question that many enterprises, at the start of the data lake journey, are dealing with.
Also, I will talk about the two types of architectures possible for introducing data quality management in your data pipelines.
Some companies have enthusiastically started creating data lakes with all possible data generated by their systems - only to realize later that the data cannot be trusted at all. When they start trying to use it, they see many problems or discrepancies and that brings a cloud of doubt on all the data in the data lake. The data may be corrupted, has missing values where it is not expected or even duplicated data leading to wrong aggregations etc. Then such a data lake is of no use at all.
There may be an argument that absolutely perfect data is not required. In the statistical world, the errors could also average out.
But let us take a look at who are the typical users of the data from the data lake and would they be happy with approximately correct data. Typically, the users are
Business Intelligence and Analytics Users
Big Data Engineers
Machine Learning Engineers
Do you believe any of the above users would be fine with the wrong data?
Imagine, a data analyst is trying to draw insights about sales over the last week by plotting trends. If he sees that there is almost double the sale on a particular day of the week, it would intrigue him and he would want to probe further as to what would have triggered this trend. On probing, he realizes that it is because of receiving duplicate data on that day!! Does it help him to trust the data lake data? Absolutely not.
Take the case of a model that a data scientist and an ML engineer have deployed to forecast the sales for the next few weeks based on past sales. Because of this duplicate data, the model could predict that every Wednesday, there is double the sales compared to other weekdays. This could have a snowball effect on the number of products stocked in the store, ordered for the store and the wastage etc. The model would be questioned for accuracy. However, even the best of models can only do that much if the data input is bad. All of us know, garbage in leads to garbage out!!
So, there is no question on whether data correctness is important within a data lake. data validation is necessary to ensure the same.
Every one of the users of the data lake looks for a guarantee of the quality of data. Hence, a one-time validation of data is essential when we ingest data into a data lake.
But remember, all data cannot be validated. Unstructured data like video, audio, images, etc. cannot be validated and most often don't need to be. In fact, data rigour is probably required only on structured or semi-structured data sometimes.
Next, arises the question, what exactly do you validate about the data? You have no context to the data.
What to Validate?
Two aspects can be validated, depending on the type of source.
To explain this, I would like to take a sample data pipeline on a big data platform - just a reference to use for the data validation discussion. The point of interest in the below diagram is only the first two layers - the Landing Zone and the Staging Zone.
Typically, you tend to keep the raw data, as obtained in what may be called the 'Landing Zone'. You could then convert the data to a file format more appropriate to your platform for optimal reads/writes that you intend to use it for, along with applying a schema to it - if possible and required. That I would call the 'Staging' Zone. From here, depending on the types of data consumption, you could combine, join, aggregate, etc. and make the data appropriate for the consumer (Consumption Zone). This pipeline is typical and works most of the time.
Typically validations can be done in the Landing Zone and the Staging Zone. In the Landing Zone, you have an opportunity to validate the "Exactness" of data with the source.
In the Staging Zone, you can validate data for its own correctness without a comparison with anything else.
Landing Zone checks - Exactness
In the Landing Zone, you have raw data and the guarantee you want here is that it is "exactly" the same as the data in the source - an 'Exactness' KPI (Key Performance Indicator). To ensure it is true to the source, the checks that can be done are:
The row counts, column counts, null count, missing value count, and min/max validations are all simple checks that you can do on the source and the data in the landing zone to ensure that you have got your data correctly. Simple and useful checks.
Null checks especially have to be done in the landing zone on raw data before any conversions or compressions because, during conversions, some data could get forced to null and that does not mean that the data was not fetched correctly. So, if there is some corruption and data has become null during transfer, it best be caught in the raw data itself.
Checksum can be used to validate that the data in an entire column or row is exactly as it is in the source. However, it is an expensive operation and needs to be used judiciously. Also sometimes, if the checksum is large and a decimal, some of the last few digits can have a mismatch depending on the rounding difference between the source and the target systems. So, use this carefully and only if required.
Distinct is used to validate columns that the number of unique values is the same in the source and the raw zone. An expensive operation and hence should be used judiciously.
Note that you do not have to check every column of data obtained from the source system. Select and validate only how much is necessary and that which will give you the confidence that the data is "Exact". This is to ensure optimum utilization of resources available to you as well as reducing the latency of your data pipelines.
Another point to note is: I have been mentioning columns of data here, as this mostly applies to only structured data from data sources. If you are ingesting from a database or a data warehouse, these kinds of checks at the raw stage would turn out very useful.
If you are getting data from a legacy system that gives data as files with some headers or footers that contain some metadata like the number of rows, the number of columns, total bytes of data, all you can validate is only based on the header/footer data, after you ingest the same.
Staging Zone Checks
In the staging zone, you have various other checks that you can do that deal with data for what it is rather than comparing it with a source.
Typically, in the staging area, you would have converted your data into columnar or row file formats depending on your architecture and usage, say into ORC files, Parquet files, or Avro files and even probably applied some compression like snappy. You could even build external hive tables over these files with a schema defined so that the data could be queried.
In the case of semi-structured data in the landing zone, you might have culled out the relevant data and brought some structure to it in the staging zone.
At this stage the following data quality KPIs would make a lot of sense:
The specifics of what each of these stands for is provided in the table below:
To elaborate on each of these:
Completeness checks consist of checking for null or missing values. The null checks here are different from the null checks you do at the raw stage. Here you check if there are any null values in columns where you do not expect nulls to exist. This is very much possible when you convert raw data into a different file format and also impose a schema on it. When data is read, if the schema mismatches the data underneath, then the value is returned as null. For example, a string data has a column type as integer, it will return null. This is a unique challenge that happens in all systems that apply schema-on-read and not on write. These kinds of issues introduced by the staging layer can be caught here.
Consistency checks check for datatype, data length and possible restricted values. If a column is of type String, you can check that the values here indeed characters and not alphanumeric data. If a column has a fixed length, then check if all the data has the defined fixed length. There are not assured by the systems that apply schema-on-read. Or if a column is supposed to have only a set of predefined strings, for example, days of the week or the months of a year, check that the unique values in the column match the predefined set of acceptable values.
Validity checks - There could be a few columns whose range or threshold is pre-defined by the business or functional use case. For example, the age of a registered customer has to be more than 18 years. Then, anything less than 18 in this column is invalid. So, in validity checks, you are checking if the data is within a valid range or above or below a valid threshold value.
Uniqueness - Here, you check for entire records being duplicate probably due to receiving the same data twice or you check if there are duplicate primary keys with different data - which should not be allowed.
All of these validations will ensure good data quality and will give a huge boost to the confidence in using this data. This is mostly in the context of structured data or semi-structured that has been converted to structured in the staging zone.
What do you do if you find some invalid data? You can deal with it in two ways which are discussed in the next section
Intrusive versus Non-Intrusive Data Quality Management
When you find errors in data, what you do with it depends on the appetite of your business to deal with erroneous data and its impact on the downstream systems.
If the data should be stopped from going forward down the pipeline because the downstream systems/users cannot tolerate the errors, then you build what is called an 'Intrusive' Data Quality Management system. However, if the downstream can manage a few hours or a day of errors that can be manually or systemically corrected over time, then you build a 'Non-Intrusive' data quality system.
What do they actually mean?
Non-Intrusive Data Quality Management
In a Non-Intrusive DQM, as the name suggests, you do not impact any of the pipelines or systems critical for your business. But you still capture the errors and take corrective action. This you achieve by ensuring that you do not build a dependency in your data pipeline on the validation process, as shown in the diagram below:
You see that the DQM is running as a pipeline parallel to your data pipeline and not interrupting it in any way.
The data in the Landing Zone continues into the Staging Zone uninterrupted. It does not wait for the DQM process to complete and check if it is successful without errors or has thrown any errors.
Similar flow is maintained at the staging layer too. This keeps it non-intrusive while the errors or the success are captured in a central metadata store by the DQM pipeline. The data in the central metadata store can then be used to trigger appropriate alerts and even a full-time monitoring dashboard can be developed from this data for monitoring purposes.
Intrusive Data Quality Management
In some cases, your stakeholders may not be fine with erroneous data. Then, you would want the data pipeline to stop execution and raise errors if validation fails. To achieve this, you have to choose gated quality management as shown in the diagram here:
Essentially, you make your validations part of the pipeline and the data does not go forward to the next zone unless it passes the tests. This could be costly in terms of the latencies introduced and hence should be judiciously used. And if and when this method is used, the number of quality checks done also should follow a minimalist principle to ensure you just get the quality you really need.
Cannot stress this more as the latencies introduced could be huge and you could end up with a large set of unhappy stakeholders if they are indeed looking for the data to be available as quickly as possible.
Once you know that these are the validations that you need to do, it would be the same for all data that is being ingested. The principles are the same but the schema, the table structures, the data types, etc would vary from one data source to the next. Keeping this in mind, you can build this as a library or framework that is configuration-driven. If you can declare much of the metadata through configuration files and use the dependency injection principle within the framework for
defining your sources
defining the structure of source data and target data
defining the validations - types and on what columns
defining your metadata structures to capture the results for each table /data source
then you truly have a reusable framework that is quite valuable for your organization.
Data Quality checks are useful and required for both structured data and semi-structured data that is structured later. We have discussed at what stages and what type of validations can be done.
Once you have the Quality framework developed, with the right design principles, you have clean data in your data lake, a lot of peace as your time spent on managing quality issues comes down drastically and you have an extensible library that will last for some time. All in all, a good investment.