Automated DataWarehouse Testing

How to create a DataOps process for Data Quality and Data Testing

In this article, we're going to take a look at the world of DataOps and explore an increasingly common challenge:
"How can we operationalize Data Quality and Data Testing in a complex DataWarehouse environment?"

At BiG EVAL, we've seen an increase in demand for Data Quality and Data Testing use cases within DataOps environments.

As teams grapple with the complexity of managing multiple analytics environments, it has become an urgent priority to devise a more innovative way to deliver large-scale, automated data testing.

What is DataOps?

Before we discuss DataOps Data Testing, we need to provide a brief overview of DataOps.

DataOps is a relatively young discipline that has gained traction as companies recognize the need for a more effective way to manage their data landscape complexity, particularly with business intelligence and data analytics.

Data warehousing has become a valuable deployment use case for DataOps due to the enormous volumes of data and the number of inbound/outbound pipelines that operational teams need to control when delivering a data analytics capability.

DataOps aims to simplify the organization and orchestration of core data functions in much the same way that DevOps did for controlling software.

The DevOps fundamentals of agile, lean and continuous delivery still apply to DataOps, but applying agile approaches to data increases the difficulty to a whole new level.

(Note: We provide links to useful articles/research on DataOps at the end of this article).

Why has DataOps Testing become a problem for Data Warehousing and Analytics?

Even a modest data warehouse operation still needs to coordinate thousands of changes every year to ensure business users have access to accurate and timely information.

To maintain trust in the data, data analytics operations teams need to ensure all changes get thoroughly tested across all environments (e.g. Dev, QA, Pre-Production and Production).

This requirement for data testing presents a challenge because:

  • Most testing solutions are great at testing software but were never designed to test data (so do a poor job)
  • Even simple changes to data processing logic can require regression testing of thousands of data quality rules (creating a scalability and resourcing headache)
  • Data warehouse environments need different teams to coordinate and collaborate, which is fine for software changes because DevOps takes the strain, but when it came to data testing, historically, there was no way to bring these teams together to manage the testing rules in a unified way

The Data Quality Dimension

A further challenge is that DataOps Data Testing doesn't just require a processing change to trigger a round of testing.

The DataOps team needs to ensure any data flowing through the pipelines and systems of a data analytics environment are continuously monitored and assessed against a comprehensive library of data quality rules.

DataOps presents a different level of complexity compared to DevOps

The added complexity of data is what makes DataOps intrinsically more complex than DevOps:

  • In DevOps environments, the code base is static, albeit through approved changes. But with DataOps, the data is dynamic – it's changing every minute.
  • DevOps code is relatively well-contained, but with DataOps, your data pipelines and processes can span across the entire organization, adding complexity.

Building a Data Warehouse Testing capability that is fit for DataOps deployment

This next section of the article will discuss some of the most recent Data Warehousing use cases we've been involved with, highlighting the benefit of integrating a Data Testing capability within a DataOps environment.

What do you need from a DataOps Data Testing solution?

When we talk with clients, here are some of the typical challenges they're looking to solve:

  • Lots of (data) unit tests: Companies need a location for storing, managing, and executing a large volume of unit tests
  • Better QA reporting: QA teams need a simple reporting environment to observe and analyze data test results
  • Scalable Regression Testing: Our clients need a method for performing regression tests on their data whenever changes to the data (e.g. structure/process) or their code need to be promoted
  • Seamless orchestration integration: Organizations need a way to integrate data testing into their existing orchestration toolset without extensive coding and manipulation
  • Integration with Data Warehouse automation: Data testing can't be executed in isolation; it needs to be automated via modern warehouse platforms, such as WhereScape

Closing the gap between business needs and DataOps delivery

Most data teams have a perennial challenge of keeping up with the business demand for analytics.

Unfortunately, most companies rely on outdated data testing practices. With the volume and frequency of data tests required in a busy data analytics environment, it's easy for cracks to appear.

The diagram below highlights a simplified Data Warehouse environment found in many data analytics operations:

Simple Data Warehouse configuration

As you can see, this is a standard Data Warehouse configuration that requires extensive data testing at various locations.

So let's walk through each stage of testing and explain how to approach it for optimal success.

Source and Target System Data Quality Assurance

You've probably heard the phrase garbage-in, garbage-out.

It's a sad reality that many Data Warehouse environments ignore the most critical data test of all – monitoring the quality of inbound and outbound data flowing from production systems into the analytics platform and onwards to further operational, analytical or reporting systems.

To address this, you need to deploy a data quality management testing/monitoring capability at the point of 'data ingestion' before the data enters your Data Warehouse.

As you see in the diagram below, we deploy our BiG EVAL Data Quality Management tool within the client analytics environment to perform this function:

Simple Data Warehouse Configuration including BiG EVAL Quality Assurance Components

There are some profound benefits from deploying a tool such as BiG EVAL Data Quality Management in this context:

  • Data Quality gates: You need to guarantee the trust of any inbound and outbound data, which is why we recommend a 'gated' approach to data ingestion. As the data flows into your analytics operation, each data quality gate should trap and report poor-quality data before it impairs analytical insights and operational processes upstream.
  • Continuous validation and alerts: DataOps thrives on delivering processing changes made iteratively for maximum agility. This speed of change requires regular monitoring and testing, particularly within production environments. Therefore, your data quality monitoring solution must be capable of notifying technicians and operations staff whenever data defects are encountered.
  • Synchronization and integration: Many Data Warehouse environments rely on extracting and storing data offline before assessing data quality. Instead, we recommend building data quality tests that execute directly within your Data Analytics processes. Evaluating data quality 'in-flight' allows near real-time defect monitoring, which is why we opted for this approach with our BiG EVAL solution. Data quality testing rules in our solution can be executed manually, via data events, timing events, or with an API call from a DataOps orchestration workflow.

Data Test Automation

Data Warehouse environments are continually evolving and updating, hence the need for a DataOps capability to serve the insatiable analytics appetites of the modern enterprise.

Your data testing solution needs to fit into this paradigm.

With DataOps, each time a developer makes a feature change, they need to execute an associated set of data tests to confirm they won't break anything when that feature is promoted to the production environment.

As each feature is added, the number of tests increases exponentially because even a simple piece of code or logic can require a large number of data tests. The testing team soon becomes a bottleneck due to 'test saturation' if forced to test scripts and routines manually.

This ability to align your data testing requirements to the scale and volume required for DataOps is one of the increasingly common use cases we see right now for Data Test Automation.

When you synchronize data testing with continuous development and integration, you also ensure that any feature releases are gated (in a similar way to the Data Quality gates earlier). Changes can't pass into production without the required data tests being executed and successfully achieving their predefined thresholds.

In the 'old world' of manual data testing, it was common for features to be released without appropriate testing simply because the testing resource was limited, and the timescales required were impractical.

With the automated data testing approach we provide with BiG EVAL, everything changes. You can harmonize data testing with each sprint and feature drop coming out of your DataOps process.

Let's explore how we can approach this.

What are the DataOps foundations your Data Testing approach needs to align with?

Continuous Integration and Deployment

Operating within a continuous delivery pattern requires development teams to work with a trusted source of testing rules, just as they work with a single code source in the DevOps world.

With BiG EVAL, we've made it our mission to eliminate the traditional approach of 'scattered' data testing rules. Our solution is built around a managed repository of data testing rules shared amongst your operational environments whilst retaining a 'single source of truth'.

Data Pipeline Orchestration

Orchestration is delivered through a range of operational workflows that call and coordinate the many data pipeline functions required to successfully move data around the analytics environment or other data processing stages.

As such, you could argue that orchestration provides the 'engine room' of your DataOps capability.

Your data quality and data testing routines need to be called from the orchestration workflows.

Testing

You can't deliver DataOps with manual testing, hence the need for automated testing.

Executing automated tests under a DataOps environment means that test scripts and specifications need to be created in advance with every new analytics feature or update.

With test automation, it's much easier to run your tests regularly and often. You're also not just testing code but the data itself.

The image below provides a quick overview of where BiG EVAL Data Test Automation is deployed, giving you a sense of the breadth of testing scenarios you need to support in a DataOps environment:

Data Warehouse Testing Areas BiG EVAL supports

Monitoring

With the focus on the increased deployment speed that DataOps must support, there is an added need to monitor the entire data analytics environment to ensure any defects and anomalies are trapped before they cause serious harm.

Monitoring and testing need to work in tandem. Once issues are detected, new data testing rules can be created to ensure future defects are trapped within the source data environment, and any code releases don't create defects.

Putting it all together

In the following diagram, we've highlighted where Data Test Automation and Data Quality Testing needs to be executed to support DataOps in a standard Data Warehousing configuration.

Of course, we're biased at BiG EVAL in that our testing solutions operate exceptionally well in this configuration.

Still, whatever testing solution you opt for, you'll find that for your DataOps environment to function, you'll need to deploy this type of arrangement.

DataOps Architecture

Resources and Next Steps

If you want to learn more about automating data tests and data quality within a DataOps environment, the following resources may be helpful.