How to Create a DataOps Process for Data Testing with Data Quality Software

In this article, we're going to take a look at the world of DataOps and explore an increasingly common challenge: how to harness data quality software to automate data quality and testing in complex data warehouse environments.

At BiG EVAL, we've seen increased demand for data quality and data testing use cases within DataOps environments.

As business intelligence architect and operations teams grapple with the complexity of managing multiple analytics environments, it has become an urgent priority to devise more innovative ways to deliver large-scale, automated data testing.

What is DataOps?

Before we discuss DataOps data testing, we need to provide a brief overview of DataOps.

DataOps is a relatively young discipline that has gained traction as companies recognize the need for more effective ways to manage data landscape complexity, particularly with business intelligence and data analytics.

Data warehousing has become a valuable deployment use case for DataOps due to the enormous volumes of data and the number of inbound/outbound pipelines operational teams need to control when delivering a data analytics capability.

DataOps aims to simplify the organization and orchestration of core data functions in much the same way that DevOps did for controlling software.

The DevOps fundamentals of agile, lean, and continuous delivery still apply to DataOps, but applying agile approaches to data increases the difficulty to a whole new level.

(Note: We provide links to useful articles/research on DataOps at the end of this article).

Why Has DataOps Testing Become a Problem for Data Warehousing and Analytics?

Even a modest data warehouse operation still needs to coordinate thousands of changes every year to ensure business users have access to accurate and timely information.

To maintain trust in the data, data analytics operations teams need to ensure all changes get thoroughly tested across all environments (e.g., dev, QA, pre-production, and production).

This requirement for data testing presents a challenge because:

  • Most testing solutions are great at testing software but were never designed to test data (so do a poor job).
  • Even simple changes to data processing logic can require regression testing of thousands of data quality rules, creating a scalability and resourcing headache.
  • Data warehouse environments need different teams to coordinate and collaborate, which is fine for software changes because DevOps takes the strain. However, historically, there was no way to unite these teams to manage testing rules in a unified manner for data testing.

The Data Quality Dimension

A further challenge is that DataOps data testing doesn't just require a processing change to trigger a round of testing.

The DataOps team needs to ensure that any data flowing through the pipelines and systems of a data analytics environment are continuously monitored and assessed against a comprehensive library of data quality rules.

DataOps Presents a Different Level of Complexity Compared to DevOps

The added complexity of data is what makes DataOps intrinsically more complex than DevOps:

  • In DevOps environments, the code base is static, albeit through approved changes. But with DataOps, the data is dynamic, as it's changing every minute.
  • DevOps code is relatively well-contained, but with DataOps, data pipelines and processes can span across the entire organization, adding complexity.

Building Data Warehouse Testing for DataOps Deployment with Data Quality Software

This next section of the article will discuss some of the most recent data warehousing use cases we've been involved with, highlighting the benefit of integrating a data testing capability within a DataOps environment.

What Do You Need from a DataOps Data Testing Solution?

When we talk with clients, here are some of the typical challenges they're looking to solve:

  • Lots of (data) unit tests: Companies need a location for storing, managing, and executing large volumes of unit tests.
  • Better QA reporting: QA teams need a simple reporting environment to observe and analyze data test results.
  • Scalable regression testing: Our clients need a method for performing regression tests on their data whenever changes to the data (e.g., structure/process) or code need to be promoted.
  • Seamless orchestration integration: Organizations need a way to integrate data test planning and execution into their existing orchestration toolset without extensive coding and manipulation.
  • Integration with data warehouse automation: Data testing can't be executed in isolation. It needs to be automated via modern warehouse platforms such as WhereScape

Closing the Gap Between Business Needs and DataOps Delivery

Most data teams have a perennial challenge of keeping up with the business demand for analytics.

Unfortunately, most companies rely on outdated data testing practices. With the volume and frequency of data tests required in a busy data analytics environment, it's easy for cracks to appear.

The diagram below highlights a simplified data warehouse environment found in many data analytics operations:

Flow chart of a typical data warehouse deployment from the data location to multiple testing stages to the analytics user base.

As you can see, this is a standard data warehouse configuration that requires extensive data testing at various locations.

Let's walk through each testing stage and explain how to approach it for optimal success.

Source and Target System Data Quality Assurance

You've probably heard the phrase, "Garbage in, garbage out."

It's a sad reality that many data warehouse environments ignore the most critical data test of all: monitoring the quality of inbound and outbound data flowing from production systems into the analytics platform and onward to further operational, analytical, or reporting systems.

To address this, you need to deploy a data quality management testing/monitoring capability at the point of data ingestion before the data enters your warehouse.

As you see in the diagram below, we deploy our BiG EVAL Data Quality Management tool within the client analytics environment to perform this function:

Typical data warehouse deployment: data location to testing stages to the analytics user base, with BiG EVAL data quality gates and alerts.

There are some profound benefits from deploying a tool such as BiG EVAL Data Quality Management in this context:

  • Data quality gates: You need to guarantee the trust of any inbound and outbound data, which is why we recommend a gated approach to data ingestion. As the data flows into your analytics operation, each data quality gate should trap and report poor-quality data before it impairs analytical insights and operational processes upstream.
  • Continuous validation and alerts: DataOps thrives on delivering processing changes made iteratively for maximum agility. This speed of change requires regular monitoring and testing, particularly within production environments. Therefore, your data quality monitoring solution must be capable of notifying technicians and operations staff whenever it encounters data defects.
  • Synchronization and integration: Many data warehouse environments extract and store data offline before assessing data quality. Instead, we recommend building data quality tests that execute directly within your data analytics processes. Evaluating data quality in-flight allows near real-time defect monitoring, which is why we opted for this approach with our BiG EVAL solution. Our solution's data quality testing rules can be executed manually, via data and timing events, or with an API call from a DataOps orchestration workflow.

Data Test Automation

Data Warehouse environments are continually evolving and updating, hence the need for a DataOps capability to serve the insatiable analytics appetites of the modern enterprise.

Your data testing solution needs to fit into this paradigm.

With DataOps, each time a developer makes a feature change, they need to execute an associated set of data tests to confirm they won't break anything when that feature is promoted to the production environment.

With each added feature, the number of tests increases exponentially because even a simple piece of code or logic can require a large number of data tests. The testing team soon becomes a bottleneck due to test saturation if forced to test scripts and routines manually.

This ability to align your data testing requirements to the scale and volume required for DataOps is one of the increasingly common use cases we see right now for data test automation.

When you synchronize data testing with continuous development and integration, you also ensure that any feature releases are gated (as with the data quality gates mentioned earlier). Changes can't pass into production without the required data tests being executed and successfully achieving their predefined thresholds.

In the old world of manual data testing, it was common to release features without appropriate testing simply because the testing resource was limited and the timescales required were impractical.

With the automated data testing approach we provide with BiG EVAL, everything changes. You can harmonize data testing with each sprint and feature drop coming out of your DataOps process.

Let's explore how we can approach this.

Automating the data warehouse deployment and maintenance process requires automation of quality assurance too.

With Which DataOps Foundations Does Your Testing Approach Need to Align?

Continuous Integration and Deployment

Operating within a continuous delivery pattern requires development teams to work with a trusted source of testing rules, just as they work with a single code source in the DevOps world.

With BiG EVAL, we've made it our mission to eliminate the traditional approach of scattered data testing rules. Our solution is built around a managed repository of data testing rules shared among your operational environments while retaining a single source of truth.

Data Pipeline Orchestration

Orchestration is delivered through various operational workflows that call and coordinate the many data pipeline functions required to successfully move data around the analytics environment or other data processing stages.

As such, you could argue that orchestration provides the "engine room" of your DataOps capability.

Your data quality and data testing routines need to be called from the orchestration workflows.

DataOps Requires Test Automation

You can't deliver DataOps with manual testing, hence the need for automated testing.

Executing automated tests under a DataOps environment means that test scripts and specifications must be created in advance with every new analytics feature or update.

With test automation, it's much easier to run your tests regularly and often. You're also not just testing code but the data itself.

The image below provides a quick overview of where BiG EVAL Data Test Automation is deployed, giving you a sense of the breadth of testing scenarios you need to support in a DataOps environment:

BI Testing Supported by BiG EVAL Data Quality Software: Data Integration, Analysis Model, End to End, Security Testing, & Performance Testing.

Monitoring

With the focus on the increased deployment speed that DataOps must support, there is an added need to monitor the entire data analytics environment to ensure any defects and anomalies are trapped before they cause serious harm.

Monitoring and testing need to work in tandem. Once issues are detected, new data testing rules can be created to ensure future defects are trapped within the source data environment, and any code releases don't create defects.

Putting It All Together

The following diagram highlights where data test automation and quality testing need to be executed to support DataOps in a standard data warehousing configuration.

Data test automation/quality testing in a standard data warehousing configuration for four data types.

Of course, we're biased at BiG EVAL in that our testing solutions operate exceptionally well in this configuration.

Still, whatever testing solution you opt for, you'll find that for your DataOps environment to function, you'll need to deploy this type of arrangement.

Resources and Next Steps

If you want to learn more about automating data tests and data quality within a DataOps environment, the following resources may be helpful:

Do the first step! Get in touch with BiG EVAL...