How data warehouse testing helps prevent sudden load process breaks
Successfully loading data from source systems into a data warehouse heavily depends on the structure (meta data) of the source database or files. It is a kind of interface for a data integration or ETL process. And as such, it is an external dependency, the data integration process relies on. If such an interface gets updated without knowing it, changes may harm the data integration process.
Source systems are often industry standards that do not require explicit testing. This is done by the vendor. But usually, such systems are under control and maintenance of other teams or external parties as well. This means that the interface is out of the control of the team that is responsible for the data integration process. That team relies on proper communication of updates, so it can react to changes immediately. But proper communication is often a problem.
However, when it comes to customized interfaces such as exports of flat files or web service APIs, there should be a clear definition of the interface. Testing is also required to ensure that there are no differences between the definition of the interface and its actual implementation. Otherwise, your data integration process could fail.
With that in mind, let us take a look at some common changes that may harm your data integration processes — and how to manage them.
Source Data Schema Issues
Data integrations often fail due to alterations in the schema of source data. Changes that may harm your data integration processes include:
- Changes in file names
- Changes in the names of tables or columns
- Data types changes
- Removing or adding columns
- and many more
Why Do Such Changes Happen?
Changes in the schema of source data usually happen because the vendor or the development team applies system updates. This is not an issue in itself. However, things get problematic when the teams responsible for subsequent systems, such as data warehouses, are not informed about the previous changes.
What Data Changes Need to Be Announced?
Adding a column or additional tables is usually not a problem. However, when columns or tables are removed or — even worse — data types get changed, that can give rise to huge problems. This is why you want to ensure such changes are always communicated to all stakeholders.
5 Best Practices for Communicating Updates in Data Warehouse Testing
For optimal data warehouse quality assurance, consider these best practices:
1. Use Data Profiling To Get a Clear Picture Of Your Data
To start, you should ensure that the profile of the source data meets your needs. That includes:
- Age and Time Behavior
The better you understand the source data, the better you can monitor and detect any changes.
2. Consolidate Sources and Avoid Hoarding Data
Companies often duplicate data sources and store them all, “just in case.” As the volume of information and the amount of data sources grows, so does the complexity and effort required to keep up with changes and updates.
3. Catch and Communicate Quality Issues Early On
Data issues become costlier as you move further along the supply chain. If you catch an error late in the data’s journey, the data may have undergone multiple transformations that you would need to unravel to fix the problem.
To avoid that, implement quality controls as early in the supply chain as possible and always check the format and structure of source data.
4. Have Change Management Controls and Encourage Knowledge Transfer
To minimize data quality issues, take the time to develop and document clear procedures for monitoring, reporting, and implementing data changes. In addition, all stakeholders should have easy access to the information they need to implement those changes.
5. Set Up Automated Monitoring
Automatic monitoring for changes in the schema of the source data can help you detect and address potential problems early on. This is important during development, as it can ensure that nothing goes wrong just before your release deadline, and is even more important when the data warehouse is productive.
How Data Test Automation can help you
The software solutions from BiG EVAL provide two aspects of data testing. The first represented by BiG EVAL Data Test Automation edition is the robust testing during development of your data processing systems like an ETL-process for example. The other edition called BiG EVAL Data Quality Management is for quality gating source data interfaces within the live environment. Pleaes book a meeting with a BiG EVAL expert or download a trial of BiG EVAL's solution to get to know more.
Data Quality Improvements
This free eBook unveils one of the most important secrets of successful data quality improvement projects.
Do the first step! Get in touch with BiG EVAL...