Case Study: Automated Data Validation Framework

A Python-based solution that uses a centralised Excel data dictionary to automate data quality checks and generate actionable error reports.

Project Type

Data Governance & Automation

Core Technologies

Python, Pandas, JSON, Excel

Sector

Commercial / Govt.

Impact

Streamlined Data Quality Reporting

The Business Challenge

Maintaining data quality is a constant struggle for any data-driven organisation. When new data is ingested, it must be validated against a set of business rules to ensure its integrity. This process is often manual, involving data managers visually inspecting files, which is not scalable and is highly susceptible to errors.

The core challenge was to create a flexible and automated data validation process that could be easily managed by non-technical users. The system needed a central place to define data rules (a data dictionary) and a script to apply these rules to incoming data, flagging any issues for correction.

The Solution: A Dictionary-Driven Validation Script

I developed a Python-based solution that uses a simple, human-readable Excel file as a data dictionary to drive the validation logic. This approach empowers data managers to define and update data quality rules without touching any code.

The Four-Step Process:

1. The Excel Data Dictionary

A structured Excel file is created where each row defines the validation rules for a specific data field. This includes data type, min/max values, and a comma-separated list of allowable values or patterns.

2. Conversion to JSON

The Excel dictionary is converted into a JSON format. JSON is a machine-readable format that is easy for the Python script to parse and use as a set of instructions.

3. Automated Python Validation

A Python script reads the JSON dictionary and the data file to be validated. It then iterates through each row of the data, applying the rules defined in the dictionary to each field.

4. Actionable Error Reporting

The script generates a new Excel report that flags all data quality issues. Invalid cells are highlighted, and a separate sheet details all the rows that passed and failed validation, making it easy for data managers to identify and correct errors.

A Practical Example

Let's walk through the process with a simple dataset of employee information.

1. The Data Dictionary (Excel)

Field Name	Data Type	Min Value	Max Value	Allowable Values
EmployeeID	integer	1000	9999
Department	string			Sales,Marketing,IT,HR
Salary	integer	30000	150000

2. Converted to JSON

{
    "EmployeeID": { "data_type": "integer", "min_value": 1000, "max_value": 9999 },
    "Department": { "data_type": "string", "allowable_values": ["Sales", "Marketing", "IT", "HR"] },
    "Salary": { "data_type": "integer", "min_value": 30000, "max_value": 150000 }
}

3. The Python Validation Script

import pandas as pd
import json

def validate_row(row, rules):
    """Applies a set of rules to a single row of data."""
    errors = []
    for column, ruleset in rules.items():
        value = row[column]
        
        # Check for allowable values
        if 'allowable_values' in ruleset and value not in ruleset['allowable_values']:
            errors.append(f"'{value}' is not an allowable value for {column}.")
            
        # Check for min/max values
        if 'min_value' in ruleset and value < ruleset['min_value']:
            errors.append(f"{column} value {value} is below minimum of {ruleset['min_value']}.")
            
    return errors if errors else None

# --- Main Script ---
# 1. Load dictionary and data
with open('data_dictionary.json', 'r') as f:
    validation_rules = json.load(f)
data_df = pd.read_excel('employee_data.xlsx')

# 2. Validate data
error_log = []
valid_rows_indices = []

for index, row in data_df.iterrows():
    row_errors = validate_row(row, validation_rules)
    if row_errors:
        for error in row_errors:
            error_log.append({'Row': index + 2, 'Error': error})
    else:
        valid_rows_indices.append(index)

# 3. Create report
error_df = pd.DataFrame(error_log)
valid_df = data_df.iloc[valid_rows_indices]

with pd.ExcelWriter('validation_report.xlsx') as writer:
    error_df.to_excel(writer, sheet_name='Errors', index=False)
    valid_df.to_excel(writer, sheet_name='Valid Data', index=False)

4. The Final Validation Report (Excel Output)

The script would produce an Excel file. The "Errors" sheet would look like this, pinpointing the exact issues for data managers:

Row	Error
3	'Finance' is not an allowable value for Department.
5	Salary value 25000 is below minimum of 30000.

Key Outcomes & Business Impact

Empowerment of Non-Technical Users: Data managers can now control complex data validation rules from a simple Excel file, without any need for coding.
Improved Data Quality: The automated process ensures that all incoming data is systematically checked, leading to a significant improvement in data integrity.
Increased Efficiency: The script automates a previously manual and time-consuming task, allowing the data team to focus on more strategic initiatives.
Clear and Actionable Reporting: The generated error reports make it easy to pinpoint and fix data quality issues, streamlining the data cleansing process.

Have a challenge for us?

Let's discuss how our expertise can be applied to solve your organisation's unique data and AI challenges. Schedule a complimentary call today.

Book Your Free Discovery Call