Implementing Data Quality Check with AWS Data Glue by Tolulade AdemisoyeImplementing Data Quality Check with AWS Data Glue by Tolulade Ademisoye

Implementing Data Quality Check with AWS Data Glue

Tolulade Ademisoye

Database Engineer

Data Engineer

Amazon RDS

Amazon Redshift

PostgreSQL

Healthcare

Implementing Data Quality Check with AWS Data Glue

Tolulade Ademisoye

A data step-by-step guide

Enterprises rise and fall on the level and quality of information at their fingertips. In today’s data-driven world, poor data quality can lead to flawed insights, missed opportunities, and costly mistakes. That’s why implementing robust data quality checks isn’t just an option — it’s a necessity.

As a Data Solutions Architect, I help organisations ensure their data remains reliable, consistent, and actionable across all systems. But is data quality a one-time fix? No! It’s an ongoing process that requires strategic implementation at both upstream and downstream stages of the data pipeline.

Drawing from my experience consulting for global brands, I’ll share practical insights on how to build resilient data quality frameworks that drive better business decisions.

Data Quality Checks with AWS Glue

AWS Glue Data Quality is a tool from Amazon that helps organisations ensure the integrity of their data. This guide provides an overview of how to implement it within your company. If you need a consultant-led implementation, feel free to reach out.

To analyse data quality in databases or data warehouses like Amazon RDS and Amazon Redshift using AWS Glue Data Quality, follow these key steps:

Set Up AWS Glue Data Catalog: Make sure your data sources (Amazon RDS and Amazon Redshift) are registered in the AWS Glue Data Catalog.

Create Crawlers: Create crawlers in AWS Glue to scan your databases and populate the AWS Glue Data Catalog with the metadata of your tables and schemas.

Set Up Glue Connections: Set up connections in AWS Glue for both Amazon RDS and Amazon Redshift.

Create AWS Glue Data Quality Jobs: Create Glue Data Quality jobs to analyze the data quality of your tables.

Become a data mentor on Semis and host sessions! https://semis.reispartechnologies.com

Detailed step-by-step guide

Set Up AWS Glue Data Catalog

If you have not already done so, you’ll need to set up the Glue Data Catalog: - Navigate to AWS Glue in the AWS Management Console. - Ensure you have the necessary IAM roles and policies set up for Glue to access your databases.

Create Crawlers

Navigate to AWS Glue Console. 2. Create a New Crawler: — Go to Crawlers and click on “Add crawler”. — Give the crawler a name and provide the necessary details. 3. Specify Data Source: — Choose JDBC for Amazon RDS and Redshift. — Provide connection details (JDBC URL, username, password). 4. Select the IAM Role: — Choose or create an IAM role that has the necessary permissions to access the databases. 5. Schedule: — Schedule the crawler or set it to run on demand. 6. Configure Output: — Specify the Glue Data Catalog database where the metadata should be stored. 7. Run the Crawler: — Run the crawler to populate the Glue Data Catalog.

Set Up Glue Connections

1. Navigate to Connections in AWS Glue Console. 2. Create Connection: — Click on “Add connection”. — Choose JDBC as the connection type. — Enter connection details (JDBC URL, username, password) for both Amazon RDS and/or Amazon Redshift.

Create AWS Glue Data Quality Jobs

1. Navigate to AWS Glue Console. 2. Create Data Quality Job: — Go to the Jobs section and click on “Add job”. — Give the job a name and choose the IAM role. 3. Script to Load Data: — Write a script to load data from the Glue Data Catalog. — Perform data quality checks using the Glue Data Quality service. 4. Script Example:

Here’s a basic example of a Glue job script for data quality checks:

import sysfrom awsglue.transforms import *from awsglue.utils import getResolvedOptionsfrom pyspark.context import SparkContextfrom awsglue.context import GlueContextfrom awsglue.job import Jobargs = getResolvedOptions(sys.argv, ['JOB_NAME'])sc = SparkContext()glueContext = GlueContext(sc)spark = glueContext.spark_sessionjob = Job(glueContext)job.init(args['JOB_NAME'], args)# Load data from Glue Data Catalogdatasource = glueContext.create_dynamic_frame.from_catalog( database = "your_glue_catalog_database", table_name = "your_table_name")# Apply Data Quality Rulesdatasource = datasource.resolveChoice(specs=[('column_name', 'cast:double')])# Example data quality checkif datasource.count() == 0: raise ValueError("No data found")# Save the result (optional)datasink = glueContext.write_dynamic_frame.from_options( frame = datasource,  connection_type = "s3",  connection_options = {"path": "s3://your-output-bucket"},  format = "json")job.commit()

Running and Scheduling Jobs

— Run the Job: — Manually run the job from the Glue Console or schedule it to run periodically. - Monitor: — Use AWS Glue Console to monitor the job execution and review logs for data quality results.

Create DQDL (Data Quality Definition Language) Rules

AWS Data Quality will recommend some rules for you, which is okay. After the rules are recommended, you may insert and adjust them to suit your needs.

Sample data quality rules

Rules=[RowCount <=2315,IsComplete "id",StandardDeviation "text_id" between 3577669.81 and 3954266.63,Uniqueness "text_id" > 0.95,ColumnValues "text_id" between 186911368 and 201849865,StandardDeviation "invoice_id" between 412.73 and 456.17,ColumnValues "invoice_id" <= 1585,IsComplete "direction",ColumnValues "direction" in ["in"],ColumnLength "direction" = 2,Completeness "text_type" >= 0.99,ColumnValues "text_type" in ["flow","inbox"],ColumnValues "text_type" in ["flow"] with threshold >= 0.98,ColumnLength "text_type" <= 5,IsComplete "status",ColumnValues "status" in ["handled","queued"],ColumnValues "status" in ["handled"] with threshold >= 0.9,ColumnLength "status" between 5 and 8,IsComplete "visibility",ColumnLength "sales_id" <= 10]

Conclusion

By following these steps, you can set up AWS Glue to perform data quality analysis on your databases in Amazon RDS and Amazon Redshift. This involves using crawlers to populate the Glue Data Catalog, setting up connections, and creating and running Glue Data Quality jobs.

You may reach me on Semis, Linkedin or via email.

Become a data mentor on Semis and host sessions! https://semis.reispartechnologies.com

Like this project

Posted Sep 5, 2025

Implemented data quality checks using AWS Glue for improved data integrity.

Likes

Views