AWS Zero-ETL CDC Guide: Prevent Duplicate Records in S3 & Glue Data Catalog

Select Language:

If you’re working with DynamoDB and experiencing multiple versions of the same records appearing in your data, you’re not alone. That’s because Zero-ETL integrations with DynamoDB often result in this behavior. The Change Data Capture (CDC) process is set up to add new change records to Amazon S3 rather than overwrite existing files. As a result, your Athena queries via the Glue Data Catalog might show several versions of the same data, making it seem like duplicates.

To clean up this data and keep your dataset tidy, the best move is to set up a deduplication process. The most straightforward way is to create a Glue ETL job that runs regularly. This job pulls data from S3, looks at the timestamps or version numbers in each record (these come from the CDC process), and keeps only the latest version of each record. It then saves this cleaned data to a new location or table, making your queries quicker and more accurate.

You can also use Glue’s data quality features to help spot duplicates at the file level, especially looking at whether duplicate files are stored in the same folders. But for record-level duplicates, you’ll need to build custom logic into your ETL process.

Keep in mind that if your DynamoDB source updates happen less often than once a day — for example, every 24 hours or more — the data integration will follow a daily batch process rather than continuous updates. This means it will wait until the full refresh interval has passed and then perform multiple exports, each covering a day’s worth of data, before processing the CDCs.

Here are some best practices to consider:
– Run a Glue ETL job right after CDC updates to filter out outdated records, keeping only the most recent version based on primary keys and timestamps.
– Organize your S3 data by date or other categories to improve query speeds.
– Use the Data Catalog’s versioning features to monitor schema changes.
– Set your refresh intervals based on how fresh you need your data to be and the volume of updates.

Regarding support for Apache Iceberg tables, AWS Glue does offer some options, but native upsert or merge capabilities with DynamoDB may need additional configuration or separate ETL steps.

In your setup, a good approach looks like this: DynamoDB feeds data into a Zero-ETL setup, which stores raw CDC data in S3. A Glue ETL job then processes this raw data to remove duplicates and stores the clean, curated data back in S3. Finally, the Data Catalog and Athena are used to run fast, accurate queries on the deduplicated data.

For detailed guidance, you can check the official AWS documentation on Zero-ETL integrations and data quality rules:
– Configuring a Zero-ETL integration with AWS Glue
– Using FileUniqueness in Glue for detecting duplicate files