Best Practices for Interacting with Veson Data Lake
Overview
The Veson Data Lake is a cloud-native, highly scalable solution that allows customers to interact with and download large amounts of data in a very efficient manner.
This article describes some best practices that customers may choose to adopt when interacting with the Veson Data Lake. There are many ways to ingest and process data for reporting purposes; here we seek to lay out one option that is widely utilized.
Technologies and Patterns
This article assumes familiarity with the following technologies and patterns:
- AWS S3
- The AWS SDK
- The AWS CLI
- Microsoft SQL Server
- SQL Server Integration Services
- Relational database design concepts
Architecture Diagram
Downloading Data
Data can be downloaded from the data lake on a regular basis using either a scripted call to the AWS CLI, or by building a lightweight application that leverages the AWS SDK. Veson recommends running the download process at the same cadence as the data lake refresh interval. By default, the refresh interval is set to once per day. However this can be adjusted to run as often as once per hour.
- Configure the download process to pull down all required files to a folder that matches the folder name observed in S3 (e.g. 2020-03-24)
- Programmatically unzip all of the downloaded zip files
- Programmatically delete the zip files to save space
The result should be a folder that contains an uncompressed copy of the latest flat files, ready for ingestion. Veson recommends logging the output of the download process. It is also worthwhile setting up automated alerts for the scenario where downloads are interrupted or disk space problems occur.
Ingesting Data via ETL
Populate the Staging Tables
A recommended pattern for ingesting flat files into a datwarehouse is to use Microsoft SSIS (SQL Server Integration Services) in conjunction with a staging schema and the "wipe and load" technique.
A staging schema is a set of tables that essentially mimic the structure of the flat files that are to be ingested. There should be one staging table for each file type.
The data in the staging tables are ephemeral – after ETL is complete, they will be empty. For that reason, reports or other datwarehouse consumers should not be permitted to access the staging tables directly. Permissions can be applied at the schema level to implement this restriction.
- Create an SSIS package that:
- Wipes the staging tables via a TRUNCATE command
- Loads each flat file into the appropriate staging table
- Sends an alert if anything fails during the process
Update the Production Data Warehouse Tables
With the latest data now sitting in the staging tables, the penultimate step is to update the production tables in the data warehouse. There are several ways to do this, however the goal here is to ensure that any reports being executed are not interrupted. Wipe and load cannot be used on production tables, because that would lead to an inconsistent experience for people running reports.
- Create an SSIS package that:
- Iterates over the list of staging tables
- For each staging table, execute an appropriate T-SQL statement to merge changes from the staging table to the production table
- Pay attention to the absence of records in a staging table, which indicates records have been deleted since the last refresh
Other Tips
- Consider where to place transactions in the process, to ensure reporting clients do not experience inconsistent reads across different entities
- Consider using snapshot isolation
- Monitor transaction log utilization for larger load operations
- Truncate the staging tables afterwards, to avoid confusion
- Instrument the ETL process with appropriate logging and alerting
- Centralize any transformation logic in the Staging -> Production step. It is generally easier to maintain if the staging tables match the flat files precisely
Local Clean Up
After the flat files have been successfully ingested, be sure to programmatically clean up the download folder structure on your local file system. Veson recommends implementing a sliding window delete mechanism, so that you have n days' worth of data on your local system for diagnostics purposes.
Feedback
For any questions or feedback on this article, please contact our support team by submitting a case in the Help Center or by calling one of our regional offices.