Best Practices for Interacting with Veson Data Lake

Overview

The Veson Data Lake is a cloud-native, highly scalable solution that allows customers to interact with and download large amounts of data in a very efficient manner.
This article describes some best practices that customers may choose to adopt when interacting with the Veson Data Lake. There are many ways to ingest and process data for reporting purposes; here we seek to lay out one option that is widely utilized.

Technologies and Patterns

This article assumes familiarity with the following technologies and patterns:

AWS S3
The AWS SDK
The AWS CLI
Microsoft SQL Server
SQL Server Integration Services
Relational database design concepts

Architecture Diagram

Downloading Data

Data can be downloaded from the data lake on a regular basis using either a scripted call to the AWS CLI, or by building a lightweight application that leverages the AWS SDK. Veson recommends running the download process at the same cadence as the data lake refresh interval. By default, the refresh interval is set to once per day. However this can be adjusted to run as often as once per hour.

Configure the download process to pull down all required files to a folder that matches the folder name observed in S3 (e.g. 2020-03-24)
Programmatically unzip all of the downloaded zip files
Programmatically delete the zip files to save space

The result should be a folder that contains an uncompressed copy of the latest flat files, ready for ingestion. Veson recommends logging the output of the download process. It is also worthwhile setting up automated alerts for the scenario where downloads are interrupted or disk space problems occur.

Ingesting Data via ETL

Populate the Staging Tables

A recommended pattern for ingesting flat files into a datwarehouse is to use Microsoft SSIS (SQL Server Integration Services) in conjunction with a staging schema and the "wipe and load" technique.
A staging schema is a set of tables that essentially mimic the structure of the flat files that are to be ingested. There should be one staging table for each file type.
The data in the staging tables are ephemeral – after ETL is complete, they will be empty. For that reason, reports or other datwarehouse consumers should not be permitted to access the staging tables directly. Permissions can be applied at the schema level to implement this restriction.

Create an SSIS package that:
1. Wipes the staging tables via a TRUNCATE command
2. Loads each flat file into the appropriate staging table
3. Sends an alert if anything fails during the process

Update the Production Data Warehouse Tables

With the latest data now sitting in the staging tables, the penultimate step is to update the production tables in the data warehouse. There are several ways to do this, however the goal here is to ensure that any reports being executed are not interrupted. Wipe and load cannot be used on production tables, because that would lead to an inconsistent experience for people running reports.

Create an SSIS package that:
1. Iterates over the list of staging tables
2. For each staging table, execute an appropriate T-SQL statement to merge changes from the staging table to the production table
3. Pay attention to the absence of records in a staging table, which indicates records have been deleted since the last refresh

Other Tips

Consider where to place transactions in the process, to ensure reporting clients do not experience inconsistent reads across different entities
Consider using snapshot isolation
Monitor transaction log utilization for larger load operations
Truncate the staging tables afterwards, to avoid confusion
Instrument the ETL process with appropriate logging and alerting
Centralize any transformation logic in the Staging -> Production step. It is generally easier to maintain if the staging tables match the flat files precisely

Local Clean Up

After the flat files have been successfully ingested, be sure to programmatically clean up the download folder structure on your local file system. Veson recommends implementing a sliding window delete mechanism, so that you have n days' worth of data on your local system for diagnostics purposes.

Feedback

For any questions or feedback on this article, please contact our support team by submitting a case in the Help Center or by calling one of our regional offices.

VIP - Data Lake Best Practices