This article describes some best practices that clients may choose to adopt when interacting with the Veson IMOS Platform (VIP) Data Lake. There are many ways to ingest and process data for reporting purposes; this article lays out one option that is widely utilized.:
Table of Content Zone |
---|
Technologies & PatternsThis article assumes familiarity with the following technologies and patterns:
Architecture DiagramDownloading DataData can be downloaded from the data lake Data Lake on a regular basis using either a scripted call to the AWS CLI or by building a lightweight application that leverages the AWS SDK. Veson Nautical recommends running the download process at the same cadence as the data lake Data Lake refresh interval. By default, the refresh interval is set to once per day. However, upon request, this can be adjusted to run as often as once per hour.
The result should be a folder that contains an uncompressed copy of the latest flat files, ready for ingestion. Veson Nautical recommends logging the output of the download process. It is also worthwhile setting to set up automated alerts for the scenario where downloads are interrupted or disk space problems occur. Ingesting Data via ETLPopulate the Staging TablesA recommended pattern for ingesting flat files into a data warehouse is to use Microsoft SSIS (SQL Server Integration Services) in conjunction with a staging schema and the "wipe and load" technique. A staging schema is a set of tables that essentially mimic the structure of the flat files that are to be ingested. There should be one staging table for each file type. The data in the staging tables are ephemeral – ; after ETL is complete, they will be empty. For that reason, reports or other data warehouse consumers should not be permitted to access the staging tables directly. Permissions can be applied at the schema level to implement this restriction. Create an SSIS package that:
Update the Production Data Warehouse TablesWith the latest data now sitting in the staging tables, the penultimate step is to update the production tables in the data warehouse. There are several ways to do this, ; however, the goal here is to ensure that any reports being executed are not interrupted. Wipe and load cannot be used on production tables because that would lead to an inconsistent experience for people running reports. Create an SSIS package that:
Other Tips
Local Clean UpAfter the flat files have been successfully ingested, be sure to programmatically clean up the download folder structure on your local file system. Veson Nautical recommends implementing a sliding window delete mechanism so that you have n days' worth of data on your local system for diagnostics purposes. |