Facilitating moving data to data lakes

Many of the bigger transactional databases are moving towards facilitating moving data to data lakes and/or warehouses for easier & quicker reporting. Cloud-first DBs have been focusing on this a lot. AWS AuroraDB announced this recently and Azure SQL Database & Azure CosmosDB (MongoDB competitor) both have this thru the Azure Synapse Link feature.

What about on-prem databases?

Microsoft’s latest release of their on-prem database, SQL Server 2022 released a month or so ago with support for Azure Synapse Link. They bill it as a quick easy way to pull data out of an on-prem database and into Azure Synapse Link. Being a SQL Server DBA, I wanted to dig into the details of this and figure out the nuances and how it works and what the limits are of it.

The good news is VERY possible to use this to port data to any destination, not just Azure Synapse. I tested it out porting to Snowflake. However, there are some extra coding steps to take in order to move the data somewhere other than Synapse. Not difficult and definitely a lot easier than the current traditional method to pipe via ETL processes (in my world, usually Azure Data Factory).

One other extra feature of this is that it will work for loading DELTAS (just the data that has changed), even in databases that do not have good change keys. Basically, if there is no good method to identify what changes were made to a table, the ETL is coded for flush & fill; delete all data on the destination and re-load it all up. As you can imagine, in very large data sets, this can take a long long long time & prevent from being able to do real-time reporting. So now, it can be coded to do basically real-time reporting (delayed by maybe a minute or so).

I know, this is getting technical. The overall point is that the easier it is to pipe data from on-prem to analytical sources, like Snowflake, the better for us Snowflake investors!!