Loading Dynamic CSV Files With Snowflake

September 28, 2021 · 3 min read

Creator of Sling, Big Time Data Engineer

Now more than ever, CSV structure can change quickly according to business need. Learn how to set up ingestion logic to consume CSV files with an arbitrary number of columns.

in today’s data world, there is a vast array of solutions for loading data into a Data Warehouse platform such as Snowflake. Thankfully, Snowflake makes this easy with their COPY command. Having worked with Snowflake and Snowpipe with multiple clients, an increasingly common requirement is to allow loading of CSV files with a dynamic number of columns. Now more than ever, CSV structure, especially coming from external sources can change quickly according to business need, so having this capability is a must.

With Snowflake’s COPY INTO load functionality paired with transformation, it is possible to ingest an arbitrary number of CSV columns and prepare the data for flexible consumption downstream. Let’s go over how we can do this.

Assuming your data is staged in an accessible location such as AWS S3 or Google Cloud Storage, the next step is to properly define the COPY INTO statement. Let us use an AWS S3 example, where our CSV files are located in bucket path s3://big-time-data/files

We could simply load any CSV files with the command:

COPY INTO

For this to work properly, mytable must have the exact number of columns as the CSV files (not allowing flexibility). To allow any number of columns, we need to use the transformation feature. And in order to use the transformation feature, we need to define our S3 bucket as a named stage. The documentation to do this can be followed here.

Once your stage is mapped to the bucket root, we should be able to see our existing CSV files with LIST @mystage/files. We can now load our CSV files.

Let’s first create our needed table to receive our data:

Table 1

Now let’s run our COPY command. What we are doing here is constructing a JSON object to support up to 20 CSV columns (we can increase this to however many maximum number of columns we wish to support). If the CSV file has less than 20 columns, only the number of columns will be written in the object. We are also capturing the loaded_at timestamp, the file name and the row number which can be useful later on.

Copy 2

Now let's transform the raw lines to JSON records:

Table 2

With this, our data should be consumable by simply casting the values to their proper type. Say we have a column named account_id, we can query it with:

Select 1

The beauty of this method is the flexible nature, allowing CSV columns to be added / removed as necessary without having to update the Pipeline ingestion logic. We only need to adjust the downstream queries specifying whatever fields available in the JSON object.

Click here to see the code snippets.