Complete guide to loading data into Snowflake. Warehouses, stages, file formats, COPY INTO, and bulk loading best practices explained.

In Part 1 of our guide, we covered the revolutionary architecture of Snowflake. Now, it’s time to get hands-on. A data platform is only as good as the data within it, so understanding how to efficiently load data into Snowflake is a fundamental skill for any data professional.

This guide will walk you through the key concepts and practical steps for data ingestion, covering the role of virtual warehouses, the concept of staging, and the different methods for loading your data.

Step 1: Choose Your Compute – The Virtual Warehouse

Before you can load or query any data, you need compute power. In Snowflake, this is handled by a Virtual Warehouse. As we discussed in Part 1, this is an independent cluster of compute resources that you can start, stop, resize, and configure on demand.

Choosing a Warehouse Size

For data loading, the size of your warehouse matters.

For Bulk Loading: When loading large batches of data (gigabytes or terabytes) using the COPY command, using a larger warehouse (like a Medium or Large) can significantly speed up the process. The warehouse can process more files in parallel.
For Snowpipe: For continuous, micro-batch loading with Snowpipe, you don’t use your own virtual warehouse. Snowflake manages the compute for you on its own serverless resources.

Actionable Tip: Create a dedicated warehouse specifically for your loading and ETL tasks, separate from your analytics warehouses. You can name it something like ETL_WH. This isolates workloads and helps you track costs.

Step 2: Prepare Your Data – The Staging Area

You don’t load data directly from your local machine into a massive Snowflake table. Instead, you first upload the data files to a Stage. A stage is an intermediate location where your data files are stored before being loaded.

There are two main types of stages:

Internal Stage: Snowflake manages the storage for you. You use Snowflake’s tools (like the PUT command) to upload your local files to this secure, internal location.
External Stage: Your data files remain in your own cloud storage (AWS S3, Azure Blob Storage, or Google Cloud Storage). You simply create a stage object in Snowflake that points to your bucket or container.

Best Practice: For most production data engineering workflows, using an External Stage is the standard. Your data lake already resides in a cloud storage bucket, and creating an external stage allows Snowflake to securely and efficiently read directly from it.

Step 3: Load the Data – Snowpipe vs. COPY Command

Once your data is staged, you have two primary methods to load it into a Snowflake table.

A) The `COPY INTO` Command for Bulk Loading

The COPY INTO <table> command is the workhorse for bulk data ingestion. It’s a powerful and flexible command that you execute manually or as part of a scheduled script (e.g., in an Airflow DAG).

Use Case: Perfect for large, scheduled batch jobs, like a nightly ETL process that loads all of the previous day’s data at once.
How it Works: You run the command, and it uses the resources of your active virtual warehouse to load the data from your stage into the target table.

Example Code:SQL

-- This command loads all Parquet files from our external S3 stage
COPY INTO my_raw_table
FROM @my_s3_stage
FILE_FORMAT = (TYPE = 'PARQUET');

B) Snowpipe for Continuous Loading

Snowpipe is the serverless, automated way to load data. It uses an event-driven approach to automatically ingest data as soon as new files appear in your stage.

Use Case: Ideal for near real-time data from sources like event streams, logs, or IoT devices, where files are arriving frequently.
How it Works: You configure a PIPE object that points to your stage. When a new file lands in your S3 bucket, S3 sends an event notification that triggers the pipe, and Snowpipe loads the file.

Step 4: Know Your File Formats

Snowflake supports various file formats, but your choice has a big impact on performance and cost.

Highly Recommended: Use compressed, columnar formats like Apache Parquet or ORC. Snowflake is highly optimized to load and query these formats. They are smaller in size (saving storage costs) and can be processed more efficiently.
Good Support: Formats like CSV and JSON are fully supported. For these, Snowflake also provides a wide range of formatting options to handle different delimiters, headers, and data structures.
Semi-Structured Data: Snowflake’s VARIANT data type allows you to load semi-structured data like JSON directly into a single column and query it later using SQL extensions, offering incredible flexibility.

Conclusion for Part 2

You now understand the essential mechanics of getting data into Snowflake. The process involves:

Choosing and activating a Virtual Warehouse for compute.
Placing your data files in a Stage (preferably an external one on your own cloud storage).
Using the COPY command for bulk loads or Snowflake for continuous ingestion.

In Part 3 of our guide, we will explore “Transforming and Querying Data in Snowflake,” where we’ll cover the basics of SQL querying, working with the VARIANT data type, and introducing powerful concepts like Zero-Copy Cloning.

Load Data into Snowflake: Stages & File Formats

Step 1: Choose Your Compute – The Virtual Warehouse

Choosing a Warehouse Size

Step 2: Prepare Your Data – The Staging Area

Step 3: Load the Data – Snowpipe vs. COPY Command

A) The `COPY INTO` Command for Bulk Loading

B) Snowpipe for Continuous Loading

Step 4: Know Your File Formats

Conclusion for Part 2

More Articles

Load Data into Snowflake: Stages & File Formats

Step 1: Choose Your Compute – The Virtual Warehouse

Choosing a Warehouse Size

Step 2: Prepare Your Data – The Staging Area

Step 3: Load the Data – Snowpipe vs. COPY Command

A) The COPY INTO Command for Bulk Loading

B) Snowpipe for Continuous Loading

Step 4: Know Your File Formats

Conclusion for Part 2

More Articles

A) The `COPY INTO` Command for Bulk Loading