Load Data into Snowflake Using Python with Pandas

Comments 0

Share to social media

Loading data into Snowflake is a common need. Using Python and pandas is a common go-to solution for data professionals. Whether you’re pulling data from a relational database, wrangling a CSV file, or prototyping a new pipeline, this combination leverages pandas’ intuitive data manipulation and Snowflake’s cloud-native scalability. But let’s be real—data loading isn’t always a simple task.

Files go missing, connections drop, and type mismatches pop up when you least expect them. That’s why robust error handling isn’t just nice-to-have; it’s essential for anything you’d trust in production. In this guide, we’ll walk through the fundamentals of getting data into Snowflake, explore practical examples with pandas and SQLAlchemy, and equip you with the tools to build a dependable, real-world-ready pipeline. Let’s dive in and make your data loading process as smooth as possible!

Understanding How Snowflake Handles Data

Before we dive into the code and examples, let’s chat about how Snowflake approaches data loading. Think of Snowflake like a really efficient postal service. Just as you might have different ways to send a package (regular mail, express delivery, or instant messenger), Snowflake offers various methods to load your data.

The “post office” in this case is what we call a staging area. You’ve got three types: Internal stages (like having a PO box at the post office) managed by Snowflake for simplified management; External stages that leverage cloud storage services like AWS S3, Azure Blob Storage, or GCP Cloud Storage; and User stages which provide personal file storage areas for individual users.

For File Processing: Snowflake automatically handles:

  • Format detection and validation
  • Compression/decompression (supported formats: GZIP, BZIP2, ZSTD)
  • Data parsing and type conversion
  • Micro-partitioning for optimal query performance

Snowflake’s automated type inference is powerful but needs careful attention. Initially, Snowflake analyzes a sample of rows to determine column data types. For example, for string columns, it considers the maximum length in the sample, not the entire dataset.

This is often good enough, but if the sample does not represent the longest values in a set, (e.g., one row with 200 characters among millions with 10), Snowflake might underestimate the required datatype length, leading to errors.

The best practice whenever possible, particularly when you process will be repeated or automated (like a production load), is to explicitly define column types to avoid type inference issues

Getting Your Hands Dirty: Different Ways to Load Data

Before we get into weeds of setting up python code, let’s explore different ways to load data into Snowflake based on common scenarios:

  • Relational Database Sources
    • Using database connectors (Oracle, PostgreSQL, MySQL, etc.)
    • Setting up incremental loads
    • Handling schema changes
  • File-Based Sources
    • CSV files (with various delimiters)
    • JSON documents
    • Parquet and other columnar formats
    • XML files
  • API and Streaming Sources
    • Real-time data ingestion
    • Message queue integration
    • Change data capture (CDC) feeds

In this article, I will demonstrate something I do most every day- working with Python and pandas to load data into Snowflake using a few typical methods.

Using Python with pandas to load relational database data into Snowflake

First, let’s look at a scenario where you’re loading customer transaction data from your e-commerce relational database. This example also covers how to write a pandas DataFrame to Snowflake using SQLAlchemy, a Python SQL toolkit and Object Relational Mapper.

This is probably the most common scenario for data engineers, data scientists and analysts. This code snippet will feature:

  • Error Handling: Captures exceptions, logs them, saves failed records, and supports retries for transient failures (e.g., network timeouts).
  • Chunking: The chunksize=10000 parameter processes data in batches, reducing memory usage—crucial for larger datasets.
  • Logging: Tracks success and failure for auditing and debugging.
  • Flexibility: if_exists='append' adds new data without overwriting existing records.

In this example, we’ll extract yesterday’s transaction data from an e-commerce database (like MySQL or PostgreSQL) and load it into a Snowflake table called ‘daily_transactions’. The workflow includes handling potential errors, saving problematic records for further analysis, and implementing retry logic for temporary failures such as network timeouts.

The example assumes the source database has a ‘transactions’ table with a ‘date’ column (without a time of day aspect to the value) to filter by. The target Snowflake table structure will match the source data schema, as pandas will automatically create matching columns.

Let’s walk through this process step by step. First, we’ll connect to our source database and extract the previous day’s transaction data (this code can be downloaded from here):

Next, we’ll set up our connection to Snowflake using SQLAlchemy. This provides a clean, standardized way to interact with Snowflake:

Finally, we’ll load the data into Snowflake with comprehensive error handling:

This will first attempt to execute the code in the try block to load the data into Snowflake and will be directed to the exception block in case of an error.

This snippet pulls yesterday’s transactions, connects to Snowflake, and loads the data in batches. If something goes wrong—like a timeout or a schema mismatch—it logs the error, saves the problematic records, and even tries again if the issue is transient. It’s a solid foundation for daily ETL jobs or ad-hoc analysis.

For optimal performance and security, consider these additional tips:

  • Credentials Security: Store user, password, etc., in environment variables or a secure vault (e.g., AWS Secrets Manager).
  • Performance: For larger datasets, consider adjusting the chunksize parameter based on your data volume and available memory. For very large datasets, the native Snowflake Connector (explained in my next article) might be more efficient.
  • Performance: For larger datasets, consider adjusting the chunksize
  • Validation: Pre-check transactions_df for nulls or duplicates before loading.

The more practical application of this technique is for:

  • Ad-hoc Analysis: Load small to medium datasets for quick exploration and analysis by data scientists.
  • ETL Pipelines: Extract from relational sources, transform with pandas, and load into Snowflake as part of a scheduled workflow.
  • Prototyping: Test data workflows before scaling to higher-volume batch or streaming solutions.

Load File-Based Sources into Snowflake using Pandas

Now, let’s shift gears to file-based data—like a CSV with customer info. This scenario demands more: validating the data, ensuring consistency, and tracking what’s loaded.

Let’s see an example of how to load CSV Files with Custom Delimiter Detection into Snowflake and explore an approach to loading CSV files into Snowflake. The following class implements a comprehensive solution that:

  • Validates data before attempting to load it
  • Uses transactions to ensure data consistency
  • Implements proper error handling and logging
  • Verifies that the data was loaded correctly
  • Maintains audit records of each load operation

This pattern is particularly useful for production ETL processes where reliability and traceability are critical. The code is structured as a reusable class that can be integrated into larger data pipelines (The code module described can be downloaded whole from here):

Let’s begin by importing python modules:

  • pandas – A popular library for working with data tables
  • snowflake.connector – The tool that lets Python talk to Snowflake
  • logging – For keeping track of what happens during execution
  • Other utilities for handling files, dates, and data validation

Let’s now define a class, initialize it and set up logging. This object will handle the entire process of loading CSV data into Snowflake. Think of this as designing a machine that has specific parts (methods) working together. When you create a new instance of this “machine,” you need to provide configuration settings for connecting to Snowflake (like username, password, etc.). __init__ method saves those settings and sets up logging so you can track what happens. setup_logging creates a ‘logs’ folder if it doesn’t exist, then configures the logging system to write logs to both a file and the console including timestamps, severity levels using the current date in the log filename

The following code sets up a function that you can call. It is broken up into sections for discussion here:

This next code block establishes a connection to Snowflake using the config details you provided. If it can’t connect (wrong password, network issue, etc.), it logs the error and raises an exception. Think of this as dialing a phone number to talk to Snowflake.

Before attempting to load data, let’s look at a method to check if it meets basic requirements such as:

  • Make sure the data file isn’t empty
  • Confirm required columns like ‘id’ and ‘name’ exist
  • Verify data types match expectations (e.g., ‘id’ should be an integer)

This is like checking ingredients before cooking to avoid problems later.

Let’s make sure that the data integrity is intact. This creates a unique “fingerprint” of the data using a mathematical function. Later, you can use this fingerprint to verify if the data was changed or corrupted during transfer. Read more here on checksum if you want to understand the basics of it.

Now, let’s start the main process – loading the date. This involves multiple steps – creating an orchestration method for the entire load process, reading the csv file, validating the csv data, loading the data, verifying the load process all along having robust error handling. Follow along the comments on the code to understand this process.

 

Opens and reads the CSV file into a pandas DataFrame (a table-like structure). The parse_dates parameter tells pandas to interpret the ‘date’ column as actual dates rather than text.

Uses the validation method to check if the data meets requirements.

Creates the unique fingerprint of the data for integrity checking.

Establishes a connection to Snowflake.

Starts a database transaction. This is like telling Snowflake “I’m about to do several things, but treat them as one action.”

 

Loads the data into Snowflake in chunks of 10,000 rows at a time. Chunking is almost always more efficient for large datasets. It will definitely behoove you to test different sizes, especially for often repeated load processes.

Double-checks that all records were loaded by comparing the count in Snowflake to the original DataFrame size.

Adds an entry to a tracking table that records what was loaded, when, and with what checksum. This is like keeping a shipping receipt.

Finalizes all changes in Snowflake if everything went well.

If anything goes wrong during the process:

  • ROLLBACK cancels all changes (like undoing a transaction)
  • The error is logged
  • The data that failed to load is saved to a CSV file in an ‘errors’ folder for later analysis

Always closes the database connection (regardless of success or failure) and returns whether the operation succeeded.

Let’s see an example of using the above created class to load csv data into snowflake.

This shows how to use the class we created:

  • Define connection settings
  • Create an instance of our loader
  • Call the load_csv method to load a specific file into a specific table
  • Check if it worked and print an appropriate message

For optimal performance and security, consider these best practices:

  • Chunk-based loading for large files
  • Proper resource cleanup
  • Configurable logging
  • Type hints for better code maintainability

Wrapping Up

By now, you’ve seen how pandas and Python can make loading data into Snowflake both approachable and reliable. Whether you’re pulling from a database with SQLAlchemy or tackling CSV files with the Snowflake Connector, these methods balance ease of use with production-grade robustness. The error handling keeps things running smoothly when hiccups occur, while chunking and validation ensure scalability and accuracy.

Start with these examples for your next project—tweak them, watch the logs, and refine as you go. Remember that successful data loading is not just about getting data from point A to point B—it’s about ensuring data quality, maintaining system performance, and implementing proper error handling throughout the process. Start with these patterns and adjust them based on your specific use cases and requirements.

As your datasets grow or your needs evolve, you’ll be ready to scale up to batch processing or streaming with confidence. Happy loading, and may your pipelines always run clean!

Load comments

About the author

Anil Kumar Moka

See Profile

Anil Kumar Moka is a Polyglot Data Software Engineer and Visionary Technical Leader at a leading U.S. bank, bringing 17 years of expertise in data engineering, data security, software engineering and cloud solutions. His innovative enterprise-wide data remediation platforms safeguard over 100 million customers while achieving 95% faster processing times and multi-million dollar cost savings. An AWS Certified Solutions Architect and published researcher, Anil regularly authors influential white papers and technical articles on cloud computing and data engineering. His leadership in designing scalable solutions and driving technological excellence has established him as a thought leader in the industry. Anil holds a Bachelor of Technology in Electrical and Electronics Engineering, complemented by multiple professional certifications that underscore his commitment to technical mastery.