Introduction to Snowflake

Data Engineer’s Lunch #35: Introduction to Snowflake

In Data Engineer’s Lunch #35: Introduction to Snowflake, we introduce Snowflake and how we can use it for data engineering. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion and demo, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!

In Data Engineer’s Lunch #35: Introduction to Snowflake, we introduce Snowflake and how we can use it for data engineering. Snowflake’s Data Cloud is powered by an advanced data platform provided as SaaS, which means no hardware (virtual or physical) to select, install, configure, or manage, virtually no software to install, configure, or manage, and ongoing maintenance, management, upgrades, and tuning are handled by Snowflake. Snowflake also enables data storage, processing, and analytic solutions that are faster, easier to use, and far more flexible than traditional offerings. It is not built on any existing database technology or “big data” software platforms such as Hadoop, rather it combines a completely new SQL query engine with an innovative architecture natively designed for the cloud

Snowflake runs completely on cloud infrastructure. All components of Snowflake’s service (other than optional command-line clients, drivers, and connectors), run in public cloud infrastructures. Snowflake uses virtual compute instances for its compute needs and a storage service for persistent storage of data. Snowflake cannot be run on private cloud infrastructures (on-premises or hosted). It also uses a central data repository for persisted data that is accessible from all compute nodes in the platform. Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally.

Architecture

Diagram of Snowflake architecture.
Snowflake Architecture
  • Database Storage
    • Takes loaded data, reorganizes it into its internal optimized, compressed, columnar format, and stores the optimized data in cloud storage
  • Query Processing
    • Query execution is performed in the processing layer
    • Processes queries using “virtual warehouses”
    • Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider
  • Cloud Services
    • cloud services layer is a collection of services that coordinate activities across Snowflake
    • These services tie together all of the different components of Snowflake in order to process user requests, from login to query dispatch
    • Services managed in this layer include:
      • Authentication
      • Infrastructure management
      • Metadata management
      • Query parsing and optimization
      • Access control

Connecting to Snowflake

  • Snowflake supports multiple ways of connecting to the service:
    • A web-based user interface from which all aspects of managing and using Snowflake can be accessed.
    • Command line clients (e.g. SnowSQL) which can also access all aspects of managing and using Snowflake.
    • ODBC and JDBC drivers that can be used by other applications (e.g. Tableau) to connect to Snowflake.
    • Native connectors (e.g. Python, Spark) that can be used to develop applications for connecting to Snowflake.
    • Third-party connectors that can be used to connect applications such as ETL tools (e.g. Informatica) and BI tools (e.g. ThoughtSpot) to Snowflake.
  • Logging into Snowflake

Supported Cloud Platforms

  • A Snowflake account can be hosted on AWS, GCP, and Azure
  • Differences in unit costs for credits and data storage are calculated by region on each cloud platform
  • Snowflake supports loading data from files staged in any of the following locations, regardless of the cloud platform for your Snowflake account:
    • Internal (i.e. Snowflake) stages
    • Amazon S3
    • Google Cloud Storage
    • Microsoft Azure blob storage
  • Snowflake supports both bulk data loading and continuous data loading (Snowpipe). Likewise, Snowflake supports unloading data from tables into any of the above staging locations
  • Current Limitations:

Snowflake Editions

Their feature edition matrix can be found here. In the walkthrough portion of Data Engineer’s Lunch #35: Introduction to Snowflake, we created a trial account using the standard edition and even more so, Snowflake did not require a credit card!

Data Lifecycle

Diagram of the Snowflake data lifecycle.
Snowflake Data Lifecycle
  • All user data in Snowflake is logically represented as tables that can be queried and modified through standard SQL interfaces. Each table belongs to a schema which in turn belongs to a database.
  • You can organize your data into databases, schemas, and tables. Snowflake does not limit the number of databases you can create or the number of schemas you can create within a database. Snowflake also does not limit the number of tables you can create in a schema.
  • You can insert data directly into tables. In addition, Snowflake provides DML for loading data into Snowflake tables from external, formatted files.
  • Once data is stored in a table, you can issue SELECT statements to query the data.
  • Once data is stored in a table, all standard DML operations can be performed on the data. In addition, Snowflake supports DDL actions such as cloning entire databases, schemas, and tables.
  • In addition to using the DML command, DELETE, to remove data from a table, you can truncate or drop an entire table. You can also drop entire schemas and databases.

Walkthrough

In the walkthrough portion of Data Engineer’s Lunch #35: Introduction to Snowflake, we mainly introduced the UI/UX and did some basic queries to show how to use Snowflake on a beginner level. Snowflake provides a large number of tutorials and resources to help users get started on their platform, so be sure to check those out as well. Click the video below to watch the walkthrough of Snowflake’s UI/UX.

Resources

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!