5-min Quickstart Guide#

This guide will walk you through the steps to quickly get started with using Ponder. Alternatively, check out the video below or follow along this notebook ("Open In Colab").

Step 0: Create an Account#

Before we get started, you first need a Ponder account. If you don’t already have a Ponder account, you can create a free account by signing up here.

Step 1: Setting up Ponder#

You can use Ponder by simply installing Ponder as a library on your own machine. With this flexible and lightweight approach, you can continue using Ponder within your own environment with your existing notebook/IDE setup.

To install the library, run the following command:

pip install ponder # Install Ponder dependencies and DuckDB (local execution)

If you intend to use Snowflake or BigQuery with Ponder, you will need to install Ponder with the relevant target:

pip install "ponder[snowflake]" # Install Ponder dependencies and Snowflake connector
pip install "ponder[bigquery]" # Install Ponder dependencies and BigQuery connector
pip install "ponder[all]" # Install Ponder dependencies and all supported database connectors

Step 2: Login to Authenticate#

Next, from your terminal, you will need to login to register your product key.

ponder login

Go to your Account Settings and copy your product key.

../_images/api_token.png

When you are prompted to enter your product key, please copy and paste the above key and press enter to proceed.

Alternatively, if you don’t have terminal access, feel free to pass in your product key into ponder.init(you_ponder_key) in Step 3.

Step 3: Initialize Ponder#

Now we are ready to start using Ponder! To get started, you first need to initialize Ponder.

import ponder
ponder.init()

To learn more about what initializing Ponder does, check out this page!

Step 4: Configure your database connection#

Next, configure your connection to whichever database engine you’d like to work on. If you have a cloud data warehouse already, you can use your warehouse provider’s standard Python connection library. If you don’t currently use a cloud data warehouse, we encourage you to use DuckDB as the engine. Below we show you how to configure each:

  • Snowflake
  • BigQuery
  • DuckDB

To establish a connection to Snowflake, we leverage Snowflake’s Python connector.

import snowflake.connector
db_con = snowflake.connector.connect(user=****, password=****, account=****, role=****, database=****, schema=****, warehouse=****)

To establish a connection to BigQuery, we leverage Google Cloud’s Python client for Google BigQuery.

from google.cloud import bigquery
from google.cloud.bigquery import dbapi
from google.oauth2 import service_account

import json

db_con = dbapi.Connection(
            bigquery.Client(
               credentials=service_account.Credentials.from_service_account_info(
                  json.loads(open("my_service_account_key.json").read()),
                  scopes=["https://www.googleapis.com/auth/bigquery"]
               )
            )
         )

To establish connection to DuckDB, all you need to do is use duckdb.connect(), which creates an in-memory database.

import duckdb
db_con = duckdb.connect()

If you are looking for more information about how to set up the connection, please check out this guide for more information.

Step 5: Selecting Your Data Source#

With Ponder, you can work with an existing table in your database using read_sql and operate on CSV or Parquet files using read_csv and read_parquet, see this guide for more information.

If you already have your data in our warehouse, you can connect to the table by passing the database connection you configured to read_sql as follows:

import modin.pandas as pd
df = pd.read_sql("DBNAME.TABLENAME", con=db_con)

Or if you want to work with a CSV file, since pandas’s read_csv doesn’t take in a database connection, we first need to configure Ponder to leverage the database connection that we established earlier.

ponder.configure(default_connection=db_con)

Note for BigQuery Users: Google BigQuery users must also set a default GBQ dataset as part of the configuration step by setting the bigquery_dataset parameter. For more information on ponder.configure parameters, visit the documentation here

Then, we can use pandas’ read_csv command to load the CSV in your database for further processing.

import modin.pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/tpch/orders.csv")

Step 6: Starting Pondering 🎉#

Once the data is loaded, we can now start hacking away with pandas! Note that any operations you are doing here with pandas is directly being run on your database, rather than on the local CSV file.

df.describe()
df.groupby("O_ORDERSTATUS").mean()
pd.concat([df, df])
# .. and much more! 🧹📊🔍🧪

In this tutorial, we took a look at a quick example of how we can use pandas to work with the data directly in your database. Next, we will take a look at the different ways you can work with a data source in Ponder. If you are looking to learn more about how you can use Ponder, check out this tutorial series.