Welcome to Ponder 👋#

Ponder is a scalable data science platform that lets you run your pandas workflows directly in your data warehouse. Ponder gives you the scalability and security benefits of your data warehouse, while still preserving the ease-of-use and flexibility of pandas.

Ponder builds on top of the open source project, Modin, with added support for data warehouses, tailored to the needs of production-scale workloads in enterprise settings.


How it works?#

Ponder uses your data warehouse as an engine. Just initialize Ponder and configure the database connection to get started. The current version of Ponder supports cloud data warehouses (Snowflake, BigQuery) as well as local execution mode (with DuckDB). Additional support for other databases and warehouses coming soon. To get started, you first need to initialize Ponder.

import ponder
ponder.init()

Next, you can connect to different database engines:

  • Snowflake
  • BigQuery
  • DuckDB

To establish a connection to Snowflake, we leverage Snowflake’s Python connector.

import snowflake.connector
db_con = snowflake.connector.connect(user=****, password=****, account=****, role=****, database=****, schema=****, warehouse=****)

To establish a connection to BigQuery, we leverage Google Cloud’s Python client for Google BigQuery. Here, we are connecting to the CUSTOMER dataset by authenticating via your BigQuery service account key.

from google.cloud import bigquery
from google.cloud.bigquery import dbapi
from google.oauth2 import service_account

import json

db_con = dbapi.Connection(
            bigquery.Client(
               credentials=service_account.Credentials.from_service_account_info(
                  json.loads(open("my_serviceaccount_key.json").read()),
                  scopes=["https://www.googleapis.com/auth/bigquery"],
               )
            )
      )

If you do not already have your account key file or don’t have a dataset created, please please follow our step-by-step guide here for more information.

To establish connection to DuckDB, all you need to do is use duckdb.connect(), which creates an in-memory database.

import duckdb
db_con = duckdb.connect()

Once you establish and initialize the database connection, you can connect to the table via pd.read_sql.

# Connect to your table named "CUSTOMER" in Snowflake
df = pd.read_sql("CUSTOMER", db_con)

Now you can start hacking away with pandas! 🐼

df.describe()

df.groupby("C_MKTSEGMENT").mean()

pd.concat([df, df])

# .. and much more! 🧹📊🔍🧪

You can find a list of pandas APIs we support here. To get started, check out this 10-minute quickstart guide.

Key Features#

Bring your own data warehouse

Under the hood, pandas operations are automatically compiled down to SQL. Because queries are executed directly in the warehouse, users benefit from the performance, scalability, and security benefits offered by their database provider.

Simplify your stack

No need to set up and maintain compute infrastructure required for other parallel processing frameworks (e.g., Spark, Ray, Dask, etc.) to perform large-scale data analysis with pandas. Ponder plugs straight into your data warehouse as a query engine.

Zero code change

With Ponder’s technology, the same pandas workflows can be run at all scales, from megabytes to terabytes, without changing a single line of code.

Data stays where it lives

All your pandas workflows will be executed in Snowflake, thus benefiting from the rigorous security guarantees offered by Snowflake.