Selecting Your Data Source#

Before we start can start our analysis, we need to first connect to a data source. Ponder currently supports read_csv for operating on CSV files, read_sql for operating on tables that are already stored in your data warehouse, and read_parquet for operating on Parquet files.

Note

Note that unlike in pandas, the data ingestion (read_*) command in Ponder does not actually load in the data into a dataframe in memory. Instead, you can think of the Ponder DataFrame as a pointer to the table in your warehouse, which stores the data and does the computation.

../_images/architecture.png

read_sql: Working with existing tables#

To work with data stored in an existing table in your warehouse, we use the read_sql command and provide the name of the table CUSTOMER and pass in your database connection object your_db_con to the connection parameter.

df = pd.read_sql("CUSTOMER", con= db_con)

Now that we have a Ponder DataFrame that points to the CUSTOMER table in your data warehouse, you can now work on your DataFrame df just like you would typically do with any pandas dataframe – with all the computation happening in your database!

read_csv: Working with CSV files#

Going beyond read_sql, if the pandas command doesn’t take in a database connection as a parameter, such as in the case of read_csv, we need to configure Ponder to leverage the database connection that we established earlier.

ponder.configure(default_connection=db_con)

Then, use the read_csv command to feed in the file path to the CSV file.

df = pd.read_csv("https://github.com/ponder-org/ponder-datasets/blob/main/tpch/orders.csv?raw=True", header=0)

Ponder will automatically process your CSV file and load it into a temporary table in your warehouse for analysis.

read_parquet: Working with Parquet files#

Going beyond read_sql, if the pandas command doesn’t take in a database connection as a parameter, such as in the case of read_parquet, we need to configure Ponder to leverage the database connection that we established earlier.

ponder.configure(default_connection=db_con)

Then, use the read_parquet command to feed in the file path to the Parquet file.

df = pd.read_parquet("https://github.com/ponder-org/ponder-datasets/blob/main/userdatasample.parquet?raw=True", header=0)

Ponder will automatically process your Parquet file and load it into a temporary table in your warehouse for analysis.

Now that we see how pd.read_* works in Ponder, we will discuss how you can use pd.to_* to save your dataframes with Ponder.