Comparison with other DataFrame systems#

In this post, we will discuss how Ponder is different from libraries such as PySpark, Snowpark, and Polars to help practitioners understand the strengths and weaknesses of these tools and what might work best for your use case.

System Overview#

Spark Dataframe API#

The Spark Dataframe API provides convenient interface to work with collections of data with named columns in Spark. The Spark Dataframe API is available in Scala, Java, Python, and R. In Python, Spark also provides the pyspark.pandas.DataFrame interface which implements the pandas API on Spark DataFrame.

Snowpark DataFrame API#

The Snowpark DataFrame API is an interface for querying and working with with data stored in Snowflake. The Snowpark DataFrame API allow users to lazily construct their queries through its Python, Java, or Scala interface. Then compute can be triggered on Snowflake when users involve a method to access the results of the DataFrame.

Polars#

Polars is a DataFrame library written in Rust that leverages parallel in-memory execution to speed up its query processing. Polars provides its own Rust and Python interface that is designed to be different from the pandas API.

How is Ponder different from other DataFrame systems?#

small example snippet to showcase the difference wrt different systems

pandas API coverage differences#

  • need to call collect()

  • Snowpark aimed to be like PySpark

Dataframe Semantics#

  • Snowpark has no notion of index and order

  • order is important since sometimes I want to set a predefined order

Backend-Agnostic#

Snowpark only runs on Snowflake. Ponder was designed to run in an infrastrcuture agnostic manner, so If you want to run the same code in BigQuery, Redshift, or an on-premise cluster, Ponder will support that.

Snowpark offers a “Dataframe-like” experience. But dataframes, by their very nature, require the ability for your tables to preserve order and support indexing (iloc), which Snowpark doesn’t support. Our team developed the theory behind dataframes at UC Berkeley and developed Modin, the open source distributed dataframe project. We mirror the pandas API so that Python users don’t have to learn a new syntax, and we stay true to the API by supporting indexing + preserving order across tables/operations.