Ducktales — Part 2: Pandas vs DuckDB — large dataset processing comparison

Diptiman Raichaudhuri
3 min readSep 12, 2023

--

Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.

I often joke with my good friend and colleague Shailesh Mishra (my go-to person for anything related to DB partition analysis for long queries, RDBMS log reading etc ..), that SQL has come a full circle with the rise of ELT !

For SQL engineers, tools like duckdb provide an immense opportunity to apply the same engineering principles on large semi-structured datasets for data engineering and ML EDA tasks.

DuckDB provides the vector runtime and aggregation efficiency on large datasets with pure SQL, which I will demonstrate in this article.

I took the Binance BTC raw.parquet which is around 1.78 GB in size (a semi-gargantuan large file !! ).

I ran the following notebook in my personal laptop with an AMD Ryzen 7, 32 GB RAM.

Here’s the file :

Large 1.78GB parquet file

I run it in a notebook virtual environment, with the usual setup :

  1. virtualenv duckdblargefiletest
  2. then install duckdb, pandas and notebook
pip install duckdb pandas notebook

3. start jupyter lab

I use timeit to measure the load and COUNT(*) on the dataset, which comprises of stock candlestick data.

import pandas as pd
import timeit

start = timeit.default_timer()
print("The start time is :", start)

df = pd.read_parquet("D:\\large_file\\raw.parquet")

print("The difference of time is :",timeit.default_timer() - start)

It took my laptop around 5s to load the 1.78 GB single large file in memory

The start time is : 10514.3752879
The difference of time is : 4.882038199999442

Then I ran a df.count() on the dataframe :

start = timeit.default_timer()
print("The start time is :", start)

print(df.count())

print("The difference of time is :",timeit.default_timer() - start)
The start time is : 10564.2045469
Open 46674268
High 46674268
Low 46674268
Close 46674268
Volume 46674268
Close_time 46674268
Quote_asset_volume 46674268
Number_of_trades 46674268
Taker_buy_base_asset_volume 46674268
Taker_buy_quote_asset_volume 46674268
symbol 46674268
time 46674268
dtype: int64
The difference of time is : 1.8905711999996129

Around, 1.9s. Not too bad, though, aggregation operations would take considerably more time with Pandas.

Now, I load the same file using DuckDB

import duckdb

conn = duckdb.connect()

start = timeit.default_timer()
print("The start time is :", start)

conn.execute("""
DROP VIEW IF EXISTS binance_btc;
""")
conn.execute("""
CREATE VIEW binance_btc AS
SELECT * FROM 'D:\\large_file\\raw.parquet'
""")

print("The difference of time is :",timeit.default_timer() - start)

Here’s how long it took :

The start time is : 10801.4392424
The difference of time is : 0.0012158000008639647

This was blazing fast ! More so, aggregations run fast as well !

start = timeit.default_timer()
print("The start time is :", start)

result = conn.execute("""
SELECT count(*) FROM binance_btc;
""").fetchall()
print("Count", result)

print("The difference of time is :",timeit.default_timer() - start)
The start time is : 3220.0213813
Count [(46674268,)]
The difference of time is : 0.017077200000130688
start = timeit.default_timer()
print("The start time is :", start)

result = conn.execute("""
SELECT avg(Number_of_trades) FROM binance_btcgroup by symbol;
""").fetchall()
#print("Count", result)

print("The difference of time is :",timeit.default_timer() - start)
The start time is : 11185.2873231
The difference of time is : 1.145269399999961

Applying the well-known SQL filters on large semi-structured datasets is an immense help for SQL engineers and I am sure, just like SQLite, duckdb would slowly get embedded within multiple frameworks/products to run blazing fast analytical queries on large semi-strcutured datasets from developer workstations.

So long !

--

--

No responses yet