Ducktales — Part 2: Pandas vs DuckDB — large dataset processing comparison
Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.
I often joke with my good friend and colleague Shailesh Mishra (my go-to person for anything related to DB partition analysis for long queries, RDBMS log reading etc ..), that SQL has come a full circle with the rise of ELT !
For SQL engineers, tools like duckdb provide an immense opportunity to apply the same engineering principles on large semi-structured datasets for data engineering and ML EDA tasks.
DuckDB provides the vector runtime and aggregation efficiency on large datasets with pure SQL, which I will demonstrate in this article.
I took the Binance BTC raw.parquet
which is around 1.78 GB in size (a semi-gargantuan large file !! ).
I ran the following notebook in my personal laptop with an AMD Ryzen 7, 32 GB RAM.
Here’s the file :
I run it in a notebook virtual environment, with the usual setup :
- virtualenv duckdblargefiletest
- then install duckdb, pandas and notebook
pip install duckdb pandas notebook
3. start jupyter lab
I use timeit
to measure the load and COUNT(*) on the dataset, which comprises of stock candlestick data.
import pandas as pd
import timeit
start = timeit.default_timer()
print("The start time is :", start)
df = pd.read_parquet("D:\\large_file\\raw.parquet")
print("The difference of time is :",timeit.default_timer() - start)
It took my laptop around 5s to load the 1.78 GB single large file in memory
The start time is : 10514.3752879
The difference of time is : 4.882038199999442
Then I ran a df.count() on the dataframe :
start = timeit.default_timer()
print("The start time is :", start)
print(df.count())
print("The difference of time is :",timeit.default_timer() - start)
The start time is : 10564.2045469
Open 46674268
High 46674268
Low 46674268
Close 46674268
Volume 46674268
Close_time 46674268
Quote_asset_volume 46674268
Number_of_trades 46674268
Taker_buy_base_asset_volume 46674268
Taker_buy_quote_asset_volume 46674268
symbol 46674268
time 46674268
dtype: int64
The difference of time is : 1.8905711999996129
Around, 1.9s. Not too bad, though, aggregation operations would take considerably more time with Pandas.
Now, I load the same file using DuckDB
import duckdb
conn = duckdb.connect()
start = timeit.default_timer()
print("The start time is :", start)
conn.execute("""
DROP VIEW IF EXISTS binance_btc;
""")
conn.execute("""
CREATE VIEW binance_btc AS
SELECT * FROM 'D:\\large_file\\raw.parquet'
""")
print("The difference of time is :",timeit.default_timer() - start)
Here’s how long it took :
The start time is : 10801.4392424
The difference of time is : 0.0012158000008639647
This was blazing fast ! More so, aggregations run fast as well !
start = timeit.default_timer()
print("The start time is :", start)
result = conn.execute("""
SELECT count(*) FROM binance_btc;
""").fetchall()
print("Count", result)
print("The difference of time is :",timeit.default_timer() - start)
The start time is : 3220.0213813
Count [(46674268,)]
The difference of time is : 0.017077200000130688
start = timeit.default_timer()
print("The start time is :", start)
result = conn.execute("""
SELECT avg(Number_of_trades) FROM binance_btcgroup by symbol;
""").fetchall()
#print("Count", result)
print("The difference of time is :",timeit.default_timer() - start)
The start time is : 11185.2873231
The difference of time is : 1.145269399999961
Applying the well-known SQL filters on large semi-structured datasets is an immense help for SQL engineers and I am sure, just like SQLite, duckdb would slowly get embedded within multiple frameworks/products to run blazing fast analytical queries on large semi-strcutured datasets from developer workstations.
So long !