3.3B Taxi Trips and the Unreasonable Effectiveness of Duckdb and Parquet

In an instant

num_rows
3367004170

I was going to write a blog post comparing arrow-ballista and duckdb on a large task, but duckdb was so fast it was alarming. So I decided to write down some of my observations of using duckdb on the NYC Taxi Dataset.

Context

I used Duckdb v0.9.2 on a 96-core Dual Xeon workstation with ~5TB of disk space. I had good results on my 16-core workstation, too, but markedly faster results on the big workstation. There were 439 parquet files as of January 2024. I put the parquet files in one directory and used read_parquet('taxi/*.parquet', filename=true, union_by_name=true), allowing duckdb to read the files in parallel. I used the duckdb cli, and gave it a database on disk, duckdb taxidb, but read from the parquet files throughout.

Interesting Results

Messy data example

This is the record with max(total_amount). It seems like an outlier, where it should be 15.0 but there was an accidental 10000000 added in.

fare_amountextramta_taxtip_amounttolls_amounttotal_amountfilename
14.00.50.50.00.010000015.0taxi/yellow_tripdata_2014-08.parquet

Trips by year

Peak trip count dropped in 2020 with Covid-19 but has increased some since.

SQL for trip count by year calculated from filename

file_yeartrip_count
2009170896055
2010111529714
2011176887259
2012171359007
2013171816340
2014181284588
2015228661528
2016279631429
2017317546944
2018372645858
2019368790969
2020184638604
2021221374980
2022267424247
2023142516648

Trips by hour of day

The # of trips varies by hour of day, but many trips didn't have this information.

hour_of_pickupcount_star()
050136451
136511245
226713761
319394015
414414029
513478009
629142824
749853450
861836553
963869199
1063211677
1165778781
1269481912
1369504611
1472311337
1569914744
1662170489
1772704518
1885003150
1985608774
2079561917
2178185955
2274891862
2364675234
1988649673

Thoughts on duckdb

In the past with a dataset with long strings coming from a ndjson file, I had segfaults with Duckdb. That was several minor versions of Duckdb ago. I was really impressed with the stability while throwing a huge amount of parquet at it. The difference in time it took for a query between the 16-core workstation and the 96-core workstation is exactly as you want it to be: duckdb can use all of the compute resources that you can throw at it.

Locality of compute, memory, and disk

Instead of throwing many (relatively)small computers at a problem, duckdb is a software that can use the resources of one big computer. The duckdb python package can be used for writing processes that run automatically, but I love the duckdb cli for exploring a dataset.

Go Forth and Crunch data

Try out duckdb with the knowledge that you can throw huge amounts of data at it and it will perform.