In an instant
num_rows |
---|
3367004170 |
I was going to write a blog post comparing arrow-ballista and duckdb on a large task, but duckdb was so fast it was alarming. So I decided to write down some of my observations of using duckdb on the NYC Taxi Dataset.
Context
I used Duckdb v0.9.2 on a 96-core Dual Xeon workstation with ~5TB of disk space. I had good results on my 16-core workstation, too, but markedly faster results on the big workstation. There were 439 parquet files as of January 2024. I put the parquet files in one directory and used read_parquet('taxi/*.parquet', filename=true, union_by_name=true)
, allowing duckdb to read the files in parallel. I used the duckdb cli, and gave it a database on disk, duckdb taxidb
, but read from the parquet files throughout.
Interesting Results
Messy data example
This is the record with max(total_amount)
. It seems like an outlier, where it should be 15.0 but there was an accidental 10000000 added in.
fare_amount | extra | mta_tax | tip_amount | tolls_amount | total_amount | filename |
---|---|---|---|---|---|---|
14.0 | 0.5 | 0.5 | 0.0 | 0.0 | 10000015.0 | taxi/yellow_tripdata_2014-08.parquet |
Trips by year
Peak trip count dropped in 2020 with Covid-19 but has increased some since.
file_year | trip_count |
---|---|
2009 | 170896055 |
2010 | 111529714 |
2011 | 176887259 |
2012 | 171359007 |
2013 | 171816340 |
2014 | 181284588 |
2015 | 228661528 |
2016 | 279631429 |
2017 | 317546944 |
2018 | 372645858 |
2019 | 368790969 |
2020 | 184638604 |
2021 | 221374980 |
2022 | 267424247 |
2023 | 142516648 |
Trips by hour of day
The # of trips varies by hour of day, but many trips didn't have this information.
hour_of_pickup | count_star() |
---|---|
0 | 50136451 |
1 | 36511245 |
2 | 26713761 |
3 | 19394015 |
4 | 14414029 |
5 | 13478009 |
6 | 29142824 |
7 | 49853450 |
8 | 61836553 |
9 | 63869199 |
10 | 63211677 |
11 | 65778781 |
12 | 69481912 |
13 | 69504611 |
14 | 72311337 |
15 | 69914744 |
16 | 62170489 |
17 | 72704518 |
18 | 85003150 |
19 | 85608774 |
20 | 79561917 |
21 | 78185955 |
22 | 74891862 |
23 | 64675234 |
1988649673 |
Thoughts on duckdb
In the past with a dataset with long strings coming from a ndjson file, I had segfaults with Duckdb. That was several minor versions of Duckdb ago. I was really impressed with the stability while throwing a huge amount of parquet at it. The difference in time it took for a query between the 16-core workstation and the 96-core workstation is exactly as you want it to be: duckdb can use all of the compute resources that you can throw at it.
Locality of compute, memory, and disk
Instead of throwing many (relatively)small computers at a problem, duckdb is a software that can use the resources of one big computer. The duckdb python package can be used for writing processes that run automatically, but I love the duckdb cli for exploring a dataset.
Go Forth and Crunch data
Try out duckdb with the knowledge that you can throw huge amounts of data at it and it will perform.