Over the last year or so I have been off-and-on working on two tools for processing NDJSON and turning JSON into NDJSON. I am releasing one of them today, with a deb package, and cross-compiled macOS and Windows executables.
This blog post will explain the use case of this tool (short answer: you want to turn JSON into NDJSON or you want to do simple processing on NDJSON), give examples of usage and compare it to other tools that do similar things.
Summary of Use Case
There are very large JSON files out there. One of them happens to be the Consumer Financial Protection Bureau's Complaint Dataset on data.gov. It is 1.7 GB unzipped. It is possible to load this into memory on most computers, but why should you have to.
Maybe you are wondering how many total complaints there are. Maybe you are wondering the number of complaints per state.
You may think to reach for jq. The default for jq
is to load everything into memory, but it has a streaming mode.
WARNING: these commands will be really slow
Default: takes 1m18s on my workstation and uses a bunch of memory
cat complaints.json | jq -c '.[]' | wc -l
1833581
Streaming: takes 7m6s on my workstation and uses less memory:
cat complaints.json | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' | wc -l
1833581
Introducing ndjson tool
Number of complaints in the dataset
takes 26s on my workstation
cat complaints.json | ndjson from-json d | wc -l
1833581
Number of complaints per state, 2 ways
1st way
takes 44s on my workstation
cat complaints.json | ndjson from-json d | ndjson pick-field d.state | sort | uniq -c | sort -n
...
...
243520 "CA"
2nd way
takes 2m56s on my workstation
cat complaints.json | ndjson from-json d | ndjson agg --group-by d.state --agg count d
...
{"_count":243520,"state":"CA"}
...
{"_count":1553,"state":"WY"}
In-depth on the JSON selector patterns
Every command takes a selector to target a part of the JSON/NDJON. These selectors work very much like d3 JSON selectors. Each selector starts with d
, to refer to the whole document (or for NDJSON this refers to 1 NDJSON element). Next is each JSON key or an index in a JSON array (d.key[5] for the 5th item in the array given by the key key
).
d.key[5]
refers to 5
.
{"key":[1,2,3,4,5]}
Suppose we have the following JSON document, and we want to process it into NDJSON:
{ "foo": [{ "bar": 1}, {"bar": 2}]}
Selector
d.foo
Keys can be combined
{"company_name": "Dunder Mifflin", "employees": [{"name": "Dwight", "age": 37}]}
Selector targeting 37
d.employees[0].age
ndjson tool commands
SUBCOMMANDS:
agg Aggregatation commands on a grouped-by key
filter returns only json that matches filter expression
from-json Converts json to ndjson
join joins json file to ndjson stream
pick-field picks a field from all of the ndjson objects
Caveats
This tool can't do everything jq
can do and it never will. It is supposed to be simple and fast. Also the agg
subcommand takes a lot of memory if the group-by
field is not already sorted.
Github repo
ndjson-spatial is the project that this tool is a part of.
Conclusion
For gigantic JSON files there is a new way to process them into NDJSON. This NDJSON can be filtered and processed in relatively simple ways to find basic answers about the data.