I wrote something similar in my last job where we had to parse and query data from huge (50+ GB? I remember they weren't even fitting in my laptop) json files that were stored in an S3 Bucket..
We used the streaming parser to create an index of the file locally {json key: (byte offset, byte size)} and then simply used http range queries to access the data we needed.
We used the streaming parser to create an index of the file locally {json key: (byte offset, byte size)} and then simply used http range queries to access the data we needed.
Here is the full write up about it:
https://dinesh.cloud/2022/streaming-json-for-fun-and-profit/
And here is the open sourced code:
https://github.com/multiversal-ventures/json-buffet