More

PolarizedPoutin · on April 18, 2024

Thank you for reading and for the links!

I'm trying out Clickhouse for the next post. Definitely excited for sub 24 hour data loading!

I haven't heard of VictoriaMetrics but that's some impressive performance. Will check it out!

PolarizedPoutin · on April 17, 2024

The data is publicly available!

The data is freely available from the Climate Change Service [1] which has a nice API but download speeds can be a bit slow. You'll have to sign up for this.

NCAR's Research Data Archive [2] provides some of the data (as pre-generated NetCDF files) but at higher download speeds. No signup necessary.

It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository [3].

[1]: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...

[2]: https://rda.ucar.edu/datasets/ds633-0/

[3]: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...

PolarizedPoutin · on April 17, 2024

What do you mean by the heat distribution of energy? Do you mean like how much heat is received from the sun at a particular location and particular time? If so, then temperature is only the result of this plus other factors. Factors like cloud cover influence how much radiation is received at the surface.

PolarizedPoutin · on April 17, 2024

GraphCast was trained on this exact same data!

From https://deepmind.google/discover/blog/graphcast-ai-model-for...

> Crucially, GraphCast and traditional approaches go hand-in-hand: we trained GraphCast on four decades of weather reanalysis data, from the ECMWF’s ERA5 dataset.

PolarizedPoutin · on April 17, 2024

The full dataset is quite huge (~9 petabytes and growing) out of which I'm using just ~8 terabytes.

The data is freely available from the Climate Change Service [1] which has a nice API but download speeds can be a bit slow.

NCAR's Research Data Archive [2] provides some of the data (as pre-generated NetCDF files) but at higher download speeds.

It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository [3].

[1]: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...

[2]: https://rda.ucar.edu/datasets/ds633-0/

[3]: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...

PolarizedPoutin · on April 17, 2024

The full dataset is quite huge (~9 petabytes and growing) out of which I'm using just ~8 terabytes. Still quite big to upload.

The data is freely available from the Climate Change Service [1] which has a nice API but download speeds can be a bit slow.

NCAR's Research Data Archive [2] provides some of the data (as pre-generated NetCDF files) but at higher download speeds.

It's not super well documented but I hosted the Python scripts I used to download the data on the accompanying GitHub repository [3].

[1]: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...

[2]: https://rda.ucar.edu/datasets/ds633-0/

[3]: https://github.com/ali-ramadhan/timescaledb-insert-benchmark...

PolarizedPoutin · on April 17, 2024

Yeah I was thinking about this and hoped that Postgres had a `float2` data type but `int2` would have to work. I could scale the numbers to fit them into 2 bytes with minimal loss of precision, but decided I'd rather take the storage space hit since TimescaleDB promises good compression. Still haven't measured this though haha.

PolarizedPoutin · on April 17, 2024

It's a good question! It's true that the output is massive, I believe ~9 petabytes and growing. But running the model is super expensive. It runs on ECMWF's supercomputer. Not sure how many cores but I would guesstimate in the 10k - 100k core range just for one instance. So computing the data on the fly is super expensive, probably much more expensive than just storing it.

PolarizedPoutin · on April 17, 2024

I'm hoping to compare TimescaleDB and Clickhouse to see how big the difference is for different queries! My impression is that TimescaleDB gives you some columnar features, but maybe Clickhouse is a true columnar database.

PolarizedPoutin · on April 17, 2024

Had a read through parts 1 and 2, thank you for the engaging reads! Love how you've formatted your posts with the margin notes too. Thank you for providing the function to write numpy structured arrays to Postgres binary, I couldn't figure this out before.