That aside, the training isn't blind, it's guided, and it's likely they use verified correct sources of info to train for some things, like medical diagnoses.
You may also be interested in Apendix A in the same document: "Details of Common Crawl Filtering"
[1] https://arxiv.org/pdf/2005.14165.pdf