Hacker Newsnew | past | comments | ask | show | jobs | submit | dieortin's favoriteslogin

Common Crawl Foundation | REMOTE | Full and part-time | https://commoncrawl.org/ | web datasets

I'm the CTO at the Common Crawl Foundation, which has a 17 year old, 8 petabyte crawl & archive of the web. Our open dataset has been cited in nearly 10,000 research papers, and is the most-used dataset in the AWS Open Data program. Our organization is also very active in the open source community.

We are expanding our engineering team. We're looking for someone who is:

* Excited about our non-profit, open data mission

* Proficient with Python, and hopefully also some Java

* Proficient at cloud systems such as Spark/PySpark

* Willing to learn the rest: crawling parsing indexing etc.

Contact me at jobs zat commoncrawl zot org.


Yeah, uh, I wouldn't.

There was a B- or C-list physics blogger a few years back whose graduate homework I used to grade. (I still remember this one, so that should tell you something.) He got very angry that I gave him zero credit for one particular question. But he:

- did not use the standard/expected approach to this problem

- did not explain what he was doing well enough for me to find him any partial credit (this is not easy!)

- had a pile of impenetrable unnecessary very complex alien math that I wasn't going to try to cut through given that

- his final answer was very, very wrong

- in fact, it was wrong by 26 orders of magnitude

- and he didn't have the skill to notice something was wrong (and, yes, I was lenient with students who noticed final answers were weird even if they couldn't/didn't fix it up)

- also, he was a major asshole (no surprise given that he's complaining about this "indignity") who was

- somehow still causing #MeToo problems in the 21st century despite being under 30 (seriously??)

So if that's who gets held up as "authorities", even minor ones, forgive me if I don't listen too much. I'll choose who I trust.


Hey Drew, I'm a huge fan and have a lot of respect for your work.

I feel like I see the pattern of people on HN being disappointed in some of the tools that come out of Google and other large engineering orgs, when they don't work out really well in orgs that are not operating at the same scale. People have similar complaints about the complexity of other projects that come out of Google. K8s comes to mind as one such example. Often times these tools must be robust to such a large variety of uses that they are simply overkill for smaller organizations. I'll readily admit that I could be wrong and Bazel is simply poorly designed, but it is perhaps worth considering that the build system used by an engineering team of 50 need not be as complex as the build system used by one of the largest engineering orgs in the world. My guess is we'd see a lot less backlash if people tried to step off the hype train for a moment and critically evaluate whether they really need to use something like Bazel or K8s when something simpler would suffice.


The wake up one thing is the issue

Plus the real time push works differently… like a phonecall comes in and your phone rings thay same second


What was the content of the "handled" tweets, though? I think that matters a lot. For example, the RealJamesWood tweet (from [0]) seems to be a leaked nude of Hunter; it was most likely (safe to say) posted non-consensually, and I don't really see a big difference between that kind of tweet and revenge porn.

Having "stuff they dislike" removed would be one thing, but using the direct line at Twitter for reporting explicit ToS violations isn't a big deal.

[0] https://twitter.com/mtaibbi/status/1598828601268469760


Am I reading this right? This reads like the presence of a UI element holds the unlock state of the phone?

Maybe Amazon should make us-east-1's actual datacenter change depend on the customer, as they do with the AZs :P

Human beings are capable of all kinds of petty emotions for all kinds of petty reasons, most of which are irrelevant to the discussion at hand. As a society, we recognize that prejudice, especially against the vulnerable, is particularly bad.

I don't think the headline matches the post. Usually when we refer to systems being "insecure" and "compromised", we mean so with respect to a third party.

A better headline might be: ProtonMail supports full E2EE, provided you trust them with your keys (if not, why are you paying for their service?).


> Intel is the reason we don't have ECC RAM on desktops.

Intel has offered ECC support in a lot of their low-end i3 parts for a long time. They’re popular for budget server builds for this reason.

The real reason people don’t use ECC is because they don’t like paying extra for consumer builds. That’s all. ECC requires more chips, more traces, and more expense. Consumers can’t tell if there’s a benefit, so they skip it.

> AMD supports ECC on their consumer chips, but without Intel support it's never taken off

You’re blaming Intel’s CPU lineup for people not using ECC RAM on their AMD builds?

Let’s be honest: People aren’t interested in ECC RAM for the average build. I use ECC in my servers and workstations, but I also accept that I’m not the norm.


I certainly wouldn't describe it as “a pass” given how commonly people joke about things like “friends don't let friends use us-east-1”. There's also a reporting bias: because many places only use us-east-1, you're more likely to hear about it even if it only affects a fraction of customers, and many of those companies blame AWS publicly because that's easier than admitting that they were only using one AZ, etc.

These big outages are noteworthy because they _do_ affect people who correctly architected for reliability — and they're pretty rare. This one didn't affect one of my big sites at all; the other was affected by the S3 / Fargate issues but the last time that happened was 2017.

That certainly could be better but so far it hasn't been enough to be worth the massive cost increase of using multiple providers, especially if you can have some basic functionality provided by a CDN when the origin is down (true for the kinds of projects I work on). GCP and Azure have had their share of extended outages, too, so most of the major providers tend to be careful to cast stones about reliability, and it's _much_ better than the median IT department can offer.


Remember, things are never as good or as bad as you think.

'Offline' internet can come back, we can put 32g of microsd cards in special locations so we can share content and /etc/hosts without anyone knowing, maybe go online from time to time to get the newest /etc/hosts from your friends and the new locations for microsd cards near you.

Gopher is making a comeback, and gemini is growing.

Pi zero 2W costs 10$ and with 40$ screen you can watch feynman lectures with mplayer -vo fbdev, it boots into vim for 3 seconds (init into openvt -w vim kinda thing).

Soon the-eye.eu will be back (hopefully).

The new phrack is out.

The social networks are eating themselves, same as google is eating itself, the search is garbage, the feed is even more garbage, 99% of the content is anxiety inducing miasma.

The web is eating itself, with gazillions of GPT(ish) generated articles.

Let it go, life always finds a way.


> That is not entirely true, if you translate the C code to Rust, you get C code, in Rust, with similar issues (or possibly worse).

thus it was basically true after all? Like, sure, Rust is turing-complete so you can simulate whatever C did and thus technically you can translate anything that C can do into Rust. But if it doesn't fix any problems, then have you really translated it into Rust?


Macron said "We are going, for the first time in decades, to relaunch the construction of nuclear reactors in our country". This is a blatant lie, as France launched a project and never ceased trying to build. However it failed flat.

The last delivered reactor was Civaux-2 (generation II), in 1999. See https://en.wikipedia.org/wiki/Civaux_Nuclear_Power_Plant

Then in 2002 the project Flamanville-3 was launched (a generation III reactor, the "EPR", first one of its kind), and the building phase started in 2007. It is a major failure, not delivered, at least 11 years behind schedule, and will cost at least 19.1 billion euros (initial budget: 3.4 billions €). See https://en.wikipedia.org/wiki/Flamanville_Nuclear_Power_Plan...


> That's why you do not control networks with signals on the network itself.

Other than, you know, the entire Internet (BGP) and pretty much every corporate WAN (OSPF, EIGRP, RIP) and LAN (spanning tree, ARP).

Having an out-of-band control plane is very much the exception. Now OOB emergency access, on the other hand...

edit okay, SS7 is out-of-band, but parent was talking about IP networks.


It seems like a bit of an overreaction. Is shipbuilding that important to France?

Julia Reda's analysis depends on the factual claim in this key passage:

> In a few cases, Copilot also reproduces short snippets from the training datasets, according to GitHub’s FAQ.

> This line of reasoning is dangerous in two respects: On the one hand, it suggests that even reproducing the smallest excerpts of protected works constitutes copyright infringement. This is not the case. Such use is only relevant under copyright law if the excerpt used is in turn original and unique enough to reach the threshold of originality.

That analysis may have been reasonable when the post was first written, but subsequent examples seem to show Copilot reproducing far more than the "smallest excerpts" of existing code. For example, the excerpt from the Quake source code[0] appears to easily meet the standard of originality.

[0]: https://news.ycombinator.com/item?id=27710287


If Cellebrite was disclosing these vulns when they found them, there would be no business, thus no Cellebrite, thus they wouldn’t have found them. “Destroy Cellebrite” is a possible outcome but “Have Cellebrite release 0days when they find them” isn’t.

I like the the interactive visualizations a lot and some of the setup (a camera picture is a thing that changes in certain ways as you fiddle with these 3 parameters, etc). But I always have a hard time telling who the actual audience for this stuff is. If it's someone who has very little exposure to how cameras actually operate, is a Bayer filter really the second thing they need to be aware of? I don't really follow the pedagogical narrative/intent here.

It should be legal to sell bald eagle feathers, as well as have any arbitrary sequence of bytes stored in a file on one's disk. I don't see why everyone in society should have to have their liberty arbitrarily restricted just because cops are terrible at their ostensible jobs.

It's not that enforcement is hard, it's that legal systems have decided that cops shouldn't have to be any good to do policing, and then they work backward from that. It's the wrong way around.

Attacking the leaf nodes of a societal problem lets it look like something is being done by police, when it is actually a huge injustice and a complete waste of time and resources.

You could spend billions to bust a million hookers and johns and never once make the tiniest dent in human trafficking.


> You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically.

Wikipedia visitors, edits, and revenue are all increasing, and the rate that they're increasing is increasing, at least in the last few years. Is this a claim about the third derivative?

> Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.

The Wikimedia Enterprise thing seems like it has nothing to do with missing visitors and that companies ingesting raw Wikipedia edits are an opportunity for diversifying revenue by offering paid structured APIs and service contracts. Kind of the traditional RedHat approach to revenue in open source: https://meta.m.wikimedia.org/wiki/Wikimedia_Enterprise


Didn't Coinbase already survive that 2 years ago?

I agree with the other comments here saying that this is not a gap in open-source licensing. If Elastic wanted to force Amazon to contribute modifications back, they could have switched to the AGPL to do that. But they didn't — because, as the AWS blog points out, AWS was already contributing their changes back.

The problem wasn't that the spirit of open-source was violated. The problem was that AWS is better at acquiring customers for their competing hosting service. So Elastic switched the license to the non-open-source SSPL, which makes offering a hosting service essentially impossible (to satisfy the terms, Amazon would need to open-source effectively all of AWS).

This license switch is more antithetical to the spirit of open source — that is, user freedom — than what AWS was doing, which was offering a hosted version of open-source software that competed with Elastic's hosting business.

Ultimately I think selling hosting of a single piece of open-source software as a funding model for developing a single piece of open-source software (e.g. Mongo, Elastic) has proven to not be a very sustainable funding model; at least, not if you're hoping for VC-backed company sized returns. And... I can see why that doesn't feel great. At one point, hosting was believed to be the silver bullet model as compared to consulting. But ultimately I think it's just not worked out: hosting as a model makes sense for proprietary software, but for OSS, either:

* You pay developers to build the software and also pay more developers to build the hosting stack — but then your competition only pays for the latter and can beat you on price, or

* You only pay to build the hosting stack, but then — what's the point? You haven't solved funding the software development.


> It's interesting that a reasonably sized phone is now a selling point in itself.

Sony was the last one in Android world to offer it to us. Now, even they chickened out.


There was the apple gpu fiasco. Apple laptop GPUs were failing at an extraordinary rate in 2010-2011, and Apple were getting harangued in the media about it (which was justified). Apple eventually announced a program to cover replacement of the GPUs, but it was very delayed. I'm 100% convinced the delay was caused by Nvidia refusing to accept responsibility. When the program was released, in order to have the repair completed at no cost, your computer had to fail a very specific graphics diagnostic, and the failure would be linked to your serial number in Apple's repair CRM. At the time, no other apple repair program had this requirement — Don't get me wrong, it makes sense to use diagnostics to verify hardware issues, but they aren't the only tool you can use. I'm convinced Nvidia only would reimburse apple for computers that had a recorded failure of this specific test.

Since 2014, apple has almost exclusively used AMD graphics cards in computers that do have discrete GPUs, and I don't think it's unreasonable to suggest this was motivated with their terrible experience with Nvidia.

(I'm aware I'm not really backing this up with evidence, but I think the publicly available facts about the whole situation support this narrative)


People are always talking about switching to Blender (me included) - but they don't. Because as sad as it is, once you start to actually work with it in serious projects, you noticed all the things that hobbyists who hail Blender as the second coming never care about. I wanted to give Blender another got at replacing Cinema 4D for us (since their new licensing). I got used to the UI, though it's not nearlly as intuitive as C4D, but when I started working with larger scenes, it came out that Undo steps in large scenes could take 20 seconds easily. I mean...wtf? That is probably one of the most essential functions you can have in an editor. Apparently they are working on fixing that. Eventually.

I use -m often, it's pretty much my default commit method. It doesn't break the flow, I can see the previous commands in the terminal above which is sometimes useful.

Besides it's trivial to change the commit in an editor if you're not happy with it with a simple "git commit --amend". It's even possible to change your mind in the middle of the commit command by adding '-e' as somebody else helpfully pointed out in this comment section.

I mean, I don't think my way is superior to yours, but I don't think it's inferior either. It's just a matter of taste and workflow I suppose. In particular I don't really see why editing text in vim is going to make you more or less likely to realize that you forgot to "tweak something or stage a change". And at any rate as long as you haven't pushed anything it's trivial to rewrite the commit.


I've been developing on macs for 10+ years. Recently got myself a Windows desktop for testing stuff, gaming etc. It turned into my main development machine thanks to WSL2. Everything works better / more stable and this is extremely ironic given that I switched to mac a decade ago for the same reason. Also I have many more hardware options (which is overwhelming)

Meanwhile, my MacBook Pro 16" has daily kernel panics when it is connected to an external display. It has been reported by many people on Macrumors forums and has been happening for months now. No fix...


The problem with physical programming books - and don't get me wrong, I love physical books to a fault - is that they're easily made outdated, either with new versions of the same software, or the software gets replaced, or new best practices arise. Then you have to buy another version of the book. In addition to this, which basically makes physical programming books a nonstarter already, you have the issue that books aren't easily searchable or copy-pasteable.

Really, there's not much of a reason to have physical programming books anymore; it's just not practical, whatever the level of quality you get elsewhere. And you seem to have left out official documentation and stuff, which is getting really good imo.

The only reason for having a physical programming book is for totally timeless ideas, like algorithms and data structures textbooks, or introductory books to concepts like compilers.


Huh, N26, a major online bank in Europe is famous for some of its customers getting their accounts blocked every time they tweak their ML model and yet is doing great financially dispite the shitstorm it generates each time.

It's not like Google blocking your email or YouTube account, we're taking about your friggin bank account here.

I don't know how they're still in business and growing with such a process in place.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: