Thanks for the point dang. Is it just me find it disturbing that the original article's referencing page is gone, and now we had to go to wayback machine to get a copy.
A bit off topic are there any BitTorrent/ipfs effort to archive archive.org ?
> At this point, do we need to using JS disabled browser to really get privacy on the web?
My thoughts are that we need a distinction between web pages (no JS) which are minimally interactive documents that are safe to view, and web apps (sites as they exist now) which require considerable trust to allow on your device. Of course, looking that the average person's installed app list indicates that we have a long way to go culturally with regards to establishing a good sense of digital hygiene, even for native software.
It doesn't help that web browsers aren't even trying to help users make the distinction. They have an ever-growing list of features and permissions that sites can take advantage of, with no attempt to coalesce anything into a manageable user interface. Instead, it takes a hundred clicks to fully trust or distrust a site/app.
More UI/UX distinction is needed! Just the green lock for security! The browser should indicate the level of privacy of the page. If the page use no js or any GPU compromising (css I'm looking at you), then it gets a green kind. For every privacy/security compromising feature you add the turns yellow. Once it start to ask for WebUSB, MIDI, then it should be in some kind of Native Mode. More like a UI/UX issue for the major browser makers!
The problem is that there is a lot of grey area between pure document-style pages and full-on apps (take online shops for example) and even for the former category of pages a lot of UI niceties are only possible with scripting.
Any other tracking methods are way more obvious, and way harder to implement for the advertising industry. We shouldn't think in black/white here - the more difficult it is to track a user, the less likely it is implemented. It is okay if 30% of tracking sites dissapear as the cost/value ratio don't work for them. We don't have to sit in silence and do nothing, just because we can't have the 100% privacy.
I do think there is a point here: any technical means to block tracking is going to be overrun by technical means to overcome the anti-tracking tech. There are simply too many dollars at stake for anything else to happen. If anti-tracking stops some players, that just means the industry will consolidate into a few large and well-resourced players.
While I am all in favor of continuing the technical battle against tracking, it’s time to recognize that the war will only be won with legislation.
It’s an interesting question: is it possible for JavaScript to be turing complete, able to read/write the DOM, and somehow prevent fingerprinting / tracking?
My gut says no, not possible.
Maybe we need a much lighter way to express logic for UI interactions. Declarative is nice, so maybe CSS grows?
But I don’t see how executing server-controlled JS could ever protect privacy.
I've always thought there should be a way to use the browser like a condom. It should obfuscate all the things that make a user uniquely identifiable. Mouse movement/clicks/typing cadence should be randomized and sanitized a bit. And no website should have any authority whatsoever to identify your extensions or other tabs, or even whether or not your tab is open. And it certainly shouldn't allow a website to overrule your right click functionality, or zoom, or other accessibility features.
I don't know what it is called but if you try to open a window from a timeOut it wont work. The user has to click on something then the click even grants the permission.
You could make something similar where fingerprint worthy information cant be posted or used to build an url. For example, you read the screen size then add it to an array. The array is "poisoned" and cant be posted anymore. If you use the screen size for anything those things and everything affected may stay readable but are poisoned too. New fingerprinting methods can be added as they are found. Complex calculations and downloads might make time temporarily into a sensitive value too.
In the old days, something similar to what you're calling "poisoned" was called "tainted" [0].
In those scenarios, tainted variables were ones which were read from untrusted sources, so could cause unexpected behaviour if made part of SQL strings, shell commands, or used to assemble html pages for users. Taint checking was a way of preventing potentially dangerous variables being sent to vulnerable places.
In your scenario, poisoned variables function similarly, but with "untrusted" and "vulnerable" being replaced with "secret" and "public" respectively. Variables read from privacy-compromising sources (e.g. screen size) become poisoned, and poisoned values can't be written to public locations like urls.
There's still some potential to leak information without using the poisoned variables directly, based on conditional behaviour - some variation on
if posioned_screenwidth < poisoned_screenheight then load(mobile_css) else load(desktop_css)
is sufficient to leak some info about poisoned variables, without specifically building URLs with the information included.
Just create _strict_ content security profile, which doesn't allow any external requests (fetch) and only allow load of resources (css, image, whatever) from predefined manifest.
App cannot exfiltrate any data in that case.
You may add permissions mechanisms of course (local disk, some cloud user controls, etc).
That's a big challenge in standards and not sure if anyone is working on such strongly restricted profile for web/js.
It’s an interesting question: is it possible for JavaScript to be turing complete, able to read/write the DOM, and somehow prevent fingerprinting / tracking?
Yes, of course: restrict its network access. If JS can't phone home, it can't track you. This obviously lets you continue to write apps that play in a DOM sandbox (such as games) without network access.
You could also have an API whereby users can allow the JS application to connect to a server of the user's choosing. If that API works similarly to an open/save dialog (controlled entirely by the browser) then the app developer has no control over which servers the user connects to, thus cannot track the user unless they deliberately choose to connect to the developer's server.
This is of course how desktop apps worked back in the day. An FTP client couldn't track you. You could connect to whatever FTP server you wanted to. Only the server you chose to connect to has any ability to log your activity.
There's no point. If you diaable JS. Can track you other ways, fingerprint your dns packets like timestamp clock skew and other things. With IPV6 can assign you unique ip address for a dnslookup that can function like a cookie,
Don't want to be tracked. Don't go on the internet.
Websites can't fingerprint my dns packets by their clock skew, nor can they assign me a unique IP address for a dns lookup (what?). "Don't go on the internet" isn't a great starting point to improve things.
Used to fingerprint your TCP packets when i built a large neobank. Could easily tell if you're behind a proxy, falsifying your user agent via syn numbers, and more. We used it to detect bots but it could be easily be used to fingerprint individual users. DNS trick is already used for DNS based CDNs, you can just keep refining it down to more specificity. CDN edge for each individual user.
Why does it have to be a technological solution? That's what the media industry tried to do with DRM and it failed. The solution is legislation. We need the equivalent of DMCA for our privacy. Make it illegal to fingerprint.
I’m completely unsold on legislation. Another headline that recently hit the top of HN is about how Apple flagrantly ignored a court order. The judge has recommended the case for criminal contempt prosecution [1].
The comments on the story are completely unconvinced that anyone at Apple will ever be convicted. Any fines for the company are almost guaranteed to be a slap on the wrist since they stand to lose more money by complying with the law.
I think the same could be said about anti-cookie/anti-tracking legislation. This is an industry with trillions of dollars at stake. Who is going to levy the trillions of dollars in fines to rein it in? No one.
With a technological solution at least users stand a chance. A 3rd party browser like Ladybird could implement it. Or even a browser extension with the right APIs. Technology empowers users. Legislation is the tool of those already in power.
> The solution is legislation. We need the equivalent of DMCA for our privacy
and how does one know their privacy has been invaded? How does the user know to enforce the DMCA law for privacy?
I think the solution has to be technological. Just like encryption, we need some sort of standard to ensure all browsers are identical and unidentifiable (unless the user _chooses_ to be identified - like logging in). Tor-browser is on the right track.
Just tried this with Brave and it didn't seem to work, assuming the site working means that it can remember me in an incognito browser. I gave the site a name, and then opened it in incognito (still using brave), and it acts as if I visited the site for the first time.
On me it had the opposite effect of what was intended:
I opened the website on non anonymous session safari: it asked my name. Then I opened another new non anonymous window on the same browser: it showed my name as expected. I then opened the same browser in incognito mode: it asked my name again. I then opened chrome (non anonymous) and again it asked my name.
Exactly what I expected to see; everything seems to be working as intended. Anonymization online seems to be working perfectly fine.
They can track you just fine via CSS and countless other ways. They'll even fingerprint the subtle intricacies of your network stack.
What we need to do is turn the hoarding of personal information into a literal crime. They should be scrambling to forget all about us the second our business with them is concluded, not compiling dossiers on us as though they were clandestine intelligence agencies.
They run arbritrary code from sketchy servers called "websites" on people's hardware with way too many privileges. While free and open source standalone web applications exist that only use minimal JS code to access the same web resources with a much better user experience. Without trackers, without ads and third parties.
I want a browser to be able to run arbitrary code. That's the whole point. I want to play a game or use a complex application in the browser without having to install anything.
I don’t mean to sound glib. But people derive a ton of utility from the web as it stands today. If they were asked if they supported the removal of web browsers they would absolutely say no. The privacy costs are worth the gains. If you want change you have to tackle that perception.
I've tried this recently and I found it very difficult. Cloudflare bot protection is everywhere, other anti-scrape protections, many 'document' sites using JS to render with no fallback, basic forms requiring JS, authentication requiring JS, payments requiring JS etc
Not intending to sound snarky but do you just not use the web much? Or if you're adding allows all the time, what's the net gain?
I use the web fairly constantly and yeah, if I am visiting a new site and I want to see the content there's a 50/50 chance I have to press a button in noscript (like 2-3 clicks) - but when you setup your initial set (usually takes me about a week) you'd be surprised how few net new properties you set in a week - maybe 100 or less?
I also set temporary permissions for any site I dont think I will be spending a lot of time on because they might change what's running and I dont have any trust or insight into their process - so I might authorize that site 3-4x a year sometimes before I say it can stay.
Unmodified server request headers contain enough information for tracking even if JS is disabled. If you're keen to modify http headers while browsing, then you could also modify any JS run on your system that snoops system information (or strip the info from any request sent to the server) and continue with JS enabled.
IMO this service should straight up be made illegal. I love the tagline they have of supposedly "stopping fraud" or "bots", when it's obvious it's just privacy invasive BS that straight up shouldn't exist, least of all as an actual company with customers.
I have almost no hope that this is a matter that has a technical solution.
The GDPR shows that law - even if not global, and even if not widely enforced - is pretty good at getting people to act. And most importantly, it will make the largest players the most afraid as they have the most to lose. And if just a handful of the largest players online are looking after peoples privacy then that is a huge win for privacy.
Doing what this demo shows, is clearly a violation of the GDPR if it works the way I assume it does (via fingerprints stored server side).
>- Instead, product was built the old-fashioned way - by talking to customers; quite often, customers would reach out to us! "Please build time-saving feature x", "support new medical procedure y", "help us publish more research by analyzing z".
Think that might be a plus. PM/PO has ruined the industry. This way at least one has a direction to the customers, which is something I can't say about large companies.
Hmm, think we ought to judge on a case by case basis. However, for megacorp and especially banks that has almost 0 to 1% access to cost of capital, vs rest of us who at at 20 - 30 % ( for credit card, loan sharks), then there should be a different license for these people. There should be a GLP type license adjusted to the cost of the capital.
There should not be any difference between small or large entitise in how you deal with them as an opensource maintainer. Just because someone has more money (or less), should not automatically mean you treat them with more leniency or ethics.
You set up your standard, and stick to it whomever comes.
Companies are never just money. There is a monumental difference between:
1. A small company which is barely profitable but is building something which aligns with your values and you see as a positive to the world.
2. A massive mega corporation whose only purpose is profit, mistreats employees, and you view as highly unethical.
You shouldn’t treat those the same way. It’s perfectly ethical to offer your work for free to the first one (helping them succeed in creating a better world) and charging up the wazoo (or better yet, refusing to engage in any way with) the second one.
A company is not a person, and can literally have its entire staff changed in short order. Or be bought.
Companies have no morals. Sometimes people in companies do, but again, that person can vanish instantly.
You should treat a company as a person which may receive a brain transplant at any time. Most especially, when writing contracts or having any expectation of what that company will do.
A business that is privately owned, is run by its founders and which represents the lion's share of its officers income and net worth can be dealt with like any other small business.
Some guy who makes bespoke firmware for industrial microcontrollers or very niche audio encoding software isn't Microsoft. You won't be able to do business with him in a useful way if you treat him like Microsoft.
There exist companies which have taken VC money, and others which haven’t. We’ve carved out one exception, but this doesn’t indicate that small personally-run companies can’t exist, right?
The key is contract. Casual chat with a corporate representative who isn’t selling you something about something you own requires some sort of contractual relationship and consideration.
If you want to be extreme don't distribute it to them in the first place. Licenses do not come into effect until after distribution. So you could have a pay-to-download model that comes with a %100 discount if you're a lone developer or an organization with under X amount of revenue. You wouldn't be able to stop someone redistributing it after the fact, but you're not engaging.
Unfortunately now that everything is based on automated pipelines, something that doesn't integrate well is not so good.
Although at work we have a provider of proprietary software that has an APT repository where the URL includes a secret token, so they can track from where it's being accessed.
Interacting with faceless entities with the power to buy multiple countries the same way you'd interact with some interested independent young person wanting to learn.
Interesting moral proposition, I doubt you'd get many followers. I think it's perfectly reasonable to treat people differently from corporations, and random small and medium corporations differently than huge megacorps without losing any sleep.
Specially in business, charging more to those that can pay more is a very common approach.
No, it's also because some consumers can't pay the "original" price. Steam in "developing" countries is a classic example — you as a game developer can ask a guy from my country $60 for a game (and some companies do try that), but he will simply go back to torrent trackers because $60 is a week's worth of living expenses.
gaben figured that out and successfully expanded into many markets that were considered basket cases for software licensing.
That's a really silly precommitment. If you were sensible, your actual commitment should be "help the next person who requires help, provided that help can be provided in the form of one dollar".
That's why the premise in the grandparent post is ridiculous.
But the license of a piece of software is not ridiculous - if you chose a very permissive license, you cannot then go and choose who should or shouldnt be profiting off your software. The license was a pre-commitment.
But lots of people make this pre-commitment, but then makes a moral/ethical judgement post-facto when someone rich seems to be able to extract more value out of the software than what "they deserve", and complain about it.
"Permissive" licenses, in fields where abusive corporations are known to operate, are a really silly precommitment. Copyleft exists for a reason. But, even if you (foolishly) made that precommitment, that doesn't then mean you have to do free labour for the abusive corporations, out of some misguided ideological consistency. (Such consistency is the hobgoblin of little minds.)
I mean, the MIT license might be a “more permissive” license but it says very explicit things that Microsoft is explicitly ignoring. Your license choice doesn’t matter when they ignore the license anyway.
If a guy comes begging for money out of rolls royce, I guess they either are pretty bad at begging or have a pretty bad sense of humor. I guess I wouldn't give money to them, it doesn't seem like it'll help them regardless.
> You set up your standard, and stick to it whomever comes.
Why? Most businesses don't entertain standard rates, either. It's case-by-case negotiations ("call us", "request quote"). Why should I, as a private person putting stuff out there for free, set up "my standard" and stick to it?
Clearly you have yet to experience some of the less savoury behaviours from Megacorps sharks. You're looking at people trying to make a name for themselves internally and if this means being economical with attributions, this is the least they would do for their place in the California sun.
so what is the real comparison against DeepSeek r1 ? Would be good to know which is actually more cost efficient and open (reproducible build) to run locally.
>I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.
Think it is NOT just you. Most company with decent management also would not want their data going to anything outside the physical server they have in control of. But yeah for most people just use an app and hosted server. But this is HN,there are ppl here hosting their own email servers, so shouldn't be too hard to run llm locally.
Yeah, this has been confusing me a bit. I'm not complaining by ANY means, but why does it suddenly feel like everyone cares about data privacy in LLM contexts, way more than previous attitudes to allowing data to sit on a bunch of random SaaS products?
I assume because of the assumption that the AI companies will train off of your data, causing it to leak? But I thought all these services had enterprise tiers where they'll promise not to do that?
Again, I'm not complaining, it's good to see people caring about where their data goes. Just interesting that they care now, but not before. (In some ways LLMs should be one of the safer services, since they don't even really need to store any data, they can delete it after the query or conversation is over.)
Laundering of data through training makes it a more complicated case than a simple data theft or copyright infringement.
Leaks could be accidental, e.g. due to an employee logging in to their free-as-in-labor personal account instead of a no-training Enterprise account.
It's safer to have a complete ban on providers that may collect data for training.
Their entire business model based on taking other peoples stuff. I cant imagine someone would willingly drown with the sinking ship if the entire cargo is filled with lifeboats - just because they promised they would.
Being caught doing they would be wildly harmful to their business - billions of dollars harmful, especially given the contracts they sign with their customers. The brand damage would be unimaginably expensive too.
There is no world in which training on customer data without permission would be worth it for AWS.
? One single random document, maybe, but as an aggregate, I understood some parties were trying to scrape indiscriminately - the "big data" way. And if some of that input is sensitive, and is stored somewhere in the NN, it may come out in an output - in theory...
Actually I never researched the details of the potential phenomenon - that anything personal may be stored (not just George III but Random Randy) -, but it seems possible.
There's a pretty common misconception that training LLMs is about loading in as much data as possible no matter the source.
That might have been true a few years ago but today the top AI labs are all focusing on quality: they're trying to find the best possible sources of high quality tokens, not randomly dumping in anything they can obtain.
> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.
Obviously the training data should be preferably high quality - but there you have a (pseudo-, I insisted also elsewhere citing the rights to have read whatever is in any public library) problem with "copyright".
If there exists some advantage on quantity though, then achieving high quality imposes questions about tradeoffs and workflows - sources where authors are "free participants" could have odd data sip in.
And the matter of whether such data may be reflected in outputs remains as a question (probably tackled by some I have not read... Ars longa, vita brevis).
In Scandinavian financial related severs must in the country! That always sounded like a sane approach. The whole putting your data on saas or AWS just seems like the same "Let's shift the responsibility to a big player".
Any important data should NOT be in devices that is NOT physically with in our jurisdiction.
Or GitHub. I’m always amused when people don’t want to send fractions of their code to a LLM but happily host it on GitHub. All big llm providers offer no-training-on-your-data business plans.
Unlikely they think Microsoft or GitHub wants to steal it.
With LLMs, they're thinking of examples that regurgitated proprietary code, and contrary to everyday general observation, valuable proprietary code does exist.
But with GitHub, the thinking is generally the opposite: the worry is that the code is terrible, and seeing it would be like giant blinkenlights* indicating the way in.
That's why AWS Bedrock and Google Vertex AI and Azure AI model inference exist - they're all hosted LLM services that offer the same compliance guarantees that you get from regular AWS-style hosting agreements.
AWS has a strong track record, a clear business model that isn’t predicated on gathering as much data as possible, and an awful lot to lose if they break their promises.
Lots of AI companies have some of these, but not to the same extent.
> "Most company with decent management also would not want their data going to anything outside the physical server they have in control of."
Most companies physical and digital security controls are so much worst than anything from AWS or Google. Note I dont include Azure...but a physical server they have control of is a phrase that screams vulnerability.
A bit off topic are there any BitTorrent/ipfs effort to archive archive.org ?