What the article mentioned about the influence of PageRank definitely rings true. An interesting variation is used by the Secret Search Engine Lab's CashRank algorithm[2].
So I tried these search engines. They're fun to try but kinda remind me of search before Google.
I'm frustrated with Google search results lately, they don't reach far enough back and are diluted with all sorts of crud.
But these alternative search engines kinda just remind me of why people started using Google. When I use them, I don't get spam, but I don't really get things I'm interested in either. It's like they don't understand what I'm really interested in and I either get no hits, or get hits on things that are just completely unrelated. It's like old-school chat software or something, complete with all the computer misunderstandings and whatnot.
Don't get me wrong, I'd love to see a lot of competition in search. I use DDG a lot. But I'm surprised at how positive people's comments are about these alt search engines, because to me they have problems on the opposite end of the spectrum. To me, what Google returns is a pretty good understanding of what I want, but corrupted by manipulative spam; the other ones generally return no spam, but also not at all what I want.
I'm trying https://kagi.com/ at the moment, was mentioned here a few months back when it was still in beta and I finally remembered to switch one of my browsers to try it.
So far, pretty happy with it. Before that I'd been using ddg but found myself so frequently !g that it was almost pointless.
DDG feels absolutely rubbish at local searches, but I may have just lost patience with it.
To be honest, I don't think DDG has a future purely because of the name. No way I'm telling non-techy friends that as I'll just get "what? Ducky go? Duck what? Are you being serious or is this a joke?"
Being able to make rank adjustments is an absolute GAME CHANGER. I didn't realise just how terrible my search results were until I could have granular control over the poor quality sites which kept appearing in my Google results. It's so clear to me now that Google is prioritising advertising heavily over user experience. The recent discussion about how such a large number of Google searches include "reddit" should serve as a warning for Google to swing that pendulum back in the direction of the user experience, or they will lose people like me. I'll be paying for Kagi when it's out of beta.
>To be honest, I don't think DDG has a future purely because of the name. No way I'm telling non-techy friends that as I'll just get "what? Ducky go? Duck what? Are you being serious or is this a joke?"
Only as a side note/JFYI, naming something, particularly if aimed to be international/multilingual, is particularly tricky, duckduckgo may well sound funny to your friends, but - as an example - kagi (in Italy) would be pronounced the same as "cagi" which is a reknown historical maker of men underwear, to the point that "a cagi" is sometimes used as a synonym to "a tank top", and surely it would make some people think it as a joke.
I’m delighted with Kagi. It finds the regulars I want (StackOverflow, Github, Wikipedia) just as well as Google, but is less infested with SEO spam and has a better selection of indie sites. Images and maps aren’t really up to par yet, but I mostly use OSM for maps anyway. I’ll be very happy to pay $10-$15 a month when Kagi start charging.
> That you aren't willing to pay for search means it isn't very important to you.
Or that they don't have money to throw away on things that aren't things necessary to live. Your statement is only true if the person you respond to is quite wealthy, which is a leap.
I don’t know where you live or what your circumstances are, but $10-15/month is less than Netflix or Spotify. It’s less than YouTube Premium. All those things are essentially you paying for content in a way that results in you being less advertised to. It’s not a lot of money.
Wait, why is being 'quite wealthy' necessary? Their FAQ says 10$ on the low end, 20-30$ for an unlimited offering. People of varying economic positions spend that monthly on a myriad of non-essential services. I feel like you're also playing a semantic game with 'importance' since in another comment you immediately jump to a starving person trying to choose between a premium search engine and food. We're more or less in a typical HN thread about product barrier to entry here and it feels like you're on a class crusade to get equal search results for all. Not the same conversation.
Totally fair, but why not just provide a throwaway email if you're worried about spam or privacy? Is it the principle? I've never had a principle problem with having a login to a service. I would do so with DDG, for example, if I could customise my results better.
When it comes to these older sites, it's hard for an automated engine to discover what they're about because (1) they lack the kind of structured description that's far more common on modern sites - including, to some limited extent, spammy ones (although extensive structured descriptions would nonetheless tend to favor legitimate content) - and that powers "smart" suggestions in search results. Also, (2) the web directories that would've provided an accurate description back when those sites were current are now dead. What we'd need to improve the Web search ecosystem is for "non-commercial", hobbyist sites to work on addressing both of these problems.
HTML has had metadata tags for ever, it's just that they quickly stopped being used by search engines were so inaccurate and prone to abuse. Even now, the heavy presence of these types of tags is arguably a marker that a website is really interested in its google ranking, and probably fairly spammy.
Any sort of description or tagging or keywords or genre description needs third party vetting to be of any use what so ever. It's simply too profitable to misrepresent your websites for it to be any other way.
Those metadata tags were just a simple textual description and a bunch of keywords with no reference to any controlled vocabulary. This is what made them so easy to abuse. Modern schema-based structured data is vastly different, and with a bit of human supervision (that's the "third party vetting") it's feasible to tell when the site is lying. (Of course, low-quality bulk content can also be given an accurate description. But this is good for users, who can then more easily filter out that sort of content.)
One could even let this vetting happen in decentralized fashion, by extending Web Annotation standards to allow for claims of the sort "this page/site includes accurate/inaccurate structured content."
The thing is "a bit of human supervision" is difficult on a scale of ten thousand Wikipedias. It pretty much needs to be done completely automatically.
I just hope we can get a trust-based network combined with search. So if I trust some friends and search "best toaster", and one of my trusted friends has given a very high review score to a toaster, then I get that one as the top search result.
Extend this to online communities, and you can ask "what laptop would HN recommend for Linux?", etc.
Of course, there's a privacy issue to solve, but the functionality could be very useful compared to the crappy Google search results for commercial products.
I'm having similar idea but with less user involve: a browser extension that extract keywords from the websites you actually visited, and form a p2p database with your followers. If you see spam / undesired ads in search result, you can rate down the content, and the system auto reduce the weighing from the peer providing that history.
To avoid sharing sensitive page / habit, maybe let the user review in batch and confirm before sharing out the list.
>But these alternative search engines kinda just remind me of why people started using Google. When I use them, I don't get spam
In the past few months at least half of my Google searches have had spam in the top 3 results. Literally malware domains that community-made filter lists are aware of but somehow Google chooses to share anyway.
Yes! I was trying to find an article on polish objections to Nord Stream 2 a week or two ago and couldn't find anything. Tried a million search combinations, tried date restricting but it all favotrd recency. Ended up finding it as a footnote on Wikipedia
With marginalia I think that is by design. They intentionally do not index any websites that are deemed "too modern", which includes most things one might be searching for. It does seem to be intended more for exploring unusual places.
It look like good, but perhaps should have some more options (that can be specified as a part of the search query text), and documenting these features. Some possibilities would include filters (by file format, domain name, scheme, etc), sort order, excluding, etc.
There are two menus for options ("Popular Sites", "Blogocentric Eigenvector", "Both Algorithms", "Experimental", "Allow JS", "Deny JS", "Require JS"), but does not explain them very well. (For example, I might want the search engine to not execute any scripts in web pages to determine their text, but if the text works when scripts are disabled that it can still be included in the search results (even if the web page has scripts, as long as those web pages work correctly even when scripts are disabled).)
Also, they have some documentation using Gemini format. I have a Gemini viewer in my computer, but it won't use it because of the "Content-disposition" response header.
You deserve every bit of it, and then some. The scope of Marginalia may be small, but it is one of the best examples of making the non-commercial web accessible to a much larger audience. I hope it inspires others to tackle similar projects.
I'll be able to support the project starting a few weeks from now, and would love some non-Patreon options.
I think what you're doing with Breeze seems interesting, but the value-add of making it a commercial offering isn't clear; what does it offer that anyone else can't easily replicate with Google Custom Search? I'm not saying that there isn't a value add, I'm only saying that it isn't obvious as a user.
Something has to be a scarce resource; my guess is that the resource here is "labor" in building the CSE parameters and finding the sites to add to collections. Perhaps the effort that went into this should be emphasized.
Are there plans to move things server-side? Making client-side requests to Google has privacy implications.
tl;dr -- working on improving client-side privacy protection & there's couple of other options that may permit using Google with a slight bit of user config; premium version is all server-side since proxied via Bing, Gigablast &/or our index; there's an alt premium option that would basically proxy google via a cloud browser, say browserling or KASM or similar
1. The client-side will be set to no personalized ads -- just found out about a fairly hidden setting that permits that -- should be changed / live later today
2. Google's API precludes making server-side unless user configs an account which we can then drop in, since limited to 10K queries / day to do anything meaningful custom or full web; we're planning to do that as blog post as intermediate alt
3. We've started to instrument if client-side calls sidestep any privacy -- those results / protections are necessarily limited, however, we can perhaps uncover some things if anything in their client-side code that is privacy revealing & possibly mitigate some
4. premium is server-side with proxy to {ahrefs*, Bing, Gigablast, etc.} OR our Breeze index -- we scrape inventory-sensitive / time-sensitive sites, e.g., used car dealer pages
5. given enough ad or other revenue, we could alternatively give everyone a Bing proxy like DDG or other services -- bit too bootstrapped to do that out of the gate, plus Bing has more constraints on custom search, so that's a mixed bag of outcomes
6. another approach, also necessarily premium, would be to proxy through a service like browserling for Google searches, since the API doesn't permit it at scale
7. premium also includes alerts, especially for things like say car dealer pages that are more time sensitive and harder to config than what free google alerts do
can DM some examples of CSE config on easy to hard
if added anything else, it's that we're positioning Breeze more as a search client + search engine, e.g., we use Google, Bing, etc. for webw-wide (search client) whereas we scrape car dealer pages for real-time alerts of inventory (search engine)
that same search client philosophy also why we're adding a low-code query builder so that anyone can build a really extended query, aka, custom search engine, and either keep private for their use or share with community, since we can't possibly build all the CSEs ourselves
in that sense, our long-term trajectory is about building what amounts to a deep reddit, where people can search sites / topics of interest in very direct ways that are way more substantive than wrestling Bing / Google
that's also why we're shifting away from /topics at the top of our site to integrating the branches / filters / custom queries directly into the search experience, e.g., blogs is first one we've done that with that's not easily available elsewhere, and podcasts is likely next.
-- custom / topic search, or what we call branches --
tl;dr -- free version makes it easy for anyone to build, use & share custom searches; premium version is zero ads + better web alerts + easier access to alt web indexes
1. average user unlikely to go through config of custom search
2. even if they did, it's so poorly documented & finicky, they'd be unlikely to achieve comparable outcome
3. Bing CSE is even worse -- requires Azure account, limited to 400 up/down boosts, etc.
4. there's other technical reasons a user might not do more than a couple, along with the sheer scale issues that also mentioned
5. we're refactoring how the topic searches are experienced and making the topics, which we call branches, a more natural part of search experience
6. "blogs" is the first one to be done that way -- you can filter to just blogs after searching for X - any other branches / topics will be added in that way
7. e.g., podcasts & RSS feeds are coming out soon, along with more traditional filters such as shopping
8. that approach also makes it easy for us to expose what we're calling a low-code builder to let anyone build a custom search and either share / make public or keep private
9. that includes all possible filters - advanced keyword combos, site inclusion / exclusion, URL patterns, schema structure, etc.
10. premium for users includes zero ads, alerts, and some other features that are mix of TBD or too early to build
11. premium for teams includes similar things, along with the ability to config dashboards of searches, e.g., an HR dashboard of relevant custom searches, etc.
12. we're navigating that labor balance -- e.g., the blogs filter is fairly basic atm, whereas filtering college scholarships was a bit more nuanced to make it work
Teclis[0] and Wiby[1] are similar contenders in the non-commercial search space.
[0]: http://teclis.com/
[1]: https://wiby.me/
What the article mentioned about the influence of PageRank definitely rings true. An interesting variation is used by the Secret Search Engine Lab's CashRank algorithm[2].
[2]: http://www.secretsearchenginelabs.com/tech/cashrank.php
I listed a bunch of engines with their own indexes, FWIW:
https://seirdy.one/2021/03/10/search-engines-with-own-indexe...