Hacker Newsnew | past | comments | ask | show | jobs | submit | more shasheene's commentslogin

There's no reason why a web browser bookmark action doesn't automatically create a WARC (web archive) format.

Heck, with the cost of storage so low, recording every webpage you ever visit in searchable format is also very realistic. Imagine having the last 30 years of web browsing history saved on your local machine. This would especially be useful when in research mode and deep diving a topic.

[1] https://github.com/machawk1/warcreate

[2] https://github.com/machawk1/wail

[3] https://github.com/internetarchive/warcprox

EDIT: I forgot to mention https://github.com/webrecorder/webrecorder (the best general purpose web recorder application I have used during my previous research into archiving personal web usage)


This was what made me convert from bookmarking to clipping pages into Evernote around 6-7 years ago. I realized I had this huge archive of reference bookmarks that were almost useless because 1) I could rarely find what I was looking for, if I even remembered I'd bookmarked something in the first place, and 2) if I did, it was likely gone anyway. With Evernote I can full text search anything I've clipped in the past (and also add additional notes or keywords to ease in finding or add reference info).

Since starting with replacing bookmarks, I've moved other forms of reference info in there, and now have a whole GTD setup there as well, which is extremely handy since I can search in one place for reference info and personal tasks (past and future). Only downside is I'm dependent on Evernote, but hopefully it manages to stick around in some form for a good while, and if it ever doesn't, I expect I'll be able to migrate to something similar.


Shout out to https://joplinapp.org/

I was an Evernote user when I was on macOS. When I switched to Linux, a proper web clipper was something I really missed. I'm now on Joplin and it does everything I used to use Evernote for and then some.

It even has vim bindings now!

As far as longevity goes, I think they got their archive / backup format right - it's just a tarball with markdown in it.


No need of proprietary code and apps why not build it into browsers. I have seen Firefox and Chrome can download web pages. So it will be nicer if they can download the bookmarked pages and store in a local html, css, image folder. I think it's pretty easy to achieve.

Also people need to move away from those esoteric reactjs, angular, vuejs and plethora of CMS as API or static site generators relying on some js framework which won't last even 2-3 years. Use a static site generator which can generate a plain html, like static site generators built on pandoc, python docutils or similar.

Personally I like restructuredText as the preferred format for content as its a complete specification and plain text. So the only thing in this article I will change is that content can also be in rst format and then generate html from it. Markdown is not a specification as each site implements their own markdown directives unlike restructuredtext specification and most of the parsers and tooling are little different from each other.


> Personally I like restructuredText as the preferred format for content as its a complete specification and plain text.

I have used rst intensively on a project. A few years later, I would be hard pressed to write anything in ti and would need to start with a Quick Start tutorial. With all its faults, Markdown is simple enough that it can be (and is) used anywhere, so there is no danger of me forgetting its syntax (even if it wasn't much simpler to star with).

So personally I would prefer md over rst anytime.


> Markdown is not a specification

Not by that name... https://commonmark.org/


It is stil not a specification like restructuredText[1]. Also wikiMarkup (which really started this markdown) is different from GitHub markdown, which is different from other markdown editors. Also many sites use their own markdown versions.

If you are in restructuredText world there is one specification and all implementation adhere to it, be it pandoc, sphinx, pelican, nikola. The beauty of it is that it has extension mechanisms which provides enough room for each tool to develop it. But markup can be parsed by any tool.

[1] https://docutils.sourceforge.io/docs/ref/rst/restructuredtex...


I don't know why markdown is so popular other than maybe "it was easy to get running" or "works for me".

It's better than "designed by a committee" standards, but it lacks elegance or maybe craftsmanship.


Because its inherently appealing, close to what you wanted intuitively, and if you're only dealing with a single implementation of it, works fairly well.

You don't really get bit by its lack of a standard and extensibility until after you've bought in.

It's essentially designed by the opposite of a committee -- rather than including everything but the kitchen-sink, it contains support for almost no usecases except the one. Which is very appealing, when you only have the one usecase.


Well rst is better than markdown from day one. The only reason it became famous is thanks to wikimarkup.

So markdown needs to thank the popularity of Wikipedia for its success, as rst did not have any application like Wikipedia. But still rst is used widely enough with its killer Sphinx, readthedocs and now its kind of de-facto documentation writing markup in Python and many open source software world.


Because you can teach someone markdown in five minutes. And even if they don't know all the ins and outs, the basics are pretty foolproof (paragraphs, headings, bold and italic).


> No need of proprietary code and apps

Joplin is free and open source.


This... I just found a plugin for the static site generator Pelican that is 7 years old that still works. After running Pelican you get plain HTML that can be hosted anywhere. I like Netlify, but other options like GitHub pages are also great. The author recommends not putting on GitHub Pages because they haven't found a working business model and might not be here in the future. But... GitHub has been taken over by Microsoft which is most likely not going bankrupt soon and Microsoft loves their backward compatibility so I am confident they won't screw GitHub up too much.


You can say the same about geocities when it was acquired by yahoo. But it didn't last and then now it's happening with yahoo-groups. So I am not hopeful if GitHub becomes a liability microsoft will keep it.


Yahoo! And Microsoft have very different business models. One is intently more sustainable (selling software and services).


> No need of proprietary code and apps

Joplin is open source, which is a big part of the sell to me. It definitely isn't the best of all possible note taking systems that could ever exist, but it's the best open source one I've found so far, and I don't have time to write a better one at the moment.

> why not build it into browsers. I have seen Firefox and Chrome can download web pages. So it will be nicer if they can download the bookmarked pages and store in a local html, css, image folder. I think it's pretty easy to achieve

This is solving a different problem though. WARC/MHT and other solutions can do this. Joplin is more of a note taking system that allows ingesting content from the web into one's own local notebook, which is relevant to what the GP post was talking about - Evernote.

However, it would seem that "the modern web" is the now popular standard. 10 years ago it might have been Flash or Java web applets or whatever. Now it's JS. I'm not convinced that JS is any better than what it has replaced. However, people keep paying developers to write them, so presumably someone likes them.

> Also people need to move away from those esoteric reactjs, angular, vuejs and plethora of CMS as API or static site generators relying on some js framework which won't last even 2-3 years. Use a static site generator which can generate a plain html, like static site generators built on pandoc, python docutils or similar.

Agreed, but that's also not a problem that Joplin, Evernote, or any other such tool is going to be able to solve. Unless you are complaining that Joplin is an Electron app? That's my biggest issue with it personally. It runs well enough, but is definitely the heaviest application I use regularly, which is a little sad for a note taking program. On the other hand, I haven't found a better open source replacement for _Evernote_. There are lots of other open source note-taking programs though.

> Personally I like restructuredText as the preferred format for content as its a complete specification and plain text. So the only thing in this article I will change is that content can also be in rst format and then generate html from it. Markdown is not a specification as each site implements their own markdown directives unlike restructuredtext specification and most of the parsers and tooling are little different from each other.

reST is indeed very nice. At one point, I kept my personal notes as a Sphinx wiki with everything stored in reST. I found this to be less ergonomic than Evernote/Joplin, although in principle it could do all the same things that Joplin can do, and then some.


> No need of proprietary code and apps why not build it into browsers.

Safari does this. pages added to the reading list archive the content for offline reading.


Joplin is open source.


Thanks a lot for the recommendation! I have been a little annoyed with Evernote not having an app for Ubuntu, which I recently started using quite heavily. So this looks very interesting!


The developer behind it is doing some awesome stuff so I decided to sponsor him on GitHub.


> Only downside is I'm dependent on Evernote, but hopefully it manages to stick around in some form for a good while, and if it ever doesn't, I expect I'll be able to migrate to something similar.

I have used Evernote and OneNote, but have finally, after a long interim period, resorted to using only markdown.

I have a "Notes" root folder and organize section groups and sections in subfolders. VSCode (or Emacs), with some tweaks, shortcuts, and extensions, provides a good-enough markdown editing experience. Like an extension that allows you to paste in images, storing it in a resources folder in the note's current location (yes, I see small problems with this down the road when re-organizing, but nothing that can't be handled).

For Firefox, I use the markdown-clipper extension the few times I would like to copy a whole article, it works well enough. Or I copy/paste what I need; mostly, I take my own summarized notes.

For syncing, I use Nextcloud, which also makes the notes available both for reading and editing on Android and iOS (I use both).

Up until very recently, I used Joplin, which also uses markdown, but there were two things I could not live with: it does not store the markdown files with a readable filename, e.g., its title, and being tied to a specific editor.

If you are mostly clipping and not writing your own notes, I can imagine my setup won't work well, or be very efficient.

I want to use a format that has longevity, and storing in a format that I cannot grep is out of the question.



https://archivebox.io/

You bookmark in Pinboard or Delicious and ArchiveBox saves the page. Handy.


>if I even remembered I'd bookmarked something in the first place

I had recently participated in a discussion on the problem of forgetting bookmarks[1].

Copying my workflow from there,

1. If the entire content should be easily viewed, then store via pocket extension.

2. If a partial content should be easily viewed i.e. some snippet with link to entire source, then store in notes (Apple).

3. If the content seem useful in the future, but it is okay to forget it; then I store it in the browser bookmarks.

But, my workflow doesn't address the problem raised by Mr. Jeff Huang; if Pocket app or notes disappear so goes my archives. I think self hosted archive as mentioned by the parent is the way to go, but I don't think it's a seamless solution to a common web browser user.

[1]https://needgap.com/problems/57-i-forget-my-web-bookmarks-qu...


My solution for a small subset of the forgetting problem:

I frequently see something and want to try it out the next time I want to do something else. So I emulate User Agent strings and append lots of "like [common thing I search for a lot]" to the bookmark. When I start typing into the search bar for those other things I'll be reminded of the bookmark.

For example, since file.io is semi-deprecated I decided to try out 0x0.st . But I kept forgetting when I actually needed to transfer a file, so I made a bookmark titled "0x0.st Like file io".

As a side note, I have a similar bash function called mean2use that I use to define aliases that wrap a command and ask me if I'd like to do it another way instead or if I'm sure I want to use the command. I've found this is a nice way to retrain my habits.


That was useful, can you add this to the original needgap thread I linked?

Disclaimer: needgap is a problem validation platform I built.


I'm glad you mention Evernote. I also use it for this, and also for many other purposes.

It is true that it is propietary software but it is worth mentioning that all the content can be exported as an .enex file, which is xml.

So, the data can be easily exported.


>easily exported

Have you actually looked at such an xml: https://gist.github.com/evernotegists/6116886

Exported sure, it's all there. But importing that into your new favorite notes application is not going to be trivial, especially not for regular users.

That's why I've decided to stick mostly to regular files in a filesystem.


Presumably "regular users" will not be individually writing XML parsing code to convert the notes. The developers of their "new favorite notes application" will do it (and if they can't be bothered, maybe it shouldn't be your "new favorite notes application").

Joplin, for example, can import notes exported from Evernote. It's just a menu option that even regular users should have no trouble employing.


I store bookmarks (i.e. URLs) in a simple .txt file. My text editor lets me click on them to bring them up in a browser.

> Only downside is I'm dependent on Evernote

No special software nor database required.


What benefit does that provide compared to regular browser bookmarks? It doesn't seem to address either of the issues I mentioned.


1. it's independent of the browser

2. it works with any browser

3. I can move it to any machine and use it

4. It is not transmitted to the browser vendor

5. Being a text file, it is under my control

6. I can back it up

7. I don't need some database manager to access it

8. I can add notes and anything else to the file

9. It's stupid simple


What happens when the website itself becomes unavailable?


For me, this is the problem that Evernote solves - it saves the entire content of the page, images, text, and clickable links.


The link stops working.


I find Evernote's search isn't that good, at least in the free version. Often trying to remember keywords and using Google is faster.

I know about DevonThink, i read good recommendations. But it's IOS/Mac only.

Any Evernote alternative for Win/Android with great search ?


Another commenter suggested https://joplinapp.org/, it has a nice search feature and has apps for most platforms.


You just opened up a world for me I hadn't thought about!

So simple! Thank you!


You're welcome! If you're interested in getting into GTD in Evernote (which I highly recommend), I wrote a blog post a while back about my setup: https://www.tempestblog.com/2017/08/16/how-i-stay-organized-...


Nice article but http://www.thesecretweapon.org/ isnt reachable anymore. The Page didnt Last...


Oh the irony. I'll update my link to an archive post or reproduce the important parts. Thanks for pointing that out!

Edit: doesn't appear to be down, they're just using a self-signed cert.


hmm, your right. i tried "continue with insecure certificate" yesterday but maxbe i was to impatient


Is there a good end-to-end encrypted alternative to Evernote?


first I heard of web clipping. Looks like OneNote has web-clipper extensions, too. This is so great.


>There's no reason why a web browser bookmark action doesn't automatically create a WARC (web archive) format.

Indeed.

And I still remember the modem days where I would download entire websites because the ISP charged by the hour, and I'd read them offline to save money.


I can't put my finger on it but this has a sort of Dickensian quality to me.

I think this says something kind of profound about information and capitalism and whatnot.


No, it hasn't. Technology just wasn't there back then which caused significant cost per time unit, which makes it only fair to charge per time unit.


Yeah. In the dialup days, layer 1 and 2 of a home internet connection was a long-running phone call between your own modem and a modem of an ISP. You payed via your phone bill, for the duration of the call.


It says scarce resources are pricier. Welcome to the real world!


Personal wayback machins should be standard computing kit. I have had one since around 2013. Very bare bones demo: https://bpaste.net/show/3FBH6 it does much more than that. file:// is supported for example, so you can recursively import a folder tree, and re-export it later if you wanted to.

Or in some random script: "from iridb import iri_index" "data = iri_index.get('https://some/url')" I'm skipping lots, you can ref by hash, url, url+timestamp. It hands back a fh, you dont know if the data you are reffing even fits in memory. Extensive caching, all the iri/url quirks, punycode, PSL etc.

Some random pdf in ~/Downloads, "import doc.pdf" and dmenu pops up, you type a tag, hit enter and the pdf disappears into the hash tree, tagged, and you never need to remember where you put it. Later on you only need to remember part of the tag, and a tag is just a sequence of unicode words.

Chunks are on my github (jakeogh/uhashfs, it's heka-outdated dont use it yet), I'll be posting the full thing sometime soonish.


I actually this week asked the author of SingleFile if he could implement a save-on-bookmark feature for SingleFile, and he was amenable:

https://github.com/gildas-lormeau/SingleFile

https://github.com/gildas-lormeau/SingleFile/issues/320


Nice. FYI there's also SingleFileZ

> SingleFileZ is a fork of SingleFile that allows you to save a webpage as a self-extracting HTML file. This HTML file is also a valid ZIP file which contains the resources (images, fonts, stylesheets and frames) of the saved page.

https://github.com/gildas-lormeau/SingleFileZ


I'll implement the "save bookmark page" feature in both extensions :)


Whoa. I just installed SingleFileZ for FF and it is working great. Before I was using wget and that was clunky. This is working great since I can just toss a single file up on my server and we are good to go. Thanks for this!


oh, hello! It's funny how this subject has popped up again.


Hi! I think it confirms that there's a real interest in this feature.


Off topic, but could I ask how you knew your software was being talked about? Did you just happen by or have you some monitoring agent looking for mentions? Just curious


Sorry, I didn't see your question. I check the posts on HN very regularly. The title of the post made me think that people might have been talking about SingleFile. Sometimes, friends of mine tell me someone on the Internets is talking about SingleFile :). I also sometimes use the integrated search engine.


WorldBrain's Memex (https://addons.mozilla.org/en-US/firefox/addon/worldbrain/) has an option to perform a full-text index (not archive) of bookmarks, or pages you visited for 5 seconds (default) down to 1 second (no option to index all pages). It stores this stuff into a giant Local Storage (etc) database, which Firefox implements as a sqlite file.


https://www.gwern.net/Archiving-URLs describes extracting brower history to create an archive via a batch job.


Firefox actually purges history automatically. For instance, the oldest history I have on this browser right now is from January 2018. I found about this the hard way.


I noticed this behavior in Firefox too. So I started writing personal Python scripts to scrape FF's SQLite database where it stores all the browsing history information.


Safari does the same, even though I tell it to never clear browsing history.


I think Chrome(ium) does as well. Very annoying tbh.


Chrome was the first browser I encountered that deletes history without being instructed to.


It looks like Firefox has been doing it since 2010[1]. I wonder how long Chrome has been doing it, since launch, 2008? Here's a Chrome bug discussing it[2].

[1] https://web.archive.org/web/20151229082536/http://blog.bonar...

[2] https://bugs.chromium.org/p/chromium/issues/detail?id=500239


Mosaic had full text history search.


You can increase the retention period to centuries via about:config.


I have this problem. Some bits of history are gone except from old backups of profile directories and profiles where I've already set places.history.expiration.max_pages to some absurdly high number.

I need to do a handful of experiments to see exactly how this interacts with Sync, even though I've (foolishly) already synced the important profiles. I'd hope that the cloud copies of the places database just keeps growing, but in any case, I'd rather combine them all offline anyway.


Even if you set the setting, how can you be sure that it won't be reset on an upgrade or that you'll remember to set it if you need a new profile (perhaps your old one becomes buggy, crufty, corrupt, or all three)? I thought I had all my history retained until one day I couldn't find a website I knew I had visited years ago, and took a closer look at my history and was very unpleasantly surprised... What happened? I'll never know, but my suspicion is that Firefox reset the history retention setting at some point along the way. If you do any web dev, you know Firefox occasionally backstabs you and changes on updates. The only way to be sure over a decade-plus is to regularly export to a safe text file where the Mozilla devs can't mess with it. I can't undo my history loss, but I do know I have lost little history since.


I can't be sure. When I say 'combine them all offline', I mean using something like [1] which refuses to do anything for me because the Waterfox database version is a rather old Firefox version, and that seems to expect all the db's versions to be up-to-date and equal, which seems pointless. #include <sqlite3.h> was my next step-- only I don't walk very well, so that didn't happen "yet". Or I'm lazy, or distracted, or depressed, or something. When I recently got tired of realizing a thing was on the other machine, I bit the bullet and synced them, if only to see how well that worked.

Anyway, thanks for the guide.

[1] https://github.com/crazy-max/firefox-history-merger


I like the idea, but wanted to know how realistic it would be so I made a quick and dirty Python script to download all my bookmarks. If you want to make the same experiment, you can get it from here: https://gist.github.com/ksamuel/fb3af1345626cb66c474f38d8f03...

It requires Python 3.8 (just the stdlib) and wget.

I have 3633 bookmarks, for a total of 1.5 Go unziped, 1.0 Go zipped (and we know we can get more from better algo and using links to files with the same checksum like for JS and css deps).

This seems acceptable IMO, espacially since I used to consider myself a heavy bookmarker and I was stunned by how few I actually had and how little disk they occupied. Here are the types of the files:

   31396 text/plain
   3034 application/octet-stream
   1316 text/x-c++
   1123 text/x-po
    865 text/x-python
    384 text/html
    227 application/gzip
    218 inode/x-empty
    178 text/x-pascal
    113 image/png
     44 application/zlib
     29 text/x-c
     28 text/x-shellscript
     14 application/xml
     13 application/x-dosexec
     12 text/troff
      5 text/x-makefile
      4 text/x-asm
      3 application/zip
      2 image/jpeg
      2 image/gif
      2 application/x-elc
      1 text/x-ruby
      1 text/x-diff
      1 text/rtf
      1 image/x-xcf
      1 image/x-icon
      1 image/svg+xml
      1 application/x-shockwave-flash
      1 application/x-mach-binary
      1 application/x-executable
      1 application/x-dbf
      1 application/pdf
It should probably be opt-in though, like a double click on the "save as bookmark icon" to download the entire file, and the star becomes a different color. Mobile phones, chrome books and raspy may not want to use the spaces, not to mention there are some bookmark content that you don't want your OS to index, and show you preview of in every search.

But it would be fantastic: by doing this experiment I noticed that many bookmarks were 404 now, and I will never get their content back. Beside, searching bookmark, and referencing them is a huge pain.

So definitely something I wish mozilla would consider.


> definitely something I wish mozilla would consider

There used to be this neat little extension called Read It Later that let you do just that. Bookmark and save it so you could read it when you were offline or the page disappeared. Later they changed their name and much later Mozilla bought it and added it to Firefox without a way to opt out. It was renamed to Pocket.


Pocket is not integrated with your bookmarks. For offline consultation, you need a separate app. Of course this app is not available on Linux, where you have to get some community provided tools.

Bookmark integration would mean one software, with the same UI, on every platform, and only one listing for your whole archive system.


I’ve been building an application to do this, except for everything on your computer! It’s called APSE[0], short for A Personal Search Engine.

[0] https://apse.io


Having to pay $15/month ($180/yr!) to be able to search stuff on my own computer for years seems awfully expensive. I'd rather depend on some simple open-source piece of software that I can understand and maintain if necessary.


Yeah, the sheer idea of paying a subscription for software that is running on my computer to index local resources is crazy. This kind of software should be should sold as one-time buy license.


Decades ago there was an amazing piece of software from lotus when I worked there called magellan. I remember the first time I saw someone search, and find results in text documents, spreadsheets and many other of the common formats of the day.

That was in 1989 and today I mostly search my computer using find and grep commands, since that's what just keeps working.


I should try adopting find and grep, but on Windows I'm currently using this and I'm very happy with it: https://www.voidtools.com/downloads/


I use Void Tools Everything to find files by name, and AstroGrep for finding information in them.


Yup, I could see paying $180 one time for something like this. but at $180 a year for a self hosted product... that's just very steep.


Google used to have a native Mac extension like a launch bar. Command space ... Enter search all local files. It was really fast



Well I used LaunchBar and then Quicksilver for many years. Spotlight has never been as nice and hackable as those.


You can ask Safari to do that by enabling 'Reading List: Save articles for offline reading automatically'. It's not WARC but it is an offline archive. The shortcut is cmd-shift-D which is almost the same as the bookmark one. It's also the only way I know of to get Safari to show you bookmarks in reverse chronological order. And it syncs to iOS devices.

This could be done in better and more specialized ways, one problem is browser extension APIs don't provide very good access to the browser's webpage-saving features.


This problem has been solved a long time ago if you use Pinboard.

https://pinboard.in/

Just pay the yearly subscription so pinboard can cache your bookmarks.


I do not see how using a web thing is a solution to web things going away.


Pinboard happens to be a web service run along the same principles as the article we're discussing.

The bus factor is high, but I suspect that Maciej has a plan that'll let us download our archive even if he does get grabbed by the mainland Chinese government let alone a forecasted going out of business action.


Then what do you propose the answer is? The blog post just proposes using “web things” differently


For archiving web pages? ArchiveBox is ok.


I've contemplated upgrading my Pinboard account many times. Finally bit the bullet.


Until pinboard goes OOB


Pinboard is a profitable online equivalent to a mom and pop shop. It’s sustainable and its founder isn’t chasing growth at all costs. It also has a cult following, so OOB is highly unlikely.


Does pinboard caching work with sites that are behind a login or paywall?


It does not. See FAQ here: https://pinboard.in/upgrade/


This is the advantage to Evernote. Since it’s a browser extension, it has access to anything you have access to.


The downside is, since it’s a browser extension, it has access to anything you have access to.


Agreed that there’s a tradeoff. I don’t think there’s really an alternative solution though.


I use the Zotero extension for this feature.

https://www.zotero.org/


came here to say this. zotero also saves metadata as an extra. as long as it is used from within a browser.


In practice how is this different from MHTML? I think most browsers have built-in support for MHTML so it should be possible to build that part easily.


The state of mhtml support is fairly pathetic at the moment. Firefox broke mhtml compatibility with the quantum overhaul. Chrome's mht support had been a hit and miss over the years, sometimes removing the GUI option entirely and requiring one to manually launch the browser with a special tag to enable it. The only browser with a history of consistent mhtml support happens to be....Internet Explorer, followed by a bunch of even more obscure vendors that nobody really uses.

I am currently dealing with the problem of parsing large mht files (several megabytes and up). A regular web browser would hang and crash upon opening these files and most ready made tools I could find struggle with the number of embedded images. It's very much a neglected format with very little support in 2019.


According to the MHTML entry on Wikipedia, Chrome requires an extension, Firefox doesn't support it, and only Internet Explorer supports MHTML.


I mean, it's not any worse than WARC support…


Maybe MAFF's are best as they use compression instead of base64 encoding: https://en.wikipedia.org/wiki/Mozilla_Archive_Format


Or SingleFileZ files which can be viewed without installing any extension https://github.com/gildas-lormeau/SingleFileZ.

Edit: it can also auto-save pages, like SingleFile.


I used the excellent unMHT plugin for Firefox, but it got dropped some time ago, failed to meet "enhanced security requirements" :(

I still keep an old ESR with this plugin for archiving, and accessing MHTs.


This is what started me clipping everything to OneNote instead of bookmarking. Unfortunately, it becomes difficult to maintain, the formatting is off, things subtly break, pages clipped on mobile use different fonts for god knows what reason, some content is discarded silently because the clipper deems it's not part of the main article, I could go on.

It's better than nothing but it's also increasingly frustrating to deal with.


I've actually been saving every page I visit for a good two years now and it has barely caused in dent in my NAS storage space. As usual though, I wrote a crappy extension and Python script to do that because I never bothered to look online. Thanks for introducing me to WarcProxy - I'll probably be making the switch very soon.


That would be even more useful if a search warrant is ever executed on my house.

Not useful to me personally, but useful to someone!


I always save pages instead of only bookmarking them.

Most websites I used to visit during demoscene high days, are now gone.


Years ago I used the old Firefox addon Shelve to automatically archive the vast majority of web pages I visited.

http://shelve.sourceforge.net/

The main disadvantage was disk space. This is particularly true when some pages are 10 MB or larger. I would periodically prune the archive directory for this reason.

I stopped using Shelve when I started running out of disk space, and now I can't use Shelve because the addon is no longer supported. The author of Shelve has some suggestions for software with similar functionality:

https://groups.google.com/forum/#!topic/shelve-firefox-addon...


It installs and loads in Waterfox, but of course it still hasn't been touched in 3.5 years.

I used Scrapbook in the past (which also still works) but I usually just save random things in ~/webpages/ since (apparently) 2011. The earliest is a copy of the landing page at bigthink.com. Of course now almost every link is broken, excluding social media buttons, About Us, Contact Us, RSS, Privacy, Terms of Use, Login, and the header logo pointing at the same page.


It would be nice if we had browsers that were actually user-agents that allow full pluggable customizability for all cookie, header, UI, request, and history behavior. Then, this would just be a plugin that anyone could install.


And then people install garbage extensions that break the browser and people think "Wow, this firefox browser is so buggy and slow" and switch to chrome. And then your extensions break with every single browser update because they are tampering with internal code.

Everyone is free to fork a browser and apply any changes they want. Allowing extensions to change anything at all essentially is the same as forking and merging your changes with upstream every update.


Somehow that doesn’t seem to happen with editors; we have a ton of them and they are very customizable.

I guess “with a reasonably stable hook api” was supposed to be implicit in my statement.


How is viewing the WARC after? Is it the same quality as archive.is/ or archive.org/ ?


It's a pain, like most of the WARC ecosystem. It's been several months since I dug in, so maybe it's had some spitshine the last little bit, but I usually end up using combinations of wpull, grab-site, and a smattering of other utilities to reliably capture a page/set of pages, and have had to make some quick hacks as well as manually merging in some PRs to get things to work with Python3. Once I have the WARC, I typically end up using warcat to extract the contents into a local directory and explore that way.

WARC as a format seems promising, but at least last I checked, open-source tooling to make it a pleasant and/or transparent experience is not really there, and worse, at least as of several months ago, doesn't really seem actively worked on. Definitely an area you'd expect to be further along.


Pretty good, depending on the tooling. I’m having good luck with https://webrecorder.io/ and their related open-source tools.


archive.org use WARC if I remember correctly, and offer guides on creating and reading the format.


Safari used to (and still) do this automatically but in a limited way. In the browsing history view (Command Y), you can search visited pages by its content, and this is extremely useful. But there's no way(†) to tell Safari to display that saved content. If you revisit a URL in the history, Safari fetches it again, losing the original saved content.

(†): short of direct plist manipulation


Does it actually? I thought history was stored in a SQLite database and only kept the URL and page title.


This comment misses the point.

The point is to make a webpage that lasts. So people can link to it and get the page. That means making a maintainable webpage and a url that does not change.

It is great that you can archive every page you visit for yourself, but that is not the same as making a lasting web.

Lets make something that others can use too.


Better than a bookmark action would be a commandline option, similar to Firefox's -screenshot, which will work without starting X11. Something like -archive:warc


Does this also strip the megabytes of superfluous tracking JS? It's probably what'll be the bulk of the size on the modern web, and I don't feel any particular need to store it.

(I believe that for historical purposes, enough complaining about ads and tracking will survive that future historians can easily deduce the existence of this practice)


I use DEVONThink on my mac, which has web archiving, full-text search, and auto-categorization.


"Imagine having the last 30 years of web browsing history saved on your local machine."

I believe the name for that experience was/is "Microsoft Windows".


>Heck, with the cost of storage so low, recording every webpage you ever visit in searchable format is also very realistic.

I tend to do that, I also save a lot of scientific papers, ebooks and personal notes. I've found that doing so does not help me at all. The main problem I have is that when I need to look something up (an article, a book, a bit of info) I reach for google first, usually end up finding the answer and go to save it, only to find that I had already found the answer beforehand (and perhaps already made clarifying notes to go along with it) and then forgot about it.

This, and not dead links, is the fundamental problem with bookmarks for me. Not only bookmarks, it extends to my physical notes and pretty much everything I do. If I haven't actively worked on something for a couple of months, I forget all about it and when I come back to it I usually have to start from scratch until I (hopefully) refresh my memory. Some of it is also usually outdated information.

I think this is a big, unsolved problem and I'm not even sure how to go about starting to solve it. I can envision some form of AI-powered research assistant, but only in abstract terms. I can't envision how it would actually work to make my life better or easier. It would need to be something that would help blur the line between things I know and things that are on my computer somehow. If I think of my brain like it has RAM and cache, things I'm working on right now are in the cache and things I've worked on recently or work on a lot are in RAM, but what's for me lacking is a way to easily move knowledge from my brain-RAM to long term storage and then move that knowledge back into working memory faster than I can do so now. I'm not even talking about brain uploading or mind-machine interfaces, but just something that can remind me of things I already know but forgot about faster than I can do so by myself.

I am convinced that figuring out how to do this will lead to the next leap in technological development speed and efficiency. Not quite the singularity that transhumanists like to talk about, but a substantial advancement.


I have exactly the same problem.

What I've found is that I need to spend more time deciding what is important, and less time consuming frivolous information. That's hardly a technology problem.

For things I really don't want to forget, I'm using Anki [0], a Spaced Repetition System (SRS). Anki is supremely good at knowing when you're about to forget an item and prompting you to review it.

Spaced practice and retrieval practice, both of which are used in SRS, are two learning techniques for which there is ample evidence that they actually work [1].

You still need to decide what is worth remembering, but that's something technology can't help with, I think.

[0] https://apps.ankiweb.net/

[1] https://www.learningscientists.org/


Yes so much this.

There are a few issues to consider:

- Any comprehensive archive of your activity is itself going to be a tremendously "interesting" resource for others -- advertisers, law enforcement, business adversaries, and the like. Baking in strong crypto and privacy protections from the start would be exceedingly strongly advised.

- That's also an excellent reason to have this outside the browser, by default, or otherwise sandboxed.

- Back when I was foolish enough to think that making suggestions to Browser Monopoly #1 was remotely useful, I pointed out that the ability to search within the set of pages I currently have open or have visited would be immensely useful. It's (generally) a smaller set than the entire Web, and comprises a set of at least putatively known, familiar, and/or vetted references. I may as well have been writing in Linear A.

- Context of references matters a lot to me. A reason I have a huge number of tabs open, in Firefox, using Tree-Style Tabs, is that the arrangement and relationships between tabs (and windows) is itself significant information. This is of course entirely lost in traditional bookmarks.

- A classification language for categorising documents would be useful. I've been looking at various of these, including the Library of Congress Subject Headings. A way of automatically mapping 1-6 high-probability subjects to a given reference would be good, as well as, of course, tools for mapping between these.

- I've an increasing difference of opinion with the Internet Archive over both the utility and ultimately advisability of saving Web content in precisely the format originally published. Often this is fragile and idiosyncratic. Upconverting to a standardised representation -- say, a strictly semantic, minimal-complexity HTML5, Markdown, or LaTeX, is often superior. Both have their place.

On that last, I've been continuing to play with the suggestion a few days ago for a simplified Washington Post article scrubber, and now have a suite of simple scripts which read both WashPo articles and the homepage, fetching links from the homepage for local viewing. These tend to reduce the total page size to about 3-5% of the original, are easier to read than the source, and are much more robust.

I'm reading HN at the moment from w3m (which means I've got vim as my comment editor, yay!), and have found that passing the source to pandoc and regenerating HTML from that (scrubbing some elements) is actually much preferable, for the homepage. Discussion pages are ... more difficult to process, and the default view in w3m is unpleasant, though vaguely usable.

Upshot: saving a WARC strictly for archival purposes is probably useful, but generating useful formats as noted above would be generally preferable in addition.

With the increasing untenability of mainstream Web design and practices, a Rococco catastrophe of mainstream browsers, the emergence of lightweight and alternative browsers and user-agents (though many based on mainstream rendering engines), the tyranny of the minimum viable user attacking any level of online informational access beyond simple push-stream based consumption, and more, it seems that at the very least there's a strongly favourable environment to rethinking what the Web is and what access methods it should support. Peaks in technological complexity tend to lead to a recapitulation phase in which former, simpler, ideas are resurrected, returned to, and become the basis of further development.


This refers to the CIVICUS Monitor rankings. See the interactive world map: https://monitor.civicus.org/


"Tesla Cybertruck (pressurized edition) will be official truck of Mars"

https://twitter.com/elonmusk/status/1197627433970589696


Here's the timestamp of the presentation: https://www.youtube.com/watch?v=SwvDOdBHYBw&t=7m22s


Thank you!


It suprises me that intelligent, reasonable people like Andrew Ng and Sam Altman were willing to help develop and invest in the Chinese Communist Party's AI/ML capabilities helping to create an authoritarian surveillance state used to repress minorities (Uighers, Tibetans, Falun Gong) on a scale not seen since Adolf Hitler.

The CCP and private companies are hand-in-glove. The risks of working with state-owned enterprises are clear. The risks of working with private companies are are also great with "party cells" officially embedded into at least half of private companies.

Fortunately for the free world, China faces massive structural issues to growth over the next 40 years, which the CCP is unable or unwilling to deal with. I highly recommend [1] and [2] to understand the future of the brutal Chinese dictatorship.

[1] https://www.nytimes.com/interactive/2019/01/17/world/asia/ch...

[2] https://www.youtube.com/watch?v=_AvNT3vyzr0

EDIT: Instead of down voting, please articulate why you disagree.


Did not downvote, I hate all views equally.

Perhaps it is the enormous cognitive dissonance. Western companies also take military funding and their surveillance systems were always better and more invasive.

The US is using statutes for national security interests to ban Chinese companies for contributing to alledged human rights abuses. Meanwhile we have documented human rights abuses in the West, and a US that does not shy away to cover a trade war and skewed commercial motives with anti-China propaganda.

US companies and agencies surveil outside the US. Capitalism-commercialization gave our rights to the likes of Zuckerberg and Amazon, who will turn your door bell into a police camera.

The Godwin was also a tad unnecessary. Perhaps you are a bit caught up in the heat of the moment. The propaganda machine is definately turning up the heat, so I do not blame you. Try to pose your argument without claiming the moral highground of the "free world". It is never fortunate that an entire country and its people struggles.

Accusing Ng and Altman of helping build a fascist surveillance state is a nasty accusation, which I feel you are not allowed to make.

Remember when the US had concentration camps for Japanese-Americans? Countries can change even without pressure or backlash (the US did not face any). Can you think of a modern group of people in the US being secluded from participation, facing state racism, and being locked away in ghettos, with no hope of social movement? Would China be justified in boycotting or sanctioning US companies for contributing to ICE, Guantanamo bay, or police violence, or Flynt's water, or Trump's racism?

Freedom of religion gave the US scientology and TV pastors and schools preaching creationism. It gave Europe extremist Salafism and terrorist attacks and people using the Bible to go after immigrants. You have to see China in that context: they see religion as a dangerous memetic virus to indoctronate youthful people into believing something not of their own choice. Countries have a right to make their own laws and their own tactics for improvement and combating extremism.

Edit: you may also receive downvotes from people who hate to hear negative stuff on China, or are paid 50 cents to do so.


Comparing Uigher concentration camps to Nazi Germany is valid and accurate, not a frivolous example of Godwin's law.

Your comment has a lot of whataboutism that isn't relevant to the discussion. I'll mention that US presidents have formally apologized several times for Japanese-American internment (which was 75 years ago), and that I would encourage anybody (and any government) to boycott and sanction companies contributing to human rights abuse.


This is great news. After a decade of stagnation [1], an exploit that in theory allows Linux (and Android) to be ported to iPhone, iPads, Apple Watches, and I believe, Apple TV.

[1] https://en.wikipedia.org/wiki/OpeniBoot


> After a decade of stagnation, an exploit that in theory allows Linux (and Android) to be ported to iPhone, iPads, Apple Watches, and I believe, Apple TV.

There was some discussion about this on r/jailbreak, and it comes down to whether the community is willing to reverse-engineer and write drivers for the various hardware:

https://www.reddit.com/r/jailbreak/comments/d9yyit/release_i...


Why wouldn't you just buy a rootable Android device in the first place? Cheaper, better phones and far less hassle to install your own OS on.


How are you qualifying better?

I had three Nexus Android phones go sideways on me in their first year, over a span of 4 years.

I have had two iPhones since 2014, and only because I dropped the first one.

If I have to spend $300-$400 on replacements every couple years, I’ll go with $800 every 4-5


Better value for money, bigger battery, better screen (Samsung), more memory (most devices), more storage, expandable storage, standard interface (all devices), no legal liabilities related to 'hacking' the device, better camera (Samsung, Pixel), more choice in form factor, need I go on?

What did you do with those phones that they 'went sideways'? I have a number of Motorola Defy phones which are between 8 and 9 years old, they still work fine. My daughter left one of those Defy's in her pocket when she put her jeans in the washing machine, it went through a full washing cycle and still worked except for the ear piece which I replaced at a total cost of $0.50 in parts (I bought 10 for $5 including shipping, anyone need a Defy earpiece?). I only ended up buying a newer device (Xiaomi Redmi Note 5 with many of the mentioned advantages) because the Swedish electronic ID supplier stopped supporting Android 4.4. I also have an Ainol Novo Advanced 8 Android tablet from 2010, still works fine albeit with a somewhat limited battery time.

Apple makes slick devices but the slickness comes with a downside: they are among the most vulnerable devices out there, usually ending up in the bottom legion when it comes to ability to survive rough treatment [1]. Repairs end up being extremely expensive due to the enforced single supplier rule - only Apple is 'allowed' to repair the device, iOS contains checks for 'unauthorised' repairs. For the price of a single screen repair on an iPhone X ($279) I can buy a new phone for myself and for my daughter (who has a Xiaomi Redmi 4X), 'other' repairs cost $549 which is enough for new devices for the whole family. In short, Apple is the more expensive choice. If you think they're worth their price you should buy them but that does not negate the fact that you're paying more for a more fragile device with limited repair options.

[1] https://macworld.idg.se/2.1038/1.692054/iphone-x-en-tickande...


> more memory

This is a largely useless comparison without qualification.


You're forgetting the context, being the selection of a device to run Linux (or anything else) on. Also, more memory is more memory, no qualification needed. It doesn't matter why the device has more memory, just as long as it does.


Yeah and it needs "more memory" because it runs Java. Modern iPhones have way more than enough memory to run one app at a time and a few in the background.


Who cares why the device 'needs more memory' (which it doesn't by the way, the excessive amounts of memory in recent Android devices is more of a marketing ploy than a necessity) when the goal is to select a device to run an alternative operating system on? That is, after all, the context of my reply.


But Android is already kinda Linux. You can even get shell and everything.


Is is, just add `termux` and install whatever packages you want. That won't help those who want to install Sailfish or Ubuntu or their very own mobile Linux creation or whatever other option they might contemplate. For those applications it makes sense to get a device which is open to this type of tinkering.


The last Nexus phone was made in 2015, and they were known for being bad (bootloop). There are tons of really solid Android phones that work for years. Even the cheap chinese ones from unknown brands are solid these days. I also replace them only because I broke the screen.

With Android you have a choice of the specs - bigger battery, better camera, tough build, fast charging. With the iPhone you get an average meh for not so average price.


Eh - anecdotal but my pixel just turned off and never turned on again about a year and a half into use (without water damage).

I still have my working iPod touch from like 2011.


Anecdotal but my MyTouch 3G (2009) and LG G2 (2013) still work just fine. Both are rooted with custom ROMs.


More anecdotes, still have my Nexus 4 kicking around. Just switched from my Pixel 1 after about 3 years since I cracked the screen.


Glad you found something that works for you.


To reuse/repurpose old devices you still have laying around. Which is much of the reason for Linux support on most unusual architectures, really.


Perhaps you wouldn't need to buy another separate Android device, since a iPhone dual-booting Android could make it the best supported Android device since the Pixel.


Nothing comes close to apple's ARM CPUs.


Most of the performance of which is irrelevent, because the apps have to run on the lowest common denominator iPhone.

This is not like Desktop PCs where a game might run at less than 60fps at a certain monitor resolution, iPhone software is more like the console market, so it's like saying you have a new PS4 model with 2x the speed of last years model. You're not likely going to notice the difference except at the edges, like launching apps are slightly faster.

Apple's fan base, prior to the A6, used to be 'specs don't mstter', but once Apple got the lead in CPU speed, now specs matter. I think for most people the former is probably true.

A faster phone doesn't make your Facebook, Snapchat, iMessage, Instagram, etc experience much better and lets be honest, people are spending the majority of time in those apps.


Performance isn't the only area where apple chips are superior. Power efficiency is another, and that's a lot more noticeable to your general consumer.


You’ll notice when loading web pages.


Why wouldn't you want to run doom/linux on all the devices?


For fun!


I can't wait to see an iPhone X natively running DOOM.



Facebook's realistic VR avatar research: https://www.youtube.com/watch?v=RCB_mfGmh9w&t=1h47m12s


Facebook Horizon was announced at Oculus Connect Day 1 keynote yesterday.

I highly recommend people checkout Michael Abrash's keynote discussing some jaw dropping "social teleportation" research: https://www.youtube.com/watch?v=RCB_mfGmh9w&t=1h47m12s


They are building something that doesn't even need to exist. We can have fully remote work with today's tools that's _way_ better than the average office environment, certainly way more productive per hour spent. You don't need a roomful of GPUs to have it. All that's in the way is organizational inertia and managers who can't justify their own existence without butts in the chairs, herded into an "open" office like cattle.

Source: I work remotely.


Very cool, but inherently troubling that it's done on FB's money, with all the strings attached.


I haven't seen anyone mention it here, but this concept is known as a "terrarium".

There have been terrariums that have been sealed for 50+ years and are still thriving.


Yeah, here's a one-line udev rule to log USB removal and insertion into the journal.

> $ echo 'SUBSYSTEMS=="block", RUN+="/usr/bin/logger --tag=block-device-history -- %E{ACTION} | %E{DEVNAME} | %E{ID_MODEL_ID} | $attr{serial}"' | sudo tee /etc/udev/rules.d/10-block-device-history.rules

Activate the new rule by reloading and retrigging:

> $ udevadm control --reload-rules && sudo udevadm trigger

Then in another terminal run:

> $ sudo journalctl -f

...Insert a USB drive and see information about it printed.

journalctl supports querying with time intervals (eg, journalctl --since "2018-01-10" --until "2019-08-01 23:59").


Slightly tweaked the above, in case anyone is interested. Copy the following to /etc/udev/rules.d/10-block-device-history-rules:

> SUBSYSTEM=="block",ENV{DEVTYPE}=="disk", RUN+="/usr/bin/logger --tag=block-device-history -- '%E{ACTION} | %E{DEVNAME} | %E{ID_SERIAL}'"

Run

> $ sudo udevadm control --reload-rules && sudo udevadm trigger

> $ sudo journalctl -f SYSLOG_IDENTIFIER=block-device-history

...Then re-inserting a USB device produces output similar to the below:

> Jan 01 12:00:00 hostname block-device-history[12345]: add | /dev/sdY | TOSHIBA_TOSHIBA_USB_DRV_012345678900FF00-0:0

To customize the printed variables, have a look at:

> sudo udevadm info /dev/sdY


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: