Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As a practicing scientist, I firmly believe the world would be much better off if we simply published version-controlled Jupyter notebooks on a free site, such as GitHub or ArXiv.


> As a practicing scientist

> version-controlled Jupyter notebooks

That's awfully field specific. It probably wouldn't work for most of STEM. Even for ML I shudder to imagine trying to make sense of the inevitable monstrosities. Writing a paper is part of the thinking process. It forces the author to sit down and work through things in an orderly manner and they're still often difficult to read.

I'm definitely in favor of all papers being accompanied by working source code when relevant though.


> It forces the author to sit down and work through things in an orderly manner and they're still often difficult to read.

As a former academian: Papers are difficult to read primarily because the academic community does not value making them easier to read - no other reason. You may hear things like "papers should be written for other experts", but even that doesn't hold up to scrutiny.

They typically spend 99% of their research time on the actual research, and less than 1% on writing the paper. They definitely can afford the time and energy to make the papers easier to read, but game theory holds sway: Why should a particular researcher use his/her time to do it, when his/her peers will not appreciate it? It's purely an internal, cultural problem. There are no external constraints leading to this.

I've seen referees send papers back saying they contained too much explanation, and suggest leaving out most of the details - just include the big picture methodology and show the results. I can guarantee most who will read the paper would not be able to reproduce those details, if they ever want to. Likewise, I've found papers where I couldn't reproduce the results, because the results were wrong - but since including a derivation of your final expressions is discouraged, no referee caught the errors.


> They typically spend 99% of their research time on the actual research, and less than 1%

This isn’t anywhere close to my experience. 2-3 days of writing per year seems like a wild underestimate for any academic I know. I’d maybe believe only 20% of time spent writing, but for some folks even that’s probably way too low


Are they writing papers or grant applications? I'm speaking specifically about papers.

And fine, even if 1% is an exaggeration, I doubt they spend more than 5% (about 18 days of the year).


As a former academian, I mostly disagree. In my experience, papers are difficult to read because they are as concise as possible, striving to refer to previous work for anything that's not original, and only elaborating on original things. This is done to make them as quick as possible to read by experts, which is pretty important given the immense volume of papers that appear in many fields.

I do think papers nowadays need to include a link to a zip file (or whatever other format - but it should be a boring old format unlikely to change or be abandoned, and not proprietary either) including all data, code, and so on. This data is necessary to verify the paper's results, but it is not the results themselves.


> I mostly disagree. In my experience, papers are difficult to read because they are as concise as possible, striving to refer to previous work for anything that's not original, and only elaborating on original things. This is done to make them as quick as possible to read by experts, which is pretty important given the immense volume of papers that appear in many fields.

This is consistent with what I said: Papers are difficult to read by choice. Yes, they strive to make it as concise as possible, which translates to making them harder to read.

Where I would disagree is the claim that it is done to make it as quick as possible to read by experts. In my field, the experts would skim papers quickly to get an idea, but if they then honed in on a paper to actually extract the meat of it so they can use it in their own work, it would take a lot of time, and was a pain.

I've heard from math professors that it takes about a day to read and digest one page of a journal article in their field.

Also disagree on it merely being a matter of consulting references. My work was theoretical/computational. It was common to see the final equations without any derivation, and many experts would not be able to reproduce it. There are lots of tricks that go into the derivation, but they are not provided under the pretext that any expert should be able to solve the equations and derive them.

And in the day of digital media, it's quite trivial to write a paper the way you suggest, and then put all the extra details in appendices. I guarantee that they will be read by most people who want to read the paper in detail, as opposed to merely skimming it.


Explorable explanations[1] is what you want, not Jupyter notebooks.

Explorables have all the same requirements to carefully think them through and prepare them for reading and clarity of exposition, but they also have the interactions that ease the introduction of concepts to their readers through a hands-on approach (rather than forcing them to read the mind of the writer by reverse-engineering their thought process, by running in your short-term memory the examples given in a non-interactive paper).

[1] https://explorabl.es/


I'm pretty sure all of us have the experience of cramming most of the paper into the two weeks before the deadline. Expecting someone to firm up their presentation without the risk of finalizing something half-assed is ... too much to ask for mere mortals.


So they would suffer the same decay all other web resources do? Broken links, no longer maintained tools, their opinionated choice (or lack of flexibility - which text + mathematical notation do have - if a set of stacks and data formats gets standardized).

Nobody is preventing scientists from publishing code and data in addition to & before the paper, which imho itself should be as conservative in format as possible to provide the most universal baseline for understanding, reproducibility, and reliability.


> So they would suffer the same decay all other web resources do? Broken links, no longer maintained tools, their opinionated choice (or lack of flexibility - which text + mathematical notation do have - if a set of stacks and data formats gets standardized).

Tools like Zenodo [1] are meant to solve this exact problem, and ensure these kinds of data don't suffer web decay.

[1] https://zenodo.org/


I think what you really want is a FOSS Mathematica.

It's sort of code, but more convenient for just getting work done. You can pass around the whole thing (data + code). No need to learn software development skills or set up a development environment to replicate results; it runs everywhere. Plenty of power to do advanced things. Already a standard in research.


Do you do so? If not, why not?


I'm not an academic, but a physicist working in R&D at a company. So my "papers" are only for internal consumption, and not earth shattering anyway. My colleagues and I are using Jupyter extensively.

My observation, from seeing papers that have been written in Jupyter, and observing how people work, is that Jupyter will first gain traction in disciplines that are already computation-heavy, and where open software is closer to the front end of the data pipeline.

For instance in my case, I develop measurement instruments, so everything I make is computerized, by me or my colleagues. While "raw data" may be in the form of things like voltages, they are almost immediately turned into a Python friendly data format by code that I wrote myself. So I'm up to my armpits in data and code just to get my experiments even barely working in the first place. I have a computer with coding tools literally at every bench in the lab. Jupyter is my lab notebook, and often my "report" is just the same notebook, dressed up with some readable commentary.

Now, contrast that with somebody like a synthetic bench chemist. The data that they get may be in computer readable form, but they rarely do any coding during the course of a project. For analysis, they're satisfied with the computations rolled into their instrument software, or Excel. And a fair amount of their analysis is in the form of explaining their way through an argument that connects data from disparate measurement techniques, using pictures and graphs. They don't program. The ones who can program have gone into software development. The ones who are using Jupyter are motivated to use it, as an end unto itself. Bringing that stuff together in Jupyter wouldn't help much. Many of their journals do require submission of raw data.

This is similar to questions about why so many people use Excel. I think you have to actually immerse yourself in the specific work environment an observe or even experience what people are experiencing, what they're actually studying, how they think, and so forth. There's a certain Chesterton's Fence aspect to discussions that start with the premise that some widespread activity is hopelessly broken beyond repair and must be immediately abolished.


I do share code that way, but the traditional ivory tower standards by which I am judged require "refereed journal publications" in high impact factor traditional journals. I'm trying to fight back against that, largely unsuccessfully.

What would help me is to have the old geezers consider GitHub issues, PRs, and commits as a type of citation and to have a better way of tracking when my code gets used by others that is more detailed than forks.

I also think citations of your work that find errors or correct things should count as a negative citation. Because otherwise you are incentivized to publish something early and wrong. Thus the references at the end of the paper should be split into two sections: stuff that was right and stuff that was wrong.


> I also think citations of your work that find errors or correct things should count as a negative citation.

Strong disagree. Given how much influence colleagues can have over one another's career prospects, how petty academic disagreements can get, admin focus on metrics like citation count, and how it's easier to prove someone else wrong than to do your own original work (both have value, one is just easier), it would end up with people ONLY publishing 'negative citations' (or at least the proportion would skyrocket). I think that would be bad for science and also REALLY bad for the public's ability to value and understand science.

> Thus the references at the end of the paper should be split into two sections: stuff that was right and stuff that was wrong.

This, on the other hand, is brilliant and I love it and want to reform all the citation styles to accommodate it.


Which is basically the reddit up- & down-vote versus the HN only upvotes story just on a broader scale, is it?


Superficially yes, but in actuality it would be very different due to the context surrounding academic papers vs. Reddit.

Organizationally speaking, Reddit is a dumpster fire; check out the 'search' function (I'm just speaking on a taxonomical/categorization perspective, I can't speak to their dev practices).

Academic papers aren't. (They're a dumpster fire in their own ways: The replication crisis and the lack of publishing negative results comes to mind, but damn if they aren't all organized!)

There's two key differences:

1.) Academic papers have other supporting metadata that could combine with the more in-depth citation information to offer clear improvements to the discovery process. Imagine being able to click on a MeSH term and then see, in order, what every paper published on that topic in the past year recommends you read. I also think improving citation information would do a lot to make research more accessible for students.

2.) Reddit's system lets anybody with an account upvote or downvote. Given you don't even need an email address to make a Reddit account, there's functionally zero quality control for expressing an opinion. For academic publications, there is a quality control process (albeit an imperfect one). If only 5 people in the world understand a given topic, it's really helpful to be able to see THEIR votes: If they all 'downvote' a paper that would suggest it's wrong.


You make a good point, and maybe there is a mathematical solution, like netting out mutually negative citations.


Maybe! Probably not that solution since it would penalize academics who are having a long-standing but intellectually productive disagreement.

I think it'd be hard to math out right now since the skill doesn't exist in the departments who'd be doing the work, but in 20 years who knows?


> the references at the end of the paper should be split into two sections: stuff that was right and stuff that was wrong

I've seen stuff like this said before but I don't think it would work. Most citations are mixed in my experience. A few objections, a bunch of stuff you aren't commenting on, and some things you're building on. Or you agree with the raw data but completely disagree with the interpretation. Others are topical - see <work> for more information about <background>. Probably more patterns I'm not thinking of.


Yes. We should record all of this, and turn them into easily browsable graphs/hypertext to easily assemble sets of papers to read/look into. At the very least things like 'background reading', 'further reading', 'supporting evidence' and 'addressed arguments' would be useful.

'We' meaning the librarians and archivists. You guys actually researching have more than enough to do.


Actually I think that's an intriguing idea for how to improve citations. Instead of a single <work> citation, have multiple <work, subset> citations that include a region of text as well as basic categorization of how the citation is being used in that instance.

I'm not sure if it would prove feasible in practice. It seems like it would aid the writing process in some cases by helping the author keep track of details. But in other cases maintaining all that metadata would become too much of a burden while writing, so it would get put off, and then it would all fall apart.

Very interesting to think about!


I was imagining a post-writing process akin to assigning a paper its DOI[0] or a book its cataloguing info. Citations as they are can be done by researchers without impacting the process because they're binary: Something either is cited or it isn't. It either contributed to the creation of the research or it didn't. This probably couldn't be done by the researchers, but you identified why: The citations are data but this would be metadata.

Definitely don't want to encourage papers to take even LONGER.


> What would help me is to have the old geezers consider GitHub issues, PRs, and commits as a type of citation

As a geezer myself I am imagining what this type of request would look like in less than 5 years from now.

"What would help me is to have the old geezers consider Tiktok videos and replies as a type of citation"

i.e. if you open this particular can of worms for a very restricted subset of users (not only programmers but specifically programmers who also happen to use github), you have to open it to everyone else. I am sure plenty of Youtube research qualifies as "citation" if you start counting Github commits.


So what?

Papers qua papers aren't the goal. The idea is to advance our collective understanding of a field. Papers are certainly a means to that end, but other things can too, like code, videos, and blogposts, even if they don't fit into the "6000 words and 6 figures" box.

I get that citations and citation metrics feel objective, but they emphatically aren't great measures of research/researcher "quality".


A good youtube video easily can represent more work than a bad (published) paper, so why not.


Tiktok videos are primarily used for entertainment, rather unlike Jupyter notebooks and source code repositories. Surely you have a more serious objection.


You don't read git repos for entertainment?!


For entertainment, I tend to read about things outside my field — things like In the Pipeline where I learn about FOOF and chlorine triflouride and freaky molecules like jawsamycin (insert shark theme here). I also watch Chemical Safety and Hazard Investigation Board videos on YouTube, like that time a refrigerator accident at a poultry plant caused hydraulic shock and released a massive cloud of ammonia, and ~150 contractors hanging out across the river working on Deepwater Horizon cleanup measures got sent to the hospital.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: