csaid81's comments

csaid81 · 2025-08-07T18:11:46 1754590306

Yes, but not only did they improve the memory of mouse models of Alzheimer's, they also improved the memory of older wild-type mice, which seems impressive to me. https://www.nature.com/articles/s41586-025-09335-x/figures/1...

csaid81 · 2025-02-24T00:52:19 1740358339

I know a lot of legitimate research supports various versions of the amyloid hypothesis, but I don't buy that these likely fraudsters had minimal impact.

You said that Lesné was "not that highly cited". But his main fraudulent paper was cited 2,300 times, making it the fifth most highly cited Alzheimer's paper since 2006! [1]

Berislav Zlokovic's likely fraudulent papers were cited 11,500 times! [2]

It's hard to imagine these highly papers didn't redirect at least some scientists to do pointless followup studies. Of course, in the counterfactual world the scientists might still have been doing pointless studies, but we'll never know...

[1] https://www.science.org/content/article/potential-fabricatio... [2] https://www.science.org/content/article/misconduct-concerns-...

bartathe · 2025-02-24T05:12:36 1740373956

Thank you for bringing Berislav Zlokovic to my attention. I will bring him up in the office tomorrow. I have personally never heard of him. Neurodegeneration is a really, really large field.

Okay, citation count was a metric I should not have used and also I was just straight-up mistaken about his citation count on that paper. To be honest, citation count is also a lazy argument on my part. I like to think it matters when I think the paper is actually good and that it doesn't matter when I think the paper is bad. A lot of citations get racked up by medical reviews, which are often more highly cited than original work. Keep this in mind when you look at Berislav Zlokovic's "top" papers.

I genuinely do not think Lesne's work is that influential, though. First, if you look up "amyloid beta oligomer" on grantome (https://grantome.com/search?q=amyloid+beta+oligomer), then look up something like "Alzheimer's amyloid" (to make sure you are not getting amyloids that are not associated with Alzheimer's, since I don't know what other diseases amyloid beta oligomers could possibly be associated with), you will see that amyloid beta oligomer research is actually not well-funded at all compared to everything else in the field. Also, how many papers that cite his work actually work on the amyloid beta 56 oligomer in particular (not rhetorical, I haven't checked)? Maybe this would make for an interesting project for a high schooler or undergrad interested in web-scraping and that kind of data analysis. Probably a question for economists or quantiative sociologists. Now is the time to mention a bit more about what trying to replicate his work would have looked like.

Lots of stuff is difficult to replicate not because people are committing fraud, but because people in academia actually "move fast and break things," aka make bad protocols and don't document that well because of pressure to publish quickly. Most people doing the work are also graduate students, who are still students. Wet lab work is very, very, very finicky (this is the reason I have gone into more computational projects, I don't possess the level of finesse or patience required). So, some replication problems are because things are both poorly notated and people don't follow protocol perfectly either. But regardless of the causes, some recent experiences that describe what the process would have been like --

Someone in our lab recently tried to replicate a protocol for generating recombinant (not brain-extracted, but made from purified protein) amyloid fibers that look like disease-associated fibers. She had previously been one of the only people in our lab who managed to replicate another protocol made by the the same lab (she subsequently taught everyone else in our lab). She was not able to replicate this new one. It took her maybe a month while she was also working on other things to decide she was not going to be able to replicate it. There were some other tells that we think they were a little lazy, maybe got lucky a few times, and published it in a easy journal just because other people had published far better work on the same topic, so they wanted to wrap up their work and move on while still publishing to keep their grants. Similarly, someone else in our lab went through a two month long process where someone who claimed to be able to make some kind of oligomer send us stuff that turned out to not be an oligomer. First, they blamed the shipping and tried again. Well, next round, same thing. Next, they blamed us. We tried again, same thing. At that point, someone gave them an earful, and now we are kind of sketched out by them. There was some work expended, but again, we work on multiple stuff at once, since most things don't work. And in this case, collaborators sent samples.

So, to summarize, I highly doubt anyone was trying to replicate their results for years, that just isn't how science works. And I don't think amyloid beta oligomer research got that much public funding compared to other things. You do multiple, different experiments every week just to see what sticks. I'm sure plenty of people lost an experiment slot for a few weeks, though. Extremely annoying and part of the general demoralizing slog, but it's not the reason we have no cure.

Why might the Lesne paper have been highly cited? Because oligomers will be important drug targets if they polymerize to amyloid fibers and can be specifically targeted since they are less stable than amyloid fibers, and lots of people working on on them more broadly than just this species might have grabbed this citation because it was a Nature paper with a lot of marketing. From there, citation propagation kept it going. I try to only cite things that I read now, but part of that is probably because my undergrad advisor had the eyes and memory of a hawk and would really hammer everyone on this. I notice bad examples of citation propagation semi-regularly.

Some more context for what oligomers really are and why they are so difficult to replicate if they exist -- a fiber exists of many, many proteins stacked. Well, how does the fiber begin? Presumably, you don't go 0 to 100 units perfectly stacked, you go in small increments via pre-fibrillar intermediates. Well, that means you're describing a very transient species, so good luck extracting it from brains or making it recombinantly. Oh, and many amyloid-forming proteins are "intrinsically disordered" meaning they have no "native" structures, and might not be that structured if you have only a few of them stuck together either.

csaid81 · on Sept 28, 2024

He could be prosecuted under current fraud laws, but this hardly ever happens.

I wrote a blog post on how to make this easier, including a new criminal statute specifically tailored for scientific fraud. https://news.ycombinator.com/item?id=41672599

csaid81 · on Sept 28, 2024

Sorry, here's the correct link https://chris-said.io/2024/06/17/the-case-for-criminalizing-...

csaid81 · on June 25, 2024

Blog post author here. The paper was the 4th most cited paper in Alzheimer's research since 2006. So I feel reasonably confident that if it had never been written, some researchers at the margin would have chosen to work on other hypotheses instead, and perhaps those other avenues would have been more fruitful.

How much time could have been saved towards an effective treatment? It could be as high as a decade, but of course more likely it was zero years. I averaged it out to 1 year.

Now suppose you think that 1 year is orders of magnitude too high, and that in expectation it averages out to a 1 day delay. Even then, I estimate 100,000 QALYs would be lost, making this a tragically high impact case of misconduct.

--- Final point: Nobody doubts that science is error correcting. The point is that the errors are corrected far too slowly and many never get corrected at all. It's incredibly hard to develop good theories when you know that 30-50% of the results in your lit review are false.

mistermann · on July 5, 2024

> Nobody doubts that science is error correcting.

What is the precise, singular meaning (or implementation) of "science is error correcting"?

Is there one?

csaid81 · on June 20, 2020

I think the answer is not "nope" but "probably nope". That video is perhaps overconfident in our ability to detect these storms.

From the article:

> One-third of major storms arrive unexpectedly, according to the SWPC’s own 2010 analysis. And that’s not just the small storms. According to a news article in Science, the SWPC might be also be poor at identifying the characteristics of severe storms, since they are so rare.

koheripbal · on June 20, 2020

Sometimes you get close enough to zero that for any reasonable application, the answer can be rounded to zero.

csaid81 · on June 13, 2017

It's great that the Moore Foundation provided funding for open source data science tools in Python. Good for them!

That being said, I do wonder if numpy is the most appropriate recipient. In my experience with data science, the tool that would benefit the most is not numpy, but pandas. While data scientists rarely use numpy directly, every data scientist I know who uses pandas says they are constantly having to google how to do things due to a somewhat confusing and inconsistent API. I use pandas at work every day and I'm always looking stuff up, particularly when it comes to confusing multi-indexes. In contrast, I rarely use R's dplyr at work, but the API is so natural that I hardly ever need to look things up. I would love if pandas could make a full-throated commitment to a more dplyr-like API.

Nothing against pandas -- I know the devs are selflessly working very hard hard. It's just that it seems there is more bang for the buck there.

abuckenheimer · on June 13, 2017

If you look at the design documents for pandas 2 there is a good illustration of how a lot of pain points in pandas 1 spring from numpy ( https://pandas-dev.github.io/pandas2/internal-architecture.h...). I think any significant development effort numpy would probably greatly benefit both libraries.

Will have to check out dplyr :) love to see how they master the magic that is multi-indexes.

has2k1 · on June 14, 2017

In many cases, the use of multi-indexes in Pandas is (I think) a result of culture/style or expectation that the cells of a dataframe should have scalar values. If that would change and it became common to have nested dataframes, the use of multi-indexes would diminish.

The tooling to support nested dataframes (and maybe even lists) is simple to create, It can even be a third party library. I find that multi-indices though may be an accurate conceptual way of thinking about certain data, they tend to be practically more inconvenient than nesting the dataframes. In all cases I have encountered only single level of nesting is required.

shoyer · on June 14, 2017

If you're excited about non-scalar values in DataFrames, you should take a look at xarray (http://xarray.pydata.org), which implements a very similar idea in its Dataset class.

csaid81 · on June 13, 2017

Thanks for the link! Good stuff.

By the way, dplyr doesn't use multi-indexes. I actually think this one of the reasons (although not the biggest reason) dplyr is easier to use.

ngoldbaum · on June 13, 2017

The funding source used by NumPy here is equally available to pandas developers. If someone with the experience to deliver wrote a good proposal I think there's a decent chance that it would be funded.

carreau · on June 13, 2017

But.. pandas uses numpy under the hood. If numpy is better and can offload some of the core functionality from pandas that will also benefit pandas right ?

csaid81 · on June 13, 2017

Right, but I'm talking about the pandas API. Stuff like how easy it is to remember exactly how to do aggregations, transformations, etc.

csaid81 · on June 13, 2017

Here's some specific examples:

https://twitter.com/Chris_Said/status/715249097326768128

https://twitter.com/Chris_Said/status/861244045535756290

I could be wrong, but I'm pretty sure that these would be solved by pandas API design improvements, not with numpy improvements under the hood. (NB: As always, a big thanks to the developers for all their work.)

has2k1 · on June 13, 2017

I had a similar issue. I had to read a certain piece of R code and it used a lot of dplyr, I read the dplyr documentation and I immediately felt more comfortable manipulating data in R than in Python. Later on I created https://github.com/has2k1/plydata, a dplyr imitation.

chaostheory · on June 14, 2017

A lot of people already mentioned that pandas is built on top of numpy. Also the pandas and numpy are housed under the same non-profit: https://www.numfocus.org

visarga · on June 14, 2017

First time I heard about NumFocus. Under their umbrella also sit iPython, Jupyter Notebook, Julia, Matplotlib, and a dozen more projects.

simen · on June 13, 2017

But isn't the issue here that completely redoing the API would break a lot of code? I don't see how throwing money at the problem would fix this. I don't use any of these libraries, so maybe I'm totally off base, but it sounds like it's more of a tech debt/design issue than an issue that requires the kind of programming hours that only money can buy.

On the other hand if lots of libraries use numpy, making it more efficient and/or capable would seem to give quite a lot of bang for the buck. And it sounds like that's the kind of problem that money can actually solve.

csaid81 · on June 13, 2017

Pandas makes lots of backwards-incompatible changes. See for example these changes in the latest release

http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew...

There have been a few independent attempts to add dplyr-like functionality to pandas without being backwards incompatible (e.g. dplython). I'd be very happy if the core pandas team went down this path.

That being said, I don't have a good understanding of how strong the distinction is between "design issues" and "issues where money helps". There must be some overlap.

simen · on June 13, 2017

I'll have to speak in generalities as I don't know enough about NumPy in particular to comment.

> That being said, I don't have a good understanding of how strong the distinction is between "design issues" and "issues where money helps". There must be some overlap.

That's true, but many projects have turned out bad no matter how much more money has been spent compared to less expensive, but better projects. See: Design by committee. The design of an API obviously requires careful thought, which I suppose is work that could be paid. But the issue of getting everyone to agree on a design isn't one that money can solve, and then you need to make some hard decisions about backward incompatibility. Perhaps you'd fund a fork of the project, splitting it into an old legacy one and a new, fancy version with a new API, but then you're committed to maintaining two projects which is its own headache.

These are the kinds of things I mean by design issues. Problems that aren't necessarily hard because they require many people to work for many billable hours to solve them, but because finding acceptable compromises is a very human issue quite irrespective of the programming effort involved.

Many a software project has recognized that serious, backwards-incompatible changes would improve the project, and often there is even a working implementation, but these human and legacy support issues prevent widespread adoption and then the new implementation dies a quiet death because nobody is using it, so nobody finds it worth their time to work on it.

Perhaps what you really want is a new library, rather than trying to contort a different project into the shape you want. Which is of course something money helps with, but then when the money dries up the question of adoption is going to determine whether it lives or dies as an open source project.

Again, those were some general thoughts, I don't know much about this particular project, so maybe I'm way off base. Just offering an alternative POV regarding what exactly constitutes "getting your money's worth" with respect to choosing which OS projects to fund.

halflings · on June 13, 2017

pandas is often used for one-off reports, where backwards compatibility is not as important. Production software relying on the API could always depend on previous versions if a new version brings a significantly improved API.

I'm a regular user of pandas, would definitely say it's my favorite Python library by far... but it is very hard to do certain operations with it (as the OP said, anything involving multiple indexes, and things like plotting multiple plots after a groupby, etc.)

simen · on June 13, 2017

Ok, I might very well be totally off base. Sorry for butting in on a subject that I don't know much about.

jacquesm · on June 13, 2017

> every data scientist I know who uses pandas says they are constantly having to google how to do things due to a somewhat confusing and inconsistent API.

That's a design error, not necessarily something that money will fix for you. This is why you need to think really long and hard before deploying a public API, it is very hard to change those.

Barrin92 · on June 13, 2017

well one could at least hope that some additional funding would improve the chance that these design errors are addressed, although I agree that it is no panacea

ferdterguson · on June 14, 2017

Just because the API is bad doesn't mean we should throw money at it. I agree that NumPy might not be the best recipient either. It's hard telling, really.

Personally, I believe the biggest blocker for me is to have good visualization tools. That's ultimately what gets me paid is showing other people my work and getting them to give me money to continue it.

On the core science stack IMO there's numpy, scipy, sympy, matplotlib, pandas and xarray. I probably use it next to least, but I really think sympy is the one that could benefit the most from some funding.

isolli · on June 14, 2017

Do you not use Seaborn?

art187 · on June 14, 2017

I can't speak to the reasons why pandas wasn't funded, but the team is looking for funding.

At the end of the day a lot of code uses NumPy and not Pandas.

gaius · on June 13, 2017

Pandas is sponsored by AQR I thought?

dsacco · on June 14, 2017

Nope, just developed in-house there (and the original developer now works at Two Sigma).

art187 · on June 14, 2017

csaid81 · on May 4, 2017

This seems different and a bit lacking in detail (although I don't dispute that it could be useful). How exactly does one choose m and C? And what are the conditions under which it would reduce to the James-Stein / Bulmannn / BLUP model?

pps43 · on May 4, 2017

The choice of m and C need not be exact. It is enough to choose them so that

1. If there are no ratings, Bayesian average is close to overall mean, and

2. If there are many ratings (how many depends on how big the site is), C and m do not affect the result much.

You probably can do a little better if you have a lot of data and ability to run A/B tests, but for vast majority of cases pseudocounts work just fine.

csaid81 · on May 4, 2017

Got it. Thanks for the clarification. In that case I would think that James-Stein / Buhlmann / BLUP is a better approach, since it is just as easy to implement and the amount of shrinkage is optimally chosen based on the data, rather than on guesswork. In fact it may be more easy because no guesswork is required.

It would be interesting though to have people try to guess suitable values of m and C and then see how close their MSEs get to the James-Stein MSE. I suspect that some people's guesses would be meaningfully off target.

pps43 · on May 4, 2017

But that's not how you should measure it. You goal is not to minimize MSE. Your goal is to rank movies in a way that users like.

So the test would be to randomly split users into test and control, show ranking based on Bayesian averaging to control, show ranking based on James-Stein or some other method to test, measure some metric of user happiness (a different hard problem, click rate on top titles?), then do the comparison.

csaid81 · on May 4, 2017

Author here. Please see the section on mixed models in my post. As I mentioned there, I would love if an expert could expand on the relationship between mixed effects and Empirical Bayes.

Regarding MCMC, one of the things I try to emphasize throughout the post is that the best solution depends on your needs (for example if you want a full posterior). In fact, most of the post is devoted to quick and simple methods -- not MCMC -- because they are good enough for most purposes. I welcome your feedback though on how I could make this point clearer.

apathy · on May 4, 2017

> Author here.

Alright, I'll put on my Reviewer Number 3 hat and say that I learned some neat things from your work, including that the National Swine Improvement Federation. I'll try and do a halfway decent job here.

> I would love if an expert could expand on the relationship between mixed effects and Empirical Bayes.

A real expert? Here you go:

http://statweb.stanford.edu/~ckirby/brad/LSI/monograph_CUP.p...

Read it, all of it, but particularly chapter 1, section 2.5, and chapters 8, 10, and 11. Why does testing, effect size estimation, and high-dimensional analysis have anything to do with anything? Because...

1) independence is largely a myth 2) you are likely to have multiple ratings per reviewer on your site, whether your generating distribution is nearly-continuous (0-10, mean-centered) or discrete (0/1, A/B/C). If you discard this, you are throwing away an enormous amount of information, and failing utterly to understand why a person would estimate not just the variance but the covariance even for a univariate response.

The second point is the one that matters.

Also, "empirical Bayes" is in modern parlance equivalent to "Bayes". What's the alternative? "Conjectural Bayes"? (Maybe I should quit while I'm ahead, pure frequentists may be lurking somewhere)

> I welcome your feedback though on how I could make this point clearer.

For starters, edit. Your post is too damned long.

Think about where you are getting diminishing returns and why. Is there ever a realistic situation where your ratings site would not keep track of who submitted the rating? (It's certainly not going to be an unbiased sample, if so; the ballot box will get stuffed) So if you have to keep track of who's voting, you automatically have information to decompose the covariance matrix, and everything else logically follows.

A univariate response with a multivariate predictor (say, rating ~ movie*rater) can have multiple sources of variance, and estimating these from small samples is hard. When you use a James-Stein estimator, you trade variance for bias. You're shrinking towards movie-specific variance estimates, but you almost certainly have enough information to shrink towards movie-centric and rater-centric estimates of fixed and random effects, tempered by the number of ratings per movie and the number of ratings per rater. (Obviously you should not have more than one rating per movie per rater, else your sample cannot be unbiased).

I think you will return to this and write a much crisper, more concise, and more useful summary once this sinks in. I could be wrong. But you'll have learned something deeply useful even if I am. I do not think you can lose by it.

rcthompson · on May 4, 2017

> Also, "empirical Bayes" is in modern parlance equivalent to "Bayes". What's the alternative? "Conjectural Bayes"?

My understanding of the difference, as a frequent user of empirical Bayes methods (mainly limma[1]), is that in "empirical Bayes" the prior is derived empirically from the data itself, so that it's not really a "prior" in the strictest sense of being specified a priori. I don't know whether this is enough of a difference in practice to warrant a different name, but my guess is that whoever coined the term did so to head off criticisms to the effect of "this isn't really Bayesian".

[1]: https://bioconductor.org/packages/release/bioc/html/limma.ht...

lighttower · on May 4, 2017

Do you have a webpage? I just helped my wife (physician) with stats for a research presentation she made that sought to track infection spreading in hospitals (via room number, location specific) via movement of equipment and staff which was tagged. They then PCRd the strains to make sure it was the same one.

The experimental design was good, the stats person they had to help them decipher the results was.... Left much to be desired.

Can you please be so kind to email me jpolak{at} email service of a company where a guy named Kalashnikov worked.

csaid81 · on May 4, 2017

Yup, I agree about throwing away rater information. The actual application at my company that motivated me to research this doesn't have rater information, which is why I didn't think to adjust for it. The movie case was just an example I used to motivate this post for which, yes, I agree, rater information would be quite useful.

csaid81 · on May 4, 2017

It's an excellent blog post, although it's worth emphasizing that it is designed for the binomial case, where you wish to compute the fraction of occurrences within some events, such as batting averages. For continuous variables, however, it makes more sense to use one of the methods described in the original post.

TL;DR: One blog post is for Rotten Tomatoes and the other is for Metacritic.

phillc73 · on May 4, 2017

Absolutely, and thanks for better defining the distinction.

I really just wanted to point out another solid Empirical Bayes resource, as there's not that many about. Yours and David's make a good combination covering different cases.