For what it's worth, the article is the author arguing why they don't personally use blah-blah (Dependent Types) despite being a leading academic in the field (PLT) where blah-blah is frequently touted as the holy grail of that field, and justifies his experience using blah-blah-2 (Higher Order Logic), a tried and true "sophomoric" choice that seems dusty and crusty by comparison (literally, PLT undergrads learn how to formalize systems using blah-blah-2-reduced frequently in their sophomore years, as a way to learn SML). The rest of the article is really only interesting for the PLT/proof automation community since it is pretty niche. His conclusions is that you don't need the shiny new blah-blah to do things, often in more complicated ways, if older blah-blah-2s can do things mostly just as well and have the benefit of simplicity and ease of automation.
Their in-context long-sequence understanding "benchmark" is pretty interesting.
There's a language called Kalamang with only 200 native speakers left. There's a set of grammar books for this language that adds up to ~250K tokens. [1]
They set up a test of in-context learning capabilities at long context - they asked 3 long-context models (GPT 4 Turbo, Claude 2.1, Gemini 1.5) to perform various Kalamang -> English and English -> Kalamang translation tasks. These are done either 0-shot (no prior training data for kgv in the models), half-book (half of the kgv grammar/wordlists - 125k tokens - are fed into the model as part of the prompt), and full-book (the whole 250k tokens are fed into the model). Finally, they had human raters check these translations.
This is a really neat setup, it tests for various things (e.g. did the model really "learn" anything from these massive grammar books) beyond just synthetic memorize-this-phrase-and-regurgitate-it-later tests.
It'd be great to make this and other reasoning-at-long-ctx benchmarks a standard affair for evaluating context extension. I can't tell which of the many context-extension methods (PI, E2 LLM, PoSE, ReRoPE, SelfExtend, ABF, NTK-Aware ABF, NTK-by-parts, Giraffe, YaRN, Entropy ABF, Dynamic YaRN, Dynamic NTK ABF, CoCA, Alibi, FIRE, T5 Rel-Pos, NoPE, etc etc) is really SoTA since they all use different benchmarks, meaningless benchmarks, or drastically different methodologies that there's no fair comparison.
The available resources for Kalamang are: field linguistics documentation10 comprising a ∼500 page reference grammar, a ∼2000-entry bilingual wordlist, and a set of ∼400 additional parallel sentences. In total the available resources for Kalamang add up to around ∼250k tokens.
That's not necessarily true across all "fast fields". There's definitely selection pressure against fields sensitive to faulty experimentation to "move fast." Computer Science is one of the bigger subjects within Arxiv. Retraction rates within this discipline is extremely low and the field as a whole moves very rapidly. Having access to preprints makes it much more scalable for researchers to stay at the cutting edge.
There are already several extensions to markdown with LaTeX support. For example, StackExchange supports flavored markdown with a Mathjax backend, as do most static Markdown renderers out there. Nevertheless, since Markdown isn't standardized across the field, we really can't complain that some support LaTeX while others do not. Since it gets asked about in /github/markdown often enough however, I believe that it's worth some consideration.
I think the answer to your question is more situational. For example, the ML community routinely publish supplementary material on Github. While the introductory section of these repositories should be accessible to everyone, they do also go fairly deep with technicals before referencing their respective publication for more details. The PL community also routinely use Github as a distribution platform. For them, it's often much more compact to give the semantics of their language-extension as a system of equations rather than as words. This doesn't absolve them of the obligation of giving readable examples of the specification, but for some of these more technical repositories, it's a nice-to-have as well.
Having said all of this, the ML community has thought up of a rather clever workaround for this already. Github renders ipynb notebooks that has LaTeX in side of its Markdown sections, so many READMEs just reference a Jupyter notebook. This is just a different approach on that problem.
Thanks for taking time to respond with your insights. Rendering math notation for screen display in an HTML browser is a difficult and challenging problem.
For a long time, TeX was the best-of-class document rendering software for mathematics. I believe TeX emits Postscript, which works wonderfully for printed articles and books. Web developers can either try to leverage legacy software like TeX or take an entirely different approach. Can you point to a good summary article on this topic?
More parameters also means that the likelihood of overfitting (the training set) increases. Currently (and rather unintuitively, considering that ML is an applied optimization field, and optimization is usually concerned with underfitting), the bane of ML is overfitting. It's easy to supply a model with high representational capacity, but it's impossible to learn anything interesting in a reasonable amount of time. You'll learn how to fit your training set perfectly because your model has enough degrees of freedom to let you fit a million points arbitrarily well, but that doesn't mean that the resulting fit describes the data in a meaningful way. This is why a core tenet of ML is to prune parameters whenever possible. Neurogenesis increases representational capacity whenever it detects that your underlying model does not have sufficient representational capacity to fit the data; from this perspective, you start small (undercapacity) and then you gradually increase your capacity until you hit the optimal model. In other words, Neurogenesis is also a way for you to minimize the number of options.
On the other hand, giving the model with more options than it necessarily needs and letting it decide what is important will usually backfire. Rather than learning a few meaningful/functional features, it can just go ahead and completely fit the training data from the very beginning. It will therefore decide that everything is important, because all those extraneous parameters will let it squeeze that last 0.5% out of your training set.
I think I can sort of see what's going on, we're trying to fit the data of some GPS position taken from a walk around the park or something similar, and fit it to some linear combination of basis of some "street functions" to minimize its two-norm, where the "street function" corresponding to each street for some point (x,y) is some measure of how far away from the street that point is.
For example, in the subproblem of only one-way streets that we can assume to be composed of one or more segments of straight lines per street, we take a simple street-function $\phi_i(x,y)$ corresponding to street i to be the norm of the projection of <x,y> onto that street. Furthermore we also add into the system some "smoothing-function" to ensure that the overall shape of the final path is doable, for example constraining the distance between successive points. Next, we solve the argmin of the norm equation for each point so that each point is now moved to some linear combination of the streets, and truncate all but the most significant street basis, and rerun until we get to some acceptable tolerance.
o_O I don't know about you guys, but our CS affiliation requirement includes at least one semester of Functional Programming (using ML rather than Haskell, I know for sure that CMU (ML) and Berkeley (Scheme/ML) have similar requirements too). Java's seen as the bane of the world over here, especially since none of our professors are convinced that OO is more than just a little bit of syntax sugar coated over some generic imperative language (and as far as I can tell, I'm starting to agree with them).
AFAIK it is pretty common for most universities in Germany to start CS programs with either (S)ML, OCaml, Haskell or any combination of these depending upon the individual preferences of the responsible chair.
Java is (was?) pretty common for introductory OO classes. Advanced OO classes (outside of a "Software Engineering" context) though often introduce Eiffel, Smalltalk, Python, C++ and C# as well.
Depending upon your particular choice of courses, specialization etc., being exposed to Lisp, Prolog, Perl, C and R isn't uncommon either and from what I heard Scala is also starting to appear as a teaching language.
From what I heard the curriculum in France, Swiss, Austria and Denmark is very similar. So I am a bit surprised of the notion that there are universities out there where you can get a CS degree without being exposed to some functional programming.