We often ignore the importance of using good baseline systems and jump to the latest shiny thing.
I had a similar experience few years back when participating in a ML competitions [1,2] for detecting and typing phrases in a text. I submitted an approach based on Named Enttiy Recognition using Conditional Random Field (CRF) which has been quite robust and well known in the community and my solution beat most of tuned Deep learning solutions by quite a large margin [1].
I think a lot of folks underestimate the complexity of using some of these models (DL, LLM) and just throw them at the problem or don't compare it well against well established baselines.
As I see it, you need a model you can train quickly so you can do tuning, model selection, and all that.
I have a BERT + SVM + Logistic Regression (for calibration) model that can train 20 models for automatic model selection and calibration in about 3 minutes. I feel like I understand the behavior of it really well.
I've tried fine tuning a BERT for the same task and the shortest model builds take 30 minutes, the training curves make no sense (back in the day I used to be able to train networks with early stopping and get a good one every time) and if I look at arXiv papers it is rare for anyone to have a model selection process with any discipline at all, mainly people use a recipe that sorta-kinda seemed to work in some other paper. People scoff at you if you ask the engineering-oriented question "What training procedure can I use to get a good model consistently?"
That's the thing that blows my mind. Even if NN are some percentage better, the training+deployment headaches are not worth it unless you have a billion users where a 0.1% lift equates to millions of dollars.
It is pleasantly surprising to see how close your pipeline is to mine. Essentially a good representation layer - usually based on BERT - like minilm or MPNet, followed by a calibrated linear SVM. Sometimes I replace the SVM with LightGBM if I have non-language features.
If I am building a set of models for a domain, I might fine-tune the representation layer. On a per-model basis I typically just train the SVM and calibrate it. For the amount of time this whole pipeline takes (not counting the occasions when I fine-tune), it works amazingly well.
I spent a week learning enough ML to design a recommender system that worked well with my company's use case. I knew enough linear algebra to determine that collaborative filtering with some specifically chosen dimensionality reduction and text vectorization algorithms as well as a strategy for scaling the models across multiple databases would work well for us. The solution was tailored specifically to the type of data we were working with.
When I presented the proposal, nobody read it and the meeting immediately turned to the vp of engineering and the ceo discussing neural networks and some other ML system that they had read about on HN the day before. When I tried to bring collaborative filtering up again, the VP said "I don't know what that is", so obviously he hadn't read the doc that I was assigned to write over the last week
I had a similar experience few years back when participating in a ML competitions [1,2] for detecting and typing phrases in a text. I submitted an approach based on Named Enttiy Recognition using Conditional Random Field (CRF) which has been quite robust and well known in the community and my solution beat most of tuned Deep learning solutions by quite a large margin [1].
I think a lot of folks underestimate the complexity of using some of these models (DL, LLM) and just throw them at the problem or don't compare it well against well established baselines.
[1] https://scholar.google.com/citations?view_op=view_citation&h... [2] https://scholar.google.com/citations?view_op=view_citation&h...