Everyone repeat after me: "we need a baseline model".
You should always try some "dumb" models first. You'd be surprised how hard is to beat (of course depends on your KPIs) a historical average model with a more sophisticated method.
Not to mention the plethora of issues that arise from trying to fit an ARIMA onto an AR(1) process... It's weird that people just jump into using insanely complicated models right off the bat.
I've seen this in real time. I don't do statistics as part of my day job, but I've had enough experience and keep up with the field to know what I'm talking about. I've seen senior engineers try to ram in an overly specified ARIMA model just to claim that they've "improved" the system. It worked far worse than whatever model we were using before was (never got to take a look under the hood of that one unfortunately), was prone to wild swings in forecasting, and was eventually deprecated and we reverted to the old model.
I mean...you can always appeal to “old school” AI. Just dig in to the old papers and use their words. Latent semantic analysis (LSA) is an example of a hard to beat baseline model for text:
“By inducing global knowledge indirectly from co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren.” (http://www.stat.cmu.edu/~cshalizi/350/2008/readings/Landauer...)
I once had a mentor with clout on a 9 figure investment committee tell me that maximum likelihood estimation is "the dumbest idea" he'd ever heard.
Words like "Cramer-Rao bound" didn't get through. What worked was saying "deep learning is usually just MLE with a mysteriously effective function approximation".
Hard to beat in terms of effort vs. quality of the outcome is more accurately what I meant (it’s two lines of code in scikit-learn [CountVectorizer() + TruncatedSVD()] to go from raw text to document/word embeddings, and the result is often “good enough” depending on what you’re trying to do). See the results on pg. 6 (note LSI==LSA): http://proceedings.mlr.press/v37/kusnerb15.pdf
Also, at least based on papers I’ve read recently, BERT doesn’t work that well for producing word embeddings compared to word2vec and GloVe (which can be formulated as matrix factorization methods, like LSA). See table on pg. 6: https://papers.nips.cc/paper/9031-spherical-text-embedding.p...
Point being: mastering the old models gives you a solid foundation to build from.
You should always try some "dumb" models first. You'd be surprised how hard is to beat (of course depends on your KPIs) a historical average model with a more sophisticated method.