they'd never been extended to work on more than 8 bits, even in the 8086; they only really existed for 8080 compatibility, and arguably the 8080 primarily had them to ease the path from earlier intel processors designed for, if i'm not mistaken, literal pocket calculators
in pocket calculators you have to display the result in decimal after performing a single arithmetic operation, so it's much more efficient to do the arithmetic in bcd than to convert from decimal to binary, perform the arithmetic operation, and then convert back
Agner says that AAA has latency 5 on Cannon Lake, so using that instruction is a bit faster than doing the operations manually. But if you vectorize (or use SWAR) I imagine you can start to beat the legacy instructions with larger numbers.
I can't imagine that has been efficient for decades tho