I tried to get ChatGPT (4) to extract some tax info from a PDF for me (because this one place doesn't give you 1099B in any more usable form) and it completely hallucinated the result. It doesn't seem to read PDFs well at all. Claude did it correctly though.
It doesn't seem like there any any real optimization possibilities in the article though. For household taxes there just aren't really situations complicated enough to need optimizing, and the big things like maxing out 401k you had to have done the previous year anyway.
Irritated by the UX and business model of web-based tax return preparation, I've looked into open source alternatives in the past. I hadn't come across the Open Tax Solver package mentioned in this article which looks interesting. It seems like they all rely on printing and mailing the completed returns. Is there any hope of ever being able to e-file them? I don't understand why it couldn't be possible for Free Fillable Forms to support some sort of import format.
This looks great - but mainly because of tenforty capabilities.
The US tax looks like an absolute mess but I thought often about implementing something similar for more favourable jurisdictions.
It's definitely less taxation overall and less rules. Maybe it's not complex enough to warrant more than to have a few excel formulas.
Coming back to tenforty - cool, the LLM allows you to access it via language, but that makes it pretty unusable for me: there is way more value (and more work needed, as you noticed) in building a proper UI.
It'd be cool to explore generating UI using LLMs instead.
That said, most countries do not have lobbying from tax preparation companies, so they have have saner and easier to grasp tax codes, and the tax returns are painless and/or automated for individuals.
The most important thing to consider when determining your residency status in Canada for income tax purposes is whether or not you maintain or establish significant residential ties with Canada.
Significant residential ties to Canada include:
- a home in Canada
- a spouse or common-law partner in Canada
- dependants in Canada
Secondary residential ties that may be relevant include:
- personal property in Canada, such as a car or furniture
- social ties in Canada, such as memberships in Canadian recreational or religious organizations
- economic ties in Canada, such as Canadian bank accounts or credit cards
- a Canadian driver's licence
- a Canadian passport
- health insurance with a Canadian province or territory
So if you are a Canadian with a passport, working in Japan for over a year with car at your parents house in Canada. You could count as a Resident of Canada and have to file taxes.
I don't want an accountant who's optimizing around what most people think would work.
Surely optimizing your taxes is almost the definition of a worst-case scenario for an LLM? There is not only an authority which has the final say, but serious penalties for hallucinating.
How about trying to think about what would make better sense to attempt? This seems like a poorly thought-out idea on its face. Unless it's LLM-generated: 'what would be a good article about what GPT can do, what are people really interested about'.
What do you believe was incorrect? I asked Chat GPT 3.5 and 4 this question verbatim, seems they both gave the correct answer: it is taxable income. 4 was very thorough. 3.5 not so much.
This is the answer I got: “Interest earned in Canadian Registered Retirement Savings Plan (RRSP) accounts is generally not taxable income in California for state tax purposes. California does not typically tax income earned in foreign retirement accounts like RRSPs. However, it's always a good idea to consult with a tax professional or accountant for personalized advice based on your specific situation.”
Treat an LLM as a confident smart person who isn’t an expert in anything, and doesn’t have access to resources to check their answer (unless you give it access).
If you assume the above, it’s unsurprising that it doesn’t get your question right.
What’s the correct answer to your question? If you say “research the question before answering” I bet it could probably solve it.
> Treat an LLM as a confident smart person who isn’t an expert in anything, and doesn’t have access to resources to check their answer (unless you give it access).
Sure, but I don't want that person helping me with my taxes.
Like, the idea that I get an answer and then I need to do research to figure out if it's correct... The whole point -- literally the entire point -- is that I don't want to do the research. If I wanted to do the research, I wouldn't use the LLM. I would just do the research. And now I'm using the LLM and I have to do the research anyway?
If we're talking about brainstorming or getting topic overviews or helping you start research, sure, I could see that. An LLM could be useful there. But asking direct questions and getting back plausible-sounding incorrect answers? This is a domain where an LLM just shouldn't be used at all.
I've brought up this analogy in the past, but it's like people proposing an LLM as a calculator, and when it gets answers wrong they say, "well, humans get math problems wrong too." Yes, they do. Why do you think we started using calculators instead of humans?
There is a reason why I don't go on Reddit to ask for tax advice. It's cool that people can make a computer that simulates that same use-case, but that is not a use-case that was useful to simulate. If I have to read the tax forms anyway, then I might as well just read the tax forms.
The LLM can do the research itself, just current LLM's often don't by default (and require a bit of prompting).
LLM's also use calculators (and again, if you prompt ChatGPT it will use a calculator to generate the result).
These things are solvable, we are still in the 'Commodore-PET' phase of LLM's development, i.e. impressive for today but lots more to do to make it more useful for most people.
The thing is, a confident smart person who isn’t an expert in anything, who DOES have access to resources to check their answers is still not someone I want doing my taxes.
The only way that an LLM doing my taxes for me is useful is if I trust it to the same level that I trust an expert; and that includes being able to very reliably know when that person is actually confident and when they're not. I haven't seen strong evidence that this is possible with the current LLM approach.
So if we're thinking of LLMs as a confident smart person who sometimes spouts bullcrap -- that's still just not a good fit for this kind of product category. It's a good fit for other tasks, just... not for these.
It's definitely not a good fit for putting in front of problems that are already pretty-well solved. Calculators replaced human calculation, it's a solved problem, you put the numbers into the calculator. Even natural-language input into the calculator is a reasonably solved problem, it's not a task that really requires a full LLM of the scale of GPT-4. So is there value in introducing a human-like source of error into the middle of that process again? I feel like the answer is 'no'.
> These things are solvable, we are still in the 'Commodore-PET' phase of LLM's development
Sure, if we get to the point where an LLM can give such good tax information that there literally is no need to read the tax documents to confirm any of it, then great! But that's not even remotely where we are, and products like this straight-up shouldn't be launched until we're at that point.
This is sort of like selling me an scuba tank and saying, "it might not dispense oxygen so make sure you hold your breath the entire time", and when I say, "well what's the point of the tank then" the reply I get back is, "well eventually we're going to get it to reliably dispense oxygen!"
Okay, but the products are all launching now, I'm being asked to buy into this now, not eventually. The future doesn't change the fact that the current products that I'm supposed to get excited about are literally useless at a conceptual level -- that they do not even make sense given their caveats. I'm not complaining about AI chat bots or funny toys or even toy phone assistants, I'm complaining that companies like Intuit are literally launching tax assistants that give incorrect information and their response is, "well, double check what the bot says." Well that is not a tax assistant then, that's just work. The service that the chatbot is offering me is that I'll have to work more.
Fundamentally, there appears to be a misunderstanding from these companies about what a product is.
>Treat an LLM as a confident smart person who isn’t an expert in anything
When you say it that way, I'm reminded that IRL I find that sort of person to be just the worst. Largely because I've mostly encountered them in mailing lists and users group meetings, confidently leading other people down a blind alley. When it's just a 1-on-1, with me asking ChatGPT, I guess I'm more willing to accept it -- because the blast radius is limited.
LLMs don't have any clue what the concept "research the question before answering" means. Maybe if you asked it a series of leading questions to get some of the "research" in the context window first. Then you might as well just search. Otherwise it'll just confidently sound like it researched without any bearing on the answer.
LLMs are sloppy, and we shouldn't have to do "sociology" experiments to find out whether or not an LLM might give the right answer slightly more often if we trick it into adding more context to the answer.
> LLMs don't have any clue what the concept "research the question before answering" means.
GPT-4 in ChatGPT will absolutely do research if you ask it (in the practical sense that if you say “research first” like in my original post, it will search the web and interpret/synthesise the results, and then use the findings to inform its answer).
While something like GPT4 cannot "do more research before answering a question", it is well established that telling an LLM it's an expert in something and asking it to spell out the steps in it's decision making process does make it more accurate in it's responses.
That is allowing the LLM to place minor steps into its own context window where a more difficult result is more likely to follow as the next series of words. That is using our intelligence of how answers are determined to manipulate the output. It doesn't say anything about the "intelligence" of an LLM. Telling it that it's an expert does not have it somehow change the rote algorithm or have it "pretend" anything. It's just not that complicated.
Based on my experience trying to get LLMs to write useful code, I don't think LLMs are the right tool to optimize a novel technical problem. The best they can really do is offer common, generic optimizations that might apply
to similar situations, or might be completely made up. Useful for brainstorming, but not reliable.
The whole AI space seems to suffer from this. Remember in the late 90s, big tech companies (particularly Microsoft and IBM) confidently declaring that conversing naturally with a computer via voice recognition would be the primary way people use computers within years? Meanwhile, 25 years later, voice recognition is finally getting into the realm of being good enough for unimportant things where accuracy doesn’t matter.
The only mainstream AI-adjacent thing that I can think of which ever actually became “good enough” is OCR, and that took about 50 years.
I’m somewhat surprised that people aren’t more cautious, honestly. We’ve had _two_ “this time it’s different” moments in a little over a decade, now (self-driving cars and generative AI); while self-driving cars are at this point squarely in the “okay, maybe it’s not so different” bucket, I think people underestimate the risk that generative AI will, ultimately, also underwhelm.
It's like the 90/10 problem, right? LLMs today do things that are pretty amazing things but they are missing something that seems incredibly simple to laymen but is and has always been the actual hard problem.
I think the comparison to self driving cars in incredibly apt. A self driving car can do incredible things at superhuman speeds but the simple things it can't figure out are actually really hard problem. For instance, seeing a floating grocery bag and knowing that it's floating in the wind with negligible mass that doesn't need to be slowed down for or avoided. Something that actually requires an actual understanding of the world, rather than just being trained on the past.
A similar situation for LLMs seems to be these "hallucinations". The laymen see the vast knowledge that they seem to hold and thinks, just tell it to stop lying and it's fixed. The problem is that telling the model not to lie is really only simple if it has some understanding which LLMs don't seem to possess. Their hallucinations, to them, are just as true as when they happen to tell the the true. They have no method of discerning them.
It's more like a 60/40 problem right now. Generating reasonable sounding text was a huge problem for decades. That seems to be a relatively minor part of understanding what those words mean in more context than "the likely sequence that follows a prompt".
Although I am impressed with how well the models perform when trained purely in a Chinese Room format. I think they have gleaned some understanding of some systems beyond just a super powered Markov chain.
I'll admit that's possible. But if it's coded in a set of weights that happen to generate a good answer, how can we tell? I honestly think explainability is the most important problem in AI right now. And not explainability in terms of the model generating a series of words, but introspective explainability. If the models get to superhuman levels, maybe we don't have to have that, but until one or the other I don't know how we can demonstrate anything beyond super Markov chain.
I get the fusion power thing. There has been serious research going into fusion power.
But with the space elevator... I don't think even serious materials research scientists and engineers think we are anywhere close to a space elevator. Yeah, graphene / carbon nanotubes are exciting, but no viable manufacturing technique for the long braided strands required has been anywhere close. Our materials science is just nowhere near and people doing the research understand how far off we are.
I’d believe space elevators could be around the corner if we had smaller analogues in production right now. Like a 2 mile tall building supported with miles of carbon nanotube wires.
Well yeah exactly. And that's not considering boring stuff like weather, wear and tear, and what happens if the cable snaps and the bottom half comes crashing down to earth. Even a super-light wire would presumably cause a catastrophic impact (and that's not considering if whatever payload came crashing down with it).
People were talking about space elevators before graphene or carbon nanotubes existed. I think the original idea came from Heinlein and when I was a kid there was no material that was even remotely feasible but it was still a tech that people talked about being possible in the "near" future.
Decades of optimizing compilers, static analyzers, and software engineering practices when the solution to quality was right under our noses all along.
> This lack of meta-cognition is a big reason why this first generation of LLM-based products – ChatGPT, Stable Diffusion, GitHub Copilot – are expressed as copilots, as opposed to say autonomous assistants that can be relied upon to perform a task correctly. We encountered the same issue in our previous post about doing scientific literature meta-analysis with LLMs.
Is there any indication that a next generation of LLM-based products can solve the meta-cognition problem?
Both Sam and Yann have expressed desire to pursue the idea of asymmetric computational load, where "harder" problems emit or recurse through more tokens while "easier" problems (like retrieving tbe current date) use less tokens. This assumes there is a fixed amount of intelligence brought to bear for emitting each token.
>Humans can estimate the complexity of a task without performing the task.
I would argue this ability drops precipitously as the complexity grows. That's part of the reason cost and schedule estimates are notoriously unreliable on big projects.
Just a meta comment here. When does a metaphor stop being a metaphor and start becoming a lie? Does it have to do with the speakers intention or can it just get away from you and nobody notices?
Not really, and it shows that LLMs just like prior attempts at generalized AI are gonna hit a wall, they're trying to brute force the issue which in my opinion won't work.
It doesn't seem like there any any real optimization possibilities in the article though. For household taxes there just aren't really situations complicated enough to need optimizing, and the big things like maxing out 401k you had to have done the previous year anyway.