Hacker News new | past | comments | ask | show | jobs | submit login
Can GPT optimize my taxes? An experiment in letting the LLM be the UX (finedataproducts.com)
194 points by mmacpherson 10 months ago | hide | past | favorite | 80 comments



I tried to get ChatGPT (4) to extract some tax info from a PDF for me (because this one place doesn't give you 1099B in any more usable form) and it completely hallucinated the result. It doesn't seem to read PDFs well at all. Claude did it correctly though.

It doesn't seem like there any any real optimization possibilities in the article though. For household taxes there just aren't really situations complicated enough to need optimizing, and the big things like maxing out 401k you had to have done the previous year anyway.


Irritated by the UX and business model of web-based tax return preparation, I've looked into open source alternatives in the past. I hadn't come across the Open Tax Solver package mentioned in this article which looks interesting. It seems like they all rely on printing and mailing the completed returns. Is there any hope of ever being able to e-file them? I don't understand why it couldn't be possible for Free Fillable Forms to support some sort of import format.


eFile requires a preparer ID as well as the person’s TIN, so it can’t be done directly.

Freefillableforms is in the business of being as obtuse and unweildly as possible on purpose. Having an import goes against those goals.

This is deliberate. Get mad at your government.


This looks great - but mainly because of tenforty capabilities.

The US tax looks like an absolute mess but I thought often about implementing something similar for more favourable jurisdictions. It's definitely less taxation overall and less rules. Maybe it's not complex enough to warrant more than to have a few excel formulas.

Coming back to tenforty - cool, the LLM allows you to access it via language, but that makes it pretty unusable for me: there is way more value (and more work needed, as you noticed) in building a proper UI.

It'd be cool to explore generating UI using LLMs instead.


What does a "more favorable jurisdiction" look like? Federal tax code is the same throughout the US.


All the other countries with income taxes? https://en.wikipedia.org/wiki/Income_tax#Around_the_world

That said, most countries do not have lobbying from tax preparation companies, so they have have saner and easier to grasp tax codes, and the tax returns are painless and/or automated for individuals.


I don't think the wiki page is accurate.

Canada will tax you on income earned outside of Canada if they consider you to be a resident of Canada: https://www.canada.ca/en/revenue-agency/services/tax/interna...

The most important thing to consider when determining your residency status in Canada for income tax purposes is whether or not you maintain or establish significant residential ties with Canada.

Significant residential ties to Canada include:

- a home in Canada

- a spouse or common-law partner in Canada

- dependants in Canada

Secondary residential ties that may be relevant include: - personal property in Canada, such as a car or furniture

- social ties in Canada, such as memberships in Canadian recreational or religious organizations

- economic ties in Canada, such as Canadian bank accounts or credit cards

- a Canadian driver's licence

- a Canadian passport

- health insurance with a Canadian province or territory

So if you are a Canadian with a passport, working in Japan for over a year with car at your parents house in Canada. You could count as a Resident of Canada and have to file taxes.


I don't want a capable if mercurial accountant.

I don't want an accountant who's optimizing around what most people think would work.

Surely optimizing your taxes is almost the definition of a worst-case scenario for an LLM? There is not only an authority which has the final say, but serious penalties for hallucinating.

How about trying to think about what would make better sense to attempt? This seems like a poorly thought-out idea on its face. Unless it's LLM-generated: 'what would be a good article about what GPT can do, what are people really interested about'.


I still want to know how LLMs can help with dreams, as was once described from something I read


Funny enough just today I asked ChatGPT a tax question, which it very confidently gave me the wrong answer for.

For those curious, the question was: is interest earned in Canadian RRSP accounts considered taxable income in California for state tax.


What do you believe was incorrect? I asked Chat GPT 3.5 and 4 this question verbatim, seems they both gave the correct answer: it is taxable income. 4 was very thorough. 3.5 not so much.


This is exactly the problem. The answer is not deterministic.


This is the answer I got: “Interest earned in Canadian Registered Retirement Savings Plan (RRSP) accounts is generally not taxable income in California for state tax purposes. California does not typically tax income earned in foreign retirement accounts like RRSPs. However, it's always a good idea to consult with a tax professional or accountant for personalized advice based on your specific situation.”


Treat an LLM as a confident smart person who isn’t an expert in anything, and doesn’t have access to resources to check their answer (unless you give it access).

If you assume the above, it’s unsurprising that it doesn’t get your question right.

What’s the correct answer to your question? If you say “research the question before answering” I bet it could probably solve it.


> Treat an LLM as a confident smart person who isn’t an expert in anything, and doesn’t have access to resources to check their answer (unless you give it access).

Sure, but I don't want that person helping me with my taxes.

Like, the idea that I get an answer and then I need to do research to figure out if it's correct... The whole point -- literally the entire point -- is that I don't want to do the research. If I wanted to do the research, I wouldn't use the LLM. I would just do the research. And now I'm using the LLM and I have to do the research anyway?

If we're talking about brainstorming or getting topic overviews or helping you start research, sure, I could see that. An LLM could be useful there. But asking direct questions and getting back plausible-sounding incorrect answers? This is a domain where an LLM just shouldn't be used at all.

I've brought up this analogy in the past, but it's like people proposing an LLM as a calculator, and when it gets answers wrong they say, "well, humans get math problems wrong too." Yes, they do. Why do you think we started using calculators instead of humans?

There is a reason why I don't go on Reddit to ask for tax advice. It's cool that people can make a computer that simulates that same use-case, but that is not a use-case that was useful to simulate. If I have to read the tax forms anyway, then I might as well just read the tax forms.


The LLM can do the research itself, just current LLM's often don't by default (and require a bit of prompting).

LLM's also use calculators (and again, if you prompt ChatGPT it will use a calculator to generate the result).

These things are solvable, we are still in the 'Commodore-PET' phase of LLM's development, i.e. impressive for today but lots more to do to make it more useful for most people.


The thing is, a confident smart person who isn’t an expert in anything, who DOES have access to resources to check their answers is still not someone I want doing my taxes.

The only way that an LLM doing my taxes for me is useful is if I trust it to the same level that I trust an expert; and that includes being able to very reliably know when that person is actually confident and when they're not. I haven't seen strong evidence that this is possible with the current LLM approach.

So if we're thinking of LLMs as a confident smart person who sometimes spouts bullcrap -- that's still just not a good fit for this kind of product category. It's a good fit for other tasks, just... not for these.

It's definitely not a good fit for putting in front of problems that are already pretty-well solved. Calculators replaced human calculation, it's a solved problem, you put the numbers into the calculator. Even natural-language input into the calculator is a reasonably solved problem, it's not a task that really requires a full LLM of the scale of GPT-4. So is there value in introducing a human-like source of error into the middle of that process again? I feel like the answer is 'no'.

> These things are solvable, we are still in the 'Commodore-PET' phase of LLM's development

Sure, if we get to the point where an LLM can give such good tax information that there literally is no need to read the tax documents to confirm any of it, then great! But that's not even remotely where we are, and products like this straight-up shouldn't be launched until we're at that point.

This is sort of like selling me an scuba tank and saying, "it might not dispense oxygen so make sure you hold your breath the entire time", and when I say, "well what's the point of the tank then" the reply I get back is, "well eventually we're going to get it to reliably dispense oxygen!"

Okay, but the products are all launching now, I'm being asked to buy into this now, not eventually. The future doesn't change the fact that the current products that I'm supposed to get excited about are literally useless at a conceptual level -- that they do not even make sense given their caveats. I'm not complaining about AI chat bots or funny toys or even toy phone assistants, I'm complaining that companies like Intuit are literally launching tax assistants that give incorrect information and their response is, "well, double check what the bot says." Well that is not a tax assistant then, that's just work. The service that the chatbot is offering me is that I'll have to work more.

Fundamentally, there appears to be a misunderstanding from these companies about what a product is.


>Treat an LLM as a confident smart person who isn’t an expert in anything

When you say it that way, I'm reminded that IRL I find that sort of person to be just the worst. Largely because I've mostly encountered them in mailing lists and users group meetings, confidently leading other people down a blind alley. When it's just a 1-on-1, with me asking ChatGPT, I guess I'm more willing to accept it -- because the blast radius is limited.


LLMs don't have any clue what the concept "research the question before answering" means. Maybe if you asked it a series of leading questions to get some of the "research" in the context window first. Then you might as well just search. Otherwise it'll just confidently sound like it researched without any bearing on the answer.

LLMs are sloppy, and we shouldn't have to do "sociology" experiments to find out whether or not an LLM might give the right answer slightly more often if we trick it into adding more context to the answer.


> LLMs don't have any clue what the concept "research the question before answering" means.

GPT-4 in ChatGPT will absolutely do research if you ask it (in the practical sense that if you say “research first” like in my original post, it will search the web and interpret/synthesise the results, and then use the findings to inform its answer).


While something like GPT4 cannot "do more research before answering a question", it is well established that telling an LLM it's an expert in something and asking it to spell out the steps in it's decision making process does make it more accurate in it's responses.


That is allowing the LLM to place minor steps into its own context window where a more difficult result is more likely to follow as the next series of words. That is using our intelligence of how answers are determined to manipulate the output. It doesn't say anything about the "intelligence" of an LLM. Telling it that it's an expert does not have it somehow change the rote algorithm or have it "pretend" anything. It's just not that complicated.


GPT-4 absolutely can do more research.

Adding this to a prompt will encourage GPT-4 to do web searches in the background to augment its answer.


Yikes. I'm definitely asking my CPA that question and not an LLM.


First result on DuckDuckGo, for now.

Interesting bit of tax law trivia.

https://www.kahntaxlaw.com/california-reporting-requirements...


Don't know why, but I found Gemini more accurate for this kind of stuff. It yaps a lot, so you have to focus on brevity going forward.


Oh my God, here we go again. WHICH VERSION of chatgpt did you use?


Based on my experience trying to get LLMs to write useful code, I don't think LLMs are the right tool to optimize a novel technical problem. The best they can really do is offer common, generic optimizations that might apply to similar situations, or might be completely made up. Useful for brainstorming, but not reliable.


Brainstorming has been my approach.

It’s the rubber ducky method, except the duck writes back.


It’s like self driving cars 10 years ago. We were almost there! Just another couple years and the driver is extinct!

It turns out that a nice demo doesn’t mean you’re about to revolutionize the world.


The whole AI space seems to suffer from this. Remember in the late 90s, big tech companies (particularly Microsoft and IBM) confidently declaring that conversing naturally with a computer via voice recognition would be the primary way people use computers within years? Meanwhile, 25 years later, voice recognition is finally getting into the realm of being good enough for unimportant things where accuracy doesn’t matter.

The only mainstream AI-adjacent thing that I can think of which ever actually became “good enough” is OCR, and that took about 50 years.

I’m somewhat surprised that people aren’t more cautious, honestly. We’ve had _two_ “this time it’s different” moments in a little over a decade, now (self-driving cars and generative AI); while self-driving cars are at this point squarely in the “okay, maybe it’s not so different” bucket, I think people underestimate the risk that generative AI will, ultimately, also underwhelm.


> I think people underestimate the risk that generative AI will, ultimately, also underwhelm

That’s the true danger of AI.


I mean, joking aside, it is one; while I don't think we're quite there yet, there's the potential for a bigger version of the dot-com bubble.


It's like the 90/10 problem, right? LLMs today do things that are pretty amazing things but they are missing something that seems incredibly simple to laymen but is and has always been the actual hard problem.

I think the comparison to self driving cars in incredibly apt. A self driving car can do incredible things at superhuman speeds but the simple things it can't figure out are actually really hard problem. For instance, seeing a floating grocery bag and knowing that it's floating in the wind with negligible mass that doesn't need to be slowed down for or avoided. Something that actually requires an actual understanding of the world, rather than just being trained on the past.

A similar situation for LLMs seems to be these "hallucinations". The laymen see the vast knowledge that they seem to hold and thinks, just tell it to stop lying and it's fixed. The problem is that telling the model not to lie is really only simple if it has some understanding which LLMs don't seem to possess. Their hallucinations, to them, are just as true as when they happen to tell the the true. They have no method of discerning them.


It's more like a 60/40 problem right now. Generating reasonable sounding text was a huge problem for decades. That seems to be a relatively minor part of understanding what those words mean in more context than "the likely sequence that follows a prompt".


Although I am impressed with how well the models perform when trained purely in a Chinese Room format. I think they have gleaned some understanding of some systems beyond just a super powered Markov chain.


I'll admit that's possible. But if it's coded in a set of weights that happen to generate a good answer, how can we tell? I honestly think explainability is the most important problem in AI right now. And not explainability in terms of the model generating a series of words, but introspective explainability. If the models get to superhuman levels, maybe we don't have to have that, but until one or the other I don't know how we can demonstrate anything beyond super Markov chain.


Yes, and space elevators and fusion power are also just around the corner. 10 years from now is going to be great.

I've been thinking that for the last 40 years.


I get the fusion power thing. There has been serious research going into fusion power.

But with the space elevator... I don't think even serious materials research scientists and engineers think we are anywhere close to a space elevator. Yeah, graphene / carbon nanotubes are exciting, but no viable manufacturing technique for the long braided strands required has been anywhere close. Our materials science is just nowhere near and people doing the research understand how far off we are.


I’d believe space elevators could be around the corner if we had smaller analogues in production right now. Like a 2 mile tall building supported with miles of carbon nanotube wires.


Well yeah exactly. And that's not considering boring stuff like weather, wear and tear, and what happens if the cable snaps and the bottom half comes crashing down to earth. Even a super-light wire would presumably cause a catastrophic impact (and that's not considering if whatever payload came crashing down with it).


People were talking about space elevators before graphene or carbon nanotubes existed. I think the original idea came from Heinlein and when I was a kid there was no material that was even remotely feasible but it was still a tech that people talked about being possible in the "near" future.


> It will reliably produce apt python code using tenforty that you can drop into a notebook and take from there.

Reliably? And how do they know it's reliable?


Well, there weren't any compile errors.


If only my code reviews were this easy.


LGTM


Works on my machine


that is necessary, but not sufficient, for the program to be correct, aka “reliable“.


Not in python it isn't.


I was surprised the first time I learned that Python will actually refuse to compile:

  def fibonacci(n):
     return n
because it knows that the Fibonacci sequence doesn't work that way. It's a fantastic language.


If it’s valid utf8 I consider it ready for production.


Decades of optimizing compilers, static analyzers, and software engineering practices when the solution to quality was right under our noses all along.


Just a couple of warnings.

/s


It can just be saying that it produces code instead of non-code.


And how do they know it _reliably_ produces code instead of non-code?

I could tell you I back out of my driveway without looking every day and I reliably don't run people over.


I have bad news for you about the average tax preparer.


Somewhat related: here's an open source llm expense labeler for small businesses. https://github.com/realityinspector/llm_expense_labeler/ [OC]


> This lack of meta-cognition is a big reason why this first generation of LLM-based products – ChatGPT, Stable Diffusion, GitHub Copilot – are expressed as copilots, as opposed to say autonomous assistants that can be relied upon to perform a task correctly. We encountered the same issue in our previous post about doing scientific literature meta-analysis with LLMs.

Is there any indication that a next generation of LLM-based products can solve the meta-cognition problem?


Both Sam and Yann have expressed desire to pursue the idea of asymmetric computational load, where "harder" problems emit or recurse through more tokens while "easier" problems (like retrieving tbe current date) use less tokens. This assumes there is a fixed amount of intelligence brought to bear for emitting each token.


And that you can accurately determine difficulty and/or confidence.


Accurately or the way humans do it?


Seems like passing the input with a question of "where should this input be handled?" into the LLM would be a good first step.


Why would the LLM be any good at determining where the input should be handled?


There's a long history of "this looks hard, I won't implement/fix it" in GitHub issues the LLM can train on.


And then the LLM can probably know whether an issue can be fixed or not. That’s far from generalizing the problem though.


LLMs have shown some limited ability to generalize.

(Opinion) I think internally, they record how closely 2 words are in meaning based on the training data.

If everyone uses similar language to describe different problems, then the LLM should be able to at least act like it’s generalizing


GPT4 uses a "mixture of experts" system which is already sort of kind of like doing this.


The first tasks of gpt was to evaluate the emotion of a sentence.

Like determine if a comment/review is good or bad.


can probably easily be interpreted from the text?


Did we just discover a new halting problem?


It doesn't seem like this should be the case. Humans can estimate the complexity of a task without performing the task. Why couldn't an advanced LLM?


Isn't a large part of software development's difficulty that people can't actually do that?


>Humans can estimate the complexity of a task without performing the task.

I would argue this ability drops precipitously as the complexity grows. That's part of the reason cost and schedule estimates are notoriously unreliable on big projects.


Just a meta comment here. When does a metaphor stop being a metaphor and start becoming a lie? Does it have to do with the speakers intention or can it just get away from you and nobody notices?

Anyways. This problem isn't going to be solved.


JEPA architecture seems promising, though my last review was about 4 months back.


this paper explores combining metacognition with LLMs as the application level: https://replicantlife.com/


Not yet.


Not really, and it shows that LLMs just like prior attempts at generalized AI are gonna hit a wall, they're trying to brute force the issue which in my opinion won't work.


Just be careful it doesn’t invent some new tax laws in the process.


nice work. now try to implement custom agent architecture and it will perform 10 times faster, 10 times better, 10 times cheaper.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: