Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Sourcetable – AI Spreadsheet and Data Platform
188 points by mceoin 3 months ago | hide | past | favorite | 88 comments
Hi HN! I’m Eoin, founder of Sourcetable (https://sourcetable.com).

Sourcetable is an AI-native spreadsheet that syncs with all your data. Users pair with an AI copilot that helps them do their spreadsheet work, as well as more database-centric analysis and SQL.

Soucetable syncs with databases including Postgres, MySQL, and MongoDB, and over 100+ business applications including Stripe, Zendesk, Hubspot, Quickbooks and Google Analytics. That data is available in a spreadsheet, and any models you build automatically update in near-real-time as new data flows in. The core primitives are AI + spreadsheet + data sync + storage + compute.

If you want to play with Sourcetable today, the easiest way is to upload a CSV and start asking questions.

Who is it for? Sourcetable is for analysts, operators and finance folk doing data-centric work in a spreadsheet. Sourcetable’s spreadsheet-based AI assistant understands workbook range selection and can adjust scope context to the datasets you are working with. You can talk directly to your database and SaaS integrations, which is great for analysis, data search and retrieval, SQL writing & editing (including writing joins across different datasets), and automatic chart creation.

Niching down, if you work in operations at a <50 person startup or SMB and your company relies on a Postgres or MySQL database, Sourcetable is an affordable reporting tool with turnkey data infrastructure that doesn’t require code or engineers to set up.

Spreadsheets are the most used analytical tool on the planet. AI is a platform shift with broad applications. We are staying open-minded about users and use cases since everything is so new.

Backstory: I spent ten years working in de-facto operations and technical roles at startups. Sourcetable draws from that experience of needing better data tooling inside spreadsheets, and constantly hacking ad hoc solutions to fill the gap. Andrew (CTO / co-founder) previously had a deep learning company and was initially drawn to the idea that Sourcetable could be an operating system for the web. We’re both Aussie expats in the Bay Area, which is how we met. Internally, we think of Sourcetable as an application platform, with AI applications being a useful and interesting place to focus.

Features & Use Cases: Talk to your CSV files, spreadsheets, integrations, and datasets using LLMs. AI + data work: Text-to-SQL, search and retrieval from databases, LLM-based data analysis. (This is an entirely different experience to what Copilot/Gemini & Excel/Sheets provide, since they are thin workbooks and not data platforms.) AI + spreadsheet work: formula assist, workbook analysis, data cleaning, chart creation, error handling, summarization, chat, etc. Automated reporting: data is synced, reports you build stay up to date. No-code data access: give the business team safe database access so they will leave you alone! Centralizing data for cross-channel reporting. (e.g. Postgres + Stripe + Mailchimp) Analyzing large CSV files: Sourcetable can handle multi-gigabit files. (Google Sheets can’t handle large data and the experience in Excel is rather cumbersome.)

Technical Details: Sourcetable was built to be fast. It was also built to scale.

AI: LLama 3 (via Groq), Claude, GPT-4o, LiteLLM, custom LLMs

Frontend: DuckDB, React, ShadCN, AntV / Bizcharts, Plotly, CodeMirror, Hookstate

Backend: DuckDB, Python, Cassandra, Redis, NGINX, Cloudflare

Data Eng & Transformations: Fivetran, DBT, Apache Arrow, SQLglot

Distributed Computing & Scaling: Daft, Ray, Cloud Formation

Other: Linux Namespaces, Dill (U.Queensland)

A huge thank you to the open source community, and a special shout-out to DuckDB for being so damn fast. Thank you also to Groq & Anthropic for the rate limit increases in time for this ShowHN post!

Feedback: Product feedback is welcome! eoin@sourcetable.com




This is incredible. I uploaded a CSV with ~6000 rows containing campaign finance data for a particularly corrupt local politician and asked "what was the total contributed amount in [year]". Not only did it produce the correct answer (in around the same amount of time it took me to calculate it on my end) but it also seemed to understand that the spreadsheet was related to campaign finance in the "summary" portion of the response.

The most useful aspect was that I could ask "what was the total contributed amount between January and June of 2020" and get an accurate answer for that as well. Since the date column is provided as an "MM/DD/YYYY" string, I would normally have to do some boilerplate work to sanitize this.

For my particular use case, the charting aspect left a few things to be desired - once I grouped campaign donations by contributor, I could only see the first 10 rows in the AI response, with no option to expand the output. But overall I was truly blown away that something like this is even possible for a small team to build.


> For my particular use case, the charting aspect left a few things to be desired - once I grouped campaign donations by contributor, I could only see the first 10 rows in the AI response, with no option to expand the output.

Insert it as a table on the page (you should see a button), it will then print the whole table result from that query into the spreadsheet. Also, you can check the SQL first and validate it, then print to table after that.

Try a few million rows and see what happens!


Also keep an eye out on the limit - we default to 10,000 to keep it snappy but if you want to make it larger its a click away. The "summarize table" button should auto limit to 1B+ rows.


Can Office 365 Copilot do this too?


We are a generally a generation ahead of sheets and excel. You might be able to do some of the things in the older software but it won't be a button press. The ability for you to run data queries (and let the AI do it for you) next to the traditional A1 notation we invented.


Interesting. I think you're on to something here. I fully agree that a combination of spreadsheets and SQL are the ideal tools for data analysis -- not a SaaS GUI.

> Niching down, if you work in operations at a <50 person startup or SMB and your company relies on a Postgres or MySQL database, Sourcetable is an affordable reporting tool with turnkey data infrastructure that doesn’t require code or engineers to set up.

With the rise of AI, companies like Tembo that help you set up all in one databases, and tools like this, I'm increasingly of the mind that many companies should start bringing things like analytics and observability in-house. I don't see the need to pay Mixpanel or Datadog thousands of dollars per month when a self-serve solution that relies on tried and true tech is more or less at your fingertips.


Minus the AI part tools like this have existed for decades.

And companies are not dumping their SaaS tools and switching to them en masse.

Because (a) data silos have dramatically increased pushing dreams of a unified data schema out of reach, (b) technology stacks have become far more complex necessitating tools like Datadog and (c) competition is stronger than ever meaning that skimping on paying for tools like MixPanel is often short sighted and counter productive.

Companies like this will do fine and there will be always be a demand for them especially in the SMB space. But there simply isn't the business value in bringing a lot of analytics and observability in-house in almost all cases.


Not yet. But in the analytics case, suppose you could build a tool that collected data on your own infrastructure, allowed you to write plain SQL against a PostgreSQL database to get whatever analytics data you need, had an AI-driven text-to-SQL option so non-technical users could get whatever analytics data _they_ need, and output everything to a universal interface, i.e. a spreadsheet? No vendor flavored DSL, GUI, or workflows to learn. That product would be tough to beat. It wasn't built in the past because it was hard. But with AI and something like Tembo or Timescale, is it actually hard anymore?


Managed services are useful.


Agree. A general thesis I have is that the API-ification of the web fragmented business information, and with every new SaaS tool we fragment our company's data further. The trend at all company sizes is to be increasingly analytical, but for SMBs it's too hard to get access to your data (mainly due to technical limitations). So it makes sense to centralize data somewhere, and we think that somewhere is inside the data tool that everyone actually uses: the spreadsheet.

Many other advantages of this data centralization too. Data + spreadsheets + compute is a nice application base for agents.


> So it makes sense to centralize data somewhere

Modelling and integrating datasets that you don't own is extremely hard.

Shopify for example updates their API every 3 months.

How much time and money do you think an SMB can afford to spend on this before the ROI becomes so poor that they abandon it entirely.


There is a separate answer here which is many (most?) SMBs can't afford technical folk, so the ability integrate data at all, talk to it and model it (using SQL or AI), is already a big step forward for them.

My personal use case tends to involve a lot of Postgres data and transaction events for my reporting. We see "simple" businesses like parts manufacturers, print shops, vineyards, etc. all doing something similar.


Yes some integrations are excellent (hey Stripe : ), some are terrible (no comment on who). We're finding that LLMs increasingly able to fill the gap around organizing data schema for that initial data prep piece where someone has to build the data tables that others consume. To your specific question/problem set, when a schema updates you end up with a "fuzzy schema matching problem"; we are solving that separately anyways for a separate product feature requirement.

Strong note here that the current state of technology is much better for SMB scale data and not enterprise scale data with messy schemas.


It’s amazing that Microsoft - given their focus on AI and decades of experience in spreadsheets - doesn’t offer this type of functionality. Corporate bureaucracy vs startup agility!


At risk of poking the bear, they should have done this decades ago. Except for LLMs they have had everything they needed to bundle this stack into a single product solution; this would be much better for users.

And yes! We're definitely of the opinion that as a startup we can outcompete the two trillion-dollar death stars when it comes to product experience. AI is a platform shift!


When I was in there in Microsoft Research, our team was working on related efforts. But, yes, while pieces have shipped into products, MS never released a complete solution at the time.

Some links that might be of interest:

- Table semantics: https://www.microsoft.com/en-us/research/project/table-inter...

- Entity semantics (video): https://onedrive.live.com/?authkey=%21AMIdbT4yVFaw2Kk&cid=A6...

- Natural Language in Spreadsheets: https://www.microsoft.com/en-us/research/project/gridbook/


Thank you for sharing! Would love to grab a coffee if you're ever in San Francisco. eoin@sourcetable.com


Actually Microsoft do now have Copilot and Python in Excel recently released last week. Maybe a bit slow.


I dont know if the Python in Excel architecture as changed but last time i saw it, it was insane and unusable for me (data sent to MS servers where a linux container executes python: you need both a subscription and that the data in question not be regulated)


Platform wise, the equivalent would be if they combined Excel, PowerBI, Data Factory and Azure into a single tool.

Technically you can combine these, but it’s a cumbersome experience and difficult for most people. Vertically integrating their equivalents simplifies things a lot.

(Small note: we don’t currently offer Python to users but likely will at some point)


> Niching down, if you work in operations at a <50 person startup or SMB and your company relies on a Postgres or MySQL database, Sourcetable is an affordable reporting tool with turnkey data infrastructure that doesn’t require code or engineers to set up.

I'm already using Retool for these kinds of tasks- what does sourcetable do that I can't already do with Retool?

edit: also, did you build your own spreadsheet engine, or use an off-the-shelf one? (also will it be open source ;P)


Category Comparison (table-based solutions): "How are you different than Retool/Airtable/Coda/Notion/Zapier Tables, etc."

The primary difference vs table-based solutions is that Sourcetable is a spreadsheet in the common sense of the word, similar to Excel and Sheets. We have A1 notation and cell-based referencing. This is what most users expect, and this flexibility/familiarity has a big impact on the breadth of users and use cases within a team.

The formula referencing system of these table-based solutions is usually very limited both to columns/rows (not cells), and is a set of SQL-based queries which are much more limited than that 500+ formulas and functions spreadsheet users commonly expect.

Retool specifically: I tend to think of Retool as a lightweight custom-ERP software system, whereas Sourcetable more like Excel + PowerBI + Data Warehouse, so we will generally be much stronger for reporting and analysis. We definitely have some overlap in potential users since technical operators should like us both. FWIW - Retool is an excellent product.


Hi I'm Andy, Cofounder & CTO @ Sourcetable.

We use a heavily modified licensed engine that prevents us from open sourcing everything (for now). We have plans to open source our agentic/plugin framework, and other parts of the system. We also have a strong ethos of contributing back to open source where we can (contributed back to Arrow, DuckDB etc.).

I'd also add that while everyone knows how to use and work with spreadsheets, we also provide a SQL layer on top that you can use to query data sources as an advanced user (we developed a nomenclature to work within sheets/across sheets/files/our data-warehouse). This allows more technical users to work side-by-side in the same environment as non-technical users without crossing pythonic or reporting boundaries.

On top of this, the AI assistant can answer most of the questions you might have of all this data.

I think as ML gets more sophisticated, we will in general need to be less technical. The "tooling" might even disappear, but we will still need something to communicate important data centric decisions. Whether you like it or not spreadsheets are the foundation of human research and operations and have been for thousands of years, and I feel humanity will need less complicated "tools" and we will keep to our roots.


Will you be able to share name of the engine ?


Seems like customized version of luckysheet

https://dream-num.github.io/LuckysheetDemo/

https://github.com/dream-num/Luckysheet/issues/1454

I am not related to either sourcetable or luckysheet


or possibly FortuneSheet which is an actively maintained fork of the original LuckySheet

https://github.com/ruilisi/fortune-sheet


I always wonder where these spreadsheet/database apps will land. Usually it falls flat for one of a few reasons I’ve observed:

- Fundamental gap in skillset, in that if you want to have ultimate flexibility to slice and dice the data and report on whatever you’re seeking, you’ve ultimately needed SQL skills in the past (which isn’t rocket science, but also isn’t something most accounting users can run with on their own).

- Fundamental desire of users to work with unstructured data. This goes back at least as far as Excel vs Lotus Improv in the early 90’s. Joel Spolsky talked about this, how they were terrified that Lotus Improv was going to kill Excel, because Improv was built to work with structured data, which users could then query and ask questions of to get any answer they want. But it turned out, as they observed people using both apps, there were zero users that used 100% normalized, structure data.

- Imperfect translation between spreadsheet and database. I’ve seen these work well 99.9% of the time, but at some point a column gets added or something that throws off formulas. And 0.1% error is basically catastrophic in accounting.

Maybe LLMs help overcome these challenges. Wish you luck.


Agree with you, and we're definitely trying to thread the needle!

We're generating the SQL to answer natural language questions, so folks can just get answers and results tables if that's all they need, with the option for power users to fiddle with the SQL either directly or via a query editor GUI.

There's a ton of use cases for working with unstructured and semi-structured data and that's coming down the pipe!


This is 100% the correct insight in my experience.

TL;DR, most technical people massively overestimate the technical / data abilities of regular spreadsheet users. We find simple use cases are best, and with each new LLM release the UX around more complex data improves significantly.

The reason we chose to build as a full-blown spreadsheet instead of just a table-based solution was that we saw that most people want the flexibility of a regular spreadsheet, but access to their (structured) business data. Table-based solutions wedge you into AI and you can never get out of that.


You might want to check who is blacklisting you and request to unblock. AdGuard blocked sourcetable.com as "Scam".

https://www.dropbox.com/scl/fi/np92pyo0eb0zphysc9wwz/screens...


Thanks for reporting! Taking a look now.


Hey do you mind removing this comment? Seems it might have caused us to be blacklisted?


This post was briefly flagged after that initial comment, which is what dioptre was referring to, not AdGuard. Phrasing was too ambiguous, hope that clears it up!


I'm sorry, I've missed the "delete" window. But may I know how a comment here (after it being blacklisted) about it being blacklisted will be the reason to be blacklisted?


¯\_(ツ)_/¯ deciphering magic algorithms.

Very much appreciate the bug report. Thank you!


This makes zero sense


We thought there might be basic word filters that tripped the algorithm. “Scam” being the offending word here. (Turns out it was something else that tripped a flame war setting. Probably a comment that later got flagged by the mods.)

Anyway, fun fact: it turns out our domain used to be a scam erectile pills website!


This is amazing. I’ve been scouting for such a solution as we’ve outgrown excel. Giving it a spin


A very common use case we see is SMBs having outgrown their spreadsheet but not wanting to move to a full-blown BI tool. They want the power, but not the change in interface/medium.

I didn't go into details above but a nice thing is that we leverage cloud compute and storage, so you can query billion-row data in sub-second time. (Courtesy of Duck!)


This looks great! Well done. My concern is that there's not a single mention of data privacy. Which is is a red flag for any one coming from an enterprise world. Get that sorted and I'd consider using your tool for actual work.


Hi - we encrypt everything at rest and use metadata to improve LLM performance.

We don't yet have enterprise-grade data permissioning or compliance certificates like Soc2. Those will come in time.


Data schemas and questions asked about data are just as much a company's IP as the data itself. It frustrates me that startups suddenly draw a line here for their own convenience when tuning generative AI. If I (as an employee) publicly posted all our database schemas and report descriptions, I would obviously be violating IP laws. Yet vendors think this "metadata" is fine to use and potentially leak across users.


> we encrypt everything at rest

And where are the keys stored?


We use Amazon’s key management service.


Awesome, have you got any mining specific worked examples or spatial examples? Thinking about lidar point clouds and running deltas for stock pile management. Looking at building a new mine and typically there at any mine site there are excel macros which might take an hour to run embeded in the operations. Often developed by older engineers, who will default to excel. Any suggestions on how best drive technical user adoption (asides from dropping it on the kids in the engineering deparments, can't wait that long) ?


The underlying datatypes we support in our data-warehouse support 3d and 4d data. So we can do vector queries on these and do transformations over different spaces. I think given what you need we can put your data in our data-warehouse, and then present it to the older engineers in an excel format with 3d plotting. We might want to chat about the details though, give me a holler at andrew@sourcetable.com


Yes actually! My cousin is a mining engineer so I spent a bunch of time playing around with mining data during testing. Turns out all New South Wales government data is public. Right now you can talk to any CSV or database using LLMs. I've also played around with a bunch of marine biology datasets too!

(p.s. I think Andrew, CTO, is going to jump in here as he has more experience in this space.)


Can you email me -- eoin@sourcetable.com -- more about the Excel macros? This might be easy to help you out with agents. A lot of compute-intensive stuff that takes ages in Excel is nearly instant in Sourcetable because we are leveraging cloud compute, but it really depends on your use case.


Possibly off-topic:

If i want to enable a simple internal web application (say React) with ability for users to manage master data tables, their schemas, and PK-FK relationships using a simple lookup -- as close to a simple spreadsheet as possible (upload and download CSV or view/edit data in a spreadsheet view) ... what are some good components or libraries that I can utilize?


Brilliant work team, great to see this being launched.


Thank you!


Thats a spicy example dataset!

I like that it's able to infer information from the context of the cells, e.g. being able to run a query across continents when the data only contains the country.

Being able to ask it to interpret the results is helpful, it would be cool if it automatically told you if there was enough data to have statistical significance in the conclusions it was presenting.


You may see that we try to suggest follow-up questions or question improvements where we think better context-in will result in a better result-out.

Curious what will happen if you modify the question to be more explicit?

I have seen that PMs and data-trained folk tend to be very articulate in asking for exactly what they want and that tends to lead to significantly better LLM responses.


Congrats on the launch! It's been great working with you from the Daft side


Thanks mate!


Cool.

How did you build so many integrations so fast?

Selfishly, would love to see Streak (CRM) integration as well.


Fivetran will build a Streak integration if you bring a customer who will use it (sometimes as little as one): https://fivetran.com/docs/by-request-program


Mostly Fivetran, a little Airbyte, and a few custom integrations. Would love to add Streak (can you get it into Fivetran? We can usually crank those integrations out within an hour.)


p.s. I was a massive Streak user at a previous (sales-driven) startup. Big fan!


An improved and more interactive version of Google Sheets' explore tab. Looks good!


You did it, you somehow made Excel even more error-prone.


If you found any bugs please let me know! andrew@sourcetable.com


Very cool. It would be great to have auto complete across cells.


Yes we don't yet have the full auto-suggest magic that Sheets offers, but you can click-drag for auto-complete the same way Excel offers.

We released Sourcetable today with the AI chatbot & AI data analysis features, but a very limited cell-based AI (only "summarize" and "fix formula"). We'll be releasing a big AI-based magic-autofill solution in the coming weeks.


What external checks are included to verify the chatbot output?


Wherever possible, the chatbot output is deterministic, in that to answer a query, we're realtime generating and running code or SQL against your data. Our LLM orchestrates that, and finally evaluates whether the output correctly and adequately answers the question.

We also extensively use synthetic data and examples to guide and constrain our models.

Another way we're ensuring good-quality output is to ensure good-quality _input_ -- by enriching the detail and specificity of the user's question, and asking the user to disambiguate when we determine the question is too broad.


None of that is external verification. You're using the generating tool to do the verification, leaving the door open for opaque errors.


very nice app. just the front-end browser component alone is super-slick. but expecting users to bring their data to your platform is a barrier to adoption.


Are you open sourcing the product for non-commercial use?


Would love to but unfortunately there are pieces we can't open-source for various reasons. We'll open source bits and pieces over time, and generally are excited to start blogging about AI & technical learnings now that the product is out of stealth mode.

Small plug for the analytics tracker we are using which Andrew (CTO) built and is open source: https://github.com/sfproductlabs/tracker


Does this use function-calling on the backend?


We use our own implementation of function calling orchestrated by chain-of-thought. The CoT allows us more granular control over the function calls, rather than zero-shotting and hoping the LLM selects the right functions.


do you prompt chain-of-thought, use a model trained on chain-of-thought, or use o1?

I ask because I am interested in trying out function-calling (without the problem you mentioned of zero-shot sometimes getting it wrong, and then having to validate it and re-send it with a correction prompt if it's invalid)


Looks interesting, commenting so that I can remember.


Huge congrats on the Launch ! You guys crushed it with all the thought and hustle behind creating such a valuable tool. Wishing you nothing but success on the ride ahead!


Thankyou!!!


do you use any agentic prompting techniques?


Yes definitely. We wrote a multi-step reasoning engine similar to o1. Depending on the route and depth in the chain of response, prompting techniques vary. The LLM router was pretty important to get right. On input: synthetic question enhancement has yielded positive results. Asking the right question is subtly important, and we find it is sometimes better to just slow a user down and help them improve their question instead of relying on multi-step reasoning from the get go. On output: checking answers via mixture of experts and other techniques is often necessary -- "truthful but wrong" answers are where some tricky gotcha's lie. Eval is hard.

One under-discussed feature of fast databases is that they are especially necessary in an agent-centric world. E.g. if you're running recursive SQL to pathfind towards an answer, it has to happen with minimal latency otherwise you break the user experience. Our interface looks like a spreadsheet and users mentally benchmark against spreadsheet-latency speed. They won't accept 20 second query response times they might get from their data warehouse.


great product. congrats on the launch


Thank you!


[flagged]


You can't post personal attacks to HN. We ban accounts that do, so please don't do it again.

We detached this subthread from https://news.ycombinator.com/item?id=41597837.


[flagged]


"Andrew was making a fool of himself" is a personal attack. This is not a good forum for this kind of information sharing. You can maybe do it, but only if you're ultra careful; if it's a personal critique of someone who might read the thread, and it's easy to write, it's probably problematic.


It was in reply to a dishonest comment that was taken down. It's normal to use elevated rhetoric when correcting the record about a false idea being put forth and made public.

Furthermore saying "probably because the CTO is a dick" to a random person online - in response to a question about why their startup is blacklisted - is definitely fair game. How many people have said "{insert CEO name} is a dick" on these forums?

https://hn.algolia.com/?q=is+a+dick

I think you guys are way too sensitive, unless "reputation management" is part of the YC offering it's a bad look to even be involved in my comments like this. Let people say what they want within reason.


I'm just a random guy here; if what I wrote wasn't helpful, disregard! Sorry about that.


TBF, however, random guys (and gals, and all that lieth betwixt and beyond) are who effect most of HN's moderation through votes and flags.

So if someone such as tptacek suggests that a comment goes beyond HN guidelines, it's generally worth consideration.


I'll just speak up so this comment isn't left unchallenged. I'm friends with the folks on the Sourcetable team and I think they're good people, including the CTO.


Congrats on the launch! It’s wild to see AI stepping into spreadsheets like this. Pretty soon there won’t be a part of our workflow AI hasn’t touched.


Thanks _hfqa! We think there's massive potential here. It's a big platform shift, and spreadsheets weren't really impacted by the mobile or cloud compute waves, so it's a space long-overdue for disruption. (The last shift was back when Google Sheets took spreadsheets to the browser 17 years ago!!)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: