> Every cell can contain text, data or formulae; every cell, row and column may ...

aarondia · on April 29, 2022

Hey, I'm one of the founders of Mito (https://www.trymito.io/). This is a super interesting perspective. I agree with a lot of your thoughts and wanted to respond to a few in particular.

> They also happen to be utterly awful at anything even remotely large-scale.

I think there's a few reasons why spreadsheets struggle to scale to large datasets and complex analyses.

When it comes to data size, legacy spreadsheets like Excel were just built for an age with different data size expectations and its hard to upgrade that monstrous code base. That's why Mito uses Python to make all of the transformations. Python still has limitations, but it works for tens of millions of rows of data.

Complex analyses are the other big cause of pain when using spreadsheets. Specifically, spreadsheets can quickly get super messy when using a mix of tabular data and singular cell results. Once the structure of the spreadsheet loses consistency, it takes a lot more mental effort to untangle the spreadsheet.

These complexities arise because Excel is super un-opinionated about what types of analyses make sense for a spreadsheet and how those analyses should be structured. Because Mito is designed specifically for working with tabular data through pandas dataframes, we're able to make design decisions that enforce a bit more structure into the analysis. 1) All data in Mito must be tabular -- it both preserves the structure of the spreadsheet and fits the ideals of pandas dataframes. 2) Every edit you apply in Mito applies the entire column (or dataframe for ops like filter, sort, pivot, etc.).

The result of 1 + 2 + the fact that Mito generates the equivalent pandas code for every edit makes it fairly easy to understand what transformations are applied to the data at any given time.

In practice, we see complexity explosion is the result of combining data exploration and analysis. In the exploration phase users apply temporary filters, column transformations, etc. But they don't want to take those transformations with them. What is exploratory and analysis work is often not known until after the analysis, so its a hard problem to design for, but its something we spend a lot of time talking about. Our most recent work to address this area of complexity is optimizing the pandas code that we generate. We can use obvious cues like if the user deleted a column or dataframe that they had previously created to tell us that work was only part of exploratory work that they no longer want. As a result, we can safely delete the python code used to create those columns/dataframes.

> I want a tool I can open and use right now, not one where I have to make a whole new Python environment and notebook and so on just to do a simple calculation

I totally agree with this! Even as the creator of Mito, if I have to do some quick ad-hoc analysis, I'll end up opening Excel instead of launching Jupyter and then Mito. We're looking into ways of improving this though! One idea is to create a command like mito <file path> that automatically launches your juptyer server and opens the file in Mito. Another is to add support for Jupyter Lab desktop so you can get closer to launching with the click of a button.

Lastly, I'd love to engage with you more about this since you clearly have a lot of interesting thoughts. If you want, reach out to me aaron <@> sagacollab (dot) com.

bradrn · on April 30, 2022

I completely agree with your assessment of why spreadsheets fail. Completely unstructured data plus a mixture of exploration and analysis is a recipe for disaster.

> These complexities arise because Excel is super un-opinionated about what types of analyses make sense for a spreadsheet and how those analyses should be structured. Because Mito is designed specifically for working with tabular data through pandas dataframes, we're able to make design decisions that enforce a bit more structure into the analysis. 1) All data in Mito must be tabular -- it both preserves the structure of the spreadsheet and fits the ideals of pandas dataframes. 2) Every edit you apply in Mito applies the entire column (or dataframe for ops like filter, sort, pivot, etc.).

I tend to agree with this too, though there are cases where either (1) or (2) may need to be relaxed. Personally, I think static type checking will also turn out to be useful for structure enforcement: it’s nice to have things like builtin support for units, or defining enumerations for categorical data, or even just making sure that each column has the same type of data throughout. (This is also why I’m uncomfortable with building a spreadsheet on Python, for all the advantages such an approach has.)

> In practice, we see complexity explosion is the result of combining data exploration and analysis. In the exploration phase users apply temporary filters, column transformations, etc. But they don't want to take those transformations with them. What is exploratory and analysis work is often not known until after the analysis, so its a hard problem to design for, but its something we spend a lot of time talking about.

Making a UI good for both data exploration as well as more in-depth analysis is an interesting problem, and I’m not convinced we’ve found a good solution yet. Spreadsheets are good for the former, but not for the latter; programming is good for the latter, but not the former. Inserting a spreadsheet into a notebook interface seems a reasonable compromise, but I’m sure it’s possible to find something better and more tightly integrated.

> Lastly, I'd love to engage with you more about this

Sure, thanks! I’ll send you an email now.