Saving the World from Spreadsheets [video]

bschne · on July 21, 2019

Here's the corresponding papers:

CheckCell - https://people.cs.umass.edu/~emery/pubs/CheckCell-preprint-O...

ExceLint - https://arxiv.org/pdf/1901.11100.pdf

wwarner · on July 21, 2019

Excelint https://github.com/plasma-umass/ExceLint-addin CheckCel https://github.com/plasma-umass/DataDebug

phonon · on July 21, 2019

What's the difference between https://github.com/plasma-umass/ExceLint-addin and https://github.com/ExceLint/ExceLint ?

emeryberger · on July 22, 2019

(ExceLint co-author and the speaker on this video here) The first one is the latest version which works in all modern Excel versions (Windows, Mac, Online). It’s a rewrite in TypeScript I did while on sabbatical this year at Microsoft Research (also it has faster and improved algorithms and more features to further improve its precision and usability). It is actively under development. Installation instructions are in the README. We are hoping to have it posted on the Microsoft store at some point. If you would like to see something like this in Excel, say so!

phonon · on July 22, 2019

Cool! Does it include all the capabilities of https://github.com/plasma-umass/DataDebug or should/could both be installed?

emeryberger · on July 22, 2019

Right now, ExceLint and CheckCell are not integrated. But this is something we would like to do!

DonHopkins · on July 21, 2019

I'd love to have something like ExceLint for Google Sheets, called Google ShizNits.

vogtb · on July 21, 2019

I have a habit of taking notes on these types of talks when I stumble upon them, so for anyone interested, here's a quick write up of the video. Some of it is paraphrased,.

* "[Microsoft estimates that 750M users of Excel. (%7 of the world population).]"

* Spreadsheet errors basically ruined the economy of Greece.

* "State of the art" is manually double checking your formulas. This is what the experts suggest…

* There's apparently an article out there from Forbes titled "Sorry, Your Spreadsheet Has Errors (Almost 90% Do)".

* Thomas Herndon is the guy that did manual spreadsheet verification to prove that there were errors.

* Talks about CheckCell, ExceLint.

* Input errors are a huge problem: "Roughly 1% of characters people mistype."

* "[1 out of 20 cells manually typed probably has an error.]" (Woah.)

* "[Users often add a digit or remove a digit, changing the order of magnitude]"

* 1) Manual data entry is hard to do correctly, 2) Writing formulas/code/Excel that uses that data is also hard to do correctly.

* One take away: Like code, if you're not testing it manually, and no one is testing it for you manually, and you're not writing tests, and the results aren't "gut-checked" or the results aren't used, why would it be correct? If a tree falls…

* "[A lot of public posted Excel sheets are filled with errors, or fudging.]" Look at the grades one that he shows around 19:36 to see what I mean.

* "The Bootstrap" - stats analysis using simulations. "[Resample samples]... random sample with replacement, repeatedly, to get distribution of output of calculation." Requires a homogenous range. Allows you to find "outliers" that drastically change the output. "What is the likelihood of observing one of the simulations under the null hypothesis, and if it's below [X] then we say it's unusual." Dude in audience at 30:29 describes it well.

* Formulas are easier to audit because they're usually named w/ column, etc. Data is hard.

* Goes through a long process of describing how they gather data, etc. Good stuff, but the short and the long of it is CheckCell is good.

* Loops back around to the global finance sheet that had a lot of errors: CheckCell worked on it.

* ExceLint - static analysis, ranks errors and their fixes. Can find off-by-one-like errors. Formulas using off-by-one ranges, etc.Excel has its own error finder, but it gives a lot of false positives and false negatives."Most errors are reference errors" - wrong row, wrong column, too short a range, too long a range, etc. "Looks for disruptions in rectangular regions." Not just outliers. Looking for irregularity, where regularity is basically low entropy. "[Capture the relationship of cells/ranges and their relationship to one another.]" Looking for relationships that minimize entropy ("[Because users aren't insane and they're putting things in a rectangular grid.]" Looks for every rectangle (i.e. range) that when merged with a neighbor, would remain rectangular. That is considered a potential fix. Then you can simulate the fix as if you already did it, and check the entropy on that.

* A lot of the errors, and their origins have to do with basic Excel features. Some of these features were outlined as best practices in Joel's "You suck at Excel" talk, which is kinda funny. Great power, Uncle Ben, etc.

* Dropped this one: "SUM is [something like 45% of formulas]".

(edit: spacing out list)

freqshow · on July 23, 2019

Thank you for posting this. I know that your taking a few moments to offer this bit of help to the community is appreciated by many more people than will ever reply to you.

Mvandenbergh · on July 21, 2019

I think this stuff is great. I'm a big fan of the real world and meeting people where they are. If your plan for saving the world from spreadsheets is to convince people to use Pandas instead (and there are certainly people on HN who think that way), you're not really serious about fixing things. Developing these kinds of checking tools, which could realistically be rolled out to many Excel users is a great step forward.

wwarner · on July 21, 2019

I don't know of anything similar for Jupyter notebooks or sql either. I guess testing with simulated data designed to yield expected results would be the way I test my stuff.

jnbiche · on July 21, 2019

> I don't know of anything similar for Jupyter notebooks or sql either

Exactly. It feels like we're falling into the exact same problem with Jupyter notebooks as with spreadsheets: they become increasingly used by professionals who code (but who aren't software devs) to create bug-ridden, unmaintainable, large-scale software because they become familiar with the tool they have at hand.

analog31 · on July 21, 2019

Yes, definitely. I'm guilty as charged to some extent. I'm a heavy user of Jupyter, and introduced Python to my workplace, where it is now used by a handful of scientists in R&D.

I'm hesitant to blame the tool. Instead, I think it's a matter of our exuberance and interest in producing new results that causes us to get ahead of our software engineering skills and build things that get out of hand. Also, the professional developers are simply not available to help us improve things. We're on our own.

hjk05 · on July 21, 2019

Seems that bootstrapping and outlier detection is the way to go independent of what language of “editor”(Eg notebook use) you are using.

The really cool thing here is that they parse the excel formula in order to automatically “figure out” how they can perform the bootstrapping.

baxtr · on July 21, 2019

and then people just love AirTable... especially the dev community I feel

chopete · on July 21, 2019

I tried using AirTable twice. It has a steep learning curve. The UI comes in the way often.

If they really have day-to-day users - they must be from the top down approach (somebody up the chain selected it) or forcefully committed ones or a have a perfect use-case.

It is certainly not for regular/most excel users.

verdverm · on July 21, 2019

I tried Airtable and became frustrated with their UX. Went back to the spreadsheet, also people prefer to receive them

larrydag · on July 21, 2019

I'm an R advocate. I call R the "Excel Buster". I think reproducible research is very important and tools should follow. Excel has its place in fast mathematics prototyping but for reproducible research it is quite lacking.

mycall · on July 21, 2019

Vast majority of Excel users would implode learning R.

FromHoiPolloi · on July 23, 2019

Can confirm. Source: am an excel user.

educationdata · on July 21, 2019

I actually think you are more possibly to make mistakes in R than in Excel. Because in Excel you always see the results directly, and you can even easily catch an anomaly in a single cell. But in R it takes one more step to see the results, and you probably won't see results of all the rows directly.

_Wintermute · on July 22, 2019

Especially when R really likes to carry on chugging along with your analysis spitting out nonsense values when it should have failed on something 50 lines ago.

    sum(1, 2, 3, 10)
    [1] 16 # great

    mean(1, 2, 3, 10)
    [1] 1 # wait, what?

on July 21, 2019

[dead]

emeryberger · on July 22, 2019

Please delete this. He is a grad student with a serious vision impairment.

bagrow · on July 21, 2019

He could have a vision impairment.

sbr464 · on July 21, 2019

Really? He clearly has vision trouble.