Hacker News new | past | comments | ask | show | jobs | submit login
Ditching Excel for Python in a legacy industry (amypeniston.com)
313 points by afkmango on Dec 30, 2020 | hide | past | favorite | 279 comments



I'm a research actuary working in reinsurance. Here is why I think Python creates more problems than it solves from the standpoint of most insurance business users:

1.) Environment management. There are many solutions for managing python dependencies, my favorite is Docker + pip. Good luck getting actuaries and underwriters to write Dockerfiles etc, and good luck getting I.T. to support Docker on Windows desktops. Like it or not, the best "feature" of Excel is that it is mostly the same on every corporate Windows machine.

2.) Unless you are using numpy / numba, Python isn't that much faster than VBA (if at all). Both are "compiled" to interpreter bytecode.

3.) Speed of development and traceability. Excel takes a lot of getting used to, but if you know the purpose of the spreadsheet (e.g. a reserve calculation), it's relatively easy to figure out what a mangled and convoluted formula is doing (Excel has a "debugger" that allows you to evaluate formulas by highlighting pieces).

4.) LAST BUT NOT LEAST. Many financial and actuarial (insurance) calculations are inherently recursive. Excel has built-in memoization (in the dynamic programming sense). It also has a reactive programming model. Good luck implementing that in Python without tripping up on the huge amount of function call overhead, even if you use a memoization decorator.


It is great that you respond with succinct reasons.

People who have used Python for some years seem to forget just how clunky it really is. I've been using Excel since 1990, and sure, it has its own warts, but Python is a very rudimentary tool compared to Excel.

Python is a machine shop. Excel is a car. It may be a lemon, but it's a functional car.

This is a great example of programmers not being able to see the forest for the trees. Reminds me of the "Once Linux gets a desktop it will take over the world" debate from circa 1997-today.


> Reminds me of the "Once Linux gets a desktop it will take over the world" debate from circa 1997-today.

Linux did take over the world, just not on the desktop. It was on servers and mobile, which now have more users than desktops or laptops (edit: servers via the web).

Technology gets its warts fixed when it grows along an explosive new market, especially if the market ends up being larger than the last.

Python is currently riding the data science wave, and that wave is growing. If that market expands to the point where large scale data-science type work wags the dog of VBA/excel, the clunkiness[1] will work itself out.

[1] - I don't actually understand what's clunky about Python in the context of the article. Seems like a reasonable direction in a complex market (reinsurance) driven by actuaries. I'd be surprised if newgrad actuaries/stats people aren't using Python?


Absolutely. I worked at Intel and our distributed computing pools went from a combination of SunOS and AIX machines to Linux in about 3 months, essentially overnight, (back in the 90's). It was an astonishingly fast deployment.

Linux dominates the server world AND the entertainment device world (hello busybox & gstreamer!)

[1] Regarding clunkiness of Python: mostly it is the packages, installation, and 2.x vs 3.x nightmare that persists. Everyone seems to forget the initial pain getting the Python env to work, esp. when it comes to cython native compilation issues / arch wheels, unsupported packages, etc. The only issue I have with python is it is extremely challenging to make cross-platform deployments for single-executables. I've tried three different approaches and they were all trainwrecks. Once that is ironed out, I'll be switching from Electron to whatever Python offers.


I think the 2.x vs 3.x issues have mostly been resolved by now. I don't think I've hit one for a long time, and even StackOverflow answers are more likely to be Python 3 now.


macOS still ships with Py2.7 and has dependencies, and npm-gyp only recently switched to 3.x. same with python SDR. it depends what you use: less popular packages are still languishing.

but that discounts the tens of thousands of projects that are already out there that are in use and need conversion.

it'll take probably 3-5 years for it to really go away.


Yes, that's true. And aren't some versions of RedHat still on 2.7 too?

But I class this as a packaging issue more than a 2.7 vs 3.x issue: you see the same problems with (as a random example...) different versions of OpenCV - people not using virtual environments have problems even if they are all on 3.7.

When I think of the "2.x vs 3.x problems" I was thinking more of the language and core libray level incompatibilities.


macOS ships with both py2.7 and 3.7, `python` would call 2.7, and `python3` would 3.7


As of macOS 11.0 there is no python anymore. You have to download it.


Really? YAY!!! I upgraded to Big Sur and it is still there... but I don't know if it is a legacy since I've been TimeMachining my machines since 2012 and things persist.


Hmm apologies, I was wrong.

The Catalina release notes[1] said they would remove it and I was sure they mentioned it again in this year’s WWDC but apparently it’s still included after all.

[1] “Scripting language runtimes such as Python, Ruby, and Perl are included in macOS for compatibility with legacy software. Future versions of macOS won’t include scripting language runtimes by default, and might require you to install additional packages. If your software depends on scripting languages, it’s recommended that you bundle the runtime within the app.”


> Linux did take over the world, just not on the desktop.

I'm honestly not sad that it never happened.

It took over my desktop around ~1998 and I wonder if massive adoption of Linux on the desktop would have benefited me or been worse (from my perspective).

As it stands literally every single tool I want/need to do my job is already available for Linux and indeed many of those tools are simply better on Linux (docker on a mac is horrible, I have a work issued current gen macbook pro, I use it purely for testing docker set-ups and then it goes back in its case).

Its going to sound elitist but not dumbing down the platform for the average user is a benefit to me.


Well, it would certainly make may day live easier if we had Microsoft Office, the Affinity Suite, and Lightroom on Linux, which would probably have happened if Linux was the dominant desktop.


Depends really on your proficiency at python and the task at hand. For many "small fry" tasks Excel is perfectly fine and faster. Good when the table data is meant to be in foreground.

For things where data exploration and formulas should be in the foreground or where data and formulas should be strictly separated python (or a python/jupyter notebook) has tangible benefits (e.g. really good list syntax).


> "Once Linux gets a desktop it will take over the world" debate from circa 1997-today.

Pretty much all servers run on linux.

Linux also dominates smartphones in the form of Android phones.

Chrome OS is also linux, and it's market share is currently at around 6%.

I mean, other than desktops Linux pretty much is everywhere.


> Linux also dominates smartphones in the form of Android phones.

Not really.

See any Linux specific APIs on the NDK official APIs?

https://developer.android.com/ndk/guides/stable_apis

Android is a mix of Java and Kotlin based frameworks, ISO C and C++, POSIX subset and a couple of additional libraries.

Whatever kernel gets used is an implementation detail for Google and Android device makers.

Can be completely replaced in Android 12, and the eco-system would continue to work.

> Chrome OS is also linux, and it's market share is currently at around 6%.

Basically Android (already mentioned above) and Web stacks.

https://chromeos.dev/en/android-environment

https://chromeos.dev/en/web-environment

Ah, but it does expose Linux you say, https://chromeos.dev/en/linux

Indeed, except of the small detail that as shown on the Google IO talk, it is actually a design similar to WSL 2, running a second kernel on a hypervisor based environment.

The real kernel powering ChromeOS doesn't get exposed to userspace and can also be replaced at any time, if Google so desires.

In fact, in a near future Android and ChromeOS can be running on top of Fuchsia and most consumers wouldn't even notice.


I wonder what the problem is with standarizing companywide around python@3.x, numpy@1.1x and pandas@1.x. At this point these can all be considered mature and why on earth would an org, which is not developing these packages, nor heavily consuming outside code (because they didn't with excel either in a sensible way?) decide to jump on the "but we need rrrrrollling release"-fad bandwagon?


The problem isn't feasibility, it's resources. Building and rolling out a standardized environment, and maintaining it, will cost millions of dollars. It shouldn't, but it does. And for what added benefit? The end-users don't want it, you'd have to spend another couple million for a lateral move at best. More than likely, you'll end up with a pile of Python spaghetti code that runs slower than the spreadsheet (see point #4 about massively recursive calcs).


Why does building and rolling out a standardized environment cost so much? Could you break down the requisite steps and resources required to achieve this?

Thank you, I appreciate it :)


Up-front costs (mostly salaries, but all I.T. projects are "billable") 1.) Getting buy-in from solutions architect, software architecture, information security, I.T. management. This will be a 6 month process. 2.) Getting buy-in from actuarial management and audit. Another 6 month process.

Recurring annual costs (over 10 years) 3.) Contractor at $150 an hour = $300K annually 4.) Contractor PM at $50 an hour = $100K annually 5.) Information security compliance hoops, getting it to play nicely with the myriad of endpoint security tools, etc 6.) Ongoing maintenance and support (failed rollouts and upgrades, user desktop support, user training)


This isn't realistic at all. Assuming actuaries know Python like they know Excel, the only other added cost is someone technical to slap together a working environment, which isn't particularly hard when compared to the state of what Excel offers in a collaborative environment.

You just don't jump from "humans in Excel" to "CI/CD perfected pipeline" overnight, nor do you need it.

Excel shops still have costly expenses rewriting entire workflows/re-doing Excel files constantly as people come and go, it's not like there isn't already maintenance cost with the current method.


reminds me of standardization of the shipping container, not sure of all the details of how that push had happened though


Funny story about Excel on corporate machines. A couple of years ago the company I work for got boight by an Italian company. When we finally migrated the Windows users over to the corporate Office installs a bunch of people found that Excel wouldn't work for them. Things like sum(A1:A20) were syntax errors.

After a bunch of digging i worked out that the localisation from corporate meant they suddenly had Italian function names not English. Very confusing.

Excel is a program that is both incredible and terrifying to me. There are ways of building spreadsheets that are reliable and auditable. Then there's how 95% of people do it.

You can start out really quickly and make great progress. But it tends to grow and metastasize before you know it.


It must be easier to build an auditable and reliable solution using a high-level language programming language and concepts like source control and automated testing.

Excel is only easier if you aren't interested in building something auditable and reliable solution that might have some hope of being maintained after you have left the company.


That's the thing, most Excel workbooks start out as a one-off then gradually get adapted and extended until they're load-bearing.

They're often built by specialists in another dept who definitely wouldn't consider themselves programmers.

Doing it 'properly' would probably mean having to spec put the problem, get a budget, maybe wait a few months for someone to look at it. And the same thing every time the requirements change.

Excel is available today and they can get started solving their immediate problem straight away.

After it's been in use for a couple of years and shown value someone takes a look and sees the Lovecraftian horror it's become.


> until they're load-bearing

This cannot be stressed enough. I've outlived generations of finance teams at many startups, and I've seen firsthand the masterpieces/abominations left behind in Excel. Imagine a dozen sheets with ad-hoc queried data copy/pasted from System A/B/C/D into Excel, with formulas that feed formulas that feed formulas. Sometimes columns are inputs (seasonality adjustments for monthly forecasts), sometimes their outputs (modeled growth * last year * seasonality adjustment) and more often than not their right next to each other and maybe they have different cell background colors or a black separator line. Maybe.

And this is just finance. For many e-commerce businesses, planning is done in Excel with equal zeal.


I remember hearing about a mythical spreadsheet floating around for modelling something to do with our national grid a few years back.

It would take about 12 hours to calculate, and would error out before finishing about 30% of the time. It needed to be run once a day for something reasonably important.

I don't use Excel much these days, but I do point people to a video if they do plan on doing anything:

* [You suck at excel - Joel Spolsky](https://m.youtube.com/watch?v=0nbkaYsR94c)


Someone who does not use Excel nice will probably make spaghetti code as well.


I'd rather reverse engineer spaghetti code than a spaghetti Excel spreadsheet.


Most probably because you do not know Excel !? But for someone who do know well Excel and understand finance concept it'd be easier to understand the spaghetti Excel spreadsheet and find where it doesn't work properly !


You’ve been lucky with your spaghetti code experiences, then.


At least with spaghetti code I can set a breakpoint or add print statements and get an understanding of the execution path.


And you find a bug that costs you $$$$,

I worked for a company that used an opaque excel spread sheet as a part of its accounting system - turns out there where bugs and we found a massive short fall one of the contributing factors in the collapse of the company.


This is the real problem. It's not with excel per se, but the complete lack of automated testing and source control.


Python works with source control and tests so this would indeed be a problem with Excel per se.


But Python doesn't have to, I've seen plenty of python with 1000 line functions, no tests and no source control. It's particularly common in Jupyter Notebooks.


> There are ways of building spreadsheets that are reliable and auditable. Then there's how 95% of people do it

Do you have any pointers to learning materials on how to do this? Would be interested in reading more on it.


This Twitter thread is a good start: https://mobile.twitter.com/keith_ng/status/13079610874515251...

There also Joel Spolsky video I linked in another comment: https://youtu.be/0nbkaYsR94c

I don't actually use Excel much so others might have better resources.


Oh I've already seen the Spolsky video, but I seem to recall for the most part (except for using tables, which is great) it's about how to use Excel effectively, not about how to produce maintainable sheets. The Twitter thread looks good though. Thanks!


Localisation is where you first have to debug Excel formula errors. Is it colon, or semicolon; it responds to, is my goto method to approach the issue.


This brings up a good point, which is that Excel supports localization, while Python just assumed you know English.


Hey fellow reinsurance actuary! I totally agree that Excel has its place in modeling, especially one-offs, and your criticisms make sense. That said, we have been moving a lot of our calculations to Python. We have had way too many rickety tools to move files or send emails (“first you open this spreadsheet and click this button, then you open this spreadsheet and click this button, then...”), and way too many version control issues over the years. Python solves those nicely.

I’m curious about docker + pip, why do you like that better than poetry or pipenv?


One reason why Python is so successful is that it places very nicely with C code. Many of Python's libraries are thin wrappers around native DLLs.

For example, numpy is a wrapper around a BLAS DLL (e.g. Intel MKL). Pipenv manages the python side of things, but don't exert control over the system DLLs (like Docker does). Anaconda gets very close to what Docker does (by managing DLLs). Have not used poetry, so can't comment.

Ultimately, like most dependency management issues, lacking a stable DLL environment won't be a problem until it is :)


numpy is much more than a wrapper around a BLAS dll. BLAS implements three sets of operations: Level 1: unary and binary vector vector operations, one transform. Level 2: Matrix vector operations. Level 3: Matrix matrix operations (most famously the dgemm routine).

Perhaps some blas implementations offer more features, but that would defeat the purpose of a standard interface.


Okay, I can totally see it now. We are still at the stage where people seem to think that I’m neurotic for worrying about the python side of things, so DLLs have not been on the radar. ;)


In Nodejs native libraries are a PITA. The compatibility API breaks on a schedule every 6 month, dependencies get updated by OS distros sometimes breaking stuff, then you rely on the OS being able to compile the library. I hope Python has better native interop.


Much like the referenced Excel spreadsheets becoming unwieldy, so does a dev's machine [0].

0. https://xkcd.com/1987/


It would be considered incredibly bad practice for such a dev's machine to be used to perform almost any calculation of significant business importance. It's why mediocre Tech Executives are able to appear like they got some work done by focusing on "no access from dev to production."

With Excel there is no such separation, and when there is it would make a great punchline to an XKCD or Dilbert cartoon.


From the submitted link:

"The desire to price increasingly complex deals with increasingly large datasets"

Bingo! Most people use Excel when they actually should use a database. I am sure you can use Excel with a database like MS Access, but then again, who does?

To your arguments: 1. " and good luck getting I.T. to support Docker on Windows desktops." Yah. Great experience to work with Excel on Linux.

2. You can always link compiled code for stuff that needs to be fast. But in the end most people wont use neither python nor Excel for HFT

3. " it's relatively easy to figure out what a mangled and convoluted formula is doing"

https://www.sciencemag.org/news/2016/08/one-five-genetics-pa...

https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-al...

https://www.sciencealert.com/excel-is-responsible-for-20-per...

4. Maybe. Not sure it is really an issue.


The case for using a database with Excel is at least as strong as for using Python or C# to model Excel data. There are excellent free adapters for MySQL and PostgreSQL.


I think these are all fair points and a good reflection of the downside of Python. But there are also some pretty huge upsides. My view is that for any model of significant complexity, the pros of Python outweigh the cons from a technical point of view.

- Abstraction. It's very difficult to effectively abstract parts of a model in Excel. It's a bit like a doctor having to 'model' a human being as a collection of atoms, rather than having abstractions like organs, cells etc. This makes it very hard to build re-usable components, so analysts end up reinventing the wheel. You also quickly hit a 'complexity ceiling' in Excel, above which mistakes and errors becoming much more likely, and complexity is very difficult to manage.

- Existing libraries provide a huge range of sophisticated calculations and operations for which we don't need to write any code.

- Separation of concerns - particularly separating data from model. Easy in Python, hard in Excel. Another aspect of this is that using data science software promotes the use of tidy data[0] (i.e. clear thinking about how data should be structured).

- Unit/integration tests. For complex models, these are essential. Users of Excel (even extremely clever/competent people) don't have have a great reputation for producing error-free spreadsheets, and I think this is an important reason why, alongside copy-paste errors. The tools for testing in Excel/VBA are rudimentary.

- Version control. This is particularly important for historical reproducibility because it allows us to run past models, and also understand what has changed in the codebase since.

I appreciate some of the above is also possible in VBA, but if you're writing an entire model in code and not really using Excel at all, my view is it's better to use a more sophisticated programming language.

There is also an important cultural point of having to re-skill everyone, and I can see that in some context that means in the short run at least, Excel/VBA may still be better overall.

I've written a bit more about all of this here: https://www.robinlinacre.com/transforming_analytical_functio...

[0] https://vita.had.co.nz/papers/tidy-data.pdf


> I appreciate some of the above is also possible in VBA, but if you're writing an entire model in code and not really using Excel at all, my view is it's better to use a more sophisticated programming language.

Which is why many VBA experts eventually adopt VB.NET instead of jumping into a complete foreign language, with the benefit that is actually compiled to native code (JIT/NGEN), if performance is ever an issue.


Yes, I was wondering why Python was the automatic choice considering C#.Net has typesafe native APIs for Excel on Windows.


I was more thinking about migrating away from Excel fully rather than interfacing with Excel from Python.

I agree that to interact with Excel programmatically VBA is a better choice (and no doubt C#/VB.NET as well, but I have no direct experience). For what it's worth, for interacting with Excel and Office more generally, I've always though VBA is extremely well designed.


I use both Excel and Python, and like both. They solve different kinds of problems, even within the same context.

Excel is fantastic for what I would describe as linear modeling, building a graph of effects in single data models. I reach for Python when I need to fundamentally transform the data model at points to answer the desired question. That is difficult to the point of being impractical in Excel, especially if the data model is large or exploratory. Python is more programmable in this regard but also lacks the strong static typing that would be useful in such work.

I can’t imagine not using either.


If you have valid reasons to make the move from Excel to Python, why not consider Julia? Environment is easier to manage (Pkg.add), the language "looks like Python and walks like C", math-friendly style possible, just-ahead-of-time compilation resulting in high performance (enough to not need native code implementations), interactive development (Jupyter was named for Julia-Python-R after all), @memoize may be enough for you and Pluto gives you reactive notebooks.

Bonus - tools are also emerging to make stand alone distributables.

Disclaimer: I neither work for nor am I affiliated with Julialang. I just use it.


Why do actuaries refer to workstation/desktop computers with more than 16 cores as “super computers” it’s embarrassing but sometimes I give in an say “the super computer” because I’m in a hurry and they’ll give me a blank stare if I call it a workstation or anything like that.


They really are supercomputers though. Do you know how much faster a modern PC is compared to say a Cray-1? Especially if it has a decent graphics card.


So a typical actuary’s technological reference point is stuck in 1985? That explains a lot about excel and sas egp. But seriously, calling a fairly standard computer in the tech world a “supercomputer” is just another example of the underlying attitude in insurance that makes many actuaries recoil in horror about the thought of “programming” aka learning python or any programming best practices.


Yeah but at this point my phone is comparable. If being faster than a Cray-1 qualifies something as a supercomputer, the definition is meaningless now.


> problem #1: Environment management

Great observation. Python environment management is getting simpler, but is off putting for people without a software background. Unclear even CS majors get enough classroom exposure to package & dependency management to utilize Python efficiently.

I’m more optimistic about an on-prem deployment of Jupyter Notebooks or Sage Math Cloud as a way to hide a lot of the setup complexity. More like a wiki for math. Curious if anyone has stories/tips to share (good or bad)?


Quant here, at my firm we've deployed a JupyterHub server which provides users with a production docker image, so that analysts and portfolio managers can perform analyses without installing python, dependencies and sql drivers locally. It is working well and spurs interest in Python across the wider org - so we let everyone use it.

Similar to in OP's case, I think the real selling point of Python over Excel is advancing the capabilities and the scale of the business. Talks of different programming languages falls on flat ears in finance - show what can be done instead. With Python, Zipline and notebooks I can manage a global equity portfolio, continuously adding active strategies and adapting to real-world changes and constraints. And backtest! Excel is great, but there is an upper bound to what can be reasonably done without a thriving open source community.


That's really interesting. I've been working on a JupyterHub / JupyterLab / Python based product for the insurance industry - choosing this for all the reasons you've cited. Would be really interested if there are any points you can share e.g are you using Kubernetes and if so how have you found it?


We use Kubernetes, according to our IT/devops guys it was pretty straightforward to deploy in Azure with Jupyterhub's KubeSpawner module + documentation. A few people are quite eager to learn Python / code in general, so we try to make it convenient. One common use case would be to work with existing excel spreadsheets, so the notebook volume storage should be mountable in Windows. The file upload/management in notebook servers is quite obtuse IMO.

If a recurring task can be reasonably parameterized then a Streamlit app might be a better choice in some instances. I've developed a monitoring application for our portfolios where I can track daily asset weights, underlying data points, computations etc. Not displaying code ensures that the output can be consumed by a wider audience.


Thanks especially for the Streamlit suggestion.

We've tested JH with K8 in GCP which was straightforward also. With a small team though tending towards a single VM deploy (based on "The Littleist JupyterHub") which looks a lot easier to maintain.


Thank you for sharing.


Is this really an issue for the use case being described? Environment management is obviously a significant consideration for software developers who need to keep track of versions etc, but it sounds as if these users primarily want to use the fundamental numpy functions.

They could install one of the scientific python stacks (e.g. anaconda) or just install packages globally with pip.


Things always break over time. I have a very technical co-worker, a systems admin, with a broken Python stack on Windows. No idea what's wrong.


so he did a chmod -w on their folder and it broke. Well, that's a feat I still have to work out!


Anaconda switched its software license recently to require large companies to purchase a license, so now people must wade through the purchasing department before installation & use.


Agree completely on 1. but not sure on 2. and 4. - I think if you're using Python and need performance then you will be using Numpy etc - would be interested to hear if there are instances where this doesn't work.


numpy is great for vectorizable calculations, but many calcs (particularly for long-term life contingent risks, i.e. reserves), are not vectorizable except in the most simplistic cases.


Numba can usually speed up non-vectorizable things by JIT compiling and potentially parallelizing.

I mean, in an ideal world you'd use Julia or a Cython extension, but if you already have something in Python/numpy, numba only requires you add a decorator to your function and it gets jitted.


Thanks - sorry I'm struggling a bit - wouldn't they be vectorizable across the portfolio or across scenario for stochastic calculations. Maybe it's because of different backgrounds (mine in UK) but I'm can't recall seeing the deeply nested function calls that you're alluding to.


Think about the calculation of an insurance product with a Fund Value. Everything is forward recursive with respect to time. Been a while, so I might butcher some of this. It is likely that you'll want a 30 year projection, so you'll call fundValue(30 * 12)

fundValue(t+1) = if t > 0 fundValue(t) - charges(t) + intCred(t) else initialPrem

charges(t) = netAmtAtRisk(t) * costOfInsurance(t) + riderCosts(t) + policyFee(t)

netAmtAtRisk = (FaceAmt - fundValue(t))

Now think layering on decrements

surrenderMargin(t) = lapseDecrement(t) * (surrenderCharge(t) * fundValue(t))

mortalityMargin(t) = mortalityDecrement(t) * netAmtAtRisk(t)

investmentMargin(t) = (earnedRate(t) - intCred(t)) * assetBase(t)

Now think layering on calcs necessary to calculate the assetBase (e.g. reserves + required capital)...


That code looks very familiar! I see what you mean now. I don't think I've ever seen this implemented recursively though - can certainly see how this would end up being problematic if you tried to do this in Python!

ps Thanks so much for taking the time to set this out.

pps I've been working on something that implements a highly optimised version of this style of calculation - with a DSL to describe the calcs - can do 30 year cashflow projection for 1m contracts in about 1 min on quad core laptop. UK focus initially but might have wider application?


Clojure now has libpython which would enable you to add the recursion.


The calculations typically performed by actuaries are individually all fairly simple but in aggregate without automated testing and version control it is reckless to use them in pricing or portfolio calculations.


For 1. I would say R is a good option. It works relatively well everywhere and has an ok IDE, lots of packages that make life easier (tidyverse). I also wouldn’t recommend python for exactly that reason.


I thought the point was to make a web interface in Python to query some database and process to calculations on a powerful server, in which case the database and server for sure are going to be faster than excel, including the WITH RECURSIVE SQL statement, and only a browser is necessary for users which makes the solution not only multi platform but also remote-friendly. As such, your post really makes me wonder what they are trying to do at all.


Those are fair criticisms but I think under Windows, Docker+pip is the worse way of managing it

2 and 4 are surprising, it would be interesting to do a benchmark and maybe figure out the best way to do stuff in Python for your case

About 3, I suppose that's why developers should break up complex expressions (and not only in Python)


Its absolutely true that an untested 1000 line python function is no better than a 1000 line untested VBA function.



I have wondered about this, the lambdas and table could be of huge benefit against some of the most egregious excel mistakes but that isn't an argument for excel's use.

The problem is not one of it not being possible to do automated testing or source control in Excel. VBA is Turing complete so anything is possible, it's more one of not thinking, or understanding why, those things are important. Once you do come to think of such things as important you will quickly never use Excel for anything but the most basic calculations.


Hi, this is a late reply but I am actually pulling this off. Your points above all make sense but are around the desktop paradigm. If you move to the cloud paradigm not only most of the points go away but you gain a lot by having data all in one place (S3) and strict collaboration (github). Specifically the Python env problem goes away if you ask analysts to work online with notebooks (ie jupyter hub).


Just because one can trace ones way through an excel spreadsheet should be of negligible comfort in the aftermath of a multi-million dollar loss.


The article specific mention that the environment is the browser using python notebook. In the author use case there is no docker, no pip, no window desktop for which I.T. support is managing python on client machines.

I also wonder, if I.T. support were to use docker, are they doing that for python, or would they still continue to use docker even if they move away from python?


I am an actuary for 20 years and people have been trying to replace Excel for at least 15 of them. But Excel is not going anywhere. I think it will be even more popular with the recent introduced lamda function. Excel formulas will Turing complete and we don’t have to use VBA anymore.


From an actuary's perspective: is Google Sheets ever entertained as an Excel alternative?


No, for many reasons. Workbook calculation performance, advanced, complicated spreadsheet incompatibility, worse UI/keyboard shortcuts, slow UI, different language than VBA for custom functions (It isn't necessarily worse, but it's the same thing as trying to convince a department to switch from language X to Y including rewriting every application that's written in X. Also, every person you've ever hired was familiar with X, and has no experience with Y.)

R/Python/SAS etc. are a much more compelling alternative to Excel than Google Sheets (to say nothing of the actuarial modeling software packages that are used already for more rigorous/complicated problems).

If an insurance company decided to move all of their MS Office users to Google Docs/Sheets etc, my money is on the actuarial department paying for Excel out of their budget without a moment's hesitation.


It's been a couple years since I've used it, and I didn't feel it was a comparable alternative. It's decent for about 80% of spreadsheet users, but the keyboard shortcuts were lacking and it was missing some functions that I rely on.

For keyboard shortcuts, most Excel power users don't use the mouse, so while it sound trivial, it's really hard to feel productive when you have to hunt around for the right button to click.

From an enterprise perspective, Excel is so entrenched it would be a 5-10 year effort to port existing spreadsheets to sheets. Practically speaking, most companies wouldn't see the benefit.


> From an enterprise perspective, Excel is so entrenched it would be a 5-10 year effort to port existing spreadsheets to sheets. Practically speaking, most companies wouldn't see the benefit.

And at the end of the day, it would have worse performance than Excel both in calculation speed and _much_ worse UI. One of the reasons Excel is so much better than Sheets is speed. Insurance companies spend hundreds of thousands of dollars a year on actuaries. Even if Excel cost them $500/year/user, it would be easily worth it for actuarial departments.


The hundreds of thousands of dollars a year that insurance companies spend on actuaries is the real reason that no one will ever come off excel. The maths is not very complicated but the lack of source control and testing means each bugfix or added feature introduces another bug to fix next week or feature to add the week after.


Used GSuite (Google Workspace?): It's fine, it is improving but lack of shortcuts even for basic tasks (I can change the font on Word with just a keyboard, try that with Docs without a mouse). Dealbreaker is the 5 million cell (not row nor column) limit, which is even lower than Microsoft's old limits (more than 15 million cells, which was increased in 2007 to you-have-a-serious-problem-if-you-somehow-fill-this-limit cells).


Not to mention that if you're using Power Query, Excel's 1 million row limit doesn't apply in the workbook queries either.


Alt+/ is the magical shortcut that makes this easy, you can simply search a font name and hit enter.


Not an actuary - but I think this crosses domains: the lack of comprehensive shortcuts makes Google Sheets DOA (dead on arrival) for my uses.


Uploading propertiary data to a cloud is a big no.


Companies are more worried about this than they probably should be.

The average enterprise network is nothing like as secure as people behave like it is.

Where do you think your email is hosted? With few exceptions I'd expect its provided by a cloud provider these days.


Would the people doing the implementation need to be able to choose and manage dependencies? Or could they do the work inside a prepared environment (comparable to Excel in some sense).


that costs millions!


I think you are mocking me, but I'll bite.

Insurance companies are contractor heavy. They bill at $150 an hour. That's $300K annually per head. Won't take long to get a million, when you add PM overhead, information security oversight and governance, etc. Again, it shouldn't cost that much, but it does.


from your POV, what would be an better replacement than Excel? provided it's opensource?


This is the argument made against all software stack advancements. Nothing to do with industry. But when the benefits outweigh the the hurdles, change happens. And if I was starting a new insurance company (which I've considered) I'd be doing our work in code not xls, and probably python. Having RCS, Numpy, unlimited compute, unlimited storage, all gives me an advantage over my competition. :-)

As to the memoization, that is not hard to manage in Python.


"As to the memoization, that is not hard to manage in Python."

Yes it is. Recursive calls for financial calculations easily go hundreds of thousands of calls deep. This is why high-end actuarial modeling software either decomposes it into a dependency graph and unrolls function calls where possible, or just "brute-forces" it by being a thin wrapper over c++, i.e. using operator overloading on ::operator().

I've seen ill-fated efforts of capable software developers attempting to unroll the recursive function calls, and ending up with 2000 line functions that are impossible to maintain.


I can't visualize what you mean by deep recursive calls. What are the calculations that mean you can't just use fairly bog standard python for? I didn't realize there was "big data" in accounting.


Why doesn't annotating these functions with @functools.lru_cache(10000000) work?


First of all, let me say that I've tried it :)

Your recursion needs to "bottom-out" in order for that to work. If you don't get a stack overflow / out of memory error, you're good. But bear in mind that there will be thousands of stack frames. Before you get to time=0 (the recursive base case) in a long-term liability actuarial calc.

The recursion isn't simple like the Fibonacci sequence . It's more like:

f(t+1) = if t > 0 (f(t) + g(t)) * h(t) else initial_constant

g(t) = f(t) + q(t) - d(t)

q(t) = ....

d(t) = ....


Although at fist glance this formula is written recursively, one doesn't have to (and shouldn't) implement using recursion, does one? Just making f, g, q, d arrays and then loop over t should be good, or is there more to this formula?


Appreciate the curiosity. In this small trivial case, yes that works. But what happens when something in the logic changes?

You wind up needing to know the order of calculations since things are no longer lazily evaluated via recursion. This is a problem when you have dozens of "columns" (i.e. recursive functions or arrays as you are suggesting). Often times, the value in the array is NULL (or worse, leftover from a previous calculation). You are left to manually try and re-order the calculations, which is not trivial when there are hundreds of functions.

Excel takes care of these details for you automatically. Users program functionally and recursively (fill-down) without even thinking about it. Excel reactively updates when dependent values change (re-evaluates as necessary).

If power, speed, and scale are necessary, there are purpose-built systems (with Domain Specific Languages) which specifically solve this problem in the insurance domain (e.g. FIS Prophet, Risk Agility, AXIS, etc).


It is common knowledge that all recursive functions can be re-written using iteration (e.g. loops). See “ Recursion versus iteration” here https://en.m.wikipedia.org/wiki/Recursion_(computer_science). The assumption that only trivial calculations can occur using iteration, or that recursion alone allows for supportable code, I believe are very flawed assumptions.


Um, ever heard of a “for loop”? An obvious alternative to recursion.


Not sure why this is being downvoted since I don't think the OP has done a good job of showing evidence that this doesn't work for recursion. Your computer almost certainly WILL have enough space for all the stack frames necessary.


My guess is the numeric inputs would be changing significantly each call?


Agree 100% with this. Better analytics can be a key competitive advantage for insurers and modern tools / cloud offer potential to be much better than Excel.

I've been working on a product that turns JupyterLab into an IDE for life insurance calculations - Python API wrapped around an optimised C / GPU computation layer underneath, all integrated with key open source libraries.


When I worked at Uber, one of the big goals of my team was converting spreadsheets from the finance team to Python and Java. The second two problems that the author mentions (pulling in more data and software best practices) were two huge factors. In the former case, you simply cannot have an org where analysts have full read access to every data store to dump a CSV (of sensitive data collocated with lord knows what) at any time. It's a security nightmare. And in the latter case, when you've reached a point of sufficient complexity, you can no longer "roll out an update" to a team of more than a few people. Without versioning and source control, the model_v2_final_FINAL(1)(1).xlsx problem becomes extreme (even on cloud platforms). This leads to mistakes, and mistakes cost time and money.

Excel has other problems that aren't described in the article. First, it intermingles data and logic. If you're not especially careful and deliberate, running an experiment with multiple inputs means that you'll inevitably fuck up one of the inputs (or forget to change some data, or otherwise fail to do the steps necessary to reliably run the model again), leading to bad output. This is a reusability problem: you can do it right (one file per experiment, "template" spreadsheets, error handling logic), but in practice very few folks do this or even care.

Second, there's no meaningful way to test. If you've got critical logic, there's no way to write proper unit tests against the spreadsheet to ensure something hasn't broken. If I had a dollar for every improperly written linear regression in a spreadsheet... Conversely, writing spreadsheets as code means that you can rest assured that important units of logic are sound, which pays dividends when you're dealing with stuff used by a whole org.

Third, spreadsheets are really only useful as the "last step" in data processing. It's not good or easy to use a spreadsheet as input to something else. The inputs to the spreadsheet are usually manually updated (importing a CSV as a sheet), and then the output is graphical by default unless you're parsing the spreadsheet (good luck) or dumping it to CSV to import elsewhere (manual step with the risk of human error). In any business where the model you're dealing with pipes into other processes, there's almost always a manual step to get that data into "the next thing", be it another model, a dashboard, a database, etc. You can hack around this, but I've never seen a hack here that isn't incredibly brittle.

This isn't to say that Excel is bad, but when you use it "at scale" there are very rough edges that dramatically increase the ongoing costs of running a business built around it. When you're building a model, it's great. When you're running that model with different data more than a few dozen times a day and using the output in other systems, the costs quickly start to add up. That's the point where someone needs to step in and say "okay y'all, production use of this needs to run on a server". And if the production implementation is built well, you'll often find it simplifies the lives of the analysts, because they can download a blob of already- or partially-processed data to work with.


I'm pushing for our actuarial team to transition to more R + Git. After 3 years of preaching, most of the actuaries now use RStudio + git as their primary work tool. It is happening.

What we did :

1) Provide documentation on everything from install to using internal R libraries for ETL.

2) Provide mostly problem free, always updated VMs with RStudio Server/ Shiny Server.

3) Establish an hotline channel for instant help on R or git.

4) A couple members on the team developed really close working relationship with IT and we have great respect for each other work.

What we provide is way better and by being active, we built users trust in the tools.

We are phasing out SAS and proprietary modeling tools. Python never took hold even if we bought Anaconda entreprise. Excel is there to stay for sure but since actuarial student learn R in school, it is easier to onboard new hire.

If you want to go down this path and have a chat, hit me up. I'm in P&C. We use R both in development and production environments. We use it for pricing, spatial contractual obligation, claims assignment and a couple more models.


The current top comment (sibling to the one I’m replying to) argues that keeping Python environments across actuaries/users computers up to date is too difficult.

This is nicely solved by using R server.

I’ve worked in an R server shop, and the experience is really nice. You log on to the server in chrome or Firefox and the browser window basically becomes RStudio and all calculations are done on the server and all code and data also lives on the server which is a huge bonus in terms of data protection. No copies are floating around on peoples laptops and if Johnny is sick and forgot to push his code to git - no worries, it’s all on the r studio server.

I don’t now of a nearly as good Python solution. I think Conda suggests using jupyter lab, and while that is a great environment it’s not great if it’s all you can use.


The big problem with notebooks is that you don't have a real REPL. This prevents one from single step debugging and tracing. This is one area where RStudio is much, much better.

The trouble is that so many of the younger DS people are focused on Python, that it makes financial sense to just deal with all its problems. There's also a lot more programming tools (though less statistical modelling tools).


You do have access to a repl when using jupyter notebooks.

You can hook a notebook or a repl to an existing kernel. I always have a command line attached to my notebooks. When using jupyter lab I attach the build-in terminal and place it at the bottom. When using notebooks I attach it from my terminal.

The experience in Rstudio is still better imho. It’s also a more mature text editor and ide than jupyter.


Ok fair enough, I only used notebooks when I can't avoid it. I'm pretty sure you don't get a repl by default though, is there an involved set up in jupyter?


    jupyter console --existing
should start ipython in your terminal and connect to the last started kernel (e.g., the one in the notebook you just started)

https://stackoverflow.com/questions/22447572/connect-termina...

For jupyter lab, you just choose to start a repl from the gui and choose an existing kernel.


Thank you! (clearly I didn't spend a lot of time doing this, as I have an Emacs addiction ;) )


I'm an actuary with a strong interest in this area - would be very interested to hear more especially on your R vs Python experience.


It came down to IDE, workflow and data.table.

RStudio is an absolute killer solution from the get go. Package management in R is simple and robust. Shiny is the new Excel pivot table on performance enhancing code.

Python has more contributors, more users. It also creates a lot more noise. Business people may feel like it is a a programmer tool. R feel more approachable.

In the end, both are great solutions but we decided on R because we believe in the people contributing to the ecosystem, mostly RStudio. Somewhere down the line, there might be a transition to julia.


Thanks - really interesting, especially on the RStudio point.


I've used R (3 years) and Python (8+ years) in data science and much prefer Python, because it can do things that aren't just pure data analysis, and because pandas is so amazingly good compared to R's data matrix solutions, in my opinion. I believe that the algorithmic trading industry has gone fully into Python and away from R for these reasons.


R has data.table. It is the game changer as I agree base R data.frame do not cut it for performance. tibble will come close once they incorporate more of the data.table performance tricks.

https://h2oai.github.io/db-benchmark/


Does R have robust CSV parsing? I remember using the default and it'd be extremely finicky about getting the header and index flags right and wouldn't typecast numeric columns properly (instead they'd end up as factors and not play nice)


Python version of data.table has very fast CSV parsing (compared to Pandas), and it didn't have issues like those you mention. Even if data.table had issues with CSV parsing, you could probably use Apache Arrow to parse CSV into arrow table and then convert it to data.table (but that is probably suboptimal).



Personally have never had a problem with R csv parsing


It happens, but mostly because other formats don't produce usable CSV's. The biggest problem is if there are any free-entry text fields (common for customer/business name), and there isn't full quoting around these fields, base R will break.

I believe both fread and readr::read_csv do the right thing here, but the base-R perspective on data manipulation before read.csv is to use Perl (the R-core team are pretty old-school, to be fair).


h2o's data.table clone is fine

https://github.com/h2oai/datatable


I've been a heavy user of all 3, and pandas syntax is a nightmare compared to dplyR or data.table in R. That being said, I still use pandas because I prefer python for non-analysis.


I'm a CPA. When I started learning code, I looked for whatever was most like a spreadsheet. R for the bill, with built-in frames.


Oh.. similar line for me, accounting/tax law. Excel is bread and butter because all year end fianncials are prepared and finalised on excel. Although I have used libreoffice on my personal machine, it also kinda works.

For a couple of years I have tried to excel macro myself a balance sheet template which does most of the copy pasting from precious years, does bank interest calculations and all.

It would be interesting to know how does a us CPA work because its all accounting package>excel>efile.


I'm on mobile, but do also consider https://JuliaActuary.org (something that I personally have contributed to).


Looks really interesting thanks. I've seen some interesting insurance projects using Julia e.g.

https://www.youtube.com/watch?v=__gMirBBNXY


R is better if your raw data is already tabular. I prefer Python if the raw data is unstructured / semi-structured. You can make the case that once Python has converted the data to tabular then move to R, but at that point I like the soup to nuts to be in one language.


I’ll second klelatti’s question about R vs Python. From my perspective Python is just as practical for actuarial calcs and better for building general purpose tools. Is there a reason Anaconda didn’t click?


GPU integration was broken for a long time. Managing VMs / Environments. The absolutely horrible integration with git/Github.

Having to rebuild your environment from scratch when your workspace crashed. Imagine starting a notebook with a 45 minutes compile time. No go.

One click deploy, let's just forget about it.


Thanks! That totally makes sense. If I had to pick a pain point for getting people started with Python tools it would be environments. Comments here make me think my team is working with a lot less data too.


I am putting together course aimed at Python beginners in enterprise, I too have experience in Finance. If someone is interested I would love some early feedback, you can contact me my email is in this profile bio.


Would love to have a chat about how you’re making R + git more accessible. How do I best reach out?


As the author states - this is an issue for some really complex models - where the complexity, reusability and iteration challenges approach code.

Most models do NOT take that many tabs, you can build a toy model near instantly - the production line from finished model and output to publishable material is a few shortcuts away.

Having an analyst, write that same thing using Jupyter? From an accounts perspective? Man, I’d want to see it in a spread sheet. It’s just simpler, or more familiar, to debug accounting information in a spread sheet.

The idea that we are going to see all those analysts pick up code - over excel - is possible, but I’d say less likely.

I’d suspect that the idea of python inside of excel, is a winner. But given that excel is working with its own data model and data tools with power BI, or with their new Lambda function, I’d say they are also working to keep people happy within the excel ecosystem.

Interestingly, this is a version of the Bloomberg terminal debate - the terminal does everything, any upstart can only do a small part of the BB offering, allowing BB to always be relevant if not dominant.


"analysts" - lol!

I know a HR director at a multi-national. He'd had enough of Excel and liked the look of this Python thing. I showed him R as well for balance but he wanted Python. I showed him how to install a Python distro and MS Code on his Windows machine, wired them up and off he went a few months back.

The board are in awe of his presentations. He is not an IT bod at all but a Uni. degree in Psycho. involves a fair amount of stats so a fair grounding there. He grabs huge data dumps from payroll etc and performs analyses that are complex but just work.

I think one of the benefits of using Python is that you instantly divorce input data, calcs and reporting. Fire up Excel and the first thing you often do is write a title. Using Excel properly requires a lot of discipline - I wrote a Finite Capacity Planner, with forecast and labour planner for a pie factory in Excel with quite a lot of VBA. It ran my P60 hard but did the job iteratively in about 2 to 5 minutes. Easter and Chrimbo needed a fair bit of tweaking by a Planner but most of the time my model told several supermarkets what they would be ordering back in the mid 1990s and they mostly faxed or EDId our forecast back as an order.

My brother (cough) is absolutely not an analyst in the normal sense. That a non programmer can bolt together enough Python to perform analyses useful to his job is testament to the power of the libraries and examples and documentation available. I've seen his code: suck in data, process it, spit out results, report results. That's all he needs and not a OO abstraction in sight.

My two examples (me and my FC Planner with Excel and an HR bod thrashing some data to a report with Python) are different things and each uses the opposite "tool for the job" discussed in the OP. However, it is how you use a tool that is important.


Thanks for some great anecdotes! A couple of thoughts:

- It's often not is Python a good fit for the task but are there Python libraries that are a good fit? If so the actual Python code may be pretty trivial and the equivalent Excel a lot more complex.

- Writing good Excel is definitely possible but needs real discipline as you say - and bad Excel can be really bad!


All good anecdotes, but I’m still kind of stuck on the whole you got to work at a pie factory thing


(Sorry, misty eyed recollection alert)

I should point out that "pies" in the UK is a rather generic term. MBOs (mince beef and onion), sausage rolls, pasties, pork pies and quiche was made in this south Devon based factory, near Plymouth.

It was a good corp citizen thing to attend the 1100 "taste panel" which was part of the quality process. Obviously Product Dev, QA and the line crews could not mark their own work so office staff were expected to taste to standard. The idea is that you taste samples from the store that is post bake. This is a perishable product and there are stores (freezers, chills etc) to provide time buffers throughout production.

There are a lot of constraints. You always make to forecast. In this case, back then, you had to deliver to depot with seven plus days of shelf life. The product needs meat and dough prep, make, bake and wrap and shoving in the back of a trunker (lorry/truck). You need to ensure you've got all your raw ingredients available and most of those have a shelf life and somewhere to store. Your machines have a nominal 100% production rate and a defined servicing period, expected breakdown rate, need cleaning and more. Some machines will do the job end to end and some will only do part of the process. You have bakeries and stores with varying characteristics. Some products have special requirements.

It is clearly a "simple" job of defining, understanding and controlling your constraints and solving a few equations. I absolutely loved it as a challenge. This was 1995ish. I inherited a System 36 that was basically a glorified accounting system with some stock control and a few other things. If it got too warm in summer I used to put bags of solid CO2 that I could scrounge from Despatch (the whole factory panics when it gets really hot) in it.

I am quite partial to pork pies and pasties.


I helped to sell System 36s (the big ones that took half a room) back in early 1980s.

I remember one demo when a potential customer asked "If it's this slow with one user how slow is it with six?"

The person doing the demo "improvised" with "It's dynamic load time balancing." Which I'd never heard of before. Turns out neither had anyone else involved with the System 36.

I later came across a whole insurance company that was run on a S/36. It was replaced by a single 386 PC.


I had to tickle one of the fans to get it started otherwise the bloody thing would shut down about 40 mins after IPL started. One summer we put bags of solid CO2 in it to keep under the temp threshold.

I still giggle hysterically when someone deploys the "enterprise" keyword at me.


That was far better than I could have hoped for. Thank you.

I’m trying to understand what benighted people are not fond of pork pies and pastries. Let’s have a moment of silence for them and move on.


There are always the unenlightened. One day they shall see the light (or a bloody great steak and ale pie) and they shall wipe their mouths in righteousness.


this is an amazing story, thank you!


And I have seen a board member not wanted anything but linear regression from his own data scientist team (If I remember well, they were like 5~ PHD or master in stats) because he couldn't understand anything. And that was in one of the largest organization in the world.


Data scientists are useless if they don’t have the right communication skills to empower decision makers.

This is why linear regression will always be king and people who know how to turn complex problems into linear problems are worth millions.


What's a bod?


A “body”/a person. I suspect the OP is from the U.K. or Australia or similar, it’s slang there. I suppose it’s kind of a gender neutral version of the U.S. “guy”.


A "bod" is a person (body.) It is a colloquialism in the UK. You generally only use it in this case to denote an anonymous/generic person: "A bod did stuff".

However, bod is also used as a formal abbreviation for body: "You have a lovely bod". In this case you should be reasonably familiar with the object or you will get slapped!

Sorry, bod means a person.


Last time I checked, Libre Office Calc had Python as an option for scripting language, why aren't more people using it?


I work in an excel heavy function (Corporate Finance). And while this sounds very exciting and fresh I am just not seeing it take hold in Fortune 500s I have/am working for. A few reasons:

1) biggest gripe: I don’t have time to maintain and fix models after I move to a new role. If it’s a Python based model I build, no one can seem to fix it when some tiny thing breaks 6 months after due to a change in the data. I’ve had to work weekends to help colleagues fix models that I don’t use anymore. I can hand Excel to a young or old worker and they can always seem to figure it out and take it over.

2) The tools seem limited when directly doing Python in Excel like the one mentioned nothing the article. VBA kind of sucks in 2020 but until Excel natively accepts Python as part of its base, I don’t love being dependent on these 3rd party tools. VBA always works.

3). I’ve recently complete an MS in Data Sci so I am very familiar with Python and R. My company doesn’t need that level of model for most things. We are a best in class in our industry and we get by using lots of Excel models. I mentioned in my first point that I have built a few things with Python. When I had to fix I just rebuilt in Excel and that was all I needed. When I kept fixing the Python code I always felt like I let folks down if I couldn’t fix their stuff right away. Yet our business makes money and we continue to do well without much Python.

I love Python. But until others start to see its value and a critical mass of individuals knows/supports/can implement Python, I will put emphasis on learning Excel tools or SQL first because those will always be supported.


>When I kept fixing the Python code I always felt like I let folks down if I couldn’t fix their stuff right away. Yet our business makes money and we continue to do well without much Python.

I'm all too familiar with this. I think you need to let go of those Python models. You need to let others fix them themselves, maybe with minimal guidance. That's the only way they have a chance to learn.


The big problem I've always had with "programming" in a spreadsheet is by nature everything is obfuscated and difficult to trace. Yes, you can inspect a cell and see what the source for that cell is, but that might be 10 other cells and you can only really review one cell at a time. It's like a programming language where you only see one line of code at a time. Worse, those references usually aren't named. What does "A1 + SomeOtherTab:B2" mean?

All of this really starts to fall apart when you have 10s of tabs with hundreds of rows of data which are often copy/ pasted. You won't even notice that some intern hard-coded one value into cell F75 until you actually drill down to that cell.

Spreadsheets are great until you hit a certain complexity, then they are unmanageable messes.


> What does "A1 + SomeOtherTab:B2" mean?

Excel offers the ability to name any cell or range of cells. Don't even have to search through menus or the ribbon, it's right there to the left of the formula bar.

I'm well aware that the vast majority of Excel spreadsheets don't use named cells/ranges, but you can't really blame Excel for that. It couldn't be too much easier. Lots of Python programmers don't use comments or descriptive variable names either.


> Excel offers the ability to name any cell or range of cells

Except almost nobody does. It's not intuitive or the way it's taught.

> Lots of Python programmers don't use comments or descriptive variable names either.

You have to have variable names in Python. If you want to give them shitty names, that's your bag, but unlike Excel, it's not an extra step.

You also have to deal with the fact that every cell in a range has its own unique formula. It's like you have a special function for each and every cell. You have nice conveniences for it like copy/ paste and dragging, but ultimately you are copying formulas all over the place. And it's super easy to update all but one of those when you make a change.

Yes, you can create custom functions, but much like named ranges, it's not the default behavior, takes extra steps, and it isn't the way Excel is taught.

Spreadsheets are amazing for small to moderately complex things, but beyond a certain point, they are just an unmanageable mess regardless of who creates them.


Interesting, my comment was solely in response to your complaint about naming convention... you've expanded your criticism quite a bit. I already anticipated your issue with naming being that few people used named cells/ranges -- again, it's not even in a menu or ribbon, it's present at all times, what else do you want? Not "the way it's taught"? Well, blame your teacher.

> You have to have variable names in Python

sure, but `(i, j, k)` isn't any more descriptive than A1 or B7 or CQ85759. `intOrderTotal` may seem better initially, until the summer intern creates `intOrderFinalTotal` (after tax) and `intOrderAllInFinalTotal` (after shipping and tax)

> every cell in a range has its own unique formula

you could use array formulas, or were you never taught those either?

> it's not the default behavior, takes extra steps,

Creating a Python virtual environment is not the default behavior, and it takes extra steps. So does using any packages beyond the standard library. So does source control. Or running Jupyter. Using classes, or type hints, or imports, are all not the "default" of one long script in a single file.

Excel isn't superior to Python, or the best tool to solve every type of problem. Excel has its place, Python has its place. But your specific little nitpicks here are a reflection of the user (who I presume is you) not on the tool itself.


If the code was stored in source control with a separation between dev and production the intern might be able to raise such a Pull Request but it would never get approved without oversight from someone more senior. This Code Review process is almost completely impossible in Excel.


That is a big if.

Office has source control management via SharePoint integration.


... and then you have two problems. I have never seen an environment converted to SharePoint that didn't suffer badly from the conversion.

And the Excel vs <other language> source control issue isn't history, it's "go ahead and try to diff between two versions of an Excel sheet" vs "diff two versions of that source file".

Unless I've never heard of the tool that can digest two Excel sheets and tell you what formulas differ, or cells. Please correct me, anyone who knows of one.



Late response over the holidays, but honestly thank you. In years of Excel usage, I've never stumbled on that or had anyone mention it.

I will look into that as Excel 2016 is one of my current required work apps.


An excel tool that showed you formula differences sounds like a cracking idea for a startup.


> Interesting, my comment was solely in response to your complaint about naming convention... you've expanded your criticism quite a bit.

My point was always that spreadsheets are poorly structured for complex problems and that the logic is obfuscated. Just pointing out additional issues. Nor is my previous post exhaustive.

You are comparing worst case Python programming to best case spreadsheet designs. As soon as you compare a typical moderately complex Python program to a similarly complex spreadsheet, things fall apart.


You can F9 on "A1 + SomeOtherTab:B2" (just that selection) and it'll calculate for you. You can do this for all the dependencies under Trace Dependencies.

I admit it's not perfect but I have found it much easier historically to follow a calculation through Excel than through untested pandas code (people inner join and drop rows; they groupby and lose null groups plus related rows; they filter string data without case insensitive matching, etc.).


In Excel there's a toolbar button to toggle showing formulae rather than their results. I realise this doesn't counter your overall objection, but it does mean chasing down logic isn't quite as bad as having to select individual cells one at a time.


I mean... sure? There are a fair number of ways you can mitigate these issues, but the way spreadsheets are structured does not lend itself to structured/ well managed code.


I fully agree, things can get out of hand in spreadsheets. I've built my fair share of such spreadsheets, and as penance, I'm learning R.

The thing with Excel is there's a low barrier to entry, but there are a lot of differences between a great spreadsheet and a bad one. Somewhat like a junior vs senior developer, the quality of code/spreadsheet depends on what they know, how well they can troubleshoot, and how good of a system they can imagine (to then replicate as much as possible).

For example, most people are entirely unaware that Tables exist in Excel. When you want the sum of a column, rather than writing =SUM(F7:F39) and cursing when you realize you added 10 more rows and that's why the sum is not updating, you can do =SUM(tbl_Sales[SalePrice]), and when you add 10 rows, the table will automatically expand. Suddenly your formulas are somewhat self-documenting, regardless of which sheet holds tbl_Sales. Crtl+T when you've selected your data, or Insert -> Tables -> Table.

You can also make named ranges, which I would say is an analog between using {a, b, c, tempVar} versus well named variables in normal programming.

You can also trace dependents/precedents, showing arrows for how the data flows throughout the spreadsheet. Formula -> Trace Precedents/Dependents.


> structured/ well managed code

Funny comment in a Python discussion. No type enforcement, no requirement for class/object declarations, circular imports/dependencies allowed, threading/gevent/async messes, variable/class scope weakly enforced

Python is great for a lot of things, but the language is not a beacon of well-managed code. Good Python programmers write nice, easy-to-follow code, just as good Excel builders create very nice, easy-to-follow spreadsheets.


I mean... sure? I never said they did. I even specifically said I wasn't disagreeing with your overall point, in the vain hope of avoiding this redundant discussion. I was just correcting one specific inaccuracy.


Fair enough, my comment wasn't meant to mock yours.


I think there could be a middle ground with an excel like tool with some kind of typing. For example I often get confused about what currencies particular cells are, and mixing this up causes a lot of pain. Not sure why the hard coded number can’t just have a “$” and every time it gets multiplied by EUR/USD changes to a euro, and is displayed as such. This little thing would save me so much time.


How would the conversion rate between EUR and USD be determined?


My first job was cleaning up the mess that was caused by hard coded yahoo finance urls in spreadsheets. One of them died, no one noticed and it cost the company millions of dollars in bad trades over three months.


You could automatically import a table from the web into the sheet, or alternatively create some custom data types for currency conversion.


Or you could act like you're responsible for tens of millions of dollars and hand it over to someone who can make sure it doesn't blow up.

"We don't need to hire an electrician to wire up the office, I did my garage using uninsulated wires and it works perfectly!"


The guys handling millions of dollars should be trained in finance / accounting, so they aren’t untrained.

Handing over to IT usually means their flexible spreadsheet that they can change as they require, turns into an expensive and inflexible black box that only IT can change and that doesn’t integrate with the rest of their decisions. Also the new solution also probably has errors and pulls currency info from the same endpoint. Excel isn’t perfect, but it’s used for a reason.

To use your analogy, You can wait for an electrician to change your lightbulb, but that means your going to be working in the dark for longer.


Lightbulbs do not generally cost millions when they go out.


are you suggesting programmers don't hardcode URLs in quick-and-dirty (and sometimes, even production) Python scripts/code?


I'm suggesting they have logs.


logs of what? if a programmer hard-codes URLs, you really think they are following best practices elsewhere? further, how do logs help get back the millions of dollars that were lost?


Yes, logging with email alerts means that when a catastrophic error is encountered you'd know about it, instead of it being hidden for months.


LOL! as I already stated, a weak coder isn't going to implement best practices for logging. At best, he hard-coded his own email address to receive an alert, however, this also goes unnoticed, since, as you stated initially, he's dead.

Look, whatever company this was, was relying on YAHOO as its data source for making million+ dollar trades -- not Reuters, or Bloomberg, or JP Morgan, but YAHOO -- and then, for MONTHS -- not hours, or days, but MONTHS -- nobody in its finance department or trading desk or whatever happened to notice that the incoming data feeds were not matching up with quotes from counterparties, market makers, CNBC, colleagues, Wall St Journal? Does this company not have auditors? A CFO? Any IT oversight whatsoever? I'm sorry to say that this particular company's problems are rooted much, much deeper than the loss of a particular Excel guy.


Usually it’s pulled in via plug-in from a third party data source like Bloomberg or a cheaper alternative.


Yep, lack of reproducibility with complex spreadsheets is a nightmare.


That was a neat detour into reinsurance. I wonder how the main thrust of the article holds up generally. My experience in trading/finance is that Excel remains extremely preferred for last-mile use (i.e. for writing and reading reports). Whereas the data science ecosystem has been adopted to develop research platforms and manage the data into (what ultimately becomes, for most users in the organization) an Excel sheet.

I don't see this changing any time soon. I think Excel was always strongest for last-mile use. Excel is extremely powerful when you know how to use it correctly, and I routinely see people match or exceed the productivity of programmers using it for specialized use cases.


The author pretty accurately describes a business model for a startup in an enterprise company: offering a service that was once hidden in some excel sheets.

As a developer at heart turned Senior Manager, I find this article especially interesting. I stumble a lot over complaints like these in the enterprise company I am working for and truth to be told, I voiced many of these before myself.

Problems I see:

- What is the business problem the author is trying to solve? How does a tool - Python - can help do specifically do what better?

- There are no specific measurements mentioned. How big is the data the author mentioned, how long does an analysis cycle take, how large are the teams, the affected people? What about maintaining the software stack? How many requests are there per year?

-What about cost savings? How could they help us compete with other companies? Lead cycles of even weeks may bother a developer but not the business.

It is not that I don't believe his suggestions. It is just that I don't get to the point other than "my favorite tool could do it, too." We could easily substitute Python with R, for example.

"The spreadsheet took 30+ seconds to open" I know this is an annoyance, but how often do you open it? One time a day? 20 times an hour?

"The new model logic is testable and can be upgraded independently" this is one of the most valuable points here, as long as you work in a larger environment. So context is needed here as well.

I know a colleague of mine who is extremely well versed in Excel who has put a decent amount of magic into her sheets. However, even losing her and starting all over again is from a business perspective way cheaper than trying to put her solution behind a cloud service.

It would be fun, to have a conversation with the author.


I agree. The article came across as a programmer griping about Excel and VBA, while praising his/her favorite tooling as the answer to some programmer-centric greivances.


You can reach out to her (or me) on email. Her email is all over her blog and mine is in my profile.

In my opinion the project that inspired this article was some of the most valuable work we did together and it was made more valuable by working directly in a pair (trio?)-programming context with the Underwriter the model was actually for.


XLL lets you also write .NET code, basically anything that can compile to a DLL, to make custom functions you can expose in Excel - and it is a pretty old supported integration method by Microsoft with Excel. COM enabled DLLs were another way to do this, but they ran slower.

Not that I have any issue with getting Python in my Excel, but people seem to forget that .NET is also an option.

Getting these capabilities enabled in a locked down corporate IT environment traditionally was difficult but I suspect that is changing.

I have also lived the whole, turning a model in Excel into an app exercise. At the time, we rewrote a fairly complex demand planning app from Excel/VBA to C# since the other dev team members were C# devs and could support the app.

However, during the project, I did a demo of how one could build a Winforms app in VB.NET also, to the developer who was the Excel/VBA guru. He'd had no idea that coding in VB.NET and Winforms was close enough that he nearly could have been doing that instead.

The compiled C# version of the model we built, went from running a single instance of the model in 1 hour, to under 1 minute. We could re-run their model for tens of thousands of instances daily, without breaking a sweat.

Ironically, the rewritten version in C# never saw the light of day as the project was canceled (corporate politics and wisdom). However, the simple optimizations we identified in the rewrite were given to the Excel guru who actually made improvements to his tool that let it run in more like 10 minutes.. and it was even object oriented and modular! He learned he could do a lot more in Excel/VBA that he didn't even know about.

Coding is coding.. just some tools make the jobs easier or harder.


This reminds me of Bill Gates' comment about secretaries writing VBA. Thankfully, that didn't happen. My friend rides herd on all the python code written at BigBank. He said they now have over 1MM python scripts, "Most of them written by people who had no business writing code in the first place." Python certainly lowers the barrier to writing code, but I seriously doubt many Excel power users will jump on the Python bandwagon. If you are not up to speed with the latest BI tools for Excel, you are missing out. Power Pivot is brilliant and dynamic ranges are the cat's meow. With LAMBDA there is no need to try cramming Python into Excel. Don't underestimate how much large corporations prefer to use products written by professionals instead of relying on a menagerie of ad hoc packages cooked up by well-meaning amateurs.


This is a great point but companies are using tools built by amateurs and just don't know it. I think of a lot of business users as the fine artisans you might have come and paint a fresco at the Sistine wall. Sure, they might be the best painter in the world, but if they are painting on a poorly constructed building there will be little-to-no long term sustainable value.


This is a really interesting article on some of the challenges facing enthusiasts for technologies such as Python in a "traditional" corporate environment.

I'm a really strong advocate for these technologies - and have been developing a notebook based product for use in the insurance sector - but there is a lot of resistance.

- There is very little awareness of the power of open source tools to do traditional data manipulation tasks;

- In addition to Excel there are both legacy and newer proprietary systems backed by consulting firms that have a strong hold over parts of the market.

On the other hand there are some areas where there is increasing adoption of Python and (especially I think) R to do statistical analyses that are difficult / impossible in Excel. Also DataScience tools and techniques are now being taught as part of standard actuarial courses.

Finally, firms are increasingly acutely aware of the risks of relying on Excel and are looking for tools with better control / testing environments.


I'm way more proficient as a programmer, than an Excel user, so it was my assumption that all these marketing guys that constantly work with Excel can do wonders with it. I mean, they probably can, but recently I tried to use it (actually, it was LibreOffice Calc, so there might be my problem, but I don't know if the difference really is this big) instead of writing a python or bash script (as I would usually do) and was unpleasantly surprised by how complicated the stuff that I would consider standard is, like concatenating columns, grouping/counting unique values, dirty data semi-automated cleanup and such. Everything would require either multiple clicks in multiple menus to perform something that appears to be very ill-composable actions in Calc, or would require writing 50-line Basic script (on kinda ugly APIs) for something that I can do by simply converting all of that to csv and writing 5 lines of Python (or sometimes even shell text-utils, literally). My general impression was that it is not really a tool made with power-user efficiency in mind, ending up being not very efficient for anyone, since it isn't very intuitive software to use for an excel-noob anyway.

So my question is, is this really the state of art for visual working with data-sheets, semi-manual data editing and such?

I was assuming that running a Jupyter/Pluto/RStudio and doing stuff in Python/Julia/R when you don't indent to do actual data-analysis/learning, but only something that seems like basic stuff/preprocessing is more of a bad habit, because Excel was actually made to work with tabular (DataFrame-like) structures, but I ended up feeling like there's no way I would actually prefer Excel for that.


> Python/Julia/R when you don't indent

This typo made me chuckle

> So my question is, is this really the state of art for visual working with data-sheets, semi-manual data editing and such?

I guess Excel is used because it's well established, and migrating will be very costly not to mention finding people in that field knowing Python/Julia/R


I work somewhere that was stuck in even more simple excel usage, like sorting a list and counting rows for each type of category instead of using a pivot table.

I've done some more advanced work with python and xgboost for some modeling, but the biggest improvements in terms of both time saving and regular use of data for informed decision making has been implementing basic reports and dashboards. So much so that sometimes I feel like I'm creating kindergarten doodles that get praised as amazing masterpieces, which is a weird sort of embarrassment. I jokingly describe my job as "I count stuff" because a big part of what I do is still working with departments on what they want counted and the most useful way of displaying it to them. Percentages and year-on-year comparisons are magic.

I'm not quite sure what qualifies as a "legacy industry", but just about any organization that's been around for 40+ years could have the potential for massive improvements from taking advantage of improvement made during <= the past 20 years.


I know your pain all too well. I hate when people say stuff like 'Did you go to Harvard?' after I help them print a document. I knew managers who didn't know what < and > meant. They think creating a pivot table is genius work. I actually heard someone say you don't have to put .au on the end of the email address if you live in Australia. I sat next to one guy in a meeting who was typing away on his laptop keyboard without looking at the screen for about 5 minutes and then used spell check to fix every second word. These are the people running these old corporations.


I work in a small R&D team within a larger engineering organization. I use Python, and it has spread to the rest of my team. However, I've tried to share tools that I've written in Python with the engineers. The problem is that I have to hand-hold them through the process of getting Python working on their computer at the level of detail of: Here is how you find the Python editor. Double click on it. Click on "open." Find the Python file. Click on "run." Click on "Run Module." Or you can press F5. No, you have to be in the editor when you press F5. Now do you see that a window just appeared? Look at the entries and buttons in the window...

It's really quite harrowing. Whereas if I put the same thing in an Excel file, they can bring it up themselves and I can quickly walk them through using it.

And I don't think my UI's are all that bad. But going from throwing together a simple Tkinter GUI, to something that is totally user proof and self installing is actually quite a lot of work.


Pyinstaller + Gooey has been my go-to combo for sharing executable python with simple/intuitive UIs.

https://github.com/chriskiehl/Gooey



Solution for this is to make flask or django apps. Easier to make user interfaces, and solves packaging / user experience problems.


I've had some success with WinPython, where I just install the whole kit and kaboodle on their computer. Before I share anything, I try running my code on a fresh install of WinPython.

Learning to distribute Python code is on my to-do list for next year. We now have a younger programmer on the team who is up to date on this stuff, and has agreed to train me.


> Learning to distribute Python code

Found pyinstaller very handy, in fact that's what I use to create releases for my side project[0]. And if you're creating CLIs, a sibling comment mentioned Gooey.

0: github.com/rmpr/atbswp


Imagine them the first time they used Excel


Have you tried pyinstaller? Works pretty well for simple cli tools or tkinter apps. There's already predefined docker images with all of the necessary wine hacks that just consume your code + requirements.txt and spit out a exe.


We have a few customers in reinsurance, and for the most part the goal is to do the opposite of what the python solutions try to do. Instead of integrating foreign stuff into existing workbooks, the goal is to retain the existing worksheets as source of truth and build modern tools around the files. The most common use case is building out a web interface to replicate the Excel formula engine.

In the python space, there are libraries like openpyxl and xlrd, but the real hurdle is introducing python into an ecosystem which otherwise has no natural knowledge. JavaScript is the language of choice for modern Excel addins as Excel provides an actual API for it https://docs.microsoft.com/en-us/office/dev/add-ins/referenc...


> The most common use case is building out a web interface to replicate the Excel formula engine.

This is one of the projects that I'd worked on. We implemented a pretty thorough version of the Excel engine in JS. Load data and expressions as 2d arrays and get a nice api for the output.

https://github.com/websheets


In principle I agree strongly with this approach, especially when an extant "working" solution already exists. However, some of the cargo-cultery that goes on almost defies belief.

I once heard of a company that created a database table with columns "workbook","sheet name","row","col","value" that they would extract all of their spreadsheets into as a "Database backend" for their spreadsheets.


If the industry needs a better data analysis solution, that's fine, find one or maybe start to build one. But saying "I'll just throw some Python at it" isn't a good solution. It's like saying you're gonna replace a lawnmower with tool steel and a welding torch. Not only is it not guaranteed to end up as a better solution, but you're setting yourself up for a lot more work than you think.


I think replacing excel with python is a bad idea.

Think of it as a developer: you write a small application, and everything - code, code dependencies, data, really everything - is in one file and can be sent over email. The program behaves the same everywhere. The code can be understood by people who have no coding literacy.

Good luck doing that in python.

But I guesd I don't understand the fad with python anyway.


Related note about something I've been working on: https://JuliaActuary.org

It's basically packages to support actuarial work written in Julia, which addresses a lot of the issues of Python/R (environment management, runtime speed, rich cross-package compatibility).


> At the end of the day, making a change in an Excel sheet is easy; understanding formulas is achievable; but learning to code is hard.

This is the crux of the matter. I would guess that there are more than an order of magnitude Excel users than Python programmers. Python is great if you already know programming, but expecting domain experts to learn Python in large numbers is going to be a daunting barrier.


This has frustrated me as a python user for the last 7 years, working as the only python user in business environments dominated by Excel. People will say things like, "if you leave, who can support this report you made in python?" Well I say, who can support the bloated 40mb spreadsheet that would take forever to unpick and figure out how to update with new data? No one can, because I've seen people would rather rebuild their own spreadsheet from scratch, than use the files they inherited from the last person.

If these tools are necessary to conduct business and they are so worried about being able to support it, why don't they use proper software for that process?

A lot of people who make these bloated spreadsheets are people with no education in computing, and don't think about the basics of how to store data that is easy to analyse later. If they are building a weekly report, they build the report and enter the data directly into the report structure, which then makes it almost impossible to analyse later. Next week they just copy the file, rename it and update the data. If you want to analyse that same data over a year, good luck! You can't even count on the data being in the same place over the 52 weeks, since they would have added and removed data points over time.

Once I got the process down in a jupyter notebook, handling all the oddities with the data coming from whichever website, CSV file, data warehouse report I need, I can just save it as a .py file and run it as a scheduled task on a virtual computer forever. The data is kept in a format that can be appended to with each update, and can be easily analysed later.

The most amazing thing with replacing excel with python is you don't need to manually perform the update process yourself. Which means it doesn't cost anything to run the process more often. Weekly reports can become daily, or even hourly email updates that are only sent when something interesting happens. People can start reacting to things shortly after they happen, rather than having to remember what happened a week or a month ago. The iteration on improving becomes so much faster. People spend more of their time discussing how to fix problems, rather than spending time building problem finders. You can even start to automate the fixing of the problem in python and people don't even have to spend time on that thing at all, ever again.


> You can't even count on the data being in the same place over the 52 weeks, since they would have added and removed data points over time.

> I can just save it as a .py file and run it as a scheduled task on a virtual computer forever.

This is rather naive and short-sighted. Do you think the spreadsheet guy is moving data points around because s/he's bored at work and is screwing around with no purpose? No, the business requirements change, so he needs to update the spreadsheet to incorporate the new rules and/or data.

Which is exactly what you'll need to do with your python program, otherwise it also will break and/or produce incorrect results.

Simple example: calculate available vacation days. Last year company policy was simple, use it or lose it. Just subtract days allotted minus days used in the calendar year. This year company policy allows for up to 5 to be rolled over. Now we also need to know how many were available last year, how many were used, how many could be rolled over. Your Python program importing from SQL query, CSV, data warehouse report... totally breaks now that the data source has 5 columns instead of 2.

Claiming you can build a program in Python or any other language and run it "forever," in the context of a business, makes the whole comment lose any credibility.


It sounds like you missed the point of GP's post.

He was talking about avoiding manual weekly data copy-paste errors by writing code to do it in a predictable format.

I think you assumed that they meant the code would never have to be changed again, when they were actually talking about being able to standardize the data update process.

I don't think they said anything about never having to change the code, just that running a software process saves each user from manually pasting data every week.


I'll stick with bash-j's own words, where he bragged about running a python script "forever", and not your interpretation of his words. I'm also going to dismiss what you "think [I] assumed that they meant"

However, if you, or bash-j, believe Excel is incapable of avoiding manual data updates, that's simply not correct. Excel provides at least 3 ways of bypassing manual data entry in favor of automated imports from an external data source -- VBA code, ODC (not to be confused with ODBC) data connection definitions, and PowerQuery (I think that is the current name, haven't used it in a long time).


I will grant that tracking vacation days in a 5 person company might be a valid use of excel.


I've used excel and python in lots of business contexts. For most tasks involving domain experts, excel usually wins hands down.

An excel spreadsheet is usually easily auditable. The visual presentation and layout lends itself to review by others. You can click and point at values. Python and other programming languages require an environment and tooling that can't be easily supported across the enterprise. It requires source control systems and code review.

Programming languages are also too "dynamic". Using excel I can bring in a hard-coded report and link to those values in another tab. In python I'll have to save those to another file or re-query the data source, which may have changed due to new values being retrospectively added.

Python is the right tool for lots of analysis tasks, but for most corporate reports it's hard to beat excel. Programmers are also more expensive than corporate analysts. So you would end up replacing teams of low-cost high-retention analysts with high-cost low-retention programmers.


how does Python require source control? I guess you can just drop your files on a sharedrive... just like ... EXCEL!

Also you reference an existing report? What's exactly the problem of serializing your data after a run of your python program? Actually, if you think this through, you would probably establish some pipeline architecture to just continously integrate your results.


> An excel spreadsheet is usually easily auditable. The visual presentation and layout lends itself to review by others.

Tell that the next poor soul who has to edit some organically grown spreadsheets powered by VBA and malformed CSV files which generate your company's financial reports.


> An excel spreadsheet is usually easily auditable.

What does this sentence mean? Virtually no part of excel is easily auditable, at best you can see "this file was changed by xyz at a:b:c" identifying the cells that have changed between versions wouldn't even be easy.


All versions of Excel from 2013 and onward come with a tool that lists out cell by cell differences between two spreadsheets, called Spreadsheet Compare [0].

It lists side-by-side differences in hardcoded values, formula changes, calculated value changes, and even changes in VBA code. The list of changes can then be dumped out to a text file.

Default Excel installations also include the Inquire add-in which allows you to perform the comparison within Excel itself.

[0] https://support.microsoft.com/en-us/office/compare-two-versi...


This is exactly the niche that Resolver One used to fill. It was basically an Excel-like spreadsheet but under the hood everything's Python.

Unfortunately it looks like they ceased the product in 2012 due to lack of sales. Perhaps they were too early.

https://youtu.be/u6EV2jiKRfc


Here is sort of a modern variation - upload your Excel workbook to the cloud. I don't have hands on experience or any connection but I think it translates to C# under the hood.

https://www.milliman.com/en/products/milliman-mind


I think there's a realization that some computation has become too complex or too risky to put in a spreadsheet. The reasons the author gave for moving to python less about difficulty of coding, but that other problems were bigger than learning to code.


This is exactly the point, do you know of any companies that have had this realisation. If they are based in Bermuda I'd love to work with them.


Loved the post. I spent more than a year trying to pull a prominent reinsurer on the rock, out of spreadsheet hell, and into the modern age. Baby step #1 would have been to transfer critical data to a database environment and baby step #2 would have been to extract business logic and very-poorly written VBA code into an external code library (Python) that is source-controlled and auditable... I can still hear the Chikadees laughing at me. Last I heard said firm was still deep in spreadsheet hell.


Are you still on the rock? Send me an email (my address is my profile and we should get coffee.)


> We used an Excel spreadsheet-based model with dozens of tabs containing complex formulas, endless pivot tables and unintelligible VBA code.

> The tangled mess of VBA was re-written into independent Python modules, each of which performs a distinct function.

Hardly a valid argument. I mean, VBA supports using modules too, so it basically comes down to an increased skill of the programmer and maybe more knowledge about the actual scope as it is a rewrite, but there is no reason why it could not be done with VBA.


I do a lot of both. Excel really is great for data where there are less than say 100k rows. Its just so easy to see exactly what you're doing and what the data looks like. If you have millions of records Python really does better but I still find it frustrating to find a way to keep peeking behind the curtain.

Ideally I'd have a type safe language which can embed data the way excel does. If Excel had dotnet languages instead of VBA and could store data arrays in XLBs it'd slay.


Excel can easily handle tens of millions to hundreds of millions of rows of data.

Check out power query and power pivot.


Try ExcelDna it lets you hook up .net to Excel


Yes I use Excel DNA, it just doesn't have the same flexibility that I can email anyone a spreadsheet file containing both code and data.


The title is a bit misleading: It's not that typical Excel users are encouraged to, or experimenting with, using Python instead.

Rather, these are people who need to "price complex deals with increasingly large datasets". They write pricing models and need to run them.


I’m surprised that there aren’t more comments about utilizing R AND Python for analysis work. These two languages actually commingle fairly well, you can build in RStudio if you like that flavor and still import Python packages to use in R code.

We do a significant amount of modeling and analysis on large data sets from a variety of disparate sources and utilizing several different packages have extended this out to standing up a fully free (save for AWS hosting) environments that perform modeling, allow reporting and Dashboarding automation, restful APIs for other services to call into.

I’d encourage anyone looking at making the jump from Excel to ‘X’ to checkout out some of the power of flex dashboards, R Shiny, Plumber and some of the different authentication mechanisms available.

Some elbow grease can create a wonderful environment.


This is a really good point - you can even use packages such as rpy2 if you want really close integration. A bit clunky but it works.


Yes! RStudio is an amazing product, and you truly can have the best of both worlds by using python in it.


This is actually about using a paid, closed-source add-on PyXLL, that integrates python into Excel, but only to the Windows version.

And costs 25 USD per month (but has a free trial).

Excel is never actually ditched.

edit: oh, that's only step 4. Step 5 is actually ditching Excel.


The main point is to try and separate view, data and calculation engine.

PyXLL is great for helping with this as you can move the calculations into Python and thus have them automatically tested and protected by source control.

I assume this is possible for VBA but have never seen it done in practice.


I've used xlwings at work but I'm mired by packaging issues and things like the CMD window disappearing for some users but not others.


I dont claim that Excel is the best tool or that it should be used in reinsurance, but it seems that the author does not know how to use it properly.

It sounds more as if they knew Python so they use Python. If author would know Java they would write that Java is better.

For example combining data with logic does not need to happen in Excel. Someone who does this might make the same bad practice in Python too.

Also there are no thoughts that there are many bad Excels, but there are also many bad programs (in Python and every other language). There is some sort of magic thinking that a rookie who switched from Excel to Python will somehow not produce spaghetti code. What is not true at all. Those Excels are much easier to debug by the business side. Any program is a black box.

Author makes empty claims that "excel forumals are long" or that models have many tabs. If the Python program gets as big it might also become a mess.

How many times have you heard that the new programmer looks on old code and says that it is spaghetti? Nearly every time. Nealry every time they want a rewrite too. And Excel just works.

I doubt author used Python programs made by others. There are no comments on that. Also eas authors code reviewed by a real programmer? Author is self thought so odds are that they create some really awful code and dont even know it.


There are some good criticisms here. However, the OP is getting at the fact that at least the tools exist for code review, source control and automated testing in Python.


I’m a diehard user of the “PyData” toolchain (pandas, numpy, seaborn, sklearn, etc). Models become too much for Excel when either (a) you want to incorporate live updating datasets, (b) you want multiple users to be able to query the model simultaneously without affecting each other, or (c) you want to incorporate probabilistic calculations i.e. your inputs and/or outputs are distributions. Python blows Excel out of the water in these cases, it’s not a question of speed or ease of use.

However, I think Python’s biggest current weakness is the lack of a general purpose plotting library with good defaults or GUI-based tweaking. Matplotlib “can do anything”, provided you’re willing to google how to rotate axis labels and 20 other things to get legible styling. Seaborn is an improvement but still takes re-writing about 10-20 lines of code for each plot. As far as interactive libs, I prefer bokeh but it’s still too low level and missing fundamental capabilities like histograms. Holoviews is an interesting wrapper but still suffers the same limitations. Plotly... is popular, which is about all I can say for it. I find that I hit random walls and inflexibilities often bc it tries to be too one-size-fits-all. I understand ggplot from R is kind of the gold standard. Wish someone would do a carbon copy port to python.

Final random thought: my feeling is that white collar industries like insurance that are built around a network of Excel jockeys are in for a major disruption. If you built these companies from the ground up with a software dev team and mindset you could probably cut headcount 5x. It might not make business sense for a company deeply rooted in Excel to make that transition, but then again, that’s exactly why and how disruption happens.


I’m an actuary working in life insurance and I’m so shocked to see how inefficient processes are in many insurance companies. Some models take hours to run in excel whereas they could be easily run in a copule of minutes using python or c#. Not to mention the importance of having a clear version history that is just impossible with excel.


Lots of discussion here of Python and R as alternatives to Excel. But no mention of drag and drop data processing tools such as Alteryx, Knime or Easy Data Transform. Are these not a more natural alternative to Excel for people who don't have a programming background?


I used Knime in my last company. Very interesting software. But I think the initial introduction can be tough to carry forward. You need a champion or two in your org to help everyone else. Excel is a tool enough people know to both work independently and be dangerous.


> The development environment is not user friendly, the syntax confusing, there’s no support for unit testing – I could go on.

The vba ide is pretty good imo. Lack of unit test frameworks is valid but doesn't stop you from rolling your own.


Loved VB and VBA but it is too limiting when needing advanced numerical capabilities. Back in the day had to create add-ins making use of compiled Matlab code to get access to decent numerical routines. Eventually moved to Python and never looked back, although I still use Excel for certain tasks.

However I do miss the VBA GUI editor built into Excel. It allowed for relatively polished interfaces in record time.


What sort of capabilities are you looking for? Newton-Raphson (which Excel has with GOALSEEK)? Sensitivity analysis?


And a stepping stone into VB.NET in case more power is needed.


People have been integrating web services with excel for two decades. Previously you had to write a C XLL to do it properly, but COM add-ins have been supported for a long time, although have oddities, and now you can load .net assemblies into excel. You can use kerberos baked into windows for authentication, or just use http basic auth, and control access to data sets just like any other thing. So I guess, I'm confused by the article because a lot of the claims about how you have to do things in excel don't seem to be quite right.


I like the idea but it’s a bit concerning that everyone would be expected to build models in Python. You’d think there would be enough reuse and structure that it wouldn’t be needed or an application could be built to simplify the construction of the models.

If it’s so complex that it needs to be coded up in Python and everyone is doing that bespoke each time it feels like alarm bells should be going off.


I think the OP is hoping that by using python one makes reuse of robust libraries and solutions easier so that the ad-hoc reimplementation of common tasks happens less frequently not just "in python".


The problems with excel IMO are:

- large datasets are an issue

- it doesn't have some libraries without extra cost (and in my instance long winded approvals). I'm specifically thinking about linear programming libraries here.

- VBA is less easy to code in

Otherwise excel is great.

Side note - is anyone exploring Jai? This seems to try to be solving the installation and compatibility issues that mires coding these days.


We are using notebooks in visual studio code to process csv downloaded from a corporate BI and replace Excel analysis. As a programmer I dreaded questions like "do you know how to program macros in Excel?"


Slightly off topic but any suggestions for the best forum for issues relating to use of Python / R in this sort of environment (thinking insurance / finance specifically)?


Blockpad (http://www.blockpad.net) tries to ease some of these pain points but more in the engineering space.


I love python. In investment banking I can see a usage for recurring analyses of fundraising or grabbing data from the SEC. it would be next to impossible for analyses by client.


I recently did a project where a financial model for IAS-19 built in Excel was migrated to a cloud based solution based on Python/Django/Pandas.


Actually on the life sciences domain I worked a couple of years, I saw them adopting VB.NET alongside Excel, Python was not even on the radar.


I wish I could ditch excel but I use both viewing the data and doing some ad hoc calcs is much easier for me with excel.


How do you get compliance/IT to sign off on the inherently risky pip install mysterypackage though?


Better sell some cat bonds if you’re adopting Jupyter notebooks en masse ...


F# is a better Python - especially with Units of Measure


Does your name rhyme with "tennis tin"?


FWIW, I doubt he's Janis Joplin ;-)


Why use Python when MS built in JS?


Is this a joke? The PyXLL add on costs $25 a month. Hard pass.


I wish there was just a desktop program like Excel that actually fucking works well on Mac and PC. I like Google Sheets but it's way too integrated in the cloud for my tastes.

Excel has so many good features, but the core of it is so fucking buggy.

It sucks that I gotta bust out Jupyter and use Pandas to double check my work, especially dates, because I can't trust Excel.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: