The chemfp project: problems selling free software

dalke · on July 21, 2020

I started the chemfp project in part to see if I could develop a self-funded free/open source product in my field, cheminformatics. (In short, storing and searching chemical information on a computer. Chemfp does very fast Jaccard-Tanimoto similarity search for "short"/O(1024 bit) bitstrings.)

The answer: no.

The section I linked to highlights some of the problems I had selling software under the principles of free software. For examples: How do I provide a demo if I always provide MIT licensed source code? Academics expect discounts, but they are also the ones most likely to redistribute the code. Which is not a wrong thing to do! But it affects the economics in a way I could never resolve, compared to proprietary/"software hoarding" licensing models.

As an HN note, I contracted a couple people to help improve the popcount implementations. HN user nkurz developed and tweaked the AVX2 implementation, and proof-read the paper. Thanks nkurz! As a result, chemfp is, I believe, the fastest single-threaded Tanimoto search implementation for CPUs available, and most likely memory bandwidth limited, not CPU limited.

(Note: the mods asked me to repost. My earlier post is at https://news.ycombinator.com/item?id=23598470 .)

maximente · on July 21, 2020

in your future ideas section you have listed a few features or directions that the software could go. what about selling those?

in my mind, in order to monetize a greenfields FLOSS project, it seems you need to basically create software so impactful that users are willing to pay for particular features or bug fixes that they need, and that don't already exist in the software. so basically, you have to not only get people to use this new thing, but also get them to be willing to pay for particular improvements to it! quite a task.

this first occurred to me when i heard Stallman talk about how one can monetize FLOSS projects by selling bugfixes or improvements. contrast the typical developer-driven setup where a team implements the feature then sells it by selling a new version, upgrade fee, or what have you.

so it's a tricky situation, in that one needs to offer a compelling product, but not something so good that individuals can take it from the shelf and never have to do anything to it. however, this is sort of helpful insofar as it acts as a way for you to ship something and then - hopefully - let users drive the product (we'd pay $X for a UI on top).

for cheminformatics, i'm not really sure what some killer features are, but you have at least some ideas as potential things to sell. perhaps the community would be interested in pooling together funds for some of those ideas, or a UI/UX, or whatever - but yeah, definitely seems more customer-driven then traditional software.

dalke · on July 21, 2020

Yes, those future ideas are also possible salable products, thought the market is smaller. I had hoped that the surplus from the core bitstring search would provide the seed money for their development.

Chemfp is (or rather was) at least an order of magnitude faster than its competitors. One of my early customers went from taking 5 days to cluster 1M fingerprints using their internal tool to several hours with a re-writes based on a chemfp core.

I figured that was sufficiently impactful. But that was in the early stage when people did fund me to add new features as an open source project. I think that once it got "good enough", people didn't want to fund the incrementally smaller improvements. The commercial version is only about 2x faster than the no-cost version. I would talk with potential customers and they already had the no-cost version installed in-house.

Now, there are a lot of other improvements in the commercial version, but it's a variation on the "offer a compelling product" - I was my own competition. That was the primary failure in the original business model, based on Ghostscript. (I didn't realize that hardware companies paid Aladdin to keep up with Adobe Postscript, which provided the reason to pay for the latest version rather a no-cost older version.)

I tried crowdfunding a project, ie, "pooling together funds", for another project. It worked, in that it made enough money to fund the software development and the marketing, but it also required a lot of time for the marketing (which I'm not good at), and I would have made more money doing straight consulting work instead of product development.

BTW, I focus on "classic" cheminformatics. Similarity search is from the 1980s, with roots in the 1970s. My thesis is that there people haven't looked into these topics in years and there might be orders of magnitude gains by developing better fits to modern architecture, that is, improve existing methods enough that UI/UX changes. Most other people's definition of "killer feature" is "an improveed method predict how a molecule will work in the body."

cowsandmilk · on July 21, 2020

In addition to the free version becoming good enough, pharma is generally a consolidating market. My experience has been the number of companies buying software has steadily gone down. I sold software with a subscription license and the primary driver of nonrenewal was acquisition. Cubist? Acquired. Dyax? Acquired. And on and on.

ssivark · on July 21, 2020

Very interesting experiment.

Did you consider having academics cite your software as an academic work, and then monetizing those citations to get grants from funding agencies who were funding the research that used your software?

Given this knowledge, can you think of other similar experiments worth performing? Alternatively, are there any likely changes that might lead to such approaches becoming more feasible?

I’m quite interested in this question and appreciate your comments. In case you’ve already answered these in the article, I apologize — the article is long and the HN thread will likely expire before I have the chance to peruse it thoroughly.

dalke · on July 21, 2020

I tried talking with funding agencies. As a self-employed/sole-proprietership, it proved to be difficult.

I conducted the "mmpdb crowdfunding project" at http://mmpdb.dalkescientific.com/ . Unfortunately, it's been on delay because of major changes to my work schedule once the coronovirus came. My conclusion from that was it's still easier for me to make money as a consultant than from selling a product.

One thing I didn't mention in the paper was that being a consultant on different projects makes it easy to present new talks at conferences, which leads to new work. Continued development of a single project with one focus leads to variations on a theme, feeling sometimes like "have you ever looked at your hand. I mean, really looked at your hand" introspection.

You can always email me as follow up. dalke at dalkescientific.

alicemaz · on July 21, 2020

have you considered offering it as a cloud platform? we're doing something along these lines, niche scientific software (biological modeling, bioinformatics) as a paid hosted service. still at the prototype stage! so I can't comment on how well the business model actually works yet lol

but the idea is our mathematician will be able to publish whatever novel math she develops, and we may eventually open source the math core as a reference impl, but we'll keep all the cluster management and other supporting infrastructure code proprietary. sort of a "if you want to run it on your desktop, go ahead! if you want to actually scale this up for big jobs, we've done all the legwork already so it's really in your best interests to just pay us." I think open source ideals are good and worthy but from a business perspective, you capture value by providing value that can't be got without you. relying on customer goodwill is particularly difficult because any large org, the people who will feel goodwill toward you and the people who can authorize purchases are in two different departments

also fwiw I think if you wanted to do the model you described in the paper unchanged, gpl is a much better choice than mit. copyleft actually serves as a wonderful poison pill: you can try us out for free, but if you want to ship us, you need to pay for a proprietary license or legal will nail you to the wall. whereas mit, there's no stick. I've seen affero used by several projects for this express purpose: you have to buy a proprietary license because agpl is so onerous you just can't use the code for commercial purposes at all

interesting project btw, I love seeing stuff like this!

dalke · on July 21, 2020

Thanks!

Yes, I've considered cloud platform. There are several big difficulties with that.

First, data. It's easy to grab public data from PubChem, ChEMBL, and a few other projects, and make a service. But why would anyone pay for it given that PubChem, ChEMBL, ChemSpider, and others already provide free search services of that data?

There's search-as-improved-sales, like how Sigma-Aldrich lets people do a substructure search to find chemicals available for sale.

There's value-add data. eMolecules includes data from multiple vendors, to help those who want to purchase compounds more cheaply.

Or there's ZINC, which already provides search for their data.

So you can see there's plenty of competition for no-cost search. I don't have the ability to add significantly new abilities that people are willing to pay for.

Note also there's a non-trivial maintenance cost to keep the data sets up-to-date.

Second, the queries themselves may be proprietary. I talked with one of the eMolecules people. Pharmaceutical companies will block network access to a public services to reduce the temptation of internal users to do a query using a potential $1 billion molecular structure (or potential $0 structure). eMolecules instead has NDAs with many pharmas which legal bind them. Managing these negotiations takes experience I don't have, and neither do I have the right contacts at those pharmas.

Sequences don't have quite the same connection between sequence and profit as molecules do.

BTW, part of the conclusion of my work is that people don't need a cluster for search - they can handle nearly all data sets on their laptop, so there shouldn't be a need to scale up any more. And small molecule data has a much smaller growth curve than sequence data, so Moore's Law is keeping up.

My first customer, who continues to be a customer, said outright that they would not buy if it were under GPL.

Since my paying customers are pharmaceutical companies who, as a near-rule, don't redistribute software, it doesn't really matter if they don't redistribute under MIT or don't redistribute under GPL.

I came into the project in part to see if FOSS could be self-supporting on it own. AGPL is often used as a stick to try to get people to use a commercial license - the implicit view of the two-license model is that FOSS is not sustainable. Which is now my conclusion, for this project and field.

fock · on July 21, 2020

not really into industry, but a) the pharma-companies using it are probably reluctant to give you their data and b) uni researchers are not overly fond of high-fee services and labor is cheap there.

zokier · on July 21, 2020

I think to truly appreciate FOSS as a model, one needs to shift away from thinking software as an asset to be monetized to more of a liability that needs to be managed and maintained. Then the benefit of FOSS becomes clear: by publishing your software there is possibility of sharing that burden with others instead of carrying it alone yourself.

dalke · on July 21, 2020

As a revenue-generating project I was able to share the burden by paying people to work on parts of it. I think that was a more effective way than waiting for others to join in.

Note that my paying commercial customers are pharmaceutical companies, where employees must present their presentation materials to their legal department before they give a talk.

In some cases I've heard that that extends to non-trivial code changes, which reduces the number of people who can help out.

Even setting that aside, I've been working on FOSS project for 20+ years. Outside code contribution is rare, and substantial code contribution is like hen's teeth. Bear in mind that most people in my field are chemists-who-program, not software developers.

One major exception is the RDKit, which was primarily funded by in-house R&D. Open sourcing it benefited the company because 1) it was primarily pre-competitive (the internal version included competitive features not in the FOSS version), 2) it lowers the price of commercial tools in the area, and 3) the differential cost of organizing the FOSS distribution was a net benefit.

This is similar to other successful FOSS projects. However, this model implies that one must be an employee of a large company in order to work on FOSS projects. Which is not for everyone. Clearly small profitable proprietary development shops exist, so why not FOSS ones?

taneq · on July 21, 2020

I agree. I can't see how you can 'sell free software' as a standalone product. You build and evangelise free software while selling feature requests, services and support to the users.

dalke · on July 21, 2020

You can sell a proprietary product, right? With a restrictive software license?

That means customers are willing to pay $base + $yearly renewal for the product.

Why aren't they willing to pay the same price for the same product but with an open source license?

I really don't understand why they don't.

I'll go one further - how much will people pay for an open source license over a source available license with a right to modify, no time limits/renewal requirement, but no distribution right?

Answer: all but one of my customers jumped at the chance to reduce the cost by switching from the MIT license to a not-quite-open-source license.

Which means they don't really value the redistribution right.

And I saw this at one small conference about industry use of open source. The organizers - who use chemfp! - stated at the start that the biggest reason they love open source is because it's "free" (meaning no cost), not the principles of software freedom nor the improved development methodology of open source.

I tried selling feature requests, services and support to the users. That was my original plan, and it worked so long as those feature requests were easy and there were enough of them.

But consider that the upgrade to Python 3 took two months. Who pays for that? The first customer who wants Python 3 support 5 years ago, who pays $20K for a feature request which everyone else gets for free? Then there's inventive to wait for a feature request in hopes that someone else will pay it. While the sales model - even as free software - lets me split the cost among multiple customers who need that feature, and across a few years.

I also pointed out that selling services is a disincentive to developing good document and good APIs. I feel like there's a sweet spot where if I were to skimp on the documentation some then there's an increased chance of getting consulting work.

imtringued · on July 21, 2020

>the biggest reason they love open source is because it's "free" (meaning no cost), not the principles of software freedom nor the improved development methodology of open source.

I'm a cheapskate but that's still pretty weird to me. Open source software is free because the entire idea behind it is users don't get excluded. It's more about being accessible than not charging money.

There was a dual licensed HTML component that I was going to use at work but the commercial licensing conditions (not the price) were pretty bad. Per user licensing with a strict upper limit for both active users and the number of apps even though we don't know how many people are going to use the software and most users are only going to use it for one hour per month and we would probably integrate it into a library that will be automatically included in every of our applications to maintain consistency even if the commercial component is not actively being used in every project.

Paying $100/month or maybe a little more for a commercial license with few restrictions that I can just plop in would have been a no brainer but since I'd have to constantly play license tetris it's going to cost my company more time than the product is worth in the long run. It's not a lack money that forced me to go with an open source project that also happens to be free. It's the massive headaches caused by the commercial one.

dalke · on July 21, 2020

My running hypothesis is that many people see open source as a way to avoid dealing with upstream developers.

If I "pip install" a package which brings in a lot of other packages, I don't need to have any relationship with any of those developers. It Just Works.

I don't have to know about their projects, find their web sites, read their calls for funding, learn their licensing options, etc. I don't have to worry about billing. It Just Works.

Even if the price is $100, the fact that it doesn't Just Work means the effective price is far higher.

I decided to focus on industrial customers who were used to software in the EUR ~5-20K/yr range (rather than the ~$1000/yr range) so the overhead costs are proportionally smaller. And why I try to make the code fit into the "Just Works" framework, eg, on Linux-based OSes:

    pip install chemfp -i https://chemfp.com/packages/

WJW · on July 21, 2020

> Open source software is free because the entire idea behind it is users don't get excluded. It's more about being accessible than not charging money.

The reason for the creator of the software to make it open source does not have to be the reason that the users decide to use it.

Even at work, I don't think a single person has ever had the source code of Redis, Postgres or even most of their NodeJS modules open on their machine. The reason they use it is because they can `apt/brew/npm install redis` and off they go. They wouldn't care at all if npm only installed binaries. Zero price enables this kind of easy distribution because every form of money transfer is more difficult (especially in corporate setting where you have to pay for it) than "not paying at all".

taneq · on July 21, 2020

I feel like you're kind of answering your own question here: Most people who use free software do so because it doesn't cost money, not for ideological reasons. So if it's open source, they won't pay for it if they can get it for free the moment someone else buys it.

Some projects allow users to crowdsource funding for new features. That seems to work reasonably well because the cost for a feature is split amongst a number of users who need it.

I agree 100% with your point about support contracts creating a perverse incentive against user-friendliness. I've often wondered how much this effect is responsible for the arcane user interfaces on many pieces of 'enterprise' software (OSS and otherwise).

dalke · on July 21, 2020

I prefer to think of it as replying to earlier papers. Quoting myself under "Funding open source", I wrote:

> Starting around 15 years ago a number of papers discussed the role of free and open source software (“FOSS”) in cheminformatics [49,50,51,52,53]. Most papers argued that FOSS was essential for scientific reproducibility and economically beneficial to organizations, but said little about how FOSS projects could be funded, or the effect of the funding model on the project. ... The rest of this section outlines the issues involved, in hopes of providing insights for future FOSS software projects.

My goal was to separate the "get it for no cost" from "get it as open source" to see how that changes the dynamic, and continue the conversation on open source in my field.

I tried crowdsourcing for a different project. My experience there is that while it made money, the financial risk was higher and profits lower than straight consulting/contract work on in-house software. How much am I willing to give up to do open source? And now that I'm the sole income source for a wife and two kids, that also changes the personal dynamics.

raphlinus · on July 21, 2020

Could you expand on what you mean by "reduce the cost by switching from the MIT license"? On your licensing page you also state "Open source licensing is still available, though it is the most expensive option by far." I think understand what you mean here, but still feel a bit confused.

dalke · on July 21, 2020

Sure, and I apologize for the confusion.

I have academic licensing at EUR 0 for the pre-built Python wheels packaged for "manylinux" and EUR 1000 for source code availability.

For companies have EUR $X for single geographical site, EUR $Y for multiple sites, and EUR $Z for MIT license.

$X < $Y < $Z. For purposes of this example, say that $X=EUR 5000, $Y=EUR 10000, and $Z=EUR 20000, and that license renewals are 20% of sale price. (These aren't the actual prices, but not unreasonably different.)

That means world-wide license renews for EUR 2 000 and MIT license renews for EUR 4 000.

I had several clients with MIT licensing who switched to my (IMO generous) proprietary license to save a few thousand euro. The primary difference is the lack of a redistribution right, which means their value on that right is less than EUR 2 000.

(I think the one company which continues to pay for MIT licensing does so because they see it as a way to provide extra funding to chemfp within their accounting structure, and not because they want the redistribution right.)

Does that clear things up?

raphlinus · on July 21, 2020

Ah, so with the MIT license option they have the right to redistribute the source code, but you're asking them not to do that, as it would undercut your business? And most of your customers comply, as they have their own interest in holding their code close to the vest?

This seems like a highly unusual structure, I can't think of a single other example where "made available under MIT license" does not also imply "code published." But I think I can see where it makes sense for you.

dalke · on July 21, 2020

Yes, though technically I didn't ask them to not buy the MIT license option, but rather offered them another option which was cheaper.

Personally, I would have preferred they stayed with MIT because it appeals to my desire to have people people support open source, and because I would have made more money.

I don't think my pharma customers have any desire to distribute the code, other than to their collaborators. It's not so much "close to the vest" but rather that they don't have any infrastructure support for that - no public repos, no public mailing lists, little career benefit for doing so, and little involvement in managing FOSS projects - and a risk that I would drop them as a support customer.

I'm more concerned about academic customers - people who don't want to pay money - who might release the software, and be able to offer "free"/grant-subsidized support, because that's what grad students do.

I agree that I'm in an unusual. Chemistry has always been more protective than, say, biology. I suspect it's because the connection between compound and new commercial product is more direct than between a biological sequence and a new commercial product. OTOH, there's a huge amount of public research money going into bioinformatics, so it's hard to compare things head-on.

I don't know what other fields are like though. If I developed an improved method to find oil, and sold it to petro companies under an expensive open source license, I don't think they would redistribute it to their competitors.

raphlinus · on July 21, 2020

Thanks for the answer. In practice, the "MIT" option sounds very similar to the much more standard "buy a license to integrate the code into proprietary software." As I'm sure you're aware, that's the model that Ghostscript has used for 30 years or whatnot. Is there a particular reason you're risking your customer doing a source release instead of taking this avenue?

My (incorrect) understanding of your MIT option is that it would essentially be a buyout, in other words compensating you for the lost revenue from a public release. I've seen that happen as well.

dalke · on July 21, 2020

To date me, I learned about the Ghostscript model from Michael Tiemann in late 1996, when it used a delayed release model - GNU Ghostscript versions were released approximately a year after the corresponding Aladdin Ghostscript version.

That model influenced my early thinking for chemfp. I thought there would be enough commercial interest to pay to develop the leading edge (and get a copy under an open source license), with a, say, 2-year delay for the no-cost open source version. There wasn't.

But unlike the Ghostscript of 30 years ago, I wanted chemfp to be a fully open source project. That is, checking now, https://web.archive.org/web/20070614092626/http://pages.cs.w... says "[Artifex Software Inc. is] the only entity legally authorized to distribute Ghostscript per se on any terms other than the GNU or Aladdin" - that doesn't sound like an OEM could re-sell the source code under the same terms as received by the OEM, since only Artifex was authorized to allow that distribution.

"Is there a particular reason .."

My general support for FOSS? My willingness to try an experiment, see how it turns out, and report the results, in order to further the discussion about open source software in my field and how to fund it? My understanding that my pharma clients, almost as a rule, don't do software releases? My expectation that I can always fall back to consulting? My annoyance that published papers on fast similarity search almost invariably showed an amazing performance boost compared to a slow reference baseline, so even if there was a buyout this way, it would still result in reaching my main goals of promoting my FPS format and setting a more honest baseline?

So no, no particular reason, but rather many reasons.

jobigoud · on July 21, 2020

The perverse incentive in the last point has been bugging for years and it's true of proprietary projects too. As soon as you start selling support as a separate product/service, anything that makes the software easier to use creates a conflict.

dalke · on July 21, 2020

Yeah, I fell bad about asking my customers to renew their support contract when they don't send me any support or feature requests during the year.

But they do, so I must be doing something right.

pabs3 · on July 22, 2020

Your example about Python 3 support reminded me of snowdrift:

https://snowdrift.coop/

EvanAnderson · on July 21, 2020

I agree too. Free software definitely doesn't work as a "product".

Software has always been a service. Software as a "product" was an aberration. That model isn't sustainable. The fact the largest software companies are moving to subscription-based models is proof-enough that software as a product is unsustainable.

dalke · on July 21, 2020

Why isn't it proof that a subscription model produces more revenue than a sale model, rather than proof the sale model is essentially unsustainable?

taneq · on July 21, 2020

I'd guess they were going with a theory such as "The subscription model out-competes the sales model on a few axes, therefore (all else being equal) a company relying on the sales model can't compete with a company relying on the subscription model. Sales companies fail, leaving only subscription companies."?

dalke · on July 21, 2020

I thought about it some more, and realized a different issue.

I charge a fee to get the software, and renewal fee for support and upgrades.

That's essentially a subscription, right?

The main difference is that once my customer has the software, even under the proprietary license, there's no time limit to their continued use of their old versions.

So is EvanAnderson's point that only time-based subscription renewals are sustainable?

taneq · on July 21, 2020

That's how I'd read it. For a while, computers and software were advancing so fast that anything over ~3 years old was obsolete and essentially useless anyway. The whole software industry is built around everyone throwing their computer in the bin every 3 years and starting over, buying all their software and hardware again.

Somewhere in the mid 2000s we reached a point where computers were "good enough" and so that upgrade cycle started stretching longer and longer. I'm still using a computer I built 5 years ago and (barring the spinning hard disk which died last year) there's no reason I shouldn't keep using it for another 5, along with all the software I run on it.

Hence the industry-wide switch to subscription models (and especially to cloud-based subscription models as they're essentially impossible to pirate). They had no other way to maintain their revenue streams.

dalke · on July 21, 2020

Thanks. There are a few comparisons to my field which stand out as being different than an end-user software sales model.

First, like I said, most companies in my field (including me) already do a software subscription model. You write "there's no reason I shouldn't keep using it for another 5", but there are reasons to pay for chemfp upgrades even nowadays:

1) I track support for underlying toolkits, where new features are added and APIs change. For example, the most recent release of chemfp added support for Open Babel's 3.0's new circular fingerprints, and for a number of new structure formats in OEChem and RDKit.

2) The commercial version adds Python 3 support - if you stay with Python 2 then you build up technical debt. (#1 and #2 depend on other software advancing enough to need support.)

3) I added improved performance, new APIs, and Zstandard compression support, which resulted in better I/O performance over network file systems than gzip.

So while you could wait 5 years - and it would be cheaper for you to only buy a new copy every 5 years than a support contract - there are advantages to spending the money. And I can still budget for a 5-year update period; while Microsoft has to meet market growth expectations on their revenue streams.

Second, there's plenty of scientific software in my field which started decades ago, and which haven't become "essentially useless" in the meanwhile, making them counter-examples to your broad description.

Third, pharmaceutical companies keep their data close, and prefer to analyze things internally rather than on the clouds. Even the search queries may contain sensitive chemical structure information, which causes some companies to block access to chemical search services unless an NDA is in place with the search provider.

EvanAnderson · on July 21, 2020

That's basically my thesis. Even entertainment software, which is, I think, the most amenable category to being a "product" is going the way of subscription services.

dalke · on July 21, 2020

I commented in the parallel thread at https://news.ycombinator.com/item?id=23905523 . Basically, commercial scientific software has long had a software product model based on yearly support/upgrades, scientific software has a long history where even decades-old code bases are useful, and pharmaceutical research has a practice of doing their database searches and research in-house, in order to protect trade secrets.

So I think scientific software is more amenable to this sort of product+support model than you do.

Or rather, what I think of as a product sale includes a support contract or occasional re-purchase at full price, rather than a pure fire-and-forget model like embedded software.

EvanAnderson · on July 21, 2020

That's my take on it. Recurring cash flow trumps one-time sales. When you consider that all of our computing platforms are in flux, and all of our software is effectively built on shifting sands, virtually every product will need some kind of long-term "maintenance", even if that just means the silly make-work of moving to new APIs / OS versions / CPU architectures for no benefit to the program's feature set. A subscription model is the only effective method I see to sustain that maintenance and keep the software relevant as time passes.

WJW · on July 21, 2020

In this viewpoint it should also become immediately clear why for-profit companies interact so poorly with FOSS communities: they are not interested in taking on unnecessary burdens and would prefer the original developer(s) contributing more free labour over having to pay for it.

dekhn · on July 21, 2020

It's funny just how much the implementations described in the paper map to how modern search engines implement retrieval. The same is true for BLAST and other search engines.

(it's a very readable paper and I enjoy the frank expression of view, even if I have a vastly different perspective on how to accelerate problems like this)

dalke · on July 21, 2020

There's a deep connection between what I do and text retrieval in general.

Take a look at the early work in IR in the 1940s and 1950s, at https://en.wikipedia.org/wiki/Information_retrieval#Timeline

1947, "Hans Peter Luhn (research engineer at IBM since 1941) began work on a mechanized punch card-based system for searching chemical compounds"

1950s, "invention of citation indexing (Eugene Garfield)" - Garfield's earlier work was with chemical structure data sets, and his PhD work was on the linguists of chemical compound names.

1950: "The term "information retrieval" was coined by Calvin Mooers." - that was presented at an American Chemical Society (ACS) meeting that year, and in the 1940s Mooers developed an early version of what is now called a connection table, hand-waved a substructure search algorithm which was implemented a few years later. (I'm a Mooers fanboy!)

Many of the early IR events were at ACS meetings - the concept of an "inverted index" was presented at one, as I recall.

This is because in the 1940s, chemical search was Big Data, with >1 million records containing many structured data search fields, and demand for chemical record search from many organizations.

So many of the core concepts are the same, though in cheminformatics we've switched to a lossy encoding of molecular features to a bitstring fingerprint since we tend to look more at global similarity than local similarity, and there are a lot of possible features to encode.

Thank you for writing that it was a very readable paper. I have received very little feedback of any sort about the publication, and have been worried that it was too verbose, pedantic, or turgid.

dekhn · on July 21, 2020

Its a bit verbose, and I really think it's several papers (the technical details of the package is one, the open source positioning is another). But it's readable- a person outside the field (say, a search engineer at Google) could sit down, read this and immediately recognize what you were trying to achieve ("implement popcnt" used to be a popular question), and then immediately suggest ways to get the output results faster by using a cluster :)

dalke · on July 21, 2020

Indeed, it is several papers. There are two journals in my field - one I can't read because it's behind a paywall and one that's expensive to publish in. I choose the latter, but couldn't afford multiple months of rent in order to publish several papers. :(

A blog post I wrote years ago use to part of the "implement popcnt" literature - http://www.dalkescientific.com/writings/diary/archive/2008/0... . It's now outdated, and actual low-level programmers have done better, but it still gets mentioned in-passing in postings like the one referenced on HN last year at https://news.ycombinator.com/item?id=20914479 .

dekhn · on July 21, 2020

It's really extraordinary how tightly coupled modern innovation in scientific fields is to processor implementations. I suspect you and I share a keen interest in the path by which we got to this enviable situation.

nkurz · on July 21, 2020

> even if I have a vastly different perspective on how to accelerate problems like this

Do tell more!

After the initial optimization, I did hint at an approach to Andrew that I thought could get a further large speedup. Essentially, the idea was to "rotate" all the stored data 90 degrees, so that instead of counting the features present in each compound you read lists of compounds that contain a given feature, storing the hits in some very fast custom data structure. He wasn't particularly interested, likely correctly realizing that it would be a lot of work for an uncertain amount of gain. The question wasn't really whether I could achieve further speedup (although there was question as to how much), rather (as alluded to in the paper) whether he would be able to sufficiently increase sales to justify the additional development cost and added complexity of the codebase.

dalke · on July 21, 2020

Towards the end of the writing the paper, another paper came out on "RISC: rapid inverted-index based search of chemical fingerprints", https://doi.org/10.1021/acs.jcim.9b00069 which does something along those lines.

It was close enough that I published the pre-print "RISC and dense fingerprints" at https://doi.org/10.26434/chemrxiv.8218517.v1 to examine its claims. I found that their RISC implementation was faster than chemfp for low bit densities (<~5%), which includes the popular 2048-bit ECFP/Morgan fingerprints for smaller radii, and uncommonly high similarity thresholds.

Otherwise, chemfp was faster.

So while there's certainly something to investigate there, I think it's better to focus that effort on truly sparse fingerprints and count fingerprints, rather than nominally dense bit fingerprints.

Just needs money and time. ;)

Plus, part of the focus was on making chemfp a really good baseline for these sorts of timing tests.

rurban · on July 21, 2020

Im sceptical that a good single CPU search can compete with massive parallel HW, like this one: https://www.graphcore.ai/posts/introducing-second-generation...

dalke · on July 21, 2020

Sure. It can't. Even GPUs will beat a CPU. In my paper I commented:

> GPU memory bandwidth is an order of magnitude higher than CPU bandwidth, so a GPU implementation of the Tanimoto search kernel should be about ten times faster. Chemfp has avoided GPU support so far because it’s not clear that the demand for similarity search justifies dedicated hardware, especially if the time to load the data into the GPU is slower than the time to search it on the CPU. GPUs are more likely to be appropriate for clustering mid-sized datasets where the fingerprints fit into GPU memory.

Corporate compound sets have ~5 million records. That can be searched on a laptop in about 50ms.

A large data set containing physically measured properties is ~100M records, which takes a bit over a second. The largest data sets people search, with synthetically generated compounds, is around 1G records. That requires distributed computing. But most people don't work with them.

They say the best camera is the one you have with you. Most people have a CPU with them. Fewer have massive parallel HW with them.