Hacker News new | past | comments | ask | show | jobs | submit login
Ex-Googlers raise $40M to democratize natural-language AI (fastcompany.com)
201 points by aaronbrindle on Sept 7, 2021 | hide | past | favorite | 122 comments



In the ML industry, there are two type of companies.

1/ Open source companies like huggingface (https://github.com/huggingface), explosion (https://github.com/explosion), Fast.ai that democratized access to ML and provided a set of tools & language models for engineers. These companies didn't only manage to build a company and open source projects, they built a welcoming community and it's impressive to see all these students use these tools to tackle bigger problems. That's what"democratized" mean to an engineer[1].

2/ Media companies that built ML APIs to ease the use of such services like OpenAI. I think this company falls under this category.

[1]: https://marksaroufim.substack.com/p/machine-learning-the-gre...


Do you know how Huggingface can pay their bills? How do you monetize an AI that everyone can freely download?


Yes. HF is built on a tiered subscription (SaaS) model. https://huggingface.co/pricing

You can refer to this discussion for more details: https://twitter.com/migueldeicaza/status/1285204129225281536


I believe huggingface mostly subsist on investments and the amount they earn directly is still pretty small compared to expenses, is that no longer the case?


One of the founders said on twitter (don't remember when exactly, but less than a year I think) thay they were cash flow positive.



Thank you :)

I guess I'm now "old" in the sense that it never occurred to me to search for this kind of info on Twitter.


Unfortunately the new models are very big, hence they cannot be released as smaller models (as open source). They must be served as an api.


> .. They must be served as an api.

I respectfully disagree with this statement, there is a large body of work focusing on making this easier. The HW/SW stack vary from a a company to another.


GPT-J is huge and completely open source


Is "democratize" the hip way to spell "sell" these days?


I've got some old clothes I'm planning to take to take to the recycle shop and Democratize over the weekend.


Let's democratize our democracy to the highest bidder!


Vote with your dollar!


And "AI".


[flagged]


> That's why suddenly so many care about the rights of minority groups, too.

Yes, as we all know, that idea was invented in 2004. Before that, it was unheard of to worry about the rights of minorities. The founding fathers of the US were famously not concerned about a tyranny of the majority.


Tuesdays are buzzword bingo nights


"Commoditize" not "democratize". If it were democratized I'd be able to spin up my a local instance of their API. With the setup described in the article they are the unilateral arbiters of access to and usage of the technology.


Which is more "democratized": a large language model which can be downloaded, and accessed as a library, originally based from the work of the FAANG giants (e.g. huggingface transformers) or an API, where every invocation is a call that flows through Cohere's servers?


The former.

"Data dumps" are privacy-friendly. For example, a user can download Wikipedias dumps and search through them to her hearts content, and never use the network. Zero data collection by third parties. All those observing the network can see is that she downloaded some data dumps.

"Web APIs" are "tech" company and surveillance-friendly. Network access is required and all activity is observed and recorded. Web APIs are also used as a means of controlling access to what is often publicly available data/info. The company does not own the data/info, its a middleman. "Too many" requests, API user gets cut off.

Too often its publicly available data/info that is being served by APIs. Hard to sell data dumps of public data as a "product". I sometimes see entities that provide "data dumps", e.g., a corpus, for free who then try to restrict usage of it through a license, yet they do not themsleves own the data. Whether they even have the rights to "license" it is debatable. They are never legally challenged so we cannot say for sure. The more interesting issue is whether they had the rights to collect the data thats in it. The so-called "web scraping" issue.


Reminds me of a joke.

"They say New Yorkers are selfish and unfriendly, but its all untrue. When I visited, a guy overheard I was a tourist and came right up and offered me a great deal on Staten Island Ferry tickets. Just $7.50 for a round trip!"

(In case you don't know, the Staten Island Ferry is free).


Information arbitrage


"Democratized" in the context of a startup usually doesn't mean letting the public access something, it usually means letting a different group of investors (the funders of the startup) access a market (formerly controlled by a monopoly or oligopoly of established mega-firms).


This reminds me of the whole privatisation vs nationalisation debates in the UK. Labour claimed when it nationalised the railways and other parts of the economy it was giving them "back to the people" in a democratic fashion because they were now owned by the state which it conceives of as an expression of the body politic, rather than corporations which are seen as entirely seperate from the people. The Tories when they privatised these things also claimed they were giving them "back to the people" in a democratic fashion as private individuals could now freely invest if they chose in these companies rather than them being controlled by the state, which it conceives of as something entirely seperate from the people*.

"Democracy" and especially "the people" can mean lots of different and completely contradictory things to lots of different people. I'm always very cautious of anyone who invokes "democracy" and "the people" directly, I don't automatically consider them untrustworthy but I prefer a direct argument for a particular idea. If someone's truly a democrat, they'd be unafraid to make their point without such potentially dishonest tactics as people would democratically choose that idea anyway.

* I'm massively oversimplifying here - there's such things as Blairism and One-Nation Conservatism which blur these lines enormously.


> it usually means letting a different group of investors (the funders of the startup) access a market

I guess 'democratize' rolls off the tongue more easily than 'heteroligocratize'.


>Which is more "democratized": a large language model which can be downloaded, and accessed as a library, [...] or an API, where every invocation is a call that flows through Cohere's servers?

I do understand the point you're trying to make: local autonomy is superior to cloud access mediated by a private commercial entity.

However at this time, we may have a counterintuitive situation where API access is more "democratic" than downloading a huge model.

Based on various reports[1], the GPT-3 model was trained on ~45 terabytes of text corpus (Wikipedia + Web Common Crawl + book texts, etc) and the final runtime model (175 billion parameters) requires ~350 gigabytes of RAM. In that case, the model size is ~1% of the training set.

So "democractize" depends on how ambitious the user is. If you want to use a very large 350GB RAM model, the cloud model with API will be more accessible by the masses than running on local hardware. Last time I looked, an Intel Xeon motherboard has max ram of 128GB so scaling up to 350GB RAM is not going to be cheap or trivial to build.

Let's further extrapolate to a future hypothetical GPT-4 using ~10x multiplier: train on 450 terabytes of text with a model requiring 3.5 terabytes of RAM. How do we make that future huge model accessible to the masses? Probably via a cloud API. Unfortunately, there's an unavoidable hardware capital expense barrier there.

[1] https://www.google.com/search?q=gpt-3+350gb+ram


> Last time I looked, an Intel Xeon motherboard has max ram of 128GB so scaling up to 350GB RAM is not going to be cheap or trivial to build.

Even a low-end tower server can be configured with multiple terabytes of RAM.

E.G. https://www.dell.com/en-us/work/shop/povw/poweredge-t640#tec...


This is a strawman argument. Publishing the code/weights is not mutually exclusive to providing an API


>This is a strawman argument. Publishing the code/weights is not mutually exclusive to providing an API

You're misinterpreting my comment. I'm directly addressing this fragment by the gp: >which can be downloaded, and accessed as a library,

I'm not making any moral ideology statements about the model's "openness", "transparency", or "intellectual property".

As a person very interested in playing with something like GPT-3, I'm talking about practical concerns of even running the model. Some type of cloud API access let's me run experiments today. Hopefully the API cost is reasonable or free with limits. I believe that's true of most researchers because they can't afford the hardware in the near future to run a GPT-3-size model as local library.


> Hopefully the API cost is reasonable or free with limits

If the model was open source you could have an API market where providers competed to build the most economic service just like virtual machine companies do with Linux. If there is only one API then everyone is stuck with them and just has to hope they don't change the prices.

There is practical concern for researchers within the possibility that what costs you $0.05 to run today will cost you $500 tomorrow (see Google Maps API for when this actually happened).


It seems like quite a simple calculus to me: if ever the cost of reasonably using the API exceeds the cost of buying and operating a machine with 350GB RAM, then people will switch to the latter. Either way, I don't see how adding a new option could make anyone's situation worse.


> the cost of reasonably using the API exceeds the cost of buying and operating a machine with 350GB RAM

Capex and opex of these two are quite different.


What's wrong with both? A true democratisation of this kind of model would involve both an offline model for those with the resources to support such a thing as well as many different people hosting them and allowing access for a price via an API for those without those high end resources.

I think the issue people have with centralised cloud APIs for this sort of thing is that there's still a single gatekeeper with their finger on the off switch. In my opinion, instead of throwing out cloud APIs altogether a better scenario would be many gatekeepers with a deliberate diversity of socio-political backgrounds.


If OpenAI made the weights and the NN configuration freely available,I am sure that several organizations would offer cheap (or even free with strict rate limits) API access. And users would also be able to run the model locally, if they can afford the hardware.


This is a common misconception. GPT-3 was trained using a 300B token (~300gb) subset of common-crawl and friends. The model is larger than the dataset.


When a startup says anything, you should hear "we plan to use this to get hella rich".

There is no such thing as a startup with altruistic intent.


To reductio ad absurdum that - every time you see any marketing, eg “Drink Coca-Cola because it’s refreshing”, you should hear “Drink Coca-Cola because it’ll make us money.”

To your point though: “democratize x” makes my eyes roll. It’s overused hip marketing speak.


That's accurate though; coca cola don't care about whether you're refreshed unless it's profitable to them. The refreshment is the means towards the end of profit.


But it is also not zero sum. Long term viable companies provide a higher value to their customers than the price they charge.


Altruistic startup = a non profit ?


Non profits and charities can certainly be altruistic.


Not only that, democratize would imply that the general public has influence on the project through voting, for example.


I think there's degrees of democracy. There's the directly democratic model at one end and "not hidden away in a black box at $mega_corp" on the other. You can be in favour of democracy without advocating mob rule for example.


If the former has equivalent performance (as in output quality) etc., then the former obviously.

If the former doesn't exist (at a comparable quality), then the latter is better than the nothing.


In America, money is speech-- especially in politics, so unfortunately maybe this is a "fitting" use of the word.


The same day this made the front page of hacker news, one of the founders is also playing his first live concert with his band tonight: https://twitter.com/goodkidband/status/1435267485683568642.

The best part of the story for me is the work-life-balance he's able o achieve.


All of us in the band are engineers (I work at Snapchat) and we take the band super seriously and make time for it


Ha, what a great find. First time listening, but liked their music [1].

[1] https://www.youtube.com/watch?v=2t9TxR4IolM


fast.ai are doing the best work democratizing AI (machine learning and neural networks), including natural language

Can't recommend the course enough and it's incredible that such a quality resource is available for essentially zero cost

https://github.com/fastai/course-nlp

https://www.youtube.com/playlist?list=PLtmWHNX-gukKocXQOkQju...


I am/was a huge fan and have done their courses but they've made so many weird decisions (like going all in on swift4tf for teaching newish people so early on or the dramas they involve themselves in) and have stopped doing as much that I'm not sure if they are as great of a current example anymore (but maybe they'll resurface stronger).

Huggingface are a better example closer to what Cohere do.


The “Why is this course taught in such a weird order?” section is really interesting.


"Democratize" has got to be the most abused new buzzword in tech for this decade. Right up there with "synergy" from yesteryear.


Cloud seems like it should be between those buzzwords, chronologically.


>For example, Cohere is providing the NLP capability for Ada, a recent unicorn company in the chatbot space. Ada has experimented with the Cohere natural-language models to match customer chat requests with available support information. Rather than trying to anticipate all the possible wordings of a request, Cohere’s model tries to understand the intent behind it, Gomez says.

>Cohere, he says, offers a platform containing a “full stack” of NLP functions, including sentiment classification, question answering, and text classification.

Google (Cloud Natural Language, Dialogflow), Microsoft (Azure LUIS, Bot Service), Amazon (Lex), and IBM (Watson Assistant, Watson Discovery) all offer Conversational AI and NLP APIs that do exactly what these guys are trying to do here. What is new or unique about them? I work in this space, and the Chatbot/Conversational AI market has been flooded with startups all doing the exact same thing for the last 3-4 years.


ADA is the company doing chatbots/conversational AI, cohere is providing them with language models to do it, that's not the same thing as cohere being a company in the chatbot field


I understand that. However, the companies I listed are providing BOTH the raw NLP APIs as well as chatbot builder APIs.


This seems much more developed than whatever is being done here:

https://huggingface.co/transformers/index.html

I recommend it since they already are democratizing AI.


Wonderful concept but whenever I hear the term 'democratize' in a description I laugh a little inside. It's hugely overused, to the point where lazy marketers of all kinds are using it willy nilly.

We need better lingo.


How does the business model work here?

For a language processing SDK to be useful, it needs to work inside an offline capable app on my phone. But doesn't that make it way too easy for someone to extract the network and use it to train their "own" AI with distillation learning?

And if they instead only run the AI on their servers so users need to connect through an API, how is that any different from the language AI APIs that Google, Amazon, Microsoft already offer?

And let's say they foot the bill to train a new type of language AI, what stops the big cloud providers from just training something similar? A 200mio non-recurring expense won't stop Google if you have proven that it's a viable business for you.

Asking because I'm pondering similar issues for my AI project.


> But doesn't that make it way too easy for someone to extract the network and use it to train their "own" AI with distillation learning?

If you're targeting medium-to-large businesses, I suspect lots of them will buy a licence just to make sure they can't be sued (e.g. https://majadhondt.wordpress.com/2012/05/16/googles-9-lines/), even if it would be realistically near-impossible to detect if they 'stole' it.


"Democratize".

lol. It's now a weasel word used by start-ups. Merits an eye roll before moving along.


Is Ex-Googler still a thing when everyone is an Ex-Googler?


I was about to post that. Anyone who reads the little PDF Google gives you when you apply there knows pretty much exactly how the interview will go, and the questions they frequently ask are all over the internet. At that point its just about cramming your brain for a couple of days.

Like, I don't know how to write an A* on top of my head, but back when I was looking at Google a while ago, they always asked that question, and the prep PDF said so. Kind of hard to fail there...

Not everyone will be able to work there, but it's certainly not the status symbol it was a long time ago. Plenty of "Ex Googlers" are code monkeys like anyone else from any other company.


Haha, I see this even on some people's dating profiles ¯\_(ツ)_/¯


exercise for the reader: the shortest program / keystrokes that would convert:

"democratize X" -> "capitalize on X hype"

NB: the challenge is that X can be any important sounding combination of words


Just watched the WeWork documentary last night and it makes me question the validity of every money raise now.


What stops Google, Microsoft, etc from doing the same if Cohere's solution gains traction?


Don't they try this currently? https://cloud.google.com/natural-language/docs

I think during a gold rush, it's good to sell shovels even if can't be the leading shovel sales outlet in town.

I think we're at a point where new NLP techniques will create a lot of value, but it's still hard to tell in advance which cases the current techniques can do well enough to be worthwhile, and in which cases they'll be kinda cool but not effective and so we'll continue to require humans to be involved catching errors. While we sort that out, a lot of companies will need to take a few stabs at trying to get some new model to work for their problems.


Bureaucratic inertia, sunk cost fallacy, interdepartmental resource hoarding, corporate bloat, and finally, HR and middle management all jealously guarding their petty kingdoms.

Oh, and the fact that Cohere is an expensive solution for a problem that's already been solved for free by open source groups.



Huggingface more generally, but I'd highlight https://huggingface.co/EleutherAI/gpt-j-6B , the currently largest model, which is almost as good as DaVinci in most respects.

It's not about particular models, though, but the ease of use in numerous cloud or self hosted scenarios. What's the value add for Cohere?


Oh, somehow I've missed that one. Thanks. Agreed about numerous scenarios. Personally I think there is value in managing the models since getting the multiple T/GPUs and setting up REST apis is not trivial. Serverless style GPU rental is a little tricky right now in my experience. But, HuggingFace is doing something similar and the moat isn't too wide.


I wonder if either of the two companies have introduced many new products recently?

When you only have $1m you might as well poke around various ideas and see if you can get another million or ten. Now, if you're sitting on tens of billions of dollars your priorities change - it's far more profitable to grow a $10b pie by 10% than a $1m by 1000%. I'm pretty sure that's what happened to Oracle and IBM, hence we rarely hear about them anymore.


Nothing is democratized here, this is pure marketing.


I'll just say that the word democratize does not appear in Cohere's website


Large language model are the rage, it seems.

Can someone a successful (not necessarily profitable) concrete application of these things, other than "gpt-3 wrote an article in Guardian and said it wouldn't kill us".


I used largish (GPT-2 and similar) models to build an app discovering Category Entry Points (a marketing thing around the things people are thinking about when they decide they need to buy a particular product) for specific product categories.

It was very successful.


We use specifically prompted gpt-3 to generate synthetic training examples (eg paraphrases, summaries, etc). We fine tune other (much smaller than gpt3 but still large-ish) language models for controllable language generation (often augmented with synthetic data from gpt3). As a comparison point, we did try GPT Neo and it did not provide sufficiently high quality synthetic data.

Transformers in general have lots of applications (machine translation, information retrieval/reranking, ner, etc).


In the narrow access window I was allowed to Philosopher AI, I found it incredibly helpful in brainstorming and bouncing ideas off. It helped me organize my project, and I even included the conversations in the repo.


AI Dungeon?


GitHub Copilot?


"democratize" is double-speak for "closed source".


Worse, it's double-speak for SaaS APIs, aka. "you have to enter into a business relationship with us in order to use it".


Being an ex-googler still a thing nowadays ?


Argos Translate has open source neural machine translation https://github.com/argosopentech/argos-translate


When your product revolves around that you are an ex-Googler...


It’s not like it’s even verifiable is it? Literally anyone can say that.


"democratize"

You keep using that word. I do not think it means what you think it means.

The word you are looking for is "mass-market" or perhaps "sell to small business".


What they say they're doing fits one of the definitions of democratize: "make (something) accessible to everyone".

More about the etymology here: https://www.etymonline.com/word/democratize


Commoditize as an alternative?


A commodity is a product that's not differentiated between suppliers (think silver, pork bellies, robusta coffee beans).

Examples of things that are democratized (accessible to virtually anyone) but not commoditized (not the same as what you can get from other providers):

- Coca cola

- iPhones

- Teslas

- Google search

- HN

If you're going to make money by creating an NLP service for wide adoption, you'd hope to create some competitive moat. Otherwise there's no way for you to earn profits to amortize your R&D.


So from your definition and example, it means that Coke want to democratize soda, Apple want to democratize iPhone and Tesla want to democratize EV. I don't think those sentence would have your expected meaning when someone else read them.


Well, no, because the current state of affairs isn’t that very few people have access to smartphones / soda .


Lots of people have access to natural language AI — Google search, voice assistants, WolframAlpha, AI dungeon, GPT-J — so if you agree that Coke doesn’t get to say they’re “democratising” soda because that’s a misuse of terminology, you should also agree that commercial deployment of a natural language AI API doesn’t get to say they’re “democratising” AI because that’s also a misuse of terminology.


Yes.

(I mean, like coke, they could say it, but it wouldn’t make sense.)

Possibly one could claim something is democratizing access to some specific aspect?

I do think it probably best to tend away from using the term unless it is a particularly good fit though. (“Best” as in “what I would prefer”, not as in “most profitable”)


So Coke pioneered the democratization of soda.


All that source tells us is that the word has been wrongly used since the french revolution. Democracy means governance by the masses. Has nothing to do with making things accessible to the masses.


> the word has been wrongly used since the french revolution

Words don't have to mean what their Ancient Greek component parts mean. Meaning = use.


Les docteurs du spin...


I think the word is monetize.


Or, for a darker spin, "absolve yourself of responsibility if anyone misuses your API". I actually think OpenAI gets this right with GPT-3, having fairly strict terms of use for their API, so I'm sad to see seemingly less responsible players enter the space.


EleutherAI won this fight already. The technology is freely available for anyone to use however they like, and they're responsible for whatever it is they do with it.

The 6 billion parameter model was released just last week with the huggingface api hooks. It's just a few percentage points less performant than Gpt-3 DaVinci in most metrics. You can run it on a laptop with 40gb ram, albeit slowly.

All that to say, the horse has left the barn. What openai is doing is just marketing and protecting an investment. They don't have an ethical or moral high ground.


What does democratize even mean?


de·moc·ra·tize /dəˈmäkrəˌtīz/

verb

introduce a democratic system or democratic principles to. "public institutions need to be democratized"

make (something) accessible to everyone. "mass production has not democratized fashion"

https://www.google.com/search?q=democratize

I suppose that would be the second meaning.


The real answer: Democratisation of software or systems is making Free and Open Source all the software and knowledge required to operate the software or system. Democratisation is also implementing structures that make the development process beholden and accountable to the users. Fundamentally it's about giving the users true ownership over the software and a voice in the development process and/or operation.

Now how it's used here? It's more just a buzzword that suggests they are trying to make it more accessible by selling it as a SaaS. They seem to have a focus on ethically providing access to these more powerful ML systems but what this'll actually mean beyond "cover our ass" is up in the air.


Like synergize, but with 20% more righteous and warm fuzzy.


Here, it means 'make available to everyone'.

It has at least one other meaning ('introduce democracy as a system of rule').


It's a trendy marketing word for selling to a lot of smaller businesses, in contrast with "normal" enterprise software sales which stereotypically focus on big contracts.


I think this kind of marketing language is aimed more at potential employees / investors than customers. Employees and investors often want to imagine themselves as part of a narrative where they're "democratizing" something.


They should call it something that makes it clear how “democratic” they are and the fact they are in AI. Hmmm, like Open but about AI. I give up.


Democratizing AI is passe. If they have no plans to decolonize AI then they're part of the problem.


>In the last year, critics of large NLP models, which are trained on huge amounts of text from the web, have raised concerns about the ways that the technology inadvertently picks up biases inherent to the people or viewpoints in this training data. Such critiques gained steam after Google controversially pushed out famed AI researcher Timnit Gebru, in part due to a paper she coauthored analyzing these risks. Cohere CEO Aidan Gomez says his company has developed new tools and invested a lot of time into making sure the Cohere models don’t ingest such bad data.

So it's censored. No thanks, not interested.


I have been thinking recently about the ethics of AI;

At what point do we 'allow' AI to determine actions based on perceived (programmed) *BIAS* How can one prevent any bias on an AI's ability to be deterministic.


All of the conversations I've seen about AI bias recently seem to define "bias" as "any difference between the output and the particular rightthink ordained by whoever's speaking." Nobody cares about making the AI's output correct, they just want it to agree with them.

So the glib answer is "train it on unbiased data". Depending on your philosophy, this translates either to "manually 'fix' anything you see as 'bias' in the training data", or "use a sufficient amount of entirely unmodified raw data along with an algorithm sufficiently insightful to cancel out all of the sources of inaccuracy introduced by the various sources of data and extract ground truth."

But you know if anyone ever does manage the latter, its results will still be decried as 'biased' by anyone who disagrees with them.


I am now convinced this is what is meant by this:

https://en.wikipedia.org/wiki/Ouroboros

but instead its the warning against letting AI iterate upon itself without external intervention...


>Nobody cares about making the AI's output correct, they just want it to agree with them.

that is the 4th industrial revolution, the "post-correct" world where abundance of information (and its consumers) and speed of its production allows for and results in co-existence of multiple truths (kind of like hyperbolic geometry where one can draw through a point multiple different lines parallel to the given line) with the information space splitting into multiple feudal times like dukedoms.

Anyway, in general for biases i think we have D.Rumsfeld situation - the known/expected biases are known while the AI driven world would most probably bring new biases that we don't even expect.


AI simply replicate existing bias the data already has. The downside is that it could amplify the bias if not done carefully; the upside is that now we can analyze the algorithm's bias and fix it, which is much harder to do to people with biases.


But to do this you would need unbiased people. Since those don't exist, this correction would just be matching the bias to that of the bias adjuster.


Nah, you just need enough people to review the fixes. If at the end everyone is unhappy with the results, you're good to go.


There's an infinite number of wrong answers that also anger everyone, so that's not a sufficient metric for correctness.


This was exactly my point -- at what inflection point is the AI's decision sound, vs when it may be based on bias from its base creation code (whatever the substrate code that 'births' an AI? WTF do we call that (as obviously an AI is meant to iteratively evolve - at what point is an AI required to 'check in changes' such that if a rollback is required across that AI's reach may be accomplished...

We need a "product recall" method that doesnt involve Blade Runners and campy one-liners...

SERIOUSLY


What if an AI is unbiased but makes decisions people don't like?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: