In the ML industry, there are two type of companies.
1/ Open source companies like huggingface (https://github.com/huggingface), explosion (https://github.com/explosion), Fast.ai that democratized access to ML and provided a set of tools & language models for engineers. These companies didn't only manage to build a company and open source projects, they built a welcoming community and it's impressive to see all these students use these tools to tackle bigger problems. That's what"democratized" mean to an engineer[1].
2/ Media companies that built ML APIs to ease the use of such services like OpenAI. I think this company falls under this category.
I believe huggingface mostly subsist on investments and the amount they earn directly is still pretty small compared to expenses, is that no longer the case?
I respectfully disagree with this statement, there is a large body of work focusing on making this easier. The HW/SW stack vary from a a company to another.
> That's why suddenly so many care about the rights of minority groups, too.
Yes, as we all know, that idea was invented in 2004. Before that, it was unheard of to worry about the rights of minorities. The founding fathers of the US were famously not concerned about a tyranny of the majority.
"Commoditize" not "democratize". If it were democratized I'd be able to spin up my a local instance of their API. With the setup described in the article they are the unilateral arbiters of access to and usage of the technology.
Which is more "democratized": a large language model which can be downloaded, and accessed as a library, originally based from the work of the FAANG giants (e.g. huggingface transformers) or an API, where every invocation is a call that flows through Cohere's servers?
"Data dumps" are privacy-friendly. For example, a user can download Wikipedias dumps and search through them to her hearts content, and never use the network. Zero data collection by third parties. All those observing the network can see is that she downloaded some data dumps.
"Web APIs" are "tech" company and surveillance-friendly. Network access is required and all activity is observed and recorded. Web APIs are also used as a means of controlling access to what is often publicly available data/info. The company does not own the data/info, its a middleman. "Too many" requests, API user gets cut off.
Too often its publicly available data/info that is being served by APIs. Hard to sell data dumps of public data as a "product". I sometimes see entities that provide "data dumps", e.g., a corpus, for free who then try to restrict usage of it through a license, yet they do not themsleves own the data. Whether they even have the rights to "license" it is debatable. They are never legally challenged so we cannot say for sure. The more interesting issue is whether they had the rights to collect the data thats in it. The so-called "web scraping" issue.
"They say New Yorkers are selfish and unfriendly, but its all untrue. When I visited, a guy overheard I was a tourist and came right up and offered me a great deal on Staten Island Ferry tickets. Just $7.50 for a round trip!"
(In case you don't know, the Staten Island Ferry is free).
"Democratized" in the context of a startup usually doesn't mean letting the public access something, it usually means letting a different group of investors (the funders of the startup) access a market (formerly controlled by a monopoly or oligopoly of established mega-firms).
This reminds me of the whole privatisation vs nationalisation debates in the UK. Labour claimed when it nationalised the railways and other parts of the economy it was giving them "back to the people" in a democratic fashion because they were now owned by the state which it conceives of as an expression of the body politic, rather than corporations which are seen as entirely seperate from the people. The Tories when they privatised these things also claimed they were giving them "back to the people" in a democratic fashion as private individuals could now freely invest if they chose in these companies rather than them being controlled by the state, which it conceives of as something entirely seperate from the people*.
"Democracy" and especially "the people" can mean lots of different and completely contradictory things to lots of different people. I'm always very cautious of anyone who invokes "democracy" and "the people" directly, I don't automatically consider them untrustworthy but I prefer a direct argument for a particular idea. If someone's truly a democrat, they'd be unafraid to make their point without such potentially dishonest tactics as people would democratically choose that idea anyway.
* I'm massively oversimplifying here - there's such things as Blairism and One-Nation Conservatism which blur these lines enormously.
>Which is more "democratized": a large language model which can be downloaded, and accessed as a library, [...] or an API, where every invocation is a call that flows through Cohere's servers?
I do understand the point you're trying to make: local autonomy is superior to cloud access mediated by a private commercial entity.
However at this time, we may have a counterintuitive situation where API access is more "democratic" than downloading a huge model.
Based on various reports[1], the GPT-3 model was trained on ~45 terabytes of text corpus (Wikipedia + Web Common Crawl + book texts, etc) and the final runtime model (175 billion parameters) requires ~350 gigabytes of RAM. In that case, the model size is ~1% of the training set.
So "democractize" depends on how ambitious the user is. If you want to use a very large 350GB RAM model, the cloud model with API will be more accessible by the masses than running on local hardware. Last time I looked, an Intel Xeon motherboard has max ram of 128GB so scaling up to 350GB RAM is not going to be cheap or trivial to build.
Let's further extrapolate to a future hypothetical GPT-4 using ~10x multiplier: train on 450 terabytes of text with a model requiring 3.5 terabytes of RAM. How do we make that future huge model accessible to the masses? Probably via a cloud API. Unfortunately, there's an unavoidable hardware capital expense barrier there.
>This is a strawman argument. Publishing the code/weights is not mutually exclusive to providing an API
You're misinterpreting my comment. I'm directly addressing this fragment by the gp: >which can be downloaded, and accessed as a library,
I'm not making any moral ideology statements about the model's "openness", "transparency", or "intellectual property".
As a person very interested in playing with something like GPT-3, I'm talking about practical concerns of even running the model. Some type of cloud API access let's me run experiments today. Hopefully the API cost is reasonable or free with limits. I believe that's true of most researchers because they can't afford the hardware in the near future to run a GPT-3-size model as local library.
> Hopefully the API cost is reasonable or free with limits
If the model was open source you could have an API market where providers competed to build the most economic service just like virtual machine companies do with Linux. If there is only one API then everyone is stuck with them and just has to hope they don't change the prices.
There is practical concern for researchers within the possibility that what costs you $0.05 to run today will cost you $500 tomorrow (see Google Maps API for when this actually happened).
It seems like quite a simple calculus to me: if ever the cost of reasonably using the API exceeds the cost of buying and operating a machine with 350GB RAM, then people will switch to the latter. Either way, I don't see how adding a new option could make anyone's situation worse.
What's wrong with both? A true democratisation of this kind of model would involve both an offline model for those with the resources to support such a thing as well as many different people hosting them and allowing access for a price via an API for those without those high end resources.
I think the issue people have with centralised cloud APIs for this sort of thing is that there's still a single gatekeeper with their finger on the off switch. In my opinion, instead of throwing out cloud APIs altogether a better scenario would be many gatekeepers with a deliberate diversity of socio-political backgrounds.
If OpenAI made the weights and the NN configuration freely available,I am sure that several organizations would offer cheap (or even free with strict rate limits) API access. And users would also be able to run the model locally, if they can afford the hardware.
This is a common misconception. GPT-3 was trained using a 300B token (~300gb) subset of common-crawl and friends. The model is larger than the dataset.
To reductio ad absurdum that - every time you see any marketing, eg “Drink Coca-Cola because it’s refreshing”, you should hear “Drink Coca-Cola because it’ll make us money.”
To your point though: “democratize x” makes my eyes roll. It’s overused hip marketing speak.
That's accurate though; coca cola don't care about whether you're refreshed unless it's profitable to them. The refreshment is the means towards the end of profit.
I think there's degrees of democracy. There's the directly democratic model at one end and "not hidden away in a black box at $mega_corp" on the other. You can be in favour of democracy without advocating mob rule for example.
I am/was a huge fan and have done their courses but they've made so many weird decisions (like going all in on swift4tf for teaching newish people so early on or the dramas they involve themselves in) and have stopped doing as much that I'm not sure if they are as great of a current example anymore (but maybe they'll resurface stronger).
Huggingface are a better example closer to what Cohere do.
>For example, Cohere is providing the NLP capability for Ada, a recent unicorn company in the chatbot space. Ada has experimented with the Cohere natural-language models to match customer chat requests with available support information. Rather than trying to anticipate all the possible wordings of a request, Cohere’s model tries to understand the intent behind it, Gomez says.
>Cohere, he says, offers a platform containing a “full stack” of NLP functions, including sentiment classification, question answering, and text classification.
Google (Cloud Natural Language, Dialogflow), Microsoft (Azure LUIS, Bot Service), Amazon (Lex), and IBM (Watson Assistant, Watson Discovery) all offer Conversational AI and NLP APIs that do exactly what these guys are trying to do here. What is new or unique about them? I work in this space, and the Chatbot/Conversational AI market has been flooded with startups all doing the exact same thing for the last 3-4 years.
ADA is the company doing chatbots/conversational AI, cohere is providing them with language models to do it, that's not the same thing as cohere being a company in the chatbot field
Wonderful concept but whenever I hear the term 'democratize' in a description I laugh a little inside. It's hugely overused, to the point where lazy marketers of all kinds are using it willy nilly.
For a language processing SDK to be useful, it needs to work inside an offline capable app on my phone. But doesn't that make it way too easy for someone to extract the network and use it to train their "own" AI with distillation learning?
And if they instead only run the AI on their servers so users need to connect through an API, how is that any different from the language AI APIs that Google, Amazon, Microsoft already offer?
And let's say they foot the bill to train a new type of language AI, what stops the big cloud providers from just training something similar? A 200mio non-recurring expense won't stop Google if you have proven that it's a viable business for you.
Asking because I'm pondering similar issues for my AI project.
> But doesn't that make it way too easy for someone to extract the network and use it to train their "own" AI with distillation learning?
If you're targeting medium-to-large businesses, I suspect lots of them will buy a licence just to make sure they can't be sued (e.g. https://majadhondt.wordpress.com/2012/05/16/googles-9-lines/), even if it would be realistically near-impossible to detect if they 'stole' it.
I was about to post that. Anyone who reads the little PDF Google gives you when you apply there knows pretty much exactly how the interview will go, and the questions they frequently ask are all over the internet. At that point its just about cramming your brain for a couple of days.
Like, I don't know how to write an A* on top of my head, but back when I was looking at Google a while ago, they always asked that question, and the prep PDF said so. Kind of hard to fail there...
Not everyone will be able to work there, but it's certainly not the status symbol it was a long time ago. Plenty of "Ex Googlers" are code monkeys like anyone else from any other company.
I think during a gold rush, it's good to sell shovels even if can't be the leading shovel sales outlet in town.
I think we're at a point where new NLP techniques will create a lot of value, but it's still hard to tell in advance which cases the current techniques can do well enough to be worthwhile, and in which cases they'll be kinda cool but not effective and so we'll continue to require humans to be involved catching errors. While we sort that out, a lot of companies will need to take a few stabs at trying to get some new model to work for their problems.
Huggingface more generally, but I'd highlight https://huggingface.co/EleutherAI/gpt-j-6B , the currently largest model, which is almost as good as DaVinci in most respects.
It's not about particular models, though, but the ease of use in numerous cloud or self hosted scenarios. What's the value add for Cohere?
Oh, somehow I've missed that one. Thanks. Agreed about numerous scenarios. Personally I think there is value in managing the models since getting the multiple T/GPUs and setting up REST apis is not trivial. Serverless style GPU rental is a little tricky right now in my experience. But, HuggingFace is doing something similar and the moat isn't too wide.
I wonder if either of the two companies have introduced many new products recently?
When you only have $1m you might as well poke around various ideas and see if you can get another million or ten. Now, if you're sitting on tens of billions of dollars your priorities change - it's far more profitable to grow a $10b pie by 10% than a $1m by 1000%. I'm pretty sure that's what happened to Oracle and IBM, hence we rarely hear about them anymore.
Can someone a successful (not necessarily profitable) concrete application of these things, other than "gpt-3 wrote an article in Guardian and said it wouldn't kill us".
I used largish (GPT-2 and similar) models to build an app discovering Category Entry Points (a marketing thing around the things people are thinking about when they decide they need to buy a particular product) for specific product categories.
We use specifically prompted gpt-3 to generate synthetic training examples (eg paraphrases, summaries, etc). We fine tune other (much smaller than gpt3 but still large-ish) language models for controllable language generation (often augmented with synthetic data from gpt3). As a comparison point, we did try GPT Neo and it did not provide sufficiently high quality synthetic data.
Transformers in general have lots of applications (machine translation, information retrieval/reranking, ner, etc).
In the narrow access window I was allowed to Philosopher AI, I found it incredibly helpful in brainstorming and bouncing ideas off. It helped me organize my project, and I even included the conversations in the repo.
A commodity is a product that's not differentiated between suppliers (think silver, pork bellies, robusta coffee beans).
Examples of things that are democratized (accessible to virtually anyone) but not commoditized (not the same as what you can get from other providers):
- Coca cola
- iPhones
- Teslas
- Google search
- HN
If you're going to make money by creating an NLP service for wide adoption, you'd hope to create some competitive moat. Otherwise there's no way for you to earn profits to amortize your R&D.
So from your definition and example, it means that Coke want to democratize soda, Apple want to democratize iPhone and Tesla want to democratize EV. I don't think those sentence would have your expected meaning when someone else read them.
Lots of people have access to natural language AI — Google search, voice assistants, WolframAlpha, AI dungeon, GPT-J — so if you agree that Coke doesn’t get to say they’re “democratising” soda because that’s a misuse of terminology, you should also agree that commercial deployment of a natural language AI API doesn’t get to say they’re “democratising” AI because that’s also a misuse of terminology.
(I mean, like coke, they could say it, but it wouldn’t make sense.)
Possibly one could claim something is democratizing access to some specific aspect?
I do think it probably best to tend away from using the term unless it is a particularly good fit though. (“Best” as in “what I would prefer”, not as in “most profitable”)
All that source tells us is that the word has been wrongly used since the french revolution. Democracy means governance by the masses. Has nothing to do with making things accessible to the masses.
Or, for a darker spin, "absolve yourself of responsibility if anyone misuses your API". I actually think OpenAI gets this right with GPT-3, having fairly strict terms of use for their API, so I'm sad to see seemingly less responsible players enter the space.
EleutherAI won this fight already. The technology is freely available for anyone to use however they like, and they're responsible for whatever it is they do with it.
The 6 billion parameter model was released just last week with the huggingface api hooks. It's just a few percentage points less performant than Gpt-3 DaVinci in most metrics. You can run it on a laptop with 40gb ram, albeit slowly.
All that to say, the horse has left the barn. What openai is doing is just marketing and protecting an investment. They don't have an ethical or moral high ground.
The real answer: Democratisation of software or systems is making Free and Open Source all the software and knowledge required to operate the software or system. Democratisation is also implementing structures that make the development process beholden and accountable to the users. Fundamentally it's about giving the users true ownership over the software and a voice in the development process and/or operation.
Now how it's used here? It's more just a buzzword that suggests they are trying to make it more accessible by selling it as a SaaS. They seem to have a focus on ethically providing access to these more powerful ML systems but what this'll actually mean beyond "cover our ass" is up in the air.
It's a trendy marketing word for selling to a lot of smaller businesses, in contrast with "normal" enterprise software sales which stereotypically focus on big contracts.
I think this kind of marketing language is aimed more at potential employees / investors than customers. Employees and investors often want to imagine themselves as part of a narrative where they're "democratizing" something.
>In the last year, critics of large NLP models, which are trained on huge amounts of text from the web, have raised concerns about the ways that the technology inadvertently picks up biases inherent to the people or viewpoints in this training data. Such critiques gained steam after Google controversially pushed out famed AI researcher Timnit Gebru, in part due to a paper she coauthored analyzing these risks. Cohere CEO Aidan Gomez says his company has developed new tools and invested a lot of time into making sure the Cohere models don’t ingest such bad data.
I have been thinking recently about the ethics of AI;
At what point do we 'allow' AI to determine actions based on perceived (programmed) *BIAS* How can one prevent any bias on an AI's ability to be deterministic.
All of the conversations I've seen about AI bias recently seem to define "bias" as "any difference between the output and the particular rightthink ordained by whoever's speaking." Nobody cares about making the AI's output correct, they just want it to agree with them.
So the glib answer is "train it on unbiased data". Depending on your philosophy, this translates either to "manually 'fix' anything you see as 'bias' in the training data", or "use a sufficient amount of entirely unmodified raw data along with an algorithm sufficiently insightful to cancel out all of the sources of inaccuracy introduced by the various sources of data and extract ground truth."
But you know if anyone ever does manage the latter, its results will still be decried as 'biased' by anyone who disagrees with them.
>Nobody cares about making the AI's output correct, they just want it to agree with them.
that is the 4th industrial revolution, the "post-correct" world where abundance of information (and its consumers) and speed of its production allows for and results in co-existence of multiple truths (kind of like hyperbolic geometry where one can draw through a point multiple different lines parallel to the given line) with the information space splitting into multiple feudal times like dukedoms.
Anyway, in general for biases i think we have D.Rumsfeld situation - the known/expected biases are known while the AI driven world would most probably bring new biases that we don't even expect.
AI simply replicate existing bias the data already has. The downside is that it could amplify the bias if not done carefully; the upside is that now we can analyze the algorithm's bias and fix it, which is much harder to do to people with biases.
This was exactly my point -- at what inflection point is the AI's decision sound, vs when it may be based on bias from its base creation code (whatever the substrate code that 'births' an AI? WTF do we call that (as obviously an AI is meant to iteratively evolve - at what point is an AI required to 'check in changes' such that if a rollback is required across that AI's reach may be accomplished...
We need a "product recall" method that doesnt involve Blade Runners and campy one-liners...
1/ Open source companies like huggingface (https://github.com/huggingface), explosion (https://github.com/explosion), Fast.ai that democratized access to ML and provided a set of tools & language models for engineers. These companies didn't only manage to build a company and open source projects, they built a welcoming community and it's impressive to see all these students use these tools to tackle bigger problems. That's what"democratized" mean to an engineer[1].
2/ Media companies that built ML APIs to ease the use of such services like OpenAI. I think this company falls under this category.
[1]: https://marksaroufim.substack.com/p/machine-learning-the-gre...