Hello. I made this. I'm a bit shocked at how many people are getting things wrong and taking things out of context, so perhaps I could answer some questions.
To clarify: I have not been with MIT in years. I was paid the minimum hourly rate (roughly $14 an hour) to work on a related project during my undergraduate years, which eventually evolved into this project years down the road. (In fact, I had to pay for my own compute to get my work started - MIT never offered me any credits.) Everything else has been paid out of pocket since then. Yes, it does indeed cost several thousands of dollars a month - that is not an exaggeration. This has been optimized so many times that the technology needs to improve first before I can cut down on costs.
The timer in the "TOS" was put there in hopes that people could understand where I was coming from in regards to misappropriating this kind of research. I did not expect people to get this riled up over a 10 second timer. (Especially not on Hacker News, of all places...)
Edit: I suppose it makes sense to include other information in this post so that others don't have to go hunting for my comments in this thread:
- >This is unrelated but what's with the fascination with HN users and My Little Pony? I've noticed this on a lot of posts in the past few months.
- Twilight Sparkle's voice is indispensable in getting emotional contextualizers to work properly. The logo and profile picture is an homage to that fact.
- >But seriously: how did you get the domain `15.ai`?
- I purchased it. It was definitely not cheap.
- >They named their tts "Deep Throat"? Why would you?
- It was a suggestion from a Twitter user, and I found it clever.
- >I heavily doubt it's "several thousands of dollars"...
- It is indeed several thousands of dollars a month. I can show you AWS invoices, if you're skeptical. Just send me an email and I'd be happy to show proof.
- >The disclaimer is a little ironic considering the site owner doesn’t own the model (MIT does) and doesn’t own the training data (the various shows and games do)
- I'm sorry to tell you that I do, in fact, own my own model. I have not been with MIT in years.
I have not been with MIT in years. I had a successful exit not too long after graduation, and I've been spending most of my earnings on this project.
As an undergrad, I was completely broke. I figured that keeping the project free to use was the best thing I could possibly do with my research as I continued to work on it.
>Can't you continue/do your research without a public website?
Yes, but the website has multiple purposes. It serves as a proof of concept of a platform that allows anyone to create content, even if they can't hire someone to voice their projects.
It also demonstrates the progress of my research in a far more engaging manner - by being able to use the actual model, you can discover things about it that even I wasn't aware of (such as getting characters to make gasping noises or moans by placing commas in between certain phonemes).
It also doesn't let me get away with picking and choosing the best results and showing off only the ones that work (which I believe is a big problem endemic in ML today - it's disingenuous and misleading). Being able to interact with the model with no filter allows the user to judge exactly how good the current work is at face value.
Despite others here, I personally certainly think this is admirable. I've played with your models a long time ago with some colleagues of mine and we were all shocked how good it was, and that it was free.
I'm no stranger to passion projects, I have a lot of respect for people like you. This is great stuff.
Thank you for the kind words. I know that HN is a tough crowd to please (I myself as well), so I hope that my next update will be well worth working for.
The Rise model in particular is amazingly good quality. I pranked a friend with some text from her a few minutes ago and he chastised me for wasting my money on hiring voice actors just to troll him.
Do you cache results from this (especially the random samples provided)? It seems to be regenerating those, which might be expensive if lots of people are using the same prompts
Edit: also wanted to thank you for the Chell voice, it sounds completely true to life to me! (minus some jumping noises)
Yes, I do. For the past three years, I have done nothing but work on this project nonstop. I've been working on massive improvements (that some have pointed out in this thread) that I've been stuck on for the past several months, but I'm getting close to finishing that up.
I don't feel comfortable publishing or releasing anything until I know for a fact that I can make no further improvements. It's not out of corporate greed or anything like that - I'm just really paranoid about getting out the best work possible.
Respectfully, the perfect is the enemy of the good, and it’s entirely reasonable to publish what you have now. If later you make further improvements, you can simply publish again.
You're completely correct, but I'm afraid this is more of a personal problem. I know I'll never be able to forgive myself if I figured out a solution to one of the more obvious problems with the model after I've already published it. I'd just be far more comfortable being happy with my own work before I release it to the wild. I know that this is selfish, and I apologize.
This is really cool. It's a text-to-speech and the gist seems to be that they synthesize it from only a little audio.
The results are clearly synthetic and need work. However what's cool is that there are a ton of characters (from popular shows and video games) and there are useful statistics like inferred emotion (which is also in the output).
Honestly it's a big problem how a lot of AIs are like "black boxes" where you really can't customize or see anything. Yeah we have DALL-E and GPT which can generate text images but the lack of customization or fine-tuning the image afterwards severely hinders what's possible with them. Ultimately what you want is something interactive, where you can control how much or little the AI generates, and give it really specific criterion.
But seriously: how did you get the domain `15.ai`?
I agree this is an amazing demonstration of what AI can do, but I think that the current method of "learn and repeat" that depends on having tons of computing resources available is still too inefficient in many ways. Personally I'm more interested in what parameterisable formant-based synths can do, since they are extremely efficient and can produce a theoretically infinite variation of voices, although the output quality is still not great. Example: https://news.ycombinator.com/item?id=31604299
.ai domains cost a couple hundred bucks a year so domains are very available / not widely used by domain squatters (Its the country domain for the island of Anguilla, pop 15,000)
In the case of text generation, we call this "Constrained Text Generation" and it is an active field of research. Without going into too many details (I have a paper out for review about this), it's pretty trivial to get "interactive control over how much or how little the AI generates" by a combination of filters on the LMs vocabulary, and effective selection of the various hyperparamaters in the decoder (top_p, top_k, temperature)...
OpenAI itself is a black box. Until I can reproduce their models or download them myself, and have unfettered access to them, it's just gatekept magic behind an API. So much for democratizing machine learning.
Unless this is a recent change, their mission isn't that:
> OpenAI’s mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity.
Fine tuning of GPT3 models is available via their public API. Costs credits, and you need to get their permission to use it in an actual application, but it’s not locked in a lab.
It’s not a matter of ‘figuring out’. The model supports fine tuning. It’s a core feature of the openai API. Running ‘fine tuned’ versions of GPT-3 that are created by customers is literally their SaaS model. They have examples in the documentation. Here: https://help.openai.com/en/articles/5528730-fine-tuning-a-cl...
I get that. But you don’t get the weights. Which means they limit what you can and can’t generate.
A friend of mine was going to build a writing assistant on top of GPT-3. She got a lot of encouragement from them. Then one day the social media storm hit OpenAI, and suddenly safety became a non-negotiable feature of their api. And along with that came the restriction “absolutely no products that can generate unlimited amounts of text, even if you can pay for the credits.”
Poof, no more business.
Imagine buying a keyboard that restricted what you could type.
All of these problems go away when you have access to the weights.
The whole history behind the project is fascinating: 4chan had a huge role in its development, and the project's work was stolen by an NFT company that a famous voice actor endorsed not too long ago.
Ah, I was wondering why they were so concerned about attribution.
The truth is that, today, if I was going to use a tool to generate voices (say for YouTube), I wouldn't necessarily pick a small SaaS tool. I'd use Amazon Polly or some other GCP-style platform voice creation tool. There are already a few products in the space, and their costs are so low as to be almost negligible (example: Polly, 5 million characters free). For a commercial project, I could probably stay on a free tier for a whole year.
With Dall*E, it seems like the only option, and it's such a superior option that a website could abuse it for commercial profits. But for voice synthesis, it's already dirt cheap and commercially available without limitations.
Unrelated, but years ago iOS used a new ML voice for Siri. It was quickly abused by people who found out how easy it was to make her say weird things [1]. Just by repeating the same character over and over again, she made weird sighs and other noises.
Seems this AI has the same 'problem'. lol
The copyright laws around this are fascinating. They're adamant it must be non-commercial, they must be credited, and it can't be mixed with any other generated content. Meanwhile their content is exclusively derived from popular commercial products. Oh and they also make money via Patreon donations.
I dunno. Feels a little gross to me. Eventually there is going to be a big copyright case about a model trained with copyrighted material. I have no idea how that will be resolved. Or maybe there will simply be new laws passed to make it either explicitly ok or explicitly not ok.
“Make money”? The creator loses several thousands of dollars a month hosting the site, and it’s done for free. The Patreon donations are all voluntary and only offer a pittance to the developer.
I highly suggest reading into the project first. The Wiki article I linked before (https://en.wikipedia.org/wiki/15.ai) answers all of your questions about copyright infringement.
Feel free to replace "make money" with "collect revenue". This is currently a research project (with funding). However it's long-term goal is to achieve commercial quality voice acting and dubbing. It could be given away for free, sold directly, sold downstream, sold indirectly, or otherwise generate commercial value.
In terms of copyright infringement, your wiki link answers nothing. A court ruled that Google could use copyrighted book text to train an algorithm to improve search results because the copying was highly transformative and did not serve as a market substitute to the original work.
Meanwhile 15.ai is using copyrighted voice recordings to train an algorithm to synthesize new voice recordings that sound like they came from the original speaker. This is radically different from the Google case. Just because one instance of using copyrighted material to train an algorithm qualifies as fair use does not mean that all use of copyright material to train any algorithm also qualifies as fair use.
There is absolutely nothing about this that is settled law. In the next 20 years there are going to be lots of lawsuits, lots of settlements, possibly a few rulings, and maybe even a few new laws. I find the whole topic very interesting. YMMV.
Like you say, the law is not settled on this, but I assume if the author got a takedown request they would probably comply.
In many instances a policy of "ask for forgiveness rather than permission" can get you further, faster. While Nickelodeon are unlikely to grant you a license to the Spongebob voice because that has broader licensing and IP repercussions, they are likely to tolerate a research project using their characters (e.g. just as they have to-date tolerated The SpongeBob SquarePants Movie Rehydrated, which was a fan re-creation of one of their actual movies).
It is indeed several thousands of dollars a month. I can show you AWS invoices, if you're skeptical. Just send me an email and I'd be happy to show proof.
I would imagine it will end with a similar outcome to video game likenesses - a person owns their likeness and you can't create products that includes their likeness without their consent.
What would that mean for parodies though, death of satire. Can likenesses have fair use or perhaps only for for positive representations of the person?
Oh god, 50 shades of SpongePants. The future is wild in ways I never imagined. Star Trek style holodecks in what, 15 years?
So, creepy thought: should we be recording audio of our parents, so we can still "hear from them" once in a while after they die? People are going to want to reconstruct their lost loved ones with AI. This project seems to imply you only need an hour or so of audio.
After my dad died, we found that he had recorded every phone call he had with us. I thought about doing this combined with text generation to create plausible prompts but never got the guts to go through with it. He wouldn't care if I had done it, but it wouldn't ease the guilt from years of sighs and rolling my eyes when he called at always the wrong times.
There is a black mirror episode about this which then extends into a whole robotic replacement for a lost loved one. As with all black mirror episodes, it’s pretty dystopian.
I came back to the HN comments to ask the same thing, when I saw that.
I think this is kinda edgy humor that can work in small, select groups.
But maybe in most contexts in which people will be looking at AI method tech demos (e.g., within a company, or researchers at a university), we're still feeling ongoing effects of multi-generational injustices. In such a context, no one wants to be associating to ideas of women around some infamous '70s porn film. Doing that, and making light of it, seems like it'd rightly bother a lot of people.
When you're focused on a project, and maybe first discussing/showing in one small group, it can be easy to forget there are many additional things going on outside that group, some of which we also want to consider. I've made that mistake multiple times, including with the wrong humor for a context, still cringe when I think of instances of it, and this looks like that to me.
Maybe the developer will see this discussion, and decide to change some things, ASAP. It might still be relatively easy to change. (Maybe doable before Monday business hours, in the time zone of the university mentioned prominently.)
Imagine being so uptight that you see a harmless joke, get mad about it, make many assumptions about a person that you've never met, and not only tell them to change it but give them a deadline to do so.
I am just gonna take this oppurtunity to say: THANK YOU. For your work on the site and the joy it brings. And thank you for not censoring input. Your site is simply the best out there for making characters voice copypastas and such.
I think that depends on whether you subscribe to Chell not having a voice, or having a voice and simply not using it. The game is silent on whether Chell can't talk, or simply has nothing to say to the AI around her.
Interesting how it seems like there's little correlation between source sample-size and quality. e.g. the Portal Sentry turret at 1.5min input vs the 100+ minutes of the narrator from Stanly Parable which sounded like auto-tune had a stroke.
The AI seems to work best on high-pitched, female voices. The model seems to have improved in this regard since I last tried this website, but it's still very significantly biased towards female voices it seems.
Much of it depends on refinement work on each specific model. Try the Daria voices, for example, which are easy to get results with that sound like they came straight out of the show.
I think it's because the underlying(?) TTS can't really portrait how the narrator speaks, which is very exaggerating and highly varying in tempo. The key idea of the app should be that we can easily transform "voice AND emotional tone" of the underlying voice.
I have no involvement in either of these companies, but I'll mention that this seems like a beta version of uberduck. Personally, I think uberdurk is awesome, and probably worth a look
The disclaimer is a little ironic considering the site owner doesn’t own the model (MIT does) and doesn’t own the training data (the various shows and games do)
MIT doesn’t own the model, where did you get that idea from? If you read through the website, it says that the developer alone owns everything related to the project, and the only funding he received from MIT was a small amount from the beginning.
It’s really strange reading these ignorant comments from HN…
The use of emojis to determine sentiment is incredibly clever.
> The DeepThroat model is able to generate voices of varying degrees of emotion despite never having been exposed to emotive data of the character during training. Furthermore, multiple characters can be trained simultaneously, significantly reducing the amount of time required compared to if one were to train the character models individually.
To be honest, I thought more people on HN would appreciate the little in-jokes that I added to the website (such as the Interjection copypasta and the multiple Rickrolls hidden throughout), but it's nice to see that some people do still share my lame attempt at humor :)
That's a lot of SpongeBob and My Little Pony characters. At this point, is it fair to say the attachment to kids' cartoons is a cultural (or pathological) phenomenon for under 30s?
I appreciate the intent, and I understand that many people will do the wrong thing so this was probably an attempt to get such folks to actually read and adhere to the TOS, but the obnoxious consent dialog with a mandatory countdown turned me off. It’s probably not effective, either.
On desktop, maybe I’d open dev tools and remove it. On mobile, I won’t be bothered. I hate that this is what the web has become and I choose to simply miss out on websites that behave this way.
Weird, I read through the text because I care about how I’m allowed to use the things people are giving me – and by the time I got to the Accept button, it was enabled.
Same here. And I figured if a hobby project had such a disclaimer it would be important and so I was interested to read what the rules were. But how spoiled and precious are we all today when we can’t even read a few paragraphs and accept some terms to be able to try something cool for free!
it kind of sucks, not accurate or convincing. if they spent less time making the disclaimer and more on the product…
Reminds me of the 90s when everyone had a secret weapon IP in the making, until open source showed by example how futile and silly that approach was. You want people to use your work, because then they need you.
Well, lets not confuse well intentions and hard circumstances with bad product and poor execution. My criticism is of what I experienced as a user, and it really doesn’t matter what the hacker news community or myself thinks, what the OP cares about is the user experience of a larger group. If you burn the user at the door it sets the tone of the experience and further, again, the product wasn’t impressive and the IP not worth defending. I mean where is this even going? If you train on actors voices they will come at you with an army of lawyers. Spongebob squarepants imitations aren’t going to pay the rent. There’s no game plan here
I'm afraid it's a necessity after all the times that my work has been appropriated by companies and TikTokers/YouTubers. Yes, I am fully aware that most people will not read it. But at least I tried.
(I mean, what I was saying is by getting rid of the box in the DOM, you avoid the issue of the TOS altogether since you "walk around it" instead of agreeing to it.)
Aren't we all appropriating the work of Newton, Maxwell, Einstein, and others? It's not like Maxwell's equations are copyrighted.
You're entitled to your opinions, but as a PhD myself I'd rather my research get used by people than end up in a copyright junkyard of things people can't use.
I'd argue that there is a rather massive difference between invoking Maxwell's equations to invent GPS and literally plagiarizing my work by using it to broker partnerships with celebrities and subsequently selling my work as NFTs. (Yes, this really happened - I'm not making this up.)
couldn’t agree more, the web has become user hostile and I will boycott sites and their services and products if they disrespect me by wasting my time try to trick me into agreeing to a list of smallprint demands
Oh yep. I'll be reading an article and a damn popup appears midway through a sentence. I usually just quit the website at that point. I hope their bounce detectors pick up on it.
> All code and models used for this website were written and trained as part of my research at the Massachusetts Institute of Technology (MIT). The code and models are privately owned and are not to be sold or distributed for unauthorized use.
Does anybody else find the irony in this statement absolutely amazing lol.
The author took someone else's IP as training data, trained a model on someone else's compute, and then gets extremely bent out of shape when others use the model without crediting them?
This entire thread is honestly so disturbing, this comment especially. Not only is it rife with misinformation (using copyrighted material for training is totally legal and the whole project is paid out of pocket), but is it really that big of a deal to want credit for the work they’ve done? The developer has had their work stolen by companies, influencers, and grifters, and people here are getting pissy that they can’t wait 10 seconds to wait for a popup.
I don’t know why, but I honestly expected more from HN.
You're right about the compute part being wrong. I never said it wasn't legal, just that they took someone else's work to train it. I would hope that voice synthesis is illegal without permission from the voice's owner, but I imagine it is untested so far.
But it's not just about the popup - it 's more that when your work is fundamentally about using reusing someone else's character, it feels pretty hypocritical to be so focused on making sure you get credit.
If they are used in a tool that lets you generate someone's likeness as part of user-specified new content, yes. But unlike 15.ai that isn't their core purpose and no such tool exists.
The problem is that after having to wait for 10 seconds to reject their terms of service (which you should be able to reject right away) before even being able to see what the site is about, they are rickrolling you, effectively giving you the finger for not wanting to agree to their terms without context. That‘s quite unprofessional, counterproductive and antagonistic.
I share this sentiment entirely. There seems to be a growing trend on HN that negativity is popular. A project like this, to me at least, would seem to be right up HN's street.
Shame to see the toxicity over a passion project, whos creator generously went out of his way to answer the questions and ridiculous comments.
Making things up out of thin air like “the creator used someone else’s compute” goes beyond negativity because someone thinks the project is in the grey. That is just straight up disinformation.
"...MIT owns inventions made or created by MIT faculty, students, staff, and others participating in sponsored research projects or in MIT programs using significant MIT funds or facilities or those inventions developed pursuant to a written agreement with MIT..."
I got RickRolled as soon as arriving to the page. :-)
So is this a blanket approval for anyone with AI synthesis of voices, to sample hours of any _copyrighted content_ and come up with a TTS that is copyrighted to the new owner?
In other words, if I deep fake someone's photo on someone else's body, I own the rights of that 'model'?
This is from the MIT license, which is the school he's doing research for (emphasis mine):
> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so
His TOS is literally the antithesis to the free nature of the MIT license :)
This is my fault. I do see how that part is worded ambiguously (I'll fix it later), but I have not been with MIT in years. Copy pasting what I've written in another comment:
To clarify: I have not been with MIT in years. I was paid the minimum hourly rate (roughly $14 an hour) to work on a related project during my undergraduate years, which eventually evolved into this project years down the road. (In fact, I had to pay for my own compute to get my work started - MIT never offered me any credits.)
And to address the philosophy behind the MIT license (also copy pasted from another comment):
For the past three years, I have done nothing but work on this project nonstop. I've been working on massive improvements (that some have pointed out in this thread) that I've been stuck on for the past several months, but I'm getting close to finishing that up.
I don't feel comfortable publishing or releasing anything until I know for a fact that I can make no further improvements. It's not out of corporate greed or anything like that - I'm just really paranoid about getting out the best work possible.
Gotcha, I have no problem with wanting to keep your personal work closed source. I was just under the impression that this had been created as research funded by MIT. If that's not the case, then sorry for the confusion :)
https://en.wikipedia.org/wiki/My_Little_Pony:_Friendship_Is_... explains in detail - between 2010 and ~2015 there was a massive overlap between millennial geek culture and unironic fandom of the rebooted My Little Pony show, especially among millennial men. One dedicated fan hub averaged almost 400k page views per day over its first 3.5 years of existence. And throughout it all, programming projects abounded, such as the delightful FiM++ esoteric language (https://esolangs.org/wiki/FiM%2B%2B) styled after the show's framing device. For many in tech now, it was an inescapable part of internet culture of the early 2010s, and a fond memory for many.
I mean, 15.ai started as a 4chan project for /mlp/ users to generate voice lines from official voice actors now that Friendship is Magic is over (google Pony Preservation Project). Honestly, the more impressive part is that a bunch of nobodies on an imageboard leapfrogged the rest of the industry and made a now-famous voice transformer model.
In the greater sense, though? Ponies have always been this weird relic of internet absurdity and bear-baiting. Some people rep it ironically, other people are dead-serious, but the community has significant overlap with the STEM field. As a result, a lot of pony-related stuff would end up propagating into the tech world, much like this very project.
Aside from the causal brony references, this project originally featured a lot of my little pony voices because it needed meticulously annotated transcriptions of the input audio to be trained well.
The extremely dedicated brony subculture voluntarily put in a lot of work to get a corpus for the AI to learn from.
There's also another factor at play: this AI works best with highly pitched voices, which my little pony is just full of. Not only did MLP provide such a generous source of training data, its results were also much more impressive than the dry dictation many other corpi would've resulted in, adding to its fame.
I personally haven't seen any significant rise in MLP references, though that could be because I don't know the show so I don't catch references to it. It's also very possible that you've caught the Baader-Meinhof phenomenon.
My ML professor at the university I went to was also weirdly obsessed with MLP.
Weeaboo/furry data scientists are always ahead of the industry - I seem to recall an effective decensoring model that was called "DeepCreamPy" and had almost 10K github stars before it was nuked and rehosted.
I'm convinced that learning Statistics is in a zero-sum game with social skills.
It's basically the same as unironic appreciation of various child-targeted-but-adult-friendly 'slice of life' anime, just more incongruous-seeming because of the 'pony' thing.
A lot of people in or around tech are furries, are into things like japanese animation, or are into My Little Pony. I don't consider myself one, but people often jokingly say that furries run the Internet.
And it's not really specific to HN. For instance you have well-known people in the community who do vaccine R&D, or cryptography, or contribute to the C/C++ standards at ISO, or several other STEM things that are pretty outspoken about their interests.
This is made more obvious on Twitter, where people tend to blur their personal and work identities a lot.
Twilight Sparkle's voice is indispensable in getting emotional contextualizers to work properly. The logo and profile picture is an homage to that fact.
To clarify: I have not been with MIT in years. I was paid the minimum hourly rate (roughly $14 an hour) to work on a related project during my undergraduate years, which eventually evolved into this project years down the road. (In fact, I had to pay for my own compute to get my work started - MIT never offered me any credits.) Everything else has been paid out of pocket since then. Yes, it does indeed cost several thousands of dollars a month - that is not an exaggeration. This has been optimized so many times that the technology needs to improve first before I can cut down on costs.
The timer in the "TOS" was put there in hopes that people could understand where I was coming from in regards to misappropriating this kind of research. I did not expect people to get this riled up over a 10 second timer. (Especially not on Hacker News, of all places...)
Edit: I suppose it makes sense to include other information in this post so that others don't have to go hunting for my comments in this thread:
- >This is unrelated but what's with the fascination with HN users and My Little Pony? I've noticed this on a lot of posts in the past few months.
- Twilight Sparkle's voice is indispensable in getting emotional contextualizers to work properly. The logo and profile picture is an homage to that fact.
- >But seriously: how did you get the domain `15.ai`?
- I purchased it. It was definitely not cheap.
- >They named their tts "Deep Throat"? Why would you?
- It was a suggestion from a Twitter user, and I found it clever.
- >I heavily doubt it's "several thousands of dollars"...
- It is indeed several thousands of dollars a month. I can show you AWS invoices, if you're skeptical. Just send me an email and I'd be happy to show proof.
- >The disclaimer is a little ironic considering the site owner doesn’t own the model (MIT does) and doesn’t own the training data (the various shows and games do)
- I'm sorry to tell you that I do, in fact, own my own model. I have not been with MIT in years.