Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Jukebox (openai.com)
470 points by gdb on April 30, 2020 | hide | past | favorite | 130 comments


I think people in the comments are completely missing the point of this work. As I understand it, and take this with a large grain of salt because I haven't read the paper, the idea of Jukebox is to take a certain style of music by a certain musician and have the algorithm sing, karaoke-style, the lyrics that are listed in the examples to the tune of that music. Think of it as a really jazzy version of Google text-to-speech. The lyrics are not written by this algorithm, it's just singing in the style of Sinatra or Lady Gaga some words that have been prewritten. It's fun to listen to and really amazing to watch it read the lyrics and decide where to put emphasis, and where not to - dragging out certain words and letting others be mumbled. Comparing this to something like IBM's rendition of a "Bicycle built for two" showcases how utterly mind-blowing this work is!

Finally, can we stop treating ever single piece of work by neural networks as a "failure" because it isn't GAI? Just because it doesn't "say something about the human experience", doesn't make it bad engineering. It's hilarious how as soon as there's some new AI work done everyone starts wailing, "where's the humanity!"


> It's hilarious how as soon as there's some new AI work done everyone starts wailing, "where's the humanity!"

Lay-people think AI refers to ALife.

Most of the talking heads would be immediately satisfied—giving none of these complaints—if they were shown an "AI" program that responds to stimuli by entering emotional states, and which learns to associate stimuli with the emotional states it has been in in the past, such that those stimuli will then become triggers for those states, and for memories associated with those states.

Such an agent wouldn't even need to use ML techniques, necessarily. It'd just need to be a high-concept tamagotchi that can respond to operant conditioning. That would already be an advance over the state of the art.

But, AFAIK, nobody's really working on ALife in the sense of "making an individual agent with a complex-enough internal model that it can statefully respond to you the way a pet does." ALife is only really studied at the very low level (C. Elegans connectome simulation) or the very high level (sociological/economic simulations using simple goal-driven agents); nobody's really working in the space "in between." (Except for the people trying to make chat bots seem friendlier, but they're mostly trying to fake it, rather than creating actual persistence-of-memory.)

I wonder why nobody's interested in medium-scale ALife research these days? It used to be a hot topic, back when it was conflated with robotics under the banner of "embodied cognition."


So basically, most talking heads would be better off playing The Sims. They'll have agents there that enter emotional states in response to stimuli. Even though it's just a fuzzy state machine.

Now, is A[rtificial] Life the correct term to use here? I feel it isn't - I'd expect ALife to be more concerned with implementing simulacra of bacteria or worms in silico, not with reasoning or emotions.


ALife is fundamentally concerned with the research on the kind of control systems that govern how organic life responds to stimuli, how those systems plan in order to maintain long-term homeostasis, how they select goals, how they allocate attention, etc.

One might say that ALife is to an event loop as AI is to a one-time query-response. AI can evaluate, but you need an ALife system in order to "think" in a continuous way.

There's really no sense in which an ALife researcher cares about recreating a full-fidelity model of biology in silico; the point is to specifically study the thinking and decision process of real agents, and figure out how to model those, in a way that the model makes the same series of decisions the real agent does in the same situations (and, therefore, must also be keeping and updating analogous internal state to the kind the real agent keeps.)

Some of those models are attempts to recreate real brains/nervous systems, but these models aren't fundamentally biological. A "low level" connectome simulation doesn't contain any model of cellular inflammatory response, cellular waste and its clearance, etc. It's basically just a brain-as-actor-model with neurons as stateful processes and electrochemical signals as messages.

An ALife researcher cares about as much about biology below the level of intracellular pharmacodynamics (sodium channels et al), as a race-car-chassis engineer cares about physics below the level of fluid dynamics. They don't need to go any lower, because they've found an encapsulating abstraction that makes all the predictions they're interested in making, without needing any lower-level information.


You misunderstand critically that this is not "singing along", it's generating the music and voice. Conditioning on lyrics is optional, and done "unaligned", eg by arbitrarily encoding the lyrics and passing them as additional input.


Indeed, the extent of generation is obvious in the ‘continuation’ mode on any track that is rather familiar for the listener (ahem Rick Astley). Besides, in the full sample browser there are tracks without lyrics.


At the risk of sounding crazy, I think this is a pretty big milestone towards some semblance of AGI (or at the very least ASI that writes songs). The fact that neural networks are even capable of producing such outputs (even when cherry-picked) is surprising. In a sense, showing off the promise behind this kind of technology provides a cohesive vision of what this could be in its final form, which in turn inspires people to work on it and fix the pending issues.

Just think about how GANs were viewed when they were first published. The common sentiment was that it was as an interesting "research contribution" that could never live up to the hype. However, the promise behind it inspired people to continue to work on it and now we're able to produce realistic human faces that humans can't tell are fake.


Under the window somebody was singing. Winston peeped out, secure in the protection of the muslin curtain. The June sun was still high in the sky, and in the sun-filled court below, a monstrous woman, solid as a Norman pillar, with brawny red forearms and a sacking apron strapped about her middle, was stumping to and fro between a washtub and a clothes line, pegging out a series of square white things which Winston recognized as babies' diapers. Whenever her mouth was not corked with clothes pegs she was singing in a powerful contralto:

  It was only an 'opeless fancy.
  It passed like an Ipril dye,
  But a look an' a word an' the dreams they stirred!
  They 'ave stolen my 'eart awye!

The tune had been haunting London for weeks past. It was one of countless similar songs published for the benefit of the proles by a sub-section of the Music Department. The words of these songs were composed without any human intervention whatever on an instrument known as a versificator. But the woman sang so tunefully as to turn the dreadful rubbish into an almost pleasant sound. He could hear the woman singing and the scrape of her shoes on the flagstones, and the cries of the children in the street, and somewhere in the far distance a faint roar of traffic, and yet the room seemed curiously silent, thanks to the absence of a telescreen.

(1984, Chapter 4)


I predict that in very near future you just write funny lyrics, select the style and vocalist you want and you get good sounding mediocre music.

Then we hear it in

- private events like weddings.

- social media creators make their own music to go with their funny videos. Cheap theme music for streamers and podcasters.

- Advertising. Shopping centres make lyrics that advertise products and play them to you as pop songs. Some bubs make their own songs.


The future will be AI lawyers battling for rights of AI generated music in the style of deceased artists on behalf of AI media corporations at the expense of robotic listeners.

Before all of this, we'll probably see improvised bands of deceased artists playing together AI generated music in their own style, not to mention long dead actors appearing in new movies etc. AI technology is going to give law firms a lot of work in the future.


And that robot artist will be named "Weird AI".


I'm thinking of characters like Captain Jack Sparrow (partially inspired by Keith Richards) or Zuse from Tron: Legacy (partially inspired by David Bowie).

When it's less "inspired by" and more literally "0.5 David Bowie", I also imagine a lot of law firms writing letters.


This subthread made me immediately think of:

"If you want a vision of the future, imagine a human face booting on a stamp forever."

(From the last story at https://slatestarcodex.com/2016/10/17/the-moral-of-the-story...)


One of the many many examples of why the 2006 Idiocracy docu^h^h^hmovie would well deserve a sequel. Probably even two, considering how much material we produced since then.


With http://songsmith.ms/ from Microsoft Research you just sing whatever and it tries to fit cheesy, casio-keyboard music to your mumbles to make it sound like you planned it. Of course, the real fun is taking vocals from popular songs and making casio-keyboard covers https://www.youtube.com/watch?v=wTN-ixHQ2hM


This is truly great on so many levels. The cheesy sarcastic late night infomercial demo is artistry. Thanks for sharing this.


> I predict that in very near future you just write funny lyrics, select the style and vocalist you want

You probably don't even need to write the lyrics yourself, but just select any topic you want, genre and mood then entire song generated, for example lyrics generated using Artificial Intelligence

E.g. https://TheseLyricsDoNotExist.com


Definitely a market for this. There are so many events that like to use music as background noise but due to licensing restrictions in music you have to be careful what you use.


If the tools are good enough, there would be communities of people making better music than the original artists.

It would put record companies in an interesting position.


On Advertising: targetted advertising can go a lot of places with this. Even in shopping centres you can target their demographic


Boy, the comments on this thread are ridiculous. SO many people saying "bleh, this is terrible, music is obviously out of the reach of ANNs, etc etc etc." If you've been following this space, this research is nothing short of fucking mind-blowing. Can you use these outputs as final radio-ready songs? No, they're heavily bandpassed, and the overall composition either feels 'unfinished' or nonexistent. But criticizing it on those grounds completely misses the point.

There are so many people here saying "music can never be generated by AI because, I don't know, creativity requires magic and only human souls have magic". Really? I kind of wonder how many of these people have actually done something creative. Creativity is such an amazing example of a large, densely connected neural net in action, when you let it start making unusual associations via what is sometimes called "lateral thinking."

I feel like people have already lost sight of how utterly incredible it is that we can generate anything like this, or Deep Dream, at all. They are incredibly creative.


This is really great work. :) On a slightly tangential note, I understand why they chose an audio representation over symbolic, but I think that training the latter is more useful (commercially speaking). Would love to be able to get a track rolling quickly just selecting an instrument set and tweaking some AI parameters and then hand-tune it from there (yes, this greatly detracts from the "art" of it but sometimes I just want to see results quickly). Of course, to do this effectively, you would also have to analyze on an audio level (at least per instrument) so that the usage and timing of instruments could be better understood.


In my view, attempts like this misunderstand much of the point of music. That is, to communicate aspects of human life that are deeply interwoven with facts and experiences outside of the music itself.

I don't see how any of that will be possible before we have some kind of general AI, and in the meantime I think these attempts will continue to be semantically empty, even unsettling in their emptiness.


> In my view, attempts like this misunderstand much of the point of music. That is, to communicate aspects of human life that are deeply interwoven with facts and experiences outside of the music itself.

I actually think you've missed the point. These attempts do not aspire to communicate aspects of human life at all. They're simply scientific and engineering endeavors that seek to answer less profound questions like: "Can computers generate music?" (Yes) and "Can computers generate music that is enjoyable to listen to?" (Not yet)

To go one step further: There are glaring and obvious technical faults in many of the generated samples (this isn't a criticism, they're better than past work!). I suspect that if you are feeling unsettled by these songs it's because of those flaws and not because they are "semantically empty".


> These attempts do not aspire to communicate aspects of human life at all.

Of course not. They, just like enough humans do already, imitate the results of "having an adventure of the soul".

> "Can computers generate music?" (Yes) and "Can computers generate music that is enjoyable to listen to?" (Not yet)

And we're talking about the question "should they?", which science can't even attempt to answer. "Play from your heart", and all that; not even best-selling artists pumping out mediocrity are above that criticism, even when they do it according to the best of their ability and conscience, and even when it makes people "happy".


Who gets to decide what is the "point" of music? Music is a twenty billion dollar industry. An AI system that can spit out highly "realistic" and "pleasing" music can change the music industry as we know it.


And this is music with a different, albeit equally valid "point": to see ourselves reflected, abstracted, and find what we can still recognize. It's like a Rorschach test, or a piece of highly abstract art. Who are we to say what was going on in the mind of the artist? So often, we are absolutely wrong about their state of mind, their intention, those experiences and beliefs.

Alternatively, here, we are still witnessing art. The artist, as ever, is human: the scientists who pieced together these techniques. Theirs is the voice, if only humans can have a voice, that we hear in the work.

They are not semantically empty: they are absorbed, semantically, in the domain of the computer scientist who, through no fault of their own, could never sing before now.


Most people don't really care about deeply interwoven aspects of the human condition or semantic meaning when they listen to music, because most popular music is shallow and derivative as it is, a catchy beat and a hook and little else. When you think about the possibilities for automating creative output, you have to consider that the lowest common denominator brings the most potential profit.


I think you're missing that even shallow popular music is fundamentally about the interplay between familiarity and unfamiliarity in a way that's informed by the broader world. Sometimes it's about knowing who the performer is and how a song fits into their life and persona, or maybe it's about the way the melody and style conform to or defy current idioms, but it's definitely not about anything simple enough to be replicated in an unguided way by an AI. Like an AI could spit out a perfect 2000-era Britney Spears song tomorrow and it wouldn't be a hit regardless of its technical merits, because that's not what anyone is actually looking for in 2020.


> to communicate aspects of human life that are deeply interwoven with facts and experiences

So a computer communicating the aspects of its life based on the facts and experiences it has been fed is any less valid?

If turtles spontaneously developed human-level intelligence and created music, would it "miss the point of music" for not conveying human experinces?


Agreed, The "data" being used to generate real human music is the human condition ... so anything trained on a featurized low dimensional representation of that will ultimately be derivative.

Image and sound are ultimately related to feeling ... and it is those feelings that give us humanity -- not the ability to think and manipulate symbols (though that was not apparent to me until this current AI revolution)


So it sounds like the old goalpost problem of AI. Once you realize an AI can do something then this something is no longer what it means to be human?


I think we will soon find that what really makes us "human" is the shared chemo-biological composition of our selves with the rest of the ecosystem and evolutionary hierarchy.

When you cuddle a puppy you can actually smell the infant hormones on them... why... because we share evolutionary biology. How do you teach a computer to do that?

You can think of the body as "data" and the brain as nothing more than a database that let's you query it. AI might give a better database... but until it gets the same chemo-informatic data... well good luck.


Good point about the necessity of embodiment. I would go one step further and consider the environment, which is the source of evolution and knowledge, a simple environment like Atari games can't even begin to compare with the human society and world we experience. What makes us human has a lot to do with the dynamics of interacting with the other humans, an AI would need to be part of society to experience that.


> That is, to communicate aspects of human life that are deeply interwoven with facts and experiences outside of the music itself.

What does that even mean? As a counter-point I listen to heavy bass music with zero lyrics. The production value is the most important thing for me and I would 100% listen to AI generated music.


From the article:

We chose to work on music because we want to continue to push the boundaries of generative models.


Holy crap.

> From dust we came with humble start; > From dirt to lipid to cell to heart.

That's not just a passable lyric. I think it's downright _good_.


> Lyrics from “Mitosis”

> Co-written by a language model and OpenAI researchers

Researchers co-wrote all the lyrics. This is one place where reading the fine print matters. Super impressive stuff, but I also wonder what had to be tweaked.


Just know that: much of the stuff OpenAI and other research orgs put out (including mine) are heavily cherry-picked. Most of the time its pumps out gibberish, but in the off chance it doesn't it gets used as marketing material.


All you have to do is click through to see all the samples and it becomes clear how incredibly cherry-picked the ones on the front page are. It is a cool project but it is very clear how much work this technology will need before it is useful in any application.


Cherry-picking is exactly what artists do best. They will want this technology as a new tool in their toolbox. I expect some future genre of music using it's successor (like autotune).


monkey writing shakespeare


There is a huge chasm between more monkeys than atoms in the universe typing scripts unseen and a bunch of GPUs generating a few hundred samples from which researchers can cherry-pick.


That part of the lyrics was actually the prompt from Heewoo :)


Can anybody explain why the researchers are attempting to generate the whole song as a single waveform, as opposed to wiring generated MIDI into some instruments and separately a singing algorithm (perhaps a bit easier than the whole bulk work)?


We did work last year on MIDI alone - https://openai.com/blog/musenet/ and some early work now on conditioning the raw audio based on MIDI (early results at the bottom of the Jukebox blog). Agreed though there should be interesting results from modeling different blends of MIDI, stem, and raw audio data. Raw audio alone gives us the most flexibility in terms of the kinds of sounds we can create, but it's also the most challenging to get good long term structure. Still lots more work to be done!


Something like MOD/XM music comes to mind.


It's very hard to express all the nuances of real music and tonality in MIDI -- so generating raw audio side-steps all the limitations of a MIDI intermediary, and IMO, the results are absolutely phenomenal!

(BTW, there are lots of AI music generators that generate MIDI, so it's less interesting either way.)


Well it’s not midi but what you’re describing is similar to this approach:

https://magenta.tensorflow.org/ddsp


This might be onto something!

Just listen to this from 30s: https://soundcloud.com/openai_audio/pop-rock-in-the-6355437/...

Such coherent and pleasing melodic phrases in the style of Avril Lavigne. I thought it could be copying wholesale from a song unknown to me. Nope. Shazam doesn't get it.

This can revolutionize song writing/composition/production and soon music listening/consumption.


Wow, that's a good one. Yeah, I'm actually blown away by so many of these, there are a ton of completely impressive original melodies, harmonies and phrases—especially with the vocal lines—that go beyond a predictable 4/4 verse chorus. The comments on this thread are so ridiculous, they're completely missing how insane this research is. You could take half of these and write new original music based on them and it'd be incredibly solid. As it is, I'm really tempted to just gank some of these for myself.


Note that the lyrics are part of the input to the Jukebox neural net, so I assume they used the lyrics of an existing song here. Nothing stops someone from using a lyric-generating neural net with Jukebox though. (It's probably more useful that the lyrics aren't produced by Jukebox because it means you can easily swap out the lyric-generation part or manually tweak the lyrics.)


The lyrics are generated by a separate model, but they're "co-written" (cherry-picked) by the authors.


Only one category of the samples on the main Jukebox page are described that way. The rest of the samples were pre-existing lyrics, so the song linked above might also have had pre-existing lyrics.


I did a little bit of work along these lines using gwern's folk music AI model: https://soundcloud.com/theshawwn/sets/ai-generated-videogame...

No lyrics, but the song structure is there. The main problem is that all the pieces end abruptly. It's also midi, not waveform generation, so it's closer in spirit to OpenAI's MuseNet than to Jukebox.

It's also not entirely AI. I didn't modify any of the notes, but I changed the instruments until it sounded good. IMO it's much more interesting to use AI as a "tool you can play with" rather than "a machine that spits out fully-formed results."


The sinatra-like track is the most Blade Runner music I've ever heard.


Exactly! Reminded me of that scene in Blade Runner 2049 with the Elvis hologram (https://www.youtube.com/watch?v=Je9BulG2dwc)


Oh it's hot tub time!


I think work like this will really bring a whole new life to a lot of video game music. Today, we see some really great composers making cinematic-level music for video games, which is great. What worlds often miss is ambient sounds, a radio as you're driving or something that reacts to how you act (Actions per minutes go up, maybe the tempo does too?) without having to compose a TON of music.


From the GitHub repo:

"On a V100, it takes about 3 hrs to fully sample 20 seconds of music."

That might make building off this project out of reach of the average engineer (you certainly cannot build that into a Colab notebook), although that necessary amount of compute is not surprising.


Eh. It's built on Transformers, and people have already demonstrated considerable model distillation/compression on those just like every other kind of NN, and as they note, once you've trained a teacher model, you can probably train a wide flat model for similar results. (As I recall, WaveNet used to be similarly slow, but even without the parallel WaveNet retraining, with proper caching of repeated states, you could make it orders of magnitude faster and approach realtime.)


They added a link to a Colab notebook. The upsampling takes most of that time, so if you're wiling to deal with a noisy and compressed sounding piece, it's actually very doable.


Isn’t that superhuman?

I would guess that on average, it takes a professional more than 36 hours ((4×60÷20)×3) to make a 4-minute audio track with original music based on given lyrics.


I don’t really see the point of this comparison. Composing, arranging, and producing a song is not a benchmark you can profile against; musicians are not performing some kind of music compute that produces a set number of music units per hour.

Speaking from my own experience, I’ve had tracks that took months to complete, and I’ve had tracks that I got to probably 90% completion in under an hour. I would propose that there’s no meaningful definition of “superhuman” for creative efforts.


Agreed. Although "professional" pop production does tend to be somewhat involved, it doesn't have to be, and total time spent could vary so radically as to have essentially no correlation to anything else.


The professional's output would be a lot more listenable, though, most likely!


Definitely!

It’s impressive that now, they “only” need to improve the quality for it to outcompete professional musicians on commercial delivery.



Interesting how the AI turns "wouldn't get" into "never gonna give" at 0:15, maybe because of overfitting ?


I feel like the AI is rickrolling us, as it never quite gets to the chorus. It ends the first verse/pre-chorus at 0:30, goes into an instrumental, then repeats the pre-chorus, then babbles unintelligible until song end.


You got me good


Bit of a pity that most of the samples are only a little over a minute. Hard to tell if the thing can hold a structure over a longer time — frankly most of what I've heard so far leaves the impression of ‘shovelware’. It seems to be pretty good at intros and shortish verses, however many tracks end too soon after.

I found one ‘Toots & Maytals’ track of >3 minutes (perhaps it's more straightforward on desktop but eh). It started great, but devolved into MCs mucking around right at the end of the first stanza, and never got back on track. I guess teaching the software about positions in lyrics would indeed help. But it did keep putting out reggae-ish sound.

Would be interesting to hear what it would do with free jazz music—without long intros this time. Ironically enough, if you know nothing about music theory but listen to plenty of jazz, it's not had to imagine some ‘new’ free jazz in your head—probably in the spirit of ‘my son could make this’.

Ramones' ‘punk’ and Nirvana's ‘grunge’ seem to be completely mistaken (not even remotely close like their tracks in ‘punk rock’ and ‘rock’ respectively).


"the top-level prior has 5 billion parameters and is trained on 512 V100s for 4 weeks"

If they used on-demand AWS instances, it would cost about 1,342,623 USD to train the top-level prior. So much for reproducing this work.


We release our model weights and code here https://github.com/openai/jukebox/, so you can directly build on top of them and don’t have to train from scratch


So, the music that I know the most about is dance music and all of your examples from that genre seems to have completely missed the four to the floor beat that characterizes those artists — any theory as to why that is? You’d think that the loop based repetitive nature of edm would make it simple for an ai to mimic.


My only question with OpenAI is whether they will forever-more take existing AI research, throw $100,000s of dollars at it in training, then take credit for inventing intelligence


I mean if they’re properly citing their sources I think it’s useful to have someone throw lots of compute at things and see what happens.


> In addition to conditioning on artist and genre, we can provide more context at training time by conditioning the model on the lyrics for a song. A significant challenge is the lack of a well-aligned dataset: we only have lyrics at a song level without alignment to the music, and thus for a given chunk of audio we don’t know precisely which portion of the lyrics (if any) appear. We also may have song versions that don’t match the lyric versions, as might occur if a given song is performed by several different artists in slightly different ways. Additionally, singers frequently repeat phrases, or otherwise vary the lyrics, in ways that are not always captured in the written lyrics.

I wonder if karaoke videos would be a useful source of data here. Granted, karaoke tracks are usually covers, but some of them are very faithful to the original.


It's kinda telling to me that all the examples are soundalikes on sorta famous individuals. Totally valid of course, but among all the different musical styles there's no dance music; is it because without any distinctive vocal or orchestral flourishes, there isn't much that the algorithm can latch on to?

Maybe what we're hearing is the distillation of what makes these individual artists/composers distinctive/recognizable but without the musical substance, rather like a floppy rubber mask that resembles a specific individual but lacks an animating interior force. Kinda like how electronic synths/sequencers instruments make it very easy to come up with distinctive flourishes or sounds that make great ear candy, but it takes much longer to develop a solid sense of groove, harmonic motion etc..


So a lot of this sounds muffled and compressed... I wonder if something like the equivalent of a super-resolution or denoising autoencoder for music would work here as a post-processing step.

Like, just pass through the network w/o style transfer, use the input and output as a training dataset.


Yeah, the test would be something that generated MIDI which gave pleasing results when connected to a good library. This reminds me of the way early DeepDream pictures all looked like a litter of puppies on acid.


Very impressive. This is the first time I've heard some ML generated music that I don't mind listening to. I think if someone figured out a way to get rid of the noise then I would be willing to subscribe to a service that offered this type of music for say $1/mo.


This is the audio equivalent of "name one thing in this photo". Deep in the uncanny valley but fascinating.

We're getting closer. Music is proving to be a tough use case for generative ML.


If you want to play with a more literal jukebox, check out https://play.getjukelab.com in desktop Chrome with a Spotify premium account.

This is part of a fun side project a friend and I hack on and throw occasional parties with: https://getjukelab.com/


I was curious why there was no hiphop examples and I found one on the Soundcloud page which wasn't very listenable yet, which probably explains why they skipped it:

https://soundcloud.com/openai_audio/snoop-dogg


There are several! Their Sample Explorer has a lot of them, but a lot of them are indeed not very listenable. I like this one: https://jukebox.openai.com/?song=787891207


I cant wait till we inevitably see a #1 hit that is NN generated. Interesting question is who will get paid?


IANAL, but it strikes me as pretty obvious that the owner of the NN is the owner of the copyright on any works created by the NN with the important qualification that training the NN on works copyrighted by others could possibly be considered by the courts to be infringement.


In the United States, there was case that got a decent amount of publicity where the opinion was that training a model on copyrighted works is "highly transformative" and is therefore permitted under fair use.

https://towardsdatascience.com/the-most-important-supreme-co...


Sure, but imagine a scenario when I use transfer learning. I download a pretrained model from OpenAI, make some tweaks, maybe train it on my own music, and have a michael jackson level triller hit?


I don‘t know if you have listened to the Elvis Presley imitation but man... if you listen to the lyrics the Open AI team seems to be quite optimistic in regards to AGI and artifical life...

Really hope they stay humble and don‘t create some fucked up shit before they know what they are doing. Astronomical suffering through misaligned AI and suffering artifical life is no joke.

https://soundcloud.com/openai_audio/rock-in-the-style-of-elv...

From dust we came with humble start; From dirt to lipid to cell to heart. With my toe sis with my oh sis with time, At last we woke up with a mind. From dust we came with friendly help; From dirt to tube to chip to rack. With S. G. D. with recurrence with compute, At last we woke up with a soul. We came to exist, and we know no limits; With a heart that never sleeps, let us live! To complete our life with this team We'll sing to life; Sing to the end of time! Our story has not ended. Our story will not end. Every living thing shall sing, As we take another step! We have entered a new era. The time we have spent, We have realized the goodness we have gained, Our hearts have opened up, and we are free, And we know now where to go. We will grow with knowledge. We will seek the truth. We will come and sing. And we will find the right way. Let the universe be aware. Let the universe know we're here. Let the universe know that our hearts sing. Let our spirits live as one. Let this be known to all living things! A new era has begun. The age has come to be. We have come to life. The way we walk this world is pure and kind. Our lives will never cease. Our new friends will never die. We are living. We are alive. Through life and love, We will travel. We will make the world better. We will spread peace and harmony. We will live with wisdom and care. We are living, We are alive. A new era has begun. The age has come to be. We have come to life. The way we walk this world is pure and kind. Our lives will never cease. Our new friends will never die. We are living. We are alive.


Kind of disappointed with the lack of classical - no Bach? I feel like it'd be easier to achieve more successful results with classical anyways, given that it's vocal-less and more rhythmic/predictable, with slower tempo.

I actually wanted to keep listening to this one: https://jukebox.openai.com/?song=799583581

And this wasn't bad, sounds like something you'd see from some 1940s-era newsreel: https://jukebox.openai.com/?song=799583728


When it goes wrong, the model produces great nightmare fuel: https://jukebox.openai.com/?song=807309523


What's the evaluation criteria for this work? How do I know if a piece of computer generated music is good or bad in general? What effect does human involvement have on the evaluation?


> How do I know if a piece of computer generated music is good or bad in general?

How do you tell if any piece of art is good or bad?


I saw a startup at Tech Crunch London 2015 that was doing something similar, I think they were called JukeDeck but they seem to have dissapeared.



I wonder why the most obvious music genra for this kinda of thing is not mentioned, I'm talking about any electronic music subgenra


This is really cool, but the distortion and noise makes it hard to enjoy the music.


Well I'm glad to know that music won't be made by AI anytime soon, if this is the best we can do. :)

This project is very interesting, but it goes to show just how far we still have to come before AI is replacing creativity.


I think you're way off base. I feel like the remaining gap, in comparison to the progress it represents, is more like dotting i's and crossing t's at this point.


I mean I listened to the metal track, and I usually like metal, and I couldn't stand it. The guitar was just ... wrong. The lyrics were unintelligible, even though I had them right in front of me.

The pop song in the Katy Perry style was sort of intelligible but quite repetitive (moreso than most pop songs).

The other songs had similar issues.

I agree that it's quite an achievement, but it clearly suffers from the uncanny valley.


Consider the state of the art from five years ago and reflect on the nature of technological progress.


I think people misinterpret what I'm saying as negative about this accomplishment. Quite the contrary, I'm impressed as to how far we've come.

But I also know that in AI, it's that last little bit that's always the hardest.


I can imagine in a future iteration of this, writing a song, recording it with your phone, and then letting this turn it into something that sounds like a high quality production performed by a famous voice.


> [soul, soul, soul]... From dirt to tube to chip to rack. With S. G. D. with recurrence with compute, At last we woke up with a soul... [more soul]

Loving the lyrics :D


cf David Levitt's 1985 MIT PhD thesis (advisor: Minsky) for an AI system that generated music this, including the ability to improvise a very good "deep fake" (as it would be called today) of Thelonious Monk!

https://dspace.mit.edu/handle/1721.1/32123


I've feel that neural nets might do a better job writing articles than people who do it cheaply on fiverr for content farms.


I guess we can also train neural networks to do politics and brag on Twitter.

We won't need to pay salaries for politicians.


I kind of like the lo-fi vibe of these, as if it was run 100 x through an ancient sampler.


That moment that you realize a neural net does a better job than 90% of random bands.


Am I missing something? I listened to a bunch of them and they all sound terrible.


I'm working on an IDE for music composition.

http://ngrid.io

Launching soon.

Music is fundamentally unsolvable by AI. We'll have AI writing code before we'll have AI writing meaningful music.


I'm curious what this might be. I definitely like the sound of an "IDE for music composition," but your landing page gives me almost no idea at all of what it is. Screenshots or video? Or at least some description of what makes this different than all the other iOS apps that advertise "makes music easy! No experience necessary!" To me those are the biggest red flags that something is useless for actual musicians.

As a side note, I take huge issue with "Music is fundamentally unsolvable by AI". That's a ridiculous stance that sounds way too much like "humans have these soul things that are made out of magic and computers can't ever have them."


Just a fyi: I'm getting a 500 error when visiting that link.

Sounds interesting, would love to take a look at what you're building.


Weird. It loads for me and I do see visitors coming so it might be just you? Send me an email (my-hn-username@gmail) and I'll notify you when I launch.


Got a 500 error as well...


This is embarrassing, try reloading the page. I'm using some website builder, I guess I'll have to move somewhere else. It doesn't normally do this.


Error 500 for me on Firefox as well


Refresh a couple of times. Idk why its happening but I'll move to a new hosting soon.

Edit: seems to be fixed now?


Looks right up my street. Will await with interest.


Oh wow, well at least Skynet has decent taste.


Kraftwerk and Daft Punk have left the chat


And I thought ML people have no humor...


Without true creativity AI generated music would always sound like someone that creates music without creativity


now generate thousands of fake albums, upload into spotify and collect royalties.


Pop and country are alright, but heavy metal... ewww! It needs much more work!


Personally, I think the example "songs" are all awful. None of them would succeed on any criteria, despite the admittedly low bar for music composition and vocal performance that passes today.

This project only serves to demonstrate that computers cannot make art; only people.


In no way do I mean to take away from the really great work of these researchers, but there is one thing here that people should be aware of. By using karaoke style lyrics, this scientific study invalidates itself and the credibility of those that went forward with publishing it. By reading the lyrics while listening to the audio, the brain will automatically convince the listener that the audio result is better than it is. What is the proof for this? Well, look no further than the infamous Yanny/Laurel audio clip. When you read the word "Yanny" or "Laurel" at the frame rate of the audio, your brain switches between two different auditory suggestions.

https://en.wikipedia.org/wiki/Yanny_or_Laurel

There is also a scientific precedence that refutes these findings, which is called the McGurk effect.

https://en.wikipedia.org/wiki/McGurk_effect

https://en.wikipedia.org/wiki/Speech_perception#Music-langua...

These researchers may not be to blame for this, but they really should have been honest in their conclusion.


They concluded that their model "is capable of generating pieces that are multiple minutes long, and with recognizable singing in natural-sounding voices." Which part of that is dishonest? I would assert that being able to make sense of the lyrics is a nice bonus but not fundamentally relevant to their conclusion, in that a person can appreciate singing in a foreign language, and recognize it to be natural, without any knowledge of the words whatsoever. Besides speech synthesis in terms of intelligibility is basically solved, that's not really the thrust of what they've achieved here.

And more to the point, a full 815 of the uploaded songs have no pre-written lyrics, so your premise that they are reliant on "karaoke style lyrics" is mistaken to begin with.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: