Wow. GPT2 is so, so, so much better than Markov chains. I'm reading these definitions, and the fact that the last few words of the sentence match the first few words subject-wise is pretty amazing. Just some random ones:
> denoting or relating to a word (e.g., al-Qadri), the first letter of which is preceded or followed by another letter
> a synthetic compound used in perfumery and cosmetic surgery to improve the appearance of skin tone and irritation
> a type of cookie made with dough, jelly, butter, or chocolate, often filled with extra flour
Pretty impressive. I've never seen fake text so real. (I mean none of these seem to quite make 100% logical sense, but if you were just skimming the sentence nothing would stand out as a red flag.)
I always like to point people to /r/SubSimulatorGPT2 [1] as a good example of what GPT2 is able to accomplish.
It's a subreddit filled entirely with bots, each user is trained on a specific subreddit's comments matching it's username (so politicsGPT2Bot is trained on comments from the politics subreddit).
Go click through a few comment sections and see how mind-bendingly real some comment chains seem. They reply quoting other comments, they generate links (they almost always go to a 404 page, but they look real and are in format that makes me think it's real every time I hover over it) entirely on their own, they have full conversations back and forth, they make jokes, they argue "opinions" (often across multiple comments back and forth keeping the context of which "side" each comment is on), and they vary from single word comments to multi-paragraph comments.
Take a look at this thread [2] specifically. The headline is made up, the link it goes to is made up, but the comments look insanely real at first glance. Some of them even seem to be quoting the contents of the article (which again, doesn't exist) in it's comments!
If you threw something like 50% "real humans" in the mix, I genuinely don't think I'd be able to pick out the bots on my own.
This one is also fantastic, the post text itself is so goddamn convincing that I actually searched on youtube for the name of the show to see if it existed (it doesn't).
I particularly like the "edit" in light of the "mod comment".
The "edit": UPDATE: I just wanted to say that I'm a huge fan of all of you. The subreddit is extremely well run, and the amount of support and encouragement that I've been getting from the community is incredible.
The "mod comment": "Hi Iamacreamt! Your post has been removed because this topic isn't suitable for /r/IAmA."
These comments within that post were also really impressive.. it still seems off but the fact that it carries the topic between replies in a thread is so crazy to me:
> How do you feel about the internet arguments subreddit, r/iamverysmart?
> As a young professional, where do you prefer to have internet arguments with your friends? I've been thinking about leaving YouTube for Reddit, but I don't currently have much free time, and I want to have some quality back and forths with my friends.
---
> I love that subreddit. I love how open minded it is. The only reason why I didn't like it was because it's so popular. I'm glad that I'm not the only person who has this problem, because I never had any reason to like it.
---
> Thanks for the reply :) I had a bit of an argument with one of my friends over reddit, and I was wondering if you could give me insight on the way you handled it?
What stands out to me is that the comments are treating "arguments" as passion as if it was "apples" or "old cars".
"I had many arguments growing up"
"where do you prefer to have internet arguments with your friends"
It interpolates "arguments"-related stuff into the posts as with any other topic. It does it well, but seems to have missed that it's not a conversational topic like most others. We don't talk about arguments like we talk about basketball. But here's a peek into what it would look like if we did!
(I also enjoyed the bot signing both a question and the immediate reply with ~Alex)
That seems like a idiosyncronous but plausible thing to talk about as if it's a conversational topic like most others?
Although most people probably don't prefer to have internet arguments (especially with their friends who sign both a question and the immediate reply with ~Alex).
> I think the US has turned from a police state into the police state we see today. They're just using more tools to keep us safe in the eyes of the government. One major tool that I can think of is the TSA. The TSA is a tool to keep us safe, not to keep us safe. I believe the government and TSA have become a one party system. They use the TSA as a way to keep us safe, and then use the TSA as a weapon against us if we're too annoying. A lot of people do not understand the government or TSA. It's very easy to do what I mentioned above.
I subscribe to it so it’s mixed into my front page and every now and then, I read a post and get a good way into the comments before I realize what sub it is.
One fun part - we used the inline metadata trick to train a single GPT-2-1.5b to do all the different subreddits. It allows mutual transfer learning and saves an enormous amount of space & complexity compared to training separate models, and it's easy to add in any new subreddits one might want (just define a new keyword prefix and train some more). Not sure that trick is meaningful for Markov chains at all!
It's an old trick in generative models, I've been using it since 2015: https://www.gwern.net/RNN-metadata When you have categorical or other metadata, instead of trying to find some way to hardwire it into the NN by having a special one-hot vector or something, you simply inline it into the dataset itself, as a text prefix, and then let the model figure it out. If it's at all good, like a char-RNN, it'll learn what the metadata is and how to use it. So you get a very easy generic approach to encoding any metadata, which lets you extend it indefinitely without retraining from scratch (reusing models not trained with it in the first place, like OA's GPT-2-1.5b), while still controlling generation. Particularly with GPT-2, you see this used for (among others) Grover and CTRL, in addition to my own poetry/music/SubSim models.
I almost got miffed about this one complete jerk. Then I remembered how it was generated and laughed through the whole thing - it is so uncanny. Genuine Reddit emotions were had.
There is a whole news agency that is build upon GPT-2. There is a social media influencer bot that also uses GPT-2 and also responds to comments and is mostly coherent.
One of the recurring tropes on Hacker News whenever a text generator (either RNN or GPT-2 based) project is posted, there is inevitably a comment saying "this is indistinguishable from a Markov chain."
Oh, that second one is cute. I read up on bryophyes (moss and friends) for an exceedingly brief stint back in my undergrad days. Pronunciation similarity to bio made for many "bryo" puns.
Oh, I'm surprised the "blacklist" isn't just the standard English dictionary. I'm sure I'm just being naive though. Why not just blacklist any word that already exists in English?
There are some subtleties (e.g. hyphens, derived forms, bigrams, etc.) but the biggest problem is that most English dictionaries don't have entries for every scientific word / piece of internet slang. I ended up tokenizing Wikipedia for a blacklist and still missed a lot :(
Words not on Wikipedia, found on other sources, listed by frequency (perhaps with a date-weighting of the source document to reduce rating of older sources), would be an interesting way to find holes in Wikipedia's coverage.
I like how you had information, made a sarcastic comment about it, but didn't share the actual information ... just in case your comment might prove helpful ...
Are you saying the URL of that Wikipedia page is “actual information” that patrickthebold failed to share?
I think that page doesn’t exist. patrickthebold wasn’t sarcastically mocking people who were too lazy to look up that page. He was just making the point that as soon as a hypothetical list like that was uploaded to Wikipedia, it should be deleted, since those words would then be words found on Wikipedia.
blacklist is probably to avoid cases where it randomly generates a real word like above two cases, so that blacklist filter is probably applied after the ml stuff.
Data scientist here. It's common to define boundaries for a machine learning algorithm by hand. Think of telling a chess AI that it can't move pieces off the board.
Unspooled and Hardstyle popped up for me. Perhaps you should do a google search for generated words before displaying them to prevent existing words from being shown.
That ones a little more fuzzy; intermodulate doesn't occur very much in discourse (e.g. not in the wiki article at all) even though it would naturally be related
Deflategate was a National Football League (NFL) controversy involving the allegation that New England Patriots quarterback Tom Brady ordered the deliberate deflation of footballs used in the Patriots' victory against the Indianapolis Colts in the 2014 American Football Conference (AFC) Championship Game.
Very impressive. It was reminiscent of the Polish sci-fi writer, Stanislaw Lem who invented more words than Shakespeare. That made for some tricky translations. Here's a list of new words in just one of the books alone:
This 1961 book predicted what Lem calls "opton", an electronic book reader with only one page between the covers (Kindle Opton special for Lem's 100 year anniversary next year would be nice!)
I wonder if someone has created sci-fi short stories with that data set yet.
I'm not sure that Shakespeare really invented those words. A lot of the first use citations in the Oxford English Dictionary tended to go for prominent writers in its first use citations.
"Stanislaw Lem who invented more words than Shakespeare."
SL's inventiveness is beyond reproach but he lived in a time (died 2006) rather later than Mr (I have 50 spelings of my name) Shakesper. Lem spoke several languages and a cursory glance at your link seems to show a lot of drinks!
Shake a spear did invent a huge number of English words. Some of them were due to the rather random speling existent in the Elizabethan England of the 14th C. The rest were the result of a creative mind that needed to deploy ideas and concepts in ways that were not available at the time. The clever thing is that he created many words that seem so obvious in meaning - and are so obviously "English". He literally understood English to its core and was able to manipulate it effectively. Good skills that man.
Also - "robot" was not invented Karel Čapek but his brother Josef. Karel was looking for good word describing mechanical workers for his theatre play called RUR.
It would be neat to do this with different dictionary training sets such as a legal dictionary or biology glossary. I wonder if it could generate useful creative ideas in the mind of a professional.
Lem is such a witty writer and his short stories are so thoroughly enjoyable. He's very clever!
His work isn't like Terry Pratchett but I find that there is a commonality in their cleverness. Still can't get over Pratchett's 'ideon' (a fundamental particle that when it strikes the brain, creates an idea; hilarious!) or that weird story of Lem's where an abandoned traveler gene-chemical engineers a sapient civilization to build himself a ship so he can fly back.
Those cats are nightmare fuel. Too many tails, paws and everything is fur. Not to mention pose and body proportions being in uncanny valley territory even when it gets the number of appendages right...
eicoscience
eico·science
the branch of physics that deals with the behavior, physics, and the properties of living organisms
"I started thinking about the world thinking of me and I never really went deep into the eicoscience"
neurotheistic
relating to the theories of early visual culture,
particularly those which emerged out of attempts
by the Chinese to achieve a cosmopolitan consciousness
through the use of classical ideas
"the neurotheistic ideas of Mao Zedong"
I like the word, but the definition does not seem to do much with the obvious etymology of the word. One could, e.g., imagine using the word for people who have developed a religious attachment to deep learning networks.
Wow, even with definitions! That looks like a better version of a game I made years ago where you have to pick out the real word from four options. The three "non-words" are generated by Markov chains:
This game is great; I've been trying to think through ways of using these NLP models in a competitive game but the mechanics aren't obvious. It would be awesome to do something like an AI rap battle
Yes, I am here to launch my new cure-all "metasodium" its like sodium, but meta.
metasodium
meta·sodium
any of a group of silica compounds thought to act by the interaction of sodium with sodium, the latter of which has many physiological roles and is essential in the modulation of many physiological processes
"a polymeric metasodium oxide from which such compounds in the cell cycle are derived"
Good call, when you press "generate your own" it tries to detect these and displays a little warning but my blacklist is not complete. My model is bolted on top of GPT-2 which used the web as a training set which is probably why some of these are popping up.
That's not ideal and probably picking up some of the original training set from GPT-2 (this model is bolted on top of it)! How about DUOLINGOLOGY instead https://bit.ly/3fPGP8q
I'm using a blacklist to reject "real" words but it's surprisingly hard to build for rare words. I'm up to ~600K items after parsing Wikipedia tokens and it still doesn't capture everything.
airpods
air·pods
a large pair of wings and wings of a bird or other flying
animal, typically used as a guide for a figure skater and
paraglider
"a pair of airpods"
I've just realised this tool is great for inventing new species or spells for my DnD campaign. Things like "Lollyfish" "Bannabeat" and "Sanaf" sound like awesome little plants or creatures to decorate my world with!
Hah good idea. I’m using thispersondoesntexist to generate npc appearances ad-hoc in my campaign. I believe there is a potential for a generator that understands fantasy races.
patentless
patent·less
not having or requiring a license of a particular kind and without permission
"patentless wireless communications"
a word that does not exist; it was invented, defined and used by a machine learning algorithm.
neuterization
neu·ter·i·za·tion
the denial of a person's sexual identity and gender
identity to someone else
"she had undergone neuterization of her facial hair"
I'm not understanding the praise this is getting. The words I've seen are very clearly wrong and do not match how English words are made. Some examples:
> méxis: an obsessive or revelatory pursuit
No comment....
> heelbark: a red braid fastened to a man's hat so as to prevent heeling
Unless you put your hat on your shoes, you're on the wrong end of the body.
> transgate: raise the value of (something, especially money) by expanding its capacity to become transactions or funds.
What's that even mean?
>noress: a unit of electric charge equal to one nanosecond
Where's the Coulombs? Who is J̶o̶h̶n̶ ̶G̶a̶l̶t̶ Noress?
Additionally I'm seeing words that either exist or are natural permutations/mispellings. Example:
> monucleotides, but mononucleotides are a real thing.
Additionally, the example sentences are just as crazy. Maybe I'm having bad luck. There are some good hits, but the majority of them appear pretty tashy (this is a crazy difficult problem!)
Typically they have a root to them. There are words that don't and are made up, like yeet (which I'll consider a word because of its usage and common knowledge), but other words like "microscope" are are derived from Latin or something else. The example here is from microscopium. There's a lineage and things modify more slowly (slang typically moves faster but also rarely stays in the lexicon long term). Many words are portmanteaus or compounds, like heelback (heel + back). How words are composed is called Morphology[0]. I mentioned in another comment morphemes. Let's look at transgate. We have trans+gate. Trans is a loan word from Latin meanings “across,” “beyond,” “through,” “changing thoroughly,” “transverse". We know what a gate is, but it can also be like a block (gated) or in a circuit (which is like a door). Here the model is taking the morpheme "trans" and using it as if it is "transaction". But in "transaction" the word makes sense because it is through an action (the word started from the meaning to do business and because this often means exchanging money, that's how we now think of it).
So "transgate" also sounds weird because it has opposing ideas. "through" + "block". But we need to look at morphemes to see why. At least (IIRC) it made this word a verb.
I'm not a linguist, but typically words evolve as memes and/or follow etymological patterns made up of root words. It's very rare that they're plausible sounding gibberish attached to plausible arbitrary meanings. This generator seems like it's in the "uncanny valley".... They're all somewhat plausible immitations of words, but the fact that they're not natural can be felt.
I think people here are missing what you're saying because it is subtle. Which is correct. That "troy" is different from "tor•y". "y" should be the suffix. Just like how "fix•ed" would be different from "fi•xed". "y" is the suffix like "itch" vs "itchy".
What this means, building off of what the evidence I gave, is that this model is not learning the morphemes (smallest root meaning). This exact characteristic is part of why these words sound weird. It is the same problem as the one brought up by tasogare.
You are on to something there. For the syllables I'm actually using a rule-based model from Python's "pyhphen" library: https://pypi.org/project/PyHyphen/
I am not totally happy with the results but have not had a chance to train my own
I think this needs to do a google search for each word before assuming it doesn't exist. I got "glosscoat", which is a type of paint/coating (typically hyphenated, but non-hyphenated examples exist).
Wow, the links are just base64'ing the whole text because of course there's no way to trivially reproduce it...
Yep! It's a sampling procedure; I could fix a random seed or put the results somewhere. "somewhere" ended up being the browser URL here (with a signature to prevent tampering)
I know, I added it to that site to see what it would generate, as it only generates English words and tries to attribute unknown words to something most similar
headbutter. One who strikes other people with one's head. You are becoming known as a headbutter, so unless you want the league to suspend you, I suggest that you stop playing dirty!
Headbutter - Idioms by The Free Dictionary
https://idioms.thefreedictionary.com/headbutter
noun.
backpressure
back·pres·sure
the pressure exerted up against a fluid, caused by the flow of air or water through it, exerting great physical pressure on the body
"a low backpressure"
That reminds me of the Japanese word よし (yo shi) (which sounds like "Yosh"). Its meaning is very nearly identical: an expression of excitement or enthusiasm, equivalent to saying "all right!" or "okay!" in English. Was the model trained with the Japanese word and its definition?
Delightful; especially to create words for their sound rather than their meaning, which the machine declares for whatever reasons it has at the time. It interested me that I was sometimes disappointed with the supplied definition, and sometimes strangely pleased, even though I'd no meaning in mind when I made up words. This is sublime.
1. a piece of writing (usually one of short or noibid) expressing or expressing a person's view, especially the concept of something abstract or self-subsistent. "the main quintessay of feminist theory."
2. a word that does not exist; it was invented, defined and used by a machine learning algorithm.
The first two I generated were pretty good with the first being a very "true" sounding word - however I then got Spongen[1] - an aromatic berry of a variety with a bright red, yellow, or greenish taste "spongen, light white berries" which seems like a pretty big adjective fail.
it's quite good with jargon that sounds [to the layman] plausibly medical
cyphroglodystrophy
cyphroglodys·tro·phy
a form of muscular dystrophy of muscle, caused by compression of an amyloid cytochrome
"children with cyphroglodystrophy have unusually low blood pressure"
Mycogeny: the formation of a mycoplasma within a cell
"mycogeny was detected in liver urine and its recovery in lymph nodes remained unclear"
It’s not too far off Mycogen: As Asimov explains in Prelude to Foundation,[21] their name is formed from the Greek stems myco- (meaning 'yeast' or other types of fungi) and -gen (meaning 'maker' or 'producer').
I made a page some years ago with 13000+ nonexistent words. You get to choose your own meanings.
From memory, it picks each letter with the same probability of following the previous two letters as actual english words have. The more previous letters included in calculating probabilities, the more like actual words you get. My list is on the wild side. Not novel, but was fun to do. Good for writing Jabberwocky-type poetry.
noun [usually as modifier]
deflategate
de·flate·gate
a situation in which one side is unable to extricate itself from a dangerous dilemma, especially one involving civil disobedience or military attack
"a nuclear deflategate could doom the North Korea situation"
a word that does not exist; it was invented, defined and used by a machine learning algorithm.
This word was used for the Patriots scandal a few years ago.
Funny that in Dutch probably (at least) some of these words do exist, I mean stuff like "week broodje" (mushy little bread) and "weekbroodje" (bread of the week) have a very different meaning based on the use of a space or not. In fact, we have a nice website that focusses on the incorrect use of spaces: SOS (Signalering Onjuist Spatiegebruik) [0], notice that spatiegebruik is 1 word ("Space-use")!
Oops. It just served me up “undefined” as a word that doesn’t exist. (Not an error message - it literally gave me “undefined” complete with definition after generating a few good ones)
One of the words I got was "unsubsidized" which appears in pretty much every online dictionary I've looked at, although it is red squiggle underlined in the comment box.
Yeah my first word was "carseat" [1] with the definition "the seat of a car". While I guess "carseat" all one word isn't technically a word, I would say it's perfectly cromulent outside of an English class.
This reminds me so much of balderdash which is oddly made up of real words but seem just as likely as these. Would be great to have this model try to play balderdash, though!
For slang, I'm training another modelusing urban dictionary as the datasource. I hope to release it someday but there is a lot of work to be done to clean the data. The articles are huge, user-generated and full of racism.
This is brilliant, could barely be funnier if written by a human:
> adjective.
> nondegenective
> non·de·genec·tive
> (of a computer virus) preventing development of an infection in which the infection is found in the host, without warning or compromise to the computer
> "this virus is nondegenective against bovine cholera"
It's both impressive in its relevance and fluency of language, despite being nonsense, and hilarious. Had me in stitches at 'bovine cholera'.
procreationist
1. relating to or advocating the theory that sex is the only biological sex, or as opposed to that other sex identified with reproduction "a procreationist approach to reproductive science"
2. a word that does not exist; it was invented, defined and used by a machine learning algorithm.
mantula: a small parasitic stinging insect that feeds on ants, flies, and other small insects, native to leafy lawns and shrubs.
"mantulas are widely grown as food"
You could do some great auto-worldbuilding in a dwarf fortress type game with this. Maybe constraining the input data to "bio" and "historical" definitions.
I guess it's only to be expected that from time to time, this would generate an actual word, even if perhaps the definition doesn't match a word's actual definition.
The first time I loaded the page, it came up with "polypyrrole", an existing word.
This is brilliant. Authors could use this to create new words in their universe. Entrepreneurs could use this to get a unique and short dot com and product name. Someone could use this to create a "which word isn't a real word" quiz.
I got toxoplasm, also very near but definition is just hilarious:
toxoplasm
tox·o·plasm
(in humans) a microorganism of the alimentary canal, which forms a protective passage across nerve vessels adjacent to the colon, for example in the trachea
"there is evidence that hemostasis greatly increases the risk of malaria by toxoplasm"
Disease is called toxoplasmosis, caused by Toxoplasma Gondii, I've heard about it, that's why I've pasted this definition. This AI has blacklist of valid words, toxoplasma is probably on list, toxoplasm is not. I don't know that much about latin, but toxoplasm looks similar to toxoplasma but is something other (sounds like a tissue to me). Toxoplasm without "a" at end probably does not exist, but I may be very wrong, so take that with a grain of salt.
> "protective passage across nerve vessels adjacent to the colon, for example in the trachea"
colon in trachea and toxoplasm(a) forming a protective passage is the funny part for me.
a tense tense [sic] of a verb in de-escalation, usually after some verb has been removed or added
"she gave a nervous jibberish if I didn't get the last word"
the space beyond the earth's crust which is about 2.8 billion miles (4.8 km) across and includes the oceans and Mars, Uranus, Neptune, and Jupiter and beyond
"the space between the earth's crust and the leptosphere"
ku·ber·netes
- a swelling of the cornea beneath the eye, caused by fluid aspiration,
typically occurring as a form of secondary hairlike cloud formation during movement
The first word I got was disappointed (different definition/usage though), are you running these results through a real dictionary before serving them?
(in Hinduism) a state of complete consciousness and mind
"she had reached a mystical state of palumpolism"
%
shakura
a traditional Western-style ceremony performed by traditional Japanese people in which offerings, including candles, were offered to the dead before burial
"the shakuras return next year"
%
empaired
em·paired
(of a person) having; displaying irrationally
"he seemed to live vicariously through his empaired friends"
%
mousselike
mous·se·like
manifesting as lovable or strange, especially for unpleasant or stupid reasons
They should reach out to a few of the domain search sites, and flag the ones where the .com is available (and split the domain name registration commission...)
Haha, no but that's neat. The generation process is a sampling procedure so "anything goes" and it can stumble on unlikely samples from time to time. I should probably filter out recursive ones!
a dark seaweed with silvery-white fur, the male of which has plumes of cyanobacteria similar to those of the copperhead
"copperfish will hunt bivalve mollusks"
Yup, press the "Generate your own button" on the site. If you want to go the other direction (definition to made-up word) you can hit up my twitter bot @robo_define: https://twitter.com/robo_define
I was gonna say that it would be cool if you could ask for the meaning of your own made-up words, and you CAN! That's basically an infinite generator of Douglas Adams's "meaning of liff" style definitions.
liff
a short length of soft, warm linen for legs, worn especially during formal events
"a lace liff"
I wish there were one for 70s/80s album art, but last time I looked at these projects they are still out of reach for hobbyists. I could stare at those all day.
This project is one huge yak shave starting from that idea! I was trying to pick a company name in the AI space and thought it would be appropriate if it was generated by ML. I tried a few datasets for training, and using the Oxford English Dictionary led me here
Fascinating app. Btw postclassification
post·clas·si·fi·ca·tion
the action or fact of grouping something into one class
"his postclassification as a member of a Chinese Marxist party"
I don't have anything other than to say this is really cool! Congrats to the creator, I love useless, cool shit like this. I prefer seeing this kind of awesome stuff instead of another boring "How we upgraded our server" blog from Slack or Instagram.
> denoting or relating to a word (e.g., al-Qadri), the first letter of which is preceded or followed by another letter
> a synthetic compound used in perfumery and cosmetic surgery to improve the appearance of skin tone and irritation
> a type of cookie made with dough, jelly, butter, or chocolate, often filled with extra flour
Pretty impressive. I've never seen fake text so real. (I mean none of these seem to quite make 100% logical sense, but if you were just skimming the sentence nothing would stand out as a red flag.)