Using an innocuous encoding of a word is a form of encryption. People who expect to be under surveillance agree on a set of code words to denote illegal things. Though hard, there are multiple ways to semi-automatically break such a linguistic encryption.
Imputation. [1] Remove a word from a sentence then try to predict it from its surrounding context. "when I get home tonight, i vape a ___ then space out". Assign predicted probabilities to imputed word ":leaf emoji:" ["marijuana cigarette", "electronic cigarette", "cigar"].
Active learning. Seed the algorithm with expert knowledge from law enforcement, drug users, and social workers, who know of the encryption keys.
Anomaly detection. Though perhaps easily-confused with other, innocuous usage, street slang is a distinct form of language with its own properties and patterns. Compared to common discourse, it is strange and random. This pattern could be measured.
Doing this rigorously, like building search engines for illegal drugs or human trafficking on the deep web, requires a lot of expert knowledge. [2] Maybe future deep learning can do this end-to-end on arbitrary domains? [3] Let's see.
Resulting models will likely be released as open source, and the techniques published in open journals.
The big money is in productionizing and running it on bigger datasets.
Kaggle competitions are not a data-science-product-for-hire kind of thing, like some kind of logo design contest. It is a sport. Super GM chess competitions see smaller prize pools.
Mediocre people, including me, would compete for free (just to get access to interesting data like this). The really talented people are not driven by money, but by competition and fame.
Increasing the prize money is more of a marketing move and only attracts more low-to-mediocre people trying to get lucky in a lottery: It won't increase the quality of top 10 solutions. And no computer vision PhD/professor is going to drop everything he/she is working on for a small chance to win 400k.
1. Google has been an AI company from the very beginning (information retrieval).
2. Google is investing and doing generally useful applied AI, not an AGI moonshot.
3. Google's AI researchers are not 100% sure that AGI is just a few short years away.
4. Major source of income is advertisements. A lot of non-technical people work on this, allowing others to do more research and improve search.
5. Like said, AI is Google's DNA from the start. They are the biggest AI company in the world, and will die/be dethroned when they let AI research wither.
6. Avoid blanket humility, and lose hunger, innovation, dare. "At Hooli, nothing is ever impossible".
Surely Google follows their own guidelines: You can't find Google Search Engine results indexed on Google itself (or any other search engine with or without ads for that matter). Google Search is more of an application than a content site.
Otherwise, when Google finds itself breaking the "rules", they act:
- Google banned the page for Chrome for buying paid links
- Google banned an acquired company (Beatthatquote) for violating rules.
- Google penalized their Adwords FAQ pages for cloaking.
- Google reduced pagerank for Google Japan for buying links.
- Google removed Adwords support pages for keyword stuffing.
At some point the machine fact-checking relies on data input by humans. Or does the machine interpret data directly from cameras on the street to determine, e.g. that suspect A shot victim B with weapon C? Does it interpret a historical textbook and assess the veracity of its sources and claims? Or does it build a time machine to go into the past and acquire raw data to verify claimed facts?
> At some point the machine fact-checking relies on data input by humans.
But so does nearly every ML model? In the case of a spell checker it is using corpora made by humans. If the majority of humans start spelling words differently, then facts about the correct spellings change with them.
> Or does the machine interpret data directly from cameras on the street to determine, e.g. that suspect A shot victim B with weapon C?
If the military gets their way this will happen sooner than later. It is not technically infeasible to do activity detection from drone footage.
> Does it interpret a historical textbook and assess the veracity of its sources and claims?
Yes. Just like a journalist would when fact checking an article about WWII.
> Or does it build a time machine to go into the past and acquire raw data to verify claimed facts?
Raw data is both an oxymoron and a bad idea. Data is brought into existence by human-made measuring devices.
AI hype journalists will find something to write about, regardless of the industry making their research accessible to the wider public.
Markov-chain generators have been around for a while, and have been used to throw off spam detectors. This should not stop research, but instead grow more research into adversarial usage of machine learning models.
Beating the state-of-the-art with one-shot learning is not common. Transfer learning for NLP is also quite unchartered.
Also the technique is quite novel: This is not pre-trained nets on labeled data, it is an unsupervised generative model.
Future research directions are exciting: Unsupervised prediction of the next frame in a video, and then being able to one-shot learn a wide range of visual tasks.
Is a mistake still a mistake when it is a profitable action?
If fitting to human irrationality increases generalization performance, then it does not matter if the "machines seem destined to repeat our mistakes", it is still a useful signal. If fitting to human irrationality decreases generalization performance, your algorithm is overfit to noise (and you have bigger fish to fry than human irrationality).
Overfitting to noise is perfectly avoidable, not pre-destined when part of your data is noisy (noisy data is the rule not the exception).
Imputation. [1] Remove a word from a sentence then try to predict it from its surrounding context. "when I get home tonight, i vape a ___ then space out". Assign predicted probabilities to imputed word ":leaf emoji:" ["marijuana cigarette", "electronic cigarette", "cigar"].
Active learning. Seed the algorithm with expert knowledge from law enforcement, drug users, and social workers, who know of the encryption keys.
Anomaly detection. Though perhaps easily-confused with other, innocuous usage, street slang is a distinct form of language with its own properties and patterns. Compared to common discourse, it is strange and random. This pattern could be measured.
Doing this rigorously, like building search engines for illegal drugs or human trafficking on the deep web, requires a lot of expert knowledge. [2] Maybe future deep learning can do this end-to-end on arbitrary domains? [3] Let's see.
[1] https://arxiv.org/abs/1312.3005 "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling"
[2] http://www.darpa.mil/program/memex
[3] https://universe.openai.com/envs#world_of_bits