The paper shows that Stable Diffusion and Google's Imagen regenerate individual images from their training sets. They show it is very rare, but can be found reliably.
No doubt, but how relevant? If you could somehow go through my brain you'd also find the occasional piece of art and literature - I've memorised a few good poems and songs. And I can recognise certain paintings on sight which would be difficult if they weren't accurately encoded in my mind somewhere. The fact that I have memorised them doesn't mean that I'm violating anyone's copyright if I attempted to compose poems and songs.
why do people bring up humans and their brains, as if it's gonna affect how models create and store image data/derivative data in a very definitive form of bits on a disk? just as a distraction?
Absolutely, but that is what makes the relevancy of the fact interesting. If you could load my brain into Python, would it then become copyright infringement, despite the content being logically unchanged?
Memorising a picture using a fleshy model is fine, so the raw fact that art has been found in a black box model here isn't necessarily relevant. Might be. Might not be.
I understand what you’re saying, but if you could load Python into your brain, I suspect the laws would quickly change to reflect that new reality.
The biological/evolutionary limits of humans are core assumptions of current laws, and I’d argue that the operating environment has changed enough to make those assumptions outdated.
> Memorising a picture using a fleshy model is fine
I imagine the fleshy model is fine not because it’s fleshy, but because it’s the model that people were targeting when writing current laws.
Even if no memorization occurred, there are still big questions about why such a model should be treated like anything other than just another computer program from a legal perspective.
Eh... The mp3 decoder can generate copyrighted music if you feed it the right inputs...
Likewise, in this work they prime the pump by using exact training prompts of highly duplicated training images. And then you have to generate 500 images from that prompt to find 10 duplications. You've really gotta want to find the duplicates, which indicates that these are going to be extremely rare in practice, and even more rare once the training data is hardened against the attack by deduplication.
> Eh... The mp3 decoder can generate copyrighted music if you feed it the right inputs...
Good analogy! An MP3 decoder takes an input and produces an output. If the output is copyrighted material, it's well understood that the inputs is simply a transformed version of that same copyrighted material and is similarly copyrighted.
The SD model is very much analogous. The prompt causes the algorithm to extract some output from the input model. If the output is copyrighted material then similarly the input model must carry a transformed version of that same copyrighted material and is therefore also subject to copyright.
Right?
By the way, I pose this, but I highly doubt this is actually how the courts will rule. I think they'll find the model itself is fine, that the training is subject to a fair use defense, but that the outputs may be subject to copyright if there's substantial similarly to an existing work in the training set.
The probability that synthesis of similar enough training samples given that the dataset does not contain duplicates is astronomically small, in this work they purposefully manipulated the dataset, used specific prompts and gave all sorts of advantages to an adversary, and created many images per prompt to be able to find such cases.
I imagine reliability means that with the exact prompt, seed, cfg scale, model checkpoint, blah blah and on the same hardware, they can continue to get an image that they consider close enough to the original.
Hi, I am the first author of this paper and I am happy to answer any questions. You can find a link to the technical paper here https://arxiv.org/abs/2205.09665.
Hey this is cool, I do the NYT Crossword every day. A few questions.
1. You mention an 82% solve rate. The NYT puzzle gets "harder" each day Monday through Saturday. Do you track the days separately? If so I'd be curious how much of the 18% unsolved end up on Fridays and Saturday. (for anyone who doesn't know the Sunday puzzle is outside of the M-Sat range since its a bigger puzzle).
2. Related to the above Thursday puzzles usually have "tricks" (skipped letters and what not) in them or require a Rebus (multiple letters in one space) - do you handle these at all?
3. Is this building an ongoing model and getting better at solving? Or did you have to seed it with a set of solved puzzles and clues?
2. Our current system doesn't have any handling for rebuses or similar tricks, although Dr. Fill does. I think this is part of why Thursday is the hardest day for us, even though Saturday is usually considered the most difficult.
3. We trained it with 6.4M clues. As new crosswords get published, we could theoretically retrain our model with more data, but we aren't currently planning to do that.
I don't suppose you gave more weight to more recent puzzles? Is there a time period or puzzle setter that was harder to solve because they favored an unusual clue type?
We didn't give more weight to recent puzzles. In fact, we trained on pre-2020 data, validated on data from 2020, and evaluated on post-2020 data.
Our model seems to perform well despite this "time generalization" split, but there are a couple instances where it struggled with new words. For example, we got the answer "FAUCI" wrong in a puzzle from May 2021. Even though Fauci was in the news before 2020, I guess he wasn't famous enough to show up in crosswords, and therefore his name wasn't in our training data.
I think evaluating performance by constructor would be really interesting! But we haven't done that.
For handling cross-reference clues, do you think it would be feasible in the future to feed the QA model a representation of the partially-filled puzzle (perhaps only in the refinement step - hard to do for the first step before you have any answers!), in order to give it a shot at answering clues that require looking at other answers?
It feels like the challenges might be that most clues are not cross-referential, and even for those that are, most information in the puzzle is irrelevant - you only care about one answer among many, so it could be difficult to learn to find the information you need.
But maybe this sort of thing would also be helpful for theme puzzles, where answers might be united by the theme even if their clues are not directly cross-referential, and could give enough signal to teach the model to look at the puzzle context?
One thing I was curious about - the ACPT is a crossword speed-solving competition, with time spent solving a major aspect of total score. How did you approach leveling the playing field between the human and computer competitors?