Hacker News new | past | comments | ask | show | jobs | submit login

I posed a similar question to a postdoc at CSAIL. Namely that generative image models like DALL-E consistently screw up the number of eyes and fingers. I wondered if you could use a knowledge graph in conjunction with the drawing algorithm to imbue more realism into the generation. At the time they said they weren't aware of research in that direction. Fair enough. Still interested in seeing if these purely generative models can reference knowledge and apply it.



Human features such as faces and fingers are easy for us humans to parse because our attention is instantly drawn to them, and we are highly sensitive to "noise" for that particular domain. AI-generated content of a three-toed animal, for example, that is pictured with four toes might slip a lot of people's minds.

I unfortunately don't see how vector database would be able to help with this. The knowledge bases that they provide are meant to enable large-scale retrieval and vector search. One way to fix the problem you mention is perhaps to increase the proportion of features that draw human attention (e.g. faces) in the training dataset.


> faces and fingers are easy for us humans to parse because...

because we have dedicated neural hardware for faces -- the "fusiform face area", a relatively small volume of the brain which is used by non-autistic humans for facial recognition/processing. A lot of our sensitivity to human faces is neural structure that's in our DNA, and yes that does pair with a lifelong obsession over human faces to spend huge amounts of energy learning more about them. But after infant brain development we're not relying on a blank slate for human faces.

> One case study of agnosia provided evidence that faces are processed in a special way. A patient known as C. K., who suffered brain damage as a result of a car accident, later developed object agnosia. He experienced great difficulty with basic-level object recognition, also extending to body parts, but performed very well at recognizing faces. A later study showed that C. K. was unable to recognize faces that were inverted or otherwise distorted, even in cases where they could easily be identified by normal subjects. This is taken as evidence that the fusiform face area is specialized for processing faces in a normal orientation.


Is it known what autistic people use it for? Different for each person?


It seems like generative image generators can’t count very well and we notice whenever the number of something is supposed to be fixed. They also screw up the black keys on a piano.


Generally, the models seem to struggle with fine details that correlate with broader features. Each finger typically looks ok, it's the hand that's off... remote controls might have appropriately detailed buttons with a nonsense layout. I wonder if the training process can "weight" various visual features instead of optimizing loss equally over the whole image.


You're close, but there's something even more fundamentally wrong that goes overlooked. The noise model. Namely the use of gausssian random pixel flips. The spectrum of the noise distribution exactly gets in the way of the fine details before there's any chance to learn it. There's no chance for it to learn the layout of TV remote buttons because in the first iteration of the diffusion there's already noise of the same length scale as the buttons you want it to learn.


What is a finger?

What is two?

These categories don't exist.

Categories don't exist.


They don't exist but try navigating life without them.


I was being vague, but I meant those questions in the context of an ML model.

ML models are completely inference-based. They have zero symbolic definitions. They don't categorize subjects at all.

Because of that, you can't tell a model how many X to draw.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: