Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is what I get out of Vicuna-13b:

Human: A glass door has 'push' written on it in mirror writing. To open the door should you 'push' or 'pull' it?

Assistant: Since the word "push" is written in mirror writing on the glass door, you should actually "pull" the door open instead of "push" it. Mirror writing is a writing method where the characters are reversed, so when you see the word "push" written in mirror writing, it is actually "pull" in the normal writing orientation.



The difference for me with GPT-4 is its ‘understanding’ of the scene and its explanation of WHY you should push or pull.

It talks an out a door with people approaching from different directions. It has some idea of what those people would be thinking.

That seems different to just ‘mirror writing means do the opposite’.


I asked GPT4 to draw a dog or a skull in openscad and even though the end result was buggy, commenting things in the code here and there and making some volumes transparent I figured out he got it okay. For instance the dog had two eyes two ears one long nose (potatoids). It understood the symmetry of both pairs but was unable to place them at the right place. It's not like it was just misaligned, things were in the wrong planes, but they where there. As if he hadn't understood a face is like one face of a cube.


I think things like this (or simpler things like asking ChatGPT for ascii art of a circle) really show the difference between LLMs and humans. The issue is that it’s a language model rather then an image one, so it doesn’t understand the concept of ‘looks like a dog’.


Image models don't understand it either, they only know the typical "look" of something but not the correct proportions or number of parts. If you have the word "wheel" in the prompt they might turn every circle-like shape in the image into a car wheel because it cannot selectively apply parts of the prompt to parts of the image.

At least the few models I tinkered with all had this issue, and without some additional guidance that understands scene composition and anatomy/proportions in three dimensions this probably won't fundamentally improve.


I got it to extrude a cylinder into a sinusoidal, guiding it by feeding it back screenshots of the scene converted to ascii.


Maybe, but consider the post above where GPT4 gets confused about the blind man on the other side of the door, while Vicuna-13b seems to figure it out. I accept that GPT4 gave a better answer in this case, but its level of understanding about the scene under different scenarios still seems limited.


There's also a comment where GPT4 was able to answer the question correctly. Seems like there's some kind of statistics in play.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: