The fact that I can do this on commodity hardware on a 4GB model. A model that understands text and visual images, just absolutely blows my mind.
I almost feel like in a new future, a 100GB model may be able to offline handle speech -> text, video -> live scene graph. A robot that could base level physical understanding of our world like a 4 year old does. (objects, their relationship to other objects and behaviors)
The fact that I can do this on commodity hardware on a 4GB model. A model that understands text and visual images, just absolutely blows my mind.
I almost feel like in a new future, a 100GB model may be able to offline handle speech -> text, video -> live scene graph. A robot that could base level physical understanding of our world like a 4 year old does. (objects, their relationship to other objects and behaviors)