Could you provide more details on the multimodal data ingestion process? What types of data can R2R currently handle, and how are non-text data types embedded?
Can the ingestion be streaming from logs?
There are a lot of good questions around ingestion today, so we will likely figure out how to intelligently expand this.
For mp3s we use whisper to transcribe, for videos we transcribe with whisper and sample frames to "describe" with a multimodal model. For images we again transcribe to a thorough text description - https://r2r-docs.sciphi.ai/cookbooks/multimodal
We have been testing multi-modal embedding models and open source models to do the description generation. If anyone has suggestions on SOTA techniques that work well at scale we would love to chat and work to implement these. Long run we'd like the system to be able to handle multi-modal data locally.