lol i thought about DM'ing you but i was worried you'd have seen it already
also techcrunch.com/2024/07/11/a... for some coverage (not very satisfying though)
The latest round of language models, like GPT-4o and Gemini 1.5 Pro, are touted as "multi-modal," able to understand images and audio as well as text —