Recent posts
Fixed vision encoders like DINO have driven impressive progress in more learnable representations for generative modeling - but there is no universal variant across modalities, and they do not scale with the generative model. We introduce our self-supervised framework, Self-Flow, that builds learnability directly into flow models, working in a unified and scalable way across image, video and audio. Particularly excited about the gains on video-action prediction: Beyond the overall success rate improving substantially, more complex tasks - like "Open and Place" - see some of the clearest gains. So many interesting research questions to explore to make 🤖 go brrr Super glad to be working with my amazing colleagues @hila_chefer, Dominik, @dustin_podell, Vikash, @Vinh_Suhi, Antonio and @robrombach - as well as the whole @bfl_ml team! arxiv: https://t.co/eP7ip58Tff project page: https://t.co/GNShpBMEQ1



New research from @bfl_ml 🥳 Meet Self-Flow: our self-supervised framework for image, audio, video & world models 🤖 http://bfl.ai/research/self-flow Do generative models really need DINO to learn strong representations? We propose teaching them directly via a joint framework instead 🧵

