Google’s CoCa: a new crash
The Contrastive Captioners model is the heiress of CLIP and SimVLM. It is able to work with both graphic and textual information, and solves a large set of tasks: from generating captions to images to answering questions about videos. A new approach to pre-trained multimodal models allowed Google developers to achieve 91% top-1 accuracy in the task of classifying images on the ImageNet dataset.
Michael’s comment
An elegant solution based on familiar approaches, combines two paradigms: contrastive learning and encoder-decoder approach. It seems that ImageNet will soon turn into a training dataset like MNIST, and you will need to look for something more serious.
Flying over the Flamingo Nest
www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model
The new multimodal SotA model from DeepMind solves complex visual and textual problems with the help of a couple of specific examples. And without additional training.
In addition to traditional uses — such as classification — Flamingo can even support a small dialogue about the content of the image. And clarify your answers if you point out mistakes to her.
Michael’s comment
Models like Flamingo can bring quite practical benefits to society. Multimodal solutions are necessary for AI applications such as helping visually impaired people solve everyday problems or improving algorithms for searching for dangerous content on the Internet.
How to tame a transformer
github.com/google-research/big_vision
Google Research has opened access to its code base for training large-scale machine vision models on Cloud TPU virtual machines. Now interested users will be able to see the source code of ViT, LiT, MLP-Mixer and other large AI projects. This will allow you to conduct new research in the field of CV on a reliable foundation, launching training on almost any amount of equipment: from 1 GPU to 2048 TPU.
Michael’s comment
Important news about big models in open source. Thanks to Big Vision, the community has new opportunities to experiment and search for something fundamentally new by saving time on the preparation of the training pipeline.
Can a robot write a symphony?
arxiv.org/pdf/2205.05448.pdf
Scientists from the AI Department of the Central Conservatory of China presented a SymphonyNet model designed to generate symphonic music. The approach is based on traditional text generation solutions adapted to the specifics of the task. In particular, the developers propose to use a linear transformer for processing ultra-long sequences and modified Byte Pair Encoding algorithm.
Michael’s comment
In addition to generating music, such models can be used in more mundane tasks. For example, for analyzing and predicting time series. In the end, why is the flow of indicators from hundreds of sensors not a symphony?