Google’s CoCa: a new crash
The Contrastive Captioners model is the heiress of CLIP and SimVLM. It is able to work with both graphic and textual information, and solves a large set of tasks: from generating captions to images to answering questions about videos. A new approach to pre-trained multimodal models allowed Google developers to achieve 91% top-1 accuracy in the task of classifying images on the ImageNet dataset.
An elegant solution based on familiar approaches, combines two paradigms: contrastive learning and encoder-decoder approach. It seems that ImageNet will soon turn into a training dataset like MNIST, and you will need to look for something more serious.
Flying over the Flamingo Nest
The new multimodal SotA model from DeepMind solves complex visual and textual problems with the help of a couple of specific examples. And without additional training.
In addition to traditional uses — such as classification — Flamingo can even support a small dialogue about the content of the image. And clarify your answers if you point out mistakes to her.
Models like Flamingo can bring quite practical benefits to society. Multimodal solutions are necessary for AI applications such as helping visually impaired people solve everyday problems or improving algorithms for searching for dangerous content on the Internet.
How to tame a transformer
Google Research has opened access to its code base for training large-scale machine vision models on Cloud TPU virtual machines. Now interested users will be able to see the source code of ViT, LiT, MLP-Mixer and other large AI projects. This will allow you to conduct new research in the field of CV on a reliable foundation, launching training on almost any amount of equipment: from 1 GPU to 2048 TPU.
Important news about big models in open source. Thanks to Big Vision, the community has new opportunities to experiment and search for something fundamentally new by saving time on the preparation of the training pipeline.
Can a robot write a symphony?
Scientists from the AI Department of the Central Conservatory of China presented a SymphonyNet model designed to generate symphonic music. The approach is based on traditional text generation solutions adapted to the specifics of the task. In particular, the developers propose to use a linear transformer for processing ultra-long sequences and modified Byte Pair Encoding algorithm.
In addition to generating music, such models can be used in more mundane tasks. For example, for analyzing and predicting time series. In the end, why is the flow of indicators from hundreds of sensors not a symphony?