Human perception is multidimensional and a balanced combination of hearing, vision, smell, touch, and taste. Recently, many pieces of research have tried to step forward on the road of improving machine perception by transitioning from single-modality learning to multimodality learning.

If you are wondering what modality is, it is the classification of a single independent channel of sensory input/output between a computer and a human (like vision is one modality and audio is another). In this blog, we will talk about the use of audio and visual information (representing the two most important perceptual modalities in our daily life) to make our machine perception smarter without using any labeled data (self-supervision).

#deeplearning #audio-visualiser #machinelearning #ai

Deep Learning for Modeling Audio-Visual Correspondences
1.25 GEEK