Semantic Segmentation in Computer Vision: A Comprehensive Overview

In computer vision systems, semantic segmentation is a difficult problem. To address this issue, a variety of technologies have been developed, including autonomous cars, human-computer interfaces, robots, medical research, agriculture, and so on. Many of these strategies are based on the deep learning paradigm, which has been demonstrated to be quite effective. 

What is GEEK

Buddha Community

Semantic Segmentation in Computer Vision: A Comprehensive Overview
Dominic  Feeney

Dominic Feeney


Semantic Segmentation with TensorFlow Keras - Analytics India Magazine


Semantic Segmentation laid down the fundamental path to advanced Computer Vision tasks such as object detectionshape recognitionautonomous drivingrobotics, and virtual reality. Semantic segmentation can be defined as the process of pixel-level image classification into two or more Object classes. It differs from image classification entirely, as the latter performs image-level classification. For instance, consider an image that consists mainly of a zebra, surrounded by grass fields, a tree and a flying bird. Image classification tells us that the image belongs to the ‘zebra’ class. It can not tell where the zebra is or what its size or pose is. But, semantic segmentation of that image may tell that there is a zebra, grass field, a bird and a tree in the given image (classifies parts of an image into separate classes). And it tells us which pixels in the image belong to which class.

In this article, we discuss semantic segmentation using TensorFlow Keras. Readers are expected to have a fundamental knowledge of deep learning, image classification and transfer learning. Nevertheless, the following articles might fulfil these prerequisites with a quick and clear understanding:

  1. Getting Started With Deep Learning Using TensorFlow Keras
  2. Getting Started With Computer Vision Using TensorFlow Keras
  3. Exploring Transfer Learning Using TensorFlow Keras

Let’s dive deeper into hands-on learning.

#developers corner #densenet #image classification #keras #object detection #object segmentation #pix2pix #segmentation #semantic segmentation #tensorflow #tensorflow 2.0 #unet

Julie  Donnelly

Julie Donnelly


Road Surface Semantic Segmentation

Hello There! This post is about a road surface semantic segmentation approach. So the focus here is on the road surface patterns, like: what kind of pavement the vehicle is driving on or if there is any damage on the road, also the road markings and speed-bumps as well and other things that can be relevant for a vehicular navigation task.

Here I will show you the step-by-step approach based on the preprint paper available at ResearchGate [1]. The Ground Truth and the experiments were made using the RTK dataset [2], with images captured with a low-cost camera, containing images of roads with different types of pavement and different conditions of pavement quality.

It was fun to work on it and I’m excited to share it, I hope you enjoy it too. 🤗


The purpose of this approach is to verify the effectiveness of using passive vision (camera) to detect different patterns on the road. For example, to identify if the road surface is an asphalt or cobblestone or an unpaved (dirt) road? This may be relevant for an intelligent vehicle, whether it is an autonomous vehicle or an Advanced Driver-Assistance System (ADAS). Depending on the type of pavement it may be necessary to adapt the way the vehicle is driven, whether for the safety of users or the conservation of the vehicle or even for the comfort of people inside the vehicle.

Another relevant factor of this approach is related to the detection of potholes and water-puddles, which could generate accidents, damage the vehicles and can be quite common in developing countries. This approach can also be useful for departments or organizations responsible for maintaining highways and roads.

To achieve these objectives, Convolutional Neural Networks (CNN) were used for the semantic segmentation of the road surface, I’ll talk more about that in next sections.

Ground Truth

To train the neural network and to test and validate the results, a Ground Truth (GT) was created with 701 images from the RTK dataset. This GT is available on the dataset page and is composed by the following classes:

Image for post

GT classes [1]

Image for post

GT Samples [1]

The approach and setup

Everything done here was done using Google Colab. Which is a free Jupyter notebook environment and give us free access to GPUs and is super easy to use, also very helpful for organization and configuration. It was also used the fastai [3], the amazing deep learning library. To be more precise, the step-by-step that I will present was very much based on one of the lessons given by Jeremy Howard on one the courses about deep learning, in this case lesson3-camvid.

The CNN architecture used was the U-NET [4], which is an architecture designed to perform the task of semantic segmentation in medical images, but successfully applied to many other approaches. In addition, ResNet [5] based encoder and a decoder are used. The experiments for this approach were done with resnet34 and resnet50.

For the data augmentation step, standard options from the fastai library were used, with horizontal rotations and perspective distortion being applied. With fastai it is possible to take care to make the same variations made in the data augmentation step for both the original and mask (GT) images.

A relevant point, which was of great importance for the definition of this approach, is that the classes of the GT are quite unbalanced, having much larger pixels of background or surface types (eg.: asphalt, paved or unpaved) than the other classes. Unlike an image classification problem, where perhaps replicating certain images from the dataset could help to balance the classes, in this case, replicating an image would imply further increasing the difference between the number of pixels from the largest to the smallest classes. Then, in the defined approach weights were used in the classes for balancing. 🤔

Based on different experiments, it was realized that just applying the weights is not enough, because when improving the accuracy of the classes that contain a smaller amount of pixels, the classes that contain a larger amount of pixels (eg.: asphalt, paved and unpaved) lost quality in the accuracy results.

The best accuracy values, considering all classes, without losing much quality for the detection of surface types, was with the following configuration: first training a model without using weights, generating a model with good accuracy for the types of surface, then, use that previously trained model as a basis for the next model that uses the proportional weights for the classes. And that’s it!

You can check the complete code, that I will comment on throughout this post, on GitHub:

#semantic-segmentation #computer-vision #towards-data-science #potholes #data science

Olen  Predovic

Olen Predovic


Smoothing Semantic Segmentation Edges

Credit to Eric VanBuhler for contributing the code corresponding to overlay_image and mask dilation.

In a previous post, I showed how to separate a person from a video stream and alter the background, creating a virtual green screen. In that post, the model that performed best was a coarse-grained semantic segmentation model that resulted in large, blocky segmentation edges. A more fine-grained segmentation model was not able to accurately track the person in the video stream, and using Gaussian smoothing on the more coarse-grained model blurred the entire image. In this tutorial, we’ll cover how to smooth out edges generated by coarse-grained semantic segmentation models without blurring the desired target objects.

All of the code from this tutorial is available on GitHub. To run the final code, first sign up for an alwaysAI account (it’s free!) and get it set up on your machine (also free). However, you can use the smoothing code in any Python computer vision application!

This tutorial builds off OpenCV and the virtual green screen blog post. If you’d like to follow along, first clone this repo.

Let’s get started!

#green-screen #semantic-segmentation #artificial-intelligence #computer-vision #background-removal

Obie  Rowe

Obie Rowe


Semantic Segmentation using a Django API


Image segmentation has been a hot topic for a while now. Various uses cases involving segmentation had emerged in a bunch of different areas, machine vision, medical imaging, object detection, recognition tasks, traffic control systems, video surveillance and, a lot more. The intuition behind these intelligent systems is to capture the diverse components that form the image and therefore, teach computer vision models to grasp more insight and a better understanding of the scene and the context.

Image for post

Original Photo by Melody Jacob on Unsplash on the left, segmented version on the right

The two types of image segmentation commonly used are:

  • Semantic Segmentation: identifying different classes in the image and segment accordingly
  • Instance Segmentation: determine priorly the different classes in the image and recognize the number of instances each class contains. The image is decomposed in multiple labeled regions relating to the different class instances the model was trained on.

For this article, I will use the Pytorch implementation of the Google DeepLab V3 segmentation model to customize the background of an image. The intention is to segment the foreground and detach it from the rest while replacing the remaining background with a whole different picture. The model will be served through a Django REST API.


  1. A bit of background on DeepLab V3
  2. Use the DeepLab V3-Resnet101 implementation from Pytorch
  3. Set up the Django API
  4. Conclusion

You can check the entire code for this project under my Gihut repo.

1. A bit of background on DeepLab V3

Segmentation models use fully convolutional neural networks **FCNN ** during a prior image detection stage where masks and boundaries are put in place then, the inputs are processed through a vastly deep network where the accumulated convolutions and poolings cause the image to importantly decrease its resolution and quality, hence results are yield with a high loss of information. DeepLab models address the challenge leveraging on Atrous convolutions and Atrous Spatial Pyramid Pooling (ASPP) architectures.

#deeplab #computer-vision #deep-learning #artificial-intelligence #semantic-segmentation

Semantic Image Segmentation using Fully Convolutional Networks

Humans have the innate ability to identify the objects that they see in the world around them. The visual cortex present in our brain can distinguish between a cat and a dog effortlessly in almost no time. This is true not only with cats and dogs but with almost all the objects that we see. But a computer is not as smart as a human brain to be able to this on its own. Over the past few decades, Deep Learning researchers have tried to bridge this gap between human brain and computer through a special type of artificial neural networks called Convolutional Neural Networks(CNNs).

What is a Convolutional Neural Network?

After a lot of research to study mammalian brains, researchers found that specific parts of the brain get activated to specific type of stimulus. For example, some parts in the visual cortex get activated when we see vertical edges, some when we see horizontal edges, and some others when we see specific shapes, colors, faces, etc. ML researchers imagined each of these parts as a layer of neural network and considered the idea that a large network of such layers could mimic the human brain.

This intuition gave rise to the advent of CNN, which is a type of neural network whose building blocks are convolutional layers. A convolution layer is nothing but a set of weight matrices called kernels or filters which are used for convolution operation on a feature matrix such as an image.Convolution:

2D convolution is a fairly simple operation, you start with a kernel and ‘stride’ (slide) it over the 2D input data, performing an element-wise multiplication with the part of the input it is currently on, and then summing up the results into a single output cell. The kernel repeats this process for every location it slides over, converting a 2D matrix of features into another 2D matrix of features.

The step size by which the kernel slides on the input feature matrix is called stride. In the below animation, the input matrix has been added with an extra stripe of zeros from all four sides to ensure that the output matrix is of the same size as the input matrix. This is called (zero)padding.

Image for post

2D Convolution: kernel size=3x3, padding=1 or ‘same’, stride=1

Semantic Image Segmentation

Image segmentation is the task of partitioning a digital image into multiple segments (sets of pixels) based on some characteristics. The objective is to simplify or change the image into a representation that is more meaningful and easier to analyze.

Semantic Segmentation refers to assigning a class label to each pixel in the given image. See the below example.

Image for post

Note that segmentation is different from classification. In classification, complete image is assigned a class label whereas in segmentation, each pixel in an image is classified into one of the classes.

1. Business Problem

Having a fair idea about convolutional networks and semantic image segmentation, let’s jump into the problem we need to solve.

Image for post

Image for post

Severstal is among the top 50 producers of steel in the world and Russia’s biggest player in efficient steel mining and production. One of the key products of Severstal is steel sheets. The production process of flat sheet steel is delicate. From heating and rolling, to drying and cutting, several machines touch flat steel by the time it’s ready to ship. To ensure quality in the production of steel sheets, today, Severstal uses images from high-frequency cameras to power a defect detection algorithm.

Through this competition, Severstal expects the AI community to improve the algorithm by localizing and classifying surface defects on a steel sheet.Business objectives and constraints

  1. A defective sheet must be predicted as defective since there would be serious concerns about quality if we misclassify a defective sheet as non-defective. i.e. high recall value for each of the classes is needed.We need not give the results for a given image in the blink of an eye. (No strict latency concerns)

2. Machine Learning Problem

2.1. Mapping the business problem to an ML problemOur task is to

  1. Detect/localize the defects in a steel sheet using image segmentation andClassify the detected defects into one or more classes from [1, 2, 3, 4]

To put it together, it is a semantic image segmentation problem.2.2. Performance metricThe evaluation metric used is the mean Dice coefficient. The Dice coefficient can be used to compare the pixel-wise agreement between a predicted segmentation and its corresponding ground truth. The formula is given by:

Image for post

where X is the predicted set of pixels and Y is the ground truth.

Read more about Dice Coefficient here.2.3. Data OverviewWe have been given a zip folder of size 2GB which contains the following:

  • **train_images** —a folder containing 12,568 training images (.jpg files)**test_images** — a folder containing 5506 test images (.jpg files). We need to detect and localize defects in these images**train.csv** — training annotations which provide segments for defects belonging to ClassId = [1, 2, 3, 4]**sample_submission.csv**— a sample submission file in the correct format, with each ImageId repeated 4 times, one for each of the 4 defect classes.

More details about data have been discussed in the next section.

3. Exploratory Data Analysis

The first step in solving any machine learning problem should be a thorough study of the raw data. This gives a fair idea about what our approaches to solving the problem should be. Very often, it also helps us find some latent aspects of the data which might be useful to our models.

Let’s analyze the data and try to draw some meaningful conclusions.3.1. Loading train.csv file

Image for post

**_train.csv _**tells which type of defect is present at what pixel location in an image. It contains the following columns:

  • **ImageId**: image file name with .jpg extension**ClassId**: type/class of the defect, one of [1, 2, 3, 4]**EncodedPixels**: represents the range of defective pixels in an image in the form of run-length encoded pixels(pixel number where defect starts pixel length of the defect).
  • e.g. ‘29102 12’ implies the defect is starting at pixel 29102 and running a total of 12 pixels, i.e. pixels 29102, 29103,………, 29113 are defective. The pixels are numbered from top to bottom, then left to right: 1 corresponds to pixel (1,1), 2 corresponds to (2,1), and so on.

#convolutional-network #semantic-segmentation #computer-vision #dilated-convolution #deep-learning #deep learning