Billy Chandler

Billy Chandler

1574421845

Introduction to Face Processing with Computer Vision

Ever wonder how Facebook’s facial recognition or Snapchat’s filters work? Faces are a fundamental piece of photography, and building applications around them has never been easier with open-source libraries and pre-trained models. In this talk, we’ll help you understand some of the computer vision and machine learning techniques behind these applications. Then, we’ll use this knowledge to develop our own prototypes to tackle tasks such as face detection (e.g. digital cameras), recognition (e.g. Facebook Photos), classification (e.g. identifying emotions), manipulation (e.g. Snapchat filters), and more.

#MachineLearning #ComputerVision #OpenCV #Python #DataScience

What is GEEK

Buddha Community

Introduction to Face Processing with Computer Vision

A Lightweight Face Recognition and Facial Attribute Analysis

deepface

Deepface is a lightweight face recognition and facial attribute analysis (age, gender, emotion and race) framework for python. It is a hybrid face recognition framework wrapping state-of-the-art models: VGG-Face, Google FaceNet, OpenFace, Facebook DeepFace, DeepID, ArcFace and Dlib.

Experiments show that human beings have 97.53% accuracy on facial recognition tasks whereas those models already reached and passed that accuracy level.

Installation

The easiest way to install deepface is to download it from PyPI. It's going to install the library itself and its prerequisites as well. The library is mainly based on TensorFlow and Keras.

pip install deepface

Then you will be able to import the library and use its functionalities.

from deepface import DeepFace

Facial Recognition - Demo

A modern face recognition pipeline consists of 5 common stages: detect, align, normalize, represent and verify. While Deepface handles all these common stages in the background, you don’t need to acquire in-depth knowledge about all the processes behind it. You can just call its verification, find or analysis function with a single line of code.

Face Verification - Demo

This function verifies face pairs as same person or different persons. It expects exact image paths as inputs. Passing numpy or based64 encoded images is also welcome. Then, it is going to return a dictionary and you should check just its verified key.

result = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg")

Face recognition - Demo

Face recognition requires applying face verification many times. Herein, deepface has an out-of-the-box find function to handle this action. It's going to look for the identity of input image in the database path and it will return pandas data frame as output.

df = DeepFace.find(img_path = "img1.jpg", db_path = "C:/workspace/my_db")

Face recognition models - Demo

Deepface is a hybrid face recognition package. It currently wraps many state-of-the-art face recognition models: VGG-Face , Google FaceNet, OpenFace, Facebook DeepFace, DeepID, ArcFace and Dlib. The default configuration uses VGG-Face model.

models = ["VGG-Face", "Facenet", "Facenet512", "OpenFace", "DeepFace", "DeepID", "ArcFace", "Dlib"]

#face verification
result = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg", model_name = models[1])

#face recognition
df = DeepFace.find(img_path = "img1.jpg", db_path = "C:/workspace/my_db", model_name = models[1])

FaceNet, VGG-Face, ArcFace and Dlib are overperforming ones based on experiments. You can find out the scores of those models below on both Labeled Faces in the Wild and YouTube Faces in the Wild data sets declared by its creators.

ModelLFW ScoreYTF Score
Facenet51299.65%-
ArcFace99.41%-
Dlib99.38 %-
Facenet99.20%-
VGG-Face98.78%97.40%
Human-beings97.53%-
OpenFace93.80%-
DeepID-97.05%

Similarity

Face recognition models are regular convolutional neural networks and they are responsible to represent faces as vectors. We expect that a face pair of same person should be more similar than a face pair of different persons.

Similarity could be calculated by different metrics such as Cosine Similarity, Euclidean Distance and L2 form. The default configuration uses cosine similarity.

metrics = ["cosine", "euclidean", "euclidean_l2"]

#face verification
result = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg", distance_metric = metrics[1])

#face recognition
df = DeepFace.find(img_path = "img1.jpg", db_path = "C:/workspace/my_db", distance_metric = metrics[1])

Euclidean L2 form seems to be more stable than cosine and regular Euclidean distance based on experiments.

Facial Attribute Analysis - Demo

Deepface also comes with a strong facial attribute analysis module including age, gender, facial expression (including angry, fear, neutral, sad, disgust, happy and surprise) and race (including asian, white, middle eastern, indian, latino and black) predictions.

obj = DeepFace.analyze(img_path = "img4.jpg", actions = ['age', 'gender', 'race', 'emotion'])

Age model got ± 4.65 MAE; gender model got 97.44% accuracy, 96.29% precision and 95.05% recall as mentioned in its tutorial.

Streaming and Real Time Analysis - Demo

You can run deepface for real time videos as well. Stream function will access your webcam and apply both face recognition and facial attribute analysis. The function starts to analyze a frame if it can focus a face sequantially 5 frames. Then, it shows results 5 seconds.

DeepFace.stream(db_path = "C:/User/Sefik/Desktop/database")

Even though face recognition is based on one-shot learning, you can use multiple face pictures of a person as well. You should rearrange your directory structure as illustrated below.

user
├── database
│   ├── Alice
│   │   ├── Alice1.jpg
│   │   ├── Alice2.jpg
│   ├── Bob
│   │   ├── Bob.jpg

Face Detectors - Demo

Face detection and alignment are important early stages of a modern face recognition pipeline. Experiments show that just alignment increases the face recognition accuracy almost 1%. OpenCV, SSD, Dlib, MTCNN and RetinaFace detectors are wrapped in deepface.

All deepface functions accept an optional detector backend input argument. You can switch among those detectors with this argument. OpenCV is the default detector.

backends = ['opencv', 'ssd', 'dlib', 'mtcnn', 'retinaface']

#face verification
obj = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg", detector_backend = backends[4])

#face recognition
df = DeepFace.find(img_path = "img.jpg", db_path = "my_db", detector_backend = backends[4])

#facial analysis
demography = DeepFace.analyze(img_path = "img4.jpg", detector_backend = backends[4])

#face detection and alignment
face = DeepFace.detectFace(img_path = "img.jpg", target_size = (224, 224), detector_backend = backends[4])

Face recognition models are actually CNN models and they expect standard sized inputs. So, resizing is required before representation. To avoid deformation, deepface adds black padding pixels according to the target size argument after detection and alignment.

RetinaFace and MTCNN seem to overperform in detection and alignment stages but they are much slower. If the speed of your pipeline is more important, then you should use opencv or ssd. On the other hand, if you consider the accuracy, then you should use retinaface or mtcnn.

The performance of RetinaFace is very satisfactory even in the crowd as seen in the following illustration. Besides, it comes with an incredible facial landmark detection performance. Highlighted red points show some facial landmarks such as eyes, nose and mouth. That's why, alignment score of RetinaFace is high as well.

You can find out more about RetinaFace on this repo.

API - Demo

Deepface serves an API as well. You can clone /api/api.py and pass it to python command as an argument. This will get a rest service up. In this way, you can call deepface from an external system such as mobile app or web.

python api.py

Face recognition, facial attribute analysis and vector representation functions are covered in the API. You are expected to call these functions as http post methods. Service endpoints will be http://127.0.0.1:5000/verify for face recognition, http://127.0.0.1:5000/analyze for facial attribute analysis, and http://127.0.0.1:5000/represent for vector representation. You should pass input images as base64 encoded string in this case. Here, you can find a postman project.

Tech Stack - Vlog, Tutorial

Face recognition models represent facial images as vector embeddings. The idea behind facial recognition is that vectors should be more similar for same person than different persons. The question is that where and how to store facial embeddings in a large scale system. Herein, deepface offers a represention function to find vector embeddings from facial images.

embedding = DeepFace.represent(img_path = "img.jpg", model_name = 'Facenet')

Tech stack is vast to store vector embeddings. To determine the right tool, you should consider your task such as face verification or face recognition, priority such as speed or confidence, and also data size.

Contribution

Pull requests are welcome. You should run the unit tests locally by running test/unit_tests.py. Please share the unit test result logs in the PR. Deepface is currently compatible with TF 1 and 2 versions. Change requests should satisfy those requirements both.

Support

There are many ways to support a project - starring⭐️ the GitHub repo is just one 🙏

You can also support this work on Patreon

 

Citation

Please cite deepface in your publications if it helps your research. Here are its BibTeX entries:

@inproceedings{serengil2020lightface,
  title        = {LightFace: A Hybrid Deep Face Recognition Framework},
  author       = {Serengil, Sefik Ilkin and Ozpinar, Alper},
  booktitle    = {2020 Innovations in Intelligent Systems and Applications Conference (ASYU)},
  pages        = {23-27},
  year         = {2020},
  doi          = {10.1109/ASYU50717.2020.9259802},
  url          = {https://doi.org/10.1109/ASYU50717.2020.9259802},
  organization = {IEEE}
}
@inproceedings{serengil2021lightface,
  title        = {HyperExtended LightFace: A Facial Attribute Analysis Framework},
  author       = {Serengil, Sefik Ilkin and Ozpinar, Alper},
  booktitle    = {2021 International Conference on Engineering and Emerging Technologies (ICEET)},
  pages        = {1-4},
  year         = {2021},
  doi          = {10.1109/ICEET53442.2021.9659697},
  url.         = {https://doi.org/10.1109/ICEET53442.2021.9659697},
  organization = {IEEE}
}

Also, if you use deepface in your GitHub projects, please add deepface in the requirements.txt.

Author: Serengil
Source Code: https://github.com/serengil/deepface 
License: MIT License

#python #machine-learning 

Dominic  Feeney

Dominic Feeney

1648217849

Deepface: A Face Recognition and Facial Attribute Analysis for Python

deepface

Deepface is a lightweight face recognition and facial attribute analysis (age, gender, emotion and race) framework for python. It is a hybrid face recognition framework wrapping state-of-the-art models: VGG-Face, Google FaceNet, OpenFace, Facebook DeepFace, DeepID, ArcFace and Dlib.

Experiments show that human beings have 97.53% accuracy on facial recognition tasks whereas those models already reached and passed that accuracy level.

Installation

The easiest way to install deepface is to download it from PyPI. It's going to install the library itself and its prerequisites as well. The library is mainly powered by TensorFlow and Keras.

pip install deepface

Then you will be able to import the library and use its functionalities.

from deepface import DeepFace

Facial Recognition - Demo

A modern face recognition pipeline consists of 5 common stages: detect, align, normalize, represent and verify. While Deepface handles all these common stages in the background, you don’t need to acquire in-depth knowledge about all the processes behind it. You can just call its verification, find or analysis function with a single line of code.

Face Verification - Demo

This function verifies face pairs as same person or different persons. It expects exact image paths as inputs. Passing numpy or based64 encoded images is also welcome. Then, it is going to return a dictionary and you should check just its verified key.

result = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg")

Face recognition - Demo

Face recognition requires applying face verification many times. Herein, deepface has an out-of-the-box find function to handle this action. It's going to look for the identity of input image in the database path and it will return pandas data frame as output.

df = DeepFace.find(img_path = "img1.jpg", db_path = "C:/workspace/my_db")

Face recognition models - Demo

Deepface is a hybrid face recognition package. It currently wraps many state-of-the-art face recognition models: VGG-Face , Google FaceNet, OpenFace, Facebook DeepFace, DeepID, ArcFace and Dlib. The default configuration uses VGG-Face model.

models = ["VGG-Face", "Facenet", "Facenet512", "OpenFace", "DeepFace", "DeepID", "ArcFace", "Dlib"]

#face verification
result = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg", model_name = models[1])

#face recognition
df = DeepFace.find(img_path = "img1.jpg", db_path = "C:/workspace/my_db", model_name = models[1])

FaceNet, VGG-Face, ArcFace and Dlib are overperforming ones based on experiments. You can find out the scores of those models below on both Labeled Faces in the Wild and YouTube Faces in the Wild data sets declared by its creators.

ModelLFW ScoreYTF Score
Facenet51299.65%-
ArcFace99.41%-
Dlib99.38 %-
Facenet99.20%-
VGG-Face98.78%97.40%
Human-beings97.53%-
OpenFace93.80%-
DeepID-97.05%

Similarity

Face recognition models are regular convolutional neural networks and they are responsible to represent faces as vectors. We expect that a face pair of same person should be more similar than a face pair of different persons.

Similarity could be calculated by different metrics such as Cosine Similarity, Euclidean Distance and L2 form. The default configuration uses cosine similarity.

metrics = ["cosine", "euclidean", "euclidean_l2"]

#face verification
result = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg", distance_metric = metrics[1])

#face recognition
df = DeepFace.find(img_path = "img1.jpg", db_path = "C:/workspace/my_db", distance_metric = metrics[1])

Euclidean L2 form seems to be more stable than cosine and regular Euclidean distance based on experiments.

Facial Attribute Analysis - Demo

Deepface also comes with a strong facial attribute analysis module including age, gender, facial expression (including angry, fear, neutral, sad, disgust, happy and surprise) and race (including asian, white, middle eastern, indian, latino and black) predictions.

obj = DeepFace.analyze(img_path = "img4.jpg", actions = ['age', 'gender', 'race', 'emotion'])

Age model got ± 4.65 MAE; gender model got 97.44% accuracy, 96.29% precision and 95.05% recall as mentioned in its tutorial.

Streaming and Real Time Analysis - Demo

You can run deepface for real time videos as well. Stream function will access your webcam and apply both face recognition and facial attribute analysis. The function starts to analyze a frame if it can focus a face sequantially 5 frames. Then, it shows results 5 seconds.

DeepFace.stream(db_path = "C:/User/Sefik/Desktop/database")

Even though face recognition is based on one-shot learning, you can use multiple face pictures of a person as well. You should rearrange your directory structure as illustrated below.

user
├── database
│   ├── Alice
│   │   ├── Alice1.jpg
│   │   ├── Alice2.jpg
│   ├── Bob
│   │   ├── Bob.jpg

Face Detectors - Demo

Face detection and alignment are important early stages of a modern face recognition pipeline. Experiments show that just alignment increases the face recognition accuracy almost 1%. OpenCV, SSD, Dlib, MTCNN, RetinaFace and MediaPipe detectors are wrapped in deepface.

All deepface functions accept an optional detector backend input argument. You can switch among those detectors with this argument. OpenCV is the default detector.

backends = ['opencv', 'ssd', 'dlib', 'mtcnn', 'retinaface', 'mediapipe']

#face verification
obj = DeepFace.verify(img1_path = "img1.jpg", img2_path = "img2.jpg", detector_backend = backends[4])

#face recognition
df = DeepFace.find(img_path = "img.jpg", db_path = "my_db", detector_backend = backends[4])

#facial analysis
demography = DeepFace.analyze(img_path = "img4.jpg", detector_backend = backends[4])

#face detection and alignment
face = DeepFace.detectFace(img_path = "img.jpg", target_size = (224, 224), detector_backend = backends[4])

Face recognition models are actually CNN models and they expect standard sized inputs. So, resizing is required before representation. To avoid deformation, deepface adds black padding pixels according to the target size argument after detection and alignment.

RetinaFace and MTCNN seem to overperform in detection and alignment stages but they are much slower. If the speed of your pipeline is more important, then you should use opencv or ssd. On the other hand, if you consider the accuracy, then you should use retinaface or mtcnn.

The performance of RetinaFace is very satisfactory even in the crowd as seen in the following illustration. Besides, it comes with an incredible facial landmark detection performance. Highlighted red points show some facial landmarks such as eyes, nose and mouth. That's why, alignment score of RetinaFace is high as well.

You can find out more about RetinaFace on this repo.

API - Demo

Deepface serves an API as well. You can clone /api/api.py and pass it to python command as an argument. This will get a rest service up. In this way, you can call deepface from an external system such as mobile app or web.

python api.py

Face recognition, facial attribute analysis and vector representation functions are covered in the API. You are expected to call these functions as http post methods. Service endpoints will be http://127.0.0.1:5000/verify for face recognition, http://127.0.0.1:5000/analyze for facial attribute analysis, and http://127.0.0.1:5000/represent for vector representation. You should pass input images as base64 encoded string in this case. Here, you can find a postman project.

Tech Stack - Vlog, Tutorial

Face recognition models represent facial images as vector embeddings. The idea behind facial recognition is that vectors should be more similar for same person than different persons. The question is that where and how to store facial embeddings in a large scale system. Herein, deepface offers a represention function to find vector embeddings from facial images.

embedding = DeepFace.represent(img_path = "img.jpg", model_name = 'Facenet')

Tech stack is vast to store vector embeddings. To determine the right tool, you should consider your task such as face verification or face recognition, priority such as speed or confidence, and also data size.

Contribution

Pull requests are welcome. You should run the unit tests locally by running test/unit_tests.py. Please share the unit test result logs in the PR. Deepface is currently compatible with TF 1 and 2 versions. Change requests should satisfy those requirements both.

Support

There are many ways to support a project - starring⭐️ the GitHub repo is just one 🙏

You can also support this work on Patreon

 

Citation

Please cite deepface in your publications if it helps your research. Here are BibTeX entries:

@inproceedings{serengil2020lightface,
  title        = {LightFace: A Hybrid Deep Face Recognition Framework},
  author       = {Serengil, Sefik Ilkin and Ozpinar, Alper},
  booktitle    = {2020 Innovations in Intelligent Systems and Applications Conference (ASYU)},
  pages        = {23-27},
  year         = {2020},
  doi          = {10.1109/ASYU50717.2020.9259802},
  url          = {https://doi.org/10.1109/ASYU50717.2020.9259802},
  organization = {IEEE}
}
@inproceedings{serengil2021lightface,
  title        = {HyperExtended LightFace: A Facial Attribute Analysis Framework},
  author       = {Serengil, Sefik Ilkin and Ozpinar, Alper},
  booktitle    = {2021 International Conference on Engineering and Emerging Technologies (ICEET)},
  pages        = {1-4},
  year         = {2021},
  doi          = {10.1109/ICEET53442.2021.9659697},
  url          = {https://doi.org/10.1109/ICEET53442.2021.9659697},
  organization = {IEEE}
}

Also, if you use deepface in your GitHub projects, please add deepface in the requirements.txt.

Download Details:
Author: serengil
Source Code: https://github.com/serengil/deepface
License: MIT License

#tensorflow  #python #machinelearning 

Top 6 Alternatives To Hugging Face

  • With Hugging Face raising $40 million funding, NLPs has the potential to provide us with a smarter world ahead.

In recent news, US-based NLP startup, Hugging Face  has raised a whopping $40 million in funding. The company is building a large open-source community to help the NLP ecosystem grow. Its transformers library is a python-based library that exposes an API for using a variety of well-known transformer architectures such as BERT, RoBERTa, GPT-2, and DistilBERT. Here is a list of the top alternatives to Hugging Face .

Watson Assistant

LUIS:

Lex

Dialogflow

#opinions #alternatives to hugging face #chatbot #hugging face #hugging face ai #hugging face chatbot #hugging face gpt-2 #hugging face nlp #hugging face transformer #ibm watson #nlp ai #nlp models #transformers

Dominic  Feeney

Dominic Feeney

1648340700

Tsf GPU: Basic Multi-GPU Computation in TensorFlow

Setup

Refer to the setup instructions

Noted: Updated for multi-GPU support. Soon a 8-way GPU cluster to be tested on my own DeepLearning box.

This tutorial requires your machine to have 2 GPUs

  • "/cpu:0": The CPU of your machine.
  • "/gpu:0": The first GPU of your machine
  • "/gpu:1": The second GPU of your machine
  • For this example, we are using 2 GTX-980
import numpy as np
import tensorflow as tf
import datetime
#Processing Units logs
log_device_placement = True

#num of multiplications to perform
n = 10
# Example: compute A^n + B^n on 2 GPUs

# Create random large matrix
A = np.random.rand(1e4, 1e4).astype('float32')
B = np.random.rand(1e4, 1e4).astype('float32')

# Creates a graph to store results
c1 = []
c2 = []

# Define matrix power
def matpow(M, n):
    if n < 1: #Abstract cases where n < 1
        return M
    else:
        return tf.matmul(M, matpow(M, n-1))
# Single GPU computing

with tf.device('/gpu:0'):
    a = tf.constant(A)
    b = tf.constant(B)
    #compute A^n and B^n and store results in c1
    c1.append(matpow(a, n))
    c1.append(matpow(b, n))

with tf.device('/cpu:0'):
  sum = tf.add_n(c1) #Addition of all elements in c1, i.e. A^n + B^n

t1_1 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:
    # Runs the op.
    sess.run(sum)
t2_1 = datetime.datetime.now()
# Multi GPU computing
# GPU:0 computes A^n
with tf.device('/gpu:0'):
    #compute A^n and store result in c2
    a = tf.constant(A)
    c2.append(matpow(a, n))

#GPU:1 computes B^n
with tf.device('/gpu:1'):
    #compute B^n and store result in c2
    b = tf.constant(B)
    c2.append(matpow(b, n))

with tf.device('/cpu:0'):
  sum = tf.add_n(c2) #Addition of all elements in c2, i.e. A^n + B^n

t1_2 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:
    # Runs the op.
    sess.run(sum)
t2_2 = datetime.datetime.now()
print "Single GPU computation time: " + str(t2_1-t1_1)
print "Multi GPU computation time: " + str(t2_2-t1_2)
Single GPU computation time: 0:00:11.833497
Multi GPU computation time: 0:00:07.085913

Link; https://nbviewer.org/github/TarrySingh/Machine-Learning-Tutorials/blob/master/deep-learning/tensor-flow-examples/notebooks/4_multi_gpu/multigpu_basics.ipynb

#tensorflow  #python 

How to Predict Housing Prices with Linear Regression?

How-to-Predict-Housing-Prices-with-Linear-Regression

The final objective is to estimate the cost of a certain house in a Boston suburb. In 1970, the Boston Standard Metropolitan Statistical Area provided the information. To examine and modify the data, we will use several techniques such as data pre-processing and feature engineering. After that, we'll apply a statistical model like regression model to anticipate and monitor the real estate market.

Project Outline:

  • EDA
  • Feature Engineering
  • Pick and Train a Model
  • Interpret
  • Conclusion

EDA

Before using a statistical model, the EDA is a good step to go through in order to:

  • Recognize the data set
  • Check to see if any information is missing.
  • Find some outliers.
  • To get more out of the data, add, alter, or eliminate some features.

Importing the Libraries

  • Recognize the data set
  • Check to see if any information is missing.
  • Find some outliers.
  • To get more out of the data, add, alter, or eliminate some features.

# Import the libraries #Dataframe/Numerical libraries import pandas as pd import numpy as np #Data visualization import plotly.express as px import matplotlib import matplotlib.pyplot as plt import seaborn as sns #Machine learning model from sklearn.linear_model import LinearRegression

Reading the Dataset with Pandas

#Reading the data path='./housing.csv' housing_df=pd.read_csv(path,header=None,delim_whitespace=True)

 CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
00.0063218.02.3100.5386.57565.24.09001296.015.3396.904.9824.0
10.027310.07.0700.4696.42178.94.96712242.017.8396.909.1421.6
20.027290.07.0700.4697.18561.14.96712242.017.8392.834.0334.7
30.032370.02.1800.4586.99845.86.06223222.018.7394.632.9433.4
40.069050.02.1800.4587.14754.26.06223222.018.7396.905.3336.2
.............................................
5010.062630.011.9300.5736.59369.12.47861273.021.0391.999.6722.4
5020.045270.011.9300.5736.12076.72.28751273.021.0396.909.0820.6
5030.060760.011.9300.5736.97691.02.16751273.021.0396.905.6423.9
5040.109590.011.9300.5736.79489.32.38891273.021.0393.456.4822.0
5050.047410.011.9300.5736.03080.82.50501273.021.0396.907.8811.9

Have a Look at the Columns

Crime: It refers to a town's per capita crime rate.

ZN: It is the percentage of residential land allocated for 25,000 square feet.

Indus: The amount of non-retail business lands per town is referred to as the indus.

CHAS: CHAS denotes whether or not the land is surrounded by a river.

NOX: The NOX stands for nitric oxide content (part per 10m)

RM: The average number of rooms per home is referred to as RM.

AGE: The percentage of owner-occupied housing built before 1940 is referred to as AGE.

DIS: Weighted distance to five Boston employment centers are referred to as dis.

RAD: Accessibility to radial highways index

TAX: The TAX columns denote the rate of full-value property taxes per $10,000 dollars.

B: B=1000(Bk — 0.63)2 is the outcome of the equation, where Bk is the proportion of blacks in each town.

PTRATIO: It refers to the student-to-teacher ratio in each community.

LSTAT: It refers to the population's lower socioeconomic status.

MEDV: It refers to the 1000-dollar median value of owner-occupied residences.

Data Preprocessing

# Check if there is any missing values. housing_df.isna().sum() CRIM       0 ZN         0 INDUS      0 CHAS       0 NOX        0 RM         0 AGE        0 DIS        0 RAD        0 TAX        0 PTRATIO    0 B          0 LSTAT      0 MEDV       0 dtype: int64

No missing values are found

We examine our data's mean, standard deviation, and percentiles.

housing_df.describe()

Graph Data

 CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000

The crime, area, sector, nitric oxides, 'B' appear to have multiple outliers at first look because the minimum and maximum values are so far apart. In the Age columns, the mean and the Q2(50 percentile) do not match.

We might double-check it by examining the distribution of each column.

Inferences

  1. The rate of crime is rather low. The majority of values are in the range of 0 to 25. With a huge value and a value of zero.
  2. The majority of residential land is zoned for less than 25,000 square feet. Land zones larger than 25,000 square feet represent a small portion of the dataset.
  3. The percentage of non-retial commercial acres is mostly split between two ranges: 0-13 and 13-23.
  4. The majority of the properties are bordered by the river, although a tiny portion of the data is not.
  5. The content of nitrite dioxide has been trending lower from.3 to.7, with a little bump towards.8. It is permissible to leave a value in the range of 0.1–1.
  6. The number of rooms tends to cluster around the average.
  7. With time, the proportion of owner-occupied units rises.
  8. As the number of weights grows, the weight distance between 5 employment centers reduces. It could indicate that individuals choose to live in new high-employment areas.
  9. People choose to live in places with limited access to roadways (0-10). We have a 30th percentile outlier.
  10. The majority of dwelling taxes are in the range of $200-450, with large outliers around $700,000.
  11. The percentage of people with lower status tends to cluster around the median. The majority of persons are of lower social standing.

Because the model is overly generic, removing all outliers will underfit it. Keeping all outliers causes the model to overfit and become excessively accurate. The data's noise will be learned.

The approach is to establish a happy medium that prevents the model from becoming overly precise. When faced with a new set of data, however, they generalise well.

We'll keep numbers below 600 because there's a huge anomaly in the TAX column around 600.

new_df=housing_df[housing_df['TAX']<600]

Looking at the Distribution

Looking-at-the-Distribution

The overall distribution, particularly the TAX, PTRATIO, and RAD, has improved slightly.

Correlation

Correlation

Perfect correlation is denoted by the clear values. The medium correlation between the columns is represented by the reds, while the negative correlation is represented by the black.

With a value of 0.89, we can see that 'MEDV', which is the medium price we wish to anticipate, is substantially connected with the number of rooms 'RM'. The proportion of black people in area 'B' with a value of 0.19 is followed by the residential land 'ZN' with a value of 0.32 and the percentage of black people in area 'ZN' with a value of 0.32.

The metrics that are most connected with price will be plotted.

The-metrics-that-are-most-connected

Feature Engineering

Feature Scaling

Gradient descent is aided by feature scaling, which ensures that all features are on the same scale. It makes locating the local optimum much easier.

Mean standardization is one strategy to employ. It substitutes (target-mean) for the target to ensure that the feature has a mean of nearly zero.

def standard(X):    '''Standard makes the feature 'X' have a zero mean'''    mu=np.mean(X) #mean    std=np.std(X) #standard deviation    sta=(X-mu)/std # mean normalization    return mu,std,sta     mu,std,sta=standard(X) X=sta X

 CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT
0-0.6091290.092792-1.019125-0.2809760.2586700.2791350.162095-0.167660-2.105767-0.235130-1.1368630.401318-0.933659
1-0.575698-0.598153-0.225291-0.280976-0.4237950.0492520.6482660.250975-1.496334-1.032339-0.0041750.401318-0.219350
2-0.575730-0.598153-0.225291-0.280976-0.4237951.1897080.0165990.250975-1.496334-1.032339-0.0041750.298315-1.096782
3-0.567639-0.598153-1.040806-0.280976-0.5325940.910565-0.5263500.773661-0.886900-1.3276010.4035930.343869-1.283945
4-0.509220-0.598153-1.040806-0.280976-0.5325941.132984-0.2282610.773661-0.886900-1.3276010.4035930.401318-0.873561
..........................................
501-0.519445-0.5981530.585220-0.2809760.6048480.3060040.300494-0.936773-2.105767-0.5746821.4456660.277056-0.128344
502-0.547094-0.5981530.585220-0.2809760.604848-0.4000630.570195-1.027984-2.105767-0.5746821.4456660.401318-0.229652
503-0.522423-0.5981530.585220-0.2809760.6048480.8777251.077657-1.085260-2.105767-0.5746821.4456660.401318-0.820331
504-0.444652-0.5981530.585220-0.2809760.6048480.6060461.017329-0.979587-2.105767-0.5746821.4456660.314006-0.676095
505-0.543685-0.5981530.585220-0.2809760.604848-0.5344100.715691-0.924173-2.105767-0.5746821.4456660.401318-0.435703

Choose and Train the Model

For the sake of the project, we'll apply linear regression.

Typically, we run numerous models and select the best one based on a particular criterion.

Linear regression is a sort of supervised learning model in which the response is continuous, as it relates to machine learning.

Form of Linear Regression

y= θX+θ1 or y= θ1+X1θ2 +X2θ3 + X3θ4

y is the target you will be predicting

0 is the coefficient

x is the input

We will Sklearn to develop and train the model

#Import the libraries to train the model from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression

Allow us to utilise the train/test method to learn a part of the data on one set and predict using another set using the train/test approach.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4) #Create and Train the model model=LinearRegression().fit(X_train,y_train) #Generate prediction predictions_test=model.predict(X_test) #Compute loss to evaluate the model coefficient= model.coef_ intercept=model.intercept_ print(coefficient,intercept) [7.22218258] 24.66379606613584

In this example, you will learn the model using below hypothesis:

Price= 24.85 + 7.18* Room

It is interpreted as:

For a decided price of a house:

A 7.18-unit increase in the price is connected with a growth in the number of rooms.

As a side note, this is an association, not a cause!

Interpretation

You will need a metric to determine whether our hypothesis was right. The RMSE approach will be used.

Root Means Square Error (RMSE) is defined as the square root of the mean of square error. The difference between the true and anticipated numbers called the error. It's popular because it can be expressed in y-units, which is the median price of a home in our scenario.

def rmse(predict,actual):    return np.sqrt(np.mean(np.square(predict - actual))) # Split the Data into train and test set X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4) #Create and Train the model model=LinearRegression().fit(X_train,y_train) #Generate prediction predictions_test=model.predict(X_test) #Compute loss to evaluate the model coefficient= model.coef_ intercept=model.intercept_ print(coefficient,intercept) loss=rmse(predictions_test,y_test) print('loss: ',loss) print(model.score(X_test,y_test)) #accuracy [7.43327725] 24.912055881970886 loss: 3.9673165450580714 0.7552661033654667 Loss will be 3.96

This means that y-units refer to the median value of occupied homes with 1000 dollars.

This will be less by 3960 dollars.

While learning the model you will have a high variance when you divide the data. Coefficient and intercept will vary. It's because when we utilized the train/test approach, we choose a set of data at random to place in either the train or test set. As a result, our theory will change each time the dataset is divided.

This problem can be solved using a technique called cross-validation.

Improvisation in the Model

With 'Forward Selection,' we'll iterate through each parameter to assist us choose the numbers characteristics to include in our model.

Forward Selection

  1. Choose the most appropriate variable (in our case based on high correlation)
  2. Add the next best variable to the model
  3. Some predetermined conditions must meet.

We'll use a random state of 1 so that each iteration yields the same outcome.

cols=[] los=[] los_train=[] scor=[] i=0 while i < len(high_corr_var):    cols.append(high_corr_var[i])        # Select inputs variables    X=new_df[cols]        #mean normalization    mu,std,sta=standard(X)    X=sta        # Split the data into training and testing    X_train,X_test,y_train,y_test= train_test_split(X,y,random_state=1)        #fit the model to the training    lnreg=LinearRegression().fit(X_train,y_train)        #make prediction on the training test    prediction_train=lnreg.predict(X_train)        #make prediction on the testing test    prediction=lnreg.predict(X_test)        #compute the loss on train test    loss=rmse(prediction,y_test)    loss_train=rmse(prediction_train,y_train)    los_train.append(loss_train)    los.append(loss)        #compute the score    score=lnreg.score(X_test,y_test)    scor.append(score)        i+=1

We have a big 'loss' with a smaller collection of variables, yet our system will overgeneralize in this scenario. Although we have a reduced 'loss,' we have a large number of variables. However, if the model grows too precise, it may not generalize well to new data.

In order for our model to generalize well with another set of data, we might use 6 or 7 features. The characteristic chosen is descending based on how strong the price correlation is.

high_corr_var ['RM', 'ZN', 'B', 'CHAS', 'RAD', 'DIS', 'CRIM', 'NOX', 'AGE', 'TAX', 'INDUS', 'PTRATIO', 'LSTAT']

With 'RM' having a high price correlation and LSTAT having a negative price correlation.

# Create a list of features names feature_cols=['RM','ZN','B','CHAS','RAD','CRIM','DIS','NOX'] #Select inputs variables X=new_df[feature_cols] # Split the data into training and testing sets X_train,X_test,y_train,y_test= train_test_split(X,y, random_state=1) # feature engineering mu,std,sta=standard(X) X=sta # fit the model to the trainning data lnreg=LinearRegression().fit(X_train,y_train) # make prediction on the testing test prediction=lnreg.predict(X_test) # compute the loss loss=rmse(prediction,y_test) print('loss: ',loss) lnreg.score(X_test,y_test) loss: 3.212659865936143 0.8582338376696363

The test set yielded a loss of 3.21 and an accuracy of 85%.

Other factors, such as alpha, the learning rate at which our model learns, could still be tweaked to improve our model. Alternatively, return to the preprocessing section and working to increase the parameter distribution.

For more details regarding scraping real estate data you can contact Scraping Intelligence today

https://www.websitescraper.com/how-to-predict-housing-prices-with-linear-regression.php