1594839180
Having a housing price prediction model can be a very important tool for both the seller and the buyer as it can aid them in making well informed decision. For sellers, it may help them to determine the average price at which they should put their house for sale while for buyers, it may help them find out the right average price to purchase the house.
To build a random forest regression model, which is able to predict the median value of houses. We will also briefly walk through some Exploratory Data Analysis, Feature Engineering and Hyperparameter tuning to improve the performance of our Random Forest model.
Our Machine Learning Pipeline can be broadly summarized into the following task:
We will be using the Boston Housing dataset_: https://www.kaggle.com/c/boston-housing/data._
#Importing the necessary libraries we will be using
%load_ext autoreload
%autoreload 2
%matplotlib inline
from fastai.imports import *
from fastai.structured import *
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor
from IPython.display import display
from sklearn import metrics
from sklearn.model_selection import RandomizedSearchCV
#Loading the Dataset
PATH = 'data/Boston Housing Dataset/'
df_raw_train = pd.read_csv(f'{PATH}train.csv',low_memory = False)
df_raw_test = pd.read_csv(f'{PATH}test.csv',low_memory = False)
2.1 Checking and handling if there are any missing data and outliers.
df_raw_train.info
Understanding the raw data:
From the raw training dataset above:
(a) There are 14 variables (13 independent variables — Features and 1 dependent variable — Target Variable).
(b) The data types are either** integers** or floats.
© No categorical data is present.
(d) There are no missing values in our dataset.
2.2 As part of EDA, we will first try to determine the distribution of the dependent variable (MDEV).
#Plot the distribution of MEDV
plt.figure(figsize=(10, 6))
sns.distplot(df_raw_train['MEDV'],bins=30)
The values of MEDV follows a** normal distribution** with a mean of around 22. There are some outliers to the right.
2.3 Next, try to determine if there are any correlations between:
(i) the independent variables themselves
(ii) the independent variables and dependent variable
To do this, let’s do a correlation heatmap.
# Plot the correlation heatmap
plt.figure(figsize=(14, 8))
corr_matrix = df_raw_train.corr().round(2)
sns.heatmap(data=corr_matrix,cmap='coolwarm',annot=True)
(i) Correlation between independent variables:
We would need to look out for features of multi-collinearity (i.e. features that are correlated with each other)as this will affect our relationship with the independent variable.
Observe that **RAD **and TAX are highly correlated with each other (Correlation score: 0.92) while there are a couple of features which are somewhat correlated with one another with a correlation score of around 0.70 (INDUS and TAX, NOX and INDUS, AGE and DIS, AGE and INDUS).
(ii) Correlation between independent variable and dependent variable:
In order for our regression model to perform well, we ideally need to select those features that are highly correlated with our dependent variable (MEDV).
We observe that both RM and LSTAT are** correlated** with MEDV with a correlation score of 0.66 and 0.74 respective. This can also be illustrated via the scatter plot .
#sklearn #random-forest #boston #data-science #data analysis
1667425440
Perl script converts PDF files to Gerber format
Pdf2Gerb generates Gerber 274X photoplotting and Excellon drill files from PDFs of a PCB. Up to three PDFs are used: the top copper layer, the bottom copper layer (for 2-sided PCBs), and an optional silk screen layer. The PDFs can be created directly from any PDF drawing software, or a PDF print driver can be used to capture the Print output if the drawing software does not directly support output to PDF.
The general workflow is as follows:
Please note that Pdf2Gerb does NOT perform DRC (Design Rule Checks), as these will vary according to individual PCB manufacturer conventions and capabilities. Also note that Pdf2Gerb is not perfect, so the output files must always be checked before submitting them. As of version 1.6, Pdf2Gerb supports most PCB elements, such as round and square pads, round holes, traces, SMD pads, ground planes, no-fill areas, and panelization. However, because it interprets the graphical output of a Print function, there are limitations in what it can recognize (or there may be bugs).
See docs/Pdf2Gerb.pdf for install/setup, config, usage, and other info.
#Pdf2Gerb config settings:
#Put this file in same folder/directory as pdf2gerb.pl itself (global settings),
#or copy to another folder/directory with PDFs if you want PCB-specific settings.
#There is only one user of this file, so we don't need a custom package or namespace.
#NOTE: all constants defined in here will be added to main namespace.
#package pdf2gerb_cfg;
use strict; #trap undef vars (easier debug)
use warnings; #other useful info (easier debug)
##############################################################################################
#configurable settings:
#change values here instead of in main pfg2gerb.pl file
use constant WANT_COLORS => ($^O !~ m/Win/); #ANSI colors no worky on Windows? this must be set < first DebugPrint() call
#just a little warning; set realistic expectations:
#DebugPrint("${\(CYAN)}Pdf2Gerb.pl ${\(VERSION)}, $^O O/S\n${\(YELLOW)}${\(BOLD)}${\(ITALIC)}This is EXPERIMENTAL software. \nGerber files MAY CONTAIN ERRORS. Please CHECK them before fabrication!${\(RESET)}", 0); #if WANT_DEBUG
use constant METRIC => FALSE; #set to TRUE for metric units (only affect final numbers in output files, not internal arithmetic)
use constant APERTURE_LIMIT => 0; #34; #max #apertures to use; generate warnings if too many apertures are used (0 to not check)
use constant DRILL_FMT => '2.4'; #'2.3'; #'2.4' is the default for PCB fab; change to '2.3' for CNC
use constant WANT_DEBUG => 0; #10; #level of debug wanted; higher == more, lower == less, 0 == none
use constant GERBER_DEBUG => 0; #level of debug to include in Gerber file; DON'T USE FOR FABRICATION
use constant WANT_STREAMS => FALSE; #TRUE; #save decompressed streams to files (for debug)
use constant WANT_ALLINPUT => FALSE; #TRUE; #save entire input stream (for debug ONLY)
#DebugPrint(sprintf("${\(CYAN)}DEBUG: stdout %d, gerber %d, want streams? %d, all input? %d, O/S: $^O, Perl: $]${\(RESET)}\n", WANT_DEBUG, GERBER_DEBUG, WANT_STREAMS, WANT_ALLINPUT), 1);
#DebugPrint(sprintf("max int = %d, min int = %d\n", MAXINT, MININT), 1);
#define standard trace and pad sizes to reduce scaling or PDF rendering errors:
#This avoids weird aperture settings and replaces them with more standardized values.
#(I'm not sure how photoplotters handle strange sizes).
#Fewer choices here gives more accurate mapping in the final Gerber files.
#units are in inches
use constant TOOL_SIZES => #add more as desired
(
#round or square pads (> 0) and drills (< 0):
.010, -.001, #tiny pads for SMD; dummy drill size (too small for practical use, but needed so StandardTool will use this entry)
.031, -.014, #used for vias
.041, -.020, #smallest non-filled plated hole
.051, -.025,
.056, -.029, #useful for IC pins
.070, -.033,
.075, -.040, #heavier leads
# .090, -.043, #NOTE: 600 dpi is not high enough resolution to reliably distinguish between .043" and .046", so choose 1 of the 2 here
.100, -.046,
.115, -.052,
.130, -.061,
.140, -.067,
.150, -.079,
.175, -.088,
.190, -.093,
.200, -.100,
.220, -.110,
.160, -.125, #useful for mounting holes
#some additional pad sizes without holes (repeat a previous hole size if you just want the pad size):
.090, -.040, #want a .090 pad option, but use dummy hole size
.065, -.040, #.065 x .065 rect pad
.035, -.040, #.035 x .065 rect pad
#traces:
.001, #too thin for real traces; use only for board outlines
.006, #minimum real trace width; mainly used for text
.008, #mainly used for mid-sized text, not traces
.010, #minimum recommended trace width for low-current signals
.012,
.015, #moderate low-voltage current
.020, #heavier trace for power, ground (even if a lighter one is adequate)
.025,
.030, #heavy-current traces; be careful with these ones!
.040,
.050,
.060,
.080,
.100,
.120,
);
#Areas larger than the values below will be filled with parallel lines:
#This cuts down on the number of aperture sizes used.
#Set to 0 to always use an aperture or drill, regardless of size.
use constant { MAX_APERTURE => max((TOOL_SIZES)) + .004, MAX_DRILL => -min((TOOL_SIZES)) + .004 }; #max aperture and drill sizes (plus a little tolerance)
#DebugPrint(sprintf("using %d standard tool sizes: %s, max aper %.3f, max drill %.3f\n", scalar((TOOL_SIZES)), join(", ", (TOOL_SIZES)), MAX_APERTURE, MAX_DRILL), 1);
#NOTE: Compare the PDF to the original CAD file to check the accuracy of the PDF rendering and parsing!
#for example, the CAD software I used generated the following circles for holes:
#CAD hole size: parsed PDF diameter: error:
# .014 .016 +.002
# .020 .02267 +.00267
# .025 .026 +.001
# .029 .03167 +.00267
# .033 .036 +.003
# .040 .04267 +.00267
#This was usually ~ .002" - .003" too big compared to the hole as displayed in the CAD software.
#To compensate for PDF rendering errors (either during CAD Print function or PDF parsing logic), adjust the values below as needed.
#units are pixels; for example, a value of 2.4 at 600 dpi = .0004 inch, 2 at 600 dpi = .0033"
use constant
{
HOLE_ADJUST => -0.004 * 600, #-2.6, #holes seemed to be slightly oversized (by .002" - .004"), so shrink them a little
RNDPAD_ADJUST => -0.003 * 600, #-2, #-2.4, #round pads seemed to be slightly oversized, so shrink them a little
SQRPAD_ADJUST => +0.001 * 600, #+.5, #square pads are sometimes too small by .00067, so bump them up a little
RECTPAD_ADJUST => 0, #(pixels) rectangular pads seem to be okay? (not tested much)
TRACE_ADJUST => 0, #(pixels) traces seemed to be okay?
REDUCE_TOLERANCE => .001, #(inches) allow this much variation when reducing circles and rects
};
#Also, my CAD's Print function or the PDF print driver I used was a little off for circles, so define some additional adjustment values here:
#Values are added to X/Y coordinates; units are pixels; for example, a value of 1 at 600 dpi would be ~= .002 inch
use constant
{
CIRCLE_ADJUST_MINX => 0,
CIRCLE_ADJUST_MINY => -0.001 * 600, #-1, #circles were a little too high, so nudge them a little lower
CIRCLE_ADJUST_MAXX => +0.001 * 600, #+1, #circles were a little too far to the left, so nudge them a little to the right
CIRCLE_ADJUST_MAXY => 0,
SUBST_CIRCLE_CLIPRECT => FALSE, #generate circle and substitute for clip rects (to compensate for the way some CAD software draws circles)
WANT_CLIPRECT => TRUE, #FALSE, #AI doesn't need clip rect at all? should be on normally?
RECT_COMPLETION => FALSE, #TRUE, #fill in 4th side of rect when 3 sides found
};
#allow .012 clearance around pads for solder mask:
#This value effectively adjusts pad sizes in the TOOL_SIZES list above (only for solder mask layers).
use constant SOLDER_MARGIN => +.012; #units are inches
#line join/cap styles:
use constant
{
CAP_NONE => 0, #butt (none); line is exact length
CAP_ROUND => 1, #round cap/join; line overhangs by a semi-circle at either end
CAP_SQUARE => 2, #square cap/join; line overhangs by a half square on either end
CAP_OVERRIDE => FALSE, #cap style overrides drawing logic
};
#number of elements in each shape type:
use constant
{
RECT_SHAPELEN => 6, #x0, y0, x1, y1, count, "rect" (start, end corners)
LINE_SHAPELEN => 6, #x0, y0, x1, y1, count, "line" (line seg)
CURVE_SHAPELEN => 10, #xstart, ystart, x0, y0, x1, y1, xend, yend, count, "curve" (bezier 2 points)
CIRCLE_SHAPELEN => 5, #x, y, 5, count, "circle" (center + radius)
};
#const my %SHAPELEN =
#Readonly my %SHAPELEN =>
our %SHAPELEN =
(
rect => RECT_SHAPELEN,
line => LINE_SHAPELEN,
curve => CURVE_SHAPELEN,
circle => CIRCLE_SHAPELEN,
);
#panelization:
#This will repeat the entire body the number of times indicated along the X or Y axes (files grow accordingly).
#Display elements that overhang PCB boundary can be squashed or left as-is (typically text or other silk screen markings).
#Set "overhangs" TRUE to allow overhangs, FALSE to truncate them.
#xpad and ypad allow margins to be added around outer edge of panelized PCB.
use constant PANELIZE => {'x' => 1, 'y' => 1, 'xpad' => 0, 'ypad' => 0, 'overhangs' => TRUE}; #number of times to repeat in X and Y directions
# Set this to 1 if you need TurboCAD support.
#$turboCAD = FALSE; #is this still needed as an option?
#CIRCAD pad generation uses an appropriate aperture, then moves it (stroke) "a little" - we use this to find pads and distinguish them from PCB holes.
use constant PAD_STROKE => 0.3; #0.0005 * 600; #units are pixels
#convert very short traces to pads or holes:
use constant TRACE_MINLEN => .001; #units are inches
#use constant ALWAYS_XY => TRUE; #FALSE; #force XY even if X or Y doesn't change; NOTE: needs to be TRUE for all pads to show in FlatCAM and ViewPlot
use constant REMOVE_POLARITY => FALSE; #TRUE; #set to remove subtractive (negative) polarity; NOTE: must be FALSE for ground planes
#PDF uses "points", each point = 1/72 inch
#combined with a PDF scale factor of .12, this gives 600 dpi resolution (1/72 * .12 = 600 dpi)
use constant INCHES_PER_POINT => 1/72; #0.0138888889; #multiply point-size by this to get inches
# The precision used when computing a bezier curve. Higher numbers are more precise but slower (and generate larger files).
#$bezierPrecision = 100;
use constant BEZIER_PRECISION => 36; #100; #use const; reduced for faster rendering (mainly used for silk screen and thermal pads)
# Ground planes and silk screen or larger copper rectangles or circles are filled line-by-line using this resolution.
use constant FILL_WIDTH => .01; #fill at most 0.01 inch at a time
# The max number of characters to read into memory
use constant MAX_BYTES => 10 * M; #bumped up to 10 MB, use const
use constant DUP_DRILL1 => TRUE; #FALSE; #kludge: ViewPlot doesn't load drill files that are too small so duplicate first tool
my $runtime = time(); #Time::HiRes::gettimeofday(); #measure my execution time
print STDERR "Loaded config settings from '${\(__FILE__)}'.\n";
1; #last value must be truthful to indicate successful load
#############################################################################################
#junk/experiment:
#use Package::Constants;
#use Exporter qw(import); #https://perldoc.perl.org/Exporter.html
#my $caller = "pdf2gerb::";
#sub cfg
#{
# my $proto = shift;
# my $class = ref($proto) || $proto;
# my $settings =
# {
# $WANT_DEBUG => 990, #10; #level of debug wanted; higher == more, lower == less, 0 == none
# };
# bless($settings, $class);
# return $settings;
#}
#use constant HELLO => "hi there2"; #"main::HELLO" => "hi there";
#use constant GOODBYE => 14; #"main::GOODBYE" => 12;
#print STDERR "read cfg file\n";
#our @EXPORT_OK = Package::Constants->list(__PACKAGE__); #https://www.perlmonks.org/?node_id=1072691; NOTE: "_OK" skips short/common names
#print STDERR scalar(@EXPORT_OK) . " consts exported:\n";
#foreach(@EXPORT_OK) { print STDERR "$_\n"; }
#my $val = main::thing("xyz");
#print STDERR "caller gave me $val\n";
#foreach my $arg (@ARGV) { print STDERR "arg $arg\n"; }
Author: swannman
Source Code: https://github.com/swannman/pdf2gerb
License: GPL-3.0 license
1595563500
In this tutorial, we’re going to create a model to predict House prices🏡 based on various factors across different markets.
The goal of this statistical analysis is to help us understand the relationship between house features and how these variables are used to predict house price.
Data used: Kaggle-kc_house Dataset
GitHub: you can find my source code here
First, Let’s import the data and have a look to see what kind of data we are dealing with:
#import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#import Data
Data = pd.read_csv('kc_house_data.csv')
Data.head(5).T
#get some information about our Data-Set
Data.info()
Data.describe().transpose()
5 records of our dataset
Information about the dataset, what kind of data types are your variables
Statistical summary of your dataset
Our features are:
✔️**Date:**_ Date house was sold_
✔️**Price:**_ Price is prediction target_
✔️**_Bedrooms: _**Number of Bedrooms/House
✔️**Bathrooms:**_ Number of bathrooms/House_
✔️**Sqft_Living:**_ square footage of the home_
✔️**Sqft_Lot:**_ square footage of the lot_
✔️**Floors:**_ Total floors (levels) in house_
✔️**Waterfront:**_ House which has a view to a waterfront_
✔️**View:**_ Has been viewed_
✔️**Condition:**_ How good the condition is ( Overall )_
✔️**Grade:**_ grade given to the housing unit, based on King County grading system_
✔️**Sqft_Above:**_ square footage of house apart from basement_
✔️**Sqft_Basement:**_ square footage of the basement_
✔️**Yr_Built:**_ Built Year_
✔️**Yr_Renovated:**_ Year when house was renovated_
✔️**Zipcode:**_ Zip_
✔️**Lat:**_ Latitude coordinate_
✔️**_Long: _**Longitude coordinate
✔️**Sqft_Living15:**_ Living room area in 2015(implies — some renovations)_
✔️**Sqft_Lot15:**_ lotSize area in 2015(implies — some renovations)_
Let’s plot couple of features to get a better feel of the data
#visualizing house prices
fig = plt.figure(figsize=(10,7))
fig.add_subplot(2,1,1)
sns.distplot(Data['price'])
fig.add_subplot(2,1,2)
sns.boxplot(Data['price'])
plt.tight_layout()
#visualizing square footage of (home,lot,above and basement)
fig = plt.figure(figsize=(16,5))
fig.add_subplot(2,2,1)
sns.scatterplot(Data['sqft_above'], Data['price'])
fig.add_subplot(2,2,2)
sns.scatterplot(Data['sqft_lot'],Data['price'])
fig.add_subplot(2,2,3)
sns.scatterplot(Data['sqft_living'],Data['price'])
fig.add_subplot(2,2,4)
sns.scatterplot(Data['sqft_basement'],Data['price'])
#visualizing bedrooms,bathrooms,floors,grade
fig = plt.figure(figsize=(15,7))
fig.add_subplot(2,2,1)
sns.countplot(Data['bedrooms'])
fig.add_subplot(2,2,2)
sns.countplot(Data['floors'])
fig.add_subplot(2,2,3)
sns.countplot(Data['bathrooms'])
fig.add_subplot(2,2,4)
sns.countplot(Data['grade'])
plt.tight_layout()
With distribution plot of price, we can visualize that most of the prices are between 0 and around 1M with few outliers close to 8 million (fancy houses😉). It would make sense to drop those outliers in our analysis.
#linear-regression #machine-learning #python #house-price-prediction #deep-learning #deep learning
1601206560
In this final part, I will share some popular machine learning algorithms to predict the housing prices and the live model that I have built. My objective is to find a model that can generate a high accuracy of the housing prices, based on the available dataset, such that given a new property and with the required information, we will know whether the property is over or under-valued.
I explore 5 machine learning algorithms that are used to predict the housing prices in Singapore, namely multi-linear regression, lasso, ridge, decision tree and neural network.
Multi-linear regression model, as its name suggest, is a widely used model that assumes linearity between the independent variables and dependent variable (price). This will be my baseline model for comparison.
Lasso and ridge are models to reduce model complexity and overfitting when there are too many parameters. For example, the lasso model will effectively shrink some of the variables, such that it only takes into account some of the important factors. While there are only 17 variables, in the dataset and the number of variables may not be considered extensive, it will still be a good exercise to analyse the effectiveness of these models.
Decision tree is an easily understandable model which uses a set of binary rules to achieve the target value. This is extremely useful for decision making as a tree diagram can be plotted to aid in understanding the importance of each variable (the higher the variable in the tree, the more important the variable).
Last, I have also explored a simple multi-layer perceptron neural network model. Simply put, the data inputs is put through a few layers of “filters” (feed forward hidden layers) and the model learns how to minimise the loss function by changing the values in the “filters” matrices.
#predictive-analytics #predictive-modeling #machine-learning #sklearn #housing-prices
1594839180
Having a housing price prediction model can be a very important tool for both the seller and the buyer as it can aid them in making well informed decision. For sellers, it may help them to determine the average price at which they should put their house for sale while for buyers, it may help them find out the right average price to purchase the house.
To build a random forest regression model, which is able to predict the median value of houses. We will also briefly walk through some Exploratory Data Analysis, Feature Engineering and Hyperparameter tuning to improve the performance of our Random Forest model.
Our Machine Learning Pipeline can be broadly summarized into the following task:
We will be using the Boston Housing dataset_: https://www.kaggle.com/c/boston-housing/data._
#Importing the necessary libraries we will be using
%load_ext autoreload
%autoreload 2
%matplotlib inline
from fastai.imports import *
from fastai.structured import *
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor
from IPython.display import display
from sklearn import metrics
from sklearn.model_selection import RandomizedSearchCV
#Loading the Dataset
PATH = 'data/Boston Housing Dataset/'
df_raw_train = pd.read_csv(f'{PATH}train.csv',low_memory = False)
df_raw_test = pd.read_csv(f'{PATH}test.csv',low_memory = False)
2.1 Checking and handling if there are any missing data and outliers.
df_raw_train.info
Understanding the raw data:
From the raw training dataset above:
(a) There are 14 variables (13 independent variables — Features and 1 dependent variable — Target Variable).
(b) The data types are either** integers** or floats.
© No categorical data is present.
(d) There are no missing values in our dataset.
2.2 As part of EDA, we will first try to determine the distribution of the dependent variable (MDEV).
#Plot the distribution of MEDV
plt.figure(figsize=(10, 6))
sns.distplot(df_raw_train['MEDV'],bins=30)
The values of MEDV follows a** normal distribution** with a mean of around 22. There are some outliers to the right.
2.3 Next, try to determine if there are any correlations between:
(i) the independent variables themselves
(ii) the independent variables and dependent variable
To do this, let’s do a correlation heatmap.
# Plot the correlation heatmap
plt.figure(figsize=(14, 8))
corr_matrix = df_raw_train.corr().round(2)
sns.heatmap(data=corr_matrix,cmap='coolwarm',annot=True)
(i) Correlation between independent variables:
We would need to look out for features of multi-collinearity (i.e. features that are correlated with each other)as this will affect our relationship with the independent variable.
Observe that **RAD **and TAX are highly correlated with each other (Correlation score: 0.92) while there are a couple of features which are somewhat correlated with one another with a correlation score of around 0.70 (INDUS and TAX, NOX and INDUS, AGE and DIS, AGE and INDUS).
(ii) Correlation between independent variable and dependent variable:
In order for our regression model to perform well, we ideally need to select those features that are highly correlated with our dependent variable (MEDV).
We observe that both RM and LSTAT are** correlated** with MEDV with a correlation score of 0.66 and 0.74 respective. This can also be illustrated via the scatter plot .
#sklearn #random-forest #boston #data-science #data analysis
1623906928
Model Stacking is a way to improve model predictions by combining the outputs of multiple models and running them through another machine learning model called a meta-learner. It is a popular strategy used to win kaggle competitions, but despite their usefulness they’re rarely talked about in data science articles — which I hope to change.
Essentially a stacked model works by running the output of multiple models through a “meta-learner” (usually a linear regressor/classifier, but can be other models like decision trees). The meta-learner attempts to minimize the weakness and maximize the strengths of every individual model. The result is usually a very robust model that generalizes well on unseen data.
The architecture for a stacked model can be illustrated by the image below:
#tensorflow #neural-networks #model-stacking #how to use “model stacking” to improve machine learning predictions #model stacking #machine learning