1668233460
Several MIT courses involving numerical computation, including 18.06 / 18.C06, 18.303, 18.330, 18.335/6.337, 18.337/6.338, and 18.338, are beginning to use Julia, a fairly new language for technical computing. This page is intended to supplement the Julia documentation with some simple tutorials on installing and using Julia targeted at MIT students. See also our Julia cheatsheet listing a few basic commands, and various Julia tutorials online.
In particular, we will be using Julia in the Jupyter browser-based enviroment (via the IJulia plug-in), which leverages your web browser and to provide a rich environment combining code, graphics, formatted text, and even equations, with sophisticated plots via Matplotlib.
You can also look at the Jupyter notebook from the fall 2020 tutorial, as well as the tutorial video (MIT only).
Julia is relatively new high-level free/open-source language for numerical computing in the same spirit, with a rich set of built-in types and libraries for working with linear algebra and other types of computations, with a syntax that is superficially reminiscent of Matlab's. Basically, we are using Julia because, unlike Matlab or Python or R, it scales better to real computational problems — you can write performance-critical "inner loops" in Julia, whereas similar tasks in other high-level languages often require one to drop down to C or similar low-level languages. (See e.g. this 6.172 lecture on performance in Julia vs. Python.) Because of this, we are using Julia more and more in our own research, and we want to teach using software tools that we really employ ourselves.
The easiest way to get started with Julia is to run it in the cloud on mybinder.org
, which is as easy as clicking this link:
That link opens up a default MIT-math Julia + Python environment that we set up, but you can also easily set up your own environments. Although the link above gives you access to our tutorial notebook here, you can create alternate links (e.g. for particular MIT courses) using nbgitpuller.
There are two major drawbacks to using the free mybinder.org
service:
It's often slow (sometimes an order of magnitude slower than a typical laptop), especially to start up, although it's probably fast enough for simple problems in coursework.
It has a very short timeout: if you go for a coffee break, your session will probably have stopped running by the time you get back. Fortunately, there are save/download buttons that still work in a timed-out session, so you can save your work and restore it after restarting the binder session.
There are at most 100 simultaneous users for a given configuration repository. (Therefore, if your instructor wants to use mybinder for a course, encourage them to set up their own docker configuration, perhaps by forking our repo.)
Eventually you'll probably want to install Julia on your own computer to eliminate these frustrations. Fortunately, this is usually relatively easy:
First, download the 1.8.x release of Julia run the installer. Then run the Julia application (double-click on it); a window with a julia>
prompt will appear. At the julia>
prompt, type a ]
(close square bracket) to get a Julia package prompt pkg>
, where you can type
(v1.7) pkg> add IJulia
You may also want to install these packages, which we tend to use in a lot of the lecture materials:
(v1.7) pkg> add Interact PyPlot Plots
(You can install other packages later as you need them using the same interface, of course. Thousands of other packages can be found on JuliaHub.)
Download the jupyterlab-desktop program, and launch it. Click the "Julia" button (or choose "New > Notebook" from the file menu and select the "Julia" kernel):
You should now have an interactive Julia notebook, whose usage we describe below.
If you have problems printing or exporting PDF from the JupyterLab Desktop (on some systems this fails if you don't have LaTeX installed), a workaround is to export as HTML (from File > Save and Export Notebook As… > HTML), open the resulting .html
in your browser (double-click on it) and print to PDF from your browser.
You can alternatively use Julia itself to install the Jupyter softare and have it run its interface through your web browser.
Switch back to the julia>
prompt by hitting backspace or ctrl-C, and then you can launch the notebook by running
julia> using IJulia
julia> notebook()
and type "y" if you are asked to install Jupyter. A "dashboard" window like this should open in your web browser (at address localhost:8888
, which you can return to at any time as long as the notebook()
server is running; I usually keep it running all the time):
Now, click on the New button and select the Julia option to start a new "notebook".
(You will have to leave this Julia command-line window open in order to keep the Jupyter process running. Alternatively, you can run notebook(detached=true)
if you want to run the Jupyter server as a background process, at which point you can close the Julia command line, but then if you ever want to restart the Jupyter server you will need to kill it manually.
build
at the pkg>
prompt to try to rerun the install scripts.update
at the pkg>
prompt and try again: this will fetch the latest versions of the Julia packages in case the problem you saw was fixed. Run build IJulia
at the pkg>
prompt if your Julia version may have changed. If this doesn't work, try just deleting the whole .julia
directory in your home directory (on Windows, it is called AppData\Roaming\julia\packages
in your home directory) and re-adding the packages.In[*]
indefinitely), try creating a new Python notebook (not Julia) from the New
button in the Jupyter dashboard, to see if 1+1
works in Python. If it is the same problem, then probably you have a firewall running on your machine (this is common on Windows) and you need to disable the firewall or at least to allow the IP address 127.0.0.1. (For the Sophos endpoint security software, go to "Configure Anti-Virus and HIPS", select "Authorization" and then "Websites", and add 127.0.0.1 to "Authorized websites"; finally, restart your computer.)A different interactive-computing environment for Julia is Pluto.jl, which runs in the browser like Jupyter but is more oriented towards "live" interaction where updating one piece of code automatically re-runs anything affected by that change. Running Julia is as easy as:
pkg> add Pluto
julia> using Pluto
julia> Pluto.run()
For writing larger programs, modules, and packages (as opposed to little interactive snippets), you'll want to start putting code into files and modules, and use a more full-featured code-editing environment. A popular choice is the free/cross-platform Visual Studio Code (VSCode) editor, which has a Julia VSCode plugin to provide a full-featured integrated development environment (IDE).
Of course, there is also good support for editing Julia in many other programs, such as Emacs, Vim, Atom, and so forth.
Julia is improving rapidly, so it won't be long before you want to update to a more recent version. The same is true of Julia add-on packages like PyPlot. To update the packages only, keeping Julia itself the same, just run:
(v1.3) pkg> update
at the Julia pkg>
prompt after typing ]
; you can also run ] update
in IJulia.
If you download and install a new version of Julia from the Julia web site, you will also probably want to update the packages with update
(in case newer versions of the packages are required for the most recent Julia). In any case, if you install a new Julia binary (or do anything that changes the location of Julia on your computer), you must update the IJulia installation (to tell IPython where to find the new Julia) by running build
at the Julia pkg>
prompt line (not in IJulia).
Once you have followed the installation steps above, then you will want to open the Jupyter notebook interface. As explained above, you can either launch the standalone JupyterLab Desktop app (which you download and install separately), or you can install Jupyter via Julia and run it via your web browser.
Either way, a notebook will combine code, computed results, formatted text, and images; for example, you might use one notebook for each problem set. The notebook window that opens will look something like:
In the browser can click the "Untitled" at the top to change the name, e.g. to "My first Julia notebook"; in JupyterLab you click the "Rename" option in the "File" menu. You can enter Julia code at the In[ ]
prompt, and hit shift-return to execute it and see the results. If you hit return without the shift key, it will add additional lines to a single input cell. For example, we can define a variable x
(using the built-in constant pi
and the built-in function sin
), and then evaluate a polynomial 3x^2 + 2x - 5
in terms of x
(note that, unlike Matlab or Python, we don't have to type 3*x^2
if we don't want to: a number followed by a variable is automatically interpreted as multiplication without having to type *
):
The result that is printed (in Out[1]
) is the last expression from the input cell, i.e. the polynomial. If you want to see the value of x
, for example, you could simply type x
at the second In[ ]
prompt and hit shift-return.
See, for example, the mathematical operations in the Julia manual for many more basic math functions.
There are several plotting packages available for Julia. If you followed the installation instructions, above, you already have one full-featured Matlab-like plotting package installed: PyPlot, which is simply a wrapper around Python's amazing Matplotlib library.
To start using PyPlot to make plots in Julia, first type using PyPlot
at an input prompt and hit shift-enter. using
is the Julia command to load an external module (which must usually be installed first, e.g. by the ] add PyPlot
command from the installation instructions above). The very first time you do using PyPlot
, it will take some time: the module and its dependencies will be "precompiled" so that in subsequent Julia sessions it will load quickly.
Then, you can type any of the commands from Matplotlib, which includes equivalents for most of the Matlab plotting functions. For example:
Currently, printing a notebook from the browser's Print command can be somewhat problematic. There are four solutions:
At the top of the notebook, click on the File menu (in the notebook, not the browser's global menu bar), and choose Print Preview. This should open up a window/tab that you can print normally.
For turning in homework, a class may allow you to submit the notebook file (.ipynb
file) electronically (the graders will handle printing). You can save a notebook file in a different location by choosing Download as from the notebook's File menu.
The highest-quality printed output is produced by IPython's nbconvert utility. For example, if you have a file mynotebook.ipynb
, you can run ipython nbconvert mynotebook.ipynb
to convert it to an HTML file that you can open and print in your web browser. This requires you to install IPython, Sphinx (which is automatically installed with the Anaconda Python/IPython distribution), and Pandoc on your computer.
If you post your notebook in a Dropbox account or in some other web-accessible location, you can paste the URL into the online nbviewer to get a printable version.
Author: mitmath
Source Code: https://github.com/mitmath/julia-mit
1668233460
Several MIT courses involving numerical computation, including 18.06 / 18.C06, 18.303, 18.330, 18.335/6.337, 18.337/6.338, and 18.338, are beginning to use Julia, a fairly new language for technical computing. This page is intended to supplement the Julia documentation with some simple tutorials on installing and using Julia targeted at MIT students. See also our Julia cheatsheet listing a few basic commands, and various Julia tutorials online.
In particular, we will be using Julia in the Jupyter browser-based enviroment (via the IJulia plug-in), which leverages your web browser and to provide a rich environment combining code, graphics, formatted text, and even equations, with sophisticated plots via Matplotlib.
You can also look at the Jupyter notebook from the fall 2020 tutorial, as well as the tutorial video (MIT only).
Julia is relatively new high-level free/open-source language for numerical computing in the same spirit, with a rich set of built-in types and libraries for working with linear algebra and other types of computations, with a syntax that is superficially reminiscent of Matlab's. Basically, we are using Julia because, unlike Matlab or Python or R, it scales better to real computational problems — you can write performance-critical "inner loops" in Julia, whereas similar tasks in other high-level languages often require one to drop down to C or similar low-level languages. (See e.g. this 6.172 lecture on performance in Julia vs. Python.) Because of this, we are using Julia more and more in our own research, and we want to teach using software tools that we really employ ourselves.
The easiest way to get started with Julia is to run it in the cloud on mybinder.org
, which is as easy as clicking this link:
That link opens up a default MIT-math Julia + Python environment that we set up, but you can also easily set up your own environments. Although the link above gives you access to our tutorial notebook here, you can create alternate links (e.g. for particular MIT courses) using nbgitpuller.
There are two major drawbacks to using the free mybinder.org
service:
It's often slow (sometimes an order of magnitude slower than a typical laptop), especially to start up, although it's probably fast enough for simple problems in coursework.
It has a very short timeout: if you go for a coffee break, your session will probably have stopped running by the time you get back. Fortunately, there are save/download buttons that still work in a timed-out session, so you can save your work and restore it after restarting the binder session.
There are at most 100 simultaneous users for a given configuration repository. (Therefore, if your instructor wants to use mybinder for a course, encourage them to set up their own docker configuration, perhaps by forking our repo.)
Eventually you'll probably want to install Julia on your own computer to eliminate these frustrations. Fortunately, this is usually relatively easy:
First, download the 1.8.x release of Julia run the installer. Then run the Julia application (double-click on it); a window with a julia>
prompt will appear. At the julia>
prompt, type a ]
(close square bracket) to get a Julia package prompt pkg>
, where you can type
(v1.7) pkg> add IJulia
You may also want to install these packages, which we tend to use in a lot of the lecture materials:
(v1.7) pkg> add Interact PyPlot Plots
(You can install other packages later as you need them using the same interface, of course. Thousands of other packages can be found on JuliaHub.)
Download the jupyterlab-desktop program, and launch it. Click the "Julia" button (or choose "New > Notebook" from the file menu and select the "Julia" kernel):
You should now have an interactive Julia notebook, whose usage we describe below.
If you have problems printing or exporting PDF from the JupyterLab Desktop (on some systems this fails if you don't have LaTeX installed), a workaround is to export as HTML (from File > Save and Export Notebook As… > HTML), open the resulting .html
in your browser (double-click on it) and print to PDF from your browser.
You can alternatively use Julia itself to install the Jupyter softare and have it run its interface through your web browser.
Switch back to the julia>
prompt by hitting backspace or ctrl-C, and then you can launch the notebook by running
julia> using IJulia
julia> notebook()
and type "y" if you are asked to install Jupyter. A "dashboard" window like this should open in your web browser (at address localhost:8888
, which you can return to at any time as long as the notebook()
server is running; I usually keep it running all the time):
Now, click on the New button and select the Julia option to start a new "notebook".
(You will have to leave this Julia command-line window open in order to keep the Jupyter process running. Alternatively, you can run notebook(detached=true)
if you want to run the Jupyter server as a background process, at which point you can close the Julia command line, but then if you ever want to restart the Jupyter server you will need to kill it manually.
build
at the pkg>
prompt to try to rerun the install scripts.update
at the pkg>
prompt and try again: this will fetch the latest versions of the Julia packages in case the problem you saw was fixed. Run build IJulia
at the pkg>
prompt if your Julia version may have changed. If this doesn't work, try just deleting the whole .julia
directory in your home directory (on Windows, it is called AppData\Roaming\julia\packages
in your home directory) and re-adding the packages.In[*]
indefinitely), try creating a new Python notebook (not Julia) from the New
button in the Jupyter dashboard, to see if 1+1
works in Python. If it is the same problem, then probably you have a firewall running on your machine (this is common on Windows) and you need to disable the firewall or at least to allow the IP address 127.0.0.1. (For the Sophos endpoint security software, go to "Configure Anti-Virus and HIPS", select "Authorization" and then "Websites", and add 127.0.0.1 to "Authorized websites"; finally, restart your computer.)A different interactive-computing environment for Julia is Pluto.jl, which runs in the browser like Jupyter but is more oriented towards "live" interaction where updating one piece of code automatically re-runs anything affected by that change. Running Julia is as easy as:
pkg> add Pluto
julia> using Pluto
julia> Pluto.run()
For writing larger programs, modules, and packages (as opposed to little interactive snippets), you'll want to start putting code into files and modules, and use a more full-featured code-editing environment. A popular choice is the free/cross-platform Visual Studio Code (VSCode) editor, which has a Julia VSCode plugin to provide a full-featured integrated development environment (IDE).
Of course, there is also good support for editing Julia in many other programs, such as Emacs, Vim, Atom, and so forth.
Julia is improving rapidly, so it won't be long before you want to update to a more recent version. The same is true of Julia add-on packages like PyPlot. To update the packages only, keeping Julia itself the same, just run:
(v1.3) pkg> update
at the Julia pkg>
prompt after typing ]
; you can also run ] update
in IJulia.
If you download and install a new version of Julia from the Julia web site, you will also probably want to update the packages with update
(in case newer versions of the packages are required for the most recent Julia). In any case, if you install a new Julia binary (or do anything that changes the location of Julia on your computer), you must update the IJulia installation (to tell IPython where to find the new Julia) by running build
at the Julia pkg>
prompt line (not in IJulia).
Once you have followed the installation steps above, then you will want to open the Jupyter notebook interface. As explained above, you can either launch the standalone JupyterLab Desktop app (which you download and install separately), or you can install Jupyter via Julia and run it via your web browser.
Either way, a notebook will combine code, computed results, formatted text, and images; for example, you might use one notebook for each problem set. The notebook window that opens will look something like:
In the browser can click the "Untitled" at the top to change the name, e.g. to "My first Julia notebook"; in JupyterLab you click the "Rename" option in the "File" menu. You can enter Julia code at the In[ ]
prompt, and hit shift-return to execute it and see the results. If you hit return without the shift key, it will add additional lines to a single input cell. For example, we can define a variable x
(using the built-in constant pi
and the built-in function sin
), and then evaluate a polynomial 3x^2 + 2x - 5
in terms of x
(note that, unlike Matlab or Python, we don't have to type 3*x^2
if we don't want to: a number followed by a variable is automatically interpreted as multiplication without having to type *
):
The result that is printed (in Out[1]
) is the last expression from the input cell, i.e. the polynomial. If you want to see the value of x
, for example, you could simply type x
at the second In[ ]
prompt and hit shift-return.
See, for example, the mathematical operations in the Julia manual for many more basic math functions.
There are several plotting packages available for Julia. If you followed the installation instructions, above, you already have one full-featured Matlab-like plotting package installed: PyPlot, which is simply a wrapper around Python's amazing Matplotlib library.
To start using PyPlot to make plots in Julia, first type using PyPlot
at an input prompt and hit shift-enter. using
is the Julia command to load an external module (which must usually be installed first, e.g. by the ] add PyPlot
command from the installation instructions above). The very first time you do using PyPlot
, it will take some time: the module and its dependencies will be "precompiled" so that in subsequent Julia sessions it will load quickly.
Then, you can type any of the commands from Matplotlib, which includes equivalents for most of the Matlab plotting functions. For example:
Currently, printing a notebook from the browser's Print command can be somewhat problematic. There are four solutions:
At the top of the notebook, click on the File menu (in the notebook, not the browser's global menu bar), and choose Print Preview. This should open up a window/tab that you can print normally.
For turning in homework, a class may allow you to submit the notebook file (.ipynb
file) electronically (the graders will handle printing). You can save a notebook file in a different location by choosing Download as from the notebook's File menu.
The highest-quality printed output is produced by IPython's nbconvert utility. For example, if you have a file mynotebook.ipynb
, you can run ipython nbconvert mynotebook.ipynb
to convert it to an HTML file that you can open and print in your web browser. This requires you to install IPython, Sphinx (which is automatically installed with the Anaconda Python/IPython distribution), and Pandoc on your computer.
If you post your notebook in a Dropbox account or in some other web-accessible location, you can paste the URL into the online nbviewer to get a printable version.
Author: mitmath
Source Code: https://github.com/mitmath/julia-mit
1641805837
The final objective is to estimate the cost of a certain house in a Boston suburb. In 1970, the Boston Standard Metropolitan Statistical Area provided the information. To examine and modify the data, we will use several techniques such as data pre-processing and feature engineering. After that, we'll apply a statistical model like regression model to anticipate and monitor the real estate market.
Project Outline:
Before using a statistical model, the EDA is a good step to go through in order to:
# Import the libraries #Dataframe/Numerical libraries import pandas as pd import numpy as np #Data visualization import plotly.express as px import matplotlib import matplotlib.pyplot as plt import seaborn as sns #Machine learning model from sklearn.linear_model import LinearRegression
#Reading the data path='./housing.csv' housing_df=pd.read_csv(path,header=None,delim_whitespace=True)
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
501 | 0.06263 | 0.0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
502 | 0.04527 | 0.0 | 11.93 | 0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
503 | 0.06076 | 0.0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
504 | 0.10959 | 0.0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
505 | 0.04741 | 0.0 | 11.93 | 0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
Crime: It refers to a town's per capita crime rate.
ZN: It is the percentage of residential land allocated for 25,000 square feet.
Indus: The amount of non-retail business lands per town is referred to as the indus.
CHAS: CHAS denotes whether or not the land is surrounded by a river.
NOX: The NOX stands for nitric oxide content (part per 10m)
RM: The average number of rooms per home is referred to as RM.
AGE: The percentage of owner-occupied housing built before 1940 is referred to as AGE.
DIS: Weighted distance to five Boston employment centers are referred to as dis.
RAD: Accessibility to radial highways index
TAX: The TAX columns denote the rate of full-value property taxes per $10,000 dollars.
B: B=1000(Bk — 0.63)2 is the outcome of the equation, where Bk is the proportion of blacks in each town.
PTRATIO: It refers to the student-to-teacher ratio in each community.
LSTAT: It refers to the population's lower socioeconomic status.
MEDV: It refers to the 1000-dollar median value of owner-occupied residences.
# Check if there is any missing values. housing_df.isna().sum() CRIM 0 ZN 0 INDUS 0 CHAS 0 NOX 0 RM 0 AGE 0 DIS 0 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 MEDV 0 dtype: int64
No missing values are found
We examine our data's mean, standard deviation, and percentiles.
housing_df.describe()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
The crime, area, sector, nitric oxides, 'B' appear to have multiple outliers at first look because the minimum and maximum values are so far apart. In the Age columns, the mean and the Q2(50 percentile) do not match.
We might double-check it by examining the distribution of each column.
Because the model is overly generic, removing all outliers will underfit it. Keeping all outliers causes the model to overfit and become excessively accurate. The data's noise will be learned.
The approach is to establish a happy medium that prevents the model from becoming overly precise. When faced with a new set of data, however, they generalise well.
We'll keep numbers below 600 because there's a huge anomaly in the TAX column around 600.
new_df=housing_df[housing_df['TAX']<600]
The overall distribution, particularly the TAX, PTRATIO, and RAD, has improved slightly.
Perfect correlation is denoted by the clear values. The medium correlation between the columns is represented by the reds, while the negative correlation is represented by the black.
With a value of 0.89, we can see that 'MEDV', which is the medium price we wish to anticipate, is substantially connected with the number of rooms 'RM'. The proportion of black people in area 'B' with a value of 0.19 is followed by the residential land 'ZN' with a value of 0.32 and the percentage of black people in area 'ZN' with a value of 0.32.
The metrics that are most connected with price will be plotted.
Gradient descent is aided by feature scaling, which ensures that all features are on the same scale. It makes locating the local optimum much easier.
Mean standardization is one strategy to employ. It substitutes (target-mean) for the target to ensure that the feature has a mean of nearly zero.
def standard(X): '''Standard makes the feature 'X' have a zero mean''' mu=np.mean(X) #mean std=np.std(X) #standard deviation sta=(X-mu)/std # mean normalization return mu,std,sta mu,std,sta=standard(X) X=sta X
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.609129 | 0.092792 | -1.019125 | -0.280976 | 0.258670 | 0.279135 | 0.162095 | -0.167660 | -2.105767 | -0.235130 | -1.136863 | 0.401318 | -0.933659 |
1 | -0.575698 | -0.598153 | -0.225291 | -0.280976 | -0.423795 | 0.049252 | 0.648266 | 0.250975 | -1.496334 | -1.032339 | -0.004175 | 0.401318 | -0.219350 |
2 | -0.575730 | -0.598153 | -0.225291 | -0.280976 | -0.423795 | 1.189708 | 0.016599 | 0.250975 | -1.496334 | -1.032339 | -0.004175 | 0.298315 | -1.096782 |
3 | -0.567639 | -0.598153 | -1.040806 | -0.280976 | -0.532594 | 0.910565 | -0.526350 | 0.773661 | -0.886900 | -1.327601 | 0.403593 | 0.343869 | -1.283945 |
4 | -0.509220 | -0.598153 | -1.040806 | -0.280976 | -0.532594 | 1.132984 | -0.228261 | 0.773661 | -0.886900 | -1.327601 | 0.403593 | 0.401318 | -0.873561 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
501 | -0.519445 | -0.598153 | 0.585220 | -0.280976 | 0.604848 | 0.306004 | 0.300494 | -0.936773 | -2.105767 | -0.574682 | 1.445666 | 0.277056 | -0.128344 |
502 | -0.547094 | -0.598153 | 0.585220 | -0.280976 | 0.604848 | -0.400063 | 0.570195 | -1.027984 | -2.105767 | -0.574682 | 1.445666 | 0.401318 | -0.229652 |
503 | -0.522423 | -0.598153 | 0.585220 | -0.280976 | 0.604848 | 0.877725 | 1.077657 | -1.085260 | -2.105767 | -0.574682 | 1.445666 | 0.401318 | -0.820331 |
504 | -0.444652 | -0.598153 | 0.585220 | -0.280976 | 0.604848 | 0.606046 | 1.017329 | -0.979587 | -2.105767 | -0.574682 | 1.445666 | 0.314006 | -0.676095 |
505 | -0.543685 | -0.598153 | 0.585220 | -0.280976 | 0.604848 | -0.534410 | 0.715691 | -0.924173 | -2.105767 | -0.574682 | 1.445666 | 0.401318 | -0.435703 |
For the sake of the project, we'll apply linear regression.
Typically, we run numerous models and select the best one based on a particular criterion.
Linear regression is a sort of supervised learning model in which the response is continuous, as it relates to machine learning.
Form of Linear Regression
y= θX+θ1 or y= θ1+X1θ2 +X2θ3 + X3θ4
y is the target you will be predicting
0 is the coefficient
x is the input
We will Sklearn to develop and train the model
#Import the libraries to train the model from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression
Allow us to utilise the train/test method to learn a part of the data on one set and predict using another set using the train/test approach.
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4) #Create and Train the model model=LinearRegression().fit(X_train,y_train) #Generate prediction predictions_test=model.predict(X_test) #Compute loss to evaluate the model coefficient= model.coef_ intercept=model.intercept_ print(coefficient,intercept) [7.22218258] 24.66379606613584
In this example, you will learn the model using below hypothesis:
Price= 24.85 + 7.18* Room
It is interpreted as:
For a decided price of a house:
A 7.18-unit increase in the price is connected with a growth in the number of rooms.
As a side note, this is an association, not a cause!
You will need a metric to determine whether our hypothesis was right. The RMSE approach will be used.
Root Means Square Error (RMSE) is defined as the square root of the mean of square error. The difference between the true and anticipated numbers called the error. It's popular because it can be expressed in y-units, which is the median price of a home in our scenario.
def rmse(predict,actual): return np.sqrt(np.mean(np.square(predict - actual))) # Split the Data into train and test set X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4) #Create and Train the model model=LinearRegression().fit(X_train,y_train) #Generate prediction predictions_test=model.predict(X_test) #Compute loss to evaluate the model coefficient= model.coef_ intercept=model.intercept_ print(coefficient,intercept) loss=rmse(predictions_test,y_test) print('loss: ',loss) print(model.score(X_test,y_test)) #accuracy [7.43327725] 24.912055881970886 loss: 3.9673165450580714 0.7552661033654667 Loss will be 3.96
This means that y-units refer to the median value of occupied homes with 1000 dollars.
This will be less by 3960 dollars.
While learning the model you will have a high variance when you divide the data. Coefficient and intercept will vary. It's because when we utilized the train/test approach, we choose a set of data at random to place in either the train or test set. As a result, our theory will change each time the dataset is divided.
This problem can be solved using a technique called cross-validation.
With 'Forward Selection,' we'll iterate through each parameter to assist us choose the numbers characteristics to include in our model.
We'll use a random state of 1 so that each iteration yields the same outcome.
cols=[] los=[] los_train=[] scor=[] i=0 while i < len(high_corr_var): cols.append(high_corr_var[i]) # Select inputs variables X=new_df[cols] #mean normalization mu,std,sta=standard(X) X=sta # Split the data into training and testing X_train,X_test,y_train,y_test= train_test_split(X,y,random_state=1) #fit the model to the training lnreg=LinearRegression().fit(X_train,y_train) #make prediction on the training test prediction_train=lnreg.predict(X_train) #make prediction on the testing test prediction=lnreg.predict(X_test) #compute the loss on train test loss=rmse(prediction,y_test) loss_train=rmse(prediction_train,y_train) los_train.append(loss_train) los.append(loss) #compute the score score=lnreg.score(X_test,y_test) scor.append(score) i+=1
We have a big 'loss' with a smaller collection of variables, yet our system will overgeneralize in this scenario. Although we have a reduced 'loss,' we have a large number of variables. However, if the model grows too precise, it may not generalize well to new data.
In order for our model to generalize well with another set of data, we might use 6 or 7 features. The characteristic chosen is descending based on how strong the price correlation is.
high_corr_var ['RM', 'ZN', 'B', 'CHAS', 'RAD', 'DIS', 'CRIM', 'NOX', 'AGE', 'TAX', 'INDUS', 'PTRATIO', 'LSTAT']
With 'RM' having a high price correlation and LSTAT having a negative price correlation.
# Create a list of features names feature_cols=['RM','ZN','B','CHAS','RAD','CRIM','DIS','NOX'] #Select inputs variables X=new_df[feature_cols] # Split the data into training and testing sets X_train,X_test,y_train,y_test= train_test_split(X,y, random_state=1) # feature engineering mu,std,sta=standard(X) X=sta # fit the model to the trainning data lnreg=LinearRegression().fit(X_train,y_train) # make prediction on the testing test prediction=lnreg.predict(X_test) # compute the loss loss=rmse(prediction,y_test) print('loss: ',loss) lnreg.score(X_test,y_test) loss: 3.212659865936143 0.8582338376696363
The test set yielded a loss of 3.21 and an accuracy of 85%.
Other factors, such as alpha, the learning rate at which our model learns, could still be tweaked to improve our model. Alternatively, return to the preprocessing section and working to increase the parameter distribution.
For more details regarding scraping real estate data you can contact Scraping Intelligence today
https://www.websitescraper.com/how-to-predict-housing-prices-with-linear-regression.php
1599954420
This free introductory computer science and programming course is available via MIT’s Open Courseware platform. It’s a great resource for mastering the fundamentals of one of data science’s major requirements.
I shouldn’t have to tell you that programming is an important aspect of data science.
In order to implement computational solutions to data science problems, it is clear that programming is an absolute necessity. Regardless of whether you are visualizing data, performing exploratory data analysis, or implementing machine learning models, and whether you are using existing code bases and libraries or coding from scratch, writing code as a data scientist is required.
But stringing together disparate lines of code found via Google searches shouldn’t be the goal of an aspiring data scientist (or anyone else learning to program). An understanding of computer science principles, computational approaches to problem solving, and the fundamentals of programming, all independent of implementation programming language, should be the goal of anyone with a true desire to really learn how to code.
There are lots of ways to pick up programming and master the concepts of computer science. Obviously, some people will learn better with some approaches than with others. Thorough university courses, including lectures, readings, slides, and assignments, are one such approach.
#2020 sep tutorials # overviews #computer science #courses #mit #programming #python
1602738000
Programming is an important part of data science, as are the underlying concepts of computers science. If we plan to implement computational solutions to data science problems, it is clear that programming is an absolute necessity. To facilitate those looking to establish or solidify these skills, we recently shared a great free course from MIT’s Open Courseware to start with.
After one learns the basic of programming, pivoting to thinking computationally is a good transition step toward solving complex real world problems, including from a data science perspective. Today we share Computational Thinking and Data Science, another top notch MIT Open Courseware offering freely-available to anyone interested in learning.
#2020 oct tutorials # overviews #computer science #courses #data science #mit #python
1593242571
Techtutorials tell you the best online IT courses/training, tutorials, certification courses, and syllabus from beginners to advanced level on the latest technologies recommended by Programming Community through video-based, book, free, paid, Real-time Experience, etc.
#techtutorials #online it courses #mobile app development courses #web development courses #online courses for beginners #advanced online courses