Web Scraping for Machine Learning with SQL Database

Web Scraping for Machine Learning with SQL Database

Web Scraping for Machine Learning with SQL Database. Machine Learning requires data. The Full Stack AI/ML Engineer toolkit needs to include web scraping, because it can improve predictions with new quality data. learn to combine the knowledge of HTML, Python, Databases, SQL and datasets for Machine Learning.

I thought, how can we angle "Web Scraping for Machine Learning", and I realized that Web Scraping should be essential to Data Scientists, Data Engineers and Machine Learning Engineers.

The Full Stack AI/ML Engineer toolkit needs to include web scraping, because it can improve predictions with new quality data. Machine Learning inherently requires data, and we would be most comfortable, if we have as much high quality data as possible. But what about when the data you need is not available as a dataset? What then? Do you just go and ask organizations and hope that they kindly will deliver it to you for free?

The answer is: you collect, label and store it yourself.

I made a GitHub repository for scraping the data. I encourage you to try it out and scrape some data yourself, and even trying to make some NLP or other Machine Learning project out of the scraped data.

In this article, we are going to web scrape Reddit – specifically, the /r/DataScience (and a little of /r/MachineLearning) subreddit. There will be no usage of the Reddit API, since we usually web scrape when an API is not available. Furthermore, you are going to learn to combine the knowledge of HTML, Python, Databases, SQL and datasets for Machine Learning. We are doing a small NLP sample project at last, but this is only to showcase that you can pickup the dataset and create a model providing predictions.

Table of Contents (Click To Scroll)
  1. Web Scraping in Python - Beautiful Soup and Selenium
    • Be careful - warning
    • Benchmarking How Long It Takes To Scrape
    • Initializing Classes From The Project
    • How To Scrape Reddit URLs
    • Getting Data From URLs
    • Converting Data to Python Dict
    • Extracting Data From Python Dict
      • Scraping The Comments
  2. Labelling Scraped Data
    • Naming Conventions
  3. Storing Scraped Data
    • Designing SQL Tables
    • Inserting Into SQL Tables
    • Exporting From SQL Tables
  4. Small Machine Learning Project on Exported Dataset
Web Scraping in Python With BeautifulSoup and Selenium

The first things we need to do is install BeautifulSoup and Selenium for scraping, but for accessing the whole project (i.e. also the Machine Learning part), we need more packages.

Selenium is basically good for content that changes due to Javascript, while BeautifulSoup is great at capturing static HTML from pages. Both the packages can be downloaded to your environment by a simple pip or conda command.

Install all the packages from the Github (linked at the start)

pip install -r requirements.txt

Alternatively, if you are using Google Colab, you can run the following to install the packages needed:

!pip install -r https://github.com/casperbh96/Web-Scraping-Reddit/raw/master/requirements.txt

Next, you need to download the chromedriver and place it in the core folder of the downloaded repository from GitHub.

Basic Warning of Web Scraping

Scrape responsibly, please! Reddit might update their website and invalidate the current approach of scraping the data from the website. If this is used in production, you would really want to setup an email / sms service, such that you get immediate notice when your web scraper fails.

This is for educational purposes only, please don't misuse or do anything illegal with the code. It's provided as-is, by the MIT license of the GitHub repository.

Benchmarking How Long It Takes To Scrape

Scraping takes time. Remember that you have to open each page, letting it load, then scraping the needed data. It can really be a tedious process – even figuring out where to start gathering the data can be hard, or even figuring out exactly what data you want.

There are 5 main steps for scraping reddit data:

  1. Collecting links from a subreddit
  2. Finding the 'script' tag from each link
  3. Turning collected data from links into Python dictionaries
  4. Getting the data from Python dictionaries
  5. Scraping comment data

Especially step 2 and 5 will take you a long time, because they are the hardest to optimize.

An approximate benchmark for:

  • Step 2: about 1 second for each post
  • Step 5: $n/3$ seconds for each post, where $n$ is number of comments. Scraping 300 comments took about 100 seconds for me, but it varies with internet speed, CPU speed, etc

Initializing Classes From The Project

We are going to be using 4 files with one class each, a total of 4 classes, which I placed in the core folder.

Whenever you see SQL, SelScraper or BSS being called, it means we are calling a method from another class, e.g. BSS.get_title(). We are going to jump forward and over some lines of code, because about 1000 lines have been written (which is too much to explain here).

from core.selenium_scraper import SeleniumScraper
from core.soup_scraper import SoupScraper
from core.progress_bar import ProgressBar
from core.sql_access import SqlAccess

SQL = SqlAccess()
SelScraper = SeleniumScraper()
BSS = SoupScraper(reddit_home,
                  slash,
                  subreddit)

Scraping URLs From A Subreddit

We start off with scraping the actual URLs from a given subreddit, defined at the start of scraper.py. What happens is that we open a browser in headless mode (without opening a window, basically running in the background), and we use some Javascript to scroll the page, while collecting links to posts.

The next snippets are just running in a while loop until the variable scroll_n_times becomes zero

while scroll_n_times:
    # Scrolls browser to the bottom of the page
    self.driver.execute_script(
                "window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(sleep_time)
    scroll_n_times -= 1

    elements = self.driver.find_elements_by_xpath(xpath)

    # Get the link from the href attribute
    self.links = [tag.get_attribute('href') for tag in elements]

The following is the Javascript to scroll a page to the bottom:

window.scrollTo(0, document.body.scrollHeight);

After that, we use xpath to find all tags in the HTML body, which really just returns all the links to us:

xpath = "//a[@data-click-id='body']"
elements = self.driver.find_elements_by_xpath(xpath)
self.links = [tag.get_attribute('href') for tag in elements]

At any point in time, if some exception happens, or we are done with scrolling, we basically garbage collect the process running, such that the program won't have 14 different chrome browsers running

try:
    # while loop code
finally:
    self.driver.quit()

Great, now we have a collection of links that we can start scraping, but how? Well, we start by...

Getting the data attribute

Upon opening one of the collected links, Reddit provides us with a javascript 'script' element that contains all the data for each post. You won't be able to see it when visiting the page, since the data is loaded in, then removed.

But our great software package BeautifulSoup will! And for our convenience, Reddit marked the 'script' with an id: id=data. This attribute makes it easy for us to find the element and capture all the data of the page.

First we specify the header, which tells the website we are visiting, which agent we are using (as to avoid being detected as a bot). Next step, we make a request and let BeautifulSoup get all the text from the page, i.e. all the HTML.

progress = ProgressBar(len(urls))
for url in urls:
    progress.update()
    headers = {'User-Agent': 'Mozilla/5.0'}
    r = requests.get(url, headers=headers)

    soup = BeautifulSoup(r.text, 'html.parser')

    pure_html_data.append(r.text)
    pure_script_data.append(soup.find(id='data').text)

You see that last line, after we have told BeautifulSoup to load the page, we find the text of the script with the id='data'. Since it's an id attribute, I knew that there would only be one element on the whole page with this attribute – hence this is pretty much a bulletproof approach of getting the data. Unless, of course, Reddit changes their site.

You might think that we could parallelize these operations, such that we can run through multiple URLs in the for loop, but it's bad practice and might get you banned – especially on bigger sites, they will protect their servers from overload, which you effectively can do with pinging URLs.

Converting the 'script' string to Python dictionary

From the last code piece, we get a string, which is in a valid JSON format. We want to convert this string into a Python dict, such that we can easily lookup the data from each link.

The basic approach here is that we find the first left curly bracket, which is where JSON starts, by the index. And we find the last index by an rfind() method with the right curly bracket, but we need to use plus one for the actual index.

pure_dicts = []

print('Making Python dicts out of script data')

progress = ProgressBar(len(script_data))
for data in script_data:
    progress.update()

    first_index = data.index('{')
    last_index = data.rfind('}') + 1

    json_str = data[first_index:last_index]

    pure_dicts.append(json.loads(json_str))

return pure_dicts

Effectively, this gives us a Python dict of the whole Reddit data which is loaded into every post. This will be great for scraping all the data and storing it.

Actually Getting The Data (From Python dictionary)

I defined one for loop, which iteratively scrapes different data. This makes it easy to maintain and find your mistakes.

As you can see, we get quite a lot of data – basically the whole post and comments, except for comments text, which we get later on.

progress = ProgressBar(len(links))
for i, current_data in enumerate(BSS.data):
    progress.update()

    BSS.get_url_id_and_url_title(BSS.urls[i],
                                 current_data, i)
    BSS.get_title()
    BSS.get_upvote_ratio()
    BSS.get_score()
    BSS.get_posted_time()
    BSS.get_author()
    BSS.get_flairs()
    BSS.get_num_gold()
    BSS.get_category()
    BSS.get_total_num_comments()
    BSS.get_links_from_post()
    BSS.get_main_link()
    BSS.get_text()
    BSS.get_comment_ids()

Scraping Comments Text Data

After we collected all this data, we need to begin thinking about storing the data. But we couldn't manage to scrape the text of the comments in the big for loop, so we will have to do that before we setup the inserts into the database tables.

This next code piece is quite long, but it's all you need. Here we collect comment text, score, author, upvote points and depth. For each comment, we have subcomments for that main comment, which specifies depth, e.g. depth is zero at the each root comment.

for i, comment_url in enumerate(comment_id_links_array):
    author = None
    text = None
    score = None
    comment_id = array_of_comment_ids[i]

    # Open the url, find the div with classes Comment.t1_comment_id
    r = requests.get(comment_url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    div = soup.select('div.Comment.t1_{0}'.format(comment_id))

    # If it found the div, let's extract the text from the comment
    if div is not None and len(div) > 0 :
        author = div[0].find_all('a')[0].get_text()
        spans = div[0].find_all("span")
        score = [spans[i].get_text() for i in range(len(spans)) if 'point' in spans[i].get_text()]

        html_and_text = div[0].find('div', attrs={'data-test-id' : 'comment'})
        if html_and_text is not None:
            text = html_and_text.get_text()

        if len(score) == 0:
            score = None
        else:
            score = score[0]

    # Make useable array for insertion
    array_of_comment_data.append([None,
                                  None,
                                  str(comment_id),
                                  str(score),
                                  array_of_depth[i],
                                  str(array_of_next[i]),
                                  str(array_of_prev[i]),
                                  str(author),
                                  str(text)])

return array_of_comment_data

What I found working was opening each of the collected comment URLs from the earlier for loop, and basically scraping once again. This ensures that we get all the data scraped.

Note: this approach for scraping the comments can be reaaally slow! This is probably the number one thing to improve – it's completely dependent on the number of comments on a posts. Scraping >100 comments takes a pretty long time.

Labelling The Collected Data

For the labelling part, we are mostly going to focus on tasks we can immediately finish with Python code, instead of the tasks that we cannot. For instance, labelling images found on Reddit is probably not feasible by a script, but actually has to be done manually.

Naming Conventions

For the SQL in this article, we use Snake Case for naming the features. An example of this naming convention is my_feature_x, i.e. we split words with underscores, and only lower case.

...But please:

If you are working in a company, look at the naming conventions they use and follow them. The last thing you want is different styles, as it will just be confusing at the end of the day. Common naming conventions include camelCase and Pascal Case.

Storing The Labelled Data

For storing the collected and labelled data, I have specifically chosen that we should proceed with an SQLite database – since it's way easier for smaller projects like web scraping. We don't have to install any drivers like for MySQL or MS SQL servers, and we don't even need to install any packages to use it, because it comes natively with Python.

Some considerations for data types has been made for the columns in the SQLite database, but there is room for improvement in the current state of form. I used some cheap varchars in the comment table, to get around some storing problems. It currently does not give me any problems, but for the future, it should probably be updated.

Designing and Creating Tables

The 1st normal form was mostly considered in the database design, i.e. we have separate tables for links and comments, for avoiding duplicate rows in the post table. Further improvements include making a table for categories and flairs, which is currently put into a string form from an array.

Without further notice, let me present you the database diagram for this project. It's quite simple.

In short: we have 3 tables. For each row with a post id in the post table, we can have multiple rows with the same post id in the comment and link table. This is how we link the post to all links and comments. The link exists because the post_id in the link and comment table has a foreign key on the post_id in the post table, which is the primary key.

I used an excellent tool for generating this database diagram, which I want to highlight – currently it's free and it's called https://dbdiagram.io/.

Play around with it, if you wish. This is not a paid endorsement of any sorts, just a shoutout to a great, free tool. Here is my code for the above diagram:

Table post as p {
  id int [pk]
  url varchar [not null]
  url_id varchar [not null]
  url_title varchar [not null]
  author varchar
  upvote_ratio uint8
  score int
  time_created datetime
  num_gold int
  num_comments int
  category varchar
  text varchar
  main_link varchar
  flairs int
}

Table link as l {
  id int [pk]
  post_id int
  link varchar [not null]
}

Table comment as c {
  id int [pk]
  post_id int
  comment_id varchar [not null]
  score varchar [not null]
  depth int [not null]
  next varchar
  previous varchar
  comment_author varchar
  text varchar
}

Ref: p.id < l.post_id
Ref: p.id < c.post_id

Actually Inserting The Data

The databases and tables will be automatically generated by some code which I setup to run automatically, so we will not cover that part here, but rather, the part where we insert the actual data.

The first step is creating and/or connecting to the database (which will either automatically generate the database and tables, and/or just connect to the existing database).

After that, we begin inserting data.

try:
    SQL.create_or_connect_db(erase_first=erase_db_first)
    # [0] = post, [1] = comment, [2] = link
    for i in range(len(BSS.post_data)):
        SQL.insert('post', data = BSS.post_data[i])
        SQL.insert('link', data = BSS.link_data[i])

        if scrape_comments:
            SQL.insert('comment', data = BSS.comment_data[i])
except Exception as ex:
    print(ex)
finally:
    SQL.save_changes()

Inserting the data happens in a for loop, as can be seen from the code snippet above. We specify the column and the data which we want to input.

For the next step, we need to get the number of columns in the table we are inserting into. From the number of columns, we have to create an array of question marks – we have one question mark separated with a comma, for each column. This is how data is inserted, by the SQL syntax.

Some data is input into a function which I called insert(), and the data variable is an array in the form of a row. Basically, we already concatenated all the data into an array and now we are ready to insert.

cols = c.execute(('''
                PRAGMA table_info({0})
                ''').format(table))

# Get the number of columns
num_cols = sum([1 for i in cols]) - 1

# Generate question marks for VALUES insertion
question_marks = self._question_mark_creator(num_cols)

if table == 'post':
    c.execute(('''INSERT INTO {0}
                  VALUES ({1})'''
              ).format(table, question_marks), data)

    self.last_post_id = c.lastrowid

elif (table == 'comment' or table == 'link') \
     and data != None and data != []:
    # setting post_id to the last post id, inserted in the post insert
    for table_data in data:
        table_data[1] = self.last_post_id
        c.execute(('''INSERT INTO {0}
                      VALUES ({1})'''
                  ).format(table, question_marks), table_data)

This wraps up scraping and inserting into a database, but how about...

Exporting From SQL Databases

For this project, I made three datasets, one of which I used for a Machine Learning project in the next section of this article.

  1. A dataset with posts, comments and links data
  2. A dataset with the post only
  3. A dataset with comments only

For these three datasets, I made a Python file called make_dataset.py for creating the datasets and saving them using Pandas and some SQL query.

For the first dataset, we used a left join from the SQL syntax (which I won't go into detail about), and it provides the dataset that we wish for. You do have to filter a lot of nulls if you want to use this dataset for anything, i.e. a lot of data cleaning, but once that's done, you can use the whole dataset.

all_data = pd.read_sql_query("""
SELECT *
FROM post p 
LEFT JOIN comment c 
    ON p.id = c.post_id
LEFT JOIN link l
	ON p.id = l.post_id;
""", c)

all_data.to_csv('data/post_comment_link_data.csv', columns=all_data.columns, index=False)

For the second and third datasets, a simple select all from table SQL query was made to make the dataset. This needs no further explaining.

post = pd.read_sql_query("""
SELECT *
FROM post;
""", c)

comment = pd.read_sql_query("""
SELECT *
FROM comment;
""", c)

post.to_csv('data/post_data.csv', columns=post.columns, index=False)
comment.to_csv('data/comment_data.csv', columns=comment.columns, index=False)

Machine Learning Project Based On This Dataset

From the three generated datasets, I wanted to show you how to do a basic machine learning project.

The results are not amazing, but we are trying to classify the comment into four categories; exceptional**,** good**,** average and bad – all based on the upvotes on a comment.

Let's start! Firstly, we import the functions and packages we need, along with the dataset, which is the comments table. But we only import the score (upvotes) and comment text from that dataset.

import copy
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

pd.options.mode.chained_assignment = None
df = pd.read_csv('data/comment_data.csv', usecols=['score', 'text'])

The next thing we have to do is cleaning the dataset. Firstly, we start off with only getting the text by using some regular expression (regex). This removes any weird characters like \ | / & % etc.

The next thing we do is with regards to the score feature. The score feature is formatted as a string, and we just need the number from that string. But we also want the minus in front of the string, if some comment has been downvoted a lot. Another regex was used here to do this.

The next ugly thing from our Python script is a None as a string. We replace this string with an actual None in Python, such that we can run df.dropna().

The last thing we need to do is convert the score to a float, since that is required for later.

def prepare_data(df):
    # Remove everything except alphanumeric characters
    df.text = df.text.str.replace('[^a-zA-Z\s]', '')

    # Get only numbers, but allow minus in front
    df.score = df.score.str.extract('(^-?[0-9]*\S+)')

    # Remove rows with None as string
    df.score = df.score.replace('None', np.nan)

    # Remove all None
    df = df.dropna()

    # Convert score feature from string to float
    score = df.score.astype(float)
    df.score = copy.deepcopy(score)

    return df

df = prepare_data(df)

The next part is trying to define how our classification is going to work; so we have to adapt the data. An easy way is using percentiles (perhaps not an ideal way).

For this, we find the fiftieth, seventy-fifth and ninety-fifth quantile of the data and mark the data below the fiftieth quantile. We replace the score feature with this new feature.

def score_to_percentile(df):
    second = df.score.quantile(0.50) # Average
    third = df.score.quantile(0.75) # Good
    fourth = df.score.quantile(0.95) # exceptional

    new_score = []

    for i, row in enumerate(df.score):
        if row > fourth:
            new_score.append('exceptional')
        elif row > third:
            new_score.append('good')
        elif row > second:
            new_score.append('average')
        else:
            new_score.append('bad')

    df.score = new_score

    return df

df = score_to_percentile(df)

We need to split the data and tokenize the text, which we proceed to do in the following code snippet. There is really not much magic happening here, so let's move on.

def df_split(df):
    y = df[['score']]
    X = df.drop(['score'], axis=1)

    content = [' ' + comment for comment in X.text.values]
    X = CountVectorizer().fit_transform(content).toarray()

    X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.33, random_state=42)

    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = df_split(df)

The last part of this machine learning project tutorial is making predictions and scoring the algorithm we choose to go with.

Logistic Regression was used, and perhaps we did not get the best score, but this is merely a boilerplate for future improvement and use.

All we do here is fit the logistic regression model to the training data, make a prediction and then score how well the model predicted.

lr = LogisticRegression(C=0.05, solver='lbfgs', multi_class='multinomial')
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
score = accuracy_score(y_test, pred)

print ("Accuracy: {0}".format(score))

In our case, the predictions were not that great, as the accuracy turned out to be $0.59$.

Future works includes:

  • Fine-tuning different algorithms
  • Trying different metrics
  • Better data cleaning

Originally published by Casper Hansen at https://mlfromscratch.com

Connect to Microsoft SQL Server database on MacOS using Python

Connect to Microsoft SQL Server database on MacOS using Python

Connect to your MS SQL using python. The first thing you need is to install Homebrew. You need the copy the content in the square brackets which in my case is “ODBC Driver 13 for SQL Server”. Replace “ODBC Driver 13 for SQL Server” with the content you copied in the square brackets.

There are always situations where I want to automate small tasks. I like using Python for these kind of things, you can quickly get something working without much hassle. I needed to perform some database changes in a Microsoft SQL Server database and wanted to connect to it using Python. On Windows this is usually pretty straight forward. But I use macOS as my main operating system and I had some hurdles along the way so here is a quick how to.

Preparing

If Homebrew isn't installed yet let's do that first. It's an excellent package manager for macOS. Paste the following command in the terminal to install it:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Once finished run the following commands to install the Microsoft ODBC Driver 13.1 for SQL Server. This driver will be used to connect to the MS SQL database.

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release
brew update
brew install [email protected] [email protected]
Python package: pyodbc or pypyodbc

With the driver installed, we'll be using a Python package that uses the driver to connect to the database. There are two frequently used packages to connect in Python: pyodbc and pypyodbcpypyodbc is a pure Python re-implementation of pyodbc. The main thing I took a way was that pypyodbcis easier to set up on different platforms. I could not get pyodbc working, it simply wouldn't connect to the database.

Installing pypyodbc can be done using pip, the python package installer.

pip install pypyodbc
Code

Now that the necessary components have been installed, let's open our favorite code editor and write code. The first thing that needs to be done is importing pypyodbc. The script starts with this:

import pypyodbc

Then we need a connection to the database:

sqlConnection = pypyodbc.connect(
                "Driver={ODBC Driver 13 for SQL Server};"
        "Server=<server IP address>;"
        "Database=<database>;"
        "uid=<username>;pwd=<password>");

Note that you have to replace four values in this code: server IP addressdatabase , username and password. The value for the driver is a hard coded value which indicates what driver to use to connect to the database, this value points to the driver that was installed earlier.

Now all what rests is use the connection and run a query.

cursor = sqlConnection.cursor()
cursor.execute("select * from Values")

The cursor now contains the result of our query, the property cursor.rowcount returns the number of rows the query returned. It's now possible to loop through the rows and access the different columns:

for row in cursor:
    print(cursor)
    # first column
    firstColumn = row[0]
    # ...

When we're done, we need to clean up the cursor and database connection by closing it.

cursor.close()
sqlConnection.close()

And that's it, save the file and use the python <filename>.py or python3 <filename.py> command, this depends on your configuration, to run. Here is the entire script:

import pypyodbc

sqlConnection = pypyodbc.connect(
"Driver={ODBC Driver 13 for SQL Server};"
"Server=<server IP address>;"
"Database=<database>;"
"uid=<username>;pwd=<password>");

cursor = sqlConnection.cursor()
cursor.execute("select * from Values")

for row in cursor:
print(cursor)
# first column
firstColumn = row[0]
# ...

cursor.close()
sqlConnection.close()

The with syntax can also be used to automatically close the cursor and the connection, this is another way of writing the same script:

import pypyodbc

with pypyodbc.connect(
"Driver={ODBC Driver 13 for SQL Server};"
"Server=<server IP address>;"
"Database=<database>;"
"uid=<username>;pwd=<password>") as sqlConnection:

with sqlConnection.cursor() as cursor:

    cursor.execute("select * from Values")

    for row in cursor:
        print(cursor)
        # first column
        firstColumn = row[0]
        # ...

If you're looking for some more reading on the topic:

Thanks for reading. If you liked this post, share it with all of your programming buddies!

Further reading

☞ Python for Time Series Data Analysis

☞ Python Programming For Beginners From Scratch

☞ Python Network Programming | Network Apps & Hacking Tools

☞ Intro To SQLite Databases for Python Programming

☞ Ethical Hacking With Python, JavaScript and Kali Linux

☞ Beginner’s guide on Python: Learn python from scratch! (New)

☞ Python for Beginners: Complete Python Programming

Learn to Build SQL Query| Ultimate SQL and DataBase Concepts|Simpliv

Learn to Build SQL Query| Ultimate SQL and DataBase Concepts|Simpliv

Learn to Build SQL Query| Ultimate SQL and DataBase Concepts

Description
SQL developers are earning higher salary in IT industry, but, its not about writing queries its about understanding and applying the right query at right time and this course will let you understand complex SQL Statements in an easy way .

Moreover, This Course will teach you how to extract Data from Database and write complex queries to a database This course will focus on a wider scale by Covering Structure Query Language SQL concepts as a whole, whether Students work with MySQL, Microsoft SQL Server, Oracle Server, etc.

This course have 5 Chapters in which you will learn

Chapter 1 Fundamentals

Fundamentals
Building Blocks
Selecting Records from DB
Working with Arithmetic Expressions
Chapter 2 Conditioning Sorting and Operators

Logical Operators
Comparison Operators
Operator Precedence
Sorting Results
Chapter 3 Functions

Character Functions
Number Functions
Date Functions
Conversions
General Purpose Functions
Nesting Functions
Chapter 4 Grouping

Multiple Row Functions on a single Table
Multiple Row Functions on Many Table
Chapter 5 Joins

Understanding Primary Key
Understanding Foreign Key
Understanding Need of Joins
Cartesian Product
Equie Join Simple Join Self Join
Non Equie Join
Outer Join
Self Join
Course is Designed for College and University Students who want Solid SQL and Data Base Concepts in a short period of time.

Who this course is for:

Beginners
University or College students
Anyone who wants Solid SQL Concepts
Basic knowledge
No prior knowledge is required
PC or MAC
What will you learn
SQL Fundamentals
Understand Complex SQL Concepts in Easy way using daily life examples
Construct SQL Statements
Use SQL to retrieve data from database
Selecting Data From Database
Restricting and Sorting Data from DB
Grouping Data From DB
Construct SQL statements that will let them work with more than two tables
Use SQL Functions
Work with SQL Operators and find out precedence
Nesting in SQL
Joins
To continue:

How to Prevent SQL Injection Attacks With Python

How to Prevent SQL Injection Attacks With Python

Among all injection types, SQL injection is one of the most common attack vectors, and arguably the most dangerous. As Python is one of the most popular programming languages in the world, knowing how to protect against Python SQL injection is critical.

Among all injection types, SQL injection is one of the most common attack vectors, and arguably the most dangerous. As Python is one of the most popular programming languages in the world, knowing how to protect against Python SQL injection is critical.

Every few years, the Open Web Application Security Project (OWASP) ranks the most critical web application security risks. Since the first report, injection risks have always been on top. Among all injection types, SQL injection is one of the most common attack vectors, and arguably the most dangerous. As Python is one of the most popular programming languages in the world, knowing how to protect against Python SQL injection is critical.

In this tutorial, you’re going to learn:

  • What Python SQL injection is and how to prevent it
  • How to compose queries with both literals and identifiers as parameters
  • How to safely execute queries in a database

Table of Contents

This tutorial is suited for users of all database engines. The examples here use PostgreSQL, but the results can be reproduced in other database management systems (such as SQLite, MySQL, Microsoft SQL Server, Oracle, and so on).

Understanding Python SQL Injection

SQL Injection attacks are such a common security vulnerability that the legendary xkcd webcomic devoted a comic to it:

"Exploits of a Mom" (Image: xkcd)

Generating and executing SQL queries is a common task. However, companies around the world often make horrible mistakes when it comes to composing SQL statements. While the ORM layer usually composes SQL queries, sometimes you have to write your own.

When you use Python to execute these queries directly into a database, there’s a chance you could make mistakes that might compromise your system. In this tutorial, you’ll learn how to successfully implement functions that compose dynamic SQL queries without putting your system at risk for Python SQL injection.

Setting Up a Database

To get started, you’re going to set up a fresh PostgreSQL database and populate it with data. Throughout the tutorial, you’ll use this database to witness firsthand how Python SQL injection works.

Creating a Database

First, open your shell and create a new PostgreSQL database owned by the user postgres:

$ createdb -O postgres psycopgtest

Here you used the command line option -O to set the owner of the database to the user postgres. You also specified the name of the database, which is psycopgtest.

Note: postgres is a special user, which you would normally reserve for administrative tasks, but for this tutorial, it’s fine to use postgres. In a real system, however, you should create a separate user to be the owner of the database.

Your new database is ready to go! You can connect to it using psql:

$ psql -U postgres -d psycopgtest
psql (11.2, server 10.5)
Type "help" for help.

You’re now connected to the database psycopgtest as the user postgres. This user is also the database owner, so you’ll have read permissions on every table in the database.

Creating a Table With Data

Next, you need to create a table with some user information and add data to it:

psycopgtest=# CREATE TABLE users (
    username varchar(30),
    admin boolean
);
CREATE TABLE

psycopgtest=# INSERT INTO users
    (username, admin)
VALUES
    ('ran', true),
    ('haki', false);
INSERT 0 2

psycopgtest=# SELECT * FROM users;
 username | admin
----------+-------
 ran      | t
 haki     | f
(2 rows)

The table has two columns: username and admin. The admin column indicates whether or not a user has administrative privileges. Your goal is to target the admin field and try to abuse it.

Setting Up a Python Virtual Environment

Now that you have a database, it’s time to set up your Python environment.

Create your virtual environment in a new directory:

~/src $ mkdir psycopgtest
~/src $ cd psycopgtest
~/src/psycopgtest $ python3 -m venv venv

After you run this command, a new directory called venv will be created. This directory will store all the packages you install inside the virtual environment.

Connecting to the Database

To connect to a database in Python, you need a database adapter. Most database adapters follow version 2.0 of the Python Database API Specification PEP 249. Every major database engine has a leading adapter:

To connect to a PostgreSQL database, you’ll need to install Psycopg, which is the most popular adapter for PostgreSQL in Python. Django ORM uses it by default, and it’s also supported by SQLAlchemy.

In your terminal, activate the virtual environment and use pip to install psycopg:

~/src/psycopgtest $ source venv/bin/activate
~/src/psycopgtest $ python -m pip install psycopg2>=2.8.0
Collecting psycopg2
  Using cached https://....
  psycopg2-2.8.2.tar.gz
Installing collected packages: psycopg2
  Running setup.py install for psycopg2 ... done
Successfully installed psycopg2-2.8.2

Now you’re ready to create a connection to your database. Here’s the start of your Python script:

import psycopg2

connection = psycopg2.connect(
    host="localhost",
    database="psycopgtest",
    user="postgres",
    password=None,
)
connection.set_session(autocommit=True)

You used psycopg2.connect() to create the connection. This function accepts the following arguments:

  • host is the IP address or the DNS of the server where your database is located. In this case, the host is your local machine, or localhost.

  • database is the name of the database to connect to. You want to connect to the database you created earlier, psycopgtest.

  • user is a user with permissions for the database. In this case, you want to connect to the database as the owner, so you pass the user postgres.

  • password is the password for whoever you specified in user. In most development environments, users can connect to the local database without a password.

After setting up the connection, you configured the session with autocommit=True. Activating autocommit means you won’t have to manually manage transactions by issuing a commit or rollback. This is the default behavior in most ORMs. You use this behavior here as well so that you can focus on composing SQL queries instead of managing transactions.

Note: Django users can get the instance of the connection used by the ORM from django.db.connection:

from django.db import connection

Executing a Query

Now that you have a connection to the database, you’re ready to execute a query:

>>> with connection.cursor() as cursor:
...     cursor.execute('SELECT COUNT(*) FROM users')
...     result = cursor.fetchone()
... print(result)
(2,)

You used the connection object to create a cursor. Just like a file in Python, cursor is implemented as a context manager. When you create the context, a cursor is opened for you to use to send commands to the database. When the context exits, the cursor closes and you can no longer use it.

While inside the context, you used cursor to execute a query and fetch the results. In this case, you issued a query to count the rows in the users table. To fetch the result from the query, you executed cursor.fetchone() and received a tuple. Since the query can only return one result, you used fetchone(). If the query were to return more than one result, then you’d need to either iterate over cursor or use one of the other fetch* methods.

Using Query Parameters in SQL

In the previous section, you created a database, established a connection to it, and executed a query. The query you used was static. In other words, it had no parameters. Now you’ll start to use parameters in your queries.

First, you’re going to implement a function that checks whether or not a user is an admin. is_admin() accepts a username and returns that user’s admin status:

# BAD EXAMPLE. DON'T DO THIS!
def is_admin(username: str) -> bool:
    with connection.cursor() as cursor:
        cursor.execute("""
            SELECT
                admin
            FROM
                users
            WHERE
                username = '%s'
        """ % username)
        result = cursor.fetchone()
    admin, = result
    return admin

This function executes a query to fetch the value of the admin column for a given username. You used fetchone() to return a tuple with a single result. Then, you unpacked this tuple into the variable admin. To test your function, check some usernames:

>>> is_admin('haki')
False
>>> is_admin('ran')
True

So far so good. The function returned the expected result for both users. But what about non-existing user? Take a look at this Python traceback:

>>> is_admin('foo')
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 12, in is_admin
TypeError: cannot unpack non-iterable NoneType object

When the user does not exist, a TypeError is raised. This is because .fetchone() returns None when no results are found, and unpacking None raises a TypeError. The only place you can unpack a tuple is where you populate admin from result.

To handle non-existing users, create a special case for when result is None:

# BAD EXAMPLE. DON'T DO THIS!
def is_admin(username: str) -> bool:
    with connection.cursor() as cursor:
        cursor.execute("""
            SELECT
                admin
            FROM
                users
            WHERE
                username = '%s'
        """ % username)
        result = cursor.fetchone()

    if result is None:
        # User does not exist
        return False

    admin, = result
    return admin

Here, you’ve added a special case for handling None. If username does not exist, then the function should return False. Once again, test the function on some users:

>>> is_admin('haki')
False
>>> is_admin('ran')
True
>>> is_admin('foo')
False

Great! The function can now handle non-existing usernames as well.

Remove ads

Exploiting Query Parameters With Python SQL Injection

In the previous example, you used string interpolation to generate a query. Then, you executed the query and sent the resulting string directly to the database. However, there’s something you may have overlooked during this process.

Think back to the username argument you passed to is_admin(). What exactly does this variable represent? You might assume that username is just a string that represents an actual user’s name. As you’re about to see, though, an intruder can easily exploit this kind of oversight and cause major harm by performing Python SQL injection.

Try to check if the following user is an admin or not:

>>> is_admin("'; select true; --")
True

Wait… What just happened?

Let’s take another look at the implementation. Print out the actual query being executed in the database:

>>> print("select admin from users where username = '%s'" % "'; select true; --")
select admin from users where username = ''; select true; --'

The resulting text contains three statements. To understand exactly how Python SQL injection works, you need to inspect each part individually. The first statement is as follows:

select admin from users where username = '';

This is your intended query. The semicolon (;) terminates the query, so the result of this query does not matter. Next up is the second statement:

select true;

This statement was constructed by the intruder. It’s designed to always return True.

Lastly, you see this short bit of code:

--'

This snippet defuses anything that comes after it. The intruder added the comment symbol (--) to turn everything you might have put after the last placeholder into a comment.

When you execute the function with this argument, it will always return True. If, for example, you use this function in your login page, an intruder could log in with the username '; select true; --, and they’ll be granted access.

If you think this is bad, it could get worse! Intruders with knowledge of your table structure can use Python SQL injection to cause permanent damage. For example, the intruder can inject an update statement to alter the information in the database:

>>> is_admin('haki')
False
>>> is_admin("'; update users set admin = 'true' where username = 'haki'; select true; --")
True
>>> is_admin('haki')
True

Let’s break it down again:

';

This snippet terminates the query, just like in the previous injection. The next statement is as follows:

update users set admin = 'true' where username = 'haki';

This section updates admin to true for user haki.

Finally, there’s this code snippet:

select true; --

As in the previous example, this piece returns true and comments out everything that follows it.

Why is this worse? Well, if the intruder manages to execute the function with this input, then user haki will become an admin:

psycopgtest=# select * from users;
 username | admin
----------+-------
 ran      | t
 haki     | t
(2 rows)

The intruder no longer has to use the hack. They can just log in with the username haki. (If the intruder really wanted to cause harm, then they could even issue a DROP DATABASE command.)

Before you forget, restore haki back to its original state:

psycopgtest=# update users set admin = false where username = 'haki';
UPDATE 1

So, why is this happening? Well, what do you know about the username argument? You know it should be a string representing the username, but you don’t actually check or enforce this assertion. This can be dangerous! It’s exactly what attackers are looking for when they try to hack your system.

Crafting Safe Query Parameters

In the previous section, you saw how an intruder can exploit your system and gain admin permissions by using a carefully crafted string. The issue was that you allowed the value passed from the client to be executed directly to the database, without performing any sort of check or validation. SQL injections rely on this type of vulnerability.

Any time user input is used in a database query, there’s a possible vulnerability for SQL injection. The key to preventing Python SQL injection is to make sure the value is being used as the developer intended. In the previous example, you intended for username to be used as a string. In reality, it was used as a raw SQL statement.

To make sure values are used as they’re intended, you need to escape the value. For example, to prevent intruders from injecting raw SQL in the place of a string argument, you can escape quotation marks:

>>> # BAD EXAMPLE. DON'T DO THIS!
>>> username = username.replace("'", "''")

This is just one example. There are a lot of special characters and scenarios to think about when trying to prevent Python SQL injection. Lucky for you, modern database adapters, come with built-in tools for preventing Python SQL injection by using query parameters. These are used instead of plain string interpolation to compose a query with parameters.

Note: Different adapters, databases, and programming languages refer to query parameters by different names. Common names include bind variables, replacement variables, and substitution variables.

Now that you have a better understanding of the vulnerability, you’re ready to rewrite the function using query parameters instead of string interpolation:

def is_admin(username: str) -> bool:

    with connection.cursor() as cursor:

        cursor.execute("""

            SELECT

                admin

            FROM

                users

            WHERE

                username = %(username)s

        """, {

            'username': username

        })

        result = cursor.fetchone()


    if result is None:

        # User does not exist

        return False


    admin, = result

    return admin

Here’s what’s different in this example:

  • In line 9, you used a named parameter username to indicate where the username should go. Notice how the parameter username is no longer surrounded by single quotation marks.

  • In line 11, you passed the value of username as the second argument to cursor.execute(). The connection will use the type and value of username when executing the query in the database.

To test this function, try some valid and invalid values, including the dangerous string from before:

>>> is_admin('haki')
False
>>> is_admin('ran')
True
>>> is_admin('foo')
False
>>> is_admin("'; select true; --")
False

Amazing! The function returned the expected result for all values. What’s more, the dangerous string no longer works. To understand why, you can inspect the query generated by execute():

>>> with connection.cursor() as cursor:
...    cursor.execute("""
...        SELECT
...            admin
...        FROM
...            users
...        WHERE
...            username = %(username)s
...    """, {
...        'username': "'; select true; --"
...    })
...    print(cursor.query.decode('utf-8'))
SELECT
    admin
FROM
    users
WHERE
    username = '''; select true; --'

The connection treated the value of username as a string and escaped any characters that might terminate the string and introduce Python SQL injection.

Passing Safe Query Parameters

Database adapters usually offer several ways to pass query parameters. Named placeholders are usually the best for readability, but some implementations might benefit from using other options.

Let’s take a quick look at some of the right and wrong ways to use query parameters. The following code block shows the types of queries you’ll want to avoid:

# BAD EXAMPLES. DON'T DO THIS!
cursor.execute("SELECT admin FROM users WHERE username = '" + username + '");
cursor.execute("SELECT admin FROM users WHERE username = '%s' % username);
cursor.execute("SELECT admin FROM users WHERE username = '{}'".format(username));
cursor.execute(f"SELECT admin FROM users WHERE username = '{username}'");

Each of these statements passes username from the client directly to the database, without performing any sort of check or validation. This sort of code is ripe for inviting Python SQL injection.

In contrast, these types of queries should be safe for you to execute:

# SAFE EXAMPLES. DO THIS!
cursor.execute("SELECT admin FROM users WHERE username = %s'", (username, ));
cursor.execute("SELECT admin FROM users WHERE username = %(username)s", {'username': username});

In these statements, username is passed as a named parameter. Now, the database will use the specified type and value of username when executing the query, offering protection from Python SQL injection.

Using SQL Composition

So far you’ve used parameters for literals. Literals are values such as numbers, strings, and dates. But what if you have a use case that requires composing a different query—one where the parameter is something else, like a table or column name?

Inspired by the previous example, let’s implement a function that accepts the name of a table and returns the number of rows in that table:

# BAD EXAMPLE. DON'T DO THIS!
def count_rows(table_name: str) -> int:
    with connection.cursor() as cursor:
        cursor.execute("""
            SELECT
                count(*)
            FROM
                %(table_name)s
        """, {
            'table_name': table_name,
        })
        result = cursor.fetchone()

    rowcount, = result
    return rowcount

Try to execute the function on your users table:

Traceback (most recent call last):
  File "", line 1, in 
  File "", line 9, in count_rows
psycopg2.errors.SyntaxError: syntax error at or near "'users'"
LINE 5:                 'users'
                        ^

The command failed to generate the SQL. As you’ve seen already, the database adapter treats the variable as a string or a literal. A table name, however, is not a plain string. This is where SQL composition comes in.

You already know it’s not safe to use string interpolation to compose SQL. Luckily, Psycopg provides a module called psycopg.sql to help you safely compose SQL queries. Let’s rewrite the function using psycopg.sql.SQL():

from psycopg2 import sql

def count_rows(table_name: str) -> int:
    with connection.cursor() as cursor:
        stmt = sql.SQL("""
            SELECT
                count(*)
            FROM
                {table_name}
        """).format(
            table_name = sql.Identifier(table_name),
        )
        cursor.execute(stmt)
        result = cursor.fetchone()

    rowcount, = result
    return rowcount

There are two differences in this implementation. First, you used sql.SQL() to compose the query. Then, you used sql.Identifier() to annotate the argument value table_name. (An identifier is a column or table name.)

Note: Users of the popular package django-debug-toolbar might get an error in the SQL panel for queries composed with psycopg.sql.SQL(). A fix is expected for release in version 2.0.

Now, try executing the function on the users table:

>>> count_rows('users')
2

Great! Next, let’s see what happens when the table does not exist:

>>> count_rows('foo')
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 11, in count_rows
psycopg2.errors.UndefinedTable: relation "foo" does not exist
LINE 5:                 "foo"
                        ^

The function throws the UndefinedTable exception. In the following steps, you’ll use this exception as an indication that your function is safe from a Python SQL injection attack.

Note: The exception UndefinedTable was added in psycopg2 version 2.8. If you’re working with an earlier version of Psycopg, then you’ll get a different exception.

To put it all together, add an option to count rows in the table up to a certain limit. This feature might be useful for very large tables. To implement this, add a LIMIT clause to the query, along with query parameters for the limit’s value:

from psycopg2 import sql

def count_rows(table_name: str, limit: int) -> int:
    with connection.cursor() as cursor:
        stmt = sql.SQL("""
            SELECT
                COUNT(*)
            FROM (
                SELECT
                    1
                FROM
                    {table_name}
                LIMIT
                    {limit}
            ) AS limit_query
        """).format(
            table_name = sql.Identifier(table_name),
            limit = sql.Literal(limit),
        )
        cursor.execute(stmt)
        result = cursor.fetchone()

    rowcount, = result
    return rowcount

In this code block, you annotated limit using sql.Literal(). As in the previous example, psycopg will bind all query parameters as literals when using the simple approach. However, when using sql.SQL(), you need to explicitly annotate each parameter using either sql.Identifier() or sql.Literal().

Note: Unfortunately, the Python API specification does not address the binding of identifiers, only literals. Psycopg is the only popular adapter that added the ability to safely compose SQL with both literals and identifiers. This fact makes it even more important to pay close attention when binding identifiers.

Execute the function to make sure that it works:

>>> count_rows('users', 1)
1
>>> count_rows('users', 10)
2

Now that you see the function is working, make sure it’s also safe:

>>> count_rows("(select 1) as foo; update users set admin = true where name = 'haki'; --", 1)
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 18, in count_rows
psycopg2.errors.UndefinedTable: relation "(select 1) as foo; update users set admin = true where name = '" does not exist
LINE 8:                     "(select 1) as foo; update users set adm...
                            ^

This traceback shows that psycopg escaped the value, and the database treated it as a table name. Since a table with this name doesn’t exist, an UndefinedTable exception was raised and you were not hacked!

Remove ads

Conclusion

You’ve successfully implemented a function that composes dynamic SQL without putting your system at risk for Python SQL injection! You’ve used both literals and identifiers in your query without compromising security.

You’ve learned:

  • What Python SQL injection is and how it can be exploited
  • How to prevent Python SQL injection using query parameters
  • How to safely compose SQL statements that use literals and identifiers as parameters

You’re now able to create programs that can withstand attacks from the outside. Go forth and thwart the hackers!