How To Manage an SQL Database

How To Manage an SQL Database

SQL databases come installed with all the commands you need to add, modify, delete, and query your data. This cheat sheet-style guide provides a quick reference to some of the most commonly-used SQL commands.

SQL databases come installed with all the commands you need to add, modify, delete, and query your data. This cheat sheet-style guide provides a quick reference to some of the most commonly-used SQL commands.

An SQL Cheat Sheet

Introduction

SQL databases come installed with all the commands you need to add, modify, delete, and query your data. This cheat sheet-style guide provides a quick reference to some of the most commonly-used SQL commands.

How to Use This Guide:

  • This guide is in cheat sheet format with self-contained command-line snippets
  • Jump to any section that is relevant to the task you are trying to complete
  • When you see <span class="highlight">highlighted text</span> in this guide’s commands, keep in mind that this text should refer to the columns, tables, and data in your own database.
  • Throughout this guide, the example data values given are all wrapped in apostrophes ('). In SQL, it is necessary to wrap any data values that consist of strings in apostrophes. This isn’t required for numeric data, but it also won’t cause any issues if you do include apostrophes.

Please note that, while SQL is recognized as a standard, most SQL database programs have their own proprietary extensions. This guide uses MySQL as the example relational database management system (RDBMS), but the commands given will work with other relational database programs, including PostgreSQL, MariaDB, and SQLite. Where there are significant differences between RDBMSs, we have included the alternative commands.

Opening up the Database Prompt (using Socket/Trust Authentication)

By default on Ubuntu 18.04, the root MySQL user can authenticate without a password using the following command:

sudo mysql


To open up a PostgreSQL prompt, use the following command. This example will log you in as the postgres user, which is the included superuser role, but you can replace that with any already-created role:

sudo -u postgres psql


Opening up the Database Prompt (using Password Authentication)

If your root MySQL user is set to authenticate with a password, you can do so with the following command:

mysql -u root -p


If you’ve already set up a non-root user account for your database, you can also use this method to log in as that user:

mysql -u user -p


The above command will prompt you for your password after you run it. If you’d like to supply your password as part of the command, immediately follow the -p option with your password, with no space between them:

mysql -u root -ppassword


Creating a Database

The following command creates a database with default settings.

CREATE DATABASE database_name;


If you want your database to use a character set and collation different than the defaults, you can specify those using this syntax:

CREATE DATABASE database_name CHARACTER SET character_set COLLATE collation;


Listing Databases

To see what databases exist in your MySQL or MariaDB installation, run the following command:

SHOW DATABASES;


In PostgreSQL, you can see what databases have been created with the following command:

\list


Deleting a Database

To delete a database, including any tables and data held within it, run a command that follows this structure:

DROP DATABASE IF EXISTS database;


Creating a User

To create a user profile for your database without specifying any privileges for it, run the following command:

CREATE USER username IDENTIFIED BY 'password';


PostgreSQL uses a similar, but slightly different, syntax:

CREATE USER user WITH PASSWORD 'password';


If you want to create a new user and grant them privileges in one command, you can do so by issuing a GRANT statement. The following command creates a new user and grants them full privileges to every database and table in the RDBMS:

GRANT ALL PRIVILEGES ON *.* TO 'username'@'localhost' IDENTIFIED BY 'password';


Note the PRIVILEGES keyword in this previous GRANT statement. In most RDBMSs, this keyword is optional, and this statement can be equivalently written as:

GRANT ALL ON *.* TO 'username'@'localhost' IDENTIFIED BY 'password';


Be aware, though, that the PRIVILEGES keyword is required for granting privileges like this when Strict SQL mode is turned on.

Deleting a User

Use the following syntax to delete a database user profile:

DROP USER IF EXISTS username;


Note that this command will not by default delete any tables created by the deleted user, and attempts to access such tables may result in errors.

Selecting a Database

Before you can create a table, you first have to tell the RDBMS the database in which you’d like to create it. In MySQL and MariaDB, do so with the following syntax:

USE database;


In PostgreSQL, you must use the following command to select your desired database:

\connect database


Creating a Table

The following command structure creates a new table with the name <span class="highlight">table</span>, and includes two columns, each with their own specific data type:

CREATE TABLE table ( column_1 column_1_data_type, column_2 column_2_data_taype );


Deleting a Table

To delete a table entirely, including all its data, run the following:

DROP TABLE IF EXISTS table


Inserting Data into a Table

Use the following syntax to populate a table with one row of data:

INSERT INTO table ( column_A, column_B, column_C ) VALUES ( 'data_A', 'data_B', 'data_C' );


You can also populate a table with multiple rows of data using a single command, like this:

INSERT INTO table ( column_A, column_B, column_C ) VALUES ( 'data_1A', 'data_1B', 'data_1C' ),  ( 'data_2A', 'data_2B', 'data_2C' ), ( 'data_3A', 'data_3B', 'data_3C' );


Deleting Data from a Table

To delete a row of data from a table, use the following command structure. Note that <span class="highlight">value</span> should be the value held in the specified <span class="highlight">column</span> in the row that you want to delete:

DELETE FROM table WHERE column='value';


Note: If you do not include a WHERE clause in a DELETE statement, as in the following example, it will delete all the data held in a table, but not the columns or the table itself:

DELETE FROM table;


Changing Data in a Table

Use the following syntax to update the data held in a given row. Note that the WHERE clause at the end of the command tells SQL which row to update. <span class="highlight">value</span> is the value held in <span class="highlight">column_A</span> that aligns with the row you want to change.

Note: If you fail to include a WHERE clause in an UPDATE statement, the command will replace the data held in every row of the table.

UPDATE table SET column_1 = value_1, column_2 = value_2 WHERE column_A=value;


Inserting a Column

The following command syntax will add a new column to a table:

ALTER TABLE table ADD COLUMN column data_type;


Deleting a Column

A command following this structure will delete a column from a table:

ALTER TABLE table DROP COLUMN column;


Performing Basic Queries

To view all the data from a single column in a table, use the following syntax:

SELECT column FROM table;


To query multiple columns from the same table, separate the column names with a comma:

SELECT column_1, column_2 FROM table;


You can also query every column in a table by replacing the names of the columns with an asterisk (*). In SQL, asterisks act as placeholders to represent “all”:

SELECT * FROM table;


Using WHERE Clauses

You can narrow down the results of a query by appending the SELECT statement with a WHERE clause, like this:

SELECT column FROM table WHERE conditions_that_apply;


For example, you can query all the data from a single row with a syntax like the following. Note that <span class="highlight">value</span> should be a value held in both the specified <span class="highlight">column</span> and the row you want to query:

SELECT * FROM table WHERE column = value;


Working with Comparison Operators

A comparison operator in a WHERE clause defines how the specified column should be compared against the value. Here are some common SQL comparison operators:

Operator What it does = tests for equality != tests for inequality < tests for less-than > tests for greater-than <= tests for less-than or equal-to >= tests for greater-than or equal-to BETWEEN tests whether a value lies within a given range IN tests whether a row’s value is contained in a set of specified values EXISTS tests whether rows exist, given the specified conditions LIKE tests whether a value matches a specified string IS NULL tests for NULL values IS NOT NULL tests for all values other than NULL Working with Wildcards

SQL allows the use of wildcard characters. These are useful if you’re trying to find a specific entry in a table, but aren’t sure of what that entry is exactly.

Asterisks (*) are placeholders that represent “all,” this will query every column in a table:

SELECT * FROM table;


Percentage signs (%) represent zero or more unknown characters.

SELECT * FROM table WHERE column LIKE val%;


Underscores (_) are used to represent a single unknown character:

SELECT * FROM table WHERE column LIKE v_lue;


Counting Entries in a Column

The COUNT function is used to find the number of entries in a given column. The following syntax will return the total number of values held in <span class="highlight">column</span>:

SELECT COUNT(column) FROM table;


You can narrow down the results of a COUNT function by appending a WHERE clause, like this:

SELECT COUNT(column) FROM table WHERE column=value;


Finding the Average Value in a Column

The AVG function is used to find the average (in this case, the mean) amongst values held in a specific column. Note that the AVG function will only work with columns holding numeric values; when used on a column holding string values, it may return an error or 0:

SELECT AVG(column) FROM table;


Finding the Sum of Values in a Column

The SUM function is used to find the sum total of all the numeric values held in a column:

SELECT SUM(column) FROM table;


As with the AVG function, if you run the SUM function on a column holding string values it may return an error or just 0, depending on your RDBMS.

Finding the Largest Value in a Column

To find the largest numeric value in a column or the last value alphabetically, use the MAX function:

SELECT MAX(column) FROM table;


Finding the Smallest Value in a Column

To find the smallest numeric value in a column or the first value alphabetically, use the MIN function:

SELECT MIN(column) FROM table;


Sorting Results with ORDER BY Clauses

An ORDER BY clause is used to sort query results. The following query syntax returns the values from <span class="highlight">column_1</span> and <span class="highlight">column_2</span> and sorts the results by the values held in <span class="highlight">column_1</span> in ascending order or, for string values, in alphabetical order:

SELECT column_1, column_2 FROM table ORDER BY column_1;


To perform the same action, but order the results in descending or reverse alphabetical order, append the query with DESC:

SELECT column_1, column_2 FROM table ORDER BY column_1 DESC;


Sorting Results with GROUP BY Clauses

The GROUP BY clause is similar to the ORDER BY clause, but it is used to sort the results of a query that includes an aggregate function such as COUNT, MAX, MIN, or SUM. On their own, the aggregate functions described in the previous section will only return a single value. However, you can view the results of an aggregate function performed on every matching value in a column by including a GROUP BY clause.

The following syntax will count the number of matching values in <span class="highlight">column_2</span> and group them in ascending or alphabetical order:

SELECT COUNT(column_1), column_2 FROM table GROUP BY column_2;


To perform the same action, but group the results in descending or reverse alphabetical order, append the query with DESC:

SELECT COUNT(column_1), column_2 FROM table GROUP BY column_2 DESC;


Querying Multiple Tables with JOIN Clauses

JOIN clauses are used to create result sets that combine rows from two or more tables. A JOIN clause will only work if the two tables each have a column with an identical name and data type, as in this example:

SELECT table_1.column_1, table_2.column_2 FROM table_1 JOIN table_2 ON table_1.common_column=table_2.common_column;


This is an example of an INNER JOIN clause. An INNER JOIN will return all the records that have matching values in both tables, but won’t show any records that don’t have matching values.

It’s possible to return all the records from one of two tables, including values that do not have a corresponding match in the other table, by using an outer JOIN clause. Outer JOIN clauses are written as either LEFT JOIN or RIGHT JOIN.

A LEFT JOIN clause returns all the records from the “left” table and only the matching records from the “right” table. In the context of outer JOIN clauses, the left table is the one referenced in the FROM clause, and the right table is any other table referenced after the JOIN statement. The following will show every record from <span class="highlight">table_1</span> and only the matching values from <span class="highlight">table_2</span>. Any values that do not have a match in <span class="highlight">table_2</span> will appear as NULL in the result set:

SELECT table_1.column_1, table_2.column_2 FROM table_1 LEFT JOIN table_2 ON table_1.common_column=table_2.common_column;


A RIGHT JOIN clause functions the same as a LEFT JOIN, but it prints the all the results from the right table, and only the matching values from the left:

SELECT table_1.column_1, table_2.column_2 FROM table_1 RIGHT JOIN table_2 ON table_1.common_column=table_2.common_column;


Combining Multiple SELECT Statements with UNION Clauses

A UNION operator is useful for combining the results of two (or more) SELECT statements into a single result set:

SELECT column_1 FROM table UNION SELECT column_2 FROM table;


Additionally, the UNION clause can combine two (or more) SELECT statements querying different tables into the same result set:

SELECT column FROM table_1 UNION SELECT column FROM table_2;


Conclusion

This guide covers some of the more common commands in SQL used to manage databases, users, and tables, and query the contents held in those tables. There are, however, many combinations of clauses and operators that all produce unique result sets. If you’re looking for a more comprehensive guide to working with SQL, we encourage you to check out Oracle’s Database SQL Reference.

Additionally, if there are common SQL commands you’d like to see in this guide, please ask or make suggestions in the comments below.

Learn More

MySQL Databases With Python Tutorial

SQL vs NoSQL or MySQL vs MongoDB

Building Web App using ASP.NET Web API Angular 7 and SQL Server

Learn NoSQL Databases from Scratch - Complete MongoDB Bootcamp 2019

MongoDB with Python Crash Course - Tutorial for Beginners

An Introduction to Queries in PostgreSQL

The Complete SQL Bootcamp

The Complete Oracle SQL Certification Course

SQL for Newbs: Data Analysis for Beginners

The Ultimate MySQL Bootcamp: Go from SQL Beginner to Expert

What are the differences between Standard SQL and Transact-SQL?

What are the differences between Standard SQL and Transact-SQL?

In this article, we'll explain syntax differences between standard SQL and the Transact-SQL language dedicated to interacting with the SQL

#1 Names of Database Objects

In relational database systems, we name tables, views, and columns, but sometimes we need to use the same name as a keyword or use special characters. In standard SQL, you can place this kind of name in quotation marks (""), but in T-SQL, you can also place it in brackets ([]). Look at these examples for the name of a table in T-SQL:

CREATE TABLE dbo.test.“first name” ( Id INT, Name VARCHAR(100));
CREATE TABLE dbo.test.[first name]  ( Id INT, Name VARCHAR(100));

Only the first delimiter (the quotation marks) for the special name is also part of the SQL standard.

What Is Different in a SELECT Statement?#2 Returning Values

The SQL standard does not have a syntax for a query returning values or values coming from expressions without referring to any columns of a table, but MS SQL Server does allow for this type of expression. How? You can use a SELECT statement alone with an expression or with other values not coming from columns of the table. In T-SQL, it looks like the example below:

SELECT 12/6 ;

In this expression, we don’t need a table to evaluate 12 divided by 6, therefore, the FROM statement and the name of the table can be omitted.

#3 Limiting Records in a Result Set

In the SQL standard, you can limit the number of records in the results by using the syntax illustrated below:

SELECT * FROM tab FETCH FIRST 10 ROWS ONLY

T-SQL implements this syntax in a different way. The example below shows the MS SQL Server syntax:

SELECT * FROM tab ORDER BY col1 DESC OFFSET 0 ROWS FETCH FIRST 10 ROWS ONLY;

As you notice, this uses an ORDER BY clause. Another way to select rows, but without ORDER BY, is by using the TOP clause in T-SQL:

SELECT TOP 10 * FROM tab;
#4 Automatically Generating Values

The SQL standard enables you to create columns with automatically generated values. The syntax to do this is shown below:

CREATE TABLE tab (id DECIMAL GENERATED ALWAYS AS IDENTITY);

In T-SQL we can also automatically generate values, but in this way:

CREATE TABLE tab (id INTEGER IDENTITY);
#5 Math Functions

Several common mathematical functions are part of the SQL standard. One of these math functions is CEIL(x), which we don’t find in T-SQL. Instead, T-SQL provides the following non-standard functions: SIGN(x), ROUND(x,[,d]) to round decimal value x to the number of decimal positions, TRUNC(x) for truncating to given number of decimal places, LOG(x) to return the natural logarithm for a value x, and RANDOM() to generate random numbers. The highest or lowest number in a list in the SQL standard is returned by MAX(list) and MIN(list) functions, but in Transact-SQL, you use the GREATEST(list) and LEAST(list) functions.

T-SQL function ROUND:

SELECT ROUND(col) FROM tab;

#6 Aggregate Functions

We find another syntax difference with the aggregate functions. The functions COUNT, SUM, and AVG all take an argument related to a count. T-SQL allows the use of DISTINCT before these argument values so that rows are counted only if the values are different from other rows. The SQL standard doesn't allow for the use of DISTINCT in these functions.

Standard SQL:
SELECT COUNT(col) FROM tab;

T-SQL:
SELECT COUNT(col) FROM tab;

SELECT COUNT(DISTINCT col) FROM tab;

But in T-SQL we don’t find a population covariance function: COVAR_POP(x,y), which is defined in the SQL standard.

#7 Retrieving Parts of Dates and Times

Most relational database systems deliver many functions to operate on dates and times.

In standard SQL, the EXTRACT(YEAR FROM x) function and similar functions to select parts of dates are different from the T-SQL functions like YEAR(x) or DATEPART(year, x).

There is also a difference in getting the current date and time. Standard SQL allows you to get the current date with the CURRENT_DATE function, but in MS SQL Server, there is not a similar function, so we have to use the GETDATE function as an argument in the CAST function to convert to a DATE data type.

#8 Operating on Strings

Using functions to operate on strings is also different between the SQL standard and T-SQL. The main difference is found in removing trailing and leading spaces from a string. In standard SQL, there is the TRIM function, but in T-SQL, there are several related functions: TRIM (removing trailing and leading spaces), LTRIM (removing leading spaces), and RTRIM (removing trailing spaces).

Another very-often-used string function is SUBSTRING.

The standard SQL syntax for the SUBSTRING function looks like:

SUBSTRING(str FROM start [FOR len])

but in T-SQL, the syntax of this function looks like:

SUBSTRING(str, start, length)

There are reasons sometimes to add values coming from other columns and/or additional strings. Standard SQL enables the following syntax to do this:

As you can see, this syntax makes use of the || operator to add one string to another.

But the equivalent operator in T-SQL is the plus sign character. Look at this example:

SELECT col1 + col2  FROM tab;

In SQL Server, we also have the possibility to use the CONCAT function concatenates a list of strings:

SELECT CONCAT(col1, str1, col2, ...)  FROM tab;

We can also repeat one character several times. Standard SQL defines the function REPEAT(str, n) to do this. Transact-SQL provides the REPLICATE function. For example:

SELECT  REPLICATE(str, x);

where x indicates how many times to repeat the string or character.

#9 Inequality Operator

During filtering records in a SELECT statement, sometimes we have to use an inequality operator. Standard SQL defines <> as this operator, while T-SQL allows for both the standard operator and the != operator:

SELECT col3 FROM tab WHERE col1 != col2;
#10 ISNULL Function

In T-SQL, we have the ability to replace NULL values coming from a column using the ISNULL function. This is a function that is specific to T-SQL and is not in the SQL standard.

SELECT ISNULL(col1) FROM tab;
Which Parts of DML Syntax Are Different?

In T-SQL, the basic syntax of DELETE, UPDATE, and INSERT queries is the same as the SQL standard, but differences appear in more advanced queries. Let’s look at them.

#11 OUTPUT Keyword

The OUTPUT keyword occurs in DELETE, UPDATE, and INSERT statements. It is not defined in standard SQL.

Using T-SQL, we can see extra information returned by a query. It returns both old and new values in UPDATE or the values added using INSERT or deleted using DELETE. To see this information, we have to use prefixes in INSERT, UPDATE, and DELETE.

UPDATE tab SET col='new value'
OUTPUT Deleted.col, Inserted.col;

We see the result of changing records with the previous and new values in an updated column. The SQL standard does not support this feature.

#12 Syntax for INSERT INTO ... SELECT

Another structure of an INSERT query is INSERT INTO … SELECT. T-SQL allows you to insert data from another table into a destination table. Look at this query:

INSERT INTO tab SELECT col1,col2,... FROM tab_source;

It is not a standard feature but a feature characteristic of SQL Server.

#13 FROM Clause in DELETE and UPDATE

SQL Server provides extended syntax of the UPDATE and DELETE with FROM clauses. You can use DELETE with FROM to use the rows from one table to remove corresponding rows in another table by referring to a primary key and a foreign key. Similarly, you can use UPDATE with FROM update rows from one table by referring to the rows of another table using common values (primary key in one table and foreign key in second, e.g. the same city name). Here is an example:

DELETE FROM Book
FROM Author
WHERE Author.Id=Book.AuthorId AND Author.Name IS NULL;

UPDATE Book
SET Book.Price=Book.Price*0.2
FROM Author
WHERE Book.AuthorId=Author.Id AND Author.Id=12;

The SQL standard doesn’t provide this syntax.

#14 INSERT, UPDATE, and DELETE With JOIN

You can also use INSERT, UPDATE, and DELETE using JOIN to connect to another table. An example of this is:

DELETE ItemOrder FROM ItemOrder
JOIN Item ON ItemOrder.ItemId=Item.Id
WHERE YEAR(Item.DeliveredDate) <= 2017;

This feature is not in the SQL standard.

Summary

This article does not cover all the issues about syntax differences between the SQL standard and T-SQL using the MS SQL Server system. However, this guide helps point out some basic features characteristic only of Transact-SQL and what SQL standard syntax isn’t implemented by MS SQL Server.

Thanks for reading. If you liked this post, share it with all of your programming buddies!

Originally published on https://dzone.com


SQL Tutorial – Learn SQL Programming Online from Experts

This SQL tutorial will help you learn SQL basics, so you can become a successful SQL developer. You will find out what are the SQL commands, syntax, data types, operators, creation & dropping of tables, inserting and selecting query. Through this tutorial you will learn SQL for working with a relational database. Learn SQL from Intellipaat SQL training and fast-track your career.

SQL Tutorial for Beginners

This SQL Tutorial for Beginners is a complete package for how to learn SQL online. In this SQL tutorial , you will learn SQL programming to get a clear idea of what Structured Query Language is and how you deploy SQL to work with a relational database system.

So, a structured query language is a language that is used to operate the relational databases. Some of the major ways in which SQL is used in conjunction with a relational database is for the purposes of storing, retrieving and manipulating of data stored in a relational database.

Check out Intellipaat’s blog to get a fair understanding of SQL Optimization Techniques!

Here we have the list of topics if you want to jump right into a specific one:


Watch this SQL Tutorial for Beginners video:



What is SQL?


The language to communicate with the relational database is the SQL or Structured Query Language. SQL programming helps to operate the relational databases and derive information from it.

Some of the operations that the SQL does include creation of database, fetching, modifying, updating and deleting the rows along with storing, manipulating and retrieving data within the relational database. SQL programming is an ANSI standard language but there are a lot of versions of SQL in usage as well.


Why is SQL programming so widely used?

Structured Query Language or SQL programming is used so extensively for the following reasons.

  • SQL lets you access any data within the relational database
  • You can describe the data in the database using SQL
  • Using SQL you can manipulate the data with the relational database
  • SQL can embed within other languages through SQL modules & libraries
  • SQL lets you easily create and drop databases and tables
  • SQL allows you to create views, functions and stored procedures in databases
  • Using SQL you can set permissions on procedures, tables and views.


Wish to crack SQL job interviews? Intellipaat’s Top SQL Interview Questions are meant only for you!

Watch this SQL Interview Questions and Answers for Beginners video


Features of SQL

Here in this section of the SQL tutorial for beginners, we list some of the top features of SQL that make it so ubiquitous when it comes to managing relational databases.

  • SQL is very simple and easy to learn language
  • SQL is versatile as it works with database systems from Oracle, IBM, Microsoft, etc.
  • SQL is an ANSI and ISO standard language for database creation and manipulation
  • SQL has a well-defined structure as it uses long established standards
  • SQL is very fast in retrieving large amounts of data very efficiently
  • SQL lets you manage databases without knowing lot of coding


Applications of SQL

Here in this section of the SQL tutorial we will learn SQL applications that make it so important in a data-driven world where managing huge databases is the norm of the day.

  • SQL is used as a Data Definition Language (DDL) meaning you can independently create a database, define its structure, use it and then discard it when you are done with it
  • SQL is also used as a Data Manipulation Language (DML) which means you can use it for maintaining an already existing database. SQL is a powerful language for entering data, modifying data and extracting data with regard to a database
  • SQL is also deployed as a Data Control Language (DCL) which specifies how you can protect your database against corruption and misuse.
  • SQL is extensively used as a Client/Server language to connect the front-end with the back-end thus supporting the client/server architecture
  • SQL can also be used in the three-tier architecture of a client, an application server and a database which defines the Internet architecture.

Still have queries? Come to Intellipaat’s SQL Community, clarify all your doubts, and excel in your career!

Why should you learn SQL?

Today regardless of the relational databases systems by major corporations like Oracle, IBM, Microsoft and others, the one thing that is common to them is the Structured Query Language or SQL.

So if you learn SQL online then you will be able to pursue a very broad career spanning a lot of roles of responsibilities. Also if you are learning SQL then it is important for a data science career as well since a data scientist will also have to deal with relational databases and query it using the standard language SQL.

Originally published at www.intellipaat.com in the tutorial "SQL Tutorial" on 09 Sept. 2019.

Web Scraping for Machine Learning with SQL Database

Web Scraping for Machine Learning with SQL Database

Web Scraping for Machine Learning with SQL Database. Machine Learning requires data. The Full Stack AI/ML Engineer toolkit needs to include web scraping, because it can improve predictions with new quality data. learn to combine the knowledge of HTML, Python, Databases, SQL and datasets for Machine Learning.

I thought, how can we angle "Web Scraping for Machine Learning", and I realized that Web Scraping should be essential to Data Scientists, Data Engineers and Machine Learning Engineers.

The Full Stack AI/ML Engineer toolkit needs to include web scraping, because it can improve predictions with new quality data. Machine Learning inherently requires data, and we would be most comfortable, if we have as much high quality data as possible. But what about when the data you need is not available as a dataset? What then? Do you just go and ask organizations and hope that they kindly will deliver it to you for free?

The answer is: you collect, label and store it yourself.

I made a GitHub repository for scraping the data. I encourage you to try it out and scrape some data yourself, and even trying to make some NLP or other Machine Learning project out of the scraped data.

In this article, we are going to web scrape Reddit – specifically, the /r/DataScience (and a little of /r/MachineLearning) subreddit. There will be no usage of the Reddit API, since we usually web scrape when an API is not available. Furthermore, you are going to learn to combine the knowledge of HTML, Python, Databases, SQL and datasets for Machine Learning. We are doing a small NLP sample project at last, but this is only to showcase that you can pickup the dataset and create a model providing predictions.

Table of Contents (Click To Scroll)
  1. Web Scraping in Python - Beautiful Soup and Selenium
    • Be careful - warning
    • Benchmarking How Long It Takes To Scrape
    • Initializing Classes From The Project
    • How To Scrape Reddit URLs
    • Getting Data From URLs
    • Converting Data to Python Dict
    • Extracting Data From Python Dict
      • Scraping The Comments
  2. Labelling Scraped Data
    • Naming Conventions
  3. Storing Scraped Data
    • Designing SQL Tables
    • Inserting Into SQL Tables
    • Exporting From SQL Tables
  4. Small Machine Learning Project on Exported Dataset
Web Scraping in Python With BeautifulSoup and Selenium

The first things we need to do is install BeautifulSoup and Selenium for scraping, but for accessing the whole project (i.e. also the Machine Learning part), we need more packages.

Selenium is basically good for content that changes due to Javascript, while BeautifulSoup is great at capturing static HTML from pages. Both the packages can be downloaded to your environment by a simple pip or conda command.

Install all the packages from the Github (linked at the start)

pip install -r requirements.txt

Alternatively, if you are using Google Colab, you can run the following to install the packages needed:

!pip install -r https://github.com/casperbh96/Web-Scraping-Reddit/raw/master/requirements.txt

Next, you need to download the chromedriver and place it in the core folder of the downloaded repository from GitHub.

Basic Warning of Web Scraping

Scrape responsibly, please! Reddit might update their website and invalidate the current approach of scraping the data from the website. If this is used in production, you would really want to setup an email / sms service, such that you get immediate notice when your web scraper fails.

This is for educational purposes only, please don't misuse or do anything illegal with the code. It's provided as-is, by the MIT license of the GitHub repository.

Benchmarking How Long It Takes To Scrape

Scraping takes time. Remember that you have to open each page, letting it load, then scraping the needed data. It can really be a tedious process – even figuring out where to start gathering the data can be hard, or even figuring out exactly what data you want.

There are 5 main steps for scraping reddit data:

  1. Collecting links from a subreddit
  2. Finding the 'script' tag from each link
  3. Turning collected data from links into Python dictionaries
  4. Getting the data from Python dictionaries
  5. Scraping comment data

Especially step 2 and 5 will take you a long time, because they are the hardest to optimize.

An approximate benchmark for:

  • Step 2: about 1 second for each post
  • Step 5: $n/3$ seconds for each post, where $n$ is number of comments. Scraping 300 comments took about 100 seconds for me, but it varies with internet speed, CPU speed, etc

Initializing Classes From The Project

We are going to be using 4 files with one class each, a total of 4 classes, which I placed in the core folder.

Whenever you see SQL, SelScraper or BSS being called, it means we are calling a method from another class, e.g. BSS.get_title(). We are going to jump forward and over some lines of code, because about 1000 lines have been written (which is too much to explain here).

from core.selenium_scraper import SeleniumScraper
from core.soup_scraper import SoupScraper
from core.progress_bar import ProgressBar
from core.sql_access import SqlAccess

SQL = SqlAccess()
SelScraper = SeleniumScraper()
BSS = SoupScraper(reddit_home,
                  slash,
                  subreddit)

Scraping URLs From A Subreddit

We start off with scraping the actual URLs from a given subreddit, defined at the start of scraper.py. What happens is that we open a browser in headless mode (without opening a window, basically running in the background), and we use some Javascript to scroll the page, while collecting links to posts.

The next snippets are just running in a while loop until the variable scroll_n_times becomes zero

while scroll_n_times:
    # Scrolls browser to the bottom of the page
    self.driver.execute_script(
                "window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(sleep_time)
    scroll_n_times -= 1

    elements = self.driver.find_elements_by_xpath(xpath)

    # Get the link from the href attribute
    self.links = [tag.get_attribute('href') for tag in elements]

The following is the Javascript to scroll a page to the bottom:

window.scrollTo(0, document.body.scrollHeight);

After that, we use xpath to find all tags in the HTML body, which really just returns all the links to us:

xpath = "//a[@data-click-id='body']"
elements = self.driver.find_elements_by_xpath(xpath)
self.links = [tag.get_attribute('href') for tag in elements]

At any point in time, if some exception happens, or we are done with scrolling, we basically garbage collect the process running, such that the program won't have 14 different chrome browsers running

try:
    # while loop code
finally:
    self.driver.quit()

Great, now we have a collection of links that we can start scraping, but how? Well, we start by...

Getting the data attribute

Upon opening one of the collected links, Reddit provides us with a javascript 'script' element that contains all the data for each post. You won't be able to see it when visiting the page, since the data is loaded in, then removed.

But our great software package BeautifulSoup will! And for our convenience, Reddit marked the 'script' with an id: id=data. This attribute makes it easy for us to find the element and capture all the data of the page.

First we specify the header, which tells the website we are visiting, which agent we are using (as to avoid being detected as a bot). Next step, we make a request and let BeautifulSoup get all the text from the page, i.e. all the HTML.

progress = ProgressBar(len(urls))
for url in urls:
    progress.update()
    headers = {'User-Agent': 'Mozilla/5.0'}
    r = requests.get(url, headers=headers)

    soup = BeautifulSoup(r.text, 'html.parser')

    pure_html_data.append(r.text)
    pure_script_data.append(soup.find(id='data').text)

You see that last line, after we have told BeautifulSoup to load the page, we find the text of the script with the id='data'. Since it's an id attribute, I knew that there would only be one element on the whole page with this attribute – hence this is pretty much a bulletproof approach of getting the data. Unless, of course, Reddit changes their site.

You might think that we could parallelize these operations, such that we can run through multiple URLs in the for loop, but it's bad practice and might get you banned – especially on bigger sites, they will protect their servers from overload, which you effectively can do with pinging URLs.

Converting the 'script' string to Python dictionary

From the last code piece, we get a string, which is in a valid JSON format. We want to convert this string into a Python dict, such that we can easily lookup the data from each link.

The basic approach here is that we find the first left curly bracket, which is where JSON starts, by the index. And we find the last index by an rfind() method with the right curly bracket, but we need to use plus one for the actual index.

pure_dicts = []

print('Making Python dicts out of script data')

progress = ProgressBar(len(script_data))
for data in script_data:
    progress.update()

    first_index = data.index('{')
    last_index = data.rfind('}') + 1

    json_str = data[first_index:last_index]

    pure_dicts.append(json.loads(json_str))

return pure_dicts

Effectively, this gives us a Python dict of the whole Reddit data which is loaded into every post. This will be great for scraping all the data and storing it.

Actually Getting The Data (From Python dictionary)

I defined one for loop, which iteratively scrapes different data. This makes it easy to maintain and find your mistakes.

As you can see, we get quite a lot of data – basically the whole post and comments, except for comments text, which we get later on.

progress = ProgressBar(len(links))
for i, current_data in enumerate(BSS.data):
    progress.update()

    BSS.get_url_id_and_url_title(BSS.urls[i],
                                 current_data, i)
    BSS.get_title()
    BSS.get_upvote_ratio()
    BSS.get_score()
    BSS.get_posted_time()
    BSS.get_author()
    BSS.get_flairs()
    BSS.get_num_gold()
    BSS.get_category()
    BSS.get_total_num_comments()
    BSS.get_links_from_post()
    BSS.get_main_link()
    BSS.get_text()
    BSS.get_comment_ids()

Scraping Comments Text Data

After we collected all this data, we need to begin thinking about storing the data. But we couldn't manage to scrape the text of the comments in the big for loop, so we will have to do that before we setup the inserts into the database tables.

This next code piece is quite long, but it's all you need. Here we collect comment text, score, author, upvote points and depth. For each comment, we have subcomments for that main comment, which specifies depth, e.g. depth is zero at the each root comment.

for i, comment_url in enumerate(comment_id_links_array):
    author = None
    text = None
    score = None
    comment_id = array_of_comment_ids[i]

    # Open the url, find the div with classes Comment.t1_comment_id
    r = requests.get(comment_url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    div = soup.select('div.Comment.t1_{0}'.format(comment_id))

    # If it found the div, let's extract the text from the comment
    if div is not None and len(div) > 0 :
        author = div[0].find_all('a')[0].get_text()
        spans = div[0].find_all("span")
        score = [spans[i].get_text() for i in range(len(spans)) if 'point' in spans[i].get_text()]

        html_and_text = div[0].find('div', attrs={'data-test-id' : 'comment'})
        if html_and_text is not None:
            text = html_and_text.get_text()

        if len(score) == 0:
            score = None
        else:
            score = score[0]

    # Make useable array for insertion
    array_of_comment_data.append([None,
                                  None,
                                  str(comment_id),
                                  str(score),
                                  array_of_depth[i],
                                  str(array_of_next[i]),
                                  str(array_of_prev[i]),
                                  str(author),
                                  str(text)])

return array_of_comment_data

What I found working was opening each of the collected comment URLs from the earlier for loop, and basically scraping once again. This ensures that we get all the data scraped.

Note: this approach for scraping the comments can be reaaally slow! This is probably the number one thing to improve – it's completely dependent on the number of comments on a posts. Scraping >100 comments takes a pretty long time.

Labelling The Collected Data

For the labelling part, we are mostly going to focus on tasks we can immediately finish with Python code, instead of the tasks that we cannot. For instance, labelling images found on Reddit is probably not feasible by a script, but actually has to be done manually.

Naming Conventions

For the SQL in this article, we use Snake Case for naming the features. An example of this naming convention is my_feature_x, i.e. we split words with underscores, and only lower case.

...But please:

If you are working in a company, look at the naming conventions they use and follow them. The last thing you want is different styles, as it will just be confusing at the end of the day. Common naming conventions include camelCase and Pascal Case.

Storing The Labelled Data

For storing the collected and labelled data, I have specifically chosen that we should proceed with an SQLite database – since it's way easier for smaller projects like web scraping. We don't have to install any drivers like for MySQL or MS SQL servers, and we don't even need to install any packages to use it, because it comes natively with Python.

Some considerations for data types has been made for the columns in the SQLite database, but there is room for improvement in the current state of form. I used some cheap varchars in the comment table, to get around some storing problems. It currently does not give me any problems, but for the future, it should probably be updated.

Designing and Creating Tables

The 1st normal form was mostly considered in the database design, i.e. we have separate tables for links and comments, for avoiding duplicate rows in the post table. Further improvements include making a table for categories and flairs, which is currently put into a string form from an array.

Without further notice, let me present you the database diagram for this project. It's quite simple.

In short: we have 3 tables. For each row with a post id in the post table, we can have multiple rows with the same post id in the comment and link table. This is how we link the post to all links and comments. The link exists because the post_id in the link and comment table has a foreign key on the post_id in the post table, which is the primary key.

I used an excellent tool for generating this database diagram, which I want to highlight – currently it's free and it's called https://dbdiagram.io/.

Play around with it, if you wish. This is not a paid endorsement of any sorts, just a shoutout to a great, free tool. Here is my code for the above diagram:

Table post as p {
  id int [pk]
  url varchar [not null]
  url_id varchar [not null]
  url_title varchar [not null]
  author varchar
  upvote_ratio uint8
  score int
  time_created datetime
  num_gold int
  num_comments int
  category varchar
  text varchar
  main_link varchar
  flairs int
}

Table link as l {
  id int [pk]
  post_id int
  link varchar [not null]
}

Table comment as c {
  id int [pk]
  post_id int
  comment_id varchar [not null]
  score varchar [not null]
  depth int [not null]
  next varchar
  previous varchar
  comment_author varchar
  text varchar
}

Ref: p.id < l.post_id
Ref: p.id < c.post_id

Actually Inserting The Data

The databases and tables will be automatically generated by some code which I setup to run automatically, so we will not cover that part here, but rather, the part where we insert the actual data.

The first step is creating and/or connecting to the database (which will either automatically generate the database and tables, and/or just connect to the existing database).

After that, we begin inserting data.

try:
    SQL.create_or_connect_db(erase_first=erase_db_first)
    # [0] = post, [1] = comment, [2] = link
    for i in range(len(BSS.post_data)):
        SQL.insert('post', data = BSS.post_data[i])
        SQL.insert('link', data = BSS.link_data[i])

        if scrape_comments:
            SQL.insert('comment', data = BSS.comment_data[i])
except Exception as ex:
    print(ex)
finally:
    SQL.save_changes()

Inserting the data happens in a for loop, as can be seen from the code snippet above. We specify the column and the data which we want to input.

For the next step, we need to get the number of columns in the table we are inserting into. From the number of columns, we have to create an array of question marks – we have one question mark separated with a comma, for each column. This is how data is inserted, by the SQL syntax.

Some data is input into a function which I called insert(), and the data variable is an array in the form of a row. Basically, we already concatenated all the data into an array and now we are ready to insert.

cols = c.execute(('''
                PRAGMA table_info({0})
                ''').format(table))

# Get the number of columns
num_cols = sum([1 for i in cols]) - 1

# Generate question marks for VALUES insertion
question_marks = self._question_mark_creator(num_cols)

if table == 'post':
    c.execute(('''INSERT INTO {0}
                  VALUES ({1})'''
              ).format(table, question_marks), data)

    self.last_post_id = c.lastrowid

elif (table == 'comment' or table == 'link') \
     and data != None and data != []:
    # setting post_id to the last post id, inserted in the post insert
    for table_data in data:
        table_data[1] = self.last_post_id
        c.execute(('''INSERT INTO {0}
                      VALUES ({1})'''
                  ).format(table, question_marks), table_data)

This wraps up scraping and inserting into a database, but how about...

Exporting From SQL Databases

For this project, I made three datasets, one of which I used for a Machine Learning project in the next section of this article.

  1. A dataset with posts, comments and links data
  2. A dataset with the post only
  3. A dataset with comments only

For these three datasets, I made a Python file called make_dataset.py for creating the datasets and saving them using Pandas and some SQL query.

For the first dataset, we used a left join from the SQL syntax (which I won't go into detail about), and it provides the dataset that we wish for. You do have to filter a lot of nulls if you want to use this dataset for anything, i.e. a lot of data cleaning, but once that's done, you can use the whole dataset.

all_data = pd.read_sql_query("""
SELECT *
FROM post p 
LEFT JOIN comment c 
    ON p.id = c.post_id
LEFT JOIN link l
	ON p.id = l.post_id;
""", c)

all_data.to_csv('data/post_comment_link_data.csv', columns=all_data.columns, index=False)

For the second and third datasets, a simple select all from table SQL query was made to make the dataset. This needs no further explaining.

post = pd.read_sql_query("""
SELECT *
FROM post;
""", c)

comment = pd.read_sql_query("""
SELECT *
FROM comment;
""", c)

post.to_csv('data/post_data.csv', columns=post.columns, index=False)
comment.to_csv('data/comment_data.csv', columns=comment.columns, index=False)

Machine Learning Project Based On This Dataset

From the three generated datasets, I wanted to show you how to do a basic machine learning project.

The results are not amazing, but we are trying to classify the comment into four categories; exceptional**,** good**,** average and bad – all based on the upvotes on a comment.

Let's start! Firstly, we import the functions and packages we need, along with the dataset, which is the comments table. But we only import the score (upvotes) and comment text from that dataset.

import copy
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

pd.options.mode.chained_assignment = None
df = pd.read_csv('data/comment_data.csv', usecols=['score', 'text'])

The next thing we have to do is cleaning the dataset. Firstly, we start off with only getting the text by using some regular expression (regex). This removes any weird characters like \ | / & % etc.

The next thing we do is with regards to the score feature. The score feature is formatted as a string, and we just need the number from that string. But we also want the minus in front of the string, if some comment has been downvoted a lot. Another regex was used here to do this.

The next ugly thing from our Python script is a None as a string. We replace this string with an actual None in Python, such that we can run df.dropna().

The last thing we need to do is convert the score to a float, since that is required for later.

def prepare_data(df):
    # Remove everything except alphanumeric characters
    df.text = df.text.str.replace('[^a-zA-Z\s]', '')

    # Get only numbers, but allow minus in front
    df.score = df.score.str.extract('(^-?[0-9]*\S+)')

    # Remove rows with None as string
    df.score = df.score.replace('None', np.nan)

    # Remove all None
    df = df.dropna()

    # Convert score feature from string to float
    score = df.score.astype(float)
    df.score = copy.deepcopy(score)

    return df

df = prepare_data(df)

The next part is trying to define how our classification is going to work; so we have to adapt the data. An easy way is using percentiles (perhaps not an ideal way).

For this, we find the fiftieth, seventy-fifth and ninety-fifth quantile of the data and mark the data below the fiftieth quantile. We replace the score feature with this new feature.

def score_to_percentile(df):
    second = df.score.quantile(0.50) # Average
    third = df.score.quantile(0.75) # Good
    fourth = df.score.quantile(0.95) # exceptional

    new_score = []

    for i, row in enumerate(df.score):
        if row > fourth:
            new_score.append('exceptional')
        elif row > third:
            new_score.append('good')
        elif row > second:
            new_score.append('average')
        else:
            new_score.append('bad')

    df.score = new_score

    return df

df = score_to_percentile(df)

We need to split the data and tokenize the text, which we proceed to do in the following code snippet. There is really not much magic happening here, so let's move on.

def df_split(df):
    y = df[['score']]
    X = df.drop(['score'], axis=1)

    content = [' ' + comment for comment in X.text.values]
    X = CountVectorizer().fit_transform(content).toarray()

    X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.33, random_state=42)

    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = df_split(df)

The last part of this machine learning project tutorial is making predictions and scoring the algorithm we choose to go with.

Logistic Regression was used, and perhaps we did not get the best score, but this is merely a boilerplate for future improvement and use.

All we do here is fit the logistic regression model to the training data, make a prediction and then score how well the model predicted.

lr = LogisticRegression(C=0.05, solver='lbfgs', multi_class='multinomial')
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
score = accuracy_score(y_test, pred)

print ("Accuracy: {0}".format(score))

In our case, the predictions were not that great, as the accuracy turned out to be $0.59$.

Future works includes:

  • Fine-tuning different algorithms
  • Trying different metrics
  • Better data cleaning

Originally published by Casper Hansen at https://mlfromscratch.com