Pythonic Database Management with SQLAlchemy

Something we've taken for granted thus far on Hackers and Slackers is a library most data professionals have accepted as an undisputed standard:&nbsp;<a href="https://www.sqlalchemy.org/" target="_blank"><strong>SQLAlchemy</strong></a>.

Something we've taken for granted thus far on Hackers and Slackers is a library most data professionals have accepted as an undisputed standard: SQLAlchemy.

In the past, we've covered database connection management and querying using libraries such as PyMySQL and Psycopg2, both of which do an excellent job of interacting with databases just as we'd expect them to. The nature of opening/closing DB connections and working with cursors hasn't changed much in the past few decades (nearly the lifespan of relational databases themselves). While boilerplate is boring, at least it has remained consistent, one might figure. That may have been the case, but the philosophical boom of MVC frameworks nearly a decade ago sparked the emergence of popularity for ORMs. While the world was singing praises of object-oriented programming, containing database-oriented functionality within objects must have been a wet dream.

The only thing shocking about SQLAlchemy's popularity is its flip side: the contingency of those functioning without SQLAlchemy as a part of their regular stack. Whether this stems from unawareness or active reluctance to change, data teams using Python without a proper ORM are surprisingly prevalent. It's easy to forget the reality of the workforce when our interactions with other professionals come mostly from blogs published by those at the top of their field.

I realize the "this is how we've always done it" attitude is a cliché with no shortage of commentary. Tales of adopting new (relatively speaking) practices dominate Silicon Valley blogs every day- it's the manner in which this is manifested, however, that catches me off guard. In this case, resistance to a single Python library can shed light on a frightening mental model that has implications up and down a corporation's stack.

Putting The 'M' In MVC

Frameworks which enforce a Model-View-Controller have held undisputed consensus for long enough: none of us need to recap why creating apps this way is unequivocally correct. To understand why side-stepping an ORM is so significant, let's recall what ORM stands for:

Object-Relational Mapping, commonly referred to as its abbreviation ORM, is a technique that connects the rich objects of an application to tables in a relational database management system. Using ORM, the properties and relationships of the objects in an application can be easily stored and retrieved from a database without writing SQL statements directly and with less overall database access code. Active Record

ORMs allow us to interact with databases simply by modifying objects in code (such as classes) as opposed to generating SQL queries by hand for each database interaction. Bouncing from application code to SQL is a major context switch, and the more interactions we introduce, the more out of control our app becomes.

To illustrate the alternative to this using models, I'll use an example offered by Flask-SQLAlchemy. Let's say we have a table of users which contains columns for id, username, and email. A model for such a table would look as such:

class User(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    username = db.Column(db.String(80), unique=True, nullable=False)
    email = db.Column(db.String(120), unique=True, nullable=False)
def __repr__ (self):
    return '&lt;User %r&gt;' % self.username

The 'model' is an object representing the structure of a single entry in our table. Once our model exists, this is all it takes to create an entry:

newuser = User(username='admin', email='[email protected]')

That's a single readable line of code without writing a single line of SQL. Compare this to the alternative, which would be to use Psycopg2:

query = "INSERT INTO users VALUES username='admin', email='[email protected]';"

def query_function(query):
"""Runs a database query."""
try:
conn = psycopg2.connect(
user = config.username,
password = config.password,
host = config.host,
port = config.port,
database = config.database)
with conn.cursor() as cur:
cur.execute(query)
cur.close()
conn.close()
except Exception as e:
print(e)

query_function(query)

Sure, query_function() only needs to be set once, but compare the readability of using a model to the following:

query = "INSERT INTO users VALUES username='admin', email='[email protected]';"

query_function(query)

Despite achieving the same effect, the latter is much less readable or maintainable by human beings. Building an application around raw string queries can quickly become a nightmare.

Integration With Other Data Libraries

When it comes to golden standards of Python libraries, there is none more quintessential to data analysis than Pandas. The pairing of Pandas and SQLAlchemy is standard to the point where Pandas has built-in integrations to interact with data from SQLAlchemy. Here's what it takes to turn a database table into a Pandas dataframe with SQLAlchemy as our connector:

df = pd.read_sql(session.query(Table).filter(User.id == 2).statement,session.bind)

Once again, a single line of Python code!

Writing Queries Purely in Python

So far by using SQLAlchemy, we haven't needed to write a single line of SQL: how far could we take this? As far as we want, in fact. SQLAlchemy contains what they've dubbed as function-based query construction, which is to say we can construct nearly any conceivable SQL query purely in Python by using the methods offered to us. For example, here's an update query:

stmt = users.update().values(fullname="Fullname: " + users.c.name)
conn.execute(stmt)

Check thefull reference to see what I mean. Every query you've ever needed to write: it's all there. All of it.

Simple Connection Management

Seeing as how we all now agree that SQLAlchemy is beneficial to our workflow, let's visit square one and see how simple it is to manage connections. The two key words to remember here are engines and sessions.

The Engine

An engine in SQLAlchemy is merely a bare-bones object representing our database. Making SQLAlchemy aware of our database is as simple as these two lines:

from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:', echo=True)

The Engine can interact with our database by accepting a simple URI. Once engine exists, we could in theory use engine exclusively via functions such as engine.connect() and engine.execute().

Sessions

To interact with our database in a Pythonic manner via the ORM, we'll need to create a session from the engine we just declared. Thus our code expands:

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

engine = create_engine('sqlite:///:memory:', echo=True)
Session = sessionmaker(bind=engine)

That's all it takes! Now just as before, we can use SQLAlchemy's ORM and built-in functions to make simple interacts:

new_user = User(name='todd', fullname='Todd Hacker', password='toddspassword')
session.add(new_user)

Takeaway Goodies

It's worth mentioning that SQLAlchemy works with nearly every type of database, and does so by leveraging the base Python library for the respective type of database. For example, it probably seems to the outsider that we've spent some time shitting on Psycopg2. On the contrary, when SQLAlchemy connects to a Postgres database, it is using the Psycopg2 library under the hood to manage the boilerplate for us. The same goes for every other type of relational database along with their standard libraries.

There are plenty of more reasons why SQLAlchemy is beneficial to the point where it is arguably critical to data analysis workflows. The critical point to be made here is that leaving SQLAlchemy out of any data workflow only hurts the person writing the code, or more importantly, all those who come after.


By : Todd Birchard


How to Prevent SQL Injection Attacks With Python

How to Prevent SQL Injection Attacks With Python

Among all injection types, SQL injection is one of the most common attack vectors, and arguably the most dangerous. As Python is one of the most popular programming languages in the world, knowing how to protect against Python SQL injection is critical.

Among all injection types, SQL injection is one of the most common attack vectors, and arguably the most dangerous. As Python is one of the most popular programming languages in the world, knowing how to protect against Python SQL injection is critical.

Every few years, the Open Web Application Security Project (OWASP) ranks the most critical web application security risks. Since the first report, injection risks have always been on top. Among all injection types, SQL injection is one of the most common attack vectors, and arguably the most dangerous. As Python is one of the most popular programming languages in the world, knowing how to protect against Python SQL injection is critical.

In this tutorial, you’re going to learn:

  • What Python SQL injection is and how to prevent it
  • How to compose queries with both literals and identifiers as parameters
  • How to safely execute queries in a database

Table of Contents

This tutorial is suited for users of all database engines. The examples here use PostgreSQL, but the results can be reproduced in other database management systems (such as SQLite, MySQL, Microsoft SQL Server, Oracle, and so on).

Understanding Python SQL Injection

SQL Injection attacks are such a common security vulnerability that the legendary xkcd webcomic devoted a comic to it:

"Exploits of a Mom" (Image: xkcd)

Generating and executing SQL queries is a common task. However, companies around the world often make horrible mistakes when it comes to composing SQL statements. While the ORM layer usually composes SQL queries, sometimes you have to write your own.

When you use Python to execute these queries directly into a database, there’s a chance you could make mistakes that might compromise your system. In this tutorial, you’ll learn how to successfully implement functions that compose dynamic SQL queries without putting your system at risk for Python SQL injection.

Setting Up a Database

To get started, you’re going to set up a fresh PostgreSQL database and populate it with data. Throughout the tutorial, you’ll use this database to witness firsthand how Python SQL injection works.

Creating a Database

First, open your shell and create a new PostgreSQL database owned by the user postgres:

$ createdb -O postgres psycopgtest

Here you used the command line option -O to set the owner of the database to the user postgres. You also specified the name of the database, which is psycopgtest.

Note: postgres is a special user, which you would normally reserve for administrative tasks, but for this tutorial, it’s fine to use postgres. In a real system, however, you should create a separate user to be the owner of the database.

Your new database is ready to go! You can connect to it using psql:

$ psql -U postgres -d psycopgtest
psql (11.2, server 10.5)
Type "help" for help.

You’re now connected to the database psycopgtest as the user postgres. This user is also the database owner, so you’ll have read permissions on every table in the database.

Creating a Table With Data

Next, you need to create a table with some user information and add data to it:

psycopgtest=# CREATE TABLE users (
    username varchar(30),
    admin boolean
);
CREATE TABLE

psycopgtest=# INSERT INTO users
    (username, admin)
VALUES
    ('ran', true),
    ('haki', false);
INSERT 0 2

psycopgtest=# SELECT * FROM users;
 username | admin
----------+-------
 ran      | t
 haki     | f
(2 rows)

The table has two columns: username and admin. The admin column indicates whether or not a user has administrative privileges. Your goal is to target the admin field and try to abuse it.

Setting Up a Python Virtual Environment

Now that you have a database, it’s time to set up your Python environment.

Create your virtual environment in a new directory:

~/src $ mkdir psycopgtest
~/src $ cd psycopgtest
~/src/psycopgtest $ python3 -m venv venv

After you run this command, a new directory called venv will be created. This directory will store all the packages you install inside the virtual environment.

Connecting to the Database

To connect to a database in Python, you need a database adapter. Most database adapters follow version 2.0 of the Python Database API Specification PEP 249. Every major database engine has a leading adapter:

To connect to a PostgreSQL database, you’ll need to install Psycopg, which is the most popular adapter for PostgreSQL in Python. Django ORM uses it by default, and it’s also supported by SQLAlchemy.

In your terminal, activate the virtual environment and use pip to install psycopg:

~/src/psycopgtest $ source venv/bin/activate
~/src/psycopgtest $ python -m pip install psycopg2>=2.8.0
Collecting psycopg2
  Using cached https://....
  psycopg2-2.8.2.tar.gz
Installing collected packages: psycopg2
  Running setup.py install for psycopg2 ... done
Successfully installed psycopg2-2.8.2

Now you’re ready to create a connection to your database. Here’s the start of your Python script:

import psycopg2

connection = psycopg2.connect(
    host="localhost",
    database="psycopgtest",
    user="postgres",
    password=None,
)
connection.set_session(autocommit=True)

You used psycopg2.connect() to create the connection. This function accepts the following arguments:

  • host is the IP address or the DNS of the server where your database is located. In this case, the host is your local machine, or localhost.

  • database is the name of the database to connect to. You want to connect to the database you created earlier, psycopgtest.

  • user is a user with permissions for the database. In this case, you want to connect to the database as the owner, so you pass the user postgres.

  • password is the password for whoever you specified in user. In most development environments, users can connect to the local database without a password.

After setting up the connection, you configured the session with autocommit=True. Activating autocommit means you won’t have to manually manage transactions by issuing a commit or rollback. This is the default behavior in most ORMs. You use this behavior here as well so that you can focus on composing SQL queries instead of managing transactions.

Note: Django users can get the instance of the connection used by the ORM from django.db.connection:

from django.db import connection

Executing a Query

Now that you have a connection to the database, you’re ready to execute a query:

>>> with connection.cursor() as cursor:
...     cursor.execute('SELECT COUNT(*) FROM users')
...     result = cursor.fetchone()
... print(result)
(2,)

You used the connection object to create a cursor. Just like a file in Python, cursor is implemented as a context manager. When you create the context, a cursor is opened for you to use to send commands to the database. When the context exits, the cursor closes and you can no longer use it.

While inside the context, you used cursor to execute a query and fetch the results. In this case, you issued a query to count the rows in the users table. To fetch the result from the query, you executed cursor.fetchone() and received a tuple. Since the query can only return one result, you used fetchone(). If the query were to return more than one result, then you’d need to either iterate over cursor or use one of the other fetch* methods.

Using Query Parameters in SQL

In the previous section, you created a database, established a connection to it, and executed a query. The query you used was static. In other words, it had no parameters. Now you’ll start to use parameters in your queries.

First, you’re going to implement a function that checks whether or not a user is an admin. is_admin() accepts a username and returns that user’s admin status:

# BAD EXAMPLE. DON'T DO THIS!
def is_admin(username: str) -> bool:
    with connection.cursor() as cursor:
        cursor.execute("""
            SELECT
                admin
            FROM
                users
            WHERE
                username = '%s'
        """ % username)
        result = cursor.fetchone()
    admin, = result
    return admin

This function executes a query to fetch the value of the admin column for a given username. You used fetchone() to return a tuple with a single result. Then, you unpacked this tuple into the variable admin. To test your function, check some usernames:

>>> is_admin('haki')
False
>>> is_admin('ran')
True

So far so good. The function returned the expected result for both users. But what about non-existing user? Take a look at this Python traceback:

>>> is_admin('foo')
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 12, in is_admin
TypeError: cannot unpack non-iterable NoneType object

When the user does not exist, a TypeError is raised. This is because .fetchone() returns None when no results are found, and unpacking None raises a TypeError. The only place you can unpack a tuple is where you populate admin from result.

To handle non-existing users, create a special case for when result is None:

# BAD EXAMPLE. DON'T DO THIS!
def is_admin(username: str) -> bool:
    with connection.cursor() as cursor:
        cursor.execute("""
            SELECT
                admin
            FROM
                users
            WHERE
                username = '%s'
        """ % username)
        result = cursor.fetchone()

    if result is None:
        # User does not exist
        return False

    admin, = result
    return admin

Here, you’ve added a special case for handling None. If username does not exist, then the function should return False. Once again, test the function on some users:

>>> is_admin('haki')
False
>>> is_admin('ran')
True
>>> is_admin('foo')
False

Great! The function can now handle non-existing usernames as well.

Remove ads

Exploiting Query Parameters With Python SQL Injection

In the previous example, you used string interpolation to generate a query. Then, you executed the query and sent the resulting string directly to the database. However, there’s something you may have overlooked during this process.

Think back to the username argument you passed to is_admin(). What exactly does this variable represent? You might assume that username is just a string that represents an actual user’s name. As you’re about to see, though, an intruder can easily exploit this kind of oversight and cause major harm by performing Python SQL injection.

Try to check if the following user is an admin or not:

>>> is_admin("'; select true; --")
True

Wait… What just happened?

Let’s take another look at the implementation. Print out the actual query being executed in the database:

>>> print("select admin from users where username = '%s'" % "'; select true; --")
select admin from users where username = ''; select true; --'

The resulting text contains three statements. To understand exactly how Python SQL injection works, you need to inspect each part individually. The first statement is as follows:

select admin from users where username = '';

This is your intended query. The semicolon (;) terminates the query, so the result of this query does not matter. Next up is the second statement:

select true;

This statement was constructed by the intruder. It’s designed to always return True.

Lastly, you see this short bit of code:

--'

This snippet defuses anything that comes after it. The intruder added the comment symbol (--) to turn everything you might have put after the last placeholder into a comment.

When you execute the function with this argument, it will always return True. If, for example, you use this function in your login page, an intruder could log in with the username '; select true; --, and they’ll be granted access.

If you think this is bad, it could get worse! Intruders with knowledge of your table structure can use Python SQL injection to cause permanent damage. For example, the intruder can inject an update statement to alter the information in the database:

>>> is_admin('haki')
False
>>> is_admin("'; update users set admin = 'true' where username = 'haki'; select true; --")
True
>>> is_admin('haki')
True

Let’s break it down again:

';

This snippet terminates the query, just like in the previous injection. The next statement is as follows:

update users set admin = 'true' where username = 'haki';

This section updates admin to true for user haki.

Finally, there’s this code snippet:

select true; --

As in the previous example, this piece returns true and comments out everything that follows it.

Why is this worse? Well, if the intruder manages to execute the function with this input, then user haki will become an admin:

psycopgtest=# select * from users;
 username | admin
----------+-------
 ran      | t
 haki     | t
(2 rows)

The intruder no longer has to use the hack. They can just log in with the username haki. (If the intruder really wanted to cause harm, then they could even issue a DROP DATABASE command.)

Before you forget, restore haki back to its original state:

psycopgtest=# update users set admin = false where username = 'haki';
UPDATE 1

So, why is this happening? Well, what do you know about the username argument? You know it should be a string representing the username, but you don’t actually check or enforce this assertion. This can be dangerous! It’s exactly what attackers are looking for when they try to hack your system.

Crafting Safe Query Parameters

In the previous section, you saw how an intruder can exploit your system and gain admin permissions by using a carefully crafted string. The issue was that you allowed the value passed from the client to be executed directly to the database, without performing any sort of check or validation. SQL injections rely on this type of vulnerability.

Any time user input is used in a database query, there’s a possible vulnerability for SQL injection. The key to preventing Python SQL injection is to make sure the value is being used as the developer intended. In the previous example, you intended for username to be used as a string. In reality, it was used as a raw SQL statement.

To make sure values are used as they’re intended, you need to escape the value. For example, to prevent intruders from injecting raw SQL in the place of a string argument, you can escape quotation marks:

>>> # BAD EXAMPLE. DON'T DO THIS!
>>> username = username.replace("'", "''")

This is just one example. There are a lot of special characters and scenarios to think about when trying to prevent Python SQL injection. Lucky for you, modern database adapters, come with built-in tools for preventing Python SQL injection by using query parameters. These are used instead of plain string interpolation to compose a query with parameters.

Note: Different adapters, databases, and programming languages refer to query parameters by different names. Common names include bind variables, replacement variables, and substitution variables.

Now that you have a better understanding of the vulnerability, you’re ready to rewrite the function using query parameters instead of string interpolation:

def is_admin(username: str) -> bool:

    with connection.cursor() as cursor:

        cursor.execute("""

            SELECT

                admin

            FROM

                users

            WHERE

                username = %(username)s

        """, {

            'username': username

        })

        result = cursor.fetchone()


    if result is None:

        # User does not exist

        return False


    admin, = result

    return admin

Here’s what’s different in this example:

  • In line 9, you used a named parameter username to indicate where the username should go. Notice how the parameter username is no longer surrounded by single quotation marks.

  • In line 11, you passed the value of username as the second argument to cursor.execute(). The connection will use the type and value of username when executing the query in the database.

To test this function, try some valid and invalid values, including the dangerous string from before:

>>> is_admin('haki')
False
>>> is_admin('ran')
True
>>> is_admin('foo')
False
>>> is_admin("'; select true; --")
False

Amazing! The function returned the expected result for all values. What’s more, the dangerous string no longer works. To understand why, you can inspect the query generated by execute():

>>> with connection.cursor() as cursor:
...    cursor.execute("""
...        SELECT
...            admin
...        FROM
...            users
...        WHERE
...            username = %(username)s
...    """, {
...        'username': "'; select true; --"
...    })
...    print(cursor.query.decode('utf-8'))
SELECT
    admin
FROM
    users
WHERE
    username = '''; select true; --'

The connection treated the value of username as a string and escaped any characters that might terminate the string and introduce Python SQL injection.

Passing Safe Query Parameters

Database adapters usually offer several ways to pass query parameters. Named placeholders are usually the best for readability, but some implementations might benefit from using other options.

Let’s take a quick look at some of the right and wrong ways to use query parameters. The following code block shows the types of queries you’ll want to avoid:

# BAD EXAMPLES. DON'T DO THIS!
cursor.execute("SELECT admin FROM users WHERE username = '" + username + '");
cursor.execute("SELECT admin FROM users WHERE username = '%s' % username);
cursor.execute("SELECT admin FROM users WHERE username = '{}'".format(username));
cursor.execute(f"SELECT admin FROM users WHERE username = '{username}'");

Each of these statements passes username from the client directly to the database, without performing any sort of check or validation. This sort of code is ripe for inviting Python SQL injection.

In contrast, these types of queries should be safe for you to execute:

# SAFE EXAMPLES. DO THIS!
cursor.execute("SELECT admin FROM users WHERE username = %s'", (username, ));
cursor.execute("SELECT admin FROM users WHERE username = %(username)s", {'username': username});

In these statements, username is passed as a named parameter. Now, the database will use the specified type and value of username when executing the query, offering protection from Python SQL injection.

Using SQL Composition

So far you’ve used parameters for literals. Literals are values such as numbers, strings, and dates. But what if you have a use case that requires composing a different query—one where the parameter is something else, like a table or column name?

Inspired by the previous example, let’s implement a function that accepts the name of a table and returns the number of rows in that table:

# BAD EXAMPLE. DON'T DO THIS!
def count_rows(table_name: str) -> int:
    with connection.cursor() as cursor:
        cursor.execute("""
            SELECT
                count(*)
            FROM
                %(table_name)s
        """, {
            'table_name': table_name,
        })
        result = cursor.fetchone()

    rowcount, = result
    return rowcount

Try to execute the function on your users table:

Traceback (most recent call last):
  File "", line 1, in 
  File "", line 9, in count_rows
psycopg2.errors.SyntaxError: syntax error at or near "'users'"
LINE 5:                 'users'
                        ^

The command failed to generate the SQL. As you’ve seen already, the database adapter treats the variable as a string or a literal. A table name, however, is not a plain string. This is where SQL composition comes in.

You already know it’s not safe to use string interpolation to compose SQL. Luckily, Psycopg provides a module called psycopg.sql to help you safely compose SQL queries. Let’s rewrite the function using psycopg.sql.SQL():

from psycopg2 import sql

def count_rows(table_name: str) -> int:
    with connection.cursor() as cursor:
        stmt = sql.SQL("""
            SELECT
                count(*)
            FROM
                {table_name}
        """).format(
            table_name = sql.Identifier(table_name),
        )
        cursor.execute(stmt)
        result = cursor.fetchone()

    rowcount, = result
    return rowcount

There are two differences in this implementation. First, you used sql.SQL() to compose the query. Then, you used sql.Identifier() to annotate the argument value table_name. (An identifier is a column or table name.)

Note: Users of the popular package django-debug-toolbar might get an error in the SQL panel for queries composed with psycopg.sql.SQL(). A fix is expected for release in version 2.0.

Now, try executing the function on the users table:

>>> count_rows('users')
2

Great! Next, let’s see what happens when the table does not exist:

>>> count_rows('foo')
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 11, in count_rows
psycopg2.errors.UndefinedTable: relation "foo" does not exist
LINE 5:                 "foo"
                        ^

The function throws the UndefinedTable exception. In the following steps, you’ll use this exception as an indication that your function is safe from a Python SQL injection attack.

Note: The exception UndefinedTable was added in psycopg2 version 2.8. If you’re working with an earlier version of Psycopg, then you’ll get a different exception.

To put it all together, add an option to count rows in the table up to a certain limit. This feature might be useful for very large tables. To implement this, add a LIMIT clause to the query, along with query parameters for the limit’s value:

from psycopg2 import sql

def count_rows(table_name: str, limit: int) -> int:
    with connection.cursor() as cursor:
        stmt = sql.SQL("""
            SELECT
                COUNT(*)
            FROM (
                SELECT
                    1
                FROM
                    {table_name}
                LIMIT
                    {limit}
            ) AS limit_query
        """).format(
            table_name = sql.Identifier(table_name),
            limit = sql.Literal(limit),
        )
        cursor.execute(stmt)
        result = cursor.fetchone()

    rowcount, = result
    return rowcount

In this code block, you annotated limit using sql.Literal(). As in the previous example, psycopg will bind all query parameters as literals when using the simple approach. However, when using sql.SQL(), you need to explicitly annotate each parameter using either sql.Identifier() or sql.Literal().

Note: Unfortunately, the Python API specification does not address the binding of identifiers, only literals. Psycopg is the only popular adapter that added the ability to safely compose SQL with both literals and identifiers. This fact makes it even more important to pay close attention when binding identifiers.

Execute the function to make sure that it works:

>>> count_rows('users', 1)
1
>>> count_rows('users', 10)
2

Now that you see the function is working, make sure it’s also safe:

>>> count_rows("(select 1) as foo; update users set admin = true where name = 'haki'; --", 1)
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 18, in count_rows
psycopg2.errors.UndefinedTable: relation "(select 1) as foo; update users set admin = true where name = '" does not exist
LINE 8:                     "(select 1) as foo; update users set adm...
                            ^

This traceback shows that psycopg escaped the value, and the database treated it as a table name. Since a table with this name doesn’t exist, an UndefinedTable exception was raised and you were not hacked!

Remove ads

Conclusion

You’ve successfully implemented a function that composes dynamic SQL without putting your system at risk for Python SQL injection! You’ve used both literals and identifiers in your query without compromising security.

You’ve learned:

  • What Python SQL injection is and how it can be exploited
  • How to prevent Python SQL injection using query parameters
  • How to safely compose SQL statements that use literals and identifiers as parameters

You’re now able to create programs that can withstand attacks from the outside. Go forth and thwart the hackers!

What are the differences between Standard SQL and Transact-SQL?

What are the differences between Standard SQL and Transact-SQL?

In this article, we'll explain syntax differences between standard SQL and the Transact-SQL language dedicated to interacting with the SQL

#1 Names of Database Objects

In relational database systems, we name tables, views, and columns, but sometimes we need to use the same name as a keyword or use special characters. In standard SQL, you can place this kind of name in quotation marks (""), but in T-SQL, you can also place it in brackets ([]). Look at these examples for the name of a table in T-SQL:

CREATE TABLE dbo.test.“first name” ( Id INT, Name VARCHAR(100));
CREATE TABLE dbo.test.[first name]  ( Id INT, Name VARCHAR(100));

Only the first delimiter (the quotation marks) for the special name is also part of the SQL standard.

What Is Different in a SELECT Statement?#2 Returning Values

The SQL standard does not have a syntax for a query returning values or values coming from expressions without referring to any columns of a table, but MS SQL Server does allow for this type of expression. How? You can use a SELECT statement alone with an expression or with other values not coming from columns of the table. In T-SQL, it looks like the example below:

SELECT 12/6 ;

In this expression, we don’t need a table to evaluate 12 divided by 6, therefore, the FROM statement and the name of the table can be omitted.

#3 Limiting Records in a Result Set

In the SQL standard, you can limit the number of records in the results by using the syntax illustrated below:

SELECT * FROM tab FETCH FIRST 10 ROWS ONLY

T-SQL implements this syntax in a different way. The example below shows the MS SQL Server syntax:

SELECT * FROM tab ORDER BY col1 DESC OFFSET 0 ROWS FETCH FIRST 10 ROWS ONLY;

As you notice, this uses an ORDER BY clause. Another way to select rows, but without ORDER BY, is by using the TOP clause in T-SQL:

SELECT TOP 10 * FROM tab;
#4 Automatically Generating Values

The SQL standard enables you to create columns with automatically generated values. The syntax to do this is shown below:

CREATE TABLE tab (id DECIMAL GENERATED ALWAYS AS IDENTITY);

In T-SQL we can also automatically generate values, but in this way:

CREATE TABLE tab (id INTEGER IDENTITY);
#5 Math Functions

Several common mathematical functions are part of the SQL standard. One of these math functions is CEIL(x), which we don’t find in T-SQL. Instead, T-SQL provides the following non-standard functions: SIGN(x), ROUND(x,[,d]) to round decimal value x to the number of decimal positions, TRUNC(x) for truncating to given number of decimal places, LOG(x) to return the natural logarithm for a value x, and RANDOM() to generate random numbers. The highest or lowest number in a list in the SQL standard is returned by MAX(list) and MIN(list) functions, but in Transact-SQL, you use the GREATEST(list) and LEAST(list) functions.

T-SQL function ROUND:

SELECT ROUND(col) FROM tab;

#6 Aggregate Functions

We find another syntax difference with the aggregate functions. The functions COUNT, SUM, and AVG all take an argument related to a count. T-SQL allows the use of DISTINCT before these argument values so that rows are counted only if the values are different from other rows. The SQL standard doesn't allow for the use of DISTINCT in these functions.

Standard SQL:
SELECT COUNT(col) FROM tab;

T-SQL:
SELECT COUNT(col) FROM tab;

SELECT COUNT(DISTINCT col) FROM tab;

But in T-SQL we don’t find a population covariance function: COVAR_POP(x,y), which is defined in the SQL standard.

#7 Retrieving Parts of Dates and Times

Most relational database systems deliver many functions to operate on dates and times.

In standard SQL, the EXTRACT(YEAR FROM x) function and similar functions to select parts of dates are different from the T-SQL functions like YEAR(x) or DATEPART(year, x).

There is also a difference in getting the current date and time. Standard SQL allows you to get the current date with the CURRENT_DATE function, but in MS SQL Server, there is not a similar function, so we have to use the GETDATE function as an argument in the CAST function to convert to a DATE data type.

#8 Operating on Strings

Using functions to operate on strings is also different between the SQL standard and T-SQL. The main difference is found in removing trailing and leading spaces from a string. In standard SQL, there is the TRIM function, but in T-SQL, there are several related functions: TRIM (removing trailing and leading spaces), LTRIM (removing leading spaces), and RTRIM (removing trailing spaces).

Another very-often-used string function is SUBSTRING.

The standard SQL syntax for the SUBSTRING function looks like:

SUBSTRING(str FROM start [FOR len])

but in T-SQL, the syntax of this function looks like:

SUBSTRING(str, start, length)

There are reasons sometimes to add values coming from other columns and/or additional strings. Standard SQL enables the following syntax to do this:

As you can see, this syntax makes use of the || operator to add one string to another.

But the equivalent operator in T-SQL is the plus sign character. Look at this example:

SELECT col1 + col2  FROM tab;

In SQL Server, we also have the possibility to use the CONCAT function concatenates a list of strings:

SELECT CONCAT(col1, str1, col2, ...)  FROM tab;

We can also repeat one character several times. Standard SQL defines the function REPEAT(str, n) to do this. Transact-SQL provides the REPLICATE function. For example:

SELECT  REPLICATE(str, x);

where x indicates how many times to repeat the string or character.

#9 Inequality Operator

During filtering records in a SELECT statement, sometimes we have to use an inequality operator. Standard SQL defines <> as this operator, while T-SQL allows for both the standard operator and the != operator:

SELECT col3 FROM tab WHERE col1 != col2;
#10 ISNULL Function

In T-SQL, we have the ability to replace NULL values coming from a column using the ISNULL function. This is a function that is specific to T-SQL and is not in the SQL standard.

SELECT ISNULL(col1) FROM tab;
Which Parts of DML Syntax Are Different?

In T-SQL, the basic syntax of DELETE, UPDATE, and INSERT queries is the same as the SQL standard, but differences appear in more advanced queries. Let’s look at them.

#11 OUTPUT Keyword

The OUTPUT keyword occurs in DELETE, UPDATE, and INSERT statements. It is not defined in standard SQL.

Using T-SQL, we can see extra information returned by a query. It returns both old and new values in UPDATE or the values added using INSERT or deleted using DELETE. To see this information, we have to use prefixes in INSERT, UPDATE, and DELETE.

UPDATE tab SET col='new value'
OUTPUT Deleted.col, Inserted.col;

We see the result of changing records with the previous and new values in an updated column. The SQL standard does not support this feature.

#12 Syntax for INSERT INTO ... SELECT

Another structure of an INSERT query is INSERT INTO … SELECT. T-SQL allows you to insert data from another table into a destination table. Look at this query:

INSERT INTO tab SELECT col1,col2,... FROM tab_source;

It is not a standard feature but a feature characteristic of SQL Server.

#13 FROM Clause in DELETE and UPDATE

SQL Server provides extended syntax of the UPDATE and DELETE with FROM clauses. You can use DELETE with FROM to use the rows from one table to remove corresponding rows in another table by referring to a primary key and a foreign key. Similarly, you can use UPDATE with FROM update rows from one table by referring to the rows of another table using common values (primary key in one table and foreign key in second, e.g. the same city name). Here is an example:

DELETE FROM Book
FROM Author
WHERE Author.Id=Book.AuthorId AND Author.Name IS NULL;

UPDATE Book
SET Book.Price=Book.Price*0.2
FROM Author
WHERE Book.AuthorId=Author.Id AND Author.Id=12;

The SQL standard doesn’t provide this syntax.

#14 INSERT, UPDATE, and DELETE With JOIN

You can also use INSERT, UPDATE, and DELETE using JOIN to connect to another table. An example of this is:

DELETE ItemOrder FROM ItemOrder
JOIN Item ON ItemOrder.ItemId=Item.Id
WHERE YEAR(Item.DeliveredDate) <= 2017;

This feature is not in the SQL standard.

Summary

This article does not cover all the issues about syntax differences between the SQL standard and T-SQL using the MS SQL Server system. However, this guide helps point out some basic features characteristic only of Transact-SQL and what SQL standard syntax isn’t implemented by MS SQL Server.

Thanks for reading. If you liked this post, share it with all of your programming buddies!

Originally published on https://dzone.com


Comparing Python and SQL for Building Data Pipelines🚀🚀🚀

Comparing Python and SQL for Building Data Pipelines🚀🚀🚀

Comparing Python and SQL for Building Data Pipelines - Breaking into the workforce as a web developer, my first interaction with databases and SQL was using an Object Relational Model (ORM). I was using the Django query sets API and had an excellent experience using the interface...

Breaking into the workforce as a web developer, my first interaction with databases and SQL was using an Object Relational Model (ORM). I was using the Django query sets API and had an excellent experience using the interface...

Thereon-after, I changed to a data engineering role and became much more involved in leveraging datasets to build AI. It became my responsibility to take the data from the user application and turn it into something usable by Data Scientists, a process commonly known as ETL.

The data on our production system was very messy and required a lot of transformations before anyone was going to be able to build AI on top of it. There were JSON columns that had different schemas per row, columns contained mixed data types and some rows had erroneous values (people saying that they were born before 1850 or in the future). As I set out on cleaning, aggregating and engineering features for the data, I tried to decide which language would be best for the task. Having used python all day every day before this, I knew that it could do the job. However, what I learned through this experience was that just because python could do the job doesn’t mean it should.

The first time I misjudged SQL is when I assumed that SQL couldn’t do complicated transformations

We are working with a time-series dataset where we wanted to track particular users over time. Privacy laws prevent us from knowing the specific dates of the user visits, so we decided that we would normalize the date of the record to the users first visit (ie 5 days after their first visit etc.). For our analysis, it was important to know the time since the last visit as well as the time since their first visit. A had two sample datasets, one with approximately 7.5 million rows measuring 6.5 GBs, and the other with 550 000 rows measuring 900 MB.

Using the python and SQL code seen below, I used the smaller dataset to first test the transformations. Python and SQL completed the task in 591 and 40.9 seconds respectively. This means that SQL was able to provide a speed-up of roughly 14.5X!

# PYTHON
# connect to db using wrapper around psycopg2
db = DatabaseConnection(db='db', user='username', password='password')
# grab data from db and load into memory
df = db.run_query("SELECT * FROM cleaned_table;")
df = pd.DataFrame(df, columns=['user_id', 'series_id', 'timestamp'])
# calculate time since first visit
df = df.sort_values(['user_id', 'timestamp'], ascending=True).assign(time_since_first=df.groupby('user_id').timestamp.apply(lambda x: x - x.min()))
# calculate time since last visit
df = df.assign(time_since_last=df.sort_values(['timestamp'], ascending=True).groupby('user_id')['timestamp'].transform(pd.Series.diff))
# save df to compressed csv
df.to_csv('transform_time_test.gz', compression='gzip')

-- SQL equivalent
-- increase the working memory (be careful with this)
set work_mem='600MB';
-- create a dual index on the partition
CREATE INDEX IF NOT EXISTS user_time_index ON table(user_id, timestamp);
-- calculate time since last visit and time since first visit in one pass 
SELECT *, AGE(timestamp, LAG(timestamp, 1, timestamp) OVER w) AS time_since_last, AGE(timestamp, FIRST_VALUE(timestamp) OVER w) AS time_since_first FROM table WINDOW w AS (PARTITION BY user_id ORDER BY timestamp);

This SQL transformation was not only faster but the code is also more readable and thus easier to maintain. Here, I used the lag and first_value functions to find specific records in the users history (called a partition). I then used the age function to determine the time difference between visits.

What’s even more interesting is that when these transformation scripts were applied to the 6.5 GB dataset, python completely failed. Out of 3 attempts, python crashed 2 times and my computer completely froze the 3rd time… while SQL took 226 seconds.

More info:

https://www.postgresql.org/docs/9.5/functions-window.html

http://www.postgresqltutorial.com/postgresql-window-function/

The second time I misjudged SQL is when I thought that it couldn’t flatten irregular json

Another game changer for me was realizing that Postgres worked with JSON quite well. I initially thought that it would be impossible to flatten or parse json in postgres…I can’t believe that I was so dumb. If you want to relationize json and its schema is consistent between rows, then your best bet is probably to use Postgres built in ability to parse json.

-- SQL (the -> syntax is how you parse json)
SELECT user_json->'info'->>'name' as user_name FROM user_table;

On the other hand, half the json in my sample dataset isn’t valid json and thus is stored as text. In which case I was left with a choice, I could either recode the data to make it valid OR I could just drop the rows that didn’t follow the rules. To do this, I created a new SQL function called is_json that I could then use to qualify valid json in a WHERE clause.

-- SQL
create or replace function is_json(text)
returns boolean language plpgsql immutable as $
begin
    perform $1::json;
    return true;
exception
    when invalid_text_representation then 
        return false;
end $;
SELECT user_json->'info'->>'name' as user_name FROM user_table WHERE is_json(user_json);

Unfortunately, I found that the user_json had a different schema depending on what app version the user was on. Although this makes sense from an application development point of view, it makes it really expensive to conditionally parse every possibility per row. Was I destined to enter python again… not a chance! I found another function on stack-overflow written by a postgres god named klin.

-- SQL
create or replace function create_jsonb_flat_view
    (table_name text, regular_columns text, json_column text)
    returns text language plpgsql as $
declare
    cols text;
begin
    execute format ($ex$
        select string_agg(format('%2$s->>%%1$L "%%1$s"', key), ', ')
        from (
            select distinct key
            from %1$s, jsonb_each(%2$s)
            order by 1
            ) s;
        $ex$, table_name, json_column)
    into cols;
    execute format($ex$
        drop view if exists %1$s_view;
        create view %1$s_view as 
        select %2$s, %3$s from %1$s
        $ex$, table_name, regular_columns, cols);
    return cols;
end $;

This function was able to successfully flatten my json and solve my worst nightmare quite easily.

Final Comments

There is an idiom that declares Python as the second best language to do almost anything. I believe this to be true and in some instances have found the performance difference between Python and the ‘best’ language to be negligible. In this case however, python was unable to compete with SQL. These realizations along with readings I’ve been doing has completely changed my approach to ETL. I now work under the paradigm of “Do not move data to code, move code to your data”. Python moves your data to the code while SQL acts on it in place. What’s more is that I know that I’ve only scratched the surface of sql and postgres abilities. I’m looking forward to more awesome functionality, and the possibility of getting speed ups from using an analytical warehouse.

=======================

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Complete Python Bootcamp: Go from zero to hero in Python 3

☞ Python and Django Full Stack Web Developer Bootcamp

☞ Python for Time Series Data Analysis

☞ Python Programming For Beginners From Scratch

☞ Python Network Programming | Network Apps & Hacking Tools

☞ Intro To SQLite Databases for Python Programming

☞ Ethical Hacking With Python, JavaScript and Kali Linux

☞ Beginner’s guide on Python: Learn python from scratch! (New)

☞ Python for Beginners: Complete Python Programming

☞ Django 2.1 & Python | The Ultimate Web Development Bootcamp

☞ Python eCommerce | Build a Django eCommerce Web Application

☞ Python Django Dev To Deployment