Databases are an important part of (almost) every piece of software today. In this article I'm going to tell you everything you need to know to start working with them.
Databases are an important part of (almost) every piece of software today. In this article I'm going to tell you everything you need to know to start working with them.
If you manage information in files or folders, you sooner or later will find that:
This approach for handling information is inefficient, databases were made to fix these problems.
A Database is a system that allows everyone to share, manage and use data. To use one you first need to understand some things:
There are many kinds of databases, these three are the most commonly used data models for databases:
To use Hierarchical Data Model and Network Data Model you must manage data by keeping physical location and the order of data in mind, so performing flexible and high-speed search of your data is difficult.
That’s why we will be using the Relational Data Model.
A relational database is a type of database. It uses a structure that allows us to identify and access data in relation to another piece of data in the database. Data in a relational database is organized into tables.
A table is a set of data elements (values).
A piece of data in a file is called a record.
Each item in a record is called a field.
One piece of data or record is called a row.
Each item or field is called a column.
A field is often given an important role in the database, when this happens we call that field the Primary Key. In this example, Product Code is the Primary Key. More information of this in the next section.
A Unique value is a value that cannot be repeated (Like product name, you shouldn’t have two products with the same name).
Null is the absence of value (as seen above in “Remarks”, where there are empty values). Some Fields can be null (depending on the database).
When you try to create a database yourself, the first step is to determine the conditions of the data you are trying to model.
Model used for analysis to make diagrams. In this model you consider the actual world using concepts of entity and relationship.
E refers Entity. A recognizable object in the real world. For example, when exporting fruits to other countries, fruit and export destination can be considered entities. Represented with a rectangle. Each entity has attributes, particular properties that describe the entity (Product Name in Fruit, for example). Represented with an ovallus.
R refers relationship. How entities relate with each other. For example, fruits and export destination are associated with each other because you sell fruits to export destinations. Represented with a diamond. Fruit is exported to many Export destinations and Export destination purchases many kinds of fruit. We call this a many-to-many relationship. In the E-R model the number of associations between entities is considered, this is called cardinality.
Cardinality: Number of associations between entities.
One-to-one relationship (1-1): I only sell fruits to you and you only buy fruits from me.One-to-many (or many-to-one) relationship (n-1 or 1-n): I sell fruits to other families and those families buy fruits only from me.Many-to-many relationship (n-n)*: You have an example above.### Normalization
Process of tabulating data from the real world for a relational database, following a serie of steps. It is necessary to Normalize data in order to properly manage a relational database. Normalization is used for mainly two purposes:
The First Normal Form is created from this table. All the attributes you have identified for a given entity are probably grouped together in a flat structure. This is where the process of normalization comes into play, to organize the attributes.
For a table to be in the First Normal Form, it should follow the following 4 rules:
Before going on you must first know this:
Now that you are familiar with the basic terminology and the ER model, you are ready to design a database.
When you use the database you have to input or retrieve data using SQL (Structured Query Language). SQL allows you to communicate with the database. Some commands may change depending on the Database Management System (SQL Server for example) you are using.
It’s commands can be broken down into three distinct types:
Most basic SQL statement.
SELECT product_name /*The column you want to see...*/ FROM product; /*...from the table it belongs.*/
Used to specify the information you want.
This statement retrieves all data from the product table that has a unit price greater than or equal to 200.
SELECT * /*This selects every column in the table.*/ FROM product WHERE unit_price>=200; /*Very easy to understand right?*/
And this one retrieves all data with the product name ‘apple’.
SELECT * FROM product WHERE product_name=’apple’;
SELECT * FROM product WHERE unit_price BETWEEN 150 AND 200; /*Between doesn’t need explanation I think.*/ SELECT * FROM product WHERE unit_price is NULL; /*This one neither.*/
When you don’t know exactly what to search for you can use pattern matching in conditions, by using wildcard characters in a LIKE statement.
SELECT * FROM product WHERE product_name LIKE "%n"; /*Will search for data ending with the letter ‘n’.*/
Sort data based on a certain column.
SELECT * FROM product WHERE product_name LIKE "%n" ORDER BY unit_price; /*Easy to understand, right*/
Also known as Set Functions. You can use them to aggregate information such as maximum or minimum values.
SELECT MAX(unit_price) FROM product
If you group data you can obtain aggregated values easily. To group data combine an aggregate function with GROUP BY.
SELECT district, AVG(unit_price) FROM product GROUP BY district; /*Output: Average unit price per district*/
You can’t use WHERE with aggregate functions, you have to use HAVING.
SELECT district, AVG(unit_price) FROM product GROUP BY district; HAVING AVG(unit_price)>=200; /*Filters result after being grouped.*/
There are more complicated query methods in SQL.
You can embed one query in another query (this is called a subquery).
SELECT * FROM product WHERE product_code IN ( SELECT product_code FROM sales_statement WHERE quantity>=1000 );
The query in parentheses is performed first. The other SELECT is performed with the result. The IN operator allows multiple WHERE values (that’s why the first select works with the second).
A subquery may refer to data from the outer query, this is called a correlated query.
SELECT * FROM sales_statement U /*Alias? Maybe more on this later*/ WHERE quantity> ( SELECT AVG(quantity) FROM sales_statement WHERE product_code=U.product_code );
SQL aliases are used to give a table, or a column in a table, a temporary name. Often used to shorten column names to make working with them easier (mostly in joins). An alias only exists for the duration of the query.
SELECT o.OrderID, o.OrderDate, c.CustomerName FROM Customers AS c, Orders AS o WHERE c.CustomerName="Around the Horn" AND c.CustomerID=o.CustomerID;
JOIN is used to combine rows from two or more tables, based on a related column between them.
Here are the different types of the JOINs in SQL:
(INNER) JOIN: Returns records that have matching values in both tables.
LEFT (OUTER) JOIN: Return all records from the left table, and the matched records from the right table.
RIGHT (OUTER) JOIN: Return all records from the right table, and the matched records from the left table.
FULL (OUTER) JOIN: Return all records when there is a match in either left or right table.
Join against equality or matching column(s) values of the associated tables. An equal sign (=) is used as comparison operator in the where clause to refer equality.
SELECT * FROM table1 JOIN table2 [ON (join_condition)] /*You can also do it without JOIN*/ SELECT column_list FROM table1, table2.... WHERE table1.column_name = table2.column_name;
The code can be different depending on which kind of database you are working with.
CREATE TABLE product /*This line needs no explanation, right?*/ ( product_code INT NOT NULL, product_name VARCHAR(255), unit_price INT, PRIMARY KEY(product_code) );
Specifications for a table to prevent data conflicts.
INSERT INTO product (product_code, product_name, unit_price) VALUES (101, "melon", 800); INSERT INTO product VALUES (101, "melon", 800);
Both statements do the same. You insert data in the type and order it was defined. Remember the constraints (like primary keys).
Allows you to modify data inside a table.
UPDATE product SET product_name="cantaloupe" /*New value*/ WHERE product_name="melon"; /*Specific value to overwrite*/
Delete data from a table.
DELETE FROM product WHERE product_name="apple"; /*Row to delete*/
You can create a virtual table that exists only when it is viewed by a user. This is a view. The table from which a view is derived is called a base table.
CREATE VIEW expensive_product /*The view name is "expensive_product"*/ (product_code, product_name, unit_price) AS SELECT * FROM product WHERE unit_price>=200;
To use the view:
SELECT * FROM expensive_product WHERE unit_price>=500;
It is convenient to create a view when you want to make part of the data in a table public.
Allows you to delete:
/*A view:*/ DROP VIEW expensive_product; /*A base table:*/ DROP TABLE product; /*A database:*/ DROP DATABASE name;
A unit of data operations is called a transaction. Example: Reading data, writing data. A transaction always ends with a commit or a rollbackoperation.
It is important to ensure that multiple transactions can be processed without conflicting data. It is also important to protect data from inconsistencies in case a failure occurs while a transaction is being processed. To that end, the following table lists the properties required for a transaction, which memorably spell ACID.
Operations in the database are finalized when each transaction is processed correctly. That finalization is called a commit operation.
Operations by many users are controlled so that nothing goes wrong when they access the database concurrently. For that purpose, a method called Lock is used.
You lock data to prevent it from being erroneously processed.
If I perform some operations on the database, the data is locked until my operations are finished, then is unlocked to be used by another user and the data is locked again, and unlocked when this user finishes.
Although a lock has its own role in a database, it should not be overused because it can hinder its purpose: sharing data with a lot of people. So we use different types of locks depending on the situation.
For example, you can use a shared lock for a read operation when it is the only operation needed. Other users can read data but cannot perform a write operation on it.
When performing a write operation a user applies an exclusive lock. When it is applied other users cannot read or write data.
When a lock is used to control two or more transactions. Concurrency allows as many users as possible to use a database at one time while preventing data conflicts from ocurring.
In order to make sure a schedule is serializable we need to obey specific rules for setting and releasing locks. One of these rules is two-phase locking, for each transaction two phases should be used: one for setting locks and the other for releasing them.
There are a number of resources that can be locked. The extent to which resources are locked is referred to as granularity. Coarse granularity occurs when many resources are locked at once, and fine granularity occurs when few resources are locked. When granularity is coarse (or high) the number of locks needed per transaction is reduced, reducing the amount of processing required.
Simpler methods can be used when you have a small number of transactions or a high number of read operations.
You can set the level of transactions that can be processed concurrently, this is referred to as the isolation level. The SET TRANSACTION statement can be used to specify the isolation level of the following transactions:
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
Depending on the isolation level setting, any of the following actions may occur:
Two users use an exclusive lock on two tables. Then, both will try to apply the same lock to the other table. Since each of them must wait for the lock applied by the other user to be released, neither can proceed with any operation. This situation is called a deadlock and cannot be solved unless one of the locks is released.
When a deadlock occurs, you can look for transactions that have been queued for a certain time and cancel them. Cancelling a transaction (every operation in it) is called a rollback. For example, if you wanted to apply a discount to all fruits with a price of 150 or more and one of them fails, you cancel everything and the database behaves as if no operation had been performed.
If you don’t secure the database, data can be deleted or modified without permission. A good solution may be to require usernames and passwords to limit users and limit operations to certain users (for example, only an administrator can DROP a table).
You can grant access to users using GRANT.
GRANT SELECT, UPDATE ON product TO Overseas_business_department;
Granting privilege with WITH GRANT OPTION enables the user to grant privileges to other users.
GRANT SELECT, UPDATE ON product TO overseas_business_department WITH GRANT OPTION;
To take away a user’s privileges uso REVOKE.
REVOKE SELECT, UPDATE ON product FROM overseas_business_department;
Some database products can group a number of privileges and grant them to multiple users at once. Grouping makes privilege management easier. Using views enables even more security, enabling the view to certain users you also protect the selected data in the view.
A database needs to have a mechanism that can protect data in the system in the event of a failure. To ensure durability of transactions is mandatory that no failure can create incorrect or faulty data. To protect itself from failure a database performs various operations which include creating backup and transaction logs.
Records called logs are kept whenever a data operation is performed. When a problem has occurred in the system you first restart the system and utilize logs to recover the database. The recovery method varies depending on whether or not the transaction has been committed.
In order to improve the efficiency of a write operation in a database, a buffer (segment of memory used to temporarily hold data) is often used to write data in the short term. The contents of the buffer and the database are synchronized, and then a checkpoint is written. When the database writes a checkpoint, it doesn’t have to perform any failure recovery for transactions that were committed before the checkpoint. Transactions that weren’t committed before the checkpoint must be recovered.
As the database grows and more people begin using it some problems may appear. The greater the volume of data, the slower a search operation becomes.
An index works as in a book. A blind search of a piece of information in a book would take time, so you check the index to speed up your search. It is very time consuming to browse all rows when searching for certain data. If you create indexed for product codes you can instantly learn where product data is stored for a product assigned. It tells you where on the disk is located, reducing the disk access count and therefore accelerating future searches.
It is up to the database administrator to add indexes. Creating too many indexes may lead to inefficiency (imagine a large number of indexes in a book, it won’t do any good).
Program logic inside the database server (basically queries inside the server). They help reduce the load on the network because it eliminates the need for frequent transfers of SQL queries.
A database in which not all storage devices are attached to a common processor but in multiple computers, located in the same physical location or dispersed over a network of interconnected computers. Keep in mind that it may be handled as a single database.
SQL stands for Structured Query Language. SQL is a scripting language expected to store, control, and inquiry information put away in social databases. The main manifestation of SQL showed up in 1974, when a gathering in IBM built up the principal model of a social database. The primary business social database was discharged by Relational Software later turning out to be Oracle.
Learn how Big Data and Business Intelligence, both technologies helps the decision makers to make proper decisions that can help the organization to get advantages over their peers.
Big Data Tutorial - Big Data Cluster Administration: In SQL Server 2019 Big Data Clusters, we ensure that management services embedded with the platform provide fast scale and upgrade operations, automatic logs and metrics collection, enterprise grade secure access and high availability. In this video we will provide an overview of these administration experiences for Big Data Clusters.
Even though Big data is into main stream of operations as of 2020, there are still potential issues or challenges the researchers.
In this article, we will go over key statistics highlighting the main data validation issues that currently impact big data companies.