The Most Popular Tool for Data Analysis in SQL

There are a bunch of tools that can be used for analyzing a data. One of the most popular and the most frequently used tool for data analysis is SQL (Structured Query Language) that run on database management system.

A several functions that SQL can do for analyzing a data in database include,

Show all of the record or a specific data.
Combining two or more data set table
Do some aggregate functions.
Make a categorical label for some data, and
Grouping a data based on a parameter.

We’ll do some query function using a SQL on analyzing a fruits sales data in some regions in Indonesia. The relational database management system that we’ll use is PostgreSQL Version 16.

The dataset that we’re going to analyze are fruits sales data, employee data, region data, user data, and vendor data.

Now, let’s begin.

Showing all of the data set in specific date

Here, we want to look all of the records of the fruit sales dataset with the specific period. The date period is from January 1st, 2021, until January 10th, 2021.

SELECT * FROM fruit_sales
WHERE purchase_date BETWEEN '2021-01-01' AND '2021-01-10'
ORDER BY purchase_date ASC

First, from the fruit sales dataset table, we want to select all of the record with SELECT * FROM statement. Then, set the parameter with WHERE clause in purchase date column, with the date period that mentioned above, and set the date period with a BETWEEN function. We can add some clause like ORDER BY to arrange the record based on date period in ascending arrangement.

Then, we can run the query, and the result will be like the table below.

2. Showing all of the data set in specific product name and location

Now, we want to look all of the record of the fruit sales in specific product name (Apel and Mangga), that sold in store with ID number 9.

SELECT * FROM fruit_sales
WHERE store_city_id IN (SELECT store_city_id FROM region
                        WHERE product_name IN ('Mangga','Apel') AND store_city_id = 9)
ORDER BY product_name

From the fruit sales dataset table, we want to select all of the record with SELECT * FROM statement. Now, for the parameter we need to specify two parameter that are product name and store ID.

We are going to use IN operator that has a function to allow us to specify multiple value in a WHERE clause.

Here we specify the first WHERE clause with the store ID, then we make another specified value with another SELECT statement from dataset table named ‘Region’. In this statement, we make a specify for product name that ‘Apel’ and ‘Mangga’ that was sold in store ID number 9.

After that, the first SELECT statement will show all of the record with the parameter in the second SELECT statement.

Note that the ORDER BY operator, if not specify the arrangement ascending or descending, it will be automatically executing the ascending or alphabetic arrangement.

Then, we can run the query, and the result will be like the table below.

3. Showing all of the data set in specific product name with the lowest quantity

A store that sells a variety of products must have the type of product with the highest sales or the type of product with the lowest sales. Here we are looking for data with the lowest sales on specific product type.

SELECT * FROM fruit_sales
WHERE
   product_name = 'Mangga' AND
   quantity = (
      SELECT
         MIN(quantity)
      FROM
         fruit_sales
 );

In fruit sales data, we select the data with the setting parameter are product name with specific value ‘Mangga’, and the quantity. To define the lowest quantity, we can use MIN function to give the lowest quantity value. So, here the result after the syntax run.

4. Show employees name that contains specific letter.

There are times when the system requires the sequence of employee names for several purposes. With the data search function in SQL, it makes it easier for users to find out the name of a data record in the form of a string with specific letters.

SELECT * FROM employees
WHERE
    first_name LIKE 'Ba%';

In employees table, we can set the WHERE parameter for the first name column with LIKE operator to set the specific employee's name that contains ‘Ba’ in the first two letters of their first name. The ‘%’ symbol is used for representing zero, one, or multiple character that means find any value that start with ‘Ba’, it’s because the % symbol was placed in the end of characters. And, here for the result.

5. Calculate the average salary in each year.

In the employees table, we know that the table have a salary column for each employee in year 2020, 2021 and 2022. We want to know the average annual salary for those years.

SELECT 
    ROUND(AVG(salary_2020), 2) AS avg_salary_2020,
    ROUND(AVG(salary_2021), 2) AS avg_salary_2021,
    ROUND(AVG(salary_2022), 2) AS avg_salary_2022
FROM
    employees;

From the employee table, first, select every salary column, in this case salary 2020, salary 2021 and salary 2022 columns. Now we can use the AVG function for define the average of values in a column. We used this query in every salary column, and because we don’t have a column to store the average value, we use AS keyword to make an aliases column that store the value temporary. The ROUND function is used for rounding up the value with the digits decimal number set by 2. Here’s for the result.

6. Show employee first name and last name with their work region.

When we open the employees table, there will be an employee first name column and employee last name column and also a region column. The region column also appears in region table. Every record in region table has a primary key on store_city_id column. Here we want to show the employee first and last name with their region.

SELECT 
  employees.first_name, 
  employees.last_name, 
  region.region
FROM employees
INNER JOIN region 
ON region.store_city_id = employees.store_city_id
WHERE region IN ('Yogyakarta', 'Jawa Barat')

To solve this problem, we will use JOIN clause. A JOIN clause in SQL is used for combining rows from two or more tables that have a relative column. There are two tables that we used namely employees table and region table. From employees table, we do an INNER JOIN with region table, specified in store_city_id column. So, the query will select the records that have matching values in both tables. The WHERE parameter set in region named Yogyakarta and Jawa Barat. Here’s for the result.

7. Show the total quantity of selected products between time periods.

Here we want to show the record of the product total quantity in fruit sales table. The product names are Mangga and Apple in between May 2nd, 2022, until May 23rd, 2022.

SELECT 
     product_name,
     SUM(quantity) AS total_quantity
FROM
    fruit_sales
WHERE 
     (purchase_date BETWEEN '2022-05-02' AND '2022-05-23')
     AND
     (product_name = 'Mangga' OR product_name = 'Apel')
GROUP BY product_name

First, use the SUM function in quantity column and make and an aliases AS total quantity. From the fruit sales, select the product name and quantity. Set the WHERE parameter with specific purchase date and specific product name. Lastly, use the GROUP BY statement for grouping the SUM function based on product name. Here’s for the result.

GROUP BY statement is always used whenever there are aggregate function (SUM, MIN, MAX, COUNT, etc.) in the query with the more than one column in result.

8. Show all the employees table records in specified parameters.

In employees table, there is store_city_id that become a foreign key in the employees' table. Meanwhile in the region table, the store_city_id become the primary key. The region names are only available in region table. So, when we want to show the records that have the specific region name without combining the table, we can use a subquery system.

SELECT * FROM employees
WHERE store_city_id IN (SELECT store_city_id FROM region
                        WHERE region IN ('Bali','Yogyakarta'))

Subquery can be interpreted as query inside the query. First, from the region table, we select region name Bali and Yogyakarta. This first query become a column for the second query. For the second query, the WHERE parameter is set in store_city_id in the first query that have been specified in region name. Here is for the result.

9. Make categorical label for employee salary.

In the employees table, we have a salary column for year 2020. Now, we want to make categorical label for employee salary in store_city_id = 9, with three labels namely LOW which the salary is less than 4000, MIDDLE which salary more than and equal to 4000 and less than 7000, HIGH which salary is more than 7000.

SELECT
  salary_categories,
  COUNT (salary_categories) AS number_of_employee
FROM (
  SELECT
      CASE
           WHEN salary_2020 < 4000 THEN 'LOW'
           WHEN salary_2020 >= 4000 AND salary_2020 < 7000 THEN 'MIDDLE'
           WHEN salary_2020 > 7000 THEN 'HIGH'
      END salary_categories
  FROM employees
  WHERE store_city_id = 9
) AS temporary_table
GROUP BY salary_categories

The main query that we use here is the subquery that become a temporary table that store salary_categories and number_of_employees data. Inside the subquery, we use CASE expression. CASE expression is used when we want to return a result if the condition is true like the if-else statement.

First, we make all the conditions that has been defined earlier and make an END function to stop the CASE expression. Here, all of the conditions are stored in salary_categories column in the temporary_table. Now, for the main query, we SELECT the salary_categories and make a COUNT for it in number_of_employee new column. Lastly, remember to put a GROUP BY function because we use an aggregate function. Here is for the result.

10. Calculate the average of sales revenue in specific region.

Here we want to calculate the average of sales revenue in regions namely Sumatera and Kalimantan. The two main tables that will be used are fruit sales table and region table. The formula that used for calculating the revenue is multiplying each row in quantity column with price_per_kg column.

WITH temp_table AS (
   SELECT 
        product_name, 
        store_city_id,
        SUM(quantity*price_per_kg) AS revenue 
   FROM fruit_sales
   GROUP BY product_name, store_city_id 
   ORDER BY store_city_id
),

avg_revenue_per_region AS (
   SELECT 
        store_city_id,
        ROUND(AVG(revenue),2) AS avg_revenue
   FROM temp_table
   GROUP BY store_city_id
)

SELECT rg.region, avg_amount.avg_revenue 
FROM avg_revenue_per_region AS avg_amount
LEFT JOIN region AS rg
ON avg_amount.store_city_id = rg.store_city_id
WHERE region IN ('Sumatera','Kalimantan')

Because we need to make a subquery and JOIN function, we can use WITH clause to create more than one subquery as temporary table. First, we make a temporary table named temp_table and make an anggregate function for calculating the revenue for each fruit sales records with SUM in quantity that multiplied by the price per kg. So, this first temporary table has a revenue records.

Now, we make the second temporary table named avg_revenue per_region that contains average revenue for each region. Now, this subquery has records for average revenue in each region.

After we’ve done with the subquery, now to showing the average revenue in specific regions, we use the JOIN function between the region table and the avg_revenue per_region table. The JOIN function matching the same store city id between these two tables.

11. Calculate the average of sales revenue in specific region and date period.

This case has the same instruction and query from the case earlier. The only difference is in this case we need to specify the time periods for the average revenue.

WITH temp_table AS (
    SELECT 
        product_name, 
        store_city_id,
        SUM(quantity*price_per_kg) AS revenue 
    FROM fruit_sales
    WHERE purchase_date BETWEEN '2021-01-01' AND '2021-12-31' 
    GROUP BY product_name, store_city_id 
    ORDER BY store_city_id
),

avg_revenue_per_region AS (
    SELECT 
        store_city_id,
        ROUND(AVG(revenue),2) AS avg_revenue
    FROM temp_table
    GROUP BY store_city_id
)

SELECT rg.region, avg_amount.avg_revenue 
FROM avg_revenue_per_region AS avg_amount
LEFT JOIN region AS rg
ON avg_amount.store_city_id = rg.store_city_id
WHERE region IN ('Yogyakarta','Sulawesi')

In the first temporary table, we specify the WHERE parameter with purchase_date column between January 1st, 2021, until December 31st, 2021. We can also make changes in the region name that we want to show.

12. Calculate the total revenue on region’s group.

In the region tables we have the region name data. In this case, some of the region can be grouped with one region label. The labels that we use here is Jawa which contains region Jawa Timur, Jawa Barat, Jawa Tengah, Jakarta, Bandung and Yogyakarta. And the label Luar Jawa for the rest of the region that not included in Jawa label.

SELECT
CASE WHEN store_city_id IN(SELECT store_city_id FROM region
     WHERE region IN ('Jawa Timur','Jawa Barat','Jawa Tengah','Jakarta','Bandung','Yogyakarta')) THEN 'Jawa'
     WHEN store_city_id IN(SELECT store_city_id FROM region
     WHERE region IN ('Bali','Sumatera','Sulawesi','Kalimantan')) THEN 'Luar Jawa'
     END AS base_region,
SUM(quantity*price_per_kg) as total_revenue
FROM fruit_sales
GROUP BY base_region

Because this case is conditionals type, we use the CASE expression in the query. Now, in every condition we set the WHERE parameter for the region in each label. You can see the details of the region data records in region table. We use the SUM function to calculate the total revenue for each label.

That’s all for some data analysis cases and practice using SQL. Hope this article will be useful and please wait for another articles about data analysis. Thanks for reading.

This blog post was originally published at: https://blog.devgenius.io

#sql #data-analysis #tools