A lightweight desktop script tool is a must-have for data analysts. But how do you know which is the most suitable one?
Almost all programming languages can manipulate data. Some are too general to lack functions for performing structured computations, such as C++ and JAVA, which produce lengthy code to deal with daily data analysis scenarios and are more suitable for taking care of major special projects. Some are technically-targeted and too highly-professional for daily analysis work, such as mathematical programming languages MATLAB and R, though they provide functions for structured data processing. My subjects in this article are the lightweight programming languages that are suitable for doing desktop analytic jobs. They are lightweight databases represented by MySQL, Excel VBA, Python pandas and esProc.
Now I’ll scrutinize the pros and cons of each to look at their capabilities.
It’s easy to run a small database, such as HSQLDB, DerbyDB, SQLite or MySQL, on desktop. Here I’ll take MySQL as an example.
The portable version of MySQL is convenient to install and configure. Though an environment configuration problem, like the folder permission issue, can only be solved with the installer version, the user-friendly wizard will make up for the trouble.
MySQL supports executing SQL with its built-in command-line tool, but the interactive user interface is crude. Many turn to a third-party tool, Navicat or Toad, to do the same thing. So the UI design isn’t MySQL’s strength.
The essential strength of a programming language is, of course, the data processing capability. For this, MySQL is intrinsically dependent on SQL to get its ability.
SQL, after nearly 50 years’ evolvement, is close to the limit of its capabilities within its model frame. Almost every basic algorithm has their SQL expression. This significantly lowers the bar for analysts who want to do data processing. In recent years, MySQL began to offer supports for window functions, WITH clause and the stored procedure. That makes it as capable as any large databases. To implement the following algorithm in MySQL, for example:
/*Filtering. emp table stores information of employees in every department*/ select eid, name, deptid, salary from emp where salary between 8000 and 10000 and hireday>=’2010–01–01' /*Sorting*/ select * from emp order by salary /*Distinct*/ select distinct deptid from emp /*Grouping & aggregation. share table stores the daily closing prices for a certain share*/ select year(sDate),month(sDate),max(price) from share group by year(sDate),month(sDate) /*Join; dept table stores department information*/ select e.name, e.salary, d.deptname from emp e inner join dept d on e.deptid=d.deptid /*Windowing; rank employees in each department by their salaries*/ select eid, name, deptid, salary, rank()over( partition by deptid order by salary desc) rk from emp
MySQL handles basic operations really well. But that is not the case with complex operations because SQL isn’t good at handling them.
SQL is not good at implementing the multistep process-mode algorithms. Here one example is to find the department having the most employees and the one with the least employees based on the emp table. Intuitively, there are two steps to get the task done. First, group the table by department and count the employees in each department; second, sort the groups by the number of employees in descending order. Now the first department and the last department are what we need. SQL, however, implements the algorithm by making it a 4-step process. The first step remains the same. Next it calculates the maximum number of employees using max function and finds the corresponding department using a nested query or a join query. Then it finds the department with the least employee with the same method. Finally, combine the results of the second step and the third step using union. The code is as follows:
with tmp as ( select dept, count(*) total from employee group by dept), deptmax as( select dept from tmp where total=( select max(total) from tmp) ), deptmin as( select dept from tmp where total=( select min(total) from tmp) ) select dept from deptmax union select dept from deptmin
It’s unnecessary lengthy.
Considering the time when SQL was invented, it’s understandable that it has certain defects. The order-based calculations are another scenario that SQL is not good at. An example is to find how many consecutive days a certain share rises based on the share table. SQL hasn’t a direct way of expressing the “consecutively rising” concept, so we need to take an extremely roundabout way. First you count the accumulative non-rising days for each transaction date. The transaction dates with same count of non-rising days are consecutive rising days. Then you group records according to whether a date is consecutive rising or not to get the maximum consecutive interval. Even a SQL expert finds it a headache to deal with such an algorithm. And their code solution is hard to read for ordinary users.
select max(consecutive_days) from (select count(*) consecutive_days from (select sum(updown_flag) over(order by sdate) no_up_days from (select sDate, case when price>LAG(price) over(order by sDate) then 0 else 1 end updown_flag from share) ) group by no_up_days)
Actually this is simpler because of the use of window function. The code is harder to write and read if you use an earlier SQL version.
Another example is to align records by a specified set. The orders table stores records of orders. We need to calculate the amount of large orders (amount >15000) on the current day in an order from Sunday to Saturday. Give a null value to a day without orders. SQL uses pseudo table technique to convert the weekday list to a set of records and then left join the pseudo table to the orders table. The implementation is complicated:
with std as( select 1 No,’Sun.’ name from dual union select 2 ,’Mon.’ from dual union select 3 ,’Tues.’ from dual union select 4 ,’Wed.’ from dual union select 5 ,’Thur’ from dual union select 6 ,’Fri.’ from dual union select 7 ,’Sat.’ from dual ) select std.No,std.name,data.total from std left join ( select DAYOFWEEK(orders.orderdate) No,sum(amount) total from orders where amount>15000 group by DAYOFWEEK(orders.birthday) ) data on std.No=data.No order by std.No
I can cite many examples of SQL headaches. The language is too old to adapt to our complicated business needs. Though it tries to keep up with the times through a series of patches and upgrades, including WITH clause, the stored procedure and window functions, the frame on which it is based confines its expression.
Besides, SQL is implemented, though not intrinsically, to be internal-oriented. SQL-based databases can compute the data tables inside a database but are hard to read and write data in an external data source.
Yet the first step of data manipulation is data source retrieval and the last of it is result set output in targeted format. One important aspect of evaluating a script tool is its ability to support external data source read/write. Unfortunately MySQL can only read (not including write) one external data source, the CSV files. And the reading process is not simple at all. To import emp.csv file of the standard format into the database, for example, you need 4 steps:
/Switch to the target database, create a new database table, import file data to the database, create index over the primary key to speed up the import/
mysql>create table emp (
-> empid int(10) not null,
-> name varchar(50),
-> deptid int(10),
-> salary float,
-> sex varchar(1),
-> birthday date,
-> hireday date)CHARSET = utf8;
mysql>LOAD DATA INFILE ‘d:\data\emp.csv’ INTO TABLE emp
->CHARACTER SET utf8
->FIELDS TERMINATED BY ‘,’
->LINES TERMINATED BY ‘\r\n’
->IGNORE 1 LINES;
mysql>ALTER TABLE emp ADD PRIMARY KEY (empid);
SQL’s closure design, which didn’t take the file retrieval into consideration at the beginning, accounts for the terribly complicated implementation, even though it was later get patched with the file retrieval feature.
A third-party tool, such as Navicat, enables MySQL to support more types of data sources. But essentially they just convert an external data source into a text file and then load it into the MySQL database. The non-native, patch-up method has a lot of drawbacks. Data sources of ancient formats, such as Dbase and Paradox, are best supported but little used. There are very strict requirements for Excel file loading and so successes are rare. The support of JSON only applies to the special two-dimensional format. Actually Navicat doesn’t support almost all common data sources we are using now.
SQL is difficult to debug. This significantly reduces the development speed.
Standard textbook algorithms don’t need debug because they are implemented with only several lines of code. The real-world data manipulation code is complicated, often having about one hundred lines of nested SQL query. Being unable to debug makes it hard to understand and maintain the code, which leads to low performance.
In a nutshell, SQL excels at processing database data with basic operations but falls short in handling data in external data sources and implementing complex algorithms. What SQL doesn’t have has offered opportunities to new script tools that are lightweight and desktop intended for the PC era.
In this article, see the role of big data in healthcare and look at the new healthcare dynamics. Big Data is creating a revolution in healthcare, providing better outcomes while eliminating fraud and abuse, which contributes to a large percentage of healthcare costs.
In this article, see a credible approach of big data processing and an analysis of telecom data to minimize crime, combat terrorism, unsocial activities, etc.
‘Data is the new science. Big Data holds the key answers’ - Pat Gelsinger The biggest advantage that the enhancement of modern technology has brought
We need no rocket science in understanding that every business, irrespective of their size in the modern-day business world, needs data insights for its expansion. Big data analytics is essential when it comes to understanding the needs and wants of a significant section of the audience.
Even though Big data is into main stream of operations as of 2020, there are still potential issues or challenges the researchers.