Powerful CSV processing with kdb+ - KDnuggets

Comma-separated text files (CSV) are the most fundamental format for data processing. All programming languages and software that support working with relational data, also provide some level of CSV handling. You can persist and process data without installing a database management system. Often you don’t need a full-blown DBMS with all its features, like handling transactions and concurrent/remote access, indexing, etc… The lightweight CSV format allows for easy processing and sharing of the captured information.

The CSV format predates personal computers and has been one of the most common data exchange formats for almost 50 years. CSV files will remain with us in the future. Working with this format efficiently is a core requirement of a productive developer, data engineer/scientist, DevOps person, etc… You may need to filter rows, sort by a column, select existing or derive new columns. Perhaps you need to do complex analysis that requires aggregation and grouping.

This article provides a glimpse into the available tools to work with CSV files and describes how kdb+ and its query language q raise CSV processing to a new level of performance and simplicity.

Common CSV tools

Linux command-line tools

Many CSV processing need to be done in a Linux or Mac environment that has a powerful terminal console with some kind of shells on it. Most shells, like Bash, support arrays. You can read a CSV line-by-line and store all fields in an array variable. You can use built-in string manipulation and integer calculations (even float calculations with e.g bc -l) to operate on cell values. The code will be lengthy and hard to maintain.

General text processing tools like [awk](https://en.wikipedia.org/wiki/AWK) and [sed](https://en.wikipedia.org/wiki/Sed) scripts may result in shorter and simpler code. Commands like [cut](https://en.wikipedia.org/wiki/Cut_(Unix))[sort](https://en.wikipedia.org/wiki/Sort_(Unix))[uniq](https://en.wikipedia.org/wiki/Uniq) and [paste](https://en.wikipedia.org/wiki/Paste_(Unix)) further simplify CSV processing. You can specify the separator and refer to fields by positions.

The world is constantly changing. So do CSV files. Position-based reference breaks if a new column is added ahead of the referred column or columns are shuffled e.g. to move related columns next to each other. The problem manifests silently: your scripts may run smoothly, but you just use a different column in your calculation! If you don’t have a regression-testing framework to safeguard your codebase, then the end-user (or your competitor) might discover the problem. This can be embarrassing.

Position-based reference creates fragile code. Processing CSV by these Linux commands is great for prototyping and for quick analysis but you hit the limits once your codebase starts increasing or you share scripts with other colleagues. No wonder that in SQL the position-based column reference is limited and discouraged.

The huge advantage of Linux command-line tools is that no installation is required. Your shell script will likely run on other’s Linux systems. Familiarity with tools readily available in Linux is useful, but they should often be avoided for complex, long-lived, tasks.


Many open-source libraries offer CSV support. The Python library CSVKit is one of the most popular. It offers a more robust solution than native Linux commands, such as allowing reference of columns by name. The column names are stored in the first row of the CSV. Reference by name is sensitive to column renaming but this probably happens less frequently than adding or moving columns.

Also, CSVKit handles the first rows better than the general-purpose text tools do. Linux command sort treats the first row as any other row and can place it in the middle of the output. Similarly, cat includes the first rows when you concatenate multiple CSV files. Commands csvsort and csvstack handle first rows properly.

Finally, the CSVKit developers took special care to provide consistent command-line parameters, e.g. separator is defined by -d. In contrast, you need to remember that the separator is specified by -t for [sort](https://en.wikipedia.org/wiki/Sort_(Unix)) and -d for the other Linux commands, [cut](https://en.wikipedia.org/wiki/Cut_(Unix))[paste](https://en.wikipedia.org/wiki/Paste_(Unix)).

CSVKit includes the simply-named utilities, [csvcut](https://csvkit.readthedocs.io/en/latest/scripts/csvcut.html)[csvgrep](https://csvkit.readthedocs.io/en/latest/scripts/csvgrep.html) and [csvsort](https://csvkit.readthedocs.io/en/latest/scripts/csvsort.html), which replace the traditional Linux commands cutgrep and sort. Nonetheless, the merit of the Linux commands is their speed.

You probably use Linux commands [head](https://en.wikipedia.org/wiki/Head_(Unix))[tail](https://en.wikipedia.org/wiki/Tail_(Unix))[less](https://en.wikipedia.org/wiki/Less_(Unix))/[more](https://en.wikipedia.org/wiki/More_(command)) and [cat](https://en.wikipedia.org/wiki/Cat_(Unix)) to take a quick look at the content of a text file. Unfortunately, the output of these tools is not appealing for CSV files. The columns are not aligned and you will spend a lot of time squinting a monochrome screen figuring out to which column a given cell belongs. You might give up and import the data into Excel or Google Sheet. However, if the file is on a remote machine you first need to SCP it to your desktop. You can save time and work in the console by using [csvlook](https://csvkit.readthedocs.io/en/latest/scripts/csvlook.html). Command csvlook nicely aligns column under the column name. To execute the command below, download arms dealership data and convert it to data.csv as CSVKit tutorial describes.

$ csvlook --max-rows 20 data.csv

Don’t worry if your console is narrow: pipe the output to less -S and use arrow keys to move left and right.

Another useful extension included in CSVKit is the command [csvstat](https://csvkit.readthedocs.io/en/latest/scripts/csvstat.html). It analyzes the file contents and displays statistics like the number of distinct values of all columns. Also, it tries to infer types. If the column type is a number then it also returns maximum, minimum, mean, median, and standard deviation of the values.

To perform aggregations, filtering and grouping, you can use the CSVKit command [csvsql](https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html) that lets you run ANSI SQL commands on CSV files.


Some CSVKit commands are slow because they load the entire file into the memory and create an in-memory database. Rust developers reimplemented several traditional tools like catlsgrep and find and tools like [bat](https://github.com/sharkdp/bat)[exa](https://github.com/ogham/exa)[ripgrep](https://github.com/BurntSushi/ripgrep) and [fd](https://github.com/sharkdp/fd) were born. No wonder they also created a performant tool for CSV processing, library [xsv](https://github.com/BurntSushi/xsv).

The Rust library also supports selecting columns, filtering, sorting and joining CSV files. An index can be added to CSV files that are frequently processed to speed up operations. Indexing is an elegant and lightweight step towards DBMS.

Type inference

CSV is a text format that holds no type information for the columns. A string can be converted to a datatype based on its value. If all values of a column match the pattern YYYY.MM.DD we can conclude that the column holds dates. But how shall we treat the literal 100000? Is it an integer, or a time 10:00:00? Maybe the source process only supports digits and omitted the time separators? In real life, information about the source is not always available and you need to reverse engineer the data. If all values of the column match the string HHMMSS then we can conclude with high confidence that the column holds time values. The following are two approaches we can take to make a decision.

First, we could be strict: we predefine the pattern that any type needs to match. The patterns do not overlap. If time is defined as HH:MM:SS and integers as [1-9][0-9]* then 100000 is an integer.

Second, we could let patterns overlap and in case of conflict we choose the type with the smaller domain or based on some rules. This approach prefers time over int for 100000 if the time pattern also contains HHMMSS.

The CSVKit library implements the first approach.


Kdb+ is the world’s fastest time-series database, optimized for ingesting, analyzing and storing massive amounts of structured data. Its query language, called Q, is a general-purpose programming language. Tables are first-class objects in q. Q tables are semantically similar to Pandas/R data frames. You can persist tables to disk, hence the solution can be considered a database, referred to as kdb+.

Exporting and importing CSV files is part of the core language. Table t can be saved in directory dir by command

#2020 jul tutorials #overviews #data analysis #data processing #python #data analysis

What is GEEK

Buddha Community

Powerful CSV processing with kdb+ - KDnuggets
sophia tondon

sophia tondon


Microsoft Power BI Consulting | Power BI Solutions in India

Hire top dedicated Mirosoft power BI consultants from ValueCoders who aim at leveraging their potential to address organizational challenges for large-scale data storage and seamless processing.

We have a team of dedicated power BI consultants who help start-ups, SMEs, and enterprises to analyse business data and get useful insights.

What are you waiting for? Contact us now!

No Freelancers, 100% Own Staff
Experienced Consultants
Continuous Monitoring
Lean Processes, Agile Mindset
Non-Disclosure Agreement
Up To 2X Less Time

##power bi service #power bi consultant #power bi consultants #power bi consulting #power bi developer #power bi development

sophia tondon

sophia tondon


Hire Power BI Developer | Microsoft Power BI consultants in India

Hire our expert Power BI consultants to make the most out of your business data. Our power bi developers have deep knowledge in Microsoft Power BI data modeling, structuring, and analysis. 16+ Yrs exp | 2500+ Clients| 450+ Team

Visit Website - https://www.valuecoders.com/hire-developers/hire-power-bi-developer-consultants

#power bi service #power bi consultant #power bi consultants #power bi consulting #power bi developer #power bi consulting services

Power BI vs Tableau

In your search for a Business Intelligence (BI) or data visualization tool, you have probably come across the two front-runners in the category: Power BI and Tableau. They are very similar products, and you have to look quite closely to figure out which product might work the best for you. I work for Encore Business Solutions; a systems partner that specializes in both Power BI and Tableau. We’ve seen more than a few scenarios in which Tableau was being used when the company really should have gone with Power BI, and vice-versa. That was part of the inspiration for this side-by-side comparison.

This is image title

Unfortunately, the internet is full of auto-generated and biased pages regarding which product trumps the other. The truth is, the best product depends more on you, your organization, your budget, and your intended use case than the tools themselves. It is easy to nit-pick at features like the coding language that supports advanced analysis, or the type of maps supported — but these have a minimal impact for most businesses. I’m going to do my best to stay away from these types of comparisons.

To get in-Depth knowledge on Power BI you can enroll for a live demo on Power BI online training

In writing this comparison, I did a lot of research. The result was more than just this article: I also created a tool that can generate a recommendation for you based on your response to a short questionnaire. It will generate a score for both Power BI and Tableau, plus provide a few other things to think about.

Tableau Software
Founded in 2003, Tableau has been the gold-standard in data visualization for a long time. They went public in 2013, and they still probably have the edge on functionality over Power BI, thanks to their 10-year head start. There are a few factors that will heavily tip the scales in favour of Tableau, which I’ll cover in the next few paragraphs.

Tableau: Key Strengths
Let’s make one thing clear from the start: if you want the cream of the crop, all other factors aside, Tableau is the choice for you. Their organization has been dedicated to data visualization for over a decade and the results show in several areas: particularly product usability, Tableau’s community, product support, and flexible deployment options. The range of visualizations, user interface layout, visualization sharing, and intuitive data exploration capabilities also have an edge on Power BI. Tableau offers much more flexibility when it comes to designing your dashboards. From my own experience, Tableau’s functionality from an end-user perspective is much farther ahead of Power BI than the Gartner Magic Quadrant (below) would have you believe.

Tableau built their product on the philosophy of “seeing and exploring” data. This means that Tableau is engineered to create interactive visuals. Tableau’s product capabilities have been implemented in such a way that the user should be able to ask a question of their data, and receive an answer almost immediately by manipulating the tools available to them. I have heard of cases in which Tableau actually declined to pursue the business of a customer in the scenario that the customer didn’t have the right vision for how their software would be used. If you just want something to generate reports, Tableau is overkill.

Tableau is also much more flexible in its deployment than Power BI. You can install the Tableau server in any Window box without installing the SQL server. Power BI is less flexible which I will discuss in Power BI Weaknesses.

Tableau can be purchased on a subscription license and then installed either in the cloud or an on-premise server.

Finally, Tableau is all-in on data visualization, and they have their fingers firmly on the pulse of the data visualization community’s most pressing desires. You can expect significant future improvements in terms of performance when loading large datasets, new visualization options, and added ETL functions.

Tableau Weaknesses
Unfortunately, Tableau comes at a cost. When it comes to the investment required to purchase and implement Tableau – 9 times out of 10 it will be more expensive than Power BI, by a fair margin. Often, Tableau projects are accompanied by data-warehouse-building endeavours, which compound the amount of money it takes to get going. The results from building a data warehouse and then hooking up Tableau are phenomenal, but you’ll need an implementation budget of at the very least $50k – plus the incremental cost of Tableau licenses. Learn more from Power bi online course

Of course, a data warehouse is not a requirement. Tableau connects to more systems out-of-the-box than Power BI. However, Tableau users report connecting to fewer data sources than most other competing tools. Overall, considering the investment required to implement a data warehouse is a worthy indicator of the commitment required to get the most out of Tableau.

This is image title

Power BI
Power BI is Microsoft’s data visualization option. It was debuted in 2013, and has since quickly gained ground on Tableau. When you look at Gartner’s most recent BI Magic Quadrant, you’ll notice that Microsoft is basically equal to Tableau in terms of functionality, but strongly outpaces Tableau when it comes to “completeness of vision”. Indeed, the biggest advantage of Power BI is that it is embedded within the greater Microsoft stack, which contributes to Microsoft’s strong position in the Quadrant.

This is image title

Power BI: Key Strengths
Though Tableau is still regarded by many in the industry as the gold standard, Power BI is nothing to scoff at. Power BI is basically comparable to all of Tableau’s bells and whistles; unless you care deeply about the manifestation and execution of small features, you’re likely to find that Power BI is fully adequate for your BI needs.

As I mentioned, one of the biggest selling points of Power BI is that it is deeply entrenched in the Microsoft stack – and quickly becoming more integrated. It’s included in Office 365, and Microsoft really encourages the use of Power BI for visualizing data from their other cloud services. Power BI is also very capable of connecting to your external sources.

Because Power BI was originally a mostly Excel-driven product; and because the first to adopt Microsoft products are often more technical users, My personal experience is that Power BI is especially suitable for creating and displaying basic dashboards and reports. My own executive team really likes being able to access KPIs from the Office portal, without having to put much time into the report’s creation, sharing, and interactivity.

Power BI’s biggest strength; however, is its rock-bottom cost and fantastic value. For a product that is totally comparable to the category leader, it’s free (included in Office 365) for basic use and $10/user/month for a “Pro” license. This increases adoption of the product as individuals can use Power BI risk-free. For companies that don’t have the budget for a large Business Intelligence project (including a data warehouse, dedicated analysts, and several months of implementation time), Power BI is extremely attractive. Companies that are preparing to “invest” in BI are more likely to add Tableau to their list of strongly considered options.

Power BI is available on a SaaS model and on-premise; on-premise is only supported by Power BI Premium licensing.

Microsoft is also investing heavily in Power BI, and they’re closing the small gaps in their functionality extremely fast. All of those little issues some users have with Power BI are going to disappear sooner rather than later.

Power BI Weaknesses
As I’ve mentioned, Tableau still has the slight edge on Power BI when it comes to the minutiae of product functionality; mostly due to their 10-year head start. But perhaps Power BI’s greatest weakness is its lack of deployment flexibility. For Power BI on-premise you need to install the Power BI Report Server as well as the SQL Server.

I also mentioned that Tableau works well for users with large amounts of data and for users that want on-premise systems. You should be aware that there are some new features being added to Power BI via Power BI Premium that help catch Microsoft up to Tableau in the areas of large datasets and on-premise capabilities – but Power BI Premium adds significant cost, and these features are relatively new. Tableau still reigns in these areas.

To get more knowledge of Power BI and its usage in the practical way one can opt for Power bi online training Hyderabad from various platforms. Getting this knowledge from industry experts like IT Guru may help to visualize the future graphically. It will enhance skills and pave the way for a great future.

#power bi training #power bi course #learn power bi #power bi online training #microsoft power bi training #power bi online course

Power BI In Brief – 2020

Every month, we bring you news, tips, and expert opinions on Power BI? Do you want to tap into the power of Power BI? Ask the Power BI experts at ArcherPoint.

This is image title

Power BI Desktop – Feature List
More exciting updates for August—as always:

  • Reporting - Perspectives support for Personalize visuals; rectangular lasso-select for data points; additional dynamic formatting support to more visuals
  • Analytics - Direct Query support for Q&A
  • Visualizations - Linear Gauge by xViz; advanced Pie & Donut by xViz; ratings visual by TME AG; toggle switch by TME AG; fdrill down Pie PRO by MAQ Software; ADWISE RoadMap; updates to ArcGIS Maps; extending Admin capabilities for AppSource visuals
  • Template Apps - Agile CRM analytics for Dynamics 365
  • **Data Preparation ** - Text/CSV By Example
  • Data connectivity - Cherwell connector; Automation Anywhere connector; Acterys connector

To get in-Depth knowledge on Power BI you can enroll for a live demo on Power BI online training

Power BI Developer Update
And the updates continue—this time, for developers:

  • Updates in embedded analytics
  • Automation & life-cycle management
  • New API for updating paginated reports data sources
  • Get dataset/s APIs return new additional properties
  • Embed capabilities
  • Persistent filters support for embedding in the organization
  • Phased embedding
  • Control focus behavior for create/clone visual
  • Additional Javascript API enhancements
  • Selected learning resources

Multiple Data Lakes Support For Power BI Dataflows
And if that’s not enough, Microsoft also announced improvements and enhancements to Azure Data Lake Storage Gen2 support inside Dataflows in Power BI. Improvements and enhancements include: Support for workspace admins to bring their own ADLS Gen2 accounts; improvements to the Dataflows connector; take-ownership support for dataflows using ADLS Gen2; minor improvements to detaching from ADLS Gen2. Changes will start rolling out during the week of August 10. Read more on multiple data lakes support in Power BI dataflows.

To get more knowledge of Power BI and its usage in the practical way one can opt for Power bi online training Hyderabad from various platforms. Getting this knowledge from industry experts like IT Guru may help to visualize the future graphically. It will enhance skills and pave the way for a great future.

#power bi training #power bi course #learn power bi #power bi online training #microsoft power bi training #power bi online course

Is Power BI Actually Useful?

The short answer, for most of you, is no. However, the complexity and capability of the products could be beneficial depending on what type of position or organization you work in.
This is image title
In my effort to answer this common question about Power BI I researched the following:
– Power BI Desktop Gateway
– Syncing on-prem SQL server data
– Syncing SharePoint Online list data
– Syncing data from an Excel workbook
– Building, and sharing a dashboard
– Inserting a Power BI visualization into PowerPoint

To get in-Depth knowledge on Power BI you can enroll for a live demo on Power BI online training

The feature spread above gave me the opportunity to explore the main features of Power BI which break down as:
– Ingesting data, building a data set
– Creating dashboard or reports with visualizations based on that data

In a nutshell Power BI is a simple concept. You take a data set, and build visualizations that answer questions about that data. For example, how many products have we sold in Category A in the last month? Quarter? Year? Power BI is especially powerful when drilling up or down in time scale.
And there are some interesting ways to visualize that data:
However, there are a number of drawbacks to the current product that prevented me from being able to fold these visualizations into our existing business processes.

  1. Integration with PowerPoint is not free. This shocked me.

The most inspiring Power BI demo I saw at a Microsoft event showed a beautiful globe visualization within a PowerPoint presentation. It rendered flawlessly within PowerPoint and was a beautiful, interactive way to explore a geographically disparate data set. I was able to derive conclusions about the sales data displayed without having to look at an old, boring chart.

During the demo, nothing was mentioned about the technology required to make this embedded chart a reality. After looking into the PowerPoint integration I learned that not only was the add-in built by a third party, it was not free, and when I signed up for a free trial the add-in could barely render my Power BI visualization. The data drill up/down functionality was non-existent and not all of the visualizations were supported. Learn more from Power bi online course

  1. Only Dashboards can be shared with other users, and cannot be embedded in our organization’s community on SharePoint.

Folks in our organization spent 50% of their time in Outlook, and the rest in SharePoint, OneNote, Excel, Word, and the other applications needed for producing documents, and other work. Adding yet another destination to that list to check on how something is doing was impossible for us. Habits are extremely hard to change, and I see that consistently in our client’s organizations as well.

Because I was not able to fold in the visualizations with the PowerPoint decks we use during meetings, I had to stop presentations in the middle, navigate to Internet Explorer (because the visualizations only render well in that browser), and then go back to PowerPoint once we were done looking at the dashboard.

This broke up the flow of our meetings, and led to more distractions. I also followed up with coworkers after meetings to see if they ever visited the dashboard themselves at their desk. None of them had ever navigated to a dashboard outside of a meeting.

  1. The visualizations aren’t actually that great.

Creating visualizations that cover such a wide variety of data sets is difficult. But, the Excel team has been working on this problem for over 15 years. When I import my SharePoint or SQL data to Excel I’m able to create extremely customized Pivot Tables and Charts that show precisely the data I need to see.

I was never able to replicate visualizations from Excel in Power BI, to produce the types of visualizations I actually needed. Excel has the ability to do conditional formatting, and other customizations in charts and tables that is simply not possible with Power BI. Because of how generic the charts are, and the limited customization it looks “cool” without being functional.

In conclusion, if you have spare time and want to explore Power BI for your organization you should. However, if you are seriously thinking about how you can fold this product into your work processes, challenge yourself to build a dashboard and look at it once a week. See if you can keep that up for a month, and then think about how that change affected your work habits and whether the data analysis actually contributed value each time. At least half of you will realize that this gimmicky product is fancy, but not actually useful.

Take your career to new heights of success with Power BI online training Hyderabad

#power bi training #power bi course #learn power bi #power bi online training #microsoft power bi training #power bi online course