Powerful CSV processing with kdb+ - KDnuggets

Comma-separated text files (CSV) are the most fundamental format for data processing. All programming languages and software that support working with relational data, also provide some level of CSV handling. You can persist and process data without installing a database management system. Often you don’t need a full-blown DBMS with all its features, like handling transactions and concurrent/remote access, indexing, etc… The lightweight CSV format allows for easy processing and sharing of the captured information.

The CSV format predates personal computers and has been one of the most common data exchange formats for almost 50 years. CSV files will remain with us in the future. Working with this format efficiently is a core requirement of a productive developer, data engineer/scientist, DevOps person, etc… You may need to filter rows, sort by a column, select existing or derive new columns. Perhaps you need to do complex analysis that requires aggregation and grouping.

This article provides a glimpse into the available tools to work with CSV files and describes how kdb+ and its query language q raise CSV processing to a new level of performance and simplicity.

Common CSV tools

Linux command-line tools

Many CSV processing need to be done in a Linux or Mac environment that has a powerful terminal console with some kind of shells on it. Most shells, like Bash, support arrays. You can read a CSV line-by-line and store all fields in an array variable. You can use built-in string manipulation and integer calculations (even float calculations with e.g bc -l) to operate on cell values. The code will be lengthy and hard to maintain.

General text processing tools like [awk](https://en.wikipedia.org/wiki/AWK) and [sed](https://en.wikipedia.org/wiki/Sed) scripts may result in shorter and simpler code. Commands like [cut](https://en.wikipedia.org/wiki/Cut_(Unix))[sort](https://en.wikipedia.org/wiki/Sort_(Unix))[uniq](https://en.wikipedia.org/wiki/Uniq) and [paste](https://en.wikipedia.org/wiki/Paste_(Unix)) further simplify CSV processing. You can specify the separator and refer to fields by positions.

The world is constantly changing. So do CSV files. Position-based reference breaks if a new column is added ahead of the referred column or columns are shuffled e.g. to move related columns next to each other. The problem manifests silently: your scripts may run smoothly, but you just use a different column in your calculation! If you don’t have a regression-testing framework to safeguard your codebase, then the end-user (or your competitor) might discover the problem. This can be embarrassing.

Position-based reference creates fragile code. Processing CSV by these Linux commands is great for prototyping and for quick analysis but you hit the limits once your codebase starts increasing or you share scripts with other colleagues. No wonder that in SQL the position-based column reference is limited and discouraged.

The huge advantage of Linux command-line tools is that no installation is required. Your shell script will likely run on other’s Linux systems. Familiarity with tools readily available in Linux is useful, but they should often be avoided for complex, long-lived, tasks.

CSVKit

Many open-source libraries offer CSV support. The Python library CSVKit is one of the most popular. It offers a more robust solution than native Linux commands, such as allowing reference of columns by name. The column names are stored in the first row of the CSV. Reference by name is sensitive to column renaming but this probably happens less frequently than adding or moving columns.

Also, CSVKit handles the first rows better than the general-purpose text tools do. Linux command sort treats the first row as any other row and can place it in the middle of the output. Similarly, cat includes the first rows when you concatenate multiple CSV files. Commands csvsort and csvstack handle first rows properly.

Finally, the CSVKit developers took special care to provide consistent command-line parameters, e.g. separator is defined by -d. In contrast, you need to remember that the separator is specified by -t for [sort](https://en.wikipedia.org/wiki/Sort_(Unix)) and -d for the other Linux commands, [cut](https://en.wikipedia.org/wiki/Cut_(Unix))[paste](https://en.wikipedia.org/wiki/Paste_(Unix)).

CSVKit includes the simply-named utilities, [csvcut](https://csvkit.readthedocs.io/en/latest/scripts/csvcut.html)[csvgrep](https://csvkit.readthedocs.io/en/latest/scripts/csvgrep.html) and [csvsort](https://csvkit.readthedocs.io/en/latest/scripts/csvsort.html), which replace the traditional Linux commands cutgrep and sort. Nonetheless, the merit of the Linux commands is their speed.

You probably use Linux commands [head](https://en.wikipedia.org/wiki/Head_(Unix))[tail](https://en.wikipedia.org/wiki/Tail_(Unix))[less](https://en.wikipedia.org/wiki/Less_(Unix))/[more](https://en.wikipedia.org/wiki/More_(command)) and [cat](https://en.wikipedia.org/wiki/Cat_(Unix)) to take a quick look at the content of a text file. Unfortunately, the output of these tools is not appealing for CSV files. The columns are not aligned and you will spend a lot of time squinting a monochrome screen figuring out to which column a given cell belongs. You might give up and import the data into Excel or Google Sheet. However, if the file is on a remote machine you first need to SCP it to your desktop. You can save time and work in the console by using [csvlook](https://csvkit.readthedocs.io/en/latest/scripts/csvlook.html). Command csvlook nicely aligns column under the column name. To execute the command below, download arms dealership data and convert it to data.csv as CSVKit tutorial describes.

$ csvlook --max-rows 20 data.csv

Don’t worry if your console is narrow: pipe the output to less -S and use arrow keys to move left and right.

Another useful extension included in CSVKit is the command [csvstat](https://csvkit.readthedocs.io/en/latest/scripts/csvstat.html). It analyzes the file contents and displays statistics like the number of distinct values of all columns. Also, it tries to infer types. If the column type is a number then it also returns maximum, minimum, mean, median, and standard deviation of the values.

To perform aggregations, filtering and grouping, you can use the CSVKit command [csvsql](https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html) that lets you run ANSI SQL commands on CSV files.

xsv

Some CSVKit commands are slow because they load the entire file into the memory and create an in-memory database. Rust developers reimplemented several traditional tools like catlsgrep and find and tools like [bat](https://github.com/sharkdp/bat)[exa](https://github.com/ogham/exa)[ripgrep](https://github.com/BurntSushi/ripgrep) and [fd](https://github.com/sharkdp/fd) were born. No wonder they also created a performant tool for CSV processing, library [xsv](https://github.com/BurntSushi/xsv).

The Rust library also supports selecting columns, filtering, sorting and joining CSV files. An index can be added to CSV files that are frequently processed to speed up operations. Indexing is an elegant and lightweight step towards DBMS.

Type inference

CSV is a text format that holds no type information for the columns. A string can be converted to a datatype based on its value. If all values of a column match the pattern YYYY.MM.DD we can conclude that the column holds dates. But how shall we treat the literal 100000? Is it an integer, or a time 10:00:00? Maybe the source process only supports digits and omitted the time separators? In real life, information about the source is not always available and you need to reverse engineer the data. If all values of the column match the string HHMMSS then we can conclude with high confidence that the column holds time values. The following are two approaches we can take to make a decision.

First, we could be strict: we predefine the pattern that any type needs to match. The patterns do not overlap. If time is defined as HH:MM:SS and integers as [1-9][0-9]* then 100000 is an integer.

Second, we could let patterns overlap and in case of conflict we choose the type with the smaller domain or based on some rules. This approach prefers time over int for 100000 if the time pattern also contains HHMMSS.

The CSVKit library implements the first approach.

q/kdb+

Kdb+ is the world’s fastest time-series database, optimized for ingesting, analyzing and storing massive amounts of structured data. Its query language, called Q, is a general-purpose programming language. Tables are first-class objects in q. Q tables are semantically similar to Pandas/R data frames. You can persist tables to disk, hence the solution can be considered a database, referred to as kdb+.

Exporting and importing CSV files is part of the core language. Table t can be saved in directory dir by command

#2020 jul tutorials #overviews #data analysis #data processing #python #data analysis

What is GEEK

Buddha Community

Powerful CSV processing with kdb+ - KDnuggets
sophia tondon

sophia tondon

1620885491

Microsoft Power BI Consulting | Power BI Solutions in India

Hire top dedicated Mirosoft power BI consultants from ValueCoders who aim at leveraging their potential to address organizational challenges for large-scale data storage and seamless processing.

We have a team of dedicated power BI consultants who help start-ups, SMEs, and enterprises to analyse business data and get useful insights.

What are you waiting for? Contact us now!

No Freelancers, 100% Own Staff
Experienced Consultants
Continuous Monitoring
Lean Processes, Agile Mindset
Non-Disclosure Agreement
Up To 2X Less Time

##power bi service #power bi consultant #power bi consultants #power bi consulting #power bi developer #power bi development

sophia tondon

sophia tondon

1619670565

Hire Power BI Developer | Microsoft Power BI consultants in India

Hire our expert Power BI consultants to make the most out of your business data. Our power bi developers have deep knowledge in Microsoft Power BI data modeling, structuring, and analysis. 16+ Yrs exp | 2500+ Clients| 450+ Team

Visit Website - https://www.valuecoders.com/hire-developers/hire-power-bi-developer-consultants

#power bi service #power bi consultant #power bi consultants #power bi consulting #power bi developer #power bi consulting services

Comparing Power BI with other tools

the Business Intelligence (BI) world has been moving towards self-service BI. As expected, several vendors created tools empowering regular users to gain insights from their data. Among the many, there is Power BI. Nowadays, users want to understand the differences between Product X and Power BI.

This is image title

One of the most common questions in conferences and user group sessions is likely, “can you provide a comparison between this product and Power BI?”.

The answer is almost always, “No, I cannot compare them, because they are too different”. First, one needs to understand the deep difference between Power BI and most other reporting tools on the market. Only later does a comparison make any sense. As a matter of fact, I think Power BI can be compared to only a few products on the market today. I would like to add my point of view to the discussion.

To get in-Depth knowledge on Power BI you can enroll for a live demo on Power BI online training

Indeed, Power BI is a tremendously powerful data modeling tool that happens to come with a pretty face; most other products are beautifully crafted reporting tools with a pretty face. The only thing they have in common is the pretty face. If you stop at what they have in common, you are only comparing a small fraction of the whole product, and that would be unfair.

To go further, a deeper understanding of basic BI concepts is needed.

Beware: this article is biased. I love Power BI and I make my living out of it. Nevertheless, I am a BI professional; I started working with Business Intelligence many years ago and I have gathered experience that I can share. I will try to be as fair as I can in this post, as my goal is not to provide a comparison with any tool. The goal of this post is to help you understand what you really need to evaluate when making (or reading) any comparison between different BI products.

At the top level, any Business Intelligence solution is composed of three layers:

Raw data: these are the data sources that one wants to analyze. Raw data comes as is.
Semantic model: this is where data is re-arranged to optimize it for analysis. Here you also define the calculations required by the reports.
Reports: these are the nice dashboards you can build with the tool.

Power BI manages all three layers: you start from raw data, you can build a semantic layer, and finally you prepare reports. Most other reporting tools are focused on the last layer and are limited in the previous two. In other words, they are missing the capability to build a real semantic layer. It is important to clarify what a semantic layer is, to understand what you would miss by choosing a different product.

In the old ages of BI, there was a clear separation between users and developers. A BI developer would build a project to help users extract insights from their data, and build reports. Users did not need to understand tables, relationships, or calculations. The developer oversaw shaping the tables, providing predefined calculations and giving sensible names to entities. Leveraging the semantic model, users did not have to know DAX, MDX or SQL.

A semantic model lets users interact with business entities like customers, sales, and products. Users would place those entities in reports made with Excel or with other reporting tools. Regular users were happy with just Excel and a Pivot Table. More advanced users wanted more powerful tools, and this led to the creation of several reporting tools with their ad-hoc programming language to create more advanced formulas. Regardless, the important thing is that no matter how powerful those tools are, they were still reporting tools based on the existence of a previously crafted semantic model. No semantic model, no reporting.

Picture this: a BI tool lets a developer build a semantic model. A reporting tool lets a user build a report on top of an existing semantic model. You need both to create a BI solution.Learn more from Power BI online course

Unfortunately, building a BI project takes time. Users were hungry for reports. This led to the start of the Self-Service BI era. Self-service BI is the idea of users building reports themselves, to reduce development time and to build a democratic knowledge about data. Sounds cool and terrifying at the same time.

Anyway, this is where we are today. Obviously, driven by the market several vendors started to build self-service BI tools. A few new products appeared on the market. Rather, existing tools evolved into new ones, targeting self-service BI. Keep in mind: any self-service BI tool requires the functionalities to build both the semantic model and the report in the same tool. Thus, depending on where you start, you have two options to have an existing product evolve into a self-service product:

If you already have a semantic model tool, you need to add reporting capabilities. You need to make it easier to use, because the target is no longer a BI professional but a regular user instead.
If you already have a reporting tool, you need to add the capability to build a semantic model because your users need to massage the data and build calculations on top of the resulting model.

In both cases, in the end you obtain a tool that mixes the capabilities to create a semantic model and to build reports. After this first step, you can add tons of different features like sharing with other users, building wizards to automatically connect to other services, improving the formula language and so on. But the core is always the same: a semantic model and a reporting tool, bound together in a nice package.

Even though we consider Power BI to be a new product, it is actually the evolution of Power Pivot and Analysis Services Tabular (semantic model), Power Query (querying tool), and Power View (the first version of the reporting tool released with Excel and SharePoint). Other vendors took similar steps, with different starting points. It is fair to say that several vendors started from a reporting tool, adding the semantic model to it.

Now, if you need to compare two BI tools, you need to compare at least these two features: the semantic model and the tools to build a report.

Say you want to compare Product X with Power BI; you show me how easy it is to build a gorgeous report on top of an SQL view, much easier and much more powerful than Power BI. Cool, but you are only comparing a fraction of both products. Reporting-wise, sure, Product X is better than Power BI. But there are other considerations: can you load multiple tables in Product X? Can you build relationships between them? Can you use a programming language to author complex calculations that involve scanning different tables? All these operations belong to the semantic model. A fair comparison needs to apply to all the features.

This is what Power BI offers you:

Power Query – a data transformation tool which is easy to use and yet incredibly powerful. It can load virtually anything and join data from different sources.
A modeling environment where you can build different kinds of relationships between tables and build powerful models. It does not hurt that it runs on top of one of the fastest databases I have ever seen.
DAX – a programming language which is not easy, but lets you author nearly any query and calculation. Yes, on this I am biased for sure!
Power BI – a reporting engine which is very good in building dashboards and reports. It can also be extended with custom visuals and third-party products.

Then, there is web-based reporting and sharing, a mobile experience, the ability to load from nearly any data source in the cloud or on premises and many other useful features. Yet, the core is composed of the four features above. If you want to compare apples to apples, you need to compare at least these four parts. Be mindful: you need all of them. A tool that requires you to build a single table because it does not let you relate two tables is nothing but a nice reporting tool. Comparing it to Power BI does not make much sense to me.

Moreover, it does not come by chance that to learn Power BI, one needs to learn new programming languages. Each feature has its own language, and this is just the right thing to have.

Finally, reporting. Reporting is only the last part, even though it is the most visible one. You might find other products are better than Power BI when it comes to reporting. This is fine, if you are aware that you are only comparing a fraction of Power BI with the whole of Product X.

I love Power BI, and I would really love to see a fair comparison between Power BI and any other product. We could learn a lot from the topic. But for it to be fair, it cannot just be based on how easy it is to build a pie chart (just kidding! You are not using a pie chart, are you?). One needs to evaluate everything both products have to offer.

To get in-depth knowledge of this technology and to develop skills to make a great career in this regard one can opt for Power BI online training Hyderabad.

#power bi training #power bi course #learn power bi #microsoft power bi training #power bi online training #power bi online course

Power BI In Brief – 2020

Every month, we bring you news, tips, and expert opinions on Power BI? Do you want to tap into the power of Power BI? Ask the Power BI experts at ArcherPoint.

This is image title

Power BI Desktop – Feature List
More exciting updates for August—as always:

  • Reporting - Perspectives support for Personalize visuals; rectangular lasso-select for data points; additional dynamic formatting support to more visuals
  • Analytics - Direct Query support for Q&A
  • Visualizations - Linear Gauge by xViz; advanced Pie & Donut by xViz; ratings visual by TME AG; toggle switch by TME AG; fdrill down Pie PRO by MAQ Software; ADWISE RoadMap; updates to ArcGIS Maps; extending Admin capabilities for AppSource visuals
  • Template Apps - Agile CRM analytics for Dynamics 365
  • **Data Preparation ** - Text/CSV By Example
  • Data connectivity - Cherwell connector; Automation Anywhere connector; Acterys connector

To get in-Depth knowledge on Power BI you can enroll for a live demo on Power BI online training

Power BI Developer Update
And the updates continue—this time, for developers:

  • Updates in embedded analytics
  • Automation & life-cycle management
  • New API for updating paginated reports data sources
  • Get dataset/s APIs return new additional properties
  • Embed capabilities
  • Persistent filters support for embedding in the organization
  • Phased embedding
  • Control focus behavior for create/clone visual
  • Additional Javascript API enhancements
  • Selected learning resources

Multiple Data Lakes Support For Power BI Dataflows
And if that’s not enough, Microsoft also announced improvements and enhancements to Azure Data Lake Storage Gen2 support inside Dataflows in Power BI. Improvements and enhancements include: Support for workspace admins to bring their own ADLS Gen2 accounts; improvements to the Dataflows connector; take-ownership support for dataflows using ADLS Gen2; minor improvements to detaching from ADLS Gen2. Changes will start rolling out during the week of August 10. Read more on multiple data lakes support in Power BI dataflows.

To get more knowledge of Power BI and its usage in the practical way one can opt for Power bi online training Hyderabad from various platforms. Getting this knowledge from industry experts like IT Guru may help to visualize the future graphically. It will enhance skills and pave the way for a great future.

#power bi training #power bi course #learn power bi #power bi online training #microsoft power bi training #power bi online course

Power BI vs Tableau

In your search for a Business Intelligence (BI) or data visualization tool, you have probably come across the two front-runners in the category: Power BI and Tableau. They are very similar products, and you have to look quite closely to figure out which product might work the best for you. I work for Encore Business Solutions; a systems partner that specializes in both Power BI and Tableau. We’ve seen more than a few scenarios in which Tableau was being used when the company really should have gone with Power BI, and vice-versa. That was part of the inspiration for this side-by-side comparison.

This is image title

Unfortunately, the internet is full of auto-generated and biased pages regarding which product trumps the other. The truth is, the best product depends more on you, your organization, your budget, and your intended use case than the tools themselves. It is easy to nit-pick at features like the coding language that supports advanced analysis, or the type of maps supported — but these have a minimal impact for most businesses. I’m going to do my best to stay away from these types of comparisons.

To get in-Depth knowledge on Power BI you can enroll for a live demo on Power BI online training

In writing this comparison, I did a lot of research. The result was more than just this article: I also created a tool that can generate a recommendation for you based on your response to a short questionnaire. It will generate a score for both Power BI and Tableau, plus provide a few other things to think about.

Tableau Software
Founded in 2003, Tableau has been the gold-standard in data visualization for a long time. They went public in 2013, and they still probably have the edge on functionality over Power BI, thanks to their 10-year head start. There are a few factors that will heavily tip the scales in favour of Tableau, which I’ll cover in the next few paragraphs.

Tableau: Key Strengths
Let’s make one thing clear from the start: if you want the cream of the crop, all other factors aside, Tableau is the choice for you. Their organization has been dedicated to data visualization for over a decade and the results show in several areas: particularly product usability, Tableau’s community, product support, and flexible deployment options. The range of visualizations, user interface layout, visualization sharing, and intuitive data exploration capabilities also have an edge on Power BI. Tableau offers much more flexibility when it comes to designing your dashboards. From my own experience, Tableau’s functionality from an end-user perspective is much farther ahead of Power BI than the Gartner Magic Quadrant (below) would have you believe.

Tableau built their product on the philosophy of “seeing and exploring” data. This means that Tableau is engineered to create interactive visuals. Tableau’s product capabilities have been implemented in such a way that the user should be able to ask a question of their data, and receive an answer almost immediately by manipulating the tools available to them. I have heard of cases in which Tableau actually declined to pursue the business of a customer in the scenario that the customer didn’t have the right vision for how their software would be used. If you just want something to generate reports, Tableau is overkill.

Tableau is also much more flexible in its deployment than Power BI. You can install the Tableau server in any Window box without installing the SQL server. Power BI is less flexible which I will discuss in Power BI Weaknesses.

Tableau can be purchased on a subscription license and then installed either in the cloud or an on-premise server.

Finally, Tableau is all-in on data visualization, and they have their fingers firmly on the pulse of the data visualization community’s most pressing desires. You can expect significant future improvements in terms of performance when loading large datasets, new visualization options, and added ETL functions.

Tableau Weaknesses
Unfortunately, Tableau comes at a cost. When it comes to the investment required to purchase and implement Tableau – 9 times out of 10 it will be more expensive than Power BI, by a fair margin. Often, Tableau projects are accompanied by data-warehouse-building endeavours, which compound the amount of money it takes to get going. The results from building a data warehouse and then hooking up Tableau are phenomenal, but you’ll need an implementation budget of at the very least $50k – plus the incremental cost of Tableau licenses. Learn more from Power bi online course

Of course, a data warehouse is not a requirement. Tableau connects to more systems out-of-the-box than Power BI. However, Tableau users report connecting to fewer data sources than most other competing tools. Overall, considering the investment required to implement a data warehouse is a worthy indicator of the commitment required to get the most out of Tableau.

This is image title

Power BI
Power BI is Microsoft’s data visualization option. It was debuted in 2013, and has since quickly gained ground on Tableau. When you look at Gartner’s most recent BI Magic Quadrant, you’ll notice that Microsoft is basically equal to Tableau in terms of functionality, but strongly outpaces Tableau when it comes to “completeness of vision”. Indeed, the biggest advantage of Power BI is that it is embedded within the greater Microsoft stack, which contributes to Microsoft’s strong position in the Quadrant.

This is image title

Power BI: Key Strengths
Though Tableau is still regarded by many in the industry as the gold standard, Power BI is nothing to scoff at. Power BI is basically comparable to all of Tableau’s bells and whistles; unless you care deeply about the manifestation and execution of small features, you’re likely to find that Power BI is fully adequate for your BI needs.

As I mentioned, one of the biggest selling points of Power BI is that it is deeply entrenched in the Microsoft stack – and quickly becoming more integrated. It’s included in Office 365, and Microsoft really encourages the use of Power BI for visualizing data from their other cloud services. Power BI is also very capable of connecting to your external sources.

Because Power BI was originally a mostly Excel-driven product; and because the first to adopt Microsoft products are often more technical users, My personal experience is that Power BI is especially suitable for creating and displaying basic dashboards and reports. My own executive team really likes being able to access KPIs from the Office portal, without having to put much time into the report’s creation, sharing, and interactivity.

Power BI’s biggest strength; however, is its rock-bottom cost and fantastic value. For a product that is totally comparable to the category leader, it’s free (included in Office 365) for basic use and $10/user/month for a “Pro” license. This increases adoption of the product as individuals can use Power BI risk-free. For companies that don’t have the budget for a large Business Intelligence project (including a data warehouse, dedicated analysts, and several months of implementation time), Power BI is extremely attractive. Companies that are preparing to “invest” in BI are more likely to add Tableau to their list of strongly considered options.

Power BI is available on a SaaS model and on-premise; on-premise is only supported by Power BI Premium licensing.

Microsoft is also investing heavily in Power BI, and they’re closing the small gaps in their functionality extremely fast. All of those little issues some users have with Power BI are going to disappear sooner rather than later.

Power BI Weaknesses
As I’ve mentioned, Tableau still has the slight edge on Power BI when it comes to the minutiae of product functionality; mostly due to their 10-year head start. But perhaps Power BI’s greatest weakness is its lack of deployment flexibility. For Power BI on-premise you need to install the Power BI Report Server as well as the SQL Server.

I also mentioned that Tableau works well for users with large amounts of data and for users that want on-premise systems. You should be aware that there are some new features being added to Power BI via Power BI Premium that help catch Microsoft up to Tableau in the areas of large datasets and on-premise capabilities – but Power BI Premium adds significant cost, and these features are relatively new. Tableau still reigns in these areas.

To get more knowledge of Power BI and its usage in the practical way one can opt for Power bi online training Hyderabad from various platforms. Getting this knowledge from industry experts like IT Guru may help to visualize the future graphically. It will enhance skills and pave the way for a great future.

#power bi training #power bi course #learn power bi #power bi online training #microsoft power bi training #power bi online course