DataTables: A C++ Tabular Data Structure Project

DataTables: A C++ Tabular Data Structure Project

The beginning of a project focused on created efficient tabular data structures to work with data in C++.For statistical programming languages or languages with good statistics processing libraries, the DataFrame is an essential structure. Most features of these languages and libraries (e.g. the R programming language or the Pandas package for Python), revolve around the DataFrame object which provides useful functionality for working with datasets. There has been a big push to incorporate this type of structure in C++ with a few open-source libraries on GitHub, especially the xtensor project which works to imitate NumPy tensors.

Introduction

For statistical programming languages or languages with good statistics processing libraries, the DataFrame is an essential structure. Most features of these languages and libraries (e.g. the R programming language or the Pandas package for Python), revolve around the DataFrame object which provides useful functionality for working with datasets. There has been a big push to incorporate this type of structure in C++ with a few open-source libraries on GitHub, especially the xtensor project which works to imitate NumPy tensors.

Although I’m sure these libraries are great, for the sake of learning by doing, I decided to create my own implementation of a data storage object in C++ to efficiently handle datasets. Initially, this functionality was part of a library I was creating (also for the sake of learning by doing) called YALL Yet Another Learning L ibrary) [the name was thought up independently but it’s not very original]. However, I found this functionality useful and, since it really can be a standalone project, decided to pull it out of YALL and put it in its own repository.

The goal of this project is to provide the basic functionality of Pandas or R DataFrames without too much bloat. Since this project is much smaller than those two implementations, initially, only the most commonly used functionalities will be incorporated into the DataTable object. Ideally, these tables will have a small computational footprint and remain memory efficient, i.e. there will be a minimal amount of metadata so that using the DataTable class doesn’t decimate your RAM. In this post, I will discuss the first version of the DataTable class, which incorporates some limited functionality, by walking through the current implementation. Note that this is an initial version, the project is just getting started, and I plan on adding much-needed functionality and (code/project) quality in the future.

Implementation

DataTables was written in C++ and tested on Ubuntu 19.04 and Netrunner 19.08, both Debian-based distributions. CMake is used for the build process so the project should be cross-platform, however, the Mac and Windows versions have not been tested.

Due to the length of the code for the function implementations (>600 lines), I won’t share all of it here. All of the project files can be found on my GitHub, in particular the function implementations. What I will go over here is, essentially, the header file that defines the DataTable class. I will walk through creating data tables, operating on them, reading and writing data, and viewing the data in the table to give some idea on how the structure is intended to be used. At times I will also try to draw parallels between Pandas and R DataFrames just for additional context.

programming datatables cpp data-science machine-learning

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

Pipelines in Machine Learning | Data Science | Machine Learning | Python

Machine Learning Pipelines performs a complete workflow with an ordered sequence of the process involved in a Machine Learning task. The Pipelines can also

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.

Data Science Projects | Data Science | Machine Learning | Python

Practice your skills in Data Science with Python, by learning and then trying all these hands-on, interactive projects, that I have posted for you.