Just in time for the U.S. tax season’s delayed 2020 deadline, Intuit released the first framework to manage Apache Cassandra clusters. After nine years using the open source database, we made our first major contribution to the Cassandra community with DSE Pronto. Pronto is an Infrastructure as a Service automation suite used to deploy and manage DataStax Cassandra clusters in Amazon Web Services (AWS).

As its name suggests, Pronto aims to get you up and running with Cassandra much sooner.

Pronto ties together an open source suite that includes Packer, Terraform and Ansible, all built into a Docker image. The widely adopted nature of these tools also means the framework should be also easily extendible to the other two cloud giants, Google Cloud Platform and Azure.

Pronto is the result of nine years of customizations we’ve made to Cassandra — a database system that is highly reliable and scalable, but not always intuitive.

DSE Pronto Abstracts the Complexity out of Managing Cassandra

My colleagues at Intuit Ben Covi and Nancy Li developed the Pronto GitHub repository, after finding no third-party suite of similar tools with the desired configurability for self-managed clusters. It’s not easy to manage your own Cassandra cluster. Pronto solves that problem.

Pronto originated as a project with the Data Persistence Platform team, of which TurboTax is the biggest user. TurboTax is in a well-regulated industry and needs to maintain tax data for at least seven years, with hundreds of thousands of integration partners. So TurboTax is anything but simple.

We are supporting over 300,000 concurrent users actively in production in AWS, over eight clusters in production. Our largest cluster in production right now is 72 servers in each data center, or 144 across two regions. Cassandra has to process massive amounts of data, such as entitlements, tax returns, filings, user experience, and everything needed to support TurboTax.

There’s an operational learning curve with Cassandra, which is why we decided to open source the Pronto automation framework that’s already being “inner-sourced” across Intuit.

Sponsor Note

DataStax is the company behind the massively scalable, highly available, cloud native NoSQL data platform built on Apache Cassandra™️. DataStax gives users and enterprises the freedom to run data in any cloud at global scale with zero downtime and zero lock-in.

It’s actually not easy to maintain the Cassandra clusters. A lot of people don’t maintain Cassandra well and they end up in a bad state. Our automation framework is popular in Intuit and makes it easier to maintain and keep Cassandra healthy.

When we first started, we implemented Cassandra for each tax year, which meant each tax year had a new cluster. But going back seven years of seven clusters was too expensive and difficult to manage. So we consolidated four years of tax clusters into one.

When we had one year per cluster, it masked a lot of problems. After consolidating, we realized there were a lot of pauses.

I personally went through debugging and optimizing all the way from the kernel, JVM, to Cassandra level. Cassandra doesn’t remove deleted data well. We are running production tests every week to make sure services don’t degrade every time, and have to run Cassandra Garbage Collector to reclaim disk space.

#data #open source #contributed #sponsored #data analysis

Pronto! Intuit Releases First Open Source Cassandra Cluster Manager
1.05 GEEK