Experiences with running PostgreSQL on Kubernetes

Introduction

Below is a transcript of an interview with our CTO, [Sasha Klizhentas], about his experience running PostgreSQL on Kubernetes. In this interview, we discuss the challenges involved, open source and commercial tools that can help and other alternatives to managing stateful applications on Kubernetes.

For some background, Gravitational specializes in running applications across a variety of infrastructure footprints with the help of Kubernetes. The applications our customers deploy need a persistent data store to go along with their stateless microservices. Making things more complicated is the fact that the majority of our deployments are [on-premises private SaaS], so we can not rely on cloud services like AWS RDS.

Challenges with running Postgres on Kubernetes

Abe: If someone wants to run Postgres or a similar database on Kubernetes where should they start?

Sasha: It’s really hard to do. The hardest thing in running Postgres on Kubernetes is to understand that Kubernetes is not aware of the deployment details of Postgres. A naive deployment could lead to complete data loss.

Kubernetes is not aware of the deployment details of Postgres. A naive deployment could lead to complete data loss.

Here’s a typical scenario when that happens. You set up streaming replication and let’s say the first master is up. All the writes go there and they asynchronously replicate to the standby. Then suddenly the current master goes down but the asynchronous replication has a huge lag caused by something like a network partition. If the naive failover leader election algorithm kicks in or the administrator who doesn’t know the state manually triggers failover, the secondary becomes the master. That becomes the source of truth. All of the data during that period is lost because all of the writes that were not replicated disappear. Whenever the admin recovers the first master it’s no longer the master any more and it has to completely sync the state from the second node which is now the master.

Abe: Have you seen this? Seen this with clusters you support at Gravitational?

Sasha: Yeah, that was a real data loss pattern we saw with asynchronous replication that caused loss.

Abe: You’re talking about classic [slony] or the [streaming replication features] that have been built into Postgres since 9.x?

Sasha: Asynchronous replication sends operations to the followers / standby nodes. Those modifications could be changes to the state, writes, or creating new values. Whenever a chunk of this data is lost there should be a mechanism that tells the receiving node that its data is out of sync. In Postgres, there is a mechanism that helps to track replication lag. But there’s no authority that analyzes this data that is built into Postgres, yet, that will help whoever is doing leader election to complete it.

#postgrequery #kubernetes. #gravitational #kubernetes statefulsets #noquery

Introduction

Challenges with running Postgres on Kubernetes

gravitational.com

Experiences with running PostgreSQL on Kubernetes