Skip to content

Architecture

Michal Kleczek edited this page Oct 21, 2024 · 2 revisions

Introduction

Alternatives for scaling PostgreSQL

Vertical

The easiest scaling technique is to add more resources (CPU/Memory/IO bandwidth) to a server hosting PostgreSQL database.

Read replicas

For HA PostgreSQL installations making use of streaming replication it is possible to direct read only queries to read replicas.

Sharding with Citus

Citus is the most popular PostgreSQL solution implementing sharding.

What is a shard?

A shard is an ordinary table that is a leaf in a partitioning hierarchy:

block-beta
  columns 2
  my_data("my_data is a partitioned table"):2
    my_data_0("my_data_0 is a partition")
    my_data_1("my_data_1 is a partition")
    block:part20
      my_data_0_0[("my_data_0_0")]
      my_data_0_1[("my_data_0_1")]
      my_data_0_2[("my_data_0_2")]
      my_data_0_3[("my_data_0_3")]
    end
    block:part21
      my_data_1_0[("my_data_1_0")]
      my_data_1_1[("my_data_1_1")]
      my_data_1_2[("my_data_1_2")]
      my_data_1_3[("my_data_1_3")]
    end

  classDef front fill:#696,stroke:#333;

  class my_data_0_0,my_data_0_1,my_data_0_2,my_data_0_3,my_data_1_0,my_data_1_1,my_data_1_2,my_data_1_3 front
Loading

Green cylinders on the above diagram are shards.

How shards are distributed among replicas?

Each replica maintains a copy of a subset of all shards configured for replication. Replicas maintain foreign tables for shards that are not maintained locally and hosted on other replicas. These foreign tables point to regular tables on replicas hosting corresponding shards.

block-beta
  columns 2
  my_data("my_data is a partitioned table"):2
  my_data_0("my_data_0 is a partition")
  my_data_1("my_data_1 is a partition")
  block:part20
    my_data_0_0[("my_data_0_0")]
    my_data_0_1[("my_data_0_1")]
    my_data_0_2[("my_data_0_2")]
    my_data_0_3[("my_data_0_3")]
  end
  block:part21
    my_data_1_0[("my_data_1_0")]
    my_data_1_1[("my_data_1_1")]
    my_data_1_2[("my_data_1_2")]
    my_data_1_3[("my_data_1_3")]
  end
  space:2
  block:r2part20
    r2my_data_0_0[("my_data_0_0")]
    r2my_data_0_1[("my_data_0_1")]
    r2my_data_0_2[("my_data_0_2")]
    r2my_data_0_3[("my_data_0_3")]
  end
  block:r2part21
    r2my_data_1_0[("my_data_1_0")]
    r2my_data_1_1[("my_data_1_1")]
    r2my_data_1_2[("my_data_1_2")]
    r2my_data_1_3[("my_data_1_3")]
  end
  r2my_data_0("my_data_0 is a partition")
  r2my_data_1("my_data_1 is a partition")
  r2my_data("my_data is a partitioned table"):2

  r2my_data_0_0-->my_data_0_0
  my_data_0_1-->r2my_data_0_1
  r2my_data_0_2-->my_data_0_2
  my_data_0_3-->r2my_data_0_3
  r2my_data_1_0-->my_data_1_0
  my_data_1_1-->r2my_data_1_1
  r2my_data_1_2-->my_data_1_2
  my_data_1_3-->r2my_data_1_3

  classDef local fill:#696,stroke:#333;
  classDef remote fill:#969,stroke:#333;

  class my_data_0_0,r2my_data_0_1,my_data_0_2,r2my_data_0_3,my_data_1_0,r2my_data_1_1,my_data_1_2,r2my_data_1_3 local
  class r2my_data_0_0,my_data_0_1,r2my_data_0_2,my_data_0_3,r2my_data_1_0,my_data_1_1,r2my_data_1_2,my_data_1_3 remote
Loading

Shard to replica assignment

Assignment of shards to replicas is done using a variant of Weighted Randezvous Hashing algorithm that takes into account placement of replicas in different data centres (availability zones). For each sharding key calculated for each shard (see below) the following algorithm is performed:

  • For each pair (replica identifier, sharding key) a score is computed using WRH algorithm
  • Replicas are grouped according to their availability zone
  • Each group of replicas is sorted by score descending
  • Replicas are sorted using their position in their availability zones. Top N of replicas is selected to host the given shard, where N is the required number of shard copies to implement high availability and load balancing.

The algorithm ensures that each shard has a copy in each availability zone (as long as required number of copies is greater than or equal the number of availability zones).