-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
The easiest scaling technique is to add more resources (CPU/Memory/IO bandwidth) to a server hosting PostgreSQL database.
For HA PostgreSQL installations making use of streaming replication it is possible to direct read only queries to read replicas.
Citus is the most popular PostgreSQL solution implementing sharding.
A shard is an ordinary table that is a leaf in a partitioning hierarchy:
block-beta
columns 2
my_data("my_data is a partitioned table"):2
my_data_0("my_data_0 is a partition")
my_data_1("my_data_1 is a partition")
block:part20
my_data_0_0[("my_data_0_0")]
my_data_0_1[("my_data_0_1")]
my_data_0_2[("my_data_0_2")]
my_data_0_3[("my_data_0_3")]
end
block:part21
my_data_1_0[("my_data_1_0")]
my_data_1_1[("my_data_1_1")]
my_data_1_2[("my_data_1_2")]
my_data_1_3[("my_data_1_3")]
end
classDef front fill:#696,stroke:#333;
class my_data_0_0,my_data_0_1,my_data_0_2,my_data_0_3,my_data_1_0,my_data_1_1,my_data_1_2,my_data_1_3 front
Green cylinders on the above diagram are shards.
Each replica maintains a copy of a subset of all shards configured for replication. Replicas maintain foreign tables for shards that are not maintained locally and hosted on other replicas. These foreign tables point to regular tables on replicas hosting corresponding shards.
block-beta
columns 2
my_data("my_data is a partitioned table"):2
my_data_0("my_data_0 is a partition")
my_data_1("my_data_1 is a partition")
block:part20
my_data_0_0[("my_data_0_0")]
my_data_0_1[("my_data_0_1")]
my_data_0_2[("my_data_0_2")]
my_data_0_3[("my_data_0_3")]
end
block:part21
my_data_1_0[("my_data_1_0")]
my_data_1_1[("my_data_1_1")]
my_data_1_2[("my_data_1_2")]
my_data_1_3[("my_data_1_3")]
end
space:2
block:r2part20
r2my_data_0_0[("my_data_0_0")]
r2my_data_0_1[("my_data_0_1")]
r2my_data_0_2[("my_data_0_2")]
r2my_data_0_3[("my_data_0_3")]
end
block:r2part21
r2my_data_1_0[("my_data_1_0")]
r2my_data_1_1[("my_data_1_1")]
r2my_data_1_2[("my_data_1_2")]
r2my_data_1_3[("my_data_1_3")]
end
r2my_data_0("my_data_0 is a partition")
r2my_data_1("my_data_1 is a partition")
r2my_data("my_data is a partitioned table"):2
r2my_data_0_0-->my_data_0_0
my_data_0_1-->r2my_data_0_1
r2my_data_0_2-->my_data_0_2
my_data_0_3-->r2my_data_0_3
r2my_data_1_0-->my_data_1_0
my_data_1_1-->r2my_data_1_1
r2my_data_1_2-->my_data_1_2
my_data_1_3-->r2my_data_1_3
classDef local fill:#696,stroke:#333;
classDef remote fill:#969,stroke:#333;
class my_data_0_0,r2my_data_0_1,my_data_0_2,r2my_data_0_3,my_data_1_0,r2my_data_1_1,my_data_1_2,r2my_data_1_3 local
class r2my_data_0_0,my_data_0_1,r2my_data_0_2,my_data_0_3,r2my_data_1_0,my_data_1_1,r2my_data_1_2,my_data_1_3 remote
Assignment of shards to replicas is done using a variant of Weighted Randezvous Hashing algorithm that takes into account placement of replicas in different data centres (availability zones). For each sharding key calculated for each shard (see below) the following algorithm is performed:
- For each pair (replica identifier, sharding key) a score is computed using WRH algorithm
- Replicas are grouped according to their availability zone
- Each group of replicas is sorted by score descending
- Replicas are sorted using their position in their availability zones. Top N of replicas is selected to host the given shard, where N is the required number of shard copies to implement high availability and load balancing.
The algorithm ensures that each shard has a copy in each availability zone (as long as required number of copies is greater than or equal the number of availability zones).