Step-by-Step Guide: Setting up a High Availability PostgreSQL Cluster with Load Balancing and Automatic Fail-over on AWS EC2 using Pgpool-II and Watchdog

Created By	Daniyal Khawaja
Type	Walk-Through
Created	@March 24, 2023 7:44 PM

Welcome to this step-by-step guide on setting up a high-availability Postgres cluster with load balancing and automatic failover on AWS EC2 using Pgpool and Watchdog. This guide walks you through the process of creating a highly available database cluster on AWS that provides improved reliability, automatic failover, and load balancing for improved performance.

Before we begin, note that this guide assumes that you have some basic knowledge of AWS, Postgres, Pgpool, and Watchdog. If you are new to any of these technologies, it is recommended that you review the documentation and tutorials provided by their respective vendors before proceeding.

With that said, let's get started!

Step 1: Setting up the Ec2 instances

Click the "Launch Instance" button to begin creating a new EC2 instance.

Choose the Ubuntu Server AMI (Amazon Machine Image) from the list of available images.

Select the instance type for your EC2 instances. For this setup, it's recommended to use at least medium instances with 2 vCPUs, 4 GB RAM, and 10 GB storage.

Configure the network settings. Make sure to select the same sub-net for all three instances to ensure that they are on the same network.

Review the instance details, and then click "Launch".

Create a new key pair or select an existing one. This will allow you to access your instances using SSH.

Launch your instances and wait for them to become available.

Once your instances are up and running, use SSH to connect to each instance.

Step 2: Installing PostgreSQL and Pgpool-II.

💡

In this guide, we will be using version 15 of PostgreSQL.

💡

The following steps need to be replicated in each instance:

Once you're logged in to each instance, run the following command to update and upgrade the system packages:

sudo apt-get update && sudo apt-get upgrade

Next, you'll need to download and install Postgres. Start by adding the Postgres repository to the sources list:

sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'

Import the Postgres signing key to trust the packages from the repository:

wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -

Update the package list again to include the Postgres packages:

sudo apt-get update

Finally, install Postgres by running the following command:

sudo apt-get -y install postgresql

Pgpool-II

Download Pgpool version 4.4.2 by running the following command

wget https://www.pgpool.net/mediawiki/download.php?f=pgpool-II-4.4.2.tar.gz

Install the required packages for building Pgpool by running the following command:

sudo apt-get install libpq-dev bison build-essential

Once the packages are installed, navigate to the directory where you downloaded Pgpool and extract the files by running:

tar -zxvf pgpool-II-4.4.2.tar.gz

Change to the extracted directory by running:

cd pgpool-II-4.4.2

Run the configure script to configure the build process:

./configure

After the configuration is complete, run the make command to build Pgpool:

make

Once the build is complete, install Pgpool by running:

sudo make install

💡

After installation, you can check the installed files in the /usr/local/etc/ directory.

PostgreSQL setup

To set up PostgreSQL on the main server , follow these steps:

Initialize the PostgreSQL data directory by running the following command:

sudo mkdir /var/db && sudo chmod a+rw /var/db && mkdir \
	/var/db/postgres && \
	/usr/lib/postgresql/15/bin/initdb -E UTF-8 /var/db/postgres/data15

Open the postgresql.conf file for editing by running the following command:

cd /var/db/postgres/data15
	nano postgresql.conf

In the postgresql.conf file, make the following changes:

listen_addresses = '*' 
	port = 5432
	
	shared_buffers = 1GB
	work_mem = 2621kB
	maintenance_work_mem = 256MB
	dynamic_shared_memory_type = posix
	effective_io_concurrency = 200 
	
	# WAL settings
	wal_level = replica 
	wal_compression = on
	wal_buffers = 16MB 
	checkpoint_completion_target = 0.9
	max_wal_size = 3GB
	min_wal_size = 1GB
	
	# Archive settings
	archive_mode = on
	archive_command = 'cp "%p" "/usr/lib/postgresql/archivedir/%f"'
	restore_command = 'scp ubuntu@<ip-address-of-main-server>:/usr/lib/postgresql/archivedir/%f "%p"' 
	
	max_wal_senders = 10  
	max_replication_slots = 10 
	wal_keep_size = '1GB' 
	synchronous_standby_names = '*'
	wal_receiver_create_temp_slot = on 
	wal_receiver_status_interval = 5s 
	hot_standby_feedback = on 
	random_page_cost = 1.1
	effective_cache_size = 3GB
	default_statistics_target = 100

These changes include enabling the PostgreSQL server to listen on all network interfaces by setting listen_addresses to '*' and setting the port to 5432.

The shared_buffers parameter has been set to 1GB, which controls the amount of memory that PostgreSQL uses for caching data. The work_mem parameter has been set to 2621kB, which controls the amount of memory used for sorting and other operations. The maintenance_work_mem parameter has been set to 256MB, which controls the amount of memory used for maintenance operations such as VACUUM and REINDEX.

The wal_level has been set to replica to enable replication. The archive_mode parameter has been set to on to enable archiving of WAL files. The archive_command and restore_command parameters specify how the archived WAL files are copied to and restored from the archive directory.

The max_wal_senders, max_replication_slots, wal_keep_size, synchronous_standby_names, wal_receiver_create_temp_slot, and wal_receiver_status_interval parameters are all related to synchronous replication and are set according to the specific needs of the cluster.

Finally, various other parameters such as random_page_cost, effective_cache_size, and default_statistics_target have been set to improve performance.

💡

You can always change these settings according to your needs.

To start the primary server, you need to first stop the PostgreSQL service if it is already running. This can be done with the command:

sudo systemctl stop postgresql.service

Next, you need to make sure that the /var/run/postgresql/ directory has the necessary permissions. You can do this with the command:

sudo chmod a+rw /var/run/postgresql/

After that, create a directory for storing the archived WAL files and give it the necessary permissions. You can use the following commands for this:

sudo mkdir /usr/lib/postgresql/archivedir/
	sudo chmod a+rw /usr/lib/postgresql/archivedir

To start the primary server, run the following command:

/usr/lib/postgresql/15/bin/pg_ctl -D data15/ start -l logfile

Once the server is running, you can connect to it using the psql command-line tool. Use the following command to create a new user with replication privileges:

psql -h localhost -U postgres
	####after entering psql, run these commands
	CREATE USER <username> WITH PASSWORD '<password>';
	ALTER USER <username> WITH REPLICATION;
	ALTER USER <username> WITH SUPERUSER;

To confirm that the user was created successfully, you can run the following command:

SELECT usename, usecreatedb, usesuper, userepl, passwd IS NOT NULL AS haspassword FROM pg_user;

Next, you need to create replication slots for the replica databases. To create a physical replication slot, run the following command:

SELECT pg_create_physical_replication_slot('<slot_name>');#(per replica)

To view the replication slots, you can run the following command:

SELECT slot_name, slot_type, active, wal_status FROM pg_replication_slots;

By following these steps, you should now have a primary PostgreSQL server running with the necessary permissions, a user with replication privileges, and replication slots created for the replica databases.

After creating the replication slots for replica databases, the next step is to add the replication servers to the main server's pg_hba.conf file.

Suppose the replication servers have IP addresses ip1 and ip2. To add them to the pg_hba.conf file, run the following commands:

sudo nano /var/db/postgres/data15/pg_hba.conf

Then, add the following lines to the file:

host replication    username    ip1/32   trust
	host postgres       ubuntu      ip1/32   trust
	host postgres       username    ip1/32   trust
	
	host replication    username    ip2/32    trust
	host postgres       ubuntu      ip2/32    trust
	host postgres       username    ip2/32    trust
	
	host replication    username    <main-server-ip>/32    trust
	host postgres       ubuntu      <main-server-ip>/32    trust
	host postgres       username    <main-server-ip>/32    trust

Save and close the file. After that, restart the PostgreSQL server using the following command:

sudo /usr/lib/postgresql/15/bin/pg_ctl -D /var/db/postgres/data15/ restart -l logfile

PostgreSQL setup - On standby servers

Run the pg_basebackup command to create a base backup of the primary server on the standby server:

sudo -u postgres /usr/lib/postgresql/15/bin/pg_basebackup \
	--pgdata=/var/db/postgres/data15 --format=p --write-recovery-conf \
	--checkpoint=fast --label=mffb --progress --host=<primary-ip> \
	--port=5432 --username=<username>

💡

Note: Replace <primary-ip> and <username> with the appropriate values.

Edit the postgresql.conf file to add the following parameters:

primary_conninfo = 'user=<username> port=5432 host=172.31.15.104 application_name=172.31.6.95_username'
	primary_slot_name = '<slot_name>'
	hot_standby = on

💡

Note: Replace <slot_name> with the replication slot name created on the primary server.

Now execute these commands.

sudo skill postgres
	sudo chmod a+rw /var/run/postgresql/
	sudo mkdir /usr/lib/postgresql/archivedir/
	sudo chmod a+rw /usr/lib/postgresql/archivedir

‼️

First, enable password-less SSH from every node to every other node. To do this, run ssh-keygen on every node and copy its ~/.ssh/id_rsa.pub file to every other node's ~/.ssh/authorized_keys file, including its own.

Now start the standby servers using the following command:

/usr/lib/postgresql/15/bin/pg_ctl  -D data15/ start -l logfile

Running the command psql -h <standby-ip> -U <username> on the standby server and then executing SELECT * FROM pg_stat_wal_receiver; will display the status of the WAL receiver process. This will confirm whether the standby server is successfully receiving the WAL data from the primary server and is ready to serve as a standby server.

Configuring Pgpool-II with Watchdog and Load Balancing on the Main Server.

After setting up streaming replication, the next step is to configure pgpool and enable high availability and load balancing. The following steps should be taken on the main server:

Change into the “/usr/local/etc” directory

Move the sample files by running the following commands:

sudo mv pgpool.conf.sample pgpool.conf
	sudo mv escalation.sh.sample escalation.sh
	sudo mv follow_primary.sh.sample follow_primary.sh
	sudo mv failover.sh.sample failover.sh
	sudo chmod +x escalation.sh follow_primary.sh failover.sh
	sudo chmod a+rw follow_primary.sh escalation.sh failover.sh

Open pgpool.conf with sudo nano pgpool.conf and make the following changes:

backend_clustering_mode = 'streaming_replication'
	listen_addresses = ''
	port = 9999
	pcp_listen_addresses = ''
	pcp_port = 9898
	backend_hostname0 = '<ip-of-primary-server>' (primary ip)
	backend_port0 = 5432
	backend_weight0 = 0
	backend_data_directory0 = '/var/db/postgres/data15'
	backend_flag0 = 'ALLOW_TO_FAILOVER'
	backend_application_name0 = 'primary_server'
	
	backend_hostname1 = '<ip-of-standby-server-1>'
	backend_port1 = 5432
	backend_weight1 = 1
	backend_data_directory1 = '/var/db/postgres/data15'
	backend_flag1 = 'ALLOW_TO_FAILOVER'
	backend_flag1 = 'ALLOW_TO_FAILOVER'
	backend_application_name1 = 'replica_server1'
	
	backend_hostname2 = '<ip-of-standby-server-2>'
	backend_port2 = 5432
	backend_weight2 = 1
	backend_data_directory2 = '/var/db/postgres/data15'
	backend_flag2 = 'ALLOW_TO_FAILOVER'
	backend_application_name2 = 'replica_server2'
	
	pool_passwd = 'pool_passwd'
	max_pool = 10
	log_standby_delay = 'always'
	logging_collector = on
	log_directory = '/home/ubuntu/pgpool_logs'
	log_filename = 'pgpool-%Y-%m-%d_%H%M%S.log'
	log_file_mode = 0600
	load_balance_mode = on
	ignore_leading_white_space = on
	read_only_function_list = 'get_.,select_.'
	write_function_list = 'nextval,setval,set_.,update_.,delete_.,insert_.'
	database_redirect_preference_list = 'postgres:0(0.5),postgres:1(0.5),postgres:2(0.5)'
	sr_check_period = 10
	sr_check_user = 'username'
	sr_check_password = ''
	sr_check_database = 'postgres'
	delay_threshold = 20
	prefer_lower_delay_standby = on
	follow_primary_command = '/usr/local/etc/follow_primary.sh %d %h %p %D %m %H %M %P %r %R'
	
	#Configure the health check settings:
	health_check_period = 5
	health_check_timeout = 20
	health_check_user = 'username'
	health_check_password = 'password'
	health_check_database = 'postgres'
	health_check_max_retries = 5
	health_check_retry_delay = 5
	failover_command = '/usr/local/etc/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S'
	failover_on_backend_shutdown = on
	detach_false_primary = on
	search_primary_node_timeout = 10
	recovery_user = 'username'
	auto_failback = on
	auto_failback_interval = 1min
	
	# Watchdog settings 
	
	use_watchdog = on
	trusted_servers = 'server1,server2,server3'
	trusted_server_command = '/bin/ping -q -c3 %h'
	hostname0 = 'server1'
	wd_port0 = 9000
	pgpool_port0 = 9999
	hostname1 = 'server2'
	wd_port1 = 9000
	pgpool_port1 = 9999
	hostname2 = 'server3'
	wd_port2 = 9000
	pgpool_port2 = 9999
	wd_priority = 1
	delegate_ip = '<leave blank for a sec>'
	if_cmd_path = '/sbin'
	if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev eth0 label eth0:0'
	if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev eth0 label eth0:0'
	if_down_cmd = '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev eth0'
	arping_path = '/usr/sbin'
	arping_cmd = '/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I eth0'
	ping_path = '/bin'
	wd_escalation_command = '/usr/local/etc/escalation.sh'
	failover_when_quorum_exists = on
	failover_require_consensus = on
	wd_remove_shutdown_nodes = on
	wd_lifecheck_method = 'heartbeat'
	wd_interval = 10
	heartbeat_hostname0 = 'server1'
	heartbeat_port0 = 9694
	heartbeat_device0 = ''
	heartbeat_hostname1 = 'server2'
	heartbeat_port1 = 9694
	heartbeat_device1 = ''
	heartbeat_hostname2 = 'server3'
	heartbeat_port2 = 9694
	heartbeat_device2 = ''
	wd_heartbeat_keepalive = 2
	wd_heartbeat_deadtime = 30

‼️

Before setting up pgpool, it's important to install arping and other net-tool utilities.

🔒

Additionally, add the following line to the ~/.bashrc file:

export LD_LIBRARY_PATH=//usr/local/lib/:$LD_LIBRARY_PATH

To configure Watchdog for Virtual IP, first allocate an Elastic IP to the primary server through AWS. Then, use that IP in the delegate IP configuration .

delegate_ip = 'elastic IP'

now, change the following in follow_primary.sh:

PGHOME=/lib/postgresql/15
	ARCHIVEDIR=/usr/lib/postgresql/archivedir
	REPLUSER=daniyal
	PCP_USER=daniyal
	PGPOOL_PATH=/usr/local/bin/
	POSTGRESQL_STARTUP_USER=ubuntu
	SSH_KEY_FILE=id_rsa

and change the following in failover.sh:

PGHOME=/lib/postgresql/15
	POSTGRESQL_STARTUP_USER=ubuntu
	SSH_KEY_FILE=id_rsa

change the following in escalation.sh

POSTGRESQL_STARTUP_USER=ubuntu
	SSH_KEY_FILE=id_rsa
	SSH_OPTIONS="-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/${SSH_KEY_FILE}"
	PGPOOLS=(server1 server2 server3)
	VIP=<delegate-ip>
	DEVICE=eth0

🗣

Now execute the following commands on all servers

sudo mkdir /var/run/pgpool && sudo chmod a+rw /var/run/pgpool
	sudo chmod a+rw /usr/local/etc/

Start pgpool on the primary node first.

Pgpool -D

Your output should be in the log file in the following format:

✅

If u see this output then you can go on ahead and start pgpool on the standby servers too, and observer which one is the master by using: ”pcp_watchdog_status -U username”

After completing the installation and configuration of the PostgreSQL cluster with pgpool and watchdog on AWS, you can check the replication statistics of the primary server using the following command:

psql -h <primary-ip> -p 9999 -U <username> -c "SELECT * FROM pg_stat_replication;"

You can also check the replication slots of the primary using the following command:

psql -h <primary-ip> -p 9999 -U <username> -c "SELECT * FROM pg_replication_slots;"

✅

You can test if fail-over is working correctly by purposely failing the primary node and observing the cluster's behavior through logs and the following command:

psql -h <primary-ip> -p 9999 -U <username> -c “show pool_nodes;”

To test if streaming replication is working fine, you can create a table from the primary and test if you can add or delete any value from it from the standby nodes.

To test for load balancing, you can use pgbench. Please refer to the documentation or Google for more information.

Remember to keep your system up to date with the latest security patches and to regularly back up your data to avoid any data loss. If you need further assistance or want to explore more advanced configurations, feel free to consult the official PostgreSQL documentation or seek help from the community.