In preparation for a critical product release, our team embarked on the essential task of upgrading our Postgres RDS instances from version 14.6 to 15.4. While the DevOps team successfully completed the upgrade in the development and staging environments, we encountered unexpected Flyway warnings post-upgrade, highlighting compatibility concerns with PostgreSQL 15.4.
This article provides a concise account of our migration journey, emphasizing the importance of adaptability and meticulous problem-solving in the face of technological evolution.
What are the key compatibility challenges? And how to solve them?
Let’s find out in the troubleshooting story from one of our team members, Ramin Orujov, Senior Software Engineer at Zoolatech.
Prior to an important product release, we made the decision to upgrade our Postgres RDS instances from version 14.6 to 15.4. The Devops team successfully handled the upgrade process across all RDS instances in the development and staging environments. However, after upgrade we started to see new flyway warnings:
“Flyway upgrade recommended: PostgreSQL 15.4 is newer than this version of Flyway and support has not been tested. The latest supported version of PostgreSQL is 14.”
With the introduction of Flyway 9.5.1 (2022-10-20), which added support for PostgreSQL 15, our team decided to upgrade all microservices to the latest version, 9.22.3.
Following the standard procedure, the team created a new branch in one of our most used microservices running on Spring Boot 2.7.x with Flyway 8.5.13.
After updating the Flyway version in the build.gradle, we ran the integration test on the local development machine. However, upon startup, the application became unresponsive, remaining in a blocked state for several minutes. Even after stopping the application, killing Docker containers, and restarting tests, the issue persisted.
For some reason the application was blocked when running the Flyway SQL migration with CREATE INDEX CONCURRENTLY statement. This blocking behavior stemmed from Flyway’s locking mechanism, which is invoked during the construction of concurrent indexes.
As a best practice, we always use CREATE/DROP INDEX CONCURRENTLY option, enabling PostgreSQL to build the index without taking any locks that prevent concurrent inserts, updates, or deletes on the table; whereas a standard index build locks out writes (but not reads) on the table until it’s done.
Quote from PostgreSQL documentation:
“Creating an index can interfere with regular operation of a database. Normally PostgreSQL locks the table to be indexed against writes and performs the entire index build with a single scan of the table. Other transactions can still read the table, but if they try to insert, update, or delete rows in the table they will block until the index build is finished. This could have a severe effect if the system is a live production database. Very large tables can take many hours to be indexed, and even for smaller tables, an index build can lock out writers for periods that are unacceptably long for a production system.
PostgreSQL supports building indexes without locking out writes. This method is invoked by specifying the CONCURRENTLY option of CREATE INDEX. When this option is used, PostgreSQL must perform two scans of the table, and in addition it must wait for all existing transactions that could potentially modify or use the index to terminate. Thus this method requires more total work than a standard index build and takes significantly longer to complete. However, since it allows normal operations to continue while the index is built, this method is useful for adding new indexes in a production environment. Of course, the extra CPU and I/O load imposed by the index creation might slow other operations.
In a concurrent index build, the index is actually entered as an “invalid” index into the system catalogs in one transaction, then two table scans occur in two more transactions. Before each table scan, the index build must wait for existing transactions that have modified the table to terminate. After the second scan, the index build must wait for any transactions that have a snapshot (see Chapter 13) predating the second scan to terminate, including transactions used by any phase of concurrent index builds on other tables, if the indexes involved are partial or have columns that are not simple column references.
Then finally the index can be marked “valid” and ready for use, and the CREATE INDEX command terminates. Even then, however, the index may not be immediately usable for queries: in the worst case, it cannot be used as long as transactions exist that predate the start of the index build.”
Quote that will be revisited in the next paragraphs.
To diagnose the issue, the team employed a systematic troubleshooting approach; first, we inserted a breakpoint and ran integration tests in debug mode.
Subsequently, I accessed the Docker container used by Testcointainer and executed the command ‘select * from pg_stat_activity;
It was observed three client backend processes:
- PID 55 is an idle Flyway process running SELECT COUNT(*) FROM pg_namespace WHERE nspname=? query to check if schema exists. (See link)
- PID 57 is my locked Flyway SQL migration process running CREATE INDEX CONCURRENTLY statement with wait_event_type = Lock
- PID 74 is my psql terminal session running select * from pg_stat_activity query.
In a separate session, it was executed the ‘select * from pg_stat_progress_create_index;’ query and observed that the SQL migration process (PID 57) tasked with building the concurrent index was locked by the Flyway schema check process (PID 55). It was waiting for old snapshots to be released.
Could you recall the process by which PostgreSQL constructs concurrent indexes as described in the highlighted documentation excerpt from the quote above?
“After the second scan, the index build must wait for any transactions that have a snapshot (see Chapter 13) predating the second scan to terminate, including transactions used by any phase of concurrent index builds on other tables, if the indexes involved are partial or have columns that are not simple column references.”
Therefore, in this scenario, PID 57 must wait for PID 55 to terminate and release transaction snapshots before proceeding.
Upon further investigation, it was discovered that Flyway 9.1.2 (2022-08-03) introduced a new feature known as transactional lock. This enhancement, detailed in Flyway’s issue tracker (https://github.com/flyway/flyway/issues/2895), addresses transactional concurrency challenges in PostgreSQL migrations. The introduction of transactional locks signifies a crucial evolution in Flyway’s capabilities, particularly in managing database transactions during migrations.
“Issue 2895 Use transactional locks in PostgreSQL by default, which enables support for PgBouncer (session locks can be enabled again by setting ‘flyway.postgresql.transactional.lock=false’)”
GitHub User (2023)
It means all transactions by default are running with global application level Postgres advisory lock using pg_try_advisory_xact_lock function and other sessions cannot acquire it until this lock is released(unlocked) by the owning process. This change explains why building concurrent index fails and we need to enable session level lock via flyway.postgresql.transactional.lock=false.
This signifies that by default, all transactions operate under a global application-level Postgres advisory lock, facilitated by the pg_try_advisory_xact_lock function. Consequently, other sessions are unable to acquire this lock until it’s released by the owning process. This default behavior sheds light on why the construction of concurrent indexes fails.
To address this issue, we must enable session-level locks by setting flyway.postgresql.transactional.lock=false. This adjustment ensures smoother migration processes by allowing concurrent index creation without conflicts arising from global advisory locks.
Following the configuration change, all Flyway migrations executed successfully, and the application resumed normal operation without any issues.
Our journey from initial upgrades to troubleshooting uncovered unexpected challenges and valuable insights.
The decision to upgrade Postgres RDS instances led us to discover nuances in Flyway migrations and PostgreSQL interactions.
Through systematic troubleshooting, we learned the importance of adaptability and thorough testing.
Ultimately, our experience underscores the necessity of understanding database systems and migration tools for seamless upgrades.
As we reflect on our journey, we emerge with newfound insights, ready to tackle future challenges with confidence and ingenuity in software development.