> We discovered that on reconnect, Postgres would start reading the replication slot’s WAL from the restart_lsn—effectively the beginning—rather than from the last processed position. For workloads with long-running, large, or interleaved transactions, this meant unnecessarily re-reading large portions of WAL and drastically impacting replication lag.
I wonder if that's intended by Postgres? Doesn't seem logical at first glance.
Great question! I believe this behavior is by design in logical decoding. Based on my reading and previous chats with committers, this is my understanding: logical decoding is not stateless, and on reconnection it loses the current transaction state (open transactions, subtransactions, snapshots, catalog state, etc.) that is required for decoding. As a result, a reconnection triggers reading WAL from restart_lsn in order to reconstruct that state.
There may be room for improvement in PostgreSQL, by persisting this state to help these reconnections, but I think this is non-trivial and complex than I think, because of the guarantees PostgreSQL must provide around correctness, consistency and reliability.
Also, based on what I’ve read, physical replication does not have this issue because it directly ships WAL files (instead of contructing a txn) and reconstructs state on the standby.
I’ll let PostgreSQL committers/contributors chime in too on this for a more precise analysis. :)
I wonder if that's intended by Postgres? Doesn't seem logical at first glance.