This error indicates an issue with PostgreSQL’s Write-Ahead Log (WAL) archiving when using replication slots. Here’s a comprehensive guide to understand and resolve this issue:

## Understanding the Problem

**Replication slots** keep track of how much WAL data a replica has consumed, preventing the master from deleting WAL files that replicas still need. When **WAL archiving** (`archive_mode = on`) is enabled alongside replication slots, conflicts can occur if:

1. The archive command fails
2. Disk space is insufficient
3. Replication slots are preventing WAL cleanup
4. Archive timeout settings are misconfigured

## Common Error Messages

– `ERROR: replication slot “slot_name” cannot be archived`
– `WARNING: archiving write-ahead log file “0000000100000001000000AB” failed too many times`
– `FATAL: could not archive write-ahead log file “0000000100000001000000AB”`

## Step-by-Step Solutions

error replication slot pg wal archive

error replication slot pg wal archive

### 1. **Check Current Status**

“`sql
— Check replication slots
SELECT * FROM pg_replication_slots;

— Check WAL archiving status
SELECT * FROM pg_stat_archiver;

— Check current WAL position
SELECT pg_current_wal_lsn();
“`

### 2. **Fix Archive Command Issues**

Check your `postgresql.conf`:
“`ini
archive_mode = on
archive_command = ‘cp %p /path/to/archive/%f’
# Consider adding error handling:
# archive_command = ‘test ! -f /path/to/archive/%f && cp %p /path/to/archive/%f’
“`

**Test the archive command manually:**
“`bash
# Find a WAL file
find $PGDATA/pg_wal -name “*.partial” -o -name “[0-9]*” | head -5

# Test the archive command
cp $PGDATA/pg_wal/0000000100000001000000AB /path/to/archive/
“`

### 3. **Manage Replication Slots**

**If a replica is down or lagging:**
“`sql
— Check slot activity
SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
AS lag_bytes FROM pg_replication_slots;

— Drop a problematic slot (CAUTION: replicas will need to re-sync)
SELECT pg_drop_replication_slot(‘slot_name’);
“`

**Alternative: Adjust slot retention**
“`sql
— For physical replication slots
ALTER SYSTEM SET max_slot_wal_keep_size = ’10GB’;

— For logical replication slots
ALTER SYSTEM SET wal_keep_size = ‘1024MB’;
“`

### 4. **Free Up WAL Space**

“`sql
— Check WAL directory usage
SELECT pg_ls_waldir();

— Force checkpoint to recycle WAL
CHECKPOINT;

— Check oldest required WAL
SELECT pg_get_wal_replay_pause_state();
“`

### 5. **Adjust Configuration**

In `postgresql.conf`:
“`ini
# Increase these if archive process is slow
archive_timeout = 300 # Force WAL switch every 5 minutes
wal_sender_timeout = 60s # For replication

# Monitor WAL growth
max_wal_size = 1GB
min_wal_size = 80MB

# For logical replication
wal_keep_size = 1024MB
“`

### 6. **Emergency Recovery**

If WAL directory is full and PostgreSQL won’t start:

“`bash
# Temporarily disable archiving
echo “archive_mode = off” >> $PGDATA/postgresql.auto.conf

# Start PostgreSQL
pg_ctl start

# Then clean up and reconfigure
“`

### 7. **Automated Monitoring Script**

Create a monitoring script (`check_wal_archive.sh`):
“`bash
#!/bin/bash
# Check archive status
FAILED_COUNT=$(psql -U postgres -t -c “SELECT failed_count FROM pg_stat_archiver”)

# Check disk space
ARCHIVE_DIR=”/path/to/archive”
WAL_DIR=”$PGDATA/pg_wal”

if [ $(df $ARCHIVE_DIR | awk ‘NR==2 {print $5}’ | sed ‘s/%//’) -gt 90 ]; then
echo “Archive directory nearly full”
fi

# Check for stuck slots
psql -U postgres -c “SELECT slot_name, active,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) as lag_bytes
FROM pg_replication_slots WHERE active = false;”
“`

## Prevention Best Practices

1. **Monitor regularly:**
– Set up alerts for `failed_count` in `pg_stat_archiver`
– Monitor disk space in archive and WAL directories
– Track replication lag

2. **Maintain replication slots:**
“`sql
— Regular maintenance query
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag
FROM pg_replication_slots;
“`

3. **Use archive_timeout carefully:**
– Don’t set too low (causes excessive archiving)
– Don’t set too high (risks data loss)

4. **Implement proper retention policy:**
“`bash
# Clean old archives (example)
find /path/to/archive -name “*.backup” -mtime +30 -delete
find /path/to/archive -type f -mtime +7 -delete
“`

5. **Consider using pgBackRest or Barman** for more robust WAL management.

## When to Seek Help

If the issue persists:
1. Check PostgreSQL logs: `tail -f $PGDATA/log/postgresql-*.log`
2. Verify filesystem permissions
3. Ensure network connectivity (if archiving to remote location)
4. Consider using `archive_mode = always` for more aggressive archiving

The key is balancing between replication slot retention and WAL archiving requirements. Regular monitoring and proper configuration adjustments will prevent most issues.

Share this post

Related posts