• PostgreSQL for DevOps:生产环境部署与运维实战指南

    对于寻求强大、可靠且可扩展的开源数据库的 DevOps 工程师和开发人员来说,PostgreSQL 常常是首选。它不仅以其对 SQL 标准的严格遵循和丰富的功能集而闻名,更因其坚如磐石的稳定性和 ACID 特性而备受赞誉。

    本指南专为 DevOps 专业人员设计,旨在提供一个从零开始、逐步深入的实战手册。我们将涵盖从初始安装、关键配置、安全管理到备份恢复、高可用复制和性能调优的全过程。读完本文,你将拥有在生产环境中自信地部署、管理和维护 PostgreSQL 所需的核心知识。

    概览:深入理解 PostgreSQL 的核心机制

    在深入操作之前,理解 PostgreSQL 的工作原理至关重要,尤其是它的数据写入机制。这不仅能帮助你更好地配置和优化数据库,还能让你明白为什么它如此可靠。

    PostgreSQL 是一个关系型数据库,它保证了 ACID 特性 (原子性、一致性、隔离性和持久性) 。以下是它的内部工作原理:

    写数据过程

    INSERT statement
        ↓
    Write-Ahead Log (WAL) buffer
        ↓
    WAL synced to disk (this is the commitment point)
        ↓
    Data written to shared buffers (in-memory cache)
        ↓
    Background writer flushes to data files
        ↓
    Data persisted

    关键要点 :PostgreSQL 会先将所有更改写入事务日志(WAL),然后再更新实际的数据文件。这个机制确保了即使在数据库或服务器突然崩溃的情况下,也可以通过重放日志来恢复所有已提交的事务,不会丢失数据。这也是 PostgreSQL 可靠性高的核心原因之一。

    安装和初始设置:迈出第一步

    现在,让我们从基础开始,在你的系统上安装 PostgreSQL,并学习如何连接和操作它。我将从基础知识讲起,逐步深入到更高级的信息,确保你完全掌握!

    安装 PostgreSQL

    # Ubuntu/Debian
    sudo apt-get update
    sudo apt-get install postgresql postgresql-contrib postgresql-client
    # macOS
    brew install postgresql
    # Verify installation
    psql --version
    # Output: psql (PostgreSQL) 16.0
    # Check if running
    sudo systemctl status postgresql
    # Output: ● postgresql.service - PostgreSQL RDBMS
    #         Loaded: loaded (/lib/systemd/system/postgresql.service)
    #         Active: active (running)

    连接到 PostgreSQ

    # Connect as postgres superuser
    sudo -u postgres psql
    # Or if you have sudo access without password
    psql -U postgres -h localhost
    # Verify you're connected
    postgres=# \conninfo
    # Output: You are connected to database "postgres" as user "postgres" via socket in "/var/run/postgresql" at port 5432.

    PostgreSQL 基本命令

    # List all databases
    \l
    # Connect to database
    \c database_name
    # List all tables
    \dt
    # List all schemas
    \dn
    # Show table structure
    \d table_name
    # Get help
    \h CREATE TABLE
    # Exit
    \q

    掌握这些基本命令是与 PostgreSQL 交互的基础。花点时间熟悉它们,你会在后续的管理工作中事半功倍。

    初始配置:为生产环境调优 postgresql.conf

    PostgreSQL 的默认配置主要为开发和低资源环境设计,远不能满足生产环境的性能和可靠性要求。postgresql.conf文件是你的主控制面板,通过调整其中的参数,我们可以释放数据库的全部潜力。

    postgresql.conf 文件用于配置 PostgreSQL 的行为。请查找该

    # Find location
    sudo -u postgres psql -c "SHOW config_file"
    # Output: /etc/postgresql/16/main/postgresql.conf
    # View important settings
    sudo -u postgres psql -c "SHOW port"
    # Output: 5432
    sudo -u postgres psql -c "SHOW data_directory"
    # Output: /var/lib/postgresql/16/main

    生产的基本配置

    以下是一份针对生产环境的基础配置建议。请注意,最佳值取决于你的服务器硬件(特别是 RAM)和具体工作负载。这里的建议是一个可靠的起点。

    编辑 /etc/postgresql/16/main/postgresql.conf :

    # Network settings
    listen_addresses = '*'           # Listen on all interfaces (restrict in production with IP)
    port = 5432                      # Standard PostgreSQL port
    # Memory configuration
    shared_buffers = 256MB           # 25% of system RAM for dedicated server
    effective_cache_size = 1GB       # 50-75% of total RAM
    work_mem = 16MB                  # Per operation memory
    maintenance_work_mem = 64MB      # For maintenance operations
    # Checkpoint configuration (critical for performance)
    checkpoint_timeout = 15min       # Force checkpoint every 15 minutes
    checkpoint_completion_target = 0.9  # Spread checkpoint over 90% of interval
    max_wal_size = 4GB              # Size of WAL before forcing checkpoint
    min_wal_size = 1GB              # Don't delete WAL below this
    # Write-Ahead Logging (WAL) configuration
    wal_level = replica             # Enable replication (critical for backup/failover)
    max_wal_senders = 10            # Max replication connections
    wal_keep_size = 1GB             # Keep WAL for replication
    wal_compression = on            # Compress WAL
    # Connections
    max_connections = 200           # Max client connections
    superuser_reserved_connections = 10  # Reserve for superuser
    # Logging
    logging_collector = on          # Enable log file collection
    log_directory = 'log'           # Relative to data_directory
    log_filename = 'postgresql-%a.log'
    log_truncate_on_rotation = on   # Overwrite old logs
    # Log slow queries (anything over 1s)
    log_min_duration_statement = 1000  # Milliseconds
    log_statement = 'all'           # Log all statements (use 'mod' for less noise)
    log_connections = on
    log_disconnections = on
    # Performance
    synchronous_commit = on         # Wait for WAL write (safer but slower)
    fsync = on                      # Always fsync to disk (never disable in production)

    编辑完成后,需要让 PostgreSQL 加载新的配置。某些参数(如 shared_buffers)需要重启服务才能生效,而其他参数则可以通过重新加载配置文件来动态应用

    # Reload without restart (some settings require restart)
    sudo -u postgres psql -c "SELECT pg_reload_conf()"
    # Output: pg_reload_conf
    #         ---------------
    #         t
    # Or restart PostgreSQL (safer for critical changes)
    sudo systemctl restart postgresql

    配置完成后,务必验证关键参数是否已成功应用。

    # Verify settings applied
    sudo -u postgres psql -c "SHOW shared_buffers"
    # Output: 256MB

    用户和数据库管理:建立安全边界

    在生产环境中,直接使用 `postgres` 超级用户来连接应用程序是极其危险的做法。最佳实践是遵循“最小权限原则”,为每个应用程序或服务创建专用的数据库和用户,并仅授予其完成任务所必需的最小权限。

    创建数据库和用户

     Connect as superuser
    sudo -u postgres psql
    # Create database
    CREATE DATABASE production_app;
    # Create user with password
    CREATE USER app_user WITH ENCRYPTED PASSWORD 'secure_password_123';
    # Grant privileges to user
    GRANT CONNECT ON DATABASE production_app TO app_user;
    GRANT USAGE ON SCHEMA public TO app_user;
    GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO app_user;
    GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO app_user;
    # Give default privileges for future tables
    ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON TABLES TO app_user;
    ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL ON SEQUENCES TO app_user;
    # Verify permissions
    \du
    # Lists all roles and their attributes
    # Exit
    \q

    应用程序的连接字符串

    你的应用程序将使用连接字符串(也称为 DSN,Data Source Name)来连接到数据库。请确保将这些凭证作为机密信息妥善管理,例如使用环境变量或专门的秘密管理工具。

    # PostgreSQL connection string (DSN)
    postgresql://app_user:secure_password_123@localhost:5432/production_app
    # Test connection from command line
    psql postgresql://app_user:secure_password_123@localhost:5432/production_app -c "SELECT 1"
    # Output: ?column?
    #         --------
    #                1

    创建表和索引:构建高效的数据结构

    良好的数据库性能始于合理的表结构和高效的索引策略。一个精心设计的模式不仅可以确保数据完整性,还能在数据量增长时保持查询速度。

    创建符合最佳实践的模式

    -- Connect as app_user
    \c production_app app_user
    -- Create table
    CREATE TABLE users (
        id SERIAL PRIMARY KEY,
        email VARCHAR(255) NOT NULL,
        username VARCHAR(100) NOT NULL,
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
        updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
        is_active BOOLEAN DEFAULT true
    );
    -- Add indexes for common queries
    CREATE UNIQUE INDEX idx_users_email ON users(email);
    CREATE UNIQUE INDEX idx_users_username ON users(username);
    CREATE INDEX idx_users_created_at ON users(created_at);
    CREATE INDEX idx_users_is_active ON users(is_active);
    -- View table structure
    \d users
    -- View indexes
    \di users*
    -- Check index size
    SELECT 
        schemaname,
        tablename,
        indexname,
        pg_size_pretty(pg_relation_size(indexrelid)) as size
    FROM pg_indexes
    WHERE tablename = 'users'
    ORDER BY pg_relation_size(indexrelid) DESC;

    添加约束

    约束是数据库层面保证数据质量的重要工具。除了主键和唯一索引,我们还可以使用 CHECK 约束和外键来强制执行业务规则。

    -- Add constraint to ensure valid email
    ALTER TABLE users 
    ADD CONSTRAINT check_valid_email 
    CHECK (email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$');
    -- Add foreign key to another table
    CREATE TABLE posts (
        id SERIAL PRIMARY KEY,
        user_id INTEGER NOT NULL REFERENCES users(id) ON DELETE CASCADE,
        title VARCHAR(255) NOT NULL,
        content TEXT,
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    );
    -- View constraints
    \d posts

    监控所需的基本查询:洞察数据库状态

    作为 DevOps 工程师,你需要像飞行员关心仪表盘一样关注数据库的健康状况。主动监控可以帮助你在小问题演变成大故障之前发现并解决它们。以下是一些你需要定期执行的关键查询。

    检查数据库健康状况

    -- Current connections
    SELECT 
        datname,
        count(*) as connections,
        max(extract(epoch from (now() - query_start))) as longest_query_seconds
    FROM pg_stat_activity
    GROUP BY datname
    ORDER BY connections DESC;
    -- Long-running queries (over 5 minutes)
    SELECT 
        pid,
        usename,
        query_start,
        extract(epoch from (now() - query_start)) as duration_seconds,
        query
    FROM pg_stat_activity
    WHERE query_start < now() - interval '5 minutes'
    ORDER BY query_start;
    -- Kill long-running query (DANGER: use carefully)
    SELECT pg_terminate_backend(pid);

    检查数据库大小

    -- Total database size
    SELECT 
        datname,
        pg_size_pretty(pg_database_size(datname)) as size
    FROM pg_database
    WHERE datname NOT IN ('template0', 'template1')
    ORDER BY pg_database_size(datname) DESC;
    -- Size per table
    SELECT 
        schemaname,
        tablename,
        pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
    FROM pg_tables
    WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
    ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
    LIMIT 20;
    -- Unused indexes (wasting space and slowing writes)
    SELECT 
        schemaname,
        tablename,
        indexname,
        idx_scan,
        pg_size_pretty(pg_relation_size(indexrelid)) as size
    FROM pg_stat_user_indexes
    WHERE idx_scan = 0
    ORDER BY pg_relation_size(indexrelid) DESC;

    检查慢查询

    慢查询是数据库性能的头号杀手。配置 PostgreSQL 记录执行时间超过特定阈值的查询,是定位和优化性能瓶颈的第一步。

    首先,请确保已配置日志记录:

    # Check log file location
    sudo -u postgres psql -c "SHOW log_directory"
    # Output: log
    # View recent logs
    sudo tail -f /var/lib/postgresql/16/main/log/postgresql-*.log | grep "duration:"
    # Output example:
    # LOG:  duration: 2534.123 ms  statement: SELECT * FROM large_table;

    pg_stat_statements 是一个非常有用的扩展,它能跟踪服务器上执行的所有 SQL 语句的统计信息,让你轻松找出最耗时、最频繁或 I/O 最多的查询。

    -- Queries taking longest (requires pg_stat_statements extension)
    CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
    SELECT 
        query,
        calls,
        total_time,
        mean_time,
        max_time
    FROM pg_stat_statements
    ORDER BY mean_time DESC
    LIMIT 10;
    -- Clear stats
    SELECT pg_stat_statements_reset();

    分析查询性能

    当你发现一个慢查询后,EXPLAIN 命令是你最好的朋友。它能揭示 PostgreSQL 的查询优化器为执行该查询所选择的计划。通过分析这个计划,你可以判断是否缺少索引、是否进行了全表扫描,或者连接顺序是否最优。

    EXPLAIN 显示了查询计划,但不会实际执行查询。而 EXPLAIN ANALYZE 则会执行查询并返回带有实际执行时间和行数的详细计划,这对于诊断问题更为精确。

    -- EXPLAIN shows query plan WITHOUT executing
    EXPLAIN SELECT * FROM users WHERE email = 'test@example.com';
    -- Output:
    -- QUERY PLAN
    -- ─────────────────────────────────────────────
    --  Index Scan using idx_users_email on users
    --    Index Cond: (email = 'test@example.com')
    -- EXPLAIN ANALYZE executes and shows actual performance
    EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'test@example.com';
    -- Output includes actual timing and row counts
    -- ─────────────────────────────────────────────
    --  Index Scan using idx_users_email on users
    --    Index Cond: (email = 'test@example.com')
    --    Planning Time: 0.123 ms
    --    Execution Time: 0.456 ms
    -- Complex query analysis
    EXPLAIN (ANALYZE, BUFFERS) 
    SELECT u.username, COUNT(p.id) as posts
    FROM users u
    LEFT JOIN posts p ON u.id = p.user_id
    GROUP BY u.id
    ORDER BY posts DESC;

    预写日志(WAL):PostgreSQL 的安全网

    现在我们来深入了解一下之前提到的预写日志(Write-Ahead Log, WAL)。WAL 是 PostgreSQL 实现数据持久性和可靠性的核心机制,同时也是实现备份和复制等高级功能的基础。

    WAL 是 PostgreSQL 的事务日志,它保证了 PostgreSQL 的可靠性。

    理解 WAL

    # Check WAL directory
    ls -lah /var/lib/postgresql/16/main/pg_wal/
    # Output:
    # -rw------- 1 postgres postgres 16M Nov 15 10:23 000000010000000000000001
    # -rw------- 1 postgres postgres 16M Nov 15 10:25 000000010000000000000002
    # -rw------- 1 postgres postgres 16M Nov 15 10:27 000000010000000000000003
    # Each WAL file is 16MB
    # Files are named sequentially
    # Older files are deleted after checkpoint (unless archiving is enabled)

    配置 WAL 归档(用于备份)

    默认情况下,PostgreSQL 在完成检查点后会回收旧的 WAL 文件。但为了实现时间点恢复(Point-in-Time Recovery, PITR),我们需要将这些 WAL 文件在被回收前归档到安全的位置

    # Create archive directory
    sudo mkdir -p /backup/wal_archive
    sudo chown postgres:postgres /backup/wal_archive
    sudo chmod 700 /backup/wal_archive
    # Edit postgresql.conf
    sudo nano /etc/postgresql/16/main/postgresql.conf
    # Add these lines:
    # wal_level = replica
    # archive_mode = on
    # archive_command = 'cp %p /backup/wal_archive/%f'
    # archive_timeout = 300  # Force archiving every 5 minutes

    配置完成后,PostgreSQL 将在每次 WAL 文件写满时,或在 archive_timeout 设定的时间内,调用 archive_command 将 WAL 段复制到指定目录。

    编辑后:

    # Reload configuration
    sudo -u postgres psql -c "SELECT pg_reload_conf()"
    # Force a checkpoint to start archiving
    sudo -u postgres psql -c "CHECKPOINT"
    # Verify WAL files are being archived
    ls -la /backup/wal_archive/
    # Output: WAL files copied here

    监测 WAL

    -- Check WAL level
    SHOW wal_level;
    -- Output: replica
    -- Check WAL stats
    SELECT 
        count(*) as wal_files,
        pg_size_pretty(sum(size)) as total_size
    FROM (
        SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')::bigint as size
    ) as t;
    -- Current WAL position
    SELECT pg_current_wal_lsn();
    -- Output: 0/17AB3F0

    备份策略:你的终极保障

    如果说 WAL 是安全网,那么备份就是你的降落伞。在数据丢失或损坏的灾难性场景下,一个可靠且经过测试的备份是让你能够恢复业务的唯一希望。为避免最糟糕的情况发生,制定备份策略是必要的。以下是一些我们可以采用的方法。

    策略 一:pg_dump(最适合中小型数据库)

    pg_dump 是一个简单而强大的工具,用于创建数据库的逻辑备份。它生成一个包含 SQL 命令或归档格式的文件,可以用于在任何地方重建数据库。它非常适合日常备份、迁移和开发环境的搭建。

    # Simple backup of one database
    pg_dump -U postgres -d production_app -F c -f backup.dump
    # Backup with compression
    pg_dump -U postgres -d production_app \
        -F c \
        -f production_app_$(date +%Y%m%d_%H%M%S).dump
    # Backup all databases
    pg_dumpall -U postgres -f all_databases.sql
    # Custom format (recommended for large DBs)
    pg_dump -U postgres -d production_app \
        -F c \
        --verbose \
        -f backup.dump
    # Backup with jobs (parallel, faster)
    pg_dump -U postgres -d production_app \
        -F d \
        -j 4 \
        -f backup_directory/
    # Show backup size
    ls -lh backup.dump

    策略二:从备份恢复

    拥有备份只是成功了一半,能够快速准确地恢复才是关键。pg_restore 是与 pg_dump 配套的恢复工具。

    # Restore from custom format dump
    pg_restore -U postgres -d production_app -F c backup.dump
    # List what's in the dump before restoring
    pg_restore -l backup.dump | head -20
    # Restore specific table
    pg_restore -U postgres -d production_app -F c \
        -t users backup.dump
    # Restore with progress
    pg_restore -U postgres -d production_app \
        -F c -j 4 --verbose backup.dump

    策略 三:持续归档 + 时间点恢复(PITR)

    对于关键的生产数据库,仅有每日备份是不够的。如果数据库在两次备份之间发生故障,你会丢失这期间的所有数据。时间点恢复(Point-in-Time Recovery, PITR)结合了完整的基础备份和持续归档的 WAL 日志,允许你将数据库恢复到故障前一刻的任意时间点,最大限度地减少数据丢失。

    这是最可靠的备份策略:

    # 1. Full backup
    pg_basebackup -h localhost -U postgres -D /backup/base_backup -Fp -Pv
    # 2. WAL archiving runs continuously (configured in postgresql.conf)
    # 3. To restore to specific point in time:
    # Stop PostgreSQL
    sudo systemctl stop postgresql
    # Move current data directory
    sudo mv /var/lib/postgresql/16/main /var/lib/postgresql/16/main.old
    # Restore base backup
    sudo cp -r /backup/base_backup /var/lib/postgresql/16/main
    sudo chown -R postgres:postgres /var/lib/postgresql/16/main
    # Create recovery.signal to enable recovery
    sudo touch /var/lib/postgresql/16/main/recovery.signal
    # Configure recovery settings
    sudo nano /var/lib/postgresql/16/main/postgresql.conf
    # Add:
    # restore_command = 'cp /backup/wal_archive/%f %p'
    # recovery_target_timeline = 'latest'
    # recovery_target_time = '2024-01-15 14:30:00'
    # Start PostgreSQL (it will replay WAL)
    sudo systemctl start postgresql
    # Verify recovery
    sudo -u postgres psql -c "SELECT now()"

    策略四:自动备份脚本

    手动执行备份容易出错且不可持续。在生产环境中,所有备份任务都应该自动化。下面是一个简单的 bash 脚本示例,用于执行每日备份并清理旧文件。

    #!/bin/bash
    # backup.sh
    BACKUP_DIR="/backup/postgresql"
    DB_USER="postgres"
    RETENTION_DAYS=30
    # Create backup directory
    mkdir -p $BACKUP_DIR
    # Full backup daily at 2 AM
    BACKUP_FILE="$BACKUP_DIR/db_$(date +%Y%m%d_%H%M%S).dump"
    echo "Starting backup..."
    pg_dump -U $DB_USER -d production_app -F c -f $BACKUP_FILE
    if [ $? -eq 0 ]; then
        echo "✓ Backup successful: $BACKUP_FILE"
    
        # Verify backup
        pg_restore -U $DB_USER -d production_app -F c -l $BACKUP_FILE > /dev/null
        if [ $? -eq 0 ]; then
            echo "✓ Backup verified"
        fi
    else
        echo "✗ Backup failed"
        exit 1
    fi
    # Remove old backups (older than RETENTION_DAYS)
    find $BACKUP_DIR -type f -name "*.dump" -mtime +$RETENTION_DAYS -delete
    echo "✓ Cleanup completed"
    # Optional: Copy to remote storage
    # scp $BACKUP_FILE remote_server:/remote/backup/

    将此脚本配置到 cron 中,即可实现无人值守的自动化备份。

    安排一下:

    # Add to crontab
    crontab -e
    # Add line (runs daily at 2 AM)
    0 2 * * * /home/postgres/backup.sh >> /var/log/pg_backup.log 2>&1
    # List cron jobs
    crontab -l

    如何设置流复制:实现高可用与读扩展

    流复制是 PostgreSQL 内置的高可用和负载均衡解决方案。它通过将主服务器上的 WAL 实时传输到一个或多个副本服务器来实现。这不仅提供了热备份(hot standby)以实现快速故障转移,还可以将读请求分流到副本,从而扩展读取能力。

    主服务器(生产环境):

    # Edit postgresql.conf
    sudo nano /etc/postgresql/16/main/postgresql.conf
    # Add:
    wal_level = replica
    max_wal_senders = 10
    wal_keep_size = 1GB
    hot_standby = on
    # Edit pg_hba.conf for replication user
    sudo nano /etc/postgresql/16/main/pg_hba.conf
    # Add (allow replica connection):
    host    replication    replica_user    192.168.1.0/24    md5
    # Create replication user
    sudo -u postgres psql -c "CREATE USER replica_user WITH REPLICATION ENCRYPTED PASSWORD 'replica_password';"
    # Restart
    sudo systemctl restart postgresql

    副本服务器(备用):

    # Stop replica if running
    sudo systemctl stop postgresql
    # Clear old data directory
    sudo rm -rf /var/lib/postgresql/16/main/*
    # Take base backup from primary
    pg_basebackup -h 192.168.1.100 -U replica_user -D /var/lib/postgresql/16/main -Fp -Pv
    # Create standby.signal (tells PostgreSQL this is standby)
    sudo touch /var/lib/postgresql/16/main/standby.signal
    # Configure replica postgresql.conf
    sudo nano /var/lib/postgresql/16/main/postgresql.conf
    # Add:
    primary_conninfo = 'host=192.168.1.100 port=5432 user=replica_user password=replica_password'
    hot_standby = on
    # Set correct permissions
    sudo chown -R postgres:postgres /var/lib/postgresql/16/main
    sudo chmod 700 /var/lib/postgresql/16/main
    # Start replica
    sudo systemctl start postgresql
    # Verify replication is working
    sudo -u postgres psql -c "SELECT * FROM pg_stat_replication;"
    # On replica:
    sudo -u postgres psql -c "SELECT * FROM pg_last_wal_receive_lsn();"

    监控复制状态

    设置好复制后,持续监控其健康状况至关重要。你需要确保副本没有远远落后于主库,否则在故障转移时可能会丢失数据。

    -- On Primary: Check replica lag
    SELECT 
        client_addr,
        state,
        sent_lsn,
        write_lsn,
        flush_lsn,
        replay_lsn,
        sync_state
    FROM pg_stat_replication;
    -- On Replica: Check replication status
    SELECT 
        pg_last_wal_receive_lsn() as receive_lsn,
        pg_last_wal_replay_lsn() as replay_lsn,
        pg_is_wal_replay_paused() as is_paused;
    -- Calculate replication lag (in bytes)
    SELECT 
        pg_wal_lsn_diff(sent_lsn, replay_lsn) as lag_bytes
    FROM pg_stat_replication;

    将副本提升为主副本:故障转移

    如果主服务器发生故障,你需要手动或通过自动化工具将一个健康的副本“提升”为新的主服务器,以恢复写服务。

    # On replica
    sudo -u postgres psql -c "SELECT pg_promote();"
    # Verify it's now primary
    sudo -u postgres psql -c "SELECT pg_is_in_recovery();"
    # Output: f (false = it's primary now)

    一些性能调优和日志分析

    数据库性能调优是一个持续的过程,而不是一次性的任务。随着应用程序负载的变化,新的性能瓶颈可能会出现。定期分析慢查询日志和数据库统计信息是保持系统健康的关键。

    # Enable query logging
    sudo -u postgres psql -c "ALTER SYSTEM SET log_min_duration_statement = 1000;"
    sudo -u postgres psql -c "SELECT pg_reload_conf();"
    # View logs
    tail -f /var/lib/postgresql/16/main/log/postgresql-*.log | grep "duration:"
    # Parse and analyze
    grep "duration:" /var/lib/postgresql/16/main/log/postgresql-*.log | \
        awk '{print $NF}' | sort -rn | head -20

    如有需要,我们可以优化常见问题:

    autovacuum 是 PostgreSQL 的一个内置进程,用于自动执行 VACUUM 和 ANALYZE 命令。VACUUM 回收已删除或已更新行所占用的空间,防止表膨胀;ANALYZE 则更新统计信息,以帮助查询规划器做出更好的决策。确保它已启用并正确配置。

    -- 1. Add missing indexes
    -- Query to find queries without indexes:
    SELECT query FROM pg_stat_statements WHERE query LIKE '%WHERE%' LIMIT 5;
    -- Add index after analyzing with EXPLAIN
    CREATE INDEX idx_posts_user_id_created_at ON posts(user_id, created_at DESC);
    -- 2. Analyze table statistics
    ANALYZE users;
    ANALYZE posts;
    -- 3. Vacuum (reclaim space, update statistics)
    VACUUM ANALYZE;
    -- 4. Check for table bloat
    SELECT 
        schemaname,
        tablename,
        pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size,
        pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) as indexes_size
    FROM pg_tables
    ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
    LIMIT 10;
    -- 5. Enable auto-vacuum (maintenance)
    SHOW autovacuum;
    ALTER SYSTEM SET autovacuum = on;
    SELECT pg_reload_conf();

    监控和警报:建立你的仪表盘

    仅仅记录指标是不够的,你需要建立一个自动化的监控和警报系统,以便在关键指标超出正常阈值时立即得到通知。这能让你在用户报告问题之前就采取行动。

    # 1. Check active connections
    sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
    # 2. Check cache hit ratio (should be > 99%)
    sudo -u postgres psql << EOF
    SELECT 
        sum(heap_blks_read) as heap_read,
        sum(heap_blks_hit) as heap_hit,
        sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
    FROM pg_statio_user_tables;
    EOF
    # 3. Check disk usage
    df -h /var/lib/postgresql
    # 4. Check transaction ID age (prevent wraparound)
    sudo -u postgres psql -c "SELECT datname, age(datfrozenxid) FROM pg_database;"
    # 5. Check replication lag (if using replication)
    sudo -u postgres psql -c "SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')/1024/1024 as lag_mb;"

    创建监控警报脚本

    下面这个简单的 shell 脚本可以作为你自定义监控系统的起点。它可以检查连接数、缓存命中率和磁盘使用情况,并在超出阈值时打印警告或严重警报。你可以将其与 Nagios、Zabbix 等监控系统或简单的邮件/Slack 通知集成。

    #!/bin/bash
    # pg_monitor.sh
    DB_USER="postgres"
    DB_NAME="production_app"
    WARNING_CONNECTIONS=150
    CRITICAL_CONNECTIONS=180
    # Check connections
    CONNECTIONS=$(sudo -u $DB_USER psql -d $DB_NAME -t -c "SELECT count(*) FROM pg_stat_activity;")
    if [ "$CONNECTIONS" -gt "$CRITICAL_CONNECTIONS" ]; then
        echo "CRITICAL: PostgreSQL connections: $CONNECTIONS/$MAX_CONNECTIONS"
        # Send alert (email, PagerDuty, Slack)
    elif [ "$CONNECTIONS" -gt "$WARNING_CONNECTIONS" ]; then
        echo "WARNING: PostgreSQL connections: $CONNECTIONS"
    fi
    # Check cache hit ratio
    CACHE_HIT=$(sudo -u $DB_USER psql -d $DB_NAME -t -c "SELECT round(sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) * 100, 2) FROM pg_statio_user_tables;")
    if (( $(echo "$CACHE_HIT < 95" | bc -l) )); then
        echo "WARNING: Low cache hit ratio: $CACHE_HIT%"
    fi
    # Check disk usage
    DISK_USED=$(df /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')
    if [ "$DISK_USED" -gt 80 ]; then
        echo "CRITICAL: Disk usage: $DISK_USED%"
    fi

    我使用 crontab 来安排监控:

    # Add to crontab (check every 5 minutes)
    */5 * * * * /home/postgres/pg_monitor.sh >> /var/log/pg_monitor.log 2>&1

    灾难恢复检查清单:为最坏情况做准备

    灾难恢复计划的价值在于演练,而不仅仅是文档。当真正的灾难来临时,你不会有时间去思考,只能依赖肌肉记忆和清晰的步骤。

    灾难发生之前

    # 1. Verify backups work
    pg_restore -U postgres -d test_db -F c /backup/database.dump
    # 2. Document recovery procedure
    # Create recovery_procedure.txt with all steps
    # 3. Test failover
    # Promote replica to primary
    # 4. Document connection strings
    # Primary: postgresql://user:pass@primary-ip:5432/db
    # Replica: postgresql://user:pass@replica-ip:5432/db
    # 5. Monitor backup age
    find /backup -name "*.dump" -type f -mtime +1 -exec ls -la {} \;

    当灾难来袭

    # 1. Assess the damage
    sudo systemctl status postgresql
    sudo journalctl -n 50 -u postgresql
    # 2. Check data directory integrity
    sudo -u postgres pg_controldata /var/lib/postgresql/16/main | head -20
    # 3. Try recovery first (don't restore unless necessary)
    sudo systemctl restart postgresql
    sudo tail -f /var/lib/postgresql/16/main/log/postgresql-*.log
    # 4. If corrupted, restore from backup
    # Follow "Strategy 3: Continuous Archiving" section above
    # 5. Verify data after recovery
    sudo -u postgres psql -c "SELECT count(*) FROM users;"

    常用命令快速参考

    这是一个常用 PostgreSQL 命令的备忘单,可以作为你日常工作的快速参考。

    # Connection & Authentication
    psql -U postgres -h localhost -d postgres
    psql postgresql://user:password@host:5432/dbname
    # Database Operations
    \l                    # List databases
    \c dbname            # Connect to database
    CREATE DATABASE name;
    DROP DATABASE name;
    # Table Operations
    \dt                  # List tables
    \d tablename         # Show table structure
    \di                  # List indexes
    \dn                  # List schemas
    # User Management
    \du                  # List users
    CREATE USER name WITH PASSWORD 'pass';
    DROP USER name;
    GRANT ALL ON DATABASE db TO user;
    # Backup & Restore
    pg_dump -U user -d db -F c -f backup.dump
    pg_restore -U user -d db -F c backup.dump
    pg_basebackup -U user -D /path -Fp -Pv
    # Monitoring
    SELECT * FROM pg_stat_activity;
    SELECT datname, pg_size_pretty(pg_database_size(datname)) FROM pg_database;
    EXPLAIN ANALYZE SELECT ...;
    # Performance
    VACUUM ANALYZE;
    REINDEX TABLE tablename;
    ALTER TABLE tablename SET (fillfactor=80);
    # Replication
    SELECT * FROM pg_stat_replication;
    SELECT pg_promote();

    生产部署检查清单

    在将新的 PostgreSQL 数据库投入生产之前,请使用此清单进行最后一次检查,确保没有遗漏任何关键步骤。

    ☐ postgresql.conf tuned (shared_buffers, wal_level, etc.)
    ☐ pg_hba.conf restricted (only necessary IPs)
    ☐ Users created with appropriate privileges
    ☐ Backup strategy implemented and tested
    ☐ WAL archiving configured
    ☐ Monitoring and alerting in place
    ☐ Replication configured (if high availability needed)
    ☐ Connection pooling configured (in application)
    ☐ Slow query logging enabled
    ☐ Regular VACUUM ANALYZE scheduled
    ☐ Cache hit ratio verified (>99%)
    ☐ Disk space monitored
    ☐ Recovery procedure documented
    ☐ Disaster recovery tested
    ☐ pg_stat_statements extension installed
    ☐ SSL/TLS configured for remote connections

    总结:DevOps 之旅

    通过本指南,我们一起研究了在生产环境中部署和运维 PostgreSQL 的关键方面。现在大家应该已经掌握了:

    • PostgreSQL 如何通过 WAL 可靠地写入数据。
    • 如何使用 pg_dump 和持续归档来安全地备份数据。
    • 如何利用时间点恢复(PITR)从灾难中恢复。
    • 如何通过流复制实现高可用和读扩展。
    • 如何通过监控、索引和调优来优化性能。

    请记住,这并非成为 DBA 的全部知识,但它为你提供了作为一名优秀的 DevOps 工程师所需的坚实基础,让你足以在生产环境中自信、安全地运行 PostgreSQL。持续学习和实践,你将成为团队中不可或缺的数据库专家。

    如有相关问题,请在文章后面给小编留言,小编安排作者第一时间和您联系,为您答疑解惑。

    «
以专业成就每一位客户,让企业IT只为效果和安全买单

以专业成就每一位客户,让企业IT只为效果和安全买单

在线咨询
连接中...