Files
rustfs/crates/io-core
houseme 8d24d9133b perf(put): comprehensive PUT performance optimization (#3514)
* perf(put): add eager path metrics and isolation tooling

* fix(decommission): persist progress adaptively (#3497)

Persist decommission progress after either the existing time interval or a migrated-item threshold, and flush progress baselines after bucket and terminal-state saves.

Also stabilize the OIDC discovery mock used by the pre-commit gate.

* refactor: move bucket operations contract (#3507)

* fix(s3): handle multipart flexible checksums (#3508)

* fix(io-core): avoid blocking on pooled buffer return

* perf(put): add slow inflight diagnostics

* perf(put): fix 16KiB regression with threshold and pool bypass

- Lower SMALL_EAGER_PUT_MAX_SIZE from 256KB to 8KB so objects >8KiB
  use the streaming BufReader path (matches baseline behavior)
- Add POOL_BYPASE_MAX_SIZE (16KiB) to bypass BytesPool for very small
  objects, avoiding Small-tier Mutex contention under high concurrency
- Add read_small_put_body_exact_direct() for direct Vec<u8> allocation
- Fix stale test assertions to match new 8KB threshold

Root cause analysis: the 16KiB regression was primarily caused by
instrumentation overhead in set_disk.rs (4x Instant::now() + metrics
per PUT), not BytesPool contention. Lowering the threshold eliminates
the eager-path overhead for 16KiB+ objects.

* perf(put): gate stage metrics behind observability flag

Add put_stage_metrics_enabled() AtomicBool switch in io-metrics crate.
When disabled (default), record_put_object_path() and
record_put_object_stage_duration() are no-ops, avoiding unnecessary
histogram/counter macro overhead in the PUT hot path.

The flag is set to true during startup when OTEL metric export is
enabled (rustfs_obs::observability_metric_enabled() == true).

This eliminates the per-request metrics overhead that contributed
to the 16KiB PUT regression when metrics collection is not active.

* perf(put): comprehensive optimization - restore eager path, cache env, remove UUID

Change 1: Restore SMALL_EAGER_PUT_MAX_SIZE from 8KB to 1MB
- The try_lock() fix (d13a189e3) eliminates the blocking that caused
  service health timeouts under 512KiB c64 load
- Eager path with BytesPool is now safe for objects up to 1MB
- Recovers the eager path benefit for 32KiB-256KiB objects

Change 2: Adjust POOL_BYPASE_MAX_SIZE from 16KB to 4KB
- With eager path restored to 1MB, objects 4KB-1MB benefit from pool reuse
- Only ≤4KB objects bypass the pool (allocation cost negligible)

Change 3: Cache RUSTFS_ERASURE_ENCODE_MAX_INFLIGHT_BYTES via OnceLock
- Eliminates per-encode std::env::var() syscall
- Env var still works (read once at first use)

Change 4: Replace Uuid::new_v4() with Uuid::nil() in Erasure construction
- _id field is unused in hot paths (documented in code)
- Eliminates CSPRNG syscall per PUT request

Change 5: Add concurrency-aware buffer sizing to PUT path
- Reuses get_concurrency_aware_buffer_size() from GET path
- Reduces buffer size under high concurrency (0.4x at >8 concurrent)
- Lowers memory pressure for >1MB streaming PUTs

* chore: add pyroscope feature flag and clean up imports

- Add pyroscope feature flag forwarding to rustfs-obs
- Remove unused allow(non_upper_case_globals) in globals.rs
- Sort imports and fix Cargo.toml formatting consistency

* style: fix import ordering and code formatting

- Sort imports alphabetically in globals.rs, encode.rs
- Fix indentation in erasure_coding encode/erasure
- Clean up HashReader formatting in object_usecase.rs

* fix(test): use tokio::test for request_logging_layer tests

The tests call tokio::spawn via RequestContextLayer, which requires a
Tokio runtime. Changed from #[test] + futures::executor::block_on to
#[tokio::test] + .await, and replaced tracing::subscriber::with_default
with tracing::subscriber::set_default to support async.

* fix(bench): normalize no-space throughput/latency parsing in to_bps/to_ms

When a benchmark tool prints throughput without a separator (e.g. 123MiB/s),
awk '{print $2}' returns empty because the whole string is one field,
causing to_bps to return N/A and losing valid measurements in CSV output.

Insert a space between number and unit via sed before awk field splitting.
Same fix applied to to_ms for latency values like '50ms'.

Also add TODO comment on PUT path noting that get_concurrency_aware_buffer_size
reads ACTIVE_GET_REQUESTS instead of PUT concurrency (PR #3514 review).

Refs: PR #3514 review comments by chatgpt-codex-connector

* fix(metrics): correct POOL_BYPASS comments and separate PUT vs generic stage metrics

- Fix 3 comment-code mismatches: POOL_BYPASS_MAX_SIZE is 4KiB, not 16KiB
- Add generic record_stage_duration() with separate histogram
  (rustfs_internal_stage_duration_ms) for non-PUT paths
- Replace record_put_object_stage_duration with record_stage_duration in
  metacache_set, store_list_objects, and bucket_lifecycle_ops to avoid
  polluting PUT-specific dashboards with listing/lifecycle timings
- Fix flaky test: serialize tests mutating PUT_STAGE_METRICS_ENABLED with
  METRICS_FLAG_LOCK mutex and explicitly set desired state at test start

Refs: PR #3514 review comments by chatgpt-codex-connector

* style: apply cargo fmt to metacache_set.rs

---------

Co-authored-by: cxymds <cxymds@gmail.com>
Co-authored-by: 安正超 <anzhengchao@gmail.com>
2026-06-17 21:19:11 +08:00
..

rustfs-io-core

CI Status Documentation Crates.io

· Home · Docs · Issues · Discussions


Overview

rustfs-io-core is the core I/O scheduling module for RustFS, a distributed object storage system. It provides:

  • I/O Scheduler: Adaptive buffer size calculation and load management
  • Priority Queue: Request priority scheduling with starvation prevention
  • Backpressure Control: System overload protection with graceful degradation
  • Deadlock Detection: Wait-for graph based deadlock detection algorithm
  • Lock Optimizer: Adaptive spin lock optimization
  • Timeout Wrapper: Dynamic timeout calculation and operation progress tracking

Features

I/O Scheduler

Adaptive I/O scheduling with dynamic buffer size calculation based on file size, access pattern, and system load:

use rustfs_io_core::{IoScheduler, IoSchedulerConfig, IoLoadLevel};
use rustfs_io_core::io_profile::{StorageMedia, AccessPattern};

// Create scheduler
let config = IoSchedulerConfig {
    max_concurrent_reads: 64,
    base_buffer_size: 64 * 1024,  // 64 KB
    max_buffer_size: 1024 * 1024, // 1 MB
    ..Default::default()
};
let scheduler = IoScheduler::new(config);

// Calculate optimal buffer size
let buffer_size = calculate_optimal_buffer_size(
    10 * 1024 * 1024,  // 10 MB file
    64 * 1024,         // base buffer
    true,              // sequential access
    4,                 // concurrent requests
    StorageMedia::Ssd,
    IoLoadLevel::Low,
);

Priority Queue

Priority queue with starvation prevention:

use rustfs_io_core::{IoPriorityQueue, IoPriority, IoQueueStatus};

let queue = IoPriorityQueue::<()>::new(100);

// Enqueue request
let request_id = queue.enqueue(IoPriority::High, (), 1024);

// Dequeue request
if let Some((priority, data)) = queue.dequeue() {
    println!("Processing priority {:?} request", priority);
}

// Check queue status
let status = queue.status();
println!("High priority waiting: {}", status.high_priority_waiting);

Backpressure Control

System overload protection:

use rustfs_io_core::{BackpressureMonitor, BackpressureState, BackpressureConfig};

let config = BackpressureConfig {
    high_watermark: 0.8,  // 80% triggers backpressure
    low_watermark: 0.5,   // 50% releases backpressure
    ..Default::default()
};
let monitor = BackpressureMonitor::new(config);

// Check state
match monitor.state() {
    BackpressureState::Normal => println!("System normal"),
    BackpressureState::Warning => println!("System warning"),
    BackpressureState::Critical => println!("System overloaded"),
}

Deadlock Detection

Wait-for graph based deadlock detection:

use rustfs_io_core::{DeadlockDetector, LockType};

let detector = DeadlockDetector::with_defaults();

// Register locks
let lock1 = detector.register_lock(LockType::Mutex);
let lock2 = detector.register_lock(LockType::RwLockWrite);

// Record lock acquisition
detector.record_acquire(lock1, 1);  // Thread 1 acquires lock1
detector.record_wait(lock2, 1);     // Thread 1 waits for lock2

// Detect deadlock
if let Some(deadlock) = detector.detect_deadlock() {
    println!("Deadlock detected: {:?}", deadlock);
}

Lock Optimizer

Adaptive spin lock optimization:

use rustfs_io_core::{LockOptimizer, LockOptimizeConfig};

let optimizer = LockOptimizer::with_defaults();

// Record lock operations
optimizer.on_acquire();
// ... do work ...
optimizer.on_release(std::time::Duration::from_millis(10));

// View statistics
let stats = optimizer.stats();
println!("Locks acquired: {}", stats.total_acquired());

Timeout Wrapper

Dynamic timeout calculation:

use rustfs_io_core::{RequestTimeoutWrapper, TimeoutConfig};
use std::time::Duration;

let config = TimeoutConfig {
    base_timeout: Duration::from_secs(5),
    timeout_per_mb: Duration::from_millis(100),
    max_timeout: Duration::from_secs(300),
    ..Default::default()
};
let wrapper = RequestTimeoutWrapper::new(config);

// Calculate operation timeout
let timeout = wrapper.calculate_timeout(10 * 1024 * 1024);  // 10 MB

Buffer Size Calculation

Multiple buffer size calculation functions are provided:

use rustfs_io_core::{
    get_concurrency_aware_buffer_size,
    get_advanced_buffer_size,
    get_buffer_size_for_media,
    calculate_optimal_buffer_size,
    KI_B, MI_B,
};
use rustfs_io_core::io_profile::StorageMedia;

// Basic calculation
let size1 = get_concurrency_aware_buffer_size(1024 * 1024, 64 * 1024);

// Advanced calculation (considering access pattern)
let size2 = get_advanced_buffer_size(10 * 1024 * 1024, 64 * 1024, true);

// Media type optimization
let size3 = get_buffer_size_for_media(64 * 1024, StorageMedia::Ssd);

// Comprehensive calculation
let size4 = calculate_optimal_buffer_size(
    100 * 1024 * 1024,  // 100 MB file
    64 * 1024,          // base buffer
    true,               // sequential access
    4,                  // concurrent requests
    StorageMedia::Nvme,
    IoLoadLevel::Low,
);

Configuration

Environment Variables

Variable Description Default
RUSTFS_MAX_CONCURRENT_READS Max concurrent reads 64
RUSTFS_BASE_BUFFER_SIZE Base buffer size 65536
RUSTFS_MAX_BUFFER_SIZE Max buffer size 1048576
RUSTFS_IO_TIMEOUT_SECS I/O timeout seconds 30

Code Configuration

use rustfs_io_core::IoSchedulerConfig;

let config = IoSchedulerConfig {
    max_concurrent_reads: 128,
    base_buffer_size: 128 * 1024,
    max_buffer_size: 4 * 1024 * 1024,
    high_priority_threshold: 64 * 1024,
    low_priority_threshold: 4 * 1024 * 1024,
    ..Default::default()
};

// Validate configuration
if let Err(e) = config.validate() {
    panic!("Invalid configuration: {}", e);
}

Module Structure

rustfs-io-core/
├── src/
│   ├── lib.rs              # Module entry
│   ├── config.rs           # Configuration types
│   ├── scheduler.rs        # I/O scheduler
│   ├── io_priority_queue.rs # Priority queue
│   ├── backpressure.rs     # Backpressure control
│   ├── deadlock_detector.rs # Deadlock detection
│   ├── lock_optimizer.rs   # Lock optimization
│   ├── timeout_wrapper.rs  # Timeout wrapper
│   └── io_profile.rs       # I/O profile
└── Cargo.toml

Testing

# Run all tests
cargo test --package rustfs-io-core

# Run specific tests
cargo test --package rustfs-io-core --lib scheduler

# Run benchmarks
cargo bench --package rustfs-io-core

Documentation

  • rustfs-io-metrics: Metrics collection and configuration
  • rustfs: Main storage service

License

Apache License 2.0