* perf(put): add eager path metrics and isolation tooling * fix(decommission): persist progress adaptively (#3497) Persist decommission progress after either the existing time interval or a migrated-item threshold, and flush progress baselines after bucket and terminal-state saves. Also stabilize the OIDC discovery mock used by the pre-commit gate. * refactor: move bucket operations contract (#3507) * fix(s3): handle multipart flexible checksums (#3508) * fix(io-core): avoid blocking on pooled buffer return * perf(put): add slow inflight diagnostics * perf(put): fix 16KiB regression with threshold and pool bypass - Lower SMALL_EAGER_PUT_MAX_SIZE from 256KB to 8KB so objects >8KiB use the streaming BufReader path (matches baseline behavior) - Add POOL_BYPASE_MAX_SIZE (16KiB) to bypass BytesPool for very small objects, avoiding Small-tier Mutex contention under high concurrency - Add read_small_put_body_exact_direct() for direct Vec<u8> allocation - Fix stale test assertions to match new 8KB threshold Root cause analysis: the 16KiB regression was primarily caused by instrumentation overhead in set_disk.rs (4x Instant::now() + metrics per PUT), not BytesPool contention. Lowering the threshold eliminates the eager-path overhead for 16KiB+ objects. * perf(put): gate stage metrics behind observability flag Add put_stage_metrics_enabled() AtomicBool switch in io-metrics crate. When disabled (default), record_put_object_path() and record_put_object_stage_duration() are no-ops, avoiding unnecessary histogram/counter macro overhead in the PUT hot path. The flag is set to true during startup when OTEL metric export is enabled (rustfs_obs::observability_metric_enabled() == true). This eliminates the per-request metrics overhead that contributed to the 16KiB PUT regression when metrics collection is not active. * perf(put): comprehensive optimization - restore eager path, cache env, remove UUID Change 1: Restore SMALL_EAGER_PUT_MAX_SIZE from 8KB to 1MB - The try_lock() fix (d13a189e3) eliminates the blocking that caused service health timeouts under 512KiB c64 load - Eager path with BytesPool is now safe for objects up to 1MB - Recovers the eager path benefit for 32KiB-256KiB objects Change 2: Adjust POOL_BYPASE_MAX_SIZE from 16KB to 4KB - With eager path restored to 1MB, objects 4KB-1MB benefit from pool reuse - Only ≤4KB objects bypass the pool (allocation cost negligible) Change 3: Cache RUSTFS_ERASURE_ENCODE_MAX_INFLIGHT_BYTES via OnceLock - Eliminates per-encode std::env::var() syscall - Env var still works (read once at first use) Change 4: Replace Uuid::new_v4() with Uuid::nil() in Erasure construction - _id field is unused in hot paths (documented in code) - Eliminates CSPRNG syscall per PUT request Change 5: Add concurrency-aware buffer sizing to PUT path - Reuses get_concurrency_aware_buffer_size() from GET path - Reduces buffer size under high concurrency (0.4x at >8 concurrent) - Lowers memory pressure for >1MB streaming PUTs * chore: add pyroscope feature flag and clean up imports - Add pyroscope feature flag forwarding to rustfs-obs - Remove unused allow(non_upper_case_globals) in globals.rs - Sort imports and fix Cargo.toml formatting consistency * style: fix import ordering and code formatting - Sort imports alphabetically in globals.rs, encode.rs - Fix indentation in erasure_coding encode/erasure - Clean up HashReader formatting in object_usecase.rs * fix(test): use tokio::test for request_logging_layer tests The tests call tokio::spawn via RequestContextLayer, which requires a Tokio runtime. Changed from #[test] + futures::executor::block_on to #[tokio::test] + .await, and replaced tracing::subscriber::with_default with tracing::subscriber::set_default to support async. * fix(bench): normalize no-space throughput/latency parsing in to_bps/to_ms When a benchmark tool prints throughput without a separator (e.g. 123MiB/s), awk '{print $2}' returns empty because the whole string is one field, causing to_bps to return N/A and losing valid measurements in CSV output. Insert a space between number and unit via sed before awk field splitting. Same fix applied to to_ms for latency values like '50ms'. Also add TODO comment on PUT path noting that get_concurrency_aware_buffer_size reads ACTIVE_GET_REQUESTS instead of PUT concurrency (PR #3514 review). Refs: PR #3514 review comments by chatgpt-codex-connector * fix(metrics): correct POOL_BYPASS comments and separate PUT vs generic stage metrics - Fix 3 comment-code mismatches: POOL_BYPASS_MAX_SIZE is 4KiB, not 16KiB - Add generic record_stage_duration() with separate histogram (rustfs_internal_stage_duration_ms) for non-PUT paths - Replace record_put_object_stage_duration with record_stage_duration in metacache_set, store_list_objects, and bucket_lifecycle_ops to avoid polluting PUT-specific dashboards with listing/lifecycle timings - Fix flaky test: serialize tests mutating PUT_STAGE_METRICS_ENABLED with METRICS_FLAG_LOCK mutex and explicitly set desired state at test start Refs: PR #3514 review comments by chatgpt-codex-connector * style: apply cargo fmt to metacache_set.rs --------- Co-authored-by: cxymds <cxymds@gmail.com> Co-authored-by: 安正超 <anzhengchao@gmail.com>
rustfs-object-capacity
rustfs-object-capacity is the core object-capacity statistics component in RustFS. It scans local data directories, maintains a capacity cache, triggers incremental refreshes after writes, and provides the admin layer with a used-capacity result that is as inexpensive and resilient as possible.
This crate is not meant to measure total filesystem capacity. Its job is to answer: "How many bytes are currently occupied by RustFS object data?" It makes practical tradeoffs between accuracy, freshness, and scan cost.
Core Responsibilities
- Scan one or more local data-disk roots and aggregate used bytes and file counts.
- Reduce scan cost on large directories with an "exact prefix + sampled overflow" strategy.
- Return usable degraded results when scans time out, traversal stalls, or some directories fail, instead of failing the entire request immediately.
- Maintain a global
HybridCapacityManagercache with scheduled refresh, write-triggered refresh, foreground blocking refresh, and background refresh. - Track which disks were affected by writes so the system can refresh only the dirty subset after a complete per-disk cache is available.
- Emit capacity-related metrics for observability and benchmarks.
Module Layout
src/lib.rsRe-exportsscan_used_capacity_disks,CapacityDiskRef, andCapacityScanSummary.src/types.rsDefines scan input/output types, includingCapacityDiskRef, the internalCapacityScanResult, and the publicCapacityScanSummary.src/scan.rsImplements directory traversal, sampled estimation, timeout/stall detection, multi-disk concurrent scans, and conversion intoCapacityUpdate.src/capacity_manager.rsOwns caching, write-frequency tracking, singleflight refresh coordination, background tasks, dirty-subset merge logic, and the global singleton manager.src/capacity_scope.rsTracks "which disks were touched by a write", including token-bound local scopes and the global dirty-scope registry.benches/capacity_scan.rsExercises the public scan API with benchmark scenarios for exact, sampled, and multi-disk scans.
Data Model
CapacityDiskRef
pub struct CapacityDiskRef {
pub endpoint: String,
pub drive_path: String,
}
This is the minimal unit required for a scan:
endpointis used to distinguish metrics and logs.drive_pathis the local disk root path.
CapacityScanSummary
pub struct CapacityScanSummary {
pub used_bytes: u64,
pub file_count: usize,
pub sampled_count: usize,
pub is_estimated: bool,
pub had_partial_errors: bool,
pub scan_duration: Duration,
}
Field meanings:
used_bytes: the computed or estimated used capacity.file_count: the number of regular files traversed.sampled_count: the number of overflow files sampled after crossing the threshold.is_estimated: whether the result is estimated instead of exact.had_partial_errors: whether traversal encountered local errors while still producing a result.scan_duration: total scan duration.
Scan Algorithm
The directory scan lives in scan.rs::get_dir_size_async and works as follows:
- Wrap blocking directory traversal in
tokio::task::spawn_blockingso the async runtime is not blocked. - Walk the directory tree with
WalkDirand count only regular files. - If the file count stays below
DEFAULT_MAX_FILES_THRESHOLD(default200_000), add every file size exactly. - After crossing the threshold:
- Keep the first
max_files_thresholdfiles as an exact prefix. - Sample every
sample_ratefile after that and estimate the overflow portion from sampled bytes.
- Keep the first
- Periodically perform progress checks:
- If total elapsed time exceeds the timeout, attempt to fall back to a sampled estimate.
- If no file progress is observed within
stall_timeout, treat the traversal as stalled.
- If some directory entries or metadata reads fail:
- As long as at least one disk scan succeeds, return a partial-success result.
- Mark the result with
had_partial_errors = true.
Scan Concurrency
- Multi-disk scans run concurrently through
buffer_unordered. - The current hard-coded maximum concurrency is
4disks. - A failure on one disk does not immediately stop scans for the others.
Timeout and Estimation Fallback
This crate is intentionally not "timeout means hard failure":
- If enough sampled data has already been collected, a timeout or stall produces an estimated result.
- Only when no usable estimate is available does the scan return an error.
- This keeps capacity queries useful for large directories, slow disks, and temporary I/O stalls.
Symlink Handling
- Symlinks are not followed by default:
RUSTFS_CAPACITY_FOLLOW_SYMLINKS=false. - If enabled, the scan applies circular-reference detection and a maximum follow depth.
- The default maximum depth is
3.
Capacity Cache and Refresh Strategy
HybridCapacityManager is the state center of this crate.
Cached State
- Latest total capacity value
total_used - Last refresh time
last_update - File count
file_count - Estimated/exact flag
is_estimated - Data source
DataSource - Per-disk cache
disk_cache - Dirty-disk set
- Recent 60-second write buckets
DataSource
RealTimeForeground real-time refresh when no cache exists yet.ScheduledBackground refresh triggered by the scheduled task.WriteTriggeredRefresh triggered when write frequency is high and the cache is old enough.FallbackFallback to externally supplied disk-used capacity when all scans fail.
Refresh Entry Points
refresh_or_joinA singleflight foreground refresh. If another refresh is already running, callers join and wait for the shared result.spawn_refresh_if_neededA background refresh. If another refresh is already running, it is skipped.start_background_taskStarts two background tasks:- the scheduled capacity refresh task
- the runtime summary logging task
Singleflight Semantics
refresh_or_join and spawn_refresh_if_needed use a watch channel to coordinate refresh cycles:
- Only one leader performs the actual refresh at a time.
- Joiners share the same published result after the leader completes.
- Panics inside the refresh function are caught and converted into errors so callers do not crash with the leader.
Dirty Scope and Subset Refresh
One of the main optimizations in this crate is "refresh only the disks dirtied by writes".
Scope Propagation
capacity_scope.rs provides two ways to propagate dirty disks:
- token scope
- The caller first binds a write operation to a disk set with
record_capacity_scope(token, scope). - Later,
record_write_operation_with_scope_token(Some(token))consumes that scope and marks the disks dirty.
- The caller first binds a write operation to a disk set with
- global dirty scope
record_global_dirty_scope(scope)records dirty disks directly in the global registry.- The manager drains and merges them during
get_dirty_disks().
When Dirty-Subset Refresh Is Allowed
Refreshing only dirty disks is safe only when:
disk_cache_complete == true- which means the system has already completed at least one full refresh without partial errors
- and the per-disk cache is fully populated
If the per-disk cache is incomplete, or there are no dirty disks, the system falls back to a full refresh.
Merge Rules After a Subset Refresh
- On a successful full refresh,
per_diskreplaces the entiredisk_cache. - On a successful dirty-subset refresh, only the affected per-disk entries are updated.
- The total capacity is recomputed from the updated
disk_cacheinstead of trusting the subset sum directly. - If a dirty-subset refresh reports partial errors, that cycle fails and the caller falls back to a full refresh to recover consistency.
Relationship to the RustFS Main Flow
This crate provides capacity primitives only. The actual RustFS integration lives in rustfs/src/capacity/service.rs.
The high-level flow is:
- Startup calls
init_capacity_management_for_local_disks(). - It collects all local disks and calls
capacity_manager::start_background_task(...). - Admin used-capacity queries first try the
HybridCapacityManagercache. - If the cache is fresh enough, the cached value is returned directly.
- If the cache is stale but still acceptable, the stale value is served and a background refresh is triggered.
- If the cache is very stale and the write rate is high, the request blocks on a foreground refresh.
- If the initial real-time scan fails, the service falls back to externally supplied disk-used capacity and stores it as
Fallback.
crates/ecstore/src/set_disk.rs is responsible for recording capacity scopes during object writes, heal operations, data movement, and related flows, so this crate can learn which disks were affected.
Public API
1. Direct Scan
This is useful for benchmarks, operational tooling, or isolated validation.
use rustfs_object_capacity::{CapacityDiskRef, scan_used_capacity_disks};
let disks = vec![
CapacityDiskRef {
endpoint: "node-a".to_string(),
drive_path: "/data/disk1".to_string(),
},
];
let summary = scan_used_capacity_disks(&disks).await?;
println!(
"used={} files={} estimated={}",
summary.used_bytes, summary.file_count, summary.is_estimated
);
# Ok::<(), Box<dyn std::error::Error>>(())
2. Use the Global Manager
This is useful for in-service caching and refresh orchestration.
use rustfs_object_capacity::capacity_manager::{DataSource, get_capacity_manager};
let manager = get_capacity_manager();
if let Some(cached) = manager.get_capacity().await {
println!("cached bytes={}", cached.total_used);
}
manager.record_write_operation().await;
let _ = manager
.refresh_or_join(DataSource::Scheduled, || async {
rustfs_object_capacity::scan::refresh_capacity_with_scope(
vec![rustfs_object_capacity::CapacityDiskRef {
endpoint: "node-a".to_string(),
drive_path: "/data/disk1".to_string(),
}],
false,
)
.await
})
.await;
3. Propagate a Dirty Scope
use rustfs_object_capacity::capacity_scope::{
CapacityScope, CapacityScopeDisk, record_capacity_scope,
};
use rustfs_object_capacity::capacity_manager::get_capacity_manager;
use uuid::Uuid;
let token = Uuid::new_v4();
record_capacity_scope(
token,
CapacityScope {
disks: vec![CapacityScopeDisk {
endpoint: "node-a".to_string(),
drive_path: "/data/disk1".to_string(),
}],
},
);
get_capacity_manager()
.record_write_operation_with_scope_token(Some(token))
.await;
Environment Variables and Defaults
The configuration constants are defined in crates/config/src/constants/capacity.rs.
| Environment Variable | Default | Description |
|---|---|---|
RUSTFS_CAPACITY_SCHEDULED_INTERVAL |
120s |
Scheduled refresh interval |
RUSTFS_CAPACITY_WRITE_TRIGGER_DELAY |
5s |
Debounce delay after writes |
RUSTFS_CAPACITY_WRITE_FREQUENCY_THRESHOLD |
5 |
Recent 60-second write-frequency threshold |
RUSTFS_CAPACITY_FAST_UPDATE_THRESHOLD |
30s |
Cache age required before fast refresh is considered |
RUSTFS_CAPACITY_MAX_FILES_THRESHOLD |
200000 |
Exact-count file threshold |
RUSTFS_CAPACITY_STAT_TIMEOUT |
3s |
Base scan timeout |
RUSTFS_CAPACITY_SAMPLE_RATE |
200 |
Overflow-file sampling interval |
RUSTFS_CAPACITY_METRICS_INTERVAL |
600s |
Runtime summary emission interval |
RUSTFS_CAPACITY_FOLLOW_SYMLINKS |
false |
Whether to follow symlinks |
RUSTFS_CAPACITY_MAX_SYMLINK_DEPTH |
3 |
Maximum symlink follow depth |
RUSTFS_CAPACITY_ENABLE_DYNAMIC_TIMEOUT |
true |
Whether to enable dynamic timeout scaling |
RUSTFS_CAPACITY_MIN_TIMEOUT |
2s |
Dynamic-timeout lower bound |
RUSTFS_CAPACITY_MAX_TIMEOUT |
15s |
Dynamic-timeout upper bound |
RUSTFS_CAPACITY_STALL_TIMEOUT |
20s |
Stall-detection threshold |
Configuration-Caching Note
In non-test builds, configuration is cached behind OnceLock:
- Environment variables are effectively read once on first access.
- Updating
RUSTFS_CAPACITY_*during runtime usually does not take effect immediately. - A process restart is normally required to apply configuration changes reliably.
Metrics
This crate reports multiple metric families to rustfs-io-metrics::capacity_metrics, including:
- cache hit / miss / served state
- refresh inflight, joiners, and success / error outcomes
- current capacity bytes
- write frequency
- dirty-disk count
- per-disk scan duration, sampling mode, timeout fallback, stall detection, and symlink statistics
So this crate is both a capacity-calculation component and an important producer of runtime observability data.
Benchmarks
Run the benchmark suite with:
cargo bench -p rustfs-object-capacity --bench capacity_scan
Current benchmark scenarios:
capacity_scan_exactSingle-disk exact scan over 10k files.capacity_scan_sampledSingle-disk scan over 202,048 files that triggers sampled estimation.capacity_scan_multi_diskFour-disk exact scan with mixed directory sizes.
Known Boundaries and Tradeoffs
- It sums file sizes under RustFS object-data directories; it is not a full replacement for filesystem-level
du. - Estimated mode prioritizes bounded cost and usable results over perfect per-run precision.
- Dirty-subset refresh is safe only after a complete per-disk cache has been established.
- Partial errors intentionally try to return a degraded result, which improves availability but means callers should pay attention to
had_partial_errors. - Symlink following is disabled by default for safety and determinism.