mirror of https://github.com/rustfs/rustfs.git synced 2026-06-22 02:32:45 +08:00

Files

Henry Guo b387689f26 feat(heal): expose scanner-aware operations status (#3483 )

Co-authored-by: Henry Guo <marshawcoco@users.noreply.github.com>

2026-06-15 22:28:53 +08:00

18 KiB

Raw Permalink Blame History

Scanner Runtime Controls

This document describes the runtime controls and status fields for the RustFS data scanner. It is written for operators who need to reduce scanner pressure, diagnose slow scan progress, or confirm that background lifecycle, replication, heal, bitrot, and usage work is still moving.

For reproducible scanner-pressure validation and before/after evidence, see Scanner Benchmark Runbook.

What the scanner does

The scanner is the background maintenance loop that walks stored objects and feeds several subsystems:

usage accounting and data usage cache updates;
lifecycle expiry and transition admission;
bucket replication repair admission;
scanner-originated heal and bitrot checks;
namespace alerts for excessive versions, retained version size, and folder fan-out.

Slowing the scanner can reduce idle CPU and disk pressure, but it also delays the maintenance work above. Prefer using the status fields below before changing cycle or pacing values.

Configuration Sources

Scanner runtime config is resolved in this order:

Environment variables.
Persisted admin config for the scanner subsystem.
Built-in defaults or speed preset-derived values.

Bitrot cycle resolution is slightly different because the canonical persistent key belongs to the heal subsystem:

RUSTFS_SCANNER_BITROT_CYCLE_SECS.
heal.bitrot_cycle.
Legacy compatibility key scanner.bitrot_cycle.
Built-in default.

The /v3/scanner/status response reports each effective runtime value with a source of env, config, scanner_compat_config, or default.

Runtime Controls

Persistent key	Environment variable	Unit	Default	Effect
`scanner.speed`	`RUSTFS_SCANNER_SPEED`	preset	`default`	Selects the base pacing preset: `fastest`, `fast`, `default`, `slow`, or `slowest`.
`scanner.delay`	`RUSTFS_SCANNER_DELAY`	factor	preset-derived	Overrides the sleep multiplier. Valid range is `0` through `10000`.
`scanner.max_wait`	`RUSTFS_SCANNER_MAX_WAIT_SECS`	seconds	preset-derived	Caps one scanner sleep.
`scanner.cycle`	`RUSTFS_SCANNER_CYCLE`	seconds	preset-derived	Sets the interval between scanner cycles.
`scanner.start_delay`	`RUSTFS_SCANNER_START_DELAY_SECS`	seconds	unset	Sets startup delay and, for compatibility, the cycle interval when `scanner.cycle` is unset.
`scanner.cycle_max_duration`	`RUSTFS_SCANNER_CYCLE_MAX_DURATION_SECS`	seconds	`0`	Caps one cycle's runtime. `0` disables this budget.
`scanner.cycle_max_objects`	`RUSTFS_SCANNER_CYCLE_MAX_OBJECTS`	objects	`0`	Caps objects processed by one cycle. `0` disables this budget.
`scanner.cycle_max_directories`	`RUSTFS_SCANNER_CYCLE_MAX_DIRECTORIES`	directories	`0`	Caps directories entered by one cycle. `0` disables this budget.
`heal.bitrot_cycle`	`RUSTFS_SCANNER_BITROT_CYCLE_SECS`	seconds	`2592000`	Controls periodic deep bitrot scans. `false`, `off`, `no`, or `disabled` disables periodic deep scans; `0`, `true`, `on`, or `yes` runs deep mode every scanner cycle.
`scanner.idle_mode`	`RUSTFS_SCANNER_IDLE_MODE`	boolean	`true`	Enables scanner sleeps and cooperative throttling.
`scanner.cache_save_timeout`	`RUSTFS_SCANNER_CACHE_SAVE_TIMEOUT_SECS`	seconds	`30`	Timeout for saving scanner cache; runtime enforces a minimum of `1`.
`scanner.max_concurrent_set_scans`	`RUSTFS_SCANNER_MAX_CONCURRENT_SET_SCANS`	count	`0`	Caps concurrent set-level scanner tasks. `0` keeps topology-derived concurrency.
`scanner.max_concurrent_disk_scans`	`RUSTFS_SCANNER_MAX_CONCURRENT_DISK_SCANS`	count	`0`	Caps concurrent disk bucket walks per set. `0` keeps disk-count-derived concurrency.
`scanner.yield_every_n_objects`	`RUSTFS_SCANNER_YIELD_EVERY_N_OBJECTS`	objects	`128`	Controls how often object loops yield to the async runtime. `0` disables this extra yield.
`scanner.alert_excess_versions`	`RUSTFS_SCANNER_ALERT_EXCESS_VERSIONS`	versions	`100`	Version count threshold for scanner alerts.
`scanner.alert_excess_version_size`	`RUSTFS_SCANNER_ALERT_EXCESS_VERSION_SIZE`	bytes	`1099511627776`	Retained version byte threshold for scanner alerts.
`scanner.alert_excess_folders`	`RUSTFS_SCANNER_ALERT_EXCESS_FOLDERS`	folders	`65538`	Direct subfolder threshold for scanner alerts.

The fastest, fast, default, slow, and slowest presets set the base sleep multiplier, maximum wait, and cycle interval. Use scanner.delay, scanner.max_wait, and scanner.cycle when the preset is close but one axis needs a precise override.

Status Endpoint

The scanner status route is:

GET /v3/scanner/status

The request must be authenticated with an admin identity that has ServerInfoAdminAction. The JSON response has two top-level objects:

runtime_config: the effective runtime controls and their value sources.
metrics: scanner work, pressure, checkpoint, lifecycle, replication, heal, bitrot, and alert counters.

Example fields to inspect:

runtime_config.speed.value
runtime_config.delay.value
runtime_config.max_wait_seconds.value
runtime_config.cycle_interval_seconds.value
runtime_config.bitrot_cycle_seconds.value
metrics.pacing_pressure.primary_pressure
metrics.pacing_pressure.last_cycle_budget_limited
metrics.lifecycle_transition.current_queued
metrics.lifecycle_transition.scanner_missed
metrics.maintenance_control.primary_control
metrics.source_work
metrics.replication_repair
metrics.scan_checkpoint

Reading Pacing Pressure

metrics.pacing_pressure.primary_pressure summarizes the highest-priority scanner pressure signal:

Value	Meaning	Usual response
`queued_scans`	Set or disk scan queues are backing up.	Lower scanner concurrency or increase pacing delay if user traffic is affected.
`cycle_budget`	The last cycle stopped because a runtime/object/directory budget was reached.	Check `last_cycle_partial_reason` and `last_cycle_partial_source`; increase the specific budget if scans need to finish sooner.
`throttle_pause`	Scanner sleeps or cooperative yields were observed.	Expected when `idle_mode` is enabled; inspect pause ratios before tuning.
`active_scans`	Scanner work is active but not currently queued or budget-limited.	Usually healthy; correlate with CPU/disk metrics.
`none`	No current scanner pressure was observed.	No scanner pacing action needed.

The ratio fields are fractions of the last cycle duration:

last_cycle_throttle_sleep_ratio
last_cycle_yield_ratio
last_cycle_total_pause_ratio

If CPU is high but pause ratios are already high, increasing scanner.delay or scanner.max_wait may have limited value. Check active paths, source work, and disk activity before changing the cycle interval.

Reading Source Work

metrics.source_work, metrics.current_cycle_source_work, and metrics.last_cycle_source_work group scanner work by source:

usage
lifecycle
bucket_replication
site_replication
heal
bitrot
alerts

Each source has checked, queued, executed, failed, skipped, and missed counters. missed means the scanner found work but could not admit it to the downstream queue. skipped means the work was intentionally merged or deduplicated.

Use these counters to decide whether scan progress is limited by scanner pacing or by a downstream subsystem such as lifecycle transition, replication repair, or heal admission.

Reading Heal Operations

The background heal status route is:

POST /v3/background-heal/status

It reports scanner-driven bitrot state together with heal queue execution state. healQueueLength and healActiveTasks keep the legacy totals. healOperations adds the same totals split by request source and priority:

Field	Meaning
`queueLength`	Total queued heal requests.
`activeTasks`	Total running heal tasks.
`queuedBySource`	Queued requests split into `scanner`, `admin`, `autoHeal`, and `internal`.
`activeBySource`	Running tasks split into `scanner`, `admin`, `autoHeal`, and `internal`.
`queuedByPriority`	Queued requests split into `low`, `normal`, `high`, and `urgent`.
`activeByPriority`	Running tasks split into `low`, `normal`, `high`, and `urgent`.

Use this route when metrics.source_work shows heal or bitrot queued or missed work. Scanner-originated object checks should appear under scanner/low for opportunistic work, while manual admin heal should appear under admin/high. If scanner work grows but admin work remains blocked, treat that as heal queue pressure rather than scanner pacing pressure.

Reading Replication Repair

metrics.replication_repair, metrics.current_cycle_replication_repair, and metrics.last_cycle_replication_repair split scanner-discovered replication repair work by source and repair kind.

Each entry has the same checked, queued, executed, failed, skipped, and missed counters used by source_work, plus:

Field	Meaning
`source`	`bucket_replication` for bucket replication repair, or `site_replication` for site replication boundary signals.
`kind`	Bucket repair kinds are `object`, `delete_marker`, `version_purge`, and `existing_object`. Site replication boundary kinds are `passive_requeue` and `active_resync`.

For bucket replication, queued means scanner-discovered repair was admitted to the replication queue, missed means the queue or worker path could not accept it, and skipped means the object did not require a new repair task.

The site replication kinds keep passive scanner discovery separate from active resync. Scanner status may report site replication boundary counters, but the scanner should not be treated as the active site replication resync controller.

Reading Maintenance Control

metrics.maintenance_control derives a source-level control snapshot from scanner pacing, partial-cycle state, source work, and lifecycle transition queue state. It does not change scanner scheduling by itself; it explains why a source is moving, deferred, or blocked. When no scan cycle is currently active, source-work controls use the last completed cycle so recently missed work stays visible between scanner passes.

metrics.maintenance_control.primary_control summarizes the highest-priority source state:

Value	Meaning
`blocked_source`	At least one maintenance source found work that could not be admitted or is blocked by a downstream queue.
`deferred_source`	At least one source was deferred by a partial scanner cycle or budget-limited pass.
`active_source`	At least one source has current-cycle work or queued downstream work.
`pacing_pressure`	No source-specific state dominated, but scanner pacing pressure is still visible.
`none`	No source-level maintenance control pressure was observed.

Each metrics.maintenance_control.sources[] entry has:

Field	Meaning
`source`	Scanner source such as `usage`, `lifecycle`, `bucket_replication`, `site_replication`, `heal`, `bitrot`, or `alerts`.
`state`	`idle`, `active`, `deferred`, or `blocked`.
`reason`	Derived reason such as `active_work`, `queued_work`, `partial_cycle`, `missed_work`, `expiry_queue_backlog`, `transition_failed`, `transition_compensation_backlog`, `transition_queue_backlog`, or `transition_queue_full`.
`backlog`	Current source-level backlog estimate from queued or missed work.
`current_checked`	Current-cycle checked work for this source, or the last completed cycle when no scan cycle is active.
`current_queued`	Current-cycle queued work for this source, or the last completed cycle when no scan cycle is active.
`current_missed`	Current-cycle work that could not be admitted, or the last completed cycle when no scan cycle is active.
`lifetime_missed`	Lifetime missed work counter for context.
`partial_cycles`	Partial cycles attributed to this source.

Use this snapshot before changing scanner controls. For example, blocked_source with lifecycle/missed_work points at downstream lifecycle admission, while deferred_source with usage/partial_cycle points at scanner cycle budgets. lifecycle/expiry_queue_backlog means scanner-driven expiry or delete work is still queued or active in the expiry worker pool. lifecycle/transition_failed means transition worker execution failed during the current or last completed scan cycle, while lifecycle/transition_compensation_backlog means transition compensation is still pending or running after queue backpressure.

metrics.lifecycle_expiry exposes the expiry/delete worker queue observed by scanner-driven lifecycle work:

Field	Meaning
`current_queue_capacity`	Effective expiry worker queue capacity for this node.
`current_queued`	Expiry/delete tasks currently waiting in the worker queue.
`current_active`	Expiry/delete tasks currently running in a worker.
`current_workers`	Configured expiry worker count.
`queue_missed`	Expiry/delete tasks that could not be queued because no worker channel was available or the queue was closed.
`scanner_queued`	Scanner-discovered expiry/delete object versions admitted to the expiry queue.
`scanner_missed`	Scanner-discovered expiry/delete object versions that could not be admitted.

Reading Distributed Metrics

/rustfs/admin/v3/scanner/status and /rustfs/admin/v3/metrics report the node that handles the HTTP request. The metrics endpoint does not fan out to peer nodes. In distributed deployments, query every node explicitly and keep by-host=true enabled so each response includes that node's host view:

for endpoint in http://node-a:9000 http://node-b:9000 http://node-c:9000; do
  node="${endpoint#http://}"
  node="${node%%:*}"
  awscurl \
    --service s3 \
    --region us-east-1 \
    --access_key "$RUSTFS_ACCESS_KEY" \
    --secret_key "$RUSTFS_SECRET_KEY" \
    --request GET \
    "${endpoint}/rustfs/admin/v3/metrics?types=1&by-host=true&n=1" \
    > "artifacts/scanner-metrics.${node}.$(date -u +%Y%m%dT%H%M%SZ).ndjson"
done

The aggregated.scanner payload preserves the same scanner progress, checkpoint, pacing, source work, maintenance control, lifecycle expiry, and lifecycle transition fields used by the local scanner status, but only for the node that returned the response. The by_host.*.scanner payload keeps that node's host view. Compare the per-node artifacts externally to find old active paths, partial checkpoints, pacing pressure, source-level control pressure, or downstream queue admission problems across the deployment.

Reading Lifecycle Transition Status

metrics.lifecycle_transition focuses on scanner-driven lifecycle transition work:

Field	Meaning
`current_queue_capacity`	Current transition queue capacity.
`current_queued`	Transition tasks currently queued.
`current_active`	Transition tasks currently being processed.
`current_workers`	Transition worker count.
`queue_full`	Queue-full observations in the transition state.
`queue_send_timeout`	Send timeouts for transition queue admission.
`compensation_scheduled`	Buckets scheduled for transition compensation.
`compensation_pending`	Buckets with transition compensation still pending or running.
`compensation_running`	Transition compensation tasks currently running.
`scanner_queued`	Scanner transition tasks admitted to the queue.
`scanner_missed`	Scanner transition tasks that could not be admitted.
`completed`	Transition worker completions.
`failed`	Transition worker failures.

When scanner_missed or queue_full rises, scanner lifecycle work is finding transition candidates faster than the transition queue can accept them. That is a downstream transition pressure signal, not just a scanner walk pressure signal.

Tuning Workflow

For symptoms where a mostly idle single-node, single-disk deployment has sustained CPU usage while the scanner is enabled:

Read /v3/scanner/status.
Check metrics.pacing_pressure.primary_pressure.
Check metrics.maintenance_control.primary_control and source entries before changing runtime controls.
Check runtime_config.delay, runtime_config.max_wait_seconds, and runtime_config.cycle_interval_seconds to confirm the active values and their sources.
Check metrics.current_cycle_objects_scanned, metrics.current_cycle_directories_scanned, and active paths to confirm the scanner is the active work.
If primary_pressure is throttle_pause and pause ratios are low, raise scanner.delay first.
If individual sleeps are too short, raise scanner.max_wait.
If each scan cycle finishes but starts too often, raise scanner.cycle.
If scans must be broken into bounded chunks, set one of the cycle budgets: scanner.cycle_max_duration, scanner.cycle_max_objects, or scanner.cycle_max_directories.
Recheck pacing_pressure, maintenance_control, source work, and lifecycle transition status after one or more scanner cycles.

Do not rely only on a longer cycle interval if lifecycle, replication, heal, or bitrot work must keep moving. Use source work and transition status to confirm that background maintenance is still making progress.

Helm

The Helm chart exposes the scanner environment variables under config.rustfs.scanner. Example:

config:
  rustfs:
    scanner:
      speed: "slow"
      delay: "30"
      max_wait_secs: "15"
      cycle_secs: "3600"
      cycle_max_duration_secs: "1800"
      cycle_max_objects: "1000000"
      cycle_max_directories: "100000"
      idle_mode: "true"
      yield_every_n_objects: "128"
      bitrot_cycle_secs: "2592000"

Use extraEnv for experimental or unrelated environment variables that are not represented by chart values.

18 KiB Raw Permalink Blame History