diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 000000000..6b9cd9fc7 --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,377 @@ +# ARCHITECTURE.md + +> Last updated: 2026-04-13 · Revision: 1 (draft) +> +> This document describes the high-level architecture of RustFS. +> If you want to familiarize yourself with the code base, you are in the right place! +> +> See also [CONTRIBUTING.md](CONTRIBUTING.md) for development workflow. + +## Bird's Eye View + +RustFS is a high-performance, S3-compatible distributed object storage system written +in Rust. It uses erasure coding for data durability, supports multi-tenancy through +IAM/STS, and provides a web-based admin console. + +A running RustFS node exposes: + +- **S3 API** (port 9000) — the primary data path for object CRUD +- **Admin API** (port 9000, `/minio/` prefix) — cluster management, IAM, metrics +- **Console** (port 9001) — web UI backed by the Admin API +- **Inter-node RPC** (gRPC/tonic) — cluster communication for distributed mode + +The core data flow for a PUT request looks like: + +``` +HTTP request + → server (TLS, auth, routing, compression) + → app/object_usecase (validation, policy, lifecycle) + → storage/ecfs (erasure coding, encryption, checksums) + → ecstore (disk pool selection, data distribution) + → rio (reader pipeline: encrypt → compress → hash → write) + → io-core (zero-copy I/O, buffer pool, direct I/O) + → local disk / remote disk via RPC +``` + +## Code Map + +The repository is a Cargo workspace with a flat `crates/` layout: + +``` +rustfs/ # Workspace root (virtual manifest) +├── rustfs/ # Main binary + library crate (75K lines) +│ └── src/ +│ ├── main.rs # Entry point, startup sequence +│ ├── lib.rs # Module tree root +│ ├── server/ # HTTP server, TLS, routing, middleware +│ ├── admin/ # Admin API handlers and console +│ ├── app/ # Use-case layer (object, bucket, multipart) +│ ├── storage/ # Storage engine interface and implementation +│ ├── auth.rs # S3 request authentication +│ ├── config/ # CLI args, config parsing, workload profiles +│ └── ... +├── crates/ # 39 library crates +│ ├── ecstore/ # Erasure-coded storage engine (⚠️ 87K lines) +│ ├── rio/ # Reader I/O pipeline (encrypt, compress, hash) +│ ├── io-core/ # Zero-copy I/O, scheduling, buffer pool +│ ├── io-metrics/ # I/O metrics collection +│ ├── common/ # Shared runtime state, globals, data usage types +│ ├── config/ # Configuration types and parsing +│ ├── utils/ # Pure utility functions +│ ├── ... # (see "Crate Reference" below) +│ └── e2e_test/ # End-to-end integration tests +└── docs/ # Design documents and analysis +``` + +### Main Crate Layers (`rustfs/src/`) + +The main crate is organized in layers, top to bottom: + +| Layer | Directory | Responsibility | +|-------|-----------|----------------| +| **Server** | `server/` | HTTP listener, TLS, CORS, compression, middleware, graceful shutdown | +| **Admin** | `admin/` | Admin API routing, 30+ handler modules, web console | +| **App** | `app/` | Use-case orchestration: object_usecase, bucket_usecase, multipart_usecase | +| **Storage** | `storage/` | S3 API translation, erasure-coded FS, SSE encryption, RPC, concurrency | +| **Auth** | `auth.rs` | S3 signature verification, credential validation | +| **Config** | `config/` | CLI parsing, config struct, workload profiles | + +A request flows **downward** through the layers. No layer should reach upward +(e.g., storage must not import from admin). + +### Crate Reference + +Crates are organized in a dependency DAG with 9 depth levels (0 = leaf, 8 = top): + +``` +Depth 0 — LEAF (no internal deps): + appauth, checksums, config, credentials, crypto, io-metrics, + madmin, mcp, s3-common, workers, zip + +Depth 1: + io-core (→ io-metrics) + policy (→ config, credentials, crypto) + utils (→ config) ⚠️ inverted: utils should be leaf + +Depth 2: + concurrency, filemeta, keystone, kms, lock, obs, + signer, targets, trusted-proxies + +Depth 3: + common (→ filemeta, madmin) ⚠️ inverted: common should be leaf + +Depth 4: + object-capacity, protos, rio + +Depth 5 — CORE: + ecstore (16 internal deps, 11 dependents — the architectural heart) + +Depth 6: + audit, heal, iam, metrics, notify, s3select-api, scanner + +Depth 7: + object-io, protocols, s3select-query + +Depth 8 — TOP: + rustfs (35 internal deps — the binary, depends on almost everything) +``` + +#### By Domain + +**Core Infrastructure:** + +| Crate | Lines | Purpose | +|-------|-------|---------| +| `config` | 3.3K | Configuration types and environment parsing | +| `utils` | 8.7K | Pure utilities (paths, compression, network, retry) | +| `common` | 4.4K | Shared runtime state, globals, data usage types, metrics | +| `madmin` | 5.5K | Admin API request/response types | + +**I/O Pipeline:** + +| Crate | Lines | Purpose | +|-------|-------|---------| +| `io-core` | 6.5K | Zero-copy I/O, buffer pool, direct I/O, scheduling, backpressure | +| `io-metrics` | 4.5K | I/O operation metrics and counters | +| `rio` | 6.9K | Composable reader chain (encrypt → compress → hash → limit) | +| `object-io` | 2.4K | High-level object read/write using rio + ecstore | +| `concurrency` | 1.8K | Concurrency control wrappers over io-core | + +**Storage Engine:** + +| Crate | Lines | Purpose | +|-------|-------|---------| +| `ecstore` | 87K | ⚠️ Erasure-coded storage: disks, pools, buckets, replication, lifecycle | +| `filemeta` | 10K | File/object metadata types and versioning | +| `checksums` | 732 | Checksum computation | +| `lock` | 7.1K | Distributed lock manager | +| `heal` | 5.9K | Data healing / bitrot repair | +| `scanner` | 5.4K | Background data usage scanner | +| `object-capacity` | 2.5K | Capacity tracking and management | + +**Security & Auth:** + +| Crate | Lines | Purpose | +|-------|-------|---------| +| `crypto` | 1.6K | Encryption primitives | +| `credentials` | 713 | Credential types (access key / secret key) | +| `signer` | 1.4K | S3 v4 request signing | +| `iam` | 9.0K | Identity and access management | +| `policy` | 8.8K | Policy engine (S3 bucket/IAM policies) | +| `kms` | 8.1K | Key management service integration | +| `keystone` | 1.9K | OpenStack Keystone auth | +| `appauth` | 143 | Application-level auth tokens | + +**Protocol & API:** + +| Crate | Lines | Purpose | +|-------|-------|---------| +| `protos` | 5.7K | Protobuf/gRPC definitions for inter-node RPC | +| `protocols` | 18K | FTP/FTPS, WebDAV, Swift API support | +| `s3-common` | 738 | Shared S3 types | +| `s3select-api` | 1.9K | S3 Select interface | +| `s3select-query` | 3.6K | S3 Select query engine | + +**Observability:** + +| Crate | Lines | Purpose | +|-------|-------|---------| +| `metrics` | 8.4K | Prometheus metric collectors | +| `io-metrics` | 4.5K | I/O-specific metrics | +| `obs` | 5.6K | OpenTelemetry tracing and telemetry | +| `audit` | 2.4K | Audit logging | + +**Events:** + +| Crate | Lines | Purpose | +|-------|-------|---------| +| `notify` | 5.5K | Event notification system | +| `targets` | 3.2K | Notification targets (Kafka, AMQP, webhook, etc.) | + +**Other:** + +| Crate | Lines | Purpose | +|-------|-------|---------| +| `trusted-proxies` | 4.0K | Trusted proxy / IP forwarding | +| `zip` | 986 | ZIP archive support for bulk downloads | +| `workers` | 136 | Simple worker abstraction | +| `mcp` | 2.0K | Model Context Protocol server (AI tooling) | + +## Architecture Invariants + +> These are rules that the codebase should follow. Some are currently violated +> (marked with ⚠️). Documenting them here makes the violations explicit and +> trackable. + +1. **Layers flow downward.** Server → Admin/App → Storage → ecstore → rio/io-core. + No upward imports. + +2. **Leaf crates have zero internal dependencies.** `config`, `credentials`, `crypto`, + `io-metrics`, `madmin`, `s3-common` should depend only on external crates. + - ⚠️ VIOLATED: `utils` depends on `config`, `common` depends on `filemeta` and `madmin`. + +3. **Each type has exactly one definition.** Types shared across crates must be defined + in one crate and re-exported or imported by others. + - ⚠️ VIOLATED: `ReplicationStats` (4 copies), `LastMinuteLatency` (3 copies), + `BackpressureConfig` (3 copies), `DataUsageInfo` (2 copies). + +4. **ecstore does not know about HTTP or S3 protocol details.** It operates on + storage-level abstractions (objects, buckets, disks, pools). + +5. **The `rustfs` binary crate is the only place that wires everything together.** + Individual crates should be testable in isolation. + +6. **Error types use `thiserror` with descriptive names** (e.g., `StorageError`, + not bare `Error`). + - ⚠️ VIOLATED: 6 crates use `pub enum Error`; 2 crates use `snafu`; + `mcp` and `heal` use `anyhow` in library code. + +## Known Structural Issues + +> This section documents known problems in the current architecture. +> It exists so the team can track and address them deliberately. + +### Critical + +- **common/scanner code duplication (~3K lines).** `scanner` depends on `common` + but maintains its own copies of `DataUsageInfo`, `LastMinuteLatency`, and related + types instead of importing them. + +- **ecstore is a monolith (87K lines, 163 files).** It contains disk management, + bucket management, erasure coding, replication, lifecycle, RPC, and configuration + — all in one crate. It should be decomposed along its existing subdirectories. + +### High + +- **Dependency inversions.** `utils → config` and `common → filemeta/madmin` break + the layering model. These need to be untangled. + +- **Three-layer BackpressureConfig/DeadlockConfig duplication** across io-core, + concurrency, and rustfs/storage. Should be defined once with builder/composition. + +### Medium + +- **Inconsistent error handling.** Three strategies (thiserror/snafu/anyhow) and + mixed naming (bare `Error` vs descriptive names). + +- **Ambiguous common vs utils boundary.** Both described as "utilities and data + structures." Need clear ownership rules. + +## Cross-Cutting Concerns + +### Error Handling + +The project convention is `thiserror` for typed errors with descriptive names. +See `AGENTS.md`: "Prefer thiserror for library-facing error types." + +```rust +// GOOD +#[derive(Debug, thiserror::Error)] +pub enum StorageError { + #[error("disk not found: {0}")] + DiskNotFound(String), +} + +// AVOID +pub enum Error { ... } // too generic +anyhow::Result // in library code (OK in tests/CLI) +``` + +### Logging & Tracing + +- Use `tracing` crate (`info!`, `warn!`, `error!`, `debug!`, `trace!`) +- Structured fields: `tracing::info!(bucket = %name, "created bucket")` +- Spans for request-scoped context + +### Metrics + +- Prometheus-style metrics via `rustfs-metrics` crate +- I/O-specific counters via `rustfs-io-metrics` +- Registration happens at crate level, collection in `metrics` crate + +### Testing + +- Unit tests: `#[cfg(test)] mod tests` in the same file +- Integration tests: inside respective crates (not top-level `tests/`) +- E2E tests: `crates/e2e_test/` — tests against a running server +- Run all: `make test` or `cargo nextest run` + +## Startup Sequence + +The binary (`main.rs`) boots in this order: + +1. Environment variable compatibility (`MINIO_*` → `RUSTFS_*`) +2. Tokio runtime construction +3. CLI argument parsing +4. License, observability, TLS, trusted proxies initialization +5. Config parsing, server address resolution +6. Credentials, endpoints, local disks, lock client initialization +7. Capacity management initialization +8. HTTP server start (S3 API + optional console) +9. ECStore initialization (erasure coding storage engine) +10. Global config, background replication, KMS +11. Optional: FTP/FTPS/WebDAV servers +12. Event notifier, audit system, deadlock detector +13. Bucket metadata, IAM, Keystone, OIDC +14. Scanner and heal manager +15. Metrics system, mark `FullReady` +16. Wait for shutdown signal → graceful shutdown + +## Dependency Diagram (Simplified) + +``` + ┌─────────┐ + │ rustfs │ (binary + lib, 75K lines) + │ main │ + └────┬────┘ + │ + ┌───────────────┼───────────────┐ + │ │ │ + ┌────▼────┐ ┌────▼────┐ ┌─────▼─────┐ + │ server │ │ admin │ │ app │ + │ (HTTP) │ │(console)│ │(use-cases) │ + └────┬────┘ └────┬────┘ └─────┬─────┘ + │ │ │ + └───────────────┼───────────────┘ + │ + ┌──────▼──────┐ + │ storage │ + │ (ecfs, SSE, │ + │ RPC, ACL) │ + └──────┬──────┘ + │ + ┌──────────────────┼──────────────────┐ + │ │ │ + ┌─────▼─────┐ ┌──────▼──────┐ ┌──────▼──────┐ + │ ecstore │ │ rio │ │ io-core │ + │ (87K,core) │ │ (readers) │ │ (zero-copy) │ + └─────┬──────┘ └─────────────┘ └─────────────┘ + │ + ┌─────┬──┼──┬─────┬──────┐ + │ │ │ │ │ │ + common utils config policy filemeta ... +``` + +## How to Navigate + +- **"Where does S3 PutObject go?"** + `server/` routes → `app/object_usecase` validates → `storage/ecfs` encodes → + `ecstore` distributes → `rio` encrypts/compresses → `io-core` writes + +- **"Where are bucket policies enforced?"** + `app/bucket_usecase` calls into `crates/policy/` + +- **"Where is replication configured?"** + `admin/handlers/replication.rs` and `admin/handlers/site_replication.rs` for API, + `ecstore/src/bucket/replication/` for engine + +- **"Where do I add a new admin endpoint?"** + Add handler in `admin/handlers/`, register in `admin/router.rs` + +- **"Where do I add a new metric?"** + Define in `crates/metrics/`, register collector, expose via `/minio/v2/metrics` + +--- + +*Inspired by [matklad's ARCHITECTURE.md](https://matklad.github.io/2021/02/06/ARCHITECTURE.md.html) +and [rust-analyzer's architecture.md](https://github.com/rust-analyzer/rust-analyzer/blob/master/docs/book/src/contributing/architecture.md).*