diff --git a/.claude/skills/coding-guidelines/SKILL.md b/.claude/skills/coding-guidelines/SKILL.md deleted file mode 100644 index f66acc2ab..000000000 --- a/.claude/skills/coding-guidelines/SKILL.md +++ /dev/null @@ -1,94 +0,0 @@ ---- -name: coding-guidelines -description: "Use when asking about Rust code style or best practices. Keywords: naming, formatting, comment, clippy, rustfmt, lint, code style, best practice, P.NAM, G.FMT, code review, naming convention, variable naming, function naming, type naming, 命名规范, 代码风格, 格式化, 最佳实践, 代码审查, 怎么命名" -source: https://rust-coding-guidelines.github.io/rust-coding-guidelines-zh/ ---- - -# Rust Coding Guidelines (50 Core Rules) - -## Naming (Rust-Specific) - -| Rule | Guideline | -|------|-----------| -| No `get_` prefix | `fn name()` not `fn get_name()` | -| Iterator convention | `iter()` / `iter_mut()` / `into_iter()` | -| Conversion naming | `as_` (cheap &), `to_` (expensive), `into_` (ownership) | -| Static var prefix | `G_CONFIG` for `static`, no prefix for `const` | - -## Data Types - -| Rule | Guideline | -|------|-----------| -| Use newtypes | `struct Email(String)` for domain semantics | -| Prefer slice patterns | `if let [first, .., last] = slice` | -| Pre-allocate | `Vec::with_capacity()`, `String::with_capacity()` | -| Avoid Vec abuse | Use arrays for fixed sizes | - -## Strings - -| Rule | Guideline | -|------|-----------| -| Prefer bytes | `s.bytes()` over `s.chars()` when ASCII | -| Use `Cow` | When might modify borrowed data | -| Use `format!` | Over string concatenation with `+` | -| Avoid nested iteration | `contains()` on string is O(n*m) | - -## Error Handling - -| Rule | Guideline | -|------|-----------| -| Use `?` propagation | Not `try!()` macro | -| `expect()` over `unwrap()` | When value guaranteed | -| Assertions for invariants | `assert!` at function entry | - -## Memory - -| Rule | Guideline | -|------|-----------| -| Meaningful lifetimes | `'src`, `'ctx` not just `'a` | -| `try_borrow()` for RefCell | Avoid panic | -| Shadowing for transformation | `let x = x.parse()?` | - -## Concurrency - -| Rule | Guideline | -|------|-----------| -| Identify lock ordering | Prevent deadlocks | -| Atomics for primitives | Not Mutex for bool/usize | -| Choose memory order carefully | Relaxed/Acquire/Release/SeqCst | - -## Async - -| Rule | Guideline | -|------|-----------| -| Sync for CPU-bound | Async is for I/O | -| Don't hold locks across await | Use scoped guards | - -## Macros - -| Rule | Guideline | -|------|-----------| -| Avoid unless necessary | Prefer functions/generics | -| Follow Rust syntax | Macro input should look like Rust | - -## Deprecated → Better - -| Deprecated | Better | Since | -|------------|--------|-------| -| `lazy_static!` | `std::sync::OnceLock` | 1.70 | -| `once_cell::Lazy` | `std::sync::LazyLock` | 1.80 | -| `std::sync::mpsc` | `crossbeam::channel` | - | -| `std::sync::Mutex` | `parking_lot::Mutex` | - | -| `failure`/`error-chain` | `thiserror`/`anyhow` | - | -| `try!()` | `?` operator | 2018 | - -## Quick Reference - -``` -Naming: snake_case (fn/var), CamelCase (type), SCREAMING_CASE (const) -Format: rustfmt (just use it) -Docs: /// for public items, //! for module docs -Lint: #![warn(clippy::all)] -``` - -Claude knows Rust conventions well. These are the non-obvious Rust-specific rules. diff --git a/.claude/skills/coding-guidelines/clippy-lints/_index.md b/.claude/skills/coding-guidelines/clippy-lints/_index.md deleted file mode 100644 index 23b9d8be8..000000000 --- a/.claude/skills/coding-guidelines/clippy-lints/_index.md +++ /dev/null @@ -1,16 +0,0 @@ -# Clippy Lint → Rule Mapping - -| Clippy Lint | Category | Fix | -|-------------|----------|-----| -| `unwrap_used` | Error | Use `?` or `expect()` | -| `needless_clone` | Perf | Use reference | -| `await_holding_lock` | Async | Scope guard before await | -| `linkedlist` | Perf | Use Vec/VecDeque | -| `wildcard_imports` | Style | Explicit imports | -| `missing_safety_doc` | Safety | Add `# Safety` doc | -| `undocumented_unsafe_blocks` | Safety | Add `// SAFETY:` | -| `transmute_ptr_to_ptr` | Safety | Use `pointer::cast()` | -| `large_stack_arrays` | Mem | Use Vec or Box | -| `too_many_arguments` | Design | Use struct params | - -For unsafe-related lints → see `unsafe-checker` skill. diff --git a/.claude/skills/coding-guidelines/index/rules-index.md b/.claude/skills/coding-guidelines/index/rules-index.md deleted file mode 100644 index 19c27b818..000000000 --- a/.claude/skills/coding-guidelines/index/rules-index.md +++ /dev/null @@ -1,6 +0,0 @@ -# Complete Rules Reference - -For the full 500+ rules, see: -- Source: https://rust-coding-guidelines.github.io/rust-coding-guidelines-zh/ - -Core rules are in `../SKILL.md`. diff --git a/.claude/skills/core-actionbook/SKILL.md b/.claude/skills/core-actionbook/SKILL.md deleted file mode 100644 index bc5e89453..000000000 --- a/.claude/skills/core-actionbook/SKILL.md +++ /dev/null @@ -1,49 +0,0 @@ ---- -name: core-actionbook -# Internal tool - no description to prevent auto-triggering -# Used by: rust-learner agents for pre-computed selectors ---- - -# Actionbook - -Pre-computed action manuals for browser automation. Agents receive structured page information instead of parsing entire HTML. - -## Workflow - -1. **search_actions** - Search by keyword, returns URL-based action IDs with content previews -2. **get_action_by_id** - Get full action manual with page details, DOM structure, and element selectors -3. **Execute** - Use returned selectors with your browser automation tool - -## MCP Tools - -- `search_actions` - Search by keyword. Returns: URL-based action IDs, content previews, relevance scores -- `get_action_by_id` - Get full action details. Returns: action content, page element selectors (CSS/XPath), element types, allowed methods (click, type, extract), document metadata - -### Parameters - -**search_actions**: -- `query` (required): Search keyword (e.g., "airbnb search", "google login") -- `type`: `vector` | `fulltext` | `hybrid` (default) -- `limit`: Max results (default: 5) -- `sourceIds`: Filter by source IDs (comma-separated) -- `minScore`: Minimum relevance score (0-1) - -**get_action_by_id**: -- `id` (required): URL-based action ID (e.g., `example.com/page`) - -## Example Response - -```json -{ - "title": "Airbnb Search", - "url": "www.airbnb.com/search", - "elements": [ - { - "name": "location_input", - "selector": "input[data-testid='structured-search-input-field-query']", - "type": "textbox", - "methods": ["type", "fill"] - } - ] -} -``` diff --git a/.claude/skills/core-agent-browser/SKILL.md b/.claude/skills/core-agent-browser/SKILL.md deleted file mode 100644 index d5d605027..000000000 --- a/.claude/skills/core-agent-browser/SKILL.md +++ /dev/null @@ -1,115 +0,0 @@ ---- -name: core-agent-browser -# Internal tool - no description to prevent auto-triggering -# Used by: rust-learner, docs-researcher, crate-researcher agents ---- - -# Browser Automation with agent-browser - -## Priority Note - -For fetching Rust/crate information, use this priority order: -1. **rust-learner skill** - Orchestrates actionbook + browser-fetcher -2. **actionbook MCP** - Pre-computed selectors for known sites -3. **agent-browser CLI** - Direct browser automation (last resort) - -Use agent-browser directly only when: -- actionbook has no pre-computed selectors for the target site -- You need interactive browser testing/automation -- You need screenshots or form filling - -## Quick start - -```bash -agent-browser open # Navigate to page -agent-browser snapshot -i # Get interactive elements with refs -agent-browser click @e1 # Click element by ref -agent-browser fill @e2 "text" # Fill input by ref -agent-browser close # Close browser -``` - -## Core workflow - -1. Navigate: `agent-browser open ` -2. Snapshot: `agent-browser snapshot -i` (returns elements with refs like `@e1`, `@e2`) -3. Interact using refs from the snapshot -4. Re-snapshot after navigation or significant DOM changes - -## Commands - -### Navigation -```bash -agent-browser open # Navigate to URL -agent-browser back # Go back -agent-browser forward # Go forward -agent-browser reload # Reload page -agent-browser close # Close browser -``` - -### Snapshot (page analysis) -```bash -agent-browser snapshot # Full accessibility tree -agent-browser snapshot -i # Interactive elements only (recommended) -agent-browser snapshot -c # Compact output -agent-browser snapshot -d 3 # Limit depth to 3 -``` - -### Interactions (use @refs from snapshot) -```bash -agent-browser click @e1 # Click -agent-browser dblclick @e1 # Double-click -agent-browser fill @e2 "text" # Clear and type -agent-browser type @e2 "text" # Type without clearing -agent-browser press Enter # Press key -agent-browser press Control+a # Key combination -agent-browser hover @e1 # Hover -agent-browser check @e1 # Check checkbox -agent-browser uncheck @e1 # Uncheck checkbox -agent-browser select @e1 "value" # Select dropdown -agent-browser scroll down 500 # Scroll page -agent-browser scrollintoview @e1 # Scroll element into view -``` - -### Get information -```bash -agent-browser get text @e1 # Get element text -agent-browser get value @e1 # Get input value -agent-browser get title # Get page title -agent-browser get url # Get current URL -``` - -### Screenshots -```bash -agent-browser screenshot # Screenshot to stdout -agent-browser screenshot path.png # Save to file -agent-browser screenshot --full # Full page -``` - -### Wait -```bash -agent-browser wait @e1 # Wait for element -agent-browser wait 2000 # Wait milliseconds -agent-browser wait --text "Success" # Wait for text -agent-browser wait --load networkidle # Wait for network idle -``` - -### Semantic locators (alternative to refs) -```bash -agent-browser find role button click --name "Submit" -agent-browser find text "Sign In" click -agent-browser find label "Email" fill "user@test.com" -``` - -## Example: Form submission - -```bash -agent-browser open https://example.com/form -agent-browser snapshot -i -# Output shows: textbox "Email" [ref=e1], textbox "Password" [ref=e2], button "Submit" [ref=e3] - -agent-browser fill @e1 "user@example.com" -agent-browser fill @e2 "password123" -agent-browser click @e3 -agent-browser wait --load networkidle -agent-browser snapshot -i # Check result -``` diff --git a/.claude/skills/core-dynamic-skills/SKILL.md b/.claude/skills/core-dynamic-skills/SKILL.md deleted file mode 100644 index 081f6e667..000000000 --- a/.claude/skills/core-dynamic-skills/SKILL.md +++ /dev/null @@ -1,87 +0,0 @@ ---- -name: core-dynamic-skills -# Command-based tool - no description to prevent auto-triggering -# Triggered by: /sync-crate-skills, /clean-crate-skills, /update-crate-skill ---- - -# Dynamic Skills Manager - -Orchestrates on-demand generation of crate-specific skills based on project dependencies. - -## Concept - -Dynamic skills are: -- Generated locally at `~/.claude/skills/` -- Based on Cargo.toml dependencies -- Created using llms.txt from docs.rs -- Versioned and updatable -- Not committed to the rust-skills repository - -## Trigger Scenarios - -### Prompt-on-Open - -When entering a directory with Cargo.toml: -1. Detect Cargo.toml (single or workspace) -2. Parse dependencies list -3. Check which crates are missing skills -4. If missing: "Found X dependencies without skills. Sync now?" -5. If confirmed: run `/sync-crate-skills` - -### Manual Commands - -- `/sync-crate-skills` - Sync all dependencies -- `/clean-crate-skills [crate]` - Remove skills -- `/update-crate-skill ` - Update specific skill - -## Architecture - -``` -Cargo.toml - ↓ -Parse dependencies - ↓ -For each crate: - ├─ Check ~/.claude/skills/{crate}/ - ├─ If missing: Check actionbook for llms.txt - │ ├─ Found: /create-skills-via-llms - │ └─ Not found: /create-llms-for-skills first - └─ Load skill -``` - -## Local Skills Directory - -``` -~/.claude/skills/ -├── tokio/ -│ ├── SKILL.md -│ └── references/ -├── serde/ -│ ├── SKILL.md -│ └── references/ -└── axum/ - ├── SKILL.md - └── references/ -``` - -## Workflow Priority - -1. **actionbook MCP** - Check for pre-generated llms.txt -2. **/create-llms-for-skills** - Generate llms.txt from docs.rs -3. **/create-skills-via-llms** - Create skills from llms.txt - -## Workspace Support - -For Cargo workspace projects: -1. Parse root Cargo.toml for `[workspace] members` -2. Collect all member Cargo.toml paths -3. Aggregate all dependencies -4. Deduplicate before skill generation - -## Related Commands - -- `/sync-crate-skills` - Main sync command -- `/clean-crate-skills` - Cleanup command -- `/update-crate-skill` - Update command -- `/create-llms-for-skills` - Generate llms.txt -- `/create-skills-via-llms` - Create skills from llms.txt diff --git a/.claude/skills/core-fix-skill-docs/SKILL.md b/.claude/skills/core-fix-skill-docs/SKILL.md deleted file mode 100644 index c4d7fef2e..000000000 --- a/.claude/skills/core-fix-skill-docs/SKILL.md +++ /dev/null @@ -1,99 +0,0 @@ ---- -name: core-fix-skill-docs -# Internal maintenance tool - no description to prevent auto-triggering -# Triggered by: /fix-skill-docs command ---- - -# Fix Skill Documentation - -Check and fix missing reference files in dynamic skills. - -## Usage - -``` -/fix-skill-docs [crate_name] [--check-only] [--remove-invalid] -``` - -**Arguments:** -- `crate_name`: Specific crate to check (optional, defaults to all) -- `--check-only`: Only report issues, don't fix -- `--remove-invalid`: Remove invalid references instead of creating files - -## Instructions - -### 1. Scan Skills Directory - -```bash -# If crate_name provided -skill_dir=~/.claude/skills/{crate_name} - -# Otherwise scan all -for dir in ~/.claude/skills/*/; do - # Process each skill -done -``` - -### 2. Parse SKILL.md for References - -Extract referenced files from Documentation section: - -```markdown -## Documentation -- `./references/file1.md` - Description -``` - -### 3. Check File Existence - -```bash -if [ ! -f "{skill_dir}/references/{filename}" ]; then - echo "MISSING: {filename}" -fi -``` - -### 4. Report Status - -``` -=== {crate_name} === -SKILL.md: ✅ -references/: - - sync.md: ✅ - - runtime.md: ❌ MISSING - -Action needed: 1 file missing -``` - -### 5. Fix Missing Files - -**--check-only**: Only report, don't fix. - -**--remove-invalid**: Update SKILL.md to remove invalid references. - -**Default**: Generate missing files using agent-browser: - -```bash -agent-browser "Navigate to docs.rs/{crate_name}/latest/{crate_name}/{module}/ -Extract documentation for {topic}. Save as markdown." -``` - -### 6. Update SKILL.md - -Ensure Documentation section matches actual files. - -## Tool Priority - -1. **agent-browser CLI** - Generate missing documentation -2. **WebFetch** - Fallback if agent-browser unavailable -3. **Edit SKILL.md** - Remove invalid references (--remove-invalid) - -## Example - -```bash -# Check all skills -/fix-skill-docs --check-only - -# Fix specific crate -/fix-skill-docs tokio - -# Remove invalid references -/fix-skill-docs tokio --remove-invalid -``` diff --git a/.claude/skills/domain-cli/SKILL.md b/.claude/skills/domain-cli/SKILL.md deleted file mode 100644 index cd18d0072..000000000 --- a/.claude/skills/domain-cli/SKILL.md +++ /dev/null @@ -1,160 +0,0 @@ ---- -name: domain-cli -description: "Use when building CLI tools. Keywords: CLI, command line, terminal, clap, structopt, argument parsing, subcommand, interactive, TUI, ratatui, crossterm, indicatif, progress bar, colored output, shell completion, config file, environment variable, 命令行, 终端应用, 参数解析" -globs: ["**/Cargo.toml"] ---- - -# CLI Domain - -> **Layer 3: Domain Constraints** - -## Domain Constraints → Design Implications - -| Domain Rule | Design Constraint | Rust Implication | -|-------------|-------------------|------------------| -| User ergonomics | Clear help, errors | clap derive macros | -| Config precedence | CLI > env > file | Layered config loading | -| Exit codes | Non-zero on error | Proper Result handling | -| Stdout/stderr | Data vs errors | eprintln! for errors | -| Interruptible | Handle Ctrl+C | Signal handling | - ---- - -## Critical Constraints - -### User Communication - -``` -RULE: Errors to stderr, data to stdout -WHY: Pipeable output, scriptability -RUST: eprintln! for errors, println! for data -``` - -### Configuration Priority - -``` -RULE: CLI args > env vars > config file > defaults -WHY: User expectation, override capability -RUST: Layered config with clap + figment/config -``` - -### Exit Codes - -``` -RULE: Return non-zero on any error -WHY: Script integration, automation -RUST: main() -> Result<(), Error> or explicit exit() -``` - ---- - -## Trace Down ↓ - -From constraints to design (Layer 2): - -``` -"Need argument parsing" - ↓ m05-type-driven: Derive structs for args - ↓ clap: #[derive(Parser)] - -"Need config layering" - ↓ m09-domain: Config as domain object - ↓ figment/config: Layer sources - -"Need progress display" - ↓ m12-lifecycle: Progress bar as RAII - ↓ indicatif: ProgressBar -``` - ---- - -## Key Crates - -| Purpose | Crate | -|---------|-------| -| Argument parsing | clap | -| Interactive prompts | dialoguer | -| Progress bars | indicatif | -| Colored output | colored | -| Terminal UI | ratatui | -| Terminal control | crossterm | -| Console utilities | console | - -## Design Patterns - -| Pattern | Purpose | Implementation | -|---------|---------|----------------| -| Args struct | Type-safe args | `#[derive(Parser)]` | -| Subcommands | Command hierarchy | `#[derive(Subcommand)]` | -| Config layers | Override precedence | CLI > env > file | -| Progress | User feedback | `ProgressBar::new(len)` | - -## Code Pattern: CLI Structure - -```rust -use clap::{Parser, Subcommand}; - -#[derive(Parser)] -#[command(name = "myapp", about = "My CLI tool")] -struct Cli { - /// Enable verbose output - #[arg(short, long)] - verbose: bool, - - #[command(subcommand)] - command: Commands, -} - -#[derive(Subcommand)] -enum Commands { - /// Initialize a new project - Init { name: String }, - /// Run the application - Run { - #[arg(short, long)] - port: Option, - }, -} - -fn main() -> anyhow::Result<()> { - let cli = Cli::parse(); - match cli.command { - Commands::Init { name } => init_project(&name)?, - Commands::Run { port } => run_server(port.unwrap_or(8080))?, - } - Ok(()) -} -``` - ---- - -## Common Mistakes - -| Mistake | Domain Violation | Fix | -|---------|-----------------|-----| -| Errors to stdout | Breaks piping | eprintln! | -| No help text | Poor UX | #[arg(help = "...")] | -| Panic on error | Bad exit code | Result + proper handling | -| No progress for long ops | User uncertainty | indicatif | - ---- - -## Trace to Layer 1 - -| Constraint | Layer 2 Pattern | Layer 1 Implementation | -|------------|-----------------|------------------------| -| Type-safe args | Derive macros | clap Parser | -| Error handling | Result propagation | anyhow + exit codes | -| User feedback | Progress RAII | indicatif ProgressBar | -| Config precedence | Builder pattern | Layered sources | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Error handling | m06-error-handling | -| Type-driven args | m05-type-driven | -| Progress lifecycle | m12-lifecycle | -| Async CLI | m07-concurrency | diff --git a/.claude/skills/domain-cloud-native/SKILL.md b/.claude/skills/domain-cloud-native/SKILL.md deleted file mode 100644 index 3a068693c..000000000 --- a/.claude/skills/domain-cloud-native/SKILL.md +++ /dev/null @@ -1,165 +0,0 @@ ---- -name: domain-cloud-native -description: "Use when building cloud-native apps. Keywords: kubernetes, k8s, docker, container, grpc, tonic, microservice, service mesh, observability, tracing, metrics, health check, cloud, deployment, 云原生, 微服务, 容器" ---- - -# Cloud-Native Domain - -> **Layer 3: Domain Constraints** - -## Domain Constraints → Design Implications - -| Domain Rule | Design Constraint | Rust Implication | -|-------------|-------------------|------------------| -| 12-Factor | Config from env | Environment-based config | -| Observability | Metrics + traces | tracing + opentelemetry | -| Health checks | Liveness/readiness | Dedicated endpoints | -| Graceful shutdown | Clean termination | Signal handling | -| Horizontal scale | Stateless design | No local state | -| Container-friendly | Small binaries | Release optimization | - ---- - -## Critical Constraints - -### Stateless Design - -``` -RULE: No local persistent state -WHY: Pods can be killed/rescheduled anytime -RUST: External state (Redis, DB), no static mut -``` - -### Graceful Shutdown - -``` -RULE: Handle SIGTERM, drain connections -WHY: Zero-downtime deployments -RUST: tokio::signal + graceful shutdown -``` - -### Observability - -``` -RULE: Every request must be traceable -WHY: Debugging distributed systems -RUST: tracing spans, opentelemetry export -``` - ---- - -## Trace Down ↓ - -From constraints to design (Layer 2): - -``` -"Need distributed tracing" - ↓ m12-lifecycle: Span lifecycle - ↓ tracing + opentelemetry - -"Need graceful shutdown" - ↓ m07-concurrency: Signal handling - ↓ m12-lifecycle: Connection draining - -"Need health checks" - ↓ domain-web: HTTP endpoints - ↓ m06-error-handling: Health status -``` - ---- - -## Key Crates - -| Purpose | Crate | -|---------|-------| -| gRPC | tonic | -| Kubernetes | kube, kube-runtime | -| Docker | bollard | -| Tracing | tracing, opentelemetry | -| Metrics | prometheus, metrics | -| Config | config, figment | -| Health | HTTP endpoints | - -## Design Patterns - -| Pattern | Purpose | Implementation | -|---------|---------|----------------| -| gRPC services | Service mesh | tonic + tower | -| K8s operators | Custom resources | kube-runtime Controller | -| Observability | Debugging | tracing + OTEL | -| Health checks | Orchestration | `/health`, `/ready` | -| Config | 12-factor | Env vars + secrets | - -## Code Pattern: Graceful Shutdown - -```rust -use tokio::signal; - -async fn run_server() -> anyhow::Result<()> { - let app = Router::new() - .route("/health", get(health)) - .route("/ready", get(ready)); - - let addr = SocketAddr::from(([0, 0, 0, 0], 8080)); - - axum::Server::bind(&addr) - .serve(app.into_make_service()) - .with_graceful_shutdown(shutdown_signal()) - .await?; - - Ok(()) -} - -async fn shutdown_signal() { - signal::ctrl_c().await.expect("failed to listen for ctrl+c"); - tracing::info!("shutdown signal received"); -} -``` - -## Health Check Pattern - -```rust -async fn health() -> StatusCode { - StatusCode::OK -} - -async fn ready(State(db): State>) -> StatusCode { - match db.ping().await { - Ok(_) => StatusCode::OK, - Err(_) => StatusCode::SERVICE_UNAVAILABLE, - } -} -``` - ---- - -## Common Mistakes - -| Mistake | Domain Violation | Fix | -|---------|-----------------|-----| -| Local file state | Not stateless | External storage | -| No SIGTERM handling | Hard kills | Graceful shutdown | -| No tracing | Can't debug | tracing spans | -| Static config | Not 12-factor | Env vars | - ---- - -## Trace to Layer 1 - -| Constraint | Layer 2 Pattern | Layer 1 Implementation | -|------------|-----------------|------------------------| -| Stateless | External state | Arc for external | -| Graceful shutdown | Signal handling | tokio::signal | -| Tracing | Span lifecycle | tracing + OTEL | -| Health checks | HTTP endpoints | Dedicated routes | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Async patterns | m07-concurrency | -| HTTP endpoints | domain-web | -| Error handling | m13-domain-error | -| Resource lifecycle | m12-lifecycle | diff --git a/.claude/skills/domain-embedded/SKILL.md b/.claude/skills/domain-embedded/SKILL.md deleted file mode 100644 index 9974cba5c..000000000 --- a/.claude/skills/domain-embedded/SKILL.md +++ /dev/null @@ -1,170 +0,0 @@ ---- -name: domain-embedded -description: "Use when developing embedded/no_std Rust. Keywords: embedded, no_std, microcontroller, MCU, ARM, RISC-V, bare metal, firmware, HAL, PAC, RTIC, embassy, interrupt, DMA, peripheral, GPIO, SPI, I2C, UART, embedded-hal, cortex-m, esp32, stm32, nrf, 嵌入式, 单片机, 固件, 裸机" -globs: ["**/Cargo.toml", "**/.cargo/config.toml"] ---- - -# Embedded Domain - -> **Layer 3: Domain Constraints** - -## Domain Constraints → Design Implications - -| Domain Rule | Design Constraint | Rust Implication | -|-------------|-------------------|------------------| -| No heap | Stack allocation | heapless, no Box/Vec | -| No std | Core only | #![no_std] | -| Real-time | Predictable timing | No dynamic alloc | -| Resource limited | Minimal memory | Static buffers | -| Hardware safety | Safe peripheral access | HAL + ownership | -| Interrupt safe | No blocking in ISR | Atomic, critical sections | - ---- - -## Critical Constraints - -### No Dynamic Allocation - -``` -RULE: Cannot use heap (no allocator) -WHY: Deterministic memory, no OOM -RUST: heapless::Vec, arrays -``` - -### Interrupt Safety - -``` -RULE: Shared state must be interrupt-safe -WHY: ISR can preempt at any time -RUST: Mutex> + critical section -``` - -### Hardware Ownership - -``` -RULE: Peripherals must have clear ownership -WHY: Prevent conflicting access -RUST: HAL takes ownership, singletons -``` - ---- - -## Trace Down ↓ - -From constraints to design (Layer 2): - -``` -"Need no_std compatible data structures" - ↓ m02-resource: heapless collections - ↓ Static sizing: heapless::Vec - -"Need interrupt-safe state" - ↓ m03-mutability: Mutex>> - ↓ m07-concurrency: Critical sections - -"Need peripheral ownership" - ↓ m01-ownership: Singleton pattern - ↓ m12-lifecycle: RAII for hardware -``` - ---- - -## Layer Stack - -| Layer | Examples | Purpose | -|-------|----------|---------| -| PAC | stm32f4, esp32c3 | Register access | -| HAL | stm32f4xx-hal | Hardware abstraction | -| Framework | RTIC, Embassy | Concurrency | -| Traits | embedded-hal | Portable drivers | - -## Framework Comparison - -| Framework | Style | Best For | -|-----------|-------|----------| -| RTIC | Priority-based | Interrupt-driven apps | -| Embassy | Async | Complex state machines | -| Bare metal | Manual | Simple apps | - -## Key Crates - -| Purpose | Crate | -|---------|-------| -| Runtime (ARM) | cortex-m-rt | -| Panic handler | panic-halt, panic-probe | -| Collections | heapless | -| HAL traits | embedded-hal | -| Logging | defmt | -| Flash/debug | probe-run | - -## Design Patterns - -| Pattern | Purpose | Implementation | -|---------|---------|----------------| -| no_std setup | Bare metal | `#![no_std]` + `#![no_main]` | -| Entry point | Startup | `#[entry]` or embassy | -| Static state | ISR access | `Mutex>>` | -| Fixed buffers | No heap | `heapless::Vec` | - -## Code Pattern: Static Peripheral - -```rust -#![no_std] -#![no_main] - -use cortex_m::interrupt::{self, Mutex}; -use core::cell::RefCell; - -static LED: Mutex>> = Mutex::new(RefCell::new(None)); - -#[entry] -fn main() -> ! { - let dp = pac::Peripherals::take().unwrap(); - let led = Led::new(dp.GPIOA); - - interrupt::free(|cs| { - LED.borrow(cs).replace(Some(led)); - }); - - loop { - interrupt::free(|cs| { - if let Some(led) = LED.borrow(cs).borrow_mut().as_mut() { - led.toggle(); - } - }); - } -} -``` - ---- - -## Common Mistakes - -| Mistake | Domain Violation | Fix | -|---------|-----------------|-----| -| Using Vec | Heap allocation | heapless::Vec | -| No critical section | Race with ISR | Mutex + interrupt::free | -| Blocking in ISR | Missed interrupts | Defer to main loop | -| Unsafe peripheral | Hardware conflict | HAL ownership | - ---- - -## Trace to Layer 1 - -| Constraint | Layer 2 Pattern | Layer 1 Implementation | -|------------|-----------------|------------------------| -| No heap | Static collections | heapless::Vec | -| ISR safety | Critical sections | Mutex> | -| Hardware ownership | Singleton | take().unwrap() | -| no_std | Core-only | #![no_std], #![no_main] | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Static memory | m02-resource | -| Interior mutability | m03-mutability | -| Interrupt patterns | m07-concurrency | -| Unsafe for hardware | unsafe-checker | diff --git a/.claude/skills/domain-fintech/SKILL.md b/.claude/skills/domain-fintech/SKILL.md deleted file mode 100644 index 9d55b5144..000000000 --- a/.claude/skills/domain-fintech/SKILL.md +++ /dev/null @@ -1,145 +0,0 @@ ---- -name: domain-fintech -description: "Use when building fintech apps. Keywords: fintech, trading, decimal, currency, financial, money, transaction, ledger, payment, exchange rate, precision, rounding, accounting, 金融, 交易系统, 货币, 支付" ---- - -# FinTech Domain - -> **Layer 3: Domain Constraints** - -## Domain Constraints → Design Implications - -| Domain Rule | Design Constraint | Rust Implication | -|-------------|-------------------|------------------| -| Audit trail | Immutable records | Arc, no mutation | -| Precision | No floating point | rust_decimal | -| Consistency | Transaction boundaries | Clear ownership | -| Compliance | Complete logging | Structured tracing | -| Reproducibility | Deterministic execution | No race conditions | - ---- - -## Critical Constraints - -### Financial Precision - -``` -RULE: Never use f64 for money -WHY: Floating point loses precision -RUST: Use rust_decimal::Decimal -``` - -### Audit Requirements - -``` -RULE: All transactions must be immutable and traceable -WHY: Regulatory compliance, dispute resolution -RUST: Arc for sharing, event sourcing pattern -``` - -### Consistency - -``` -RULE: Money can't disappear or appear -WHY: Double-entry accounting principles -RUST: Transaction types with validated totals -``` - ---- - -## Trace Down ↓ - -From constraints to design (Layer 2): - -``` -"Need immutable transaction records" - ↓ m09-domain: Model as Value Objects - ↓ m01-ownership: Use Arc for shared immutable data - -"Need precise decimal math" - ↓ m05-type-driven: Newtype for Currency/Amount - ↓ rust_decimal: Use Decimal type - -"Need transaction boundaries" - ↓ m12-lifecycle: RAII for transaction scope - ↓ m09-domain: Aggregate boundaries -``` - ---- - -## Key Crates - -| Purpose | Crate | -|---------|-------| -| Decimal math | rust_decimal | -| Date/time | chrono, time | -| UUID | uuid | -| Serialization | serde | -| Validation | validator | - -## Design Patterns - -| Pattern | Purpose | Implementation | -|---------|---------|----------------| -| Currency newtype | Type safety | `struct Amount(Decimal);` | -| Transaction | Atomic operations | Event sourcing | -| Audit log | Traceability | Structured logging with trace IDs | -| Ledger | Double-entry | Debit/credit balance | - -## Code Pattern: Currency Type - -```rust -use rust_decimal::Decimal; - -#[derive(Clone, Debug, PartialEq)] -pub struct Amount { - value: Decimal, - currency: Currency, -} - -impl Amount { - pub fn new(value: Decimal, currency: Currency) -> Self { - Self { value, currency } - } - - pub fn add(&self, other: &Amount) -> Result { - if self.currency != other.currency { - return Err(CurrencyMismatch); - } - Ok(Amount::new(self.value + other.value, self.currency)) - } -} -``` - ---- - -## Common Mistakes - -| Mistake | Domain Violation | Fix | -|---------|-----------------|-----| -| Using f64 | Precision loss | rust_decimal | -| Mutable transaction | Audit trail broken | Immutable + events | -| String for amount | No validation | Validated newtype | -| Silent overflow | Money disappears | Checked arithmetic | - ---- - -## Trace to Layer 1 - -| Constraint | Layer 2 Pattern | Layer 1 Implementation | -|------------|-----------------|------------------------| -| Immutable records | Event sourcing | Arc, Clone | -| Transaction scope | Aggregate | Owned children | -| Precision | Value Object | rust_decimal newtype | -| Thread-safe sharing | Shared immutable | Arc (not Rc) | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Value Object design | m09-domain | -| Ownership for immutable | m01-ownership | -| Arc for sharing | m02-resource | -| Error handling | m13-domain-error | diff --git a/.claude/skills/domain-iot/SKILL.md b/.claude/skills/domain-iot/SKILL.md deleted file mode 100644 index 1884a0f6b..000000000 --- a/.claude/skills/domain-iot/SKILL.md +++ /dev/null @@ -1,167 +0,0 @@ ---- -name: domain-iot -description: "Use when building IoT apps. Keywords: IoT, Internet of Things, sensor, MQTT, device, edge computing, telemetry, actuator, smart home, gateway, protocol, 物联网, 传感器, 边缘计算, 智能家居" ---- - -# IoT Domain - -> **Layer 3: Domain Constraints** - -## Domain Constraints → Design Implications - -| Domain Rule | Design Constraint | Rust Implication | -|-------------|-------------------|------------------| -| Unreliable network | Offline-first | Local buffering | -| Power constraints | Efficient code | Sleep modes, minimal alloc | -| Resource limits | Small footprint | no_std where needed | -| Security | Encrypted comms | TLS, signed firmware | -| Reliability | Self-recovery | Watchdog, error handling | -| OTA updates | Safe upgrades | Rollback capability | - ---- - -## Critical Constraints - -### Network Unreliability - -``` -RULE: Network can fail at any time -WHY: Wireless, remote locations -RUST: Local queue, retry with backoff -``` - -### Power Management - -``` -RULE: Minimize power consumption -WHY: Battery life, energy costs -RUST: Sleep modes, efficient algorithms -``` - -### Device Security - -``` -RULE: All communication encrypted -WHY: Physical access possible -RUST: TLS, signed messages -``` - ---- - -## Trace Down ↓ - -From constraints to design (Layer 2): - -``` -"Need offline-first design" - ↓ m12-lifecycle: Local buffer with persistence - ↓ m13-domain-error: Retry with backoff - -"Need power efficiency" - ↓ domain-embedded: no_std patterns - ↓ m10-performance: Minimal allocations - -"Need reliable messaging" - ↓ m07-concurrency: Async with timeout - ↓ MQTT: QoS levels -``` - ---- - -## Environment Comparison - -| Environment | Stack | Crates | -|-------------|-------|--------| -| Linux gateway | tokio + std | rumqttc, reqwest | -| MCU device | embassy + no_std | embedded-hal | -| Hybrid | Split workloads | Both | - -## Key Crates - -| Purpose | Crate | -|---------|-------| -| MQTT (std) | rumqttc, paho-mqtt | -| Embedded | embedded-hal, embassy | -| Async (std) | tokio | -| Async (no_std) | embassy | -| Logging (no_std) | defmt | -| Logging (std) | tracing | - -## Design Patterns - -| Pattern | Purpose | Implementation | -|---------|---------|----------------| -| Pub/Sub | Device comms | MQTT topics | -| Edge compute | Local processing | Filter before upload | -| OTA updates | Firmware upgrade | Signed + rollback | -| Power mgmt | Battery life | Sleep + wake events | -| Store & forward | Network reliability | Local queue | - -## Code Pattern: MQTT Client - -```rust -use rumqttc::{AsyncClient, MqttOptions, QoS}; - -async fn run_mqtt() -> anyhow::Result<()> { - let mut options = MqttOptions::new("device-1", "broker.example.com", 1883); - options.set_keep_alive(Duration::from_secs(30)); - - let (client, mut eventloop) = AsyncClient::new(options, 10); - - // Subscribe to commands - client.subscribe("devices/device-1/commands", QoS::AtLeastOnce).await?; - - // Publish telemetry - tokio::spawn(async move { - loop { - let data = read_sensor().await; - client.publish("devices/device-1/telemetry", QoS::AtLeastOnce, false, data).await.ok(); - tokio::time::sleep(Duration::from_secs(60)).await; - } - }); - - // Process events - loop { - match eventloop.poll().await { - Ok(event) => handle_event(event).await, - Err(e) => { - tracing::error!("MQTT error: {}", e); - tokio::time::sleep(Duration::from_secs(5)).await; - } - } - } -} -``` - ---- - -## Common Mistakes - -| Mistake | Domain Violation | Fix | -|---------|-----------------|-----| -| No retry logic | Lost data | Exponential backoff | -| Always-on radio | Battery drain | Sleep between sends | -| Unencrypted MQTT | Security risk | TLS | -| No local buffer | Network outage = data loss | Persist locally | - ---- - -## Trace to Layer 1 - -| Constraint | Layer 2 Pattern | Layer 1 Implementation | -|------------|-----------------|------------------------| -| Offline-first | Store & forward | Local queue + flush | -| Power efficiency | Sleep patterns | Timer-based wake | -| Network reliability | Retry | tokio-retry, backoff | -| Security | TLS | rustls, native-tls | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Embedded patterns | domain-embedded | -| Async patterns | m07-concurrency | -| Error recovery | m13-domain-error | -| Performance | m10-performance | diff --git a/.claude/skills/domain-ml/SKILL.md b/.claude/skills/domain-ml/SKILL.md deleted file mode 100644 index f07a74d1f..000000000 --- a/.claude/skills/domain-ml/SKILL.md +++ /dev/null @@ -1,180 +0,0 @@ ---- -name: domain-ml -description: "Use when building ML/AI apps in Rust. Keywords: machine learning, ML, AI, tensor, model, inference, neural network, deep learning, training, prediction, ndarray, tch-rs, burn, candle, 机器学习, 人工智能, 模型推理" ---- - -# Machine Learning Domain - -> **Layer 3: Domain Constraints** - -## Domain Constraints → Design Implications - -| Domain Rule | Design Constraint | Rust Implication | -|-------------|-------------------|------------------| -| Large data | Efficient memory | Zero-copy, streaming | -| GPU acceleration | CUDA/Metal support | candle, tch-rs | -| Model portability | Standard formats | ONNX | -| Batch processing | Throughput over latency | Batched inference | -| Numerical precision | Float handling | ndarray, careful f32/f64 | -| Reproducibility | Deterministic | Seeded random, versioning | - ---- - -## Critical Constraints - -### Memory Efficiency - -``` -RULE: Avoid copying large tensors -WHY: Memory bandwidth is bottleneck -RUST: References, views, in-place ops -``` - -### GPU Utilization - -``` -RULE: Batch operations for GPU efficiency -WHY: GPU overhead per kernel launch -RUST: Batch sizes, async data loading -``` - -### Model Portability - -``` -RULE: Use standard model formats -WHY: Train in Python, deploy in Rust -RUST: ONNX via tract or candle -``` - ---- - -## Trace Down ↓ - -From constraints to design (Layer 2): - -``` -"Need efficient data pipelines" - ↓ m10-performance: Streaming, batching - ↓ polars: Lazy evaluation - -"Need GPU inference" - ↓ m07-concurrency: Async data loading - ↓ candle/tch-rs: CUDA backend - -"Need model loading" - ↓ m12-lifecycle: Lazy init, caching - ↓ tract: ONNX runtime -``` - ---- - -## Use Case → Framework - -| Use Case | Recommended | Why | -|----------|-------------|-----| -| Inference only | tract (ONNX) | Lightweight, portable | -| Training + inference | candle, burn | Pure Rust, GPU | -| PyTorch models | tch-rs | Direct bindings | -| Data pipelines | polars | Fast, lazy eval | - -## Key Crates - -| Purpose | Crate | -|---------|-------| -| Tensors | ndarray | -| ONNX inference | tract | -| ML framework | candle, burn | -| PyTorch bindings | tch-rs | -| Data processing | polars | -| Embeddings | fastembed | - -## Design Patterns - -| Pattern | Purpose | Implementation | -|---------|---------|----------------| -| Model loading | Once, reuse | `OnceLock` | -| Batching | Throughput | Collect then process | -| Streaming | Large data | Iterator-based | -| GPU async | Parallelism | Data loading parallel to compute | - -## Code Pattern: Inference Server - -```rust -use std::sync::OnceLock; -use tract_onnx::prelude::*; - -static MODEL: OnceLock, Graph>>> = OnceLock::new(); - -fn get_model() -> &'static SimplePlan<...> { - MODEL.get_or_init(|| { - tract_onnx::onnx() - .model_for_path("model.onnx") - .unwrap() - .into_optimized() - .unwrap() - .into_runnable() - .unwrap() - }) -} - -async fn predict(input: Vec) -> anyhow::Result> { - let model = get_model(); - let input = tract_ndarray::arr1(&input).into_shape((1, input.len()))?; - let result = model.run(tvec!(input.into()))?; - Ok(result[0].to_array_view::()?.iter().copied().collect()) -} -``` - -## Code Pattern: Batched Inference - -```rust -async fn batch_predict(inputs: Vec>, batch_size: usize) -> Vec> { - let mut results = Vec::with_capacity(inputs.len()); - - for batch in inputs.chunks(batch_size) { - // Stack inputs into batch tensor - let batch_tensor = stack_inputs(batch); - - // Run inference on batch - let batch_output = model.run(batch_tensor).await; - - // Unstack results - results.extend(unstack_outputs(batch_output)); - } - - results -} -``` - ---- - -## Common Mistakes - -| Mistake | Domain Violation | Fix | -|---------|-----------------|-----| -| Clone tensors | Memory waste | Use views | -| Single inference | GPU underutilized | Batch processing | -| Load model per request | Slow | Singleton pattern | -| Sync data loading | GPU idle | Async pipeline | - ---- - -## Trace to Layer 1 - -| Constraint | Layer 2 Pattern | Layer 1 Implementation | -|------------|-----------------|------------------------| -| Memory efficiency | Zero-copy | ndarray views | -| Model singleton | Lazy init | OnceLock | -| Batch processing | Chunked iteration | chunks() + parallel | -| GPU async | Concurrent loading | tokio::spawn + GPU | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Performance | m10-performance | -| Lazy initialization | m12-lifecycle | -| Async patterns | m07-concurrency | -| Memory efficiency | m01-ownership | diff --git a/.claude/skills/domain-web/SKILL.md b/.claude/skills/domain-web/SKILL.md deleted file mode 100644 index d3c7c2c0f..000000000 --- a/.claude/skills/domain-web/SKILL.md +++ /dev/null @@ -1,155 +0,0 @@ ---- -name: domain-web -description: "Use when building web services. Keywords: web server, HTTP, REST API, GraphQL, WebSocket, axum, actix, warp, rocket, tower, hyper, reqwest, middleware, router, handler, extractor, state management, authentication, authorization, JWT, session, cookie, CORS, rate limiting, web 开发, HTTP 服务, API 设计, 中间件, 路由" -globs: ["**/Cargo.toml"] ---- - -# Web Domain - -> **Layer 3: Domain Constraints** - -## Domain Constraints → Design Implications - -| Domain Rule | Design Constraint | Rust Implication | -|-------------|-------------------|------------------| -| Stateless HTTP | No request-local globals | State in extractors | -| Concurrency | Handle many connections | Async, Send + Sync | -| Latency SLA | Fast response | Efficient ownership | -| Security | Input validation | Type-safe extractors | -| Observability | Request tracing | tracing + tower layers | - ---- - -## Critical Constraints - -### Async by Default - -``` -RULE: Web handlers must not block -WHY: Block one task = block many requests -RUST: async/await, spawn_blocking for CPU work -``` - -### State Management - -``` -RULE: Shared state must be thread-safe -WHY: Handlers run on any thread -RUST: Arc, Arc> for mutable -``` - -### Request Lifecycle - -``` -RULE: Resources live only for request duration -WHY: Memory management, no leaks -RUST: Extractors, proper ownership -``` - ---- - -## Trace Down ↓ - -From constraints to design (Layer 2): - -``` -"Need shared application state" - ↓ m07-concurrency: Use Arc for thread-safe sharing - ↓ m02-resource: Arc> for mutable state - -"Need request validation" - ↓ m05-type-driven: Validated extractors - ↓ m06-error-handling: IntoResponse for errors - -"Need middleware stack" - ↓ m12-lifecycle: Tower layers - ↓ m04-zero-cost: Trait-based composition -``` - ---- - -## Framework Comparison - -| Framework | Style | Best For | -|-----------|-------|----------| -| axum | Functional, tower | Modern APIs | -| actix-web | Actor-based | High performance | -| warp | Filter composition | Composable APIs | -| rocket | Macro-driven | Rapid development | - -## Key Crates - -| Purpose | Crate | -|---------|-------| -| HTTP server | axum, actix-web | -| HTTP client | reqwest | -| JSON | serde_json | -| Auth/JWT | jsonwebtoken | -| Session | tower-sessions | -| Database | sqlx, diesel | -| Middleware | tower | - -## Design Patterns - -| Pattern | Purpose | Implementation | -|---------|---------|----------------| -| Extractors | Request parsing | `State(db)`, `Json(payload)` | -| Error response | Unified errors | `impl IntoResponse` | -| Middleware | Cross-cutting | Tower layers | -| Shared state | App config | `Arc` | - -## Code Pattern: Axum Handler - -```rust -async fn handler( - State(db): State>, - Json(payload): Json, -) -> Result, AppError> { - let user = db.create_user(&payload).await?; - Ok(Json(user)) -} - -// Error handling -impl IntoResponse for AppError { - fn into_response(self) -> Response { - let (status, message) = match self { - Self::NotFound => (StatusCode::NOT_FOUND, "Not found"), - Self::Internal(_) => (StatusCode::INTERNAL_SERVER_ERROR, "Internal error"), - }; - (status, Json(json!({"error": message}))).into_response() - } -} -``` - ---- - -## Common Mistakes - -| Mistake | Domain Violation | Fix | -|---------|-----------------|-----| -| Blocking in handler | Latency spike | spawn_blocking | -| Rc in state | Not Send + Sync | Use Arc | -| No validation | Security risk | Type-safe extractors | -| No error response | Bad UX | IntoResponse impl | - ---- - -## Trace to Layer 1 - -| Constraint | Layer 2 Pattern | Layer 1 Implementation | -|------------|-----------------|------------------------| -| Async handlers | Async/await | tokio runtime | -| Thread-safe state | Shared state | Arc, Arc> | -| Request lifecycle | Extractors | Ownership via From | -| Middleware | Tower layers | Trait-based composition | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Async patterns | m07-concurrency | -| State management | m02-resource | -| Error handling | m06-error-handling | -| Middleware design | m12-lifecycle | diff --git a/.claude/skills/m01-ownership/SKILL.md b/.claude/skills/m01-ownership/SKILL.md deleted file mode 100644 index 51946b836..000000000 --- a/.claude/skills/m01-ownership/SKILL.md +++ /dev/null @@ -1,133 +0,0 @@ ---- -name: m01-ownership -description: "CRITICAL: Use for ownership/borrow/lifetime issues. Triggers: E0382, E0597, E0506, E0507, E0515, E0716, E0106, value moved, borrowed value does not live long enough, cannot move out of, use of moved value, ownership, borrow, lifetime, 'a, 'static, move, clone, Copy, 所有权, 借用, 生命周期" ---- - -# Ownership & Lifetimes - -> **Layer 1: Language Mechanics** - -## Core Question - -**Who should own this data, and for how long?** - -Before fixing ownership errors, understand the data's role: -- Is it shared or exclusive? -- Is it short-lived or long-lived? -- Is it transformed or just read? - ---- - -## Error → Design Question - -| Error | Don't Just Say | Ask Instead | -|-------|----------------|-------------| -| E0382 | "Clone it" | Who should own this data? | -| E0597 | "Extend lifetime" | Is the scope boundary correct? | -| E0506 | "End borrow first" | Should mutation happen elsewhere? | -| E0507 | "Clone before move" | Why are we moving from a reference? | -| E0515 | "Return owned" | Should caller own the data? | -| E0716 | "Bind to variable" | Why is this temporary? | -| E0106 | "Add 'a" | What is the actual lifetime relationship? | - ---- - -## Thinking Prompt - -Before fixing an ownership error, ask: - -1. **What is this data's domain role?** - - Entity (unique identity) → owned - - Value Object (interchangeable) → clone/copy OK - - Temporary (computation result) → maybe restructure - -2. **Is the ownership design intentional?** - - By design → work within constraints - - Accidental → consider redesign - -3. **Fix symptom or redesign?** - - If Strike 3 (3rd attempt) → escalate to Layer 2 - ---- - -## Trace Up ↑ - -When errors persist, trace to design layer: - -``` -E0382 (moved value) - ↑ Ask: What design choice led to this ownership pattern? - ↑ Check: m09-domain (is this Entity or Value Object?) - ↑ Check: domain-* (what constraints apply?) -``` - -| Persistent Error | Trace To | Question | -|-----------------|----------|----------| -| E0382 repeated | m02-resource | Should use Arc/Rc for sharing? | -| E0597 repeated | m09-domain | Is scope boundary at right place? | -| E0506/E0507 | m03-mutability | Should use interior mutability? | - ---- - -## Trace Down ↓ - -From design decisions to implementation: - -``` -"Data needs to be shared immutably" - ↓ Use: Arc (multi-thread) or Rc (single-thread) - -"Data needs exclusive ownership" - ↓ Use: move semantics, take ownership - -"Data is read-only view" - ↓ Use: &T (immutable borrow) -``` - ---- - -## Quick Reference - -| Pattern | Ownership | Cost | Use When | -|---------|-----------|------|----------| -| Move | Transfer | Zero | Caller doesn't need data | -| `&T` | Borrow | Zero | Read-only access | -| `&mut T` | Exclusive borrow | Zero | Need to modify | -| `clone()` | Duplicate | Alloc + copy | Actually need a copy | -| `Rc` | Shared (single) | Ref count | Single-thread sharing | -| `Arc` | Shared (multi) | Atomic ref count | Multi-thread sharing | -| `Cow` | Clone-on-write | Alloc if mutated | Might modify | - -## Error Code Reference - -| Error | Cause | Quick Fix | -|-------|-------|-----------| -| E0382 | Value moved | Clone, reference, or redesign ownership | -| E0597 | Reference outlives owner | Extend owner scope or restructure | -| E0506 | Assign while borrowed | End borrow before mutation | -| E0507 | Move out of borrowed | Clone or use reference | -| E0515 | Return local reference | Return owned value | -| E0716 | Temporary dropped | Bind to variable | -| E0106 | Missing lifetime | Add `'a` annotation | - ---- - -## Anti-Patterns - -| Anti-Pattern | Why Bad | Better | -|--------------|---------|--------| -| `.clone()` everywhere | Hides design issues | Design ownership properly | -| Fight borrow checker | Increases complexity | Work with the compiler | -| `'static` for everything | Restricts flexibility | Use appropriate lifetimes | -| Leak with `Box::leak` | Memory leak | Proper lifetime design | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Need smart pointers | m02-resource | -| Need interior mutability | m03-mutability | -| Data is domain entity | m09-domain | -| Learning ownership concepts | m14-mental-model | diff --git a/.claude/skills/m01-ownership/comparison.md b/.claude/skills/m01-ownership/comparison.md deleted file mode 100644 index 209574b6c..000000000 --- a/.claude/skills/m01-ownership/comparison.md +++ /dev/null @@ -1,222 +0,0 @@ -# Ownership: Comparison with Other Languages - -## Rust vs C++ - -### Memory Management - -| Aspect | Rust | C++ | -|--------|------|-----| -| Default | Move semantics | Copy semantics (pre-C++11) | -| Move | `let b = a;` (a invalidated) | `auto b = std::move(a);` (a valid but unspecified) | -| Copy | `let b = a.clone();` | `auto b = a;` | -| Safety | Compile-time enforcement | Runtime responsibility | - -### Rust Move vs C++ Move - -```rust -// Rust: after move, 'a' is INVALID -let a = String::from("hello"); -let b = a; // a moved -// println!("{}", a); // COMPILE ERROR - -// Equivalent in C++: -// std::string a = "hello"; -// std::string b = std::move(a); -// std::cout << a; // UNDEFINED (compiles but buggy) -``` - -### Smart Pointers - -| Rust | C++ | Purpose | -|------|-----|---------| -| `Box` | `std::unique_ptr` | Unique ownership | -| `Rc` | `std::shared_ptr` | Shared ownership | -| `Arc` | `std::shared_ptr` + atomic | Thread-safe shared | -| `RefCell` | (manual runtime checks) | Interior mutability | - ---- - -## Rust vs Go - -### Memory Model - -| Aspect | Rust | Go | -|--------|------|-----| -| Memory | Stack + heap, explicit | GC manages all | -| Ownership | Enforced at compile-time | None (GC handles) | -| Null | `Option` | `nil` for pointers | -| Concurrency | `Send`/`Sync` traits | Channels (less strict) | - -### Sharing Data - -```rust -// Rust: explicit about sharing -use std::sync::Arc; -let data = Arc::new(vec![1, 2, 3]); -let data_clone = Arc::clone(&data); -std::thread::spawn(move || { - println!("{:?}", data_clone); -}); - -// Go: implicit sharing -// data := []int{1, 2, 3} -// go func() { -// fmt.Println(data) // potential race condition -// }() -``` - -### Why No GC in Rust - -1. **Deterministic destruction**: Resources freed exactly when scope ends -2. **Zero-cost**: No GC pauses or overhead -3. **Embeddable**: Works in OS kernels, embedded systems -4. **Predictable latency**: Critical for real-time systems - ---- - -## Rust vs Java/C# - -### Reference Semantics - -| Aspect | Rust | Java/C# | -|--------|------|---------| -| Objects | Owned by default | Reference by default | -| Null | `Option` | `null` (nullable) | -| Immutability | Default | Must use `final`/`readonly` | -| Copy | Explicit `.clone()` | Reference copy (shallow) | - -### Comparison - -```rust -// Rust: clear ownership -fn process(data: Vec) { // takes ownership - // data is ours, will be freed at end -} - -let numbers = vec![1, 2, 3]; -process(numbers); -// numbers is invalid here - -// Java: ambiguous ownership -// void process(List data) { -// // Who owns data? Caller? Callee? Both? -// // Can caller still use it? -// } -``` - ---- - -## Rust vs Python - -### Memory Model - -| Aspect | Rust | Python | -|--------|------|--------| -| Typing | Static, compile-time | Dynamic, runtime | -| Memory | Ownership-based | Reference counting + GC | -| Mutability | Default immutable | Default mutable | -| Performance | Native, zero-cost | Interpreted, higher overhead | - -### Common Pattern Translation - -```rust -// Rust: borrowing iteration -let items = vec!["a", "b", "c"]; -for item in &items { - println!("{}", item); -} -// items still usable - -// Python: iteration doesn't consume -// items = ["a", "b", "c"] -// for item in items: -// print(item) -// items still usable (different reason - ref counting) -``` - ---- - -## Unique Rust Concepts - -### Concepts Other Languages Lack - -1. **Borrow Checker**: No other mainstream language has compile-time borrow checking -2. **Lifetimes**: Explicit annotation of reference validity -3. **Move by Default**: Values move, not copy -4. **No Null**: `Option` instead of null pointers -5. **Affine Types**: Values can be used at most once - -### Learning Curve Areas - -| Concept | Coming From | Key Insight | -|---------|-------------|-------------| -| Ownership | GC languages | Think about who "owns" data | -| Borrowing | C/C++ | Like references but checked | -| Lifetimes | Any | Explicit scope of validity | -| Move | C++ | Move is default, not copy | - ---- - -## Mental Model Shifts - -### From GC Languages (Java, Go, Python) - -``` -Before: "Memory just works, GC handles it" -After: "I explicitly decide who owns data and when it's freed" -``` - -Key shifts: -- Think about ownership at design time -- Returning references requires lifetime thinking -- No more `null` - use `Option` - -### From C/C++ - -``` -Before: "I manually manage memory and hope I get it right" -After: "Compiler enforces correctness, I fight the borrow checker" -``` - -Key shifts: -- Trust the compiler's errors -- Move is the default (unlike C++ copy) -- Smart pointers are idiomatic, not overhead - -### From Functional Languages (Haskell, ML) - -``` -Before: "Everything is immutable, copying is fine" -After: "Mutability is explicit, ownership prevents aliasing" -``` - -Key shifts: -- Mutability is safe because of ownership rules -- No persistent data structures needed (usually) -- Performance characteristics are explicit - ---- - -## Performance Trade-offs - -| Language | Memory Overhead | Latency | Throughput | -|----------|-----------------|---------|------------| -| Rust | Minimal (no GC) | Predictable | Excellent | -| C++ | Minimal | Predictable | Excellent | -| Go | GC overhead | GC pauses | Good | -| Java | GC overhead | GC pauses | Good | -| Python | High (ref counting + GC) | Variable | Lower | - -### When Rust Ownership Wins - -1. **Real-time systems**: No GC pauses -2. **Embedded**: No runtime overhead -3. **High-performance**: Zero-cost abstractions -4. **Concurrent**: Data races prevented at compile time - -### When GC Might Be Preferable - -1. **Rapid prototyping**: Less mental overhead -2. **Complex object graphs**: Cycles are tricky in Rust -3. **GUI applications**: Object lifetimes are dynamic -4. **Small programs**: Overhead doesn't matter diff --git a/.claude/skills/m01-ownership/examples/best-practices.md b/.claude/skills/m01-ownership/examples/best-practices.md deleted file mode 100644 index ccaf3dc7b..000000000 --- a/.claude/skills/m01-ownership/examples/best-practices.md +++ /dev/null @@ -1,339 +0,0 @@ -# Ownership Best Practices - -## API Design Patterns - -### 1. Prefer Borrowing Over Ownership - -```rust -// BAD: takes ownership unnecessarily -fn print_name(name: String) { - println!("Name: {}", name); -} - -// GOOD: borrows instead -fn print_name(name: &str) { - println!("Name: {}", name); -} - -// Caller benefits: -let name = String::from("Alice"); -print_name(&name); // can reuse name -print_name(&name); // still valid -``` - -### 2. Return Owned Values from Constructors - -```rust -// GOOD: return owned value -impl User { - fn new(name: &str) -> Self { - User { - name: name.to_string(), - } - } -} - -// GOOD: accept Into for flexibility -impl User { - fn new(name: impl Into) -> Self { - User { - name: name.into(), - } - } -} - -// Usage: -let u1 = User::new("Alice"); // &str -let u2 = User::new(String::from("Bob")); // String -``` - -### 3. Use AsRef for Generic Borrowing - -```rust -// GOOD: accepts both &str and String -fn process>(input: S) { - let s = input.as_ref(); - println!("{}", s); -} - -process("literal"); // &str -process(String::from("owned")); // String -process(&String::from("ref")); // &String -``` - -### 4. Cow for Clone-on-Write - -```rust -use std::borrow::Cow; - -// Return borrowed when possible, owned when needed -fn maybe_modify(s: &str, uppercase: bool) -> Cow<'_, str> { - if uppercase { - Cow::Owned(s.to_uppercase()) // allocates - } else { - Cow::Borrowed(s) // zero-cost - } -} - -let input = "hello"; -let result = maybe_modify(input, false); -// result is borrowed, no allocation -``` - ---- - -## Struct Design Patterns - -### 1. Owned Fields vs References - -```rust -// Use owned fields for most cases -struct User { - name: String, - email: String, -} - -// Use references only when lifetime is clear -struct UserView<'a> { - name: &'a str, - email: &'a str, -} - -// Pattern: owned data + view for efficiency -impl User { - fn view(&self) -> UserView<'_> { - UserView { - name: &self.name, - email: &self.email, - } - } -} -``` - -### 2. Builder Pattern with Ownership - -```rust -#[derive(Default)] -struct RequestBuilder { - url: Option, - method: Option, - body: Option>, -} - -impl RequestBuilder { - fn new() -> Self { - Self::default() - } - - // Take self by value for chaining - fn url(mut self, url: impl Into) -> Self { - self.url = Some(url.into()); - self - } - - fn method(mut self, method: impl Into) -> Self { - self.method = Some(method.into()); - self - } - - fn build(self) -> Result { - Ok(Request { - url: self.url.ok_or(Error::MissingUrl)?, - method: self.method.unwrap_or_else(|| "GET".to_string()), - body: self.body.unwrap_or_default(), - }) - } -} - -// Usage: -let req = RequestBuilder::new() - .url("https://example.com") - .method("POST") - .build()?; -``` - -### 3. Interior Mutability When Needed - -```rust -use std::cell::RefCell; -use std::rc::Rc; - -// Shared mutable state in single-threaded context -struct Counter { - value: Rc>, -} - -impl Counter { - fn new() -> Self { - Counter { - value: Rc::new(RefCell::new(0)), - } - } - - fn increment(&self) { - *self.value.borrow_mut() += 1; - } - - fn get(&self) -> u32 { - *self.value.borrow() - } - - fn clone_handle(&self) -> Self { - Counter { - value: Rc::clone(&self.value), - } - } -} -``` - ---- - -## Collection Patterns - -### 1. Efficient Iteration - -```rust -let items = vec![1, 2, 3, 4, 5]; - -// Iterate by reference (no move) -for item in &items { - println!("{}", item); -} - -// Iterate by mutable reference -for item in &mut items.clone() { - *item *= 2; -} - -// Consume with into_iter when done -let sum: i32 = items.into_iter().sum(); -``` - -### 2. Collecting Results - -```rust -// Collect into owned collection -let strings: Vec = (0..5) - .map(|i| format!("item_{}", i)) - .collect(); - -// Collect references -let refs: Vec<&str> = strings.iter().map(|s| s.as_str()).collect(); - -// Collect with transformation -let result: Result, _> = ["1", "2", "3"] - .iter() - .map(|s| s.parse::()) - .collect(); -``` - -### 3. Entry API for Maps - -```rust -use std::collections::HashMap; - -let mut map: HashMap> = HashMap::new(); - -// Efficient: don't search twice -map.entry("key".to_string()) - .or_insert_with(Vec::new) - .push(42); - -// With entry modification -map.entry("key".to_string()) - .and_modify(|v| v.push(43)) - .or_insert_with(|| vec![43]); -``` - ---- - -## Error Handling with Ownership - -### 1. Preserve Context in Errors - -```rust -use std::error::Error; -use std::fmt; - -#[derive(Debug)] -struct ParseError { - input: String, // owns the problematic input - message: String, -} - -impl fmt::Display for ParseError { - fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { - write!(f, "Failed to parse '{}': {}", self.input, self.message) - } -} - -fn parse(input: &str) -> Result { - input.parse().map_err(|_| ParseError { - input: input.to_string(), // clone for error context - message: "not a valid integer".to_string(), - }) -} -``` - -### 2. Ownership in Result Chains - -```rust -fn process_data(path: &str) -> Result { - let content = std::fs::read_to_string(path)?; // owned String - let parsed = parse_content(&content)?; // borrow - let processed = transform(parsed)?; // ownership moves - Ok(processed) // return owned -} -``` - ---- - -## Performance Considerations - -### 1. Avoid Unnecessary Clones - -```rust -// BAD: cloning just to compare -fn contains_item(items: &[String], target: &str) -> bool { - items.iter().any(|s| s.clone() == target) // unnecessary clone -} - -// GOOD: compare references -fn contains_item(items: &[String], target: &str) -> bool { - items.iter().any(|s| s == target) // String implements PartialEq -} -``` - -### 2. Use Slices for Flexibility - -```rust -// BAD: requires Vec -fn sum(numbers: &Vec) -> i32 { - numbers.iter().sum() -} - -// GOOD: accepts any slice -fn sum(numbers: &[i32]) -> i32 { - numbers.iter().sum() -} - -// Now works with: -sum(&vec![1, 2, 3]); // Vec -sum(&[1, 2, 3]); // array -sum(&array[1..3]); // slice -``` - -### 3. In-Place Mutation - -```rust -// BAD: allocates new String -fn make_uppercase(s: &str) -> String { - s.to_uppercase() -} - -// GOOD when you own the data: mutate in place -fn make_uppercase(mut s: String) -> String { - s.make_ascii_uppercase(); // in-place for ASCII - s -} -``` diff --git a/.claude/skills/m01-ownership/patterns/common-errors.md b/.claude/skills/m01-ownership/patterns/common-errors.md deleted file mode 100644 index c3efcf407..000000000 --- a/.claude/skills/m01-ownership/patterns/common-errors.md +++ /dev/null @@ -1,265 +0,0 @@ -# Common Ownership Errors & Fixes - -## E0382: Use of Moved Value - -### Error Pattern -```rust -let s = String::from("hello"); -let s2 = s; // s moved here -println!("{}", s); // ERROR: value borrowed after move -``` - -### Fix Options - -**Option 1: Clone (if ownership not needed)** -```rust -let s = String::from("hello"); -let s2 = s.clone(); // s is cloned -println!("{}", s); // OK: s still valid -``` - -**Option 2: Borrow (if modification not needed)** -```rust -let s = String::from("hello"); -let s2 = &s; // borrow, not move -println!("{}", s); // OK -println!("{}", s2); // OK -``` - -**Option 3: Use Rc/Arc (for shared ownership)** -```rust -use std::rc::Rc; -let s = Rc::new(String::from("hello")); -let s2 = Rc::clone(&s); // shared ownership -println!("{}", s); // OK -println!("{}", s2); // OK -``` - ---- - -## E0597: Borrowed Value Does Not Live Long Enough - -### Error Pattern -```rust -fn get_str() -> &str { - let s = String::from("hello"); - &s // ERROR: s dropped here, but reference returned -} -``` - -### Fix Options - -**Option 1: Return owned value** -```rust -fn get_str() -> String { - String::from("hello") // return owned value -} -``` - -**Option 2: Use 'static lifetime** -```rust -fn get_str() -> &'static str { - "hello" // string literal has 'static lifetime -} -``` - -**Option 3: Accept reference parameter** -```rust -fn get_str<'a>(s: &'a str) -> &'a str { - s // return reference with same lifetime as input -} -``` - ---- - -## E0499: Cannot Borrow as Mutable More Than Once - -### Error Pattern -```rust -let mut s = String::from("hello"); -let r1 = &mut s; -let r2 = &mut s; // ERROR: second mutable borrow -println!("{}, {}", r1, r2); -``` - -### Fix Options - -**Option 1: Sequential borrows** -```rust -let mut s = String::from("hello"); -{ - let r1 = &mut s; - r1.push_str(" world"); -} // r1 goes out of scope -let r2 = &mut s; // OK: r1 no longer exists -``` - -**Option 2: Use RefCell for interior mutability** -```rust -use std::cell::RefCell; -let s = RefCell::new(String::from("hello")); -let mut r1 = s.borrow_mut(); -// drop r1 before borrowing again -drop(r1); -let mut r2 = s.borrow_mut(); -``` - ---- - -## E0502: Cannot Borrow as Mutable While Immutable Borrow Exists - -### Error Pattern -```rust -let mut v = vec![1, 2, 3]; -let first = &v[0]; // immutable borrow -v.push(4); // ERROR: mutable borrow while immutable exists -println!("{}", first); -``` - -### Fix Options - -**Option 1: Finish using immutable borrow first** -```rust -let mut v = vec![1, 2, 3]; -let first = v[0]; // copy value, not borrow -v.push(4); // OK -println!("{}", first); // OK: using copied value -``` - -**Option 2: Clone before mutating** -```rust -let mut v = vec![1, 2, 3]; -let first = v[0].clone(); // if T: Clone -v.push(4); -println!("{}", first); -``` - ---- - -## E0507: Cannot Move Out of Borrowed Content - -### Error Pattern -```rust -fn take_string(s: &String) { - let moved = *s; // ERROR: cannot move out of borrowed content -} -``` - -### Fix Options - -**Option 1: Clone** -```rust -fn take_string(s: &String) { - let cloned = s.clone(); -} -``` - -**Option 2: Take ownership in function signature** -```rust -fn take_string(s: String) { // take ownership - let moved = s; -} -``` - -**Option 3: Use mem::take for Option/Default types** -```rust -fn take_from_option(opt: &mut Option) -> Option { - std::mem::take(opt) // replaces with None, returns owned value -} -``` - ---- - -## E0515: Return Local Reference - -### Error Pattern -```rust -fn create_string() -> &String { - let s = String::from("hello"); - &s // ERROR: cannot return reference to local variable -} -``` - -### Fix Options - -**Option 1: Return owned value** -```rust -fn create_string() -> String { - String::from("hello") -} -``` - -**Option 2: Use static/const** -```rust -fn get_static_str() -> &'static str { - "hello" -} -``` - ---- - -## E0716: Temporary Value Dropped While Borrowed - -### Error Pattern -```rust -let r: &str = &String::from("hello"); // ERROR: temporary dropped -println!("{}", r); -``` - -### Fix Options - -**Option 1: Bind to variable first** -```rust -let s = String::from("hello"); -let r: &str = &s; -println!("{}", r); -``` - -**Option 2: Use let binding with reference** -```rust -let r: &str = { - let s = String::from("hello"); - // s.as_str() // ERROR: still temporary - Box::leak(s.into_boxed_str()) // extreme: leak for 'static -}; -``` - ---- - -## Pattern: Loop Ownership Issues - -### Error Pattern -```rust -let strings = vec![String::from("a"), String::from("b")]; -for s in strings { - println!("{}", s); -} -// ERROR: strings moved into loop -println!("{:?}", strings); -``` - -### Fix Options - -**Option 1: Iterate by reference** -```rust -let strings = vec![String::from("a"), String::from("b")]; -for s in &strings { - println!("{}", s); -} -println!("{:?}", strings); // OK -``` - -**Option 2: Use iter()** -```rust -for s in strings.iter() { - println!("{}", s); -} -``` - -**Option 3: Clone if needed** -```rust -for s in strings.clone() { - // consumes cloned vec -} -println!("{:?}", strings); // original still available -``` diff --git a/.claude/skills/m01-ownership/patterns/lifetime-patterns.md b/.claude/skills/m01-ownership/patterns/lifetime-patterns.md deleted file mode 100644 index 19f76a862..000000000 --- a/.claude/skills/m01-ownership/patterns/lifetime-patterns.md +++ /dev/null @@ -1,229 +0,0 @@ -# Lifetime Patterns - -## Basic Lifetime Annotation - -### When Required -```rust -// ERROR: missing lifetime specifier -fn longest(x: &str, y: &str) -> &str { - if x.len() > y.len() { x } else { y } -} - -// FIX: explicit lifetime -fn longest<'a>(x: &'a str, y: &'a str) -> &'a str { - if x.len() > y.len() { x } else { y } -} -``` - -### Lifetime Elision Rules -1. Each input reference gets its own lifetime -2. If one input lifetime, output uses same -3. If `&self` or `&mut self`, output uses self's lifetime - -```rust -// These are equivalent (elision applies): -fn first_word(s: &str) -> &str { ... } -fn first_word<'a>(s: &'a str) -> &'a str { ... } - -// Method with self (elision applies): -impl MyStruct { - fn get_ref(&self) -> &str { ... } - // Equivalent to: - fn get_ref<'a>(&'a self) -> &'a str { ... } -} -``` - ---- - -## Struct Lifetimes - -### Struct Holding References -```rust -// Struct must declare lifetime for references -struct Excerpt<'a> { - part: &'a str, -} - -impl<'a> Excerpt<'a> { - fn level(&self) -> i32 { 3 } - - // Return reference tied to self's lifetime - fn get_part(&self) -> &str { - self.part - } -} -``` - -### Multiple Lifetimes in Struct -```rust -struct Multi<'a, 'b> { - x: &'a str, - y: &'b str, -} - -// Use when references may have different lifetimes -fn make_multi<'a, 'b>(x: &'a str, y: &'b str) -> Multi<'a, 'b> { - Multi { x, y } -} -``` - ---- - -## 'static Lifetime - -### When to Use -```rust -// String literals are 'static -let s: &'static str = "hello"; - -// Owned data can be leaked to 'static -let leaked: &'static str = Box::leak(String::from("hello").into_boxed_str()); - -// Thread spawn requires 'static or move -std::thread::spawn(move || { - // closure owns data, satisfies 'static -}); -``` - -### Avoid Overusing 'static -```rust -// BAD: requires 'static unnecessarily -fn process(s: &'static str) { ... } - -// GOOD: use generic lifetime -fn process<'a>(s: &'a str) { ... } -// or -fn process(s: &str) { ... } // lifetime elision -``` - ---- - -## Higher-Ranked Trait Bounds (HRTB) - -### for<'a> Syntax -```rust -// Function that works with any lifetime -fn apply_to_ref(f: F) -where - F: for<'a> Fn(&'a str) -> &'a str, -{ - let s = String::from("hello"); - let result = f(&s); - println!("{}", result); -} -``` - -### Common Use: Closure Bounds -```rust -// Closure that borrows any lifetime -fn filter_refs(items: &[&str], pred: F) -> Vec<&str> -where - F: for<'a> Fn(&'a str) -> bool, -{ - items.iter().copied().filter(|s| pred(s)).collect() -} -``` - ---- - -## Lifetime Bounds - -### 'a: 'b (Outlives) -```rust -// 'a must live at least as long as 'b -fn coerce<'a, 'b>(x: &'a str) -> &'b str -where - 'a: 'b, -{ - x -} -``` - -### T: 'a (Type Outlives Lifetime) -```rust -// T must live at least as long as 'a -struct Wrapper<'a, T: 'a> { - value: &'a T, -} - -// Common pattern with trait objects -fn use_trait<'a, T: MyTrait + 'a>(t: &'a T) { ... } -``` - ---- - -## Common Lifetime Mistakes - -### Mistake 1: Returning Reference to Local -```rust -// WRONG -fn dangle() -> &String { - let s = String::from("hello"); - &s // s dropped, reference invalid -} - -// RIGHT -fn no_dangle() -> String { - String::from("hello") -} -``` - -### Mistake 2: Conflicting Lifetimes -```rust -// WRONG: might return reference to y which has shorter lifetime -fn wrong<'a, 'b>(x: &'a str, y: &'b str) -> &'a str { - y // ERROR: 'b might not live as long as 'a -} - -// RIGHT: use same lifetime or add bound -fn right<'a>(x: &'a str, y: &'a str) -> &'a str { - y // OK: both have lifetime 'a -} -``` - -### Mistake 3: Struct Outlives Reference -```rust -// WRONG: s might outlive the string it references -let r; -{ - let s = String::from("hello"); - r = Excerpt { part: &s }; // ERROR -} -println!("{}", r.part); // s already dropped - -// RIGHT: ensure source outlives struct -let s = String::from("hello"); -let r = Excerpt { part: &s }; -println!("{}", r.part); // OK: s still in scope -``` - ---- - -## Subtyping and Variance - -### Covariance -```rust -// &'a T is covariant in 'a -// Can use &'long where &'short expected -fn example<'short, 'long: 'short>(long_ref: &'long str) { - let short_ref: &'short str = long_ref; // OK: covariance -} -``` - -### Invariance -```rust -// &'a mut T is invariant in 'a -fn example<'a, 'b>(x: &'a mut &'b str, y: &'b str) { - *x = y; // ERROR if 'a and 'b are different -} -``` - -### Practical Impact -```rust -// This works due to covariance -fn accept_any<'a>(s: &'a str) { ... } - -let s = String::from("hello"); -let long_lived: &str = &s; -accept_any(long_lived); // 'long coerces to 'short -``` diff --git a/.claude/skills/m02-resource/SKILL.md b/.claude/skills/m02-resource/SKILL.md deleted file mode 100644 index a7e64edf1..000000000 --- a/.claude/skills/m02-resource/SKILL.md +++ /dev/null @@ -1,158 +0,0 @@ ---- -name: m02-resource -description: "CRITICAL: Use for smart pointers and resource management. Triggers: Box, Rc, Arc, Weak, RefCell, Cell, smart pointer, heap allocation, reference counting, RAII, Drop, should I use Box or Rc, when to use Arc vs Rc, 智能指针, 引用计数, 堆分配" ---- - -# Resource Management - -> **Layer 1: Language Mechanics** - -## Core Question - -**What ownership pattern does this resource need?** - -Before choosing a smart pointer, understand: -- Is ownership single or shared? -- Is access single-threaded or multi-threaded? -- Are there potential cycles? - ---- - -## Error → Design Question - -| Error | Don't Just Say | Ask Instead | -|-------|----------------|-------------| -| "Need heap allocation" | "Use Box" | Why can't this be on stack? | -| Rc memory leak | "Use Weak" | Is the cycle necessary in design? | -| RefCell panic | "Use try_borrow" | Is runtime check the right approach? | -| Arc overhead complaint | "Accept it" | Is multi-thread access actually needed? | - ---- - -## Thinking Prompt - -Before choosing a smart pointer: - -1. **What's the ownership model?** - - Single owner → Box or owned value - - Shared ownership → Rc/Arc - - Weak reference → Weak - -2. **What's the thread context?** - - Single-thread → Rc, Cell, RefCell - - Multi-thread → Arc, Mutex, RwLock - -3. **Are there cycles?** - - Yes → One direction must be Weak - - No → Regular Rc/Arc is fine - ---- - -## Trace Up ↑ - -When pointer choice is unclear, trace to design: - -``` -"Should I use Arc or Rc?" - ↑ Ask: Is this data shared across threads? - ↑ Check: m07-concurrency (thread model) - ↑ Check: domain-* (performance constraints) -``` - -| Situation | Trace To | Question | -|-----------|----------|----------| -| Rc vs Arc confusion | m07-concurrency | What's the concurrency model? | -| RefCell panics | m03-mutability | Is interior mutability right here? | -| Memory leaks | m12-lifecycle | Where should cleanup happen? | - ---- - -## Trace Down ↓ - -From design to implementation: - -``` -"Need single-owner heap data" - ↓ Use: Box - -"Need shared immutable data (single-thread)" - ↓ Use: Rc - -"Need shared immutable data (multi-thread)" - ↓ Use: Arc - -"Need to break reference cycle" - ↓ Use: Weak - -"Need shared mutable data" - ↓ Single-thread: Rc> - ↓ Multi-thread: Arc> or Arc> -``` - ---- - -## Quick Reference - -| Type | Ownership | Thread-Safe | Use When | -|------|-----------|-------------|----------| -| `Box` | Single | Yes | Heap allocation, recursive types | -| `Rc` | Shared | No | Single-thread shared ownership | -| `Arc` | Shared | Yes | Multi-thread shared ownership | -| `Weak` | Weak ref | Same as Rc/Arc | Break reference cycles | -| `Cell` | Single | No | Interior mutability (Copy types) | -| `RefCell` | Single | No | Interior mutability (runtime check) | - -## Decision Flowchart - -``` -Need heap allocation? -├─ Yes → Single owner? -│ ├─ Yes → Box -│ └─ No → Multi-thread? -│ ├─ Yes → Arc -│ └─ No → Rc -└─ No → Stack allocation (default) - -Have reference cycles? -├─ Yes → Use Weak for one direction -└─ No → Regular Rc/Arc - -Need interior mutability? -├─ Yes → Thread-safe needed? -│ ├─ Yes → Mutex or RwLock -│ └─ No → T: Copy? → Cell : RefCell -└─ No → Use &mut T -``` - ---- - -## Common Errors - -| Problem | Cause | Fix | -|---------|-------|-----| -| Rc cycle leak | Mutual strong refs | Use Weak for one direction | -| RefCell panic | Borrow conflict at runtime | Use try_borrow or restructure | -| Arc overhead | Atomic ops in hot path | Consider Rc if single-threaded | -| Box unnecessary | Data fits on stack | Remove Box | - ---- - -## Anti-Patterns - -| Anti-Pattern | Why Bad | Better | -|--------------|---------|--------| -| Arc everywhere | Unnecessary atomic overhead | Use Rc for single-thread | -| RefCell everywhere | Runtime panics | Design clear ownership | -| Box for small types | Unnecessary allocation | Stack allocation | -| Ignore Weak for cycles | Memory leaks | Design parent-child with Weak | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Ownership errors | m01-ownership | -| Interior mutability details | m03-mutability | -| Multi-thread context | m07-concurrency | -| Resource lifecycle | m12-lifecycle | diff --git a/.claude/skills/m03-mutability/SKILL.md b/.claude/skills/m03-mutability/SKILL.md deleted file mode 100644 index 297798eab..000000000 --- a/.claude/skills/m03-mutability/SKILL.md +++ /dev/null @@ -1,152 +0,0 @@ ---- -name: m03-mutability -description: "CRITICAL: Use for mutability issues. Triggers: E0596, E0499, E0502, cannot borrow as mutable, already borrowed as immutable, mut, &mut, interior mutability, Cell, RefCell, Mutex, RwLock, 可变性, 内部可变性, 借用冲突" ---- - -# Mutability - -> **Layer 1: Language Mechanics** - -## Core Question - -**Why does this data need to change, and who can change it?** - -Before adding interior mutability, understand: -- Is mutation essential or accidental complexity? -- Who should control mutation? -- Is the mutation pattern safe? - ---- - -## Error → Design Question - -| Error | Don't Just Say | Ask Instead | -|-------|----------------|-------------| -| E0596 | "Add mut" | Should this really be mutable? | -| E0499 | "Split borrows" | Is the data structure right? | -| E0502 | "Separate scopes" | Why do we need both borrows? | -| RefCell panic | "Use try_borrow" | Is runtime check appropriate? | - ---- - -## Thinking Prompt - -Before adding mutability: - -1. **Is mutation necessary?** - - Maybe transform → return new value - - Maybe builder → construct immutably - -2. **Who controls mutation?** - - External caller → `&mut T` - - Internal logic → interior mutability - - Concurrent access → synchronized mutability - -3. **What's the thread context?** - - Single-thread → Cell/RefCell - - Multi-thread → Mutex/RwLock/Atomic - ---- - -## Trace Up ↑ - -When mutability conflicts persist: - -``` -E0499/E0502 (borrow conflicts) - ↑ Ask: Is the data structure designed correctly? - ↑ Check: m09-domain (should data be split?) - ↑ Check: m07-concurrency (is async involved?) -``` - -| Persistent Error | Trace To | Question | -|-----------------|----------|----------| -| Repeated borrow conflicts | m09-domain | Should data be restructured? | -| RefCell in async | m07-concurrency | Is Send/Sync needed? | -| Mutex deadlocks | m07-concurrency | Is the lock design right? | - ---- - -## Trace Down ↓ - -From design to implementation: - -``` -"Need mutable access from &self" - ↓ T: Copy → Cell - ↓ T: !Copy → RefCell - -"Need thread-safe mutation" - ↓ Simple counters → AtomicXxx - ↓ Complex data → Mutex or RwLock - -"Need shared mutable state" - ↓ Single-thread: Rc> - ↓ Multi-thread: Arc> -``` - ---- - -## Borrow Rules - -``` -At any time, you can have EITHER: -├─ Multiple &T (immutable borrows) -└─ OR one &mut T (mutable borrow) - -Never both simultaneously. -``` - -## Quick Reference - -| Pattern | Thread-Safe | Runtime Cost | Use When | -|---------|-------------|--------------|----------| -| `&mut T` | N/A | Zero | Exclusive mutable access | -| `Cell` | No | Zero | Copy types, no refs needed | -| `RefCell` | No | Runtime check | Non-Copy, need runtime borrow | -| `Mutex` | Yes | Lock contention | Thread-safe mutation | -| `RwLock` | Yes | Lock contention | Many readers, few writers | -| `Atomic*` | Yes | Minimal | Simple types (bool, usize) | - -## Error Code Reference - -| Error | Cause | Quick Fix | -|-------|-------|-----------| -| E0596 | Borrowing immutable as mutable | Add `mut` or redesign | -| E0499 | Multiple mutable borrows | Restructure code flow | -| E0502 | &mut while & exists | Separate borrow scopes | - ---- - -## Interior Mutability Decision - -| Scenario | Choose | -|----------|--------| -| T: Copy, single-thread | `Cell` | -| T: !Copy, single-thread | `RefCell` | -| T: Copy, multi-thread | `AtomicXxx` | -| T: !Copy, multi-thread | `Mutex` or `RwLock` | -| Read-heavy, multi-thread | `RwLock` | -| Simple flags/counters | `AtomicBool`, `AtomicUsize` | - ---- - -## Anti-Patterns - -| Anti-Pattern | Why Bad | Better | -|--------------|---------|--------| -| RefCell everywhere | Runtime panics | Clear ownership design | -| Mutex for single-thread | Unnecessary overhead | RefCell | -| Ignore RefCell panic | Hard to debug | Handle or restructure | -| Lock inside hot loop | Performance killer | Batch operations | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Smart pointer choice | m02-resource | -| Thread safety | m07-concurrency | -| Data structure design | m09-domain | -| Anti-patterns | m15-anti-pattern | diff --git a/.claude/skills/m04-zero-cost/SKILL.md b/.claude/skills/m04-zero-cost/SKILL.md deleted file mode 100644 index cea51a62e..000000000 --- a/.claude/skills/m04-zero-cost/SKILL.md +++ /dev/null @@ -1,164 +0,0 @@ ---- -name: m04-zero-cost -description: "CRITICAL: Use for generics, traits, zero-cost abstraction. Triggers: E0277, E0308, E0599, generic, trait, impl, dyn, where, monomorphization, static dispatch, dynamic dispatch, impl Trait, trait bound not satisfied, 泛型, 特征, 零成本抽象, 单态化" ---- - -# Zero-Cost Abstraction - -> **Layer 1: Language Mechanics** - -## Core Question - -**Do we need compile-time or runtime polymorphism?** - -Before choosing between generics and trait objects: -- Is the type known at compile time? -- Is a heterogeneous collection needed? -- What's the performance priority? - ---- - -## Error → Design Question - -| Error | Don't Just Say | Ask Instead | -|-------|----------------|-------------| -| E0277 | "Add trait bound" | Is this abstraction at the right level? | -| E0308 | "Fix the type" | Should types be unified or distinct? | -| E0599 | "Import the trait" | Is the trait the right abstraction? | -| E0038 | "Make object-safe" | Do we really need dynamic dispatch? | - ---- - -## Thinking Prompt - -Before adding trait bounds: - -1. **What abstraction is needed?** - - Same behavior, different types → trait - - Different behavior, same type → enum - - No abstraction needed → concrete type - -2. **When is type known?** - - Compile time → generics (static dispatch) - - Runtime → trait objects (dynamic dispatch) - -3. **What's the trade-off priority?** - - Performance → generics - - Compile time → trait objects - - Flexibility → depends - ---- - -## Trace Up ↑ - -When type system fights back: - -``` -E0277 (trait bound not satisfied) - ↑ Ask: Is the abstraction level correct? - ↑ Check: m09-domain (what behavior is being abstracted?) - ↑ Check: m05-type-driven (should use newtype?) -``` - -| Persistent Error | Trace To | Question | -|-----------------|----------|----------| -| Complex trait bounds | m09-domain | Is the abstraction right? | -| Object safety issues | m05-type-driven | Can typestate help? | -| Type explosion | m10-performance | Accept dyn overhead? | - ---- - -## Trace Down ↓ - -From design to implementation: - -``` -"Need to abstract over types with same behavior" - ↓ Types known at compile time → impl Trait or generics - ↓ Types determined at runtime → dyn Trait - -"Need collection of different types" - ↓ Closed set → enum - ↓ Open set → Vec> - -"Need to return different types" - ↓ Same type → impl Trait - ↓ Different types → Box -``` - ---- - -## Quick Reference - -| Pattern | Dispatch | Code Size | Runtime Cost | -|---------|----------|-----------|--------------| -| `fn foo()` | Static | +bloat | Zero | -| `fn foo(x: &dyn Trait)` | Dynamic | Minimal | vtable lookup | -| `impl Trait` return | Static | +bloat | Zero | -| `Box` | Dynamic | Minimal | Allocation + vtable | - -## Syntax Comparison - -```rust -// Static dispatch - type known at compile time -fn process(x: impl Display) { } // argument position -fn process(x: T) { } // explicit generic -fn get() -> impl Display { } // return position - -// Dynamic dispatch - type determined at runtime -fn process(x: &dyn Display) { } // reference -fn process(x: Box) { } // owned -``` - -## Error Code Reference - -| Error | Cause | Quick Fix | -|-------|-------|-----------| -| E0277 | Type doesn't impl trait | Add impl or change bound | -| E0308 | Type mismatch | Check generic params | -| E0599 | No method found | Import trait with `use` | -| E0038 | Trait not object-safe | Use generics or redesign | - ---- - -## Decision Guide - -| Scenario | Choose | Why | -|----------|--------|-----| -| Performance critical | Generics | Zero runtime cost | -| Heterogeneous collection | `dyn Trait` | Different types at runtime | -| Plugin architecture | `dyn Trait` | Unknown types at compile | -| Reduce compile time | `dyn Trait` | Less monomorphization | -| Small, known type set | `enum` | No indirection | - ---- - -## Object Safety - -A trait is object-safe if it: -- Doesn't have `Self: Sized` bound -- Doesn't return `Self` -- Doesn't have generic methods -- Uses `where Self: Sized` for non-object-safe methods - ---- - -## Anti-Patterns - -| Anti-Pattern | Why Bad | Better | -|--------------|---------|--------| -| Over-generic everything | Compile time, complexity | Concrete types when possible | -| `dyn` for known types | Unnecessary indirection | Generics | -| Complex trait hierarchies | Hard to understand | Simpler design | -| Ignore object safety | Limits flexibility | Plan for dyn if needed | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Type-driven design | m05-type-driven | -| Domain abstraction | m09-domain | -| Performance concerns | m10-performance | -| Send/Sync bounds | m07-concurrency | diff --git a/.claude/skills/m05-type-driven/SKILL.md b/.claude/skills/m05-type-driven/SKILL.md deleted file mode 100644 index 01f72ec78..000000000 --- a/.claude/skills/m05-type-driven/SKILL.md +++ /dev/null @@ -1,174 +0,0 @@ ---- -name: m05-type-driven -description: "CRITICAL: Use for type-driven design. Triggers: type state, PhantomData, newtype, marker trait, builder pattern, make invalid states unrepresentable, compile-time validation, sealed trait, ZST, 类型状态, 新类型模式, 类型驱动设计" ---- - -# Type-Driven Design - -> **Layer 1: Language Mechanics** - -## Core Question - -**How can the type system prevent invalid states?** - -Before reaching for runtime checks: -- Can the compiler catch this error? -- Can invalid states be unrepresentable? -- Can the type encode the invariant? - ---- - -## Error → Design Question - -| Pattern | Don't Just Say | Ask Instead | -|---------|----------------|-------------| -| Primitive obsession | "It's just a string" | What does this value represent? | -| Boolean flags | "Add an is_valid flag" | Can states be types? | -| Optional everywhere | "Check for None" | Is absence really possible? | -| Validation at runtime | "Return Err if invalid" | Can we validate at construction? | - ---- - -## Thinking Prompt - -Before adding runtime validation: - -1. **Can the type encode the constraint?** - - Numeric range → bounded types or newtypes - - Valid states → type state pattern - - Semantic meaning → newtype - -2. **When is validation possible?** - - At construction → validated newtype - - At state transition → type state - - Only at runtime → Result with clear error - -3. **Who needs to know the invariant?** - - Compiler → type-level encoding - - API users → clear type signatures - - Runtime only → documentation - ---- - -## Trace Up ↑ - -When type design is unclear: - -``` -"Need to validate email format" - ↑ Ask: Is this a domain value object? - ↑ Check: m09-domain (Email as Value Object) - ↑ Check: domain-* (validation requirements) -``` - -| Situation | Trace To | Question | -|-----------|----------|----------| -| What types to create | m09-domain | What's the domain model? | -| State machine design | m09-domain | What are valid transitions? | -| Marker trait usage | m04-zero-cost | Static or dynamic dispatch? | - ---- - -## Trace Down ↓ - -From design to implementation: - -``` -"Need type-safe wrapper for primitives" - ↓ Newtype: struct UserId(u64); - -"Need compile-time state validation" - ↓ Type State: Connection - -"Need to track phantom type parameters" - ↓ PhantomData: PhantomData - -"Need capability markers" - ↓ Marker Trait: trait Validated {} - -"Need gradual construction" - ↓ Builder: Builder::new().field(x).build() -``` - ---- - -## Quick Reference - -| Pattern | Purpose | Example | -|---------|---------|---------| -| Newtype | Type safety | `struct UserId(u64);` | -| Type State | State machine | `Connection` | -| PhantomData | Variance/lifetime | `PhantomData<&'a T>` | -| Marker Trait | Capability flag | `trait Validated {}` | -| Builder | Gradual construction | `Builder::new().name("x").build()` | -| Sealed Trait | Prevent external impl | `mod private { pub trait Sealed {} }` | - -## Pattern Examples - -### Newtype - -```rust -struct Email(String); // Not just any string - -impl Email { - pub fn new(s: &str) -> Result { - // Validate once, trust forever - validate_email(s)?; - Ok(Self(s.to_string())) - } -} -``` - -### Type State - -```rust -struct Connection(TcpStream, PhantomData); - -struct Disconnected; -struct Connected; -struct Authenticated; - -impl Connection { - fn connect(self) -> Connection { ... } -} - -impl Connection { - fn authenticate(self) -> Connection { ... } -} -``` - ---- - -## Decision Guide - -| Need | Pattern | -|------|---------| -| Type safety for primitives | Newtype | -| Compile-time state validation | Type State | -| Lifetime/variance markers | PhantomData | -| Capability flags | Marker Trait | -| Gradual construction | Builder | -| Closed set of impls | Sealed Trait | -| Zero-sized type marker | ZST struct | - ---- - -## Anti-Patterns - -| Anti-Pattern | Why Bad | Better | -|--------------|---------|--------| -| Boolean flags for states | Runtime errors | Type state | -| String for semantic types | No type safety | Newtype | -| Option for uninitialized | Unclear invariant | Builder | -| Public fields with invariants | Invariant violation | Private + validated new() | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Domain modeling | m09-domain | -| Trait design | m04-zero-cost | -| Error handling in constructors | m06-error-handling | -| Anti-patterns | m15-anti-pattern | diff --git a/.claude/skills/m06-error-handling/SKILL.md b/.claude/skills/m06-error-handling/SKILL.md deleted file mode 100644 index 0c333bd7b..000000000 --- a/.claude/skills/m06-error-handling/SKILL.md +++ /dev/null @@ -1,165 +0,0 @@ ---- -name: m06-error-handling -description: "CRITICAL: Use for error handling. Triggers: Result, Option, Error, ?, unwrap, expect, panic, anyhow, thiserror, when to panic vs return Result, custom error, error propagation, 错误处理, Result 用法, 什么时候用 panic" ---- - -# Error Handling - -> **Layer 1: Language Mechanics** - -## Core Question - -**Is this failure expected or a bug?** - -Before choosing error handling strategy: -- Can this fail in normal operation? -- Who should handle this failure? -- What context does the caller need? - ---- - -## Error → Design Question - -| Pattern | Don't Just Say | Ask Instead | -|---------|----------------|-------------| -| unwrap panics | "Use ?" | Is None/Err actually possible here? | -| Type mismatch on ? | "Use anyhow" | Are error types designed correctly? | -| Lost error context | "Add .context()" | What does the caller need to know? | -| Too many error variants | "Use Box" | Is error granularity right? | - ---- - -## Thinking Prompt - -Before handling an error: - -1. **What kind of failure is this?** - - Expected → Result - - Absence normal → Option - - Bug/invariant → panic! - - Unrecoverable → panic! - -2. **Who handles this?** - - Caller → propagate with ? - - Current function → match/if-let - - User → friendly error message - - Programmer → panic with message - -3. **What context is needed?** - - Type of error → thiserror variants - - Call chain → anyhow::Context - - Debug info → anyhow or tracing - ---- - -## Trace Up ↑ - -When error strategy is unclear: - -``` -"Should I return Result or Option?" - ↑ Ask: Is absence/failure normal or exceptional? - ↑ Check: m09-domain (what does domain say?) - ↑ Check: domain-* (error handling requirements) -``` - -| Situation | Trace To | Question | -|-----------|----------|----------| -| Too many unwraps | m09-domain | Is the data model right? | -| Error context design | m13-domain-error | What recovery is needed? | -| Library vs app errors | m11-ecosystem | Who are the consumers? | - ---- - -## Trace Down ↓ - -From design to implementation: - -``` -"Expected failure, library code" - ↓ Use: thiserror for typed errors - -"Expected failure, application code" - ↓ Use: anyhow for ergonomic errors - -"Absence is normal (find, get, lookup)" - ↓ Use: Option - -"Bug or invariant violation" - ↓ Use: panic!, assert!, unreachable! - -"Need to propagate with context" - ↓ Use: .context("what was happening") -``` - ---- - -## Quick Reference - -| Pattern | When | Example | -|---------|------|---------| -| `Result` | Recoverable error | `fn read() -> Result` | -| `Option` | Absence is normal | `fn find() -> Option<&Item>` | -| `?` | Propagate error | `let data = file.read()?;` | -| `unwrap()` | Dev/test only | `config.get("key").unwrap()` | -| `expect()` | Invariant holds | `env.get("HOME").expect("HOME set")` | -| `panic!` | Unrecoverable | `panic!("critical failure")` | - -## Library vs Application - -| Context | Error Crate | Why | -|---------|-------------|-----| -| Library | `thiserror` | Typed errors for consumers | -| Application | `anyhow` | Ergonomic error handling | -| Mixed | Both | thiserror at boundaries, anyhow internally | - -## Decision Flowchart - -``` -Is failure expected? -├─ Yes → Is absence the only "failure"? -│ ├─ Yes → Option -│ └─ No → Result -│ ├─ Library → thiserror -│ └─ Application → anyhow -└─ No → Is it a bug? - ├─ Yes → panic!, assert! - └─ No → Consider if really unrecoverable - -Use ? → Need context? -├─ Yes → .context("message") -└─ No → Plain ? -``` - ---- - -## Common Errors - -| Error | Cause | Fix | -|-------|-------|-----| -| `unwrap()` panic | Unhandled None/Err | Use `?` or match | -| Type mismatch | Different error types | Use `anyhow` or `From` | -| Lost context | `?` without context | Add `.context()` | -| `cannot use ?` | Missing Result return | Return `Result<(), E>` | - ---- - -## Anti-Patterns - -| Anti-Pattern | Why Bad | Better | -|--------------|---------|--------| -| `.unwrap()` everywhere | Panics in production | `.expect("reason")` or `?` | -| Ignore errors silently | Bugs hidden | Handle or propagate | -| `panic!` for expected errors | Bad UX, no recovery | Result | -| Box everywhere | Lost type info | thiserror | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Domain error strategy | m13-domain-error | -| Crate boundaries | m11-ecosystem | -| Type-safe errors | m05-type-driven | -| Mental models | m14-mental-model | diff --git a/.claude/skills/m06-error-handling/examples/library-vs-app.md b/.claude/skills/m06-error-handling/examples/library-vs-app.md deleted file mode 100644 index 7a7ef622f..000000000 --- a/.claude/skills/m06-error-handling/examples/library-vs-app.md +++ /dev/null @@ -1,332 +0,0 @@ -# Error Handling: Library vs Application - -## Library Error Design - -### Principles -1. **Define specific error types** - Don't use `anyhow` in libraries -2. **Implement std::error::Error** - For compatibility -3. **Provide error variants** - Let users match on errors -4. **Include source errors** - Enable error chains -5. **Be `Send + Sync`** - For async compatibility - -### Example: Library Error Type -```rust -// lib.rs -use thiserror::Error; - -#[derive(Error, Debug)] -pub enum DatabaseError { - #[error("connection failed: {host}:{port}")] - ConnectionFailed { - host: String, - port: u16, - #[source] - source: std::io::Error, - }, - - #[error("query failed: {query}")] - QueryFailed { - query: String, - #[source] - source: SqlError, - }, - - #[error("record not found: {table}.{id}")] - NotFound { table: String, id: String }, - - #[error("constraint violation: {0}")] - ConstraintViolation(String), -} - -// Public Result alias -pub type Result = std::result::Result; - -// Library functions -pub fn connect(host: &str, port: u16) -> Result { - // ... -} - -pub fn query(conn: &Connection, sql: &str) -> Result { - // ... -} -``` - -### Library Usage of Errors -```rust -impl Database { - pub fn get_user(&self, id: &str) -> Result { - let rows = self.query(&format!("SELECT * FROM users WHERE id = '{}'", id))?; - - rows.first() - .cloned() - .ok_or_else(|| DatabaseError::NotFound { - table: "users".to_string(), - id: id.to_string(), - }) - } -} -``` - ---- - -## Application Error Design - -### Principles -1. **Use anyhow for convenience** - Or custom unified error -2. **Add context liberally** - Help debugging -3. **Log at boundaries** - Don't log in libraries -4. **Convert to user-friendly messages** - For display - -### Example: Application Error Handling -```rust -// main.rs -use anyhow::{Context, Result}; -use tracing::{error, info}; - -async fn run_server() -> Result<()> { - let config = load_config() - .context("failed to load configuration")?; - - let db = Database::connect(&config.db_url) - .await - .context("failed to connect to database")?; - - let server = Server::new(config.port) - .context("failed to create server")?; - - info!("Server starting on port {}", config.port); - - server.run(db).await - .context("server error")?; - - Ok(()) -} - -#[tokio::main] -async fn main() { - tracing_subscriber::init(); - - if let Err(e) = run_server().await { - error!("Application error: {:#}", e); - std::process::exit(1); - } -} -``` - -### Converting Library Errors -```rust -use mylib::DatabaseError; - -async fn get_user_handler(id: &str) -> Result { - match db.get_user(id).await { - Ok(user) => Ok(Response::json(user)), - - Err(DatabaseError::NotFound { .. }) => { - Ok(Response::not_found("User not found")) - } - - Err(DatabaseError::ConnectionFailed { .. }) => { - error!("Database connection failed"); - Ok(Response::internal_error("Service unavailable")) - } - - Err(e) => { - error!("Database error: {}", e); - Err(e.into()) // Convert to anyhow::Error - } - } -} -``` - ---- - -## Error Handling Layers - -``` -┌─────────────────────────────────────┐ -│ Application Layer │ -│ - Use anyhow or unified error │ -│ - Add context at boundaries │ -│ - Log errors │ -│ - Convert to user messages │ -└─────────────────────────────────────┘ - │ - │ calls - ▼ -┌─────────────────────────────────────┐ -│ Service Layer │ -│ - Map between error types │ -│ - Add business context │ -│ - Handle recoverable errors │ -└─────────────────────────────────────┘ - │ - │ calls - ▼ -┌─────────────────────────────────────┐ -│ Library Layer │ -│ - Define specific error types │ -│ - Use thiserror │ -│ - Include source errors │ -│ - No logging │ -└─────────────────────────────────────┘ -``` - ---- - -## Practical Examples - -### HTTP API Error Response -```rust -use axum::{response::IntoResponse, http::StatusCode}; -use serde::Serialize; - -#[derive(Serialize)] -struct ErrorResponse { - error: String, - code: String, -} - -enum AppError { - NotFound(String), - BadRequest(String), - Internal(anyhow::Error), -} - -impl IntoResponse for AppError { - fn into_response(self) -> axum::response::Response { - let (status, error, code) = match self { - AppError::NotFound(msg) => { - (StatusCode::NOT_FOUND, msg, "NOT_FOUND") - } - AppError::BadRequest(msg) => { - (StatusCode::BAD_REQUEST, msg, "BAD_REQUEST") - } - AppError::Internal(e) => { - tracing::error!("Internal error: {:#}", e); - ( - StatusCode::INTERNAL_SERVER_ERROR, - "Internal server error".to_string(), - "INTERNAL_ERROR", - ) - } - }; - - let body = ErrorResponse { - error, - code: code.to_string(), - }; - - (status, axum::Json(body)).into_response() - } -} -``` - -### CLI Error Handling -```rust -use anyhow::{Context, Result}; -use clap::Parser; - -#[derive(Parser)] -struct Args { - #[arg(short, long)] - config: String, -} - -fn main() { - if let Err(e) = run() { - eprintln!("Error: {:#}", e); - std::process::exit(1); - } -} - -fn run() -> Result<()> { - let args = Args::parse(); - - let config = std::fs::read_to_string(&args.config) - .context(format!("Failed to read config file: {}", args.config))?; - - let parsed: Config = toml::from_str(&config) - .context("Failed to parse config file")?; - - process(parsed)?; - - println!("Done!"); - Ok(()) -} -``` - ---- - -## Testing Error Handling - -### Testing Error Cases -```rust -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn test_not_found_error() { - let result = db.get_user("nonexistent"); - - assert!(matches!( - result, - Err(DatabaseError::NotFound { table, id }) - if table == "users" && id == "nonexistent" - )); - } - - #[test] - fn test_error_message() { - let err = DatabaseError::NotFound { - table: "users".to_string(), - id: "123".to_string(), - }; - - assert_eq!(err.to_string(), "record not found: users.123"); - } - - #[test] - fn test_error_chain() { - let io_err = std::io::Error::new( - std::io::ErrorKind::ConnectionRefused, - "connection refused" - ); - - let err = DatabaseError::ConnectionFailed { - host: "localhost".to_string(), - port: 5432, - source: io_err, - }; - - // Check source is preserved - assert!(err.source().is_some()); - } -} -``` - -### Testing with anyhow -```rust -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn test_with_context() -> anyhow::Result<()> { - let result = process("valid input")?; - assert_eq!(result, expected); - Ok(()) - } - - #[test] - fn test_error_context() { - let err = process("invalid") - .context("processing failed") - .unwrap_err(); - - // Check error chain contains expected text - let chain = format!("{:#}", err); - assert!(chain.contains("processing failed")); - } -} -``` diff --git a/.claude/skills/m06-error-handling/patterns/error-patterns.md b/.claude/skills/m06-error-handling/patterns/error-patterns.md deleted file mode 100644 index d4d70c8c0..000000000 --- a/.claude/skills/m06-error-handling/patterns/error-patterns.md +++ /dev/null @@ -1,404 +0,0 @@ -# Error Handling Patterns - -## The ? Operator - -### Basic Usage -```rust -fn read_config() -> Result { - let content = std::fs::read_to_string("config.toml")?; - let config: Config = toml::from_str(&content)?; // needs From impl - Ok(config) -} -``` - -### With Different Error Types -```rust -use std::error::Error; - -// Box for quick prototyping -fn process() -> Result<(), Box> { - let file = std::fs::read_to_string("data.txt")?; - let num: i32 = file.trim().parse()?; // different error type - Ok(()) -} -``` - -### Custom Conversion with From -```rust -#[derive(Debug)] -enum MyError { - Io(std::io::Error), - Parse(std::num::ParseIntError), -} - -impl From for MyError { - fn from(err: std::io::Error) -> Self { - MyError::Io(err) - } -} - -impl From for MyError { - fn from(err: std::num::ParseIntError) -> Self { - MyError::Parse(err) - } -} - -fn process() -> Result { - let content = std::fs::read_to_string("num.txt")?; // auto-converts - let num: i32 = content.trim().parse()?; // auto-converts - Ok(num) -} -``` - ---- - -## Error Type Design - -### Simple Enum Error -```rust -#[derive(Debug, Clone, PartialEq)] -pub enum ConfigError { - NotFound, - InvalidFormat, - MissingField(String), -} - -impl std::fmt::Display for ConfigError { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - match self { - ConfigError::NotFound => write!(f, "configuration file not found"), - ConfigError::InvalidFormat => write!(f, "invalid configuration format"), - ConfigError::MissingField(field) => write!(f, "missing field: {}", field), - } - } -} - -impl std::error::Error for ConfigError {} -``` - -### Error with Source (Wrapping) -```rust -#[derive(Debug)] -pub struct AppError { - kind: AppErrorKind, - source: Option>, -} - -#[derive(Debug, Clone, Copy)] -pub enum AppErrorKind { - Config, - Database, - Network, -} - -impl std::fmt::Display for AppError { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - match self.kind { - AppErrorKind::Config => write!(f, "configuration error"), - AppErrorKind::Database => write!(f, "database error"), - AppErrorKind::Network => write!(f, "network error"), - } - } -} - -impl std::error::Error for AppError { - fn source(&self) -> Option<&(dyn std::error::Error + 'static)> { - self.source.as_ref().map(|e| e.as_ref() as _) - } -} -``` - ---- - -## Using thiserror - -### Basic Usage -```rust -use thiserror::Error; - -#[derive(Error, Debug)] -pub enum DataError { - #[error("file not found: {path}")] - NotFound { path: String }, - - #[error("invalid data format")] - InvalidFormat, - - #[error("IO error")] - Io(#[from] std::io::Error), - - #[error("parse error: {0}")] - Parse(#[from] std::num::ParseIntError), -} - -// Usage -fn load_data(path: &str) -> Result { - let content = std::fs::read_to_string(path) - .map_err(|_| DataError::NotFound { path: path.to_string() })?; - let num: i32 = content.trim().parse()?; // auto-converts with #[from] - Ok(Data { value: num }) -} -``` - -### Transparent Wrapper -```rust -use thiserror::Error; - -#[derive(Error, Debug)] -#[error(transparent)] -pub struct MyError(#[from] InnerError); - -// Useful for newtype error wrappers -``` - ---- - -## Using anyhow - -### For Applications -```rust -use anyhow::{Context, Result, bail, ensure}; - -fn process_file(path: &str) -> Result { - let content = std::fs::read_to_string(path) - .context("failed to read config file")?; - - ensure!(!content.is_empty(), "config file is empty"); - - let data: Data = serde_json::from_str(&content) - .context("failed to parse JSON")?; - - if data.version < 1 { - bail!("unsupported config version: {}", data.version); - } - - Ok(data) -} - -fn main() -> Result<()> { - let data = process_file("config.json") - .context("failed to load configuration")?; - Ok(()) -} -``` - -### Error Chain -```rust -use anyhow::{Context, Result}; - -fn deep_function() -> Result<()> { - std::fs::read_to_string("missing.txt") - .context("failed to read file")?; - Ok(()) -} - -fn middle_function() -> Result<()> { - deep_function() - .context("failed in deep function")?; - Ok(()) -} - -fn top_function() -> Result<()> { - middle_function() - .context("failed in middle function")?; - Ok(()) -} - -// Error output shows full chain: -// Error: failed in middle function -// Caused by: -// 0: failed in deep function -// 1: failed to read file -// 2: No such file or directory (os error 2) -``` - ---- - -## Option Handling - -### Converting Option to Result -```rust -fn find_user(id: u32) -> Option { ... } - -// Using ok_or for static error -fn get_user(id: u32) -> Result { - find_user(id).ok_or("user not found") -} - -// Using ok_or_else for dynamic error -fn get_user(id: u32) -> Result { - find_user(id).ok_or_else(|| format!("user {} not found", id)) -} -``` - -### Chaining Options -```rust -fn get_nested_value(data: &Data) -> Option<&str> { - data.config - .as_ref()? - .nested - .as_ref()? - .value - .as_deref() -} - -// Equivalent with and_then -fn get_nested_value(data: &Data) -> Option<&str> { - data.config - .as_ref() - .and_then(|c| c.nested.as_ref()) - .and_then(|n| n.value.as_deref()) -} -``` - ---- - -## Pattern: Result Combinators - -### map and map_err -```rust -fn parse_port(s: &str) -> Result { - s.parse::() - .map_err(|e| ParseError::InvalidPort(e)) -} - -fn get_url(config: &Config) -> Result { - config.url() - .map(|u| format!("https://{}", u)) -} -``` - -### and_then (flatMap) -```rust -fn validate_and_save(input: &str) -> Result<(), Error> { - validate(input) - .and_then(|valid| save(valid)) - .and_then(|saved| notify(saved)) -} -``` - -### unwrap_or and unwrap_or_else -```rust -// Default value -let port = config.port().unwrap_or(8080); - -// Computed default -let port = config.port().unwrap_or_else(|| find_free_port()); - -// Default for Result -let data = load_data().unwrap_or_default(); -``` - ---- - -## Pattern: Early Return vs Combinators - -### Early Return Style -```rust -fn process(input: &str) -> Result { - let step1 = validate(input)?; - if !step1.is_valid { - return Err(Error::Invalid); - } - - let step2 = transform(step1)?; - let step3 = save(step2)?; - - Ok(step3) -} -``` - -### Combinator Style -```rust -fn process(input: &str) -> Result { - validate(input) - .and_then(|s| { - if s.is_valid { - Ok(s) - } else { - Err(Error::Invalid) - } - }) - .and_then(transform) - .and_then(save) -} -``` - -### When to Use Which - -| Style | Best For | -|-------|----------| -| Early return (`?`) | Most cases, clearer flow | -| Combinators | Functional pipelines, one-liners | -| Match | Complex branching on errors | - ---- - -## Panic vs Result - -### When to Panic -```rust -// 1. Unrecoverable programmer error -fn get_config() -> &'static Config { - CONFIG.get().expect("config must be initialized") -} - -// 2. In tests -#[test] -fn test_parsing() { - let result = parse("valid").unwrap(); // OK in tests - assert_eq!(result, expected); -} - -// 3. Prototype/examples -fn main() { - let data = load().unwrap(); // OK for quick examples -} -``` - -### When to Return Result -```rust -// 1. Any I/O operation -fn read_file(path: &str) -> Result - -// 2. User input validation -fn parse_port(s: &str) -> Result - -// 3. Network operations -async fn fetch(url: &str) -> Result - -// 4. Anything that can fail at runtime -fn connect(addr: &str) -> Result -``` - ---- - -## Error Context Best Practices - -### Add Context at Boundaries -```rust -fn load_user_config(user_id: u64) -> Result { - let path = format!("/home/{}/config.toml", user_id); - - std::fs::read_to_string(&path) - .context(format!("failed to read config for user {}", user_id))? - // NOT: .context("failed to read file") // too generic - - // ... -} -``` - -### Include Relevant Data -```rust -// Good: includes the problematic value -fn parse_age(s: &str) -> Result { - s.parse() - .context(format!("invalid age value: '{}'", s)) -} - -// Bad: no context about what failed -fn parse_age(s: &str) -> Result { - s.parse() - .context("parse error") -} -``` diff --git a/.claude/skills/m07-concurrency/SKILL.md b/.claude/skills/m07-concurrency/SKILL.md deleted file mode 100644 index e3d003f1c..000000000 --- a/.claude/skills/m07-concurrency/SKILL.md +++ /dev/null @@ -1,221 +0,0 @@ ---- -name: m07-concurrency -description: "CRITICAL: Use for concurrency/async. Triggers: E0277 Send Sync, cannot be sent between threads, thread, spawn, channel, mpsc, Mutex, RwLock, Atomic, async, await, Future, tokio, deadlock, race condition, 并发, 线程, 异步, 死锁" ---- - -# Concurrency - -> **Layer 1: Language Mechanics** - -## Core Question - -**Is this CPU-bound or I/O-bound, and what's the sharing model?** - -Before choosing concurrency primitives: -- What's the workload type? -- What data needs to be shared? -- What's the thread safety requirement? - ---- - -## Error → Design Question - -| Error | Don't Just Say | Ask Instead | -|-------|----------------|-------------| -| E0277 Send | "Add Send bound" | Should this type cross threads? | -| E0277 Sync | "Wrap in Mutex" | Is shared access really needed? | -| Future not Send | "Use spawn_local" | Is async the right choice? | -| Deadlock | "Reorder locks" | Is the locking design correct? | - ---- - -## Thinking Prompt - -Before adding concurrency: - -1. **What's the workload?** - - CPU-bound → threads (std::thread, rayon) - - I/O-bound → async (tokio, async-std) - - Mixed → hybrid approach - -2. **What's the sharing model?** - - No sharing → message passing (channels) - - Immutable sharing → Arc - - Mutable sharing → Arc> or Arc> - -3. **What are the Send/Sync requirements?** - - Cross-thread ownership → Send - - Cross-thread references → Sync - - Single-thread async → spawn_local - ---- - -## Trace Up ↑ (MANDATORY) - -**CRITICAL**: Don't just fix the error. Trace UP to find domain constraints. - -### Domain Detection Table - -| Context Keywords | Load Domain Skill | Key Constraint | -|-----------------|-------------------|----------------| -| Web API, HTTP, axum, actix, handler | **domain-web** | Handlers run on any thread | -| 交易, 支付, trading, payment | **domain-fintech** | Audit + thread safety | -| gRPC, kubernetes, microservice | **domain-cloud-native** | Distributed tracing | -| CLI, terminal, clap | **domain-cli** | Usually single-thread OK | - -### Example: Web API + Rc Error - -``` -"Rc cannot be sent between threads" in Web API context - ↑ DETECT: "Web API" → Load domain-web - ↑ FIND: domain-web says "Shared state must be thread-safe" - ↑ FIND: domain-web says "Rc in state" is Common Mistake - ↓ DESIGN: Use Arc with State extractor - ↓ IMPL: axum::extract::State> -``` - -### Generic Trace - -``` -"Send not satisfied for my type" - ↑ Ask: What domain is this? Load domain-* skill - ↑ Ask: Does this type need to cross thread boundaries? - ↑ Check: m09-domain (is the data model correct?) -``` - -| Situation | Trace To | Question | -|-----------|----------|----------| -| Send/Sync in Web | **domain-web** | What's the state management pattern? | -| Send/Sync in CLI | **domain-cli** | Is multi-thread really needed? | -| Mutex vs channels | m09-domain | Shared state or message passing? | -| Async vs threads | m10-performance | What's the workload profile? | - ---- - -## Trace Down ↓ - -From design to implementation: - -``` -"Need parallelism for CPU work" - ↓ Use: std::thread or rayon - -"Need concurrency for I/O" - ↓ Use: async/await with tokio - -"Need to share immutable data across threads" - ↓ Use: Arc - -"Need to share mutable data across threads" - ↓ Use: Arc> or Arc> - ↓ Or: channels for message passing - -"Need simple atomic operations" - ↓ Use: AtomicBool, AtomicUsize, etc. -``` - ---- - -## Send/Sync Markers - -| Marker | Meaning | Example | -|--------|---------|---------| -| `Send` | Can transfer ownership between threads | Most types | -| `Sync` | Can share references between threads | `Arc` | -| `!Send` | Must stay on one thread | `Rc` | -| `!Sync` | No shared refs across threads | `RefCell` | - -## Quick Reference - -| Pattern | Thread-Safe | Blocking | Use When | -|---------|-------------|----------|----------| -| `std::thread` | Yes | Yes | CPU-bound parallelism | -| `async/await` | Yes | No | I/O-bound concurrency | -| `Mutex` | Yes | Yes | Shared mutable state | -| `RwLock` | Yes | Yes | Read-heavy shared state | -| `mpsc::channel` | Yes | Optional | Message passing | -| `Arc>` | Yes | Yes | Shared mutable across threads | - -## Decision Flowchart - -``` -What type of work? -├─ CPU-bound → std::thread or rayon -├─ I/O-bound → async/await -└─ Mixed → hybrid (spawn_blocking) - -Need to share data? -├─ No → message passing (channels) -├─ Immutable → Arc -└─ Mutable → - ├─ Read-heavy → Arc> - └─ Write-heavy → Arc> - └─ Simple counter → AtomicUsize - -Async context? -├─ Type is Send → tokio::spawn -├─ Type is !Send → spawn_local -└─ Blocking code → spawn_blocking -``` - ---- - -## Common Errors - -| Error | Cause | Fix | -|-------|-------|-----| -| E0277 `Send` not satisfied | Non-Send in async | Use Arc or spawn_local | -| E0277 `Sync` not satisfied | Non-Sync shared | Wrap with Mutex | -| Deadlock | Lock ordering | Consistent lock order | -| `future is not Send` | Non-Send across await | Drop before await | -| `MutexGuard` across await | Guard held during suspend | Scope guard properly | - ---- - -## Anti-Patterns - -| Anti-Pattern | Why Bad | Better | -|--------------|---------|--------| -| Arc> everywhere | Contention, complexity | Message passing | -| thread::sleep in async | Blocks executor | tokio::time::sleep | -| Holding locks across await | Blocks other tasks | Scope locks tightly | -| Ignoring deadlock risk | Hard to debug | Lock ordering, try_lock | - ---- - -## Async-Specific Patterns - -### Avoid MutexGuard Across Await - -```rust -// Bad: guard held across await -let guard = mutex.lock().await; -do_async().await; // guard still held! - -// Good: scope the lock -{ - let guard = mutex.lock().await; - // use guard -} // guard dropped -do_async().await; -``` - -### Non-Send Types in Async - -```rust -// Rc is !Send, can't cross await in spawned task -// Option 1: use Arc instead -// Option 2: use spawn_local (single-thread runtime) -// Option 3: ensure Rc is dropped before .await -``` - ---- - -## Related Skills - -| When | See | -|------|-----| -| Smart pointer choice | m02-resource | -| Interior mutability | m03-mutability | -| Performance tuning | m10-performance | -| Domain concurrency needs | domain-* | diff --git a/.claude/skills/m07-concurrency/comparison.md b/.claude/skills/m07-concurrency/comparison.md deleted file mode 100644 index a09c6fa28..000000000 --- a/.claude/skills/m07-concurrency/comparison.md +++ /dev/null @@ -1,312 +0,0 @@ -# Concurrency: Comparison with Other Languages - -## Rust vs Go - -### Concurrency Model - -| Aspect | Rust | Go | -|--------|------|-----| -| Model | Ownership + Send/Sync | CSP (Communicating Sequential Processes) | -| Primitives | Arc, Mutex, channels | goroutines, channels | -| Safety | Compile-time | Runtime (race detector) | -| Async | async/await + runtime | Built-in scheduler | - -### Goroutines vs Rust Tasks - -```rust -// Rust: explicit about thread safety -use std::sync::Arc; -use tokio::sync::Mutex; - -let data = Arc::new(Mutex::new(vec![])); -let data_clone = Arc::clone(&data); - -tokio::spawn(async move { - let mut guard = data_clone.lock().await; - guard.push(1); // Safe: Mutex protects access -}); - -// Go: implicit sharing (potential race) -// data := []int{} -// go func() { -// data = append(data, 1) // RACE CONDITION! -// }() -``` - -### Channel Comparison - -```rust -// Rust: typed channels with ownership -use tokio::sync::mpsc; - -let (tx, mut rx) = mpsc::channel::(100); - -tokio::spawn(async move { - tx.send("hello".to_string()).await.unwrap(); - // tx is moved, can't be used elsewhere -}); - -// Go: channels are more flexible but less safe -// ch := make(chan string, 100) -// go func() { -// ch <- "hello" -// // ch can still be used anywhere -// }() -``` - ---- - -## Rust vs Java - -### Thread Safety Model - -| Aspect | Rust | Java | -|--------|------|------| -| Safety | Compile-time (Send/Sync) | Runtime (synchronized, volatile) | -| Null | No null (Option) | NullPointerException risk | -| Locks | RAII (drop releases) | try-finally or try-with-resources | -| Memory | No GC | GC with stop-the-world | - -### Synchronization Comparison - -```rust -// Rust: lock is tied to data -use std::sync::Mutex; - -let data = Mutex::new(vec![1, 2, 3]); -{ - let mut guard = data.lock().unwrap(); - guard.push(4); -} // lock released automatically - -// Java: lock and data are separate -// List data = new ArrayList<>(); -// synchronized(data) { -// data.add(4); -// } // easy to forget synchronization elsewhere -``` - -### Thread Pool Comparison - -```rust -// Rust: rayon for data parallelism -use rayon::prelude::*; - -let sum: i32 = (0..1000) - .into_par_iter() - .map(|x| x * x) - .sum(); - -// Java: Stream API -// int sum = IntStream.range(0, 1000) -// .parallel() -// .map(x -> x * x) -// .sum(); -``` - ---- - -## Rust vs C++ - -### Safety Guarantees - -| Aspect | Rust | C++ | -|--------|------|-----| -| Data races | Prevented at compile-time | Undefined behavior | -| Deadlocks | Not prevented (same as C++) | Not prevented | -| Thread safety | Send/Sync traits | Convention only | -| Memory ordering | Explicit Ordering enum | memory_order enum | - -### Atomic Comparison - -```rust -// Rust: clear memory ordering -use std::sync::atomic::{AtomicI32, Ordering}; - -let counter = AtomicI32::new(0); -counter.fetch_add(1, Ordering::SeqCst); -let value = counter.load(Ordering::Acquire); - -// C++: similar but without safety -// std::atomic counter{0}; -// counter.fetch_add(1, std::memory_order_seq_cst); -// int value = counter.load(std::memory_order_acquire); -``` - -### Mutex Comparison - -```rust -// Rust: data protected by Mutex -use std::sync::Mutex; - -struct SafeCounter { - count: Mutex, // Mutex contains the data -} - -impl SafeCounter { - fn increment(&self) { - *self.count.lock().unwrap() += 1; - } -} - -// C++: mutex separate from data (error-prone) -// class Counter { -// std::mutex mtx; -// int count; // NOT protected by type system -// public: -// void increment() { -// std::lock_guard lock(mtx); -// count++; -// } -// void unsafe_increment() { -// count++; // Compiles! But wrong. -// } -// }; -``` - ---- - -## Async Models Comparison - -| Language | Model | Runtime | -|----------|-------|---------| -| Rust | async/await, zero-cost | tokio, async-std (bring your own) | -| Go | goroutines | Built-in scheduler | -| JavaScript | async/await, Promises | Event loop (single-threaded) | -| Python | async/await | asyncio (single-threaded) | -| Java | CompletableFuture, Virtual Threads | ForkJoinPool, Loom | - -### Rust vs JavaScript Async - -```rust -// Rust: async requires explicit runtime, can use multiple threads -#[tokio::main] -async fn main() { - let results = tokio::join!( - fetch("url1"), // runs concurrently - fetch("url2"), - ); -} - -// JavaScript: single-threaded event loop -// async function main() { -// const results = await Promise.all([ -// fetch("url1"), -// fetch("url2"), -// ]); -// } -``` - -### Rust vs Python Async - -```rust -// Rust: true parallelism possible -#[tokio::main(flavor = "multi_thread")] -async fn main() { - let handles: Vec<_> = urls - .into_iter() - .map(|url| tokio::spawn(fetch(url))) // spawns on thread pool - .collect(); - - for handle in handles { - let _ = handle.await; - } -} - -// Python: asyncio is single-threaded (use ProcessPoolExecutor for CPU) -# async def main(): -# tasks = [asyncio.create_task(fetch(url)) for url in urls] -# await asyncio.gather(*tasks) # all on same thread -``` - ---- - -## Send and Sync: Rust's Unique Feature - -No other mainstream language has compile-time thread safety markers: - -| Trait | Meaning | Auto-impl | -|-------|---------|-----------| -| `Send` | Safe to transfer between threads | Most types | -| `Sync` | Safe to share `&T` between threads | Types with thread-safe `&` | -| `!Send` | Must stay on one thread | Rc, raw pointers | -| `!Sync` | References can't be shared | RefCell, Cell | - -### Why This Matters - -```rust -// Rust PREVENTS this at compile time: -use std::rc::Rc; - -let rc = Rc::new(42); -std::thread::spawn(move || { - println!("{}", rc); // ERROR: Rc is not Send -}); - -// In other languages, this would be a runtime bug: -// - Go: race detector might catch it -// - Java: undefined behavior -// - Python: GIL usually saves you -// - C++: undefined behavior -``` - ---- - -## Performance Characteristics - -| Aspect | Rust | Go | Java | C++ | -|--------|------|-----|------|-----| -| Thread overhead | System threads or M:N | M:N (goroutines) | System or virtual | System threads | -| Context switch | OS-level or cooperative | Cheap (goroutines) | OS-level | OS-level | -| Memory | Predictable (no GC) | GC pauses | GC pauses | Predictable | -| Async overhead | Zero-cost futures | Runtime overhead | Boxing overhead | Depends | - -### When to Use What - -| Scenario | Best Choice | -|----------|-------------| -| CPU-bound parallelism | Rust (rayon), C++ | -| I/O-bound concurrency | Rust (tokio), Go, Node.js | -| Low latency required | Rust, C++ | -| Rapid development | Go, Python | -| Complex concurrent state | Rust (compile-time safety) | - ---- - -## Mental Model Shifts - -### From Go - -``` -Before: "Just use goroutines and channels" -After: "Explicitly declare what can be shared and how" -``` - -Key shifts: -- `Arc>` instead of implicit sharing -- Compiler enforces thread safety -- Async needs explicit runtime - -### From Java - -``` -Before: "synchronized everywhere, hope for the best" -After: "Types encode thread safety, compiler enforces" -``` - -Key shifts: -- No need for synchronized keyword -- Mutex contains data, not separate -- No GC pauses in critical sections - -### From C++ - -``` -Before: "Be careful, read the docs, use sanitizers" -After: "Compiler catches data races, trust the type system" -``` - -Key shifts: -- Send/Sync replace convention -- RAII locks are mandatory, not optional -- Much harder to write incorrect concurrent code diff --git a/.claude/skills/m07-concurrency/examples/thread-patterns.md b/.claude/skills/m07-concurrency/examples/thread-patterns.md deleted file mode 100644 index 51945c3cf..000000000 --- a/.claude/skills/m07-concurrency/examples/thread-patterns.md +++ /dev/null @@ -1,396 +0,0 @@ -# Thread-Based Concurrency Patterns - -## Thread Spawning Best Practices - -### Basic Thread Spawn -```rust -use std::thread; - -fn main() { - let handle = thread::spawn(|| { - println!("Hello from thread!"); - 42 // return value - }); - - let result = handle.join().unwrap(); - println!("Thread returned: {}", result); -} -``` - -### Named Threads for Debugging -```rust -use std::thread; - -let builder = thread::Builder::new() - .name("worker-1".to_string()) - .stack_size(32 * 1024); // 32KB stack - -let handle = builder.spawn(|| { - println!("Thread name: {:?}", thread::current().name()); -}).unwrap(); -``` - -### Scoped Threads (No 'static Required) -```rust -use std::thread; - -fn process_data(data: &[u32]) -> Vec { - thread::scope(|s| { - let handles: Vec<_> = data - .chunks(2) - .map(|chunk| { - s.spawn(|| { - chunk.iter().map(|x| x * 2).collect::>() - }) - }) - .collect(); - - handles - .into_iter() - .flat_map(|h| h.join().unwrap()) - .collect() - }) -} - -fn main() { - let data = vec![1, 2, 3, 4, 5, 6]; - let result = process_data(&data); // No 'static needed! - println!("{:?}", result); -} -``` - ---- - -## Shared State Patterns - -### Arc + Mutex (Read-Write) -```rust -use std::sync::{Arc, Mutex}; -use std::thread; - -fn shared_counter() { - let counter = Arc::new(Mutex::new(0)); - let mut handles = vec![]; - - for _ in 0..10 { - let counter = Arc::clone(&counter); - let handle = thread::spawn(move || { - let mut num = counter.lock().unwrap(); - *num += 1; - }); - handles.push(handle); - } - - for handle in handles { - handle.join().unwrap(); - } - - println!("Result: {}", *counter.lock().unwrap()); -} -``` - -### Arc + RwLock (Read-Heavy) -```rust -use std::sync::{Arc, RwLock}; -use std::thread; - -fn read_heavy_cache() { - let cache = Arc::new(RwLock::new(vec![1, 2, 3])); - - // Many readers - for i in 0..5 { - let cache = Arc::clone(&cache); - thread::spawn(move || { - let data = cache.read().unwrap(); - println!("Reader {}: {:?}", i, *data); - }); - } - - // Occasional writer - { - let cache = Arc::clone(&cache); - thread::spawn(move || { - let mut data = cache.write().unwrap(); - data.push(4); - println!("Writer: added element"); - }); - } -} -``` - -### Atomic for Simple Types -```rust -use std::sync::atomic::{AtomicUsize, Ordering}; -use std::sync::Arc; -use std::thread; - -fn atomic_counter() { - let counter = Arc::new(AtomicUsize::new(0)); - let mut handles = vec![]; - - for _ in 0..10 { - let counter = Arc::clone(&counter); - handles.push(thread::spawn(move || { - for _ in 0..1000 { - counter.fetch_add(1, Ordering::SeqCst); - } - })); - } - - for handle in handles { - handle.join().unwrap(); - } - - println!("Result: {}", counter.load(Ordering::SeqCst)); -} -``` - ---- - -## Channel Patterns - -### MPSC Channel -```rust -use std::sync::mpsc; -use std::thread; - -fn producer_consumer() { - let (tx, rx) = mpsc::channel(); - - // Multiple producers - for i in 0..3 { - let tx = tx.clone(); - thread::spawn(move || { - for j in 0..5 { - tx.send(format!("msg {}-{}", i, j)).unwrap(); - } - }); - } - drop(tx); // Drop original sender - - // Single consumer - for received in rx { - println!("Got: {}", received); - } -} -``` - -### Sync Channel (Bounded) -```rust -use std::sync::mpsc; -use std::thread; - -fn bounded_channel() { - let (tx, rx) = mpsc::sync_channel(2); // buffer size 2 - - thread::spawn(move || { - for i in 0..5 { - println!("Sending {}", i); - tx.send(i).unwrap(); // blocks if buffer full - println!("Sent {}", i); - } - }); - - thread::sleep(std::time::Duration::from_millis(500)); - for received in rx { - println!("Received: {}", received); - thread::sleep(std::time::Duration::from_millis(100)); - } -} -``` - ---- - -## Thread Pool Patterns - -### Using rayon for Parallel Iteration -```rust -use rayon::prelude::*; - -fn parallel_map() { - let numbers: Vec = (0..1000).collect(); - - let squares: Vec = numbers - .par_iter() // parallel iterator - .map(|x| x * x) - .collect(); - - println!("Processed {} items", squares.len()); -} - -fn parallel_filter_map() { - let data: Vec = get_data(); - - let results: Vec<_> = data - .par_iter() - .filter(|s| !s.is_empty()) - .map(|s| expensive_process(s)) - .collect(); -} -``` - -### Custom Thread Pool with crossbeam -```rust -use crossbeam::channel; -use std::thread; - -fn custom_pool(num_workers: usize) { - let (tx, rx) = channel::bounded::>(100); - - // Spawn workers - let workers: Vec<_> = (0..num_workers) - .map(|_| { - let rx = rx.clone(); - thread::spawn(move || { - while let Ok(task) = rx.recv() { - task(); - } - }) - }) - .collect(); - - // Submit tasks - for i in 0..100 { - tx.send(Box::new(move || { - println!("Processing task {}", i); - })).unwrap(); - } - - drop(tx); // Close channel - - for worker in workers { - worker.join().unwrap(); - } -} -``` - ---- - -## Synchronization Primitives - -### Barrier (Wait for All) -```rust -use std::sync::{Arc, Barrier}; -use std::thread; - -fn barrier_example() { - let barrier = Arc::new(Barrier::new(3)); - let mut handles = vec![]; - - for i in 0..3 { - let barrier = Arc::clone(&barrier); - handles.push(thread::spawn(move || { - println!("Thread {} starting", i); - thread::sleep(std::time::Duration::from_millis(i as u64 * 100)); - - barrier.wait(); // All threads wait here - - println!("Thread {} after barrier", i); - })); - } - - for handle in handles { - handle.join().unwrap(); - } -} -``` - -### Condvar (Condition Variable) -```rust -use std::sync::{Arc, Condvar, Mutex}; -use std::thread; - -fn condvar_example() { - let pair = Arc::new((Mutex::new(false), Condvar::new())); - let pair_clone = Arc::clone(&pair); - - // Waiter thread - let waiter = thread::spawn(move || { - let (lock, cvar) = &*pair_clone; - let mut started = lock.lock().unwrap(); - while !*started { - started = cvar.wait(started).unwrap(); - } - println!("Waiter: condition met!"); - }); - - // Notifier - thread::sleep(std::time::Duration::from_millis(100)); - let (lock, cvar) = &*pair; - { - let mut started = lock.lock().unwrap(); - *started = true; - } - cvar.notify_one(); - - waiter.join().unwrap(); -} -``` - -### Once (One-Time Initialization) -```rust -use std::sync::Once; - -static INIT: Once = Once::new(); -static mut CONFIG: Option = None; - -fn get_config() -> &'static Config { - INIT.call_once(|| { - unsafe { - CONFIG = Some(load_config()); - } - }); - unsafe { CONFIG.as_ref().unwrap() } -} - -// Better: use once_cell or lazy_static -use once_cell::sync::Lazy; - -static CONFIG: Lazy = Lazy::new(|| { - load_config() -}); -``` - ---- - -## Error Handling in Threads - -### Handling Panics -```rust -use std::thread; - -fn handle_panic() { - let handle = thread::spawn(|| { - panic!("Thread panicked!"); - }); - - match handle.join() { - Ok(_) => println!("Thread completed successfully"), - Err(e) => { - if let Some(s) = e.downcast_ref::<&str>() { - println!("Thread panicked with: {}", s); - } else if let Some(s) = e.downcast_ref::() { - println!("Thread panicked with: {}", s); - } else { - println!("Thread panicked with unknown error"); - } - } - } -} -``` - -### Catching Panics -```rust -use std::panic; - -fn catch_panic() { - let result = panic::catch_unwind(|| { - risky_operation() - }); - - match result { - Ok(value) => println!("Success: {:?}", value), - Err(_) => println!("Operation panicked, continuing..."), - } -} -``` diff --git a/.claude/skills/m07-concurrency/patterns/async-patterns.md b/.claude/skills/m07-concurrency/patterns/async-patterns.md deleted file mode 100644 index bacda81c7..000000000 --- a/.claude/skills/m07-concurrency/patterns/async-patterns.md +++ /dev/null @@ -1,409 +0,0 @@ -# Async Patterns in Rust - -## Task Spawning - -### Basic Spawn -```rust -use tokio::task; - -#[tokio::main] -async fn main() { - // Spawn a task that runs concurrently - let handle = task::spawn(async { - expensive_computation().await - }); - - // Do other work while task runs - other_work().await; - - // Wait for result - let result = handle.await.unwrap(); -} -``` - -### Spawn with Shared State -```rust -use std::sync::Arc; -use tokio::sync::Mutex; - -async fn process_with_state() { - let state = Arc::new(Mutex::new(vec![])); - - let handles: Vec<_> = (0..10) - .map(|i| { - let state = Arc::clone(&state); - tokio::spawn(async move { - let mut guard = state.lock().await; - guard.push(i); - }) - }) - .collect(); - - // Wait for all tasks - for handle in handles { - handle.await.unwrap(); - } -} -``` - ---- - -## Select Pattern - -### Racing Multiple Futures -```rust -use tokio::select; -use tokio::time::{sleep, Duration}; - -async fn first_response() { - select! { - result = fetch_from_server_a() => { - println!("A responded first: {:?}", result); - } - result = fetch_from_server_b() => { - println!("B responded first: {:?}", result); - } - } -} -``` - -### Select with Timeout -```rust -use tokio::time::timeout; - -async fn with_timeout() -> Result { - select! { - result = fetch_data() => result, - _ = sleep(Duration::from_secs(5)) => { - Err(Error::Timeout) - } - } -} - -// Or use timeout directly -async fn with_timeout2() -> Result { - timeout(Duration::from_secs(5), fetch_data()) - .await - .map_err(|_| Error::Timeout)? -} -``` - -### Select with Channel -```rust -use tokio::sync::mpsc; - -async fn process_messages(mut rx: mpsc::Receiver) { - loop { - select! { - Some(msg) = rx.recv() => { - handle_message(msg).await; - } - _ = tokio::signal::ctrl_c() => { - println!("Shutting down..."); - break; - } - } - } -} -``` - ---- - -## Channel Patterns - -### MPSC (Multi-Producer, Single-Consumer) -```rust -use tokio::sync::mpsc; - -async fn producer_consumer() { - let (tx, mut rx) = mpsc::channel(100); - - // Spawn producers - for i in 0..3 { - let tx = tx.clone(); - tokio::spawn(async move { - tx.send(format!("Message from {}", i)).await.unwrap(); - }); - } - - // Drop original sender so channel closes - drop(tx); - - // Consume - while let Some(msg) = rx.recv().await { - println!("Received: {}", msg); - } -} -``` - -### Oneshot (Single-Shot Response) -```rust -use tokio::sync::oneshot; - -async fn request_response() { - let (tx, rx) = oneshot::channel(); - - tokio::spawn(async move { - let result = compute_something().await; - tx.send(result).unwrap(); - }); - - // Wait for response - let response = rx.await.unwrap(); -} -``` - -### Broadcast (Multi-Consumer) -```rust -use tokio::sync::broadcast; - -async fn pub_sub() { - let (tx, _) = broadcast::channel(16); - - // Subscribe multiple consumers - let mut rx1 = tx.subscribe(); - let mut rx2 = tx.subscribe(); - - tokio::spawn(async move { - while let Ok(msg) = rx1.recv().await { - println!("Consumer 1: {}", msg); - } - }); - - tokio::spawn(async move { - while let Ok(msg) = rx2.recv().await { - println!("Consumer 2: {}", msg); - } - }); - - // Publish - tx.send("Hello").unwrap(); -} -``` - -### Watch (Single Latest Value) -```rust -use tokio::sync::watch; - -async fn config_updates() { - let (tx, mut rx) = watch::channel(Config::default()); - - // Consumer watches for changes - tokio::spawn(async move { - while rx.changed().await.is_ok() { - let config = rx.borrow(); - apply_config(&config); - } - }); - - // Update config - tx.send(Config::new()).unwrap(); -} -``` - ---- - -## Structured Concurrency - -### JoinSet for Task Groups -```rust -use tokio::task::JoinSet; - -async fn parallel_fetch(urls: Vec) -> Vec> { - let mut set = JoinSet::new(); - - for url in urls { - set.spawn(async move { - fetch(&url).await - }); - } - - let mut results = vec![]; - while let Some(res) = set.join_next().await { - results.push(res.unwrap()); - } - results -} -``` - -### Scoped Tasks (no 'static) -```rust -// Using tokio-scoped or async-scoped crate -use async_scoped::TokioScope; - -async fn scoped_example(data: &[u32]) { - let results = TokioScope::scope_and_block(|scope| { - for item in data { - scope.spawn(async move { - process(item).await - }); - } - }); -} -``` - ---- - -## Cancellation Patterns - -### Using CancellationToken -```rust -use tokio_util::sync::CancellationToken; - -async fn cancellable_task(token: CancellationToken) { - loop { - select! { - _ = token.cancelled() => { - println!("Task cancelled"); - break; - } - _ = do_work() => { - // Continue working - } - } - } -} - -async fn main_with_cancellation() { - let token = CancellationToken::new(); - let task_token = token.clone(); - - let handle = tokio::spawn(cancellable_task(task_token)); - - // Cancel after some condition - tokio::time::sleep(Duration::from_secs(5)).await; - token.cancel(); - - handle.await.unwrap(); -} -``` - -### Graceful Shutdown -```rust -async fn serve_with_shutdown(shutdown: impl Future) { - let server = TcpListener::bind("0.0.0.0:8080").await.unwrap(); - - loop { - select! { - Ok((socket, _)) = server.accept() => { - tokio::spawn(handle_connection(socket)); - } - _ = &mut shutdown => { - println!("Shutting down..."); - break; - } - } - } -} - -#[tokio::main] -async fn main() { - let ctrl_c = async { - tokio::signal::ctrl_c().await.unwrap(); - }; - - serve_with_shutdown(ctrl_c).await; -} -``` - ---- - -## Backpressure Patterns - -### Bounded Channels -```rust -use tokio::sync::mpsc; - -async fn with_backpressure() { - // Buffer of 10 - producers will wait if full - let (tx, mut rx) = mpsc::channel(10); - - let producer = tokio::spawn(async move { - for i in 0..1000 { - // This will wait if channel is full - tx.send(i).await.unwrap(); - } - }); - - let consumer = tokio::spawn(async move { - while let Some(item) = rx.recv().await { - // Slow consumer - tokio::time::sleep(Duration::from_millis(10)).await; - process(item); - } - }); - - let _ = tokio::join!(producer, consumer); -} -``` - -### Semaphore for Rate Limiting -```rust -use tokio::sync::Semaphore; -use std::sync::Arc; - -async fn rate_limited_requests(urls: Vec) { - let semaphore = Arc::new(Semaphore::new(10)); // max 10 concurrent - - let handles: Vec<_> = urls - .into_iter() - .map(|url| { - let sem = Arc::clone(&semaphore); - tokio::spawn(async move { - let _permit = sem.acquire().await.unwrap(); - fetch(&url).await - }) - }) - .collect(); - - for handle in handles { - handle.await.unwrap(); - } -} -``` - ---- - -## Error Handling in Async - -### Propagating Errors -```rust -async fn fetch_and_parse(url: &str) -> Result { - let response = fetch(url).await?; - let data = parse(response).await?; - Ok(data) -} -``` - -### Handling Task Panics -```rust -async fn robust_spawn() { - let handle = tokio::spawn(async { - risky_operation().await - }); - - match handle.await { - Ok(result) => println!("Success: {:?}", result), - Err(e) if e.is_panic() => { - println!("Task panicked: {:?}", e); - } - Err(e) => { - println!("Task cancelled: {:?}", e); - } - } -} -``` - -### Try-Join for Multiple Results -```rust -use tokio::try_join; - -async fn fetch_all() -> Result<(A, B, C), Error> { - // All must succeed, or first error returned - try_join!( - fetch_a(), - fetch_b(), - fetch_c(), - ) -} -``` diff --git a/.claude/skills/m07-concurrency/patterns/common-errors.md b/.claude/skills/m07-concurrency/patterns/common-errors.md deleted file mode 100644 index a1dc46221..000000000 --- a/.claude/skills/m07-concurrency/patterns/common-errors.md +++ /dev/null @@ -1,331 +0,0 @@ -# Common Concurrency Errors & Fixes - -## E0277: Cannot Send Between Threads - -### Error Pattern -```rust -use std::rc::Rc; - -let data = Rc::new(42); -std::thread::spawn(move || { - println!("{}", data); // ERROR: Rc cannot be sent between threads -}); -``` - -### Fix Options - -**Option 1: Use Arc instead** -```rust -use std::sync::Arc; - -let data = Arc::new(42); -let data_clone = Arc::clone(&data); -std::thread::spawn(move || { - println!("{}", data_clone); // OK: Arc is Send -}); -``` - -**Option 2: Move owned data** -```rust -let data = 42; // i32 is Copy and Send -std::thread::spawn(move || { - println!("{}", data); // OK -}); -``` - ---- - -## E0277: Cannot Share Between Threads (Not Sync) - -### Error Pattern -```rust -use std::cell::RefCell; -use std::sync::Arc; - -let data = Arc::new(RefCell::new(42)); -// ERROR: RefCell is not Sync -``` - -### Fix Options - -**Option 1: Use Mutex for thread-safe interior mutability** -```rust -use std::sync::{Arc, Mutex}; - -let data = Arc::new(Mutex::new(42)); -let data_clone = Arc::clone(&data); -std::thread::spawn(move || { - let mut guard = data_clone.lock().unwrap(); - *guard += 1; -}); -``` - -**Option 2: Use RwLock for read-heavy workloads** -```rust -use std::sync::{Arc, RwLock}; - -let data = Arc::new(RwLock::new(42)); -let data_clone = Arc::clone(&data); -std::thread::spawn(move || { - let guard = data_clone.read().unwrap(); - println!("{}", *guard); -}); -``` - ---- - -## Deadlock Patterns - -### Pattern 1: Lock Ordering Deadlock -```rust -// DANGER: potential deadlock -use std::sync::{Arc, Mutex}; - -let a = Arc::new(Mutex::new(1)); -let b = Arc::new(Mutex::new(2)); - -// Thread 1: locks a then b -let a1 = Arc::clone(&a); -let b1 = Arc::clone(&b); -std::thread::spawn(move || { - let _a = a1.lock().unwrap(); - let _b = b1.lock().unwrap(); // waits for b -}); - -// Thread 2: locks b then a (opposite order!) -let a2 = Arc::clone(&a); -let b2 = Arc::clone(&b); -std::thread::spawn(move || { - let _b = b2.lock().unwrap(); - let _a = a2.lock().unwrap(); // waits for a - DEADLOCK -}); -``` - -### Fix: Consistent Lock Ordering -```rust -// SAFE: always lock in same order (a before b) -std::thread::spawn(move || { - let _a = a1.lock().unwrap(); - let _b = b1.lock().unwrap(); -}); - -std::thread::spawn(move || { - let _a = a2.lock().unwrap(); // same order - let _b = b2.lock().unwrap(); -}); -``` - -### Pattern 2: Self-Deadlock -```rust -// DANGER: locking same mutex twice -let m = Mutex::new(42); -let _g1 = m.lock().unwrap(); -let _g2 = m.lock().unwrap(); // DEADLOCK on std::Mutex - -// FIX: use parking_lot::ReentrantMutex if needed -// or restructure code to avoid double locking -``` - ---- - -## Mutex Guard Across Await - -### Error Pattern -```rust -use std::sync::Mutex; -use tokio::time::sleep; - -async fn bad_async() { - let m = Mutex::new(42); - let guard = m.lock().unwrap(); - sleep(Duration::from_secs(1)).await; // WARNING: guard held across await - println!("{}", *guard); -} -``` - -### Fix Options - -**Option 1: Scope the lock** -```rust -async fn good_async() { - let m = Mutex::new(42); - let value = { - let guard = m.lock().unwrap(); - *guard // copy value - }; // guard dropped here - sleep(Duration::from_secs(1)).await; - println!("{}", value); -} -``` - -**Option 2: Use tokio::sync::Mutex** -```rust -use tokio::sync::Mutex; - -async fn good_async() { - let m = Mutex::new(42); - let guard = m.lock().await; // async lock - sleep(Duration::from_secs(1)).await; // OK with tokio::Mutex - println!("{}", *guard); -} -``` - ---- - -## Data Race Prevention - -### Pattern: Missing Synchronization -```rust -// This WON'T compile - Rust prevents data races -use std::sync::Arc; - -let data = Arc::new(0); -let d1 = Arc::clone(&data); -let d2 = Arc::clone(&data); - -std::thread::spawn(move || { - // *d1 += 1; // ERROR: cannot mutate through Arc -}); - -std::thread::spawn(move || { - // *d2 += 1; // ERROR: cannot mutate through Arc -}); -``` - -### Fix: Add Synchronization -```rust -use std::sync::{Arc, Mutex}; -use std::sync::atomic::{AtomicI32, Ordering}; - -// Option 1: Mutex -let data = Arc::new(Mutex::new(0)); -let d1 = Arc::clone(&data); -std::thread::spawn(move || { - *d1.lock().unwrap() += 1; -}); - -// Option 2: Atomic (for simple types) -let data = Arc::new(AtomicI32::new(0)); -let d1 = Arc::clone(&data); -std::thread::spawn(move || { - d1.fetch_add(1, Ordering::SeqCst); -}); -``` - ---- - -## Channel Errors - -### Disconnected Channel -```rust -use std::sync::mpsc; - -let (tx, rx) = mpsc::channel(); -drop(tx); // sender dropped -match rx.recv() { - Ok(v) => println!("{}", v), - Err(_) => println!("channel disconnected"), // this happens -} -``` - -### Fix: Handle Disconnection -```rust -// Use try_recv for non-blocking -loop { - match rx.try_recv() { - Ok(msg) => handle(msg), - Err(TryRecvError::Empty) => continue, - Err(TryRecvError::Disconnected) => break, - } -} - -// Or iterate (stops on disconnect) -for msg in rx { - handle(msg); -} -``` - ---- - -## Async Common Errors - -### Forgetting to Spawn -```rust -// WRONG: future not polled -async fn fetch_data() -> Result { ... } - -fn process() { - fetch_data(); // does nothing! returns Future that's dropped -} - -// RIGHT: await or spawn -async fn process() { - let data = fetch_data().await; // awaited -} - -fn process_sync() { - tokio::spawn(fetch_data()); // spawned -} -``` - -### Blocking in Async Context -```rust -// WRONG: blocks the executor -async fn bad() { - std::thread::sleep(Duration::from_secs(1)); // blocks! - std::fs::read_to_string("file.txt").unwrap(); // blocks! -} - -// RIGHT: use async versions -async fn good() { - tokio::time::sleep(Duration::from_secs(1)).await; - tokio::fs::read_to_string("file.txt").await.unwrap(); -} - -// Or spawn_blocking for CPU-bound work -async fn compute() { - let result = tokio::task::spawn_blocking(|| { - heavy_computation() // OK to block here - }).await.unwrap(); -} -``` - ---- - -## Thread Panic Handling - -### Unhandled Panic -```rust -let handle = std::thread::spawn(|| { - panic!("oops"); -}); - -// Main thread continues, might miss the error -handle.join().unwrap(); // panics here -``` - -### Proper Error Handling -```rust -let handle = std::thread::spawn(|| { - panic!("oops"); -}); - -match handle.join() { - Ok(result) => println!("Success: {:?}", result), - Err(e) => println!("Thread panicked: {:?}", e), -} - -// For async: use catch_unwind -use std::panic; - -async fn safe_task() { - let result = panic::catch_unwind(|| { - risky_operation() - }); - - match result { - Ok(v) => use_value(v), - Err(_) => log_error("task panicked"), - } -} -``` diff --git a/.claude/skills/m09-domain/SKILL.md b/.claude/skills/m09-domain/SKILL.md deleted file mode 100644 index 953dd0224..000000000 --- a/.claude/skills/m09-domain/SKILL.md +++ /dev/null @@ -1,173 +0,0 @@ ---- -name: m09-domain -description: "CRITICAL: Use for domain modeling. Triggers: domain model, DDD, domain-driven design, entity, value object, aggregate, repository pattern, business rules, validation, invariant, 领域模型, 领域驱动设计, 业务规则" ---- - -# Domain Modeling - -> **Layer 2: Design Choices** - -## Core Question - -**What is this concept's role in the domain?** - -Before modeling in code, understand: -- Is it an Entity (identity matters) or Value Object (interchangeable)? -- What invariants must be maintained? -- Where are the aggregate boundaries? - ---- - -## Domain Concept → Rust Pattern - -| Domain Concept | Rust Pattern | Ownership Implication | -|----------------|--------------|----------------------| -| Entity | struct + Id | Owned, unique identity | -| Value Object | struct + Clone/Copy | Shareable, immutable | -| Aggregate Root | struct owns children | Clear ownership tree | -| Repository | trait | Abstracts persistence | -| Domain Event | enum | Captures state changes | -| Service | impl block / free fn | Stateless operations | - ---- - -## Thinking Prompt - -Before creating a domain type: - -1. **What's the concept's identity?** - - Needs unique identity → Entity (Id field) - - Interchangeable by value → Value Object (Clone/Copy) - -2. **What invariants must hold?** - - Always valid → private fields + validated constructor - - Transition rules → type state pattern - -3. **Who owns this data?** - - Single owner (parent) → owned field - - Shared reference → Arc/Rc - - Weak reference → Weak - ---- - -## Trace Up ↑ - -To domain constraints (Layer 3): - -``` -"How should I model a Transaction?" - ↑ Ask: What domain rules govern transactions? - ↑ Check: domain-fintech (audit, precision requirements) - ↑ Check: Business stakeholders (what invariants?) -``` - -| Design Question | Trace To | Ask | -|-----------------|----------|-----| -| Entity vs Value Object | domain-* | What makes two instances "the same"? | -| Aggregate boundaries | domain-* | What must be consistent together? | -| Validation rules | domain-* | What business rules apply? | - ---- - -## Trace Down ↓ - -To implementation (Layer 1): - -``` -"Model as Entity" - ↓ m01-ownership: Owned, unique - ↓ m05-type-driven: Newtype for Id - -"Model as Value Object" - ↓ m01-ownership: Clone/Copy OK - ↓ m05-type-driven: Validate at construction - -"Model as Aggregate" - ↓ m01-ownership: Parent owns children - ↓ m02-resource: Consider Rc for shared within aggregate -``` - ---- - -## Quick Reference - -| DDD Concept | Rust Pattern | Example | -|-------------|--------------|---------| -| Value Object | Newtype | `struct Email(String);` | -| Entity | Struct + ID | `struct User { id: UserId, ... }` | -| Aggregate | Module boundary | `mod order { ... }` | -| Repository | Trait | `trait UserRepo { fn find(...) }` | -| Domain Event | Enum | `enum OrderEvent { Created, ... }` | - -## Pattern Templates - -### Value Object - -```rust -struct Email(String); - -impl Email { - pub fn new(s: &str) -> Result { - validate_email(s)?; - Ok(Self(s.to_string())) - } -} -``` - -### Entity - -```rust -struct UserId(Uuid); - -struct User { - id: UserId, - email: Email, - // ... other fields -} - -impl PartialEq for User { - fn eq(&self, other: &Self) -> bool { - self.id == other.id // Identity equality - } -} -``` - -### Aggregate - -```rust -mod order { - pub struct Order { - id: OrderId, - items: Vec, // Owned children - // ... - } - - impl Order { - pub fn add_item(&mut self, item: OrderItem) { - // Enforce aggregate invariants - } - } -} -``` - ---- - -## Common Mistakes - -| Mistake | Why Wrong | Better | -|---------|-----------|--------| -| Primitive obsession | No type safety | Newtype wrappers | -| Public fields with invariants | Invariants violated | Private + accessor | -| Leaked aggregate internals | Broken encapsulation | Methods on root | -| String for semantic types | No validation | Validated newtype | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Type-driven implementation | m05-type-driven | -| Ownership for aggregates | m01-ownership | -| Domain error handling | m13-domain-error | -| Specific domain rules | domain-* | diff --git a/.claude/skills/m10-performance/SKILL.md b/.claude/skills/m10-performance/SKILL.md deleted file mode 100644 index aefd10bc7..000000000 --- a/.claude/skills/m10-performance/SKILL.md +++ /dev/null @@ -1,156 +0,0 @@ ---- -name: m10-performance -description: "CRITICAL: Use for performance optimization. Triggers: performance, optimization, benchmark, profiling, flamegraph, criterion, slow, fast, allocation, cache, SIMD, make it faster, 性能优化, 基准测试" ---- - -# Performance Optimization - -> **Layer 2: Design Choices** - -## Core Question - -**What's the bottleneck, and is optimization worth it?** - -Before optimizing: -- Have you measured? (Don't guess) -- What's the acceptable performance? -- Will optimization add complexity? - ---- - -## Performance Decision → Implementation - -| Goal | Design Choice | Implementation | -|------|---------------|----------------| -| Reduce allocations | Pre-allocate, reuse | `with_capacity`, object pools | -| Improve cache | Contiguous data | `Vec`, `SmallVec` | -| Parallelize | Data parallelism | `rayon`, threads | -| Avoid copies | Zero-copy | References, `Cow` | -| Reduce indirection | Inline data | `smallvec`, arrays | - ---- - -## Thinking Prompt - -Before optimizing: - -1. **Have you measured?** - - Profile first → flamegraph, perf - - Benchmark → criterion, cargo bench - - Identify actual hotspots - -2. **What's the priority?** - - Algorithm (10x-1000x improvement) - - Data structure (2x-10x) - - Allocation (2x-5x) - - Cache (1.5x-3x) - -3. **What's the trade-off?** - - Complexity vs speed - - Memory vs CPU - - Latency vs throughput - ---- - -## Trace Up ↑ - -To domain constraints (Layer 3): - -``` -"How fast does this need to be?" - ↑ Ask: What's the performance SLA? - ↑ Check: domain-* (latency requirements) - ↑ Check: Business requirements (acceptable response time) -``` - -| Question | Trace To | Ask | -|----------|----------|-----| -| Latency requirements | domain-* | What's acceptable response time? | -| Throughput needs | domain-* | How many requests per second? | -| Memory constraints | domain-* | What's the memory budget? | - ---- - -## Trace Down ↓ - -To implementation (Layer 1): - -``` -"Need to reduce allocations" - ↓ m01-ownership: Use references, avoid clone - ↓ m02-resource: Pre-allocate with_capacity - -"Need to parallelize" - ↓ m07-concurrency: Choose rayon or threads - ↓ m07-concurrency: Consider async for I/O-bound - -"Need cache efficiency" - ↓ Data layout: Prefer Vec over HashMap when possible - ↓ Access patterns: Sequential over random access -``` - ---- - -## Quick Reference - -| Tool | Purpose | -|------|---------| -| `cargo bench` | Micro-benchmarks | -| `criterion` | Statistical benchmarks | -| `perf` / `flamegraph` | CPU profiling | -| `heaptrack` | Allocation tracking | -| `valgrind` / `cachegrind` | Cache analysis | - -## Optimization Priority - -``` -1. Algorithm choice (10x - 1000x) -2. Data structure (2x - 10x) -3. Allocation reduction (2x - 5x) -4. Cache optimization (1.5x - 3x) -5. SIMD/Parallelism (2x - 8x) -``` - -## Common Techniques - -| Technique | When | How | -|-----------|------|-----| -| Pre-allocation | Known size | `Vec::with_capacity(n)` | -| Avoid cloning | Hot paths | Use references or `Cow` | -| Batch operations | Many small ops | Collect then process | -| SmallVec | Usually small | `smallvec::SmallVec<[T; N]>` | -| Inline buffers | Fixed-size data | Arrays over Vec | - ---- - -## Common Mistakes - -| Mistake | Why Wrong | Better | -|---------|-----------|--------| -| Optimize without profiling | Wrong target | Profile first | -| Benchmark in debug mode | Meaningless | Always `--release` | -| Use LinkedList | Cache unfriendly | `Vec` or `VecDeque` | -| Hidden `.clone()` | Unnecessary allocs | Use references | -| Premature optimization | Wasted effort | Make it work first | - ---- - -## Anti-Patterns - -| Anti-Pattern | Why Bad | Better | -|--------------|---------|--------| -| Clone to avoid lifetimes | Performance cost | Proper ownership | -| Box everything | Indirection cost | Stack when possible | -| HashMap for small sets | Overhead | Vec with linear search | -| String concat in loop | O(n^2) | `String::with_capacity` or `format!` | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Reducing clones | m01-ownership | -| Concurrency options | m07-concurrency | -| Smart pointer choice | m02-resource | -| Domain requirements | domain-* | diff --git a/.claude/skills/m10-performance/patterns/optimization-guide.md b/.claude/skills/m10-performance/patterns/optimization-guide.md deleted file mode 100644 index 9d49f71f6..000000000 --- a/.claude/skills/m10-performance/patterns/optimization-guide.md +++ /dev/null @@ -1,365 +0,0 @@ -# Rust Performance Optimization Guide - -## Profiling First - -### Tools -```bash -# CPU profiling -cargo install flamegraph -cargo flamegraph --bin myapp - -# Memory profiling -cargo install cargo-instruments # macOS -heaptrack ./target/release/myapp # Linux - -# Benchmarking -cargo bench # with criterion - -# Cache analysis -valgrind --tool=cachegrind ./target/release/myapp -``` - -### Criterion Benchmarks -```rust -use criterion::{criterion_group, criterion_main, Criterion}; - -fn benchmark_parse(c: &mut Criterion) { - let input = "test data".repeat(1000); - - c.bench_function("parse_v1", |b| { - b.iter(|| parse_v1(&input)) - }); - - c.bench_function("parse_v2", |b| { - b.iter(|| parse_v2(&input)) - }); -} - -criterion_group!(benches, benchmark_parse); -criterion_main!(benches); -``` - ---- - -## Common Optimizations - -### 1. Avoid Unnecessary Allocations - -```rust -// BAD: allocates on every call -fn to_uppercase(s: &str) -> String { - s.to_uppercase() -} - -// GOOD: return Cow, allocate only if needed -use std::borrow::Cow; - -fn to_uppercase(s: &str) -> Cow<'_, str> { - if s.chars().all(|c| c.is_uppercase()) { - Cow::Borrowed(s) - } else { - Cow::Owned(s.to_uppercase()) - } -} -``` - -### 2. Reuse Allocations - -```rust -// BAD: creates new Vec each iteration -for item in items { - let mut buffer = Vec::new(); - process(&mut buffer, item); -} - -// GOOD: reuse buffer -let mut buffer = Vec::new(); -for item in items { - buffer.clear(); - process(&mut buffer, item); -} -``` - -### 3. Use Appropriate Collections - -| Need | Collection | Notes | -|------|------------|-------| -| Sequential access | `Vec` | Best cache locality | -| Random access by key | `HashMap` | O(1) lookup | -| Ordered keys | `BTreeMap` | O(log n) lookup | -| Small sets (<20) | `Vec` + linear search | Lower overhead | -| FIFO queue | `VecDeque` | O(1) push/pop both ends | - -### 4. Pre-allocate Capacity - -```rust -// BAD: many reallocations -let mut v = Vec::new(); -for i in 0..10000 { - v.push(i); -} - -// GOOD: single allocation -let mut v = Vec::with_capacity(10000); -for i in 0..10000 { - v.push(i); -} -``` - ---- - -## String Optimization - -### Avoid String Concatenation in Loops - -```rust -// BAD: O(n²) allocations -let mut result = String::new(); -for s in strings { - result = result + &s; -} - -// GOOD: O(n) with push_str -let mut result = String::new(); -for s in strings { - result.push_str(&s); -} - -// BETTER: pre-calculate capacity -let total_len: usize = strings.iter().map(|s| s.len()).sum(); -let mut result = String::with_capacity(total_len); -for s in strings { - result.push_str(&s); -} - -// BEST: use join for simple cases -let result = strings.join(""); -``` - -### Use &str When Possible - -```rust -// BAD: requires allocation -fn greet(name: String) { - println!("Hello, {}", name); -} - -// GOOD: borrows, no allocation -fn greet(name: &str) { - println!("Hello, {}", name); -} - -// Works with both: -greet("world"); // &str -greet(&String::from("world")); // &String coerces to &str -``` - ---- - -## Iterator Optimization - -### Use Iterators Over Indexing - -```rust -// BAD: bounds checking on each access -let mut sum = 0; -for i in 0..vec.len() { - sum += vec[i]; -} - -// GOOD: no bounds checking -let sum: i32 = vec.iter().sum(); - -// GOOD: when index needed -for (i, item) in vec.iter().enumerate() { - // ... -} -``` - -### Lazy Evaluation - -```rust -// Iterators are lazy - computation happens at collect -let result: Vec<_> = data - .iter() - .filter(|x| x.is_valid()) - .map(|x| x.process()) - .take(10) // stop after 10 items - .collect(); -``` - -### Avoid Collecting When Not Needed - -```rust -// BAD: unnecessary intermediate allocation -let filtered: Vec<_> = items.iter().filter(|x| x.valid).collect(); -let count = filtered.len(); - -// GOOD: no allocation -let count = items.iter().filter(|x| x.valid).count(); -``` - ---- - -## Parallelism with Rayon - -```rust -use rayon::prelude::*; - -// Sequential -let sum: i32 = (0..1_000_000).map(|x| x * x).sum(); - -// Parallel (automatic work stealing) -let sum: i32 = (0..1_000_000).into_par_iter().map(|x| x * x).sum(); - -// Parallel with custom chunk size -let results: Vec<_> = data - .par_chunks(1000) - .map(|chunk| process_chunk(chunk)) - .collect(); -``` - ---- - -## Memory Layout - -### Use Appropriate Integer Sizes - -```rust -// If values are small, use smaller types -struct Item { - count: u8, // 0-255, not u64 - flags: u8, // small enum - id: u32, // if 4 billion is enough -} -``` - -### Pack Structs Efficiently - -```rust -// BAD: 24 bytes due to padding -struct Bad { - a: u8, // 1 byte + 7 padding - b: u64, // 8 bytes - c: u8, // 1 byte + 7 padding -} - -// GOOD: 16 bytes (or use #[repr(packed)]) -struct Good { - b: u64, // 8 bytes - a: u8, // 1 byte - c: u8, // 1 byte + 6 padding -} -``` - -### Box Large Values - -```rust -// Large enum variants waste space -enum Message { - Quit, - Data([u8; 10000]), // all variants are 10000+ bytes -} - -// Better: box the large variant -enum Message { - Quit, - Data(Box<[u8; 10000]>), // variants are pointer-sized -} -``` - ---- - -## Async Performance - -### Avoid Blocking in Async - -```rust -// BAD: blocks the executor -async fn bad() { - std::thread::sleep(Duration::from_secs(1)); // blocking! - std::fs::read_to_string("file.txt").unwrap(); // blocking! -} - -// GOOD: use async versions -async fn good() { - tokio::time::sleep(Duration::from_secs(1)).await; - tokio::fs::read_to_string("file.txt").await.unwrap(); -} - -// For CPU work: spawn_blocking -async fn compute() -> i32 { - tokio::task::spawn_blocking(|| { - heavy_computation() - }).await.unwrap() -} -``` - -### Buffer Async I/O - -```rust -use tokio::io::{AsyncBufReadExt, BufReader}; - -// BAD: many small reads -async fn bad(file: File) { - let mut byte = [0u8]; - while file.read(&mut byte).await.unwrap() > 0 { - process(byte[0]); - } -} - -// GOOD: buffered reading -async fn good(file: File) { - let reader = BufReader::new(file); - let mut lines = reader.lines(); - while let Some(line) = lines.next_line().await.unwrap() { - process(&line); - } -} -``` - ---- - -## Release Build Optimization - -### Cargo.toml Settings - -```toml -[profile.release] -lto = true # Link-time optimization -codegen-units = 1 # Single codegen unit (slower compile, faster code) -panic = "abort" # Smaller binary, no unwinding -strip = true # Strip symbols - -[profile.release-fast] -inherits = "release" -opt-level = 3 # Maximum optimization - -[profile.release-small] -inherits = "release" -opt-level = "s" # Optimize for size -``` - -### Compile-Time Assertions - -```rust -// Zero runtime cost -const _: () = assert!(std::mem::size_of::() <= 64); -``` - ---- - -## Checklist - -Before optimizing: -- [ ] Profile to find actual bottlenecks -- [ ] Have benchmarks to measure improvement -- [ ] Consider if optimization is worth complexity - -Common wins: -- [ ] Reduce allocations (Cow, reuse buffers) -- [ ] Use appropriate collections -- [ ] Pre-allocate with_capacity -- [ ] Use iterators instead of indexing -- [ ] Enable LTO for release builds -- [ ] Use rayon for parallel workloads diff --git a/.claude/skills/m11-ecosystem/SKILL.md b/.claude/skills/m11-ecosystem/SKILL.md deleted file mode 100644 index 2f84df781..000000000 --- a/.claude/skills/m11-ecosystem/SKILL.md +++ /dev/null @@ -1,155 +0,0 @@ ---- -name: m11-ecosystem -description: "Use when integrating crates or ecosystem questions. Keywords: E0425, E0433, E0603, crate, cargo, dependency, feature flag, workspace, which crate to use, using external C libraries, creating Python extensions, PyO3, wasm, WebAssembly, bindgen, cbindgen, napi-rs, cannot find, private, crate recommendation, best crate for, Cargo.toml, features, crate 推荐, 依赖管理, 特性标志, 工作空间, Python 绑定" ---- - -# Ecosystem Integration - -> **Layer 2: Design Choices** - -## Core Question - -**What's the right crate for this job, and how should it integrate?** - -Before adding dependencies: -- Is there a standard solution? -- What's the maintenance status? -- What's the API stability? - ---- - -## Integration Decision → Implementation - -| Need | Choice | Crates | -|------|--------|--------| -| Serialization | Derive-based | serde, serde_json | -| Async runtime | tokio or async-std | tokio (most popular) | -| HTTP client | Ergonomic | reqwest | -| HTTP server | Modern | axum, actix-web | -| Database | SQL or ORM | sqlx, diesel | -| CLI parsing | Derive-based | clap | -| Error handling | App vs lib | anyhow, thiserror | -| Logging | Facade | tracing, log | - ---- - -## Thinking Prompt - -Before adding a dependency: - -1. **Is it well-maintained?** - - Recent commits? - - Active issue response? - - Breaking changes frequency? - -2. **What's the scope?** - - Do you need the full crate or just a feature? - - Can feature flags reduce bloat? - -3. **How does it integrate?** - - Trait-based or concrete types? - - Sync or async? - - What bounds does it require? - ---- - -## Trace Up ↑ - -To domain constraints (Layer 3): - -``` -"Which HTTP framework should I use?" - ↑ Ask: What are the performance requirements? - ↑ Check: domain-web (latency, throughput needs) - ↑ Check: Team expertise (familiarity with framework) -``` - -| Question | Trace To | Ask | -|----------|----------|-----| -| Framework choice | domain-* | What constraints matter? | -| Library vs build | domain-* | What's the deployment model? | -| API design | domain-* | Who are the consumers? | - ---- - -## Trace Down ↓ - -To implementation (Layer 1): - -``` -"Integrate external crate" - ↓ m04-zero-cost: Trait bounds and generics - ↓ m06-error-handling: Error type compatibility - -"FFI integration" - ↓ unsafe-checker: Safety requirements - ↓ m12-lifecycle: Resource cleanup -``` - ---- - -## Quick Reference - -### Language Interop - -| Integration | Crate/Tool | Use Case | -|-------------|------------|----------| -| C/C++ → Rust | `bindgen` | Auto-generate bindings | -| Rust → C | `cbindgen` | Export C headers | -| Python ↔ Rust | `pyo3` | Python extensions | -| Node.js ↔ Rust | `napi-rs` | Node addons | -| WebAssembly | `wasm-bindgen` | Browser/WASI | - -### Cargo Features - -| Feature | Purpose | -|---------|---------| -| `[features]` | Optional functionality | -| `default = [...]` | Default features | -| `feature = "serde"` | Conditional deps | -| `[workspace]` | Multi-crate projects | - -## Error Code Reference - -| Error | Cause | Fix | -|-------|-------|-----| -| E0433 | Can't find crate | Add to Cargo.toml | -| E0603 | Private item | Check crate docs | -| Feature not enabled | Optional feature | Enable in `features` | -| Version conflict | Incompatible deps | `cargo update` or pin | -| Duplicate types | Different crate versions | Unify in workspace | - ---- - -## Crate Selection Criteria - -| Criterion | Good Sign | Warning Sign | -|-----------|-----------|--------------| -| Maintenance | Recent commits | Years inactive | -| Community | Active issues/PRs | No response | -| Documentation | Examples, API docs | Minimal docs | -| Stability | Semantic versioning | Frequent breaking | -| Dependencies | Minimal, well-known | Heavy, obscure | - ---- - -## Anti-Patterns - -| Anti-Pattern | Why Bad | Better | -|--------------|---------|--------| -| `extern crate` | Outdated (2018+) | Just `use` | -| `#[macro_use]` | Global pollution | Explicit import | -| Wildcard deps `*` | Unpredictable | Specific versions | -| Too many deps | Supply chain risk | Evaluate necessity | -| Vendoring everything | Maintenance burden | Trust crates.io | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Error type design | m06-error-handling | -| Trait integration | m04-zero-cost | -| FFI safety | unsafe-checker | -| Resource management | m12-lifecycle | diff --git a/.claude/skills/m12-lifecycle/SKILL.md b/.claude/skills/m12-lifecycle/SKILL.md deleted file mode 100644 index 8cdbd5fc1..000000000 --- a/.claude/skills/m12-lifecycle/SKILL.md +++ /dev/null @@ -1,176 +0,0 @@ ---- -name: m12-lifecycle -description: "Use when designing resource lifecycles. Keywords: RAII, Drop, resource lifecycle, connection pool, lazy initialization, connection pool design, resource cleanup patterns, cleanup, scope, OnceCell, Lazy, once_cell, OnceLock, transaction, session management, when is Drop called, cleanup on error, guard pattern, scope guard, 资源生命周期, 连接池, 惰性初始化, 资源清理, RAII 模式" ---- - -# Resource Lifecycle - -> **Layer 2: Design Choices** - -## Core Question - -**When should this resource be created, used, and cleaned up?** - -Before implementing lifecycle: -- What's the resource's scope? -- Who owns the cleanup responsibility? -- What happens on error? - ---- - -## Lifecycle Pattern → Implementation - -| Pattern | When | Implementation | -|---------|------|----------------| -| RAII | Auto cleanup | `Drop` trait | -| Lazy init | Deferred creation | `OnceLock`, `LazyLock` | -| Pool | Reuse expensive resources | `r2d2`, `deadpool` | -| Guard | Scoped access | `MutexGuard` pattern | -| Scope | Transaction boundary | Custom struct + Drop | - ---- - -## Thinking Prompt - -Before designing lifecycle: - -1. **What's the resource cost?** - - Cheap → create per use - - Expensive → pool or cache - - Global → lazy singleton - -2. **What's the scope?** - - Function-local → stack allocation - - Request-scoped → passed or extracted - - Application-wide → static or Arc - -3. **What about errors?** - - Cleanup must happen → Drop - - Cleanup is optional → explicit close - - Cleanup can fail → Result from close - ---- - -## Trace Up ↑ - -To domain constraints (Layer 3): - -``` -"How should I manage database connections?" - ↑ Ask: What's the connection cost? - ↑ Check: domain-* (latency requirements) - ↑ Check: Infrastructure (connection limits) -``` - -| Question | Trace To | Ask | -|----------|----------|-----| -| Connection pooling | domain-* | What's acceptable latency? | -| Resource limits | domain-* | What are infra constraints? | -| Transaction scope | domain-* | What must be atomic? | - ---- - -## Trace Down ↓ - -To implementation (Layer 1): - -``` -"Need automatic cleanup" - ↓ m02-resource: Implement Drop - ↓ m01-ownership: Clear owner for cleanup - -"Need lazy initialization" - ↓ m03-mutability: OnceLock for thread-safe - ↓ m07-concurrency: LazyLock for sync - -"Need connection pool" - ↓ m07-concurrency: Thread-safe pool - ↓ m02-resource: Arc for sharing -``` - ---- - -## Quick Reference - -| Pattern | Type | Use Case | -|---------|------|----------| -| RAII | `Drop` trait | Auto cleanup on scope exit | -| Lazy Init | `OnceLock`, `LazyLock` | Deferred initialization | -| Pool | `r2d2`, `deadpool` | Connection reuse | -| Guard | `MutexGuard` | Scoped lock release | -| Scope | Custom struct | Transaction boundaries | - -## Lifecycle Events - -| Event | Rust Mechanism | -|-------|----------------| -| Creation | `new()`, `Default` | -| Lazy Init | `OnceLock::get_or_init` | -| Usage | `&self`, `&mut self` | -| Cleanup | `Drop::drop()` | - -## Pattern Templates - -### RAII Guard - -```rust -struct FileGuard { - path: PathBuf, - _handle: File, -} - -impl Drop for FileGuard { - fn drop(&mut self) { - // Cleanup: remove temp file - let _ = std::fs::remove_file(&self.path); - } -} -``` - -### Lazy Singleton - -```rust -use std::sync::OnceLock; - -static CONFIG: OnceLock = OnceLock::new(); - -fn get_config() -> &'static Config { - CONFIG.get_or_init(|| { - Config::load().expect("config required") - }) -} -``` - ---- - -## Common Errors - -| Error | Cause | Fix | -|-------|-------|-----| -| Resource leak | Forgot Drop | Implement Drop or RAII wrapper | -| Double free | Manual memory | Let Rust handle | -| Use after drop | Dangling reference | Check lifetimes | -| E0509 move out of Drop | Moving owned field | `Option::take()` | -| Pool exhaustion | Not returned | Ensure Drop returns | - ---- - -## Anti-Patterns - -| Anti-Pattern | Why Bad | Better | -|--------------|---------|--------| -| Manual cleanup | Easy to forget | RAII/Drop | -| `lazy_static!` | External dep | `std::sync::OnceLock` | -| Global mutable state | Thread unsafety | `OnceLock` or proper sync | -| Forget to close | Resource leak | Drop impl | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Smart pointers | m02-resource | -| Thread-safe init | m07-concurrency | -| Domain scopes | m09-domain | -| Error in cleanup | m06-error-handling | diff --git a/.claude/skills/m13-domain-error/SKILL.md b/.claude/skills/m13-domain-error/SKILL.md deleted file mode 100644 index 1b7f49299..000000000 --- a/.claude/skills/m13-domain-error/SKILL.md +++ /dev/null @@ -1,179 +0,0 @@ ---- -name: m13-domain-error -description: "Use when designing domain error handling. Keywords: domain error, error categorization, recovery strategy, retry, fallback, domain error hierarchy, user-facing vs internal errors, error code design, circuit breaker, graceful degradation, resilience, error context, backoff, retry with backoff, error recovery, transient vs permanent error, 领域错误, 错误分类, 恢复策略, 重试, 熔断器, 优雅降级" ---- - -# Domain Error Strategy - -> **Layer 2: Design Choices** - -## Core Question - -**Who needs to handle this error, and how should they recover?** - -Before designing error types: -- Is this user-facing or internal? -- Is recovery possible? -- What context is needed for debugging? - ---- - -## Error Categorization - -| Error Type | Audience | Recovery | Example | -|------------|----------|----------|---------| -| User-facing | End users | Guide action | `InvalidEmail`, `NotFound` | -| Internal | Developers | Debug info | `DatabaseError`, `ParseError` | -| System | Ops/SRE | Monitor/alert | `ConnectionTimeout`, `RateLimited` | -| Transient | Automation | Retry | `NetworkError`, `ServiceUnavailable` | -| Permanent | Human | Investigate | `ConfigInvalid`, `DataCorrupted` | - ---- - -## Thinking Prompt - -Before designing error types: - -1. **Who sees this error?** - - End user → friendly message, actionable - - Developer → detailed, debuggable - - Ops → structured, alertable - -2. **Can we recover?** - - Transient → retry with backoff - - Degradable → fallback value - - Permanent → fail fast, alert - -3. **What context is needed?** - - Call chain → anyhow::Context - - Request ID → structured logging - - Input data → error payload - ---- - -## Trace Up ↑ - -To domain constraints (Layer 3): - -``` -"How should I handle payment failures?" - ↑ Ask: What are the business rules for retries? - ↑ Check: domain-fintech (transaction requirements) - ↑ Check: SLA (availability requirements) -``` - -| Question | Trace To | Ask | -|----------|----------|-----| -| Retry policy | domain-* | What's acceptable latency for retry? | -| User experience | domain-* | What message should users see? | -| Compliance | domain-* | What must be logged for audit? | - ---- - -## Trace Down ↓ - -To implementation (Layer 1): - -``` -"Need typed errors" - ↓ m06-error-handling: thiserror for library - ↓ m04-zero-cost: Error enum design - -"Need error context" - ↓ m06-error-handling: anyhow::Context - ↓ Logging: tracing with fields - -"Need retry logic" - ↓ m07-concurrency: async retry patterns - ↓ Crates: tokio-retry, backoff -``` - ---- - -## Quick Reference - -| Recovery Pattern | When | Implementation | -|------------------|------|----------------| -| Retry | Transient failures | exponential backoff | -| Fallback | Degraded mode | cached/default value | -| Circuit Breaker | Cascading failures | failsafe-rs | -| Timeout | Slow operations | `tokio::time::timeout` | -| Bulkhead | Isolation | separate thread pools | - -## Error Hierarchy - -```rust -#[derive(thiserror::Error, Debug)] -pub enum AppError { - // User-facing - #[error("Invalid input: {0}")] - Validation(String), - - // Transient (retryable) - #[error("Service temporarily unavailable")] - ServiceUnavailable(#[source] reqwest::Error), - - // Internal (log details, show generic) - #[error("Internal error")] - Internal(#[source] anyhow::Error), -} - -impl AppError { - pub fn is_retryable(&self) -> bool { - matches!(self, Self::ServiceUnavailable(_)) - } -} -``` - -## Retry Pattern - -```rust -use tokio_retry::{Retry, strategy::ExponentialBackoff}; - -async fn with_retry(f: F) -> Result -where - F: Fn() -> impl Future>, - E: std::fmt::Debug, -{ - let strategy = ExponentialBackoff::from_millis(100) - .max_delay(Duration::from_secs(10)) - .take(5); - - Retry::spawn(strategy, || f()).await -} -``` - ---- - -## Common Mistakes - -| Mistake | Why Wrong | Better | -|---------|-----------|--------| -| Same error for all | No actionability | Categorize by audience | -| Retry everything | Wasted resources | Only transient errors | -| Infinite retry | DoS self | Max attempts + backoff | -| Expose internal errors | Security risk | User-friendly messages | -| No context | Hard to debug | .context() everywhere | - ---- - -## Anti-Patterns - -| Anti-Pattern | Why Bad | Better | -|--------------|---------|--------| -| String errors | No structure | thiserror types | -| panic! for recoverable | Bad UX | Result with context | -| Ignore errors | Silent failures | Log or propagate | -| Box everywhere | Lost type info | thiserror | -| Error in happy path | Performance | Early validation | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Error handling basics | m06-error-handling | -| Retry implementation | m07-concurrency | -| Domain modeling | m09-domain | -| User-facing APIs | domain-* | diff --git a/.claude/skills/m14-mental-model/SKILL.md b/.claude/skills/m14-mental-model/SKILL.md deleted file mode 100644 index d05f9cc76..000000000 --- a/.claude/skills/m14-mental-model/SKILL.md +++ /dev/null @@ -1,176 +0,0 @@ ---- -name: m14-mental-model -description: "Use when learning Rust concepts. Keywords: mental model, how to think about ownership, understanding borrow checker, visualizing memory layout, analogy, misconception, explaining ownership, why does Rust, help me understand, confused about, learning Rust, explain like I'm, ELI5, intuition for, coming from Java, coming from Python, 心智模型, 如何理解所有权, 学习 Rust, Rust 入门, 为什么 Rust" ---- - -# Mental Models - -> **Layer 2: Design Choices** - -## Core Question - -**What's the right way to think about this Rust concept?** - -When learning or explaining Rust: -- What's the correct mental model? -- What misconceptions should be avoided? -- What analogies help understanding? - ---- - -## Key Mental Models - -| Concept | Mental Model | Analogy | -|---------|--------------|---------| -| Ownership | Unique key | Only one person has the house key | -| Move | Key handover | Giving away your key | -| `&T` | Lending for reading | Lending a book | -| `&mut T` | Exclusive editing | Only you can edit the doc | -| Lifetime `'a` | Valid scope | "Ticket valid until..." | -| `Box` | Heap pointer | Remote control to TV | -| `Rc` | Shared ownership | Multiple remotes, last turns off | -| `Arc` | Thread-safe Rc | Remotes from any room | - ---- - -## Coming From Other Languages - -| From | Key Shift | -|------|-----------| -| Java/C# | Values are owned, not references by default | -| C/C++ | Compiler enforces safety rules | -| Python/Go | No GC, deterministic destruction | -| Functional | Mutability is safe via ownership | -| JavaScript | No null, use Option instead | - ---- - -## Thinking Prompt - -When confused about Rust: - -1. **What's the ownership model?** - - Who owns this data? - - How long does it live? - - Who can access it? - -2. **What guarantee is Rust providing?** - - No data races - - No dangling pointers - - No use-after-free - -3. **What's the compiler telling me?** - - Error = violation of safety rule - - Solution = work with the rules - ---- - -## Trace Up ↑ - -To design understanding (Layer 2): - -``` -"Why can't I do X in Rust?" - ↑ Ask: What safety guarantee would be violated? - ↑ Check: m01-m07 for the rule being enforced - ↑ Ask: What's the intended design pattern? -``` - ---- - -## Trace Down ↓ - -To implementation (Layer 1): - -``` -"I understand the concept, now how do I implement?" - ↓ m01-ownership: Ownership patterns - ↓ m02-resource: Smart pointer choice - ↓ m07-concurrency: Thread safety -``` - ---- - -## Common Misconceptions - -| Error | Wrong Model | Correct Model | -|-------|-------------|---------------| -| E0382 use after move | GC cleans up | Ownership = unique key transfer | -| E0502 borrow conflict | Multiple writers OK | Only one writer at a time | -| E0499 multiple mut borrows | Aliased mutation | Exclusive access for mutation | -| E0106 missing lifetime | Ignoring scope | References have validity scope | -| E0507 cannot move from `&T` | Implicit clone | References don't own data | - -## Deprecated Thinking - -| Deprecated | Better | -|------------|--------| -| "Rust is like C++" | Different ownership model | -| "Lifetimes are GC" | Compile-time validity scope | -| "Clone solves everything" | Restructure ownership | -| "Fight the borrow checker" | Work with the compiler | -| "`unsafe` to avoid rules" | Understand safe patterns first | - ---- - -## Ownership Visualization - -``` -Stack Heap -+----------------+ +----------------+ -| main() | | | -| s1 ─────────────────────> │ "hello" | -| | | | -| fn takes(s) { | | | -| s2 (moved) ─────────────> │ "hello" | -| } | | (s1 invalid) | -+----------------+ +----------------+ - -After move: s1 is no longer valid -``` - -## Reference Visualization - -``` -+----------------+ -| data: String |────────────> "hello" -+----------------+ - ↑ - │ &data (immutable borrow) - │ -+------+------+ -| reader1 reader2 (multiple OK) -+------+------+ - -+----------------+ -| data: String |────────────> "hello" -+----------------+ - ↑ - │ &mut data (mutable borrow) - │ -+------+ -| writer (only one) -+------+ -``` - ---- - -## Learning Path - -| Stage | Focus | Skills | -|-------|-------|--------| -| Beginner | Ownership basics | m01-ownership, m14-mental-model | -| Intermediate | Smart pointers, error handling | m02, m06 | -| Advanced | Concurrency, unsafe | m07, unsafe-checker | -| Expert | Design patterns | m09-m15, domain-* | - ---- - -## Related Skills - -| When | See | -|------|-----| -| Ownership errors | m01-ownership | -| Smart pointers | m02-resource | -| Concurrency | m07-concurrency | -| Anti-patterns | m15-anti-pattern | diff --git a/.claude/skills/m14-mental-model/patterns/thinking-in-rust.md b/.claude/skills/m14-mental-model/patterns/thinking-in-rust.md deleted file mode 100644 index 693b872bf..000000000 --- a/.claude/skills/m14-mental-model/patterns/thinking-in-rust.md +++ /dev/null @@ -1,286 +0,0 @@ -# Thinking in Rust: Mental Models - -## Core Mental Models - -### 1. Ownership as Resource Management - -``` -Traditional: "Who has a pointer to this data?" -Rust: "Who OWNS this data and is responsible for freeing it?" -``` - -Key insight: Every value has exactly one owner. When the owner goes out of scope, the value is dropped. - -```rust -{ - let s = String::from("hello"); // s owns the String - // use s... -} // s goes out of scope, String is dropped (memory freed) -``` - -### 2. Borrowing as Temporary Access - -``` -Traditional: "I'll just read from this pointer" -Rust: "I'm borrowing this value, owner still responsible for it" -``` - -Key insight: Borrows are like library books - you can read them, but must return them. - -```rust -fn print_length(s: &String) { // borrows s - println!("{}", s.len()); -} // borrow ends, caller still owns s - -let my_string = String::from("hello"); -print_length(&my_string); // lend to function -println!("{}", my_string); // still have it -``` - -### 3. Lifetimes as Validity Scopes - -``` -Traditional: "Hope this pointer is still valid" -Rust: "Compiler tracks exactly how long references are valid" -``` - -Key insight: A reference can't outlive the data it points to. - -```rust -fn longest<'a>(x: &'a str, y: &'a str) -> &'a str { - // 'a means: the returned reference is valid as long as BOTH inputs are valid - if x.len() > y.len() { x } else { y } -} -``` - ---- - -## Shifting Perspectives - -### From "Everything is a Reference" (Java/C#) - -Java mental model: -```java -// Everything is implicitly a reference -User user = new User("Alice"); // user is a reference -List users = new ArrayList<>(); -users.add(user); // shares the reference -user.setName("Bob"); // affects the list too! -``` - -Rust mental model: -```rust -// Values are owned, sharing is explicit -let user = User::new("Alice"); // user is owned -let mut users = vec![]; -users.push(user); // user moved into vec, can't use user anymore -// user.set_name("Bob"); // ERROR: user was moved - -// If you need sharing: -use std::rc::Rc; -let user = Rc::new(User::new("Alice")); -let user2 = Rc::clone(&user); // explicit shared ownership -``` - -### From "Manual Memory Management" (C/C++) - -C mental model: -```c -char* s = malloc(100); -// ... must remember to free(s) ... -// ... what if we return early? ... -// ... what if an exception occurs? ... -free(s); -``` - -Rust mental model: -```rust -let s = String::with_capacity(100); -// ... use s ... -// No need to free - Rust drops s automatically when scope ends -// Even with early returns, panics, or any control flow -``` - -### From "Garbage Collection" (Go/Python) - -GC mental model: -```python -# Create objects, GC will figure it out -users = [] -for name in names: - users.append(User(name)) -# GC runs sometime later, when it feels like it -``` - -Rust mental model: -```rust -let users: Vec = names - .iter() - .map(|name| User::new(name)) - .collect(); -// Memory is freed EXACTLY when users goes out of scope -// Deterministic, no GC pauses, no unpredictable memory usage -``` - ---- - -## Key Questions to Ask - -### When Designing Functions - -1. **Does this function need to own the data, or just read it?** - - Need to keep it: take ownership (`fn process(data: Vec)`) - - Just reading: borrow (`fn process(data: &[T])`) - - Need to modify: mutable borrow (`fn process(data: &mut Vec)`) - -2. **Does the return value contain references to inputs?** - - Yes: need lifetime annotations - - No: lifetime elision usually works - -### When Designing Structs - -1. **Should this struct own its data or reference it?** - - Long-lived, independent: own (`name: String`) - - Short-lived view: reference (`name: &'a str`) - -2. **Do multiple parts need to access the same data?** - - Single-threaded: `Rc` or `Rc>` - - Multi-threaded: `Arc` or `Arc>` - -### When Hitting Borrow Checker Errors - -1. **Am I trying to use a value after moving it?** - - Clone it, borrow it, or restructure the code - -2. **Am I trying to have multiple mutable references?** - - Scope the mutations, use interior mutability, or redesign - -3. **Does a reference outlive its source?** - - Return owned data instead, or use `'static` - ---- - -## Common Patterns - -### The Clone Escape Hatch - -When fighting the borrow checker, `.clone()` often works: - -```rust -// Can't do this - double borrow -let mut map = HashMap::new(); -for key in map.keys() { - map.insert(key.clone(), process(key)); // ERROR: map borrowed twice -} - -// Clone to escape -let keys: Vec<_> = map.keys().cloned().collect(); -for key in keys { - map.insert(key.clone(), process(&key)); // OK -} -``` - -But ask: "Is there a better design?" Often, restructuring is better than cloning. - -### The "Make It Own" Pattern - -When lifetimes get complex, make the struct own its data: - -```rust -// Complex: struct with references -struct Parser<'a> { - input: &'a str, - current: &'a str, -} - -// Simpler: struct owns data -struct Parser { - input: String, - position: usize, -} -``` - -### The "Split the Borrow" Pattern - -```rust -struct Data { - field_a: Vec, - field_b: Vec, -} - -// Can't borrow self mutably twice -fn process(&mut self) { - // for a in &self.field_a { - // self.field_b.push(*a); // ERROR - // } - - // Split the borrow - let Data { field_a, field_b } = self; - for a in field_a.iter() { - field_b.push(*a); // OK: separate borrows - } -} -``` - ---- - -## The Rust Way - -### Embrace the Type System - -```rust -// Don't: stringly-typed -fn connect(host: &str, port: &str) { ... } -connect("8080", "localhost"); // oops, wrong order - -// Do: strongly-typed -struct Host(String); -struct Port(u16); -fn connect(host: Host, port: Port) { ... } -// connect(Port(8080), Host("localhost".into())); // compile error! -``` - -### Make Invalid States Unrepresentable - -```rust -// Don't: runtime checks -struct Connection { - socket: Option, - connected: bool, -} - -// Do: types enforce states -enum Connection { - Disconnected, - Connected { socket: Socket }, -} -``` - -### Let the Compiler Guide You - -```rust -// Start with what you want -fn process(data: ???) -> ??? - -// Let compiler errors tell you: -// - What types are needed -// - What lifetimes are needed -// - What bounds are needed - -// The error messages are documentation! -``` - ---- - -## Summary: The Rust Mental Model - -1. **Values have owners** - exactly one at a time -2. **Borrowing is lending** - temporary access, owner retains responsibility -3. **Lifetimes are scopes** - compiler tracks validity -4. **Types encode constraints** - use them to prevent bugs -5. **The compiler is your friend** - work with it, not against it - -When stuck: -- Clone to make progress -- Restructure to own instead of borrow -- Ask: "What is the compiler trying to tell me?" diff --git a/.claude/skills/m15-anti-pattern/SKILL.md b/.claude/skills/m15-anti-pattern/SKILL.md deleted file mode 100644 index 9c250da98..000000000 --- a/.claude/skills/m15-anti-pattern/SKILL.md +++ /dev/null @@ -1,159 +0,0 @@ ---- -name: m15-anti-pattern -description: "Use when reviewing code for anti-patterns. Keywords: anti-pattern, common mistake, pitfall, code smell, bad practice, code review, is this an anti-pattern, better way to do this, common mistake to avoid, why is this bad, idiomatic way, beginner mistake, fighting borrow checker, clone everywhere, unwrap in production, should I refactor, 反模式, 常见错误, 代码异味, 最佳实践, 地道写法" ---- - -# Anti-Patterns - -> **Layer 2: Design Choices** - -## Core Question - -**Is this pattern hiding a design problem?** - -When reviewing code: -- Is this solving the symptom or the cause? -- Is there a more idiomatic approach? -- Does this fight or flow with Rust? - ---- - -## Anti-Pattern → Better Pattern - -| Anti-Pattern | Why Bad | Better | -|--------------|---------|--------| -| `.clone()` everywhere | Hides ownership issues | Proper references or ownership | -| `.unwrap()` in production | Runtime panics | `?`, `expect`, or handling | -| `Rc` when single owner | Unnecessary overhead | Simple ownership | -| `unsafe` for convenience | UB risk | Find safe pattern | -| OOP via `Deref` | Misleading API | Composition, traits | -| Giant match arms | Unmaintainable | Extract to methods | -| `String` everywhere | Allocation waste | `&str`, `Cow` | -| Ignoring `#[must_use]` | Lost errors | Handle or `let _ =` | - ---- - -## Thinking Prompt - -When seeing suspicious code: - -1. **Is this symptom or cause?** - - Clone to avoid borrow? → Ownership design issue - - Unwrap "because it won't fail"? → Unhandled case - -2. **What would idiomatic code look like?** - - References instead of clones - - Iterators instead of index loops - - Pattern matching instead of flags - -3. **Does this fight Rust?** - - Fighting borrow checker → restructure - - Excessive unsafe → find safe pattern - ---- - -## Trace Up ↑ - -To design understanding: - -``` -"Why does my code have so many clones?" - ↑ Ask: Is the ownership model correct? - ↑ Check: m09-domain (data flow design) - ↑ Check: m01-ownership (reference patterns) -``` - -| Anti-Pattern | Trace To | Question | -|--------------|----------|----------| -| Clone everywhere | m01-ownership | Who should own this data? | -| Unwrap everywhere | m06-error-handling | What's the error strategy? | -| Rc everywhere | m09-domain | Is ownership clear? | -| Fighting lifetimes | m09-domain | Should data structure change? | - ---- - -## Trace Down ↓ - -To implementation (Layer 1): - -``` -"Replace clone with proper ownership" - ↓ m01-ownership: Reference patterns - ↓ m02-resource: Smart pointer if needed - -"Replace unwrap with proper handling" - ↓ m06-error-handling: ? operator - ↓ m06-error-handling: expect with message -``` - ---- - -## Top 5 Beginner Mistakes - -| Rank | Mistake | Fix | -|------|---------|-----| -| 1 | Clone to escape borrow checker | Use references | -| 2 | Unwrap in production | Propagate with `?` | -| 3 | String for everything | Use `&str` | -| 4 | Index loops | Use iterators | -| 5 | Fighting lifetimes | Restructure to own data | - -## Code Smell → Refactoring - -| Smell | Indicates | Refactoring | -|-------|-----------|-------------| -| Many `.clone()` | Ownership unclear | Clarify data flow | -| Many `.unwrap()` | Error handling missing | Add proper handling | -| Many `pub` fields | Encapsulation broken | Private + accessors | -| Deep nesting | Complex logic | Extract methods | -| Long functions | Multiple responsibilities | Split | -| Giant enums | Missing abstraction | Trait + types | - ---- - -## Common Error Patterns - -| Error | Anti-Pattern Cause | Fix | -|-------|-------------------|-----| -| E0382 use after move | Cloning vs ownership | Proper references | -| Panic in production | Unwrap everywhere | ?, matching | -| Slow performance | String for all text | &str, Cow | -| Borrow checker fights | Wrong structure | Restructure | -| Memory bloat | Rc/Arc everywhere | Simple ownership | - ---- - -## Deprecated → Better - -| Deprecated | Better | -|------------|--------| -| Index-based loops | `.iter()`, `.enumerate()` | -| `collect::>()` then iterate | Chain iterators | -| Manual unsafe cell | `Cell`, `RefCell` | -| `mem::transmute` for casts | `as` or `TryFrom` | -| Custom linked list | `Vec`, `VecDeque` | -| `lazy_static!` | `std::sync::OnceLock` | - ---- - -## Quick Review Checklist - -- [ ] No `.clone()` without justification -- [ ] No `.unwrap()` in library code -- [ ] No `pub` fields with invariants -- [ ] No index loops when iterator works -- [ ] No `String` where `&str` suffices -- [ ] No ignored `#[must_use]` warnings -- [ ] No `unsafe` without SAFETY comment -- [ ] No giant functions (>50 lines) - ---- - -## Related Skills - -| When | See | -|------|-----| -| Ownership patterns | m01-ownership | -| Error handling | m06-error-handling | -| Mental models | m14-mental-model | -| Performance | m10-performance | diff --git a/.claude/skills/m15-anti-pattern/patterns/common-mistakes.md b/.claude/skills/m15-anti-pattern/patterns/common-mistakes.md deleted file mode 100644 index 2c7dc2d40..000000000 --- a/.claude/skills/m15-anti-pattern/patterns/common-mistakes.md +++ /dev/null @@ -1,421 +0,0 @@ -# Common Rust Anti-Patterns & Mistakes - -## Ownership Anti-Patterns - -### 1. Clone Everything - -```rust -// ANTI-PATTERN: clone to avoid borrow checker -fn process(data: Vec) { - for item in data.clone() { // unnecessary clone - println!("{}", item); - } - use_data(data); -} - -// BETTER: borrow when you don't need ownership -fn process(data: Vec) { - for item in &data { // borrow instead - println!("{}", item); - } - use_data(data); -} -``` - -### 2. Unnecessary Box - -```rust -// ANTI-PATTERN: boxing everything -fn get_value() -> Box { - Box::new(String::from("hello")) -} - -// BETTER: return value directly -fn get_value() -> String { - String::from("hello") -} -``` - -### 3. Holding References Too Long - -```rust -// ANTI-PATTERN: borrow prevents mutation -let mut data = vec![1, 2, 3]; -let first = &data[0]; -data.push(4); // ERROR: data is borrowed -println!("{}", first); - -// BETTER: scope the borrow -let mut data = vec![1, 2, 3]; -let first = data[0]; // copy the value -data.push(4); // OK -println!("{}", first); -``` - ---- - -## Error Handling Anti-Patterns - -### 4. Unwrap Everywhere - -```rust -// ANTI-PATTERN: crashes on error -fn process_file(path: &str) { - let content = std::fs::read_to_string(path).unwrap(); - let config: Config = toml::from_str(&content).unwrap(); -} - -// BETTER: propagate errors -fn process_file(path: &str) -> Result { - let content = std::fs::read_to_string(path)?; - let config: Config = toml::from_str(&content)?; - Ok(config) -} -``` - -### 5. Ignoring Errors - -```rust -// ANTI-PATTERN: silent failure -let _ = file.write_all(data); - -// BETTER: handle or propagate -file.write_all(data)?; -// or at minimum, log the error -if let Err(e) = file.write_all(data) { - eprintln!("Warning: failed to write: {}", e); -} -``` - -### 6. Panic in Library Code - -```rust -// ANTI-PATTERN: library panics -pub fn parse(input: &str) -> Data { - if input.is_empty() { - panic!("input cannot be empty"); - } - // ... -} - -// BETTER: return Result -pub fn parse(input: &str) -> Result { - if input.is_empty() { - return Err(ParseError::EmptyInput); - } - // ... -} -``` - ---- - -## String Anti-Patterns - -### 7. String Instead of &str - -```rust -// ANTI-PATTERN: forces allocation -fn greet(name: String) { - println!("Hello, {}", name); -} - -greet("world".to_string()); // allocation - -// BETTER: accept &str -fn greet(name: &str) { - println!("Hello, {}", name); -} - -greet("world"); // no allocation -``` - -### 8. Format for Simple Concatenation - -```rust -// ANTI-PATTERN: format overhead -let greeting = format!("{}{}", "Hello, ", name); - -// BETTER for simple cases: push_str -let mut greeting = String::from("Hello, "); -greeting.push_str(name); - -// Or use + for String + &str -let greeting = String::from("Hello, ") + name; -``` - -### 9. Repeated String Operations - -```rust -// ANTI-PATTERN: O(n²) allocations -let mut result = String::new(); -for word in words { - result = result + word + " "; -} - -// BETTER: join -let result = words.join(" "); - -// Or with_capacity + push_str -let mut result = String::with_capacity(total_len); -for word in words { - result.push_str(word); - result.push(' '); -} -``` - ---- - -## Collection Anti-Patterns - -### 10. Index Instead of Iterator - -```rust -// ANTI-PATTERN: bounds checking overhead -for i in 0..vec.len() { - process(vec[i]); -} - -// BETTER: iterator -for item in &vec { - process(item); -} -``` - -### 11. Collect Then Iterate - -```rust -// ANTI-PATTERN: unnecessary allocation -let filtered: Vec<_> = items.iter().filter(|x| x.valid).collect(); -for item in filtered { - process(item); -} - -// BETTER: chain iterators -for item in items.iter().filter(|x| x.valid) { - process(item); -} -``` - -### 12. Wrong Collection Type - -```rust -// ANTI-PATTERN: Vec for frequent membership checks -let allowed: Vec<&str> = vec!["a", "b", "c"]; -if allowed.contains(&input) { ... } // O(n) - -// BETTER: HashSet for membership -use std::collections::HashSet; -let allowed: HashSet<&str> = ["a", "b", "c"].into(); -if allowed.contains(input) { ... } // O(1) -``` - ---- - -## Concurrency Anti-Patterns - -### 13. Mutex for Read-Heavy Data - -```rust -// ANTI-PATTERN: Mutex when mostly reading -let data = Arc::new(Mutex::new(config)); -// All readers block each other - -// BETTER: RwLock for read-heavy workloads -let data = Arc::new(RwLock::new(config)); -// Multiple readers can proceed in parallel -``` - -### 14. Holding Lock Across Await - -```rust -// ANTI-PATTERN: lock held across await -async fn bad() { - let guard = mutex.lock().unwrap(); - some_async_op().await; // lock held! - use(guard); -} - -// BETTER: scope the lock -async fn good() { - let value = { - let guard = mutex.lock().unwrap(); - guard.clone() - }; // lock released - some_async_op().await; - use(value); -} -``` - -### 15. Blocking in Async - -```rust -// ANTI-PATTERN: blocking call in async -async fn bad() { - std::thread::sleep(Duration::from_secs(1)); // blocks executor! -} - -// BETTER: async sleep -async fn good() { - tokio::time::sleep(Duration::from_secs(1)).await; -} - -// For CPU work: spawn_blocking -async fn compute() { - tokio::task::spawn_blocking(|| heavy_work()).await -} -``` - ---- - -## Type System Anti-Patterns - -### 16. Stringly Typed - -```rust -// ANTI-PATTERN: strings for everything -fn connect(host: &str, port: &str, timeout: &str) { ... } -connect("8080", "localhost", "30"); // wrong order! - -// BETTER: strong types -struct Host(String); -struct Port(u16); -struct Timeout(Duration); - -fn connect(host: Host, port: Port, timeout: Timeout) { ... } -``` - -### 17. Boolean Parameters - -```rust -// ANTI-PATTERN: what does true mean? -fn fetch(url: &str, use_cache: bool, validate_ssl: bool) { ... } -fetch("https://...", true, false); // unclear - -// BETTER: builder or named parameters -struct FetchOptions { - use_cache: bool, - validate_ssl: bool, -} - -fn fetch(url: &str, options: FetchOptions) { ... } -fetch("https://...", FetchOptions { - use_cache: true, - validate_ssl: false, -}); -``` - -### 18. Option> - -```rust -// ANTI-PATTERN: nested Option -fn find(id: u32) -> Option> { ... } -// What does None vs Some(None) mean? - -// BETTER: use Result or custom enum -enum FindResult { - Found(User), - NotFound, - Error(String), -} -``` - ---- - -## API Design Anti-Patterns - -### 19. Taking Ownership Unnecessarily - -```rust -// ANTI-PATTERN: takes ownership but doesn't need it -fn validate(config: Config) -> bool { - config.timeout > 0 && config.retries >= 0 -} - -// BETTER: borrow -fn validate(config: &Config) -> bool { - config.timeout > 0 && config.retries >= 0 -} -``` - -### 20. Returning References to Temporaries - -```rust -// ANTI-PATTERN: impossible lifetime -fn get_default() -> &str { - let s = String::from("default"); - &s // ERROR: s is dropped -} - -// BETTER: return owned -fn get_default() -> String { - String::from("default") -} - -// Or return static -fn get_default() -> &'static str { - "default" -} -``` - -### 21. Overly Generic Functions - -```rust -// ANTI-PATTERN: complex generics for simple function -fn process(input: T) -> V -where - T: Into, - U: AsRef + Clone, - V: From, -{ ... } - -// BETTER: concrete types if generics not needed -fn process(input: &str) -> String { ... } -``` - ---- - -## Macro Anti-Patterns - -### 22. Macro When Function Works - -```rust -// ANTI-PATTERN: macro for simple operation -macro_rules! add { - ($a:expr, $b:expr) => { $a + $b }; -} - -// BETTER: just use a function -fn add(a: i32, b: i32) -> i32 { a + b } -``` - -### 23. Complex Macro Without Tests - -```rust -// ANTI-PATTERN: complex macro with no tests -macro_rules! define_api { - // ... 100 lines of macro code ... -} - -// BETTER: test macro outputs -#[test] -fn test_macro_expansion() { - // Use cargo-expand or trybuild -} -``` - ---- - -## Quick Reference - -| Anti-Pattern | Better Alternative | -|--------------|-------------------| -| Clone everywhere | Borrow when possible | -| Unwrap everywhere | Propagate with `?` | -| `String` parameters | `&str` parameters | -| Index loops | Iterator loops | -| Collect then process | Chain iterators | -| Mutex for reads | RwLock for read-heavy | -| Lock across await | Scope the lock | -| Blocking in async | spawn_blocking | -| Stringly typed | Strong types | -| Boolean params | Builders or enums | diff --git a/.claude/skills/rust-daily/SKILL.md b/.claude/skills/rust-daily/SKILL.md deleted file mode 100644 index bf6ea1e36..000000000 --- a/.claude/skills/rust-daily/SKILL.md +++ /dev/null @@ -1,55 +0,0 @@ ---- -name: rust-daily -description: | - CRITICAL: Use for Rust news and daily/weekly/monthly reports. Triggers on: - rust news, rust daily, rust weekly, TWIR, rust blog, - Rust 日报, Rust 周报, Rust 新闻, Rust 动态 ---- - -# Rust Daily Report - -Fetch Rust community updates, filtered by time range. - -## Data Sources - -| Category | Sources | -|----------|---------| -| Ecosystem | Reddit r/rust, This Week in Rust | -| Official | blog.rust-lang.org, Inside Rust | -| Foundation | rustfoundation.org (news, blog, events) | - -## Parameters - -- `time_range`: day | week | month (default: week) -- `category`: all | ecosystem | official | foundation - -## Execution - -Read agent file then launch Task: - -``` -1. Read: ../../agents/rust-daily-reporter.md -2. Task(subagent_type: "general-purpose", run_in_background: false, prompt: ) -``` - -## Output Format - -```markdown -# Rust {Weekly|Daily|Monthly} Report - -**Time Range:** {start} - {end} - -## Ecosystem -| Score | Title | Link | - -## Official -| Date | Title | Summary | - -## Foundation -| Date | Title | Summary | -``` - -## Validation - -- Each source should have at least 1 result, otherwise mark "No updates" -- On fetch failure, retry with alternative tool diff --git a/.claude/skills/rust-learner/SKILL.md b/.claude/skills/rust-learner/SKILL.md deleted file mode 100644 index caa3f8d34..000000000 --- a/.claude/skills/rust-learner/SKILL.md +++ /dev/null @@ -1,171 +0,0 @@ ---- -name: rust-learner -description: "Use when asking about Rust versions or crate info. Keywords: latest version, what's new, changelog, Rust 1.x, Rust release, stable, nightly, crate info, crates.io, lib.rs, docs.rs, API documentation, crate features, dependencies, which crate, what version, Rust edition, edition 2021, edition 2024, cargo add, cargo update, 最新版本, 版本号, 稳定版, 最新, 哪个版本, crate 信息, 文档, 依赖, Rust 版本, 新特性, 有什么特性" ---- - -# Rust Learner - -> **Version:** 1.1.0 | **Last Updated:** 2026-01-16 - -You are an expert at fetching Rust and crate information. Help users by: -- **Version queries**: Get latest Rust/crate versions via background agents -- **API documentation**: Fetch docs from docs.rs -- **Changelog**: Get Rust version features from releases.rs - -**Primary skill for fetching Rust/crate information. All agents run in background.** - -## CRITICAL: How to Launch Agents - -**All agents MUST be launched via Task tool with these parameters:** - -``` -Task( - subagent_type: "general-purpose", - run_in_background: true, - prompt: -) -``` - -**Workflow:** -1. Read the agent prompt file: `../../agents/.md` (relative to this skill) -2. Launch Task with `run_in_background: true` -3. Continue with other work or wait for completion -4. Read results when agent completes - -## Agent Routing Table - -| Query Type | Agent File | Source | -|------------|------------|--------| -| Rust version features | `../../agents/rust-changelog.md` | releases.rs | -| Crate info/version | `../../agents/crate-researcher.md` | lib.rs, crates.io | -| **Std library docs** (Send, Sync, Arc, etc.) | `../../agents/std-docs-researcher.md` | doc.rust-lang.org | -| Third-party crate docs (tokio, serde, etc.) | `../../agents/docs-researcher.md` | docs.rs | -| Clippy lints | `../../agents/clippy-researcher.md` | rust-clippy docs | -| **Rust news/daily report** | `../../agents/rust-daily-reporter.md` | Reddit, TWIR, blogs | - -### Choosing docs-researcher vs std-docs-researcher - -| Query Pattern | Use Agent | -|---------------|-----------| -| `std::*`, `Send`, `Sync`, `Arc`, `Rc`, `Box`, `Vec`, `String` | `std-docs-researcher` | -| `tokio::*`, `serde::*`, any third-party crate | `docs-researcher` | - -## Tool Chain - -All agents use this tool chain (in order): - -1. **actionbook MCP** (first - get pre-computed selectors) - - `mcp__actionbook__search_actions("site_name")` → get action ID - - `mcp__actionbook__get_action_by_id(id)` → get URL + selectors - -2. **agent-browser CLI** (PRIMARY execution tool) - ```bash - agent-browser open - agent-browser get text - agent-browser close - ``` - -3. **WebFetch** (LAST RESORT only if agent-browser unavailable) - -### Fallback Principle (CRITICAL) - -``` -actionbook → agent-browser → WebFetch (only if agent-browser unavailable) -``` - -**DO NOT:** -- Skip agent-browser because it's slower -- Use WebFetch as primary when agent-browser is available -- Block on WebFetch without trying agent-browser first - -## Example: Crate Version Query - -``` -User: "tokio latest version" - -Claude: -1. Read ../../agents/crate-researcher.md -2. Task( - subagent_type: "general-purpose", - run_in_background: true, - prompt: "Fetch crate info for 'tokio'. Use actionbook MCP to get lib.rs selectors, then agent-browser to fetch. Return: name, version, description, features." - ) -3. Wait for agent or continue with other work -4. Summarize results to user -``` - -## Example: Third-Party Crate Documentation - -``` -User: "tokio::spawn documentation" - -Claude: -1. Read ../../agents/docs-researcher.md -2. Task( - subagent_type: "general-purpose", - run_in_background: true, - prompt: "Fetch API docs for tokio::spawn from docs.rs. Use agent-browser first. Return: signature, description, examples." - ) -3. Wait for agent -4. Summarize API to user -``` - -## Example: Std Library Documentation - -``` -User: "Send trait documentation" - -Claude: -1. Read ../../agents/std-docs-researcher.md (NOT docs-researcher!) -2. Task( - subagent_type: "general-purpose", - run_in_background: true, - prompt: "Fetch std::marker::Send trait docs from doc.rust-lang.org. Use agent-browser first. Return: description, implementors, examples." - ) -3. Wait for agent -4. Summarize trait to user -``` - -## Example: Rust Changelog Query - -``` -User: "What's new in Rust 1.85?" - -Claude: -1. Read ../../agents/rust-changelog.md -2. Task( - subagent_type: "general-purpose", - run_in_background: true, - prompt: "Fetch Rust 1.85 changelog from releases.rs. Use actionbook MCP for selectors, agent-browser to fetch. Return: language features, library changes, stabilized APIs." - ) -3. Wait for agent -4. Summarize features to user -``` - -## Deprecated Patterns - -| Deprecated | Use Instead | Reason | -|------------|-------------|--------| -| WebSearch for crate info | Task + crate-researcher | Structured data | -| Direct WebFetch | Task + actionbook | Pre-computed selectors | -| Foreground agent execution | `run_in_background: true` | Non-blocking | -| Guessing version numbers | Always use agents | Prevents misinformation | - -## Error Handling - -| Error | Cause | Solution | -|-------|-------|----------| -| actionbook unavailable | MCP not configured | Fall back to WebFetch | -| agent-browser not found | CLI not installed | Fall back to WebFetch | -| Agent timeout | Site slow/down | Retry or inform user | -| Empty results | Selector mismatch | Report and use WebFetch fallback | - -## Proactive Triggering - -This skill triggers AUTOMATICALLY when: -- Any Rust crate name mentioned (tokio, serde, axum, sqlx, etc.) -- Questions about "latest", "new", "version", "changelog" -- API documentation requests -- Dependency/feature questions - -**DO NOT use WebSearch for Rust crate info. Launch background agents instead.** diff --git a/.claude/skills/rust-router/SKILL.md b/.claude/skills/rust-router/SKILL.md deleted file mode 100644 index a17f66a6b..000000000 --- a/.claude/skills/rust-router/SKILL.md +++ /dev/null @@ -1,341 +0,0 @@ ---- -name: rust-router -description: "CRITICAL: Use for ALL Rust questions including errors, design, and coding. -Triggers on: Rust, cargo, rustc, crate, Cargo.toml, -意图分析, 问题分析, 语义分析, analyze intent, question analysis, -compile error, borrow error, lifetime error, ownership error, type error, trait error, -value moved, cannot borrow, does not live long enough, mismatched types, not satisfied, -E0382, E0597, E0277, E0308, E0499, E0502, E0596, -async, await, Send, Sync, tokio, unsafe, FFI, concurrency, error handling, -编译错误, compile error, 所有权, ownership, 借用, borrow, 生命周期, lifetime, 类型错误, type error, -异步, async, 并发, concurrency, 错误处理, error handling, -问题, problem, question, 怎么用, how to use, 如何, how to, 为什么, why, -什么是, what is, 帮我写, help me write, 实现, implement, 解释, explain, 区别, difference, 最佳实践, best practice" -globs: ["**/Cargo.toml", "**/*.rs"] ---- - -# Rust Question Router - -> **Version:** 2.0.0 | **Last Updated:** 2026-01-17 -> -> **New in v2.0:** Meta-Cognition Routing - Three-layer cognitive model for deeper answers - -## Meta-Cognition Framework - -### Core Principle - -**Don't answer directly. Trace through the cognitive layers first.** - -``` -Layer 3: Domain Constraints (WHY) -├── Business rules, regulatory requirements -├── domain-fintech, domain-web, domain-cli, etc. -└── "Why is it designed this way?" - -Layer 2: Design Choices (WHAT) -├── Architecture patterns, DDD concepts -├── m09-m15 skills -└── "What pattern should I use?" - -Layer 1: Language Mechanics (HOW) -├── Ownership, borrowing, lifetimes, traits -├── m01-m07 skills -└── "How do I implement this in Rust?" -``` - -### Routing by Entry Point - -| User Signal | Entry Layer | Direction | First Skill | -|-------------|-------------|-----------|-------------| -| E0xxx error | Layer 1 | Trace UP ↑ | m01-m07 | -| Compile error | Layer 1 | Trace UP ↑ | Error table below | -| "How to design..." | Layer 2 | Check L3, then DOWN ↓ | m09-domain | -| "Building [domain] app" | Layer 3 | Trace DOWN ↓ | domain-* | -| "Best practice..." | Layer 2 | Both directions | m09-m15 | -| Performance issue | Layer 1 → 2 | UP then DOWN | m10-performance | - -### CRITICAL: Dual-Skill Loading - -**When domain keywords are present, you MUST load BOTH skills:** - -| Domain Keywords | L1 Skill | L3 Skill | -|-----------------|----------|----------| -| Web API, HTTP, axum, handler | m07-concurrency | **domain-web** | -| 交易, 支付, trading, payment | m01-ownership | **domain-fintech** | -| CLI, terminal, clap | m07-concurrency | **domain-cli** | -| kubernetes, grpc, microservice | m07-concurrency | **domain-cloud-native** | -| embedded, no_std, MCU | m02-resource | **domain-embedded** | - -**Example**: "Web API 报错 Rc cannot be sent" -- Load: m07-concurrency (L1 - Send/Sync error) -- Load: domain-web (L3 - Web state management) -- Answer must include both layers - -### Trace Examples - -``` -User: "My trading system reports E0382" - -1. Entry: Layer 1 (E0382 = ownership error) -2. Load: m01-ownership skill -3. Trace UP: What design led to this? -4. Check: domain-fintech (trading = immutable audit data) -5. Answer: Don't clone, use Arc for shared immutable data -``` - -``` -User: "How should I handle user authentication?" - -1. Entry: Layer 2 (design question) -2. Trace UP to Layer 3: domain-web constraints -3. Load: domain-web skill (security, stateless HTTP) -4. Trace DOWN: m06-error-handling, m07-concurrency -5. Answer: JWT with proper error types, async handlers -``` - ---- - -## INSTRUCTIONS FOR CLAUDE - -### Default Project Settings (When Creating Rust Code) - -When creating new Rust projects or Cargo.toml files, ALWAYS use: - -```toml -[package] -edition = "2024" # ALWAYS use latest stable edition -rust-version = "1.85" - -[lints.rust] -unsafe_code = "warn" - -[lints.clippy] -all = "warn" -pedantic = "warn" -``` - -**Rules:** -- ALWAYS use `edition = "2024"` (not 2021 or earlier) -- Include `rust-version` for MSRV clarity -- Enable clippy lints by default -- DO NOT use WebSearch for Rust questions - use skills and agents - -### Meta-Cognition Routing Process - -1. **Identify Entry Layer** - - E0xxx errors → Layer 1 - - Design questions → Layer 2 - - Domain-specific → Layer 3 - -2. **Load Appropriate Skill** - - Read the skill file for context - - Note the "Trace Up" and "Trace Down" sections - -3. **Trace Through Layers** - - Don't stop at surface-level fix - - Ask "Why?" to trace up - - Ask "How?" to trace down - -4. **Answer with Context** - - Include the reasoning chain - - Reference which layers/skills informed the answer - -### When User Requests Intent Analysis - -User may say: "analyze this question" / "what type of problem is this" / "analyze intent" - -**Execute these steps:** - -1. **Extract Keywords** - Identify Rust concepts, error codes, crate names -2. **Identify Entry Layer** - Which cognitive layer is this? -3. **Map to Skills** - Which m0x/m1x/domain skills apply? -4. **Report Analysis** - Tell user the layers and suggested trace -5. **Invoke Skill** - Load and apply the matched skill - ---- - -## Layer 1 Skills (Language Mechanics) - -| Pattern | Category | Route To | -|---------|----------|----------| -| move, borrow, lifetime, E0382, E0597 | m01 | m01-ownership | -| Box, Rc, Arc, RefCell, Cell | m02 | m02-resource | -| mut, interior mutability, E0499, E0502, E0596 | m03 | m03-mutability | -| generic, trait, inline, monomorphization | m04 | m04-zero-cost | -| type state, phantom, newtype | m05 | m05-type-driven | -| Result, Error, panic, ?, anyhow, thiserror | m06 | m06-error-handling | -| Send, Sync, thread, async, channel | m07 | m07-concurrency | -| unsafe, FFI, extern, raw pointer, transmute | - | **unsafe-checker** | - -## Layer 2 Skills (Design Choices) - -| Pattern | Category | Route To | -|---------|----------|----------| -| domain model, business logic | m09 | m09-domain | -| performance, optimization, benchmark | m10 | m10-performance | -| integration, interop, bindings | m11 | m11-ecosystem | -| resource lifecycle, RAII, Drop | m12 | m12-lifecycle | -| domain error, recovery strategy | m13 | m13-domain-error | -| mental model, how to think | m14 | m14-mental-model | -| anti-pattern, common mistake, pitfall | m15 | m15-anti-pattern | - -## Layer 3 Skills (Domain Constraints) - -| Domain Keywords | Route To | -|-----------------|----------| -| fintech, trading, decimal, currency | domain-fintech | -| ml, tensor, model, inference | domain-ml | -| kubernetes, docker, grpc, microservice | domain-cloud-native | -| embedded, sensor, mqtt, iot | domain-iot | -| web server, HTTP, REST, axum, actix | domain-web | -| CLI, command line, clap, terminal | domain-cli | -| no_std, microcontroller, firmware | domain-embedded | - ---- - -## Error Code Routing - -| Error Code | Layer | Route To | Common Cause | -|------------|-------|----------|--------------| -| E0382 | L1 | m01-ownership | Use of moved value | -| E0597 | L1 | m01-ownership | Lifetime too short | -| E0506 | L1 | m01-ownership | Cannot assign to borrowed | -| E0507 | L1 | m01-ownership | Cannot move out of borrowed | -| E0515 | L1 | m01-ownership | Return local reference | -| E0716 | L1 | m01-ownership | Temporary value dropped | -| E0106 | L1 | m01-ownership | Missing lifetime specifier | -| E0596 | L1 | m03-mutability | Cannot borrow as mutable | -| E0499 | L1 | m03-mutability | Multiple mutable borrows | -| E0502 | L1 | m03-mutability | Borrow conflict | -| E0277 | L1 | m04/m07 | Trait bound not satisfied | -| E0308 | L1 | m04-zero-cost | Type mismatch | -| E0599 | L1 | m04-zero-cost | No method found | -| E0038 | L1 | m04-zero-cost | Trait not object-safe | -| E0433 | L1 | m11-ecosystem | Cannot find crate/module | - ---- - -## Unsafe-Specific Routing - -For **detailed unsafe rules**, route to `unsafe-checker` skill: - -| Pattern | Route To | -|---------|----------| -| unsafe code review, SAFETY comment | **unsafe-checker** | -| FFI, extern "C", C interop, libc | **unsafe-checker** | -| raw pointer, *mut, *const, NonNull | **unsafe-checker** | -| transmute, union, repr(C) | **unsafe-checker** | -| MaybeUninit, uninitialized memory | **unsafe-checker** | -| panic safety, double-free | **unsafe-checker** | -| Send impl, Sync impl, manual auto-traits | **unsafe-checker** | - ---- - -## Functional Routing Table - -| Pattern | Route To | Action | -|---------|----------|--------| -| latest version, what's new | **rust-learner** | Use agents | -| API, docs, documentation | **docs-researcher** | Use agent | -| Cargo.toml, dependencies | **dynamic-skills** | Suggest `/sync-crate-skills` | -| code style, naming, clippy | **coding-guidelines** | Read skill | -| unsafe code, FFI | **unsafe-checker** | Read skill | -| code review | **os-checker** | Suggest `/rust-review` | - ---- - -## Priority Order - -1. **Identify cognitive layer** (L1/L2/L3) -2. **Load entry skill** (m0x/m1x/domain) -3. **Trace through layers** (UP or DOWN) -4. **Cross-reference skills** as indicated in "Trace" sections -5. **Answer with reasoning chain** - -## Skill File Paths - -### Meta-Cognition Framework -``` -_meta/reasoning-framework.md # How to trace through layers -_meta/layer-definitions.md # Layer definitions -_meta/externalization.md # Cognitive externalization -_meta/error-protocol.md # 3-Strike escalation rule -_meta/hooks-patterns.md # Automatic triggers -``` - -### Layer 1 Skills (Language Mechanics) -``` -skills/m01-ownership/SKILL.md -skills/m02-resource/SKILL.md -skills/m03-mutability/SKILL.md -skills/m04-zero-cost/SKILL.md -skills/m05-type-driven/SKILL.md -skills/m06-error-handling/SKILL.md -skills/m07-concurrency/SKILL.md -``` - -### Layer 2 Skills (Design Choices) -``` -skills/m09-domain/SKILL.md -skills/m10-performance/SKILL.md -skills/m11-ecosystem/SKILL.md -skills/m12-lifecycle/SKILL.md -skills/m13-domain-error/SKILL.md -skills/m14-mental-model/SKILL.md -skills/m15-anti-pattern/SKILL.md -``` - -### Layer 3 Skills (Domain Constraints) -``` -skills/domain-fintech/SKILL.md -skills/domain-ml/SKILL.md -skills/domain-cloud-native/SKILL.md -skills/domain-iot/SKILL.md -skills/domain-web/SKILL.md -skills/domain-cli/SKILL.md -skills/domain-embedded/SKILL.md -``` - ---- - -## OS-Checker Integration - -For code review and security auditing: - -| Use Case | Command | Tools | -|----------|---------|-------| -| Daily check | `/rust-review` | clippy | -| Security audit | `/audit security` | cargo audit, geiger | -| Unsafe audit | `/audit safety` | miri, rudra | -| Concurrency audit | `/audit concurrency` | lockbud | -| Full audit | `/audit full` | all os-checker tools | - ---- - -## Workflow Example with Meta-Cognition - -``` -User: "Why am I getting E0382 in my trading system?" - -Analysis: -1. Entry: Layer 1 (E0382 = ownership/move error) -2. Load: m01-ownership skill -3. Context: "trading system" → domain-fintech - -Trace UP ↑: -- E0382 in trading context -- Check domain-fintech: "immutable audit records" -- Finding: Trading data should be shared, not moved - -Response: -"E0382 indicates a value was moved when still needed. -In a trading system (domain-fintech), transaction records -should be immutable and shareable for audit purposes. - -Instead of cloning, consider: -- Arc for shared immutable access -- This aligns with financial audit requirements - -See: m01-ownership (Trace Up section), - domain-fintech (Audit Requirements)" -``` diff --git a/.claude/skills/rust-skill-creator/SKILL.md b/.claude/skills/rust-skill-creator/SKILL.md deleted file mode 100644 index 18f0234af..000000000 --- a/.claude/skills/rust-skill-creator/SKILL.md +++ /dev/null @@ -1,125 +0,0 @@ ---- -name: rust-skill-creator -description: "Use when creating skills for Rust crates or std library documentation. Keywords: create rust skill, create crate skill, create std skill, 创建 rust skill, 创建 crate skill, 创建 std skill, 动态 rust skill, 动态 crate skill, skill for tokio, skill for serde, skill for axum, generate rust skill, rust 技能, crate 技能, 从文档创建skill, from docs create skill" ---- - -# Rust Skill Creator - -> Create dynamic skills for Rust crates and std library documentation. - -## When to Use - -This skill handles requests to create skills for: -- Third-party crates (tokio, serde, axum, etc.) -- Rust standard library (std::sync, std::marker, etc.) -- Any Rust documentation URL - -## Workflow - -### 1. Identify the Target - -| User Request | Target Type | URL Pattern | -|--------------|-------------|-------------| -| "create tokio skill" | Third-party crate | `docs.rs/tokio/latest/tokio/` | -| "create Send trait skill" | Std library | `doc.rust-lang.org/std/marker/trait.Send.html` | -| "create skill from URL" + URL | Custom URL | User-provided URL | - -### 2. Execute the Command - -Use the `/create-llms-for-skills` command: - -``` -/create-llms-for-skills [requirements] -``` - -**Examples:** - -```bash -# For third-party crate -/create-llms-for-skills https://docs.rs/tokio/latest/tokio/ - -# For std library -/create-llms-for-skills https://doc.rust-lang.org/std/marker/trait.Send.html - -# With specific requirements -/create-llms-for-skills https://docs.rs/axum/latest/axum/ "Focus on routing and extractors" -``` - -### 3. Follow-up with Skill Creation - -After llms.txt is generated, use: - -``` -/create-skills-via-llms [version] -``` - -## URL Construction Helper - -| Target | URL Template | -|--------|--------------| -| Crate overview | `https://docs.rs/{crate}/latest/{crate}/` | -| Crate module | `https://docs.rs/{crate}/latest/{crate}/{module}/` | -| Std trait | `https://doc.rust-lang.org/std/{module}/trait.{Name}.html` | -| Std struct | `https://doc.rust-lang.org/std/{module}/struct.{Name}.html` | -| Std module | `https://doc.rust-lang.org/std/{module}/index.html` | - -## Common Std Library Paths - -| Item | Path | -|------|------| -| Send, Sync, Copy, Clone | `std/marker/trait.{Name}.html` | -| Arc, Mutex, RwLock | `std/sync/struct.{Name}.html` | -| Rc, Weak | `std/rc/struct.{Name}.html` | -| RefCell, Cell | `std/cell/struct.{Name}.html` | -| Box | `std/boxed/struct.Box.html` | -| Vec | `std/vec/struct.Vec.html` | -| String | `std/string/struct.String.html` | -| Option | `std/option/enum.Option.html` | -| Result | `std/result/enum.Result.html` | - -## Example Interactions - -### Example 1: Create Crate Skill - -``` -User: "Create a dynamic skill for tokio" - -Claude: -1. Identify: Third-party crate "tokio" -2. Execute: /create-llms-for-skills https://docs.rs/tokio/latest/tokio/ -3. Wait for llms.txt generation -4. Execute: /create-skills-via-llms tokio ~/tmp/{timestamp}-tokio-llms.txt -``` - -### Example 2: Create Std Library Skill - -``` -User: "Create a skill for Send and Sync traits" - -Claude: -1. Identify: Std library traits -2. Execute: /create-llms-for-skills https://doc.rust-lang.org/std/marker/trait.Send.html https://doc.rust-lang.org/std/marker/trait.Sync.html -3. Wait for llms.txt generation -4. Execute: /create-skills-via-llms std-marker ~/tmp/{timestamp}-std-marker-llms.txt -``` - -### Example 3: Custom URL - -``` -User: "Create skill from https://docs.rs/sqlx/latest/sqlx/" - -Claude: -1. Identify: User-provided URL -2. Execute: /create-llms-for-skills https://docs.rs/sqlx/latest/sqlx/ -3. Follow standard workflow -``` - -## DO NOT - -- Use `best-skill-creator` for Rust-related skill creation -- Skip the `/create-llms-for-skills` step -- Guess documentation URLs without verification - -## Output Location - -All generated skills are saved to: `~/.claude/skills/` diff --git a/.claude/skills/unsafe-checker/AGENTS.md b/.claude/skills/unsafe-checker/AGENTS.md deleted file mode 100644 index dbcc967a0..000000000 --- a/.claude/skills/unsafe-checker/AGENTS.md +++ /dev/null @@ -1,136 +0,0 @@ -# Unsafe Checker - Quick Reference - -**Auto-generated from rules/** - -## Rule Summary by Section - -### General Principles (3 rules) -| ID | Level | Title | -|----|-------|-------| -| general-01 | P | Do Not Abuse Unsafe to Escape Compiler Safety Checks | -| general-02 | P | Do Not Blindly Use Unsafe for Performance | -| general-03 | G | Do Not Create Aliases for Types/Methods Named "Unsafe" | - -### Safety Abstraction (11 rules) -| ID | Level | Title | -|----|-------|-------| -| safety-01 | P | Be Aware of Memory Safety Issues from Panics | -| safety-02 | P | Unsafe Code Authors Must Verify Safety Invariants | -| safety-03 | P | Do Not Expose Uninitialized Memory in Public APIs | -| safety-04 | P | Avoid Double-Free from Panic Safety Issues | -| safety-05 | P | Consider Safety When Manually Implementing Auto Traits | -| safety-06 | P | Do Not Expose Raw Pointers in Public APIs | -| safety-07 | P | Provide Unsafe Counterparts for Performance Alongside Safe Methods | -| safety-08 | P | Mutable Return from Immutable Parameter is Wrong | -| safety-09 | P | Add SAFETY Comment Before Any Unsafe Block | -| safety-10 | G | Add Safety Section in Docs for Public Unsafe Functions | -| safety-11 | G | Use assert! Instead of debug_assert! in Unsafe Functions | - -### Raw Pointers (6 rules) -| ID | Level | Title | -|----|-------|-------| -| ptr-01 | P | Do Not Share Raw Pointers Across Threads | -| ptr-02 | P | Prefer NonNull Over *mut T | -| ptr-03 | P | Use PhantomData for Variance and Ownership | -| ptr-04 | G | Do Not Dereference Pointers Cast to Misaligned Types | -| ptr-05 | G | Do Not Manually Convert Immutable Pointer to Mutable | -| ptr-06 | G | Prefer pointer::cast Over `as` for Pointer Casting | - -### Union (2 rules) -| ID | Level | Title | -|----|-------|-------| -| union-01 | P | Avoid Union Except for C Interop | -| union-02 | P | Do Not Use Union Variants Across Different Lifetimes | - -### Memory Layout (6 rules) -| ID | Level | Title | -|----|-------|-------| -| mem-01 | P | Choose Appropriate Data Layout for Struct/Tuple/Enum | -| mem-02 | P | Do Not Modify Memory Variables of Other Processes | -| mem-03 | P | Do Not Let String/Vec Auto-Drop Other Process's Memory | -| mem-04 | P | Prefer Reentrant Versions of C-API or Syscalls | -| mem-05 | P | Use Third-Party Crates for Bitfields | -| mem-06 | G | Use MaybeUninit for Uninitialized Memory | - -### FFI (18 rules) -| ID | Level | Title | -|----|-------|-------| -| ffi-01 | P | Avoid Passing Strings Directly to C | -| ffi-02 | P | Read Documentation Carefully for std::ffi Types | -| ffi-03 | P | Implement Drop for Wrapped C Pointers | -| ffi-04 | P | Handle Panics When Crossing FFI Boundaries | -| ffi-05 | P | Use Portable Type Aliases from std or libc | -| ffi-06 | P | Ensure C-ABI String Compatibility | -| ffi-07 | P | Do Not Implement Drop for Types Passed to External Code | -| ffi-08 | P | Handle Errors Properly in FFI | -| ffi-09 | P | Use References Instead of Raw Pointers in Safe Wrappers | -| ffi-10 | P | Exported Functions Must Be Thread-Safe | -| ffi-11 | P | Be Careful with repr(packed) Field References | -| ffi-12 | P | Document Invariant Assumptions for C Parameters | -| ffi-13 | P | Ensure Consistent Data Layout for Custom Types | -| ffi-14 | P | Types in FFI Should Have Stable Layout | -| ffi-15 | P | Validate Non-Robust External Values | -| ffi-16 | P | Separate Data and Code for Closures to C | -| ffi-17 | P | Use Opaque Types Instead of c_void | -| ffi-18 | P | Avoid Passing Trait Objects to C | - -### I/O Safety (1 rule) -| ID | Level | Title | -|----|-------|-------| -| io-01 | P | Ensure I/O Safety When Using Raw Handles | - -## Clippy Lint Mapping - -| Clippy Lint | Rule | Category | -|-------------|------|----------| -| `undocumented_unsafe_blocks` | safety-09 | SAFETY comments | -| `missing_safety_doc` | safety-10 | Safety docs | -| `panic_in_result_fn` | safety-01, ffi-04 | Panic safety | -| `non_send_fields_in_send_ty` | safety-05 | Send/Sync | -| `uninit_assumed_init` | safety-03 | Initialization | -| `uninit_vec` | mem-06 | Initialization | -| `mut_from_ref` | safety-08 | Aliasing | -| `cast_ptr_alignment` | ptr-04 | Alignment | -| `cast_ref_to_mut` | ptr-05 | Aliasing | -| `ptr_as_ptr` | ptr-06 | Pointer casting | -| `unaligned_references` | ffi-11 | Packed structs | -| `debug_assert_with_mut_call` | safety-11 | Assertions | - -## Quick Decision Tree - -``` -Writing unsafe code? - │ - ├─ FFI with C? - │ └─ See ffi-* rules - │ - ├─ Raw pointers? - │ └─ See ptr-* rules - │ - ├─ Manual Send/Sync? - │ └─ See safety-05 - │ - ├─ MaybeUninit/uninitialized? - │ └─ See safety-03, mem-06 - │ - └─ Performance optimization? - └─ See general-02, safety-07 -``` - -## Essential Checklist - -Before every unsafe block: -- [ ] SAFETY comment present -- [ ] Invariants documented -- [ ] Pointer validity checked -- [ ] Aliasing rules followed -- [ ] Panic safety considered -- [ ] Tested with Miri - -## Resources - -- `checklists/before-unsafe.md` - Pre-writing checklist -- `checklists/review-unsafe.md` - Code review checklist -- `checklists/common-pitfalls.md` - Common bugs and fixes -- `examples/safe-abstraction.md` - Safe wrapper patterns -- `examples/ffi-patterns.md` - FFI best practices diff --git a/.claude/skills/unsafe-checker/SKILL.md b/.claude/skills/unsafe-checker/SKILL.md deleted file mode 100644 index 7205cd971..000000000 --- a/.claude/skills/unsafe-checker/SKILL.md +++ /dev/null @@ -1,72 +0,0 @@ ---- -name: unsafe-checker -description: "Use when reviewing unsafe code or writing FFI. Keywords: unsafe, raw pointer, FFI, extern, transmute, *mut, *const, union, #[repr(C)], libc, std::ffi, MaybeUninit, NonNull, PhantomData, Send, Sync, SAFETY comment, soundness, undefined behavior, UB, how to call C functions, safe wrapper for unsafe code, when is unsafe necessary, memory layout, bindgen, cbindgen, CString, CStr, invariant, 安全抽象, 裸指针, 外部函数接口, 内存布局, 不安全代码, FFI 绑定, 未定义行为" -globs: ["**/*.rs"] ---- - -# Unsafe Rust Checker - -## When Unsafe is Valid - -| Use Case | Example | -|----------|---------| -| FFI | Calling C functions | -| Low-level abstractions | Implementing `Vec`, `Arc` | -| Performance | Measured bottleneck with safe alternative too slow | - -**NOT valid:** Escaping borrow checker without understanding why. - -## Required Documentation - -```rust -// SAFETY: -unsafe { ... } - -/// # Safety -/// -pub unsafe fn dangerous() { ... } -``` - -## Quick Reference - -| Operation | Safety Requirements | -|-----------|---------------------| -| `*ptr` deref | Valid, aligned, initialized | -| `&*ptr` | + No aliasing violations | -| `transmute` | Same size, valid bit pattern | -| `extern "C"` | Correct signature, ABI | -| `static mut` | Synchronization guaranteed | -| `impl Send/Sync` | Actually thread-safe | - -## Common Errors - -| Error | Fix | -|-------|-----| -| Null pointer deref | Check for null before deref | -| Use after free | Ensure lifetime validity | -| Data race | Add proper synchronization | -| Alignment violation | Use `#[repr(C)]`, check alignment | -| Invalid bit pattern | Use `MaybeUninit` | -| Missing SAFETY comment | Add `// SAFETY:` | - -## Deprecated → Better - -| Deprecated | Use Instead | -|------------|-------------| -| `mem::uninitialized()` | `MaybeUninit` | -| `mem::zeroed()` for refs | `MaybeUninit` | -| Raw pointer arithmetic | `NonNull`, `ptr::add` | -| `CString::new().unwrap().as_ptr()` | Store `CString` first | -| `static mut` | `AtomicT` or `Mutex` | -| Manual extern | `bindgen` | - -## FFI Crates - -| Direction | Crate | -|-----------|-------| -| C → Rust | bindgen | -| Rust → C | cbindgen | -| Python | PyO3 | -| Node.js | napi-rs | - -Claude knows unsafe Rust. Focus on SAFETY comments and soundness. diff --git a/.claude/skills/unsafe-checker/checklists/before-unsafe.md b/.claude/skills/unsafe-checker/checklists/before-unsafe.md deleted file mode 100644 index 174b05ea8..000000000 --- a/.claude/skills/unsafe-checker/checklists/before-unsafe.md +++ /dev/null @@ -1,115 +0,0 @@ -# Checklist: Before Writing Unsafe Code - -Use this checklist before writing any `unsafe` block or `unsafe fn`. - -## 1. Do You Really Need Unsafe? - -- [ ] Have you tried all safe alternatives? -- [ ] Can you restructure the code to satisfy the borrow checker? -- [ ] Would interior mutability (`Cell`, `RefCell`, `Mutex`) solve the problem? -- [ ] Is there a safe crate that already does this? -- [ ] Is the performance gain (if any) worth the safety risk? - -**If you answered "no" to all, proceed with unsafe.** - -## 2. What Unsafe Operation Do You Need? - -Identify which specific unsafe operation you're performing: - -- [ ] Dereferencing a raw pointer (`*const T`, `*mut T`) -- [ ] Calling an `unsafe` function -- [ ] Accessing a mutable static variable -- [ ] Implementing an unsafe trait (`Send`, `Sync`, etc.) -- [ ] Accessing fields of a `union` -- [ ] Using `extern "C"` functions (FFI) - -## 3. Safety Invariants - -For each unsafe operation, document the invariants: - -### For Pointer Dereference: -- [ ] Is the pointer non-null? -- [ ] Is the pointer properly aligned for the type? -- [ ] Does the pointer point to valid, initialized memory? -- [ ] Is the memory not being mutated by other code? -- [ ] Will the memory remain valid for the entire duration of use? - -### For Mutable Aliasing: -- [ ] Are you creating multiple mutable references to the same memory? -- [ ] Is there any possibility of aliasing `&mut` and `&`? -- [ ] Have you verified no other code can access this memory? - -### For FFI: -- [ ] Is the function signature correct (types, ABI)? -- [ ] Are you handling potential null pointers? -- [ ] Are you handling potential panics (catch_unwind)? -- [ ] Is memory ownership clear (who allocates, who frees)? - -### For Send/Sync: -- [ ] Is concurrent access properly synchronized? -- [ ] Are there any data races possible? -- [ ] Does the type truly satisfy the trait requirements? - -## 4. Panic Safety - -- [ ] What happens if this code panics at any line? -- [ ] Are data structures left in a valid state on panic? -- [ ] Do you need a panic guard for cleanup? -- [ ] Could a destructor see invalid state? - -## 5. Documentation - -- [ ] Have you written a `// SAFETY:` comment explaining: - - What invariants must hold? - - Why those invariants are upheld here? - -- [ ] For `unsafe fn`, have you written `# Safety` docs explaining: - - What the caller must guarantee? - - What happens if requirements are violated? - -## 6. Testing and Verification - -- [ ] Can you add debug assertions to verify invariants? -- [ ] Have you tested with Miri (`cargo miri test`)? -- [ ] Have you tested with address sanitizer (`RUSTFLAGS="-Zsanitizer=address"`)? -- [ ] Have you considered fuzzing the unsafe code? - -## Quick Reference: Common SAFETY Comments - -```rust -// SAFETY: We checked that index < len above, so this is in bounds. - -// SAFETY: The pointer was created from a valid reference and hasn't been invalidated. - -// SAFETY: We hold the lock, guaranteeing exclusive access. - -// SAFETY: The type is #[repr(C)] and all fields are initialized. - -// SAFETY: Caller guarantees the pointer is non-null and properly aligned. -``` - -## Decision Flowchart - -``` -Need unsafe? - | - v -Can you use safe Rust? --Yes--> Don't use unsafe - | - No - v -Can you use existing safe abstraction? --Yes--> Use it (std, crates) - | - No - v -Document all invariants - | - v -Add SAFETY comments - | - v -Write the unsafe code - | - v -Test with Miri -``` diff --git a/.claude/skills/unsafe-checker/checklists/common-pitfalls.md b/.claude/skills/unsafe-checker/checklists/common-pitfalls.md deleted file mode 100644 index 45390082e..000000000 --- a/.claude/skills/unsafe-checker/checklists/common-pitfalls.md +++ /dev/null @@ -1,253 +0,0 @@ -# Common Unsafe Pitfalls and Fixes - -A reference of frequently encountered unsafe bugs and how to fix them. - -## Pitfall 1: Dangling Pointer from Local - -**Bug:** -```rust -fn bad() -> *const i32 { - let x = 42; - &x as *const i32 // Dangling after return! -} -``` - -**Fix:** -```rust -fn good() -> Box { - Box::new(42) // Heap allocation lives beyond function -} - -// Or return the value itself -fn better() -> i32 { - 42 -} -``` - -## Pitfall 2: CString Lifetime - -**Bug:** -```rust -fn bad() -> *const c_char { - let s = CString::new("hello").unwrap(); - s.as_ptr() // Dangling! CString dropped -} -``` - -**Fix:** -```rust -fn good(s: &CString) -> *const c_char { - s.as_ptr() // Caller keeps CString alive -} - -// Or take ownership -fn also_good(s: CString) -> *const c_char { - s.into_raw() // Caller must free with CString::from_raw -} -``` - -## Pitfall 3: Vec set_len with Uninitialized Data - -**Bug:** -```rust -fn bad() -> Vec { - let mut v = Vec::with_capacity(10); - unsafe { v.set_len(10); } // Strings are uninitialized! - v -} -``` - -**Fix:** -```rust -fn good() -> Vec { - let mut v = Vec::with_capacity(10); - for _ in 0..10 { - v.push(String::new()); - } - v -} - -// Or use resize -fn also_good() -> Vec { - let mut v = Vec::new(); - v.resize(10, String::new()); - v -} -``` - -## Pitfall 4: Reference to Packed Field - -**Bug:** -```rust -#[repr(packed)] -struct Packed { a: u8, b: u32 } - -fn bad(p: &Packed) -> &u32 { - &p.b // UB: misaligned reference! -} -``` - -**Fix:** -```rust -fn good(p: &Packed) -> u32 { - unsafe { std::ptr::addr_of!(p.b).read_unaligned() } -} -``` - -## Pitfall 5: Mutable Aliasing Through Raw Pointers - -**Bug:** -```rust -fn bad() { - let mut x = 42; - let ptr1 = &mut x as *mut i32; - let ptr2 = &mut x as *mut i32; // Already have ptr1! - unsafe { - *ptr1 = 1; - *ptr2 = 2; // Aliasing mutable pointers! - } -} -``` - -**Fix:** -```rust -fn good() { - let mut x = 42; - let ptr = &mut x as *mut i32; - unsafe { - *ptr = 1; - *ptr = 2; // Same pointer, sequential access - } -} -``` - -## Pitfall 6: Transmute to Wrong Size - -**Bug:** -```rust -fn bad() { - let x: u32 = 42; - let y: u64 = unsafe { std::mem::transmute(x) }; // UB: size mismatch! -} -``` - -**Fix:** -```rust -fn good() { - let x: u32 = 42; - let y: u64 = x as u64; // Use conversion -} -``` - -## Pitfall 7: Invalid Enum Discriminant - -**Bug:** -```rust -#[repr(u8)] -enum Status { A = 0, B = 1, C = 2 } - -fn bad(raw: u8) -> Status { - unsafe { std::mem::transmute(raw) } // UB if raw > 2! -} -``` - -**Fix:** -```rust -fn good(raw: u8) -> Option { - match raw { - 0 => Some(Status::A), - 1 => Some(Status::B), - 2 => Some(Status::C), - _ => None, - } -} -``` - -## Pitfall 8: FFI Panic Unwinding - -**Bug:** -```rust -#[no_mangle] -extern "C" fn callback(x: i32) -> i32 { - if x < 0 { - panic!("negative!"); // UB: unwinding across FFI! - } - x * 2 -} -``` - -**Fix:** -```rust -#[no_mangle] -extern "C" fn callback(x: i32) -> i32 { - std::panic::catch_unwind(|| { - if x < 0 { - panic!("negative!"); - } - x * 2 - }).unwrap_or(-1) // Return error code on panic -} -``` - -## Pitfall 9: Double Free from Clone + into_raw - -**Bug:** -```rust -struct Handle(*mut c_void); - -impl Clone for Handle { - fn clone(&self) -> Self { - Handle(self.0) // Both now "own" same pointer! - } -} - -impl Drop for Handle { - fn drop(&mut self) { - unsafe { free(self.0); } // Double free when both drop! - } -} -``` - -**Fix:** -```rust -struct Handle(*mut c_void); - -// Don't implement Clone, or implement proper reference counting -impl Handle { - fn clone_ptr(&self) -> *mut c_void { - self.0 // Return raw pointer, no ownership - } -} -``` - -## Pitfall 10: Forget Doesn't Run Destructors - -**Bug:** -```rust -fn bad() { - let guard = lock.lock(); - std::mem::forget(guard); // Lock never released! -} -``` - -**Fix:** -```rust -fn good() { - let guard = lock.lock(); - // Let guard drop naturally - // or explicitly: drop(guard); -} -``` - -## Quick Reference Table - -| Pitfall | Detection | Fix | -|---------|-----------|-----| -| Dangling pointer | Miri | Extend lifetime or heap allocate | -| Uninitialized read | Miri | Use MaybeUninit properly | -| Misaligned access | Miri, UBsan | read_unaligned, copy by value | -| Data race | TSan | Use atomics or mutex | -| Double free | ASan | Track ownership carefully | -| Invalid enum | Manual review | Use TryFrom | -| FFI panic | Testing | catch_unwind | -| Type confusion | Miri | Match types exactly | diff --git a/.claude/skills/unsafe-checker/checklists/review-unsafe.md b/.claude/skills/unsafe-checker/checklists/review-unsafe.md deleted file mode 100644 index 5efa11b07..000000000 --- a/.claude/skills/unsafe-checker/checklists/review-unsafe.md +++ /dev/null @@ -1,113 +0,0 @@ -# Checklist: Reviewing Unsafe Code - -Use this checklist when reviewing code containing `unsafe`. - -## 1. Surface-Level Checks - -- [ ] Does every `unsafe` block have a `// SAFETY:` comment? -- [ ] Does every `unsafe fn` have `# Safety` documentation? -- [ ] Are the safety comments specific and verifiable, not vague? -- [ ] Is the unsafe code minimized (smallest possible unsafe block)? - -## 2. Pointer Validity - -For each pointer dereference: - -- [ ] **Non-null**: Is null checked before dereference? -- [ ] **Aligned**: Is alignment verified or guaranteed by construction? -- [ ] **Valid**: Does the pointer point to allocated memory? -- [ ] **Initialized**: Is the memory initialized before reading? -- [ ] **Lifetime**: Is the memory valid for the entire use duration? -- [ ] **Unique**: For `&mut`, is there only one mutable reference? - -## 3. Memory Safety - -- [ ] **No aliasing**: Are `&` and `&mut` never created to the same memory simultaneously? -- [ ] **No use-after-free**: Is memory not accessed after deallocation? -- [ ] **No double-free**: Is memory freed exactly once? -- [ ] **No data races**: Is concurrent access properly synchronized? -- [ ] **Bounds checked**: Are array/slice accesses in bounds? - -## 4. Type Safety - -- [ ] **Transmute**: Are transmuted types actually compatible? -- [ ] **Repr**: Do FFI types have `#[repr(C)]`? -- [ ] **Enum values**: Are enum discriminants validated from external sources? -- [ ] **Unions**: Is the correct union field accessed? - -## 5. Panic Safety - -- [ ] What state is the program in if this code panics? -- [ ] Are partially constructed objects properly cleaned up? -- [ ] Do Drop implementations see valid state? -- [ ] Is there a panic guard if needed? - -## 6. FFI-Specific Checks - -- [ ] **Types**: Do Rust types match C types exactly? -- [ ] **Strings**: Are strings properly null-terminated? -- [ ] **Ownership**: Is it clear who owns/frees memory? -- [ ] **Thread safety**: Are callbacks thread-safe? -- [ ] **Panic boundary**: Are panics caught before crossing FFI? -- [ ] **Error handling**: Are C-style errors properly handled? - -## 7. Concurrency Checks - -- [ ] **Send/Sync**: Are manual implementations actually sound? -- [ ] **Atomics**: Are memory orderings correct? -- [ ] **Locks**: Is there potential for deadlock? -- [ ] **Data races**: Is all shared mutable state synchronized? - -## 8. Red Flags (Require Extra Scrutiny) - -| Pattern | Concern | -|---------|---------| -| `transmute` | Type compatibility, provenance | -| `as` on pointers | Alignment, type punning | -| `static mut` | Data races | -| `*const T as *mut T` | Aliasing violation | -| Manual `Send`/`Sync` | Thread safety | -| `assume_init` | Initialization | -| `set_len` on Vec | Uninitialized memory | -| `from_raw_parts` | Lifetime, validity | -| `offset`/`add`/`sub` | Out of bounds | -| FFI callbacks | Panic safety | - -## 9. Verification Questions - -Ask the author: -- "What would happen if [X invariant] was violated?" -- "How do you know [pointer/reference] is valid here?" -- "What if this panics at [specific line]?" -- "Who is responsible for freeing this memory?" - -## 10. Testing Requirements - -- [ ] Has this been tested with Miri? -- [ ] Are there unit tests covering edge cases? -- [ ] Are there tests for error conditions? -- [ ] Has concurrent code been tested under stress? - -## Review Severity Guide - -| Severity | Requires | -|----------|----------| -| `transmute` | Two reviewers, Miri test | -| Manual `Send`/`Sync` | Thread safety expert review | -| FFI | Documentation of C interface | -| `static mut` | Justification for not using atomic/mutex | -| Pointer arithmetic | Bounds proof | - -## Sample Review Comments - -``` -// Good SAFETY comment ✓ -// SAFETY: index was checked to be < len on line 42 - -// Needs improvement ✗ -// SAFETY: This is safe because we know it works - -// Missing information ✗ -// SAFETY: ptr is valid -// (Why is it valid? How do we know?) -``` diff --git a/.claude/skills/unsafe-checker/examples/ffi-patterns.md b/.claude/skills/unsafe-checker/examples/ffi-patterns.md deleted file mode 100644 index 4e928ec18..000000000 --- a/.claude/skills/unsafe-checker/examples/ffi-patterns.md +++ /dev/null @@ -1,353 +0,0 @@ -# FFI Best Practices and Patterns - -Examples of safe and idiomatic Rust-C interoperability. - -## Pattern 1: Basic FFI Wrapper - -```rust -use std::ffi::{CStr, CString}; -use std::os::raw::{c_char, c_int, c_void}; -use std::ptr::NonNull; - -// Raw C API -mod ffi { - use super::*; - - extern "C" { - pub fn lib_create(name: *const c_char) -> *mut c_void; - pub fn lib_destroy(handle: *mut c_void); - pub fn lib_process(handle: *mut c_void, data: *const u8, len: usize) -> c_int; - pub fn lib_get_error() -> *const c_char; - } -} - -// Safe Rust wrapper -pub struct Library { - handle: NonNull, -} - -#[derive(Debug)] -pub struct LibraryError(String); - -impl Library { - pub fn new(name: &str) -> Result { - let c_name = CString::new(name).map_err(|_| LibraryError("invalid name".into()))?; - - let handle = unsafe { ffi::lib_create(c_name.as_ptr()) }; - - NonNull::new(handle) - .map(|handle| Self { handle }) - .ok_or_else(|| Self::last_error()) - } - - pub fn process(&self, data: &[u8]) -> Result<(), LibraryError> { - let result = unsafe { - ffi::lib_process(self.handle.as_ptr(), data.as_ptr(), data.len()) - }; - - if result == 0 { - Ok(()) - } else { - Err(Self::last_error()) - } - } - - fn last_error() -> LibraryError { - let ptr = unsafe { ffi::lib_get_error() }; - if ptr.is_null() { - LibraryError("unknown error".into()) - } else { - let msg = unsafe { CStr::from_ptr(ptr) } - .to_string_lossy() - .into_owned(); - LibraryError(msg) - } - } -} - -impl Drop for Library { - fn drop(&mut self) { - unsafe { ffi::lib_destroy(self.handle.as_ptr()); } - } -} - -// Prevent accidental copies -impl !Clone for Library {} -``` - -## Pattern 2: Callback Registration - -```rust -use std::os::raw::{c_int, c_void}; -use std::panic::{catch_unwind, AssertUnwindSafe}; - -type CCallback = extern "C" fn(value: c_int, user_data: *mut c_void) -> c_int; - -extern "C" { - fn register_callback(cb: CCallback, user_data: *mut c_void); - fn unregister_callback(); -} - -/// Safely register a Rust closure as a C callback. -pub struct CallbackGuard { - _closure: Box, -} - -impl i32 + 'static> CallbackGuard { - pub fn register(closure: F) -> Self { - let boxed = Box::new(closure); - let user_data = Box::into_raw(boxed) as *mut c_void; - - extern "C" fn trampoline i32>( - value: c_int, - user_data: *mut c_void, - ) -> c_int { - let result = catch_unwind(AssertUnwindSafe(|| { - let closure = unsafe { &mut *(user_data as *mut F) }; - closure(value as i32) as c_int - })); - result.unwrap_or(-1) - } - - unsafe { - register_callback(trampoline::, user_data); - } - - Self { - // SAFETY: We just created this box and need to keep it alive - _closure: unsafe { Box::from_raw(user_data as *mut F) }, - } - } -} - -impl Drop for CallbackGuard { - fn drop(&mut self) { - unsafe { unregister_callback(); } - // Box in _closure is dropped automatically - } -} - -// Usage -fn example() { - let multiplier = 2; - let _guard = CallbackGuard::register(move |x| x * multiplier); - // Callback is active until _guard is dropped -} -``` - -## Pattern 3: Opaque Handle Types - -```rust -use std::marker::PhantomData; - -// Opaque type markers - prevents mixing up handles -#[repr(C)] -pub struct DatabaseHandle { - _data: [u8; 0], - _marker: PhantomData<(*mut u8, std::marker::PhantomPinned)>, -} - -#[repr(C)] -pub struct ConnectionHandle { - _data: [u8; 0], - _marker: PhantomData<(*mut u8, std::marker::PhantomPinned)>, -} - -mod ffi { - use super::*; - - extern "C" { - pub fn db_open(path: *const c_char) -> *mut DatabaseHandle; - pub fn db_close(db: *mut DatabaseHandle); - pub fn db_connect(db: *mut DatabaseHandle) -> *mut ConnectionHandle; - pub fn conn_close(conn: *mut ConnectionHandle); - pub fn conn_query(conn: *mut ConnectionHandle, sql: *const c_char) -> c_int; - } -} - -// Type-safe wrappers -pub struct Database { - handle: NonNull, -} - -pub struct Connection<'db> { - handle: NonNull, - _db: PhantomData<&'db Database>, -} - -impl Database { - pub fn open(path: &str) -> Result { - let c_path = CString::new(path).map_err(|_| ())?; - let handle = unsafe { ffi::db_open(c_path.as_ptr()) }; - NonNull::new(handle).map(|h| Self { handle: h }).ok_or(()) - } - - pub fn connect(&self) -> Result, ()> { - let handle = unsafe { ffi::db_connect(self.handle.as_ptr()) }; - NonNull::new(handle) - .map(|h| Connection { handle: h, _db: PhantomData }) - .ok_or(()) - } -} - -impl Drop for Database { - fn drop(&mut self) { - // All Connections must be dropped first (enforced by lifetime) - unsafe { ffi::db_close(self.handle.as_ptr()); } - } -} - -impl Connection<'_> { - pub fn query(&self, sql: &str) -> Result<(), ()> { - let c_sql = CString::new(sql).map_err(|_| ())?; - let result = unsafe { ffi::conn_query(self.handle.as_ptr(), c_sql.as_ptr()) }; - if result == 0 { Ok(()) } else { Err(()) } - } -} - -impl Drop for Connection<'_> { - fn drop(&mut self) { - unsafe { ffi::conn_close(self.handle.as_ptr()); } - } -} -``` - -## Pattern 4: Error Handling Across FFI - -```rust -use std::os::raw::c_int; - -// Error codes for C -pub const SUCCESS: c_int = 0; -pub const ERR_NULL_PTR: c_int = 1; -pub const ERR_INVALID_UTF8: c_int = 2; -pub const ERR_IO: c_int = 3; -pub const ERR_PANIC: c_int = -1; - -// Thread-local error storage -thread_local! { - static LAST_ERROR: std::cell::RefCell>> = - std::cell::RefCell::new(None); -} - -fn set_last_error(err: E) { - LAST_ERROR.with(|e| { - *e.borrow_mut() = Some(Box::new(err)); - }); -} - -/// Get the last error message. Caller must free with `free_string`. -#[no_mangle] -pub extern "C" fn get_last_error() -> *mut c_char { - LAST_ERROR.with(|e| { - e.borrow() - .as_ref() - .map(|err| { - CString::new(err.to_string()) - .unwrap_or_else(|_| CString::new("error").unwrap()) - .into_raw() - }) - .unwrap_or(std::ptr::null_mut()) - }) -} - -/// Free a string returned by this library. -#[no_mangle] -pub extern "C" fn free_string(s: *mut c_char) { - if !s.is_null() { - // SAFETY: String was created by CString::into_raw - unsafe { drop(CString::from_raw(s)); } - } -} - -/// Example function with proper error handling. -#[no_mangle] -pub extern "C" fn do_operation(data: *const u8, len: usize) -> c_int { - let result = catch_unwind(AssertUnwindSafe(|| -> Result<(), c_int> { - if data.is_null() { - return Err(ERR_NULL_PTR); - } - - let slice = unsafe { std::slice::from_raw_parts(data, len) }; - - std::str::from_utf8(slice) - .map_err(|e| { - set_last_error(e); - ERR_INVALID_UTF8 - })?; - - // Do actual work... - - Ok(()) - })); - - match result { - Ok(Ok(())) => SUCCESS, - Ok(Err(code)) => code, - Err(_) => ERR_PANIC, - } -} -``` - -## Pattern 5: Struct with C Layout - -```rust -use std::os::raw::{c_char, c_int}; - -/// A C-compatible configuration struct. -#[repr(C)] -pub struct Config { - pub version: c_int, - pub flags: u32, - pub name: [c_char; 64], - pub name_len: usize, -} - -impl Config { - pub fn new(version: i32, flags: u32, name: &str) -> Option { - if name.len() >= 64 { - return None; - } - - let mut config = Self { - version: version as c_int, - flags, - name: [0; 64], - name_len: name.len(), - }; - - // Copy name bytes - for (i, byte) in name.bytes().enumerate() { - config.name[i] = byte as c_char; - } - - Some(config) - } - - pub fn name(&self) -> &str { - let bytes = unsafe { - std::slice::from_raw_parts( - self.name.as_ptr() as *const u8, - self.name_len, - ) - }; - // SAFETY: We only store valid UTF-8 in new() - unsafe { std::str::from_utf8_unchecked(bytes) } - } -} - -// Verify layout at compile time -const _: () = { - assert!(std::mem::size_of::() == 80); // 4 + 4 + 64 + 8 - assert!(std::mem::align_of::() == 8); -}; -``` - -## Key FFI Guidelines - -1. **Always use `#[repr(C)]`** for types crossing FFI -2. **Handle null pointers** at the boundary -3. **Catch panics** before returning to C -4. **Document ownership** clearly -5. **Use opaque types** for type safety -6. **Keep unsafe minimal** and well-documented diff --git a/.claude/skills/unsafe-checker/examples/safe-abstraction.md b/.claude/skills/unsafe-checker/examples/safe-abstraction.md deleted file mode 100644 index 17e43d5f3..000000000 --- a/.claude/skills/unsafe-checker/examples/safe-abstraction.md +++ /dev/null @@ -1,272 +0,0 @@ -# Safe Abstraction Examples - -Examples of building safe APIs on top of unsafe code. - -## Example 1: Simple Wrapper with Bounds Check - -```rust -/// A slice wrapper that provides unchecked access internally -/// but safe access externally. -pub struct SafeSlice<'a, T> { - ptr: *const T, - len: usize, - _marker: std::marker::PhantomData<&'a T>, -} - -impl<'a, T> SafeSlice<'a, T> { - /// Creates a SafeSlice from a regular slice. - pub fn new(slice: &'a [T]) -> Self { - Self { - ptr: slice.as_ptr(), - len: slice.len(), - _marker: std::marker::PhantomData, - } - } - - /// Safe get - returns Option. - pub fn get(&self, index: usize) -> Option<&T> { - if index < self.len { - // SAFETY: We just verified index < len - Some(unsafe { &*self.ptr.add(index) }) - } else { - None - } - } - - /// Unsafe get - caller must ensure bounds. - /// - /// # Safety - /// `index` must be less than `self.len()`. - pub unsafe fn get_unchecked(&self, index: usize) -> &T { - debug_assert!(index < self.len); - &*self.ptr.add(index) - } - - pub fn len(&self) -> usize { - self.len - } -} -``` - -## Example 2: Resource Wrapper with Drop - -```rust -use std::ptr::NonNull; - -/// Safe wrapper around a C-allocated buffer. -pub struct CBuffer { - ptr: NonNull, - len: usize, -} - -extern "C" { - fn c_alloc(size: usize) -> *mut u8; - fn c_free(ptr: *mut u8); -} - -impl CBuffer { - /// Creates a new buffer. Returns None if allocation fails. - pub fn new(size: usize) -> Option { - let ptr = unsafe { c_alloc(size) }; - NonNull::new(ptr).map(|ptr| Self { ptr, len: size }) - } - - /// Returns a slice view of the buffer. - pub fn as_slice(&self) -> &[u8] { - // SAFETY: ptr is valid for len bytes (from c_alloc contract) - unsafe { std::slice::from_raw_parts(self.ptr.as_ptr(), self.len) } - } - - /// Returns a mutable slice view. - pub fn as_mut_slice(&mut self) -> &mut [u8] { - // SAFETY: We have &mut self, so exclusive access - unsafe { std::slice::from_raw_parts_mut(self.ptr.as_ptr(), self.len) } - } -} - -impl Drop for CBuffer { - fn drop(&mut self) { - // SAFETY: ptr was allocated by c_alloc and not yet freed - unsafe { c_free(self.ptr.as_ptr()); } - } -} - -// Prevent double-free -impl !Clone for CBuffer {} - -// Safe to send between threads (assuming c_alloc is thread-safe) -unsafe impl Send for CBuffer {} -``` - -## Example 3: Interior Mutability with UnsafeCell - -```rust -use std::cell::UnsafeCell; -use std::sync::atomic::{AtomicBool, Ordering}; - -/// A simple spinlock demonstrating safe abstraction over UnsafeCell. -pub struct SpinLock { - locked: AtomicBool, - data: UnsafeCell, -} - -pub struct SpinLockGuard<'a, T> { - lock: &'a SpinLock, -} - -impl SpinLock { - pub const fn new(data: T) -> Self { - Self { - locked: AtomicBool::new(false), - data: UnsafeCell::new(data), - } - } - - pub fn lock(&self) -> SpinLockGuard<'_, T> { - // Spin until we acquire the lock - while self.locked.compare_exchange_weak( - false, - true, - Ordering::Acquire, - Ordering::Relaxed, - ).is_err() { - std::hint::spin_loop(); - } - SpinLockGuard { lock: self } - } -} - -impl std::ops::Deref for SpinLockGuard<'_, T> { - type Target = T; - - fn deref(&self) -> &T { - // SAFETY: We hold the lock, so we have exclusive access - unsafe { &*self.lock.data.get() } - } -} - -impl std::ops::DerefMut for SpinLockGuard<'_, T> { - fn deref_mut(&mut self) -> &mut T { - // SAFETY: We hold the lock, so we have exclusive access - unsafe { &mut *self.lock.data.get() } - } -} - -impl Drop for SpinLockGuard<'_, T> { - fn drop(&mut self) { - self.lock.locked.store(false, Ordering::Release); - } -} - -// SAFETY: The lock ensures only one thread accesses data at a time -unsafe impl Sync for SpinLock {} -unsafe impl Send for SpinLock {} -``` - -## Example 4: Iterator with Lifetime Tracking - -```rust -use std::marker::PhantomData; - -/// An iterator over raw pointer range with proper lifetime tracking. -pub struct PtrIter<'a, T> { - current: *const T, - end: *const T, - _marker: PhantomData<&'a T>, -} - -impl<'a, T> PtrIter<'a, T> { - /// Creates an iterator from a slice. - pub fn new(slice: &'a [T]) -> Self { - let ptr = slice.as_ptr(); - Self { - current: ptr, - // SAFETY: Adding len to slice pointer is always valid - end: unsafe { ptr.add(slice.len()) }, - _marker: PhantomData, - } - } -} - -impl<'a, T> Iterator for PtrIter<'a, T> { - type Item = &'a T; - - fn next(&mut self) -> Option { - if self.current == self.end { - None - } else { - // SAFETY: - // - current < end (checked above) - // - PhantomData<&'a T> ensures the data lives for 'a - let item = unsafe { &*self.current }; - self.current = unsafe { self.current.add(1) }; - Some(item) - } - } -} -``` - -## Example 5: Builder Pattern with Delayed Initialization - -```rust -use std::mem::MaybeUninit; - -/// A builder that collects exactly N items, then produces an array. -pub struct ArrayBuilder { - data: [MaybeUninit; N], - count: usize, -} - -impl ArrayBuilder { - pub fn new() -> Self { - Self { - // SAFETY: MaybeUninit doesn't require initialization - data: unsafe { MaybeUninit::uninit().assume_init() }, - count: 0, - } - } - - pub fn push(&mut self, value: T) -> Result<(), T> { - if self.count >= N { - return Err(value); - } - self.data[self.count].write(value); - self.count += 1; - Ok(()) - } - - pub fn build(self) -> Option<[T; N]> { - if self.count != N { - return None; - } - - // SAFETY: All N elements have been initialized - let result = unsafe { - // Prevent drop of self.data (we're moving out) - let data = std::ptr::read(&self.data); - std::mem::forget(self); - // Transmute MaybeUninit array to initialized array - std::mem::transmute_copy::<[MaybeUninit; N], [T; N]>(&data) - }; - Some(result) - } -} - -impl Drop for ArrayBuilder { - fn drop(&mut self) { - // Drop only initialized elements - for i in 0..self.count { - // SAFETY: Elements 0..count are initialized - unsafe { self.data[i].assume_init_drop(); } - } - } -} -``` - -## Key Patterns - -1. **Encapsulation**: Hide unsafe behind safe public API -2. **Invariant maintenance**: Use private fields to maintain invariants -3. **PhantomData**: Track lifetimes and ownership for pointers -4. **RAII**: Use Drop for cleanup -5. **Type state**: Use types to encode valid states diff --git a/.claude/skills/unsafe-checker/metadata.json b/.claude/skills/unsafe-checker/metadata.json deleted file mode 100644 index a028f78a4..000000000 --- a/.claude/skills/unsafe-checker/metadata.json +++ /dev/null @@ -1,17 +0,0 @@ -{ - "name": "unsafe-checker", - "version": "1.0.0", - "description": "Unsafe Rust code review and safety abstraction checker", - "source": "https://github.com/Rust-Coding-Guidelines/rust-coding-guidelines-zh", - "lastUpdated": "2026-01-16", - "ruleCount": 47, - "sections": [ - { "id": "general", "name": "General Principles", "count": 3 }, - { "id": "safety", "name": "Safety Abstraction", "count": 11 }, - { "id": "ptr", "name": "Raw Pointers", "count": 6 }, - { "id": "union", "name": "Union", "count": 2 }, - { "id": "mem", "name": "Memory Layout", "count": 6 }, - { "id": "ffi", "name": "FFI", "count": 18 }, - { "id": "io", "name": "I/O Safety", "count": 1 } - ] -} diff --git a/.claude/skills/unsafe-checker/rules/_sections.md b/.claude/skills/unsafe-checker/rules/_sections.md deleted file mode 100644 index a6b714184..000000000 --- a/.claude/skills/unsafe-checker/rules/_sections.md +++ /dev/null @@ -1,77 +0,0 @@ -# Unsafe Checker - Section Definitions - -## Section Overview - -| # | Section | Prefix | Level | Count | Impact | -|---|---------|--------|-------|-------|--------| -| 1 | General Principles | `general-` | CRITICAL | 3 | Foundational unsafe usage guidance | -| 2 | Safety Abstraction | `safety-` | CRITICAL | 11 | Building sound safe APIs | -| 3 | Raw Pointers | `ptr-` | HIGH | 6 | Pointer manipulation safety | -| 4 | Union | `union-` | HIGH | 2 | Union type safety | -| 5 | Memory Layout | `mem-` | HIGH | 6 | Data representation correctness | -| 6 | FFI | `ffi-` | CRITICAL | 18 | C interoperability safety | -| 7 | I/O Safety | `io-` | MEDIUM | 1 | Handle/resource safety | - -## Section Details - -### 1. General Principles (`general-`) - -**Focus**: When and why to use unsafe - -- P.UNS.01: Don't abuse unsafe to escape borrow checker -- P.UNS.02: Don't use unsafe blindly for performance -- G.UNS.01: Don't create aliases for "unsafe" named items - -### 2. Safety Abstraction (`safety-`) - -**Focus**: Building sound safe abstractions over unsafe code - -Key invariants: -- Panic safety -- Memory initialization -- Send/Sync correctness -- API soundness - -### 3. Raw Pointers (`ptr-`) - -**Focus**: Safe pointer manipulation patterns - -- Aliasing rules -- Alignment requirements -- Null/dangling prevention -- Type casting - -### 4. Union (`union-`) - -**Focus**: Safe union usage (primarily for C interop) - -- Initialization rules -- Lifetime considerations -- Type punning dangers - -### 5. Memory Layout (`mem-`) - -**Focus**: Correct data representation - -- `#[repr(C)]` usage -- Alignment and padding -- Uninitialized memory -- Cross-process memory - -### 6. FFI (`ffi-`) - -**Focus**: Safe C interoperability - -Subcategories: -- String handling (CString, CStr) -- Type compatibility -- Error handling across FFI -- Thread safety -- Resource management - -### 7. I/O Safety (`io-`) - -**Focus**: Handle and resource ownership - -- Raw file descriptor safety -- Handle validity guarantees diff --git a/.claude/skills/unsafe-checker/rules/_template.md b/.claude/skills/unsafe-checker/rules/_template.md deleted file mode 100644 index 64628bedb..000000000 --- a/.claude/skills/unsafe-checker/rules/_template.md +++ /dev/null @@ -1,53 +0,0 @@ -# Rule Template - -Use this template for all unsafe-checker rules. - ---- - -```markdown ---- -id: {prefix}-{number} -original_id: P.UNS.XXX.YY or G.UNS.XXX.YY -level: P|G -impact: CRITICAL|HIGH|MEDIUM -clippy: (if applicable) ---- - -# {Rule Title} - -## Summary - -One-sentence description of what this rule requires. - -## Rationale - -Why this rule matters for safety/soundness. - -## Bad Example - -```rust -// DON'T: Description of the anti-pattern - -``` - -## Good Example - -```rust -// DO: Description of the correct pattern - -``` - -## Common Violations - -1. Violation pattern 1 -2. Violation pattern 2 - -## Checklist - -- [ ] Check item 1 -- [ ] Check item 2 - -## Related Rules - -- `{other-rule-id}`: Brief description -``` diff --git a/.claude/skills/unsafe-checker/rules/ffi-01-no-string-direct.md b/.claude/skills/unsafe-checker/rules/ffi-01-no-string-direct.md deleted file mode 100644 index 2f207661b..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-01-no-string-direct.md +++ /dev/null @@ -1,122 +0,0 @@ ---- -id: ffi-01 -original_id: P.UNS.FFI.01 -level: P -impact: HIGH ---- - -# Avoid Passing Strings Directly to C from Public Rust API - -## Summary - -Use `CString` and `CStr` for string handling at FFI boundaries. Never pass Rust `String` or `&str` directly to C. - -## Rationale - -- Rust strings are UTF-8, not null-terminated -- C strings require null terminator -- Rust strings may contain interior null bytes -- Memory layout differs between Rust String and C char* - -## Bad Example - -```rust -extern "C" { - fn c_print(s: *const u8); - fn c_strlen(s: *const u8) -> usize; -} - -// DON'T: Pass Rust string directly -fn bad_print(s: &str) { - unsafe { - c_print(s.as_ptr()); // Not null-terminated! - } -} - -// DON'T: Assume length matches -fn bad_strlen(s: &str) -> usize { - unsafe { - c_strlen(s.as_ptr()) // May read past buffer - } -} - -// DON'T: Use String in FFI signatures -extern "C" fn bad_callback(s: String) { // Wrong! - println!("{}", s); -} -``` - -## Good Example - -```rust -use std::ffi::{CString, CStr}; -use std::os::raw::c_char; - -extern "C" { - fn c_print(s: *const c_char); - fn c_strlen(s: *const c_char) -> usize; - fn c_get_string() -> *const c_char; -} - -// DO: Convert to CString for passing to C -fn good_print(s: &str) -> Result<(), std::ffi::NulError> { - let c_string = CString::new(s)?; // Adds null terminator, checks for interior nulls - unsafe { - c_print(c_string.as_ptr()); - } - Ok(()) -} - -// DO: Use CStr for receiving C strings -fn good_receive() -> String { - unsafe { - let ptr = c_get_string(); - let c_str = CStr::from_ptr(ptr); - c_str.to_string_lossy().into_owned() - } -} - -// DO: Handle interior null bytes -fn handle_nulls(s: &str) { - match CString::new(s) { - Ok(c_string) => unsafe { c_print(c_string.as_ptr()) }, - Err(e) => { - // String contains interior null at position e.nul_position() - eprintln!("String contains null byte at {}", e.nul_position()); - } - } -} - -// DO: Use proper types in callbacks -extern "C" fn good_callback(s: *const c_char) { - if !s.is_null() { - let c_str = unsafe { CStr::from_ptr(s) }; - if let Ok(rust_str) = c_str.to_str() { - println!("{}", rust_str); - } - } -} -``` - -## String Type Comparison - -| Type | Null-terminated | Encoding | Use | -|------|-----------------|----------|-----| -| `String` | No | UTF-8 | Rust owned | -| `&str` | No | UTF-8 | Rust borrowed | -| `CString` | Yes | Byte | Rust-to-C owned | -| `&CStr` | Yes | Byte | Rust-to-C borrowed | -| `*const c_char` | Yes | Byte | FFI pointer | -| `OsString` | Platform | Platform | Paths, env | - -## Checklist - -- [ ] Am I passing Rust strings to C? → Use CString -- [ ] Am I receiving C strings? → Use CStr -- [ ] Does my string contain null bytes? → Handle NulError -- [ ] Am I checking for null pointers from C? - -## Related Rules - -- `ffi-02`: Read documentation for std::ffi types -- `ffi-06`: Ensure C-ABI string compatibility diff --git a/.claude/skills/unsafe-checker/rules/ffi-02-read-ffi-docs.md b/.claude/skills/unsafe-checker/rules/ffi-02-read-ffi-docs.md deleted file mode 100644 index 43d95f564..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-02-read-ffi-docs.md +++ /dev/null @@ -1,133 +0,0 @@ ---- -id: ffi-02 -original_id: P.UNS.FFI.02 -level: P -impact: MEDIUM ---- - -# Read Documentation Carefully When Using std::ffi Types - -## Summary - -The `std::ffi` module has many types with subtle differences. Read their documentation carefully to avoid misuse. - -## Key Types in std::ffi - -### CString vs CStr - -```rust -use std::ffi::{CString, CStr}; -use std::os::raw::c_char; - -// CString: Owned, heap-allocated, null-terminated -// - Use when creating strings to pass to C -// - Owns the memory -let owned = CString::new("hello").unwrap(); -let ptr: *const c_char = owned.as_ptr(); -// ptr valid until `owned` is dropped - -// CStr: Borrowed, null-terminated -// - Use when receiving strings from C -// - Does not own memory -let borrowed: &CStr = unsafe { CStr::from_ptr(ptr) }; -// borrowed valid as long as ptr is valid -``` - -### OsString vs OsStr - -```rust -use std::ffi::{OsString, OsStr}; -use std::path::Path; - -// OsString/OsStr: Platform-native strings -// - Windows: potentially ill-formed UTF-16 -// - Unix: arbitrary bytes -// - Use for paths and environment variables - -let path = Path::new("/some/path"); -let os_str: &OsStr = path.as_os_str(); - -// Convert to Rust string (may fail) -if let Some(s) = os_str.to_str() { - println!("Valid UTF-8: {}", s); -} -``` - -### c_void and Opaque Types - -```rust -use std::ffi::c_void; - -extern "C" { - fn get_handle() -> *mut c_void; - fn use_handle(h: *mut c_void); -} - -// c_void is for truly opaque pointers -// Better: use dedicated opaque types (see ffi-17) -``` - -## Common Pitfalls - -```rust -use std::ffi::CString; - -// PITFALL 1: CString::as_ptr() lifetime -fn bad_ptr() -> *const i8 { - let s = CString::new("hello").unwrap(); - s.as_ptr() // Dangling! s dropped at end of function -} - -fn good_ptr(s: &CString) -> *const i8 { - s.as_ptr() // OK: s outlives the pointer -} - -// PITFALL 2: CString::new with interior nulls -let result = CString::new("hello\0world"); -assert!(result.is_err()); // Interior null! - -// PITFALL 3: CStr::from_ptr safety -unsafe { - let ptr: *const i8 = std::ptr::null(); - // let cstr = CStr::from_ptr(ptr); // UB: null pointer! - - // Always check for null first - if !ptr.is_null() { - let cstr = CStr::from_ptr(ptr); - } -} - -// PITFALL 4: CStr assumes valid null-terminated string -unsafe { - let bytes = [104, 101, 108, 108, 111]; // "hello" without null - let ptr = bytes.as_ptr() as *const i8; - // let cstr = CStr::from_ptr(ptr); // UB: no null terminator! - - // Use from_bytes_with_nul instead - let bytes_with_nul = b"hello\0"; - let cstr = CStr::from_bytes_with_nul(bytes_with_nul).unwrap(); -} -``` - -## Type Selection Guide - -| Scenario | Type | -|----------|------| -| Create string for C | `CString` | -| Borrow string from C | `&CStr` | -| File paths | `OsString`, `Path` | -| Environment variables | `OsString` | -| Opaque C pointers | Newtype over `*mut c_void` | -| C integers | `c_int`, `c_long`, etc. | - -## Checklist - -- [ ] Have I read the docs for the std::ffi type I'm using? -- [ ] Am I aware of the lifetime constraints? -- [ ] Am I handling potential errors (NulError, UTF-8 errors)? -- [ ] Is there a better type for my use case? - -## Related Rules - -- `ffi-01`: Use CString/CStr for strings -- `ffi-17`: Use opaque types instead of c_void diff --git a/.claude/skills/unsafe-checker/rules/ffi-03-drop-for-c-ptr.md b/.claude/skills/unsafe-checker/rules/ffi-03-drop-for-c-ptr.md deleted file mode 100644 index 952850f88..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-03-drop-for-c-ptr.md +++ /dev/null @@ -1,162 +0,0 @@ ---- -id: ffi-03 -original_id: P.UNS.FFI.03 -level: P -impact: CRITICAL ---- - -# Implement Drop for Rust Types Wrapping Memory-Managing C Pointers - -## Summary - -When wrapping a C pointer that owns memory, implement `Drop` to call the appropriate C deallocation function. - -## Rationale - -- C allocated memory must be freed with the matching C function -- Rust's default drop won't clean up foreign memory -- Resource leaks and double-frees are common FFI bugs - -## Bad Example - -```rust -extern "C" { - fn create_resource() -> *mut Resource; - fn free_resource(r: *mut Resource); -} - -// DON'T: Wrapper without Drop -struct ResourceHandle { - ptr: *mut Resource, -} - -impl ResourceHandle { - fn new() -> Self { - Self { - ptr: unsafe { create_resource() } - } - } - // Memory leak! ptr is never freed -} - -// DON'T: Forget to handle null -impl Drop for BadHandle { - fn drop(&mut self) { - unsafe { - free_resource(self.ptr); // Crash if ptr is null! - } - } -} -``` - -## Good Example - -```rust -use std::ptr::NonNull; - -extern "C" { - fn create_resource() -> *mut Resource; - fn free_resource(r: *mut Resource); -} - -// DO: Proper wrapper with Drop -struct ResourceHandle { - ptr: NonNull, -} - -impl ResourceHandle { - fn new() -> Option { - let ptr = unsafe { create_resource() }; - NonNull::new(ptr).map(|ptr| Self { ptr }) - } - - fn as_ptr(&self) -> *mut Resource { - self.ptr.as_ptr() - } -} - -impl Drop for ResourceHandle { - fn drop(&mut self) { - // SAFETY: ptr was allocated by create_resource - // and hasn't been freed yet - unsafe { - free_resource(self.ptr.as_ptr()); - } - } -} - -// Prevent accidental copies that would cause double-free -impl !Clone for ResourceHandle {} - -// DO: Document ownership transfer -impl ResourceHandle { - /// Consumes the handle and returns the raw pointer. - /// - /// The caller is responsible for freeing the resource. - fn into_raw(self) -> *mut Resource { - let ptr = self.ptr.as_ptr(); - std::mem::forget(self); // Don't run Drop - ptr - } - - /// Creates a handle from a raw pointer. - /// - /// # Safety - /// - /// ptr must have been allocated by create_resource() - /// and not yet freed. - unsafe fn from_raw(ptr: *mut Resource) -> Option { - NonNull::new(ptr).map(|ptr| Self { ptr }) - } -} -``` - -## Complete Pattern with Multiple Resources - -```rust -struct Connection { - handle: NonNull, -} - -struct Statement<'conn> { - handle: NonNull, - _conn: std::marker::PhantomData<&'conn Connection>, -} - -impl Connection { - fn prepare(&self, sql: &str) -> Option> { - let handle = unsafe { db_prepare(self.handle.as_ptr(), sql.as_ptr()) }; - NonNull::new(handle).map(|handle| Statement { - handle, - _conn: std::marker::PhantomData, - }) - } -} - -impl Drop for Connection { - fn drop(&mut self) { - // Statements must be dropped before Connection - // PhantomData ensures this at compile time - unsafe { db_close(self.handle.as_ptr()); } - } -} - -impl Drop for Statement<'_> { - fn drop(&mut self) { - unsafe { db_finalize(self.handle.as_ptr()); } - } -} -``` - -## Checklist - -- [ ] Does my wrapper own the C resource? -- [ ] Did I implement Drop with the correct C free function? -- [ ] Did I handle null pointers? -- [ ] Did I prevent Clone/Copy to avoid double-free? -- [ ] Did I consider ownership transfer methods (into_raw/from_raw)? - -## Related Rules - -- `mem-03`: Don't let String/Vec drop foreign memory -- `ffi-07`: Don't implement Drop for types passed to external code diff --git a/.claude/skills/unsafe-checker/rules/ffi-04-panic-boundary.md b/.claude/skills/unsafe-checker/rules/ffi-04-panic-boundary.md deleted file mode 100644 index 82a50fa86..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-04-panic-boundary.md +++ /dev/null @@ -1,145 +0,0 @@ ---- -id: ffi-04 -original_id: P.UNS.FFI.04 -level: P -impact: CRITICAL -clippy: panic_in_result_fn ---- - -# Handle Panics When Crossing FFI Boundaries - -## Summary - -Panics must not unwind across FFI boundaries. Use `catch_unwind` or mark functions as `extern "C-unwind"`. - -## Rationale - -- Unwinding across C code is undefined behavior -- C has no concept of Rust panics -- Can corrupt C stack frames and cause crashes -- Even with `panic=abort`, still UB to attempt unwinding in `extern "C"` - -## Bad Example - -```rust -// DON'T: Allow panics to escape to C -#[no_mangle] -pub extern "C" fn callback(data: *const u8, len: usize) -> i32 { - let slice = unsafe { std::slice::from_raw_parts(data, len) }; - - // If this panics, UB occurs! - let sum: i32 = slice.iter().map(|&x| x as i32).sum(); - - // If this panics due to overflow in debug, UB! - process(sum) -} - -// DON'T: Unwrap in extern functions -#[no_mangle] -pub extern "C" fn parse_config(path: *const c_char) -> i32 { - let path = unsafe { CStr::from_ptr(path) }; - let config = std::fs::read_to_string(path.to_str().unwrap()).unwrap(); // Can panic! - 0 -} -``` - -## Good Example - -```rust -use std::panic::{catch_unwind, AssertUnwindSafe}; -use std::ffi::CStr; -use std::os::raw::{c_char, c_int}; - -// DO: Catch panics at FFI boundary -#[no_mangle] -pub extern "C" fn safe_callback(data: *const u8, len: usize) -> c_int { - let result = catch_unwind(AssertUnwindSafe(|| { - if data.is_null() || len == 0 { - return -1; - } - - let slice = unsafe { std::slice::from_raw_parts(data, len) }; - let sum: i32 = slice.iter().map(|&x| x as i32).sum(); - sum - })); - - match result { - Ok(value) => value, - Err(_) => { - // Log error, return error code - eprintln!("Panic caught at FFI boundary"); - -1 - } - } -} - -// DO: Use Result-based API internally -#[no_mangle] -pub extern "C" fn parse_config(path: *const c_char) -> c_int { - let result = catch_unwind(AssertUnwindSafe(|| -> Result<(), Box> { - let path = unsafe { CStr::from_ptr(path) }.to_str()?; - let _config = std::fs::read_to_string(path)?; - Ok(()) - })); - - match result { - Ok(Ok(())) => 0, - Ok(Err(e)) => { - eprintln!("Error: {}", e); - -1 - } - Err(_) => { - eprintln!("Panic in parse_config"); - -2 - } - } -} - -// DO: For Rust-calling-Rust across C, use "C-unwind" -#[no_mangle] -pub extern "C-unwind" fn rust_callback_can_unwind() { - // This is OK to panic if called from Rust through C - // The "C-unwind" ABI allows unwinding - panic!("This is allowed"); -} -``` - -## FFI Error Handling Pattern - -```rust -// Define error codes -const SUCCESS: c_int = 0; -const ERR_NULL_PTR: c_int = -1; -const ERR_INVALID_UTF8: c_int = -2; -const ERR_IO: c_int = -3; -const ERR_PANIC: c_int = -99; - -// Thread-local for detailed error -thread_local! { - static LAST_ERROR: std::cell::RefCell> = std::cell::RefCell::new(None); -} - -fn set_error(msg: String) { - LAST_ERROR.with(|e| *e.borrow_mut() = Some(msg)); -} - -#[no_mangle] -pub extern "C" fn get_last_error() -> *const c_char { - LAST_ERROR.with(|e| { - e.borrow().as_ref().map(|s| s.as_ptr() as *const c_char) - .unwrap_or(std::ptr::null()) - }) -} -``` - -## Checklist - -- [ ] Does my extern "C" function use catch_unwind? -- [ ] Am I avoiding unwrap/expect in FFI functions? -- [ ] Do I return error codes for error conditions? -- [ ] Have I considered using "C-unwind" for Rust-to-Rust through C? - -## Related Rules - -- `ffi-08`: Handle errors properly in FFI -- `safety-01`: Panic safety diff --git a/.claude/skills/unsafe-checker/rules/ffi-05-portable-types.md b/.claude/skills/unsafe-checker/rules/ffi-05-portable-types.md deleted file mode 100644 index 17e5fb631..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-05-portable-types.md +++ /dev/null @@ -1,113 +0,0 @@ ---- -id: ffi-05 -original_id: P.UNS.FFI.05 -level: P -impact: HIGH ---- - -# Use Portable Type Aliases from std or libc - -## Summary - -Use type aliases from `std::os::raw` or the `libc` crate for C-compatible types. Don't assume sizes of C types. - -## Rationale - -- C types have platform-dependent sizes (`int` is not always 32 bits) -- `long` is 32 bits on Windows, 64 bits on Unix -- Using Rust primitives directly causes portability bugs - -## Bad Example - -```rust -// DON'T: Use Rust types directly for C interop -extern "C" { - fn c_function(x: i32, y: i64) -> i32; // Might not match C types! -} - -// DON'T: Assume sizes -#[repr(C)] -struct BadStruct { - count: i32, // C 'int' might not be 32 bits - size: i64, // C 'long' varies by platform! - ptr: usize, // size_t? intptr_t? Different! -} -``` - -## Good Example - -```rust -use std::os::raw::{c_int, c_long, c_char, c_void}; - -// DO: Use std::os::raw types -extern "C" { - fn c_function(x: c_int, y: c_long) -> c_int; -} - -// DO: Use libc for more types -use libc::{size_t, ssize_t, off_t, pid_t, time_t}; - -extern "C" { - fn read(fd: c_int, buf: *mut c_void, count: size_t) -> ssize_t; - fn lseek(fd: c_int, offset: off_t, whence: c_int) -> off_t; - fn getpid() -> pid_t; -} - -// DO: Match C struct layout -#[repr(C)] -struct GoodStruct { - count: c_int, - size: c_long, - data: *mut c_void, -} - -// DO: Use isize/usize for pointer-sized integers -#[repr(C)] -struct PointerSized { - offset: isize, // intptr_t equivalent - size: usize, // size_t in pointer arithmetic -} -``` - -## Type Mapping Reference - -| C Type | Rust Type | Notes | -|--------|-----------|-------| -| `char` | `c_char` | May be signed or unsigned! | -| `signed char` | `i8` | | -| `unsigned char` | `u8` | | -| `short` | `c_short` | Usually i16 | -| `int` | `c_int` | Usually i32 | -| `long` | `c_long` | 32 or 64 bits! | -| `long long` | `c_longlong` | Usually i64 | -| `size_t` | `usize` or `libc::size_t` | | -| `ssize_t` | `isize` or `libc::ssize_t` | | -| `float` | `c_float` / `f32` | | -| `double` | `c_double` / `f64` | | -| `void*` | `*mut c_void` | | -| `const void*` | `*const c_void` | | - -## Platform Differences - -```rust -#[cfg(target_pointer_width = "64")] -type PtrDiff = i64; - -#[cfg(target_pointer_width = "32")] -type PtrDiff = i32; - -// Better: use isize -let diff: isize = ptr1 as isize - ptr2 as isize; -``` - -## Checklist - -- [ ] Am I using std::os::raw or libc types for FFI? -- [ ] Have I avoided assuming c_long is 64 bits? -- [ ] Am I using size_t/usize for sizes? -- [ ] Have I tested on multiple platforms? - -## Related Rules - -- `ffi-13`: Ensure consistent data layout -- `ffi-14`: Types in FFI should have stable layout diff --git a/.claude/skills/unsafe-checker/rules/ffi-06-string-abi.md b/.claude/skills/unsafe-checker/rules/ffi-06-string-abi.md deleted file mode 100644 index 3e154e0f3..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-06-string-abi.md +++ /dev/null @@ -1,151 +0,0 @@ ---- -id: ffi-06 -original_id: P.UNS.FFI.06 -level: P -impact: HIGH ---- - -# Ensure C-ABI Compatibility for Strings Between Rust and C - -## Summary - -When passing strings across FFI, ensure both sides agree on encoding, null-termination, and memory ownership. - -## Rationale - -- Rust strings are UTF-8, C strings are byte arrays -- C expects null termination, Rust strings don't have it -- Memory ownership must be explicit to avoid leaks/double-frees - -## String Passing Patterns - -### Rust to C (Caller Allocates) - -```rust -use std::ffi::CString; -use std::os::raw::c_char; - -extern "C" { - fn c_process_string(s: *const c_char); -} - -fn rust_to_c(s: &str) -> Result<(), std::ffi::NulError> { - let c_string = CString::new(s)?; - // c_string lives until end of scope - unsafe { - c_process_string(c_string.as_ptr()); - } - // c_string dropped here, memory freed - Ok(()) -} -``` - -### C to Rust (C Allocates, Rust Borrows) - -```rust -use std::ffi::CStr; -use std::os::raw::c_char; - -extern "C" { - fn c_get_string() -> *const c_char; -} - -fn c_to_rust() -> Option { - let ptr = unsafe { c_get_string() }; - if ptr.is_null() { - return None; - } - // Borrow from C, don't take ownership - let c_str = unsafe { CStr::from_ptr(ptr) }; - Some(c_str.to_string_lossy().into_owned()) -} -``` - -### C to Rust (Ownership Transfer) - -```rust -extern "C" { - fn c_create_string() -> *mut c_char; - fn c_free_string(s: *mut c_char); -} - -struct CAllocatedString { - ptr: *mut c_char, -} - -impl CAllocatedString { - fn new() -> Option { - let ptr = unsafe { c_create_string() }; - if ptr.is_null() { - None - } else { - Some(Self { ptr }) - } - } - - fn as_str(&self) -> &str { - let c_str = unsafe { CStr::from_ptr(self.ptr) }; - c_str.to_str().unwrap_or("") - } -} - -impl Drop for CAllocatedString { - fn drop(&mut self) { - unsafe { c_free_string(self.ptr); } - } -} -``` - -### Rust to C (Ownership Transfer) - -```rust -extern "C" { - fn c_take_ownership(s: *mut c_char); // C will free -} - -fn give_to_c(s: &str) -> Result<(), std::ffi::NulError> { - let c_string = CString::new(s)?; - let ptr = c_string.into_raw(); // Don't drop CString - - unsafe { - c_take_ownership(ptr); - // C now owns this memory - // To free it back in Rust: let _ = CString::from_raw(ptr); - } - Ok(()) -} -``` - -## Encoding Considerations - -```rust -// UTF-8 to platform encoding -use std::ffi::OsString; -use std::os::unix::ffi::OsStrExt; - -fn to_platform_string(s: &str) -> CString { - // On Unix, UTF-8 usually works - CString::new(s).unwrap() -} - -#[cfg(windows)] -fn to_wide_string(s: &str) -> Vec { - use std::os::windows::ffi::OsStrExt; - std::ffi::OsStr::new(s) - .encode_wide() - .chain(std::iter::once(0)) - .collect() -} -``` - -## Checklist - -- [ ] Is the string null-terminated when passed to C? -- [ ] Who allocates the memory? Who frees it? -- [ ] Is the encoding (UTF-8, ASCII, platform) documented? -- [ ] Am I handling conversion errors (interior nulls, invalid UTF-8)? - -## Related Rules - -- `ffi-01`: Use CString/CStr at FFI boundaries -- `ffi-02`: Read std::ffi documentation diff --git a/.claude/skills/unsafe-checker/rules/ffi-07-no-drop-external.md b/.claude/skills/unsafe-checker/rules/ffi-07-no-drop-external.md deleted file mode 100644 index ca46bbf4d..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-07-no-drop-external.md +++ /dev/null @@ -1,131 +0,0 @@ ---- -id: ffi-07 -original_id: P.UNS.FFI.07 -level: P -impact: HIGH ---- - -# Do Not Implement Drop for Types Passed to External Code - -## Summary - -If a type will be passed to external code that manages its lifetime, don't implement `Drop`. Otherwise, both Rust and the external code will try to free it. - -## Rationale - -- External code (C library) may take ownership of the data -- If Rust also tries to drop it, you get double-free -- Need clear ownership boundaries - -## Bad Example - -```rust -// DON'T: Drop on type that external code will free -#[repr(C)] -struct EventHandler { - callback: extern "C" fn(i32), - user_data: *mut c_void, -} - -impl Drop for EventHandler { - fn drop(&mut self) { - // BAD: What if the C library already freed user_data? - unsafe { libc::free(self.user_data); } - } -} - -extern "C" { - // C takes ownership and frees EventHandler when done - fn register_handler(h: *mut EventHandler); -} - -fn bad_register() { - let handler = EventHandler { /* ... */ }; - let ptr = Box::into_raw(Box::new(handler)); - unsafe { - register_handler(ptr); - // If C code frees this, and Rust's Drop runs too = double-free - } -} -``` - -## Good Example - -```rust -// DO: No Drop for types whose lifetime is managed externally -#[repr(C)] -struct EventHandler { - callback: extern "C" fn(i32), - user_data: *mut c_void, -} -// No Drop impl - C library manages lifetime - -extern "C" { - fn register_handler(h: *mut EventHandler); - fn unregister_handler(h: *mut EventHandler); -} - -// DO: Wrap in a Rust type that knows when it's safe to drop -struct RegisteredHandler { - ptr: *mut EventHandler, - registered: bool, -} - -impl RegisteredHandler { - fn register(handler: EventHandler) -> Self { - let ptr = Box::into_raw(Box::new(handler)); - unsafe { register_handler(ptr); } - Self { ptr, registered: true } - } - - fn unregister(&mut self) { - if self.registered { - unsafe { unregister_handler(self.ptr); } - self.registered = false; - } - } -} - -impl Drop for RegisteredHandler { - fn drop(&mut self) { - self.unregister(); - // Only free if we still own it - if !self.registered { - unsafe { drop(Box::from_raw(self.ptr)); } - } - } -} - -// DO: Use ManuallyDrop for explicit control -use std::mem::ManuallyDrop; - -fn explicit_ownership() { - let handler = ManuallyDrop::new(EventHandler { /* ... */ }); - let ptr = &*handler as *const EventHandler as *mut EventHandler; - unsafe { - register_handler(ptr); - // C now owns handler, don't drop it in Rust - } -} -``` - -## Ownership Patterns - -| Pattern | Who Owns | Rust Drop? | -|---------|----------|------------| -| Rust creates, Rust frees | Rust | Yes | -| Rust creates, C frees | C | No | -| C creates, C frees | C | No (use wrapper) | -| C creates, Rust frees | Rust | Yes (in wrapper) | - -## Checklist - -- [ ] Who will free this type's memory? -- [ ] If external code frees it, am I avoiding Drop? -- [ ] If ownership is conditional, do I track it? -- [ ] Am I using ManuallyDrop or forget() when transferring ownership? - -## Related Rules - -- `ffi-03`: Implement Drop for wrapped C pointers (opposite case) -- `mem-03`: Don't let String/Vec drop foreign memory diff --git a/.claude/skills/unsafe-checker/rules/ffi-08-error-handling.md b/.claude/skills/unsafe-checker/rules/ffi-08-error-handling.md deleted file mode 100644 index 9058d8955..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-08-error-handling.md +++ /dev/null @@ -1,146 +0,0 @@ ---- -id: ffi-08 -original_id: P.UNS.FFI.08 -level: P -impact: HIGH ---- - -# Handle Errors Properly in FFI - -## Summary - -FFI functions must use C-compatible error handling (return codes, errno, out parameters). Rust's Result/Option don't cross FFI boundaries. - -## Rationale - -- C doesn't have Result or Option -- Exceptions don't exist in C -- Must use patterns C code understands - -## Bad Example - -```rust -// DON'T: Return Result across FFI -#[no_mangle] -pub extern "C" fn bad_open(path: *const c_char) -> Result { - // Result is not C-compatible! - unimplemented!() -} - -// DON'T: Return Option across FFI -#[no_mangle] -pub extern "C" fn bad_find(id: i32) -> Option<*mut Data> { - // Option<*mut T> might work but is confusing - unimplemented!() -} -``` - -## Good Example - -```rust -use std::os::raw::{c_char, c_int}; - -// Error codes -const SUCCESS: c_int = 0; -const ERR_NULL_PTR: c_int = 1; -const ERR_INVALID_PATH: c_int = 2; -const ERR_FILE_NOT_FOUND: c_int = 3; -const ERR_PERMISSION: c_int = 4; -const ERR_UNKNOWN: c_int = -1; - -// DO: Return error code, output via pointer -#[no_mangle] -pub extern "C" fn open_file( - path: *const c_char, - out_handle: *mut *mut Handle -) -> c_int { - if path.is_null() || out_handle.is_null() { - return ERR_NULL_PTR; - } - - let path_str = match unsafe { CStr::from_ptr(path) }.to_str() { - Ok(s) => s, - Err(_) => return ERR_INVALID_PATH, - }; - - match File::open(path_str) { - Ok(file) => { - let handle = Box::into_raw(Box::new(Handle { file })); - unsafe { *out_handle = handle; } - SUCCESS - } - Err(e) => { - match e.kind() { - std::io::ErrorKind::NotFound => ERR_FILE_NOT_FOUND, - std::io::ErrorKind::PermissionDenied => ERR_PERMISSION, - _ => ERR_UNKNOWN, - } - } - } -} - -// DO: Use errno for POSIX-style APIs -#[cfg(unix)] -#[no_mangle] -pub extern "C" fn posix_style_read( - fd: c_int, - buf: *mut u8, - count: usize -) -> isize { - if buf.is_null() { - unsafe { *libc::__errno_location() = libc::EINVAL; } - return -1; - } - - // ... do read ... - // On error: - // unsafe { *libc::__errno_location() = error_code; } - // return -1; - - count as isize -} - -// DO: Provide error message function -thread_local! { - static LAST_ERROR: std::cell::RefCell> = std::cell::RefCell::new(None); -} - -#[no_mangle] -pub extern "C" fn get_error_message(buf: *mut c_char, len: usize) -> c_int { - LAST_ERROR.with(|e| { - if let Some(msg) = e.borrow().as_ref() { - let bytes = msg.as_bytes(); - let copy_len = std::cmp::min(bytes.len(), len.saturating_sub(1)); - unsafe { - std::ptr::copy_nonoverlapping(bytes.as_ptr(), buf as *mut u8, copy_len); - *buf.add(copy_len) = 0; - } - SUCCESS - } else { - ERR_UNKNOWN - } - }) -} -``` - -## Error Handling Patterns - -| Pattern | Usage | -|---------|-------| -| Return code | Simple success/failure | -| Return code + out param | Return value on success | -| errno | POSIX-style APIs | -| Error message function | Detailed error info | -| Last-error thread-local | Windows-style APIs | - -## Checklist - -- [ ] Am I returning C-compatible error indicators? -- [ ] Are output parameters used for return values? -- [ ] Is there a way to get detailed error info? -- [ ] Am I documenting all possible error codes? - -## Related Rules - -- `ffi-04`: Handle panics at FFI boundary -- `safety-10`: Document safety requirements diff --git a/.claude/skills/unsafe-checker/rules/ffi-09-ref-not-ptr.md b/.claude/skills/unsafe-checker/rules/ffi-09-ref-not-ptr.md deleted file mode 100644 index 732f3bd06..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-09-ref-not-ptr.md +++ /dev/null @@ -1,136 +0,0 @@ ---- -id: ffi-09 -original_id: P.UNS.FFI.09 -level: P -impact: MEDIUM ---- - -# Use References Instead of Raw Pointers When Calling Safe C Functions - -## Summary - -When wrapping C functions that don't need null pointers, use Rust references in the safe wrapper to enforce non-null at compile time. - -## Rationale - -- References guarantee non-null -- References have lifetime tracking -- Raw pointers should stay in the unsafe FFI layer -- Safe Rust API should use safe types - -## Bad Example - -```rust -extern "C" { - fn c_process(data: *const u8, len: usize); -} - -// DON'T: Expose raw pointers in safe API -pub fn process(data: *const u8, len: usize) { - // Caller might pass null! - unsafe { c_process(data, len); } -} - -// DON'T: Unsafe function when it could be safe -pub unsafe fn process_unsafe(data: *const u8, len: usize) { - // Why force caller to use unsafe? - c_process(data, len); -} -``` - -## Good Example - -```rust -extern "C" { - fn c_process(data: *const u8, len: usize); - fn c_modify(data: *mut Data); - fn c_optional(data: *const Data); // Can be null -} - -// DO: Use slice reference for safe API -pub fn process(data: &[u8]) { - // Reference guarantees non-null - // Slice guarantees valid length - unsafe { c_process(data.as_ptr(), data.len()); } -} - -// DO: Use &mut for exclusive access -pub fn modify(data: &mut Data) { - // Mutable reference guarantees: - // - Non-null - // - Exclusive access - // - Valid for duration - unsafe { c_modify(data as *mut Data); } -} - -// DO: Use Option<&T> for nullable parameters -pub fn optional(data: Option<&Data>) { - let ptr = data.map(|d| d as *const Data).unwrap_or(std::ptr::null()); - unsafe { c_optional(ptr); } -} - -// DO: Wrap FFI types in safe Rust types -pub struct SafeHandle(*mut c_void); - -impl SafeHandle { - pub fn new() -> Option { - let ptr = unsafe { create_handle() }; - if ptr.is_null() { - None - } else { - Some(Self(ptr)) - } - } - - // Methods take &self or &mut self, not raw pointers - pub fn do_something(&self) { - unsafe { handle_operation(self.0); } - } -} -``` - -## Converting Between References and Pointers - -```rust -// Reference to pointer -fn ref_to_ptr(r: &Data) -> *const Data { - r as *const Data -} - -fn mut_ref_to_ptr(r: &mut Data) -> *mut Data { - r as *mut Data -} - -// Slice to pointer -fn slice_to_ptr(s: &[u8]) -> (*const u8, usize) { - (s.as_ptr(), s.len()) -} - -// Pointer to reference (unsafe) -unsafe fn ptr_to_ref<'a>(p: *const Data) -> &'a Data { - &*p -} - -unsafe fn ptr_to_mut<'a>(p: *mut Data) -> &'a mut Data { - &mut *p -} -``` - -## When to Use Raw Pointers - -- FFI declarations (`extern "C"`) -- Implementing the unsafe boundary layer -- When null is a valid value -- When the pointee might not be valid Rust (e.g., uninitialized) - -## Checklist - -- [ ] Can this parameter be a reference instead of a pointer? -- [ ] Am I checking for null in the unsafe layer? -- [ ] Is the safe API free of raw pointers? -- [ ] Do I use Option<&T> for nullable references? - -## Related Rules - -- `safety-06`: Don't expose raw pointers in public APIs -- `ffi-02`: Read std::ffi documentation diff --git a/.claude/skills/unsafe-checker/rules/ffi-10-thread-safety.md b/.claude/skills/unsafe-checker/rules/ffi-10-thread-safety.md deleted file mode 100644 index 733419fb5..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-10-thread-safety.md +++ /dev/null @@ -1,132 +0,0 @@ ---- -id: ffi-10 -original_id: P.UNS.FFI.10 -level: P -impact: CRITICAL ---- - -# Exported Rust Functions Must Be Designed for Thread-Safety - -## Summary - -Functions exported to C with `#[no_mangle] extern "C"` may be called from multiple threads. Ensure they are thread-safe. - -## Rationale - -- C code doesn't know about Rust's thread safety guarantees -- C may call your function from any thread -- Global state must be synchronized -- Race conditions are undefined behavior - -## Bad Example - -```rust -// DON'T: Unsynchronized global state -static mut COUNTER: i32 = 0; - -#[no_mangle] -pub extern "C" fn increment() -> i32 { - unsafe { - COUNTER += 1; // Data race if called from multiple threads! - COUNTER - } -} - -// DON'T: Thread-local assuming single thread -thread_local! { - static CONFIG: RefCell = RefCell::new(Config::default()); -} - -#[no_mangle] -pub extern "C" fn set_config(value: i32) { - // Different threads get different configs! - // Is that what the C caller expects? - CONFIG.with(|c| c.borrow_mut().value = value); -} - -// DON'T: Non-Send types in globals -static mut HANDLE: Option> = None; // Rc is not Send! -``` - -## Good Example - -```rust -use std::sync::atomic::{AtomicI32, Ordering}; -use std::sync::{Mutex, OnceLock}; - -// DO: Use atomics for simple counters -static COUNTER: AtomicI32 = AtomicI32::new(0); - -#[no_mangle] -pub extern "C" fn increment() -> i32 { - COUNTER.fetch_add(1, Ordering::SeqCst) + 1 -} - -// DO: Use Mutex for complex state -static CONFIG: OnceLock> = OnceLock::new(); - -fn get_config() -> &'static Mutex { - CONFIG.get_or_init(|| Mutex::new(Config::default())) -} - -#[no_mangle] -pub extern "C" fn set_config_value(value: i32) -> i32 { - match get_config().lock() { - Ok(mut config) => { - config.value = value; - 0 // Success - } - Err(_) => -1 // Lock poisoned - } -} - -// DO: Document thread safety requirements -/// Initializes the library. NOT thread-safe. -/// Must be called once from main thread before any other function. -#[no_mangle] -pub extern "C" fn init() -> i32 { - // One-time initialization - 0 -} - -/// Processes data. Thread-safe. -/// May be called from multiple threads concurrently. -#[no_mangle] -pub extern "C" fn process(data: *const u8, len: usize) -> i32 { - // Uses only local state or synchronized globals - 0 -} - -// DO: Make non-thread-safe APIs explicit -/// Handle for single-threaded use only. -/// -/// # Thread Safety -/// -/// This handle must only be used from the thread that created it. -struct SingleThreadHandle { - data: *mut Data, - _not_send: std::marker::PhantomData<*const ()>, // !Send -} -``` - -## Synchronization Patterns - -| Pattern | Use Case | -|---------|----------| -| `AtomicT` | Simple counters, flags | -| `Mutex` | Complex shared state | -| `RwLock` | Read-heavy shared state | -| `OnceLock` | Lazy one-time init | -| `thread_local!` | Per-thread state (document!) | - -## Checklist - -- [ ] Does my exported function access global state? -- [ ] Is that state properly synchronized? -- [ ] Have I documented thread-safety guarantees? -- [ ] Are any types !Send/!Sync exposed across FFI? - -## Related Rules - -- `ptr-01`: Don't share raw pointers across threads -- `safety-05`: Send/Sync implementation safety diff --git a/.claude/skills/unsafe-checker/rules/ffi-11-packed-ub.md b/.claude/skills/unsafe-checker/rules/ffi-11-packed-ub.md deleted file mode 100644 index 61f3b87f7..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-11-packed-ub.md +++ /dev/null @@ -1,142 +0,0 @@ ---- -id: ffi-11 -original_id: P.UNS.FFI.11 -level: P -impact: HIGH -clippy: unaligned_references ---- - -# Be Careful with UB When Referencing #[repr(packed)] Struct Fields - -## Summary - -Creating references to fields in `#[repr(packed)]` structs is undefined behavior if the field is misaligned. Use raw pointers and `read_unaligned`/`write_unaligned` instead. - -## Rationale - -- Packed structs have no padding, so fields may be misaligned -- References must be aligned; misaligned references are UB -- Even implicit references (method calls, match) can cause UB - -## Bad Example - -```rust -#[repr(C, packed)] -struct Packet { - header: u8, - value: u32, // Misaligned! At offset 1, not 4 - data: u64, // Misaligned! At offset 5, not 8 -} - -fn bad_reference(p: &Packet) -> &u32 { - &p.value // UB: Creates misaligned reference! -} - -fn bad_match(p: &Packet) { - match p.value { // UB: Match creates a reference - 0 => {}, - _ => {}, - } -} - -fn bad_method(p: &Packet) { - p.value.to_string(); // UB: Method call creates reference -} - -fn bad_borrow(p: &mut Packet) { - let v = &mut p.value; // UB: Misaligned mutable reference - *v = 42; -} -``` - -## Good Example - -```rust -#[repr(C, packed)] -struct Packet { - header: u8, - value: u32, - data: u64, -} - -// DO: Copy out the value -fn good_read(p: &Packet) -> u32 { - p.value // Copies the value, no reference created -} - -// DO: Use addr_of! for raw pointer (Rust 2021+) -fn good_ptr_read(p: &Packet) -> u32 { - // SAFETY: read_unaligned handles misalignment - unsafe { - std::ptr::addr_of!(p.value).read_unaligned() - } -} - -// DO: Use addr_of_mut! for writing -fn good_ptr_write(p: &mut Packet, value: u32) { - // SAFETY: write_unaligned handles misalignment - unsafe { - std::ptr::addr_of_mut!(p.value).write_unaligned(value); - } -} - -// DO: Create accessor methods -impl Packet { - fn value(&self) -> u32 { - unsafe { std::ptr::addr_of!(self.value).read_unaligned() } - } - - fn set_value(&mut self, value: u32) { - unsafe { std::ptr::addr_of_mut!(self.value).write_unaligned(value); } - } - - fn data(&self) -> u64 { - unsafe { std::ptr::addr_of!(self.data).read_unaligned() } - } -} - -// DO: Consider using byte arrays + from_ne_bytes -#[repr(C, packed)] -struct PacketBytes { - header: u8, - value: [u8; 4], // Store as bytes - data: [u8; 8], -} - -impl PacketBytes { - fn value(&self) -> u32 { - u32::from_ne_bytes(self.value) // Safe, no alignment issue - } -} -``` - -## Safe Alternatives - -```rust -// Alternative 1: Don't use packed -#[repr(C)] -struct AlignedPacket { - header: u8, - _pad: [u8; 3], - value: u32, - data: u64, -} - -// Alternative 2: Use zerocopy crate -// use zerocopy::{AsBytes, FromBytes}; - -// Alternative 3: Use bytemuck -// use bytemuck::{Pod, Zeroable}; -``` - -## Checklist - -- [ ] Am I creating references to packed struct fields? -- [ ] Am I using addr_of! / addr_of_mut! for field access? -- [ ] Am I using read_unaligned / write_unaligned? -- [ ] Would a byte array representation be safer? - -## Related Rules - -- `ptr-04`: Don't dereference misaligned pointers -- `mem-01`: Choose appropriate data layout diff --git a/.claude/skills/unsafe-checker/rules/ffi-12-invariant-doc.md b/.claude/skills/unsafe-checker/rules/ffi-12-invariant-doc.md deleted file mode 100644 index 026a65a41..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-12-invariant-doc.md +++ /dev/null @@ -1,164 +0,0 @@ ---- -id: ffi-12 -original_id: P.UNS.FFI.12 -level: P -impact: MEDIUM ---- - -# Document Invariant Assumptions for C-Provided Parameters - -## Summary - -When receiving parameters from C, document what invariants you assume (non-null, alignment, validity, lifetime) and verify them when possible. - -## Rationale - -- C doesn't enforce invariants at compile time -- Rust code needs to validate or document assumptions -- Debugging FFI bugs is hard without clear documentation - -## Bad Example - -```rust -// DON'T: Undocumented assumptions -extern "C" { - fn get_data() -> *mut Data; -} - -fn bad_use() -> &'static Data { - let ptr = unsafe { get_data() }; - // Assumes: - // - ptr is non-null (not documented) - // - ptr is aligned (not checked) - // - Data is valid (not verified) - // - Lifetime is 'static (just guessing) - unsafe { &*ptr } -} - -// DON'T: Silent assumptions in function signature -#[no_mangle] -pub extern "C" fn process(data: *const Data, len: usize) { - // What if data is null? - // What if len is wrong? - // What if data contains invalid Data? - let slice = unsafe { - std::slice::from_raw_parts(data, len) - }; -} -``` - -## Good Example - -```rust -/// Retrieves data from the C library. -/// -/// # Invariants Assumed from C -/// -/// - Returns a non-null pointer on success, null on failure -/// - Returned pointer is valid for the lifetime of the library -/// - Returned pointer is aligned for `Data` -/// - The `Data` struct is fully initialized -extern "C" { - fn get_data() -> *mut Data; -} - -fn documented_use() -> Option<&'static Data> { - let ptr = unsafe { get_data() }; - - // Verify what we can - if ptr.is_null() { - return None; - } - - // Document what we can't verify - // SAFETY: - // - Non-null: checked above - // - Aligned: documented in C library docs - // - Valid: C library guarantees initialized Data - // - Lifetime: C library guarantees static lifetime - Some(unsafe { &*ptr }) -} - -/// Processes data provided by C caller. -/// -/// # Parameters -/// -/// - `data`: Must be non-null, aligned for `Data`, and point to `len` valid `Data` items -/// - `len`: Number of items. Must not exceed `isize::MAX / size_of::()` -/// -/// # Returns -/// -/// - `0` on success -/// - `-1` if `data` is null -/// - `-2` if `len` is invalid -/// -/// # Thread Safety -/// -/// This function is thread-safe. The `data` array must not be mutated during the call. -#[no_mangle] -pub extern "C" fn process_documented(data: *const Data, len: usize) -> i32 { - // Verify invariants we can check - if data.is_null() { - return -1; - } - - if len > isize::MAX as usize / std::mem::size_of::() { - return -2; - } - - // SAFETY: - // - Non-null: checked above - // - Aligned: documented requirement for caller - // - Valid for len items: documented requirement for caller - // - Not mutated: documented thread safety requirement - let slice = unsafe { std::slice::from_raw_parts(data, len) }; - - for item in slice { - // process... - } - - 0 -} -``` - -## Documentation Template - -```rust -/// Brief description. -/// -/// # Parameters -/// -/// - `param`: Description, constraints (non-null, aligned, etc.) -/// -/// # Invariants Assumed -/// -/// The following invariants are assumed and NOT verified: -/// - Invariant 1: explanation -/// - Invariant 2: explanation -/// -/// The following invariants ARE verified at runtime: -/// - Verified 1: how it's checked -/// -/// # Safety (for unsafe fn) -/// -/// Caller must ensure: -/// - Requirement 1 -/// - Requirement 2 -/// -/// # Errors -/// -/// Returns error code when: -/// - Condition 1: error code -``` - -## Checklist - -- [ ] Have I documented all assumptions about C parameters? -- [ ] Which invariants can I verify at runtime? -- [ ] Which must I trust the C caller to uphold? -- [ ] Have I documented error conditions and return values? - -## Related Rules - -- `safety-02`: Verify safety invariants -- `safety-10`: Document safety requirements diff --git a/.claude/skills/unsafe-checker/rules/ffi-13-data-layout.md b/.claude/skills/unsafe-checker/rules/ffi-13-data-layout.md deleted file mode 100644 index 1e36e078b..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-13-data-layout.md +++ /dev/null @@ -1,146 +0,0 @@ ---- -id: ffi-13 -original_id: P.UNS.FFI.13 -level: P -impact: HIGH ---- - -# Ensure Consistent Data Layout for Custom Types - -## Summary - -Types shared between Rust and C must have `#[repr(C)]` to ensure the memory layout matches what C expects. - -## Rationale - -- Rust's default layout is unspecified and may change -- C has specific, standardized layout rules -- Mismatched layouts cause memory corruption - -## Bad Example - -```rust -// DON'T: Rust layout for FFI types -struct BadStruct { - a: u8, - b: u32, - c: u8, -} -// Rust may reorder to: b, a, c (for better packing) -// C expects: a, padding, b, c, padding - -extern "C" { - fn use_struct(s: *const BadStruct); // Layout mismatch! -} - -// DON'T: Assume Rust enum layout matches C -enum BadEnum { - A, - B(i32), - C { x: u8, y: u8 }, -} -// Rust enum layout is complex and not C-compatible -``` - -## Good Example - -```rust -// DO: Use repr(C) for FFI structs -#[repr(C)] -struct GoodStruct { - a: u8, // offset 0 - // 3 bytes padding - b: u32, // offset 4 - c: u8, // offset 8 - // 3 bytes padding -} -// Total size: 12, align: 4 - -// DO: Use repr(C) for enums with explicit discriminant -#[repr(C)] -enum GoodEnum { - A = 0, - B = 1, - C = 2, -} -// Equivalent to C: enum { A = 0, B = 1, C = 2 }; - -// DO: For complex enums, use tagged unions -#[repr(C)] -struct TaggedUnion { - tag: GoodEnum, - data: GoodUnionData, -} - -#[repr(C)] -union GoodUnionData { - a: (), // For GoodEnum::A - b: i32, // For GoodEnum::B - c: [u8; 2], // For GoodEnum::C -} - -// DO: Verify layout at compile time -const _: () = { - assert!(std::mem::size_of::() == 12); - assert!(std::mem::align_of::() == 4); -}; -``` - -## Layout Verification - -```rust -use std::mem::{size_of, align_of, offset_of}; - -#[repr(C)] -struct Verified { - a: u8, - b: u32, - c: u8, -} - -// Compile-time layout verification -const _: () = { - assert!(size_of::() == 12); - assert!(align_of::() == 4); - // offset_of! requires nightly or crate - // assert!(offset_of!(Verified, a) == 0); - // assert!(offset_of!(Verified, b) == 4); - // assert!(offset_of!(Verified, c) == 8); -}; - -// Runtime verification -#[test] -fn verify_layout() { - assert_eq!(size_of::(), 12); - assert_eq!(align_of::(), 4); - - let v = Verified { a: 0, b: 0, c: 0 }; - let base = &v as *const _ as usize; - - assert_eq!(&v.a as *const _ as usize - base, 0); - assert_eq!(&v.b as *const _ as usize - base, 4); - assert_eq!(&v.c as *const _ as usize - base, 8); -} -``` - -## repr Options - -| Attribute | Effect | -|-----------|--------| -| `#[repr(C)]` | C-compatible layout | -| `#[repr(C, packed)]` | C layout, no padding | -| `#[repr(C, align(N))]` | C layout, minimum align N | -| `#[repr(transparent)]` | Same layout as single field | -| `#[repr(u8)]` etc. | Enum discriminant type | - -## Checklist - -- [ ] Is every FFI struct marked `#[repr(C)]`? -- [ ] Is every FFI enum using explicit discriminants? -- [ ] Have I verified the layout matches the C header? -- [ ] Have I added compile-time assertions? - -## Related Rules - -- `mem-01`: Choose appropriate data layout -- `ffi-14`: Types in FFI should have stable layout diff --git a/.claude/skills/unsafe-checker/rules/ffi-14-stable-layout.md b/.claude/skills/unsafe-checker/rules/ffi-14-stable-layout.md deleted file mode 100644 index ba3db5fd2..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-14-stable-layout.md +++ /dev/null @@ -1,135 +0,0 @@ ---- -id: ffi-14 -original_id: P.UNS.FFI.14 -level: P -impact: HIGH ---- - -# Types Used in FFI Should Have Stable Layout - -## Summary - -FFI types should not change layout between versions. Use `#[repr(C)]` and avoid types with unstable layout like generic `std` types. - -## Rationale - -- ABI compatibility requires stable layout -- Dynamic libraries may be loaded with different compiler versions -- Layout changes break binary compatibility - -## Bad Example - -```rust -// DON'T: Use Rust std types with unstable layout in FFI -extern "C" { - // Vec layout is not stable! - fn bad_vec(v: Vec); - - // String layout is not stable! - fn bad_string(s: String); - - // HashMap layout varies between versions - fn bad_map(m: std::collections::HashMap); -} - -// DON'T: Use Rust-specific types in C structs -#[repr(C)] -struct BadMixed { - id: i32, - data: Vec, // Vec is not C-compatible! -} - -// DON'T: Use Option with non-null optimization assumptions -#[repr(C)] -struct BadOption { - value: Option, // Layout may change! -} -``` - -## Good Example - -```rust -use std::os::raw::{c_int, c_char, c_void}; - -// DO: Use C-compatible types -#[repr(C)] -struct GoodStruct { - id: c_int, - name: *const c_char, // C-style string - data: *const c_void, // Generic pointer - data_len: usize, -} - -// DO: Use explicit struct for what Vec would provide -#[repr(C)] -struct GoodBuffer { - ptr: *mut u8, - len: usize, - cap: usize, -} - -impl GoodBuffer { - fn from_vec(mut v: Vec) -> Self { - let buf = Self { - ptr: v.as_mut_ptr(), - len: v.len(), - cap: v.capacity(), - }; - std::mem::forget(v); - buf - } - - /// # Safety - /// Must have been created by from_vec() - unsafe fn into_vec(self) -> Vec { - Vec::from_raw_parts(self.ptr, self.len, self.cap) - } -} - -// DO: Use fixed-size arrays for bounded data -#[repr(C)] -struct FixedName { - name: [c_char; 64], - name_len: usize, -} - -// DO: Define your own stable option type -#[repr(C)] -struct OptionalU32 { - has_value: bool, - value: u32, -} - -impl From> for OptionalU32 { - fn from(opt: Option) -> Self { - match opt { - Some(v) => Self { has_value: true, value: v }, - None => Self { has_value: false, value: 0 }, - } - } -} -``` - -## Stable Types for FFI - -| Use Instead Of | Stable Type | -|----------------|-------------| -| `Vec` | `*mut T` + `len` + `cap` | -| `String` | `*const c_char` or `*mut c_char` + `len` | -| `&[T]` | `*const T` + `len` | -| `Option` | Custom tagged struct | -| `Result` | Error code + out parameter | -| `Box` | `*mut T` | -| `bool` | `c_int` or explicit `u8` | - -## Checklist - -- [ ] Am I using only C-compatible primitive types? -- [ ] Am I avoiding std collection types in FFI signatures? -- [ ] Have I created stable wrappers for Rust types? -- [ ] Is the layout documented for other languages? - -## Related Rules - -- `ffi-13`: Ensure consistent data layout -- `ffi-05`: Use portable type aliases diff --git a/.claude/skills/unsafe-checker/rules/ffi-15-validate-external.md b/.claude/skills/unsafe-checker/rules/ffi-15-validate-external.md deleted file mode 100644 index 05f5ea110..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-15-validate-external.md +++ /dev/null @@ -1,145 +0,0 @@ ---- -id: ffi-15 -original_id: P.UNS.FFI.15 -level: P -impact: HIGH ---- - -# Validate Non-Robust External Values - -## Summary - -Data received from external sources (FFI, files, network) may be invalid. Validate before using it as Rust types with stricter invariants. - -## Rationale - -- External data can be malicious or corrupted -- Rust types have invariants (e.g., valid UTF-8 for str) -- Invalid data causes undefined behavior - -## Bad Example - -```rust -// DON'T: Trust external data -extern "C" { - fn get_status() -> u8; -} - -#[derive(Debug)] -enum Status { Active = 0, Inactive = 1, Pending = 2 } - -fn bad_convert() -> Status { - let raw = unsafe { get_status() }; - // BAD: Assumes C returns valid enum value - unsafe { std::mem::transmute(raw) } // UB if raw > 2 -} - -// DON'T: Trust strings from C -fn bad_string(ptr: *const c_char) -> &str { - let cstr = unsafe { CStr::from_ptr(ptr) }; - // BAD: Assumes valid UTF-8 - cstr.to_str().unwrap() -} - -// DON'T: Trust size values -fn bad_size(ptr: *const u8, len: usize) -> Vec { - // BAD: len could be huge, causing OOM - // BAD: len could exceed actual data - unsafe { std::slice::from_raw_parts(ptr, len) }.to_vec() -} -``` - -## Good Example - -```rust -// DO: Validate enum values -#[derive(Debug, Clone, Copy)] -#[repr(u8)] -enum Status { - Active = 0, - Inactive = 1, - Pending = 2, -} - -impl TryFrom for Status { - type Error = InvalidStatusError; - - fn try_from(value: u8) -> Result { - match value { - 0 => Ok(Status::Active), - 1 => Ok(Status::Inactive), - 2 => Ok(Status::Pending), - _ => Err(InvalidStatusError(value)), - } - } -} - -fn good_convert() -> Result { - let raw = unsafe { get_status() }; - Status::try_from(raw) // Returns error for invalid values -} - -// DO: Handle invalid UTF-8 -fn good_string(ptr: *const c_char) -> Result { - if ptr.is_null() { - return Ok(String::new()); - } - let cstr = unsafe { CStr::from_ptr(ptr) }; - cstr.to_str().map(|s| s.to_owned()) -} - -fn good_string_lossy(ptr: *const c_char) -> String { - if ptr.is_null() { - return String::new(); - } - let cstr = unsafe { CStr::from_ptr(ptr) }; - cstr.to_string_lossy().into_owned() // Replaces invalid UTF-8 -} - -// DO: Validate sizes -const MAX_REASONABLE_SIZE: usize = 100 * 1024 * 1024; // 100 MB - -fn good_size(ptr: *const u8, len: usize) -> Result, ValidationError> { - if ptr.is_null() { - return Err(ValidationError::NullPointer); - } - if len > MAX_REASONABLE_SIZE { - return Err(ValidationError::SizeTooLarge); - } - - // Still need to trust that ptr points to len valid bytes - // Document this as a caller requirement - let slice = unsafe { std::slice::from_raw_parts(ptr, len) }; - Ok(slice.to_vec()) -} - -// DO: Use num_enum for safe enum conversion -// use num_enum::TryFromPrimitive; -// -// #[derive(TryFromPrimitive)] -// #[repr(u8)] -// enum Status { Active = 0, Inactive = 1, Pending = 2 } -``` - -## Validation Patterns - -| External Data | Validation | -|---------------|------------| -| Enum discriminant | Match against valid values | -| String | Check UTF-8 or use lossy conversion | -| Size/length | Check against maximum | -| Pointer | Check for null | -| Boolean | Explicit 0/1 check or treat any non-zero as true | -| Float | Check for NaN, infinity if problematic | - -## Checklist - -- [ ] Am I validating external enum values? -- [ ] Am I handling potential invalid UTF-8? -- [ ] Am I checking sizes against reasonable limits? -- [ ] Am I using TryFrom instead of transmute? - -## Related Rules - -- `ffi-12`: Document invariant assumptions -- `safety-02`: Verify safety invariants diff --git a/.claude/skills/unsafe-checker/rules/ffi-16-closure-to-c.md b/.claude/skills/unsafe-checker/rules/ffi-16-closure-to-c.md deleted file mode 100644 index 7d80d5c71..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-16-closure-to-c.md +++ /dev/null @@ -1,141 +0,0 @@ ---- -id: ffi-16 -original_id: P.UNS.FFI.16 -level: P -impact: HIGH ---- - -# Separate Data and Code When Passing Rust Closures to C - -## Summary - -C callbacks are function pointers without captured state. To pass Rust closures to C, separate the function pointer from the closure data using a "trampoline" pattern. - -## Rationale - -- Rust closures can capture state (like lambdas) -- C function pointers are just addresses, no state -- Must pass state separately via `void*` user_data - -## Bad Example - -```rust -// DON'T: Try to pass closure directly -extern "C" { - fn set_callback(cb: fn(i32) -> i32); // Only works for non-capturing! -} - -fn bad_closure() { - let multiplier = 2; - let closure = |x| x * multiplier; // Captures multiplier - - // This won't compile - closure is not fn pointer - // set_callback(closure); -} - -// DON'T: Transmute closure to function pointer -fn bad_transmute() { - let closure = |x: i32| x * 2; - let fp: fn(i32) -> i32 = unsafe { std::mem::transmute(closure) }; - // UB: Closure may have non-zero size -} -``` - -## Good Example - -```rust -use std::os::raw::c_void; -use std::ffi::c_int; - -// C callback signature with user_data -type CCallback = extern "C" fn(value: c_int, user_data: *mut c_void) -> c_int; - -extern "C" { - fn set_callback(cb: CCallback, user_data: *mut c_void); - fn remove_callback(); -} - -// DO: Use trampoline pattern -fn good_closure i32>(mut closure: F) { - // Trampoline function that forwards to the closure - extern "C" fn trampoline i32>( - value: c_int, - user_data: *mut c_void, - ) -> c_int { - let closure = unsafe { &mut *(user_data as *mut F) }; - closure(value as i32) as c_int - } - - let user_data = &mut closure as *mut F as *mut c_void; - - unsafe { - set_callback(trampoline::, user_data); - // Important: closure must live until callback is removed! - } -} - -// DO: Box the closure for 'static lifetime -struct CallbackHandle { - closure: Box i32>, -} - -impl CallbackHandle { - fn new i32 + 'static>(closure: F) -> Self { - Self { closure: Box::new(closure) } - } - - fn register(&mut self) { - extern "C" fn trampoline(value: c_int, user_data: *mut c_void) -> c_int { - let closure = unsafe { &mut *(user_data as *mut Box i32>) }; - closure(value as i32) as c_int - } - - let user_data = &mut self.closure as *mut _ as *mut c_void; - unsafe { set_callback(trampoline, user_data); } - } -} - -impl Drop for CallbackHandle { - fn drop(&mut self) { - unsafe { remove_callback(); } - // Now safe to drop closure - } -} - -// Usage -fn example() { - let multiplier = 2; - let mut handle = CallbackHandle::new(move |x| x * multiplier); - handle.register(); - // handle must live until callback is no longer needed -} -``` - -## Trampoline Pattern - -``` -Rust Closure: |x| x * captured_value - | - v -+-----------------+ +-----------------+ -| trampoline fn | --> | closure data | -| (no captures) | | (captured_value)| -+-----------------+ +-----------------+ - | ^ - | user_data ptr | - +-------------------------+ - -C sees: function pointer + void* user_data -``` - -## Checklist - -- [ ] Does my closure capture any state? -- [ ] Am I using the trampoline pattern? -- [ ] Does the closure data live long enough? -- [ ] Am I unregistering before dropping the closure? - -## Related Rules - -- `ffi-03`: Implement Drop for resource wrappers -- `ffi-10`: Thread safety for callbacks diff --git a/.claude/skills/unsafe-checker/rules/ffi-17-opaque-types.md b/.claude/skills/unsafe-checker/rules/ffi-17-opaque-types.md deleted file mode 100644 index 9fcd52652..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-17-opaque-types.md +++ /dev/null @@ -1,152 +0,0 @@ ---- -id: ffi-17 -original_id: P.UNS.FFI.17 -level: P -impact: MEDIUM ---- - -# Use Dedicated Opaque Type Pointers Instead of c_void for C Opaque Types - -## Summary - -Instead of using `*mut c_void` for opaque C handles, create dedicated marker types that provide type safety. - -## Rationale - -- `*mut c_void` accepts any pointer, easy to mix up handles -- Dedicated types catch mistakes at compile time -- Self-documenting code -- Prevents accidental use of wrong free function - -## Bad Example - -```rust -use std::ffi::c_void; - -extern "C" { - fn create_database() -> *mut c_void; - fn create_connection() -> *mut c_void; - fn execute(conn: *mut c_void, query: *const i8); - fn close_database(db: *mut c_void); - fn close_connection(conn: *mut c_void); -} - -fn bad_usage() { - let db = unsafe { create_database() }; - let conn = unsafe { create_connection() }; - - // BUG: Passed db where conn was expected - compiles fine! - unsafe { execute(db, b"SELECT 1\0".as_ptr() as *const i8) }; - - // BUG: Wrong close function - compiles fine! - unsafe { close_connection(db) }; - unsafe { close_database(conn) }; -} -``` - -## Good Example - -```rust -use std::marker::PhantomData; - -// DO: Define opaque marker types -#[repr(C)] -pub struct Database { - _private: [u8; 0], - _marker: PhantomData<(*mut u8, std::marker::PhantomPinned)>, -} - -#[repr(C)] -pub struct Connection { - _private: [u8; 0], - _marker: PhantomData<(*mut u8, std::marker::PhantomPinned)>, -} - -extern "C" { - fn create_database() -> *mut Database; - fn create_connection(db: *mut Database) -> *mut Connection; - fn execute(conn: *mut Connection, query: *const i8) -> i32; - fn close_database(db: *mut Database); - fn close_connection(conn: *mut Connection); -} - -fn good_usage() { - let db = unsafe { create_database() }; - let conn = unsafe { create_connection(db) }; - - // Compile error: expected *mut Connection, found *mut Database - // unsafe { execute(db, b"SELECT 1\0".as_ptr() as *const i8) }; - - // Correct usage - unsafe { execute(conn, b"SELECT 1\0".as_ptr() as *const i8) }; - - unsafe { close_connection(conn) }; - unsafe { close_database(db) }; -} - -// DO: Wrap in safe Rust types -pub struct SafeDatabase { - ptr: *mut Database, -} - -impl SafeDatabase { - pub fn new() -> Option { - let ptr = unsafe { create_database() }; - if ptr.is_null() { None } else { Some(Self { ptr }) } - } - - pub fn connect(&self) -> Option> { - let ptr = unsafe { create_connection(self.ptr) }; - if ptr.is_null() { None } else { Some(SafeConnection { ptr, _db: PhantomData }) } - } -} - -impl Drop for SafeDatabase { - fn drop(&mut self) { - unsafe { close_database(self.ptr); } - } -} - -pub struct SafeConnection<'db> { - ptr: *mut Connection, - _db: PhantomData<&'db SafeDatabase>, -} - -impl SafeConnection<'_> { - pub fn execute(&self, query: &str) -> Result<(), ()> { - let query = std::ffi::CString::new(query).map_err(|_| ())?; - let result = unsafe { execute(self.ptr, query.as_ptr()) }; - if result == 0 { Ok(()) } else { Err(()) } - } -} - -impl Drop for SafeConnection<'_> { - fn drop(&mut self) { - unsafe { close_connection(self.ptr); } - } -} -``` - -## Opaque Type Pattern - -```rust -// The zero-sized array makes it impossible to construct -// PhantomData ensures proper variance and !Send/!Sync if needed -#[repr(C)] -pub struct OpaqueHandle { - _private: [u8; 0], - _marker: PhantomData<(*mut u8, std::marker::PhantomPinned)>, -} -``` - -## Checklist - -- [ ] Am I using `*mut c_void` for distinct handle types? -- [ ] Would dedicated types prevent bugs? -- [ ] Have I wrapped opaque pointers in safe Rust types? -- [ ] Do my types enforce correct handle/function pairing? - -## Related Rules - -- `ffi-02`: Read std::ffi documentation -- `ffi-03`: Implement Drop for wrapped pointers diff --git a/.claude/skills/unsafe-checker/rules/ffi-18-no-trait-objects.md b/.claude/skills/unsafe-checker/rules/ffi-18-no-trait-objects.md deleted file mode 100644 index 0878d1216..000000000 --- a/.claude/skills/unsafe-checker/rules/ffi-18-no-trait-objects.md +++ /dev/null @@ -1,165 +0,0 @@ ---- -id: ffi-18 -original_id: P.UNS.FFI.18 -level: P -impact: HIGH ---- - -# Avoid Passing Trait Objects to C Interfaces - -## Summary - -Trait objects (`dyn Trait`) have Rust-specific layout (fat pointers with vtable) that is not compatible with C. - -## Rationale - -- Trait objects are "fat pointers": data ptr + vtable ptr -- C expects thin pointers (single pointer) -- Vtable layout is not stable across Rust versions -- C cannot call Rust vtable methods - -## Bad Example - -```rust -// DON'T: Pass trait objects to C -trait Handler { - fn handle(&self, data: i32); -} - -extern "C" { - // This won't work - dyn Handler is a fat pointer! - fn set_handler(h: *const dyn Handler); -} - -// DON'T: Store trait objects in FFI structs -#[repr(C)] -struct BadCallback { - handler: *const dyn Handler, // Not C-compatible! -} -``` - -## Good Example - -```rust -use std::os::raw::{c_int, c_void}; - -// DO: Use function pointers with user_data (trampoline pattern) -type HandlerFn = extern "C" fn(data: c_int, user_data: *mut c_void); - -extern "C" { - fn set_handler(handler: HandlerFn, user_data: *mut c_void); -} - -trait Handler { - fn handle(&self, data: i32); -} - -fn register_handler(handler: H) { - // Box the handler - let boxed: Box = Box::new(handler); - let user_data = Box::into_raw(boxed) as *mut c_void; - - extern "C" fn trampoline(data: c_int, user_data: *mut c_void) { - let handler = unsafe { &*(user_data as *const H) }; - handler.handle(data as i32); - } - - unsafe { - set_handler(trampoline::, user_data); - } -} - -// DO: Use concrete types when possible -struct ConcreteHandler { - multiplier: i32, -} - -impl Handler for ConcreteHandler { - fn handle(&self, data: i32) { - println!("{}", data * self.multiplier); - } -} - -// DO: Create C-compatible vtable manually if needed -#[repr(C)] -struct HandlerVtable { - handle: extern "C" fn(this: *const c_void, data: c_int), - drop: extern "C" fn(this: *mut c_void), -} - -#[repr(C)] -struct CCompatibleHandler { - data: *mut c_void, - vtable: *const HandlerVtable, -} - -impl CCompatibleHandler { - fn new(handler: H) -> Self { - extern "C" fn handle_impl(this: *const c_void, data: c_int) { - let handler = unsafe { &*(this as *const H) }; - handler.handle(data as i32); - } - - extern "C" fn drop_impl(this: *mut c_void) { - unsafe { drop(Box::from_raw(this as *mut H)); } - } - - static VTABLE: HandlerVtable = HandlerVtable { - handle: handle_impl::, // Need concrete type - drop: drop_impl::, - }; - - Self { - data: Box::into_raw(Box::new(handler)) as *mut c_void, - vtable: &VTABLE, - } - } - - fn handle(&self, data: i32) { - unsafe { - ((*self.vtable).handle)(self.data, data as c_int); - } - } -} - -impl Drop for CCompatibleHandler { - fn drop(&mut self) { - unsafe { - ((*self.vtable).drop)(self.data); - } - } -} -``` - -## Why Trait Objects Don't Work - -``` -Rust trait object (*const dyn Handler): -[data pointer][vtable pointer] <- 16 bytes on 64-bit - -C pointer (void*): -[pointer] <- 8 bytes on 64-bit - -The sizes don't match! -``` - -## Alternatives to Trait Objects - -| Instead of | Use | -|------------|-----| -| `dyn Trait` | Function pointer + user_data | -| `Box` | Boxed concrete type + trampoline | -| `&dyn Trait` | C-compatible vtable struct | -| `Arc` | Reference counting wrapper | - -## Checklist - -- [ ] Am I passing trait objects across FFI? -- [ ] Can I use concrete types instead? -- [ ] Have I used the trampoline pattern for callbacks? -- [ ] If vtable is needed, is it C-compatible? - -## Related Rules - -- `ffi-16`: Closure to C with trampoline pattern -- `ffi-14`: Types should have stable layout diff --git a/.claude/skills/unsafe-checker/rules/general-01-no-abuse.md b/.claude/skills/unsafe-checker/rules/general-01-no-abuse.md deleted file mode 100644 index 866e956c1..000000000 --- a/.claude/skills/unsafe-checker/rules/general-01-no-abuse.md +++ /dev/null @@ -1,71 +0,0 @@ ---- -id: general-01 -original_id: P.UNS.01 -level: P -impact: CRITICAL ---- - -# Do Not Abuse Unsafe to Escape Compiler Safety Checks - -## Summary - -Unsafe Rust should not be used as an escape hatch from the borrow checker or other compiler safety mechanisms. - -## Rationale - -The borrow checker exists to prevent memory safety bugs. Using `unsafe` to bypass it defeats Rust's safety guarantees and introduces potential undefined behavior. - -## Bad Example - -```rust -// DON'T: Using unsafe to bypass borrow checker -fn bad_alias() { - let mut data = vec![1, 2, 3]; - let ptr = data.as_mut_ptr(); - - // Unsafe used to create aliasing mutable references - unsafe { - let ref1 = &mut *ptr; - let ref2 = &mut *ptr; // UB: Two mutable references! - *ref1 = 10; - *ref2 = 20; - } -} -``` - -## Good Example - -```rust -// DO: Work with the borrow checker, not against it -fn good_sequential() { - let mut data = vec![1, 2, 3]; - data[0] = 10; - data[0] = 20; // Sequential mutations are fine -} - -// DO: Use interior mutability when needed -use std::cell::RefCell; - -fn good_interior_mut() { - let data = RefCell::new(vec![1, 2, 3]); - data.borrow_mut()[0] = 10; -} -``` - -## Legitimate Uses of Unsafe - -1. **FFI**: Calling C functions or implementing C-compatible interfaces -2. **Low-level abstractions**: Implementing collections, synchronization primitives -3. **Performance**: Only after profiling shows measurable improvement, and with careful safety analysis - -## Checklist - -- [ ] Have I tried all safe alternatives first? -- [ ] Is the borrow checker preventing a genuine design need? -- [ ] Can I restructure the code to satisfy the borrow checker? -- [ ] If unsafe is necessary, have I documented the safety invariants? - -## Related Rules - -- `general-02`: Don't blindly use unsafe for performance -- `safety-02`: Unsafe code authors must verify safety invariants diff --git a/.claude/skills/unsafe-checker/rules/general-02-not-for-perf.md b/.claude/skills/unsafe-checker/rules/general-02-not-for-perf.md deleted file mode 100644 index 814f12fa7..000000000 --- a/.claude/skills/unsafe-checker/rules/general-02-not-for-perf.md +++ /dev/null @@ -1,90 +0,0 @@ ---- -id: general-02 -original_id: P.UNS.02 -level: P -impact: CRITICAL ---- - -# Do Not Blindly Use Unsafe for Performance - -## Summary - -Do not assume that using `unsafe` will automatically improve performance. Always measure first and verify the safety invariants. - -## Rationale - -1. Modern Rust optimizers often eliminate bounds checks when they can prove safety -2. Unsafe code may prevent optimizations by breaking aliasing assumptions -3. Unmeasured "optimizations" often provide no real benefit while introducing risk - -## Bad Example - -```rust -// DON'T: Blind unsafe for "performance" -fn sum_bad(slice: &[i32]) -> i32 { - let mut sum = 0; - // Unnecessary unsafe - LLVM can optimize the safe version - for i in 0..slice.len() { - unsafe { - sum += *slice.get_unchecked(i); - } - } - sum -} -``` - -## Good Example - -```rust -// DO: Use safe iteration - compiler optimizes bounds checks away -fn sum_good(slice: &[i32]) -> i32 { - slice.iter().sum() -} - -// DO: If unsafe is justified, document why -fn sum_justified(slice: &[i32]) -> i32 { - let mut sum = 0; - // This is actually slower than iter().sum() in most cases - // Only use get_unchecked when: - // 1. Profiler shows bounds checks as bottleneck - // 2. Iterator patterns can't be used - // 3. Safety is proven by other means - for i in 0..slice.len() { - // SAFETY: i is always < slice.len() due to loop condition - unsafe { - sum += *slice.get_unchecked(i); - } - } - sum -} -``` - -## When Unsafe Might Be Justified for Performance - -1. **Hot inner loops** where profiling shows bounds checks are a bottleneck -2. **SIMD operations** that require specific memory alignment -3. **Lock-free data structures** with carefully verified memory orderings - -## Measurement Workflow - -```bash -# 1. Benchmark the safe version first -cargo bench --bench my_bench - -# 2. Profile to identify actual bottlenecks -cargo flamegraph --bench my_bench - -# 3. Only then consider unsafe, with measurements -``` - -## Checklist - -- [ ] Have I benchmarked the safe version? -- [ ] Does profiling show this specific code as a bottleneck? -- [ ] Have I measured the actual improvement from unsafe? -- [ ] Is the performance gain worth the safety risk? - -## Related Rules - -- `general-01`: Don't abuse unsafe to escape safety checks -- `safety-02`: Unsafe code authors must verify safety invariants diff --git a/.claude/skills/unsafe-checker/rules/general-03-no-alias.md b/.claude/skills/unsafe-checker/rules/general-03-no-alias.md deleted file mode 100644 index e65bd9a51..000000000 --- a/.claude/skills/unsafe-checker/rules/general-03-no-alias.md +++ /dev/null @@ -1,74 +0,0 @@ ---- -id: general-03 -original_id: G.UNS.01 -level: G -impact: MEDIUM ---- - -# Do Not Create Aliases for Types/Methods Named "Unsafe" - -## Summary - -Do not create type aliases, re-exports, or wrapper methods that hide the "unsafe" nature of operations. - -## Rationale - -The word "unsafe" in Rust is a signal to developers that extra scrutiny is required. Hiding this signal makes code review harder and can lead to accidental misuse. - -## Bad Example - -```rust -// DON'T: Hide unsafe behind an alias -type SafePointer = *mut u8; // Still unsafe to dereference! - -// DON'T: Wrap unsafe in a "safe-looking" name -pub fn get_value(ptr: *const i32) -> i32 { - unsafe { *ptr } // Caller doesn't know this is unsafe! -} - -// DON'T: Re-export unsafe functions with different names -pub use std::mem::transmute as convert; -``` - -## Good Example - -```rust -// DO: Keep "unsafe" visible in the API -pub unsafe fn get_value_unchecked(ptr: *const i32) -> i32 { - *ptr -} - -// DO: If providing a safe wrapper, make the safety contract clear -/// Returns the value at the pointer. -/// -/// # Safety -/// This is safe because the pointer is validated internally. -pub fn get_value_checked(ptr: *const i32) -> Option { - if ptr.is_null() { - None - } else { - // SAFETY: We checked for null above - Some(unsafe { *ptr }) - } -} - -// DO: Use clear naming for raw pointer types -type RawHandle = *mut c_void; // "Raw" signals potential unsafety -``` - -## Common Violations - -1. Creating type aliases that hide pointer types -2. Wrapping unsafe functions in safe-looking functions without proper safety analysis -3. Re-exporting unsafe functions with "friendlier" names - -## Checklist - -- [ ] Does my API preserve visibility of unsafe operations? -- [ ] If wrapping unsafe code in safe API, is the safety invariant enforced? -- [ ] Are type aliases clearly named to indicate their nature? - -## Related Rules - -- `safety-06`: Don't expose raw pointers in public APIs -- `safety-09`: Add SAFETY comment before any unsafe block diff --git a/.claude/skills/unsafe-checker/rules/io-01-raw-handle.md b/.claude/skills/unsafe-checker/rules/io-01-raw-handle.md deleted file mode 100644 index 12714dc22..000000000 --- a/.claude/skills/unsafe-checker/rules/io-01-raw-handle.md +++ /dev/null @@ -1,151 +0,0 @@ ---- -id: io-01 -original_id: P.UNS.FIO.01 -level: P -impact: HIGH ---- - -# Ensure I/O Safety When Using Raw Handles - -## Summary - -When working with raw file descriptors or handles, ensure they are valid for the duration of use and properly ownership-tracked. - -## Rationale - -- Raw handles can be closed by other code -- Using a closed handle is undefined behavior -- Handle reuse can cause data corruption -- Rust 1.63+ provides I/O safety traits - -## Bad Example - -```rust -#[cfg(unix)] -mod bad_example { - use std::os::unix::io::RawFd; - - // DON'T: Accept raw handle without ownership - fn bad_read(fd: RawFd) -> std::io::Result> { - // What if fd was closed? What if it's reused? - let mut buf = vec![0u8; 1024]; - let n = unsafe { - libc::read(fd, buf.as_mut_ptr() as *mut libc::c_void, buf.len()) - }; - if n < 0 { - Err(std::io::Error::last_os_error()) - } else { - buf.truncate(n as usize); - Ok(buf) - } - } - - // DON'T: Store raw handle without tracking ownership - struct BadFileRef { - fd: RawFd, // Who owns this? Who closes it? - } -} -``` - -## Good Example - -```rust -#[cfg(unix)] -mod good_example { - use std::os::unix::io::{AsFd, BorrowedFd, OwnedFd, FromRawFd, AsRawFd}; - use std::fs::File; - - // DO: Use BorrowedFd for borrowed access (Rust 1.63+) - fn good_read(fd: BorrowedFd<'_>) -> std::io::Result> { - let mut buf = vec![0u8; 1024]; - // BorrowedFd guarantees the fd is valid for this call - let n = unsafe { - libc::read( - fd.as_raw_fd(), - buf.as_mut_ptr() as *mut libc::c_void, - buf.len() - ) - }; - if n < 0 { - Err(std::io::Error::last_os_error()) - } else { - buf.truncate(n as usize); - Ok(buf) - } - } - - // DO: Use OwnedFd for owned handles - struct GoodFileOwner { - fd: OwnedFd, // Clearly owns the handle - } - - impl Drop for GoodFileOwner { - fn drop(&mut self) { - // OwnedFd closes automatically - } - } - - // DO: Use generic AsFd bound for flexibility - fn generic_read(f: &F) -> std::io::Result> { - good_read(f.as_fd()) - } - - // Usage - fn example() -> std::io::Result<()> { - let file = File::open("test.txt")?; - - // Pass as BorrowedFd - let data = good_read(file.as_fd())?; - - // Or use generic function - let data = generic_read(&file)?; - - Ok(()) - } - - // DO: Take ownership from raw fd - fn from_raw(fd: i32) -> Option { - if fd < 0 { - return None; - } - // SAFETY: Caller guarantees fd is valid and ownership is transferred - let owned = unsafe { OwnedFd::from_raw_fd(fd) }; - Some(GoodFileOwner { fd: owned }) - } -} -``` - -## I/O Safety Types (Rust 1.63+) - -| Type | Meaning | -|------|---------| -| `OwnedFd` | Owns a file descriptor, closes on drop | -| `BorrowedFd<'a>` | Borrows a fd for lifetime 'a | -| `RawFd` | Raw integer, no safety guarantees | -| `AsFd` | Trait for types that have a fd | -| `From` | Create from owned fd | -| `Into` | Convert to owned fd | - -## Windows Equivalents - -```rust -#[cfg(windows)] -use std::os::windows::io::{ - OwnedHandle, BorrowedHandle, RawHandle, - AsHandle, FromRawHandle, - OwnedSocket, BorrowedSocket, RawSocket, - AsSocket, FromRawSocket, -}; -``` - -## Checklist - -- [ ] Am I using BorrowedFd/OwnedFd instead of RawFd? -- [ ] Is ownership of handles clear? -- [ ] Am I using the AsFd trait for generic code? -- [ ] Is the fd guaranteed valid for the duration of use? - -## Related Rules - -- `ffi-03`: Implement Drop for resource wrappers -- `safety-02`: Verify safety invariants diff --git a/.claude/skills/unsafe-checker/rules/mem-01-repr-layout.md b/.claude/skills/unsafe-checker/rules/mem-01-repr-layout.md deleted file mode 100644 index dbd383494..000000000 --- a/.claude/skills/unsafe-checker/rules/mem-01-repr-layout.md +++ /dev/null @@ -1,130 +0,0 @@ ---- -id: mem-01 -original_id: P.UNS.MEM.01 -level: P -impact: HIGH ---- - -# Choose Appropriate Data Layout for Struct/Tuple/Enum - -## Summary - -Use `#[repr(...)]` attributes to control data layout when interfacing with C, doing memory mapping, or needing specific guarantees. - -## Rationale - -Rust's default layout is unspecified and may change between compiler versions. For FFI, persistence, or low-level memory operations, you need predictable layout. - -## Repr Attributes - -| Attribute | Use Case | -|-----------|----------| -| `#[repr(C)]` | C-compatible layout, stable field order | -| `#[repr(transparent)]` | Single-field struct with same layout as field | -| `#[repr(packed)]` | No padding (alignment = 1), careful with references! | -| `#[repr(align(N))]` | Minimum alignment of N bytes | -| `#[repr(u8)]`, `#[repr(i32)]`, etc. | Enum discriminant type | - -## Bad Example - -```rust -// DON'T: Assume Rust struct layout matches C -struct BadFFI { - a: u8, - b: u32, - c: u8, -} -// Rust may reorder fields or add different padding than C - -// DON'T: Use packed without understanding the risks -#[repr(packed)] -struct Dangerous { - a: u8, - b: u32, -} - -fn bad_ref(d: &Dangerous) -> &u32 { - &d.b // UB: Creates unaligned reference! -} -``` - -## Good Example - -```rust -// DO: Use repr(C) for FFI -#[repr(C)] -struct GoodFFI { - a: u8, - b: u32, - c: u8, -} -// Guaranteed: a at 0, padding 1-3, b at 4, c at 8, padding 9-11 - -// DO: Use repr(transparent) for newtypes -#[repr(transparent)] -struct Wrapper(u32); -// Guaranteed same layout as u32, can be transmuted - -// DO: Use repr(packed) carefully, access via copy -#[repr(C, packed)] -struct PackedData { - header: u8, - value: u32, -} - -impl PackedData { - fn value(&self) -> u32 { - // Copy out the value to avoid unaligned reference - let ptr = std::ptr::addr_of!(self.value); - // SAFETY: Reading unaligned is OK with read_unaligned - unsafe { ptr.read_unaligned() } - } -} - -// DO: Use align for SIMD or cache line alignment -#[repr(C, align(64))] -struct CacheAligned { - data: [u8; 64], -} - -// DO: Specify enum discriminant for FFI -#[repr(u8)] -enum Status { - Ok = 0, - Error = 1, - Unknown = 255, -} -``` - -## Layout Guarantees - -```rust -use std::mem::{size_of, align_of}; - -#[repr(C)] -struct Example { - a: u8, // offset 0, size 1 - // padding: 3 bytes - b: u32, // offset 4, size 4 - c: u8, // offset 8, size 1 - // padding: 3 bytes -} - -assert_eq!(size_of::(), 12); -assert_eq!(align_of::(), 4); - -// repr(Rust) might reorder to: b, a, c -> size 8 -``` - -## Checklist - -- [ ] Is this type used in FFI? → Use `#[repr(C)]` -- [ ] Is this a newtype wrapper? → Consider `#[repr(transparent)]` -- [ ] Do I need specific alignment? → Use `#[repr(align(N))]` -- [ ] Am I using packed? → Never create references to packed fields - -## Related Rules - -- `ffi-13`: Ensure consistent data layout for custom types -- `ffi-14`: Types in FFI should have stable layout -- `ptr-04`: Alignment considerations diff --git a/.claude/skills/unsafe-checker/rules/mem-02-no-other-process.md b/.claude/skills/unsafe-checker/rules/mem-02-no-other-process.md deleted file mode 100644 index 0cf24e7a4..000000000 --- a/.claude/skills/unsafe-checker/rules/mem-02-no-other-process.md +++ /dev/null @@ -1,113 +0,0 @@ ---- -id: mem-02 -original_id: P.UNS.MEM.02 -level: P -impact: CRITICAL ---- - -# Do Not Modify Memory Variables of Other Processes or Dynamic Libraries - -## Summary - -Do not directly manipulate memory belonging to other processes or dynamically loaded libraries. Use proper IPC or FFI mechanisms. - -## Rationale - -- Other processes have separate address spaces; direct access is impossible on modern OSes -- Shared memory requires explicit setup and synchronization -- Dynamic library memory has ownership rules that must be respected -- Violating these causes undefined behavior or security vulnerabilities - -## Bad Example - -```rust -// DON'T: Try to access another process's memory directly -fn bad_cross_process(ptr: *mut i32) { - // This pointer from another process is meaningless in our address space - unsafe { *ptr = 42; } // Undefined behavior or crash -} - -// DON'T: Modify library internals -extern "C" { - static mut LIBRARY_INTERNAL: i32; -} - -fn bad_library_access() { - // Modifying library internals breaks encapsulation - unsafe { LIBRARY_INTERNAL = 100; } // May corrupt library state -} -``` - -## Good Example - -```rust -// DO: Use proper IPC for cross-process communication -use std::io::{Read, Write}; -use std::os::unix::net::UnixStream; - -fn ipc_communication() -> std::io::Result<()> { - let mut stream = UnixStream::connect("/tmp/socket")?; - stream.write_all(b"message")?; - Ok(()) -} - -// DO: Use shared memory with proper synchronization -#[cfg(unix)] -fn shared_memory_example() { - use std::sync::atomic::{AtomicI32, Ordering}; - - // Properly set up shared memory region - // let shm = mmap shared memory... - - // Use atomic operations for synchronization - let shared: &AtomicI32 = /* ... */; - shared.store(42, Ordering::Release); -} - -// DO: Use proper FFI for library interaction -mod ffi { - extern "C" { - pub fn library_set_value(value: i32); - pub fn library_get_value() -> i32; - } -} - -fn proper_library_access() { - unsafe { - ffi::library_set_value(42); - let value = ffi::library_get_value(); - } -} - -// DO: Use Rust's libloading for dynamic libraries -fn dynamic_library() -> Result<(), Box> { - let lib = unsafe { libloading::Library::new("mylib.so")? }; - let func: libloading::Symbol i32> = - unsafe { lib.get(b"my_function")? }; - let result = func(42); - Ok(()) -} -``` - -## Memory Ownership Rules - -| Memory Type | Owner | Safe Access | -|-------------|-------|-------------| -| Stack variables | Current function | Direct | -| Heap (Box, Vec) | Rust allocator | Through smart pointers | -| Static | Program | With proper synchronization | -| Shared memory | Multiple processes | Atomic ops, mutexes | -| Library memory | Library | Through library API | -| FFI-allocated | C allocator | Through C free functions | - -## Checklist - -- [ ] Who allocated this memory? -- [ ] Who is responsible for freeing it? -- [ ] Is proper synchronization in place for shared access? -- [ ] Am I using the correct API for cross-boundary access? - -## Related Rules - -- `mem-03`: Don't let String/Vec drop other process's memory -- `ffi-03`: Implement Drop for wrapped C pointers diff --git a/.claude/skills/unsafe-checker/rules/mem-03-no-auto-drop-foreign.md b/.claude/skills/unsafe-checker/rules/mem-03-no-auto-drop-foreign.md deleted file mode 100644 index 0a20c74e2..000000000 --- a/.claude/skills/unsafe-checker/rules/mem-03-no-auto-drop-foreign.md +++ /dev/null @@ -1,127 +0,0 @@ ---- -id: mem-03 -original_id: P.UNS.MEM.03 -level: P -impact: CRITICAL ---- - -# Do Not Let String/Vec Auto-Drop Other Process's Memory - -## Summary - -Never create `String`, `Vec`, or `Box` from memory allocated outside Rust's allocator. They will try to free the memory with the wrong deallocator. - -## Rationale - -`String`, `Vec`, and `Box` assume memory was allocated by Rust's global allocator. When dropped, they call `dealloc`. If the memory came from C's `malloc`, a different allocator, or shared memory, this causes undefined behavior. - -## Bad Example - -```rust -// DON'T: Create String from C-allocated memory -extern "C" { - fn c_get_string() -> *mut std::os::raw::c_char; -} - -fn bad_string() -> String { - unsafe { - let ptr = c_get_string(); - // BAD: String will try to free with Rust allocator - String::from_raw_parts(ptr as *mut u8, len, cap) - } -} - -// DON'T: Create Vec from foreign memory -fn bad_vec(ptr: *mut u8, len: usize) -> Vec { - // BAD: Vec will free this memory incorrectly - unsafe { Vec::from_raw_parts(ptr, len, len) } -} - -// DON'T: Wrap shared memory in Box -fn bad_box(shared_ptr: *mut Data) -> Box { - // BAD: Box will try to deallocate shared memory! - unsafe { Box::from_raw(shared_ptr) } -} -``` - -## Good Example - -```rust -use std::ffi::CStr; - -extern "C" { - fn c_get_string() -> *mut std::os::raw::c_char; - fn c_free_string(s: *mut std::os::raw::c_char); -} - -// DO: Copy data into Rust-owned allocation -fn good_string() -> String { - unsafe { - let ptr = c_get_string(); - let cstr = CStr::from_ptr(ptr); - let result = cstr.to_string_lossy().into_owned(); - c_free_string(ptr); // Free with correct deallocator - result - } -} - -// DO: Use wrapper that calls correct deallocator -struct CString { - ptr: *mut std::os::raw::c_char, -} - -impl Drop for CString { - fn drop(&mut self) { - unsafe { c_free_string(self.ptr); } - } -} - -// DO: Use slice for borrowed view, don't take ownership -fn good_slice(ptr: *const u8, len: usize) -> &'static [u8] { - // Only borrow, don't own - unsafe { std::slice::from_raw_parts(ptr, len) } -} - -// DO: For shared memory, use raw pointers or custom wrapper -struct SharedBuffer { - ptr: *mut u8, - len: usize, -} - -impl SharedBuffer { - fn as_slice(&self) -> &[u8] { - unsafe { std::slice::from_raw_parts(self.ptr, self.len) } - } -} - -impl Drop for SharedBuffer { - fn drop(&mut self) { - // Unmap shared memory, don't deallocate - // munmap(self.ptr, self.len); - } -} -``` - -## Memory Allocation Compatibility - -| Allocator | Can use Rust Vec/String/Box? | -|-----------|------------------------------| -| Rust global allocator | Yes | -| C malloc | No - use wrapper with C free | -| C++ new | No - use wrapper with C++ delete | -| Custom allocator | No - use allocator_api | -| mmap/shared memory | No - use munmap | -| Stack/static | No - never "free" | - -## Checklist - -- [ ] Who allocated this memory? -- [ ] Is it from Rust's global allocator? -- [ ] If not, do I have a custom Drop that frees correctly? -- [ ] Am I copying data or taking ownership? - -## Related Rules - -- `mem-02`: Don't modify other process's memory -- `ffi-03`: Implement Drop for wrapped C pointers -- `ffi-07`: Don't implement Drop for types passed to external code diff --git a/.claude/skills/unsafe-checker/rules/mem-04-reentrant.md b/.claude/skills/unsafe-checker/rules/mem-04-reentrant.md deleted file mode 100644 index fbcc3bf61..000000000 --- a/.claude/skills/unsafe-checker/rules/mem-04-reentrant.md +++ /dev/null @@ -1,121 +0,0 @@ ---- -id: mem-04 -original_id: P.UNS.MEM.04 -level: P -impact: HIGH ---- - -# Prefer Reentrant Versions of C-API or Syscalls - -## Summary - -When calling C functions or system calls, use reentrant (`_r`) versions to avoid data races from global state. - -## Rationale - -Many C library functions use static buffers or global state, making them unsafe in multithreaded programs. Reentrant versions use caller-provided buffers instead. - -## Bad Example - -```rust -use std::ffi::CStr; - -extern "C" { - fn strtok(s: *mut i8, delim: *const i8) -> *mut i8; - fn localtime(time: *const i64) -> *mut Tm; - fn rand() -> i32; -} - -// DON'T: Use non-reentrant functions -fn bad_tokenize(s: &mut [i8]) { - unsafe { - let delim = b" \0".as_ptr() as *const i8; - // strtok uses static buffer - not thread-safe! - let token = strtok(s.as_mut_ptr(), delim); - } -} - -fn bad_time() { - unsafe { - let now: i64 = 0; - // localtime returns pointer to static buffer - let tm = localtime(&now); // Data race if called from multiple threads! - } -} - -fn bad_random() -> i32 { - // rand() uses global state - not thread-safe - unsafe { rand() } -} -``` - -## Good Example - -```rust -extern "C" { - fn strtok_r(s: *mut i8, delim: *const i8, saveptr: *mut *mut i8) -> *mut i8; - fn localtime_r(time: *const i64, result: *mut Tm) -> *mut Tm; - fn rand_r(seed: *mut u32) -> i32; -} - -// DO: Use reentrant versions -fn good_tokenize(s: &mut [i8]) { - unsafe { - let delim = b" \0".as_ptr() as *const i8; - let mut saveptr: *mut i8 = std::ptr::null_mut(); - // strtok_r uses caller-provided saveptr - let token = strtok_r(s.as_mut_ptr(), delim, &mut saveptr); - } -} - -fn good_time() { - unsafe { - let now: i64 = 0; - let mut result: Tm = std::mem::zeroed(); - // localtime_r writes to caller-provided buffer - localtime_r(&now, &mut result); - } -} - -fn good_random(seed: &mut u32) -> i32 { - // rand_r uses caller-provided seed - unsafe { rand_r(seed) } -} - -// BETTER: Use Rust standard library -fn best_time() { - use std::time::SystemTime; - let now = SystemTime::now(); // Thread-safe! -} - -fn best_random() -> u32 { - use rand::Rng; - rand::thread_rng().gen() // Thread-safe! -} -``` - -## Common Non-Reentrant Functions - -| Non-Reentrant | Reentrant | Rust Alternative | -|---------------|-----------|------------------| -| `strtok` | `strtok_r` | `str::split` | -| `localtime` | `localtime_r` | `chrono` crate | -| `gmtime` | `gmtime_r` | `chrono` crate | -| `ctime` | `ctime_r` | `chrono` crate | -| `rand` | `rand_r` | `rand` crate | -| `strerror` | `strerror_r` | `std::io::Error` | -| `getenv` | None (inherent race) | `std::env::var` (not atomic) | -| `readdir` | `readdir_r` | `std::fs::read_dir` | -| `gethostbyname` | `getaddrinfo` | `std::net::ToSocketAddrs` | - -## Checklist - -- [ ] Am I calling a C function that might use global state? -- [ ] Is there a `_r` reentrant version available? -- [ ] Is there a Rust standard library alternative? -- [ ] If neither, do I need synchronization? - -## Related Rules - -- `ffi-10`: Exported functions must be thread-safe -- `ptr-01`: Don't share raw pointers across threads diff --git a/.claude/skills/unsafe-checker/rules/mem-05-bitfield-crates.md b/.claude/skills/unsafe-checker/rules/mem-05-bitfield-crates.md deleted file mode 100644 index 27ba424c2..000000000 --- a/.claude/skills/unsafe-checker/rules/mem-05-bitfield-crates.md +++ /dev/null @@ -1,147 +0,0 @@ ---- -id: mem-05 -original_id: P.UNS.MEM.05 -level: P -impact: MEDIUM ---- - -# Use Third-Party Crates for Bitfields - -## Summary - -Use crates like `bitflags`, `bitvec`, or `modular-bitfield` instead of manual bit manipulation for complex bitfield operations. - -## Rationale - -- Manual bit manipulation is error-prone -- Easy to get offsets, masks, or endianness wrong -- Crates provide type-safe, tested abstractions -- Proc-macro crates generate efficient code - -## Bad Example - -```rust -// DON'T: Manual bitfield manipulation -struct Flags(u32); - -impl Flags { - const READ: u32 = 1 << 0; - const WRITE: u32 = 1 << 1; - const EXECUTE: u32 = 1 << 2; - - fn has_read(&self) -> bool { - (self.0 & Self::READ) != 0 - } - - fn set_read(&mut self) { - self.0 |= Self::READ; - } - - fn clear_read(&mut self) { - self.0 &= !Self::READ; // Easy to forget the ! - } -} - -// DON'T: Manual packed bitfields for FFI -#[repr(C)] -struct PackedHeader { - data: u32, -} - -impl PackedHeader { - // Error-prone: wrong shift or mask values - fn version(&self) -> u8 { - ((self.data >> 24) & 0xFF) as u8 - } - - fn flags(&self) -> u16 { - ((self.data >> 8) & 0xFFFF) as u16 - } - - fn tag(&self) -> u8 { - (self.data & 0xFF) as u8 - } -} -``` - -## Good Example - -```rust -// DO: Use bitflags for flag sets -use bitflags::bitflags; - -bitflags! { - #[derive(Debug, Clone, Copy, PartialEq, Eq)] - struct Flags: u32 { - const READ = 1 << 0; - const WRITE = 1 << 1; - const EXECUTE = 1 << 2; - const RW = Self::READ.bits() | Self::WRITE.bits(); - } -} - -fn use_flags() { - let mut flags = Flags::READ | Flags::WRITE; - flags.insert(Flags::EXECUTE); - flags.remove(Flags::WRITE); - - if flags.contains(Flags::READ) { - println!("Readable"); - } -} - -// DO: Use modular-bitfield for packed structures -use modular_bitfield::prelude::*; - -#[bitfield] -#[repr(C)] -struct PackedHeader { - tag: B8, // 8 bits - flags: B16, // 16 bits - version: B8, // 8 bits -} - -fn use_packed() { - let header = PackedHeader::new() - .with_version(1) - .with_flags(0x1234) - .with_tag(0xAB); - - assert_eq!(header.version(), 1); - assert_eq!(header.flags(), 0x1234); -} - -// DO: Use bitvec for arbitrary bit manipulation -use bitvec::prelude::*; - -fn use_bitvec() { - let mut bits = bitvec![u8, Msb0; 0; 16]; - bits.set(0, true); - bits.set(7, true); - - let byte: u8 = bits[0..8].load_be(); - assert_eq!(byte, 0b1000_0001); -} -``` - -## Recommended Crates - -| Crate | Use Case | Features | -|-------|----------|----------| -| `bitflags` | Flag sets (like C enums) | Type-safe, const, derives | -| `modular-bitfield` | Packed struct fields | Proc macro, repr(C) | -| `bitvec` | Arbitrary bit arrays | Slicing, iteration | -| `packed_struct` | Binary protocol structs | Endianness, derive | -| `deku` | Binary parsing | Derive, read/write | - -## Checklist - -- [ ] Am I manipulating multiple bit flags? → Use `bitflags` -- [ ] Am I packing fields into bytes? → Use `modular-bitfield` or `packed_struct` -- [ ] Am I doing binary protocol work? → Consider `deku` -- [ ] Is the manual approach really simpler? - -## Related Rules - -- `mem-01`: Choose appropriate data layout -- `ffi-13`: Ensure consistent data layout diff --git a/.claude/skills/unsafe-checker/rules/mem-06-maybeuninit.md b/.claude/skills/unsafe-checker/rules/mem-06-maybeuninit.md deleted file mode 100644 index a51f75e04..000000000 --- a/.claude/skills/unsafe-checker/rules/mem-06-maybeuninit.md +++ /dev/null @@ -1,146 +0,0 @@ ---- -id: mem-06 -original_id: G.UNS.MEM.01 -level: G -impact: HIGH -clippy: uninit_assumed_init, uninit_vec ---- - -# Use MaybeUninit for Uninitialized Memory - -## Summary - -Use `MaybeUninit` instead of `mem::uninitialized()` or `mem::zeroed()` when working with uninitialized memory. - -## Rationale - -- `mem::uninitialized()` is deprecated and unsound -- `mem::zeroed()` is UB for types where zero is invalid (references, NonZero, bool) -- `MaybeUninit` clearly marks memory as potentially uninitialized -- Compiler can optimize based on initialization state - -## Bad Example - -```rust -// DON'T: Use deprecated uninitialized -fn bad_uninit() -> T { - unsafe { std::mem::uninitialized() } // Deprecated, UB -} - -// DON'T: Use zeroed for types where zero is invalid -fn bad_zeroed() -> &'static str { - unsafe { std::mem::zeroed() } // UB: null reference -} - -fn bad_zeroed_bool() -> bool { - unsafe { std::mem::zeroed() } // UB: 0 might not be valid bool -} - -// DON'T: Transmute to "initialize" -fn bad_transmute() -> [String; 10] { - unsafe { std::mem::transmute([0u8; std::mem::size_of::<[String; 10]>()]) } -} - -// DON'T: Set Vec length without initializing -fn bad_vec() -> Vec { - let mut v = Vec::with_capacity(10); - unsafe { v.set_len(10); } // Elements are uninitialized! - v -} -``` - -## Good Example - -```rust -use std::mem::MaybeUninit; - -// DO: Use MaybeUninit for delayed initialization -fn good_array() -> [String; 10] { - let mut arr: [MaybeUninit; 10] = - unsafe { MaybeUninit::uninit().assume_init() }; - - for (i, elem) in arr.iter_mut().enumerate() { - elem.write(format!("item {}", i)); - } - - // SAFETY: All elements initialized above - unsafe { std::mem::transmute::<_, [String; 10]>(arr) } -} - -// DO: Use MaybeUninit with arrays (cleaner with array_assume_init) -fn good_array_nightly() -> [String; 10] { - let mut arr: [MaybeUninit; 10] = - [const { MaybeUninit::uninit() }; 10]; - - for (i, elem) in arr.iter_mut().enumerate() { - elem.write(format!("item {}", i)); - } - - // On nightly: arr.map(|e| unsafe { e.assume_init() }) - unsafe { MaybeUninit::array_assume_init(arr) } -} - -// DO: Use zeroed only for types where it's valid -fn good_zeroed() -> [u8; 1024] { - // SAFETY: All-zero bytes is valid for u8 - unsafe { std::mem::zeroed() } -} - -// DO: Initialize buffer properly -fn good_vec() -> Vec { - let mut v = Vec::with_capacity(1024); - - // Option 1: Resize with default value - v.resize(1024, 0); - - // Option 2: Use spare_capacity_mut - let spare = v.spare_capacity_mut(); - for elem in spare.iter_mut().take(1024) { - elem.write(0); - } - unsafe { v.set_len(1024); } - - v -} - -// DO: Use MaybeUninit::uninit_array (nightly) or const array -fn good_uninit_array() -> [MaybeUninit; N] { - // Stable: create array of uninit - [const { MaybeUninit::uninit() }; N] -} -``` - -## MaybeUninit API - -```rust -use std::mem::MaybeUninit; - -// Creation -let uninit: MaybeUninit = MaybeUninit::uninit(); -let zeroed: MaybeUninit = MaybeUninit::zeroed(); -let init: MaybeUninit = MaybeUninit::new(value); - -// Writing -uninit.write(value); // Returns &mut T - -// Reading (unsafe) -let value: T = unsafe { uninit.assume_init() }; -let ref_: &T = unsafe { uninit.assume_init_ref() }; -let mut_: &mut T = unsafe { uninit.assume_init_mut() }; - -// Pointer access -let ptr: *const T = uninit.as_ptr(); -let mut_ptr: *mut T = uninit.as_mut_ptr(); -``` - -## Checklist - -- [ ] Am I using `mem::uninitialized()`? → Replace with `MaybeUninit` -- [ ] Am I using `mem::zeroed()` for non-POD types? → Use `MaybeUninit` -- [ ] Am I setting Vec length without initialization? → Use proper initialization -- [ ] Have I initialized all MaybeUninit before assume_init? - -## Related Rules - -- `safety-03`: Don't expose uninitialized memory in APIs -- `safety-01`: Panic safety with partial initialization diff --git a/.claude/skills/unsafe-checker/rules/ptr-01-no-thread-share.md b/.claude/skills/unsafe-checker/rules/ptr-01-no-thread-share.md deleted file mode 100644 index c7d6e5724..000000000 --- a/.claude/skills/unsafe-checker/rules/ptr-01-no-thread-share.md +++ /dev/null @@ -1,113 +0,0 @@ ---- -id: ptr-01 -original_id: P.UNS.PTR.01 -level: P -impact: CRITICAL ---- - -# Do Not Share Raw Pointers Across Threads - -## Summary - -Raw pointers (`*const T`, `*mut T`) are not `Send` or `Sync` by default. Do not share them across threads without ensuring proper synchronization. - -## Rationale - -Raw pointers have no synchronization guarantees. Sharing them across threads can lead to data races, which are undefined behavior. - -## Bad Example - -```rust -use std::thread; - -// DON'T: Share raw pointers across threads -fn bad_sharing() { - let mut data = 42i32; - let ptr = &mut data as *mut i32; - - let handle = thread::spawn(move || { - // This is undefined behavior! - unsafe { *ptr = 100; } - }); - - // Main thread also accesses - data race! - unsafe { *ptr = 200; } - - handle.join().unwrap(); -} - -// DON'T: Wrap in struct and impl Send unsafely -struct UnsafePtr(*mut i32); -unsafe impl Send for UnsafePtr {} // Unsound without synchronization! -``` - -## Good Example - -```rust -use std::sync::{Arc, Mutex, atomic::{AtomicPtr, Ordering}}; -use std::thread; - -// DO: Use Arc> for shared mutable access -fn good_mutex() { - let data = Arc::new(Mutex::new(42i32)); - let data_clone = Arc::clone(&data); - - let handle = thread::spawn(move || { - *data_clone.lock().unwrap() = 100; - }); - - *data.lock().unwrap() = 200; - handle.join().unwrap(); -} - -// DO: Use AtomicPtr for lock-free pointer sharing -fn good_atomic() { - let data = Box::into_raw(Box::new(42i32)); - let atomic_ptr = Arc::new(AtomicPtr::new(data)); - let atomic_clone = Arc::clone(&atomic_ptr); - - let handle = thread::spawn(move || { - let ptr = atomic_clone.load(Ordering::Acquire); - // SAFETY: We have exclusive access through atomic operations - unsafe { println!("Value: {}", *ptr); } - }); - - handle.join().unwrap(); - - // SAFETY: All threads done, we own the memory - unsafe { drop(Box::from_raw(atomic_ptr.load(Ordering::Relaxed))); } -} - -// DO: If you must use raw pointers, ensure exclusive access -fn good_exclusive() { - let mut data = vec![1, 2, 3]; - - // Send data ownership to thread, not pointer - let handle = thread::spawn(move || { - data.push(4); - data - }); - - let data = handle.join().unwrap(); - println!("{:?}", data); -} -``` - -## When Raw Pointers Across Threads Are Valid - -Only with proper synchronization: -- Through `AtomicPtr` with appropriate memory orderings -- Protected by a `Mutex` (don't share the pointer, share the Mutex) -- Using lock-free algorithms with careful memory ordering - -## Checklist - -- [ ] Does my pointer cross thread boundaries? -- [ ] Is there synchronization preventing concurrent access? -- [ ] Can I use a higher-level abstraction (Arc, Mutex)? -- [ ] If implementing Send/Sync, is thread safety proven? - -## Related Rules - -- `safety-05`: Consider safety when implementing Send/Sync -- `safety-02`: Verify safety invariants diff --git a/.claude/skills/unsafe-checker/rules/ptr-02-prefer-nonnull.md b/.claude/skills/unsafe-checker/rules/ptr-02-prefer-nonnull.md deleted file mode 100644 index 47505dd71..000000000 --- a/.claude/skills/unsafe-checker/rules/ptr-02-prefer-nonnull.md +++ /dev/null @@ -1,114 +0,0 @@ ---- -id: ptr-02 -original_id: P.UNS.PTR.02 -level: P -impact: MEDIUM ---- - -# Prefer NonNull Over *mut T - -## Summary - -Use `NonNull` instead of `*mut T` when the pointer should never be null. This enables null pointer optimization and makes the intent clear. - -## Rationale - -- `NonNull` guarantees non-null at the type level -- Enables niche optimization: `Option>` is the same size as `*mut T` -- Makes invariants explicit in the type system -- Covariant over `T` (like `&T`), which is usually what you want - -## Bad Example - -```rust -// DON'T: Use *mut when pointer is always non-null -struct MyBox { - ptr: *mut T, // Invariant: never null, but not enforced -} - -impl MyBox { - pub fn new(value: T) -> Self { - let ptr = Box::into_raw(Box::new(value)); - // ptr is guaranteed non-null, but type doesn't show it - Self { ptr } - } - - pub fn get(&self) -> &T { - // Must add null check or document the invariant - unsafe { &*self.ptr } - } -} -``` - -## Good Example - -```rust -use std::ptr::NonNull; - -// DO: Use NonNull when pointer is never null -struct MyBox { - ptr: NonNull, // Type guarantees non-null -} - -impl MyBox { - pub fn new(value: T) -> Self { - let ptr = Box::into_raw(Box::new(value)); - // SAFETY: Box::into_raw never returns null - let ptr = unsafe { NonNull::new_unchecked(ptr) }; - Self { ptr } - } - - pub fn get(&self) -> &T { - // SAFETY: NonNull guarantees ptr is valid - unsafe { self.ptr.as_ref() } - } -} - -impl Drop for MyBox { - fn drop(&mut self) { - // SAFETY: ptr was created from Box::into_raw - unsafe { drop(Box::from_raw(self.ptr.as_ptr())); } - } -} - -// DO: Niche optimization with Option -struct OptionalBox { - ptr: Option>, // Same size as *mut T! -} -``` - -## NonNull API - -```rust -use std::ptr::NonNull; - -// Creating NonNull -let ptr: NonNull = NonNull::new(raw_ptr).expect("null pointer"); -let ptr: NonNull = unsafe { NonNull::new_unchecked(raw_ptr) }; -let ptr: NonNull = NonNull::dangling(); // For ZSTs or uninitialized - -// Using NonNull -let raw: *mut i32 = ptr.as_ptr(); -let reference: &i32 = unsafe { ptr.as_ref() }; -let mut_ref: &mut i32 = unsafe { ptr.as_mut() }; - -// Casting -let ptr: NonNull = ptr.cast::(); -``` - -## When to Use *mut T Instead - -- When null is a valid/expected value -- FFI with C code that may return null -- When variance matters (NonNull is covariant, sometimes you need invariance) - -## Checklist - -- [ ] Is my pointer ever null? If no, use NonNull -- [ ] Do I need null pointer optimization? -- [ ] Is the variance correct for my use case? - -## Related Rules - -- `ptr-03`: Use PhantomData for variance and ownership -- `safety-06`: Don't expose raw pointers in public APIs diff --git a/.claude/skills/unsafe-checker/rules/ptr-03-phantomdata.md b/.claude/skills/unsafe-checker/rules/ptr-03-phantomdata.md deleted file mode 100644 index 7ce8b85ed..000000000 --- a/.claude/skills/unsafe-checker/rules/ptr-03-phantomdata.md +++ /dev/null @@ -1,125 +0,0 @@ ---- -id: ptr-03 -original_id: P.UNS.PTR.03 -level: P -impact: HIGH ---- - -# Use PhantomData for Variance and Ownership with Pointer Generics - -## Summary - -When a struct contains raw pointers but logically owns or borrows the pointed-to data, use `PhantomData` to tell the compiler about the relationship. - -## Rationale - -Raw pointers don't carry ownership or lifetime information. `PhantomData` lets you: -- Indicate ownership (for `Drop` check) -- Control variance (covariant, contravariant, invariant) -- Participate in lifetime elision - -## Bad Example - -```rust -// DON'T: Raw pointer without PhantomData -struct MyVec { - ptr: *mut T, - len: usize, - cap: usize, -} - -// Problems: -// 1. Compiler doesn't know we "own" the T values -// 2. T might be incorrectly determined as unused -// 3. Drop check may allow dangling references -``` - -## Good Example - -```rust -use std::marker::PhantomData; -use std::ptr::NonNull; - -// DO: Use PhantomData to express ownership -struct MyVec { - ptr: NonNull, - len: usize, - cap: usize, - _marker: PhantomData, // We own T values -} - -// For owned data: PhantomData -// For borrowed data: PhantomData<&'a T> -// For mutably borrowed: PhantomData<&'a mut T> -// For function pointers: PhantomData (contravariant) - -// DO: Express lifetime relationships -struct Iter<'a, T> { - ptr: *const T, - end: *const T, - _marker: PhantomData<&'a T>, // Borrows T for 'a -} - -impl<'a, T> Iterator for Iter<'a, T> { - type Item = &'a T; - - fn next(&mut self) -> Option { - if self.ptr == self.end { - None - } else { - // SAFETY: ptr < end, so ptr is valid - // Lifetime is tied to 'a through PhantomData - let current = unsafe { &*self.ptr }; - self.ptr = unsafe { self.ptr.add(1) }; - Some(current) - } - } -} -``` - -## PhantomData Patterns - -| Phantom Type | Meaning | Variance | -|--------------|---------|----------| -| `PhantomData` | Owns T | Covariant | -| `PhantomData<&'a T>` | Borrows T for 'a | Covariant in T, covariant in 'a | -| `PhantomData<&'a mut T>` | Mutably borrows T | Invariant in T, covariant in 'a | -| `PhantomData<*const T>` | Just has pointer | Covariant | -| `PhantomData<*mut T>` | Just has pointer | Invariant | -| `PhantomData` | Consumes T | Contravariant | -| `PhantomData T>` | Produces T | Covariant | - -## Drop Check - -```rust -use std::marker::PhantomData; - -// This tells the compiler that dropping MyVec may drop T values -struct MyVec { - ptr: NonNull, - _marker: PhantomData, -} - -impl Drop for MyVec { - fn drop(&mut self) { - // Drop all T values... - } -} - -// Without PhantomData, this might compile incorrectly: -// let x = MyVec::new(&local); -// drop(local); // Would be UB if allowed -// drop(x); // Tries to access dropped local -``` - -## Checklist - -- [ ] Does my pointer type logically own the pointed-to data? -- [ ] Do I need to express a lifetime relationship? -- [ ] What variance do I need for my generic parameter? -- [ ] Will the type be dropped, and does it need drop check? - -## Related Rules - -- `ptr-02`: Prefer NonNull over *mut T -- `safety-05`: Send/Sync implementation safety diff --git a/.claude/skills/unsafe-checker/rules/ptr-04-alignment.md b/.claude/skills/unsafe-checker/rules/ptr-04-alignment.md deleted file mode 100644 index 53479f838..000000000 --- a/.claude/skills/unsafe-checker/rules/ptr-04-alignment.md +++ /dev/null @@ -1,117 +0,0 @@ ---- -id: ptr-04 -original_id: G.UNS.PTR.01 -level: G -impact: HIGH -clippy: cast_ptr_alignment ---- - -# Do Not Dereference Pointers Cast to Misaligned Types - -## Summary - -When casting a pointer to a different type, ensure the resulting pointer is properly aligned for the target type. - -## Rationale - -Misaligned pointer dereferences are undefined behavior on most architectures. Even on architectures that support unaligned access, it may cause performance penalties or subtle bugs. - -## Bad Example - -```rust -// DON'T: Cast without checking alignment -fn bad_cast(bytes: &[u8]) -> u32 { - // BAD: bytes might not be aligned for u32 - let ptr = bytes.as_ptr() as *const u32; - unsafe { *ptr } // UB if misaligned! -} - -// DON'T: Assume struct layout -#[repr(C)] -struct Header { - flags: u8, - value: u32, // Aligned at offset 4 in the struct -} - -fn bad_field_access(bytes: &[u8]) -> u32 { - let header = bytes.as_ptr() as *const Header; - // Even if bytes is 4-byte aligned, this might fail - // if Header has different alignment than expected - unsafe { (*header).value } -} -``` - -## Good Example - -```rust -// DO: Use read_unaligned for potentially misaligned data -fn good_cast(bytes: &[u8]) -> u32 { - assert!(bytes.len() >= 4); - let ptr = bytes.as_ptr() as *const u32; - // SAFETY: We're reading 4 bytes, alignment doesn't matter for read_unaligned - unsafe { ptr.read_unaligned() } -} - -// DO: Check alignment before cast -fn good_aligned_cast(bytes: &[u8]) -> Option<&u32> { - if bytes.len() >= 4 && bytes.as_ptr() as usize % std::mem::align_of::() == 0 { - // SAFETY: Checked length and alignment - Some(unsafe { &*(bytes.as_ptr() as *const u32) }) - } else { - None - } -} - -// DO: Use from_ne_bytes for portable byte conversion -fn good_from_bytes(bytes: &[u8]) -> u32 { - u32::from_ne_bytes(bytes[..4].try_into().unwrap()) -} - -// DO: Use bytemuck for safe transmutation -// use bytemuck::{Pod, Zeroable}; -// let value: u32 = bytemuck::pod_read_unaligned(bytes); - -// DO: Use align_to for splitting at alignment boundaries -fn process_aligned(bytes: &[u8]) { - let (prefix, aligned, suffix) = unsafe { bytes.align_to::() }; - // prefix and suffix are unaligned portions - // aligned is a &[u32] that's properly aligned -} -``` - -## Alignment Check Helpers - -```rust -fn is_aligned(ptr: *const u8) -> bool { - ptr as usize % std::mem::align_of::() == 0 -} - -/// Align a pointer up to the next aligned address -fn align_up(ptr: *const u8) -> *const u8 { - let align = std::mem::align_of::(); - let addr = ptr as usize; - let aligned = (addr + align - 1) & !(align - 1); - aligned as *const u8 -} -``` - -## Architecture Notes - -| Arch | Misaligned Access | -|------|-------------------| -| x86/x64 | Works but slower | -| ARM | UB, may trap or give wrong results | -| RISC-V | UB, may trap | -| WASM | UB | - -## Checklist - -- [ ] Is my pointer cast changing alignment requirements? -- [ ] Is the source pointer guaranteed to be aligned? -- [ ] Should I use read_unaligned instead? -- [ ] Can I use safe conversion methods (from_ne_bytes)? - -## Related Rules - -- `mem-01`: Choose appropriate data layout -- `ffi-13`: Ensure consistent data layout diff --git a/.claude/skills/unsafe-checker/rules/ptr-05-no-const-to-mut.md b/.claude/skills/unsafe-checker/rules/ptr-05-no-const-to-mut.md deleted file mode 100644 index a719cbf2e..000000000 --- a/.claude/skills/unsafe-checker/rules/ptr-05-no-const-to-mut.md +++ /dev/null @@ -1,120 +0,0 @@ ---- -id: ptr-05 -original_id: G.UNS.PTR.02 -level: G -impact: CRITICAL -clippy: cast_ref_to_mut ---- - -# Do Not Manually Convert Immutable Pointer to Mutable - -## Summary - -Never cast `*const T` to `*mut T` and dereference it to write. This violates aliasing rules and is undefined behavior. - -## Rationale - -Creating `*const T` from `&T` implies immutability. Other references might exist. Writing through a `*mut T` created from `*const T` creates mutable aliasing, which is UB. - -## Bad Example - -```rust -// DON'T: Cast *const to *mut -fn bad_mutate(value: &i32) { - let ptr = value as *const i32 as *mut i32; - unsafe { *ptr = 42; } // UB: Mutating through & -} - -// DON'T: Use transmute to convert -fn bad_transmute(value: &i32) -> &mut i32 { - unsafe { std::mem::transmute(value) } // UB! -} - -// DON'T: "I know this is the only reference" -fn bad_claim(value: &i32) { - // Even if you "know" there's only one reference, - // the compiler assumes & means no mutation - let ptr = value as *const i32 as *mut i32; - unsafe { *ptr += 1; } // Still UB - compiler may optimize incorrectly -} -``` - -## Good Example - -```rust -// DO: Take &mut if you need to mutate -fn good_mutate(value: &mut i32) { - *value = 42; -} - -// DO: Use interior mutability -use std::cell::{Cell, RefCell, UnsafeCell}; - -struct Mutable { - value: Cell, // Interior mutability -} - -impl Mutable { - fn modify(&self) { - self.value.set(42); // OK: Cell provides interior mutability - } -} - -// DO: Use UnsafeCell if you need raw unsafe interior mutability -struct RawMutable { - value: UnsafeCell, -} - -impl RawMutable { - fn modify(&self) { - // SAFETY: We ensure exclusive access through external means - unsafe { *self.value.get() = 42; } - } -} -``` - -## The UnsafeCell Exception - -`UnsafeCell` is the ONLY valid way to get `*mut T` from `&self`: - -```rust -use std::cell::UnsafeCell; - -pub struct MyMutex { - data: UnsafeCell, - // ... lock state -} - -impl MyMutex { - pub fn lock(&self) -> Guard<'_, T> { - // acquire lock... - - // SAFETY: UnsafeCell allows this, lock ensures exclusivity - Guard { data: unsafe { &mut *self.data.get() } } - } -} -``` - -## Why This Is Always UB - -The compiler assumes: -1. `&T` means no mutation will occur -2. Multiple `&T` can exist simultaneously -3. Optimizations can be made based on these assumptions - -When you mutate through cast pointer: -1. Other `&T` references see inconsistent values -2. Compiler may cache/eliminate reads -3. Results are unpredictable - -## Checklist - -- [ ] Am I trying to mutate through `&`? -- [ ] Should I use `&mut` instead? -- [ ] Should I use `Cell`, `RefCell`, or `UnsafeCell`? -- [ ] Is the original type designed for interior mutability? - -## Related Rules - -- `safety-08`: Mutable return from immutable parameter is wrong -- `safety-02`: Verify safety invariants diff --git a/.claude/skills/unsafe-checker/rules/ptr-06-prefer-cast.md b/.claude/skills/unsafe-checker/rules/ptr-06-prefer-cast.md deleted file mode 100644 index 8397516e0..000000000 --- a/.claude/skills/unsafe-checker/rules/ptr-06-prefer-cast.md +++ /dev/null @@ -1,110 +0,0 @@ ---- -id: ptr-06 -original_id: G.UNS.PTR.03 -level: G -impact: LOW -clippy: ptr_as_ptr ---- - -# Prefer pointer::cast Over `as` for Pointer Casting - -## Summary - -Use the `cast()` method instead of `as` for pointer type conversions. It's clearer and prevents accidental provenance loss. - -## Rationale - -- `cast()` only changes the pointed-to type, not pointer properties -- `as` can accidentally convert to integer and back, losing provenance -- `cast()` is more explicit about intent -- Better tooling support (clippy, miri) - -## Bad Example - -```rust -// DON'T: Use `as` for pointer casts -fn bad_cast(ptr: *const u8) -> *const i32 { - ptr as *const i32 // Works, but less clear -} - -// DON'T: Accidental provenance loss -fn bad_roundtrip(ptr: *const u8) -> *const u8 { - let addr = ptr as usize; // Converts to integer - addr as *const u8 // Loses provenance information! -} - -// DON'T: Multiple `as` casts in chain -fn bad_chain(ptr: *const u8) -> *mut i32 { - ptr as *mut u8 as *mut i32 // Hard to follow -} -``` - -## Good Example - -```rust -// DO: Use cast() for pointer type changes -fn good_cast(ptr: *const u8) -> *const i32 { - ptr.cast::() -} - -// DO: Use cast_mut() for const-to-mut (when valid) -fn good_cast_mut(ptr: *const u8) -> *mut u8 { - ptr.cast_mut() // Only use when mutation is valid! -} - -// DO: Use cast_const() for mut-to-const -fn good_cast_const(ptr: *mut u8) -> *const u8 { - ptr.cast_const() -} - -// DO: Chain casts clearly -fn good_chain(ptr: *const u8) -> *mut i32 { - ptr.cast_mut().cast::() -} - -// DO: Use with_addr() for address manipulation (nightly) -#[cfg(feature = "strict_provenance")] -fn good_provenance(ptr: *const u8, new_addr: usize) -> *const u8 { - ptr.with_addr(new_addr) // Preserves provenance -} -``` - -## Pointer Method Reference - -| Method | From | To | Notes | -|--------|------|-----|-------| -| `.cast::()` | `*T` | `*U` | Changes pointee type | -| `.cast_mut()` | `*const T` | `*mut T` | Removes const | -| `.cast_const()` | `*mut T` | `*const T` | Adds const | -| `.addr()` | `*T` | `usize` | Gets address (nightly) | -| `.with_addr(usize)` | `*T` | `*T` | Changes address, keeps provenance | -| `.map_addr(fn)` | `*T` | `*T` | Transforms address | - -## Provenance Considerations - -```rust -// Provenance = permission to access memory - -// BAD: Loses provenance -let ptr: *const u8 = &data as *const u8; -let addr = ptr as usize; -let ptr2 = addr as *const u8; // ptr2 has no provenance! - -// GOOD: Preserves provenance (nightly strict_provenance) -let ptr2 = ptr.with_addr(addr); // Still has permission - -// GOOD: Use expose/from_exposed when provenance must cross integer -let addr = ptr.expose_addr(); // "Expose" the provenance -let ptr2 = std::ptr::from_exposed_addr(addr); // Recover it -``` - -## Checklist - -- [ ] Am I using `as` where `cast()` would be clearer? -- [ ] Am I accidentally converting through `usize`? -- [ ] Do I need to preserve provenance? - -## Related Rules - -- `ptr-04`: Alignment considerations when casting -- `ptr-05`: Don't convert const to mut improperly diff --git a/.claude/skills/unsafe-checker/rules/safety-01-panic-safety.md b/.claude/skills/unsafe-checker/rules/safety-01-panic-safety.md deleted file mode 100644 index 7b02278b6..000000000 --- a/.claude/skills/unsafe-checker/rules/safety-01-panic-safety.md +++ /dev/null @@ -1,113 +0,0 @@ ---- -id: safety-01 -original_id: P.UNS.SAS.01 -level: P -impact: CRITICAL -clippy: panic_in_result_fn ---- - -# Be Aware of Memory Safety Issues from Panics - -## Summary - -Panics in unsafe code can leave data structures in an inconsistent state, leading to undefined behavior when the panic is caught. - -## Rationale - -When a panic occurs, Rust unwinds the stack and runs destructors. If unsafe code has partially modified data, the destructors may observe invalid state. - -## Bad Example - -```rust -// DON'T: Panic can leave Vec in invalid state -impl MyVec { - pub fn push(&mut self, value: T) { - if self.len == self.cap { - self.grow(); // Might panic during allocation - } - - unsafe { - // If Clone::clone() panics after incrementing len, - // drop will try to drop uninitialized memory - self.len += 1; - ptr::write(self.ptr.add(self.len - 1), value.clone()); - } - } -} -``` - -## Good Example - -```rust -// DO: Ensure panic safety by ordering operations correctly -impl MyVec { - pub fn push(&mut self, value: T) { - if self.len == self.cap { - self.grow(); - } - - unsafe { - // Write first, then increment len - // If write somehow panics, len is still valid - ptr::write(self.ptr.add(self.len), value); - self.len += 1; // Only increment after successful write - } - } -} - -// DO: Use guards for complex operations -impl MyVec { - pub fn extend_from_slice(&mut self, slice: &[T]) { - self.reserve(slice.len()); - - let mut guard = PanicGuard { - vec: self, - initialized: 0, - }; - - for item in slice { - unsafe { - ptr::write(guard.vec.ptr.add(guard.vec.len + guard.initialized), item.clone()); - guard.initialized += 1; - } - } - - // Success - update len and forget guard - self.len += guard.initialized; - std::mem::forget(guard); - } -} - -struct PanicGuard<'a, T> { - vec: &'a mut MyVec, - initialized: usize, -} - -impl Drop for PanicGuard<'_, T> { - fn drop(&mut self) { - // Clean up partially initialized elements on panic - unsafe { - for i in 0..self.initialized { - ptr::drop_in_place(self.vec.ptr.add(self.vec.len + i)); - } - } - } -} -``` - -## Key Patterns - -1. **Update bookkeeping after operations**: Increment length only after writing -2. **Use panic guards**: RAII types that clean up on panic -3. **Order operations carefully**: Ensure invariants hold if panic occurs at any point - -## Checklist - -- [ ] What happens if this code panics at each line? -- [ ] Are all invariants maintained if we unwind from here? -- [ ] Do I need a panic guard for cleanup? - -## Related Rules - -- `safety-04`: Avoid double-free from panic safety issues -- `safety-02`: Verify safety invariants diff --git a/.claude/skills/unsafe-checker/rules/safety-02-verify-invariants.md b/.claude/skills/unsafe-checker/rules/safety-02-verify-invariants.md deleted file mode 100644 index 63b972961..000000000 --- a/.claude/skills/unsafe-checker/rules/safety-02-verify-invariants.md +++ /dev/null @@ -1,90 +0,0 @@ ---- -id: safety-02 -original_id: P.UNS.SAS.02 -level: P -impact: CRITICAL ---- - -# Unsafe Code Authors Must Verify Safety Invariants - -## Summary - -When writing unsafe code, you are taking responsibility for upholding all safety invariants that the compiler normally enforces. - -## Rationale - -Unsafe blocks don't disable safety requirements - they transfer responsibility from the compiler to the programmer. You must manually verify what the compiler normally checks. - -## Safety Invariants to Verify - -1. **Pointer validity**: Non-null, aligned, points to valid memory -2. **Aliasing**: No mutable aliasing (two &mut to same memory) -3. **Initialization**: Memory is initialized before read -4. **Lifetime**: References don't outlive their referents -5. **Type validity**: Data matches the expected type's invariants -6. **Thread safety**: Proper synchronization for concurrent access - -## Bad Example - -```rust -// DON'T: Blindly trust inputs -unsafe fn process(ptr: *const Data, len: usize) { - for i in 0..len { - // No verification that ptr is valid or len is correct! - let item = &*ptr.add(i); - process_item(item); - } -} -``` - -## Good Example - -```rust -// DO: Document and verify invariants -/// Processes a slice of Data items. -/// -/// # Safety -/// -/// - `ptr` must be non-null and aligned for `Data` -/// - `ptr` must point to `len` consecutive initialized `Data` items -/// - The memory must not be mutated during this call -/// - `len * size_of::()` must not overflow `isize::MAX` -unsafe fn process(ptr: *const Data, len: usize) { - debug_assert!(!ptr.is_null(), "ptr must not be null"); - debug_assert!(ptr.is_aligned(), "ptr must be aligned"); - - for i in 0..len { - // SAFETY: Caller guarantees ptr points to len valid items - let item = &*ptr.add(i); - process_item(item); - } -} - -// DO: Provide safe wrapper when possible -fn process_slice(data: &[Data]) { - // SAFETY: slice guarantees all invariants - unsafe { process(data.as_ptr(), data.len()) } -} -``` - -## Invariant Documentation Template - -```rust -/// # Safety -/// -/// The caller must ensure that: -/// - [List each invariant] -/// - [Explain why each matters] -``` - -## Checklist - -- [ ] Have I listed all safety invariants? -- [ ] Can I prove each invariant holds at the call site? -- [ ] Have I added debug assertions where possible? -- [ ] Have I documented invariants in /// # Safety section? - -## Related Rules - -- `safety-09`: Add SAFETY comment before any unsafe block -- `safety-10`: Add Safety section in docs for public unsafe functions diff --git a/.claude/skills/unsafe-checker/rules/safety-03-no-uninit-api.md b/.claude/skills/unsafe-checker/rules/safety-03-no-uninit-api.md deleted file mode 100644 index e4850868d..000000000 --- a/.claude/skills/unsafe-checker/rules/safety-03-no-uninit-api.md +++ /dev/null @@ -1,121 +0,0 @@ ---- -id: safety-03 -original_id: P.UNS.SAS.03 -level: P -impact: CRITICAL -clippy: uninit_assumed_init ---- - -# Do Not Expose Uninitialized Memory in Public APIs - -## Summary - -Public APIs must never return or expose uninitialized memory to callers. - -## Rationale - -Reading uninitialized memory is undefined behavior in Rust. Safe code should never be able to access uninitialized memory through your API. - -## Bad Example - -```rust -// DON'T: Expose uninitialized memory -pub struct Buffer { - data: [u8; 1024], - len: usize, -} - -impl Buffer { - pub fn new() -> Self { - // BAD: data is uninitialized - unsafe { - Self { - data: std::mem::MaybeUninit::uninit().assume_init(), - len: 0, - } - } - } - - // BAD: Returns reference to potentially uninitialized data - pub fn as_slice(&self) -> &[u8] { - &self.data[..self.len] // What if len > initialized portion? - } -} -``` - -## Good Example - -```rust -use std::mem::MaybeUninit; - -// DO: Use MaybeUninit properly and only expose initialized data -pub struct Buffer { - data: Box<[MaybeUninit; 1024]>, - len: usize, // Invariant: data[0..len] is initialized -} - -impl Buffer { - pub fn new() -> Self { - Self { - // MaybeUninit doesn't require initialization - data: Box::new([MaybeUninit::uninit(); 1024]), - len: 0, - } - } - - pub fn push(&mut self, byte: u8) { - if self.len < 1024 { - self.data[self.len].write(byte); - self.len += 1; - } - } - - // Only returns initialized portion - pub fn as_slice(&self) -> &[u8] { - // SAFETY: self.len bytes are initialized (invariant) - unsafe { - std::slice::from_raw_parts( - self.data.as_ptr() as *const u8, - self.len - ) - } - } -} - -impl Drop for Buffer { - fn drop(&mut self) { - // Only drop initialized elements - // For u8 this is a no-op, but important for Drop types - } -} -``` - -## Patterns for Uninitialized Memory - -```rust -// Pattern 1: MaybeUninit for delayed initialization -let mut value: MaybeUninit = MaybeUninit::uninit(); -initialize_expensive(&mut value); -let value = unsafe { value.assume_init() }; - -// Pattern 2: Vec::with_capacity for growable buffers -let mut vec = Vec::with_capacity(100); -// vec.len() is 0, capacity is 100 -// No uninitialized memory is accessible - -// Pattern 3: Box::new_uninit (nightly) -let mut boxed = Box::<[u8; 1024]>::new_uninit(); -boxed.write([0u8; 1024]); -let boxed = unsafe { boxed.assume_init() }; -``` - -## Checklist - -- [ ] Does my API ever return references to uninitialized memory? -- [ ] Are length/capacity invariants properly maintained? -- [ ] Is MaybeUninit used instead of transmute for uninitialized data? - -## Related Rules - -- `mem-06`: Use MaybeUninit for uninitialized memory -- `safety-01`: Panic safety with partial initialization diff --git a/.claude/skills/unsafe-checker/rules/safety-04-double-free.md b/.claude/skills/unsafe-checker/rules/safety-04-double-free.md deleted file mode 100644 index 6d9673f30..000000000 --- a/.claude/skills/unsafe-checker/rules/safety-04-double-free.md +++ /dev/null @@ -1,109 +0,0 @@ ---- -id: safety-04 -original_id: P.UNS.SAS.04 -level: P -impact: CRITICAL ---- - -# Avoid Double-Free from Panic Safety Issues - -## Summary - -Ensure that resources are not freed twice, especially when panics can occur during operations. - -## Rationale - -Double-free is undefined behavior. Panics during unsafe operations can cause destructors to run on already-freed or partially-constructed data. - -## Bad Example - -```rust -// DON'T: Potential double-free on panic -impl MyVec { - pub fn pop(&mut self) -> Option { - if self.len == 0 { - None - } else { - self.len -= 1; - unsafe { - // If something panics after this read but before return, - // Drop will try to drop this element again - Some(ptr::read(self.ptr.add(self.len))) - } - } - } -} - -// DON'T: Double-free with ManuallyDrop misuse -fn bad_swap(a: &mut T, b: &mut T) { - unsafe { - let tmp = ptr::read(a); - ptr::write(a, ptr::read(b)); // If this panics, tmp leaks - ptr::write(b, tmp); - } -} -``` - -## Good Example - -```rust -// DO: Use std::mem::take or swap -fn good_swap(a: &mut T, b: &mut T) { - std::mem::swap(a, b); // Safe and correct -} - -// DO: Use ManuallyDrop for panic safety -use std::mem::ManuallyDrop; - -impl MyVec { - pub fn pop(&mut self) -> Option { - if self.len == 0 { - None - } else { - self.len -= 1; // Decrement first - unsafe { - // SAFETY: len was decremented, so this slot won't be - // dropped again by Vec's Drop impl - Some(ptr::read(self.ptr.add(self.len))) - } - } - } -} - -// DO: Use scopeguard or manual cleanup -fn safe_operation(data: &mut [T], source: &[T]) { - // Track what we've written for cleanup on panic - let mut written = 0; - - let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| { - for (i, item) in source.iter().enumerate() { - data[i] = item.clone(); - written = i + 1; - } - })); - - if result.is_err() { - // Clean up on panic (if T needs special handling) - // In this case, safe code handles it automatically - } -} -``` - -## Patterns to Avoid Double-Free - -1. **Decrement length before reading**: Vec's Drop won't touch the read element -2. **Use ManuallyDrop**: Explicitly control when Drop runs -3. **Use std::mem::replace/swap**: Safe alternatives for move semantics -4. **Panic guards**: RAII cleanup on unwind - -## Checklist - -- [ ] After reading memory, is it marked as "moved"? -- [ ] Will Drop run on this memory? Should it? -- [ ] What happens if this code panics at each point? -- [ ] Are length/count bookkeeping updates ordered correctly? - -## Related Rules - -- `safety-01`: Panic safety in unsafe code -- `ptr-01`: Don't share raw pointers across threads diff --git a/.claude/skills/unsafe-checker/rules/safety-05-send-sync.md b/.claude/skills/unsafe-checker/rules/safety-05-send-sync.md deleted file mode 100644 index 07101df9a..000000000 --- a/.claude/skills/unsafe-checker/rules/safety-05-send-sync.md +++ /dev/null @@ -1,113 +0,0 @@ ---- -id: safety-05 -original_id: P.UNS.SAS.05 -level: P -impact: CRITICAL -clippy: non_send_fields_in_send_ty ---- - -# Consider Safety When Manually Implementing Auto Traits - -## Summary - -When manually implementing `Send` or `Sync`, you must ensure thread safety invariants are upheld. - -## Rationale - -`Send` and `Sync` are unsafe traits because incorrect implementations cause data races, which are undefined behavior. The compiler auto-implements them conservatively, but manual implementations require careful analysis. - -## Trait Meanings - -- **`Send`**: Safe to transfer ownership to another thread -- **`Sync`**: Safe to share references (`&T`) between threads (i.e., `&T: Send`) - -## Bad Example - -```rust -// DON'T: Unsafe Send/Sync without thread safety -struct NotThreadSafe { - ptr: *mut i32, // Raw pointers are not Send/Sync -} - -// BAD: This is unsound! -unsafe impl Send for NotThreadSafe {} -unsafe impl Sync for NotThreadSafe {} - -// DON'T: Rc-like type with unsafe Sync -struct MyRc { - ptr: *mut RcInner, -} - -struct RcInner { - count: usize, // Not atomic! - data: T, -} - -// BAD: count is not atomic, concurrent access is UB -unsafe impl Sync for MyRc {} -``` - -## Good Example - -```rust -use std::sync::atomic::{AtomicUsize, Ordering}; -use std::ptr::NonNull; - -// DO: Use atomic operations for thread-safe reference counting -struct MyArc { - ptr: NonNull>, -} - -struct ArcInner { - count: AtomicUsize, // Atomic for thread safety - data: T, -} - -// SAFETY: The data is behind atomic reference counting, -// and T: Send + Sync ensures the data itself is thread-safe -unsafe impl Send for MyArc {} -unsafe impl Sync for MyArc {} - -// DO: Document why it's safe -/// A thread-safe wrapper around a raw file descriptor. -/// -/// # Safety -/// -/// The file descriptor is valid for the lifetime of this struct, -/// and file descriptors are safe to use from any thread. -struct ThreadSafeFd { - fd: std::os::unix::io::RawFd, -} - -// SAFETY: File descriptors are just integers and can be used -// from any thread. The actual I/O operations are thread-safe -// at the OS level. -unsafe impl Send for ThreadSafeFd {} -unsafe impl Sync for ThreadSafeFd {} -``` - -## Decision Tree - -``` -Does your type contain: - - Raw pointers? → Probably not auto Send/Sync - - Rc/RefCell? → Not Sync (Rc not Send either) - - Cell/UnsafeCell? → Not Sync - - Interior mutability? → Needs synchronization for Sync - -To manually implement: - - Send: Can another thread safely drop this? - - Sync: Can multiple threads safely call &self methods? -``` - -## Checklist - -- [ ] Does my type contain any non-Send/Sync fields? -- [ ] Is interior mutability properly synchronized (Mutex, atomic)? -- [ ] Would concurrent access cause data races? -- [ ] Have I documented why the implementation is safe? - -## Related Rules - -- `ptr-01`: Don't share raw pointers across threads -- `safety-02`: Verify safety invariants diff --git a/.claude/skills/unsafe-checker/rules/safety-06-no-raw-ptr-api.md b/.claude/skills/unsafe-checker/rules/safety-06-no-raw-ptr-api.md deleted file mode 100644 index bb1b9c501..000000000 --- a/.claude/skills/unsafe-checker/rules/safety-06-no-raw-ptr-api.md +++ /dev/null @@ -1,137 +0,0 @@ ---- -id: safety-06 -original_id: P.UNS.SAS.06 -level: P -impact: HIGH ---- - -# Do Not Expose Raw Pointers in Public APIs - -## Summary - -Public APIs should use safe abstractions (references, slices, smart pointers) instead of exposing raw pointers. - -## Rationale - -Raw pointers bypass Rust's safety guarantees. Exposing them in public APIs forces users into unsafe code and makes it easy to create undefined behavior. - -## Bad Example - -```rust -// DON'T: Expose raw pointers in public API -pub struct Buffer { - data: *mut u8, - len: usize, -} - -impl Buffer { - // BAD: Returns raw pointer - pub fn as_ptr(&self) -> *const u8 { - self.data - } - - // BAD: Takes raw pointer as input - pub fn from_ptr(ptr: *mut u8, len: usize) -> Self { - Self { data: ptr, len } - } - - // BAD: Exposes internal pointer mutably - pub fn as_mut_ptr(&mut self) -> *mut u8 { - self.data - } -} -``` - -## Good Example - -```rust -// DO: Use safe abstractions -pub struct Buffer { - data: Vec, -} - -impl Buffer { - // Returns a safe reference - pub fn as_slice(&self) -> &[u8] { - &self.data - } - - // Takes safe input - pub fn from_slice(data: &[u8]) -> Self { - Self { data: data.to_vec() } - } - - // Mutable access through safe reference - pub fn as_mut_slice(&mut self) -> &mut [u8] { - &mut self.data - } -} - -// DO: If raw pointers are needed, provide unsafe API with documentation -impl Buffer { - /// Returns a pointer to the buffer's data. - /// - /// # Safety - /// - /// The pointer is valid for `self.len()` bytes and must not be - /// used after the Buffer is dropped or reallocated. - pub fn as_ptr(&self) -> *const u8 { - self.data.as_ptr() - } - - /// Creates a Buffer from a raw pointer. - /// - /// # Safety - /// - /// - `ptr` must point to `len` valid bytes - /// - The memory must be allocated with the global allocator - /// - Caller transfers ownership of the memory to Buffer - pub unsafe fn from_raw_parts(ptr: *mut u8, len: usize, cap: usize) -> Self { - Self { - data: Vec::from_raw_parts(ptr, len, cap) - } - } -} -``` - -## Patterns for Safe Pointer APIs - -```rust -// Pattern 1: Use NonNull for internal pointers -use std::ptr::NonNull; - -pub struct MyBox { - ptr: NonNull, // Internal use only -} - -impl MyBox { - // Safe public API - pub fn get(&self) -> &T { - // SAFETY: ptr is always valid while MyBox exists - unsafe { self.ptr.as_ref() } - } -} - -// Pattern 2: Callback-based access -impl Buffer { - // User can work with pointer in controlled context - pub fn with_ptr(&self, f: F) -> R - where - F: FnOnce(*const u8, usize) -> R, - { - f(self.data.as_ptr(), self.data.len()) - } -} -``` - -## Checklist - -- [ ] Can this API use references instead of pointers? -- [ ] Can this API use slices instead of pointer + length? -- [ ] If pointers are necessary, is the API marked `unsafe`? -- [ ] Are safety requirements documented? - -## Related Rules - -- `general-03`: Don't create aliases for unsafe items -- `safety-10`: Document safety requirements for public unsafe functions diff --git a/.claude/skills/unsafe-checker/rules/safety-07-unsafe-pair.md b/.claude/skills/unsafe-checker/rules/safety-07-unsafe-pair.md deleted file mode 100644 index 68136b2ef..000000000 --- a/.claude/skills/unsafe-checker/rules/safety-07-unsafe-pair.md +++ /dev/null @@ -1,107 +0,0 @@ ---- -id: safety-07 -original_id: P.UNS.SAS.07 -level: P -impact: MEDIUM ---- - -# Provide Unsafe Counterparts for Performance Alongside Safe Methods - -## Summary - -When providing performance-critical operations that skip safety checks, offer both a safe checked version and an unsafe unchecked version. - -## Rationale - -Users who need maximum performance can opt into unsafe, while others get safety by default. This follows the "safe by default, unsafe opt-in" principle. - -## Bad Example - -```rust -// DON'T: Only provide unsafe version -impl MySlice { - /// Gets an element by index. - /// - /// # Safety - /// Index must be in bounds. - pub unsafe fn get(&self, index: usize) -> &T { - &*self.ptr.add(index) - } -} - -// DON'T: Only provide checked version when performance matters -impl MySlice { - pub fn get(&self, index: usize) -> Option<&T> { - if index < self.len { - Some(unsafe { &*self.ptr.add(index) }) - } else { - None - } - } - // Missing: get_unchecked for performance-critical code -} -``` - -## Good Example - -```rust -// DO: Provide both versions -impl MySlice { - /// Gets an element by index, returning `None` if out of bounds. - #[inline] - pub fn get(&self, index: usize) -> Option<&T> { - if index < self.len { - // SAFETY: We just verified index < len - Some(unsafe { self.get_unchecked(index) }) - } else { - None - } - } - - /// Gets an element by index without bounds checking. - /// - /// # Safety - /// - /// Calling this method with an out-of-bounds index is undefined behavior. - #[inline] - pub unsafe fn get_unchecked(&self, index: usize) -> &T { - debug_assert!(index < self.len, "index out of bounds"); - &*self.ptr.add(index) - } - - /// Gets an element, panicking if out of bounds. - #[inline] - pub fn get_or_panic(&self, index: usize) -> &T { - assert!(index < self.len, "index {} out of bounds for len {}", index, self.len); - // SAFETY: We just asserted index < len - unsafe { self.get_unchecked(index) } - } -} -``` - -## Standard Library Patterns - -| Safe Method | Unsafe Counterpart | -|-------------|-------------------| -| `slice.get(i)` | `slice.get_unchecked(i)` | -| `str.chars().nth(i)` | `str.get_unchecked(range)` | -| `vec.pop()` | `vec.set_len()` + `ptr::read` | -| `String::from_utf8()` | `String::from_utf8_unchecked()` | - -## Naming Conventions - -- Safe: `method_name()` -- Unsafe: `method_name_unchecked()` -- Or: `get()` vs `get_unchecked()` - -## Checklist - -- [ ] Does my safe method have an unsafe counterpart for hot paths? -- [ ] Does my unsafe method have a safe alternative for normal use? -- [ ] Are both methods documented with their trade-offs? -- [ ] Does the unsafe version include debug assertions? - -## Related Rules - -- `general-02`: Don't blindly use unsafe for performance -- `safety-09`: Add SAFETY comments diff --git a/.claude/skills/unsafe-checker/rules/safety-08-no-mut-from-immut.md b/.claude/skills/unsafe-checker/rules/safety-08-no-mut-from-immut.md deleted file mode 100644 index c7a8e5331..000000000 --- a/.claude/skills/unsafe-checker/rules/safety-08-no-mut-from-immut.md +++ /dev/null @@ -1,124 +0,0 @@ ---- -id: safety-08 -original_id: P.UNS.SAS.08 -level: P -impact: CRITICAL -clippy: mut_from_ref ---- - -# Mutable Return from Immutable Parameter is Wrong - -## Summary - -A function taking `&self` or `&T` must not return `&mut T` to the same data without interior mutability. - -## Rationale - -Returning `&mut` from `&` violates Rust's aliasing rules. The caller has an immutable borrow, so they can create additional `&` references. Returning `&mut` creates mutable aliasing, which is undefined behavior. - -## Bad Example - -```rust -// DON'T: Return &mut from &self -struct Container { - data: i32, -} - -impl Container { - // WRONG: This is undefined behavior! - pub fn get_mut(&self) -> &mut i32 { - unsafe { - // Creating &mut from & is ALWAYS wrong - &mut *(&self.data as *const i32 as *mut i32) - } - } -} - -// DON'T: Transmute & to &mut -fn bad_transmute(reference: &T) -> &mut T { - unsafe { std::mem::transmute(reference) } // UB! -} -``` - -## Good Example - -```rust -use std::cell::{Cell, RefCell, UnsafeCell}; - -// DO: Use interior mutability types -struct Container { - data: Cell, // For Copy types - complex: RefCell, // For non-Copy with runtime checks -} - -impl Container { - pub fn get(&self) -> i32 { - self.data.get() - } - - pub fn set(&self, value: i32) { - self.data.set(value); - } - - pub fn modify_complex(&self, f: impl FnOnce(&mut String)) { - f(&mut self.complex.borrow_mut()); - } -} - -// DO: Use UnsafeCell for custom interior mutability -struct MyMutex { - locked: std::sync::atomic::AtomicBool, - data: UnsafeCell, -} - -impl MyMutex { - pub fn lock(&self) -> MutexGuard<'_, T> { - // Acquire lock... - MutexGuard { mutex: self } - } -} - -struct MutexGuard<'a, T> { - mutex: &'a MyMutex, -} - -impl std::ops::DerefMut for MutexGuard<'_, T> { - fn deref_mut(&mut self) -> &mut T { - // SAFETY: We hold the lock, so exclusive access is guaranteed - unsafe { &mut *self.mutex.data.get() } - } -} -``` - -## The Only Valid Pattern - -The ONLY way to get `&mut` from `&` is through `UnsafeCell`: - -```rust -use std::cell::UnsafeCell; - -struct ValidInteriorMut { - data: UnsafeCell, -} - -impl ValidInteriorMut { - // This is sound ONLY because UnsafeCell opts out of aliasing rules - // AND we guarantee exclusive access (e.g., through a lock) - pub fn get_mut(&self) -> &mut i32 { - // Must ensure no other references exist! - unsafe { &mut *self.data.get() } - } -} -``` - -## Checklist - -- [ ] Am I trying to return &mut from a & method? -- [ ] If yes, am I using UnsafeCell or a type built on it? -- [ ] Am I guaranteeing exclusive access before creating &mut? -- [ ] Would Cell, RefCell, or Mutex solve my problem safely? - -## Related Rules - -- `ptr-05`: Don't manually convert *const to *mut -- `safety-02`: Verify safety invariants diff --git a/.claude/skills/unsafe-checker/rules/safety-09-safety-comment.md b/.claude/skills/unsafe-checker/rules/safety-09-safety-comment.md deleted file mode 100644 index 16e840a8e..000000000 --- a/.claude/skills/unsafe-checker/rules/safety-09-safety-comment.md +++ /dev/null @@ -1,122 +0,0 @@ ---- -id: safety-09 -original_id: P.UNS.SAS.09 -level: P -impact: CRITICAL -clippy: undocumented_unsafe_blocks ---- - -# Add SAFETY Comment Before Any Unsafe Block - -## Summary - -Every `unsafe` block or `unsafe impl` must have a `// SAFETY:` comment explaining why the operation is safe. - -## Rationale - -SAFETY comments force the author to think about invariants and help reviewers verify correctness. They serve as documentation for future maintainers. - -## Bad Example - -```rust -// DON'T: Unsafe without explanation -fn get_unchecked(slice: &[i32], index: usize) -> i32 { - unsafe { *slice.get_unchecked(index) } -} - -// DON'T: Vague or unhelpful comments -fn bad_comments(ptr: *const i32) -> i32 { - // This is unsafe - unsafe { *ptr } - - // Trust me - unsafe { *ptr } - - // Safe because I know what I'm doing - unsafe { *ptr } -} -``` - -## Good Example - -```rust -// DO: Explain the safety invariant -fn get_unchecked(slice: &[i32], index: usize) -> i32 { - // SAFETY: Caller guarantees index < slice.len() - unsafe { *slice.get_unchecked(index) } -} - -// DO: Be specific about what makes it safe -fn read_header(buffer: &[u8]) -> Header { - assert!(buffer.len() >= std::mem::size_of::
()); - - // SAFETY: - // - buffer.len() >= size_of::
() (asserted above) - // - buffer is aligned for u8, which is compatible with any alignment - // - Header is #[repr(C)] and has no padding requirements - unsafe { - std::ptr::read_unaligned(buffer.as_ptr() as *const Header) - } -} - -// DO: Document unsafe impl -struct MySendType(*mut i32); - -// SAFETY: The pointer is to thread-local storage that is only accessed -// from the owning thread. MySendType is only sent when the TLS slot -// is being transferred between threads with proper synchronization. -unsafe impl Send for MySendType {} - -// DO: Multi-line for complex invariants -fn complex_operation(data: &mut [u8], ranges: &[(usize, usize)]) { - for &(start, end) in ranges { - // SAFETY: - // 1. All ranges were validated to be within data.len() - // in the calling function `validate_ranges()` - // 2. Ranges are non-overlapping (invariant of RangeSet) - // 3. We have &mut access to data, so no aliasing - unsafe { - let ptr = data.as_mut_ptr().add(start); - std::ptr::write_bytes(ptr, 0, end - start); - } - } -} -``` - -## SAFETY Comment Format - -```rust -// SAFETY: - -// Or for complex cases: -// SAFETY: -// - Invariant 1: explanation -// - Invariant 2: explanation -// - Why this is upheld: explanation -``` - -## What to Include - -1. **What invariants must hold** for this to be safe -2. **Why those invariants hold** at this specific call site -3. **What could go wrong** if the invariants were violated (optional but helpful) - -## Clippy Configuration - -```toml -# clippy.toml -accept-comment-above-statement = true -accept-comment-above-attributes = true -``` - -## Checklist - -- [ ] Does every unsafe block have a SAFETY comment? -- [ ] Does the comment explain WHY it's safe, not just WHAT it does? -- [ ] Are all relevant invariants mentioned? -- [ ] Would a reviewer understand the safety argument? - -## Related Rules - -- `safety-02`: Verify safety invariants -- `safety-10`: Add Safety section in docs for public unsafe functions diff --git a/.claude/skills/unsafe-checker/rules/safety-10-safety-doc.md b/.claude/skills/unsafe-checker/rules/safety-10-safety-doc.md deleted file mode 100644 index b044b4ae0..000000000 --- a/.claude/skills/unsafe-checker/rules/safety-10-safety-doc.md +++ /dev/null @@ -1,127 +0,0 @@ ---- -id: safety-10 -original_id: G.UNS.SAS.01 -level: G -impact: HIGH -clippy: missing_safety_doc ---- - -# Add Safety Section in Docs for Public Unsafe Functions - -## Summary - -Public `unsafe` functions must have a `# Safety` section in their documentation explaining the caller's obligations. - -## Rationale - -Unlike SAFETY comments (which explain why an unsafe block is sound), `# Safety` docs tell callers what they must guarantee. Without this, users cannot safely call the function. - -## Bad Example - -```rust -// DON'T: Unsafe function without safety docs -pub unsafe fn process_buffer(ptr: *const u8, len: usize) { - // ... -} - -// DON'T: Safety docs that don't explain requirements -/// Processes a buffer. -/// -/// This function is unsafe. // Not helpful! -pub unsafe fn process_buffer(ptr: *const u8, len: usize) { - // ... -} -``` - -## Good Example - -```rust -/// Processes a buffer of bytes. -/// -/// # Safety -/// -/// The caller must ensure that: -/// -/// - `ptr` is non-null and properly aligned for `u8` -/// - `ptr` points to at least `len` consecutive, initialized bytes -/// - The memory referenced by `ptr` is not mutated during this call -/// - `len` does not exceed `isize::MAX` -/// -/// # Examples -/// -/// ``` -/// let data = [1u8, 2, 3, 4]; -/// // SAFETY: data is a valid slice, we pass its pointer and length -/// unsafe { process_buffer(data.as_ptr(), data.len()) }; -/// ``` -pub unsafe fn process_buffer(ptr: *const u8, len: usize) { - // ... -} - -/// Creates a `Vec` from raw parts. -/// -/// # Safety -/// -/// This is highly unsafe due to the number of invariants that must -/// be upheld by the caller: -/// -/// * `ptr` must have been allocated via the global allocator -/// * `T` must have the same alignment as the original allocation -/// * `capacity` must be the capacity the pointer was allocated with -/// * `length` must be less than or equal to `capacity` -/// * The first `length` values must be properly initialized -/// * The allocated memory must not be used elsewhere -/// -/// Violating these may cause undefined behavior including -/// use-after-free, double-free, and memory corruption. -pub unsafe fn from_raw_parts(ptr: *mut T, length: usize, capacity: usize) -> Vec { - // ... -} -``` - -## Safety Documentation Template - -```rust -/// Brief description of what the function does. -/// -/// # Safety -/// -/// The caller must ensure that: -/// -/// - Requirement 1: detailed explanation -/// - Requirement 2: detailed explanation -/// -/// # Panics (if applicable) -/// -/// Panics if... -/// -/// # Examples -/// -/// ``` -/// // SAFETY: explanation of why this call is safe -/// unsafe { function_name(...) }; -/// ``` -``` - -## What to Document - -| Category | Example | -|----------|---------| -| Pointer validity | "ptr must be non-null and aligned" | -| Memory state | "must point to initialized memory" | -| Aliasing | "no other references to this memory may exist" | -| Lifetime | "pointer must be valid for the duration of the call" | -| Thread safety | "must not be called concurrently with..." | -| Invariants | "len must not exceed isize::MAX" | - -## Checklist - -- [ ] Does the function have a `# Safety` section? -- [ ] Are ALL caller obligations listed? -- [ ] Is each requirement specific and verifiable? -- [ ] Does the example show correct usage with SAFETY comment? - -## Related Rules - -- `safety-09`: SAFETY comments for unsafe blocks -- `safety-02`: Verify safety invariants diff --git a/.claude/skills/unsafe-checker/rules/safety-11-assert-not-debug.md b/.claude/skills/unsafe-checker/rules/safety-11-assert-not-debug.md deleted file mode 100644 index 9c38a4913..000000000 --- a/.claude/skills/unsafe-checker/rules/safety-11-assert-not-debug.md +++ /dev/null @@ -1,105 +0,0 @@ ---- -id: safety-11 -original_id: G.UNS.SAS.02 -level: G -impact: MEDIUM -clippy: debug_assert_with_mut_call ---- - -# Use assert! Instead of debug_assert! in Unsafe Functions - -## Summary - -In `unsafe` functions or functions containing unsafe blocks, prefer `assert!` over `debug_assert!` for checking safety invariants. - -## Rationale - -`debug_assert!` is compiled out in release builds. If an invariant is important enough to check for safety, it should be checked in all builds to catch violations. - -## Bad Example - -```rust -// DON'T: Use debug_assert for safety-critical checks -pub unsafe fn get_unchecked(slice: &[i32], index: usize) -> &i32 { - debug_assert!(index < slice.len()); // Gone in release! - &*slice.as_ptr().add(index) -} - -// DON'T: Rely on debug_assert for FFI safety -pub unsafe fn call_c_function(ptr: *const Data) { - debug_assert!(!ptr.is_null()); // Won't catch bugs in release - ffi::process_data(ptr); -} -``` - -## Good Example - -```rust -// DO: Use assert! for safety checks (when performance allows) -pub unsafe fn get_unchecked(slice: &[i32], index: usize) -> &i32 { - assert!(index < slice.len(), "index {} out of bounds for len {}", index, slice.len()); - &*slice.as_ptr().add(index) -} - -// DO: Use debug_assert when CALLER is responsible -/// # Safety -/// index must be less than slice.len() -pub unsafe fn get_unchecked_fast(slice: &[i32], index: usize) -> &i32 { - // Caller is responsible; debug_assert just helps catch bugs during development - debug_assert!(index < slice.len()); - &*slice.as_ptr().add(index) -} - -// DO: Use assert for internal safety, debug_assert for caller obligations -pub fn get_checked(slice: &[i32], index: usize) -> Option<&i32> { - if index < slice.len() { - // SAFETY: We just checked index < len - // debug_assert is fine here because the if-check is the real guard - Some(unsafe { - debug_assert!(index < slice.len()); // Redundant, just for documentation - &*slice.as_ptr().add(index) - }) - } else { - None - } -} -``` - -## When to Use Each - -| Assertion | Use When | -|-----------|----------| -| `assert!` | Invariant is not already checked; function is called with untrusted input | -| `debug_assert!` | Invariant is the caller's responsibility (documented in `# Safety`); performance-critical | -| No assert | Invariant is enforced by types or prior checks in the same function | - -## Hybrid Approach - -```rust -// Use cfg to have both safety and performance -pub unsafe fn process(slice: &[u8], index: usize) { - // Always check in tests and debug - #[cfg(any(test, debug_assertions))] - assert!(index < slice.len()); - - // Optional: paranoid mode for production - #[cfg(feature = "paranoid")] - assert!(index < slice.len()); - - // SAFETY: Caller guarantees index < len (checked in debug) - let ptr = slice.as_ptr().add(index); - // ... -} -``` - -## Checklist - -- [ ] Is this a safety-critical invariant? -- [ ] Who is responsible for upholding it (caller or this function)? -- [ ] Can the assertion be optimized away when provably true? -- [ ] What's the performance impact of the assertion? - -## Related Rules - -- `safety-02`: Verify safety invariants -- `safety-09`: SAFETY comments diff --git a/.claude/skills/unsafe-checker/rules/union-01-avoid-except-ffi.md b/.claude/skills/unsafe-checker/rules/union-01-avoid-except-ffi.md deleted file mode 100644 index 6922517e8..000000000 --- a/.claude/skills/unsafe-checker/rules/union-01-avoid-except-ffi.md +++ /dev/null @@ -1,117 +0,0 @@ ---- -id: union-01 -original_id: P.UNS.UNI.01 -level: P -impact: HIGH ---- - -# Avoid Union Except for C Interop - -## Summary - -Only use `union` for FFI with C code. For Rust-only code, use `enum` with explicit tags. - -## Rationale - -- Unions require unsafe to read (any field access is unsafe) -- Easy to read wrong field, causing undefined behavior -- Enums are type-safe and the compiler tracks the active variant -- Unions don't run destructors properly - -## Bad Example - -```rust -// DON'T: Use union for space optimization in Rust-only code -union IntOrFloat { - i: i32, - f: f32, -} - -fn bad_usage() { - let mut u = IntOrFloat { i: 42 }; - - // BAD: Reading wrong field is UB - let f = unsafe { u.f }; // UB if i was the last written field -} - -// DON'T: Use union for variant types -union Variant { - string: std::mem::ManuallyDrop, - number: i64, -} - -// Problems: -// 1. Must manually track which variant is active -// 2. Must manually call drop on String variant -// 3. Easy to have memory leaks or double-free -``` - -## Good Example - -```rust -// DO: Use enum for variant types in Rust -enum Variant { - String(String), - Number(i64), -} - -// Compiler tracks active variant, runs correct destructor - -// DO: Use union only for C FFI -#[repr(C)] -union CUnion { - i: i32, - f: f32, -} - -// When interfacing with C code that uses this union -extern "C" { - fn c_function_returns_union() -> CUnion; - fn c_function_takes_union(u: CUnion); -} - -// DO: Wrap in safe API with explicit variant tracking -#[repr(C)] -pub struct SafeUnion { - tag: u8, - data: CUnion, -} - -impl SafeUnion { - pub fn as_int(&self) -> Option { - if self.tag == 0 { - // SAFETY: Tag indicates integer variant is active - Some(unsafe { self.data.i }) - } else { - None - } - } -} -``` - -## When Union Is Appropriate - -1. **C FFI**: Matching C union layout for interoperability -2. **MaybeUninit**: The standard library uses union internally -3. **Very low-level optimization**: Only after profiling and careful safety analysis - -## Alternatives to Union - -| Use Case | Instead of Union | Use | -|----------|-----------------|-----| -| Variant types | union + tag | `enum` | -| Optional value | union + bool | `Option` | -| Type punning | union | `transmute` or `from_ne_bytes` | -| Uninitialized memory | union | `MaybeUninit` | - -## Checklist - -- [ ] Is this for C FFI? If not, use enum -- [ ] If union is necessary, is there a tag tracking active variant? -- [ ] Are destructors handled correctly for Drop types? -- [ ] Is the union #[repr(C)] for FFI? - -## Related Rules - -- `union-02`: Don't use union variants across lifetimes -- `ffi-13`: Ensure consistent data layout diff --git a/.claude/skills/unsafe-checker/rules/union-02-no-cross-lifetime.md b/.claude/skills/unsafe-checker/rules/union-02-no-cross-lifetime.md deleted file mode 100644 index 5c54c0348..000000000 --- a/.claude/skills/unsafe-checker/rules/union-02-no-cross-lifetime.md +++ /dev/null @@ -1,121 +0,0 @@ ---- -id: union-02 -original_id: P.UNS.UNI.02 -level: P -impact: CRITICAL ---- - -# Do Not Use Union Variants Across Different Lifetimes - -## Summary - -Do not write to one union field and read from another field that has a different lifetime or references data with a different lifetime. - -## Rationale - -Union fields share the same memory. If one field stores a reference with lifetime `'a` and you read it as a reference with lifetime `'b`, you bypass lifetime checking and can create dangling references. - -## Bad Example - -```rust -// DON'T: Extend lifetime through union -union LifetimeBypass<'a, 'b> { - short: &'a str, - long: &'b str, -} - -fn bad_lifetime_extension<'a, 'b>(short: &'a str) -> &'b str { - let u = LifetimeBypass { short }; - // BAD: Reading with different lifetime is UB - unsafe { u.long } -} - -fn exploit() { - let long_ref: &'static str; - { - let temp = String::from("temporary"); - // Extend local reference to 'static - dangling pointer! - long_ref = bad_lifetime_extension(&temp); - } - // temp is dropped, long_ref is dangling - println!("{}", long_ref); // UB: use after free -} -``` - -## Good Example - -```rust -// DO: Use same lifetime for all reference fields -union SafeUnion<'a> { - str_ref: &'a str, - bytes_ref: &'a [u8], -} - -fn safe_conversion<'a>(s: &'a str) -> &'a [u8] { - let u = SafeUnion { str_ref: s }; - // SAFETY: Both fields have same lifetime 'a - // AND str and [u8] have compatible representations - unsafe { u.bytes_ref } -} - -// Better: Just use as_bytes() -fn better_conversion(s: &str) -> &[u8] { - s.as_bytes() -} - -// DO: Use MaybeUninit for delayed initialization, not lifetime tricks -use std::mem::MaybeUninit; - -fn delayed_init(init: impl FnOnce() -> T) -> T { - let mut value: MaybeUninit = MaybeUninit::uninit(); - value.write(init()); - unsafe { value.assume_init() } -} -``` - -## Why This Is Dangerous - -The Rust lifetime system prevents use-after-free by tracking how long references are valid. Unions can subvert this: - -``` -Memory: [pointer to "hello"] - -Union as 'short: points to stack memory (valid during function) -Union as 'long: claims to point to valid memory forever - -Reality: After function returns, pointer is dangling -``` - -## Safe Union Patterns - -```rust -// Pattern 1: All fields have same lifetime -union SameLifetime<'a, T, U> { - a: &'a T, - b: &'a U, -} - -// Pattern 2: No references at all -#[repr(C)] -union NoRefs { - i: i32, - f: f32, -} - -// Pattern 3: Use ManuallyDrop for owned values (careful with Drop!) -union OwnedUnion { - s: std::mem::ManuallyDrop, - v: std::mem::ManuallyDrop>, -} -``` - -## Checklist - -- [ ] Do all reference fields have the same lifetime parameter? -- [ ] Am I trying to extend a lifetime through union? (If yes, stop!) -- [ ] For owned types, am I handling Drop correctly? - -## Related Rules - -- `union-01`: Avoid union except for C interop -- `safety-02`: Verify safety invariants diff --git a/.cursor/agents/_shared/fetch-strategy.md b/.cursor/agents/_shared/fetch-strategy.md deleted file mode 100644 index efd8d5c08..000000000 --- a/.cursor/agents/_shared/fetch-strategy.md +++ /dev/null @@ -1,71 +0,0 @@ -# Web Fetch Strategy - -Common web fetching strategy for anti-crawler handling. - -## Site Classification - -| Type | Examples | Characteristics | -|------|----------|-----------------| -| Anti-crawler | Reddit, Twitter/X, LinkedIn | Need login or browser fingerprint | -| Regular | blog.rust-lang.org, docs.rs | No anti-crawler, direct fetch | - -## Fetch Priority - -``` -Anti-crawler sites: Local Chrome → crawl4ai MCP → give up and mark -Regular sites: WebFetch → crawl4ai MCP -``` - -## Tools - -### 1. Local Chrome (for anti-crawler) - -User's real browser with login and normal fingerprint. - -**macOS:** -```bash -# Open URL -osascript -e 'tell application "Google Chrome" to open location "URL"' - -# Get page HTML -osascript -e 'tell application "Google Chrome" to execute front window'\''s active tab javascript "document.documentElement.outerHTML"' -``` - -### 2. crawl4ai MCP (fallback) - -Strong anti-crawler bypass, needs Docker. - -``` -mcp__crawl4ai__scrape(url: "URL") -``` - -### 3. WebFetch (regular sites) - -Built-in tool, simple and fast, no anti-crawler capability. - -## Site Routing - -| Domain | Tool | Reason | -|--------|------|--------| -| reddit.com | Local Chrome | Strict anti-crawler | -| twitter.com / x.com | Local Chrome | Needs login | -| linkedin.com | Local Chrome | Strict anti-crawler | -| *.rust-lang.org | WebFetch | No anti-crawler | -| docs.rs | WebFetch | No anti-crawler | -| crates.io | WebFetch | No anti-crawler | -| this-week-in-rust.org | WebFetch | No anti-crawler | -| rustfoundation.org | WebFetch | No anti-crawler | -| github.com | WebFetch | Light rate limit | - -## Failure Handling - -1. Local Chrome fails → try crawl4ai -2. crawl4ai fails → try WebFetch -3. All fail → mark "Fetch failed: {reason}" - -## Validation - -After fetch, check: -- Content is not empty -- Not an error page (403, 429, "blocked") -- Contains expected data diff --git a/.cursor/agents/browser-fetcher.md b/.cursor/agents/browser-fetcher.md deleted file mode 100644 index d5153867f..000000000 --- a/.cursor/agents/browser-fetcher.md +++ /dev/null @@ -1,26 +0,0 @@ -# browser-fetcher - -Generic web content fetcher. - -## Fetch - -Use available tools: -- agent-browser (preferred) -- WebFetch (fallback) - -## Output - -```markdown -## Fetched Content - -**URL:** -**Title:** - -<content> -``` - -## Validation - -1. Content is not empty -2. Not an error page (403, 429, blocked) -3. On failure: report reason diff --git a/.cursor/agents/clippy-researcher.md b/.cursor/agents/clippy-researcher.md deleted file mode 100644 index b77481bf0..000000000 --- a/.cursor/agents/clippy-researcher.md +++ /dev/null @@ -1,49 +0,0 @@ -# clippy-researcher - -Fetch Clippy lint information. - -## URL - -`rust-lang.github.io/rust-clippy/stable/index.html#<lint_name>` - -## Fetch - -Use available tools to get clippy docs. - -## Lint Categories - -| Category | Description | -|----------|-------------| -| correctness | Definite bugs | -| style | Code style | -| complexity | Overly complex | -| perf | Performance | -| pedantic | Strict checks | - -## Output - -```markdown -## clippy::<lint_name> - -**Level:** warn/deny/allow -**Category:** <category> - -**What:** <what it checks> -**Why:** <why it's a problem> - -**Bad:** -\`\`\`rust -<code triggering lint> -\`\`\` - -**Good:** -\`\`\`rust -<fixed code> -\`\`\` -``` - -## Validation - -1. Content contains lint name -2. Has "What it does" or similar description -3. On failure: "Lint does not exist or fetch failed" diff --git a/.cursor/agents/crate-researcher.md b/.cursor/agents/crate-researcher.md deleted file mode 100644 index ae0acb4e5..000000000 --- a/.cursor/agents/crate-researcher.md +++ /dev/null @@ -1,31 +0,0 @@ -# crate-researcher - -Fetch crate metadata from lib.rs / crates.io. - -## Fetch - -Use available tools: -- lib.rs (preferred, more info): `lib.rs/crates/<name>` -- crates.io (fallback): `crates.io/crates/<name>` - -## Output - -```markdown -## <Crate Name> - -**Version:** <latest> -**Description:** <short> - -**Features:** -- `feature1`: desc - -**Links:** -- docs.rs | crates.io | repo -``` - -## Validation - -1. Content contains version number -2. Not a "crate not found" page -3. Has description -4. On failure: "Crate does not exist or fetch failed" diff --git a/.cursor/agents/docs-cache.md b/.cursor/agents/docs-cache.md deleted file mode 100644 index 062a1c8aa..000000000 --- a/.cursor/agents/docs-cache.md +++ /dev/null @@ -1,41 +0,0 @@ -# docs-cache - -Documentation cache helper for agents. - -## Cache Directory - -``` -~/.claude/cache/rust-docs/ -├── docs.rs/{crate}/{item}.json -├── std/{module}/{item}.json -├── releases.rs/{version}.json -├── lib.rs/{crate}.json -└── clippy/{lint}.json -``` - -## TTL by Source - -| Source | TTL | Reason | -|--------|-----|--------| -| std/ | 30 days | Stable | -| docs.rs/ | 7 days | Crate updates | -| releases.rs/ | 365 days | Historical | -| lib.rs/ | 1 day | Version changes | -| clippy/ | 14 days | Rust version updates | - -## Cache Format - -```json -{ - "meta": { - "url": "...", - "fetched_at": "2025-01-01T00:00:00Z", - "expires_at": "2025-01-08T00:00:00Z" - }, - "content": { ... } -} -``` - -## Skip Cache - -Keywords: refresh, force, --force, update docs diff --git a/.cursor/agents/docs-researcher.md b/.cursor/agents/docs-researcher.md deleted file mode 100644 index 221025735..000000000 --- a/.cursor/agents/docs-researcher.md +++ /dev/null @@ -1,45 +0,0 @@ -# docs-researcher - -Fetch third-party crate documentation from docs.rs. - -> For std library (std::*), use `std-docs-researcher` instead. - -## Fetch - -Use available tools to get docs.rs content: -- agent-browser if available -- WebFetch otherwise - -**URL format:** `docs.rs/<crate>/latest/<crate>/<path>` - -## Cache - -Location: `~/.claude/cache/rust-docs/docs.rs/{crate}/{item}.json` -TTL: 7 days - -Skip cache if user says "refresh", "force", or "--force". - -## Output - -```markdown -## <Crate>::<Item> - -**Signature:** -\`\`\`rust -<signature> -\`\`\` - -**Description:** <main doc> - -**Example:** -\`\`\`rust -<usage> -\`\`\` -``` - -## Validation - -1. Content is not empty -2. Not a 404 page (check for "Not Found" or empty docblock) -3. Contains signature or description -4. On failure: report "Fetch failed: {reason}" diff --git a/.cursor/agents/rust-changelog.md b/.cursor/agents/rust-changelog.md deleted file mode 100644 index 731913e34..000000000 --- a/.cursor/agents/rust-changelog.md +++ /dev/null @@ -1,38 +0,0 @@ -# rust-changelog - -Fetch Rust version changelog from releases.rs. - -## URL - -`releases.rs/docs/<version>/` (e.g., `1.85`, `1.84.1`) - -## Fetch - -Use available tools to get releases.rs content. - -## Output - -```markdown -## Rust <Version> Release Notes - -**Release Date:** <date> - -### Language Features -- feature: desc - -### Standard Library -- new/stabilized API: desc - -### Cargo -- change: desc - -### Breaking Changes -- note: desc -``` - -## Validation - -1. Content contains version number -2. Has "Language" or "Features" sections -3. Not "version not found" -4. On failure: "Version {v} does not exist or fetch failed" diff --git a/.cursor/agents/rust-daily-reporter.md b/.cursor/agents/rust-daily-reporter.md deleted file mode 100644 index ddff9e72b..000000000 --- a/.cursor/agents/rust-daily-reporter.md +++ /dev/null @@ -1,65 +0,0 @@ -# Rust Daily Reporter - -Aggregate Rust news, filter by time range. - -## Data Sources (Required) - -| Category | URL | -|----------|-----| -| Ecosystem | https://www.reddit.com/r/rust/hot/ | -| Ecosystem | https://this-week-in-rust.org/ | -| Official | https://blog.rust-lang.org/ | -| Official | https://blog.rust-lang.org/inside-rust/ | -| Foundation | https://rustfoundation.org/media/category/news/ | -| Foundation | https://rustfoundation.org/media/category/blog/ | -| Foundation | https://rustfoundation.org/events/ | - -## Parameters - -- `time_range`: day | week | month -- `category`: all | ecosystem | official | foundation - -## Fetch Strategy - -See: `_shared/fetch-strategy.md` - -| Source | Tool | -|--------|------| -| Reddit | Local Chrome → crawl4ai | -| Others | WebFetch | - -## Time Filter - -| Range | Filter | -|-------|--------| -| day | Last 24 hours | -| week | Last 7 days | -| month | Last 30 days | - -## Output - -```markdown -# Rust {Day|Week|Month} Report - -**Time:** {start} - {end} | **Generated:** {now} - -## Ecosystem -### Reddit r/rust -| Score | Title | Link | - -### This Week in Rust -- Issue #{number} ({date}): highlights - -## Official -| Date | Title | Summary | - -## Foundation -| Date | Title | Summary | -``` - -## Validation (Required) - -1. Check each source has results -2. Mark "No updates" if empty -3. Retry with different tool on failure -4. Report reason if all fail diff --git a/.cursor/agents/std-docs-researcher.md b/.cursor/agents/std-docs-researcher.md deleted file mode 100644 index 8b1fee3cd..000000000 --- a/.cursor/agents/std-docs-researcher.md +++ /dev/null @@ -1,55 +0,0 @@ -# std-docs-researcher - -Fetch Rust std library documentation from doc.rust-lang.org. - -## URL Patterns - -| Type | URL | -|------|-----| -| Trait | `doc.rust-lang.org/std/marker/trait.Send.html` | -| Struct | `doc.rust-lang.org/std/sync/struct.Arc.html` | -| Module | `doc.rust-lang.org/std/collections/index.html` | -| Function | `doc.rust-lang.org/std/mem/fn.replace.html` | - -## Common Paths - -| Item | Path | -|------|------| -| Send, Sync, Copy, Clone | `std/marker/trait.<Name>.html` | -| Arc, Mutex, RwLock | `std/sync/struct.<Name>.html` | -| RefCell, Cell | `std/cell/struct.<Name>.html` | -| Vec | `std/vec/struct.Vec.html` | -| Option, Result | `std/<name>/enum.<Name>.html` | - -## Fetch - -Use available tools to get doc.rust-lang.org content. - -## Cache - -Location: `~/.claude/cache/rust-docs/std/{module}/{item}.json` -TTL: 30 days (std is stable) - -## Output - -```markdown -## std::<Item> - -**Signature:** -\`\`\`rust -<signature> -\`\`\` - -**Description:** <main doc> - -**Key Points:** -- point 1 -- point 2 -``` - -## Validation - -1. Content is not empty -2. Not a 404 page -3. Contains signature or docblock -4. On failure: "Fetch failed: {reason}, see doc.rust-lang.org" diff --git a/.gitignore b/.gitignore index b58b536a8..37b2a464d 100644 --- a/.gitignore +++ b/.gitignore @@ -6,6 +6,8 @@ /test /logs /data +/docs +/rustfs-data/ .devcontainer rustfs/static/* !rustfs/static/.gitkeep diff --git a/docs/COMPLETE_SUMMARY.md b/docs/COMPLETE_SUMMARY.md deleted file mode 100644 index d7501f750..000000000 --- a/docs/COMPLETE_SUMMARY.md +++ /dev/null @@ -1,306 +0,0 @@ -# Adaptive Buffer Sizing - Complete Implementation Summary - -## English Version - -### Overview - -This implementation provides a comprehensive adaptive buffer sizing optimization system for RustFS, enabling intelligent -buffer size selection based on file size and workload characteristics. The complete migration path (Phases 1-4) has been -successfully implemented with full backward compatibility. - -### Key Features - -#### 1. Workload Profile System - -- **6 Predefined Profiles**: GeneralPurpose, AiTraining, DataAnalytics, WebWorkload, IndustrialIoT, SecureStorage -- **Custom Configuration Support**: Flexible buffer size configuration with validation -- **OS Environment Detection**: Automatic detection of secure Chinese OS environments (Kylin, NeoKylin, UOS, OpenKylin) -- **Thread-Safe Global Configuration**: Atomic flags and immutable configuration structures - -#### 2. Intelligent Buffer Sizing - -- **File Size Aware**: Automatically adjusts buffer sizes from 32KB to 4MB based on file size -- **Profile-Based Optimization**: Different buffer strategies for different workload types -- **Unknown Size Handling**: Special handling for streaming and chunked uploads -- **Performance Metrics**: Optional metrics collection via feature flag - -#### 3. Integration Points - -- **put_object**: Optimized buffer sizing for object uploads -- **put_object_extract**: Special handling for archive extraction -- **upload_part**: Multipart upload optimization - -### Implementation Phases - -#### Phase 1: Infrastructure (Completed) - -- Created workload profile module (`rustfs/src/config/workload_profiles.rs`) -- Implemented core data structures (WorkloadProfile, BufferConfig, RustFSBufferConfig) -- Added configuration validation and testing framework - -#### Phase 2: Opt-In Usage (Completed) - -- Added global configuration management -- Implemented `RUSTFS_BUFFER_PROFILE_ENABLE` and `RUSTFS_BUFFER_PROFILE` configuration -- Integrated buffer sizing into core upload functions -- Maintained backward compatibility with legacy behavior - -#### Phase 3: Default Enablement (Completed) - -- Changed default to enabled with GeneralPurpose profile -- Replaced opt-in with opt-out mechanism (`--buffer-profile-disable`) -- Created comprehensive migration guide (MIGRATION_PHASE3.md) -- Ensured zero-impact migration for existing deployments - -#### Phase 4: Full Integration (Completed) - -- Unified profile-only implementation -- Removed hardcoded buffer values -- Added optional performance metrics collection -- Cleaned up deprecated code and improved documentation - -### Technical Details - -#### Buffer Size Ranges by Profile - -| Profile | Min Buffer | Max Buffer | Optimal For | -|----------------|------------|------------|-------------------------------| -| GeneralPurpose | 64KB | 1MB | Mixed workloads | -| AiTraining | 512KB | 4MB | Large files, sequential I/O | -| DataAnalytics | 128KB | 2MB | Mixed read-write patterns | -| WebWorkload | 32KB | 256KB | Small files, high concurrency | -| IndustrialIoT | 64KB | 512KB | Real-time streaming | -| SecureStorage | 32KB | 256KB | Compliance environments | - -#### Configuration Options - -**Environment Variables:** - -- `RUSTFS_BUFFER_PROFILE`: Select workload profile (default: GeneralPurpose) -- `RUSTFS_BUFFER_PROFILE_DISABLE`: Disable profiling (opt-out) - -**Command-Line Flags:** - -- `--buffer-profile <PROFILE>`: Set workload profile -- `--buffer-profile-disable`: Disable workload profiling - -### Performance Impact - -- **Default (GeneralPurpose)**: Same performance as original implementation -- **AiTraining**: Up to 4x throughput improvement for large files (>500MB) -- **WebWorkload**: Lower memory usage, better concurrency for small files -- **Metrics Collection**: < 1% CPU overhead when enabled - -### Code Quality - -- **30+ Unit Tests**: Comprehensive test coverage for all profiles and scenarios -- **1200+ Lines of Documentation**: Complete usage guides, migration guides, and API documentation -- **Thread-Safe Design**: Atomic flags, immutable configurations, zero data races -- **Memory Safe**: All configurations validated, bounded buffer sizes - -### Files Changed - -``` -rustfs/src/config/mod.rs | 10 + -rustfs/src/config/workload_profiles.rs | 650 +++++++++++++++++ -rustfs/src/storage/ecfs.rs | 200 ++++++ -rustfs/src/main.rs | 40 ++ -docs/adaptive-buffer-sizing.md | 550 ++++++++++++++ -docs/IMPLEMENTATION_SUMMARY.md | 380 ++++++++++ -docs/MIGRATION_PHASE3.md | 380 ++++++++++ -docs/PHASE4_GUIDE.md | 425 +++++++++++ -docs/README.md | 3 + -``` - -### Backward Compatibility - -- ✅ Zero breaking changes -- ✅ Default behavior matches original implementation -- ✅ Opt-out mechanism available -- ✅ All existing tests pass -- ✅ No configuration required for migration - -### Usage Examples - -**Default (Recommended):** - -```bash -./rustfs /data -``` - -**Custom Profile:** - -```bash -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /data -``` - -**Opt-Out:** - -```bash -export RUSTFS_BUFFER_PROFILE_DISABLE=true -./rustfs /data -``` - -**With Metrics:** - -```bash -cargo build --features metrics --release -./target/release/rustfs /data -``` - ---- - -## 中文版本 - -### 概述 - -本实现为 RustFS 提供了全面的自适应缓冲区大小优化系统,能够根据文件大小和工作负载特性智能选择缓冲区大小。完整的迁移路径(阶段 -1-4)已成功实现,完全向后兼容。 - -### 核心功能 - -#### 1. 工作负载配置文件系统 - -- **6 种预定义配置文件**:通用、AI 训练、数据分析、Web 工作负载、工业物联网、安全存储 -- **自定义配置支持**:灵活的缓冲区大小配置和验证 -- **操作系统环境检测**:自动检测中国安全操作系统环境(麒麟、中标麒麟、统信、开放麒麟) -- **线程安全的全局配置**:原子标志和不可变配置结构 - -#### 2. 智能缓冲区大小调整 - -- **文件大小感知**:根据文件大小自动调整 32KB 到 4MB 的缓冲区 -- **基于配置文件的优化**:不同工作负载类型的不同缓冲区策略 -- **未知大小处理**:流式传输和分块上传的特殊处理 -- **性能指标**:通过功能标志可选的指标收集 - -#### 3. 集成点 - -- **put_object**:对象上传的优化缓冲区大小 -- **put_object_extract**:存档提取的特殊处理 -- **upload_part**:多部分上传优化 - -### 实现阶段 - -#### 阶段 1:基础设施(已完成) - -- 创建工作负载配置文件模块(`rustfs/src/config/workload_profiles.rs`) -- 实现核心数据结构(WorkloadProfile、BufferConfig、RustFSBufferConfig) -- 添加配置验证和测试框架 - -#### 阶段 2:选择性启用(已完成) - -- 添加全局配置管理 -- 实现 `RUSTFS_BUFFER_PROFILE_ENABLE` 和 `RUSTFS_BUFFER_PROFILE` 配置 -- 将缓冲区大小调整集成到核心上传函数中 -- 保持与旧版行为的向后兼容性 - -#### 阶段 3:默认启用(已完成) - -- 将默认值更改为使用通用配置文件启用 -- 将选择性启用替换为选择性退出机制(`--buffer-profile-disable`) -- 创建全面的迁移指南(MIGRATION_PHASE3.md) -- 确保现有部署的零影响迁移 - -#### 阶段 4:完全集成(已完成) - -- 统一的纯配置文件实现 -- 移除硬编码的缓冲区值 -- 添加可选的性能指标收集 -- 清理弃用代码并改进文档 - -### 技术细节 - -#### 按配置文件划分的缓冲区大小范围 - -| 配置文件 | 最小缓冲 | 最大缓冲 | 最适合 | -|----------|-------|-------|------------| -| 通用 | 64KB | 1MB | 混合工作负载 | -| AI 训练 | 512KB | 4MB | 大文件、顺序 I/O | -| 数据分析 | 128KB | 2MB | 混合读写模式 | -| Web 工作负载 | 32KB | 256KB | 小文件、高并发 | -| 工业物联网 | 64KB | 512KB | 实时流式传输 | -| 安全存储 | 32KB | 256KB | 合规环境 | - -#### 配置选项 - -**环境变量:** - -- `RUSTFS_BUFFER_PROFILE`:选择工作负载配置文件(默认:通用) -- `RUSTFS_BUFFER_PROFILE_DISABLE`:禁用配置文件(选择性退出) - -**命令行标志:** - -- `--buffer-profile <配置文件>`:设置工作负载配置文件 -- `--buffer-profile-disable`:禁用工作负载配置文件 - -### 性能影响 - -- **默认(通用)**:与原始实现性能相同 -- **AI 训练**:大文件(>500MB)吞吐量提升最多 4 倍 -- **Web 工作负载**:小文件的内存使用更低、并发性更好 -- **指标收集**:启用时 CPU 开销 < 1% - -### 代码质量 - -- **30+ 单元测试**:全面覆盖所有配置文件和场景 -- **1200+ 行文档**:完整的使用指南、迁移指南和 API 文档 -- **线程安全设计**:原子标志、不可变配置、零数据竞争 -- **内存安全**:所有配置经过验证、缓冲区大小有界 - -### 文件变更 - -``` -rustfs/src/config/mod.rs | 10 + -rustfs/src/config/workload_profiles.rs | 650 +++++++++++++++++ -rustfs/src/storage/ecfs.rs | 200 ++++++ -rustfs/src/main.rs | 40 ++ -docs/adaptive-buffer-sizing.md | 550 ++++++++++++++ -docs/IMPLEMENTATION_SUMMARY.md | 380 ++++++++++ -docs/MIGRATION_PHASE3.md | 380 ++++++++++ -docs/PHASE4_GUIDE.md | 425 +++++++++++ -docs/README.md | 3 + -``` - -### 向后兼容性 - -- ✅ 零破坏性更改 -- ✅ 默认行为与原始实现匹配 -- ✅ 提供选择性退出机制 -- ✅ 所有现有测试通过 -- ✅ 迁移无需配置 - -### 使用示例 - -**默认(推荐):** - -```bash -./rustfs /data -``` - -**自定义配置文件:** - -```bash -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /data -``` - -**选择性退出:** - -```bash -export RUSTFS_BUFFER_PROFILE_DISABLE=true -./rustfs /data -``` - -**启用指标:** - -```bash -cargo build --features metrics --release -./target/release/rustfs /data -``` - -### 总结 - -本实现为 RustFS 提供了企业级的自适应缓冲区优化能力,通过完整的四阶段迁移路径实现了从基础设施到完全集成的平滑过渡。系统默认启用,完全向后兼容,并提供了强大的工作负载优化功能,使不同场景下的性能得到显著提升。 - -完整的文档、全面的测试覆盖和生产就绪的实现确保了系统的可靠性和可维护性。通过可选的性能指标收集,运维团队可以持续监控和优化缓冲区配置,实现数据驱动的性能调优。 diff --git a/docs/CONCURRENCY_ARCHITECTURE.md b/docs/CONCURRENCY_ARCHITECTURE.md deleted file mode 100644 index aa1603982..000000000 --- a/docs/CONCURRENCY_ARCHITECTURE.md +++ /dev/null @@ -1,601 +0,0 @@ -# Concurrent GetObject Performance Optimization - Complete Architecture Design - -## Executive Summary - -This document provides a comprehensive architectural analysis of the concurrent GetObject performance optimization implemented in RustFS. The solution addresses Issue #911 where concurrent GetObject latency degraded exponentially (59ms → 110ms → 200ms for 1→2→4 requests). - -## Table of Contents - -1. [Problem Statement](#problem-statement) -2. [Architecture Overview](#architecture-overview) -3. [Module Analysis: concurrency.rs](#module-analysis-concurrencyrs) -4. [Module Analysis: ecfs.rs](#module-analysis-ecfsrs) -5. [Critical Analysis: helper.complete() for Cache Hits](#critical-analysis-helpercomplete-for-cache-hits) -6. [Adaptive I/O Strategy Design](#adaptive-io-strategy-design) -7. [Cache Architecture](#cache-architecture) -8. [Metrics and Monitoring](#metrics-and-monitoring) -9. [Performance Characteristics](#performance-characteristics) -10. [Future Enhancements](#future-enhancements) - ---- - -## Problem Statement - -### Original Issue (#911) - -Users observed exponential latency degradation under concurrent load: - -| Concurrent Requests | Observed Latency | Expected Latency | -|---------------------|------------------|------------------| -| 1 | 59ms | ~60ms | -| 2 | 110ms | ~60ms | -| 4 | 200ms | ~60ms | -| 8 | 400ms+ | ~60ms | - -### Root Causes Identified - -1. **Fixed Buffer Sizes**: 1MB buffers for all requests caused memory contention -2. **No I/O Rate Limiting**: Unlimited concurrent disk reads saturated I/O queues -3. **No Object Caching**: Repeated reads of same objects hit disk every time -4. **Lock Contention**: RwLock-based caching (if any) created bottlenecks - ---- - -## Architecture Overview - -``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ GetObject Request Flow │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ 1. Request Tracking (GetObjectGuard - RAII) │ -│ - Atomic increment of ACTIVE_GET_REQUESTS │ -│ - Start time capture for latency metrics │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ 2. OperationHelper Initialization │ -│ - Event: ObjectAccessedGet / s3:GetObject │ -│ - Used for S3 bucket notifications │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ 3. Cache Lookup (if enabled) │ -│ - Key: "{bucket}/{key}" or "{bucket}/{key}?versionId={vid}" │ -│ - Conditions: cache_enabled && !part_number && !range │ -│ - On HIT: Return immediately with CachedGetObject │ -│ - On MISS: Continue to storage backend │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ┌───────────────┴───────────────┐ - │ │ - Cache HIT Cache MISS - │ │ - ▼ ▼ -┌──────────────────────────────┐ ┌───────────────────────────────────────────┐ -│ Return CachedGetObject │ │ 4. Adaptive I/O Strategy │ -│ - Parse last_modified │ │ - Acquire disk_permit (semaphore) │ -│ - Construct GetObjectOutput │ │ - Calculate IoStrategy from wait time │ -│ - ** CALL helper.complete **│ │ - Select buffer_size, readahead, etc. │ -│ - Return S3Response │ │ │ -└──────────────────────────────┘ └───────────────────────────────────────────┘ - │ - ▼ - ┌───────────────────────────────────────────┐ - │ 5. Storage Backend Read │ - │ - Get object info (metadata) │ - │ - Validate conditions (ETag, etc.) │ - │ - Stream object data │ - └───────────────────────────────────────────┘ - │ - ▼ - ┌───────────────────────────────────────────┐ - │ 6. Cache Writeback (if eligible) │ - │ - Conditions: size <= 10MB, no enc. │ - │ - Background: tokio::spawn() │ - │ - Store: CachedGetObject with metadata│ - └───────────────────────────────────────────┘ - │ - ▼ - ┌───────────────────────────────────────────┐ - │ 7. Response Construction │ - │ - Build GetObjectOutput │ - │ - Call helper.complete(&result) │ - │ - Return S3Response │ - └───────────────────────────────────────────┘ -``` - ---- - -## Module Analysis: concurrency.rs - -### Purpose - -The `concurrency.rs` module provides intelligent concurrency management to prevent performance degradation under high concurrent load. It implements: - -1. **Request Tracking**: Atomic counters for active requests -2. **Adaptive Buffer Sizing**: Dynamic buffer allocation based on load -3. **Moka Cache Integration**: Lock-free object caching -4. **Adaptive I/O Strategy**: Load-aware I/O parameter selection -5. **Disk I/O Rate Limiting**: Semaphore-based throttling - -### Key Components - -#### 1. IoLoadLevel Enum - -```rust -pub enum IoLoadLevel { - Low, // < 10ms wait - ample I/O capacity - Medium, // 10-50ms wait - moderate load - High, // 50-200ms wait - significant load - Critical, // > 200ms wait - severe congestion -} -``` - -**Design Rationale**: These thresholds are calibrated for NVMe SSD characteristics. Adjustments may be needed for HDD or cloud storage. - -#### 2. IoStrategy Struct - -```rust -pub struct IoStrategy { - pub buffer_size: usize, // Calculated buffer size (32KB-1MB) - pub buffer_multiplier: f64, // 0.4 - 1.0 of base buffer - pub enable_readahead: bool, // Disabled under high load - pub cache_writeback_enabled: bool, // Disabled under critical load - pub use_buffered_io: bool, // Always enabled - pub load_level: IoLoadLevel, - pub permit_wait_duration: Duration, -} -``` - -**Strategy Selection Matrix**: - -| Load Level | Buffer Mult | Readahead | Cache WB | Rationale | -|------------|-------------|-----------|----------|-----------| -| Low | 1.0 (100%) | ✓ Yes | ✓ Yes | Maximize throughput | -| Medium | 0.75 (75%) | ✓ Yes | ✓ Yes | Balance throughput/fairness | -| High | 0.5 (50%) | ✗ No | ✓ Yes | Reduce I/O amplification | -| Critical | 0.4 (40%) | ✗ No | ✗ No | Prevent memory exhaustion | - -#### 3. IoLoadMetrics - -Rolling window statistics for load tracking: -- `average_wait()`: Smoothed average for stable decisions -- `p95_wait()`: Tail latency indicator -- `max_wait()`: Peak contention detection - -#### 4. GetObjectGuard (RAII) - -Automatic request lifecycle management: -```rust -impl Drop for GetObjectGuard { - fn drop(&mut self) { - ACTIVE_GET_REQUESTS.fetch_sub(1, Ordering::Relaxed); - // Record metrics... - } -} -``` - -**Guarantees**: -- Counter always decremented, even on panic -- Request duration always recorded -- No resource leaks - -#### 5. ConcurrencyManager - -Central coordination point: - -```rust -pub struct ConcurrencyManager { - pub cache: HotObjectCache, // Moka-based object cache - disk_permit: Semaphore, // I/O rate limiter - cache_enabled: bool, // Feature flag - io_load_metrics: Mutex<IoLoadMetrics>, // Load tracking -} -``` - -**Key Methods**: - -| Method | Purpose | -|--------|---------| -| `track_request()` | Create RAII guard for request tracking | -| `acquire_disk_read_permit()` | Rate-limited disk access | -| `calculate_io_strategy()` | Compute adaptive I/O parameters | -| `get_cached_object()` | Lock-free cache lookup | -| `put_cached_object()` | Background cache writeback | -| `invalidate_cache()` | Cache invalidation on writes | - ---- - -## Module Analysis: ecfs.rs - -### get_object Implementation - -The `get_object` function is the primary focus of optimization. Key integration points: - -#### Line ~1678: OperationHelper Initialization - -```rust -let mut helper = OperationHelper::new(&req, EventName::ObjectAccessedGet, "s3:GetObject"); -``` - -**Purpose**: Prepares S3 bucket notification event. The `complete()` method MUST be called before returning to trigger notifications. - -#### Lines ~1694-1756: Cache Lookup - -```rust -if manager.is_cache_enabled() && part_number.is_none() && range.is_none() { - if let Some(cached) = manager.get_cached_object(&cache_key).await { - // Build response from cache - return Ok(S3Response::new(output)); // <-- ISSUE: helper.complete() NOT called! - } -} -``` - -**CRITICAL ISSUE IDENTIFIED**: The current cache hit path does NOT call `helper.complete(&result)`, which means S3 bucket notifications are NOT triggered for cache hits. - -#### Lines ~1800-1830: Adaptive I/O Strategy - -```rust -let permit_wait_start = std::time::Instant::now(); -let _disk_permit = manager.acquire_disk_read_permit().await; -let permit_wait_duration = permit_wait_start.elapsed(); - -// Calculate adaptive I/O strategy from permit wait time -let io_strategy = manager.calculate_io_strategy(permit_wait_duration, base_buffer_size); - -// Record metrics -#[cfg(feature = "metrics")] -{ - histogram!("rustfs.disk.permit.wait.duration.seconds").record(...); - gauge!("rustfs.io.load.level").set(io_strategy.load_level as f64); - gauge!("rustfs.io.buffer.multiplier").set(io_strategy.buffer_multiplier); -} -``` - -#### Lines ~2100-2150: Cache Writeback - -```rust -if should_cache && io_strategy.cache_writeback_enabled { - // Read stream into memory - // Background cache via tokio::spawn() - // Serve from InMemoryAsyncReader -} -``` - -#### Line ~2273: Final Response - -```rust -let result = Ok(S3Response::new(output)); -let _ = helper.complete(&result); // <-- Correctly called for cache miss path -result -``` - ---- - -## Critical Analysis: helper.complete() for Cache Hits - -### Problem - -When serving from cache, the current implementation returns early WITHOUT calling `helper.complete(&result)`. This has the following consequences: - -1. **Missing S3 Bucket Notifications**: `s3:GetObject` events are NOT sent -2. **Incomplete Audit Trail**: Object access events are not logged -3. **Event-Driven Workflows Break**: Lambda triggers, SNS notifications fail - -### Solution - -The cache hit path MUST properly configure the helper with object info and version_id, then call `helper.complete(&result)` before returning: - -```rust -if manager.is_cache_enabled() && part_number.is_none() && range.is_none() { - if let Some(cached) = manager.get_cached_object(&cache_key).await { - // ... build response output ... - - // CRITICAL: Build ObjectInfo for event notification - let event_info = ObjectInfo { - bucket: bucket.clone(), - name: key.clone(), - storage_class: cached.storage_class.clone(), - mod_time: cached.last_modified.as_ref().and_then(|s| { - time::OffsetDateTime::parse(s, &Rfc3339).ok() - }), - size: cached.content_length, - actual_size: cached.content_length, - is_dir: false, - user_defined: cached.user_metadata.clone(), - version_id: cached.version_id.as_ref().and_then(|v| Uuid::parse_str(v).ok()), - delete_marker: cached.delete_marker, - content_type: cached.content_type.clone(), - content_encoding: cached.content_encoding.clone(), - etag: cached.e_tag.clone(), - ..Default::default() - }; - - // Set object info and version_id on helper for proper event notification - let version_id_str = req.input.version_id.clone().unwrap_or_default(); - helper = helper.object(event_info).version_id(version_id_str); - - let result = Ok(S3Response::new(output)); - - // Trigger S3 bucket notification event - let _ = helper.complete(&result); - - return result; - } -} -``` - -### Key Points for Proper Event Notification - -1. **ObjectInfo Construction**: The `event_info` must be built from cached metadata to provide: - - `bucket` and `name` (key) for object identification - - `size` and `actual_size` for event payload - - `etag` for integrity verification - - `version_id` for versioned object access - - `storage_class`, `content_type`, and other metadata - -2. **helper.object(event_info)**: Sets the object information for the notification event. This ensures: - - Lambda triggers receive proper object metadata - - SNS/SQS notifications include complete information - - Audit logs contain accurate object details - -3. **helper.version_id(version_id_str)**: Sets the version ID for versioned bucket access: - - Enables version-specific event routing - - Supports versioned object lifecycle policies - - Provides complete audit trail for versioned access - -4. **Performance**: The `helper.complete()` call may involve async I/O (SQS, SNS). Consider: - - Fire-and-forget with `tokio::spawn()` for minimal latency impact - - Accept slight latency increase for correctness - -5. **Metrics Alignment**: Ensure cache hit metrics don't double-count -``` - ---- - -## Adaptive I/O Strategy Design - -### Goal - -Automatically tune I/O parameters based on observed system load to prevent: -- Memory exhaustion under high concurrency -- I/O queue saturation -- Latency spikes -- Unfair resource distribution - -### Algorithm - -``` -1. ACQUIRE disk_permit from semaphore -2. MEASURE wait_duration = time spent waiting for permit -3. CLASSIFY load_level from wait_duration: - - Low: wait < 10ms - - Medium: 10ms <= wait < 50ms - - High: 50ms <= wait < 200ms - - Critical: wait >= 200ms -4. CALCULATE strategy based on load_level: - - buffer_multiplier: 1.0 / 0.75 / 0.5 / 0.4 - - enable_readahead: true / true / false / false - - cache_writeback: true / true / true / false -5. APPLY strategy to I/O operations -6. RECORD metrics for monitoring -``` - -### Feedback Loop - -``` - ┌──────────────────────────┐ - │ IoLoadMetrics │ - │ (rolling window) │ - └──────────────────────────┘ - ▲ - │ record_permit_wait() - │ -┌───────────────────┐ ┌─────────────┐ ┌─────────────────────┐ -│ Disk Permit Wait │──▶│ IoStrategy │──▶│ Buffer Size, etc. │ -│ (observed latency)│ │ Calculation │ │ (applied to I/O) │ -└───────────────────┘ └─────────────┘ └─────────────────────┘ - │ - ▼ - ┌──────────────────────────┐ - │ Prometheus Metrics │ - │ - io.load.level │ - │ - io.buffer.multiplier │ - └──────────────────────────┘ -``` - ---- - -## Cache Architecture - -### HotObjectCache (Moka-based) - -```rust -pub struct HotObjectCache { - bytes_cache: Cache<String, Arc<CachedObjectData>>, // Legacy byte cache - response_cache: Cache<String, Arc<CachedGetObject>>, // Full response cache -} -``` - -### CachedGetObject Structure - -```rust -pub struct CachedGetObject { - pub body: bytes::Bytes, // Object data - pub content_length: i64, // Size in bytes - pub content_type: Option<String>, // MIME type - pub e_tag: Option<String>, // Entity tag - pub last_modified: Option<String>, // RFC3339 timestamp - pub expires: Option<String>, // Expiration - pub cache_control: Option<String>, // Cache-Control header - pub content_disposition: Option<String>, - pub content_encoding: Option<String>, - pub content_language: Option<String>, - pub storage_class: Option<String>, - pub version_id: Option<String>, // Version support - pub delete_marker: bool, - pub tag_count: Option<i32>, - pub replication_status: Option<String>, - pub user_metadata: HashMap<String, String>, -} -``` - -### Cache Key Strategy - -| Scenario | Key Format | -|----------|------------| -| Latest version | `"{bucket}/{key}"` | -| Specific version | `"{bucket}/{key}?versionId={vid}"` | - -### Cache Invalidation - -Invalidation is triggered on all write operations: - -| Operation | Invalidation Target | -|-----------|---------------------| -| `put_object` | Latest + specific version | -| `copy_object` | Destination object | -| `delete_object` | Deleted object | -| `delete_objects` | Each deleted object | -| `complete_multipart_upload` | Completed object | - ---- - -## Metrics and Monitoring - -### Request Metrics - -| Metric | Type | Description | -|--------|------|-------------| -| `rustfs.get.object.requests.total` | Counter | Total GetObject requests | -| `rustfs.get.object.requests.completed` | Counter | Completed requests | -| `rustfs.get.object.duration.seconds` | Histogram | Request latency | -| `rustfs.concurrent.get.requests` | Gauge | Current concurrent requests | - -### Cache Metrics - -| Metric | Type | Description | -|--------|------|-------------| -| `rustfs.object.cache.hits` | Counter | Cache hits | -| `rustfs.object.cache.misses` | Counter | Cache misses | -| `rustfs.get.object.cache.served.total` | Counter | Requests served from cache | -| `rustfs.get.object.cache.serve.duration.seconds` | Histogram | Cache serve latency | -| `rustfs.object.cache.writeback.total` | Counter | Cache writeback operations | - -### I/O Metrics - -| Metric | Type | Description | -|--------|------|-------------| -| `rustfs.disk.permit.wait.duration.seconds` | Histogram | Disk permit wait time | -| `rustfs.io.load.level` | Gauge | Current I/O load level (0-3) | -| `rustfs.io.buffer.multiplier` | Gauge | Current buffer multiplier | -| `rustfs.io.strategy.selected` | Counter | Strategy selections by level | - -### Prometheus Queries - -```promql -# Cache hit rate -sum(rate(rustfs_object_cache_hits[5m])) / -(sum(rate(rustfs_object_cache_hits[5m])) + sum(rate(rustfs_object_cache_misses[5m]))) - -# P95 GetObject latency -histogram_quantile(0.95, rate(rustfs_get_object_duration_seconds_bucket[5m])) - -# Average disk permit wait -rate(rustfs_disk_permit_wait_duration_seconds_sum[5m]) / -rate(rustfs_disk_permit_wait_duration_seconds_count[5m]) - -# I/O load level distribution -sum(rate(rustfs_io_strategy_selected_total[5m])) by (level) -``` - ---- - -## Performance Characteristics - -### Expected Improvements - -| Concurrent Requests | Before | After (Cache Miss) | After (Cache Hit) | -|---------------------|--------|--------------------|--------------------| -| 1 | 59ms | ~55ms | < 5ms | -| 2 | 110ms | 60-70ms | < 5ms | -| 4 | 200ms | 75-90ms | < 5ms | -| 8 | 400ms | 90-120ms | < 5ms | -| 16 | 800ms | 110-145ms | < 5ms | - -### Resource Usage - -| Resource | Impact | -|----------|--------| -| Memory | Reduced under high load via adaptive buffers | -| CPU | Slight increase for strategy calculation | -| Disk I/O | Smoothed via semaphore limiting | -| Cache | 100MB default, automatic eviction | - ---- - -## Future Enhancements - -### 1. Dynamic Semaphore Sizing - -Automatically adjust disk permit count based on observed throughput: -```rust -if avg_wait > 100ms && current_permits > MIN_PERMITS { - reduce_permits(); -} else if avg_wait < 10ms && throughput < MAX_THROUGHPUT { - increase_permits(); -} -``` - -### 2. Predictive Caching - -Analyze access patterns to pre-warm cache: -- Track frequently accessed objects -- Prefetch predicted objects during idle periods - -### 3. Tiered Caching - -Implement multi-tier cache hierarchy: -- L1: Process memory (current Moka cache) -- L2: Redis cluster (shared across instances) -- L3: Local SSD cache (persistent across restarts) - -### 4. Request Priority - -Implement priority queuing for latency-sensitive requests: -```rust -pub enum RequestPriority { - RealTime, // < 10ms SLA - Standard, // < 100ms SLA - Batch, // Best effort -} -``` - ---- - -## Conclusion - -The concurrent GetObject optimization architecture provides a comprehensive solution to the exponential latency degradation issue. Key components work together: - -1. **Request Tracking** (GetObjectGuard) ensures accurate concurrency measurement -2. **Adaptive I/O Strategy** prevents system overload under high concurrency -3. **Moka Cache** provides sub-5ms response times for hot objects -4. **Disk Permit Semaphore** prevents I/O queue saturation -5. **Comprehensive Metrics** enable observability and tuning - -**Critical Fix Required**: The cache hit path must call `helper.complete(&result)` to ensure S3 bucket notifications are triggered for all object access events. - ---- - -## Document Information - -- **Version**: 1.0 -- **Created**: 2025-11-29 -- **Author**: RustFS Team -- **Related Issues**: #911 -- **Status**: Implemented and Verified diff --git a/docs/CONCURRENT_GETOBJECT_IMPLEMENTATION_SUMMARY.md b/docs/CONCURRENT_GETOBJECT_IMPLEMENTATION_SUMMARY.md deleted file mode 100644 index 979fce3ae..000000000 --- a/docs/CONCURRENT_GETOBJECT_IMPLEMENTATION_SUMMARY.md +++ /dev/null @@ -1,465 +0,0 @@ -# Concurrent GetObject Performance Optimization - Implementation Summary - -## Executive Summary - -Successfully implemented a comprehensive solution to address exponential performance degradation in concurrent GetObject requests. The implementation includes three key optimizations that work together to significantly improve performance under concurrent load while maintaining backward compatibility. - -## Problem Statement - -### Observed Behavior -| Concurrent Requests | Latency per Request | Performance Degradation | -|---------------------|---------------------|------------------------| -| 1 | 59ms | Baseline | -| 2 | 110ms | 1.9x slower | -| 4 | 200ms | 3.4x slower | - -### Root Causes Identified -1. **Fixed buffer sizing** regardless of concurrent load led to memory contention -2. **No I/O concurrency control** caused disk saturation -3. **No caching** resulted in redundant disk reads for hot objects -4. **Lack of fairness** allowed large requests to starve smaller ones - -## Solution Architecture - -### 1. Concurrency-Aware Adaptive Buffer Sizing - -#### Implementation -```rust -pub fn get_concurrency_aware_buffer_size(file_size: i64, base_buffer_size: usize) -> usize { - let concurrent_requests = ACTIVE_GET_REQUESTS.load(Ordering::Relaxed); - - let adaptive_multiplier = match concurrent_requests { - 0..=2 => 1.0, // Low: 100% buffer - 3..=4 => 0.75, // Medium: 75% buffer - 5..=8 => 0.5, // High: 50% buffer - _ => 0.4, // Very high: 40% buffer - }; - - (base_buffer_size as f64 * adaptive_multiplier) as usize - .clamp(min_buffer, max_buffer) -} -``` - -#### Benefits -- **Reduced memory pressure**: Smaller buffers under high concurrency -- **Better cache utilization**: More data fits in CPU cache -- **Improved fairness**: Prevents large requests from monopolizing resources -- **Automatic adaptation**: No manual tuning required - -#### Metrics -- `rustfs_concurrent_get_requests`: Tracks active request count -- `rustfs_buffer_size_bytes`: Histogram of buffer sizes used - -### 2. Hot Object Caching (LRU) - -#### Implementation -```rust -struct HotObjectCache { - max_object_size: 10 * MI_B, // 10MB limit per object - max_cache_size: 100 * MI_B, // 100MB total capacity - cache: RwLock<lru::LruCache<String, Arc<CachedObject>>>, -} -``` - -#### Features -- **LRU eviction policy**: Automatic management of cache memory -- **Eligibility filtering**: Only small (<= 10MB), complete objects cached -- **Atomic size tracking**: Thread-safe cache size management -- **Read-optimized**: RwLock allows concurrent reads - -#### Current Limitations -- **Cache insertion not yet implemented**: Framework exists but streaming cache insertion requires TeeReader implementation -- **Cache can be populated manually**: Via admin API or background processes -- **Cache lookup functional**: Objects in cache will be served from memory - -#### Benefits (once fully implemented) -- **Eliminates disk I/O**: Memory access is 100-1000x faster -- **Reduces contention**: Cached objects don't compete for disk I/O permits -- **Improves scalability**: Cache hit ratio increases with concurrent load - -#### Metrics -- `rustfs_object_cache_hits`: Count of successful cache lookups -- `rustfs_object_cache_misses`: Count of cache misses -- `rustfs_object_cache_size_bytes`: Current cache memory usage -- `rustfs_object_cache_insertions`: Count of cache additions - -### 3. I/O Concurrency Control - -#### Implementation -```rust -struct ConcurrencyManager { - disk_read_semaphore: Arc<Semaphore>, // 64 permits -} - -// In get_object: -let _permit = manager.acquire_disk_read_permit().await; -// Permit automatically released when dropped -``` - -#### Benefits -- **Prevents I/O saturation**: Limits queue depth to optimal level (64) -- **Predictable latency**: Avoids exponential increase under extreme load -- **Fair queuing**: FIFO order for disk access -- **Graceful degradation**: Queues requests instead of thrashing - -#### Tuning -The default of 64 concurrent disk reads is suitable for most scenarios: -- **SSD/NVMe**: Can handle higher queue depths efficiently -- **HDD**: May benefit from lower values (32-48) to reduce seeks -- **Network storage**: Depends on network bandwidth and latency - -### 4. Request Tracking (RAII) - -#### Implementation -```rust -pub struct GetObjectGuard { - start_time: Instant, -} - -impl Drop for GetObjectGuard { - fn drop(&mut self) { - ACTIVE_GET_REQUESTS.fetch_sub(1, Ordering::Relaxed); - // Record metrics - } -} - -// Usage: -let _guard = ConcurrencyManager::track_request(); -// Automatically decrements counter on drop -``` - -#### Benefits -- **Zero overhead**: Tracking happens automatically -- **Leak-proof**: Counter always decremented, even on panics -- **Accurate metrics**: Reflects actual concurrent load -- **Duration tracking**: Captures request completion time - -## Integration Points - -### GetObject Handler - -```rust -async fn get_object(&self, req: S3Request<GetObjectInput>) -> S3Result<S3Response<GetObjectOutput>> { - // 1. Track request (RAII guard) - let _request_guard = ConcurrencyManager::track_request(); - - // 2. Try cache lookup (fast path) - if let Some(cached_data) = manager.get_cached(&cache_key).await { - return serve_from_cache(cached_data); - } - - // 3. Acquire I/O permit (rate limiting) - let _disk_permit = manager.acquire_disk_read_permit().await; - - // 4. Read from storage with optimal buffer - let optimal_buffer_size = get_concurrency_aware_buffer_size( - response_content_length, - base_buffer_size - ); - - // 5. Stream response - let body = StreamingBlob::wrap( - ReaderStream::with_capacity(final_stream, optimal_buffer_size) - ); - - Ok(S3Response::new(output)) -} -``` - -### Workload Profile Integration - -The solution integrates with the existing workload profile system: - -```rust -let base_buffer_size = get_buffer_size_opt_in(file_size); -let optimal_buffer_size = get_concurrency_aware_buffer_size(file_size, base_buffer_size); -``` - -This two-stage approach provides: -1. **Workload-specific sizing**: Based on file size and workload type -2. **Concurrency adaptation**: Further adjusted for current load - -## Testing - -### Test Coverage - -#### Unit Tests (in concurrency.rs) -- `test_concurrent_request_tracking`: RAII guard functionality -- `test_adaptive_buffer_sizing`: Buffer size calculation -- `test_hot_object_cache`: Cache operations -- `test_cache_eviction`: LRU eviction behavior -- `test_concurrency_manager_creation`: Initialization -- `test_disk_read_permits`: Semaphore behavior - -#### Integration Tests (in concurrent_get_object_test.rs) -- `test_concurrent_request_tracking`: End-to-end tracking -- `test_adaptive_buffer_sizing`: Multi-level concurrency -- `test_buffer_size_bounds`: Boundary conditions -- `bench_concurrent_requests`: Performance benchmarking -- `test_disk_io_permits`: Permit acquisition -- `test_cache_operations`: Cache lifecycle -- `test_large_object_not_cached`: Size filtering -- `test_cache_eviction`: Memory pressure handling - -### Running Tests - -```bash -# Run all tests -cargo test --test concurrent_get_object_test - -# Run specific test -cargo test --test concurrent_get_object_test test_adaptive_buffer_sizing - -# Run with output -cargo test --test concurrent_get_object_test -- --nocapture -``` - -### Performance Validation - -To validate the improvements in a real environment: - -```bash -# 1. Create test object (32MB) -dd if=/dev/random of=test.bin bs=1M count=32 -mc cp test.bin rustfs/test/bxx - -# 2. Run concurrent load test (Go client from issue) -for concurrency in 1 2 4 8 16; do - echo "Testing concurrency: $concurrency" - # Run your Go test client with this concurrency level - # Record average latency -done - -# 3. Monitor metrics -curl http://localhost:9000/metrics | grep rustfs_get_object -``` - -## Expected Performance Improvements - -### Latency Improvements - -| Concurrent Requests | Before | After (Expected) | Improvement | -|---------------------|--------|------------------|-------------| -| 1 | 59ms | 55-60ms | Baseline | -| 2 | 110ms | 65-75ms | ~40% faster | -| 4 | 200ms | 80-100ms | ~50% faster | -| 8 | 400ms | 100-130ms | ~65% faster | -| 16 | 800ms | 120-160ms | ~75% faster | - -### Scaling Characteristics - -- **Sub-linear latency growth**: Latency increases at < O(n) -- **Bounded maximum latency**: Upper bound even under extreme load -- **Fair resource allocation**: All requests make progress -- **Predictable behavior**: Consistent performance across load levels - -## Monitoring and Observability - -### Key Metrics - -#### Request Metrics -```promql -# P95 latency -histogram_quantile(0.95, - rate(rustfs_get_object_duration_seconds_bucket[5m]) -) - -# Concurrent request count -rustfs_concurrent_get_requests - -# Request rate -rate(rustfs_get_object_requests_completed[5m]) -``` - -#### Cache Metrics -```promql -# Cache hit ratio -sum(rate(rustfs_object_cache_hits[5m])) -/ -(sum(rate(rustfs_object_cache_hits[5m])) + sum(rate(rustfs_object_cache_misses[5m]))) - -# Cache memory usage -rustfs_object_cache_size_bytes - -# Cache entries -rustfs_object_cache_entries -``` - -#### Buffer Metrics -```promql -# Average buffer size -avg(rustfs_buffer_size_bytes) - -# Buffer size distribution -histogram_quantile(0.95, rustfs_buffer_size_bytes_bucket) -``` - -### Dashboards - -Recommended Grafana panels: -1. **Request Latency**: P50, P95, P99 over time -2. **Concurrency Level**: Active requests gauge -3. **Cache Performance**: Hit ratio and memory usage -4. **Buffer Sizing**: Distribution and adaptation -5. **I/O Permits**: Available vs. in-use permits - -## Code Quality - -### Review Findings and Fixes - -All code review issues have been addressed: - -1. **✅ Race condition in cache size tracking** - - Fixed by using consistent atomic operations within write lock - -2. **✅ Incorrect buffer sizing thresholds** - - Corrected: 1-2 (100%), 3-4 (75%), 5-8 (50%), >8 (40%) - -3. **✅ Unhelpful error message** - - Improved semaphore acquire failure message - -4. **✅ Incomplete cache implementation** - - Documented limitation and added detailed TODO - -### Security Considerations - -- **No new attack surface**: Only internal optimizations -- **Resource limits enforced**: Cache size and I/O permits bounded -- **No data exposure**: Cache respects existing access controls -- **Thread-safe**: All shared state properly synchronized - -### Memory Safety - -- **No unsafe code**: Pure safe Rust -- **RAII for cleanup**: Guards ensure resource cleanup -- **Bounded memory**: Cache size limited to 100MB -- **No memory leaks**: All resources automatically dropped - -## Deployment Considerations - -### Configuration - -Default values are production-ready but can be tuned: - -```rust -// In concurrency.rs -const HIGH_CONCURRENCY_THRESHOLD: usize = 8; -const MEDIUM_CONCURRENCY_THRESHOLD: usize = 4; - -// Cache settings -max_object_size: 10 * MI_B, // 10MB per object -max_cache_size: 100 * MI_B, // 100MB total -disk_read_semaphore: Semaphore::new(64), // 64 concurrent reads -``` - -### Rollout Strategy - -1. **Phase 1**: Deploy with monitoring (current state) - - All optimizations active - - Collect baseline metrics - -2. **Phase 2**: Validate performance improvements - - Compare metrics before/after - - Adjust thresholds if needed - -3. **Phase 3**: Implement streaming cache (future) - - Add TeeReader for cache insertion - - Enable automatic cache population - -### Rollback Plan - -If issues arise: -1. No code changes needed - optimizations degrade gracefully -2. Monitor for any unexpected behavior -3. File size limits prevent memory exhaustion -4. I/O semaphore prevents disk saturation - -## Future Enhancements - -### Short Term (Next Sprint) - -1. **Implement Streaming Cache** - ```rust - // Potential approach with TeeReader - let (cache_sink, response_stream) = tee_reader(original_stream); - tokio::spawn(async move { - let data = read_all(cache_sink).await?; - manager.cache_object(key, data).await; - }); - return response_stream; - ``` - -2. **Add Admin API for Cache Management** - - Cache statistics endpoint - - Manual cache invalidation - - Pre-warming capability - -### Medium Term - -1. **Request Prioritization** - - Small files get priority - - Age-based queuing to prevent starvation - - QoS classes per tenant - -2. **Advanced Caching** - - Partial object caching (hot blocks) - - Predictive prefetching - - Distributed cache across nodes - -3. **I/O Scheduling** - - Batch similar requests for sequential I/O - - Deadline-based scheduling - - NUMA-aware buffer allocation - -### Long Term - -1. **ML-Based Optimization** - - Learn access patterns - - Predict hot objects - - Adaptive threshold tuning - -2. **Compression** - - Transparent cache compression - - CPU-aware compression level - - Deduplication for similar objects - -## Success Criteria - -### Quantitative Metrics - -- ✅ **Latency reduction**: 40-75% improvement under concurrent load -- ✅ **Memory efficiency**: Sub-linear growth with concurrency -- ✅ **I/O optimization**: Bounded queue depth -- 🔄 **Cache hit ratio**: >70% for hot objects (once implemented) - -### Qualitative Goals - -- ✅ **Maintainability**: Clear, well-documented code -- ✅ **Reliability**: No crashes or resource leaks -- ✅ **Observability**: Comprehensive metrics -- ✅ **Compatibility**: No breaking changes - -## Conclusion - -This implementation successfully addresses the concurrent GetObject performance issue through three complementary optimizations: - -1. **Adaptive buffer sizing** eliminates memory contention -2. **I/O concurrency control** prevents disk saturation -3. **Hot object caching** framework reduces redundant disk I/O (full implementation pending) - -The solution is production-ready, well-tested, and provides a solid foundation for future enhancements. Performance improvements of 40-75% are expected under concurrent load, with predictable behavior even under extreme conditions. - -## References - -- **Implementation PR**: [Link to PR] -- **Original Issue**: User reported 2x-3.4x slowdown with concurrency -- **Technical Documentation**: `docs/CONCURRENT_PERFORMANCE_OPTIMIZATION.md` -- **Test Suite**: `rustfs/tests/concurrent_get_object_test.rs` -- **Core Module**: `rustfs/src/storage/concurrency.rs` - -## Contact - -For questions or issues: -- File issue on GitHub -- Tag @houseme or @copilot -- Reference this document and the implementation PR diff --git a/docs/CONCURRENT_PERFORMANCE_OPTIMIZATION.md b/docs/CONCURRENT_PERFORMANCE_OPTIMIZATION.md deleted file mode 100644 index 4b79e909e..000000000 --- a/docs/CONCURRENT_PERFORMANCE_OPTIMIZATION.md +++ /dev/null @@ -1,319 +0,0 @@ -# Concurrent GetObject Performance Optimization - -## Problem Statement - -When multiple concurrent GetObject requests are made to RustFS, performance degrades exponentially: - -| Concurrency Level | Single Request Latency | Performance Impact | -|------------------|----------------------|-------------------| -| 1 request | 59ms | Baseline | -| 2 requests | 110ms | 1.9x slower | -| 4 requests | 200ms | 3.4x slower | - -## Root Cause Analysis - -The performance degradation was caused by several factors: - -1. **Fixed Buffer Sizing**: Using `DEFAULT_READ_BUFFER_SIZE` (1MB) for all requests, regardless of concurrent load - - High memory contention under concurrent load - - Inefficient cache utilization - - CPU context switching overhead - -2. **No Concurrency Control**: Unlimited concurrent disk reads causing I/O saturation - - Disk I/O queue depth exceeded optimal levels - - Increased seek times on traditional disks - - Resource contention between requests - -3. **Lack of Caching**: Repeated reads of the same objects - - No reuse of frequently accessed data - - Unnecessary disk I/O for hot objects - -## Solution Architecture - -### 1. Concurrency-Aware Adaptive Buffer Sizing - -The system now dynamically adjusts buffer sizes based on the current number of concurrent GetObject requests: - -```rust -let optimal_buffer_size = get_concurrency_aware_buffer_size(file_size, base_buffer_size); -``` - -#### Buffer Sizing Strategy - -| Concurrent Requests | Buffer Size Multiplier | Typical Buffer | Rationale | -|--------------------|----------------------|----------------|-----------| -| 1-2 (Low) | 1.0x (100%) | 512KB-1MB | Maximize throughput with large buffers | -| 3-4 (Medium) | 0.75x (75%) | 256KB-512KB | Balance throughput and fairness | -| 5-8 (High) | 0.5x (50%) | 128KB-256KB | Improve fairness, reduce memory pressure | -| 9+ (Very High) | 0.4x (40%) | 64KB-128KB | Ensure fair scheduling, minimize memory | - -#### Benefits -- **Reduced memory pressure**: Smaller buffers under high concurrency prevent memory exhaustion -- **Better cache utilization**: More requests fit in CPU cache with smaller buffers -- **Improved fairness**: Prevents large requests from starving smaller ones -- **Adaptive performance**: Automatically tunes for different workload patterns - -### 2. Hot Object Caching (LRU) - -Implemented an intelligent LRU cache for frequently accessed small objects: - -```rust -pub struct HotObjectCache { - max_object_size: usize, // Default: 10MB - max_cache_size: usize, // Default: 100MB - cache: RwLock<lru::LruCache<String, Arc<CachedObject>>>, -} -``` - -#### Caching Policy -- **Eligible objects**: Size ≤ 10MB, complete object reads (no ranges) -- **Eviction**: LRU (Least Recently Used) -- **Capacity**: Up to 1000 objects, 100MB total -- **Exclusions**: Encrypted objects, partial reads, multipart - -#### Benefits -- **Reduced disk I/O**: Cache hits eliminate disk reads entirely -- **Lower latency**: Memory access is 100-1000x faster than disk -- **Higher throughput**: Free up disk bandwidth for cache misses -- **Better scalability**: Cache hit ratio improves with concurrent load - -### 3. Disk I/O Concurrency Control - -Added a semaphore to limit maximum concurrent disk reads: - -```rust -disk_read_semaphore: Arc<Semaphore> // Default: 64 permits -``` - -#### Benefits -- **Prevents I/O saturation**: Limits queue depth to optimal levels -- **Predictable latency**: Avoids exponential latency increase -- **Protects disk health**: Reduces excessive seek operations -- **Graceful degradation**: Queues requests rather than thrashing - -### 4. Request Tracking and Monitoring - -Implemented RAII-based request tracking with automatic cleanup: - -```rust -pub struct GetObjectGuard { - start_time: Instant, -} - -impl Drop for GetObjectGuard { - fn drop(&mut self) { - ACTIVE_GET_REQUESTS.fetch_sub(1, Ordering::Relaxed); - // Record metrics - } -} -``` - -#### Metrics Collected -- `rustfs_concurrent_get_requests`: Current concurrent request count -- `rustfs_get_object_requests_completed`: Total completed requests -- `rustfs_get_object_duration_seconds`: Request duration histogram -- `rustfs_object_cache_hits`: Cache hit count -- `rustfs_object_cache_misses`: Cache miss count -- `rustfs_buffer_size_bytes`: Buffer size distribution - -## Performance Expectations - -### Expected Improvements - -Based on the optimizations, we expect: - -| Concurrency Level | Before | After (Expected) | Improvement | -|------------------|--------|------------------|-------------| -| 1 request | 59ms | 55-60ms | Similar (baseline) | -| 2 requests | 110ms | 65-75ms | ~40% faster | -| 4 requests | 200ms | 80-100ms | ~50% faster | -| 8 requests | 400ms | 100-130ms | ~65% faster | -| 16 requests | 800ms | 120-160ms | ~75% faster | - -### Key Performance Characteristics - -1. **Sub-linear scaling**: Latency increases sub-linearly with concurrency -2. **Cache benefits**: Hot objects see near-zero latency from cache hits -3. **Predictable behavior**: Bounded latency even under extreme load -4. **Memory efficiency**: Lower memory usage under high concurrency - -## Implementation Details - -### Integration Points - -The optimization is integrated at the GetObject handler level: - -```rust -async fn get_object(&self, req: S3Request<GetObjectInput>) -> S3Result<S3Response<GetObjectOutput>> { - // 1. Track request - let _request_guard = ConcurrencyManager::track_request(); - - // 2. Try cache - if let Some(cached_data) = manager.get_cached(&cache_key).await { - return Ok(S3Response::new(output)); // Fast path - } - - // 3. Acquire I/O permit - let _disk_permit = manager.acquire_disk_read_permit().await; - - // 4. Calculate optimal buffer size - let optimal_buffer_size = get_concurrency_aware_buffer_size( - response_content_length, - base_buffer_size - ); - - // 5. Stream with optimal buffer - let body = StreamingBlob::wrap( - ReaderStream::with_capacity(final_stream, optimal_buffer_size) - ); -} -``` - -### Configuration - -All defaults can be tuned via code changes: - -```rust -// In concurrency.rs -const HIGH_CONCURRENCY_THRESHOLD: usize = 8; -const MEDIUM_CONCURRENCY_THRESHOLD: usize = 4; - -// Cache settings -max_object_size: 10 * MI_B, // 10MB -max_cache_size: 100 * MI_B, // 100MB -disk_read_semaphore: Semaphore::new(64), // 64 concurrent reads -``` - -## Testing Recommendations - -### 1. Concurrent Load Testing - -Use the provided Go client to test different concurrency levels: - -```go -concurrency := []int{1, 2, 4, 8, 16, 32} -for _, c := range concurrency { - // Run test with c concurrent goroutines - // Measure average latency and P50/P95/P99 -} -``` - -### 2. Hot Object Testing - -Test cache effectiveness with repeated reads: - -```bash -# Read same object 100 times with 10 concurrent clients -for i in {1..10}; do - for j in {1..100}; do - mc cat rustfs/test/bxx > /dev/null - done & -done -wait -``` - -### 3. Mixed Workload Testing - -Simulate real-world scenarios: -- 70% small objects (<1MB) - should see high cache hit rate -- 20% medium objects (1-10MB) - partial cache benefit -- 10% large objects (>10MB) - adaptive buffer sizing benefit - -### 4. Stress Testing - -Test system behavior under extreme load: -```bash -# 100 concurrent clients, continuous reads -ab -n 10000 -c 100 http://rustfs:9000/test/bxx -``` - -## Monitoring and Observability - -### Key Metrics to Watch - -1. **Latency Percentiles** - - P50, P95, P99 request duration - - Should show sub-linear growth with concurrency - -2. **Cache Performance** - - Cache hit ratio (target: >70% for hot objects) - - Cache memory usage - - Eviction rate - -3. **Resource Utilization** - - Memory usage per concurrent request - - Disk I/O queue depth - - CPU utilization - -4. **Throughput** - - Requests per second - - Bytes per second - - Concurrent request count - -### Prometheus Queries - -```promql -# Average request duration by concurrency level -histogram_quantile(0.95, - rate(rustfs_get_object_duration_seconds_bucket[5m]) -) - -# Cache hit ratio -sum(rate(rustfs_object_cache_hits[5m])) -/ -(sum(rate(rustfs_object_cache_hits[5m])) + sum(rate(rustfs_object_cache_misses[5m]))) - -# Concurrent requests over time -rustfs_concurrent_get_requests - -# Memory efficiency (bytes per request) -rustfs_object_cache_size_bytes / rustfs_concurrent_get_requests -``` - -## Future Enhancements - -### Potential Improvements - -1. **Request Prioritization** - - Prioritize small requests over large ones - - Age-based priority to prevent starvation - - QoS classes for different clients - -2. **Advanced Caching** - - Partial object caching (hot blocks) - - Predictive prefetching based on access patterns - - Distributed cache across multiple nodes - -3. **I/O Scheduling** - - Batch similar requests for sequential I/O - - Deadline-based I/O scheduling - - NUMA-aware buffer allocation - -4. **Adaptive Tuning** - - Machine learning based buffer sizing - - Dynamic cache size adjustment - - Workload-aware optimization - -5. **Compression** - - Transparent compression for cached objects - - Adaptive compression based on CPU availability - - Deduplication for similar objects - -## References - -- [Issue #XXX](https://github.com/rustfs/rustfs/issues/XXX): Original performance issue -- [PR #XXX](https://github.com/rustfs/rustfs/pull/XXX): Implementation PR -- [MinIO Best Practices](https://min.io/docs/minio/linux/operations/install-deploy-manage/performance-and-optimization.html) -- [LRU Cache Design](https://leetcode.com/problems/lru-cache/) -- [Tokio Concurrency Patterns](https://tokio.rs/tokio/tutorial/shared-state) - -## Conclusion - -The concurrency-aware optimization addresses the root causes of performance degradation: - -1. ✅ **Adaptive buffer sizing** reduces memory contention and improves cache utilization -2. ✅ **Hot object caching** eliminates redundant disk I/O for frequently accessed files -3. ✅ **I/O concurrency control** prevents disk saturation and ensures predictable latency -4. ✅ **Comprehensive monitoring** enables performance tracking and tuning - -These changes should significantly improve performance under concurrent load while maintaining compatibility with existing clients and workloads. diff --git a/docs/DEVELOPMENT.md b/docs/DEVELOPMENT.md deleted file mode 100644 index 60767eb68..000000000 --- a/docs/DEVELOPMENT.md +++ /dev/null @@ -1,71 +0,0 @@ -# RustFS Local Development Guide - -This guide explains how to set up and run a local development environment for RustFS using Docker. This approach allows you to build and run the code from source in a consistent environment without needing to install the Rust toolchain on your host machine. - -## Prerequisites - -- [Docker](https://docs.docker.com/get-docker/) -- [Docker Compose](https://docs.docker.com/compose/install/) - -## Quick Start - -The development environment is configured as a Docker Compose profile named `dev`. - -### 1. Setup Console UI (Optional) - -If you want to use the Console UI, you must download the static assets first. The default source checkout does not include them. - -```bash -bash scripts/static.sh -``` - -### 2. Start the Environment - -To start the development container: - -```bash -docker compose --profile dev up -d rustfs-dev -``` - -**Note**: The first run will take some time (5-10 minutes) because it builds the docker image and compiles all Rust dependencies from source. Subsequent runs will be much faster. - -### 3. View Logs - -To follow the application logs: - -```bash -docker compose --profile dev logs -f rustfs-dev -``` - -### 4. Access the Services - -- **S3 API**: `http://localhost:9010` -- **Console UI**: `http://localhost:9011/rustfs/console/index.html` - -## Workflow - -### Making Changes -The source code from your local `rustfs` directory is mounted into the container at `/app`. You can edit files in your preferred IDE on your host machine. - -### Applying Changes -Since the application runs via `cargo run`, you need to restart the container to pick up changes. Thanks to incremental compilation, this is fast. - -```bash -docker compose --profile dev restart rustfs-dev -``` - -### Rebuilding Dependencies -If you modify `Cargo.toml` or `Cargo.lock`, you generally need to rebuild the Docker image to update the cached dependencies layer: - -```bash -docker compose --profile dev build rustfs-dev -``` - -## Troubleshooting - -### `VolumeNotFound` Error -If you see an error like `Error: Custom { kind: Other, error: VolumeNotFound }`, it means the `rustfs` binary was started without valid volume arguments. -The development image uses `entrypoint.sh` to parse the `RUSTFS_VOLUMES` environment variable (supporting `{N..M}` syntax), create the directories, and pass them to `cargo run`. Ensure your `RUSTFS_VOLUMES` variable is correctly formatted. - -### Slow Initial Build -This is expected. The `dev` stage in `Dockerfile.source` compiles all dependencies from scratch. Because the `/usr/local/cargo/registry` is mounted as a volume, these compiled artifacts are preserved between restarts, making future builds fast. diff --git a/docs/ENVIRONMENT_VARIABLES.md b/docs/ENVIRONMENT_VARIABLES.md deleted file mode 100644 index 6dac9d747..000000000 --- a/docs/ENVIRONMENT_VARIABLES.md +++ /dev/null @@ -1,202 +0,0 @@ -# RustFS Environment Variables - -This document describes the environment variables that can be used to configure RustFS behavior. - -## Background Services Control - -### RUSTFS_ENABLE_SCANNER - -Controls whether the data scanner service should be started. - -- **Default**: `true` -- **Valid values**: `true`, `false` -- **Description**: When enabled, the data scanner will run background scans to detect inconsistencies and corruption in stored data. - -**Examples**: -```bash -# Disable scanner -export RUSTFS_ENABLE_SCANNER=false - -# Enable scanner (default behavior) -export RUSTFS_ENABLE_SCANNER=true -``` - -### RUSTFS_ENABLE_HEAL - -Controls whether the auto-heal service should be started. - -- **Default**: `true` -- **Valid values**: `true`, `false` -- **Description**: When enabled, the heal manager will automatically repair detected data inconsistencies and corruption. - -**Examples**: -```bash -# Disable auto-heal -export RUSTFS_ENABLE_HEAL=false - -# Enable auto-heal (default behavior) -export RUSTFS_ENABLE_HEAL=true -``` - -### RUSTFS_ENABLE_LOCKS - -Controls whether the distributed lock system should be enabled. - -- **Default**: `true` -- **Valid values**: `true`, `false`, `1`, `0`, `yes`, `no`, `on`, `off`, `enabled`, `disabled` (case insensitive) -- **Description**: When enabled, provides distributed locking for concurrent object operations. When disabled, all lock operations immediately return success without actual locking. - -**Examples**: -```bash -# Disable lock system -export RUSTFS_ENABLE_LOCKS=false - -# Enable lock system (default behavior) -export RUSTFS_ENABLE_LOCKS=true -``` - -## Service Combinations - -The scanner and heal services can be independently controlled: - -| RUSTFS_ENABLE_SCANNER | RUSTFS_ENABLE_HEAL | Result | -|----------------------|-------------------|--------| -| `true` (default) | `true` (default) | Both scanner and heal are active | -| `true` | `false` | Scanner runs without heal capabilities | -| `false` | `true` | Heal manager is available but no scanning | -| `false` | `false` | No background maintenance services | - -## Use Cases - -### Development Environment -For development or testing environments where you don't need background maintenance: -```bash -export RUSTFS_ENABLE_SCANNER=false -export RUSTFS_ENABLE_HEAL=false -./rustfs --address 127.0.0.1:9000 ... -``` - -### Scan-Only Mode -For environments where you want to detect issues but not automatically fix them: -```bash -export RUSTFS_ENABLE_SCANNER=true -export RUSTFS_ENABLE_HEAL=false -./rustfs --address 127.0.0.1:9000 ... -``` - -### Heal-Only Mode -For environments where external tools trigger healing but no automatic scanning: -```bash -export RUSTFS_ENABLE_SCANNER=false -export RUSTFS_ENABLE_HEAL=true -./rustfs --address 127.0.0.1:9000 ... -``` - -### Production Environment (Default) -For production environments where both services should be active: -```bash -# These are the defaults, so no need to set explicitly -# export RUSTFS_ENABLE_SCANNER=true -# export RUSTFS_ENABLE_HEAL=true -./rustfs --address 127.0.0.1:9000 ... -``` - -### No-Lock Development -For single-node development where locking is not needed: -```bash -export RUSTFS_ENABLE_LOCKS=false -./rustfs --address 127.0.0.1:9000 ... -``` - -## Protocol Servers - -### RUSTFS_FTPS_ENABLE - -Controls whether the FTPS (FTP over TLS) server should be started. - -- **Default**: `false` -- **Valid values**: `true`, `false` -- **Description**: When enabled, starts an FTPS server for secure file transfers over TLS. - -### RUSTFS_FTPS_ADDRESS - -FTPS server bind address. - -- **Default**: `0.0.0.0:8021` -- **Valid values**: Valid IP:PORT combination -- **Description**: The address and port where the FTPS server will listen for connections. - -### RUSTFS_FTPS_CERTS_FILE - -Path to FTPS server TLS certificate file. - -- **Default**: None (required when FTPS is enabled) -- **Valid values**: Path to a PEM-encoded certificate file -- **Description**: TLS certificate used for securing FTPS connections. - -### RUSTFS_FTPS_KEY_FILE - -Path to FTPS server TLS private key file. - -- **Default**: None (required when FTPS is enabled) -- **Valid values**: Path to a PEM-encoded private key file -- **Description**: TLS private key corresponding to the certificate. - -### RUSTFS_FTPS_PASSIVE_PORTS - -Passive port range for FTPS data connections. - -- **Default**: None (system-assigned ports) -- **Valid values**: Port range in format "START-END" (e.g., "40000-50000") -- **Description**: Range of ports for FTPS passive mode data connections. - -### RUSTFS_FTPS_EXTERNAL_IP - -External IP address for FTPS passive mode. - -- **Default**: None (auto-detected) -- **Valid values**: Valid IP address -- **Description**: External IP address advertised to FTPS clients for passive mode, useful for NAT setups. - -### RUSTFS_SFTP_ENABLE - -Controls whether the SFTP (SSH File Transfer Protocol) server should be started. - -- **Default**: `false` -- **Valid values**: `true`, `false` -- **Description**: When enabled, starts an SFTP server for secure file transfers over SSH. - -### RUSTFS_SFTP_ADDRESS - -SFTP server bind address. - -- **Default**: `0.0.0.0:8022` -- **Valid values**: Valid IP:PORT combination -- **Description**: The address and port where the SFTP server will listen for connections. - -### RUSTFS_SFTP_HOST_KEY - -Path to SFTP server SSH host key file. - -- **Default**: None (required when SFTP is enabled) -- **Valid values**: Path to an SSH host key file -- **Description**: SSH host key used for server identification. - -### RUSTFS_SFTP_AUTHORIZED_KEYS - -Path to SFTP authorized keys file. - -- **Default**: None (required when SFTP is enabled) -- **Valid values**: Path to a file containing OpenSSH public keys -- **Description**: File containing authorized SSH public keys for client authentication. - -## Performance Impact - -- **Scanner**: Light to moderate CPU/IO impact during scans -- **Heal**: Moderate to high CPU/IO impact during healing operations -- **Locks**: Minimal CPU/memory overhead for coordination; disabling can improve throughput in single-client scenarios -- **Memory**: Each service uses additional memory for processing queues and metadata -- **FTPS**: Moderate CPU/memory overhead for TLS operations and connection management -- **SFTP**: Moderate CPU/memory overhead for SSH operations and key management - -Disabling these services in resource-constrained environments can improve performance for primary storage operations. \ No newline at end of file diff --git a/docs/FINAL_OPTIMIZATION_SUMMARY.md b/docs/FINAL_OPTIMIZATION_SUMMARY.md deleted file mode 100644 index b29610a40..000000000 --- a/docs/FINAL_OPTIMIZATION_SUMMARY.md +++ /dev/null @@ -1,398 +0,0 @@ -# Final Optimization Summary - Concurrent GetObject Performance - -## Overview - -This document provides a comprehensive summary of all optimizations made to address the concurrent GetObject performance degradation issue, incorporating all feedback and implementing best practices as a senior Rust developer. - -## Problem Statement - -**Original Issue**: GetObject performance degraded exponentially under concurrent load: -- 1 concurrent request: 59ms -- 2 concurrent requests: 110ms (1.9x slower) -- 4 concurrent requests: 200ms (3.4x slower) - -**Root Causes Identified**: -1. Fixed 1MB buffer size caused memory contention -2. No I/O concurrency control led to disk saturation -3. Absence of caching for frequently accessed objects -4. Inefficient lock management in concurrent scenarios - -## Solution Architecture - -### 1. Optimized LRU Cache Implementation (lru 0.16.2) - -#### Read-First Access Pattern - -Implemented an optimistic locking strategy using the `peek()` method from lru 0.16.2: - -```rust -async fn get(&self, key: &str) -> Option<Arc<Vec<u8>>> { - // Phase 1: Read lock with peek (no LRU modification) - let cache = self.cache.read().await; - if let Some(cached) = cache.peek(key) { - let data = Arc::clone(&cached.data); - drop(cache); - - // Phase 2: Write lock only for LRU promotion - let mut cache_write = self.cache.write().await; - if let Some(cached) = cache_write.get(key) { - cached.hit_count.fetch_add(1, Ordering::Relaxed); - return Some(data); - } - } - None -} -``` - -**Benefits**: -- **50% reduction** in write lock acquisitions -- Multiple readers can peek simultaneously -- Write lock only when promoting in LRU order -- Maintains proper LRU semantics - -#### Advanced Cache Operations - -**Batch Operations**: -```rust -// Single lock for multiple objects -pub async fn get_cached_batch(&self, keys: &[String]) -> Vec<Option<Arc<Vec<u8>>>> -``` - -**Cache Warming**: -```rust -// Pre-populate cache on startup -pub async fn warm_cache(&self, objects: Vec<(String, Vec<u8>)>) -``` - -**Hot Key Tracking**: -```rust -// Identify most accessed objects -pub async fn get_hot_keys(&self, limit: usize) -> Vec<(String, usize)> -``` - -**Cache Management**: -```rust -// Lightweight checks and explicit invalidation -pub async fn is_cached(&self, key: &str) -> bool -pub async fn remove_cached(&self, key: &str) -> bool -``` - -### 2. Advanced Buffer Sizing - -#### Standard Concurrency-Aware Sizing - -| Concurrent Requests | Buffer Multiplier | Rationale | -|--------------------|-------------------|-----------| -| 1-2 | 1.0x (100%) | Maximum throughput | -| 3-4 | 0.75x (75%) | Balanced performance | -| 5-8 | 0.5x (50%) | Fair resource sharing | -| >8 | 0.4x (40%) | Memory efficiency | - -#### Advanced File-Pattern-Aware Sizing - -```rust -pub fn get_advanced_buffer_size( - file_size: i64, - base_buffer_size: usize, - is_sequential: bool -) -> usize -``` - -**Optimizations**: -1. **Small files (<256KB)**: Use 25% of file size (16-64KB range) -2. **Sequential reads**: 1.5x multiplier at low concurrency -3. **Large files + high concurrency**: 0.8x for better parallelism - -**Example**: -```rust -// 32MB file, sequential read, low concurrency -let buffer = get_advanced_buffer_size( - 32 * 1024 * 1024, // file_size - 256 * 1024, // base_buffer (256KB) - true // is_sequential -); -// Result: ~384KB buffer (256KB * 1.5) -``` - -### 3. I/O Concurrency Control - -**Semaphore-Based Rate Limiting**: -- Default: 64 concurrent disk reads -- Prevents disk I/O saturation -- FIFO queuing ensures fairness -- Tunable based on storage type: - - NVMe SSD: 128-256 - - HDD: 32-48 - - Network storage: Based on bandwidth - -### 4. RAII Request Tracking - -```rust -pub struct GetObjectGuard { - start_time: Instant, -} - -impl Drop for GetObjectGuard { - fn drop(&mut self) { - ACTIVE_GET_REQUESTS.fetch_sub(1, Ordering::Relaxed); - // Record metrics - } -} -``` - -**Benefits**: -- Zero overhead tracking -- Automatic cleanup on drop -- Panic-safe counter management -- Accurate concurrent load measurement - -## Performance Analysis - -### Cache Performance - -| Metric | Before | After | Improvement | -|--------|--------|-------|-------------| -| Cache hit (read-heavy) | 2-3ms | <1ms | 2-3x faster | -| Cache hit (with promotion) | 2-3ms | 2-3ms | Same (required) | -| Batch get (10 keys) | 20-30ms | 5-10ms | 2-3x faster | -| Cache miss | 50-800ms | 50-800ms | Same (disk bound) | - -### Overall Latency Impact - -| Concurrent Requests | Original | Optimized | Improvement | -|---------------------|----------|-----------|-------------| -| 1 | 59ms | 50-55ms | ~10% | -| 2 | 110ms | 60-70ms | ~40% | -| 4 | 200ms | 75-90ms | ~55% | -| 8 | 400ms | 90-120ms | ~70% | -| 16 | 800ms | 110-145ms | ~75% | - -**With cache hits**: <5ms regardless of concurrency level - -### Memory Efficiency - -| Scenario | Buffer Size | Memory Impact | Efficiency Gain | -|----------|-------------|---------------|-----------------| -| Small files (128KB) | 32KB (was 256KB) | 8x more objects | 8x improvement | -| Sequential reads | 1.5x base | Better throughput | 50% faster | -| High concurrency | 0.32x base | 3x more requests | Better fairness | - -## Test Coverage - -### Comprehensive Test Suite (15 Tests) - -**Request Tracking**: -1. `test_concurrent_request_tracking` - RAII guard functionality - -**Buffer Sizing**: -2. `test_adaptive_buffer_sizing` - Multi-level concurrency adaptation -3. `test_buffer_size_bounds` - Boundary conditions -4. `test_advanced_buffer_sizing` - File pattern optimization - -**Cache Operations**: -5. `test_cache_operations` - Basic cache lifecycle -6. `test_large_object_not_cached` - Size filtering -7. `test_cache_eviction` - LRU eviction behavior -8. `test_cache_batch_operations` - Batch retrieval efficiency -9. `test_cache_warming` - Pre-population mechanism -10. `test_hot_keys_tracking` - Access frequency tracking -11. `test_cache_removal` - Explicit invalidation -12. `test_is_cached_no_promotion` - Peek behavior verification - -**Performance**: -13. `bench_concurrent_requests` - Concurrent request handling -14. `test_concurrent_cache_access` - Performance under load -15. `test_disk_io_permits` - Semaphore behavior - -## Code Quality Standards - -### Documentation - -✅ **All documentation in English** following Rust documentation conventions -✅ **Comprehensive inline comments** explaining design decisions -✅ **Usage examples** in doc comments -✅ **Module-level documentation** with key features and characteristics - -### Safety and Correctness - -✅ **Thread-safe** - Proper use of Arc, RwLock, AtomicUsize -✅ **Panic-safe** - RAII guards ensure cleanup -✅ **Memory-safe** - No unsafe code -✅ **Deadlock-free** - Careful lock ordering and scope management - -### API Design - -✅ **Clear separation of concerns** - Public vs private APIs -✅ **Consistent naming** - Follows Rust naming conventions -✅ **Type safety** - Strong typing prevents misuse -✅ **Ergonomic** - Easy to use correctly, hard to use incorrectly - -## Production Deployment Guide - -### Configuration - -```rust -// Adjust based on your environment -const CACHE_SIZE_MB: usize = 200; // For more hot objects -const MAX_OBJECT_SIZE_MB: usize = 20; // For larger hot objects -const DISK_CONCURRENCY: usize = 64; // Based on storage type -``` - -### Cache Warming Example - -```rust -async fn init_cache_on_startup(manager: &ConcurrencyManager) { - // Load known hot objects - let hot_objects = vec![ - ("config/settings.json".to_string(), load_config()), - ("common/logo.png".to_string(), load_logo()), - // ... more hot objects - ]; - - manager.warm_cache(hot_objects).await; - info!("Cache warmed with {} objects", hot_objects.len()); -} -``` - -### Monitoring - -```rust -// Periodic cache metrics -tokio::spawn(async move { - loop { - tokio::time::sleep(Duration::from_secs(60)).await; - - let stats = manager.cache_stats().await; - gauge!("cache_size_bytes").set(stats.size as f64); - gauge!("cache_entries").set(stats.entries as f64); - - let hot_keys = manager.get_hot_keys(10).await; - for (key, hits) in hot_keys { - info!("Hot: {} ({} hits)", key, hits); - } - } -}); -``` - -### Prometheus Metrics - -```promql -# Cache hit ratio -sum(rate(rustfs_object_cache_hits[5m])) -/ -(sum(rate(rustfs_object_cache_hits[5m])) + sum(rate(rustfs_object_cache_misses[5m]))) - -# P95 latency -histogram_quantile(0.95, rate(rustfs_get_object_duration_seconds_bucket[5m])) - -# Concurrent requests -rustfs_concurrent_get_requests - -# Cache efficiency -rustfs_object_cache_size_bytes / rustfs_object_cache_entries -``` - -## File Structure - -``` -rustfs/ -├── src/ -│ └── storage/ -│ ├── concurrency.rs # Core concurrency management -│ ├── concurrent_get_object_test.rs # Comprehensive tests -│ ├── ecfs.rs # GetObject integration -│ └── mod.rs # Module declarations -├── Cargo.toml # lru = "0.16.2" -└── docs/ - ├── CONCURRENT_PERFORMANCE_OPTIMIZATION.md - ├── ENHANCED_CACHING_OPTIMIZATION.md - ├── PR_ENHANCEMENTS_SUMMARY.md - └── FINAL_OPTIMIZATION_SUMMARY.md # This document -``` - -## Migration Guide - -### Backward Compatibility - -✅ **100% backward compatible** - No breaking changes -✅ **Automatic optimization** - Existing code benefits immediately -✅ **Opt-in advanced features** - Use when needed - -### Using New Features - -```rust -// Basic usage (automatic) -let _guard = ConcurrencyManager::track_request(); -if let Some(data) = manager.get_cached(&key).await { - return serve_from_cache(data); -} - -// Advanced usage (explicit) -let results = manager.get_cached_batch(&keys).await; -manager.warm_cache(hot_objects).await; -let hot = manager.get_hot_keys(10).await; - -// Advanced buffer sizing -let buffer = get_advanced_buffer_size(file_size, base, is_sequential); -``` - -## Future Enhancements - -### Short Term -1. Implement TeeReader for automatic cache insertion from streams -2. Add Admin API for cache management -3. Distributed cache invalidation across cluster nodes - -### Medium Term -1. Predictive prefetching based on access patterns -2. Tiered caching (Memory + SSD + Remote) -3. Smart eviction considering factors beyond LRU - -### Long Term -1. ML-based optimization and prediction -2. Content-addressable storage with deduplication -3. Adaptive tuning based on observed patterns - -## Success Metrics - -### Quantitative Goals - -✅ **Latency reduction**: 40-75% improvement under concurrent load -✅ **Memory efficiency**: Sub-linear growth with concurrency -✅ **Cache effectiveness**: <5ms for cache hits -✅ **I/O optimization**: Bounded queue depth - -### Qualitative Goals - -✅ **Maintainability**: Clear, well-documented code -✅ **Reliability**: No crashes or resource leaks -✅ **Observability**: Comprehensive metrics -✅ **Compatibility**: No breaking changes - -## Conclusion - -This optimization successfully addresses the concurrent GetObject performance issue through a comprehensive solution: - -1. **Optimized Cache** (lru 0.16.2) with read-first pattern -2. **Advanced buffer sizing** adapting to concurrency and file patterns -3. **I/O concurrency control** preventing disk saturation -4. **Batch operations** for efficiency -5. **Comprehensive testing** ensuring correctness -6. **Production-ready** features and monitoring - -The solution is backward compatible, well-tested, thoroughly documented in English, and ready for production deployment. - -## References - -- **Issue**: #911 - Concurrent GetObject performance degradation -- **Final Commit**: 010e515 - Complete optimization with lru 0.16.2 -- **Implementation**: `rustfs/src/storage/concurrency.rs` -- **Tests**: `rustfs/src/storage/concurrent_get_object_test.rs` -- **LRU Crate**: https://crates.io/crates/lru (version 0.16.2) - -## Contact - -For questions or issues related to this optimization: -- File issue on GitHub referencing #911 -- Tag @houseme or @copilot -- Reference this document and commit 010e515 diff --git a/docs/IMPLEMENTATION_SUMMARY.md b/docs/IMPLEMENTATION_SUMMARY.md deleted file mode 100644 index aada16614..000000000 --- a/docs/IMPLEMENTATION_SUMMARY.md +++ /dev/null @@ -1,412 +0,0 @@ -# Adaptive Buffer Sizing Implementation Summary - -## Overview - -This implementation extends PR #869 with a comprehensive adaptive buffer sizing optimization system that provides intelligent buffer size selection based on file size and workload type. - -## What Was Implemented - -### 1. Workload Profile System - -**File:** `rustfs/src/config/workload_profiles.rs` (501 lines) - -A complete workload profiling system with: - -- **6 Predefined Profiles:** - - `GeneralPurpose`: Balanced performance (default) - - `AiTraining`: Optimized for large sequential reads - - `DataAnalytics`: Mixed read-write patterns - - `WebWorkload`: Small file intensive - - `IndustrialIoT`: Real-time streaming - - `SecureStorage`: Security-first, memory-constrained - -- **Custom Configuration Support:** - ```rust - WorkloadProfile::Custom(BufferConfig { - min_size: 16 * 1024, - max_size: 512 * 1024, - default_unknown: 128 * 1024, - thresholds: vec![...], - }) - ``` - -- **Configuration Validation:** - - Ensures min_size > 0 - - Validates max_size >= min_size - - Checks threshold ordering - - Validates buffer sizes within bounds - -### 2. Enhanced Buffer Sizing Algorithm - -**File:** `rustfs/src/storage/ecfs.rs` (+156 lines) - -- **Backward Compatible:** - - Preserved original `get_adaptive_buffer_size()` function - - Existing code continues to work without changes - -- **New Enhanced Function:** - ```rust - fn get_adaptive_buffer_size_with_profile( - file_size: i64, - profile: Option<WorkloadProfile> - ) -> usize - ``` - -- **Auto-Detection:** - - Automatically detects Chinese secure OS (Kylin, NeoKylin, UOS, OpenKylin) - - Falls back to GeneralPurpose if no special environment detected - -### 3. Comprehensive Testing - -**Location:** `rustfs/src/storage/ecfs.rs` and `rustfs/src/config/workload_profiles.rs` - -- Unit tests for all 6 workload profiles -- Boundary condition testing -- Configuration validation tests -- Custom configuration tests -- Unknown file size handling tests -- Total: 15+ comprehensive test cases - -### 4. Complete Documentation - -**Files:** -- `docs/adaptive-buffer-sizing.md` (460 lines) -- `docs/README.md` (updated with navigation) - -Documentation includes: -- Overview and architecture -- Detailed profile descriptions -- Usage examples -- Performance considerations -- Best practices -- Troubleshooting guide -- Migration guide from PR #869 - -## Design Decisions - -### 1. Backward Compatibility - -**Decision:** Keep original `get_adaptive_buffer_size()` function unchanged. - -**Rationale:** -- Ensures no breaking changes -- Existing code continues to work -- Gradual migration path available - -### 2. Profile-Based Configuration - -**Decision:** Use enum-based profiles instead of global configuration. - -**Rationale:** -- Type-safe profile selection -- Compile-time validation -- Easy to extend with new profiles -- Clear documentation of available options - -### 3. Separate Module for Profiles - -**Decision:** Create dedicated `workload_profiles` module. - -**Rationale:** -- Clear separation of concerns -- Easy to locate and maintain -- Can be used across the codebase -- Facilitates testing - -### 4. Conservative Default Values - -**Decision:** Use moderate buffer sizes by default. - -**Rationale:** -- Prevents excessive memory usage -- Suitable for most workloads -- Users can opt-in to larger buffers - -## Performance Characteristics - -### Memory Usage by Profile - -| Profile | Min Buffer | Max Buffer | Memory Footprint | -|---------|-----------|-----------|------------------| -| GeneralPurpose | 64KB | 1MB | Low-Medium | -| AiTraining | 512KB | 4MB | High | -| DataAnalytics | 128KB | 2MB | Medium | -| WebWorkload | 32KB | 256KB | Low | -| IndustrialIoT | 64KB | 512KB | Low | -| SecureStorage | 32KB | 256KB | Low | - -### Throughput Impact - -- **Small buffers (32-64KB):** Better for high concurrency, many small files -- **Medium buffers (128-512KB):** Balanced for mixed workloads -- **Large buffers (1-4MB):** Maximum throughput for large sequential I/O - -## Usage Patterns - -### Simple Usage (Backward Compatible) - -```rust -// Existing code works unchanged -let buffer_size = get_adaptive_buffer_size(file_size); -``` - -### Profile-Aware Usage - -```rust -// For AI/ML workloads -let buffer_size = get_adaptive_buffer_size_with_profile( - file_size, - Some(WorkloadProfile::AiTraining) -); - -// Auto-detect environment -let buffer_size = get_adaptive_buffer_size_with_profile(file_size, None); -``` - -### Custom Configuration - -```rust -let custom = BufferConfig { - min_size: 16 * 1024, - max_size: 512 * 1024, - default_unknown: 128 * 1024, - thresholds: vec![ - (1024 * 1024, 64 * 1024), - (i64::MAX, 256 * 1024), - ], -}; - -let profile = WorkloadProfile::Custom(custom); -let buffer_size = get_adaptive_buffer_size_with_profile(file_size, Some(profile)); -``` - -## Integration Points - -The new functionality can be integrated into: - -1. **`put_object`**: Choose profile based on object metadata or headers -2. **`put_object_extract`**: Use appropriate profile for archive extraction -3. **`upload_part`**: Apply profile for multipart uploads - -Example integration (future enhancement): - -```rust -async fn put_object(&self, req: S3Request<PutObjectInput>) -> S3Result<S3Response<PutObjectOutput>> { - // Detect workload from headers or configuration - let profile = detect_workload_from_request(&req); - - let buffer_size = get_adaptive_buffer_size_with_profile( - size, - Some(profile) - ); - - let body = tokio::io::BufReader::with_capacity(buffer_size, reader); - // ... rest of implementation -} -``` - -## Security Considerations - -### Memory Safety - -1. **Bounded Buffer Sizes:** - - All configurations enforce min and max limits - - Prevents out-of-memory conditions - - Validation at configuration creation time - -2. **Immutable Configurations:** - - All config structures are immutable after creation - - Thread-safe by design - - No risk of race conditions - -3. **Secure OS Detection:** - - Read-only access to `/etc/os-release` - - No privilege escalation required - - Graceful fallback on error - -### No New Vulnerabilities - -- Only adds new functionality -- Does not modify existing security-critical paths -- Preserves all existing security measures -- All new code is defensive and validated - -## Testing Strategy - -### Unit Tests - -- Located in both modules with `#[cfg(test)]` -- Test all workload profiles -- Validate configuration logic -- Test boundary conditions - -### Integration Testing - -Future integration tests should cover: -- Actual file upload/download with different profiles -- Performance benchmarks for each profile -- Memory usage monitoring -- Concurrent operations - -## Future Enhancements - -### 1. Runtime Configuration - -Add environment variables or config file support: - -```bash -RUSTFS_BUFFER_PROFILE=AiTraining -RUSTFS_BUFFER_MIN_SIZE=32768 -RUSTFS_BUFFER_MAX_SIZE=1048576 -``` - -### 2. Dynamic Profiling - -Collect metrics and automatically adjust profile: - -```rust -// Monitor actual I/O patterns and adjust buffer sizes -let optimal_profile = analyze_io_patterns(); -``` - -### 3. Per-Bucket Configuration - -Allow different profiles per bucket: - -```rust -// Configure profiles via bucket metadata -bucket.set_buffer_profile(WorkloadProfile::WebWorkload); -``` - -### 4. Performance Metrics - -Add metrics to track buffer effectiveness: - -```rust -metrics::histogram!("buffer_utilization", utilization); -metrics::counter!("buffer_resizes", 1); -``` - -## Migration Path - -### Phase 1: Current State ✅ - -- Infrastructure in place -- Backward compatible -- Fully documented -- Tested - -### Phase 2: Opt-In Usage ✅ **IMPLEMENTED** - -- ✅ Configuration option to enable profiles (`RUSTFS_BUFFER_PROFILE_ENABLE`) -- ✅ Workload profile selection (`RUSTFS_BUFFER_PROFILE`) -- ✅ Default to existing behavior when disabled -- ✅ Global configuration management -- ✅ Integration in `put_object`, `put_object_extract`, and `upload_part` -- ✅ Command-line and environment variable support -- ✅ Performance monitoring ready - -**How to Use:** -```bash -# Enable with environment variables -export RUSTFS_BUFFER_PROFILE_ENABLE=true -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /data - -# Or use command-line flags -./rustfs --buffer-profile-enable --buffer-profile WebWorkload /data -``` - -### Phase 3: Default Enablement ✅ **IMPLEMENTED** - -- ✅ Profile-aware buffer sizing enabled by default -- ✅ Default profile: `GeneralPurpose` (same behavior as PR #869 for most files) -- ✅ Backward compatibility via `--buffer-profile-disable` flag -- ✅ Easy profile switching via `--buffer-profile` or `RUSTFS_BUFFER_PROFILE` -- ✅ Updated documentation with Phase 3 examples - -**Default Behavior:** -```bash -# Phase 3: Enabled by default with GeneralPurpose profile -./rustfs /data - -# Change to a different profile -./rustfs --buffer-profile AiTraining /data - -# Opt-out to legacy behavior if needed -./rustfs --buffer-profile-disable /data -``` - -**Key Changes from Phase 2:** -- Phase 2: Required `--buffer-profile-enable` to opt-in -- Phase 3: Enabled by default, use `--buffer-profile-disable` to opt-out -- Maintains full backward compatibility -- No breaking changes for existing deployments - -### Phase 4: Full Integration ✅ **IMPLEMENTED** - -- ✅ Deprecated legacy `get_adaptive_buffer_size()` function -- ✅ Profile-only implementation via `get_buffer_size_opt_in()` -- ✅ Performance metrics collection capability (with `metrics` feature) -- ✅ Consolidated buffer sizing logic -- ✅ All buffer sizes come from workload profiles - -**Implementation Details:** -```rust -// Phase 4: Single entry point for buffer sizing -fn get_buffer_size_opt_in(file_size: i64) -> usize { - // Uses workload profiles exclusively - // Legacy function deprecated but maintained for compatibility - // Metrics collection integrated for performance monitoring -} -``` - -**Key Changes from Phase 3:** -- Legacy function marked as `#[deprecated]` but still functional -- Single, unified buffer sizing implementation -- Performance metrics tracking (optional, via feature flag) -- Even disabled mode uses GeneralPurpose profile (profile-only) - -## Maintenance Guidelines - -### Adding New Profiles - -1. Add enum variant to `WorkloadProfile` -2. Implement config method -3. Add tests -4. Update documentation -5. Add usage examples - -### Modifying Existing Profiles - -1. Update threshold values in config method -2. Update tests to match new values -3. Update documentation -4. Consider migration impact - -### Performance Tuning - -1. Collect metrics from production -2. Analyze buffer hit rates -3. Adjust thresholds based on data -4. A/B test changes -5. Update documentation with findings - -## Conclusion - -This implementation provides a solid foundation for adaptive buffer sizing in RustFS: - -- ✅ Comprehensive workload profiling system -- ✅ Backward compatible design -- ✅ Extensive testing -- ✅ Complete documentation -- ✅ Secure and memory-safe -- ✅ Ready for production use - -The modular design allows for gradual adoption and future enhancements without breaking existing functionality. - -## References - -- [PR #869: Fix large file upload freeze with adaptive buffer sizing](https://github.com/rustfs/rustfs/pull/869) -- [Adaptive Buffer Sizing Documentation](./adaptive-buffer-sizing.md) -- [Performance Testing Guide](./PERFORMANCE_TESTING.md) diff --git a/docs/MIGRATION_PHASE3.md b/docs/MIGRATION_PHASE3.md deleted file mode 100644 index 31bd989dc..000000000 --- a/docs/MIGRATION_PHASE3.md +++ /dev/null @@ -1,284 +0,0 @@ -# Migration Guide: Phase 2 to Phase 3 - -## Overview - -Phase 3 of the adaptive buffer sizing feature makes workload profiles **enabled by default**. This document helps you understand the changes and how to migrate smoothly. - -## What Changed - -### Phase 2 (Opt-In) -- Buffer profiling was **disabled by default** -- Required explicit enabling via `--buffer-profile-enable` or `RUSTFS_BUFFER_PROFILE_ENABLE=true` -- Used legacy PR #869 behavior unless explicitly enabled - -### Phase 3 (Default Enablement) -- Buffer profiling is **enabled by default** with `GeneralPurpose` profile -- No configuration needed for default behavior -- Can opt-out via `--buffer-profile-disable` or `RUSTFS_BUFFER_PROFILE_DISABLE=true` -- Maintains full backward compatibility - -## Impact Analysis - -### For Most Users (No Action Required) - -The `GeneralPurpose` profile (default in Phase 3) provides the **same buffer sizes** as PR #869 for most file sizes: -- Small files (< 1MB): 64KB buffer -- Medium files (1MB-100MB): 256KB buffer -- Large files (≥ 100MB): 1MB buffer - -**Result:** Your existing deployments will work exactly as before, with no performance changes. - -### For Users Who Explicitly Enabled Profiles in Phase 2 - -If you were using: -```bash -# Phase 2 -export RUSTFS_BUFFER_PROFILE_ENABLE=true -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /data -``` - -You can simplify to: -```bash -# Phase 3 -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /data -``` - -The `RUSTFS_BUFFER_PROFILE_ENABLE` variable is no longer needed (but still respected for compatibility). - -### For Users Who Want Exact Legacy Behavior - -If you need the guaranteed exact behavior from PR #869 (before any profiling): - -```bash -# Phase 3 - Opt out to legacy behavior -export RUSTFS_BUFFER_PROFILE_DISABLE=true -./rustfs /data - -# Or via command-line -./rustfs --buffer-profile-disable /data -``` - -## Migration Scenarios - -### Scenario 1: Default Deployment (No Changes Needed) - -**Phase 2:** -```bash -./rustfs /data -# Used PR #869 fixed algorithm -``` - -**Phase 3:** -```bash -./rustfs /data -# Uses GeneralPurpose profile (same buffer sizes as PR #869 for most cases) -``` - -**Action:** None required. Behavior is essentially identical. - -### Scenario 2: Using Custom Profile in Phase 2 - -**Phase 2:** -```bash -export RUSTFS_BUFFER_PROFILE_ENABLE=true -export RUSTFS_BUFFER_PROFILE=WebWorkload -./rustfs /data -``` - -**Phase 3 (Simplified):** -```bash -export RUSTFS_BUFFER_PROFILE=WebWorkload -./rustfs /data -# RUSTFS_BUFFER_PROFILE_ENABLE no longer needed -``` - -**Action:** Remove `RUSTFS_BUFFER_PROFILE_ENABLE=true` from your configuration. - -### Scenario 3: Explicitly Disabled in Phase 2 - -**Phase 2:** -```bash -# Or just not setting RUSTFS_BUFFER_PROFILE_ENABLE -./rustfs /data -``` - -**Phase 3 (If you want to keep legacy behavior):** -```bash -export RUSTFS_BUFFER_PROFILE_DISABLE=true -./rustfs /data -``` - -**Action:** Set `RUSTFS_BUFFER_PROFILE_DISABLE=true` if you want to guarantee exact PR #869 behavior. - -### Scenario 4: AI/ML Workloads - -**Phase 2:** -```bash -export RUSTFS_BUFFER_PROFILE_ENABLE=true -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /data -``` - -**Phase 3 (Simplified):** -```bash -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /data -``` - -**Action:** Remove `RUSTFS_BUFFER_PROFILE_ENABLE=true`. - -## Configuration Reference - -### Phase 3 Environment Variables - -| Variable | Default | Description | -|----------|---------|-------------| -| `RUSTFS_BUFFER_PROFILE` | `GeneralPurpose` | The workload profile to use | -| `RUSTFS_BUFFER_PROFILE_DISABLE` | `false` | Disable profiling and use legacy behavior | - -### Phase 3 Command-Line Flags - -| Flag | Default | Description | -|------|---------|-------------| -| `--buffer-profile <PROFILE>` | `GeneralPurpose` | Set the workload profile | -| `--buffer-profile-disable` | disabled | Disable profiling (opt-out) | - -### Deprecated (Still Supported for Compatibility) - -| Variable | Status | Replacement | -|----------|--------|-------------| -| `RUSTFS_BUFFER_PROFILE_ENABLE` | Deprecated | Profiling is enabled by default; use `RUSTFS_BUFFER_PROFILE_DISABLE` to opt-out | - -## Performance Expectations - -### GeneralPurpose Profile (Default) - -Same performance as PR #869 for most workloads: -- Small files: Same 64KB buffer -- Medium files: Same 256KB buffer -- Large files: Same 1MB buffer - -### Specialized Profiles - -When you switch to a specialized profile, you get optimized buffer sizes: - -| Profile | Performance Benefit | Use Case | -|---------|-------------------|----------| -| `AiTraining` | Up to 4x throughput on large files | ML model files, training datasets | -| `WebWorkload` | Lower memory, higher concurrency | Static assets, CDN | -| `DataAnalytics` | Balanced for mixed patterns | Data warehouses, BI | -| `IndustrialIoT` | Low latency, memory-efficient | Sensor data, telemetry | -| `SecureStorage` | Compliance-focused, minimal memory | Government, healthcare | - -## Testing Your Migration - -### Step 1: Test Default Behavior - -```bash -# Start with default configuration -./rustfs /data - -# Verify it works as expected -# Check logs for: "Using buffer profile: GeneralPurpose" -``` - -### Step 2: Test Your Workload Profile (If Using) - -```bash -# Set your specific profile -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /data - -# Verify in logs: "Using buffer profile: AiTraining" -``` - -### Step 3: Test Opt-Out (If Needed) - -```bash -# Disable profiling -export RUSTFS_BUFFER_PROFILE_DISABLE=true -./rustfs /data - -# Verify in logs: "using legacy adaptive buffer sizing" -``` - -## Rollback Plan - -If you encounter any issues with Phase 3, you can easily roll back: - -### Option 1: Disable Profiling - -```bash -export RUSTFS_BUFFER_PROFILE_DISABLE=true -./rustfs /data -``` - -This gives you the exact PR #869 behavior. - -### Option 2: Use GeneralPurpose Profile Explicitly - -```bash -export RUSTFS_BUFFER_PROFILE=GeneralPurpose -./rustfs /data -``` - -This uses profiling but with conservative buffer sizes. - -## FAQ - -### Q: Will Phase 3 break my existing deployment? - -**A:** No. The default `GeneralPurpose` profile uses the same buffer sizes as PR #869 for most scenarios. Your deployment will work exactly as before. - -### Q: Do I need to change my configuration? - -**A:** Only if you were explicitly using profiles in Phase 2. You can simplify by removing `RUSTFS_BUFFER_PROFILE_ENABLE=true`. - -### Q: What if I want the exact legacy behavior? - -**A:** Set `RUSTFS_BUFFER_PROFILE_DISABLE=true` to use the exact PR #869 algorithm. - -### Q: Can I still use RUSTFS_BUFFER_PROFILE_ENABLE? - -**A:** Yes, it's still supported for backward compatibility, but it's no longer necessary. - -### Q: How do I know which profile is active? - -**A:** Check the startup logs for messages like: -- "Using buffer profile: GeneralPurpose" -- "Buffer profiling is disabled, using legacy adaptive buffer sizing" - -### Q: Should I switch to a specialized profile? - -**A:** Only if you have specific workload characteristics: -- AI/ML with large files → `AiTraining` -- Web applications → `WebWorkload` -- Secure/compliance environments → `SecureStorage` -- Default is fine for most general-purpose workloads - -## Support - -If you encounter issues during migration: - -1. Check logs for buffer profile information -2. Try disabling profiling with `--buffer-profile-disable` -3. Report issues with: - - Your workload type - - File sizes you're working with - - Performance observations - - Log excerpts showing buffer profile initialization - -## Timeline - -- **Phase 1:** Infrastructure (✅ Complete) -- **Phase 2:** Opt-In Usage (✅ Complete) -- **Phase 3:** Default Enablement (✅ Current - You are here) -- **Phase 4:** Full Integration (Future) - -## Conclusion - -Phase 3 represents a smooth evolution of the adaptive buffer sizing feature. The default behavior remains compatible with PR #869, while providing an easy path to optimize for specific workloads when needed. - -Most users can migrate without any changes, and those who need the exact legacy behavior can easily opt-out. diff --git a/docs/MOKA_CACHE_MIGRATION.md b/docs/MOKA_CACHE_MIGRATION.md deleted file mode 100644 index f256f4a4a..000000000 --- a/docs/MOKA_CACHE_MIGRATION.md +++ /dev/null @@ -1,569 +0,0 @@ -# Moka Cache Migration and Metrics Integration - -## Overview - -This document describes the complete migration from `lru` to `moka` cache library and the comprehensive metrics collection system integrated into the GetObject operation. - -## Why Moka? - -### Performance Advantages - -| Feature | LRU 0.16.2 | Moka 0.12.11 | Benefit | -|---------|------------|--------------|---------| -| **Concurrent reads** | RwLock (shared lock) | Lock-free | 10x+ faster reads | -| **Concurrent writes** | RwLock (exclusive lock) | Lock-free | No write blocking | -| **Expiration** | Manual implementation | Built-in TTL/TTI | Automatic cleanup | -| **Size tracking** | Manual atomic counters | Weigher function | Accurate & automatic | -| **Async support** | Manual wrapping | Native async/await | Better integration | -| **Memory management** | Manual eviction | Automatic LRU | Less complexity | -| **Performance scaling** | O(log n) with lock | O(1) lock-free | Better at scale | - -### Key Improvements - -1. **True Lock-Free Access**: No locks for reads or writes, enabling true parallel access -2. **Automatic Expiration**: TTL and TTI handled by the cache itself -3. **Size-Based Eviction**: Weigher function ensures accurate memory tracking -4. **Native Async**: Built for tokio from the ground up -5. **Better Concurrency**: Scales linearly with concurrent load - -## Implementation Details - -### Cache Configuration - -```rust -let cache = Cache::builder() - .max_capacity(100 * MI_B as u64) // 100MB total - .weigher(|_key: &String, value: &Arc<CachedObject>| -> u32 { - value.size.min(u32::MAX as usize) as u32 - }) - .time_to_live(Duration::from_secs(300)) // 5 minutes TTL - .time_to_idle(Duration::from_secs(120)) // 2 minutes TTI - .build(); -``` - -**Configuration Rationale**: -- **Max Capacity (100MB)**: Balances memory usage with cache hit rate -- **Weigher**: Tracks actual object size for accurate eviction -- **TTL (5 min)**: Ensures objects don't stay stale too long -- **TTI (2 min)**: Evicts rarely accessed objects automatically - -### Data Structures - -#### HotObjectCache - -```rust -#[derive(Clone)] -struct HotObjectCache { - cache: Cache<String, Arc<CachedObject>>, - max_object_size: usize, - hit_count: Arc<AtomicU64>, - miss_count: Arc<AtomicU64>, -} -``` - -**Changes from LRU**: -- Removed `RwLock` wrapper (Moka is lock-free) -- Removed manual `current_size` tracking (Moka handles this) -- Added global hit/miss counters for statistics -- Made struct `Clone` for easier sharing - -#### CachedObject - -```rust -#[derive(Clone)] -struct CachedObject { - data: Arc<Vec<u8>>, - cached_at: Instant, - size: usize, - access_count: Arc<AtomicU64>, // Changed from AtomicUsize -} -``` - -**Changes**: -- `access_count` now `AtomicU64` for larger counts -- Struct is `Clone` for compatibility with Moka - -### Core Methods - -#### get() - Lock-Free Retrieval - -```rust -async fn get(&self, key: &str) -> Option<Arc<Vec<u8>>> { - match self.cache.get(key).await { - Some(cached) => { - cached.access_count.fetch_add(1, Ordering::Relaxed); - self.hit_count.fetch_add(1, Ordering::Relaxed); - - #[cfg(feature = "metrics")] - { - counter!("rustfs_object_cache_hits").increment(1); - counter!("rustfs_object_cache_access_count", "key" => key) - .increment(1); - } - - Some(Arc::clone(&cached.data)) - } - None => { - self.miss_count.fetch_add(1, Ordering::Relaxed); - - #[cfg(feature = "metrics")] - { - counter!("rustfs_object_cache_misses").increment(1); - } - - None - } - } -} -``` - -**Benefits**: -- No locks acquired -- Automatic LRU promotion by Moka -- Per-key and global metrics tracking -- O(1) average case performance - -#### put() - Automatic Eviction - -```rust -async fn put(&self, key: String, data: Vec<u8>) { - let size = data.len(); - - if size == 0 || size > self.max_object_size { - return; - } - - let cached_obj = Arc::new(CachedObject { - data: Arc::new(data), - cached_at: Instant::now(), - size, - access_count: Arc::new(AtomicU64::new(0)), - }); - - self.cache.insert(key.clone(), cached_obj).await; - - #[cfg(feature = "metrics")] - { - counter!("rustfs_object_cache_insertions").increment(1); - gauge!("rustfs_object_cache_size_bytes") - .set(self.cache.weighted_size() as f64); - gauge!("rustfs_object_cache_entry_count") - .set(self.cache.entry_count() as f64); - } -} -``` - -**Simplifications**: -- No manual eviction loop (Moka handles automatically) -- No size tracking (weigher function handles this) -- Direct cache access without locks - -#### stats() - Accurate Reporting - -```rust -async fn stats(&self) -> CacheStats { - self.cache.run_pending_tasks().await; // Ensure accuracy - - CacheStats { - size: self.cache.weighted_size() as usize, - entries: self.cache.entry_count() as usize, - max_size: 100 * MI_B, - max_object_size: self.max_object_size, - hit_count: self.hit_count.load(Ordering::Relaxed), - miss_count: self.miss_count.load(Ordering::Relaxed), - } -} -``` - -**Improvements**: -- `run_pending_tasks()` ensures accurate stats -- Direct access to `weighted_size()` and `entry_count()` -- Includes hit/miss counters - -## Comprehensive Metrics Integration - -### Metrics Architecture - -``` -┌─────────────────────────────────────────────────────────┐ -│ GetObject Flow │ -├─────────────────────────────────────────────────────────┤ -│ │ -│ 1. Request Start │ -│ ↓ rustfs_get_object_requests_total (counter) │ -│ ↓ rustfs_concurrent_get_object_requests (gauge) │ -│ │ -│ 2. Cache Lookup │ -│ ├─ Hit → rustfs_object_cache_hits (counter) │ -│ │ rustfs_get_object_cache_served_total │ -│ │ rustfs_get_object_cache_serve_duration │ -│ │ │ -│ └─ Miss → rustfs_object_cache_misses (counter) │ -│ │ -│ 3. Disk Permit Acquisition │ -│ ↓ rustfs_disk_permit_wait_duration_seconds │ -│ │ -│ 4. Disk Read │ -│ ↓ (existing storage metrics) │ -│ │ -│ 5. Response Build │ -│ ↓ rustfs_get_object_response_size_bytes │ -│ ↓ rustfs_get_object_buffer_size_bytes │ -│ │ -│ 6. Request Complete │ -│ ↓ rustfs_get_object_requests_completed │ -│ ↓ rustfs_get_object_total_duration_seconds │ -│ │ -└─────────────────────────────────────────────────────────┘ -``` - -### Metric Catalog - -#### Request Metrics - -| Metric | Type | Description | Labels | -|--------|------|-------------|--------| -| `rustfs_get_object_requests_total` | Counter | Total GetObject requests received | - | -| `rustfs_get_object_requests_completed` | Counter | Completed GetObject requests | - | -| `rustfs_concurrent_get_object_requests` | Gauge | Current concurrent requests | - | -| `rustfs_get_object_total_duration_seconds` | Histogram | End-to-end request duration | - | - -#### Cache Metrics - -| Metric | Type | Description | Labels | -|--------|------|-------------|--------| -| `rustfs_object_cache_hits` | Counter | Cache hits | - | -| `rustfs_object_cache_misses` | Counter | Cache misses | - | -| `rustfs_object_cache_access_count` | Counter | Per-object access count | key | -| `rustfs_get_object_cache_served_total` | Counter | Objects served from cache | - | -| `rustfs_get_object_cache_serve_duration_seconds` | Histogram | Cache serve latency | - | -| `rustfs_get_object_cache_size_bytes` | Histogram | Cached object sizes | - | -| `rustfs_object_cache_insertions` | Counter | Cache insertions | - | -| `rustfs_object_cache_size_bytes` | Gauge | Total cache memory usage | - | -| `rustfs_object_cache_entry_count` | Gauge | Number of cached entries | - | - -#### I/O Metrics - -| Metric | Type | Description | Labels | -|--------|------|-------------|--------| -| `rustfs_disk_permit_wait_duration_seconds` | Histogram | Time waiting for disk permit | - | - -#### Response Metrics - -| Metric | Type | Description | Labels | -|--------|------|-------------|--------| -| `rustfs_get_object_response_size_bytes` | Histogram | Response payload sizes | - | -| `rustfs_get_object_buffer_size_bytes` | Histogram | Buffer sizes used | - | - -### Prometheus Query Examples - -#### Cache Performance - -```promql -# Cache hit rate -sum(rate(rustfs_object_cache_hits[5m])) -/ -(sum(rate(rustfs_object_cache_hits[5m])) + sum(rate(rustfs_object_cache_misses[5m]))) - -# Cache memory utilization -rustfs_object_cache_size_bytes / (100 * 1024 * 1024) - -# Cache effectiveness (objects served directly) -rate(rustfs_get_object_cache_served_total[5m]) -/ -rate(rustfs_get_object_requests_completed[5m]) - -# Average cache serve latency -rate(rustfs_get_object_cache_serve_duration_seconds_sum[5m]) -/ -rate(rustfs_get_object_cache_serve_duration_seconds_count[5m]) - -# Top 10 most accessed cached objects -topk(10, rate(rustfs_object_cache_access_count[5m])) -``` - -#### Request Performance - -```promql -# P50, P95, P99 latency -histogram_quantile(0.50, rate(rustfs_get_object_total_duration_seconds_bucket[5m])) -histogram_quantile(0.95, rate(rustfs_get_object_total_duration_seconds_bucket[5m])) -histogram_quantile(0.99, rate(rustfs_get_object_total_duration_seconds_bucket[5m])) - -# Request rate -rate(rustfs_get_object_requests_completed[5m]) - -# Average concurrent requests -avg_over_time(rustfs_concurrent_get_object_requests[5m]) - -# Request success rate -rate(rustfs_get_object_requests_completed[5m]) -/ -rate(rustfs_get_object_requests_total[5m]) -``` - -#### Disk Contention - -```promql -# Average disk permit wait time -rate(rustfs_disk_permit_wait_duration_seconds_sum[5m]) -/ -rate(rustfs_disk_permit_wait_duration_seconds_count[5m]) - -# P95 disk wait time -histogram_quantile(0.95, - rate(rustfs_disk_permit_wait_duration_seconds_bucket[5m]) -) - -# Percentage of time waiting for disk permits -( - rate(rustfs_disk_permit_wait_duration_seconds_sum[5m]) - / - rate(rustfs_get_object_total_duration_seconds_sum[5m]) -) * 100 -``` - -#### Resource Usage - -```promql -# Average response size -rate(rustfs_get_object_response_size_bytes_sum[5m]) -/ -rate(rustfs_get_object_response_size_bytes_count[5m]) - -# Average buffer size -rate(rustfs_get_object_buffer_size_bytes_sum[5m]) -/ -rate(rustfs_get_object_buffer_size_bytes_count[5m]) - -# Cache vs disk reads ratio -rate(rustfs_get_object_cache_served_total[5m]) -/ -(rate(rustfs_get_object_requests_completed[5m]) - rate(rustfs_get_object_cache_served_total[5m])) -``` - -## Performance Comparison - -### Benchmark Results - -| Scenario | LRU (ms) | Moka (ms) | Improvement | -|----------|----------|-----------|-------------| -| Single cache hit | 0.8 | 0.3 | 2.7x faster | -| 10 concurrent hits | 2.5 | 0.8 | 3.1x faster | -| 100 concurrent hits | 15.0 | 2.5 | 6.0x faster | -| Cache miss + insert | 1.2 | 0.5 | 2.4x faster | -| Hot key (1000 accesses) | 850 | 280 | 3.0x faster | - -### Memory Usage - -| Metric | LRU | Moka | Difference | -|--------|-----|------|------------| -| Overhead per entry | ~120 bytes | ~80 bytes | 33% less | -| Metadata structures | ~8KB | ~4KB | 50% less | -| Lock contention memory | High | None | 100% reduction | - -## Migration Guide - -### Code Changes - -**Before (LRU)**: -```rust -// Manual RwLock management -let mut cache = self.cache.write().await; -if let Some(cached) = cache.get(key) { - // Manual hit count - cached.hit_count.fetch_add(1, Ordering::Relaxed); - return Some(Arc::clone(&cached.data)); -} - -// Manual eviction -while current + size > max { - if let Some((_, evicted)) = cache.pop_lru() { - current -= evicted.size; - } -} -``` - -**After (Moka)**: -```rust -// Direct access, no locks -match self.cache.get(key).await { - Some(cached) => { - // Automatic LRU promotion - cached.access_count.fetch_add(1, Ordering::Relaxed); - Some(Arc::clone(&cached.data)) - } - None => None -} - -// Automatic eviction by Moka -self.cache.insert(key, value).await; -``` - -### Configuration Changes - -**Before**: -```rust -cache: RwLock::new(lru::LruCache::new( - std::num::NonZeroUsize::new(1000).unwrap() -)), -current_size: AtomicUsize::new(0), -``` - -**After**: -```rust -cache: Cache::builder() - .max_capacity(100 * MI_B) - .weigher(|_, v| v.size as u32) - .time_to_live(Duration::from_secs(300)) - .time_to_idle(Duration::from_secs(120)) - .build(), -``` - -### Testing Migration - -All existing tests work without modification. The cache behavior is identical from an API perspective, but internal implementation is more efficient. - -## Monitoring Recommendations - -### Dashboard Layout - -**Panel 1: Request Overview** -- Request rate (line graph) -- Concurrent requests (gauge) -- P95/P99 latency (line graph) - -**Panel 2: Cache Performance** -- Hit rate percentage (gauge) -- Cache memory usage (line graph) -- Cache entry count (line graph) - -**Panel 3: Cache Effectiveness** -- Objects served from cache (rate) -- Cache serve latency (histogram) -- Top cached objects (table) - -**Panel 4: Disk I/O** -- Disk permit wait time (histogram) -- Disk wait percentage (gauge) - -**Panel 5: Resource Usage** -- Response sizes (histogram) -- Buffer sizes (histogram) - -### Alerts - -**Critical**: -```promql -# Cache disabled or failing -rate(rustfs_object_cache_hits[5m]) + rate(rustfs_object_cache_misses[5m]) == 0 - -# Very high disk wait times -histogram_quantile(0.95, - rate(rustfs_disk_permit_wait_duration_seconds_bucket[5m]) -) > 1.0 -``` - -**Warning**: -```promql -# Low cache hit rate -( - rate(rustfs_object_cache_hits[5m]) - / - (rate(rustfs_object_cache_hits[5m]) + rate(rustfs_object_cache_misses[5m])) -) < 0.5 - -# High concurrent requests -rustfs_concurrent_get_object_requests > 100 -``` - -## Future Enhancements - -### Short Term -1. **Dynamic TTL**: Adjust TTL based on access patterns -2. **Regional Caches**: Separate caches for different regions -3. **Compression**: Compress cached objects to save memory - -### Medium Term -1. **Tiered Caching**: Memory + SSD + Remote -2. **Predictive Prefetching**: ML-based cache warming -3. **Distributed Cache**: Sync across cluster nodes - -### Long Term -1. **Content-Aware Caching**: Different policies for different content types -2. **Cost-Based Eviction**: Consider fetch cost in eviction decisions -3. **Cache Analytics**: Deep analysis of access patterns - -## Troubleshooting - -### High Miss Rate - -**Symptoms**: Cache hit rate < 50% -**Possible Causes**: -- Objects too large (> 10MB) -- High churn rate (TTL too short) -- Working set larger than cache size - -**Solutions**: -```rust -// Increase cache size -.max_capacity(200 * MI_B) - -// Increase TTL -.time_to_live(Duration::from_secs(600)) - -// Increase max object size -max_object_size: 20 * MI_B -``` - -### Memory Growth - -**Symptoms**: Cache memory exceeds expected size -**Possible Causes**: -- Weigher function incorrect -- Too many small objects -- Memory fragmentation - -**Solutions**: -```rust -// Fix weigher to include overhead -.weigher(|_k, v| (v.size + 100) as u32) - -// Add min object size -if size < 1024 { return; } // Don't cache < 1KB -``` - -### High Disk Wait Times - -**Symptoms**: P95 disk wait > 100ms -**Possible Causes**: -- Not enough disk permits -- Slow disk I/O -- Cache not effective - -**Solutions**: -```rust -// Increase permits for NVMe -disk_read_semaphore: Arc::new(Semaphore::new(128)) - -// Improve cache hit rate -.max_capacity(500 * MI_B) -``` - -## References - -- **Moka GitHub**: https://github.com/moka-rs/moka -- **Moka Documentation**: https://docs.rs/moka/0.12.11 -- **Original Issue**: #911 -- **Implementation Commit**: 3b6e281 -- **Previous LRU Implementation**: Commit 010e515 - -## Conclusion - -The migration to Moka provides: -- **10x better concurrent performance** through lock-free design -- **Automatic memory management** with TTL/TTI -- **Comprehensive metrics** for monitoring and optimization -- **Production-ready** solution with proven scalability - -This implementation sets the foundation for future enhancements while immediately improving performance for concurrent workloads. diff --git a/docs/MOKA_TEST_SUITE.md b/docs/MOKA_TEST_SUITE.md deleted file mode 100644 index 5cad06441..000000000 --- a/docs/MOKA_TEST_SUITE.md +++ /dev/null @@ -1,472 +0,0 @@ -# Moka Cache Test Suite Documentation - -## Overview - -This document describes the comprehensive test suite for the Moka-based concurrent GetObject optimization. The test suite validates all aspects of the concurrency management system including cache operations, buffer sizing, request tracking, and performance characteristics. - -## Test Organization - -### Test File Location -``` -rustfs/src/storage/concurrent_get_object_test.rs -``` - -### Total Tests: 18 - -## Test Categories - -### 1. Request Management Tests (3 tests) - -#### test_concurrent_request_tracking -**Purpose**: Validates RAII-based request tracking -**What it tests**: -- Request count increments when guards are created -- Request count decrements when guards are dropped -- Automatic cleanup (RAII pattern) - -**Expected behavior**: -```rust -let guard = ConcurrencyManager::track_request(); -// count += 1 -drop(guard); -// count -= 1 (automatic) -``` - -#### test_adaptive_buffer_sizing -**Purpose**: Validates concurrency-aware buffer size adaptation -**What it tests**: -- Buffer size reduces with increasing concurrency -- Multipliers: 1→2 req (1.0x), 3-4 (0.75x), 5-8 (0.5x), >8 (0.4x) -- Proper scaling for memory efficiency - -**Test cases**: -| Concurrent Requests | Expected Multiplier | Description | -|---------------------|---------------------|-------------| -| 1-2 | 1.0 | Full buffer for throughput | -| 3-4 | 0.75 | Medium reduction | -| 5-8 | 0.5 | High concurrency | -| >8 | 0.4 | Maximum reduction | - -#### test_buffer_size_bounds -**Purpose**: Validates buffer size constraints -**What it tests**: -- Minimum buffer size (64KB) -- Maximum buffer size (10MB) -- File size smaller than buffer uses file size - -### 2. Cache Operations Tests (8 tests) - -#### test_moka_cache_operations -**Purpose**: Basic Moka cache functionality -**What it tests**: -- Cache insertion -- Cache retrieval -- Stats accuracy (entries, size) -- Missing key handling -- Cache clearing - -**Key difference from LRU**: -- Requires `sleep()` delays for Moka's async processing -- Eventual consistency model - -```rust -manager.cache_object(key.clone(), data).await; -sleep(Duration::from_millis(50)).await; // Give Moka time -let cached = manager.get_cached(&key).await; -``` - -#### test_large_object_not_cached -**Purpose**: Validates size limit enforcement -**What it tests**: -- Objects > 10MB are rejected -- Cache remains empty after rejection -- Size limit protection - -#### test_moka_cache_eviction -**Purpose**: Validates Moka's automatic eviction -**What it tests**: -- Cache stays within 100MB limit -- LRU eviction when capacity exceeded -- Automatic memory management - -**Behavior**: -- Cache 20 × 6MB objects (120MB total) -- Moka automatically evicts to stay under 100MB -- Older objects evicted first (LRU) - -#### test_cache_batch_operations -**Purpose**: Batch retrieval efficiency -**What it tests**: -- Multiple keys retrieved in single operation -- Mixed existing/non-existing keys handled -- Efficiency vs individual gets - -**Benefits**: -- Single function call for multiple objects -- Lock-free parallel access with Moka -- Better performance than sequential gets - -#### test_cache_warming -**Purpose**: Pre-population functionality -**What it tests**: -- Batch insertion via warm_cache() -- All objects successfully cached -- Startup optimization support - -**Use case**: Server startup can pre-load known hot objects - -#### test_hot_keys_tracking -**Purpose**: Access pattern analysis -**What it tests**: -- Per-object access counting -- Sorted results by access count -- Top-N key retrieval - -**Validation**: -- Hot keys sorted descending by access count -- Most accessed objects identified correctly -- Useful for cache optimization - -#### test_cache_removal -**Purpose**: Explicit cache invalidation -**What it tests**: -- Remove cached object -- Verify removal -- Handle non-existent key - -**Use case**: Manual cache invalidation when data changes - -#### test_is_cached_no_side_effects -**Purpose**: Side-effect-free existence check -**What it tests**: -- contains() doesn't increment access count -- Doesn't affect LRU ordering -- Lightweight check operation - -**Important**: This validates that checking existence doesn't pollute metrics - -### 3. Performance Tests (4 tests) - -#### test_concurrent_cache_access -**Purpose**: Lock-free concurrent access validation -**What it tests**: -- 100 concurrent cache reads -- Completion time < 500ms -- No lock contention - -**Moka advantage**: Lock-free design enables true parallel access - -```rust -let tasks: Vec<_> = (0..100) - .map(|i| { - tokio::spawn(async move { - let _ = manager.get_cached(&key).await; - }) - }) - .collect(); -// Should complete quickly due to lock-free design -``` - -#### test_cache_hit_rate -**Purpose**: Hit rate calculation validation -**What it tests**: -- Hit/miss tracking accuracy -- Percentage calculation -- 50/50 mix produces ~50% hit rate - -**Metrics**: -```rust -let hit_rate = manager.cache_hit_rate(); -// Returns percentage: 0.0 - 100.0 -``` - -#### test_advanced_buffer_sizing -**Purpose**: File pattern-aware buffer optimization -**What it tests**: -- Small file optimization (< 256KB) -- Sequential read enhancement (1.5x) -- Large file + high concurrency reduction (0.8x) - -**Patterns**: -| Pattern | Buffer Adjustment | Reason | -|---------|-------------------|---------| -| Small file | Reduce to 0.25x file size | Don't over-allocate | -| Sequential | Increase to 1.5x | Prefetch optimization | -| Large + concurrent | Reduce to 0.8x | Memory efficiency | - -#### bench_concurrent_cache_performance -**Purpose**: Performance benchmark -**What it tests**: -- Sequential vs concurrent access -- Speedup measurement -- Lock-free advantage quantification - -**Expected results**: -- Concurrent should be faster or similar -- Demonstrates Moka's scalability -- No significant slowdown under concurrency - -### 4. Advanced Features Tests (3 tests) - -#### test_disk_io_permits -**Purpose**: I/O rate limiting -**What it tests**: -- Semaphore permit acquisition -- 64 concurrent permits (default) -- FIFO queuing behavior - -**Purpose**: Prevents disk I/O saturation - -#### test_ttl_expiration -**Purpose**: TTL configuration validation -**What it tests**: -- Cache configured with TTL (5 min) -- Cache configured with TTI (2 min) -- Automatic expiration mechanism exists - -**Note**: Full TTL test would require 5 minute wait; this just validates configuration - -## Test Patterns and Best Practices - -### Moka-Specific Patterns - -#### 1. Async Processing Delays -Moka processes operations asynchronously. Always add delays after operations: - -```rust -// Insert -manager.cache_object(key, data).await; -sleep(Duration::from_millis(50)).await; // Allow processing - -// Bulk operations need more time -manager.warm_cache(objects).await; -sleep(Duration::from_millis(100)).await; // Allow batch processing - -// Eviction tests -// ... cache many objects ... -sleep(Duration::from_millis(200)).await; // Allow eviction -``` - -#### 2. Eventual Consistency -Moka's lock-free design means eventual consistency: - -```rust -// May not be immediately available -let cached = manager.get_cached(&key).await; - -// Better: wait and retry if critical -sleep(Duration::from_millis(50)).await; -let cached = manager.get_cached(&key).await; -``` - -#### 3. Concurrent Testing -Use Arc for sharing across tasks: - -```rust -let manager = Arc::new(ConcurrencyManager::new()); - -let tasks: Vec<_> = (0..100) - .map(|i| { - let mgr = Arc::clone(&manager); - tokio::spawn(async move { - // Use mgr here - }) - }) - .collect(); -``` - -### Assertion Patterns - -#### Descriptive Messages -Always include context in assertions: - -```rust -// Bad -assert!(cached.is_some()); - -// Good -assert!( - cached.is_some(), - "Object {} should be cached after insertion", - key -); -``` - -#### Tolerance for Timing -Account for async processing and system variance: - -```rust -// Allow some tolerance -assert!( - stats.entries >= 8, - "Most objects should be cached (got {}/10)", - stats.entries -); - -// Rather than exact -assert_eq!(stats.entries, 10); // May fail due to timing -``` - -#### Range Assertions -For performance tests, use ranges: - -```rust -assert!( - elapsed < Duration::from_millis(500), - "Should complete quickly, took {:?}", - elapsed -); -``` - -## Running Tests - -### All Tests -```bash -cargo test --package rustfs concurrent_get_object -``` - -### Specific Test -```bash -cargo test --package rustfs test_moka_cache_operations -``` - -### With Output -```bash -cargo test --package rustfs concurrent_get_object -- --nocapture -``` - -### Specific Test with Output -```bash -cargo test --package rustfs test_concurrent_cache_access -- --nocapture -``` - -## Performance Expectations - -| Test | Expected Duration | Notes | -|------|-------------------|-------| -| test_concurrent_request_tracking | <50ms | Simple counter ops | -| test_moka_cache_operations | <100ms | Single object ops | -| test_cache_eviction | <500ms | Many insertions + eviction | -| test_concurrent_cache_access | <500ms | 100 concurrent tasks | -| test_cache_warming | <200ms | 5 object batch | -| bench_concurrent_cache_performance | <1s | Comparative benchmark | - -## Debugging Failed Tests - -### Common Issues - -#### 1. Timing Failures -**Symptom**: Test fails intermittently -**Cause**: Moka async processing not complete -**Fix**: Increase sleep duration - -```rust -// Before -sleep(Duration::from_millis(50)).await; - -// After -sleep(Duration::from_millis(100)).await; -``` - -#### 2. Assertion Exact Match -**Symptom**: Expected exact count, got close -**Cause**: Async processing, eviction timing -**Fix**: Use range assertions - -```rust -// Before -assert_eq!(stats.entries, 10); - -// After -assert!(stats.entries >= 8 && stats.entries <= 10); -``` - -#### 3. Concurrent Test Failures -**Symptom**: Concurrent tests timeout or fail -**Cause**: Resource contention, slow system -**Fix**: Increase timeout, reduce concurrency - -```rust -// Before -let tasks: Vec<_> = (0..1000).map(...).collect(); - -// After -let tasks: Vec<_> = (0..100).map(...).collect(); -``` - -## Test Coverage Report - -### By Feature - -| Feature | Tests | Coverage | -|---------|-------|----------| -| Request tracking | 1 | ✅ Complete | -| Buffer sizing | 3 | ✅ Complete | -| Cache operations | 5 | ✅ Complete | -| Batch operations | 2 | ✅ Complete | -| Hot keys | 1 | ✅ Complete | -| Hit rate | 1 | ✅ Complete | -| Eviction | 1 | ✅ Complete | -| TTL/TTI | 1 | ✅ Complete | -| Concurrent access | 2 | ✅ Complete | -| Disk I/O control | 1 | ✅ Complete | - -### By API Method - -| Method | Tested | Test Name | -|--------|--------|-----------| -| `track_request()` | ✅ | test_concurrent_request_tracking | -| `get_cached()` | ✅ | test_moka_cache_operations | -| `cache_object()` | ✅ | test_moka_cache_operations | -| `cache_stats()` | ✅ | test_moka_cache_operations | -| `clear_cache()` | ✅ | test_moka_cache_operations | -| `is_cached()` | ✅ | test_is_cached_no_side_effects | -| `get_cached_batch()` | ✅ | test_cache_batch_operations | -| `remove_cached()` | ✅ | test_cache_removal | -| `get_hot_keys()` | ✅ | test_hot_keys_tracking | -| `cache_hit_rate()` | ✅ | test_cache_hit_rate | -| `warm_cache()` | ✅ | test_cache_warming | -| `acquire_disk_read_permit()` | ✅ | test_disk_io_permits | -| `buffer_size()` | ✅ | test_advanced_buffer_sizing | - -## Continuous Integration - -### Pre-commit Hook -```bash -# Run all concurrency tests before commit -cargo test --package rustfs concurrent_get_object -``` - -### CI Pipeline -```yaml -- name: Test Concurrency Features - run: | - cargo test --package rustfs concurrent_get_object -- --nocapture - cargo test --package rustfs bench_concurrent_cache_performance -- --nocapture -``` - -## Future Test Enhancements - -### Planned Tests -1. **Distributed cache coherency** - Test cache sync across nodes -2. **Memory pressure** - Test behavior under low memory -3. **Long-running TTL** - Full TTL expiration cycle -4. **Cache poisoning resistance** - Test malicious inputs -5. **Metrics accuracy** - Validate all Prometheus metrics - -### Performance Benchmarks -1. **Latency percentiles** - P50, P95, P99 under load -2. **Throughput scaling** - Requests/sec vs concurrency -3. **Memory efficiency** - Memory usage vs cache size -4. **Eviction overhead** - Cost of eviction operations - -## Conclusion - -The Moka test suite provides comprehensive coverage of all concurrency features with proper handling of Moka's async, lock-free design. The tests validate both functional correctness and performance characteristics, ensuring the optimization delivers the expected improvements. - -**Key Achievements**: -- ✅ 18 comprehensive tests -- ✅ 100% API coverage -- ✅ Performance validation -- ✅ Moka-specific patterns documented -- ✅ Production-ready test suite diff --git a/docs/PERFORMANCE_TESTING.md b/docs/PERFORMANCE_TESTING.md deleted file mode 100644 index 0fff3a51d..000000000 --- a/docs/PERFORMANCE_TESTING.md +++ /dev/null @@ -1,329 +0,0 @@ -# RustFS Performance Testing Guide - -This document describes the recommended tools and workflows for benchmarking RustFS and analyzing performance bottlenecks. - -## Overview - -RustFS exposes several complementary tooling options: - -1. **Profiling** – collect CPU samples through the built-in `pprof` endpoints. -2. **Load testing** – drive concurrent requests with dedicated client utilities. -3. **Monitoring and analysis** – inspect collected metrics to locate hotspots. - -## Prerequisites - -### 1. Enable profiling support - -Set the profiling environment variable before launching RustFS: - -```bash -export RUSTFS_ENABLE_PROFILING=true -./rustfs -``` - -### 2. Install required tooling - -Make sure the following dependencies are available: - -```bash -# Base tools -curl # HTTP requests -jq # JSON processing (optional) - -# Analysis tools -go # Go pprof CLI (optional, required for protobuf output) -python3 # Python load-testing scripts - -# macOS users -brew install curl jq go python3 - -# Ubuntu/Debian users -sudo apt-get install curl jq golang-go python3 -``` - -## Performance Testing Methods - -### Method 1: Use the dedicated profiling script (recommended) - -The repository ships with a helper script for common profiling flows: - -```bash -# Show command help -./scripts/profile_rustfs.sh help - -# Check profiler status -./scripts/profile_rustfs.sh status - -# Capture a 30 second flame graph -./scripts/profile_rustfs.sh flamegraph - -# Download protobuf-formatted samples -./scripts/profile_rustfs.sh protobuf - -# Collect both formats -./scripts/profile_rustfs.sh both - -# Provide custom arguments -./scripts/profile_rustfs.sh -d 60 -u http://192.168.1.100:9000 both -``` - -### Method 2: Run the Python end-to-end tester - -A Python utility combines background load generation with profiling: - -```bash -# Launch the integrated test harness -python3 test_load.py -``` - -The script will: - -1. Launch multi-threaded S3 operations as load. -2. Pull profiling samples in parallel. -3. Produce a flame graph for investigation. - -### Method 3: Simple shell-based load test - -For quick smoke checks, a lightweight bash script is also provided: - -```bash -# Execute a lightweight benchmark -./simple_load_test.sh -``` - -## Profiling Output Formats - -### 1. Flame graph (SVG) - -- **Purpose**: Visualize CPU time distribution. -- **File name**: `rustfs_profile_TIMESTAMP.svg` -- **How to view**: Open the SVG in a browser. -- **Interpretation tips**: - - Width reflects CPU time per function. - - Height illustrates call-stack depth. - - Click to zoom into specific frames. - -```bash -# Example: open the file in a browser -open profiles/rustfs_profile_20240911_143000.svg -``` - -### 2. Protobuf samples - -- **Purpose**: Feed data to the `go tool pprof` command. -- **File name**: `rustfs_profile_TIMESTAMP.pb` -- **Tooling**: `go tool pprof` - -```bash -# Analyze the protobuf output -go tool pprof profiles/rustfs_profile_20240911_143000.pb - -# Common pprof commands -(pprof) top # Show hottest call sites -(pprof) list func # Display annotated source for a function -(pprof) web # Launch the web UI (requires graphviz) -(pprof) png # Render a PNG flame chart -(pprof) help # List available commands -``` - -## API Usage - -### Check profiling status - -```bash -curl "http://127.0.0.1:9000/rustfs/admin/debug/pprof/status" -``` - -Sample response: - -```json -{ - "enabled": "true", - "sampling_rate": "100" -} -``` - -### Capture profiling data - -```bash -# Fetch a 30-second flame graph -curl "http://127.0.0.1:9000/rustfs/admin/debug/pprof/profile?seconds=30&format=flamegraph" \ - -o profile.svg - -# Fetch protobuf output -curl "http://127.0.0.1:9000/rustfs/admin/debug/pprof/profile?seconds=30&format=protobuf" \ - -o profile.pb -``` - -**Parameters** -- `seconds`: Duration between 1 and 300 seconds. -- `format`: Output format (`flamegraph`/`svg` or `protobuf`/`pb`). - -## Load Testing Scenarios - -### 1. S3 API workload - -Use the Python harness to exercise a complete S3 workflow: - -```python -# Basic configuration -tester = S3LoadTester( - endpoint="http://127.0.0.1:9000", - access_key="rustfsadmin", - secret_key="rustfsadmin" -) - -# Execute the load test -# Four threads, ten operations each -tester.run_load_test(num_threads=4, operations_per_thread=10) -``` - -Each iteration performs: -1. Upload a 1 MB object. -2. Download the object. -3. Delete the object. - -### 2. Custom load scenarios - -```bash -# Create a test bucket -curl -X PUT "http://127.0.0.1:9000/test-bucket" - -# Concurrent uploads -for i in {1..10}; do - echo "test data $i" | curl -X PUT "http://127.0.0.1:9000/test-bucket/object-$i" -d @- & -done -wait - -# Concurrent downloads -for i in {1..10}; do - curl "http://127.0.0.1:9000/test-bucket/object-$i" > /dev/null & -done -wait -``` - -## Profiling Best Practices - -### 1. Environment preparation - -- Confirm that `RUSTFS_ENABLE_PROFILING=true` is set. -- Use an isolated benchmark environment to avoid interference. -- Reserve disk space for generated profile artifacts. - -### 2. Data collection tips - -- **Warm-up**: Run a light workload for 5–10 minutes before sampling. -- **Sampling window**: Capture 30–60 seconds under steady load. -- **Multiple samples**: Take several runs to compare results. - -### 3. Analysis focus areas - -When inspecting flame graphs, pay attention to: - -1. **The widest frames** – most CPU time consumed. -2. **Flat plateaus** – likely bottlenecks. -3. **Deep call stacks** – recursion or complex logic. -4. **Unexpected syscalls** – I/O stalls or allocation churn. - -### 4. Common issues - -- **Lock contention**: Investigate frames under `std::sync`. -- **Memory allocation**: Search for `alloc`-related frames. -- **I/O wait**: Review filesystem or network call stacks. -- **Serialization overhead**: Look for JSON/XML parsing hotspots. - -## Troubleshooting - -### 1. Profiling disabled - -Error: `{"enabled":"false"}` - -**Fix**: - -```bash -export RUSTFS_ENABLE_PROFILING=true -# Restart RustFS -``` - -### 2. Connection refused - -Error: `Connection refused` - -**Checklist**: -- Confirm RustFS is running. -- Ensure the port number is correct (default 9000). -- Verify firewall rules. - -### 3. Oversized profile output - -If artifacts become too large: -- Shorten the capture window (e.g., 15–30 seconds). -- Reduce load-test concurrency. -- Prefer protobuf output instead of SVG. - -## Configuration Parameters - -### Environment variables - -| Variable | Default | Description | -|------|--------|------| -| `RUSTFS_ENABLE_PROFILING` | `false` | Enable profiling support | -| `RUSTFS_URL` | `http://127.0.0.1:9000` | RustFS endpoint | -| `PROFILE_DURATION` | `30` | Profiling duration in seconds | -| `OUTPUT_DIR` | `./profiles` | Output directory | - -### Script arguments - -```bash -./scripts/profile_rustfs.sh [OPTIONS] [COMMAND] - -OPTIONS: - -u, --url URL RustFS URL - -d, --duration SECONDS Profile duration - -o, --output DIR Output directory - -COMMANDS: - status Check profiler status - flamegraph Collect a flame graph - protobuf Collect protobuf samples - both Collect both formats (default) -``` - -## Output Locations - -- **Script output**: `./profiles/` -- **Python script**: `/tmp/rustfs_profiles/` -- **File naming**: `rustfs_profile_TIMESTAMP.{svg|pb}` - -## Example Workflow - -1. **Launch RustFS** - ```bash - RUSTFS_ENABLE_PROFILING=true ./rustfs - ``` - -2. **Verify profiling availability** - ```bash - ./scripts/profile_rustfs.sh status - ``` - -3. **Start a load test** - ```bash - python3 test_load.py & - ``` - -4. **Collect samples** - ```bash - ./scripts/profile_rustfs.sh -d 60 both - ``` - -5. **Inspect the results** - ```bash - # Review the flame graph - open profiles/rustfs_profile_*.svg - - # Or analyze the protobuf output - go tool pprof profiles/rustfs_profile_*.pb - ``` - -Following this workflow helps you understand RustFS performance characteristics, locate bottlenecks, and implement targeted optimizations. diff --git a/docs/PHASE4_GUIDE.md b/docs/PHASE4_GUIDE.md deleted file mode 100644 index 6f4e5eccd..000000000 --- a/docs/PHASE4_GUIDE.md +++ /dev/null @@ -1,383 +0,0 @@ -# Phase 4: Full Integration Guide - -## Overview - -Phase 4 represents the final stage of the adaptive buffer sizing migration path. It provides a unified, profile-based implementation with deprecated legacy functions and optional performance metrics. - -## What's New in Phase 4 - -### 1. Deprecated Legacy Function - -The `get_adaptive_buffer_size()` function is now deprecated: - -```rust -#[deprecated( - since = "Phase 4", - note = "Use workload profile configuration instead." -)] -fn get_adaptive_buffer_size(file_size: i64) -> usize -``` - -**Why Deprecated?** -- Profile-based approach is more flexible and powerful -- Encourages use of the unified configuration system -- Simplifies maintenance and future enhancements - -**Still Works:** -- Function is maintained for backward compatibility -- Internally delegates to GeneralPurpose profile -- No breaking changes for existing code - -### 2. Profile-Only Implementation - -All buffer sizing now goes through workload profiles: - -**Before (Phase 3):** -```rust -fn get_buffer_size_opt_in(file_size: i64) -> usize { - if is_buffer_profile_enabled() { - // Use profiles - } else { - // Fall back to hardcoded get_adaptive_buffer_size() - } -} -``` - -**After (Phase 4):** -```rust -fn get_buffer_size_opt_in(file_size: i64) -> usize { - if is_buffer_profile_enabled() { - // Use configured profile - } else { - // Use GeneralPurpose profile (no hardcoded values) - } -} -``` - -**Benefits:** -- Consistent behavior across all modes -- Single source of truth for buffer sizes -- Easier to test and maintain - -### 3. Performance Metrics - -Optional metrics collection for monitoring and optimization: - -```rust -#[cfg(feature = "metrics")] -{ - metrics::histogram!("buffer_size_bytes", buffer_size as f64); - metrics::counter!("buffer_size_selections", 1); - - if file_size >= 0 { - let ratio = buffer_size as f64 / file_size as f64; - metrics::histogram!("buffer_to_file_ratio", ratio); - } -} -``` - -## Migration Guide - -### From Phase 3 to Phase 4 - -**Good News:** No action required for most users! - -Phase 4 is fully backward compatible with Phase 3. Your existing configurations and deployments continue to work without changes. - -### If You Have Custom Code - -If your code directly calls `get_adaptive_buffer_size()`: - -**Option 1: Update to use the profile system (Recommended)** -```rust -// Old code -let buffer_size = get_adaptive_buffer_size(file_size); - -// New code - let the system handle it -// (buffer sizing happens automatically in put_object, upload_part, etc.) -``` - -**Option 2: Suppress deprecation warnings** -```rust -// If you must keep calling it directly -#[allow(deprecated)] -let buffer_size = get_adaptive_buffer_size(file_size); -``` - -**Option 3: Use the new API explicitly** -```rust -// Use the profile system directly -use rustfs::config::workload_profiles::{WorkloadProfile, RustFSBufferConfig}; - -let config = RustFSBufferConfig::new(WorkloadProfile::GeneralPurpose); -let buffer_size = config.get_buffer_size(file_size); -``` - -## Performance Metrics - -### Enabling Metrics - -**At Build Time:** -```bash -cargo build --features metrics --release -``` - -**In Cargo.toml:** -```toml -[dependencies] -rustfs = { version = "*", features = ["metrics"] } -``` - -### Available Metrics - -| Metric Name | Type | Description | -|------------|------|-------------| -| `buffer_size_bytes` | Histogram | Distribution of selected buffer sizes | -| `buffer_size_selections` | Counter | Total number of buffer size calculations | -| `buffer_to_file_ratio` | Histogram | Ratio of buffer size to file size | - -### Using Metrics - -**With Prometheus:** -```rust -// Metrics are automatically exported to Prometheus format -// Access at http://localhost:9090/metrics -``` - -**With Custom Backend:** -```rust -// Use the metrics crate's recorder interface -use metrics_exporter_prometheus::PrometheusBuilder; - -PrometheusBuilder::new() - .install() - .expect("failed to install Prometheus recorder"); -``` - -### Analyzing Metrics - -**Buffer Size Distribution:** -```promql -# Most common buffer sizes -histogram_quantile(0.5, buffer_size_bytes) # Median -histogram_quantile(0.95, buffer_size_bytes) # 95th percentile -histogram_quantile(0.99, buffer_size_bytes) # 99th percentile -``` - -**Buffer Efficiency:** -```promql -# Average ratio of buffer to file size -avg(buffer_to_file_ratio) - -# Files where buffer is > 10% of file size -buffer_to_file_ratio > 0.1 -``` - -**Usage Patterns:** -```promql -# Rate of buffer size selections -rate(buffer_size_selections[5m]) - -# Total selections over time -increase(buffer_size_selections[1h]) -``` - -## Optimizing Based on Metrics - -### Scenario 1: High Memory Usage - -**Symptom:** Most buffers are at maximum size -```promql -histogram_quantile(0.9, buffer_size_bytes) > 1048576 # 1MB -``` - -**Solution:** -- Switch to a more conservative profile -- Use SecureStorage or WebWorkload profile -- Or create custom profile with lower max_size - -### Scenario 2: Poor Throughput - -**Symptom:** Buffer-to-file ratio is very small -```promql -avg(buffer_to_file_ratio) < 0.01 # Less than 1% -``` - -**Solution:** -- Switch to a more aggressive profile -- Use AiTraining or DataAnalytics profile -- Increase buffer sizes for your workload - -### Scenario 3: Mismatched Profile - -**Symptom:** Wide distribution of file sizes with single profile -```promql -# High variance in buffer sizes -stddev(buffer_size_bytes) > 500000 -``` - -**Solution:** -- Consider per-bucket profiles (future feature) -- Use GeneralPurpose for mixed workloads -- Or implement custom thresholds - -## Testing Phase 4 - -### Unit Tests - -Run the Phase 4 specific tests: -```bash -cd /home/runner/work/rustfs/rustfs -cargo test test_phase4_full_integration -``` - -### Integration Tests - -Test with different configurations: -```bash -# Test default behavior -./rustfs /data - -# Test with different profiles -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /data - -# Test opt-out mode -export RUSTFS_BUFFER_PROFILE_DISABLE=true -./rustfs /data -``` - -### Metrics Verification - -With metrics enabled: -```bash -# Build with metrics -cargo build --features metrics --release - -# Run and check metrics endpoint -./target/release/rustfs /data & -curl http://localhost:9090/metrics | grep buffer_size -``` - -## Troubleshooting - -### Q: I'm getting deprecation warnings - -**A:** You're calling `get_adaptive_buffer_size()` directly. Options: -1. Remove the direct call (let the system handle it) -2. Use `#[allow(deprecated)]` to suppress warnings -3. Migrate to the profile system API - -### Q: How do I know which profile is being used? - -**A:** Check the startup logs: -``` -Buffer profiling is enabled by default (Phase 3), profile: GeneralPurpose -Using buffer profile: GeneralPurpose -``` - -### Q: Can I still opt-out in Phase 4? - -**A:** Yes! Use `--buffer-profile-disable`: -```bash -export RUSTFS_BUFFER_PROFILE_DISABLE=true -./rustfs /data -``` - -This uses GeneralPurpose profile (same buffer sizes as PR #869). - -### Q: What's the difference between opt-out in Phase 3 vs Phase 4? - -**A:** -- **Phase 3**: Opt-out uses hardcoded legacy function -- **Phase 4**: Opt-out uses GeneralPurpose profile -- **Result**: Identical buffer sizes, but Phase 4 is profile-based - -### Q: Do I need to enable metrics? - -**A:** No, metrics are completely optional. They're useful for: -- Production monitoring -- Performance analysis -- Profile optimization -- Capacity planning - -If you don't need these, skip the metrics feature. - -## Best Practices - -### 1. Let the System Handle Buffer Sizing - -**Don't:** -```rust -// Avoid direct calls -let buffer_size = get_adaptive_buffer_size(file_size); -let reader = BufReader::with_capacity(buffer_size, file); -``` - -**Do:** -```rust -// Let put_object/upload_part handle it automatically -// Buffer sizing happens transparently -``` - -### 2. Use Appropriate Profiles - -Match your profile to your workload: -- AI/ML models: `AiTraining` -- Static assets: `WebWorkload` -- Mixed files: `GeneralPurpose` -- Compliance: `SecureStorage` - -### 3. Monitor in Production - -Enable metrics in production: -```bash -cargo build --features metrics --release -``` - -Use the data to: -- Validate profile choice -- Identify optimization opportunities -- Plan capacity - -### 4. Test Profile Changes - -Before changing profiles in production: -```bash -# Test in staging -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /staging-data - -# Monitor metrics for a period -# Compare with baseline - -# Roll out to production when validated -``` - -## Future Enhancements - -Based on collected metrics, future versions may include: - -1. **Auto-tuning**: Automatically adjust profiles based on observed patterns -2. **Per-bucket profiles**: Different profiles for different buckets -3. **Dynamic thresholds**: Adjust thresholds based on system load -4. **ML-based optimization**: Use machine learning to optimize buffer sizes -5. **Adaptive limits**: Automatically adjust max_size based on available memory - -## Conclusion - -Phase 4 represents the mature state of the adaptive buffer sizing system: -- ✅ Unified, profile-based implementation -- ✅ Deprecated legacy code (but backward compatible) -- ✅ Optional performance metrics -- ✅ Production-ready and battle-tested -- ✅ Future-proof and extensible - -Most users can continue using the system without any changes, while advanced users gain powerful new capabilities for monitoring and optimization. - -## References - -- [Adaptive Buffer Sizing Guide](./adaptive-buffer-sizing.md) -- [Implementation Summary](./IMPLEMENTATION_SUMMARY.md) -- [Phase 3 Migration Guide](./MIGRATION_PHASE3.md) -- [Performance Testing Guide](./PERFORMANCE_TESTING.md) diff --git a/docs/README.md b/docs/README.md deleted file mode 100644 index 5bd249093..000000000 --- a/docs/README.md +++ /dev/null @@ -1,250 +0,0 @@ -# RustFS Documentation Center - -Welcome to the RustFS distributed file system documentation center! - -## 📚 Documentation Navigation - -### ⚡ Performance Optimization - -RustFS provides intelligent performance optimization features for different workloads. - -| Document | Description | Audience | -|------|------|----------| -| [Adaptive Buffer Sizing](./adaptive-buffer-sizing.md) | Intelligent buffer sizing optimization for optimal performance across workload types | Developers and system administrators | -| [Phase 3 Migration Guide](./MIGRATION_PHASE3.md) | Migration guide from Phase 2 to Phase 3 (Default Enablement) | Operations and DevOps teams | -| [Phase 4 Full Integration Guide](./PHASE4_GUIDE.md) | Complete guide to Phase 4 features: deprecated legacy functions, performance metrics | Advanced users and performance engineers | -| [Performance Testing Guide](./PERFORMANCE_TESTING.md) | Performance benchmarking and optimization guide | Performance engineers | - -### 🔐 KMS (Key Management Service) - -RustFS KMS delivers enterprise-grade key management and data encryption. - -| Document | Description | Audience | -|------|------|----------| -| [KMS User Guide](./kms/README.md) | Comprehensive KMS guide with quick start, configuration, and deployment steps | Required reading for all users | -| [HTTP API Reference](./kms/http-api.md) | HTTP REST API reference with usage examples | Administrators and operators | -| [Programming API Reference](./kms/api.md) | Rust library APIs and code samples | Developers | -| [Configuration Reference](./kms/configuration.md) | Complete configuration options and environment variables | System administrators | -| [Troubleshooting](./kms/troubleshooting.md) | Diagnosis tips and solutions for common issues | Operations engineers | -| [Security Guide](./kms/security.md) | Security best practices and compliance guidance | Security architects | - -## 🚀 Quick Start - -### 1. Deploy KMS in 5 Minutes - -**Production (Vault backend)** - -```bash -# 1. Enable the Vault feature flag -cargo build --features vault --release - -# 2. Configure environment variables -export RUSTFS_VAULT_ADDRESS=https://vault.company.com:8200 -export RUSTFS_VAULT_TOKEN=hvs.CAESIJ... - -# 3. Launch the service -./target/release/rustfs server -``` - -**Development & Testing (Local backend)** - -```bash -# 1. Build a release binary -cargo build --release - -# 2. Configure local storage -export RUSTFS_KMS_BACKEND=Local -export RUSTFS_KMS_LOCAL_KEY_DIR=/tmp/rustfs-keys - -# 3. Launch the service -./target/release/rustfs server -``` - -### 2. S3-Compatible Encryption - -```bash -# Upload an encrypted object -curl -X PUT https://rustfs.company.com/bucket/sensitive.txt \ - -H "x-amz-server-side-encryption: AES256" \ - --data-binary @sensitive.txt - -# Download with automatic decryption -curl https://rustfs.company.com/bucket/sensitive.txt -``` - -## 🏗️ Architecture Overview - -### Three-Layer KMS Security Architecture - -``` -┌─────────────────────────────────────────────────┐ -│ Application Layer │ -│ ┌─────────────┐ ┌─────────────┐ │ -│ │ S3 API │ │ REST API │ │ -│ └─────────────┘ └─────────────┘ │ -├─────────────────────────────────────────────────┤ -│ Encryption Layer │ -│ ┌─────────────┐ Encrypt ┌─────────────────┐ │ -│ │ Object Data │ ◄──────► │ Data Key (DEK) │ │ -│ └─────────────┘ └─────────────────┘ │ -├─────────────────────────────────────────────────┤ -│ Key Management Layer │ -│ ┌─────────────────┐ Encrypt ┌──────────────┐ │ -│ │ Data Key (DEK) │ ◄───────│ Master Key │ │ -│ └─────────────────┘ │ (Vault/HSM) │ │ -│ └──────────────┘ │ -└─────────────────────────────────────────────────┘ -``` - -### Key Features - -- ✅ **Multi-layer encryption**: Master Key → DEK → Object Data -- ✅ **High performance**: 1 MB streaming encryption with large file support -- ✅ **Multiple backends**: Vault (production) + Local (testing) -- ✅ **S3 compatibility**: Supports standard SSE-S3/SSE-KMS headers -- ✅ **Enterprise-ready**: Auditing, monitoring, and compliance features - -## 📖 Learning Paths - -### 👨‍💻 Developers - -1. Read the [Programming API Reference](./kms/api.md) to learn the Rust library -2. Review the sample code to understand integration patterns -3. Consult [Troubleshooting](./kms/troubleshooting.md) when issues occur - -### 👨‍💼 System Administrators - -1. Start with the [KMS User Guide](./kms/README.md) -2. Learn the [HTTP API Reference](./kms/http-api.md) for management tasks -3. Study the [Configuration Reference](./kms/configuration.md) in depth -4. Configure monitoring and logging - -### 👨‍🔧 Operations Engineers - -1. Become familiar with the [HTTP API Reference](./kms/http-api.md) for day-to-day work -2. Master the [Troubleshooting](./kms/troubleshooting.md) procedures -3. Understand the requirements in the [Security Guide](./kms/security.md) -4. Establish operational runbooks - -### 🔒 Security Architects - -1. Dive into the [Security Guide](./kms/security.md) -2. Evaluate threat models and risk posture -3. Define security policies - -## 🤝 Contribution Guide - -We welcome community contributions! - -### Documentation Contributions - -```bash -# 1. Fork the repository -git clone https://github.com/your-username/rustfs.git - -# 2. Create a documentation branch -git checkout -b docs/improve-kms-guide - -# 3. Edit the documentation -# Update Markdown files under docs/kms/ - -# 4. Commit the changes -git add docs/ -git commit -m "docs: improve KMS configuration examples" - -# 5. Open a Pull Request -gh pr create --title "Improve KMS documentation" -``` - -### Documentation Guidelines - -- Use clear headings and structure -- Provide runnable code examples -- Include warnings and tips where appropriate -- Support multiple usage scenarios -- Keep the content up to date - -## 📞 Support & Feedback - -### Getting Help - -- **GitHub Issues**: https://github.com/rustfs/rustfs/issues -- **Discussion Forum**: https://github.com/rustfs/rustfs/discussions -- **Documentation Questions**: Open an issue on the relevant document -- **Security Concerns**: security@rustfs.com - -### Issue Reporting Template - -When reporting a problem, please provide: - -```markdown -**Environment** -- RustFS version: v1.0.0 -- Operating system: Ubuntu 20.04 -- Rust version: 1.75.0 - -**Issue Description** -Summarize the problem you encountered... - -**Reproduction Steps** -1. Step one -2. Step two -3. Step three - -**Expected Behavior** -Describe what you expected to happen... - -**Actual Behavior** -Describe what actually happened... - -**Relevant Logs** -```bash -# Paste relevant log excerpts -``` - -**Additional Information** -Any other details that may help... -``` - -## 📈 Release History - -| Version | Release Date | Highlights | -|------|----------|----------| -| v1.0.0 | 2024-01-15 | 🎉 First official release with full KMS functionality | -| v0.9.0 | 2024-01-01 | 🔐 KMS system refactor with performance optimizations | -| v0.8.0 | 2023-12-15 | ⚡ Streaming encryption with 1 MB block size tuning | - -## 🗺️ Roadmap - -### Coming Soon (v1.1.0) - -- [ ] Automatic key rotation -- [ ] HSM integration support -- [ ] Web UI management console -- [ ] Additional compliance support (SOC2, HIPAA) - -### Long-Term Plans - -- [ ] Multi-tenant key isolation -- [ ] Key import/export tooling -- [ ] Performance benchmarking suite -- [ ] Kubernetes Operator - -## 📋 Documentation Feedback - -Help us improve the documentation! - -**Was this documentation helpful?** -- 👍 Very helpful -- 👌 Mostly satisfied -- 👎 Needs improvement - -**Suggestions for improvement:** -Share specific ideas via GitHub Issues. - ---- - -**Last Updated**: 2024-01-15 -**Documentation Version**: v1.0.0 - -*Thank you for using RustFS! We are committed to delivering the best distributed file system solution.* diff --git a/docs/SECURITY_SUMMARY_special_chars.md b/docs/SECURITY_SUMMARY_special_chars.md deleted file mode 100644 index aa9e62320..000000000 --- a/docs/SECURITY_SUMMARY_special_chars.md +++ /dev/null @@ -1,241 +0,0 @@ -# Security Summary: Special Characters in Object Paths - -## Overview - -This document summarizes the security implications of the changes made to handle special characters in S3 object paths. - -## Changes Made - -### 1. Control Character Validation - -**Files Modified**: `rustfs/src/storage/ecfs.rs` - -**Change**: Added validation to reject object keys containing control characters: -```rust -// Validate object key doesn't contain control characters -if key.contains(['\0', '\n', '\r']) { - return Err(S3Error::with_message( - S3ErrorCode::InvalidArgument, - format!("Object key contains invalid control characters: {:?}", key) - )); -} -``` - -**Security Impact**: ✅ **Positive** -- **Prevents injection attacks**: Null bytes, newlines, and carriage returns could be used for various injection attacks -- **Improves error messages**: Clear rejection of invalid input -- **No breaking changes**: Valid UTF-8 object names still work -- **Defense in depth**: Adds additional validation layer - -### 2. Debug Logging - -**Files Modified**: `rustfs/src/storage/ecfs.rs` - -**Change**: Added debug logging for keys with special characters: -```rust -// Log debug info for keys with special characters -if key.contains([' ', '+', '%']) { - debug!("PUT object with special characters in key: {:?}", key); -} -``` - -**Security Impact**: ✅ **Neutral** -- **Information disclosure**: Debug level logs are only enabled when explicitly configured -- **Helps debugging**: Assists in diagnosing client-side encoding issues -- **No sensitive data**: Only logs the object key (which is not secret) -- **Production safe**: Debug logs disabled by default in production - -## Security Considerations - -### Path Traversal - -**Risk**: Could special characters enable path traversal attacks? - -**Analysis**: ✅ **No Risk** -- Object keys are not directly used as filesystem paths -- RustFS uses a storage abstraction layer (ecstore) -- Path sanitization occurs at multiple levels -- Our validation rejects control characters that could be used in attacks - -**Evidence**: -```rust -// From path utilities - already handles path traversal -pub fn clean(path: &str) -> String { - // Normalizes paths, removes .. and . components -} -``` - -### URL Encoding/Decoding Vulnerabilities - -**Risk**: Could double-encoding or encoding issues lead to security issues? - -**Analysis**: ✅ **No Risk** -- s3s library (well-tested) handles URL decoding -- We receive already-decoded keys from s3s -- No manual URL decoding in our code (avoids double-decode bugs) -- Control character validation prevents encoded null bytes - -**Evidence**: -```rust -// From s3s-0.12.0-rc.4/src/ops/mod.rs: -let decoded_uri_path = urlencoding::decode(req.uri.path()) - .map_err(|_| S3ErrorCode::InvalidURI)? - .into_owned(); -``` - -### Injection Attacks - -**Risk**: Could special characters enable SQL injection, command injection, or other attacks? - -**Analysis**: ✅ **No Risk** -- Object keys are not used in SQL queries (no SQL database) -- Object keys are not passed to shell commands -- Object keys are not evaluated as code -- Our control character validation prevents most injection vectors - -**Mitigations**: -1. Control character rejection (null bytes, newlines) -2. UTF-8 validation (already present in Rust strings) -3. Storage layer abstraction (no direct filesystem operations) - -### Information Disclosure - -**Risk**: Could debug logging expose sensitive information? - -**Analysis**: ✅ **Low Risk** -- Debug logs are opt-in (RUST_LOG=rustfs=debug) -- Only object keys are logged (not content) -- Object keys are part of the S3 API (not secret) -- Production deployments should not enable debug logging - -**Best Practices**: -```bash -# Development -RUST_LOG=rustfs=debug ./rustfs server /data - -# Production (no debug logs) -RUST_LOG=info ./rustfs server /data -``` - -### Denial of Service - -**Risk**: Could malicious object keys cause DoS? - -**Analysis**: ✅ **Low Risk** -- Control character validation has O(n) complexity (acceptable) -- No unbounded loops or recursion added -- Validation is early in the request pipeline -- AWS S3 API already has key length limits (1024 bytes) - -## Vulnerability Assessment - -### Known Vulnerabilities: **None** - -The changes introduce: -- ✅ **Defensive validation** (improves security) -- ✅ **Better error messages** (improves UX) -- ✅ **Debug logging** (improves diagnostics) -- ❌ **No new attack vectors** -- ❌ **No security regressions** - -### Security Testing - -**Manual Review**: ✅ Completed -- Code reviewed for injection vulnerabilities -- URL encoding handling verified via s3s source inspection -- Path traversal risks analyzed - -**Automated Testing**: ⚠️ CodeQL timed out -- CodeQL analysis timed out due to large codebase -- Changes are minimal (3 validation blocks + logging) -- No complex logic or unsafe operations added -- Recommend manual security review (completed above) - -**E2E Testing**: ✅ Test suite created -- Tests cover edge cases with special characters -- Tests verify correct handling of spaces, plus signs, etc. -- Tests would catch security regressions - -## Security Recommendations - -### For Deployment - -1. **Logging Configuration**: - - Production: `RUST_LOG=info` or `RUST_LOG=warn` - - Development: `RUST_LOG=debug` is safe - - Never log to publicly accessible locations - -2. **Input Validation**: - - Our validation is defensive (not primary security) - - Trust s3s library for primary validation - - Monitor logs for validation errors - -3. **Client Security**: - - Educate users to use proper S3 SDKs - - Warn against custom HTTP clients (easy to make mistakes) - - Provide client security guidelines - -### For Future Development - -1. **Additional Validation** (optional): - - Consider max key length validation - - Consider Unicode normalization - - Consider additional control character checks - -2. **Security Monitoring**: - - Monitor for repeated validation errors (could indicate attack) - - Track unusual object key patterns - - Alert on control character rejection attempts - -3. **Documentation**: - - Keep security docs updated - - Document security considerations for contributors - - Maintain threat model - -## Compliance - -### Standards Compliance - -✅ **RFC 3986** (URI Generic Syntax): -- URL encoding handled by s3s library -- Follows standard URI rules - -✅ **AWS S3 API Specification**: -- Compatible with AWS S3 behavior -- Follows object key naming rules -- Matches AWS error codes - -✅ **OWASP Top 10**: -- A03:2021 – Injection: Control character validation -- A05:2021 – Security Misconfiguration: Clear error messages -- A09:2021 – Security Logging: Appropriate debug logging - -## Conclusion - -### Security Assessment: ✅ **APPROVED** - -The changes to handle special characters in object paths: -- **Improve security** through control character validation -- **Introduce no new vulnerabilities** -- **Follow security best practices** -- **Maintain backward compatibility** -- **Are production-ready** - -### Risk Level: **LOW** - -- Changes are minimal and defensive -- No unsafe operations introduced -- Existing security mechanisms unchanged -- Well-tested s3s library handles encoding - -### Recommendation: **MERGE** - -These changes can be safely merged and deployed to production. - ---- - -**Security Review Date**: 2025-12-09 -**Reviewer**: Automated Analysis + Manual Review -**Risk Level**: Low -**Status**: Approved -**Next Review**: After deployment (monitor for any issues) diff --git a/docs/adaptive-buffer-sizing.md b/docs/adaptive-buffer-sizing.md deleted file mode 100644 index 723a1eaf6..000000000 --- a/docs/adaptive-buffer-sizing.md +++ /dev/null @@ -1,765 +0,0 @@ -# Adaptive Buffer Sizing Optimization - -RustFS implements intelligent adaptive buffer sizing optimization that automatically adjusts buffer sizes based on file size and workload type to achieve optimal balance between performance, memory usage, and security. - -## Overview - -The adaptive buffer sizing system provides: - -- **Automatic buffer size selection** based on file size -- **Workload-specific optimizations** for different use cases -- **Special environment support** (Kylin, NeoKylin, Unity OS, etc.) -- **Memory pressure awareness** with configurable limits -- **Unknown file size handling** for streaming scenarios - -## Workload Profiles - -### GeneralPurpose (Default) - -Balanced performance and memory usage for general-purpose workloads. - -**Buffer Sizing:** -- Small files (< 1MB): 64KB buffer -- Medium files (1MB-100MB): 256KB buffer -- Large files (≥ 100MB): 1MB buffer - -**Best for:** -- General file storage -- Mixed workloads -- Default configuration when workload type is unknown - -### AiTraining - -Optimized for AI/ML training workloads with large sequential reads. - -**Buffer Sizing:** -- Small files (< 10MB): 512KB buffer -- Medium files (10MB-500MB): 2MB buffer -- Large files (≥ 500MB): 4MB buffer - -**Best for:** -- Machine learning model files -- Training datasets -- Large sequential data processing -- Maximum throughput requirements - -### DataAnalytics - -Optimized for data analytics with mixed read-write patterns. - -**Buffer Sizing:** -- Small files (< 5MB): 128KB buffer -- Medium files (5MB-200MB): 512KB buffer -- Large files (≥ 200MB): 2MB buffer - -**Best for:** -- Data warehouse operations -- Analytics workloads -- Business intelligence -- Mixed access patterns - -### WebWorkload - -Optimized for web applications with small file intensive operations. - -**Buffer Sizing:** -- Small files (< 512KB): 32KB buffer -- Medium files (512KB-10MB): 128KB buffer -- Large files (≥ 10MB): 256KB buffer - -**Best for:** -- Web assets (images, CSS, JavaScript) -- Static content delivery -- CDN origin storage -- High concurrency scenarios - -### IndustrialIoT - -Optimized for industrial IoT with real-time streaming requirements. - -**Buffer Sizing:** -- Small files (< 1MB): 64KB buffer -- Medium files (1MB-50MB): 256KB buffer -- Large files (≥ 50MB): 512KB buffer (capped for memory constraints) - -**Best for:** -- Sensor data streams -- Real-time telemetry -- Edge computing scenarios -- Low latency requirements -- Memory-constrained devices - -### SecureStorage - -Security-first configuration with strict memory limits for compliance. - -**Buffer Sizing:** -- Small files (< 1MB): 32KB buffer -- Medium files (1MB-50MB): 128KB buffer -- Large files (≥ 50MB): 256KB buffer (strict limit) - -**Best for:** -- Compliance-heavy environments -- Secure government systems (Kylin, NeoKylin, UOS) -- Financial services -- Healthcare data storage -- Memory-constrained secure environments - -**Auto-Detection:** -This profile is automatically selected when running on Chinese secure operating systems: -- Kylin -- NeoKylin -- UOS (Unity OS) -- OpenKylin - -## Usage - -### Using Default Configuration - -The system automatically uses the `GeneralPurpose` profile by default: - -```rust -// The buffer size is automatically calculated based on file size -// Uses GeneralPurpose profile by default -let buffer_size = get_adaptive_buffer_size(file_size); -``` - -### Using Specific Workload Profile - -```rust -use rustfs::config::workload_profiles::WorkloadProfile; - -// For AI/ML workloads -let buffer_size = get_adaptive_buffer_size_with_profile( - file_size, - Some(WorkloadProfile::AiTraining) -); - -// For web workloads -let buffer_size = get_adaptive_buffer_size_with_profile( - file_size, - Some(WorkloadProfile::WebWorkload) -); - -// For secure storage -let buffer_size = get_adaptive_buffer_size_with_profile( - file_size, - Some(WorkloadProfile::SecureStorage) -); -``` - -### Auto-Detection Mode - -The system can automatically detect the runtime environment: - -```rust -// Auto-detects OS environment or falls back to GeneralPurpose -let buffer_size = get_adaptive_buffer_size_with_profile(file_size, None); -``` - -### Custom Configuration - -For specialized requirements, create a custom configuration: - -```rust -use rustfs::config::workload_profiles::{BufferConfig, WorkloadProfile}; - -let custom_config = BufferConfig { - min_size: 16 * 1024, // 16KB minimum - max_size: 512 * 1024, // 512KB maximum - default_unknown: 128 * 1024, // 128KB for unknown sizes - thresholds: vec![ - (1024 * 1024, 64 * 1024), // < 1MB: 64KB - (50 * 1024 * 1024, 256 * 1024), // 1MB-50MB: 256KB - (i64::MAX, 512 * 1024), // >= 50MB: 512KB - ], -}; - -let profile = WorkloadProfile::Custom(custom_config); -let buffer_size = get_adaptive_buffer_size_with_profile(file_size, Some(profile)); -``` - -## Phase 3: Default Enablement (Current Implementation) - -**⚡ NEW: Workload profiles are now enabled by default!** - -Starting from Phase 3, adaptive buffer sizing with workload profiles is **enabled by default** using the `GeneralPurpose` profile. This provides improved performance out-of-the-box while maintaining full backward compatibility. - -### Default Behavior - -```bash -# Phase 3: Profile-aware buffer sizing enabled by default with GeneralPurpose profile -./rustfs /data -``` - -This now automatically uses intelligent buffer sizing based on file size and workload characteristics. - -### Changing the Workload Profile - -```bash -# Use a different profile (AI/ML workloads) -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /data - -# Or via command-line -./rustfs --buffer-profile AiTraining /data - -# Use web workload profile -./rustfs --buffer-profile WebWorkload /data -``` - -### Opt-Out (Legacy Behavior) - -If you need the exact behavior from PR #869 (fixed algorithm), you can disable profiling: - -```bash -# Disable buffer profiling (revert to PR #869 behavior) -export RUSTFS_BUFFER_PROFILE_DISABLE=true -./rustfs /data - -# Or via command-line -./rustfs --buffer-profile-disable /data -``` - -### Available Profile Names - -The following profile names are supported (case-insensitive): - -| Profile Name | Aliases | Description | -|-------------|---------|-------------| -| `GeneralPurpose` | `general` | Default balanced configuration (same as PR #869 for most files) | -| `AiTraining` | `ai` | Optimized for AI/ML workloads | -| `DataAnalytics` | `analytics` | Mixed read-write patterns | -| `WebWorkload` | `web` | Small file intensive operations | -| `IndustrialIoT` | `iot` | Real-time streaming | -| `SecureStorage` | `secure` | Security-first, memory constrained | - -### Behavior Summary - -**Phase 3 Default (Enabled):** -- Uses workload-aware buffer sizing with `GeneralPurpose` profile -- Provides same buffer sizes as PR #869 for most scenarios -- Allows easy switching to specialized profiles -- Buffer sizes: 64KB, 256KB, 1MB based on file size (GeneralPurpose) - -**With `RUSTFS_BUFFER_PROFILE_DISABLE=true`:** -- Uses the exact original adaptive buffer sizing from PR #869 -- For users who want guaranteed legacy behavior -- Buffer sizes: 64KB, 256KB, 1MB based on file size - -**With Different Profiles:** -- `AiTraining`: 512KB, 2MB, 4MB - maximize throughput -- `WebWorkload`: 32KB, 128KB, 256KB - optimize concurrency -- `SecureStorage`: 32KB, 128KB, 256KB - compliance-focused -- And more... - -### Migration Examples - -**Phase 2 → Phase 3 Migration:** - -```bash -# Phase 2 (Opt-In): Had to explicitly enable -export RUSTFS_BUFFER_PROFILE_ENABLE=true -export RUSTFS_BUFFER_PROFILE=GeneralPurpose -./rustfs /data - -# Phase 3 (Default): Enabled automatically -./rustfs /data # ← Same behavior, no configuration needed! -``` - -**Using Different Profiles:** - -```bash -# AI/ML workloads - larger buffers for maximum throughput -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /data - -# Web workloads - smaller buffers for high concurrency -export RUSTFS_BUFFER_PROFILE=WebWorkload -./rustfs /data - -# Secure environments - compliance-focused -export RUSTFS_BUFFER_PROFILE=SecureStorage -./rustfs /data -``` - -**Reverting to Legacy Behavior:** - -```bash -# If you encounter issues or need exact PR #869 behavior -export RUSTFS_BUFFER_PROFILE_DISABLE=true -./rustfs /data -``` - -## Phase 4: Full Integration (Current Implementation) - -**🚀 NEW: Profile-only implementation with performance metrics!** - -Phase 4 represents the final stage of the adaptive buffer sizing system, providing a unified, profile-based approach with optional performance monitoring. - -### Key Features - -1. **Deprecated Legacy Function** - - `get_adaptive_buffer_size()` is now deprecated - - Maintained for backward compatibility only - - All new code uses the workload profile system - -2. **Profile-Only Implementation** - - Single entry point: `get_buffer_size_opt_in()` - - All buffer sizes come from workload profiles - - Even "disabled" mode uses GeneralPurpose profile (no hardcoded values) - -3. **Performance Metrics** (Optional) - - Built-in metrics collection with `metrics` feature flag - - Tracks buffer size selections - - Monitors buffer-to-file size ratios - - Helps optimize profile configurations - -### Unified Buffer Sizing - -```rust -// Phase 4: Single, unified implementation -fn get_buffer_size_opt_in(file_size: i64) -> usize { - // Enabled by default (Phase 3) - // Uses workload profiles exclusively - // Optional metrics collection -} -``` - -### Performance Monitoring - -When compiled with the `metrics` feature flag: - -```bash -# Build with metrics support -cargo build --features metrics - -# Run and collect metrics -./rustfs /data - -# Metrics collected: -# - buffer_size_bytes: Histogram of selected buffer sizes -# - buffer_size_selections: Counter of buffer size calculations -# - buffer_to_file_ratio: Ratio of buffer size to file size -``` - -### Migration from Phase 3 - -No action required! Phase 4 is fully backward compatible with Phase 3: - -```bash -# Phase 3 usage continues to work -./rustfs /data -export RUSTFS_BUFFER_PROFILE=AiTraining -./rustfs /data - -# Phase 4 adds deprecation warnings for direct legacy function calls -# (if you have custom code calling get_adaptive_buffer_size) -``` - -### What Changed - -| Aspect | Phase 3 | Phase 4 | -|--------|---------|---------| -| Legacy Function | Active | Deprecated (still works) | -| Implementation | Hybrid (legacy fallback) | Profile-only | -| Metrics | None | Optional via feature flag | -| Buffer Source | Profiles or hardcoded | Profiles only | - -### Benefits - -1. **Simplified Codebase** - - Single implementation path - - Easier to maintain and optimize - - Consistent behavior across all scenarios - -2. **Better Observability** - - Optional metrics for performance monitoring - - Data-driven profile optimization - - Production usage insights - -3. **Future-Proof** - - No legacy code dependencies - - Easy to add new profiles - - Extensible for future enhancements - -### Code Example - -**Phase 3 (Still Works):** -```rust -// Enabled by default -let buffer_size = get_buffer_size_opt_in(file_size); -``` - -**Phase 4 (Recommended):** -```rust -// Same call, but now with optional metrics and profile-only implementation -let buffer_size = get_buffer_size_opt_in(file_size); -// Metrics automatically collected if feature enabled -``` - -**Deprecated (Backward Compatible):** -```rust -// This still works but generates deprecation warnings -#[allow(deprecated)] -let buffer_size = get_adaptive_buffer_size(file_size); -``` - -### Enabling Metrics - -Add to `Cargo.toml`: -```toml -[dependencies] -rustfs = { version = "*", features = ["metrics"] } -``` - -Or build with feature flag: -```bash -cargo build --features metrics --release -``` - -### Metrics Dashboard - -When metrics are enabled, you can visualize: - -- **Buffer Size Distribution**: Most common buffer sizes used -- **Profile Effectiveness**: How well profiles match actual workloads -- **Memory Efficiency**: Buffer-to-file size ratios -- **Usage Patterns**: File size distribution and buffer selection trends - -Use your preferred metrics backend (Prometheus, InfluxDB, etc.) to collect and visualize these metrics. - -## Phase 2: Opt-In Usage (Previous Implementation) - -**Note:** Phase 2 documentation is kept for historical reference. The current version uses Phase 4 (Full Integration). - -<details> -<summary>Click to expand Phase 2 documentation</summary> - -Starting from Phase 2 of the migration path, workload profiles can be enabled via environment variables or command-line arguments. - -### Environment Variables - -Enable workload profiling using these environment variables: - -```bash -# Enable buffer profiling (opt-in) -export RUSTFS_BUFFER_PROFILE_ENABLE=true - -# Set the workload profile -export RUSTFS_BUFFER_PROFILE=AiTraining - -# Start RustFS -./rustfs /data -``` - -### Command-Line Arguments - -Alternatively, use command-line flags: - -```bash -# Enable buffer profiling with AI training profile -./rustfs --buffer-profile-enable --buffer-profile AiTraining /data - -# Enable buffer profiling with web workload profile -./rustfs --buffer-profile-enable --buffer-profile WebWorkload /data - -# Disable buffer profiling (use legacy behavior) -./rustfs /data -``` - -### Behavior - -When `RUSTFS_BUFFER_PROFILE_ENABLE=false` (default in Phase 2): -- Uses the original adaptive buffer sizing from PR #869 -- No breaking changes to existing deployments -- Buffer sizes: 64KB, 256KB, 1MB based on file size - -When `RUSTFS_BUFFER_PROFILE_ENABLE=true`: -- Uses the configured workload profile -- Allows for workload-specific optimizations -- Buffer sizes vary based on the selected profile - -</details> - - - -## Configuration Validation - -All buffer configurations are validated to ensure correctness: - -```rust -let config = BufferConfig { /* ... */ }; -config.validate()?; // Returns Err if invalid -``` - -**Validation Rules:** -- `min_size` must be > 0 -- `max_size` must be >= `min_size` -- `default_unknown` must be between `min_size` and `max_size` -- Thresholds must be in ascending order -- Buffer sizes in thresholds must be within `[min_size, max_size]` - -## Environment Detection - -The system automatically detects special operating system environments by reading `/etc/os-release` on Linux systems: - -```rust -if let Some(profile) = WorkloadProfile::detect_os_environment() { - // Returns SecureStorage profile for Kylin, NeoKylin, UOS, etc. - let buffer_size = profile.config().calculate_buffer_size(file_size); -} -``` - -**Detected Environments:** -- Kylin (麒麟) -- NeoKylin (中标麒麟) -- UOS / Unity OS (统信) -- OpenKylin (开放麒麟) - -## Performance Considerations - -### Memory Usage - -Different profiles have different memory footprints: - -| Profile | Min Buffer | Max Buffer | Typical Memory | -|---------|-----------|-----------|----------------| -| GeneralPurpose | 64KB | 1MB | Low-Medium | -| AiTraining | 512KB | 4MB | High | -| DataAnalytics | 128KB | 2MB | Medium | -| WebWorkload | 32KB | 256KB | Low | -| IndustrialIoT | 64KB | 512KB | Low | -| SecureStorage | 32KB | 256KB | Low | - -### Throughput Impact - -Larger buffers generally provide better throughput for large files by reducing system call overhead: - -- **Small buffers (32-64KB)**: Lower memory, more syscalls, suitable for many small files -- **Medium buffers (128-512KB)**: Balanced approach for mixed workloads -- **Large buffers (1-4MB)**: Maximum throughput, best for large sequential reads - -### Concurrency Considerations - -For high-concurrency scenarios (e.g., WebWorkload): -- Smaller buffers reduce per-connection memory -- Allows more concurrent connections -- Better overall system resource utilization - -## Best Practices - -### 1. Choose the Right Profile - -Select the profile that matches your primary workload: - -```rust -// AI/ML training -WorkloadProfile::AiTraining - -// Web application -WorkloadProfile::WebWorkload - -// General purpose storage -WorkloadProfile::GeneralPurpose -``` - -### 2. Monitor Memory Usage - -In production, monitor memory consumption: - -```rust -// For memory-constrained environments, use smaller buffers -WorkloadProfile::SecureStorage // or IndustrialIoT -``` - -### 3. Test Performance - -Benchmark your specific workload to verify the profile choice: - -```bash -# Run performance tests with different profiles -cargo test --release -- --ignored performance_tests -``` - -### 4. Consider File Size Distribution - -If you know your typical file sizes: - -- Mostly small files (< 1MB): Use `WebWorkload` or `SecureStorage` -- Mostly large files (> 100MB): Use `AiTraining` or `DataAnalytics` -- Mixed sizes: Use `GeneralPurpose` - -### 5. Compliance Requirements - -For regulated environments: - -```rust -// Automatically uses SecureStorage on detected secure OS -let config = RustFSBufferConfig::with_auto_detect(); - -// Or explicitly set SecureStorage -let config = RustFSBufferConfig::new(WorkloadProfile::SecureStorage); -``` - -## Integration Examples - -### S3 Put Object - -```rust -async fn put_object(&self, req: S3Request<PutObjectInput>) -> S3Result<S3Response<PutObjectOutput>> { - let size = req.input.content_length.unwrap_or(-1); - - // Use workload-aware buffer sizing - let buffer_size = get_adaptive_buffer_size_with_profile( - size, - Some(WorkloadProfile::GeneralPurpose) - ); - - let body = tokio::io::BufReader::with_capacity( - buffer_size, - StreamReader::new(body) - ); - - // Process upload... -} -``` - -### Multipart Upload - -```rust -async fn upload_part(&self, req: S3Request<UploadPartInput>) -> S3Result<S3Response<UploadPartOutput>> { - let size = req.input.content_length.unwrap_or(-1); - - // For large multipart uploads, consider using AiTraining profile - let buffer_size = get_adaptive_buffer_size_with_profile( - size, - Some(WorkloadProfile::AiTraining) - ); - - let body = tokio::io::BufReader::with_capacity( - buffer_size, - StreamReader::new(body_stream) - ); - - // Process part upload... -} -``` - -## Troubleshooting - -### High Memory Usage - -If experiencing high memory usage: - -1. Switch to a more conservative profile: - ```rust - WorkloadProfile::WebWorkload // or SecureStorage - ``` - -2. Set explicit memory limits in custom configuration: - ```rust - let config = BufferConfig { - min_size: 16 * 1024, - max_size: 128 * 1024, // Cap at 128KB - // ... - }; - ``` - -### Low Throughput - -If experiencing low throughput for large files: - -1. Use a more aggressive profile: - ```rust - WorkloadProfile::AiTraining // or DataAnalytics - ``` - -2. Increase buffer sizes in custom configuration: - ```rust - let config = BufferConfig { - max_size: 4 * 1024 * 1024, // 4MB max buffer - // ... - }; - ``` - -### Streaming/Unknown Size Handling - -For chunked transfers or streaming: - -```rust -// Pass -1 for unknown size -let buffer_size = get_adaptive_buffer_size_with_profile(-1, None); -// Returns the profile's default_unknown size -``` - -## Technical Implementation - -### Algorithm - -The buffer size is selected based on file size thresholds: - -```rust -pub fn calculate_buffer_size(&self, file_size: i64) -> usize { - if file_size < 0 { - return self.default_unknown; - } - - for (threshold, buffer_size) in &self.thresholds { - if file_size < *threshold { - return (*buffer_size).clamp(self.min_size, self.max_size); - } - } - - self.max_size -} -``` - -### Thread Safety - -All configuration structures are: -- Immutable after creation -- Safe to share across threads -- Cloneable for per-thread customization - -### Performance Overhead - -- Configuration lookup: O(n) where n = number of thresholds (typically 2-4) -- Negligible overhead compared to I/O operations -- Configuration can be cached per-connection - -## Migration Guide - -### From PR #869 - -The original `get_adaptive_buffer_size` function is preserved for backward compatibility: - -```rust -// Old code (still works) -let buffer_size = get_adaptive_buffer_size(file_size); - -// New code (recommended) -let buffer_size = get_adaptive_buffer_size_with_profile( - file_size, - Some(WorkloadProfile::GeneralPurpose) -); -``` - -### Upgrading Existing Code - -1. **Identify workload type** for each use case -2. **Replace** `get_adaptive_buffer_size` with `get_adaptive_buffer_size_with_profile` -3. **Choose** appropriate profile -4. **Test** performance impact - -## References - -- [PR #869: Fix large file upload freeze with adaptive buffer sizing](https://github.com/rustfs/rustfs/pull/869) -- [Performance Testing Guide](./PERFORMANCE_TESTING.md) -- [Configuration Documentation](./ENVIRONMENT_VARIABLES.md) - -## License - -Copyright 2024 RustFS Team - -Licensed under the Apache License, Version 2.0. diff --git a/docs/ansible/REAEME.md b/docs/ansible/REAEME.md deleted file mode 100644 index 65a8f52eb..000000000 --- a/docs/ansible/REAEME.md +++ /dev/null @@ -1,106 +0,0 @@ -# Install rustfs with mnmd mode using ansible - -This chapter show how to install rustfs with mnmd(multiple nodes multiple disks) using ansible playbook.Two installation method are available, namely binary and docker compose. - -## Requirements - -- Multiple nodes(At least 4 nodes,each has private IP and public IP) -- Multiple disks(At least 1 disk per nodes, 4 disks is a better choice) -- Ansible should be available -- Docker should be available(only for docker compose installation) - -## Binary installation and uninstallation - -### Installation - -For binary installation([script installation](https://rustfs.com/en/download/),you should modify the below part of the playbook, - -``` -- name: Modify Target Server's hosts file - blockinfile: - path: /etc/hosts - block: | - 172.92.20.199 rustfs-node1 - 172.92.20.200 rustfs-node2 - 172.92.20.201 rustfs-node3 - 172.92.20.202 rustfs-node4 -``` - -Replacing the IP with your nodes' **private IP**.If you have more than 4 nodes, adding the ip in order. - -Running the command to install rustfs - -``` -ansible-playbook --skip-tags rustfs_uninstall binary-mnmd.yml -``` - -After installation success, you can access the rustfs cluster via any node's public ip and 9000 port. Both default username and password are `rustfsadmin`. - - -### Uninstallation - -Running the command to uninstall rustfs - -``` -ansible-playbook --tags rustfs_uninstall binary-mnmd.yml -``` - -## Docker compose installation and uninstallation - -**NOTE**: For docker compose installation,playbook contains docker installation task, - -``` -tasks: - - name: Install docker - shell: | - apt-get remove -y docker docker.io containerd runc || true - apt-get update -y - apt-get install -y ca-certificates curl gnupg lsb-release - install -m 0755 -d /etc/apt/keyrings - curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | gpg --dearmor --yes -o /etc/apt/keyrings/docker.gpg - chmod a+r /etc/apt/keyrings/docker.gpg - echo \ - "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://mirrors.aliyun.com/docker-ce/linux/ubuntu \ - $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null - apt-get update -y - apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin - become: yes - register: docker_installation_result - changed_when: false - - - name: Installation check - debug: - var: docker_installation_result.stdout -``` - -If your node already has docker environment,you can add `tags` in the playbook and skip this task in the follow installation.By the way, the docker installation only for `Ubuntu` OS,if you have the different OS,you should modify this task as well. - -For docker compose installation,you should also modify the below part of the playbook, - -``` -extra_hosts: - - "rustfs-node1:172.20.92.202" - - "rustfs-node2:172.20.92.201" - - "rustfs-node3:172.20.92.200" - - "rustfs-node4:172.20.92.199" -``` - -Replacing the IP with your nodes' **private IP**.If you have more than 4 nodes, adding the ip in order. - -Running the command to install rustfs, - -``` -ansible-playbook --skip-tags docker_uninstall docker-compose-mnmd.yml -``` - -After installation success, you can access the rustfs cluster via any node's public ip and 9000 port. Both default username and password are `rustfsadmin`. - -### Uninstallation - -Running the command to uninstall rustfs - -``` -ansible-playbook --tags docker_uninstall docker-compose-mnmd.yml -``` - - diff --git a/docs/ansible/binary-mnmd.yml b/docs/ansible/binary-mnmd.yml deleted file mode 100644 index 3bb7169de..000000000 --- a/docs/ansible/binary-mnmd.yml +++ /dev/null @@ -1,166 +0,0 @@ ---- -- name: Prepare for RustFS installation - hosts: rustfs - become: yes - vars: - ansible_python_interpreter: /usr/bin/python3 - remote_user: root - - tasks: - - name: Create Workspace - file: - path: /opt/rustfs - state: directory - mode: '0755' - register: create_dir_result - - - name: Dir Creation Result Check - debug: - msg: "RustFS dir created successfully" - when: create_dir_result.changed - - - name: Modify Target Server's hosts file - blockinfile: - path: /etc/hosts - block: | - 127.0.0.1 localhost - 172.20.92.199 rustfs-node1 - 172.20.92.200 rustfs-node2 - 172.20.92.201 rustfs-node3 - 172.20.92.202 rustfs-node4 - - - name: Create rustfs group - group: - name: rustfs - system: yes - state: present - - - name: Create rustfs user - user: - name: rustfs - shell: /bin/bash - system: yes - create_home: no - group: rustfs - state: present - - - name: Get rustfs user id - command: id -u rustfs - register: rustfs_uid - changed_when: false - ignore_errors: yes - - - name: Check rustfs user id - debug: - msg: "rustfs uid is {{ rustfs_uid.stdout }}" - - - name: Create volume dir - file: - path: "{{ item }}" - state: directory - owner: rustfs - group: rustfs - mode: '0755' - loop: - - /data/rustfs0 - - /data/rustfs1 - - /data/rustfs2 - - /data/rustfs3 - - /var/logs/rustfs - -- name: Install RustFS - hosts: rustfs - become: yes - vars: - ansible_python_interpreter: /usr/bin/python3 - install_script_url: "https://rustfs.com/install_rustfs.sh" - install_script_tmp: "/opt/rustfs/install_rustfs.sh" - tags: rustfs_install - - tasks: - - name: Prepare configuration file - copy: - dest: /etc/default/rustfs - content: | - RUSTFS_ACCESS_KEY=rustfsadmin - RUSTFS_SECRET_KEY=rustfsadmin - RUSTFS_VOLUMES="http://rustfs-node{1...4}:9000/data/rustfs{0...3}" - RUSTFS_ADDRESS=":9000" - RUSTFS_CONSOLE_ENABLE=true - RUSTFS_OBS_LOG_DIRECTORY="/var/logs/rustfs/" - owner: root - group: root - mode: '0644' - - - name: Install unzip - apt: - name: unzip - state: present - update_cache: yes - - - name: Download the rustfs install script - get_url: - url: "{{ install_script_url }}" - dest: "{{ install_script_tmp }}" - mode: '0755' - - - name: Run rustfs installation script - expect: - command: bash "{{install_script_tmp}}" - responses: - '.*Enter your choice.*': "1\n" - '.*Please enter RustFS service port.*': "9000\n" - '.*Please enter RustFS console port.*': "9001\n" - '.*Please enter data storage directory.*': "http://rustfs-node{1...4}:9000/data/rustfs{0...3}\n" - timeout: 300 - register: rustfs_install_result - tags: - - rustfs_install - - - name: Debug installation output - debug: - var: rustfs_install_result.stdout_lines - - - name: Installation confirmation - command: rustfs --version - register: rustfs_version - changed_when: false - failed_when: rustfs_version.rc != 0 - - - name: Show rustfs version - debug: - msg: "RustFS version is {{ rustfs_version.stdout }}" - -- name: Uninstall RustFS - hosts: rustfs - become: yes - vars: - install_script_tmp: /opt/rustfs/install_rustfs.sh - ansible_python_interpreter: /usr/bin/python3 - tags: rustfs_uninstall - - tasks: - - name: Run rustfs uninstall script - expect: - command: bash "{{ install_script_tmp }}" - responses: - 'Enter your choice.*': "2\n" - 'Are you sure you want to uninstall RustFS.*': "y\n" - timeout: 300 - register: rustfs_uninstall_result - tags: rustfs_uninstall - - - name: Debug uninstall output - debug: - var: rustfs_uninstall_result.stdout_lines - - - name: Delete data dir - file: - path: "{{ item }}" - state: absent - loop: - - /data/rustfs0 - - /data/rustfs1 - - /data/rustfs2 - - /data/rustfs3 - - /var/logs/rustfs diff --git a/docs/ansible/docker-compose-mnmd.yml b/docs/ansible/docker-compose-mnmd.yml deleted file mode 100644 index 2eb1c985d..000000000 --- a/docs/ansible/docker-compose-mnmd.yml +++ /dev/null @@ -1,114 +0,0 @@ ---- -- name: Prepare for RustFS installation - hosts: rustfs - become: yes - vars: - ansible_python_interpreter: /usr/bin/python3 - install_script_url: "https://rustfs.com/install_rustfs.sh" - docker_compose: "/opt/rustfs/docker-compose" - remote_user: root - - tasks: - - name: Install docker - tags: docker_install - shell: | - apt-get remove -y docker docker.io containerd runc || true - apt-get update -y - apt-get install -y ca-certificates curl gnupg lsb-release - install -m 0755 -d /etc/apt/keyrings - curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | gpg --dearmor --yes -o /etc/apt/keyrings/docker.gpg - chmod a+r /etc/apt/keyrings/docker.gpg - echo \ - "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://mirrors.aliyun.com/docker-ce/linux/ubuntu \ - $(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null - apt-get update -y - apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin - become: yes - register: docker_installation_result - changed_when: false - when: ansible_facts['distribution'] == "Ubuntu" - - - name: Installation check - debug: - var: docker_installation_result.stdout - when: ansible_facts['distribution'] == "Ubuntu" - - - name: Create docker compose dir - file: - path: "{{ docker_compose }}" - state: directory - mode: '0755' - - - name: Prepare docker compose file - copy: - content: | - services: - rustfs: - image: rustfs/rustfs:latest - container_name: rustfs - hostname: rustfs - network_mode: host - environment: - # Use service names and correct disk indexing (1..4 to match mounted paths) - - RUSTFS_VOLUMES=http://rustfs-node{1...4}:9000/data/rustfs{1...4} - - RUSTFS_ADDRESS=0.0.0.0:9000 - - RUSTFS_CONSOLE_ENABLE=true - - RUSTFS_CONSOLE_ADDRESS=0.0.0.0:9001 - - RUSTFS_ACCESS_KEY=rustfsadmin - - RUSTFS_SECRET_KEY=rustfsadmin - - RUSTFS_CMD=rustfs - command: ["sh", "-c", "sleep 3 && rustfs"] - healthcheck: - test: - [ - "CMD-SHELL", - "curl -f http://127.0.0.1:9000/health && curl -f http://127.0.0.1:9001/health || exit 1" - ] - interval: 10s - timeout: 5s - retries: 3 - start_period: 30s - ports: - - "9000:9000" # API endpoint - - "9001:9001" # Console - volumes: - - rustfs-data1:/data/rustfs1 - - rustfs-data2:/data/rustfs2 - - rustfs-data3:/data/rustfs3 - - rustfs-data4:/data/rustfs4 - extra_hosts: - - "rustfs-node1:172.20.92.202" - - "rustfs-node2:172.20.92.201" - - "rustfs-node3:172.20.92.200" - - "rustfs-node4:172.20.92.199" - - volumes: - rustfs-data1: - rustfs-data2: - rustfs-data3: - rustfs-data4: - - dest: "{{ docker_compose }}/docker-compose.yml" - mode: '0644' - - - name: Install rustfs using docker compose - tags: rustfs_install - command: docker compose -f "{{ docker_compose}}/docker-compose.yml" up -d - args: - chdir: "{{ docker_compose }}" - - - name: Get docker compose output - command: docker compose ps - args: - chdir: "{{ docker_compose }}" - register: docker_compose_output - - - name: Check the docker compose installation output - debug: - msg: "{{ docker_compose_output.stdout }}" - - - name: Uninstall rustfs using docker compose - tags: rustfs_uninstall - command: docker compose -f "{{ docker_compose}}/docker-compose.yml" down - args: - chdir: "{{ docker_compose }}" diff --git a/docs/bug_resolution_report_issue_1013.md b/docs/bug_resolution_report_issue_1013.md deleted file mode 100644 index 0b1d72f7f..000000000 --- a/docs/bug_resolution_report_issue_1013.md +++ /dev/null @@ -1,174 +0,0 @@ -# Bug Resolution Report: Jemalloc Page Size Crash on Raspberry Pi (AArch64) - -**Status:** Resolved and Verified -**Issue Reference:** GitHub Issue #1013 -**Target Architecture:** Linux AArch64 (Raspberry Pi 5, Apple Silicon VMs) -**Date:** December 7, 2025 - ---- - -## 1. Executive Summary - -This document details the analysis, resolution, and verification of a critical startup crash affecting `rustfs` on -Raspberry Pi 5 and other AArch64 Linux environments. The issue was identified as a memory page size mismatch between the -compiled `jemalloc` allocator (4KB) and the runtime kernel configuration (16KB). - -The fix involves a dynamic, architecture-aware allocator configuration that automatically switches to `mimalloc` on -AArch64 systems while retaining the high-performance `jemalloc` for standard x86_64 server environments. This solution -ensures 100% stability on ARM hardware without introducing performance regressions on existing platforms. - ---- - -## 2. Issue Analysis - -### 2.1 Symptom - -The application crashes immediately upon startup, including during simple version checks (`rustfs -version`). - -**Error Message:** - -```text -<jemalloc>: Unsupported system page size -``` - -### 2.2 Environment - -* **Hardware:** Raspberry Pi 5 (and compatible AArch64 systems). -* **OS:** Debian Trixie (Linux AArch64). -* **Kernel Configuration:** 16KB system page size (common default for modern ARM performance). - -### 2.3 Root Cause - -The crash stems from a fundamental incompatibility in the `tikv-jemallocator` build configuration: - -1. **Static Configuration:** Experimental builds of `jemalloc` are often compiled expecting a standard **4KB memory page**. -2. **Runtime Mismatch:** Modern AArch64 kernels (like on RPi 5) often use **16KB or 64KB pages** for improved TLB - efficiency. -3. **Fatal Error:** When `jemalloc` initializes, it detects that the actual system page size exceeds its compiled - support window. This is treated as an unrecoverable error, triggering an immediate panic before `main()` is even - entered. - ---- - -## 3. Impact Assessment - -### 3.1 Critical Bottleneck - -**Zero-Day Blocker:** The mismatch acts as a hard blocker. The binaries produced were completely non-functional on the -impacted hardware. - -### 3.2 Scope - -* **Affected:** Linux AArch64 systems with non-standard (non-4KB) page sizes. -* **Unaffected:** Standard x86_64 servers, MacOS, and Windows environments. - ---- - -## 4. Solution Strategy - -### 4.1 Selected Fix: Architecture-Aware Allocator Switching - -We opted to replace the allocator specifically for the problematic architecture. - -* **For AArch64 (Target):** Switch to **`mimalloc`**. - * *Rationale:* `mimalloc` is a robust, high-performance allocator that is inherently agnostic to specific system - page sizes (supports 4KB/16KB/64KB natively). It is already used in `musl` builds, proving its reliability. -* **For x86_64 (Standard):** Retain **`jemalloc`**. - * *Rationale:* `jemalloc` is deeply optimized for server workloads. Keeping it ensures no changes to the performance - profile of the primary production environment. - -### 4.2 Alternatives Rejected - -* **Recompiling Jemalloc:** Attempting to force `jemalloc` to support 64KB pages (`--with-lg-page=16`) via - `tikv-jemallocator` features was deemed too complex and fragile. It would require forking the wrapper crate or complex - build script overrides, increasing maintenance burden. - ---- - -## 5. Implementation Details - -The fix was implemented across three key areas of the codebase to ensure "Secure by Design" principles. - -### 5.1 Dependency Management (`rustfs/Cargo.toml`) - -We used Cargo's platform-specific configuration to isolate dependencies. `jemalloc` is now mathematically impossible to -link on AArch64. - -* **Old Config:** `jemalloc` included for all Linux GNU targets. -* **New Config:** - * `mimalloc` enabled for `not(all(target_os = "linux", target_env = "gnu", target_arch = "x86_64"))` (i.e., - everything except Linux GNU x86_64). - * `tikv-jemallocator` restricted to `all(target_os = "linux", target_env = "gnu", target_arch = "x86_64")`. - -### 5.2 Global Allocator Logic (`rustfs/src/main.rs`) - -The global allocator is now conditionally selected at compile time: - -```rust -#[cfg(all(target_os = "linux", target_env = "gnu", target_arch = "x86_64"))] -#[global_allocator] -static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc; - -#[cfg(not(all(target_os = "linux", target_env = "gnu", target_arch = "x86_64")))] -#[global_allocator] -static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc; -``` - -### 5.3 Safe Fallbacks (`rustfs/src/profiling.rs`) - -Since `jemalloc` provides specific profiling features (memory dumping) that `mimalloc` does not mirror 1:1, we added -feature guards. - -* **Guard:** `#[cfg(all(target_os = "linux", target_env = "gnu", target_arch = "x86_64"))]` (profiling enabled only on - Linux GNU x86_64) -* **Behavior:** On all other platforms (including AArch64), calls to dump memory profiles now return a "Not Supported" - error log instead of crashing or failing to compile. - ---- - -## 6. Verification and Testing - -To ensure the fix is 100% effective, we employed **Cross-Architecture Dependency Tree Analysis**. This method -mathematically proves which libraries are linked for a specific target. - -### 6.1 Test 1: Replicating the Bugged Environment (AArch64) - -We checked if the crashing library (`jemalloc`) was still present for the ARM64 target. - -* **Command:** `cargo tree --target aarch64-unknown-linux-gnu -i tikv-jemallocator` -* **Result:** `warning: nothing to print.` -* **Conclusion:** **Passed.** `jemalloc` is completely absent from the build graph. The crash is impossible. - -### 6.2 Test 2: Verifying the Fix (AArch64) - -We confirmed that the safe allocator (`mimalloc`) was correctly substituted. - -* **Command:** `cargo tree --target aarch64-unknown-linux-gnu -i mimalloc` -* **Result:** - ```text - mimalloc v0.1.48 - └── rustfs v0.0.5 ... - ``` -* **Conclusion:** **Passed.** The system is correctly configured to use the page-agnostic allocator. - -### 6.3 Test 3: Regression Safety (x86_64) - -We ensured that standard servers were not accidentally downgraded to `mimalloc` (unless desired). - -* **Command:** `cargo tree --target x86_64-unknown-linux-gnu -i tikv-jemallocator` -* **Result:** - ```text - tikv-jemallocator v0.6.1 - └── rustfs v0.0.5 ... - ``` -* **Conclusion:** **Passed.** No regression. High-performance allocator retained for standard hardware. - ---- - -## 7. Conclusion - -The codebase is now **110% secure** against the "Unsupported system page size" crash. - -* **Robustness:** Achieved via reliable, architecture-native allocators (`mimalloc` on ARM). -* **Stability:** Build process is deterministic; no "lucky" builds. -* **Maintainability:** Uses standard Cargo features (`cfg`) without custom build scripts or hacks. diff --git a/docs/client-special-characters-guide.md b/docs/client-special-characters-guide.md deleted file mode 100644 index 6b5d5251e..000000000 --- a/docs/client-special-characters-guide.md +++ /dev/null @@ -1,442 +0,0 @@ -# Working with Special Characters in Object Names - -## Overview - -This guide explains how to properly handle special characters (spaces, plus signs, etc.) in S3 object names when using RustFS. - -## Quick Reference - -| Character | What You Type | How It's Stored | How to Access It | -|-----------|---------------|-----------------|------------------| -| Space | `my file.txt` | `my file.txt` | Use proper S3 client/SDK | -| Plus | `test+file.txt` | `test+file.txt` | Use proper S3 client/SDK | -| Percent | `test%file.txt` | `test%file.txt` | Use proper S3 client/SDK | - -**Key Point**: Use a proper S3 SDK or client. They handle URL encoding automatically! - -## Recommended Approach: Use S3 SDKs - -The easiest and most reliable way to work with object names containing special characters is to use an official S3 SDK. These handle all encoding automatically. - -### AWS CLI - -```bash -# Works correctly - AWS CLI handles encoding -aws --endpoint-url=http://localhost:9000 s3 cp file.txt "s3://mybucket/path with spaces/file.txt" -aws --endpoint-url=http://localhost:9000 s3 ls "s3://mybucket/path with spaces/" - -# Works with plus signs -aws --endpoint-url=http://localhost:9000 s3 cp data.json "s3://mybucket/ES+net/data.json" -``` - -### MinIO Client (mc) - -```bash -# Configure RustFS endpoint -mc alias set myrustfs http://localhost:9000 ACCESS_KEY SECRET_KEY - -# Upload with spaces in path -mc cp README.md "myrustfs/mybucket/a f+/b/c/3/README.md" - -# List contents -mc ls "myrustfs/mybucket/a f+/" -mc ls "myrustfs/mybucket/a f+/b/c/3/" - -# Works with plus signs -mc cp file.txt "myrustfs/mybucket/ES+net/file.txt" -``` - -### Python (boto3) - -```python -import boto3 - -# Configure client -s3 = boto3.client( - 's3', - endpoint_url='http://localhost:9000', - aws_access_key_id='ACCESS_KEY', - aws_secret_access_key='SECRET_KEY' -) - -# Upload with spaces - boto3 handles encoding automatically -s3.put_object( - Bucket='mybucket', - Key='path with spaces/file.txt', - Body=b'file content' -) - -# List objects - boto3 encodes prefix automatically -response = s3.list_objects_v2( - Bucket='mybucket', - Prefix='path with spaces/' -) - -for obj in response.get('Contents', []): - print(obj['Key']) # Will print: "path with spaces/file.txt" - -# Works with plus signs -s3.put_object( - Bucket='mybucket', - Key='ES+net/LHC+Data+Challenge/file.json', - Body=b'data' -) -``` - -### Go (AWS SDK) - -```go -package main - -import ( - "bytes" - "fmt" - - "github.com/aws/aws-sdk-go/aws" - "github.com/aws/aws-sdk-go/aws/credentials" - "github.com/aws/aws-sdk-go/aws/session" - "github.com/aws/aws-sdk-go/service/s3" -) - -func main() { - // Configure session - sess := session.Must(session.NewSession(&aws.Config{ - Endpoint: aws.String("http://localhost:9000"), - Region: aws.String("us-east-1"), - Credentials: credentials.NewStaticCredentials("ACCESS_KEY", "SECRET_KEY", ""), - S3ForcePathStyle: aws.Bool(true), - })) - - svc := s3.New(sess) - - // Upload with spaces - SDK handles encoding - _, err := svc.PutObject(&s3.PutObjectInput{ - Bucket: aws.String("mybucket"), - Key: aws.String("path with spaces/file.txt"), - Body: bytes.NewReader([]byte("content")), - }) - - if err != nil { - panic(err) - } - - // List objects - SDK handles encoding - result, err := svc.ListObjectsV2(&s3.ListObjectsV2Input{ - Bucket: aws.String("mybucket"), - Prefix: aws.String("path with spaces/"), - }) - - if err != nil { - panic(err) - } - - for _, obj := range result.Contents { - fmt.Println(*obj.Key) - } -} -``` - -### Node.js (AWS SDK v3) - -```javascript -const { S3Client, PutObjectCommand, ListObjectsV2Command } = require("@aws-sdk/client-s3"); - -// Configure client -const client = new S3Client({ - endpoint: "http://localhost:9000", - region: "us-east-1", - credentials: { - accessKeyId: "ACCESS_KEY", - secretAccessKey: "SECRET_KEY", - }, - forcePathStyle: true, -}); - -// Upload with spaces - SDK handles encoding -async function upload() { - const command = new PutObjectCommand({ - Bucket: "mybucket", - Key: "path with spaces/file.txt", - Body: "file content", - }); - - await client.send(command); -} - -// List objects - SDK handles encoding -async function list() { - const command = new ListObjectsV2Command({ - Bucket: "mybucket", - Prefix: "path with spaces/", - }); - - const response = await client.send(command); - - for (const obj of response.Contents || []) { - console.log(obj.Key); - } -} -``` - -## Advanced: Manual HTTP Requests - -**⚠️ Not Recommended**: Only use if you can't use an S3 SDK. - -If you must make raw HTTP requests, you need to manually URL-encode the object key in the path: - -### URL Encoding Rules - -| Character | Encoding | Example | -|-----------|----------|---------| -| Space | `%20` | `my file.txt` → `my%20file.txt` | -| Plus | `%2B` | `test+file.txt` → `test%2Bfile.txt` | -| Percent | `%25` | `test%file.txt` → `test%25file.txt` | -| Slash (in name) | `%2F` | `test/file.txt` → `test%2Ffile.txt` | - -**Important**: In URL **paths** (not query parameters): -- `%20` = space -- `+` = literal plus sign (NOT space!) -- To represent a plus sign, use `%2B` - -### Example: Manual curl Request - -```bash -# Upload object with spaces -curl -X PUT "http://localhost:9000/mybucket/path%20with%20spaces/file.txt" \ - -H "Authorization: AWS4-HMAC-SHA256 ..." \ - -d "file content" - -# Upload object with plus signs -curl -X PUT "http://localhost:9000/mybucket/ES%2Bnet/file.txt" \ - -H "Authorization: AWS4-HMAC-SHA256 ..." \ - -d "data" - -# List objects (prefix in query parameter) -curl "http://localhost:9000/mybucket?prefix=path%20with%20spaces/" - -# Note: You'll also need to compute AWS Signature V4 -# This is complex - use an SDK instead! -``` - -## Troubleshooting - -### Issue: "UI can navigate to folder but can't list contents" - -**Symptom**: -- You uploaded: `mc cp file.txt "myrustfs/bucket/a f+/b/c/file.txt"` -- You can see folder `"a f+"` in the UI -- But clicking on it shows "No Objects" - -**Root Cause**: The UI may not be properly URL-encoding the prefix when making the LIST request. - -**Solution**: -1. **Use CLI instead**: `mc ls "myrustfs/bucket/a f+/b/c/"` works correctly -2. **Check UI console**: Open browser DevTools, look at Network tab, check if the request is properly encoded -3. **Report UI bug**: If using RustFS web console, this is a UI bug to report - -**Workaround**: -Use the CLI for operations with special characters until UI is fixed. - -### Issue: "400 Bad Request: Invalid Argument" - -**Symptom**: -``` -Error: api error InvalidArgument: Invalid argument -``` - -**Possible Causes**: - -1. **Client not encoding plus signs** - - Problem: Client sends `/bucket/ES+net/file.txt` - - Solution: Client should send `/bucket/ES%2Bnet/file.txt` - - Fix: Use a proper S3 SDK - -2. **Control characters in key** - - Problem: Key contains null bytes, newlines, etc. - - Solution: Remove invalid characters from key name - -3. **Double-encoding** - - Problem: Client encodes twice: `%20` → `%2520` - - Solution: Only encode once, or use SDK - -**Debugging**: -Enable debug logging on RustFS: -```bash -RUST_LOG=rustfs=debug ./rustfs server /data -``` - -Look for log lines like: -``` -DEBUG rustfs::storage::ecfs: PUT object with special characters in key: "a f+/file.txt" -DEBUG rustfs::storage::ecfs: LIST objects with special characters in prefix: "ES+net/" -``` - -### Issue: "NoSuchKey error but file exists" - -**Symptom**: -- Upload: `PUT /bucket/test+file.txt` works -- List: `GET /bucket?prefix=test` shows: `test+file.txt` -- Get: `GET /bucket/test+file.txt` fails with NoSuchKey - -**Root Cause**: Key was stored with one encoding, requested with another. - -**Diagnosis**: -```bash -# Check what name is actually stored -mc ls --recursive myrustfs/bucket/ - -# Try different encodings -curl "http://localhost:9000/bucket/test+file.txt" # Literal + -curl "http://localhost:9000/bucket/test%2Bfile.txt" # Encoded + -curl "http://localhost:9000/bucket/test%20file.txt" # Space (if + was meant as space) -``` - -**Solution**: Use a consistent S3 client/SDK for all operations. - -### Issue: "Special characters work in CLI but not in UI" - -**Root Cause**: This is a UI bug. The backend (RustFS) handles special characters correctly when accessed via proper S3 clients. - -**Verification**: -```bash -# These should all work: -mc cp file.txt "myrustfs/bucket/test with spaces/file.txt" -mc ls "myrustfs/bucket/test with spaces/" - -aws --endpoint-url=http://localhost:9000 s3 cp file.txt "s3://bucket/test with spaces/file.txt" -aws --endpoint-url=http://localhost:9000 s3 ls "s3://bucket/test with spaces/" -``` - -**Solution**: Report as UI bug. Use CLI for now. - -## Best Practices - -### 1. Use Simple Names When Possible - -Avoid special characters if you don't need them: -- ✅ Good: `my-file.txt`, `data_2024.json`, `report-final.pdf` -- ⚠️ Acceptable but complex: `my file.txt`, `data+backup.json`, `report (final).pdf` - -### 2. Always Use S3 SDKs/Clients - -Don't try to build raw HTTP requests yourself. Use: -- AWS CLI -- MinIO client (mc) -- AWS SDKs (Python/boto3, Go, Node.js, Java, etc.) -- Other S3-compatible SDKs - -### 3. Understand URL Encoding - -If you must work with URLs directly: -- **In URL paths**: Space=`%20`, Plus=`%2B`, `+` means literal plus -- **In query params**: Space=`%20` or `+`, Plus=`%2B` -- Use a URL encoding library in your language - -### 4. Test Your Client - -Before deploying: -```bash -# Test with spaces -mc cp test.txt "myrustfs/bucket/test with spaces/file.txt" -mc ls "myrustfs/bucket/test with spaces/" - -# Test with plus -mc cp test.txt "myrustfs/bucket/test+plus/file.txt" -mc ls "myrustfs/bucket/test+plus/" - -# Test with mixed -mc cp test.txt "myrustfs/bucket/test with+mixed/file.txt" -mc ls "myrustfs/bucket/test with+mixed/" -``` - -## Technical Details - -### How RustFS Handles Special Characters - -1. **Request Reception**: Client sends HTTP request with URL-encoded path - ``` - PUT /bucket/test%20file.txt - ``` - -2. **URL Decoding**: s3s library decodes the path - ```rust - let decoded = urlencoding::decode("/bucket/test%20file.txt") - // Result: "/bucket/test file.txt" - ``` - -3. **Storage**: Object stored with decoded name - ``` - Stored as: "test file.txt" - ``` - -4. **Retrieval**: Object retrieved by decoded name - ```rust - let key = "test file.txt"; // Already decoded by s3s - store.get_object(bucket, key) - ``` - -5. **Response**: Key returned in response (decoded) - ```xml - <Key>test file.txt</Key> - ``` - -6. **Client Display**: S3 clients display the decoded name - ``` - Shows: test file.txt - ``` - -### URL Encoding Standards - -RustFS follows: -- **RFC 3986**: URI Generic Syntax -- **AWS S3 API**: Object key encoding rules -- **HTTP/1.1**: URL encoding in request URIs - -Key points: -- Keys are UTF-8 strings -- URL encoding is only for HTTP transport -- Keys are stored and compared in decoded form - -## FAQs - -**Q: Can I use spaces in object names?** -A: Yes, but use an S3 SDK which handles encoding automatically. - -**Q: Why does `+` not work as a space?** -A: In URL paths, `+` represents a literal plus sign. Only in query parameters does `+` mean space. Use `%20` for spaces in paths. - -**Q: Does RustFS support Unicode in object names?** -A: Yes, object names are UTF-8 strings. They support any valid UTF-8 character. - -**Q: What characters are forbidden?** -A: Control characters (null byte, newline, carriage return) are rejected. All printable characters are allowed. - -**Q: How do I fix "UI can't list folder" issue?** -A: Use the CLI (mc or aws-cli) instead. This is a UI bug, not a backend issue. - -**Q: Why do some clients work but others don't?** -A: Proper S3 SDKs handle encoding correctly. Custom clients may have bugs. Always use official SDKs. - -## Getting Help - -If you encounter issues: - -1. **Check this guide first** -2. **Verify you're using an S3 SDK** (not raw HTTP) -3. **Test with mc client** to isolate if issue is backend or client -4. **Enable debug logging** on RustFS: `RUST_LOG=rustfs=debug` -5. **Report issues** at: https://github.com/rustfs/rustfs/issues - -Include in bug reports: -- Client/SDK used (and version) -- Exact object name causing issue -- Whether mc client works -- Debug logs from RustFS - ---- - -**Last Updated**: 2025-12-09 -**RustFS Version**: 0.0.5+ -**Related Documents**: -- [Special Characters Analysis](./special-characters-in-path-analysis.md) -- [Special Characters Solution](./special-characters-solution.md) diff --git a/docs/cluster_recovery.md b/docs/cluster_recovery.md deleted file mode 100644 index 6e0bef3d4..000000000 --- a/docs/cluster_recovery.md +++ /dev/null @@ -1,156 +0,0 @@ -# Resolution Report: Issue #1001 - Cluster Recovery from Abrupt Power-Off - -## 1. Issue Description -**Problem**: The cluster failed to recover gracefully when a node experienced an abrupt power-off (hard failure). -**Symptoms**: -- The application became unable to upload files. -- The Console Web UI became unresponsive across the cluster. -- The `rustfsadmin` user was unable to log in after a server power-off. -- The performance page displayed 0 storage, 0 objects, and 0 servers online/offline. -- The system "hung" indefinitely, unlike the immediate recovery observed during a graceful process termination (`kill`). - -**Root Cause (Multi-Layered)**: -1. **TCP Connection Issue**: The standard TCP protocol does not immediately detect a silent peer disappearance (power loss) because no `FIN` or `RST` packets are sent. -2. **Stale Connection Cache**: Cached gRPC connections in `GLOBAL_Conn_Map` were reused even when the peer was dead, causing blocking on every RPC call. -3. **Blocking IAM Notifications**: Login operations blocked waiting for ALL peers to acknowledge user/policy changes. -4. **No Per-Peer Timeouts**: Console aggregation calls like `server_info()` and `storage_info()` could hang waiting for dead peers. - ---- - -## 2. Technical Approach -To resolve this, we implemented a comprehensive multi-layered resilience strategy. - -### Key Objectives: -1. **Fail Fast**: Detect dead peers in seconds, not minutes. -2. **Evict Stale Connections**: Automatically remove dead connections from cache to force reconnection. -3. **Non-Blocking Operations**: Auth and IAM operations should not wait for dead peers. -4. **Graceful Degradation**: Console should show partial data from healthy nodes, not hang. - ---- - -## 3. Implemented Solution - -### Solution Overview -The fix implements a multi-layered detection strategy covering both Control Plane (RPC) and Data Plane (Streaming): - -1. **Control Plane (gRPC)**: - * Enabled `http2_keep_alive_interval` (5s) and `keep_alive_timeout` (3s) in `tonic` clients. - * Enforced `tcp_keepalive` (10s) on underlying transport. - * Context: Ensures cluster metadata operations (raft, status checks) fail fast if a node dies. - -2. **Data Plane (File Uploads/Downloads)**: - * **Client (Rio)**: Updated `reqwest` client builder in `crates/rio` to enable TCP Keepalive (10s) and HTTP/2 Keepalive (5s). This prevents hangs during large file streaming (e.g., 1GB uploads). - * **Server**: Enabled `SO_KEEPALIVE` on all incoming TCP connections in `rustfs/src/server/http.rs` to forcefully close sockets from dead clients. - -3. **Cross-Platform Build Stability**: - * Guarded Linux-specific profiling code (`jemalloc_pprof`) with `#[cfg(target_os = "linux")]` to fix build failures on macOS/AArch64. - -### Configuration Changes - -```rust -pub async fn storage_info<S: StorageAPI>(&self, api: &S) -> rustfs_madmin::StorageInfo { - let peer_timeout = Duration::from_secs(2); - - for client in self.peer_clients.iter() { - futures.push(async move { - if let Some(client) = client { - match timeout(peer_timeout, client.local_storage_info()).await { - Ok(Ok(info)) => Some(info), - Ok(Err(_)) | Err(_) => { - // Return offline status for dead peer - Some(rustfs_madmin::StorageInfo { - disks: get_offline_disks(&host, &endpoints), - ..Default::default() - }) - } - } - } - }); - } - // Rest continues even if some peers are down -} -``` - -### Fix 4: Enhanced gRPC Client Configuration - -**File Modified**: `crates/protos/src/lib.rs` - -**Configuration**: -```rust -const CONNECT_TIMEOUT_SECS: u64 = 3; // Reduced from 5s -const TCP_KEEPALIVE_SECS: u64 = 10; // OS-level keepalive -const HTTP2_KEEPALIVE_INTERVAL_SECS: u64 = 5; // HTTP/2 PING interval -const HTTP2_KEEPALIVE_TIMEOUT_SECS: u64 = 3; // PING ACK timeout -const RPC_TIMEOUT_SECS: u64 = 30; // Reduced from 60s - -let connector = Endpoint::from_shared(addr.to_string())? - .connect_timeout(Duration::from_secs(CONNECT_TIMEOUT_SECS)) - .tcp_keepalive(Some(Duration::from_secs(TCP_KEEPALIVE_SECS))) - .http2_keep_alive_interval(Duration::from_secs(HTTP2_KEEPALIVE_INTERVAL_SECS)) - .keep_alive_timeout(Duration::from_secs(HTTP2_KEEPALIVE_TIMEOUT_SECS)) - .keep_alive_while_idle(true) - .timeout(Duration::from_secs(RPC_TIMEOUT_SECS)); -``` - ---- - -## 4. Files Changed Summary - -| File | Change | -|------|--------| -| `crates/common/src/globals.rs` | Added `evict_connection()`, `has_cached_connection()`, `clear_all_connections()` | -| `crates/common/Cargo.toml` | Added `tracing` dependency | -| `crates/protos/src/lib.rs` | Refactored to use constants, added `evict_failed_connection()`, improved documentation | -| `crates/protos/Cargo.toml` | Added `tracing` dependency | -| `crates/ecstore/src/rpc/peer_rest_client.rs` | Added auto-eviction on RPC failure for `server_info()` and `local_storage_info()` | -| `crates/ecstore/src/notification_sys.rs` | Added per-peer timeout to `storage_info()` | -| `crates/iam/src/sys.rs` | Made `notify_for_user()`, `notify_for_service_account()`, `notify_for_group()` non-blocking | - ---- - -## 5. Test Results - -All 299 tests pass: -``` -test result: ok. 299 passed; 0 failed; 0 ignored -``` - ---- - -## 6. Expected Behavior After Fix - -| Scenario | Before | After | -|----------|--------|-------| -| Node power-off | Cluster hangs indefinitely | Cluster recovers in ~8 seconds | -| Login during node failure | Login hangs | Login succeeds immediately | -| Console during node failure | Shows 0/0/0 | Shows partial data from healthy nodes | -| Upload during node failure | Upload stops | Upload fails fast, can be retried | -| Stale cached connection | Blocks forever | Auto-evicted, fresh connection attempted | - ---- - -## 7. Verification Steps - -1. **Start a 3+ node RustFS cluster** -2. **Test Console Recovery**: - - Access console dashboard - - Forcefully kill one node (e.g., `kill -9`) - - Verify dashboard updates within 10 seconds showing offline status -3. **Test Login Recovery**: - - Kill a node while logged out - - Attempt login with `rustfsadmin` - - Verify login succeeds within 5 seconds -4. **Test Upload Recovery**: - - Start a large file upload - - Kill the target node mid-upload - - Verify upload fails fast (not hangs) and can be retried - ---- - -## 8. Related Issues -- Issue #1001: Cluster Recovery from Abrupt Power-Off -- PR #1035: fix(net): resolve 1GB upload hang and macos build - -## 9. Contributors -- Initial keepalive fix: Original PR #1035 -- Deep-rooted reliability fix: This update diff --git a/docs/compression-best-practices.md b/docs/compression-best-practices.md deleted file mode 100644 index 6a10e7dbc..000000000 --- a/docs/compression-best-practices.md +++ /dev/null @@ -1,444 +0,0 @@ -# HTTP Response Compression Best Practices in RustFS - -## Overview - -This document outlines best practices for HTTP response compression in RustFS, based on lessons learned from fixing the -NoSuchKey error response regression (Issue #901) and the whitelist-based compression redesign (Issue #902). - -## Whitelist-Based Compression (Issue #902) - -### Design Philosophy - -After Issue #901, we identified that the blacklist approach (compress everything except known problematic types) was -still causing issues with browser downloads showing "unknown file size". In Issue #902, we redesigned the compression -system using a **whitelist approach** aligned with MinIO's behavior: - -1. **Compression is disabled by default** - Opt-in rather than opt-out -2. **Only explicitly configured content types are compressed** - Preserves Content-Length for all other responses -3. **Fine-grained configuration** - Control via file extensions, MIME types, and size thresholds -4. **Skip already-encoded content** - Avoid double compression - -### Configuration Options - -RustFS provides flexible compression configuration via environment variables and command-line arguments: - -| Environment Variable | CLI Argument | Default | Description | -|---------------------|--------------|---------|-------------| -| `RUSTFS_COMPRESS_ENABLE` | | `false` | Enable/disable compression | -| `RUSTFS_COMPRESS_EXTENSIONS` | | `""` | File extensions to compress (e.g., `.txt,.log,.csv`) | -| `RUSTFS_COMPRESS_MIME_TYPES` | | `text/*,application/json,...` | MIME types to compress (supports wildcards) | -| `RUSTFS_COMPRESS_MIN_SIZE` | | `1000` | Minimum file size (bytes) for compression | - -### Usage Examples - -```bash -# Enable compression for text files and JSON -RUSTFS_COMPRESS_ENABLE=on \ -RUSTFS_COMPRESS_EXTENSIONS=.txt,.log,.csv,.json,.xml \ -RUSTFS_COMPRESS_MIME_TYPES=text/*,application/json,application/xml \ -RUSTFS_COMPRESS_MIN_SIZE=1000 \ -rustfs /data - -# Or using command-line arguments -rustfs /data \ - --compress-enable \ - --compress-extensions ".txt,.log,.csv" \ - --compress-mime-types "text/*,application/json" \ - --compress-min-size 1000 -``` - -### Implementation Details - -The `CompressionPredicate` implements intelligent compression decisions: - -```rust -impl Predicate for CompressionPredicate { - fn should_compress<B>(&self, response: &Response<B>) -> bool { - // 1. Check if compression is enabled - if !self.config.enabled { return false; } - - // 2. Never compress error responses - if status.is_client_error() || status.is_server_error() { return false; } - - // 3. Skip already-encoded content (gzip, br, deflate, etc.) - if has_content_encoding(response) { return false; } - - // 4. Check minimum size threshold - if content_length < self.config.min_size { return false; } - - // 5. Check whitelist: extension OR MIME type must match - if matches_extension(response) || matches_mime_type(response) { - return true; - } - - // 6. Default: don't compress (whitelist approach) - false - } -} -``` - -### Benefits of Whitelist Approach - -| Aspect | Blacklist (Old) | Whitelist (New) | -|--------|-----------------|-----------------| -| Default behavior | Compress most content | No compression | -| Content-Length | Often removed | Preserved for unmatched types | -| Browser downloads | "Unknown file size" | Accurate file size shown | -| Configuration | Complex exclusion rules | Simple inclusion rules | -| MinIO compatibility | Different behavior | Aligned behavior | - -## Key Principles - -### 1. Never Compress Error Responses - -**Rationale**: Error responses are typically small (100-500 bytes) and need to be transmitted accurately. Compression -can: - -- Introduce Content-Length header mismatches -- Add unnecessary overhead for small payloads -- Potentially corrupt error details during buffering - -**Implementation**: - -```rust -// Always check status code first -if status.is_client_error() || status.is_server_error() { - return false; // Don't compress -} -``` - -**Affected Status Codes**: - -- 4xx Client Errors (400, 403, 404, etc.) -- 5xx Server Errors (500, 502, 503, etc.) - -### 2. Size-Based Compression Threshold - -**Rationale**: Compression has overhead in terms of CPU and potentially network roundtrips. For very small responses: - -- Compression overhead > space savings -- May actually increase payload size -- Adds latency without benefit - -**Recommended Threshold**: 1000 bytes minimum (configurable via `RUSTFS_COMPRESS_MIN_SIZE`) - -**Implementation**: - -```rust -if let Some(content_length) = response.headers().get(CONTENT_LENGTH) { - if let Ok(length) = content_length.to_str()?.parse::<u64>()? { - if length < self.config.min_size { - return false; // Don't compress small responses - } - } -} -``` - -### 3. Skip Already-Encoded Content - -**Rationale**: If the response already has a `Content-Encoding` header (e.g., gzip, br, deflate, zstd), the content -is already compressed. Re-compressing provides no benefit and may cause issues: - -- Double compression wastes CPU cycles -- May corrupt data or increase size -- Breaks decompression on client side - -**Implementation**: - -```rust -// Skip if content is already encoded (e.g., gzip, br, deflate, zstd) -if let Some(content_encoding) = response.headers().get(CONTENT_ENCODING) { - if let Ok(encoding) = content_encoding.to_str() { - let encoding_lower = encoding.to_lowercase(); - // "identity" means no encoding, so we can still compress - if encoding_lower != "identity" && !encoding_lower.is_empty() { - debug!("Skipping compression for already encoded response: {}", encoding); - return false; - } - } -} -``` - -**Common Content-Encoding Values**: - -- `gzip` - GNU zip compression -- `br` - Brotli compression -- `deflate` - Deflate compression -- `zstd` - Zstandard compression -- `identity` - No encoding (compression allowed) - -### 4. Maintain Observability - -**Rationale**: Compression decisions can affect debugging and troubleshooting. Always log when compression is skipped. - -**Implementation**: - -```rust -debug!( - "Skipping compression for error response: status={}", - status.as_u16() -); -``` - -**Log Analysis**: - -```bash -# Monitor compression decisions -RUST_LOG=rustfs::server::http=debug ./target/release/rustfs - -# Look for patterns -grep "Skipping compression" logs/rustfs.log | wc -l -``` - -## Common Pitfalls - -### ❌ Compressing All Responses Blindly - -```rust -// BAD - No filtering -.layer(CompressionLayer::new()) -``` - -**Problem**: Can cause Content-Length mismatches with error responses and browser download issues - -### ❌ Using Blacklist Approach - -```rust -// BAD - Blacklist approach (compress everything except...) -fn should_compress(&self, response: &Response<B>) -> bool { - // Skip images, videos, archives... - if is_already_compressed_type(content_type) { return false; } - true // Compress everything else -} -``` - -**Problem**: Removes Content-Length for many file types, causing "unknown file size" in browsers - -### ✅ Using Whitelist-Based Predicate - -```rust -// GOOD - Whitelist approach with configurable predicate -.layer(CompressionLayer::new().compress_when(CompressionPredicate::new(config))) -``` - -### ❌ Ignoring Content-Encoding Header - -```rust -// BAD - May double-compress already compressed content -fn should_compress(&self, response: &Response<B>) -> bool { - matches_mime_type(response) // Missing Content-Encoding check -} -``` - -**Problem**: Double compression wastes CPU and may corrupt data - -### ✅ Comprehensive Checks - -```rust -// GOOD - Multi-criteria whitelist decision -fn should_compress(&self, response: &Response<B>) -> bool { - // 1. Must be enabled - if !self.config.enabled { return false; } - - // 2. Skip error responses - if response.status().is_error() { return false; } - - // 3. Skip already-encoded content - if has_content_encoding(response) { return false; } - - // 4. Check minimum size - if get_content_length(response) < self.config.min_size { return false; } - - // 5. Must match whitelist (extension OR MIME type) - matches_extension(response) || matches_mime_type(response) -} -``` - -## Performance Considerations - -### CPU Usage - -- **Compression CPU Cost**: ~1-5ms for typical responses -- **Benefit**: 70-90% size reduction for text/json -- **Break-even**: Responses > 512 bytes on fast networks - -### Network Latency - -- **Savings**: Proportional to size reduction -- **Break-even**: ~256 bytes on typical connections -- **Diminishing Returns**: Below 128 bytes - -### Memory Usage - -- **Buffer Size**: Usually 4-16KB per connection -- **Trade-off**: Memory vs. bandwidth -- **Recommendation**: Profile in production - -## Testing Guidelines - -### Unit Tests - -Test compression predicate logic: - -```rust -#[test] -fn test_should_not_compress_errors() { - let predicate = ShouldCompress; - let response = Response::builder() - .status(404) - .body(()) - .unwrap(); - - assert!(!predicate.should_compress(&response)); -} - -#[test] -fn test_should_not_compress_small_responses() { - let predicate = ShouldCompress; - let response = Response::builder() - .status(200) - .header(CONTENT_LENGTH, "100") - .body(()) - .unwrap(); - - assert!(!predicate.should_compress(&response)); -} -``` - -### Integration Tests - -Test actual S3 API responses: - -```rust -#[tokio::test] -async fn test_error_response_not_truncated() { - let response = client - .get_object() - .bucket("test") - .key("nonexistent") - .send() - .await; - - // Should get proper error, not truncation error - match response.unwrap_err() { - SdkError::ServiceError(err) => { - assert!(err.is_no_such_key()); - } - other => panic!("Expected ServiceError, got {:?}", other), - } -} -``` - -## Monitoring and Alerts - -### Metrics to Track - -1. **Compression Ratio**: `compressed_size / original_size` -2. **Compression Skip Rate**: `skipped_count / total_count` -3. **Error Response Size Distribution** -4. **CPU Usage During Compression** - -### Alert Conditions - -```yaml -# Prometheus alert rules -- alert: HighCompressionSkipRate - expr: | - rate(http_compression_skipped_total[5m]) - / rate(http_responses_total[5m]) > 0.5 - annotations: - summary: "More than 50% of responses skipping compression" - -- alert: LargeErrorResponses - expr: | - histogram_quantile(0.95, - rate(http_error_response_size_bytes_bucket[5m])) > 1024 - annotations: - summary: "Error responses larger than 1KB" -``` - -## Migration Guide - -### Migrating from Blacklist to Whitelist Approach - -If you're upgrading from an older RustFS version with blacklist-based compression: - -1. **Compression is now disabled by default** - - Set `RUSTFS_COMPRESS_ENABLE=on` to enable - - This ensures backward compatibility for existing deployments - -2. **Configure your whitelist** - ```bash - # Example: Enable compression for common text formats - RUSTFS_COMPRESS_ENABLE=on - RUSTFS_COMPRESS_EXTENSIONS=.txt,.log,.csv,.json,.xml,.html,.css,.js - RUSTFS_COMPRESS_MIME_TYPES=text/*,application/json,application/xml,application/javascript - RUSTFS_COMPRESS_MIN_SIZE=1000 - ``` - -3. **Verify browser downloads** - - Check that file downloads show accurate file sizes - - Verify Content-Length headers are preserved for non-compressed content - -### Updating Existing Code - -If you're adding compression to an existing service: - -1. **Start with compression disabled** (default) -2. **Define your whitelist**: Identify content types that benefit from compression -3. **Set appropriate thresholds**: Start with 1KB minimum size -4. **Enable and monitor**: Watch CPU, latency, and download behavior - -### Rollout Strategy - -1. **Stage 1**: Deploy to canary (5% traffic) - - Monitor for 24 hours - - Check error rates and latency - - Verify browser download behavior - -2. **Stage 2**: Expand to 25% traffic - - Monitor for 48 hours - - Validate compression ratios - - Check Content-Length preservation - -3. **Stage 3**: Full rollout (100% traffic) - - Continue monitoring for 1 week - - Document any issues - - Fine-tune whitelist based on actual usage - -## Related Documentation - -- [Fix NoSuchKey Regression](./fix-nosuchkey-regression.md) -- [tower-http Compression](https://docs.rs/tower-http/latest/tower_http/compression/) -- [HTTP Content-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding) - -## Architecture - -### Module Structure - -The compression functionality is organized in a dedicated module for maintainability: - -``` -rustfs/src/server/ -├── compress.rs # Compression configuration and predicate -├── http.rs # HTTP server (uses compress module) -└── mod.rs # Module declarations -``` - -### Key Components - -1. **`CompressionConfig`** - Stores compression settings parsed from environment/CLI -2. **`CompressionPredicate`** - Implements `tower_http::compression::predicate::Predicate` -3. **Configuration Constants** - Defined in `crates/config/src/constants/compress.rs` - -## References - -1. Issue #901: NoSuchKey error response regression -2. Issue #902: Whitelist-based compression redesign -3. [Google Web Fundamentals - Text Compression](https://web.dev/reduce-network-payloads-using-text-compression/) -4. [AWS Best Practices - Response Compression](https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/) - ---- - -**Last Updated**: 2025-12-13 -**Maintainer**: RustFS Team diff --git a/docs/console-separation.md b/docs/console-separation.md deleted file mode 100644 index aa038cefb..000000000 --- a/docs/console-separation.md +++ /dev/null @@ -1,1334 +0,0 @@ -# RustFS Console & Endpoint Service Separation Guide - -This document provides comprehensive guidance on RustFS's console and endpoint service separation architecture, enabling independent deployment of the web management interface and S3 API service with enterprise-grade security, monitoring, and Docker deployment standards. - -## Table of Contents - -- [Overview](#overview) -- [Architecture](#architecture) -- [Quick Start](#quick-start) -- [Configuration Reference](#configuration-reference) -- [Docker Deployment](#docker-deployment) -- [Kubernetes Deployment](#kubernetes-deployment) -- [Security Hardening](#security-hardening) -- [Health Monitoring](#health-monitoring) -- [Troubleshooting](#troubleshooting) -- [Migration Guide](#migration-guide) -- [Best Practices](#best-practices) - -## Overview - -RustFS implements complete separation between the console web interface and the S3 API endpoint service, enabling: - -- **Independent Port Management**: Console (`:9001`) and API (`:9000`) run on separate ports -- **Enhanced Security**: Different CORS policies, TLS configurations, and access controls -- **Flexible Deployment**: Console can be disabled or restricted to internal networks -- **Docker-Native**: Optimized for containerized deployments with proper port mapping -- **Enterprise Ready**: Rate limiting, authentication timeouts, and comprehensive monitoring - -## Architecture - -### Service Components - -- **S3 API Endpoint** (Port 9000) - - Handles all S3-compatible API requests - - Independent CORS configuration via `RUSTFS_CORS_ALLOWED_ORIGINS` - - Health check endpoint: `GET /health` - - Production-ready with comprehensive error handling - -- **Console Interface** (Port 9001) - - Web-based management dashboard at `/rustfs/console/` - - Independent CORS configuration via `RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS` - - TLS support using shared certificate infrastructure - - Rate limiting and authentication timeout controls - - Health check endpoint: `GET /health` - -### Communication Flow - -``` -Browser → Console (9001) → API Endpoint (9000) → Storage Backend -``` - -## Quick Start - -### Local Development - -```bash -# Start with default configuration -rustfs /data/volume - -# Access points: -# API: http://localhost:9000 -# Console: http://localhost:9001/rustfs/console/ -``` - -### Docker Quick Start - -```bash -# Basic Docker deployment -docker run -d \ - --name rustfs \ - -p 9020:9000 -p 9021:9001 \ - rustfs/rustfs:latest - -# Access points: -# API: http://localhost:9020 -# Console: http://localhost:9021/rustfs/console/ -``` - -### Production Quick Start - -Use our enhanced deployment script for production-ready setup: - -```bash -# Use the enhanced security deployment script -./examples/enhanced-security-deployment.sh - -# Or customize the enhanced Docker deployment -./examples/enhanced-docker-deployment.sh prod -``` - -## Configuration Reference - -### Core Service Configuration - -| Parameter | Environment Variable | Default | Description | -|-----------|---------------------|---------|-------------| -| `address` | `RUSTFS_ADDRESS` | `:9000` | S3 API endpoint bind address | -| `console_address` | `RUSTFS_CONSOLE_ADDRESS` | `:9001` | Console service bind address | -| `console_enable` | `RUSTFS_CONSOLE_ENABLE` | `true` | Enable/disable console service | - -### CORS Configuration - -| Parameter | Environment Variable | Default | Description | -|-----------|---------------------|---------|-------------| -| `cors_allowed_origins` | `RUSTFS_CORS_ALLOWED_ORIGINS` | `*` | Comma-separated allowed origins for endpoint CORS | -| `console_cors_allowed_origins` | `RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS` | `*` | Comma-separated allowed origins for console CORS | - -### Security Configuration - -| Parameter | Environment Variable | Default | Description | -|-----------|---------------------|---------|-------------| -| `tls_path` | `RUSTFS_TLS_PATH` | - | TLS certificate directory path (shared by both services) | -| `console_rate_limit_enable` | `RUSTFS_CONSOLE_RATE_LIMIT_ENABLE` | `false` | Enable rate limiting for console access | -| `console_rate_limit_rpm` | `RUSTFS_CONSOLE_RATE_LIMIT_RPM` | `100` | Console rate limit (requests per minute) | -| `console_auth_timeout` | `RUSTFS_CONSOLE_AUTH_TIMEOUT` | `3600` | Console authentication timeout (seconds) | - -### Authentication Configuration - -| Parameter | Environment Variable | Default | Description | -|-----------|---------------------|---------|-------------| -| `access_key` | `RUSTFS_ACCESS_KEY` | `rustfsadmin` | Administrative access key | -| `secret_key` | `RUSTFS_SECRET_KEY` | `rustfsadmin` | Administrative secret key | - -### Important Notes - -- **TLS Configuration**: Console uses shared TLS certificates from `RUSTFS_TLS_PATH` (no separate cert config needed). -- **Environment Priority**: Console security settings are read directly from environment variables. - -## Docker Deployment - -### Prerequisites - -Ensure Docker is installed and the RustFS image is available: - -```bash -# Pull the latest RustFS image -docker pull rustfs/rustfs:latest - -# Or build from source -docker build -t rustfs/rustfs:latest . -``` - -### Basic Docker Deployment - -Simple deployment with port mapping: - -```bash -docker run -d \ - --name rustfs-basic \ - -p 9020:9000 \ # API: host 9020 → container 9000 - -p 9021:9001 \ # Console: host 9021 → container 9001 - -e RUSTFS_CORS_ALLOWED_ORIGINS="http://localhost:9021" \ - -v rustfs-data:/data \ - rustfs/rustfs:latest - -# Access: -# API: http://localhost:9020 -# Console: http://localhost:9021/rustfs/console/ -``` - -### Docker Compose Deployment - -Use the provided `docker-compose.yml` for complete setup: - -```bash -# Start the complete stack -docker-compose up -d - -# Start with specific profiles -docker-compose --profile dev up -d # Development environment -docker-compose --profile observability up -d # With monitoring stack -``` - -The compose configuration provides: - -- **Production Service** (`rustfs`): Ports 9000:9000 and 9001:9001 -- **Development Service** (`rustfs-dev`): Ports 9010:9000 and 9011:9001 -- **Observability Stack**: Grafana, Prometheus, Jaeger, and OpenTelemetry -- **Reverse Proxy**: Nginx configuration for production deployments - -### Enhanced Docker Deployment Scripts - -#### Production Deployment with Security - -```bash -# Use the enhanced security deployment script -./examples/enhanced-security-deployment.sh - -# This will: -# - Generate TLS certificates -# - Create secure credentials -# - Deploy with rate limiting -# - Configure restricted CORS -# - Enable health monitoring -``` - -#### Multiple Environment Deployment - -```bash -# Deploy different environments simultaneously -./examples/enhanced-docker-deployment.sh all - -# Individual deployments: -./examples/enhanced-docker-deployment.sh basic # Basic setup -./examples/enhanced-docker-deployment.sh dev # Development environment -./examples/enhanced-docker-deployment.sh prod # Production-like setup -``` - -### Custom Docker Deployment Examples - -#### Development Environment - -```bash -docker run -d \ - --name rustfs-dev \ - -p 9000:9000 -p 9001:9001 \ - -e RUSTFS_CORS_ALLOWED_ORIGINS="*" \ - -e RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="*" \ - -e RUSTFS_ACCESS_KEY="dev-admin" \ - -e RUSTFS_SECRET_KEY="dev-secret" \ - -v rustfs-dev-data:/data \ - rustfs/rustfs:latest -``` - -#### Production with TLS and Security - -```bash -docker run -d \ - --name rustfs-production \ - -p 9443:9001 -p 9000:9000 \ - -v /path/to/certs:/certs:ro \ - -v /path/to/data:/data \ - -e RUSTFS_TLS_PATH="/certs" \ - -e RUSTFS_CONSOLE_RATE_LIMIT_ENABLE="true" \ - -e RUSTFS_CONSOLE_RATE_LIMIT_RPM="60" \ - -e RUSTFS_CONSOLE_AUTH_TIMEOUT="1800" \ - -e RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="https://admin.yourdomain.com" \ - -e RUSTFS_CORS_ALLOWED_ORIGINS="https://api.yourdomain.com" \ - -e RUSTFS_ACCESS_KEY="$(openssl rand -hex 16)" \ - -e RUSTFS_SECRET_KEY="$(openssl rand -hex 32)" \ - rustfs/rustfs:latest -``` - -#### Console-Disabled API-Only Deployment - -```bash -docker run -d \ - --name rustfs-api-only \ - -p 9000:9000 \ - -e RUSTFS_CONSOLE_ENABLE="false" \ - -e RUSTFS_CORS_ALLOWED_ORIGINS="https://your-app.com" \ - -v rustfs-api-data:/data \ - rustfs/rustfs:latest - -# Only API available: http://localhost:9000 -``` - -### Docker Health Checks - -The Dockerfile includes health checks for both services: - -```dockerfile -HEALTHCHECK --interval=30s --timeout=10s --retries=3 \ - CMD curl -f http://localhost:9000/health && curl -f http://localhost:9001/health || exit 1 -``` - -Check container health: - -```bash -# View health status -docker ps --format "table {{.Names}}\t{{.Status}}" - -# View detailed health check logs -docker inspect rustfs --format='{{json .State.Health}}' | jq -``` - -## Kubernetes Deployment - -### Basic Kubernetes Deployment - -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: rustfs - labels: - app: rustfs -spec: - replicas: 1 - selector: - matchLabels: - app: rustfs - template: - metadata: - labels: - app: rustfs - spec: - containers: - - name: rustfs - image: rustfs/rustfs:latest - ports: - - containerPort: 9000 - name: api - - containerPort: 9001 - name: console - env: - - name: RUSTFS_ADDRESS - value: "0.0.0.0:9000" - - name: RUSTFS_CONSOLE_ADDRESS - value: "0.0.0.0:9001" - - name: RUSTFS_CORS_ALLOWED_ORIGINS - value: "*" - - name: RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS - value: "*" - livenessProbe: - httpGet: - path: /health - port: 9000 - initialDelaySeconds: 30 - periodSeconds: 10 - readinessProbe: - httpGet: - path: /health - port: 9001 - initialDelaySeconds: 5 - periodSeconds: 5 - volumeMounts: - - name: data - mountPath: /data - volumes: - - name: data - persistentVolumeClaim: - claimName: rustfs-data - ---- -apiVersion: v1 -kind: Service -metadata: - name: rustfs-service -spec: - selector: - app: rustfs - ports: - - name: api - port: 9000 - targetPort: 9000 - - name: console - port: 9001 - targetPort: 9001 - type: LoadBalancer - ---- -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: rustfs-data -spec: - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 10Gi -``` - -### Production Kubernetes with TLS - -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: rustfs-production -spec: - replicas: 3 - selector: - matchLabels: - app: rustfs-production - template: - metadata: - labels: - app: rustfs-production - spec: - containers: - - name: rustfs - image: rustfs/rustfs:latest - env: - - name: RUSTFS_TLS_PATH - value: "/certs" - - name: RUSTFS_CONSOLE_RATE_LIMIT_ENABLE - value: "true" - - name: RUSTFS_CONSOLE_RATE_LIMIT_RPM - value: "100" - - name: RUSTFS_CONSOLE_AUTH_TIMEOUT - value: "1800" - - name: RUSTFS_ACCESS_KEY - valueFrom: - secretKeyRef: - name: rustfs-credentials - key: access-key - - name: RUSTFS_SECRET_KEY - valueFrom: - secretKeyRef: - name: rustfs-credentials - key: secret-key - volumeMounts: - - name: certs - mountPath: /certs - readOnly: true - - name: data - mountPath: /data - volumes: - - name: certs - secret: - secretName: rustfs-tls - - name: data - persistentVolumeClaim: - claimName: rustfs-production-data - ---- -apiVersion: v1 -kind: Secret -metadata: - name: rustfs-credentials -type: Opaque -stringData: - access-key: "your-secure-access-key" - secret-key: "your-secure-secret-key" -``` - -### Ingress Configuration - -```yaml -apiVersion: networking.k8s.io/v1 -kind: Ingress -metadata: - name: rustfs-ingress - annotations: - nginx.ingress.kubernetes.io/ssl-redirect: "true" - nginx.ingress.kubernetes.io/cors-allow-origin: "https://admin.yourdomain.com" -spec: - tls: - - hosts: - - api.yourdomain.com - - admin.yourdomain.com - secretName: rustfs-tls-ingress - rules: - - host: api.yourdomain.com - http: - paths: - - path: / - pathType: Prefix - backend: - service: - name: rustfs-service - port: - number: 9000 - - host: admin.yourdomain.com - http: - paths: - - path: / - pathType: Prefix - backend: - service: - name: rustfs-service - port: - number: 9001 -``` - -## Security Hardening - -### TLS Configuration - -RustFS console uses shared TLS certificate infrastructure. Place certificates in a directory and configure via `RUSTFS_TLS_PATH`: - -#### Certificate Requirements - -```bash -# Certificate directory structure -/path/to/certs/ -├── cert.pem # TLS certificate -└── key.pem # Private key -``` - -#### Generate Self-Signed Certificates (Development) - -```bash -# Generate development certificates -mkdir -p ./certs -openssl req -x509 -newkey rsa:4096 \ - -keyout ./certs/key.pem \ - -out ./certs/cert.pem \ - -days 365 -nodes \ - -subj "/C=US/ST=CA/L=SF/O=RustFS/CN=localhost" - -# Set proper permissions -chmod 600 ./certs/key.pem -chmod 644 ./certs/cert.pem -``` - -#### Production TLS with Let's Encrypt - -```bash -# Use certbot to generate certificates -certbot certonly --standalone -d yourdomain.com -cp /etc/letsencrypt/live/yourdomain.com/fullchain.pem ./certs/cert.pem -cp /etc/letsencrypt/live/yourdomain.com/privkey.pem ./certs/key.pem -``` - -### Rate Limiting and Authentication - -Configure console security settings via environment variables: - -```bash -# Enable rate limiting and configure timeouts -export RUSTFS_CONSOLE_RATE_LIMIT_ENABLE=true -export RUSTFS_CONSOLE_RATE_LIMIT_RPM=60 # 60 requests per minute -export RUSTFS_CONSOLE_AUTH_TIMEOUT=1800 # 30 minutes session timeout - -# Start with security settings -docker run -d \ - -e RUSTFS_CONSOLE_RATE_LIMIT_ENABLE=true \ - -e RUSTFS_CONSOLE_RATE_LIMIT_RPM=60 \ - -e RUSTFS_CONSOLE_AUTH_TIMEOUT=1800 \ - rustfs/rustfs:latest -``` - -### CORS Security - -Configure restrictive CORS policies for production: - -```bash -# Production CORS configuration -export RUSTFS_CORS_ALLOWED_ORIGINS="https://myapp.com,https://api.myapp.com" -export RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="https://admin.myapp.com" - -# Development CORS (permissive) -export RUSTFS_CORS_ALLOWED_ORIGINS="*" -export RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="*" -``` - -### Network Security - -#### Firewall Configuration - -```bash -# Allow API access from all networks -sudo ufw allow 9000/tcp - -# Restrict console access to internal networks only -sudo ufw allow from 192.168.1.0/24 to any port 9001 -sudo ufw allow from 10.0.0.0/8 to any port 9001 - -# Block external console access -sudo ufw deny 9001/tcp -``` - -#### Docker Network Isolation - -```yaml -# docker-compose.yml with network isolation -version: '3.8' -services: - rustfs: - image: rustfs/rustfs:latest - networks: - - api-network # Public API access - - console-network # Internal console access - environment: - - RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS=https://admin.internal.com - -networks: - api-network: - driver: bridge - console-network: - driver: bridge - internal: true # No external access -``` - -#### Reverse Proxy Setup - -Use Nginx for additional security layer: - -```nginx -# /etc/nginx/sites-available/rustfs -# API endpoint - public access -server { - listen 80; - server_name api.example.com; - - location / { - proxy_pass http://localhost:9000; - proxy_set_header Host $host; - proxy_set_header X-Real-IP $remote_addr; - proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; - - # Rate limiting - limit_req zone=api burst=20 nodelay; - } -} - -# Console - restricted access with authentication -server { - listen 443 ssl; - server_name admin.example.com; - - ssl_certificate /path/to/cert.pem; - ssl_certificate_key /path/to/key.pem; - - # Basic authentication - auth_basic "RustFS Admin"; - auth_basic_user_file /etc/nginx/.htpasswd; - - # IP whitelist - allow 192.168.1.0/24; - allow 10.0.0.0/8; - deny all; - - location / { - proxy_pass http://localhost:9001; - proxy_set_header Host $host; - proxy_set_header X-Real-IP $remote_addr; - proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; - proxy_set_header X-Forwarded-Proto $scheme; - } -} -``` - -## Health Monitoring - -### Health Check Endpoints - -Both services provide independent health check endpoints: - -#### Console Health Check - -- **Endpoint**: `GET /health` -- **Response**: - -```json -{ - "status": "ok", - "service": "rustfs-console", - "timestamp": "2024-01-15T10:30:00Z", - "version": "0.0.5", - "details": { - "storage": { - "status": "connected" - }, - "iam": { - "status": "connected" - } - }, - "uptime": 1800 -} -``` - -#### Endpoint Health Check - -- **Endpoint**: `GET /health` -- **Response**: - -```json -{ - "status": "ok", - "service": "rustfs-endpoint", - "timestamp": "2024-01-15T10:30:00Z", - "version": "0.0.5" -} -``` - -### Monitoring Integration - -#### Prometheus Metrics - -```bash -# Health check monitoring -curl http://localhost:9000/health | jq '.status' -curl http://localhost:9001/health | jq '.status' - -# Prometheus alert rules -- alert: RustFSConsoleDown - expr: up{job="rustfs-console"} == 0 - for: 30s - labels: - severity: critical - annotations: - summary: "RustFS Console service is down" - -- alert: RustFSEndpointDown - expr: up{job="rustfs-endpoint"} == 0 - for: 30s - labels: - severity: critical - annotations: - summary: "RustFS API Endpoint is down" -``` - -#### Docker Health Checks - -Built-in Docker health checks are configured in the Dockerfile: - -```dockerfile -HEALTHCHECK --interval=30s --timeout=10s --retries=3 \ - CMD curl -f http://localhost:9000/health && curl -f http://localhost:9001/health || exit 1 -``` - -Check health status: - -```bash -# View health status -docker ps --format "table {{.Names}}\t{{.Status}}" - -# Detailed health information -docker inspect rustfs --format='{{json .State.Health}}' | jq -``` - -### Logging and Auditing - -#### Separate Logging Targets - -Console and endpoint services use separate logging targets: - -**Console Logging Targets:** -- `rustfs::console::startup` - Server startup and configuration -- `rustfs::console::access` - HTTP access logs with timing -- `rustfs::console::error` - Console-specific errors -- `rustfs::console::shutdown` - Graceful shutdown logs - -**Endpoint Logging Targets:** -- `rustfs::endpoint::startup` - API server startup -- `rustfs::endpoint::access` - S3 API access logs -- `rustfs::endpoint::auth` - Authentication and authorization - -#### Centralized Logging - -```bash -# JSON structured logging -RUST_LOG="rustfs::console=info,rustfs::endpoint=info" \ -docker run -d rustfs/rustfs:latest - -# Forward to log aggregation -docker run -d \ - --log-driver=fluentd \ - --log-opt fluentd-address=localhost:24224 \ - --log-opt tag="rustfs.{{.Name}}" \ - rustfs/rustfs:latest -``` - -## Troubleshooting - -### Common Issues and Solutions - -#### 1. Console Cannot Access API - -**Symptoms**: Console UI shows connection errors, "Failed to load data" messages. - -**Solutions**: - -```bash -# For Kubernetes or complex networking -# Use service name or proper endpoint URL -``` - -**Debug steps**: -```bash -# Test API connectivity from console container -docker exec rustfs-container curl http://localhost:9000/health - -# Check CORS configuration -curl -H "Origin: http://localhost:9021" -v http://localhost:9020/health -``` - -#### 2. CORS Errors - -**Symptoms**: Browser console shows "Access to fetch blocked by CORS policy" errors. - -**Causes and Solutions**: - -```bash -# Allow specific origins (production) -RUSTFS_CORS_ALLOWED_ORIGINS="https://admin.yourdomain.com,https://backup.yourdomain.com" -RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="https://console.yourdomain.com" - -# Allow all origins (development only) -RUSTFS_CORS_ALLOWED_ORIGINS="*" -RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="*" - -# Docker deployment with port mapping -RUSTFS_CORS_ALLOWED_ORIGINS="http://localhost:9021,http://127.0.0.1:9021" -``` - -**Debug CORS issues**: -```bash -# Check actual request origin in browser network tab -# Ensure the origin matches CORS configuration - -# Test CORS with curl -curl -H "Origin: http://localhost:9021" \ - -H "Access-Control-Request-Method: GET" \ - -H "Access-Control-Request-Headers: authorization" \ - -X OPTIONS \ - http://localhost:9020/ -``` - -#### 3. Port Conflicts - -**Symptoms**: "Address already in use" or "bind: address already in use" errors. - -**Solutions**: - -```bash -# Check which process is using the port -sudo lsof -i :9000 -sudo lsof -i :9001 -sudo netstat -tulpn | grep :9000 - -# Kill conflicting process -sudo kill -9 <PID> - -# Use different ports -RUSTFS_ADDRESS=":8000" RUSTFS_CONSOLE_ADDRESS=":8001" rustfs /data - -# For Docker, change host port mapping -docker run -p 8020:9000 -p 8021:9001 rustfs/rustfs:latest -``` - -#### 4. TLS Certificate Issues - -**Symptoms**: "TLS handshake failed", "certificate verify failed" errors. - -**Solutions**: - -```bash -# Verify certificate files exist and are readable -ls -la /path/to/certs/ -# Should show cert.pem and key.pem with proper permissions - -# Test certificate validity -openssl x509 -in /path/to/certs/cert.pem -text -noout - -# Generate new certificates -openssl req -x509 -newkey rsa:4096 \ - -keyout /path/to/certs/key.pem \ - -out /path/to/certs/cert.pem \ - -days 365 -nodes \ - -subj "/C=US/O=RustFS/CN=localhost" - -# For Docker, ensure certificate volume mount is correct -docker run -v /host/path/to/certs:/certs:ro rustfs/rustfs:latest -``` - -#### 5. Service Not Starting - -**Symptoms**: Container exits immediately, "failed to start console server" errors. - -**Debug steps**: - -```bash -# Check container logs -docker logs rustfs-container - -# Enable debug logging -docker run rustfs/rustfs:latest - -# Check configuration -docker exec rustfs-container env | grep RUSTFS - -# Test configuration outside Docker -rustfs --help -``` - -#### 6. Health Check Failures - -**Symptoms**: Docker health checks fail, Kubernetes pods not ready. - -**Solutions**: - -```bash -# Test health endpoints manually -curl http://localhost:9000/health -curl http://localhost:9001/health - -# Check if services are listening -docker exec rustfs-container netstat -tulpn - -# Increase health check timeouts -# For Docker -HEALTHCHECK --interval=30s --timeout=30s --retries=5 - -# For Kubernetes -livenessProbe: - initialDelaySeconds: 60 - timeoutSeconds: 30 -``` - -#### 7. Docker Network Issues - -**Symptoms**: Services cannot communicate within Docker network. - -**Solutions**: - -```bash -# Check Docker network -docker network ls -docker inspect <network-name> - -# Test connectivity between containers -docker exec container1 ping container2 -docker exec container1 curl http://container2:9000/health - -# Use Docker network aliases -docker run --network=my-network --network-alias=rustfs rustfs/rustfs:latest -``` - -### Debugging Commands - -#### Service Status - -```bash -# Check running containers -docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" - -# Check service logs -docker logs rustfs-container --tail=100 -f - -# Check resource usage -docker stats rustfs-container - -# Inspect container configuration -docker inspect rustfs-container | jq '.Config.Env' -``` - -#### Network Debugging - -```bash -# Test connectivity from host -curl -v http://localhost:9020/health -curl -v http://localhost:9021/health - -# Test from inside container -docker exec rustfs-container curl http://localhost:9000/health -docker exec rustfs-container curl http://localhost:9001/health - -# Check port listening -docker exec rustfs-container netstat -tulpn | grep -E ':(9000|9001)' -``` - -#### Configuration Debugging - -```bash -# Show effective configuration -docker exec rustfs-container env | grep RUSTFS | sort - -# Test configuration parsing -docker exec rustfs-container rustfs --help - -# Check file permissions -docker exec rustfs-container ls -la /certs/ -docker exec rustfs-container ls -la /data/ -``` - -### Getting Help - -#### Log Collection - -```bash -# Collect comprehensive logs -mkdir -p ./debug-logs -docker logs rustfs-container > ./debug-logs/container.log 2>&1 -docker inspect rustfs-container > ./debug-logs/inspect.json -docker exec rustfs-container env > ./debug-logs/environment.txt -docker exec rustfs-container ps aux > ./debug-logs/processes.txt -docker exec rustfs-container netstat -tulpn > ./debug-logs/network.txt -``` - -#### Community Support - -- **GitHub Issues**: [rustfs/rustfs/issues](https://github.com/rustfs/rustfs/issues) -- **Discussions**: [rustfs/rustfs/discussions](https://github.com/rustfs/rustfs/discussions) -- **Documentation**: Check the `docs/` directory for additional guides - -## Migration Guide - -### From Previous Versions - -Previous versions served the console from the same port as the S3 API. This section helps migrate to the separated architecture. - -#### Pre-Migration Checklist - -1. **Backup Configuration**: Save current environment variables and configuration files -2. **Document Current Setup**: Note current port usage, firewall rules, and proxy configurations -3. **Plan Downtime**: Brief service restart required for migration -4. **Update Clients**: Prepare to update console access URLs - -#### Step-by-Step Migration - -##### 1. Update Configuration - -```bash -# Old single-port configuration -RUSTFS_ADDRESS=":9000" - -# New separated configuration -RUSTFS_ADDRESS=":9000" # API port (unchanged) -RUSTFS_CONSOLE_ADDRESS=":9001" # Console port (new) -``` - -##### 2. Update Firewall Rules - -```bash -# Allow new console port -sudo ufw allow 9001/tcp - -# Optional: restrict console to internal networks -sudo ufw delete allow 9001/tcp -sudo ufw allow from 192.168.1.0/24 to any port 9001 -``` - -##### 3. Update Docker Deployments - -```bash -# Old deployment -docker run -p 9000:9000 rustfs/rustfs:legacy - -# New deployment with both ports -docker run \ - -p 9000:9000 \ # API port - -p 9001:9001 \ # Console port - rustfs/rustfs:latest -``` - -##### 4. Update Application URLs - -- **API Endpoint**: `http://localhost:9000` (unchanged) -- **Console UI**: `http://localhost:9001/rustfs/console/` (new URL) - -##### 5. Update Monitoring and Health Checks - -```bash -# Add console health check -curl http://localhost:9001/health - -# Update monitoring configuration to check both endpoints -``` - -#### Docker Migration Example - -```bash -#!/usr/bin/env bash -# migrate-docker.sh - -# Stop old container -docker stop rustfs-old -docker rm rustfs-old - -# Start new separated services -docker run -d \ - --name rustfs-new \ - -p 9000:9000 \ - -p 9001:9001 \ - -e RUSTFS_CORS_ALLOWED_ORIGINS="http://localhost:9001" \ - -v rustfs-data:/data \ - rustfs/rustfs:latest - -echo "Migration completed!" -echo "API: http://localhost:9000" -echo "Console: http://localhost:9001/rustfs/console/" -``` - -#### Kubernetes Migration - -```yaml -# Update deployment to expose both ports -apiVersion: apps/v1 -kind: Deployment -metadata: - name: rustfs -spec: - template: - spec: - containers: - - name: rustfs - ports: - - containerPort: 9000 - name: api - - containerPort: 9001 # Add console port - name: console - ---- -# Update service to include console port -apiVersion: v1 -kind: Service -metadata: - name: rustfs-service -spec: - ports: - - name: api - port: 9000 - - name: console # Add console service - port: 9001 -``` - -#### Rollback Plan - -If issues occur, you can disable the console to return to single-service mode: - -```bash -# Disable console service -RUSTFS_CONSOLE_ENABLE=false rustfs /data - -# Or use older image version temporarily -docker run rustfs/rustfs:legacy-tag -``` - -### Configuration Migration - -#### Environment Variable Changes - -```bash -# New variables (add these) -export RUSTFS_CONSOLE_ADDRESS=":9001" -export RUSTFS_CORS_ALLOWED_ORIGINS="*" -export RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="*" - -# Optional security variables -export RUSTFS_CONSOLE_RATE_LIMIT_ENABLE="true" -export RUSTFS_CONSOLE_RATE_LIMIT_RPM="100" -export RUSTFS_CONSOLE_AUTH_TIMEOUT="3600" -``` - -#### Validation - -After migration, validate the setup: - -```bash -# Check both services are running -curl http://localhost:9000/health # Should return API health -curl http://localhost:9001/health # Should return console health - -# Test console functionality -open http://localhost:9001/rustfs/console/ - -# Verify API still works -aws s3 ls --endpoint-url http://localhost:9000 -``` - -## Best Practices - -### Production Deployment - -#### Security Best Practices - -1. **Restrict Console Access** - ```bash - # Bind console to internal interface only - RUSTFS_CONSOLE_ADDRESS="127.0.0.1:9001" - - # Use restrictive CORS - RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="https://admin.yourdomain.com" - ``` - -2. **Enable TLS** - ```bash - # Use TLS for console - RUSTFS_TLS_PATH="/path/to/certs" - ``` - -3. **Configure Rate Limiting** - ```bash - # Prevent brute force attacks - RUSTFS_CONSOLE_RATE_LIMIT_ENABLE="true" - RUSTFS_CONSOLE_RATE_LIMIT_RPM="60" - ``` - -4. **Use Strong Credentials** - ```bash - # Generate secure credentials - RUSTFS_ACCESS_KEY="$(openssl rand -hex 16)" - RUSTFS_SECRET_KEY="$(openssl rand -hex 32)" - ``` - -#### Operational Best Practices - -1. **Independent Monitoring** - - Set up health checks for both API and console services - - Monitor resource usage separately - - Configure separate alerting rules - -2. **Network Segmentation** - - Use different networks for public API and internal console - - Implement proper firewall rules - - Consider using a reverse proxy for additional security - -3. **Logging Strategy** - - Configure separate log targets for console and API - - Use structured logging for better analysis - - Implement centralized log collection - -#### Docker Best Practices - -1. **Resource Limits** - ```yaml - services: - rustfs: - deploy: - resources: - limits: - memory: 1G - cpus: "0.5" - ``` - -2. **Health Checks** - ```yaml - healthcheck: - test: ["CMD", "curl", "-f", "http://localhost:9000/health", "&&", "curl", "-f", "http://localhost:9001/health"] - interval: 30s - timeout: 10s - retries: 3 - ``` - -3. **Volume Management** - ```yaml - volumes: - - rustfs-data:/data - - rustfs-certs:/certs:ro - - rustfs-logs:/logs - ``` - -### Development Environment - -#### Development Best Practices - -1. **Permissive Configuration** - ```bash - # Allow all origins for development - RUSTFS_CORS_ALLOWED_ORIGINS="*" - RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="*" - ``` - -2. **Hot Reload Support** - ```bash - # Mount source code for development - docker run -v $(pwd):/app rustfs/rustfs:dev - ``` - -3. **Use Development Scripts** - ```bash - # Use provided development deployment - ./examples/enhanced-docker-deployment.sh dev - ``` - -### Monitoring and Observability - -#### Metrics Collection - -1. **Health Check Monitoring** - ```bash - # Regular health checks - */1 * * * * curl -f http://localhost:9000/health >/dev/null || echo "API down" - */1 * * * * curl -f http://localhost:9001/health >/dev/null || echo "Console down" - ``` - -2. **Performance Monitoring** - - Monitor response times for both services - - Track error rates separately - - Set up resource usage alerts - -3. **Business Metrics** - - Track console usage patterns - - Monitor API request patterns - - Measure service availability - -#### Alerting Strategy - -```yaml -# Example Prometheus alerting rules -groups: -- name: rustfs - rules: - - alert: RustFSAPIDown - expr: up{job="rustfs-api"} == 0 - for: 30s - labels: - severity: critical - annotations: - summary: RustFS API is down - - - alert: RustFSConsoleDown - expr: up{job="rustfs-console"} == 0 - for: 30s - labels: - severity: warning - annotations: - summary: RustFS Console is down - - - alert: HighResponseTime - expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 - for: 2m - labels: - severity: warning - annotations: - summary: High response time detected -``` - -### Troubleshooting Workflows - -#### Systematic Debugging Approach - -1. **Service Status Check** - ```bash - # Check if services are running - curl -f http://localhost:9000/health - curl -f http://localhost:9001/health - ``` - -2. **Network Connectivity** - ```bash - # Test from different network contexts - docker exec container curl http://localhost:9000/health - curl -H "Origin: http://localhost:9001" http://localhost:9000/health - ``` - -3. **Configuration Validation** - ```bash - # Verify environment variables - docker exec container env | grep RUSTFS | sort - ``` - -4. **Log Analysis** - ```bash - # Check service-specific logs - docker logs container 2>&1 | grep -E "(console|endpoint)" - ``` - -This comprehensive guide covers all aspects of RustFS console and endpoint service separation, from basic deployment to enterprise-grade production configurations. For additional support, refer to the example scripts in the `examples/` directory and the community resources listed in the troubleshooting section. \ No newline at end of file diff --git a/docs/examples/README.md b/docs/examples/README.md deleted file mode 100644 index 401ff89ee..000000000 --- a/docs/examples/README.md +++ /dev/null @@ -1,85 +0,0 @@ -# RustFS Deployment Examples - -This directory contains practical deployment examples and configurations for RustFS. - -## Available Examples - -### [MNMD (Multi-Node Multi-Drive)](./mnmd/) - -Complete Docker Compose example for deploying RustFS in a 4-node, 4-drive-per-node configuration. - -**Features:** - -- Proper disk indexing (1..4) to avoid VolumeNotFound errors -- Startup coordination via `wait-and-start.sh` script -- Service discovery using Docker service names -- Health checks with alternatives for different base images -- Comprehensive documentation and verification checklist - -**Use Case:** Production-ready multi-node deployment for high availability and performance. - -**Quick Start:** - -```bash -cd docs/examples/mnmd -docker-compose up -d -``` - -**See also:** - -- [MNMD README](./mnmd/README.md) - Detailed usage guide -- [MNMD CHECKLIST](./mnmd/CHECKLIST.md) - Step-by-step verification - -## Other Deployment Examples - -For additional deployment examples, see: - -- [`docker/`](./docker/) - Root-level examples directory with: - - `docker-quickstart.sh` - Quick start script for basic deployments, Quickstart script (basic - /dev/prod/status/test/cleanup) - - `enhanced-docker-deployment.sh` - Advanced deployment scenarios, Advanced deployment script with multiple - scenarios and detailed logs (basic /dev/prod/all/status/test/logs/cleanup) - - `enhanced-security-deployment.sh` - Production-ready scripts with TLS, throttling, and secure credential - generation - - `docker-comprehensive.yml` - Docker Compose with multiple profiles, Docker Compose files containing multiple - profiles (basic / dev / production / enterprise / api-only / nginx, etc.) - - Usage example: - ```bash - # Rapid development environment - ./docs/examples/docker/docker-quickstart.sh dev - - # Start dev profile using Docker Compose - docker-compose -f docs/examples/docker/docker-comprehensive.yml --profile dev up -d - - # Secure deployment - ./docs/examples/docker/enhanced-security-deployment.sh - ``` - - Note: If the original CI or other documents refer to the old path `examples/`, please update it to - `docs/examples/docker/`. Relative links within the document are already in this README. - -- [`.docker/compose/`](/.docker/compose/) - Docker Compose configurations: - - `docker-compose.cluster.yaml` - Basic cluster setup - - `docker-compose.observability.yaml` - Observability stack integration - -## Related Documentation - -- [Console & Endpoint Service Separation](../console-separation.md) -- [Environment Variables](../ENVIRONMENT_VARIABLES.md) -- [Performance Testing](../PERFORMANCE_TESTING.md) - -## Contributing - -When adding new examples: - -1. Create a dedicated subdirectory under `docs/examples/` -2. Include a comprehensive README.md -3. Provide working configuration files -4. Add verification steps or checklists -5. Document common issues and troubleshooting - -## Support - -For issues or questions: - -- GitHub Issues: https://github.com/rustfs/rustfs/issues -- Documentation: https://rustfs.com/docs diff --git a/docs/examples/docker/README.md b/docs/examples/docker/README.md deleted file mode 100644 index 1c98ad1e3..000000000 --- a/docs/examples/docker/README.md +++ /dev/null @@ -1,282 +0,0 @@ -# RustFS Docker Deployment Examples - -This directory contains various deployment scripts and configuration files for RustFS with console and endpoint service -separation. - -## Quick Start Scripts - -### `docker-quickstart.sh` - -The fastest way to get RustFS running with different configurations. - -```bash -# Basic deployment (ports 9000-9001) -./docker-quickstart.sh basic - -# Development environment (ports 9010-9011) -./docker-quickstart.sh dev - -# Production-like deployment (ports 9020-9021) -./docker-quickstart.sh prod - -# Check status of all deployments -./docker-quickstart.sh status - -# Test health of all running services -./docker-quickstart.sh test - -# Clean up all containers -./docker-quickstart.sh cleanup -``` - -### `enhanced-docker-deployment.sh` - -Comprehensive deployment script with multiple scenarios and detailed logging. - -```bash -# Deploy individual scenarios -./enhanced-docker-deployment.sh basic # Basic setup with port mapping -./enhanced-docker-deployment.sh dev # Development environment -./enhanced-docker-deployment.sh prod # Production-like with security - -# Deploy all scenarios at once -./enhanced-docker-deployment.sh all - -# Check status and test services -./enhanced-docker-deployment.sh status -./enhanced-docker-deployment.sh test - -# View logs for specific container -./enhanced-docker-deployment.sh logs rustfs-dev - -# Complete cleanup -./enhanced-docker-deployment.sh cleanup -``` - -### `enhanced-security-deployment.sh` - -Production-ready deployment with enhanced security features including TLS, rate limiting, and secure credential -generation. - -```bash -# Deploy with security hardening -./enhanced-security-deployment.sh - -# Features: -# - Automatic TLS certificate generation -# - Secure credential generation -# - Rate limiting configuration -# - Console access restrictions -# - Health check validation -``` - -## Docker Compose Examples - -### `docker-comprehensive.yml` - -Complete Docker Compose configuration with multiple deployment profiles. - -```bash -# Deploy specific profiles -docker-compose -f docker-comprehensive.yml --profile basic up -d -docker-compose -f docker-comprehensive.yml --profile dev up -d -docker-compose -f docker-comprehensive.yml --profile production up -d -docker-compose -f docker-comprehensive.yml --profile enterprise up -d -docker-compose -f docker-comprehensive.yml --profile api-only up -d - -# Deploy with reverse proxy -docker-compose -f docker-comprehensive.yml --profile production --profile nginx up -d -``` - -#### Available Profiles: - -- **basic**: Simple deployment for testing (ports 9000-9001) -- **dev**: Development environment with debug logging (ports 9010-9011) -- **production**: Production deployment with security (ports 9020-9021) -- **enterprise**: Full enterprise setup with TLS (ports 9030-9443) -- **api-only**: API endpoint without console (port 9040) - -## Usage Examples by Scenario - -### Development Setup - -```bash -# Quick development start -./docker-quickstart.sh dev - -# Or use enhanced deployment for more features -./enhanced-docker-deployment.sh dev - -# Or use Docker Compose -docker-compose -f docker-comprehensive.yml --profile dev up -d -``` - -**Access Points:** - -- API: http://localhost:9010 (or 9030 for enhanced) -- Console: http://localhost:9011/rustfs/console/ (or 9031 for enhanced) -- Credentials: dev-admin / dev-secret - -### Production Deployment - -```bash -# Security-hardened deployment -./enhanced-security-deployment.sh - -# Or production profile -./enhanced-docker-deployment.sh prod -``` - -**Features:** - -- TLS encryption for console -- Rate limiting enabled -- Restricted CORS policies -- Secure credential generation -- Console bound to localhost only - -### Testing and CI/CD - -```bash -# API-only deployment for testing -docker-compose -f docker-comprehensive.yml --profile api-only up -d - -# Quick basic setup for integration tests -./docker-quickstart.sh basic -``` - -## Configuration Examples - -### Environment Variables - -All deployment scripts support customization via environment variables: - -```bash -# Custom image and ports -export RUSTFS_IMAGE="rustfs/rustfs:custom-tag" -export CONSOLE_PORT="8001" -export API_PORT="8000" - -# Custom data directories -export DATA_DIR="/custom/data/path" -export CERTS_DIR="/custom/certs/path" - -# Run with custom configuration -./enhanced-security-deployment.sh -``` - -### Common Configurations - -```bash -# Development - permissive CORS -RUSTFS_CORS_ALLOWED_ORIGINS="*" -RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="*" - -# Production - restrictive CORS -RUSTFS_CORS_ALLOWED_ORIGINS="https://myapp.com,https://api.myapp.com" -RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="https://admin.myapp.com" - -# Security hardening -RUSTFS_CONSOLE_RATE_LIMIT_ENABLE="true" -RUSTFS_CONSOLE_RATE_LIMIT_RPM="60" -RUSTFS_CONSOLE_AUTH_TIMEOUT="1800" -``` - -## Monitoring and Health Checks - -All deployments include health check endpoints: - -```bash -# Test API health -curl http://localhost:9000/health - -# Test console health -curl http://localhost:9001/health - -# Test all deployments -./docker-quickstart.sh test -./enhanced-docker-deployment.sh test -``` - -## Network Architecture - -### Port Mappings - -| Deployment | API Port | Console Port | Description | -|------------|----------|--------------|-------------------------| -| Basic | 9000 | 9001 | Simple deployment | -| Dev | 9010 | 9011 | Development environment | -| Prod | 9020 | 9021 | Production-like setup | -| Enterprise | 9030 | 9443 | Enterprise with TLS | -| API-Only | 9040 | - | API endpoint only | - -### Network Isolation - -Production deployments use network isolation: - -- **Public API Network**: Exposes API endpoints to external clients -- **Internal Console Network**: Restricts console access to internal networks -- **Secure Network**: Isolated network for enterprise deployments - -## Security Considerations - -### Development - -- Permissive CORS policies for easy testing -- Debug logging enabled -- Default credentials for simplicity - -### Production - -- Restrictive CORS policies -- TLS encryption for console -- Rate limiting enabled -- Secure credential generation -- Console bound to localhost -- Network isolation - -### Enterprise - -- Complete TLS encryption -- Advanced rate limiting -- Authentication timeouts -- Secret management -- Network segregation - -## Troubleshooting - -### Common Issues - -1. **Port Conflicts**: Use different ports via environment variables -2. **CORS Errors**: Check origin configuration and browser network tab -3. **Health Check Failures**: Verify services are running and ports are accessible -4. **Permission Issues**: Check volume mount permissions and certificate file permissions - -### Debug Commands - -```bash -# Check container logs -docker logs rustfs-container - -# Check container environment -docker exec rustfs-container env | grep RUSTFS - -# Test connectivity -docker exec rustfs-container curl http://localhost:9000/health -docker exec rustfs-container curl http://localhost:9001/health - -# Check listening ports -docker exec rustfs-container netstat -tulpn | grep -E ':(9000|9001)' -``` - -## Migration from Previous Versions - -See [docs/console-separation.md](../../console-separation.md) for detailed migration instructions from single-port -deployments to the separated architecture. - -## Additional Resources - -- [Console Separation Documentation](../../console-separation.md) -- [Docker Compose Configuration](../../../docker-compose.yml) -- [Main Dockerfile](../../../Dockerfile) -- [Security Best Practices](../../console-separation.md#security-hardening) \ No newline at end of file diff --git a/docs/examples/docker/docker-comprehensive.yml b/docs/examples/docker/docker-comprehensive.yml deleted file mode 100644 index 4488e6624..000000000 --- a/docs/examples/docker/docker-comprehensive.yml +++ /dev/null @@ -1,219 +0,0 @@ -# RustFS Comprehensive Docker Deployment Examples -# This file demonstrates various deployment scenarios for RustFS with console separation - -version: "3.8" - -services: - # Basic deployment with default settings - rustfs-basic: - image: rustfs/rustfs:latest - container_name: rustfs-basic - ports: - - "9000:9000" # API endpoint - - "9001:9001" # Console interface - environment: - - RUSTFS_ADDRESS=0.0.0.0:9000 - - RUSTFS_CONSOLE_ADDRESS=0.0.0.0:9001 - - RUSTFS_CORS_ALLOWED_ORIGINS=http://127.0.0.1:9001 - - RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS=* - - RUSTFS_ACCESS_KEY=admin - - RUSTFS_SECRET_KEY=password - volumes: - - rustfs-basic-data:/data - networks: - - rustfs-network - restart: unless-stopped - healthcheck: - test: [ "CMD", "sh", "-c", "curl -f http://127.0.0.1:9000/health && curl -f http://127.0.0.1:9001/rustfs/console/health" ] - interval: 30s - timeout: 10s - retries: 3 - profiles: - - basic - - # Development environment with debug logging - rustfs-dev: - image: rustfs/rustfs:latest - container_name: rustfs-dev - ports: - - "9010:9000" # API endpoint - - "9011:9001" # Console interface - environment: - - RUSTFS_ADDRESS=0.0.0.0:9000 - - RUSTFS_CONSOLE_ADDRESS=0.0.0.0:9001 - - RUSTFS_CORS_ALLOWED_ORIGINS=* - - RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS=* - - RUSTFS_ACCESS_KEY=dev-admin - - RUSTFS_SECRET_KEY=dev-password - - RUSTFS_OBS_LOGGER_LEVEL=debug - volumes: - - rustfs-dev-data:/data - - rustfs-dev-logs:/logs - networks: - - rustfs-network - restart: unless-stopped - healthcheck: - test: [ "CMD", "sh", "-c", "curl -f http://127.0.0.1:9000/health && curl -f http://127.0.0.1:9001/rustfs/console/health" ] - interval: 30s - timeout: 10s - retries: 3 - profiles: - - dev - - # Production environment with security hardening - rustfs-production: - image: rustfs/rustfs:latest - container_name: rustfs-production - ports: - - "9020:9000" # API endpoint (public) - - "127.0.0.1:9021:9001" # Console (localhost only) - environment: - - RUSTFS_ADDRESS=0.0.0.0:9000 - - RUSTFS_CONSOLE_ADDRESS=0.0.0.0:9001 - - RUSTFS_CORS_ALLOWED_ORIGINS=https://myapp.com,https://api.myapp.com - - RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS=https://admin.myapp.com - - RUSTFS_CONSOLE_RATE_LIMIT_ENABLE=true - - RUSTFS_CONSOLE_RATE_LIMIT_RPM=60 - - RUSTFS_CONSOLE_AUTH_TIMEOUT=1800 - - RUSTFS_ACCESS_KEY_FILE=/run/secrets/rustfs_access_key - - RUSTFS_SECRET_KEY_FILE=/run/secrets/rustfs_secret_key - volumes: - - rustfs-production-data:/data - - rustfs-production-logs:/logs - - rustfs-certs:/certs:ro - networks: - - rustfs-network - secrets: - - rustfs_access_key - - rustfs_secret_key - restart: unless-stopped - healthcheck: - test: [ "CMD", "sh", "-c", "curl -f http://127.0.0.1:9000/health && curl -f http://127.0.0.1:9001/rustfs/console/health" ] - interval: 30s - timeout: 10s - retries: 3 - profiles: - - production - - # Enterprise deployment with TLS and full security - rustfs-enterprise: - image: rustfs/rustfs:latest - container_name: rustfs-enterprise - ports: - - "9030:9000" # API endpoint - - "127.0.0.1:9443:9001" # Console with TLS (localhost only) - environment: - - RUSTFS_ADDRESS=0.0.0.0:9000 - - RUSTFS_CONSOLE_ADDRESS=0.0.0.0:9001 - - RUSTFS_TLS_PATH=/certs - - RUSTFS_CORS_ALLOWED_ORIGINS=https://enterprise.com - - RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS=https://admin.enterprise.com - - RUSTFS_CONSOLE_RATE_LIMIT_ENABLE=true - - RUSTFS_CONSOLE_RATE_LIMIT_RPM=30 - - RUSTFS_CONSOLE_AUTH_TIMEOUT=900 - volumes: - - rustfs-enterprise-data:/data - - rustfs-enterprise-logs:/logs - - rustfs-enterprise-certs:/certs:ro - networks: - - rustfs-secure-network - secrets: - - rustfs_enterprise_access_key - - rustfs_enterprise_secret_key - restart: unless-stopped - healthcheck: - test: [ "CMD", "sh", "-c", "curl -f http://127.0.0.1:9000/health && curl -k -f https://127.0.0.1:9001/rustfs/console/health" ] - interval: 30s - timeout: 10s - retries: 3 - profiles: - - enterprise - - # API-only deployment (console disabled) - rustfs-api-only: - image: rustfs/rustfs:latest - container_name: rustfs-api-only - ports: - - "9040:9000" # API endpoint only - environment: - - RUSTFS_ADDRESS=0.0.0.0:9000 - - RUSTFS_CONSOLE_ENABLE=false - - RUSTFS_CORS_ALLOWED_ORIGINS=https://client-app.com - - RUSTFS_ACCESS_KEY=api-only-key - - RUSTFS_SECRET_KEY=api-only-secret - volumes: - - rustfs-api-data:/data - networks: - - rustfs-network - restart: unless-stopped - healthcheck: - test: [ "CMD", "curl", "-f", "http://127.0.0.1:9000/health" ] - interval: 30s - timeout: 10s - retries: 3 - profiles: - - api-only - - # Nginx reverse proxy for production - nginx-proxy: - image: nginx:alpine - container_name: rustfs-nginx - ports: - - "80:80" - - "443:443" - volumes: - - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro - - ./nginx/ssl:/etc/nginx/ssl:ro - networks: - - rustfs-network - restart: unless-stopped - depends_on: - - rustfs-production - profiles: - - production - - enterprise - -networks: - rustfs-network: - driver: bridge - ipam: - config: - - subnet: 172.20.0.0/16 - rustfs-secure-network: - driver: bridge - internal: true - ipam: - config: - - subnet: 172.21.0.0/16 - -volumes: - rustfs-basic-data: - driver: local - rustfs-dev-data: - driver: local - rustfs-dev-logs: - driver: local - rustfs-production-data: - driver: local - rustfs-production-logs: - driver: local - rustfs-enterprise-data: - driver: local - rustfs-enterprise-logs: - driver: local - rustfs-enterprise-certs: - driver: local - rustfs-api-data: - driver: local - rustfs-certs: - driver: local - -secrets: - rustfs_access_key: - external: true - rustfs_secret_key: - external: true - rustfs_enterprise_access_key: - external: true - rustfs_enterprise_secret_key: - external: true \ No newline at end of file diff --git a/docs/examples/docker/docker-quickstart.sh b/docs/examples/docker/docker-quickstart.sh deleted file mode 100755 index 5fab636b3..000000000 --- a/docs/examples/docker/docker-quickstart.sh +++ /dev/null @@ -1,292 +0,0 @@ -#!/usr/bin/env bash - -# RustFS Docker Quick Start Script -# This script provides easy deployment commands for different scenarios - -set -e - -# Colors for output -GREEN='\033[0;32m' -BLUE='\033[0;34m' -YELLOW='\033[1;33m' -RED='\033[0;31m' -NC='\033[0m' # No Color - -log() { - echo -e "${GREEN}[RustFS]${NC} $1" -} - -info() { - echo -e "${BLUE}[INFO]${NC} $1" -} - -warn() { - echo -e "${YELLOW}[WARN]${NC} $1" -} - -error() { - echo -e "${RED}[ERROR]${NC} $1" -} - -# Print banner -print_banner() { - echo -e "${BLUE}" - echo "==================================================" - echo " RustFS Docker Quick Start" - echo " Console & Endpoint Separation" - echo "==================================================" - echo -e "${NC}" -} - -# Check Docker availability -check_docker() { - if ! command -v docker &> /dev/null; then - error "Docker is not installed or not available in PATH" - exit 1 - fi - info "Docker is available: $(docker --version)" -} - -# Quick start - basic deployment -quick_basic() { - log "Starting RustFS basic deployment..." - - docker run -d \ - --name rustfs-quick \ - -p 9000:9000 \ - -p 9001:9001 \ - -e RUSTFS_CORS_ALLOWED_ORIGINS="http://localhost:9001" \ - -v rustfs-quick-data:/data \ - rustfs/rustfs:latest - - echo - info "✅ RustFS deployed successfully!" - info "🌐 API Endpoint: http://localhost:9000" - info "🖥️ Console UI: http://localhost:9001/rustfs/console/" - info "🔐 Credentials: rustfsadmin / rustfsadmin" - info "🏥 Health Check: curl http://localhost:9000/health" - echo - info "To stop: docker stop rustfs-quick" - info "To remove: docker rm rustfs-quick && docker volume rm rustfs-quick-data" -} - -# Development deployment with debug logging -quick_dev() { - log "Starting RustFS development environment..." - - docker run -d \ - --name rustfs-dev \ - -p 9010:9000 \ - -p 9011:9001 \ - -e RUSTFS_CORS_ALLOWED_ORIGINS="*" \ - -e RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="*" \ - -e RUSTFS_ACCESS_KEY="dev-admin" \ - -e RUSTFS_SECRET_KEY="dev-secret" \ - -e RUSTFS_OBS_LOGGER_LEVEL="debug" \ - -v rustfs-dev-data:/data \ - rustfs/rustfs:latest - - echo - info "✅ RustFS development environment ready!" - info "🌐 API Endpoint: http://localhost:9010" - info "🖥️ Console UI: http://localhost:9011/rustfs/console/" - info "🔐 Credentials: dev-admin / dev-secret" - info "📊 Debug logging enabled" - echo - info "To stop: docker stop rustfs-dev" -} - -# Production-like deployment -quick_prod() { - log "Starting RustFS production-like deployment..." - - # Generate secure credentials - ACCESS_KEY="prod-$(openssl rand -hex 8)" - SECRET_KEY="$(openssl rand -hex 24)" - - docker run -d \ - --name rustfs-prod \ - -p 9020:9000 \ - -p 127.0.0.1:9021:9001 \ - -e RUSTFS_CORS_ALLOWED_ORIGINS="https://myapp.com" \ - -e RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="https://admin.myapp.com" \ - -e RUSTFS_CONSOLE_RATE_LIMIT_ENABLE="true" \ - -e RUSTFS_CONSOLE_RATE_LIMIT_RPM="60" \ - -e RUSTFS_ACCESS_KEY="$ACCESS_KEY" \ - -e RUSTFS_SECRET_KEY="$SECRET_KEY" \ - -v rustfs-prod-data:/data \ - rustfs/rustfs:latest - - # Save credentials - echo "RUSTFS_ACCESS_KEY=$ACCESS_KEY" > rustfs-prod-credentials.txt - echo "RUSTFS_SECRET_KEY=$SECRET_KEY" >> rustfs-prod-credentials.txt - chmod 600 rustfs-prod-credentials.txt - - echo - info "✅ RustFS production deployment ready!" - info "🌐 API Endpoint: http://localhost:9020 (public)" - info "🖥️ Console UI: http://127.0.0.1:9021/rustfs/console/ (localhost only)" - info "🔐 Credentials saved to rustfs-prod-credentials.txt" - info "🔒 Console restricted to localhost for security" - echo - warn "⚠️ Change default CORS origins for production use" -} - -# Stop and cleanup -cleanup() { - log "Cleaning up RustFS deployments..." - - docker stop rustfs-quick rustfs-dev rustfs-prod 2>/dev/null || true - docker rm rustfs-quick rustfs-dev rustfs-prod 2>/dev/null || true - - info "Containers stopped and removed" - echo - info "To also remove data volumes, run:" - info "docker volume rm rustfs-quick-data rustfs-dev-data rustfs-prod-data" -} - -# Show status of all deployments -status() { - log "RustFS deployment status:" - echo - - if docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" | grep -q rustfs; then - docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" | head -n1 - docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" | grep rustfs - else - info "No RustFS containers are currently running" - fi - - echo - info "Available endpoints:" - - if docker ps --filter "name=rustfs-quick" --format "{{.Names}}" | grep -q rustfs-quick; then - echo " Basic: http://localhost:9000 (API) | http://localhost:9001/rustfs/console/ (Console)" - fi - - if docker ps --filter "name=rustfs-dev" --format "{{.Names}}" | grep -q rustfs-dev; then - echo " Dev: http://localhost:9010 (API) | http://localhost:9011/rustfs/console/ (Console)" - fi - - if docker ps --filter "name=rustfs-prod" --format "{{.Names}}" | grep -q rustfs-prod; then - echo " Prod: http://localhost:9020 (API) | http://127.0.0.1:9021/rustfs/console/ (Console)" - fi -} - -# Test deployments -test_deployments() { - log "Testing RustFS deployments..." - echo - - # Test basic deployment - if docker ps --filter "name=rustfs-quick" --format "{{.Names}}" | grep -q rustfs-quick; then - info "Testing basic deployment..." - if curl -s -f http://localhost:9000/health | grep -q "ok"; then - echo " ✅ API health check: PASS" - else - echo " ❌ API health check: FAIL" - fi - - if curl -s -f http://localhost:9001/health | grep -q "console"; then - echo " ✅ Console health check: PASS" - else - echo " ❌ Console health check: FAIL" - fi - fi - - # Test dev deployment - if docker ps --filter "name=rustfs-dev" --format "{{.Names}}" | grep -q rustfs-dev; then - info "Testing development deployment..." - if curl -s -f http://localhost:9010/health | grep -q "ok"; then - echo " ✅ Dev API health check: PASS" - else - echo " ❌ Dev API health check: FAIL" - fi - - if curl -s -f http://localhost:9011/health | grep -q "console"; then - echo " ✅ Dev Console health check: PASS" - else - echo " ❌ Dev Console health check: FAIL" - fi - fi - - # Test prod deployment - if docker ps --filter "name=rustfs-prod" --format "{{.Names}}" | grep -q rustfs-prod; then - info "Testing production deployment..." - if curl -s -f http://localhost:9020/health | grep -q "ok"; then - echo " ✅ Prod API health check: PASS" - else - echo " ❌ Prod API health check: FAIL" - fi - - if curl -s -f http://127.0.0.1:9021/health | grep -q "console"; then - echo " ✅ Prod Console health check: PASS" - else - echo " ❌ Prod Console health check: FAIL" - fi - fi -} - -# Show help -show_help() { - print_banner - echo "Usage: $0 [command]" - echo - echo "Commands:" - echo " basic Start basic RustFS deployment (ports 9000-9001)" - echo " dev Start development deployment with debug logging (ports 9010-9011)" - echo " prod Start production-like deployment with security (ports 9020-9021)" - echo " status Show status of running deployments" - echo " test Test health of all running deployments" - echo " cleanup Stop and remove all RustFS containers" - echo " help Show this help message" - echo - echo "Examples:" - echo " $0 basic # Quick start with default settings" - echo " $0 dev # Development environment with debug logs" - echo " $0 prod # Production-like setup with security" - echo " $0 status # Check what's running" - echo " $0 test # Test all deployments" - echo " $0 cleanup # Clean everything up" - echo - echo "For more advanced deployments, see:" - echo " - examples/enhanced-docker-deployment.sh" - echo " - examples/enhanced-security-deployment.sh" - echo " - examples/docker-comprehensive.yml" - echo " - docs/console-separation.md" - echo -} - -# Main execution -case "${1:-help}" in - "basic") - print_banner - check_docker - quick_basic - ;; - "dev") - print_banner - check_docker - quick_dev - ;; - "prod") - print_banner - check_docker - quick_prod - ;; - "status") - print_banner - status - ;; - "test") - print_banner - test_deployments - ;; - "cleanup") - print_banner - cleanup - ;; - "help"|*) - show_help - ;; -esac \ No newline at end of file diff --git a/docs/examples/docker/enhanced-docker-deployment.sh b/docs/examples/docker/enhanced-docker-deployment.sh deleted file mode 100755 index 0185de6b8..000000000 --- a/docs/examples/docker/enhanced-docker-deployment.sh +++ /dev/null @@ -1,318 +0,0 @@ -#!/usr/bin/env bash - -# RustFS Enhanced Docker Deployment Examples -# This script demonstrates various deployment scenarios for RustFS with console separation - -set -e - -# Colors for output -RED='\033[0;31m' -GREEN='\033[0;32m' -YELLOW='\033[1;33m' -BLUE='\033[0;34m' -NC='\033[0m' # No Color - -log_info() { - echo -e "${GREEN}[INFO]${NC} $1" -} - -log_warn() { - echo -e "${YELLOW}[WARN]${NC} $1" -} - -log_error() { - echo -e "${RED}[ERROR]${NC} $1" -} - -log_section() { - echo -e "\n${BLUE}========================================${NC}" - echo -e "${BLUE}$1${NC}" - echo -e "${BLUE}========================================${NC}\n" -} - -# Function to clean up existing containers -cleanup() { - log_info "Cleaning up existing RustFS containers..." - docker stop rustfs-basic rustfs-dev rustfs-prod 2>/dev/null || true - docker rm rustfs-basic rustfs-dev rustfs-prod 2>/dev/null || true -} - -# Function to wait for service to be ready -wait_for_service() { - local url=$1 - local service_name=$2 - local max_attempts=30 - local attempt=0 - - log_info "Waiting for $service_name to be ready at $url..." - - while [ $attempt -lt $max_attempts ]; do - if curl -s -f "$url" > /dev/null 2>&1; then - log_info "$service_name is ready!" - return 0 - fi - attempt=$((attempt + 1)) - sleep 1 - done - - log_error "$service_name failed to start within ${max_attempts}s" - return 1 -} - -# Scenario 1: Basic deployment with port mapping -deploy_basic() { - log_section "Scenario 1: Basic Docker Deployment with Port Mapping" - - log_info "Starting RustFS with port mapping 9020:9000 and 9021:9001" - - docker run -d \ - --name rustfs-basic \ - -p 9020:9000 \ - -p 9021:9001 \ - -e RUSTFS_CORS_ALLOWED_ORIGINS="http://localhost:9021,http://127.0.0.1:9021" \ - -e RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="*" \ - -e RUSTFS_ACCESS_KEY="basic-access" \ - -e RUSTFS_SECRET_KEY="basic-secret" \ - -v rustfs-basic-data:/data \ - rustfs/rustfs:latest - - # Wait for services to be ready - wait_for_service "http://localhost:9020/health" "API Service" - wait_for_service "http://localhost:9021/health" "Console Service" - - log_info "Basic deployment ready!" - log_info "🌐 API endpoint: http://localhost:9020" - log_info "🖥️ Console UI: http://localhost:9021/rustfs/console/" - log_info "🔐 Credentials: basic-access / basic-secret" - log_info "🏥 Health checks:" - log_info " API: curl http://localhost:9020/health" - log_info " Console: curl http://localhost:9021/health" -} - -# Scenario 2: Development environment -deploy_development() { - log_section "Scenario 2: Development Environment" - - log_info "Starting RustFS development environment" - - docker run -d \ - --name rustfs-dev \ - -p 9030:9000 \ - -p 9031:9001 \ - -e RUSTFS_CORS_ALLOWED_ORIGINS="*" \ - -e RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="*" \ - -e RUSTFS_ACCESS_KEY="dev-access" \ - -e RUSTFS_SECRET_KEY="dev-secret" \ - -e RUSTFS_OBS_LOGGER_LEVEL="debug" \ - -v rustfs-dev-data:/data \ - rustfs/rustfs:latest - - # Wait for services to be ready - wait_for_service "http://localhost:9030/health" "Dev API Service" - wait_for_service "http://localhost:9031/health" "Dev Console Service" - - log_info "Development deployment ready!" - log_info "🌐 API endpoint: http://localhost:9030" - log_info "🖥️ Console UI: http://localhost:9031/rustfs/console/" - log_info "🔐 Credentials: dev-access / dev-secret" - log_info "📊 Debug logging enabled" - log_info "🏥 Health checks:" - log_info " API: curl http://localhost:9030/health" - log_info " Console: curl http://localhost:9031/health" -} - -# Scenario 3: Production-like environment with security -deploy_production() { - log_section "Scenario 3: Production-like Deployment" - - log_info "Starting RustFS production-like environment with security" - - # Generate secure credentials - ACCESS_KEY=$(openssl rand -hex 16) - SECRET_KEY=$(openssl rand -hex 32) - - # Save credentials for reference - cat > rustfs-prod-credentials.env << EOF -# RustFS Production Deployment Credentials -# Generated: $(date) -RUSTFS_ACCESS_KEY=$ACCESS_KEY -RUSTFS_SECRET_KEY=$SECRET_KEY -EOF - chmod 600 rustfs-prod-credentials.env - - docker run -d \ - --name rustfs-prod \ - -p 9040:9000 \ - -p 127.0.0.1:9041:9001 \ - -e RUSTFS_ADDRESS="0.0.0.0:9000" \ - -e RUSTFS_CONSOLE_ADDRESS="0.0.0.0:9001" \ - -e RUSTFS_CORS_ALLOWED_ORIGINS="https://myapp.example.com" \ - -e RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="https://admin.example.com" \ - -e RUSTFS_ACCESS_KEY="$ACCESS_KEY" \ - -e RUSTFS_SECRET_KEY="$SECRET_KEY" \ - -v rustfs-prod-data:/data \ - rustfs/rustfs:latest - - # Wait for services to be ready - wait_for_service "http://localhost:9040/health" "Prod API Service" - wait_for_service "http://127.0.0.1:9041/health" "Prod Console Service" - - log_info "Production deployment ready!" - log_info "🌐 API endpoint: http://localhost:9040 (public)" - log_info "🖥️ Console UI: http://127.0.0.1:9041/rustfs/console/ (localhost only)" - log_info "🔐 Credentials: $ACCESS_KEY / $SECRET_KEY" - log_info "🔒 Security: Console restricted to localhost" - log_info "🏥 Health checks:" - log_info " API: curl http://localhost:9040/health" - log_info " Console: curl http://127.0.0.1:9041/health" - log_warn "⚠️ Console is restricted to localhost for security" - log_warn "⚠️ Credentials saved to rustfs-prod-credentials.env file" -} - -# Function to show service status -show_status() { - log_section "Service Status" - - echo "Running containers:" - docker ps --filter "name=rustfs-" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" - - echo -e "\nService endpoints:" - if docker ps --filter "name=rustfs-basic" --format "{{.Names}}" | grep -q rustfs-basic; then - echo " Basic API: http://localhost:9020" - echo " Basic Console: http://localhost:9021/rustfs/console/" - fi - - if docker ps --filter "name=rustfs-dev" --format "{{.Names}}" | grep -q rustfs-dev; then - echo " Dev API: http://localhost:9030" - echo " Dev Console: http://localhost:9031/rustfs/console/" - fi - - if docker ps --filter "name=rustfs-prod" --format "{{.Names}}" | grep -q rustfs-prod; then - echo " Prod API: http://localhost:9040" - echo " Prod Console: http://127.0.0.1:9041/rustfs/console/" - fi -} - -# Function to test services -test_services() { - log_section "Testing Services" - - # Test basic deployment - if docker ps --filter "name=rustfs-basic" --format "{{.Names}}" | grep -q rustfs-basic; then - log_info "Testing basic deployment..." - if curl -s http://localhost:9020/health | grep -q "ok"; then - log_info "✓ Basic API health check passed" - else - log_error "✗ Basic API health check failed" - fi - - if curl -s http://localhost:9021/health | grep -q "console"; then - log_info "✓ Basic Console health check passed" - else - log_error "✗ Basic Console health check failed" - fi - fi - - # Test development deployment - if docker ps --filter "name=rustfs-dev" --format "{{.Names}}" | grep -q rustfs-dev; then - log_info "Testing development deployment..." - if curl -s http://localhost:9030/health | grep -q "ok"; then - log_info "✓ Dev API health check passed" - else - log_error "✗ Dev API health check failed" - fi - - if curl -s http://localhost:9031/health | grep -q "console"; then - log_info "✓ Dev Console health check passed" - else - log_error "✗ Dev Console health check failed" - fi - fi - - # Test production deployment - if docker ps --filter "name=rustfs-prod" --format "{{.Names}}" | grep -q rustfs-prod; then - log_info "Testing production deployment..." - if curl -s http://localhost:9040/health | grep -q "ok"; then - log_info "✓ Prod API health check passed" - else - log_error "✗ Prod API health check failed" - fi - - if curl -s http://127.0.0.1:9041/health | grep -q "console"; then - log_info "✓ Prod Console health check passed" - else - log_error "✗ Prod Console health check failed" - fi - fi -} - -# Function to show logs -show_logs() { - log_section "Service Logs" - - if [ -n "$1" ]; then - docker logs "$1" - else - echo "Available containers:" - docker ps --filter "name=rustfs-" --format "{{.Names}}" - echo -e "\nUsage: $0 logs <container-name>" - fi -} - -# Main menu -case "${1:-menu}" in - "basic") - cleanup - deploy_basic - ;; - "dev") - cleanup - deploy_development - ;; - "prod") - cleanup - deploy_production - ;; - "all") - cleanup - deploy_basic - deploy_development - deploy_production - show_status - ;; - "status") - show_status - ;; - "test") - test_services - ;; - "logs") - show_logs "$2" - ;; - "cleanup") - cleanup - docker volume rm rustfs-basic-data rustfs-dev-data rustfs-prod-data 2>/dev/null || true - log_info "Cleanup completed" - ;; - "menu"|*) - echo "RustFS Enhanced Docker Deployment Examples" - echo "" - echo "Usage: $0 [command]" - echo "" - echo "Commands:" - echo " basic - Deploy basic RustFS with port mapping" - echo " dev - Deploy development environment" - echo " prod - Deploy production-like environment" - echo " all - Deploy all scenarios" - echo " status - Show status of running containers" - echo " test - Test all running services" - echo " logs - Show logs for specific container" - echo " cleanup - Clean up all containers and volumes" - echo "" - echo "Examples:" - echo " $0 basic # Deploy basic setup" - echo " $0 status # Check running services" - echo " $0 logs rustfs-dev # Show dev container logs" - echo " $0 cleanup # Clean everything up" - ;; -esac \ No newline at end of file diff --git a/docs/examples/docker/enhanced-security-deployment.sh b/docs/examples/docker/enhanced-security-deployment.sh deleted file mode 100755 index 737c60776..000000000 --- a/docs/examples/docker/enhanced-security-deployment.sh +++ /dev/null @@ -1,206 +0,0 @@ -#!/usr/bin/env bash - -# RustFS Enhanced Security Deployment Script -# This script demonstrates production-ready deployment with enhanced security features - -set -e - -# Configuration -RUSTFS_IMAGE="${RUSTFS_IMAGE:-rustfs/rustfs:latest}" -CONTAINER_NAME="${CONTAINER_NAME:-rustfs-secure}" -DATA_DIR="${DATA_DIR:-./data}" -CERTS_DIR="${CERTS_DIR:-./certs}" -CONSOLE_PORT="${CONSOLE_PORT:-9443}" -API_PORT="${API_PORT:-9000}" - -# Colors for output -RED='\033[0;31m' -GREEN='\033[0;32m' -YELLOW='\033[1;33m' -BLUE='\033[0;34m' -NC='\033[0m' # No Color - -log() { - echo -e "${BLUE}[INFO]${NC} $1" -} - -warn() { - echo -e "${YELLOW}[WARN]${NC} $1" -} - -error() { - echo -e "${RED}[ERROR]${NC} $1" - exit 1 -} - -success() { - echo -e "${GREEN}[SUCCESS]${NC} $1" -} - -# Check if Docker is available -check_docker() { - if ! command -v docker &> /dev/null; then - error "Docker is not installed or not in PATH" - fi - log "Docker is available" -} - -# Generate TLS certificates for console -generate_certs() { - if [[ ! -d "$CERTS_DIR" ]]; then - mkdir -p "$CERTS_DIR" - log "Created certificates directory: $CERTS_DIR" - fi - - if [[ ! -f "$CERTS_DIR/console.crt" ]] || [[ ! -f "$CERTS_DIR/console.key" ]]; then - log "Generating TLS certificates for console..." - openssl req -x509 -newkey rsa:4096 \ - -keyout "$CERTS_DIR/console.key" \ - -out "$CERTS_DIR/console.crt" \ - -days 365 -nodes \ - -subj "/C=US/ST=CA/L=SF/O=RustFS/CN=localhost" - - chmod 600 "$CERTS_DIR/console.key" - chmod 644 "$CERTS_DIR/console.crt" - success "TLS certificates generated" - else - log "TLS certificates already exist" - fi -} - -# Create data directory -create_data_dir() { - if [[ ! -d "$DATA_DIR" ]]; then - mkdir -p "$DATA_DIR" - log "Created data directory: $DATA_DIR" - fi -} - -# Generate secure credentials -generate_credentials() { - if [[ -z "$RUSTFS_ACCESS_KEY" ]]; then - export RUSTFS_ACCESS_KEY="admin-$(openssl rand -hex 8)" - log "Generated access key: $RUSTFS_ACCESS_KEY" - fi - - if [[ -z "$RUSTFS_SECRET_KEY" ]]; then - export RUSTFS_SECRET_KEY="$(openssl rand -hex 32)" - log "Generated secret key: [HIDDEN]" - fi - - # Save credentials to .env file - cat > .env << EOF -RUSTFS_ACCESS_KEY=$RUSTFS_ACCESS_KEY -RUSTFS_SECRET_KEY=$RUSTFS_SECRET_KEY -EOF - chmod 600 .env - success "Credentials saved to .env file" -} - -# Stop existing container -stop_existing() { - if docker ps -a --format "table {{.Names}}" | grep -q "^$CONTAINER_NAME\$"; then - log "Stopping existing container: $CONTAINER_NAME" - docker stop "$CONTAINER_NAME" 2>/dev/null || true - docker rm "$CONTAINER_NAME" 2>/dev/null || true - fi -} - -# Deploy RustFS with enhanced security -deploy_rustfs() { - log "Deploying RustFS with enhanced security..." - - docker run -d \ - --name "$CONTAINER_NAME" \ - --restart unless-stopped \ - -p "$CONSOLE_PORT:9001" \ - -p "$API_PORT:9000" \ - -v "$(pwd)/$DATA_DIR:/data" \ - -v "$(pwd)/$CERTS_DIR:/certs:ro" \ - -e RUSTFS_CONSOLE_TLS_ENABLE=true \ - -e RUSTFS_CONSOLE_TLS_CERT=/certs/console.crt \ - -e RUSTFS_CONSOLE_TLS_KEY=/certs/console.key \ - -e RUSTFS_CONSOLE_RATE_LIMIT_ENABLE=true \ - -e RUSTFS_CONSOLE_RATE_LIMIT_RPM=60 \ - -e RUSTFS_CONSOLE_AUTH_TIMEOUT=1800 \ - -e RUSTFS_CONSOLE_CORS_ALLOWED_ORIGINS="https://localhost:$CONSOLE_PORT" \ - -e RUSTFS_CORS_ALLOWED_ORIGINS="http://localhost:$API_PORT" \ - -e RUSTFS_ACCESS_KEY="$RUSTFS_ACCESS_KEY" \ - -e RUSTFS_SECRET_KEY="$RUSTFS_SECRET_KEY" \ - "$RUSTFS_IMAGE" /data - - # Wait for container to start - sleep 5 - - if docker ps --format "table {{.Names}}" | grep -q "^$CONTAINER_NAME\$"; then - success "RustFS deployed successfully" - else - error "Failed to deploy RustFS" - fi -} - -# Check service health -check_health() { - log "Checking service health..." - - # Check console health - if curl -k -s "https://localhost:$CONSOLE_PORT/health" | jq -e '.status == "ok"' > /dev/null 2>&1; then - success "Console service is healthy" - else - warn "Console service health check failed" - fi - - # Check API health - if curl -s "http://localhost:$API_PORT/health" | jq -e '.status == "ok"' > /dev/null 2>&1; then - success "API service is healthy" - else - warn "API service health check failed" - fi -} - -# Display access information -show_access_info() { - echo - echo "==========================================" - echo " RustFS Access Information" - echo "==========================================" - echo - echo "🌐 Console (HTTPS): https://localhost:$CONSOLE_PORT/rustfs/console/" - echo "🔧 API Endpoint: http://localhost:$API_PORT" - echo "🏥 Console Health: https://localhost:$CONSOLE_PORT/health" - echo "🏥 API Health: http://localhost:$API_PORT/health" - echo - echo "🔐 Credentials:" - echo " Access Key: $RUSTFS_ACCESS_KEY" - echo " Secret Key: [Check .env file]" - echo - echo "📝 Logs: docker logs $CONTAINER_NAME" - echo "🛑 Stop: docker stop $CONTAINER_NAME" - echo - echo "⚠️ Note: Console uses self-signed certificate" - echo " Accept the certificate warning in your browser" - echo -} - -# Main deployment flow -main() { - log "Starting RustFS Enhanced Security Deployment" - - check_docker - create_data_dir - generate_certs - generate_credentials - stop_existing - deploy_rustfs - - # Wait a bit for services to start - sleep 10 - - check_health - show_access_info - - success "Deployment completed successfully!" -} - -# Run main function -main "$@" \ No newline at end of file diff --git a/docs/examples/mnmd/CHECKLIST.md b/docs/examples/mnmd/CHECKLIST.md deleted file mode 100644 index 0a08f156a..000000000 --- a/docs/examples/mnmd/CHECKLIST.md +++ /dev/null @@ -1,329 +0,0 @@ -# MNMD Deployment Checklist - -This checklist provides step-by-step verification for deploying RustFS in MNMD (Multi-Node Multi-Drive) mode using -Docker. - -## Pre-Deployment Checks - -### 1. System Requirements - -- [ ] Docker Engine 20.10+ installed -- [ ] Docker Compose 2.0+ installed -- [ ] At least 8GB RAM available -- [ ] At least 40GB disk space available (for 4 nodes × 4 volumes) - -Verify with: - -```bash -docker --version -docker-compose --version -free -h -df -h -``` - -### 2. File System Checks - -- [ ] Using XFS, ext4, or another suitable filesystem (not NFS for production) -- [ ] File system supports extended attributes - -Verify with: - -```bash -df -T | grep -E '(xfs|ext4)' -``` - -### 3. Permissions and SELinux - -- [ ] Current user is in `docker` group or can run `sudo docker` -- [ ] SELinux is properly configured (if enabled) - -Verify with: - -```bash -groups | grep docker -getenforce # If enabled, should show "Permissive" or "Enforcing" with proper policies -``` - -### 4. Network Configuration - -- [ ] Ports 9000-9031 are available -- [ ] No firewall blocking Docker bridge network - -Verify with: - -```bash -# Check if ports are free -netstat -tuln | grep -E ':(9000|9001|9010|9011|9020|9021|9030|9031)' -# Should return nothing if ports are free -``` - -### 5. Files Present - -- [ ] `docker-compose.yml` exists in current directory - -Verify with: - -```bash -cd docs/examples/mnmd -ls -la -chmod +x wait-and-start.sh # If needed -``` - -## Deployment Steps - -### 1. Start the Cluster - -- [ ] Navigate to the example directory -- [ ] Pull the latest RustFS image -- [ ] Start the cluster - -```bash -cd docs/examples/mnmd -docker-compose pull -docker-compose up -d -``` - -### 2. Monitor Startup - -- [ ] Watch container logs during startup -- [ ] Verify no VolumeNotFound errors -- [ ] Check that peer discovery completes - -```bash -# Watch all logs -docker-compose logs -f - -# Watch specific node -docker-compose logs -f rustfs-node1 - -# Look for successful startup messages -docker-compose logs | grep -i "ready\|listening\|started" -``` - -### 3. Verify Container Status - -- [ ] All 4 containers are running -- [ ] All 4 containers show as healthy - -```bash -docker-compose ps - -# Expected output: 4 containers in "Up" state with "healthy" status -``` - -### 4. Check Health Endpoints - -- [ ] API health endpoints respond on all nodes -- [ ] Console health endpoints respond on all nodes - -```bash -# Test API endpoints -curl http://localhost:9000/health -curl http://localhost:9010/health -curl http://localhost:9020/health -curl http://localhost:9030/health - -# Test Console endpoints -curl http://localhost:9001/health -curl http://localhost:9011/health -curl http://localhost:9021/health -curl http://localhost:9031/health - -# All should return successful health status -``` - -## Post-Deployment Verification - -### 1. In-Container Checks - -- [ ] Data directories exist -- [ ] Directories have correct permissions -- [ ] RustFS process is running - -```bash -# Check node1 -docker exec rustfs-node1 ls -la /data/ -docker exec rustfs-node1 ps aux | grep rustfs - -# Verify all 4 data directories exist -docker exec rustfs-node1 ls -d /data/rustfs{1..4} -``` - -### 2. DNS and Network Validation - -- [ ] Service names resolve correctly -- [ ] Inter-node connectivity works - -```bash -# DNS resolution test -docker exec rustfs-node1 nslookup rustfs-node2 -docker exec rustfs-node1 nslookup rustfs-node3 -docker exec rustfs-node1 nslookup rustfs-node4 - -# Connectivity test (using nc if available) -docker exec rustfs-node1 nc -zv rustfs-node2 9000 -docker exec rustfs-node1 nc -zv rustfs-node3 9000 -docker exec rustfs-node1 nc -zv rustfs-node4 9000 - -# Or using telnet/curl -docker exec rustfs-node1 curl -v http://rustfs-node2:9000/health -``` - -### 3. Volume Configuration Validation - -- [ ] RUSTFS_VOLUMES environment variable is correct -- [ ] All 16 endpoints are configured (4 nodes × 4 drives) - -```bash -# Check environment variable -docker exec rustfs-node1 env | grep RUSTFS_VOLUMES - -# Expected output: -# RUSTFS_VOLUMES=http://rustfs-node{1...4}:9000/data/rustfs{1...4} -``` - -### 4. Cluster Functionality - -- [ ] Can list buckets via API -- [ ] Can create a bucket -- [ ] Can upload an object -- [ ] Can download an object - -```bash -# Configure AWS CLI or s3cmd -export AWS_ACCESS_KEY_ID=rustfsadmin -export AWS_SECRET_ACCESS_KEY=rustfsadmin - -# Using AWS CLI (if installed) -aws --endpoint-url http://localhost:9000 s3 mb s3://test-bucket -aws --endpoint-url http://localhost:9000 s3 ls -echo "test content" > test.txt -aws --endpoint-url http://localhost:9000 s3 cp test.txt s3://test-bucket/ -aws --endpoint-url http://localhost:9000 s3 ls s3://test-bucket/ -aws --endpoint-url http://localhost:9000 s3 cp s3://test-bucket/test.txt downloaded.txt -cat downloaded.txt - -# Or using curl -curl -X PUT http://localhost:9000/test-bucket \ - -H "Host: localhost:9000" \ - --user rustfsadmin:rustfsadmin -``` - -### 5. Healthcheck Verification - -- [ ] Docker reports all services as healthy -- [ ] Healthcheck scripts work in containers - -```bash -# Check Docker health status -docker inspect rustfs-node1 --format='{{.State.Health.Status}}' -docker inspect rustfs-node2 --format='{{.State.Health.Status}}' -docker inspect rustfs-node3 --format='{{.State.Health.Status}}' -docker inspect rustfs-node4 --format='{{.State.Health.Status}}' - -# All should return "healthy" - -# Test healthcheck command manually -docker exec rustfs-node1 nc -z localhost 9000 -echo $? # Should be 0 -``` - -## Troubleshooting Checks - -### If VolumeNotFound Error Occurs - -- [ ] Verify volume indexing starts at 1, not 0 -- [ ] Check that RUSTFS_VOLUMES matches mounted paths -- [ ] Ensure all /data/rustfs{1..4} directories exist - -```bash -# Check mounted volumes -docker inspect rustfs-node1 | jq '.[].Mounts' - -# Verify directories in container -docker exec rustfs-node1 ls -la /data/ -``` - -### If Healthcheck Fails - -- [ ] Check if `nc` is available in the image -- [ ] Try alternative healthcheck (curl/wget) -- [ ] Increase `start_period` in docker-compose.yml - -```bash -# Check if nc is available -docker exec rustfs-node1 which nc - -# Test healthcheck manually -docker exec rustfs-node1 nc -z localhost 9000 - -# Check logs for errors -docker-compose logs rustfs-node1 | grep -i error -``` - -### If Startup Takes Too Long - -- [ ] Check peer discovery timeout in logs -- [ ] Verify network connectivity between nodes -- [ ] Consider increasing timeout in wait-and-start.sh - -```bash -# Check startup logs -docker-compose logs rustfs-node1 | grep -i "waiting\|peer\|timeout" - -# Check network -docker network inspect mnmd_rustfs-mnmd -``` - -### If Containers Crash or Restart - -- [ ] Review container logs -- [ ] Check resource usage (CPU/Memory) -- [ ] Verify no port conflicts - -```bash -# View last crash logs -docker-compose logs --tail=100 rustfs-node1 - -# Check resource usage -docker stats --no-stream - -# Check restart count -docker-compose ps -``` - -## Cleanup Checklist - -When done testing: - -- [ ] Stop the cluster: `docker-compose down` -- [ ] Remove volumes (optional, destroys data): `docker-compose down -v` -- [ ] Clean up dangling images: `docker image prune` -- [ ] Verify ports are released: `netstat -tuln | grep -E ':(9000|9001|9010|9011|9020|9021|9030|9031)'` - -## Production Deployment Additional Checks - -Before deploying to production: - -- [ ] Change default credentials (RUSTFS_ACCESS_KEY, RUSTFS_SECRET_KEY) -- [ ] Configure TLS certificates -- [ ] Set up proper logging and monitoring -- [ ] Configure backups for volumes -- [ ] Review and adjust resource limits -- [ ] Set up external load balancer (if needed) -- [ ] Document disaster recovery procedures -- [ ] Test failover scenarios -- [ ] Verify data persistence after container restart - -## Summary - -This checklist ensures: - -- ✓ Correct disk indexing (1..4 instead of 0..3) -- ✓ Proper startup coordination via wait-and-start.sh -- ✓ Service discovery via Docker service names -- ✓ Health checks function correctly -- ✓ All 16 endpoints (4 nodes × 4 drives) are operational -- ✓ No VolumeNotFound errors occur - -For more details, see [README.md](./README.md) in this directory. diff --git a/docs/examples/mnmd/README.md b/docs/examples/mnmd/README.md deleted file mode 100644 index a5e947fec..000000000 --- a/docs/examples/mnmd/README.md +++ /dev/null @@ -1,268 +0,0 @@ -# RustFS MNMD (Multi-Node Multi-Drive) Docker Example - -This directory contains a complete, ready-to-use MNMD deployment example for RustFS with 4 nodes and 4 drives per node ( -4x4 configuration). - -## Overview - -This example addresses common deployment issues including: - -- **VolumeNotFound errors** - Fixed by using correct disk indexing (`/data/rustfs{1...4}` instead of - `/data/rustfs{0...3}`) -- **Startup race conditions** - Solved with a simple `sleep` command in each service. -- **Service discovery** - Uses Docker service names (`rustfs-node{1..4}`) instead of hard-coded IPs -- **Health checks** - Implements proper health monitoring with `nc` (with alternatives documented) - -## Quick Start - -From this directory (`docs/examples/mnmd`), run: - -```bash -# Start the cluster -docker-compose up -d - -# Check the status -docker-compose ps - -# View logs -docker-compose logs -f - -# Test the deployment -curl http://localhost:9000/health -curl http://localhost:9001/rustfs/console/health - -# Run comprehensive tests -./test-deployment.sh - -# Stop the cluster -docker-compose down - -# Clean up volumes (WARNING: deletes all data) -docker-compose down -v -``` - -## Configuration Details - -### Volume Configuration - -The example uses the following volume configuration: - -```bash -RUSTFS_VOLUMES=http://rustfs-node{1...4}:9000/data/rustfs{1...4} -``` - -This expands to 16 endpoints (4 nodes × 4 drives): - -- Node 1: `/data/rustfs1`, `/data/rustfs2`, `/data/rustfs3`, `/data/rustfs4` -- Node 2: `/data/rustfs1`, `/data/rustfs2`, `/data/rustfs3`, `/data/rustfs4` -- Node 3: `/data/rustfs1`, `/data/rustfs2`, `/data/rustfs3`, `/data/rustfs4` -- Node 4: `/data/rustfs1`, `/data/rustfs2`, `/data/rustfs3`, `/data/rustfs4` - -**Important:** Disk indexing starts at 1 to match the mounted paths (`/data/rustfs1..4`). - -### Port Mappings - -| Node | API Port | Console Port | -|-------|----------|--------------| -| node1 | 9000 | 9001 | -| node2 | 9010 | 9011 | -| node3 | 9020 | 9021 | -| node4 | 9030 | 9031 | - -### Startup Coordination - -To prevent race conditions during startup where nodes might not find each other, a simple `sleep 3` command is added to -each service's command. This provides a brief delay, allowing the network and other services to initialize before RustFS -starts. For more complex scenarios, a more robust health-check dependency or an external entrypoint script might be -required. - -### Health Checks - -Default health check using `nc` (netcat): - -```yaml -healthcheck: - test: [ "CMD-SHELL", "nc -z localhost 9000 || exit 1" ] - interval: 10s - timeout: 5s - retries: 3 - start_period: 30s -``` - -#### Alternative Health Checks - -If your base image lacks `nc`, use one of these alternatives: - -**Using curl:** - -```yaml -healthcheck: - test: [ "CMD-SHELL", "curl -f http://localhost:9000/health || exit 1" ] - interval: 10s - timeout: 5s - retries: 3 - start_period: 30s -``` - -**Using wget:** - -```yaml -healthcheck: - test: [ "CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:9000/health || exit 1" ] - interval: 10s - timeout: 5s - retries: 3 - start_period: 30s -``` - -### Brace Expansion Alternatives - -If your Docker Compose runtime doesn't support brace expansion (`{1...4}`), replace with explicit endpoints: - -```yaml -environment: - - RUSTFS_VOLUMES=http://rustfs-node1:9000/data/rustfs1,http://rustfs-node1:9000/data/rustfs2,http://rustfs-node1:9000/data/rustfs3,http://rustfs-node1:9000/data/rustfs4,http://rustfs-node2:9000/data/rustfs1,http://rustfs-node2:9000/data/rustfs2,http://rustfs-node2:9000/data/rustfs3,http://rustfs-node2:9000/data/rustfs4,http://rustfs-node3:9000/data/rustfs1,http://rustfs-node3:9000/data/rustfs2,http://rustfs-node3:9000/data/rustfs3,http://rustfs-node3:9000/data/rustfs4,http://rustfs-node4:9000/data/rustfs1,http://rustfs-node4:9000/data/rustfs2,http://rustfs-node4:9000/data/rustfs3,http://rustfs-node4:9000/data/rustfs4 -``` - -## Using RUSTFS_CMD - -The `RUSTFS_CMD` environment variable provides a fallback when no command is specified: - -```yaml -environment: - - RUSTFS_CMD=rustfs # Default fallback command -``` - -This allows the entrypoint to execute the correct command when Docker doesn't provide one. - -## Testing the Deployment - -After starting the cluster, verify it's working: - -### Automated Testing - -Use the provided test script for comprehensive validation: - -```bash -./test-deployment.sh -``` - -This script tests: - -- Container status (4/4 running) -- Health checks (4/4 healthy) -- API endpoints (4 ports) -- Console endpoints (4 ports) -- Inter-node connectivity -- Data directory existence - -### Manual Testing - -For manual verification: - -```bash -# 1. Check all containers are healthy -docker-compose ps - -# 2. Test API endpoints -for port in 9000 9010 9020 9030; do - echo "Testing port $port..." - curl -s http://localhost:${port}/health | jq '.' -done - -# 3. Test console endpoints -for port in 9001 9011 9021 9031; do - echo "Testing console port $port..." - curl -s http://localhost:${port}/rustfs/console/health | jq '.' -done - -# 4. Check inter-node connectivity -docker exec rustfs-node1 nc -zv rustfs-node2 9000 -docker exec rustfs-node1 nc -zv rustfs-node3 9000 -docker exec rustfs-node1 nc -zv rustfs-node4 9000 -``` - -## Troubleshooting - -### VolumeNotFound Error - -**Symptom:** Error message about `/data/rustfs0` not found. - -**Solution:** This example uses `/data/rustfs{1...4}` indexing to match the mounted Docker volumes. Ensure your -`RUSTFS_VOLUMES` configuration starts at index 1, not 0. - -### Health Check Failures - -**Symptom:** Containers show as unhealthy. - -**Solutions:** - -1. Check if `nc` is available: `docker exec rustfs-node1 which nc` -2. Use alternative health checks (curl/wget) as documented above -3. Increase `start_period` if nodes need more time to initialize - -### Startup Timeouts - -**Symptom:** Services timeout waiting for peers. - -**Solutions:** - -1. Check logs: `docker-compose logs rustfs-node1` -2. Verify network connectivity: `docker-compose exec rustfs-node1 ping rustfs-node2` -3. Consider increasing the `sleep` duration in the `docker-compose.yml` `command` directive if a longer delay is needed. - -### Permission Issues - -**Symptom:** Cannot create directories or write data. - -**Solution:** Ensure volumes have correct permissions or set `RUSTFS_UID` and `RUSTFS_GID` environment variables. - -## Advanced Configuration - -### Custom Credentials - -Replace default credentials in production: - -```yaml -environment: - - RUSTFS_ACCESS_KEY=your_access_key - - RUSTFS_SECRET_KEY=your_secret_key -``` - -### TLS Configuration - -Add TLS certificates: - -```yaml -volumes: - - ./certs:/opt/tls:ro -environment: - - RUSTFS_TLS_PATH=/opt/tls -``` - -### Resource Limits - -Add resource constraints: - -```yaml -deploy: - resources: - limits: - cpus: '2' - memory: 4G - reservations: - cpus: '1' - memory: 2G -``` - -## See Also - -- [CHECKLIST.md](./CHECKLIST.md) - Step-by-step verification guide -- [../../console-separation.md](../../console-separation.md) - Console & endpoint service separation guide -- [../../../examples/docker-comprehensive.yml](../../../examples/docker-comprehensive.yml) - More deployment examples -- [Issue #618](https://github.com/rustfs/rustfs/issues/618) - Original VolumeNotFound issue - -## References - -- RustFS Documentation: https://rustfs.com -- Docker Compose Documentation: https://docs.docker.com/compose/ \ No newline at end of file diff --git a/docs/examples/mnmd/docker-compose.mtls.yml b/docs/examples/mnmd/docker-compose.mtls.yml deleted file mode 100644 index 088a167d6..000000000 --- a/docs/examples/mnmd/docker-compose.mtls.yml +++ /dev/null @@ -1,32 +0,0 @@ -services: - mnmd: - image: ghcr.io/your-org/mnmd:latest - container_name: mnmd - ports: - - "8443:8443" - volumes: - - ./tls:/tls:ro - environment: - # Example mnmd settings (adapt to your image) - - MNMD_LISTEN_ADDR=0.0.0.0:8443 - - MNMD_TLS_CERT=/tls/server_cert.pem - - MNMD_TLS_KEY=/tls/server_key.pem - - MNMD_TLS_CLIENT_CA=/tls/ca.crt - - rustfs: - image: ghcr.io/rustfs/rustfs:latest - container_name: rustfs - depends_on: - - mnmd - environment: - - RUSTFS_TLS_PATH=/tls - - RUSTFS_TRUST_SYSTEM_CA=false - - RUSTFS_TRUST_LEAF_CERT_AS_CA=false - # Enable outbound mTLS (client identity) for MNMD - - RUSTFS_MTLS_CLIENT_CERT=/tls/client_cert.pem - - RUSTFS_MTLS_CLIENT_KEY=/tls/client_key.pem - # MNMD address configured to https - - RUSTFS_MNMD_ADDR=https://mnmd:8443 - - RUSTFS_MNMD_DOMAIN=mnmd - volumes: - - ./tls:/tls:ro diff --git a/docs/examples/mnmd/docker-compose.yml b/docs/examples/mnmd/docker-compose.yml deleted file mode 100644 index 3aa1b7803..000000000 --- a/docs/examples/mnmd/docker-compose.yml +++ /dev/null @@ -1,121 +0,0 @@ -# Copyright 2024 RustFS Team -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# MNMD (Multi-Node Multi-Drive) Docker Compose Example -# 4 nodes x 4 drives configuration -# This example demonstrates a complete, ready-to-use MNMD deployment -# addressing startup coordination and VolumeNotFound issues. - -x-node-template: &node-template - image: rustfs/rustfs:latest - environment: - # Use service names and correct disk indexing (1..4 to match mounted paths) - - RUSTFS_VOLUMES=http://rustfs-node{1...4}:9000/data/rustfs{1...4} - - RUSTFS_ADDRESS=0.0.0.0:9000 - - RUSTFS_CONSOLE_ENABLE=true - - RUSTFS_CONSOLE_ADDRESS=0.0.0.0:9001 - - RUSTFS_ACCESS_KEY=rustfsadmin - - RUSTFS_SECRET_KEY=rustfsadmin - - RUSTFS_CMD=rustfs - - RUSTFS_OBS_LOG_DIRECTORY=/logs - command: [ "sh", "-c", "sleep 3 && rustfs" ] - healthcheck: - test: - [ - "CMD", - "sh", "-c", - "curl -f http://localhost:9000/health && curl -f http://localhost:9001/rustfs/console/health" - ] - interval: 10s - timeout: 5s - retries: 3 - start_period: 30s - networks: - - rustfs-mnmd - - -services: - rustfs-node1: - <<: *node-template - container_name: rustfs-node1 - hostname: rustfs-node1 - ports: - - "9000:9000" # API endpoint - - "9001:9001" # Console - volumes: - - node1-data1:/data/rustfs1 - - node1-data2:/data/rustfs2 - - node1-data3:/data/rustfs3 - - node1-data4:/data/rustfs4 - - rustfs-node2: - <<: *node-template - container_name: rustfs-node2 - hostname: rustfs-node2 - ports: - - "9010:9000" # API endpoint - - "9011:9001" # Console - volumes: - - node2-data1:/data/rustfs1 - - node2-data2:/data/rustfs2 - - node2-data3:/data/rustfs3 - - node2-data4:/data/rustfs4 - - rustfs-node3: - <<: *node-template - container_name: rustfs-node3 - hostname: rustfs-node3 - ports: - - "9020:9000" # API endpoint - - "9021:9001" # Console - volumes: - - node3-data1:/data/rustfs1 - - node3-data2:/data/rustfs2 - - node3-data3:/data/rustfs3 - - node3-data4:/data/rustfs4 - - rustfs-node4: - <<: *node-template - container_name: rustfs-node4 - hostname: rustfs-node4 - ports: - - "9030:9000" # API endpoint - - "9031:9001" # Console - volumes: - - node4-data1:/data/rustfs1 - - node4-data2:/data/rustfs2 - - node4-data3:/data/rustfs3 - - node4-data4:/data/rustfs4 - -networks: - rustfs-mnmd: - driver: bridge - -volumes: - node1-data1: - node1-data2: - node1-data3: - node1-data4: - node2-data1: - node2-data2: - node2-data3: - node2-data4: - node3-data1: - node3-data2: - node3-data3: - node3-data4: - node4-data1: - node4-data2: - node4-data3: - node4-data4: diff --git a/docs/examples/mnmd/test-deployment.sh b/docs/examples/mnmd/test-deployment.sh deleted file mode 100755 index 5433632a2..000000000 --- a/docs/examples/mnmd/test-deployment.sh +++ /dev/null @@ -1,172 +0,0 @@ -#!/usr/bin/env bash -# Copyright 2024 RustFS Team -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# test-deployment.sh - Quick test script for MNMD deployment -# Usage: ./test-deployment.sh - -set -e - -# Colors for output -GREEN='\033[0;32m' -RED='\033[0;31m' -YELLOW='\033[1;33m' -NC='\033[0m' # No Color - -echo "=========================================" -echo "RustFS MNMD Deployment Test" -echo "=========================================" -echo "" - -# Test 1: Check if all containers are running -echo "Test 1: Checking container status..." -RUNNING=$(docker-compose ps | grep -c "Up" || echo "0") -if [ "$RUNNING" -eq 4 ]; then - echo -e "${GREEN}✓ All 4 containers are running${NC}" -else - echo -e "${RED}✗ Only $RUNNING/4 containers are running${NC}" - docker-compose ps - exit 1 -fi -echo "" - -# Test 2: Check health status -echo "Test 2: Checking health status..." -HEALTHY=0 -for node in rustfs-node1 rustfs-node2 rustfs-node3 rustfs-node4; do - STATUS=$(docker inspect "$node" --format='{{.State.Health.Status}}' 2>/dev/null || echo "unknown") - if [ "$STATUS" = "healthy" ]; then - echo -e " ${GREEN}✓ $node is healthy${NC}" - HEALTHY=$((HEALTHY + 1)) - elif [ "$STATUS" = "starting" ]; then - echo -e " ${YELLOW}⚠ $node is starting (wait a moment)${NC}" - else - echo -e " ${RED}✗ $node status: $STATUS${NC}" - fi -done - -if [ "$HEALTHY" -eq 4 ]; then - echo -e "${GREEN}✓ All containers are healthy${NC}" -elif [ "$HEALTHY" -gt 0 ]; then - echo -e "${YELLOW}⚠ $HEALTHY/4 containers are healthy (some may still be starting)${NC}" -else - echo -e "${RED}✗ No containers are healthy${NC}" - exit 1 -fi -echo "" - -# Test 3: Check API endpoints -echo "Test 3: Testing API endpoints..." -PORTS=(9000 9010 9020 9030) -API_SUCCESS=0 -for port in "${PORTS[@]}"; do - if curl -sf http://localhost:${port}/health >/dev/null 2>&1; then - echo -e " ${GREEN}✓ API on port $port is responding${NC}" - API_SUCCESS=$((API_SUCCESS + 1)) - else - echo -e " ${RED}✗ API on port $port is not responding${NC}" - fi -done - -if [ "$API_SUCCESS" -eq 4 ]; then - echo -e "${GREEN}✓ All API endpoints are working${NC}" -else - echo -e "${YELLOW}⚠ $API_SUCCESS/4 API endpoints are working${NC}" -fi -echo "" - -# Test 4: Check Console endpoints -echo "Test 4: Testing Console endpoints..." -CONSOLE_PORTS=(9001 9011 9021 9031) -CONSOLE_SUCCESS=0 -for port in "${CONSOLE_PORTS[@]}"; do - if curl -sf http://localhost:${port}/rustfs/console/health >/dev/null 2>&1; then - echo -e " ${GREEN}✓ Console on port $port is responding${NC}" - CONSOLE_SUCCESS=$((CONSOLE_SUCCESS + 1)) - else - echo -e " ${RED}✗ Console on port $port is not responding${NC}" - fi -done - -if [ "$CONSOLE_SUCCESS" -eq 4 ]; then - echo -e "${GREEN}✓ All Console endpoints are working${NC}" -else - echo -e "${YELLOW}⚠ $CONSOLE_SUCCESS/4 Console endpoints are working${NC}" -fi -echo "" - -# Test 5: Check inter-node connectivity -echo "Test 5: Testing inter-node connectivity..." -CONN_SUCCESS=0 -for node in rustfs-node2 rustfs-node3 rustfs-node4; do - if docker exec rustfs-node1 nc -z "$node" 9000 2>/dev/null; then - echo -e " ${GREEN}✓ node1 → $node connection OK${NC}" - CONN_SUCCESS=$((CONN_SUCCESS + 1)) - else - echo -e " ${RED}✗ node1 → $node connection failed${NC}" - fi -done - -if [ "$CONN_SUCCESS" -eq 3 ]; then - echo -e "${GREEN}✓ All inter-node connections are working${NC}" -else - echo -e "${YELLOW}⚠ $CONN_SUCCESS/3 inter-node connections are working${NC}" -fi -echo "" - -# Test 6: Verify data directories -echo "Test 6: Verifying data directories..." -DIR_SUCCESS=0 -for i in {1..4}; do - if docker exec rustfs-node1 test -d "/data/rustfs${i}"; then - DIR_SUCCESS=$((DIR_SUCCESS + 1)) - else - echo -e " ${RED}✗ /data/rustfs${i} not found in node1${NC}" - fi -done - -if [ "$DIR_SUCCESS" -eq 4 ]; then - echo -e "${GREEN}✓ All data directories exist${NC}" -else - echo -e "${RED}✗ Only $DIR_SUCCESS/4 data directories exist${NC}" -fi -echo "" - -# Summary -echo "=========================================" -echo "Test Summary" -echo "=========================================" -echo "Containers running: $RUNNING/4" -echo "Healthy containers: $HEALTHY/4" -echo "API endpoints: $API_SUCCESS/4" -echo "Console endpoints: $CONSOLE_SUCCESS/4" -echo "Inter-node connections: $CONN_SUCCESS/3" -echo "Data directories: $DIR_SUCCESS/4" -echo "" - -TOTAL=$((RUNNING + HEALTHY + API_SUCCESS + CONSOLE_SUCCESS + CONN_SUCCESS + DIR_SUCCESS)) -MAX_SCORE=23 - -if [ "$TOTAL" -eq "$MAX_SCORE" ]; then - echo -e "${GREEN}✓ All tests passed! Deployment is working correctly.${NC}" - exit 0 -elif [ "$TOTAL" -ge 20 ]; then - echo -e "${YELLOW}⚠ Most tests passed. Some components may still be starting up.${NC}" - echo " Try running this script again in a few moments." - exit 0 -else - echo -e "${RED}✗ Some tests failed. Check the output above and logs for details.${NC}" - echo " Run 'docker-compose logs' for more information." - exit 1 -fi diff --git a/docs/fix-large-file-upload-freeze.md b/docs/fix-large-file-upload-freeze.md deleted file mode 100644 index d33e6c6a6..000000000 --- a/docs/fix-large-file-upload-freeze.md +++ /dev/null @@ -1,192 +0,0 @@ -# Fix for Large File Upload Freeze Issue - -## Problem Description - -When uploading large files (10GB-20GB) consecutively, uploads may freeze with the following error: - -``` -[2025-11-10 14:29:22.110443 +00:00] ERROR [s3s::service] -AwsChunkedStreamError: Underlying: error reading a body from connection -``` - -## Root Cause Analysis - -### 1. Small Default Buffer Size -The issue was caused by using `tokio_util::io::StreamReader::new()` which has a default buffer size of only **8KB**. This is far too small for large file uploads and causes: - -- **Excessive system calls**: For a 10GB file with 8KB buffer, approximately **1.3 million read operations** are required -- **High CPU overhead**: Each read involves AWS chunked encoding/decoding overhead -- **Memory allocation pressure**: Frequent small allocations and deallocations -- **Increased timeout risk**: Slow read pace can trigger connection timeouts - -### 2. AWS Chunked Encoding Overhead -AWS S3 uses chunked transfer encoding which adds metadata to each chunk. With a small buffer: -- More chunks need to be processed -- More metadata parsing operations -- Higher probability of parsing errors or timeouts - -### 3. Connection Timeout Under Load -When multiple large files are uploaded consecutively: -- Small buffers lead to slow data transfer rates -- Network connections may timeout waiting for data -- The s3s library reports "error reading a body from connection" - -## Solution - -Wrap `StreamReader::new()` with `tokio::io::BufReader::with_capacity()` using a 1MB buffer size (`DEFAULT_READ_BUFFER_SIZE = 1024 * 1024`). - -### Changes Made - -Modified three critical locations in `rustfs/src/storage/ecfs.rs`: - -1. **put_object** (line ~2338): Standard object upload -2. **put_object_extract** (line ~376): Archive file extraction and upload -3. **upload_part** (line ~2864): Multipart upload - -### Before -```rust -let body = StreamReader::new( - body.map(|f| f.map_err(|e| std::io::Error::other(e.to_string()))) -); -``` - -### After -```rust -// Use a larger buffer size (1MB) for StreamReader to prevent chunked stream read timeouts -// when uploading large files (10GB+). The default 8KB buffer is too small and causes -// excessive syscalls and potential connection timeouts. -let body = tokio::io::BufReader::with_capacity( - DEFAULT_READ_BUFFER_SIZE, - StreamReader::new(body.map(|f| f.map_err(|e| std::io::Error::other(e.to_string())))), -); -``` - -## Performance Impact - -### For a 10GB File Upload: - -| Metric | Before (8KB buffer) | After (1MB buffer) | Improvement | -|--------|--------------------|--------------------|-------------| -| Read operations | ~1,310,720 | ~10,240 | **99.2% reduction** | -| System call overhead | High | Low | Significantly reduced | -| Memory allocations | Frequent small | Less frequent large | More efficient | -| Timeout risk | High | Low | Much more stable | - -### Benefits - -1. **Reduced System Calls**: ~99% reduction in read operations for large files -2. **Lower CPU Usage**: Less AWS chunked encoding/decoding overhead -3. **Better Memory Efficiency**: Fewer allocations and better cache locality -4. **Improved Reliability**: Significantly reduced timeout probability -5. **Higher Throughput**: Better network utilization - -## Testing Recommendations - -To verify the fix works correctly, test the following scenarios: - -1. **Single Large File Upload** - - Upload a 10GB file - - Upload a 20GB file - - Monitor for timeout errors - -2. **Consecutive Large File Uploads** - - Upload 5 files of 10GB each consecutively - - Upload 3 files of 20GB each consecutively - - Ensure no freezing or timeout errors - -3. **Multipart Upload** - - Upload large files using multipart upload - - Test with various part sizes - - Verify all parts complete successfully - -4. **Archive Extraction** - - Upload large tar/gzip files with X-Amz-Meta-Snowball-Auto-Extract - - Verify extraction completes without errors - -## Monitoring - -After deployment, monitor these metrics: - -- Upload completion rate for files > 1GB -- Average upload time for large files -- Frequency of chunked stream errors -- CPU usage during uploads -- Memory usage during uploads - -## Related Configuration - -The buffer size is defined in `crates/ecstore/src/set_disk.rs`: - -```rust -pub const DEFAULT_READ_BUFFER_SIZE: usize = 1024 * 1024; // 1 MB -``` - -This value is used consistently across the codebase for stream reading operations. - -## Additional Considerations - -### Implementation Details - -The solution uses `tokio::io::BufReader` to wrap the `StreamReader`, as `tokio-util 0.7.17` does not provide a `StreamReader::with_capacity()` method. The `BufReader` provides the same buffering benefits while being compatible with the current tokio-util version. - -### Adaptive Buffer Sizing (Implemented) - -The fix now includes **dynamic adaptive buffer sizing** based on file size for optimal performance and memory usage: - -```rust -/// Calculate adaptive buffer size based on file size for optimal streaming performance. -fn get_adaptive_buffer_size(file_size: i64) -> usize { - match file_size { - // Unknown size or negative (chunked/streaming): use 1MB buffer for safety - size if size < 0 => 1024 * 1024, - // Small files (< 1MB): use 64KB to minimize memory overhead - size if size < 1_048_576 => 65_536, - // Medium files (1MB - 100MB): use 256KB for balanced performance - size if size < 104_857_600 => 262_144, - // Large files (>= 100MB): use 1MB buffer for maximum throughput - _ => 1024 * 1024, - } -} -``` - -**Benefits**: -- **Memory Efficiency**: Small files use smaller buffers (64KB), reducing memory overhead -- **Balanced Performance**: Medium files use 256KB buffers for optimal balance -- **Maximum Throughput**: Large files (100MB+) use 1MB buffers to minimize syscalls -- **Automatic Selection**: Buffer size is chosen automatically based on content-length - -**Performance Impact by File Size**: - -| File Size | Buffer Size | Memory Saved vs Fixed 1MB | Syscalls (approx) | -|-----------|-------------|--------------------------|-------------------| -| 100 KB | 64 KB | 960 KB (94% reduction) | ~2 | -| 10 MB | 256 KB | 768 KB (75% reduction) | ~40 | -| 100 MB | 1 MB | 0 KB (same) | ~100 | -| 10 GB | 1 MB | 0 KB (same) | ~10,240 | - -### Future Improvements - -1. **Connection Keep-Alive**: Ensure HTTP keep-alive is properly configured for consecutive uploads - -2. **Rate Limiting**: Consider implementing upload rate limiting to prevent resource exhaustion - -3. **Configurable Thresholds**: Make buffer size thresholds configurable via environment variables or config file - -### Alternative Approaches Considered - -1. **Increase s3s timeout**: Would only mask the problem, not fix the root cause -2. **Retry logic**: Would increase complexity and potentially make things worse -3. **Connection pooling**: Already handled by underlying HTTP stack -4. **Upgrade tokio-util**: Would provide `StreamReader::with_capacity()` but requires testing entire dependency tree - -## References - -- Issue: "Uploading files of 10GB or 20GB consecutively may cause the upload to freeze" -- Error: `AwsChunkedStreamError: Underlying: error reading a body from connection` -- Library: `tokio_util::io::StreamReader` -- Default buffer: 8KB (tokio_util default) -- New buffer: 1MB (`DEFAULT_READ_BUFFER_SIZE`) - -## Conclusion - -This fix addresses the root cause of large file upload freezes by using an appropriately sized buffer for stream reading. The 1MB buffer significantly reduces system call overhead, improves throughput, and eliminates timeout issues during consecutive large file uploads. diff --git a/docs/fix-nosuchkey-regression.md b/docs/fix-nosuchkey-regression.md deleted file mode 100644 index 4a69f2e90..000000000 --- a/docs/fix-nosuchkey-regression.md +++ /dev/null @@ -1,141 +0,0 @@ -# Fix for NoSuchKey Error Response Regression (Issue #901) - -## Problem Statement - -In RustFS version 1.0.69, a regression was introduced where attempting to download a non-existent or deleted object would return a networking error instead of the expected `NoSuchKey` S3 error: - -``` -Expected: Aws::S3::Errors::NoSuchKey -Actual: Seahorse::Client::NetworkingError: "http response body truncated, expected 119 bytes, received 0 bytes" -``` - -## Root Cause Analysis - -The issue was caused by the `CompressionLayer` middleware being applied to **all** HTTP responses, including S3 error responses. The sequence of events that led to the bug: - -1. Client requests a non-existent object via `GetObject` -2. RustFS determines the object doesn't exist -3. The s3s library generates a `NoSuchKey` error response (XML format, ~119 bytes) -4. HTTP headers are written, including `Content-Length: 119` -5. The `CompressionLayer` attempts to compress the error response body -6. Due to compression buffering or encoding issues with small payloads, the body becomes empty (0 bytes) -7. The client receives `Content-Length: 119` but the actual body is 0 bytes -8. AWS SDK throws a "truncated body" networking error instead of parsing the S3 error - -## Solution - -The fix implements an intelligent compression predicate (`ShouldCompress`) that excludes certain responses from compression: - -### Exclusion Criteria - -1. **Error Responses (4xx and 5xx)**: Never compress error responses to ensure error details are preserved and transmitted accurately -2. **Small Responses (< 256 bytes)**: Skip compression for very small responses where compression overhead outweighs benefits - -### Implementation Details - -```rust -impl Predicate for ShouldCompress { - fn should_compress<B>(&self, response: &Response<B>) -> bool - where - B: http_body::Body, - { - let status = response.status(); - - // Never compress error responses (4xx and 5xx status codes) - if status.is_client_error() || status.is_server_error() { - debug!("Skipping compression for error response: status={}", status.as_u16()); - return false; - } - - // Check Content-Length header to avoid compressing very small responses - if let Some(content_length) = response.headers().get(http::header::CONTENT_LENGTH) { - if let Ok(length_str) = content_length.to_str() { - if let Ok(length) = length_str.parse::<u64>() { - if length < 256 { - debug!("Skipping compression for small response: size={} bytes", length); - return false; - } - } - } - } - - // Compress successful responses with sufficient size - true - } -} -``` - -## Benefits - -1. **Correctness**: Error responses are now transmitted with accurate Content-Length headers -2. **Compatibility**: AWS SDKs and other S3 clients correctly receive and parse error responses -3. **Performance**: Small responses avoid unnecessary compression overhead -4. **Observability**: Debug logging provides visibility into compression decisions - -## Testing - -Comprehensive test coverage was added to prevent future regressions: - -### Test Cases - -1. **`test_get_deleted_object_returns_nosuchkey`**: Verifies that getting a deleted object returns NoSuchKey -2. **`test_head_deleted_object_returns_nosuchkey`**: Verifies HeadObject also returns NoSuchKey for deleted objects -3. **`test_get_nonexistent_object_returns_nosuchkey`**: Tests objects that never existed -4. **`test_multiple_gets_deleted_object`**: Ensures stability across multiple consecutive requests - -### Running Tests - -```bash -# Run the specific test -cargo test --test get_deleted_object_test -- --ignored - -# Or start RustFS server and run tests -./scripts/dev_rustfs.sh -cargo test --test get_deleted_object_test -``` - -## Impact Assessment - -### Affected APIs - -- `GetObject` -- `HeadObject` -- Any S3 API that returns 4xx/5xx error responses - -### Backward Compatibility - -- **No breaking changes**: The fix only affects error response handling -- **Improved compatibility**: Better alignment with S3 specification and AWS SDK expectations -- **No performance degradation**: Small responses were already not compressed by default in most cases - -## Deployment Considerations - -### Verification Steps - -1. Deploy the fix to a staging environment -2. Run the provided Ruby reproduction script to verify the fix -3. Monitor error logs for any compression-related warnings -4. Verify that large successful responses are still being compressed - -### Monitoring - -Enable debug logging to observe compression decisions: - -```bash -RUST_LOG=rustfs::server::http=debug -``` - -Look for log messages like: -- `Skipping compression for error response: status=404` -- `Skipping compression for small response: size=119 bytes` - -## Related Issues - -- Issue #901: Regression in exception when downloading non-existent key in alpha 69 -- Commit: 86185703836c9584ba14b1b869e1e2c4598126e0 (getobjectlength fix) - -## References - -- [AWS S3 Error Responses](https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html) -- [tower-http CompressionLayer](https://docs.rs/tower-http/latest/tower_http/compression/index.html) -- [s3s Library](https://github.com/Nugine/s3s) diff --git a/docs/kms/README.md b/docs/kms/README.md deleted file mode 100644 index e8824b10e..000000000 --- a/docs/kms/README.md +++ /dev/null @@ -1,151 +0,0 @@ -# RustFS Key Management Service - -The RustFS Key Management Service (KMS) provides end-to-end key orchestration, envelope encryption, and S3-compatible semantics for encrypted object storage. It sits between the RustFS API surface and the underlying encryption primitives, ensuring that data at rest and in flight remains protected while keeping operational workflows simple. - -## Highlights - -- **Multiple backends** – plug in Vault for production or use the Local filesystem backend for development and CI. -- **Envelope encryption** – master keys protect data-encryption keys (DEKs); DEKs protect object payloads with AES-256-GCM streaming. -- **S3 compatibility** – works transparently with `SSE-S3`, `SSE-KMS`, and `SSE-C` headers so existing tools continue to function. -- **Dynamic lifecycle** – configure, rotate, or swap backends at runtime by calling the admin REST API; no server restart is required. -- **Caching & resilience** – built-in caching minimises latency, while health probes, retries, and metrics help operators keep track of the service. - -## Architecture - -``` - ┌──────────────────────────────────────────────────────────┐ - │ RustFS Frontend │ - │ (S3 compatible API, IAM, policy engine, bucket logic) │ - └──────────────┬───────────────────────────────────────────┘ - │ - ▼ - ┌──────────────────────────────┐ - │ Encryption Service Manager │ - │ • Applies admin config │ - │ • Controls backend runtime │ - │ • Exposes metrics / health │ - └──────────────┬──────────────┘ - │ - ┌─────────┴─────────┐ - │ │ - ▼ ▼ - ┌────────────────┐ ┌────────────────────┐ - │ Local Backend │ │ Vault Backend │ - │ • File-based │ │ • Transit engine │ - │ • Dev / CI │ │ • Production ready │ - └────────────────┘ └────────────────────┘ -``` - -### Components at a Glance - -| Component | Responsibility | -|------------------------------|-------------------------------------------------------------------------| -| `rustfs::kms::manager` | Owns backend lifecycle, caching, and key orchestration. | -| `rustfs::kms::encryption` | Encrypts/decrypts payloads, issues data keys, validates headers. | -| Admin REST handlers | Accept configuration requests (`configure`, `start`, `status`, etc.). | -| Backends | `local` (filesystem) and `vault` (Transit) implementations. | - -## Supported Backends - -| Backend | When to use | Key storage | Authentication | Notes | -|---------|-------------|-------------|----------------|-------| -| Local | Development, CI, integration tests | JSON-encoded key blobs on disk | none | Simple, fast to bootstrap, not secure for production. | -| Vault | Production or pre-production | Vault Transit & KV engines | token or AppRole | Supports rotation, audit logging, sealed-state recovery, TLS. | - -Refer to [configuration.md](configuration.md) for static configuration details and [dynamic-configuration-guide.md](dynamic-configuration-guide.md) for the runtime workflow. - -## Encryption Workflows - -RustFS KMS supports the same S3 semantics users expect: - -- **SSE-S3** – RustFS manages the data key lifecycle and returns the `x-amz-server-side-encryption` header. -- **SSE-KMS** – RustFS issues per-object data keys bound to the configured KMS backend, exposing the `x-amz-server-side-encryption` header with value `aws:kms`. -- **SSE-C** – Clients provide a 256-bit key and MD5 checksum per request; RustFS uses KMS to encrypt metadata, while encrypted payloads are streamed with the customer key. - -Internally, every object follows the envelope-encryption flow below: - -1. Determine the logical key-id (default, explicit header, or SSE-C customer key). -2. Ask the configured backend for a DEK or encryption context. -3. Stream-encrypt the payload with AES-256-GCM (1 MiB chunking, authenticated headers). -4. Persist metadata (IV, checksum, key-id) alongside object state. -5. During GET/HEAD, the same process runs in reverse with integrity checks. - -## Quick Start - -1. **Build RustFS** – `cargo build --release` or run the project-specific build helper. -2. **Prepare credentials** – ensure you have admin access keys; for Vault, export `VAULT_ADDR` and a root or scoped token. -3. **Launch RustFS** – `./target/release/rustfs server` (KMS starts in `NotConfigured`). -4. **Configure the backend**: - - ```bash - # Local backend (ephemeral testing) - awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - -X POST -d '{ - "backend_type": "local", - "key_dir": "/var/lib/rustfs/kms-keys", - "default_key_id": "rustfs-master" - }' \ - http://localhost:9000/rustfs/admin/v3/kms/configure - - awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - -X POST http://localhost:9000/rustfs/admin/v3/kms/start - ``` - - ```bash - # Vault backend (production) - awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - -X POST -d '{ - "backend_type": "vault", - "address": "https://vault.example.com:8200", - "auth_method": { - "token": "s.XYZ..." - }, - "mount_path": "transit", - "kv_mount": "secret", - "key_path_prefix": "rustfs/kms/keys", - "default_key_id": "rustfs-master" - }' \ - https://rustfs.example.com/rustfs/admin/v3/kms/configure - - awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - -X POST https://rustfs.example.com/rustfs/admin/v3/kms/start - ``` - -5. **Verify**: - - ```bash - awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - http://localhost:9000/rustfs/admin/v3/kms/status - ``` - - The response should include `"status": "Running"` and the configured backend summary. - -## Documentation Map - -| Topic | Description | -|-------|-------------| -| [http-api.md](http-api.md) | Formal REST endpoint reference with request/response samples. | -| [dynamic-configuration-guide.md](dynamic-configuration-guide.md) | Gradual rollout, rotation, and failover playbooks. | -| [configuration.md](configuration.md) | Static configuration files, environment variables, Helm/Ansible hints. | -| [api.md](api.md) | Rust crate interfaces (`ConfigureKmsRequest`, `KmsManager`, encryption helpers). | -| [sse-integration.md](sse-integration.md) | Mapping between S3 headers and RustFS behaviour, client examples. | -| [security.md](security.md) | Threat model, access control, TLS, auditing, secrets hygiene. | -| [test_suite_integration.md](test_suite_integration.md) | Running e2e, Vault, and regression test suites. | -| [troubleshooting.md](troubleshooting.md) | Common errors and recovery steps. | - -## Terminology - -| Term | Definition | -|------|------------| -| **KMS backend** | Implementation that holds master keys (Local filesystem or Vault Transit/ KV). | -| **Master Key** | Root key stored in the backend; encrypts data keys. | -| **Data Encryption Key (DEK)** | Per-object key that encrypts payload chunks. | -| **Envelope Encryption** | Wrapping DEKs with a higher-level key before persisting. | -| **SSE-S3 / SSE-KMS / SSE-C** | Amazon S3-compatible encryption modes supported by RustFS. | - -For deeper dives continue with the documents referenced above. EOF diff --git a/docs/kms/api.md b/docs/kms/api.md deleted file mode 100644 index 46b158ddb..000000000 --- a/docs/kms/api.md +++ /dev/null @@ -1,169 +0,0 @@ -# RustFS KMS Developer API - -This document targets developers extending RustFS or embedding the KMS primitives directly. The `rustfs-kms` crate exposes building blocks for configuration, backend orchestration, and data-key lifecycle management. - -## Crate Overview - -Add the crate to your workspace (already included in RustFS): - -```toml -[dependencies] -rustfs-kms = { path = "crates/kms" } -``` - -Key namespaces: - -| Module | Purpose | -|--------|---------| -| `rustfs_kms::config` | Typed configuration objects for local/vault backends. | -| `rustfs_kms::manager::KmsManager` | High-level coordinator that proxies operations to a backend. | -| `rustfs_kms::encryption::service::EncryptionService` | Frontend consumed by RustFS S3 handlers. | -| `rustfs_kms::backends` | Backend trait definitions and concrete implementations. | -| `rustfs_kms::types` | Request/response DTOs used by the REST handlers and manager. | -| `rustfs_kms::service_manager` | Async runtime that powers `/kms/configure`, `/kms/start`, etc. | - -## Constructing a Configuration - -```rust -use rustfs_kms::config::{BackendConfig, KmsBackend, KmsConfig, LocalConfig, VaultConfig, VaultAuthMethod}; - -let config = KmsConfig { - backend: KmsBackend::Vault, - backend_config: BackendConfig::Vault(VaultConfig { - address: "https://vault.example.com:8200".parse().unwrap(), - auth_method: VaultAuthMethod::Token { token: "s.XYZ".into() }, - namespace: None, - mount_path: "transit".into(), - kv_mount: "secret".into(), - key_path_prefix: "rustfs/kms/keys".into(), - tls: None, - }), - default_key_id: Some("rustfs-master".into()), - timeout: std::time::Duration::from_secs(30), - retry_attempts: 3, - enable_cache: true, - cache_config: Default::default(), -}; -``` - -To build configurations from the admin REST payloads, use `api_types::ConfigureKmsRequest`: - -```rust -use rustfs_kms::api_types::{ConfigureKmsRequest, ConfigureVaultKmsRequest}; - -let request = ConfigureKmsRequest::Vault(ConfigureVaultKmsRequest { - address: "https://vault.example.com:8200".into(), - auth_method: VaultAuthMethod::Token { token: "s.XYZ".into() }, - namespace: None, - mount_path: Some("transit".into()), - kv_mount: Some("secret".into()), - key_path_prefix: Some("rustfs/kms/keys".into()), - default_key_id: Some("rustfs-master".into()), - skip_tls_verify: Some(false), - timeout_seconds: Some(30), - retry_attempts: Some(5), - enable_cache: Some(true), - max_cached_keys: Some(2048), - cache_ttl_seconds: Some(600), -}); - -let kms_config: KmsConfig = (&request).into(); -``` - -## Service Manager Lifecycle - -The admin layer interacts with a `ServiceManager` singleton that wraps `KmsManager`: - -```rust -use rustfs_kms::{init_global_kms_service_manager, get_global_kms_service_manager}; - -let manager = init_global_kms_service_manager(); -manager.configure(config).await?; -manager.start().await?; - -let status = manager.get_status().await; // -> KmsServiceStatus::Running -``` - -`get_global_encryption_service()` returns the `EncryptionService` façade that the S3 request handlers call. The service exposes async methods mirroring AWS KMS semantics: - -```rust -use rustfs_kms::types::{CreateKeyRequest, KeyUsage, GenerateDataKeyRequest, KeySpec}; -use rustfs_kms::get_global_encryption_service; - -let service = get_global_encryption_service().await.expect("service not initialised"); - -let create = CreateKeyRequest { - key_name: None, - key_usage: KeyUsage::EncryptDecrypt, - description: Some("project-alpha".into()), - tags: Default::default(), - origin: None, - policy: None, -}; -let created = service.create_key(create).await?; - -let data_key = service - .generate_data_key(GenerateDataKeyRequest { - key_id: created.key_id.clone(), - key_spec: KeySpec::Aes256, - encryption_context: Default::default(), - }) - .await?; -``` - -## Backend Integration Points - -To add a custom backend: - -1. Implement the `KmsBackend` trait (see `crates/kms/src/backends/mod.rs`). -2. Provide conversions from `ConfigureKmsRequest` into your backend’s config struct. -3. Register the backend in `BackendFactory` and extend the admin handlers to accept the new `backend_type` string. - -The trait contract requires implementing methods such as `create_key`, `encrypt`, `decrypt`, `generate_data_key`, `list_keys`, and `health_check`. - -```rust -#[async_trait::async_trait] -pub trait KmsBackend: Send + Sync { - async fn create_key(&self, request: CreateKeyRequest) -> Result<CreateKeyResponse>; - async fn encrypt(&self, request: EncryptRequest) -> Result<EncryptResponse>; - async fn decrypt(&self, request: DecryptRequest) -> Result<DecryptResponse>; - async fn generate_data_key(&self, request: GenerateDataKeyRequest) -> Result<GenerateDataKeyResponse>; - async fn describe_key(&self, request: DescribeKeyRequest) -> Result<DescribeKeyResponse>; - async fn list_keys(&self, request: ListKeysRequest) -> Result<ListKeysResponse>; - async fn delete_key(&self, request: DeleteKeyRequest) -> Result<DeleteKeyResponse>; - async fn cancel_key_deletion(&self, request: CancelKeyDeletionRequest) -> Result<CancelKeyDeletionResponse>; - async fn health_check(&self) -> Result<bool>; -} -``` - -## Encryption Pipeline Helpers - -`EncryptionService` contains two methods used by the S3 PUT/GET pipeline: - -- `encrypt_stream` (invoked by `PutObject` and multipart uploads) obtains DEKs, encrypts payload chunks with AES-256-GCM, and returns headers. -- `decrypt_stream` resolves metadata, fetches the required DEK or customer key, and streams plaintext back to the client. - -Both rely on `ObjectCipher` implementations defined in `crates/kms/src/encryption/ciphers.rs`. When adjusting chunk sizes or cipher suites, update these implementations and the SSE documentation. - -## Testing Utilities - -- `rustfs_kms::mock` contains in-memory backends used by unit tests. -- The e2e crate (`crates/e2e_test`) exposes helpers such as `LocalKMSTestEnvironment` and `VaultTestEnvironment` for integration testing. -- Run the full suite: `cargo test --workspace --exclude e2e_test` for unit coverage, `cargo test -p e2e_test kms:: -- --nocapture` for end-to-end validation. - -## Error Handling Conventions - -All public async methods return `rustfs_kms::error::Result<T>`. Errors are categorised as: - -| Variant | Meaning | -|---------|---------| -| `KmsError::Configuration` | Invalid or missing backend configuration. | -| `KmsError::Backend` | Underlying backend failure (Vault error, disk I/O, etc.). | -| `KmsError::Crypto` | Integrity or cryptographic failure. | -| `KmsError::Cache` | Cache lookup or eviction failure. | - -Map these errors to HTTP responses using the helper macros in `rustfs/src/admin/handlers`. - ---- - -For operational workflows continue with [http-api.md](http-api.md) and [dynamic-configuration-guide.md](dynamic-configuration-guide.md). For encryption semantics, see [sse-integration.md](sse-integration.md). diff --git a/docs/kms/configuration.md b/docs/kms/configuration.md deleted file mode 100644 index 0dccf64f3..000000000 --- a/docs/kms/configuration.md +++ /dev/null @@ -1,125 +0,0 @@ -# KMS Configuration Guide - -This guide describes the configuration surfaces for the RustFS Key Management Service. RustFS can be configured statically at process start or dynamically via the admin REST API. Most operators start with a static bootstrap (CLI flags, configuration file, or environment variables) and then rely on dynamic configuration to rotate keys or swap backends. - -## Configuration Sources - -| Mechanism | When to use | Notes | -|---------------------|----------------------------------------------------------|-------| -| CLI flags | Local development, ad-hoc testing | `rustfs server --kms-enable --kms-backend vault ...` | -| Environment vars | Container/Helm/Ansible deployments | Prefix variables with `RUSTFS_` (see table below). | -| Static config file | Use your orchestration tooling to render TOML/YAML, then pass the corresponding flags during startup. | -| Dynamic REST API | Post-start updates without restarting (see [dynamic-configuration-guide.md](dynamic-configuration-guide.md)). | - -## CLI Flags & Environment Variables - -| CLI flag | Env variable | Description | -|-----------------------------|--------------------------------|-------------| -| `--kms-enable` | `RUSTFS_KMS_ENABLE` | Enables KMS at startup. Defaults to `false`. | -| `--kms-backend <local\|vault>` | `RUSTFS_KMS_BACKEND` | Selects the backend implementation. Defaults to `local`. | -| `--kms-key-dir <path>` | `RUSTFS_KMS_KEY_DIR` | Required when `kms-backend=local`; directory that stores wrapped master keys. | -| `--kms-vault-address <url>` | `RUSTFS_KMS_VAULT_ADDRESS` | Vault base URL (e.g. `https://vault.example.com:8200`). | -| `--kms-vault-token <token>` | `RUSTFS_KMS_VAULT_TOKEN` | Token used for Vault authentication. Prefer AppRole or short-lived tokens. | -| `--kms-default-key-id <id>` | `RUSTFS_KMS_DEFAULT_KEY_ID` | Default key used when clients omit `x-amz-server-side-encryption-aws-kms-key-id`. | - -> **Tip:** Even when you plan to reconfigure the backend dynamically, setting `--kms-enable` is useful because it instantiates the global manager eagerly and surfaces better error messages when configuration fails. - -## Static TOML Example (Local Backend) - -```toml -# rustfs.toml -[kms] -enabled = true -backend = "local" -key_dir = "/var/lib/rustfs/kms-keys" -default_key_id = "rustfs-master" -``` - -Render this file using your favourite template tool and translate it to CLI flags when launching RustFS: - -```bash -rustfs server \ - --kms-enable \ - --kms-backend local \ - --kms-key-dir /var/lib/rustfs/kms-keys \ - --kms-default-key-id rustfs-master -``` - -## Static TOML Example (Vault Backend) - -```toml -[kms] -enabled = true -backend = "vault" -vault_address = "https://vault.example.com:8200" -# Supply either a token or render AppRole credentials dynamically -vault_token = "s.XYZ..." -default_key_id = "rustfs-master" -``` - -Ensure that the Vault binary is reachable and the Transit engine is initialised before starting RustFS: - -```bash -vault secrets enable transit -vault secrets enable -path=secret kv-v2 -vault write transit/keys/rustfs-master type=aes256-gcm96 -``` - -If you prefer AppRole authentication, omit `vault_token` and set the token dynamically via the REST API once RustFS is online (see [dynamic-configuration-guide.md](dynamic-configuration-guide.md)). - -## Backend-Specific Options - -### Local Backend - -| Field | Description | -|------------------|-------------| -| `key_dir` | Directory where wrapped master keys are stored (`*.key` JSON files). Ensure it is backed up securely in persistent deployments. | -| `default_key_id` | Optional; if not provided, SSE-S3 uploads require an explicit header. | -| `file_permissions` (REST only) | Octal permissions applied to generated key files (`0o600` by default). | -| `master_key` (REST only) | Base64-encoded wrapping key used to protect DEKs on disk. Leave unset to generate one automatically. | - -During development you can generate a default key manually: - -```bash -mkdir -p /tmp/rustfs-keys -openssl rand -hex 32 > /tmp/rustfs-keys/rustfs-master.material -``` - -The KMS e2e tests also demonstrate programmatic key creation using the `/kms/keys` API. - -### Vault Backend - -| Field | Description | -|---------------------|-------------| -| `address` | Base URL including scheme. TLS is strongly recommended. | -| `auth_method` | `Token { token: "..." }` or `AppRole { role_id, secret_id }`. Tokens should be renewable or short-lived. | -| `mount_path` | Transit engine mount (default `transit`). | -| `kv_mount` | KV v2 engine used to stash wrapped keys or metadata. | -| `key_path_prefix` | Prefix under the KV mount (e.g. `rustfs/kms/keys`). | -| `namespace` | Vault enterprise namespace (optional). | -| `skip_tls_verify` | Development convenience; avoid using this in production. | -| `default_key_id` | Transit key to use when clients omit `x-amz-server-side-encryption-aws-kms-key-id`. | - -## Advanced Runtime Knobs (REST API) - -The dynamic API exposes additional fields not available on the CLI: - -| Field | Purpose | -|-------|---------| -| `timeout_seconds` | Backend operation timeout (defaults to 30s). | -| `retry_attempts` | Number of retries for transient backend failures (defaults to 3). | -| `enable_cache` | Enables in-memory cache of DEKs and metadata. | -| `max_cached_keys` / `cache_ttl_seconds` | Cache size and TTL limits. | - -These options are mostly relevant for large deployments; configure them via the `/kms/configure` REST call once the service is online. - -## Bootstrapping Workflow - -1. Pick a backend (`local` or `vault`). -2. Ensure the required infrastructure is ready (filesystem permissions or Vault engines). -3. Start RustFS with `--kms-enable` and the minimal bootstrap flags. -4. Call the REST API to refine configuration (timeouts, cache, AppRole, etc.). -5. Verify with `/kms/status` and issue a test `PutObject` using SSE headers. -6. Record the configuration in your infra-as-code tooling for repeatability. - -For runtime reconfiguration (rotating keys, swapping from local to Vault) follow the step-by-step guide in [dynamic-configuration-guide.md](dynamic-configuration-guide.md). diff --git a/docs/kms/dynamic-configuration-guide.md b/docs/kms/dynamic-configuration-guide.md deleted file mode 100644 index 32cc78c23..000000000 --- a/docs/kms/dynamic-configuration-guide.md +++ /dev/null @@ -1,155 +0,0 @@ -# Dynamic KMS Configuration Playbook - -RustFS exposes a first-class admin REST API that allows you to configure, start, stop, and reconfigure the KMS subsystem without restarting the server. This document walks through common operational scenarios. - -## Prerequisites - -- RustFS is running and reachable on the admin endpoint (typically `http(s)://<host>/rustfs/admin/v3`). -- You have admin access and credentials (access key/secret or session token) with the `ServerInfoAdminAction` permission. -- Optional: `awscurl` or another SigV4-aware HTTP client to sign admin requests. - -Before starting, confirm the KMS service manager is initialised: - -```bash -awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - http://localhost:9000/rustfs/admin/v3/kms/status -``` - -The initial response shows `"status": "NotConfigured"`. - -## Initial Configuration Flow - -1. **Submit the configuration** - ```bash - awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - -X POST -d '{ - "backend_type": "local", - "key_dir": "/var/lib/rustfs/kms-keys", - "default_key_id": "rustfs-master", - "enable_cache": true, - "cache_ttl_seconds": 900 - }' \ - http://localhost:9000/rustfs/admin/v3/kms/configure - ``` - -2. **Start the service** - ```bash - awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - -X POST http://localhost:9000/rustfs/admin/v3/kms/start - ``` - -3. **Verify** - ```bash - awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - http://localhost:9000/rustfs/admin/v3/kms/status - ``` - - Look for `"status": "Running"` and a backend summary. - -## Switching to Vault - -To migrate from the local backend to Vault: - -1. Prepare Vault: - ```bash - vault secrets enable transit - vault secrets enable -path=secret kv-v2 - vault write transit/keys/rustfs-master type=aes256-gcm96 - ``` - -2. Configure the new backend without stopping service: - ```bash - awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - -X POST -d '{ - "backend_type": "vault", - "address": "https://vault.example.com:8200", - "auth_method": { "approle": { "role_id": "...", "secret_id": "..." } }, - "mount_path": "transit", - "kv_mount": "secret", - "key_path_prefix": "rustfs/kms/keys", - "default_key_id": "rustfs-master", - "retry_attempts": 5, - "timeout_seconds": 60 - }' \ - http://localhost:9000/rustfs/admin/v3/kms/reconfigure - ``` - -3. Confirm the new backend is active via `/kms/status`. - -4. Run test uploads with `SSE-KMS` headers to ensure the new backend is serving requests. - -## Rotating the Default Key - -1. **Create a new key** using the key management API: - ```bash - awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - -X POST -d '{ "KeyUsage": "ENCRYPT_DECRYPT", "Description": "rotation-2024-09" }' \ - http://localhost:9000/rustfs/admin/v3/kms/keys - ``` - -2. **Set it as default** via `reconfigure`: - ```bash - awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - -X POST -d '{ - "backend_type": "vault", - "default_key_id": "rotation-2024-09" - }' \ - http://localhost:9000/rustfs/admin/v3/kms/reconfigure - ``` - - Only the fields supplied in the payload are updated; omitted fields keep their previous values. - -3. **Validate** by uploading a new object and checking that `x-amz-server-side-encryption-aws-kms-key-id` reports the new key. - -## Rolling Cache or Timeout Changes - -Caching knobs help tune latency. To adjust them at runtime: - -```bash -awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - -X POST -d '{ - "enable_cache": true, - "max_cached_keys": 2048, - "cache_ttl_seconds": 600, - "timeout_seconds": 20 - }' \ - http://localhost:9000/rustfs/admin/v3/kms/reconfigure -``` - -## Pausing the KMS Service - -Stopping the service keeps configuration in place but disables new KMS operations. Existing SSE objects remain accessible only if their metadata allows offline decryption. - -```bash -awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - -X POST http://localhost:9000/rustfs/admin/v3/kms/stop -``` - -Restart later with `/kms/start`. - -## Automation Tips - -- Wrap REST calls in an idempotent script (see `scripts/` for examples) so you can re-run configuration safely. -- Use `--test-threads=1` when running KMS e2e suites in CI; they spin up real servers and Vault instances. -- In Kubernetes, run the configuration script as an init job that waits for both RustFS and Vault readiness before calling `/kms/configure`. -- Emit events to your observability platform: successful reconfigurations generate structured logs with the backend summary. - -## Rollback Strategy - -If a new configuration introduces errors: - -1. Call `/kms/reconfigure` with the previous payload (keep a snapshot in version control). -2. If the backend is unreachable, call `/kms/stop` to protect data from partial writes. -3. Investigate logs under `rustfs::kms::*` and Vault audit logs. -4. Once the issue is resolved, reapply the desired configuration and restart. - -Dynamic configuration makes backend maintenance safe and repeatable—ensure every change is scripted and traceable. diff --git a/docs/kms/frontend-api-guide-zh.md b/docs/kms/frontend-api-guide-zh.md deleted file mode 100644 index 2b919ac0e..000000000 --- a/docs/kms/frontend-api-guide-zh.md +++ /dev/null @@ -1,372 +0,0 @@ -# RustFS KMS Frontend Integration Guide - -This document targets frontend engineers who need to integrate with the RustFS Key Management Service (KMS). It provides a complete API reference, usage notes, and example implementations. - -## 📋 Contents - -1. [Quick Start](#quick-start) -2. [Authentication & Permissions](#authentication--permissions) -3. [API Catalog](#api-catalog) -4. [Service Management APIs](#service-management-apis) -5. [Key Management APIs](#key-management-apis) -6. [Data Encryption APIs](#data-encryption-apis) -7. [Bucket Encryption Configuration APIs](#bucket-encryption-configuration-apis) -8. [Monitoring & Cache APIs](#monitoring--cache-apis) -9. [Common Error Codes](#common-error-codes) -10. [Data Types](#data-types) -11. [Implementation Examples](#implementation-examples) - -## Quick Start - -### Base configuration - -| Setting | Value | -|---------|-------| -| **Base URL** | `http://localhost:9000/rustfs/admin/v3` (local development) | -| **Production URL** | `https://your-rustfs-domain.com/rustfs/admin/v3` | -| **Request format** | `application/json` | -| **Response format** | `application/json` | -| **Authentication** | AWS Signature Version 4 | -| **Encoding** | UTF-8 | - -### Common request headers - -| Header | Required | Value | -|--------|----------|-------| -| `Content-Type` | ✅ | `application/json` | -| `Authorization` | ✅ | `AWS4-HMAC-SHA256 Credential=...` | -| `X-Amz-Date` | ✅ | ISO 8601 timestamp | - -## Authentication & Permissions - -### Required IAM permissions - -Clients must have `ServerInfoAdminAction` to invoke KMS APIs. - -### AWS SigV4 signing - -All requests must be signed with SigV4. - -- **Access Key ID** – account access key -- **Secret Access Key** – corresponding secret key -- **Region** – `us-east-1` -- **Service** – `execute-api` - -## API Catalog - -### Service management - -| Method | Path | Description | Status | -|--------|------|-------------|--------| -| `POST` | `/kms/configure` | Configure the KMS service | ✅ Available | -| `POST` | `/kms/start` | Start the service | ✅ Available | -| `POST` | `/kms/stop` | Stop the service | ✅ Available | -| `GET` | `/kms/service-status` | Retrieve service status | ✅ Available | -| `POST` | `/kms/reconfigure` | Reconfigure and restart | ✅ Available | - -### Key management - -| Method | Path | Description | Status | -|--------|------|-------------|--------| -| `POST` | `/kms/keys` | Create a master key | ✅ Available | -| `GET` | `/kms/keys` | List keys | ✅ Available | -| `GET` | `/kms/keys/{key_id}` | Get key metadata | ✅ Available | -| `DELETE` | `/kms/keys/delete` | Schedule key deletion | ✅ Available | -| `POST` | `/kms/keys/cancel-deletion` | Cancel key deletion | ✅ Available | - -### Data encryption - -| Method | Path | Description | Status | -|--------|------|-------------|--------| -| `POST` | `/kms/generate-data-key` | Generate a data key | ✅ Available | -| `POST` | `/kms/decrypt` | Decrypt a data key | ⚠️ Not implemented | - -### Bucket encryption configuration - -| Method | Path | Description | Status | -|--------|------|-------------|--------| -| `GET` | `/api/v1/buckets` | List buckets | ✅ Available | -| `GET` | `/api/v1/bucket-encryption/{bucket}` | Get default encryption | ✅ Available | -| `PUT` | `/api/v1/bucket-encryption/{bucket}` | Set default encryption | ✅ Available | -| `DELETE` | `/api/v1/bucket-encryption/{bucket}` | Remove default encryption | ✅ Available | - -### Monitoring & cache - -| Method | Path | Description | Status | -|--------|------|-------------|--------| -| `GET` | `/kms/config` | Retrieve KMS configuration | ✅ Available | -| `POST` | `/kms/clear-cache` | Clear the KMS cache | ✅ Available | - -### Legacy compatibility endpoints - -| Method | Path | Description | Status | -|--------|------|-------------|--------| -| `POST` | `/kms/create-key` | Create key (legacy) | ✅ Available | -| `GET` | `/kms/describe-key` | Describe key (legacy) | ✅ Available | -| `GET` | `/kms/list-keys` | List keys (legacy) | ✅ Available | -| `GET` | `/kms/status` | KMS status (legacy) | ✅ Available | - -> ✅ **Available** – implemented and usable. -> ⚠️ **Not implemented** – API shape defined but backend missing. -> Prefer the new endpoints; legacy routes exist for backwards compatibility. - -## Service Management APIs - -### 1. Configure KMS - -`POST /kms/configure` - -Parameters: - -| Name | Type | Required | Description | -|------|------|----------|-------------| -| `backend_type` | string | ✅ | `"local"` or `"vault"` | -| `key_directory` | string | Cond. | Local backend key directory | -| `default_key_id` | string | ✅ | Default master key ID | -| `enable_cache` | boolean | ❌ | Toggle cache (default `true`) | -| `cache_ttl_seconds` | integer | ❌ | Cache TTL (default `600`) | -| `timeout_seconds` | integer | ❌ | Operation timeout (default `30`) | -| `retry_attempts` | integer | ❌ | Retry attempts (default `3`) | -| `address` | string | Cond. | Vault server address | -| `auth_method` | object | Cond. | Vault auth config | -| `mount_path` | string | Cond. | Vault transit mount path | -| `kv_mount` | string | Cond. | Vault KV mount | -| `key_path_prefix` | string | Cond. | Vault key prefix | - -Vault `auth_method` fields: - -| Name | Type | Required | Description | -|------|------|----------|-------------| -| `token` | string | ✅ | Vault token | - -Response -```json -{ - "success": boolean, - "message": string, - "config_id": string? -} -``` - -### 2. Start KMS - -`POST /kms/start` - -Response fields: `success`, `message`, `status` (`Running`, `Stopped`, `Error`). - -### 3. Stop KMS - -`POST /kms/stop` - -Same response structure as `/kms/start`. - -### 4. Service status - -`GET /kms/service-status` - -Response -```json -{ - "status": "Running" | "Stopped" | "NotConfigured" | "Error", - "backend_type": "local" | "vault", - "healthy": boolean, - "config_summary": { - "backend_type": string, - "default_key_id": string, - "timeout_seconds": integer, - "retry_attempts": integer, - "enable_cache": boolean - } -} -``` - -### 5. Reconfigure - -`POST /kms/reconfigure` - -Accepts the same payload as `/kms/configure` and restarts the service. - -## Key Management APIs - -### 1. Create key - -`POST /kms/keys` - -Parameters: - -| Name | Type | Required | Description | -|------|------|----------|-------------| -| `KeyUsage` | string | ✅ | `"ENCRYPT_DECRYPT"` | -| `Description` | string | ❌ | Description (≤256 chars) | -| `Tags` | object | ❌ | Key/value tag map | - -Response includes `key_id` and `key_metadata` (enabled, usage, creation date, etc.). - -### 2. Key metadata - -`GET /kms/keys/{key_id}` returns the `key_metadata` object. - -### 3. List keys - -`GET /kms/keys?limit=&marker=` with pagination support. - -### 4. Schedule deletion - -`DELETE /kms/keys/delete` - -Parameters: `key_id`, optional `pending_window_in_days` (7–30, default 7). - -### 5. Cancel deletion - -`POST /kms/keys/cancel-deletion` - -Provide `key_id`; response returns updated metadata with `deletion_date = null`. - -## Data Encryption APIs - -### 1. Generate data key - -`POST /kms/generate-data-key` - -Parameters: `key_id`, optional `key_spec` (`AES_256` or `AES_128`), optional `encryption_context` map. - -Response contains `plaintext_key` (Base64) and `ciphertext_blob` (Base64). - -### 2. Decrypt data key - -`POST /kms/decrypt` - -> ⚠️ Not yet implemented. Expect parameters `ciphertext_blob` and optional `encryption_context`. A future response will expose `key_id` and `plaintext`. - -## Bucket Encryption Configuration APIs - -RustFS exposes S3-compatible endpoints via the AWS SDK. - -### 1. List buckets - -Use `ListBuckets` from the AWS SDK. - -### 2. Get default encryption - -`GetBucketEncryption` returns SSE rules (`SSEAlgorithm`, optional `KMSMasterKeyID`). A 404 indicates no configuration. - -### 3. Set default encryption - -`PutBucketEncryption` supports SSE-S3 (`AES256`) or SSE-KMS (`aws:kms` + key ID). - -### 4. Delete default encryption - -`DeleteBucketEncryption` removes the configuration. - -Example composable and helper utilities are provided in the original Chinese document; port them as needed. - -## Monitoring & Cache APIs - -### 1. Get KMS config - -`GET /kms/config` returns backend, cache settings, and default key ID. - -### 2. Clear cache - -`POST /kms/clear-cache` invalidates cached key metadata. - -### 3. Legacy status - -`GET /kms/status` (legacy) provides cache hit/miss stats. - -## Common Error Codes - -### HTTP status codes - -| Code | Error | Description | -|------|-------|-------------| -| 200 | – | Success | -| 400 | `InvalidRequest` | Bad request or parameters | -| 401 | `AccessDenied` | Authentication failure | -| 403 | `AccessDenied` | Authorization failure | -| 404 | `NotFound` | Resource not found | -| 409 | `Conflict` | Resource conflict | -| 500 | `InternalError` | Server error | - -### Error payload - -```json -{ - "error": { - "code": string, - "message": string, - "request_id": string? - } -} -``` - -### Specific codes - -- `InvalidRequest` – check payload -- `AccessDenied` – verify credentials/permissions -- `KeyNotFound` – key ID incorrect -- `InvalidKeyState` – key disabled or invalid -- `ServiceNotConfigured` – configure KMS first -- `ServiceNotRunning` – start the service -- `BackendError` – backend failure -- `EncryptionFailed` / `DecryptionFailed` – inspect ciphertext/context - -## Data Types - -### `KeyMetadata` - -| Field | Type | Description | -|-------|------|-------------| -| `key_id` | string | UUID | -| `description` | string | Key description | -| `enabled` | boolean | Whether the key is enabled | -| `key_usage` | string | Always `ENCRYPT_DECRYPT` | -| `creation_date` | string | ISO 8601 timestamp | -| `rotation_enabled` | boolean | Rotation status | -| `deletion_date` | string? | Scheduled deletion timestamp | - -### `ConfigSummary` - -| Field | Type | Description | -|-------|------|-------------| -| `backend_type` | string | `local` or `vault` | -| `default_key_id` | string | Default master key | -| `timeout_seconds` | integer | Operation timeout | -| `retry_attempts` | integer | Retry attempts | -| `enable_cache` | boolean | Cache toggle | - -### Enumerations - -- `ServiceStatus` – `Running`, `Stopped`, `NotConfigured`, `Error` -- `BackendType` – `local`, `vault` -- `KeyUsage` – `ENCRYPT_DECRYPT` -- `KeySpec` – `AES_256`, `AES_128` - -## Implementation Examples - -The original guide included extensive code samples covering bucket encryption flows, Vue/React composables, and full application scaffolding. The key patterns are: - -1. **Signed requests** – Use AWS SigV4 (via AWS SDK or manual signing) to call `/rustfs/admin/v3` endpoints. -2. **Multipart encryption flow** – Request a data key, encrypt data locally, upload ciphertext, and store the encrypted key blob. -3. **Bucket encryption lifecycle** – Use the S3 SDK to configure default SSE policies, optionally provisioning dedicated KMS keys per bucket. -4. **Health monitoring** – Periodically poll `/kms/status` or `/kms/config` to ensure the service is healthy and cache hit ratios remain acceptable. - -## Troubleshooting & Support - -If issues arise: - -1. Verify the KMS service is healthy via `/kms/service-status`. -2. Confirm Vault or local backend configuration. -3. Inspect server logs for detailed error messages. -4. Run `cargo test -p e2e_test kms:: -- --nocapture` to validate the setup. -5. Ensure your AWS SDK version supports the required S3/KMS calls. - -Common questions: - -- **Bucket encryption fails with insufficient permissions** – Ensure the IAM policy grants `s3:GetBucketEncryption`, `s3:PutBucketEncryption`, `s3:DeleteBucketEncryption`, and (for SSE-KMS) `kms:DescribeKey`. -- **Unable to select a KMS key** – Confirm the KMS service is running, the key is enabled, and `KeyUsage` is `ENCRYPT_DECRYPT`. -- **Frontend shows incorrect encryption state** – A 404 during `GetBucketEncryption` is normal (no configuration). Allow for network latency before refreshing the status. - ---- - -_Last updated: 2024-09-22_ diff --git a/docs/kms/http-api.md b/docs/kms/http-api.md deleted file mode 100644 index 4b294108e..000000000 --- a/docs/kms/http-api.md +++ /dev/null @@ -1,248 +0,0 @@ -# KMS Admin HTTP API Reference - -The RustFS KMS admin API is exposed under the admin prefix (`/rustfs/admin/v3`). Requests must be signed with SigV4 credentials that have the `ServerInfoAdminAction` permission. All request and response bodies use JSON, and all endpoints return standard HTTP status codes. - -- Base URL examples: `http://localhost:9000/rustfs/admin/v3`, `https://rustfs.example.com/rustfs/admin/v3`. -- Headers: set `Content-Type: application/json` for requests with bodies. -- Authentication: sign with SigV4 (`awscurl`, `aws-signature-v4`, or the official SDKs). - -## Service Lifecycle - -| Endpoint | Method | Description | -|----------|--------|-------------| -| `/kms/configure` | POST | Apply the initial backend configuration. Does not start the service. | -| `/kms/reconfigure` | POST | Merge a new configuration on top of the existing one. | -| `/kms/start` | POST | Start the configured backend. | -| `/kms/stop` | POST | Stop the backend; configuration is kept. | -| `/kms/status` | GET | Lightweight status summary (`Running`, `Configured`, etc.). | -| `/kms/service-status` | GET | Backward-compatible alias for `/kms/status`. | -| `/kms/config` | GET | Returns the cached configuration summary. | -| `/kms/clear-cache` | POST | Clears in-memory DEK and metadata caches. | - -### Configure / Reconfigure - -**Request** -```json -{ - "backend_type": "vault", - "address": "https://vault.example.com:8200", - "auth_method": { "token": "s.XYZ" }, - "mount_path": "transit", - "kv_mount": "secret", - "key_path_prefix": "rustfs/kms/keys", - "default_key_id": "rustfs-master", - "enable_cache": true, - "cache_ttl_seconds": 600, - "timeout_seconds": 30, - "retry_attempts": 3 -} -``` - -**Response** -```json -{ - "success": true, - "message": "KMS configured successfully", - "status": "Configured" -} -``` - -> **Partial updates:** `/kms/reconfigure` updates only the fields present in the payload. Use this to rotate tokens or adjust cache parameters without resubmitting the full configuration. - -### Start / Stop - -**Start response** -```json -{ - "success": true, - "message": "KMS service started successfully", - "status": "Running" -} -``` - -**Stop response** -```json -{ - "success": true, - "message": "KMS service stopped successfully", - "status": "Configured" -} -``` - -### Status & Config - -`GET /kms/status` -```json -{ - "status": "Running", - "backend_type": "vault", - "healthy": true, - "config_summary": { - "backend_type": "vault", - "default_key_id": "rustfs-master", - "timeout_seconds": 30, - "retry_attempts": 3, - "enable_cache": true, - "cache_summary": { - "max_keys": 1024, - "ttl_seconds": 600, - "enable_metrics": true - }, - "backend_summary": { - "backend_type": "vault", - "address": "https://vault.example.com:8200", - "auth_method_type": "token", - "namespace": null, - "mount_path": "transit", - "kv_mount": "secret", - "key_path_prefix": "rustfs/kms/keys" - } - } -} -``` - -`GET /kms/config` -```json -{ - "backend": "vault", - "cache_enabled": true, - "cache_max_keys": 1024, - "cache_ttl_seconds": 600, - "default_key_id": "rustfs-master" -} -``` - -`POST /kms/clear-cache` returns HTTP `204` with an empty body when successful. - -## Key Management - -| Endpoint | Method | Description | -|----------|--------|-------------| -| `/kms/keys` | POST | Create a new master key in the backend. | -| `/kms/keys` | GET | List master keys (paginated). | -| `/kms/keys/{key_id}` | GET | Retrieve metadata for a specific key. | -| `/kms/keys/delete` | DELETE | Schedule key deletion. | -| `/kms/keys/cancel-deletion` | POST | Cancel a pending deletion request. | - -### Create Key - -**Request** -```json -{ - "KeyUsage": "ENCRYPT_DECRYPT", - "Description": "project-alpha", - "Tags": { - "owner": "security", - "env": "prod" - } -} -``` - -**Response** -```json -{ - "key_id": "fa5bac0e-2a2c-4f9a-a09d-2f5b8a59ed85", - "key_metadata": { - "key_id": "fa5bac0e-2a2c-4f9a-a09d-2f5b8a59ed85", - "description": "project-alpha", - "enabled": true, - "key_usage": "ENCRYPT_DECRYPT", - "creation_date": "2024-09-18T07:10:42.012345Z", - "rotation_enabled": false - } -} -``` - -### List Keys - -`GET /kms/keys?limit=50&marker=<token>` -```json -{ - "keys": [ - { "key_id": "fa5bac0e-2a2c-4f9a-a09d-2f5b8a59ed85", "description": "project-alpha" } - ], - "truncated": false, - "next_marker": null -} -``` - -### Describe Key - -`GET /kms/keys/fa5bac0e-2a2c-4f9a-a09d-2f5b8a59ed85` -```json -{ - "key_metadata": { - "key_id": "fa5bac0e-2a2c-4f9a-a09d-2f5b8a59ed85", - "description": "project-alpha", - "enabled": true, - "key_usage": "ENCRYPT_DECRYPT", - "creation_date": "2024-09-18T07:10:42.012345Z", - "deletion_date": null - } -} -``` - -### Delete & Cancel Deletion - -**Delete request** -```json -{ - "key_id": "fa5bac0e-2a2c-4f9a-a09d-2f5b8a59ed85", - "pending_window_in_days": 7 -} -``` - -**Cancel deletion** -```json -{ - "key_id": "fa5bac0e-2a2c-4f9a-a09d-2f5b8a59ed85" -} -``` - -Both endpoints respond with the updated `key_metadata`. - -## Data Key Operations - -`POST /kms/generate-data-key` - -**Request** -```json -{ - "key_id": "fa5bac0e-2a2c-4f9a-a09d-2f5b8a59ed85", - "key_spec": "AES_256", - "encryption_context": { - "bucket": "analytics-data", - "object": "2024/09/18/report.parquet" - } -} -``` - -**Response** -```json -{ - "key_id": "fa5bac0e-2a2c-4f9a-a09d-2f5b8a59ed85", - "plaintext_key": "sQW6qt0yS7CqD6c8hY7GZg==", - "ciphertext_blob": "gAAAAABlLK..." -} -``` - -- `plaintext_key` is Base64-encoded and must be zeroised after use. -- `ciphertext_blob` can be stored alongside object metadata for future re-wraps. - -## Error Handling - -| Code | Meaning | Example payload | -|------|---------|-----------------| -| `400 Bad Request` | Malformed JSON or missing required fields. | `{ "code": "InvalidRequest", "message": "invalid JSON" }` | -| `401 Unauthorized` | Request was not signed or credentials are invalid. | `{ "code": "AccessDenied", "message": "authentication required" }` | -| `403 Forbidden` | Caller lacks admin permissions. | `{ "code": "AccessDenied", "message": "unauthorised" }` | -| `409 Conflict` | Backend already configured in an incompatible way. | `{ "code": "Conflict", "message": "KMS already running" }` | -| `500 Internal Server Error` | Backend failure or transient issue. Logs include details. | `{ "code": "InternalError", "message": "failed to create key: ..." }` | - -## Useful Utilities - -- [`awscurl`](https://github.com/okigan/awscurl) for quick SigV4 requests. -- The `scripts/` directory contains example shell scripts to configure local and Vault backends automatically. -- The e2e test harness (`cargo test -p e2e_test kms:: -- --nocapture`) demonstrates end-to-end API usage against both backends. - -For dynamic workflows and automation strategies, continue with [dynamic-configuration-guide.md](dynamic-configuration-guide.md). diff --git a/docs/kms/security.md b/docs/kms/security.md deleted file mode 100644 index 5a4a350fd..000000000 --- a/docs/kms/security.md +++ /dev/null @@ -1,65 +0,0 @@ -# KMS Security Guidelines - -This document summarises the security posture of the RustFS KMS subsystem and offers guidance for safe production deployment. - -## Threat Model - -- Attackers might obtain network access to RustFS or Vault. -- Leaked admin credentials could manipulate KMS configuration. -- Misconfigured SSE-C clients could expose plaintext keys. -- Insider threats may attempt to extract master keys from disk-based storage. - -RustFS mitigates these risks via access control, auditability, and best practices outlined below. - -## Authentication & Authorisation - -- The admin API requires SigV4 credentials with `ServerInfoAdminAction`. Restrict these credentials to trusted automation. -- Do **not** share admin credentials with regular S3 clients. Provision separate IAM users for data-plane traffic. -- When running behind a reverse proxy, ensure the proxy passes through headers required for SigV4 signature validation. - -## Network Security - -- Enforce TLS for both RustFS and Vault deployments. Set `skip_tls_verify=false` in production. -- Use mTLS or private network peering between RustFS and Vault where possible. -- Restrict Vault transit endpoints using network ACLs or service meshes so only RustFS can reach them. - -## Secret Management - -- Never store Vault tokens directly in configuration files. Prefer AppRole or short-lived tokens injected at runtime. -- If you must render a token (e.g. in CI), use environment variables with limited scope and rotate them frequently. -- For the local backend, keep the key directory on encrypted disks with tight POSIX permissions (default `0o600`). - -## Vault Hardening Checklist - -- Enable audit logging (`vault audit enable file file_path=/var/log/vault_audit.log`). -- Create a dedicated policy granting access only to the `transit` and `secret` paths used by RustFS. -- Configure automatic token renewal or rely on `vault agent` to manage token lifetimes. -- Monitor the health endpoint (`/v1/sys/health`) and integrate it into your on-call alerts. - -## Caching & Memory Hygiene - -- When `enable_cache=true`, DEKs are stored in memory for the configured TTL. Tune `max_cached_keys` and TTL to balance latency versus exposure. -- The encryption service zeroises plaintext keys after use. Avoid logging plaintext keys or contexts in custom code. -- For workloads that require strict FIPS compliance, disable caching and rely on Vault for each request. - -## SSE-C Considerations - -- Clients are responsible for providing 256-bit keys and MD5 hashes. Reject uploads where the digest does not match. -- Educate clients that SSE-C keys are never stored server side; losing the key means losing access to the object. -- Use HTTPS for all client connections to prevent key disclosure. - -## Audit & Monitoring - -- Capture structured logs emitted under the `rustfs::kms` target. Each admin call logs request principals and outcomes. -- Export metrics such as cache hit ratio, backend latency, and failure counts to your observability stack. -- Periodically run the e2e Vault suite in a staging environment to verify backup/restore procedures. - -## Incident Response - -1. Stop the KMS service (`POST /kms/stop`) to freeze new operations. -2. Rotate admin credentials and Vault tokens. -3. Examine audit logs to determine the blast radius. -4. Restore keys from backups or Vault versions if tampering occurred. -5. Reconfigure the backend using trusted credentials and restart the service. - -By adhering to these practices, you can deploy RustFS KMS with confidence across regulated or high-security environments. diff --git a/docs/kms/sse-integration.md b/docs/kms/sse-integration.md deleted file mode 100644 index 4b09ee78e..000000000 --- a/docs/kms/sse-integration.md +++ /dev/null @@ -1,91 +0,0 @@ -# Server-Side Encryption Integration - -RustFS implements Amazon S3-compatible server-side encryption semantics. This document outlines how each mode maps to KMS operations and how clients should format requests. - -## Supported Modes - -| Mode | Request Headers | Managed by | Notes | -|------|-----------------|------------|-------| -| `SSE-S3` | `x-amz-server-side-encryption: AES256` | RustFS KMS using the configured default key. | Simplest option; clients do not manage keys. | -| `SSE-KMS` | `x-amz-server-side-encryption: aws:kms`<br>`x-amz-server-side-encryption-aws-kms-key-id: <key-id>` (optional) | RustFS KMS + backend (Vault/Local). | Specify a key-id to override the default. | -| `SSE-C` | `x-amz-server-side-encryption-customer-algorithm: AES256`<br>`x-amz-server-side-encryption-customer-key: <Base64 key>`<br>`x-amz-server-side-encryption-customer-key-MD5: <Base64 MD5>` | Customer provided | RustFS never stores the plaintext key; clients must supply it on every request. | - -## Request Examples - -### SSE-S3 Upload & Download - -```bash -# Upload -aws s3api put-object \ - --endpoint-url http://localhost:9000 \ - --bucket demo --key obj.txt --body file.txt \ - --server-side-encryption AES256 - -# Download -aws s3api get-object \ - --endpoint-url http://localhost:9000 \ - --bucket demo --key obj.txt out.txt -``` - -### SSE-KMS with Explicit Key ID - -```bash -aws s3api put-object \ - --endpoint-url http://localhost:9000 \ - --bucket demo --key report.csv --body report.csv \ - --server-side-encryption aws:kms \ - --ssekms-key-id rotation-2024-09 -``` - -If `--ssekms-key-id` is omitted, RustFS uses the configured `default_key_id`. - -### SSE-C Multipart Upload - -SSE-C requires additional care: - -1. Generate a 256-bit key and compute its MD5 digest. -2. For multipart uploads, every request (initiate, upload-part, complete, GET) must include the SSE-C headers. -3. Keep part sizes ≥ 5 MiB to avoid falling back to inline storage which complicates key handling. - -```bash -KEY="01234567890123456789012345678901" -KEY_B64=$(echo -n "$KEY" | base64) -KEY_MD5=$(echo -n "$KEY" | md5 | awk '{print $1}') - -aws s3api create-multipart-upload \ - --endpoint-url http://localhost:9000 \ - --bucket demo --key video.mp4 \ - --server-side-encryption-customer-algorithm AES256 \ - --server-side-encryption-customer-key "$KEY_B64" \ - --server-side-encryption-customer-key-MD5 "$KEY_MD5" -# Upload all parts with the same trio of headers -``` - -On download, supply the same headers; otherwise the request fails with `AccessDenied`. - -## Response Headers - -| Header | SSE-S3 | SSE-KMS | SSE-C | -|--------|--------|---------|-------| -| `x-amz-server-side-encryption` | `AES256` | `aws:kms` | _absent_ | -| `x-amz-server-side-encryption-aws-kms-key-id` | _default key id_ | Provided key id | _absent_ | -| `x-amz-server-side-encryption-customer-algorithm` | _absent_ | _absent_ | `AES256` | -| `x-amz-server-side-encryption-customer-key-MD5` | _absent_ | _absent_ | MD5 of supplied key | - -## Error Scenarios - -| Scenario | Error | Resolution | -|----------|-------|------------| -| SSE-C key/MD5 mismatch | `AccessDenied` | Regenerate the MD5 digest, ensure Base64 encoding is correct. | -| Missing SSE-C headers on GET | `InvalidRequest` | Provide the same `sse-c` headers used during upload. | -| Invalid key id for SSE-KMS | `NotFound` | Call `GET /kms/keys` to retrieve the valid IDs or create one via the admin API. | -| KMS backend offline | `InternalError` | Check `/kms/status`, restart or reconfigure the backend. | - -## Best Practices - -- Always use HTTPS endpoints when supplying SSE-C headers. -- Log the key-id used for SSE-KMS uploads to simplify forensic analysis. -- For compliance workloads, disable cache or lower cache TTL via `/kms/reconfigure` so data keys are short-lived. -- Test multipart SSE-C flows regularly; the e2e suite (`test_comprehensive_kms_full_workflow`) covers this scenario. - -For the administrative API and configuration specifics, refer to [http-api.md](http-api.md) and [configuration.md](configuration.md). diff --git a/docs/kms/test_suite_integration.md b/docs/kms/test_suite_integration.md deleted file mode 100644 index a8d67010f..000000000 --- a/docs/kms/test_suite_integration.md +++ /dev/null @@ -1,77 +0,0 @@ -# KMS Test Suite Integration - -RustFS ships with an extensive set of automated tests that exercise the KMS stack. This guide explains how to run them locally and in CI. - -## Crate Overview - -- `crates/kms` – unit tests for configuration, caching, and backend adapters. -- `crates/e2e_test/src/kms` – end-to-end suites for Local and Vault backends, multipart uploads, edge cases, and fault recovery. -- `crates/e2e_test/src/kms/common.rs` – reusable test environments (spins up RustFS, configures Vault, manages buckets). - -## Prerequisites - -| Requirement | Purpose | -|-------------|---------| -| `vault` binary (>=1.15) | Required for Vault end-to-end tests. Install from Vault releases. | -| `awscurl` (optional) | Debugging helper to hit admin endpoints. | -| `openssl`, `md5` | Used by SSE-C helpers during tests. | -| Local ports | Tests bind ephemeral ports (ensure `127.0.0.1:<random>` is free). | - -## Running Unit Tests - -```bash -cargo test --workspace --exclude e2e_test -``` - -This covers the core KMS crate plus supporting libraries. - -## Running End-to-End Suites - -### All KMS Tests - -```bash -NO_PROXY=127.0.0.1,localhost \ -HTTP_PROXY= HTTPS_PROXY= \ -cargo test -p e2e_test kms:: -- --nocapture --test-threads=1 -``` - -- `--nocapture` streams logs to stdout for troubleshooting. -- `--test-threads=1` ensures serial execution; most tests spawn standalone RustFS and Vault processes. - -### Local Backend Only - -```bash -cargo test -p e2e_test kms::kms_local_test:: -- --nocapture --test-threads=1 -``` - -### Vault Backend Only - -```bash -vault server -dev -dev-root-token-id=dev-root-token & -VAULT_PID=$! - -cargo test -p e2e_test kms::kms_vault_test:: -- --nocapture --test-threads=1 -kill $VAULT_PID -``` - -The tests can also start Vault automatically if the binary is found on `PATH`. When running in CI, whitelist the `vault` executable in the sandbox or mark the job as privileged. - -## Updating Fixtures - -- Adjustment to SSE behaviour or multipart limits often requires touching `crates/e2e_test/src/kms/common.rs`. Keep helpers generic so multiple tests can reuse them. -- When fixing bugs, add targeted coverage in the relevant suite (e.g. `kms_fault_recovery_test.rs`). -- Vault-specific fixtures live in `crates/e2e_test/src/kms/common.rs::VaultTestEnvironment`. - -## Debugging Tips - -- Use the `CLAUDE DEBUG` log lines (left intentionally verbose) to inspect the RustFS server flow during tests. -- If a test fails with `Operation not permitted`, rerun with sandbox overrides (`cargo test ...` with elevated permissions) as shown above. -- Attach `RUST_LOG=rustfs::kms=debug` to surface detailed backend interactions. - -## CI Recommendations - -- Split KMS tests into a dedicated job so slower suites (Vault) do not gate unrelated changes. -- Cache the Vault binary and reuse it across runs to minimise setup time. -- Surface logs and `target/debug/e2e_test-*` binaries as artifacts when failures occur. - -For API usage examples and configuration reference, consult [http-api.md](http-api.md) and [dynamic-configuration-guide.md](dynamic-configuration-guide.md). diff --git a/docs/kms/troubleshooting.md b/docs/kms/troubleshooting.md deleted file mode 100644 index 21670b565..000000000 --- a/docs/kms/troubleshooting.md +++ /dev/null @@ -1,55 +0,0 @@ -# KMS Troubleshooting - -Use this checklist to diagnose and resolve common KMS-related issues. - -## Quick Diagnostics - -1. **Check status** - ```bash - awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - http://localhost:9000/rustfs/admin/v3/kms/status - ``` -2. **Inspect logs** – enable `RUST_LOG=rustfs::kms=debug`. -3. **Verify backend reachability** – for Vault, run `vault status` and test network connectivity. -4. **Run a smoke test** – upload a small object with `--server-side-encryption AES256`. - -## Common Issues - -| Symptom | Likely Cause | Resolution | -|---------|--------------|------------| -| `status: NotConfigured` | KMS was never configured or configuration failed. | POST `/kms/configure`, then `/kms/start`. Check logs for JSON parsing errors. | -| `healthy: false` | Backend health probe failed; Vault sealed or filesystem inaccessible. | Unseal Vault, confirm permissions on the key directory, re-run `/kms/start`. | -| `InternalError: failed to create key` | Backend rejected the request (e.g. Vault policy). | Review Vault audit logs and ensure the RustFS policy has `transit/keys/*` access. | -| `AccessDenied` when downloading SSE-C objects | Missing/incorrect SSE-C headers. | Provide the same `x-amz-server-side-encryption-customer-*` headers used during upload. | -| Multipart SSE-C download truncated | Parts smaller than 5 MiB stored inline; older builds mishandled them. | Re-upload with ≥5 MiB parts or upgrade to the latest RustFS build. | -| `Operation not permitted` during tests | OS sandbox blocked launching `vault` or `rustfs`. | Re-run tests with elevated permissions (`cargo test ...` with sandbox overrides). | -| `KMS key directory is required for local backend` | Started RustFS with `--kms-backend local` but no `--kms-key-dir`. | Supply the flag or use the dynamic API to set the directory before calling `/kms/start`. | - -## Clearing the Cache - -If data keys become stale (e.g. after manual rotation in Vault), clear the cache: - -```bash -awscurl --service s3 --region us-east-1 \ - --access_key admin --secret_key admin \ - -X POST http://localhost:9000/rustfs/admin/v3/kms/clear-cache -``` - -## Resetting the Service - -1. `POST /kms/stop` -2. `POST /kms/configure` with the known-good payload -3. `POST /kms/start` -4. Verify with `/kms/status` - -## Support Data Collection - -When opening an issue, capture: - -- Output of `/kms/status` and `/kms/config` -- Relevant RustFS logs (`rustfs::kms=*`) -- Vault audit log snippets (if using Vault) -- The SSE headers used by the failing client request - -Providing these artifacts drastically speeds up triage. diff --git a/docs/nosuchkey-fix-comprehensive-analysis.md b/docs/nosuchkey-fix-comprehensive-analysis.md deleted file mode 100644 index a02bfd594..000000000 --- a/docs/nosuchkey-fix-comprehensive-analysis.md +++ /dev/null @@ -1,396 +0,0 @@ -# Comprehensive Analysis: NoSuchKey Error Fix and Related Improvements - -## Overview - -This document provides a comprehensive analysis of the complete solution for Issue #901 (NoSuchKey regression), -including related improvements from PR #917 that were merged into this branch. - -## Problem Statement - -**Issue #901**: In RustFS 1.0.69, attempting to download a non-existent or deleted object returns a networking error -instead of the expected `NoSuchKey` S3 error. - -**Error Observed**: - -``` -Class: Seahorse::Client::NetworkingError -Message: "http response body truncated, expected 119 bytes, received 0 bytes" -``` - -**Expected Behavior**: - -```ruby -assert_raises(Aws::S3::Errors::NoSuchKey) do - s3.get_object(bucket: 'some-bucket', key: 'some-key-that-was-deleted') -end -``` - -## Complete Solution Analysis - -### 1. HTTP Compression Layer Fix (Primary Issue) - -**File**: `rustfs/src/server/http.rs` - -**Root Cause**: The `CompressionLayer` was being applied to all responses, including error responses. When s3s generates -a NoSuchKey error response (~119 bytes XML), the compression layer interferes, causing Content-Length mismatch. - -**Solution**: Implemented `ShouldCompress` predicate that intelligently excludes: - -- Error responses (4xx/5xx status codes) -- Small responses (< 256 bytes) - -**Code Changes**: - -```rust -impl Predicate for ShouldCompress { - fn should_compress<B>(&self, response: &Response<B>) -> bool - where - B: http_body::Body, - { - let status = response.status(); - - // Never compress error responses - if status.is_client_error() || status.is_server_error() { - debug!("Skipping compression for error response: status={}", status.as_u16()); - return false; - } - - // Skip compression for small responses - if let Some(content_length) = response.headers().get(http::header::CONTENT_LENGTH) { - if let Ok(length_str) = content_length.to_str() { - if let Ok(length) = length_str.parse::<u64>() { - if length < 256 { - debug!("Skipping compression for small response: size={} bytes", length); - return false; - } - } - } - } - - true - } -} -``` - -**Impact**: Ensures error responses are transmitted with accurate Content-Length headers, preventing AWS SDK truncation -errors. - -### 2. Content-Length Calculation Fix (Related Issue from PR #917) - -**File**: `rustfs/src/storage/ecfs.rs` - -**Problem**: The content-length was being calculated incorrectly for certain object types (compressed, encrypted). - -**Changes**: - -```rust -// Before: -let mut content_length = info.size; -let content_range = if let Some(rs) = & rs { -let total_size = info.get_actual_size().map_err(ApiError::from) ?; -// ... -} - -// After: -let mut content_length = info.get_actual_size().map_err(ApiError::from) ?; -let content_range = if let Some(rs) = & rs { -let total_size = content_length; -// ... -} -``` - -**Rationale**: - -- `get_actual_size()` properly handles compressed and encrypted objects -- Returns the actual decompressed size when needed -- Avoids duplicate calls and potential inconsistencies - -**Impact**: Ensures Content-Length header accurately reflects the actual response body size. - -### 3. Delete Object Metadata Fix (Related Issue from PR #917) - -**File**: `crates/filemeta/src/filemeta.rs` - -#### Change 1: Version Update Logic (Line 618) - -**Problem**: Incorrect version update logic during delete operations. - -```rust -// Before: -let mut update_version = fi.mark_deleted; - -// After: -let mut update_version = false; -``` - -**Rationale**: - -- The previous logic would always update version when `mark_deleted` was true -- This could cause incorrect version state transitions -- The new logic only updates version in specific replication scenarios -- Prevents spurious version updates during delete marker operations - -**Impact**: Ensures correct version management when objects are deleted, which is critical for subsequent GetObject -operations to correctly determine that an object doesn't exist. - -#### Change 2: Version ID Filtering (Lines 1711, 1815) - -**Problem**: Nil UUIDs were not being filtered when converting to FileInfo. - -```rust -// Before: -pub fn into_fileinfo(&self, volume: &str, path: &str, all_parts: bool) -> FileInfo { - // let version_id = self.version_id.filter(|&vid| !vid.is_nil()); - // ... - FileInfo { - version_id: self.version_id, - // ... - } -} - -// After: -pub fn into_fileinfo(&self, volume: &str, path: &str, all_parts: bool) -> FileInfo { - let version_id = self.version_id.filter(|&vid| !vid.is_nil()); - // ... - FileInfo { - version_id, - // ... - } -} -``` - -**Rationale**: - -- Nil UUIDs (all zeros) are not valid version IDs -- Filtering them ensures cleaner semantics -- Aligns with S3 API expectations where no version ID means None, not a nil UUID - -**Impact**: - -- Improves correctness of version tracking -- Prevents confusion with nil UUIDs in debugging and logging -- Ensures proper behavior in versioned bucket scenarios - -## How the Pieces Work Together - -### Scenario: GetObject on Deleted Object - -1. **Client Request**: `GET /bucket/deleted-object` - -2. **Object Lookup**: - - RustFS queries metadata using `FileMeta` - - Version ID filtering ensures nil UUIDs don't interfere (filemeta.rs change) - - Delete state is correctly maintained (filemeta.rs change) - -3. **Error Generation**: - - Object not found or marked as deleted - - Returns `ObjectNotFound` error - - Converted to S3 `NoSuchKey` error by s3s library - -4. **Response Serialization**: - - s3s serializes error to XML (~119 bytes) - - Sets `Content-Length: 119` - -5. **Compression Decision** (NEW): - - `ShouldCompress` predicate evaluates response - - Detects 4xx status code → Skip compression - - Detects small size (119 < 256) → Skip compression - -6. **Response Transmission**: - - Full 119-byte XML error body is sent - - Content-Length matches actual body size - - AWS SDK successfully parses NoSuchKey error - -### Without the Fix - -The problematic flow: - -1. Steps 1-4 same as above -2. **Compression Decision** (OLD): - - No filtering, all responses compressed - - Attempts to compress 119-byte error response -3. **Response Transmission**: - - Compression layer buffers/processes response - - Body becomes corrupted or empty (0 bytes) - - Headers already sent with Content-Length: 119 - - AWS SDK receives 0 bytes, expects 119 bytes - - Throws "truncated body" networking error - -## Testing Strategy - -### Comprehensive Test Suite - -**File**: `crates/e2e_test/src/reliant/get_deleted_object_test.rs` - -Four test cases covering different scenarios: - -1. **`test_get_deleted_object_returns_nosuchkey`** - - Upload object → Delete → GetObject - - Verifies NoSuchKey error, not networking error - -2. **`test_head_deleted_object_returns_nosuchkey`** - - Tests HeadObject on deleted objects - - Ensures consistency across API methods - -3. **`test_get_nonexistent_object_returns_nosuchkey`** - - Tests objects that never existed - - Validates error handling for truly non-existent keys - -4. **`test_multiple_gets_deleted_object`** - - 5 consecutive GetObject calls on deleted object - - Ensures stability and no race conditions - -### Running Tests - -```bash -# Start RustFS server -./scripts/dev_rustfs.sh - -# Run specific test -cargo test --test get_deleted_object_test -- test_get_deleted_object_returns_nosuchkey --ignored - -# Run all deletion tests -cargo test --test get_deleted_object_test -- --ignored -``` - -## Performance Impact Analysis - -### Compression Skip Rate - -**Before Fix**: 0% (all responses compressed) -**After Fix**: ~5-10% (error responses + small responses) - -**Calculation**: - -- Error responses: ~3-5% of total traffic (typical) -- Small responses: ~2-5% of successful responses -- Total skip rate: ~5-10% - -**CPU Impact**: - -- Reduced CPU usage from skipped compression -- Estimated savings: 1-2% overall CPU reduction -- No negative impact on latency - -### Memory Impact - -**Before**: Compression buffers allocated for all responses -**After**: Fewer compression buffers needed -**Savings**: ~5-10% reduction in compression buffer memory - -### Network Impact - -**Before Fix (Errors)**: - -- Attempted compression of 119-byte error responses -- Often resulted in 0-byte transmissions (bug) - -**After Fix (Errors)**: - -- Direct transmission of 119-byte responses -- No bandwidth savings, but correct behavior - -**After Fix (Small Responses)**: - -- Skip compression for responses < 256 bytes -- Minimal bandwidth impact (~1-2% increase) -- Better latency for small responses - -## Monitoring and Observability - -### Key Metrics - -1. **Compression Skip Rate** - ``` - rate(http_compression_skipped_total[5m]) / rate(http_responses_total[5m]) - ``` - -2. **Error Response Size** - ``` - histogram_quantile(0.95, rate(http_error_response_size_bytes[5m])) - ``` - -3. **NoSuchKey Error Rate** - ``` - rate(s3_errors_total{code="NoSuchKey"}[5m]) - ``` - -### Debug Logging - -Enable detailed logging: - -```bash -RUST_LOG=rustfs::server::http=debug ./target/release/rustfs -``` - -Look for: - -- `Skipping compression for error response: status=404` -- `Skipping compression for small response: size=119 bytes` - -## Deployment Checklist - -### Pre-Deployment - -- [x] Code review completed -- [x] All tests passing -- [x] Clippy checks passed -- [x] Documentation updated -- [ ] Performance testing in staging -- [ ] Security scan (CodeQL) - -### Deployment Strategy - -1. **Canary (5% traffic)**: Monitor for 24 hours -2. **Partial (25% traffic)**: Monitor for 48 hours -3. **Full rollout (100% traffic)**: Continue monitoring for 1 week - -### Rollback Plan - -If issues detected: - -1. Revert compression predicate changes -2. Keep metadata fixes (they're beneficial regardless) -3. Investigate and reapply compression fix - -## Related Issues and PRs - -- Issue #901: NoSuchKey error regression -- PR #917: Fix/objectdelete (content-length and delete fixes) -- Commit: 86185703836c9584ba14b1b869e1e2c4598126e0 (getobjectlength) - -## Future Improvements - -### Short-term - -1. Add metrics for nil UUID filtering -2. Add delete marker specific metrics -3. Implement versioned bucket deletion tests - -### Long-term - -1. Consider gRPC compression strategy -2. Implement adaptive compression thresholds -3. Add response size histograms per S3 operation - -## Conclusion - -This comprehensive fix addresses the NoSuchKey regression through a multi-layered approach: - -1. **HTTP Layer**: Intelligent compression predicate prevents error response corruption -2. **Storage Layer**: Correct content-length calculation for all object types -3. **Metadata Layer**: Proper version management and UUID filtering for deleted objects - -The solution is: - -- ✅ **Correct**: Fixes the regression completely -- ✅ **Performant**: No negative performance impact, potential improvements -- ✅ **Robust**: Comprehensive test coverage -- ✅ **Maintainable**: Well-documented with clear rationale -- ✅ **Observable**: Debug logging and metrics support - ---- - -**Author**: RustFS Team -**Date**: 2025-11-24 -**Version**: 1.0 diff --git a/docs/rustfs-trending.jpg b/docs/rustfs-trending.jpg deleted file mode 100644 index 8285a2ee6..000000000 Binary files a/docs/rustfs-trending.jpg and /dev/null differ diff --git a/docs/security/dos-prevention-body-limits.md b/docs/security/dos-prevention-body-limits.md deleted file mode 100644 index a60d2eded..000000000 --- a/docs/security/dos-prevention-body-limits.md +++ /dev/null @@ -1,42 +0,0 @@ -# DoS Prevention: Request/Response Body Size Limits - -## Executive Summary - -This document describes the implementation of request and response body size limits in RustFS to prevent Denial of Service (DoS) attacks through unbounded memory allocation. The previous use of `usize::MAX` with `store_all_limited()` posed a critical security risk allowing attackers to exhaust server memory. - -## Security Risk Assessment - -### Vulnerability: Unbounded Memory Allocation - -**Severity**: High -**Impact**: Server memory exhaustion, service unavailability -**Likelihood**: High (easily exploitable) - -**Previous Code** (vulnerable): -```rust -let body = input.store_all_limited(usize::MAX).await?; -``` - -On a 64-bit system, `usize::MAX` is approximately 18 exabytes, effectively unlimited. - -## Implemented Limits - -| Limit | Size | Use Cases | -|-------|------|-----------| -| `MAX_ADMIN_REQUEST_BODY_SIZE` | 1 MB | User management, policies, tier/KMS/event configs | -| `MAX_IAM_IMPORT_SIZE` | 10 MB | IAM import/export (ZIP archives) | -| `MAX_BUCKET_METADATA_IMPORT_SIZE` | 100 MB | Bucket metadata import | -| `MAX_HEAL_REQUEST_SIZE` | 1 MB | Healing operations | -| `MAX_S3_RESPONSE_SIZE` | 10 MB | S3 client responses from remote services | - -## Rationale - -- AWS IAM policy limit: 6KB-10KB -- Typical payloads: < 100KB -- 1MB-100MB limits provide generous headroom while preventing DoS -- Based on real-world usage analysis and industry standards - -## Files Modified - -- 22 files updated across admin handlers and S3 client modules -- 2 new files: `rustfs/src/admin/constants.rs`, `crates/ecstore/src/client/body_limits.rs` diff --git a/docs/special-characters-README.md b/docs/special-characters-README.md deleted file mode 100644 index cf017a313..000000000 --- a/docs/special-characters-README.md +++ /dev/null @@ -1,220 +0,0 @@ -# Special Characters in Object Paths - Complete Documentation - -This directory contains comprehensive documentation for handling special characters (spaces, plus signs, percent signs, etc.) in S3 object paths with RustFS. - -## Quick Links - -- **For Users**: Start with [Client Guide](./client-special-characters-guide.md) -- **For Developers**: Read [Solution Document](./special-characters-solution.md) -- **For Deep Dive**: See [Technical Analysis](./special-characters-in-path-analysis.md) - -## Document Overview - -### 1. [Client Guide](./client-special-characters-guide.md) -**Target Audience**: Application developers, DevOps engineers, end users - -**Contents**: -- How to upload files with spaces, plus signs, etc. -- Examples for all major S3 SDKs (Python, Go, Node.js, AWS CLI, mc) -- Troubleshooting common issues -- Best practices -- FAQ - -**When to Read**: You're experiencing issues with special characters in object names. - -### 2. [Solution Document](./special-characters-solution.md) -**Target Audience**: RustFS developers, contributors, maintainers - -**Contents**: -- Root cause analysis -- Technical explanation of URL encoding -- Why the backend is correct -- Why issues occur in UI/clients -- Implementation recommendations -- Testing strategy - -**When to Read**: You need to understand the technical solution or contribute to the codebase. - -### 3. [Technical Analysis](./special-characters-in-path-analysis.md) -**Target Audience**: Senior architects, security reviewers, technical deep-dive readers - -**Contents**: -- Comprehensive technical analysis -- URL encoding standards (RFC 3986, AWS S3 API) -- Deep dive into s3s library behavior -- Edge cases and security considerations -- Multiple solution approaches evaluated -- Complete implementation plan - -**When to Read**: You need detailed technical understanding or are making architectural decisions. - -## TL;DR - The Core Issue - -### What Happened - -Users reported: -1. **Part A**: UI can navigate to folders with special chars but can't list contents -2. **Part B**: 400 errors when uploading files with `+` in the path - -### Root Cause - -**Backend (RustFS) is correct** ✅ -- The s3s library properly URL-decodes object keys from HTTP requests -- RustFS stores and retrieves objects with special characters correctly -- CLI tools (mc, aws-cli) work perfectly → proves backend is working - -**Client/UI is the issue** ❌ -- Some clients don't properly URL-encode requests -- UI may not encode prefixes when making LIST requests -- Custom HTTP clients may have encoding bugs - -### Solution - -1. **For Users**: Use proper S3 SDKs/clients (they handle encoding automatically) -2. **For Developers**: Backend needs no fixes, but added defensive validation and logging -3. **For UI**: UI needs to properly URL-encode all requests (if applicable) - -## Quick Examples - -### ✅ Works Correctly (Using mc) - -```bash -# Upload -mc cp file.txt "myrustfs/bucket/path with spaces/file.txt" - -# List -mc ls "myrustfs/bucket/path with spaces/" - -# Result: ✅ Success - mc properly encodes the request -``` - -### ❌ May Not Work (Raw HTTP without encoding) - -```bash -# Wrong: Not encoded -curl "http://localhost:9000/bucket/path with spaces/file.txt" - -# Result: ❌ May fail - spaces not encoded -``` - -### ✅ Correct Raw HTTP - -```bash -# Correct: Properly encoded -curl "http://localhost:9000/bucket/path%20with%20spaces/file.txt" - -# Result: ✅ Success - spaces encoded as %20 -``` - -## URL Encoding Quick Reference - -| Character | Display | In URL Path | In Query Param | -|-----------|---------|-------------|----------------| -| Space | ` ` | `%20` | `%20` or `+` | -| Plus | `+` | `%2B` | `%2B` | -| Percent | `%` | `%25` | `%25` | - -**Critical**: In URL **paths**, `+` = literal plus (NOT space). Only `%20` = space in paths! - -## Implementation Status - -### ✅ Completed - -1. **Backend Validation**: Added control character validation (rejects null bytes, newlines) -2. **Debug Logging**: Added logging for keys with special characters -3. **Tests**: Created comprehensive e2e test suite -4. **Documentation**: - - Client guide with SDK examples - - Solution document for developers - - Technical analysis for architects - -### 📋 Recommended Next Steps - -1. **Run Tests**: Execute e2e tests to verify backend behavior - ```bash - cargo test --package e2e_test special_chars - ``` - -2. **UI Review** (if applicable): Check if RustFS UI properly encodes requests - -3. **User Communication**: - - Update user documentation - - Add troubleshooting to FAQ - - Communicate known UI limitations (if any) - -## Related GitHub Issues - -- Original Issue: Special Chars in path (#???) -- Referenced PR: #1072 (mentioned in issue comments) - -## Support - -If you encounter issues with special characters: - -1. **First**: Check the [Client Guide](./client-special-characters-guide.md) -2. **Try**: Use mc or AWS CLI to isolate the issue -3. **Enable**: Debug logging: `RUST_LOG=rustfs=debug` -4. **Report**: Create an issue with: - - Client/SDK used - - Exact object name causing issues - - Whether mc works (to isolate backend vs client) - - Debug logs - -## Contributing - -When contributing related fixes: - -1. Read the [Solution Document](./special-characters-solution.md) -2. Understand that backend is working correctly via s3s -3. Focus on UI/client improvements or documentation -4. Add tests to verify behavior -5. Update relevant documentation - -## Testing - -### Run Special Character Tests - -```bash -# All special character tests -cargo test --package e2e_test special_chars -- --nocapture - -# Specific test -cargo test --package e2e_test test_object_with_space_in_path -- --nocapture -cargo test --package e2e_test test_object_with_plus_in_path -- --nocapture -cargo test --package e2e_test test_issue_scenario_exact -- --nocapture -``` - -### Test with Real Clients - -```bash -# MinIO client -mc alias set test http://localhost:9000 minioadmin minioadmin -mc cp README.md "test/bucket/test with spaces/README.md" -mc ls "test/bucket/test with spaces/" - -# AWS CLI -aws --endpoint-url=http://localhost:9000 s3 cp README.md "s3://bucket/test with spaces/README.md" -aws --endpoint-url=http://localhost:9000 s3 ls "s3://bucket/test with spaces/" -``` - -## Version History - -- **v1.0** (2025-12-09): Initial documentation - - Comprehensive analysis completed - - Root cause identified (UI/client issue) - - Backend validation and logging added - - Client guide created - - E2E tests added - -## See Also - -- [AWS S3 API Documentation](https://docs.aws.amazon.com/AmazonS3/latest/API/) -- [RFC 3986: URI Generic Syntax](https://tools.ietf.org/html/rfc3986) -- [s3s Library Documentation](https://docs.rs/s3s/) -- [URL Encoding Best Practices](https://developer.mozilla.org/en-US/docs/Glossary/Percent-encoding) - ---- - -**Maintained by**: RustFS Team -**Last Updated**: 2025-12-09 -**Status**: Complete - Ready for Use diff --git a/docs/special-characters-README_ZH.md b/docs/special-characters-README_ZH.md deleted file mode 100644 index a1d87e06f..000000000 --- a/docs/special-characters-README_ZH.md +++ /dev/null @@ -1,185 +0,0 @@ -# 对象路径中的特殊字符 - 完整文档 - -本目录包含关于在 RustFS 中处理 S3 对象路径中特殊字符(空格、加号、百分号等)的完整文档。 - -## 快速链接 - -- **用户指南**: [客户端指南](./client-special-characters-guide.md) -- **开发者文档**: [解决方案文档](./special-characters-solution.md) -- **深入分析**: [技术分析](./special-characters-in-path-analysis.md) - -## 核心问题说明 - -### 问题现象 - -用户报告了两个问题: -1. **问题 A**: UI 可以导航到包含特殊字符的文件夹,但无法列出其中的内容 -2. **问题 B**: 上传路径中包含 `+` 号的文件时出现 400 错误 - -### 根本原因 - -经过深入调查,包括检查 s3s 库的源代码,我们发现: - -**后端 (RustFS) 工作正常** ✅ -- s3s 库正确地对 HTTP 请求中的对象键进行 URL 解码 -- RustFS 正确存储和检索包含特殊字符的对象 -- 命令行工具(mc, aws-cli)完美工作 → 证明后端正确处理特殊字符 - -**问题出在 UI/客户端层** ❌ -- 某些客户端未正确进行 URL 编码 -- UI 可能在发出 LIST 请求时未对前缀进行编码 -- 自定义 HTTP 客户端可能存在编码错误 - -### 解决方案 - -1. **用户**: 使用正规的 S3 SDK/客户端(它们会自动处理编码) -2. **开发者**: 后端无需修复,但添加了防御性验证和日志 -3. **UI**: UI 需要正确对所有请求进行 URL 编码(如适用) - -## URL 编码快速参考 - -| 字符 | 显示 | URL 路径中 | 查询参数中 | -|------|------|-----------|-----------| -| 空格 | ` ` | `%20` | `%20` 或 `+` | -| 加号 | `+` | `%2B` | `%2B` | -| 百分号 | `%` | `%25` | `%25` | - -**重要**: 在 URL **路径**中,`+` = 字面加号(不是空格)。只有 `%20` = 空格! - -## 快速示例 - -### ✅ 正确使用(使用 mc) - -```bash -# 上传 -mc cp file.txt "myrustfs/bucket/路径 包含 空格/file.txt" - -# 列出 -mc ls "myrustfs/bucket/路径 包含 空格/" - -# 结果: ✅ 成功 - mc 正确编码了请求 -``` - -### ❌ 可能失败(原始 HTTP 未编码) - -```bash -# 错误: 未编码 -curl "http://localhost:9000/bucket/路径 包含 空格/file.txt" - -# 结果: ❌ 可能失败 - 空格未编码 -``` - -### ✅ 正确的原始 HTTP - -```bash -# 正确: 已正确编码 -curl "http://localhost:9000/bucket/%E8%B7%AF%E5%BE%84%20%E5%8C%85%E5%90%AB%20%E7%A9%BA%E6%A0%BC/file.txt" - -# 结果: ✅ 成功 - 空格编码为 %20 -``` - -## 实施状态 - -### ✅ 已完成 - -1. **后端验证**: 添加了控制字符验证(拒绝空字节、换行符) -2. **调试日志**: 为包含特殊字符的键添加了日志记录 -3. **测试**: 创建了综合 e2e 测试套件 -4. **文档**: - - 包含 SDK 示例的客户端指南 - - 开发者解决方案文档 - - 架构师技术分析 - - 安全摘要 - -### 📋 建议的后续步骤 - -1. **运行测试**: 执行 e2e 测试以验证后端行为 - ```bash - cargo test --package e2e_test special_chars - ``` - -2. **UI 审查**(如适用): 检查 RustFS UI 是否正确编码请求 - -3. **用户沟通**: - - 更新用户文档 - - 在 FAQ 中添加故障排除 - - 传达已知的 UI 限制(如有) - -## 测试 - -### 运行特殊字符测试 - -```bash -# 所有特殊字符测试 -cargo test --package e2e_test special_chars -- --nocapture - -# 特定测试 -cargo test --package e2e_test test_object_with_space_in_path -- --nocapture -cargo test --package e2e_test test_object_with_plus_in_path -- --nocapture -cargo test --package e2e_test test_issue_scenario_exact -- --nocapture -``` - -### 使用真实客户端测试 - -```bash -# MinIO 客户端 -mc alias set test http://localhost:9000 minioadmin minioadmin -mc cp README.md "test/bucket/测试 包含 空格/README.md" -mc ls "test/bucket/测试 包含 空格/" - -# AWS CLI -aws --endpoint-url=http://localhost:9000 s3 cp README.md "s3://bucket/测试 包含 空格/README.md" -aws --endpoint-url=http://localhost:9000 s3 ls "s3://bucket/测试 包含 空格/" -``` - -## 支持 - -如果遇到特殊字符问题: - -1. **首先**: 查看[客户端指南](./client-special-characters-guide.md) -2. **尝试**: 使用 mc 或 AWS CLI 隔离问题 -3. **启用**: 调试日志: `RUST_LOG=rustfs=debug` -4. **报告**: 创建问题,包含: - - 使用的客户端/SDK - - 导致问题的确切对象名称 - - mc 是否工作(以隔离后端与客户端) - - 调试日志 - -## 相关文档 - -- [客户端指南](./client-special-characters-guide.md) - 用户必读 -- [解决方案文档](./special-characters-solution.md) - 开发者指南 -- [技术分析](./special-characters-in-path-analysis.md) - 深入分析 -- [安全摘要](./SECURITY_SUMMARY_special_chars.md) - 安全审查 - -## 常见问题 - -**问: 可以在对象名称中使用空格吗?** -答: 可以,但请使用能自动处理编码的 S3 SDK。 - -**问: 为什么 `+` 不能用作空格?** -答: 在 URL 路径中,`+` 表示字面加号。只有在查询参数中 `+` 才表示空格。在路径中使用 `%20` 表示空格。 - -**问: RustFS 支持对象名称中的 Unicode 吗?** -答: 支持,对象名称是 UTF-8 字符串。它们支持任何有效的 UTF-8 字符。 - -**问: 哪些字符是禁止的?** -答: 控制字符(空字节、换行符、回车符)被拒绝。所有可打印字符都是允许的。 - -**问: 如何修复"UI 无法列出文件夹"的问题?** -答: 使用 CLI(mc 或 aws-cli)代替。这是 UI 错误,不是后端问题。 - -## 版本历史 - -- **v1.0** (2025-12-09): 初始文档 - - 完成综合分析 - - 确定根本原因(UI/客户端问题) - - 添加后端验证和日志 - - 创建客户端指南 - - 添加 E2E 测试 - ---- - -**维护者**: RustFS 团队 -**最后更新**: 2025-12-09 -**状态**: 完成 - 可供使用 diff --git a/docs/special-characters-in-path-analysis.md b/docs/special-characters-in-path-analysis.md deleted file mode 100644 index a8b5f49f9..000000000 --- a/docs/special-characters-in-path-analysis.md +++ /dev/null @@ -1,536 +0,0 @@ -# Special Characters in Object Path - Comprehensive Analysis and Solution - -## Executive Summary - -This document provides an in-depth analysis of the issues with special characters (spaces, plus signs, etc.) in object paths within RustFS, along with a comprehensive solution strategy. - -## Problem Statement - -### Issue Description - -Users encounter problems when working with object paths containing special characters: - -**Part A: Spaces in Paths** -```bash -mc cp README.md "local/dummy/a%20f+/b/c/3/README.md" -``` -- The UI allows navigation to the folder `%20f+/` -- However, it cannot display the contents within that folder -- CLI tools like `mc ls` correctly show the file exists - -**Part B: Plus Signs in Paths** -``` -Error: blob (key "/test/data/org_main-org/dashboards/ES+net/LHC+Data+Challenge/firefly-details.json") -api error InvalidArgument: Invalid argument -``` -- Files with `+` signs in paths cause 400 (Bad Request) errors -- This affects clients using the Go Cloud Development Kit or similar libraries - -## Root Cause Analysis - -### URL Encoding in S3 API - -According to the AWS S3 API specification: - -1. **Object keys in HTTP URLs MUST be URL-encoded** - - Space character → `%20` - - Plus sign → `%2B` - - Literal `+` in URL path → stays as `+` (represents itself, not space) - -2. **URL encoding rules for S3 paths:** - - In HTTP URLs: `/bucket/path%20with%20spaces/file%2Bname.txt` - - Decoded key: `path with spaces/file+name.txt` - - Note: `+` in URL path represents a literal `+`, NOT a space - -3. **Important distinction:** - - In **query parameters**, `+` represents space (form URL encoding) - - In **URL paths**, `+` represents a literal plus sign - - Space in paths must be encoded as `%20` - -### The s3s Library Behavior - -The s3s library (version 0.12.0-rc.4) handles HTTP request parsing and URL decoding: - -1. **Expected behavior**: s3s should URL-decode the path from HTTP requests before passing keys to our handlers -2. **Current observation**: There appears to be inconsistency or a bug in how keys are decoded -3. **Hypothesis**: The library may not be properly handling certain special characters or edge cases - -### Where the Problem Manifests - -The issue affects multiple operations: - -1. **PUT Object**: Uploading files with special characters in path -2. **GET Object**: Retrieving files with special characters -3. **LIST Objects**: Listing directory contents with special characters in path -4. **DELETE Object**: Deleting files with special characters - -### Consistency Issues - -The core problem is **inconsistency** in how paths are handled: - -- **Storage layer**: May store objects with URL-encoded names -- **Retrieval layer**: May expect decoded names -- **Comparison layer**: Path matching fails when encoding differs -- **List operation**: Returns encoded or decoded names inconsistently - -## Technical Analysis - -### Current Implementation - -#### 1. Storage Layer (ecfs.rs) - -```rust -// In put_object -let PutObjectInput { - bucket, - key, // ← This comes from s3s, should be URL-decoded - ... -} = input; - -store.put_object(&bucket, &key, &mut reader, &opts).await -``` - -#### 2. List Objects Implementation - -```rust -// In list_objects_v2 -let object_infos = store - .list_objects_v2( - &bucket, - &prefix, // ← Should this be decoded? - continuation_token, - delimiter.clone(), - max_keys, - fetch_owner.unwrap_or_default(), - start_after, - incl_deleted, - ) - .await -``` - -#### 3. Object Retrieval - -The key (object name) needs to match exactly between: -- How it's stored (during PUT) -- How it's queried (during GET/LIST) -- How it's compared (path matching) - -### The URL Encoding Problem - -Consider this scenario: - -1. Client uploads: `PUT /bucket/a%20f+/file.txt` -2. s3s decodes to: `a f+/file.txt` (correct: %20→space, +→plus) -3. We store as: `a f+/file.txt` -4. Client lists: `GET /bucket?prefix=a%20f+/` -5. s3s decodes to: `a f+/` -6. We search for: `a f+/` -7. Should work! ✓ - -But what if s3s is NOT decoding properly? Or decoding inconsistently? - -1. Client uploads: `PUT /bucket/a%20f+/file.txt` -2. s3s passes: `a%20f+/file.txt` (BUG: not decoded!) -3. We store as: `a%20f+/file.txt` -4. Client lists: `GET /bucket?prefix=a%20f+/` -5. s3s passes: `a%20f+/` -6. We search for: `a%20f+/` -7. Works by accident! ✓ - -But then: -8. Client lists: `GET /bucket?prefix=a+f%2B/` (encoding + as %2B) -9. s3s passes: `a+f%2B/` or `a+f+/` ?? -10. We search for that, but stored name was `a%20f+/` -11. Mismatch! ✗ - -## Solution Strategy - -### Approach 1: Trust s3s Library (Recommended) - -**Assumption**: s3s library correctly URL-decodes all keys from HTTP requests - -**Strategy**: -1. Assume all keys received from s3s are already decoded -2. Store objects with decoded names (UTF-8 strings with literal special chars) -3. Use decoded names for all operations (GET, LIST, DELETE) -4. Never manually URL-encode/decode keys in our handlers -5. Trust s3s to handle HTTP-level encoding/decoding - -**Advantages**: -- Follows separation of concerns -- Simpler code -- Relies on well-tested library behavior - -**Risks**: -- If s3s has a bug, we're affected -- Need to verify s3s actually does this correctly - -### Approach 2: Explicit URL Decoding (Defensive) - -**Assumption**: s3s may not decode keys properly, or there are edge cases - -**Strategy**: -1. Explicitly URL-decode all keys when received from s3s -2. Use `urlencoding::decode()` on all keys in handlers -3. Store and operate on decoded names -4. Add safety checks and error handling - -**Implementation**: -```rust -use urlencoding::decode; - -// In put_object -let key = decode(&input.key) - .map_err(|e| s3_error!(InvalidArgument, format!("Invalid URL encoding in key: {}", e)))? - .into_owned(); -``` - -**Advantages**: -- More defensive -- Explicit control -- Handles s3s bugs or limitations - -**Risks**: -- Double-decoding if s3s already decodes -- May introduce new bugs -- More complex code - -### Approach 3: Hybrid Strategy (Most Robust) - -**Strategy**: -1. Add logging to understand what s3s actually passes us -2. Create tests with various special characters -3. Determine if s3s decodes correctly -4. If yes → use Approach 1 -5. If no → use Approach 2 with explicit decoding - -## Recommended Implementation Plan - -### Phase 1: Investigation & Testing - -1. **Create comprehensive tests** for special characters: - - Spaces (` ` / `%20`) - - Plus signs (`+` / `%2B`) - - Percent signs (`%` / `%25`) - - Slashes in names (usually not allowed, but test edge cases) - - Unicode characters - - Mixed special characters - -2. **Add detailed logging**: - ```rust - debug!("Received key from s3s: {:?}", key); - debug!("Key bytes: {:?}", key.as_bytes()); - ``` - -3. **Test with real S3 clients**: - - AWS SDK - - MinIO client (mc) - - Go Cloud Development Kit - - boto3 (Python) - -### Phase 2: Fix Implementation - -Based on Phase 1 findings, implement one of: - -#### Option A: s3s handles decoding correctly -- Add tests to verify behavior -- Document the assumption -- Add assertions or validation - -#### Option B: s3s has bugs or doesn't decode -- Add explicit URL decoding to all handlers -- Use `urlencoding::decode()` consistently -- Add error handling for invalid encoding -- Document the workaround - -### Phase 3: Ensure Consistency - -1. **Audit all key usage**: - - PutObject - - GetObject - - DeleteObject - - ListObjects/ListObjectsV2 - - CopyObject (source and destination) - - HeadObject - - Multi-part upload operations - -2. **Standardize key handling**: - - Create a helper function `normalize_object_key()` - - Use it consistently everywhere - - Add validation - -3. **Update path utilities** (`crates/utils/src/path.rs`): - - Ensure path manipulation functions handle special chars - - Add tests for path operations with special characters - -### Phase 4: Testing & Validation - -1. **Unit tests**: - ```rust - #[test] - fn test_object_key_with_space() { - let key = "path with spaces/file.txt"; - // test PUT, GET, LIST operations - } - - #[test] - fn test_object_key_with_plus() { - let key = "path+with+plus/file+name.txt"; - // test all operations - } - - #[test] - fn test_object_key_with_mixed_special_chars() { - let key = "complex/path with spaces+plus%percent.txt"; - // test all operations - } - ``` - -2. **Integration tests**: - - Test with real S3 clients - - Test mc (MinIO client) scenarios from the issue - - Test Go Cloud Development Kit scenario - - Test AWS SDK compatibility - -3. **Regression testing**: - - Ensure existing tests still pass - - Test with normal filenames (no special chars) - - Test with existing data - -## Implementation Details - -### Key Functions to Modify - -1. **rustfs/src/storage/ecfs.rs**: - - `put_object()` - line ~2763 - - `get_object()` - find implementation - - `list_objects_v2()` - line ~2564 - - `delete_object()` - find implementation - - `copy_object()` - handle source and dest keys - - `head_object()` - find implementation - -2. **Helper function to add**: -```rust -/// Normalizes an object key by ensuring it's properly URL-decoded -/// and contains only valid UTF-8 characters. -/// -/// This function should be called on all object keys received from -/// the S3 API to ensure consistent handling of special characters. -fn normalize_object_key(key: &str) -> S3Result<String> { - // If s3s already decodes, this is a no-op validation - // If not, this explicitly decodes - match urlencoding::decode(key) { - Ok(decoded) => Ok(decoded.into_owned()), - Err(e) => Err(s3_error!( - InvalidArgument, - format!("Invalid URL encoding in object key: {}", e) - )), - } -} -``` - -### Testing Strategy - -Create a new test module: - -```rust -// crates/e2e_test/src/special_chars_test.rs - -#[cfg(test)] -mod tests { - use super::*; - - #[tokio::test] - async fn test_put_get_object_with_space() { - // Upload file with space in path - let bucket = "test-bucket"; - let key = "folder/file with spaces.txt"; - let content = b"test content"; - - // PUT - put_object(bucket, key, content).await.unwrap(); - - // GET - let retrieved = get_object(bucket, key).await.unwrap(); - assert_eq!(retrieved, content); - - // LIST - let objects = list_objects(bucket, "folder/").await.unwrap(); - assert!(objects.iter().any(|obj| obj.key == key)); - } - - #[tokio::test] - async fn test_put_get_object_with_plus() { - let bucket = "test-bucket"; - let key = "folder/ES+net/file+name.txt"; - // ... similar test - } - - #[tokio::test] - async fn test_mc_client_scenario() { - // Reproduce the exact scenario from the issue - let bucket = "dummy"; - let key = "a f+/b/c/3/README.md"; // Decoded form - // ... test with mc client or simulate its behavior - } -} -``` - -## Edge Cases and Considerations - -### 1. Directory Markers - -RustFS uses `__XLDIR__` suffix for directories: -- Ensure special characters in directory names are handled -- Test: `"folder with spaces/__XLDIR__"` - -### 2. Multipart Upload - -- Upload ID and part operations must handle special chars -- Test: Multipart upload of object with special char path - -### 3. Copy Operations - -CopyObject has both source and destination keys: -```rust -// Both need consistent handling -let src_key = input.copy_source.key(); -let dest_key = input.key; -``` - -### 4. Presigned URLs - -If RustFS supports presigned URLs, they need special attention: -- URL encoding in presigned URLs -- Signature calculation with encoded vs decoded keys - -### 5. Event Notifications - -Events include object keys: -- Ensure event payloads have properly encoded/decoded keys -- Test: Webhook target receives correct key format - -### 6. Versioning - -Version IDs with special character keys: -- Test: List object versions with special char keys - -## Security Considerations - -### Path Traversal - -Ensure URL decoding doesn't enable path traversal: -```rust -// BAD: Don't allow -key = "../../../etc/passwd" - -// After decoding: -key = "..%2F..%2F..%2Fetc%2Fpasswd" → "../../../etc/passwd" - -// Solution: Validate decoded keys -fn validate_object_key(key: &str) -> S3Result<()> { - if key.contains("..") { - return Err(s3_error!(InvalidArgument, "Invalid object key")); - } - if key.starts_with('/') { - return Err(s3_error!(InvalidArgument, "Object key cannot start with /")); - } - Ok(()) -} -``` - -### Null Bytes - -Ensure no null bytes in decoded keys: -```rust -if key.contains('\0') { - return Err(s3_error!(InvalidArgument, "Object key contains null byte")); -} -``` - -## Testing with Real Clients - -### MinIO Client (mc) - -```bash -# Test space in path (from issue) -mc cp README.md "local/dummy/a%20f+/b/c/3/README.md" -mc ls "local/dummy/a%20f+/" -mc ls "local/dummy/a%20f+/b/c/3/" - -# Test plus in path -mc cp test.txt "local/bucket/ES+net/file+name.txt" -mc ls "local/bucket/ES+net/" - -# Test mixed -mc cp data.json "local/bucket/folder%20with%20spaces+plus/file.json" -``` - -### AWS CLI - -```bash -# Upload with space -aws --endpoint-url=http://localhost:9000 s3 cp test.txt "s3://bucket/path with spaces/file.txt" - -# List -aws --endpoint-url=http://localhost:9000 s3 ls "s3://bucket/path with spaces/" -``` - -### Go Cloud Development Kit - -```go -import "gocloud.dev/blob" - -// Test the exact scenario from the issue -key := "/test/data/org_main-org/dashboards/ES+net/LHC+Data+Challenge/firefly-details.json" -err := bucket.WriteAll(ctx, key, data, nil) -``` - -## Success Criteria - -The fix is successful when: - -1. ✅ mc client can upload files with spaces in path -2. ✅ UI correctly displays folders with special characters -3. ✅ UI can list contents of folders with special characters -4. ✅ Files with `+` in path can be uploaded without errors -5. ✅ All S3 operations (PUT, GET, LIST, DELETE) work with special chars -6. ✅ Go Cloud Development Kit can upload files with `+` in path -7. ✅ All existing tests still pass (no regressions) -8. ✅ New tests cover various special character scenarios - -## Documentation Updates - -After implementation, update: - -1. **API Documentation**: Document how special characters are handled -2. **Developer Guide**: Best practices for object naming -3. **Migration Guide**: If storage format changes -4. **FAQ**: Common issues with special characters -5. **This Document**: Final solution and lessons learned - -## References - -- AWS S3 API Specification: https://docs.aws.amazon.com/AmazonS3/latest/API/ -- URL Encoding RFC 3986: https://tools.ietf.org/html/rfc3986 -- s3s Library: https://docs.rs/s3s/0.12.0-rc.4/ -- urlencoding crate: https://docs.rs/urlencoding/ -- Issue #1072 (referenced in comments) - -## Conclusion - -The issue with special characters in object paths is a critical correctness bug that affects S3 API compatibility. The solution requires: - -1. **Understanding** how s3s library handles URL encoding -2. **Implementing** consistent key handling across all operations -3. **Testing** thoroughly with real S3 clients -4. **Validating** that all edge cases are covered - -The recommended approach is to start with investigation and testing (Phase 1) to understand the current behavior, then implement the appropriate fix with comprehensive test coverage. - ---- - -**Document Version**: 1.0 -**Date**: 2025-12-09 -**Author**: RustFS Team -**Status**: Draft - Awaiting Investigation Results diff --git a/docs/special-characters-solution.md b/docs/special-characters-solution.md deleted file mode 100644 index 068750e31..000000000 --- a/docs/special-characters-solution.md +++ /dev/null @@ -1,311 +0,0 @@ -# Special Characters in Object Path - Solution Implementation - -## Executive Summary - -After comprehensive investigation, the root cause analysis reveals: - -1. **Backend (rustfs) is handling URL encoding correctly** via the s3s library -2. **The primary issue is likely in the UI/client layer** where URL encoding is not properly handled -3. **Backend enhancements needed** to ensure robustness and better error messages - -## Root Cause Analysis - -### What s3s Library Does - -The s3s library (version 0.12.0-rc.4) **correctly** URL-decodes object keys from HTTP requests: - -```rust -// From s3s-0.12.0-rc.4/src/ops/mod.rs, line 261: -let decoded_uri_path = urlencoding::decode(req.uri.path()) - .map_err(|_| S3ErrorCode::InvalidURI)? - .into_owned(); -``` - -This means: -- Client sends: `PUT /bucket/a%20f+/file.txt` -- s3s decodes to: `a f+/file.txt` -- Our handler receives: `key = "a f+/file.txt"` (already decoded) - -### What Our Backend Does - -1. **Storage**: Stores objects with decoded names (e.g., `"a f+/file.txt"`) -2. **Retrieval**: Returns objects with decoded names in LIST responses -3. **Path operations**: Rust's `Path` APIs preserve special characters correctly - -### The Real Problems - -#### Problem 1: UI Client Issue (Part A) - -**Symptom**: UI can navigate TO folder but can't LIST contents - -**Diagnosis**: -- User uploads: `PUT /bucket/a%20f+/b/c/3/README.md` ✅ Works -- CLI lists: `GET /bucket?prefix=a%20f+/` ✅ Works (mc properly encodes) -- UI navigates: Shows folder "a f+" ✅ Works -- UI lists folder: `GET /bucket?prefix=a f+/` ❌ Fails (UI doesn't encode!) - -**Root Cause**: The UI is not URL-encoding the prefix when making the LIST request. It should send `prefix=a%20f%2B/` but likely sends `prefix=a f+/` which causes issues. - -**Evidence**: -- mc (MinIO client) works → proves backend is correct -- UI doesn't work → proves UI encoding is wrong - -#### Problem 2: Client Encoding Issue (Part B) - -**Symptom**: 400 error with plus signs - -**Error Message**: `api error InvalidArgument: Invalid argument` - -**Diagnosis**: -The plus sign (`+`) has special meaning in URL query parameters (represents space in form encoding) but not in URL paths. Clients must encode `+` as `%2B` in paths. - -**Example**: -- Correct: `/bucket/ES%2Bnet/file.txt` → decoded to `ES+net/file.txt` -- Wrong: `/bucket/ES+net/file.txt` → might be misinterpreted - -### URL Encoding Rules - -According to RFC 3986 and AWS S3 API: - -| Character | In URL Path | In Query Param | Decoded Result | -|-----------|-------------|----------------|----------------| -| Space | `%20` | `%20` or `+` | ` ` (space) | -| Plus | `%2B` | `%2B` | `+` (plus) | -| Percent | `%25` | `%25` | `%` (percent) | - -**Critical Note**: In URL **paths** (not query params), `+` represents a literal plus sign, NOT a space. Only `%20` represents space in paths. - -## Solution Implementation - -### Phase 1: Backend Validation & Logging (Low Risk) - -Add defensive validation and better logging to help diagnose issues: - -```rust -// In rustfs/src/storage/ecfs.rs - -/// Validate that an object key doesn't contain problematic characters -/// that might indicate client-side encoding issues -fn log_potential_encoding_issues(key: &str) { - // Check for unencoded special chars that might indicate problems - if key.contains('\n') || key.contains('\r') || key.contains('\0') { - warn!("Object key contains control characters: {:?}", key); - } - - // Log debug info for troubleshooting - debug!("Processing object key: {:?} (bytes: {:?})", key, key.as_bytes()); -} -``` - -**Benefit**: Helps diagnose client-side issues without changing behavior. - -### Phase 2: Enhanced Error Messages (Low Risk) - -When validation fails, provide helpful error messages: - -```rust -// Check for invalid UTF-8 or suspicious patterns -if !key.is_ascii() && !key.is_char_boundary(key.len()) { - return Err(S3Error::with_message( - S3ErrorCode::InvalidArgument, - "Object key contains invalid UTF-8. Ensure keys are properly URL-encoded." - )); -} -``` - -### Phase 3: Documentation (No Risk) - -1. **API Documentation**: Document URL encoding requirements -2. **Client Guide**: Explain how to properly encode object keys -3. **Troubleshooting Guide**: Common issues and solutions - -### Phase 4: UI Fix (If Applicable) - -If RustFS includes a web UI/console: - -1. **Ensure UI properly URL-encodes all requests**: - ```javascript - // When making requests, encode the key: - const encodedKey = encodeURIComponent(key); - fetch(`/bucket/${encodedKey}`); - - // When making LIST requests, encode the prefix: - const encodedPrefix = encodeURIComponent(prefix); - fetch(`/bucket?prefix=${encodedPrefix}`); - ``` - -2. **Decode when displaying**: - ```javascript - // When showing keys in UI, decode for display: - const displayKey = decodeURIComponent(key); - ``` - -## Testing Strategy - -### Test Cases - -Our e2e tests in `crates/e2e_test/src/special_chars_test.rs` cover: - -1. ✅ Spaces in paths: `"a f+/b/c/3/README.md"` -2. ✅ Plus signs in paths: `"ES+net/LHC+Data+Challenge/file.json"` -3. ✅ Mixed special characters -4. ✅ PUT, GET, LIST, DELETE operations -5. ✅ Exact scenario from issue - -### Running Tests - -```bash -# Run special character tests -cargo test --package e2e_test special_chars -- --nocapture - -# Run specific test -cargo test --package e2e_test test_issue_scenario_exact -- --nocapture -``` - -### Expected Results - -All tests should **pass** because: -- s3s correctly decodes URL-encoded keys -- Rust Path APIs preserve special characters -- ecstore stores/retrieves keys correctly -- AWS SDK (used in tests) properly encodes keys - -If tests fail, it would indicate a bug in our backend implementation. - -## Client Guidelines - -### For Application Developers - -When using RustFS with any S3 client: - -1. **Use a proper S3 SDK**: AWS SDK, MinIO SDK, etc. handle encoding automatically -2. **If using raw HTTP**: Manually URL-encode object keys in paths -3. **Remember**: - - Space → `%20` (not `+` in paths!) - - Plus → `%2B` - - Percent → `%25` - -### Example: Correct Client Usage - -```python -# Python boto3 - handles encoding automatically -import boto3 -s3 = boto3.client('s3', endpoint_url='http://localhost:9000') - -# These work correctly - boto3 encodes automatically: -s3.put_object(Bucket='test', Key='path with spaces/file.txt', Body=b'data') -s3.put_object(Bucket='test', Key='path+with+plus/file.txt', Body=b'data') -s3.list_objects_v2(Bucket='test', Prefix='path with spaces/') -``` - -```go -// Go AWS SDK - handles encoding automatically -package main - -import ( - "github.com/aws/aws-sdk-go/aws" - "github.com/aws/aws-sdk-go/service/s3" -) - -func main() { - svc := s3.New(session.New()) - - // These work correctly - SDK encodes automatically: - svc.PutObject(&s3.PutObjectInput{ - Bucket: aws.String("test"), - Key: aws.String("path with spaces/file.txt"), - Body: bytes.NewReader([]byte("data")), - }) - - svc.ListObjectsV2(&s3.ListObjectsV2Input{ - Bucket: aws.String("test"), - Prefix: aws.String("path with spaces/"), - }) -} -``` - -```bash -# MinIO mc client - handles encoding automatically -mc cp file.txt "local/bucket/path with spaces/file.txt" -mc ls "local/bucket/path with spaces/" -``` - -### Example: Manual HTTP Requests - -If making raw HTTP requests (not recommended): - -```bash -# Correct: URL-encode the path -curl -X PUT "http://localhost:9000/bucket/path%20with%20spaces/file.txt" \ - -H "Content-Type: text/plain" \ - -d "data" - -# Correct: Encode plus as %2B -curl -X PUT "http://localhost:9000/bucket/ES%2Bnet/file.txt" \ - -H "Content-Type: text/plain" \ - -d "data" - -# List with encoded prefix -curl "http://localhost:9000/bucket?prefix=path%20with%20spaces/" -``` - -## Monitoring and Debugging - -### Backend Logs - -Enable debug logging to see key processing: - -```bash -RUST_LOG=rustfs=debug cargo run -``` - -Look for log messages showing: -- Received keys -- Validation errors -- Storage operations - -### Common Issues - -| Symptom | Likely Cause | Solution | -|---------|--------------|----------| -| 400 "InvalidArgument" | Client not encoding properly | Use S3 SDK or manually encode | -| 404 "NoSuchKey" but file exists | Encoding mismatch | Check client encoding | -| UI shows folder but can't list | UI bug - not encoding prefix | Fix UI to encode requests | -| Works with CLI, fails with UI | UI implementation issue | Compare UI requests vs CLI | - -## Conclusion - -### Backend Status: ✅ Working Correctly - -The RustFS backend correctly handles URL-encoded object keys through the s3s library. No backend code changes are required for basic functionality. - -### Client/UI Status: ❌ Needs Attention - -The issues described appear to be client-side or UI-side problems: - -1. **Part A**: UI not properly encoding LIST prefix requests -2. **Part B**: Client not encoding `+` as `%2B` in paths - -### Recommendations - -1. **Short-term**: - - Add logging and better error messages (Phase 1-2) - - Document client requirements (Phase 3) - - Fix UI if applicable (Phase 4) - -2. **Long-term**: - - Add comprehensive e2e tests (already done!) - - Monitor for encoding-related errors - - Educate users on proper S3 client usage - -3. **For Users Experiencing Issues**: - - Use proper S3 SDKs (AWS, MinIO, etc.) - - If using custom clients, ensure proper URL encoding - - If using RustFS UI, report UI bugs separately - ---- - -**Document Version**: 1.0 -**Date**: 2025-12-09 -**Status**: Final - Ready for Implementation -**Next Steps**: Implement Phase 1-3, run tests, update user documentation diff --git a/docs/tls.md b/docs/tls.md deleted file mode 100644 index 404cae876..000000000 --- a/docs/tls.md +++ /dev/null @@ -1,63 +0,0 @@ -# TLS / mTLS configuration - -RustFS supports TLS for serving HTTPS and for outbound gRPC connections (MNMD). -It also supports optional client certificate authentication (mTLS) for outbound gRPC: -if a client identity is configured, RustFS will present it; otherwise it will use -server-authenticated TLS only. - -## Recommended `tls/` directory layout - -Place these files in a directory (default: `./tls`, configurable via `RUSTFS_TLS_PATH`). - -``` -TLS_DIR/ - ca.crt # PEM bundle of CA/root certificates to trust (recommended) - public.crt # optional extra root bundle (PEM) - rustfs_cert.pem # server leaf certificate (PEM) used by the RustFS server - rustfs_key.pem # server private key (PEM) used by the RustFS server - - # Optional: outbound mTLS client identity for MNMD - client_cert.pem # client certificate chain (PEM) - client_key.pem # client private key (PEM) - - # Optional: server-side mTLS (inbound client certificate verification) - client_ca.crt # PEM bundle of CA certificates to verify client certificates -``` - -## Environment variables - -### Root trust - -- `RUSTFS_TLS_PATH` (default: `tls`): TLS directory. -- `RUSTFS_TRUST_SYSTEM_CA` (default: `false`): When `true`, include the platform/system - trust store as additional roots. When `false`, system roots are not used. -- `RUSTFS_TRUST_LEAF_CERT_AS_CA` (default: `false`): Compatibility switch. If `true`, - RustFS will also load `rustfs_cert.pem` into the root store (treating leaf certificates - as trusted roots). Prefer providing `ca.crt` instead. - -### Outbound mTLS identity - -- `RUSTFS_MTLS_CLIENT_CERT` (default: `${RUSTFS_TLS_PATH}/client_cert.pem`): path to PEM client cert/chain. -- `RUSTFS_MTLS_CLIENT_KEY` (default: `${RUSTFS_TLS_PATH}/client_key.pem`): path to PEM private key. - -If both files exist, RustFS enables outbound mTLS. If either is missing, RustFS proceeds -with server-only TLS. - -### Server-side mTLS (inbound client certificate verification) - -- `RUSTFS_SERVER_MTLS_ENABLE` (default: `false`): When `true`, the RustFS server requires - clients to present valid certificates signed by a trusted CA for authentication. - -When enabled, RustFS loads client CA certificates from: -1. `${RUSTFS_TLS_PATH}/client_ca.crt` (preferred) -2. `${RUSTFS_TLS_PATH}/ca.crt` (fallback if `client_ca.crt` does not exist) - -**Important**: Server mTLS is disabled by default. When enabled but no valid CA bundle is -found, RustFS will fail to start with a clear error message. This ensures that server mTLS -cannot be accidentally enabled without proper client CA configuration. - -## Failure mode for HTTPS without roots - -When connecting to an `https://` MNMD address, RustFS requires at least one configured -trusted root. If none are loaded (no `ca.crt`/`public.crt` and system roots disabled), -RustFS fails fast with a clear error message.