Pipeline Developer Guide

Version: 2.0.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

Welcome

This is the comprehensive technical guide for the Adaptive Pipeline. Whether you're learning advanced Rust patterns, contributing to the project, or using the pipeline in production, this guide provides the depth you need.

How to Use This Guide

This guide follows a progressive disclosure approach - each section builds on previous ones:

Start Here: Fundamentals

If you're new to the pipeline, start with Fundamentals. This section introduces core concepts in an accessible way:

  • What pipelines do and why they're useful
  • Key terminology and concepts
  • How stages work together
  • Basic configuration
  • Running your first pipeline

Time commitment: 30-45 minutes

Building Understanding: Architecture

Once you understand the basics, explore the Architecture section. This explains how the pipeline is designed:

  • Layered architecture (Domain, Application, Infrastructure)
  • Domain-Driven Design concepts
  • Design patterns in use (Repository, Service, Adapter, Observer)
  • Dependency management

This section bridges the gap between basic usage and implementation details.

Time commitment: 1-2 hours

Going Deeper: Implementation

The Implementation section covers how specific features work:

  • Stage processing details
  • Compression and encryption
  • Data persistence and schema management
  • File I/O and chunking
  • Metrics and observability

Perfect for contributors or those adapting the pipeline for specific needs.

Time commitment: 2-3 hours

Expert Level: Advanced Topics

For optimization and extension, the Advanced Topics section covers:

  • Concurrency model and thread pooling
  • Performance optimization techniques
  • Creating custom stages and algorithms

Time commitment: 2-4 hours depending on depth

Reference: Formal Documentation

The Formal Documentation section contains:

  • Software Requirements Specification (SRS)
  • Software Design Document (SDD)
  • Test Strategy (STP)

These are comprehensive reference documents.

Documentation Scope

Following our "reasonable" principle, this guide focuses on:

What you need to know to use, contribute to, or extend the pipeline ✅ Why decisions were made with just enough context ✅ How to accomplish tasks with practical examples ✅ Advanced Rust patterns demonstrated in real code

We intentionally do not include:

❌ Rust language tutorials (see The Rust Book) ❌ General programming concepts ❌ Third-party library documentation (links provided instead) ❌ Exhaustive algorithm details (high-level explanations with references)

Learning Path Recommendations

I want to use the pipeline

→ Read Fundamentals → Skip to Implementation for specific features

I want to learn advanced Rust patterns

→ Focus on Architecture section (patterns) → Review Implementation for real-world examples → Study Advanced Topics for concurrency/performance

I'm building something similar

→ Read Architecture + Implementation → Study formal documentation (SRS/SDD) → Review source code with this guide as reference

Conventions Used

Throughout this guide:

  • Code examples are complete and runnable unless marked otherwise
  • File paths use format module/file.rs:line for source references
  • Diagrams are in PlantUML (SVG rendered in book)
  • Callouts highlight important information:

Note: Additional helpful information

Warning: Important caveats or gotchas

Example: Practical code demonstration

Ready to Start?

Choose your path:

Let's dive in!

What is a Pipeline?

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

Introduction to pipelines and their purpose.

What is a Pipeline?

A pipeline is a series of connected processing stages that transform data from input to output. Each stage performs a specific operation, and data flows through the stages sequentially or in parallel.

Think of it like a factory assembly line:

  • Raw materials (input file) enter at one end
  • Each station (stage) performs a specific task
  • The finished product (processed file) exits at the other end

Real-World Analogy

Assembly Line

Imagine an automobile assembly line:

Raw Materials → Welding → Painting → Assembly → Quality Check → Finished Car

In our pipeline system:

Input File → Compression → Encryption → Validation → Output File

Each stage:

  • Receives data from the previous stage
  • Performs its specific transformation
  • Passes the result to the next stage

Why Use a Pipeline?

Modularity

Each stage does one thing well. You can:

  • Add new stages easily
  • Remove stages you don't need
  • Reorder stages as needed

Example: Need encryption? Add an encryption stage. Don't need compression? Remove the compression stage.

Reusability

Stages can be used in multiple pipelines:

  • Use the same compression stage in different workflows
  • Share validation logic across projects
  • Build libraries of reusable components

Testability

Each stage can be tested independently:

  • Unit test individual stages
  • Mock stage inputs/outputs
  • Verify stage behavior in isolation

Scalability

Pipelines can process data efficiently:

  • Process file chunks in parallel
  • Distribute work across CPU cores
  • Handle files of any size

Our Pipeline System

The Adaptive Pipeline provides:

File Processing: Transform files through configurable stages

  • Input: Any file type
  • Stages: Compression, encryption, validation
  • Output: Processed .adapipe file
  • Memory-mapped files for efficient processing of huge files

Flexibility: Configure stages for your needs

  • Enable/disable stages
  • Choose algorithms (Brotli, LZ4, Zstandard for compression)
  • Set security levels (Public → Top Secret)

Performance: Handle large files efficiently

  • Stream processing (low memory usage)
  • Parallel chunk processing
  • Optimized algorithms

Security: Protect sensitive data

  • AES-256-GCM encryption
  • Argon2 key derivation
  • Integrity verification with checksums

Pipeline Flow

Here's how data flows through the pipeline:

Pipeline Flow

  1. Input: Read file from disk
  2. Chunk: Split into manageable pieces (default 1MB)
  3. Process: Apply stages to each chunk
    • Compress (optional)
    • Encrypt (optional)
    • Calculate checksum (always)
  4. Store: Write processed data and metadata
  5. Verify: Confirm integrity of output

What You Can Do

With this pipeline, you can:

Compress files to save storage space ✅ Encrypt files to protect sensitive data ✅ Validate integrity to detect corruption ✅ Process large files without running out of memory ✅ Customize workflows with configurable stages ✅ Track metrics to monitor performance

Next Steps

Continue to:

Core Concepts

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

Essential concepts for understanding the pipeline.

Key Terminology

Pipeline

A complete file processing workflow with:

  • Unique ID: Every pipeline has a ULID identifier
  • Input path: Source file to process
  • Output path: Destination for processed data
  • Stages: Ordered list of processing steps
  • Status: Created → Running → Completed (or Failed)

Stage

An individual processing operation within a pipeline:

  • Type: Compression, Encryption, or Integrity Check
  • Algorithm: Specific implementation (e.g., Brotli, AES-256-GCM)
  • Sequence: Order in the pipeline (1, 2, 3, ...)
  • Configuration: Stage-specific settings

File Chunk

A portion of a file processed independently:

  • Size: Configurable (default 1MB)
  • Sequence: Chunk number (0, 1, 2, ...)
  • Checksum: Integrity verification value
  • Offset: Position in original file

Core Components

Entities

Pipeline Entity

#![allow(unused)]
fn main() {
Pipeline {
    id: PipelineId,
    input_file_path: FilePath,
    output_file_path: FilePath,
    stages: Vec<PipelineStage>,
    status: PipelineStatus,
    created_at: DateTime,
}
}

PipelineStage Entity

#![allow(unused)]
fn main() {
PipelineStage {
    id: StageId,
    pipeline_id: PipelineId,
    stage_type: StageType,
    algorithm: Algorithm,
    sequence_number: u32,
}
}

Value Objects

FilePath - Validated file system path

  • Must exist (for input) or be writable (for output)
  • Supports absolute and relative paths
  • Cross-platform compatibility

FileSize - File size in bytes

  • Human-readable display (KB, MB, GB)
  • Validation for reasonable limits
  • Efficient storage representation

Algorithm - Processing algorithm specification

  • Compression: Brotli, LZ4, Zstandard
  • Encryption: AES-256-GCM, ChaCha20-Poly1305
  • Checksum: Blake3, SHA-256

Data Flow

Sequential Processing

Stages execute in order:

Input → Stage 1 → Stage 2 → Stage 3 → Output

Parallel Chunk Processing

Chunks process independently:

Chunk 0 ──┐
Chunk 1 ──┼→ All go through stages → Reassemble
Chunk 2 ──┘

This enables:

  • Concurrency: Multiple chunks processed simultaneously
  • Memory efficiency: Only active chunks in memory
  • Scalability: Leverage multiple CPU cores

Pipeline Execution Sequence

Stage Execution

  1. CLI receives command
  2. Pipeline Service creates pipeline
  3. File Processor reads input file
  4. For each chunk:
    • Apply compression (if enabled)
    • Apply encryption (if enabled)
    • Calculate checksum (always)
    • Store chunk metadata
    • Write processed chunk
  5. Update pipeline status
  6. Return result to user

Domain Model

Our domain model follows Domain-Driven Design principles:

Domain Model

Aggregates

Pipeline Aggregate - The root entity

  • Contains Pipeline entity
  • Manages associated FileChunks
  • Enforces business rules
  • Ensures consistency

Relationships

  • Pipeline has many PipelineStages (1:N)
  • Pipeline processes FileChunks (1:N)
  • FileChunk belongs to Pipeline (N:1)
  • PipelineStage uses Algorithm (N:1)

Processing Guarantees

Integrity

Every chunk has a checksum:

  • Calculated after processing
  • Verified on read/restore
  • Detects any corruption

Atomicity

Pipeline operations are transactional:

  • All stages complete, or none do
  • Metadata stored consistently
  • No partial outputs on failure

Durability

Processed data is persisted:

  • SQLite database for metadata
  • File system for binary data
  • Recoverable after crashes

Next Steps

Continue to:

Pipeline Stages

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active

What is a Stage?

A pipeline stage is a single processing operation that transforms data in a specific way. Each stage performs one well-defined task, like compressing data, encrypting it, or verifying its integrity.

Think of stages like workstations on an assembly line. Each workstation has specialized tools and performs one specific operation. The product moves from one workstation to the next until it's complete.

Stage Types

Our pipeline supports four main categories of stages:

1. Compression Stages

Compression stages reduce the size of your data. This is useful for:

  • Saving disk space
  • Reducing network bandwidth
  • Faster file transfers
  • Lower storage costs

Available Compression Algorithms:

  • Brotli - Best compression ratio, slower speed

    • Best for: Text files, web content, logs
    • Performance: Excellent compression, moderate speed
    • Memory: Higher memory usage
  • Gzip - General-purpose compression

    • Best for: General files, wide compatibility
    • Performance: Good balance of speed and ratio
    • Memory: Moderate memory usage
  • Zstandard (zstd) - Modern, fast compression

    • Best for: Large files, real-time compression
    • Performance: Excellent speed and ratio
    • Memory: Efficient memory usage
  • LZ4 - Extremely fast compression

    • Best for: Real-time applications, live data streams
    • Performance: Fastest compression, moderate ratio
    • Memory: Low memory usage

2. Encryption Stages

Encryption stages protect your data by making it unreadable without the correct key. This is essential for:

  • Protecting sensitive information
  • Compliance with security regulations
  • Secure data transmission
  • Privacy protection

Available Encryption Algorithms:

  • AES-256-GCM - Industry standard encryption

    • Key Size: 256 bits (32 bytes)
    • Security: FIPS approved, very strong
    • Performance: Excellent with AES-NI hardware support
    • Authentication: Built-in integrity verification
  • ChaCha20-Poly1305 - Modern stream cipher

    • Key Size: 256 bits (32 bytes)
    • Security: Strong, constant-time implementation
    • Performance: Consistent across all platforms
    • Authentication: Built-in integrity verification
  • AES-128-GCM - Faster AES variant

    • Key Size: 128 bits (16 bytes)
    • Security: Still very secure, slightly faster
    • Performance: Faster than AES-256
    • Authentication: Built-in integrity verification

3. Integrity Verification Stages

Integrity stages ensure your data hasn't been corrupted or tampered with. They create a unique "fingerprint" of your data called a checksum or hash.

Available Hashing Algorithms:

  • SHA-256 - Industry standard hashing

    • Output: 256 bits (32 bytes)
    • Security: Cryptographically secure
    • Performance: Good balance
    • Use Case: General integrity verification
  • SHA-512 - Stronger SHA variant

    • Output: 512 bits (64 bytes)
    • Security: Stronger than SHA-256
    • Performance: Good on 64-bit systems
    • Use Case: High-security applications
  • BLAKE3 - Modern, high-performance hashing

    • Output: 256 bits (32 bytes)
    • Security: Strong security properties
    • Performance: Very fast
    • Use Case: High-performance applications

4. Transform Stages

Transform stages modify or inspect data for specific purposes like debugging, monitoring, or data flow analysis. Unlike other stages, transforms focus on observability and diagnostics rather than data protection or size reduction.

Available Transform Stages:

  • Tee - Data splitting/inspection stage

    • Purpose: Copy data to secondary output while passing through
    • Use Cases: Debugging, monitoring, audit trails, data forking
    • Performance: Limited by I/O speed to tee output
    • Configuration:
      • output_path (required): Where to write teed data
      • format: binary (default), hex, or text
      • enabled: true (default) or false to disable
    • Example: Capture intermediate pipeline data for analysis
  • Debug - Diagnostic monitoring stage

    • Purpose: Monitor data flow with zero data modification
    • Use Cases: Pipeline debugging, performance analysis, corruption detection
    • Performance: Minimal overhead, pass-through operation
    • Configuration:
      • label (required): Unique identifier for metrics
    • Metrics: Emits Prometheus metrics (checksums, bytes, chunk counts)
    • Example: Detect where data corruption occurs in complex pipelines

Stage Configuration

Each stage has a configuration that specifies how it should process data:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::entities::{PipelineStage, StageType, StageConfiguration};
use std::collections::HashMap;

// Example: Compression stage
let compression_stage = PipelineStage::new(
    "compress".to_string(),
    StageType::Compression,
    StageConfiguration::new(
        "zstd".to_string(),  // algorithm name
        HashMap::new(),      // parameters
        false,               // parallel processing
    ),
    0, // stage order
)?;

// Example: Encryption stage
let encryption_stage = PipelineStage::new(
    "encrypt".to_string(),
    StageType::Encryption,
    StageConfiguration::new(
        "aes256gcm".to_string(),
        HashMap::new(),
        false,
    ),
    1, // stage order
)?;

// Example: Integrity verification stage
let integrity_stage = PipelineStage::new(
    "verify".to_string(),
    StageType::Checksum,
    StageConfiguration::new(
        "sha256".to_string(),
        HashMap::new(),
        false,
    ),
    2, // stage order
)?;
}

Stage Execution Order

Stages execute in the order you define them. The output of one stage becomes the input to the next stage.

Recommended Order for Processing:

  1. Compress (reduce size first)
  2. Encrypt (protect compressed data)
  3. Verify integrity (create checksum of encrypted data)

For Restoration (reverse order):

  1. Verify integrity (check encrypted data)
  2. Decrypt (recover compressed data)
  3. Decompress (restore original file)
Processing Pipeline:
Input File → Compress → Encrypt → Verify → Output File

Restoration Pipeline:
Input File → Verify → Decrypt → Decompress → Output File

Combining Stages

You can combine stages in different ways depending on your needs:

Maximum Security

#![allow(unused)]
fn main() {
vec![
    PipelineStage::new(
        "compress".to_string(),
        StageType::Compression,
        StageConfiguration::new("brotli".to_string(), HashMap::new(), false),
        0,
    )?,
    PipelineStage::new(
        "encrypt".to_string(),
        StageType::Encryption,
        StageConfiguration::new("aes256gcm".to_string(), HashMap::new(), false),
        1,
    )?,
    PipelineStage::new(
        "verify".to_string(),
        StageType::Checksum,
        StageConfiguration::new("blake3".to_string(), HashMap::new(), false),
        2,
    )?,
]
}

Maximum Speed

#![allow(unused)]
fn main() {
vec![
    PipelineStage::new(
        "compress".to_string(),
        StageType::Compression,
        StageConfiguration::new("lz4".to_string(), HashMap::new(), false),
        0,
    )?,
    PipelineStage::new(
        "encrypt".to_string(),
        StageType::Encryption,
        StageConfiguration::new("chacha20poly1305".to_string(), HashMap::new(), false),
        1,
    )?,
]
}

Balanced Approach

#![allow(unused)]
fn main() {
vec![
    PipelineStage::new(
        "compress".to_string(),
        StageType::Compression,
        StageConfiguration::new("zstd".to_string(), HashMap::new(), false),
        0,
    )?,
    PipelineStage::new(
        "encrypt".to_string(),
        StageType::Encryption,
        StageConfiguration::new("aes256gcm".to_string(), HashMap::new(), false),
        1,
    )?,
    PipelineStage::new(
        "verify".to_string(),
        StageType::Checksum,
        StageConfiguration::new("sha256".to_string(), HashMap::new(), false),
        2,
    )?,
]
}

Debugging Pipeline

#![allow(unused)]
fn main() {
vec![
    PipelineStage::new(
        "debug-input".to_string(),
        StageType::Transform,
        StageConfiguration::new("debug".to_string(), {
            let mut params = HashMap::new();
            params.insert("label".to_string(), "input-data".to_string());
            params
        }, false),
        0,
    )?,
    PipelineStage::new(
        "compress".to_string(),
        StageType::Compression,
        StageConfiguration::new("zstd".to_string(), HashMap::new(), false),
        1,
    )?,
    PipelineStage::new(
        "tee-compressed".to_string(),
        StageType::Transform,
        StageConfiguration::new("tee".to_string(), {
            let mut params = HashMap::new();
            params.insert("output_path".to_string(), "/tmp/compressed.bin".to_string());
            params.insert("format".to_string(), "hex".to_string());
            params
        }, false),
        2,
    )?,
    PipelineStage::new(
        "encrypt".to_string(),
        StageType::Encryption,
        StageConfiguration::new("aes256gcm".to_string(), HashMap::new(), false),
        3,
    )?,
]
}

Parallel Processing

Stages process file chunks in parallel for better performance:

File Split into Chunks:
┌──────┬──────┬──────┬──────┐
│Chunk1│Chunk2│Chunk3│Chunk4│
└──┬───┴──┬───┴──┬───┴──┬───┘
   │      │      │      │
   ▼      ▼      ▼      ▼
   ┌──────┬──────┬──────┬──────┐
   │Stage1│Stage1│Stage1│Stage1│ (Parallel)
   └──┬───┴──┬───┴──┬───┴──┬───┘
      ▼      ▼      ▼      ▼
   ┌──────┬──────┬──────┬──────┐
   │Stage2│Stage2│Stage2│Stage2│ (Parallel)
   └──┬───┴──┬───┴──┬───┴──┬───┘
      │      │      │      │
      ▼      ▼      ▼      ▼
   Combined Output File

This parallel processing allows the pipeline to utilize multiple CPU cores for faster throughput.

Stage Validation

The pipeline validates stages at creation time:

  • Algorithm compatibility: Ensures compression algorithms are only used in compression stages
  • Stage order: Verifies stages have unique, sequential order numbers
  • Configuration validity: Checks all stage parameters are valid
  • Dependency checks: Ensures restoration pipelines match processing pipelines
#![allow(unused)]
fn main() {
// This will fail - wrong algorithm for stage type
PipelineStage::new(
    "compress".to_string(),
    StageType::Compression,
    StageConfiguration::new(
        "aes256gcm".to_string(), // ❌ Encryption algorithm in compression stage!
        HashMap::new(),
        false,
    ),
    0,
) // ❌ Error: Algorithm not compatible with stage type
}

Extending with Custom Stages

The pipeline can be easily extended through custom stages to meet your specific requirements. You can create custom stages that implement your own processing logic, integrate third-party tools, or add specialized transformations.

For detailed information on implementing custom stages, see Custom Stages in the Advanced Topics section.

Next Steps

Now that you understand pipeline stages, you can learn about:

Configuration Basics

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active

Overview

The pipeline system provides flexible configuration through command-line options, environment variables, and configuration files. This chapter covers the basics of configuring your pipelines.

Command-Line Interface

The pipeline CLI provides several commands for managing and running pipelines.

Basic Commands

Process a File

pipeline process \
  --input /path/to/input.txt \
  --output /path/to/output.bin \
  --pipeline my-pipeline

Create a Pipeline

pipeline create \
  --name my-pipeline \
  --stages compression,encryption,integrity

List Pipelines

pipeline list

Show Pipeline Details

pipeline show my-pipeline

Delete a Pipeline

pipeline delete my-pipeline --force

Performance Options

CPU Threads

Control the number of worker threads for CPU-bound operations (compression, encryption):

pipeline process \
  --input file.txt \
  --output file.bin \
  --pipeline my-pipeline \
  --cpu-threads 8

Default: Number of CPU cores - 1 (reserves one core for I/O)

Tips:

  • Too high: CPU thrashing, context switching overhead
  • Too low: Underutilized cores, slower processing
  • Monitor CPU saturation metrics to tune

I/O Threads

Control the number of concurrent I/O operations:

pipeline process \
  --input file.txt \
  --output file.bin \
  --pipeline my-pipeline \
  --io-threads 24

Default: Device-specific (NVMe: 24, SSD: 12, HDD: 4)

Storage Type Detection:

pipeline process \
  --input file.txt \
  --output file.bin \
  --pipeline my-pipeline \
  --storage-type nvme  # or ssd, hdd

Channel Depth

Control backpressure in the pipeline stages:

pipeline process \
  --input file.txt \
  --output file.bin \
  --pipeline my-pipeline \
  --channel-depth 8

Default: 4

Tips:

  • Lower values: Less memory, may cause pipeline stalls
  • Higher values: More buffering, higher memory usage
  • Optimal value depends on chunk processing time and I/O latency

Chunk Size

Configure the size of file chunks for parallel processing:

pipeline process \
  --input file.txt \
  --output file.bin \
  --pipeline my-pipeline \
  --chunk-size-mb 10

Default: Automatically determined based on file size and available resources

Global Options

Verbose Logging

Enable detailed logging output:

pipeline --verbose process \
  --input file.txt \
  --output file.bin \
  --pipeline my-pipeline

Configuration File

Use a custom configuration file:

pipeline --config /path/to/config.toml process \
  --input file.txt \
  --output file.bin \
  --pipeline my-pipeline

Configuration Files

Configuration files use TOML format and allow you to save pipeline settings for reuse.

Basic Configuration

[pipeline]
name = "my-pipeline"
stages = ["compression", "encryption", "integrity"]

[performance]
cpu_threads = 8
io_threads = 24
channel_depth = 4

[processing]
chunk_size_mb = 10

Algorithm Configuration

[stages.compression]
algorithm = "zstd"

[stages.encryption]
algorithm = "aes-256-gcm"
key_file = "/path/to/keyfile"

[stages.integrity]
algorithm = "sha256"

Complete Example

# Pipeline configuration example
[pipeline]
name = "secure-archival"
description = "High compression with encryption for archival"

[stages.compression]
algorithm = "brotli"
level = 11  # Maximum compression

[stages.encryption]
algorithm = "aes-256-gcm"
key_derivation = "argon2"

[stages.integrity]
algorithm = "blake3"

[performance]
cpu_threads = 16
io_threads = 24
channel_depth = 8
storage_type = "nvme"

[processing]
chunk_size_mb = 64
parallel_workers = 16

Using Configuration Files

# Use a configuration file
pipeline --config secure-archival.toml process \
  --input large-dataset.tar \
  --output large-dataset.bin

# Override configuration file settings
pipeline --config secure-archival.toml \
  --cpu-threads 8 \
  process --input file.txt --output file.bin

Environment Variables

Environment variables provide another way to configure the pipeline:

# Set performance defaults
export PIPELINE_CPU_THREADS=8
export PIPELINE_IO_THREADS=24
export PIPELINE_CHANNEL_DEPTH=8

# Set default chunk size
export PIPELINE_CHUNK_SIZE_MB=10

# Enable verbose logging
export PIPELINE_VERBOSE=true

# Run pipeline
pipeline process --input file.txt --output file.bin --pipeline my-pipeline

Configuration Priority

When the same setting is configured in multiple places, the following priority applies (highest to lowest):

  1. Command-line arguments - Explicit flags like --cpu-threads
  2. Environment variables - PIPELINE_* variables
  3. Configuration file - Settings from --config file
  4. Default values - Built-in intelligent defaults

Example:

# Config file says cpu_threads = 8
# Environment says PIPELINE_CPU_THREADS=12
# Command line says --cpu-threads=16

# Result: Uses 16 (command-line wins)

Performance Tuning Guidelines

For Maximum Speed

  • Use LZ4 compression
  • Use ChaCha20-Poly1305 encryption
  • Increase CPU threads to match cores
  • Use large chunks (32-64 MB)
  • Higher channel depth (8-16)
pipeline process \
  --input file.txt \
  --output file.bin \
  --pipeline speed-pipeline \
  --cpu-threads 16 \
  --chunk-size-mb 64 \
  --channel-depth 16

For Maximum Compression

  • Use Brotli compression
  • Smaller chunks for better compression ratio
  • More CPU threads for parallel compression
pipeline process \
  --input file.txt \
  --output file.bin \
  --pipeline compression-pipeline \
  --cpu-threads 16 \
  --chunk-size-mb 4

For Resource-Constrained Systems

  • Reduce CPU and I/O threads
  • Smaller chunks
  • Lower channel depth
pipeline process \
  --input file.txt \
  --output file.bin \
  --pipeline minimal-pipeline \
  --cpu-threads 2 \
  --io-threads 4 \
  --chunk-size-mb 2 \
  --channel-depth 2

Next Steps

Now that you understand configuration, you're ready to:

Running Your First Pipeline

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active

Prerequisites

Before running your first pipeline, ensure you have:

  • Pipeline binary - Built and available in your PATH

    cargo build --release
    cp target/release/pipeline /usr/local/bin/  # or add to PATH
    
  • Test file - A sample file to process

    echo "Hello, Pipeline World!" > test.txt
    
  • Permissions - Read/write access to input and output directories

Quick Start (5 minutes)

Let's run a simple compression and encryption pipeline in 3 steps:

Step 1: Create a Pipeline

pipeline create \
  --name my-first-pipeline \
  --stages compression,encryption

You should see output like:

✓ Created pipeline: my-first-pipeline
  Stages: compression (zstd), encryption (aes-256-gcm)

Step 2: Process a File

pipeline process \
  --input test.txt \
  --output test.bin \
  --pipeline my-first-pipeline

You should see progress output:

Processing: test.txt
Pipeline: my-first-pipeline
  Stage 1/2: Compression (zstd)... ✓
  Stage 2/2: Encryption (aes-256-gcm)... ✓
Output: test.bin (24 bytes)
Time: 0.05s

Step 3: Restore the File

pipeline restore \
  --input test.bin \
  --output restored.txt

Verify the restoration:

diff test.txt restored.txt
# No output = files are identical ✓

Detailed Walkthrough

Let's explore each step in more detail.

Creating Pipelines

Basic Pipeline

pipeline create \
  --name basic \
  --stages compression

This creates a simple compression-only pipeline using default settings (zstd compression).

Secure Pipeline

pipeline create \
  --name secure \
  --stages compression,encryption,integrity

This creates a complete security pipeline with:

  • Compression (reduces size)
  • Encryption (protects data)
  • Integrity verification (detects tampering)

Save Pipeline Configuration

pipeline create \
  --name archival \
  --stages compression,encryption \
  --output archival-pipeline.toml

This saves the pipeline configuration to a file for reuse.

Processing Files

Basic Processing

# Process a file
pipeline process \
  --input large-file.log \
  --output large-file.bin \
  --pipeline secure

With Performance Options

# Process with custom settings
pipeline process \
  --input large-file.log \
  --output large-file.bin \
  --pipeline secure \
  --cpu-threads 8 \
  --chunk-size-mb 32

With Verbose Logging

# See detailed progress
pipeline --verbose process \
  --input large-file.log \
  --output large-file.bin \
  --pipeline secure

Restoring Files

The pipeline automatically detects the processing stages from the output file's metadata:

# Restore automatically reverses all stages
pipeline restore \
  --input large-file.bin \
  --output restored-file.log

The system will:

  1. Read metadata from the file header
  2. Apply stages in reverse order
  3. Verify integrity if available
  4. Restore original file

Managing Pipelines

List All Pipelines

pipeline list

Output:

Available Pipelines:
  - my-first-pipeline (compression, encryption)
  - secure (compression, encryption, integrity)
  - archival (compression, encryption)

Show Pipeline Details

pipeline show secure

Output:

Pipeline: secure
  Stage 1: Compression (zstd)
  Stage 2: Encryption (aes-256-gcm)
  Stage 3: Integrity (sha256)
Created: 2025-01-04 10:30:00

Delete a Pipeline

pipeline delete my-first-pipeline --force

Understanding Output

Successful Processing

When processing completes successfully:

Processing: test.txt
Pipeline: my-first-pipeline
  Stage 1/2: Compression (zstd)... ✓
  Stage 2/2: Encryption (aes-256-gcm)... ✓

Statistics:
  Input size:  1,024 KB
  Output size: 512 KB
  Compression ratio: 50%
  Processing time: 0.15s
  Throughput: 6.8 MB/s

Output: test.bin

Performance Metrics

With --verbose flag, you'll see detailed metrics:

Pipeline Execution Metrics:
  Chunks processed: 64
  Parallel workers: 8
  Average chunk time: 2.3ms
  CPU utilization: 87%
  I/O wait: 3%

Stage Breakdown:
  Compression: 0.08s (53%)
  Encryption: 0.05s (33%)
  I/O: 0.02s (14%)

Error Messages

File Not Found

Error: Input file not found: test.txt
  Check the file path and try again

Permission Denied

Error: Permission denied: /protected/output.bin
  Ensure you have write access to the output directory

Invalid Pipeline

Error: Pipeline not found: nonexistent
  Use 'pipeline list' to see available pipelines

Common Scenarios

Scenario 1: Compress Large Log Files

# Create compression pipeline
pipeline create --name logs --stages compression

# Process log files
pipeline process \
  --input app.log \
  --output app.log.bin \
  --pipeline logs \
  --chunk-size-mb 64

# Compression ratio is typically 70-90% for text logs

Scenario 2: Secure Sensitive Files

# Create secure pipeline with all protections
pipeline create --name sensitive --stages compression,encryption,integrity

# Process sensitive file
pipeline process \
  --input customer-data.csv \
  --output customer-data.bin \
  --pipeline sensitive

# File is now compressed, encrypted, and tamper-evident

Scenario 3: High-Performance Batch Processing

# Process multiple files with optimized settings
for file in data/*.csv; do
  pipeline process \
    --input "$file" \
    --output "processed/$(basename $file).bin" \
    --pipeline fast \
    --cpu-threads 16 \
    --chunk-size-mb 128 \
    --channel-depth 16
done

Scenario 4: Restore and Verify

# Restore file
pipeline restore \
  --input customer-data.bin \
  --output customer-data-restored.csv

# Verify restoration
sha256sum customer-data.csv customer-data-restored.csv
# Both checksums should match

Testing Your Pipeline

Create Test Data

# Create a test file
dd if=/dev/urandom of=test-10mb.bin bs=1M count=10

# Calculate original checksum
sha256sum test-10mb.bin > original.sha256

Process and Restore

# Process the file
pipeline process \
  --input test-10mb.bin \
  --output test-10mb.processed \
  --pipeline my-first-pipeline

# Restore the file
pipeline restore \
  --input test-10mb.processed \
  --output test-10mb.restored

Verify Integrity

# Verify restored file matches original
sha256sum -c original.sha256
# Should output: test-10mb.bin: OK

Next Steps

Congratulations! You've run your first pipeline. Now you can:

Architecture Overview

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

High-level architectural overview of the pipeline system.

Design Philosophy

The Adaptive Pipeline is built on three foundational architectural patterns:

  1. Clean Architecture - Organizing code by dependency direction
  2. Domain-Driven Design (DDD) - Modeling the business domain
  3. Hexagonal Architecture - Isolating business logic from infrastructure

These patterns work together to create a maintainable, testable, and flexible system.

Layered Architecture

The pipeline follows a strict layered architecture where dependencies flow inward:

Layered Architecture

Layer Overview

Presentation Layer (Outermost)

  • CLI interface for user interaction
  • Configuration management
  • Request/response handling

Application Layer

  • Use cases and application services
  • Pipeline orchestration
  • File processing coordination

Domain Layer (Core)

  • Business logic and rules
  • Entities (Pipeline, PipelineStage)
  • Value objects (FilePath, FileSize, Algorithm)
  • Domain services

Infrastructure Layer (Outermost)

  • Database implementations (SQLite)
  • File system operations
  • External system adapters
  • Metrics collection

Clean Architecture

Clean Architecture ensures that business logic doesn't depend on implementation details:

Dependency Flow

Key Principles

Dependency Rule: Source code dependencies point only inward, toward higher-level policies.

  • High-level policy (Application layer) defines what the system does
  • Abstractions (Traits) define how components interact
  • Low-level details (Infrastructure) implements the abstractions

This means:

  • Domain layer has zero external dependencies
  • Application layer depends only on domain traits
  • Infrastructure implements domain interfaces

Benefits

Testability: Business logic can be tested without database or file system ✅ Flexibility: Swap implementations (SQLite → PostgreSQL) without changing business logic ✅ Independence: Domain logic doesn't know about HTTP, databases, or file formats

Hexagonal Architecture (Ports and Adapters)

The pipeline uses Hexagonal Architecture to isolate the core business logic:

Hexagonal Architecture

Core Components

Application Core

  • Domain model (entities, value objects)
  • Business logic (pipeline orchestration)
  • Ports (trait definitions)

Primary Adapters (Driving)

  • CLI adapter - drives the application
  • HTTP adapter - future API endpoints

Secondary Adapters (Driven)

  • SQLite repository adapter - driven by the application
  • File system adapter - driven by the application
  • Prometheus metrics adapter - driven by the application

How It Works

  1. User interacts with Primary Adapter (CLI)
  2. Primary Adapter calls Application Core through defined ports
  3. Application Core uses Ports (traits) to interact with infrastructure
  4. Secondary Adapters implement these ports
  5. Adapters connect to external systems (database, files)

Example Flow:

CLI → Pipeline Service → Repository Port → SQLite Adapter → Database

The application core never knows it's using SQLite - it only knows the Repository trait.

Architecture Integration

These three patterns work together:

Clean Architecture:    Layers with dependency direction
Domain-Driven Design:  Business modeling within layers
Hexagonal Architecture: Ports/Adapters at layer boundaries

In Practice:

  • Domain layer contains pure business logic (DDD entities)
  • Application layer orchestrates use cases (Clean Architecture)
  • Infrastructure implements ports (Hexagonal Architecture)

This combination provides:

  • Clear separation of concerns
  • Testable business logic
  • Flexible infrastructure
  • Maintainable codebase

Next Steps

Continue to:

Layered Architecture

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active

Overview

The pipeline system is organized into four distinct layers, each with specific responsibilities and clear boundaries. This layered architecture provides separation of concerns, testability, and maintainability.

Layered Architecture

The Four Layers

┌─────────────────────────────────────────┐
│         Presentation Layer              │  ← User interface (CLI)
│  - CLI commands                         │
│  - User interaction                     │
└─────────────────┬───────────────────────┘
                  │ depends on
┌─────────────────▼───────────────────────┐
│         Application Layer               │  ← Use cases, orchestration
│  - Use cases                            │
│  - Application services                 │
│  - Commands/Queries                     │
└─────────────────┬───────────────────────┘
                  │ depends on
┌─────────────────▼───────────────────────┐
│           Domain Layer                  │  ← Core business logic
│  - Entities                             │
│  - Value objects                        │
│  - Domain services (interfaces)         │
│  - Business rules                       │
└─────────────────△───────────────────────┘
                  │ implements interfaces
┌─────────────────┴───────────────────────┐
│        Infrastructure Layer             │  ← External dependencies
│  - Database repositories                │
│  - File I/O                             │
│  - External services                    │
│  - Encryption/Compression               │
└─────────────────────────────────────────┘

Dependency Rule

The dependency rule is the most important principle in layered architecture:

Dependencies flow inward toward the domain layer.

  • Presentation depends on Application
  • Application depends on Domain
  • Infrastructure depends on Domain (via interfaces)
  • Domain depends on nothing (pure business logic)

This means:

  • ✅ Application can use Domain types
  • ✅ Infrastructure implements Domain interfaces
  • ❌ Domain cannot use Application types
  • ❌ Domain cannot use Infrastructure types

Domain Layer

Purpose

The domain layer contains the core business logic and is the heart of the application. It's completely independent of external concerns like databases, user interfaces, or frameworks.

Responsibilities

  • Define business entities and value objects
  • Enforce business rules and invariants
  • Provide domain service interfaces
  • Emit domain events
  • Define repository interfaces

Structure

pipeline-domain/
├── entities/
│   ├── pipeline.rs           # Pipeline entity
│   ├── pipeline_stage.rs     # Stage entity
│   ├── processing_context.rs # Processing state
│   └── security_context.rs   # Security management
├── value_objects/
│   ├── algorithm.rs          # Algorithm value object
│   ├── chunk_size.rs         # Chunk size validation
│   ├── file_path.rs          # Type-safe paths
│   └── pipeline_id.rs        # Type-safe IDs
├── services/
│   ├── compression_service.rs    # Compression interface
│   ├── encryption_service.rs     # Encryption interface
│   └── checksum_service.rs       # Checksum interface
├── repositories/
│   └── pipeline_repository.rs    # Repository interface
├── events/
│   └── domain_events.rs      # Business events
└── error/
    └── pipeline_error.rs     # Domain errors

Example

#![allow(unused)]
fn main() {
// Domain layer - pure business logic
pub struct Pipeline {
    id: PipelineId,
    name: String,
    stages: Vec<PipelineStage>,
    // ... no database or UI dependencies
}

impl Pipeline {
    pub fn new(name: String, stages: Vec<PipelineStage>) -> Result<Self, PipelineError> {
        // Business rule: must have at least one stage
        if stages.is_empty() {
            return Err(PipelineError::InvalidConfiguration(
                "Pipeline must have at least one stage".to_string()
            ));
        }

        // Create pipeline with validated business rules
        Ok(Self {
            id: PipelineId::new(),
            name,
            stages,
            // ...
        })
    }
}
}

Key Characteristics

  • No external dependencies - Only standard library and domain types
  • Highly testable - Can test without databases or files
  • Portable - Can be used in any context (web, CLI, embedded)
  • Stable - Rarely changes except for business requirement changes

Application Layer

Purpose

The application layer orchestrates the execution of business use cases. It coordinates domain objects and delegates to domain services to accomplish specific tasks.

Responsibilities

  • Implement use cases (user actions)
  • Coordinate domain objects
  • Manage transactions
  • Handle application-specific workflows
  • Emit application events

Structure

pipeline/src/application/
├── use_cases/
│   ├── process_file.rs       # File processing use case
│   ├── restore_file.rs       # File restoration use case
│   └── create_pipeline.rs    # Pipeline creation
├── services/
│   ├── pipeline_service.rs   # Pipeline orchestration
│   ├── file_processor_service.rs  # File processing
│   └── transactional_chunk_writer.rs  # Chunk writing
├── commands/
│   └── commands.rs           # CQRS commands
└── utilities/
    └── generic_service_base.rs  # Service helpers

Example

#![allow(unused)]
fn main() {
// Application layer - orchestrates domain objects
pub struct FileProcessorService {
    pipeline_repo: Arc<dyn PipelineRepository>,
    compression: Arc<dyn CompressionService>,
    encryption: Arc<dyn EncryptionService>,
}

impl FileProcessorService {
    pub async fn process_file(
        &self,
        pipeline_id: &PipelineId,
        input_path: &FilePath,
        output_path: &FilePath,
    ) -> Result<ProcessingMetrics, PipelineError> {
        // 1. Fetch pipeline from repository
        let pipeline = self.pipeline_repo
            .find_by_id(pipeline_id)
            .await?
            .ok_or(PipelineError::NotFound)?;

        // 2. Create processing context
        let context = ProcessingContext::new(
            pipeline.id().clone(),
            input_path.clone(),
            output_path.clone(),
        );

        // 3. Process each stage
        for stage in pipeline.stages() {
            match stage.stage_type() {
                StageType::Compression => {
                    self.compression.compress(/* ... */).await?;
                }
                StageType::Encryption => {
                    self.encryption.encrypt(/* ... */).await?;
                }
                // ... more stages
            }
        }

        // 4. Return metrics
        Ok(context.metrics().clone())
    }
}
}

Key Characteristics

  • Thin layer - Delegates to domain for business logic
  • Workflow coordination - Orchestrates multiple domain operations
  • Transaction management - Ensures atomic operations
  • No business logic - Business rules belong in domain layer

Infrastructure Layer

Purpose

The infrastructure layer provides concrete implementations of interfaces defined in the domain layer. It handles all external concerns like databases, file systems, and third-party services.

Responsibilities

  • Implement repository interfaces
  • Provide database access
  • Handle file I/O operations
  • Implement compression/encryption services
  • Integrate with external systems
  • Provide logging and metrics

Structure

pipeline/src/infrastructure/
├── repositories/
│   ├── sqlite_pipeline_repository.rs  # SQLite implementation
│   └── stage_executor.rs              # Stage execution
├── adapters/
│   ├── compression_service_adapter.rs # Compression implementation
│   ├── encryption_service_adapter.rs  # Encryption implementation
│   └── repositories/
│       └── sqlite_repository_adapter.rs  # Repository adapter
├── services/
│   └── binary_format_service.rs       # File format handling
├── metrics/
│   ├── metrics_service.rs             # Prometheus metrics
│   └── metrics_observer.rs            # Metrics collection
├── logging/
│   └── observability_service.rs       # Logging setup
└── config/
    └── config_service.rs              # Configuration management

Example

#![allow(unused)]
fn main() {
// Infrastructure layer - implements domain interfaces
pub struct SQLitePipelineRepository {
    pool: SqlitePool,
}

#[async_trait]
impl PipelineRepository for SQLitePipelineRepository {
    async fn find_by_id(&self, id: &PipelineId) -> Result<Option<Pipeline>, PipelineError> {
        // Database-specific code
        let row = sqlx::query_as::<_, PipelineRow>(
            "SELECT * FROM pipelines WHERE id = ?"
        )
        .bind(id.to_string())
        .fetch_optional(&self.pool)
        .await
        .map_err(|e| PipelineError::RepositoryError(e.to_string()))?;

        // Map database row to domain entity
        row.map(|r| self.to_domain_entity(r)).transpose()
    }

    async fn create(&self, pipeline: &Pipeline) -> Result<(), PipelineError> {
        // Convert domain entity to database row
        let row = self.to_persistence_model(pipeline);

        // Insert into database
        sqlx::query(
            "INSERT INTO pipelines (id, name, ...) VALUES (?, ?, ...)"
        )
        .bind(&row.id)
        .bind(&row.name)
        .execute(&self.pool)
        .await
        .map_err(|e| PipelineError::RepositoryError(e.to_string()))?;

        Ok(())
    }
}
}

Key Characteristics

  • Implements domain interfaces - Provides concrete implementations
  • Database access - Handles all persistence operations
  • External integrations - Communicates with external systems
  • Technology-specific - Uses specific libraries and frameworks
  • Replaceable - Can swap implementations without changing domain

Presentation Layer

Purpose

The presentation layer handles user interaction and input/output. It translates user commands into application use cases and presents results back to the user.

Responsibilities

  • Parse and validate user input
  • Execute application use cases
  • Format and display output
  • Handle user interaction
  • Map errors to user-friendly messages

Structure

pipeline/src/presentation/
├── presentation.rs           # Presentation module (Rust 2018+ pattern)
├── adapters.rs               # Adapter declarations
└── (CLI logic is in bootstrap crate)

Example

// Presentation layer - CLI interaction
#[tokio::main]
async fn main() -> std::process::ExitCode {
    // 1. Parse CLI arguments
    let cli = bootstrap::bootstrap_cli()
        .unwrap_or_else(|e| {
            eprintln!("Error: {}", e);
            std::process::exit(65);
        });

    // 2. Set up dependencies (infrastructure)
    let db_pool = create_database_pool().await?;
    let pipeline_repo = Arc::new(SQLitePipelineRepository::new(db_pool));
    let compression = Arc::new(CompressionServiceAdapter::new());
    let file_processor = FileProcessorService::new(pipeline_repo, compression);

    // 3. Execute use case based on command
    let result = match cli.command {
        Commands::Process { input, output, pipeline } => {
            // Call application service
            file_processor.process_file(&pipeline, &input, &output).await
        }
        Commands::Create { name, stages } => {
            // Call application service
            create_pipeline_service.create(&name, stages).await
        }
        // ... more commands
    };

    // 4. Handle result and display to user
    match result {
        Ok(_) => {
            println!("✓ Processing completed successfully");
            ExitCode::SUCCESS
        }
        Err(e) => {
            eprintln!("✗ Error: {}", e);
            ExitCode::FAILURE
        }
    }
}

Key Characteristics

  • Thin layer - Minimal logic, delegates to application
  • User-facing - Handles all user interaction
  • Input validation - Validates user input before processing
  • Error formatting - Converts technical errors to user-friendly messages

Layer Interactions

Example: Processing a File

Here's how the layers work together to process a file:

1. Presentation Layer (CLI)
   ↓ User runs: pipeline process --input file.txt --output file.bin --pipeline my-pipeline
   ├─ Parse command-line arguments
   ├─ Validate input parameters
   └─ Call Application Service

2. Application Layer (FileProcessorService)
   ↓ process_file(pipeline_id, input_path, output_path)
   ├─ Fetch Pipeline from Repository (Infrastructure)
   ├─ Create ProcessingContext (Domain)
   ├─ For each stage:
   │  ├─ Call CompressionService (Infrastructure)
   │  ├─ Call EncryptionService (Infrastructure)
   │  └─ Update metrics (Domain)
   └─ Return ProcessingMetrics (Domain)

3. Domain Layer (Pipeline, ProcessingContext)
   ↓ Enforce business rules
   ├─ Validate stage compatibility
   ├─ Enforce chunk sequencing
   └─ Calculate metrics

4. Infrastructure Layer (Repositories, Services)
   ↓ Handle external operations
   ├─ Query SQLite database
   ├─ Read/write files
   ├─ Compress data (brotli, zstd, etc.)
   └─ Encrypt data (AES, ChaCha20)

Benefits of Layered Architecture

Separation of Concerns

Each layer has a single, well-defined responsibility. This makes the code easier to understand and maintain.

Testability

  • Domain Layer: Test business logic without any infrastructure
  • Application Layer: Test workflows with mock repositories
  • Infrastructure Layer: Test database operations independently
  • Presentation Layer: Test user interaction separately

Flexibility

You can change infrastructure (e.g., swap SQLite for PostgreSQL) without touching domain or application layers.

Maintainability

Changes in one layer typically don't affect other layers, reducing the risk of breaking existing functionality.

Parallel Development

Teams can work on different layers simultaneously without conflicts.

Common Pitfalls

❌ Breaking the Dependency Rule

#![allow(unused)]
fn main() {
// WRONG: Domain depending on infrastructure
pub struct Pipeline {
    id: PipelineId,
    db_connection: SqlitePool,  // ❌ Database dependency in domain!
}
}
#![allow(unused)]
fn main() {
// CORRECT: Domain independent of infrastructure
pub struct Pipeline {
    id: PipelineId,
    name: String,
    stages: Vec<PipelineStage>,  // ✅ Pure domain types
}
}

❌ Business Logic in Application Layer

#![allow(unused)]
fn main() {
// WRONG: Business logic in application
impl FileProcessorService {
    pub async fn process_file(&self, pipeline: &Pipeline) -> Result<(), Error> {
        // ❌ Business rule in application layer!
        if pipeline.stages().is_empty() {
            return Err(Error::InvalidPipeline);
        }
        // ...
    }
}
}
#![allow(unused)]
fn main() {
// CORRECT: Business logic in domain
impl Pipeline {
    pub fn new(name: String, stages: Vec<PipelineStage>) -> Result<Self, PipelineError> {
        // ✅ Business rule in domain layer
        if stages.is_empty() {
            return Err(PipelineError::InvalidConfiguration(
                "Pipeline must have at least one stage".to_string()
            ));
        }
        Ok(Self { /* ... */ })
    }
}
}

❌ Direct Infrastructure Access from Presentation

// WRONG: Presentation accessing infrastructure directly
async fn main() {
    let db_pool = create_database_pool().await?;
    // ❌ CLI directly using repository!
    let pipeline = db_pool.query("SELECT * FROM pipelines").await?;
}
// CORRECT: Presentation using application services
async fn main() {
    let file_processor = create_file_processor().await?;
    // ✅ CLI using application service
    let result = file_processor.process_file(&pipeline_id, &input, &output).await?;
}

Next Steps

Now that you understand the layered architecture:

Dependency Flow

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

Understanding dependency direction and inversion.

Dependency Rule

TODO: Explain dependency direction

Dependency Inversion

TODO: Explain DIP application

Trait Abstractions

TODO: Explain trait usage

Domain Model

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active

Overview

The domain model is the heart of the pipeline system. It captures the core business concepts, rules, and behaviors using Domain-Driven Design (DDD) principles. This chapter explains how the domain model is structured and why it's designed this way.

Domain Model

Domain-Driven Design Principles

Domain-Driven Design (DDD) is a software development approach that emphasizes:

  1. Focus on the core domain - The business logic is the most important part
  2. Model-driven design - The domain model drives the software design
  3. Ubiquitous language - Shared vocabulary between developers and domain experts
  4. Bounded contexts - Clear boundaries between different parts of the system

Why DDD?

For a pipeline processing system, DDD provides:

  • Clear separation between business logic and infrastructure
  • Testable code - Domain logic can be tested without databases or files
  • Flexibility - Easy to change infrastructure without touching business rules
  • Maintainability - Business rules are explicit and well-organized

Core Domain Concepts

Entities

Entities are objects with a unique identity that persists through time. Two entities are equal if they have the same ID, even if all their other attributes differ.

Pipeline Entity

The central entity representing a file processing workflow.

#![allow(unused)]
fn main() {
pub struct Pipeline {
    id: PipelineId,                    // Unique identity
    name: String,                      // Human-readable name
    stages: Vec<PipelineStage>,        // Ordered processing stages
    configuration: HashMap<String, String>,  // Custom settings
    metrics: ProcessingMetrics,        // Performance data
    archived: bool,                    // Lifecycle state
    created_at: DateTime<Utc>,         // Creation timestamp
    updated_at: DateTime<Utc>,         // Last modification
}
}

Key characteristics:

  • Has unique PipelineId
  • Can be modified while maintaining identity
  • Enforces business rules (e.g., must have at least one stage)
  • Automatically adds integrity verification stages

Example:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::Pipeline;

// Two pipelines with same ID are equal, even if names differ
let pipeline1 = Pipeline::new("Original Name", stages.clone())?;
let pipeline2 = pipeline1.clone();
pipeline2.set_name("Different Name");

assert_eq!(pipeline1.id(), pipeline2.id());  // Same identity
}

PipelineStage Entity

Represents a single processing operation within a pipeline.

#![allow(unused)]
fn main() {
pub struct PipelineStage {
    id: StageId,                       // Unique identity
    name: String,                      // Stage name
    stage_type: StageType,             // Compression, Encryption, etc.
    configuration: StageConfiguration, // Algorithm and parameters
    order: usize,                      // Execution order
}
}

Stage Types:

  • Compression - Data compression
  • Encryption - Data encryption
  • Integrity - Checksum verification
  • Custom - User-defined operations

ProcessingContext Entity

Manages the runtime execution state of a pipeline.

#![allow(unused)]
fn main() {
pub struct ProcessingContext {
    id: ProcessingContextId,           // Unique identity
    pipeline_id: PipelineId,           // Associated pipeline
    input_path: FilePath,              // Input file
    output_path: FilePath,             // Output file
    current_stage: usize,              // Current stage index
    status: ProcessingStatus,          // Running, Completed, Failed
    metrics: ProcessingMetrics,        // Runtime metrics
}
}

SecurityContext Entity

Manages security and permissions for pipeline operations.

#![allow(unused)]
fn main() {
pub struct SecurityContext {
    id: SecurityContextId,             // Unique identity
    user_id: UserId,                   // User performing operation
    security_level: SecurityLevel,     // Required security level
    permissions: Vec<Permission>,      // Granted permissions
    encryption_key_id: Option<EncryptionKeyId>,  // Key for encryption
}
}

Value Objects

Value Objects are immutable objects defined by their attributes. Two value objects with the same attributes are considered equal.

Algorithm Value Object

Type-safe representation of processing algorithms.

#![allow(unused)]
fn main() {
pub struct Algorithm(String);

impl Algorithm {
    // Predefined compression algorithms
    pub fn brotli() -> Self { /* ... */ }
    pub fn gzip() -> Self { /* ... */ }
    pub fn zstd() -> Self { /* ... */ }
    pub fn lz4() -> Self { /* ... */ }

    // Predefined encryption algorithms
    pub fn aes_256_gcm() -> Self { /* ... */ }
    pub fn chacha20_poly1305() -> Self { /* ... */ }

    // Predefined hashing algorithms
    pub fn sha256() -> Self { /* ... */ }
    pub fn blake3() -> Self { /* ... */ }
}
}

Key characteristics:

  • Immutable after creation
  • Self-validating (enforces format rules)
  • Category detection (is_compression(), is_encryption())
  • Type-safe (can't accidentally use wrong algorithm)

ChunkSize Value Object

Represents validated chunk sizes for file processing.

#![allow(unused)]
fn main() {
pub struct ChunkSize(usize);

impl ChunkSize {
    pub fn new(bytes: usize) -> Result<Self, PipelineError> {
        // Validates size is within acceptable range
        if bytes < MIN_CHUNK_SIZE || bytes > MAX_CHUNK_SIZE {
            return Err(PipelineError::InvalidConfiguration(/* ... */));
        }
        Ok(Self(bytes))
    }

    pub fn from_megabytes(mb: usize) -> Result<Self, PipelineError> {
        Self::new(mb * 1024 * 1024)
    }
}
}

FileChunk Value Object

Immutable representation of a piece of file data.

#![allow(unused)]
fn main() {
pub struct FileChunk {
    id: FileChunkId,                   // Unique chunk identifier
    sequence: usize,                   // Position in file
    data: Vec<u8>,                     // Chunk data
    is_final: bool,                    // Last chunk flag
    checksum: Option<String>,          // Integrity verification
}
}

FilePath Value Object

Type-safe, validated file paths.

#![allow(unused)]
fn main() {
pub struct FilePath(PathBuf);

impl FilePath {
    pub fn new(path: impl Into<PathBuf>) -> Result<Self, PipelineError> {
        let path = path.into();
        // Validation:
        // - Path traversal prevention
        // - Null byte checks
        // - Length limits
        // - Encoding validation
        Ok(Self(path))
    }
}
}

PipelineId, StageId, UserId (Type-Safe IDs)

All identifiers are wrapped in newtype value objects for type safety:

#![allow(unused)]
fn main() {
pub struct PipelineId(Ulid);  // Can't accidentally use StageId as PipelineId
pub struct StageId(Ulid);
pub struct UserId(Ulid);
pub struct ProcessingContextId(Ulid);
pub struct SecurityContextId(Ulid);
}

This prevents common bugs like passing the wrong ID to a function.

Domain Services

Domain Services contain business logic that doesn't naturally fit in an entity or value object. They are stateless and operate on domain objects.

Domain services in our system are defined as traits (interfaces) in the domain layer and implemented in the infrastructure layer.

CompressionService

#![allow(unused)]
fn main() {
#[async_trait]
pub trait CompressionService: Send + Sync {
    async fn compress(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
    ) -> Result<Vec<u8>, PipelineError>;

    async fn decompress(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
    ) -> Result<Vec<u8>, PipelineError>;
}
}

EncryptionService

#![allow(unused)]
fn main() {
#[async_trait]
pub trait EncryptionService: Send + Sync {
    async fn encrypt(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
        key: &EncryptionKey,
    ) -> Result<Vec<u8>, PipelineError>;

    async fn decrypt(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
        key: &EncryptionKey,
    ) -> Result<Vec<u8>, PipelineError>;
}
}

ChecksumService

#![allow(unused)]
fn main() {
pub trait ChecksumService: Send + Sync {
    fn calculate(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
    ) -> Result<String, PipelineError>;

    fn verify(
        &self,
        data: &[u8],
        expected: &str,
        algorithm: &Algorithm,
    ) -> Result<bool, PipelineError>;
}
}

Repositories

Repositories abstract data persistence, allowing the domain to work with collections without knowing about storage details.

#![allow(unused)]
fn main() {
#[async_trait]
pub trait PipelineRepository: Send + Sync {
    async fn create(&self, pipeline: &Pipeline) -> Result<(), PipelineError>;
    async fn find_by_id(&self, id: &PipelineId) -> Result<Option<Pipeline>, PipelineError>;
    async fn find_by_name(&self, name: &str) -> Result<Option<Pipeline>, PipelineError>;
    async fn update(&self, pipeline: &Pipeline) -> Result<(), PipelineError>;
    async fn delete(&self, id: &PipelineId) -> Result<(), PipelineError>;
    async fn list_all(&self) -> Result<Vec<Pipeline>, PipelineError>;
}
}

The repository interface is defined in the domain layer, but implementations live in the infrastructure layer. This follows the Dependency Inversion Principle.

Domain Events

Domain Events represent significant business occurrences that other parts of the system might care about.

#![allow(unused)]
fn main() {
pub enum DomainEvent {
    PipelineCreated {
        pipeline_id: PipelineId,
        name: String,
        created_at: DateTime<Utc>,
    },
    ProcessingStarted {
        pipeline_id: PipelineId,
        context_id: ProcessingContextId,
        input_path: FilePath,
    },
    ProcessingCompleted {
        pipeline_id: PipelineId,
        context_id: ProcessingContextId,
        metrics: ProcessingMetrics,
    },
    ProcessingFailed {
        pipeline_id: PipelineId,
        context_id: ProcessingContextId,
        error: String,
    },
}
}

Events enable:

  • Loose coupling - Components don't need direct references
  • Audit trails - Track all significant operations
  • Integration - External systems can react to events
  • Event sourcing - Reconstruct state from event history

Business Rules and Invariants

The domain model enforces critical business rules:

Pipeline Rules

  1. Pipelines must have at least one user-defined stage

    #![allow(unused)]
    fn main() {
    if user_stages.is_empty() {
        return Err(PipelineError::InvalidConfiguration(
            "Pipeline must have at least one stage".to_string()
        ));
    }
    }
  2. Stage order must be sequential and valid

    #![allow(unused)]
    fn main() {
    // Stages are automatically reordered: 0, 1, 2, 3...
    // Input checksum = 0
    // User stages = 1, 2, 3...
    // Output checksum = final
    }
  3. Pipeline names must be unique (enforced by repository)

Chunk Processing Rules

  1. Chunks must have non-zero size

    #![allow(unused)]
    fn main() {
    if size == 0 {
        return Err(PipelineError::InvalidChunkSize);
    }
    }
  2. Chunk sequence numbers must be sequential

    #![allow(unused)]
    fn main() {
    // Chunks are numbered 0, 1, 2, 3...
    // Missing sequences cause processing to fail
    }
  3. Final chunks must be properly marked

    #![allow(unused)]
    fn main() {
    if chunk.is_final() {
        // No more chunks should follow
    }
    }

Security Rules

  1. Security contexts must be validated

    #![allow(unused)]
    fn main() {
    security_context.validate()?;
    }
  2. Encryption keys must meet strength requirements

    #![allow(unused)]
    fn main() {
    if key.len() < MIN_KEY_LENGTH {
        return Err(PipelineError::WeakEncryptionKey);
    }
    }
  3. Access permissions must be checked

    #![allow(unused)]
    fn main() {
    if !security_context.has_permission(Permission::ProcessFile) {
        return Err(PipelineError::PermissionDenied);
    }
    }

Ubiquitous Language

The domain model uses consistent terminology shared between developers and domain experts:

TermMeaning
PipelineAn ordered sequence of processing stages
StageA single processing operation (compress, encrypt, etc.)
ChunkA piece of a file processed in parallel
AlgorithmA specific processing method (zstd, aes-256-gcm, etc.)
RepositoryStorage abstraction for domain objects
ContextRuntime execution state
MetricsPerformance and operational measurements
IntegrityData verification through checksums
Security LevelRequired protection level (Public, Confidential, Secret)

Testing Domain Logic

Domain objects are designed for easy testing:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn pipeline_enforces_minimum_stages() {
        // Domain logic can be tested without any infrastructure
        let result = Pipeline::new("test".to_string(), vec![]);
        assert!(result.is_err());
    }

    #[test]
    fn algorithm_validates_format() {
        // Value objects self-validate
        let result = Algorithm::new("INVALID-NAME".to_string());
        assert!(result.is_err());

        let result = Algorithm::new("valid-name".to_string());
        assert!(result.is_ok());
    }

    #[test]
    fn chunk_size_enforces_limits() {
        // Business rules are explicit and testable
        let too_small = ChunkSize::new(1);
        assert!(too_small.is_err());

        let valid = ChunkSize::from_megabytes(10);
        assert!(valid.is_ok());
    }
}
}

Benefits of This Domain Model

  1. Pure Business Logic - No infrastructure dependencies
  2. Highly Testable - Can test without databases, files, or networks
  3. Type Safety - Strong typing prevents many bugs at compile time
  4. Self-Documenting - Code structure reflects business concepts
  5. Flexible - Easy to change infrastructure without touching domain
  6. Maintainable - Business rules are explicit and centralized

Next Steps

Now that you understand the domain model:

Entities

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

Understanding entities in the domain model.

What are Entities?

TODO: Define entities

Pipeline Entity

TODO: Extract from entities/pipeline.rs

Stage Entity

TODO: Extract from entities/pipeline_stage.rs

Value Objects

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

Understanding value objects in the domain model.

What are Value Objects?

TODO: Define value objects

Immutability

TODO: Explain immutability

Examples

TODO: List key value objects

Aggregates

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

Understanding aggregates and aggregate roots.

What are Aggregates?

TODO: Define aggregates

Aggregate Boundaries

TODO: Explain boundaries

Pipeline Aggregate

TODO: Explain pipeline aggregate

Design Patterns

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

Design patterns used throughout the pipeline.

Pattern Overview

TODO: List patterns

When to Use Each Pattern

TODO: Add guidance

Repository Pattern

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

The Repository pattern for data persistence.

Pattern Overview

The Repository pattern provides an abstraction layer between the domain and data mapping layers. It acts like an in-memory collection of domain objects, hiding the complexities of database operations.

Key Idea: Your business logic shouldn't know whether data comes from SQLite, PostgreSQL, or a file. It just uses a Repository trait.

Architecture

Repository Pattern

Components

Repository Trait (Domain Layer)

#![allow(unused)]
fn main() {
trait PipelineRepository {
    fn create(&self, pipeline: &Pipeline) -> Result<()>;
    fn find_by_id(&self, id: &PipelineId) -> Result<Option<Pipeline>>;
    fn update(&self, pipeline: &Pipeline) -> Result<()>;
    fn delete(&self, id: &PipelineId) -> Result<()>;
}
}

Repository Adapter (Infrastructure Layer)

#![allow(unused)]
fn main() {
struct PipelineRepositoryAdapter {
    repository: SQLitePipelineRepository,
}

impl PipelineRepository for PipelineRepositoryAdapter {
    // Implements trait methods
}
}

Concrete Repository (Infrastructure Layer)

#![allow(unused)]
fn main() {
struct SQLitePipelineRepository {
    pool: SqlitePool,
    mapper: PipelineMapper,
}
}

Layer Responsibilities

Domain Layer

Defines what operations are needed:

#![allow(unused)]
fn main() {
// Domain defines the interface
pub trait PipelineRepository: Send + Sync {
    async fn create(&self, pipeline: &Pipeline) -> Result<()>;
    async fn find_by_id(&self, id: &PipelineId) -> Result<Option<Pipeline>>;
    // ... more methods
}
}

Domain knows:

  • What operations it needs
  • What domain entities look like
  • Business rules and validations

Domain doesn't know:

  • SQL syntax
  • Database technology
  • Connection pooling

Infrastructure Layer

Implements how to persist data:

#![allow(unused)]
fn main() {
impl PipelineRepository for PipelineRepositoryAdapter {
    async fn create(&self, pipeline: &Pipeline) -> Result<()> {
        // Convert domain entity to database row
        let row = self.mapper.to_persistence(pipeline);

        // Execute SQL
        sqlx::query("INSERT INTO pipelines ...")
            .execute(&self.pool)
            .await?;

        Ok(())
    }
}
}

Infrastructure knows:

  • SQL syntax and queries
  • Database schema
  • Connection management
  • Error handling

Data Mapping

The Mapper separates domain models from database schema:

#![allow(unused)]
fn main() {
struct PipelineMapper;

impl PipelineMapper {
    // Domain → Database
    fn to_persistence(&self, pipeline: &Pipeline) -> PipelineRow {
        PipelineRow {
            id: pipeline.id().to_string(),
            input_path: pipeline.input_path().to_string(),
            // ... map all fields
        }
    }

    // Database → Domain
    fn to_domain(&self, row: SqliteRow) -> Result<Pipeline> {
        Pipeline::new(
            PipelineId::from_string(&row.id)?,
            FilePath::new(&row.input_path)?,
            FilePath::new(&row.output_path)?,
        )
    }
}
}

Why mapping?

  • Domain entities stay pure (no database annotations)
  • Database schema can change independently
  • Different databases can use different schemas
  • Validation happens in domain layer

Benefits

1. Testability

Business logic can be tested without a database:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use mockall::mock;

    mock! {
        PipelineRepo {}

        impl PipelineRepository for PipelineRepo {
            async fn create(&self, pipeline: &Pipeline) -> Result<()>;
            // ... mock other methods
        }
    }

    #[tokio::test]
    async fn test_pipeline_service() {
        let mut mock_repo = MockPipelineRepo::new();
        mock_repo.expect_create()
            .returning(|_| Ok(()));

        let service = PipelineService::new(Arc::new(mock_repo));
        // Test business logic without database
    }
}
}

2. Flexibility

Swap implementations without changing business logic:

#![allow(unused)]
fn main() {
// Start with SQLite
let repo = SQLitePipelineRepositoryAdapter::new(pool);
let service = PipelineService::new(Arc::new(repo));

// Later, switch to PostgreSQL
let repo = PostgresPipelineRepositoryAdapter::new(pool);
let service = PipelineService::new(Arc::new(repo));
// Business logic unchanged!
}

3. Centralized Data Access

All database queries in one place:

  • Easier to optimize
  • Easier to audit
  • Easier to cache
  • Easier to add logging

4. Domain Purity

Domain layer stays technology-agnostic:

#![allow(unused)]
fn main() {
// Domain doesn't import sqlx, postgres, etc.
// Only depends on standard Rust types
pub struct Pipeline {
    id: PipelineId,           // Not i64 or UUID from database
    input_path: FilePath,     // Not String from database
    status: PipelineStatus,   // Not database enum
}
}

Usage Example

Application Layer

#![allow(unused)]
fn main() {
pub struct PipelineService {
    repository: Arc<dyn PipelineRepository>,
}

impl PipelineService {
    pub async fn create_pipeline(
        &self,
        input: FilePath,
        output: FilePath,
    ) -> Result<Pipeline> {
        // Create domain entity
        let pipeline = Pipeline::new(
            PipelineId::new(),
            input,
            output,
        )?;

        // Persist using repository
        self.repository.create(&pipeline).await?;

        Ok(pipeline)
    }

    pub async fn get_pipeline(
        &self,
        id: PipelineId,
    ) -> Result<Option<Pipeline>> {
        self.repository.find_by_id(&id).await
    }
}
}

The service doesn't know or care:

  • Which database is used
  • How data is stored
  • What the SQL looks like

It just uses the Repository trait!

Implementation in Pipeline

Our pipeline uses this pattern for:

PipelineRepository - Stores pipeline metadata

  • pipeline/domain/src/repositories/pipeline_repository.rs (trait)
  • pipeline/src/infrastructure/repositories/sqlite_pipeline_repository.rs (impl)

FileChunkRepository - Stores chunk metadata

  • pipeline/domain/src/repositories/file_chunk_repository.rs (trait)
  • pipeline/src/infrastructure/repositories/sqlite_file_chunk_repository.rs (impl)

Next Steps

Continue to:

Service Pattern

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

The Service pattern for domain and application logic.

Pattern Overview

TODO: Extract from service files

Domain vs Application Services

TODO: Explain distinction

Implementation

TODO: Show examples

Hexagonal Architecture (Ports and Adapters)

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active

Overview

Hexagonal Architecture, also known as Ports and Adapters, is a pattern that isolates the core business logic (domain) from external concerns. The pipeline system uses this pattern to keep the domain pure and infrastructure replaceable.

Hexagonal Architecture

The Hexagon Metaphor

Think of your application as a hexagon:

                     ┌─────────────────┐
                     │   Primary       │
                     │   Adapters      │
                     │  (Drivers)      │
                     └────────┬────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        │              ┌──────▼──────┐              │
        │              │             │              │
        │              │   Domain    │              │
        │              │    (Core)   │              │
        │              │             │              │
        │              └──────┬──────┘              │
        │                     │                     │
        └─────────────────────┼─────────────────────┘
                              │
                     ┌────────▼────────┐
                     │   Secondary     │
                     │   Adapters      │
                     │  (Driven)       │
                     └─────────────────┘
  • The Hexagon (Core): Your domain logic - completely independent
  • Ports: Interfaces that define how to interact with the core
  • Adapters: Implementations that connect the core to the outside world

Ports: The Interfaces

Ports are interfaces defined by the domain layer. They specify what the domain needs without caring about implementation details.

Primary Ports (Driving)

Primary ports define use cases - what the application can do. External systems drive the application through these ports.

#![allow(unused)]
fn main() {
// Domain layer defines the interface (port)
#[async_trait]
pub trait FileProcessorService: Send + Sync {
    async fn process_file(
        &self,
        pipeline_id: &PipelineId,
        input_path: &FilePath,
        output_path: &FilePath,
    ) -> Result<ProcessingMetrics, PipelineError>;
}
}

Examples in our system:

  • FileProcessorService - File processing operations
  • PipelineService - Pipeline management operations

Secondary Ports (Driven)

Secondary ports define dependencies - what the domain needs from the outside world. The application drives these external systems.

#![allow(unused)]
fn main() {
// Domain layer defines what it needs (port)
#[async_trait]
pub trait PipelineRepository: Send + Sync {
    async fn create(&self, pipeline: &Pipeline) -> Result<(), PipelineError>;
    async fn find_by_id(&self, id: &PipelineId) -> Result<Option<Pipeline>, PipelineError>;
    async fn update(&self, pipeline: &Pipeline) -> Result<(), PipelineError>;
    async fn delete(&self, id: &PipelineId) -> Result<(), PipelineError>;
}

#[async_trait]
pub trait CompressionService: Send + Sync {
    async fn compress(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
    ) -> Result<Vec<u8>, PipelineError>;

    async fn decompress(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
    ) -> Result<Vec<u8>, PipelineError>;
}
}

Examples in our system:

  • PipelineRepository - Data persistence
  • CompressionService - Data compression
  • EncryptionService - Data encryption
  • ChecksumService - Integrity verification

Adapters: The Implementations

Adapters are concrete implementations of ports. They translate between the domain and external systems.

Primary Adapters (Driving)

Primary adapters drive the application. They take input from the outside world and call the domain.

CLI Adapter (main.rs)

// Primary adapter - drives the application
#[tokio::main]
async fn main() -> std::process::ExitCode {
    // 1. Parse user input
    let cli = bootstrap::bootstrap_cli()?;

    // 2. Set up infrastructure (dependency injection)
    let services = setup_services().await?;

    // 3. Drive the domain through primary port
    match cli.command {
        Commands::Process { input, output, pipeline } => {
            // Call domain through FileProcessorService port
            services.file_processor
                .process_file(&pipeline, &input, &output)
                .await?
        }
        Commands::Create { name, stages } => {
            // Call domain through PipelineService port
            services.pipeline_service
                .create_pipeline(&name, stages)
                .await?
        }
        // ... more commands
    }
}

Key characteristics:

  • Translates user input to domain operations
  • Handles presentation concerns (formatting, errors)
  • Drives the application core

HTTP API Adapter (future)

#![allow(unused)]
fn main() {
// Another primary adapter for HTTP API
async fn handle_process_request(
    req: HttpRequest,
    services: Arc<Services>,
) -> HttpResponse {
    // Parse HTTP request
    let body: ProcessFileRequest = req.json().await?;

    // Drive domain through the same port
    let result = services.file_processor
        .process_file(&body.pipeline_id, &body.input, &body.output)
        .await;

    // Convert result to HTTP response
    match result {
        Ok(metrics) => HttpResponse::Ok().json(metrics),
        Err(e) => HttpResponse::BadRequest().json(e),
    }
}
}

Notice: Both CLI and HTTP adapters use the same domain ports. The domain doesn't know or care which adapter is calling it.

Secondary Adapters (Driven)

Secondary adapters are driven by the application. They implement the interfaces the domain needs.

SQLite Repository Adapter

#![allow(unused)]
fn main() {
// Infrastructure layer - implements domain port
pub struct SQLitePipelineRepository {
    pool: SqlitePool,
}

#[async_trait]
impl PipelineRepository for SQLitePipelineRepository {
    async fn create(&self, pipeline: &Pipeline) -> Result<(), PipelineError> {
        // Convert domain entity to database row
        let row = PipelineRow::from_domain(pipeline);

        // Persist to SQLite
        sqlx::query(
            "INSERT INTO pipelines (id, name, archived, created_at, updated_at)
             VALUES (?, ?, ?, ?, ?)"
        )
        .bind(&row.id)
        .bind(&row.name)
        .bind(row.archived)
        .bind(&row.created_at)
        .bind(&row.updated_at)
        .execute(&self.pool)
        .await
        .map_err(|e| PipelineError::RepositoryError(e.to_string()))?;

        Ok(())
    }

    async fn find_by_id(&self, id: &PipelineId) -> Result<Option<Pipeline>, PipelineError> {
        // Query SQLite
        let row = sqlx::query_as::<_, PipelineRow>(
            "SELECT * FROM pipelines WHERE id = ?"
        )
        .bind(id.to_string())
        .fetch_optional(&self.pool)
        .await
        .map_err(|e| PipelineError::RepositoryError(e.to_string()))?;

        // Convert database row to domain entity
        row.map(|r| Pipeline::from_database(r)).transpose()
    }
}
}

Key characteristics:

  • Implements domain-defined interface
  • Handles database-specific operations
  • Translates between domain models and database rows
  • Can be swapped without changing domain

Compression Service Adapter

#![allow(unused)]
fn main() {
// Infrastructure layer - implements domain port
pub struct CompressionServiceAdapter {
    // Internal state for compression libraries
}

#[async_trait]
impl CompressionService for CompressionServiceAdapter {
    async fn compress(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
    ) -> Result<Vec<u8>, PipelineError> {
        // Route to appropriate compression library
        match algorithm.name() {
            "brotli" => {
                let mut compressed = Vec::new();
                brotli::BrotliCompress(
                    &mut Cursor::new(data),
                    &mut compressed,
                    &Default::default(),
                )?;
                Ok(compressed)
            }
            "zstd" => {
                let compressed = zstd::encode_all(data, 3)?;
                Ok(compressed)
            }
            "lz4" => {
                let compressed = lz4::block::compress(data, None, false)?;
                Ok(compressed)
            }
            _ => Err(PipelineError::UnsupportedAlgorithm(
                algorithm.name().to_string()
            )),
        }
    }

    async fn decompress(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
    ) -> Result<Vec<u8>, PipelineError> {
        // Similar implementation for decompression
        // ...
    }
}
}

Key characteristics:

  • Wraps external libraries (brotli, zstd, lz4)
  • Implements domain interface
  • Handles library-specific details
  • Can be swapped for different implementations

Benefits of Hexagonal Architecture

1. Testability

You can test the domain in isolation using mock adapters:

#![allow(unused)]
fn main() {
// Mock adapter for testing
struct MockPipelineRepository {
    pipelines: Mutex<HashMap<PipelineId, Pipeline>>,
}

#[async_trait]
impl PipelineRepository for MockPipelineRepository {
    async fn create(&self, pipeline: &Pipeline) -> Result<(), PipelineError> {
        self.pipelines.lock().unwrap()
            .insert(pipeline.id().clone(), pipeline.clone());
        Ok(())
    }

    async fn find_by_id(&self, id: &PipelineId) -> Result<Option<Pipeline>, PipelineError> {
        Ok(self.pipelines.lock().unwrap().get(id).cloned())
    }
}

#[tokio::test]
async fn test_file_processor_service() {
    // Use mock adapter instead of real database
    let repo = Arc::new(MockPipelineRepository::new());
    let service = FileProcessorService::new(repo);

    // Test domain logic without database
    let result = service.process_file(/* ... */).await;
    assert!(result.is_ok());
}
}

2. Flexibility

Swap implementations without changing the domain:

#![allow(unused)]
fn main() {
// Start with SQLite
let repo: Arc<dyn PipelineRepository> =
    Arc::new(SQLitePipelineRepository::new(pool));

// Later, switch to PostgreSQL
let repo: Arc<dyn PipelineRepository> =
    Arc::new(PostgresPipelineRepository::new(pool));

// Domain doesn't change - same interface!
let service = FileProcessorService::new(repo);
}

3. Multiple Interfaces

Support multiple input sources using the same domain:

#![allow(unused)]
fn main() {
// CLI adapter
async fn cli_handler(cli: Cli, services: Arc<Services>) {
    services.file_processor.process_file(/* ... */).await?;
}

// HTTP adapter
async fn http_handler(req: HttpRequest, services: Arc<Services>) {
    services.file_processor.process_file(/* ... */).await?;
}

// gRPC adapter
async fn grpc_handler(req: GrpcRequest, services: Arc<Services>) {
    services.file_processor.process_file(/* ... */).await?;
}
}

All three adapters use the same domain logic through the same port.

4. Technology Independence

The domain doesn't depend on specific technologies:

#![allow(unused)]
fn main() {
// Domain doesn't know about:
// - SQLite, PostgreSQL, or MongoDB
// - HTTP, gRPC, or CLI
// - Brotli, Zstd, or LZ4
// - Any specific framework or library

// It only knows about:
// - Business concepts (Pipeline, Stage, Chunk)
// - Business rules (validation, ordering)
// - Interfaces it needs (Repository, CompressionService)
}

Dependency Inversion

Hexagonal Architecture relies on Dependency Inversion Principle:

Traditional:                    Hexagonal:

┌──────────┐                   ┌──────────┐
│   CLI    │                   │   CLI    │
└────┬─────┘                   └────┬─────┘
     │ depends on                   │ depends on
     ▼                              ▼
┌──────────┐                   ┌──────────┐
│ Domain   │                   │  Port    │ ← Interface
└────┬─────┘                   │ (trait)  │
     │ depends on               └────△─────┘
     ▼                               │ implements
┌──────────┐                   ┌────┴─────┐
│ Database │                   │  Domain  │
└──────────┘                   └──────────┘
                                     △
                                     │ implements
                               ┌─────┴─────┐
                               │ Database  │
                               │ Adapter   │
                               └───────────┘

Traditional: Domain depends on Database (tight coupling) Hexagonal: Database depends on Domain interface (loose coupling)

Our Adapter Structure

pipeline/src/
├── infrastructure/
│   └── adapters/
│       ├── compression_service_adapter.rs    # Implements CompressionService
│       ├── encryption_service_adapter.rs     # Implements EncryptionService
│       ├── async_compression_adapter.rs      # Async wrapper
│       ├── async_encryption_adapter.rs       # Async wrapper
│       └── repositories/
│           ├── sqlite_repository_adapter.rs  # Implements PipelineRepository
│           └── sqlite_base_repository.rs     # Base repository utilities

Adapter Responsibilities

What Adapters Should Do

Translate between domain and external systems ✅ Handle technology-specific details ✅ Implement domain-defined interfaces ✅ Convert data formats (domain ↔ database, domain ↔ API) ✅ Manage external resources (connections, files, etc.)

What Adapters Should NOT Do

Contain business logic - belongs in domain ❌ Make business decisions - belongs in domain ❌ Validate business rules - belongs in domain ❌ Know about other adapters - should be independent ❌ Expose infrastructure details to domain

Example: Complete Flow

Let's trace a complete request through the hexagonal architecture:

1. Primary Adapter (CLI)
   ↓ User types: pipeline process --input file.txt --output file.bin

2. Parse and validate input
   ↓ Create FilePath("/path/to/file.txt")

3. Call Primary Port (FileProcessorService)
   ↓ process_file(pipeline_id, input_path, output_path)

4. Domain Logic
   ├─ Fetch Pipeline (via PipelineRepository port)
   │  └─ Secondary Adapter queries SQLite
   ├─ Process each stage
   │  ├─ Compress (via CompressionService port)
   │  │  └─ Secondary Adapter uses brotli library
   │  ├─ Encrypt (via EncryptionService port)
   │  │  └─ Secondary Adapter uses aes-gcm library
   │  └─ Calculate checksum (via ChecksumService port)
   │     └─ Secondary Adapter uses sha2 library
   └─ Return ProcessingMetrics

5. Primary Adapter formats output
   ↓ Display metrics to user

Common Adapter Patterns

Repository Adapter Pattern

#![allow(unused)]
fn main() {
// 1. Domain defines interface (port)
pub trait PipelineRepository: Send + Sync {
    async fn find_by_id(&self, id: &PipelineId) -> Result<Option<Pipeline>>;
}

// 2. Infrastructure implements adapter
pub struct SQLitePipelineRepository { /* ... */ }

impl PipelineRepository for SQLitePipelineRepository {
    async fn find_by_id(&self, id: &PipelineId) -> Result<Option<Pipeline>> {
        // Database-specific implementation
    }
}

// 3. Application uses through interface
pub struct FileProcessorService {
    repository: Arc<dyn PipelineRepository>,  // Uses interface, not concrete type
}
}

Service Adapter Pattern

#![allow(unused)]
fn main() {
// 1. Domain defines interface
pub trait CompressionService: Send + Sync {
    async fn compress(&self, data: &[u8], algo: &Algorithm) -> Result<Vec<u8>>;
}

// 2. Infrastructure implements adapter
pub struct CompressionServiceAdapter { /* ... */ }

impl CompressionService for CompressionServiceAdapter {
    async fn compress(&self, data: &[u8], algo: &Algorithm) -> Result<Vec<u8>> {
        // Library-specific implementation
    }
}

// 3. Application uses through interface
pub struct StageExecutor {
    compression: Arc<dyn CompressionService>,  // Uses interface
}
}

Testing with Adapters

Unit Tests (Domain Layer)

#![allow(unused)]
fn main() {
// Test domain logic without any adapters
#[test]
fn test_pipeline_validation() {
    // Pure domain logic - no infrastructure needed
    let result = Pipeline::new("test", vec![]);
    assert!(result.is_err());
}
}

Integration Tests (With Mock Adapters)

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_file_processing() {
    // Use mock adapters
    let mock_repo = Arc::new(MockPipelineRepository::new());
    let mock_compression = Arc::new(MockCompressionService::new());

    let service = FileProcessorService::new(mock_repo, mock_compression);

    // Test without real database or compression
    let result = service.process_file(/* ... */).await;
    assert!(result.is_ok());
}
}

End-to-End Tests (With Real Adapters)

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_real_file_processing() {
    // Use real adapters
    let db_pool = create_test_database().await;
    let real_repo = Arc::new(SQLitePipelineRepository::new(db_pool));
    let real_compression = Arc::new(CompressionServiceAdapter::new());

    let service = FileProcessorService::new(real_repo, real_compression);

    // Test with real infrastructure
    let result = service.process_file(/* ... */).await;
    assert!(result.is_ok());
}
}

Next Steps

Now that you understand hexagonal architecture:

Observer Pattern

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

The Observer pattern for metrics and events.

Pattern Overview

TODO: Extract from metrics observer

Implementation

TODO: Show implementation

Use Cases

TODO: List use cases

Stage Processing

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

This chapter provides a comprehensive overview of the stage processing architecture in the adaptive pipeline system. Stages are the fundamental building blocks that transform data as it flows through a pipeline.


Table of Contents


Overview

Stages are individual processing steps within a pipeline that transform file chunks as data flows from input to output. Each stage performs a specific operation such as compression, encryption, or integrity checking.

Key Characteristics

  • Type Safety: Strongly-typed stage operations prevent configuration errors
  • Ordering: Explicit ordering ensures predictable execution sequence
  • Lifecycle Management: Stages track creation and modification timestamps
  • State Management: Stages can be enabled/disabled without removal
  • Resource Awareness: Stages provide resource estimation and management

Stage Processing Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Pipeline                             │
│                                                             │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐           │
│  │  Stage 1   │  │  Stage 2   │  │  Stage 3   │           │
│  │ Checksum   │→ │ Compress   │→ │  Encrypt   │→ Output  │
│  │ (Order 0)  │  │ (Order 1)  │  │ (Order 2)  │           │
│  └────────────┘  └────────────┘  └────────────┘           │
│        ↑               ↑               ↑                    │
│        └───────────────┴───────────────┘                    │
│              Stage Executor                                 │
└─────────────────────────────────────────────────────────────┘

Design Principles

  1. Domain-Driven Design: Stages are domain entities with identity
  2. Separation of Concerns: Configuration separated from execution
  3. Async-First: All operations are asynchronous for scalability
  4. Extensibility: New stage types can be added through configuration

Stage Types

The pipeline supports five distinct stage types, each optimized for different data transformation operations.

StageType Enum

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum StageType {
    /// Compression or decompression operations
    Compression,

    /// Encryption or decryption operations
    Encryption,

    /// Data transformation operations
    Transform,

    /// Checksum calculation and verification
    Checksum,

    /// Pass-through stage that doesn't modify data
    PassThrough,
}
}

Stage Type Details

Stage TypePurposeExamplesTypical Use Case
CompressionReduce data sizeBrotli, Gzip, Zstd, Lz4Minimize storage/bandwidth
EncryptionSecure dataAES-256-GCM, ChaCha20Data protection
TransformInspect/modify dataTee, DebugDebugging, monitoring, data forking
ChecksumVerify integritySHA-256, SHA-512, Blake3Data validation
PassThroughNo modificationIdentity transformTesting/debugging

Parsing Stage Types

Stage types support case-insensitive parsing from strings:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::entities::pipeline_stage::StageType;
use std::str::FromStr;

// Parse from lowercase
let compression = StageType::from_str("compression").unwrap();
assert_eq!(compression, StageType::Compression);

// Case-insensitive parsing
let encryption = StageType::from_str("ENCRYPTION").unwrap();
assert_eq!(encryption, StageType::Encryption);

// Display format
assert_eq!(format!("{}", StageType::Checksum), "checksum");
}

Pattern Matching

#![allow(unused)]
fn main() {
fn describe_stage(stage_type: StageType) -> &'static str {
    match stage_type {
        StageType::Compression => "Reduces data size",
        StageType::Encryption => "Secures data",
        StageType::Transform => "Modifies data structure",
        StageType::Checksum => "Verifies data integrity",
        StageType::PassThrough => "No modification",
    }
}
}

Stage Entity

The PipelineStage is a domain entity that encapsulates a specific data transformation operation within a pipeline.

Entity Structure

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PipelineStage {
    id: StageId,
    name: String,
    stage_type: StageType,
    configuration: StageConfiguration,
    enabled: bool,
    order: u32,
    created_at: chrono::DateTime<chrono::Utc>,
    updated_at: chrono::DateTime<chrono::Utc>,
}
}

Entity Characteristics

  • Identity: Unique StageId persists through configuration changes
  • Name: Human-readable identifier (must not be empty)
  • Type: Strongly-typed operation (Compression, Encryption, etc.)
  • Configuration: Algorithm-specific parameters
  • Enabled Flag: Controls execution without removal
  • Order: Determines execution sequence (0-based)
  • Timestamps: Track creation and modification times

Creating a Stage

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::entities::pipeline_stage::{PipelineStage, StageConfiguration, StageType};
use std::collections::HashMap;

let mut params = HashMap::new();
params.insert("level".to_string(), "6".to_string());

let config = StageConfiguration::new("brotli".to_string(), params, true);
let stage = PipelineStage::new(
    "compression".to_string(),
    StageType::Compression,
    config,
    0  // Order: execute first
).unwrap();

assert_eq!(stage.name(), "compression");
assert_eq!(stage.stage_type(), &StageType::Compression);
assert_eq!(stage.algorithm(), "brotli");
assert!(stage.is_enabled());
}

Modifying Stage State

#![allow(unused)]
fn main() {
let mut stage = PipelineStage::new(
    "checksum".to_string(),
    StageType::Checksum,
    StageConfiguration::default(),
    0,
).unwrap();

// Disable the stage temporarily
stage.set_enabled(false);
assert!(!stage.is_enabled());

// Update configuration
let mut new_params = HashMap::new();
new_params.insert("algorithm".to_string(), "sha512".to_string());
let new_config = StageConfiguration::new("sha512".to_string(), new_params, true);
stage.update_configuration(new_config);

// Change execution order
stage.update_order(2);
assert_eq!(stage.order(), 2);

// Re-enable the stage
stage.set_enabled(true);
}

Stage Configuration

Each stage has a configuration that specifies how data should be transformed.

Configuration Structure

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StageConfiguration {
    pub algorithm: String,
    pub parameters: HashMap<String, String>,
    pub parallel_processing: bool,
    pub chunk_size: Option<usize>,
}
}

Configuration Parameters

FieldTypeDescriptionDefault
algorithmStringAlgorithm name (e.g., "brotli", "aes256gcm")"default"
parametersHashMapAlgorithm-specific key-value parameters{}
parallel_processingboolEnable parallel chunk processingtrue
chunk_sizeOption<usize>Custom chunk size (1KB - 100MB)None

Compression Configuration

#![allow(unused)]
fn main() {
let mut params = HashMap::new();
params.insert("level".to_string(), "9".to_string());

let config = StageConfiguration::new(
    "zstd".to_string(),
    params,
    true,  // Enable parallel processing
);
}

Encryption Configuration

#![allow(unused)]
fn main() {
let mut params = HashMap::new();
params.insert("key_size".to_string(), "256".to_string());

let config = StageConfiguration::new(
    "aes256gcm".to_string(),
    params,
    false,  // Sequential processing for encryption
);
}

Default Configuration

#![allow(unused)]
fn main() {
let config = StageConfiguration::default();
// algorithm: "default"
// parameters: {}
// parallel_processing: true
// chunk_size: None
}

Stage Lifecycle

Stages progress through several lifecycle phases from creation to execution.

Lifecycle Phases

1. Creation
   ↓
2. Configuration
   ↓
3. Ordering
   ↓
4. Execution
   ↓
5. Monitoring

1. Creation Phase

Stages are created with initial configuration:

#![allow(unused)]
fn main() {
let stage = PipelineStage::new(
    "compression".to_string(),
    StageType::Compression,
    StageConfiguration::default(),
    0,
)?;
}

2. Configuration Phase

Parameters can be updated as needed:

#![allow(unused)]
fn main() {
stage.update_configuration(new_config);
// updated_at timestamp is automatically updated
}

3. Ordering Phase

Position in pipeline can be adjusted:

#![allow(unused)]
fn main() {
stage.update_order(1);
// Stage now executes second instead of first
}

4. Execution Phase

Stage processes data according to its configuration:

#![allow(unused)]
fn main() {
let executor: Arc<dyn StageExecutor> = /* ... */;
let result = executor.execute(&stage, chunk, &mut context).await?;
}

5. Monitoring Phase

Timestamps track when changes occur:

#![allow(unused)]
fn main() {
println!("Created: {}", stage.created_at());
println!("Last modified: {}", stage.updated_at());
}

Stage Execution Model

The stage executor processes file chunks through configured stages using two primary execution modes.

Single Chunk Processing

Process individual chunks sequentially:

#![allow(unused)]
fn main() {
async fn execute(
    &self,
    stage: &PipelineStage,
    chunk: FileChunk,
    context: &mut ProcessingContext,
) -> Result<FileChunk, PipelineError>
}

Execution Flow:

Input Chunk → Validate → Process → Update Context → Output Chunk

Parallel Processing

Process multiple chunks concurrently:

#![allow(unused)]
fn main() {
async fn execute_parallel(
    &self,
    stage: &PipelineStage,
    chunks: Vec<FileChunk>,
    context: &mut ProcessingContext,
) -> Result<Vec<FileChunk>, PipelineError>
}

Execution Flow:

Chunks: [1, 2, 3, 4]
         ↓  ↓  ↓  ↓
      ┌────┬───┬───┬────┐
      │ T1 │T2 │T3 │ T4 │  (Parallel threads)
      └────┴───┴───┴────┘
         ↓  ↓  ↓  ↓
Results: [1, 2, 3, 4]

Processing Context

The ProcessingContext maintains state during execution:

#![allow(unused)]
fn main() {
pub struct ProcessingContext {
    pub pipeline_id: String,
    pub stage_metrics: HashMap<String, StageMetrics>,
    pub checksums: HashMap<String, Vec<u8>>,
    // ... other context fields
}
}

Stage Executor Interface

The StageExecutor trait defines the contract for stage execution engines.

Trait Definition

#![allow(unused)]
fn main() {
#[async_trait]
pub trait StageExecutor: Send + Sync {
    /// Execute a stage on a single chunk
    async fn execute(
        &self,
        stage: &PipelineStage,
        chunk: FileChunk,
        context: &mut ProcessingContext,
    ) -> Result<FileChunk, PipelineError>;

    /// Execute a stage on multiple chunks in parallel
    async fn execute_parallel(
        &self,
        stage: &PipelineStage,
        chunks: Vec<FileChunk>,
        context: &mut ProcessingContext,
    ) -> Result<Vec<FileChunk>, PipelineError>;

    /// Validate if a stage can be executed
    async fn can_execute(&self, stage: &PipelineStage) -> Result<bool, PipelineError>;

    /// Get supported stage types
    fn supported_stage_types(&self) -> Vec<String>;

    /// Estimate processing time for a stage
    async fn estimate_processing_time(
        &self,
        stage: &PipelineStage,
        data_size: u64,
    ) -> Result<std::time::Duration, PipelineError>;

    /// Get resource requirements for a stage
    async fn get_resource_requirements(
        &self,
        stage: &PipelineStage,
        data_size: u64,
    ) -> Result<ResourceRequirements, PipelineError>;
}
}

BasicStageExecutor Implementation

The infrastructure layer provides a concrete implementation:

#![allow(unused)]
fn main() {
pub struct BasicStageExecutor {
    checksums: Arc<RwLock<HashMap<String, Sha256>>>,
    compression_service: Arc<dyn CompressionService>,
    encryption_service: Arc<dyn EncryptionService>,
}

impl BasicStageExecutor {
    pub fn new(
        compression_service: Arc<dyn CompressionService>,
        encryption_service: Arc<dyn EncryptionService>,
    ) -> Self {
        Self {
            checksums: Arc::new(RwLock::new(HashMap::new())),
            compression_service,
            encryption_service,
        }
    }
}
}

Supported Stage Types

The BasicStageExecutor supports:

  • Compression: Via CompressionService (Brotli, Gzip, Zstd, Lz4)
  • Encryption: Via EncryptionService (AES-256-GCM, ChaCha20-Poly1305)
  • Checksum: Via internal SHA-256 implementation
  • Transform: Via TeeService and DebugService (Tee, Debug)

Transform Stages Details

Transform stages provide observability and diagnostic capabilities without modifying data flow:

Tee Stage

The Tee stage copies data to a secondary output while passing it through unchanged - similar to the Unix tee command.

Configuration:

#![allow(unused)]
fn main() {
let mut params = HashMap::new();
params.insert("output_path".to_string(), "/tmp/debug-output.bin".to_string());
params.insert("format".to_string(), "hex".to_string());  // binary, hex, or text
params.insert("enabled".to_string(), "true".to_string());

let tee_stage = PipelineStage::new(
    "tee-compressed".to_string(),
    StageType::Transform,
    StageConfiguration::new("tee".to_string(), params, false),
    1,
)?;
}

Use Cases:

  • Debugging: Capture intermediate pipeline data for analysis
  • Monitoring: Sample data at specific pipeline stages
  • Audit Trails: Record data flow for compliance
  • Data Forking: Split data to multiple destinations

Output Formats:

  • binary: Raw binary output (default)
  • hex: Hexadecimal dump with ASCII sidebar (like hexdump -C)
  • text: UTF-8 text (lossy conversion for non-UTF8)

Performance: Limited by I/O speed to tee output file

Debug Stage

The Debug stage monitors data flow with zero modification, emitting Prometheus metrics for real-time observability.

Configuration:

#![allow(unused)]
fn main() {
let mut params = HashMap::new();
params.insert("label".to_string(), "after-compression".to_string());

let debug_stage = PipelineStage::new(
    "debug-compressed".to_string(),
    StageType::Transform,
    StageConfiguration::new("debug".to_string(), params, false),
    2,
)?;
}

Use Cases:

  • Pipeline Debugging: Identify where data corruption occurs
  • Performance Analysis: Measure throughput between stages
  • Corruption Detection: Calculate checksums at intermediate points
  • Production Monitoring: Real-time visibility into processing

Metrics Emitted (Prometheus):

  • debug_stage_checksum{label, chunk_id}: SHA256 of chunk data
  • debug_stage_bytes{label, chunk_id}: Bytes processed per chunk
  • debug_stage_chunks_total{label}: Total chunks processed

Performance: Minimal overhead, pass-through operation with checksum calculation

Transform Stage Positioning

Both Tee and Debug stages can be placed anywhere in the pipeline (StagePosition::Any) and are fully reversible. They're ideal for:

#![allow(unused)]
fn main() {
// Example: Debug pipeline with monitoring at key points
vec![
    // Input checkpoint
    PipelineStage::new("debug-input".to_string(), StageType::Transform,
        debug_config("input"), 0)?,

    // Compression
    PipelineStage::new("compress".to_string(), StageType::Compression,
        compression_config(), 1)?,

    // Capture compressed data
    PipelineStage::new("tee-compressed".to_string(), StageType::Transform,
        tee_config("/tmp/compressed.bin", "hex"), 2)?,

    // Encryption
    PipelineStage::new("encrypt".to_string(), StageType::Encryption,
        encryption_config(), 3)?,

    // Output checkpoint
    PipelineStage::new("debug-output".to_string(), StageType::Transform,
        debug_config("output"), 4)?,
]
}

Compatibility and Ordering

Stages have compatibility rules that ensure optimal pipeline performance.

1. Input Checksum (automatic)
   ↓
2. Compression (reduces data size)
   ↓
3. Encryption (secures compressed data)
   ↓
4. Output Checksum (automatic)

Rationale:

  • Compress before encrypting to reduce encrypted payload size
  • Checksum before compression to detect input corruption early
  • Checksum after encryption to verify output integrity

Compatibility Matrix

From \ To      | Compression | Encryption | Checksum | PassThrough | Transform
---------------|-------------|------------|----------|-------------|----------
Compression    | ❌ No       | ✅ Yes     | ✅ Yes   | ✅ Yes      | ⚠️ Rare
Encryption     | ❌ No       | ❌ No      | ✅ Yes   | ✅ Yes      | ❌ No
Checksum       | ✅ Yes      | ✅ Yes     | ✅ Yes   | ✅ Yes      | ✅ Yes
PassThrough    | ✅ Yes      | ✅ Yes     | ✅ Yes   | ✅ Yes      | ✅ Yes
Transform      | ✅ Yes      | ✅ Yes     | ✅ Yes   | ✅ Yes      | ⚠️ Depends

Legend:

  • ✅ Yes: Recommended combination
  • ❌ No: Not recommended (avoid duplication or inefficiency)
  • ⚠️ Rare/Depends: Context-dependent

Checking Compatibility

#![allow(unused)]
fn main() {
let compression = PipelineStage::new(
    "compression".to_string(),
    StageType::Compression,
    StageConfiguration::default(),
    0,
).unwrap();

let encryption = PipelineStage::new(
    "encryption".to_string(),
    StageType::Encryption,
    StageConfiguration::default(),
    1,
).unwrap();

// Compression should come before encryption
assert!(compression.is_compatible_with(&encryption));
}

Compatibility Rules

The is_compatible_with method implements these rules:

  1. Compression → Encryption: ✅ Compress first, then encrypt
  2. Compression → Compression: ❌ Avoid double compression
  3. Encryption → Encryption: ❌ Avoid double encryption
  4. Encryption → Compression: ❌ Cannot compress encrypted data effectively
  5. PassThrough → Any: ✅ No restrictions
  6. Checksum → Any: ✅ Checksums compatible with everything

Resource Management

Stages provide resource estimation and requirements to enable efficient execution planning.

Resource Requirements

#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
pub struct ResourceRequirements {
    pub memory_bytes: u64,
    pub cpu_cores: u32,
    pub disk_space_bytes: u64,
    pub network_bandwidth_bps: Option<u64>,
    pub gpu_memory_bytes: Option<u64>,
    pub estimated_duration: std::time::Duration,
}
}

Default Requirements

#![allow(unused)]
fn main() {
ResourceRequirements::default()
// memory_bytes: 64 MB
// cpu_cores: 1
// disk_space_bytes: 0
// network_bandwidth_bps: None
// gpu_memory_bytes: None
// estimated_duration: 1 second
}

Custom Requirements

#![allow(unused)]
fn main() {
let requirements = ResourceRequirements::new(
    128 * 1024 * 1024,  // 128 MB memory
    4,                   // 4 CPU cores
    1024 * 1024 * 1024, // 1 GB disk space
)
.with_duration(Duration::from_secs(30))
.with_network_bandwidth(100_000_000); // 100 Mbps
}

Estimating Resources

#![allow(unused)]
fn main() {
let executor: Arc<dyn StageExecutor> = /* ... */;
let requirements = executor.get_resource_requirements(
    &stage,
    10 * 1024 * 1024,  // 10 MB data size
).await?;

println!("Memory required: {}", Byte::from_bytes(requirements.memory_bytes));
println!("CPU cores: {}", requirements.cpu_cores);
println!("Estimated time: {:?}", requirements.estimated_duration);
}

Scaling Requirements

#![allow(unused)]
fn main() {
let mut requirements = ResourceRequirements::default();
requirements.scale(2.0);  // Double all requirements
}

Merging Requirements

#![allow(unused)]
fn main() {
let mut req1 = ResourceRequirements::default();
let req2 = ResourceRequirements::new(256_000_000, 2, 0);
req1.merge(&req2);  // Takes maximum of each field
}

Usage Examples

Example 1: Creating a Compression Stage

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::entities::pipeline_stage::{PipelineStage, StageConfiguration, StageType};
use std::collections::HashMap;

let mut params = HashMap::new();
params.insert("level".to_string(), "9".to_string());

let config = StageConfiguration::new(
    "zstd".to_string(),
    params,
    true,  // Enable parallel processing
);

let compression_stage = PipelineStage::new(
    "fast-compression".to_string(),
    StageType::Compression,
    config,
    1,  // Execute after input checksum (order 0)
)?;

println!("Created stage: {}", compression_stage.name());
println!("Algorithm: {}", compression_stage.algorithm());
}

Example 2: Creating an Encryption Stage

#![allow(unused)]
fn main() {
let mut params = HashMap::new();
params.insert("key_size".to_string(), "256".to_string());

let config = StageConfiguration::new(
    "aes256gcm".to_string(),
    params,
    false,  // Sequential processing for security
);

let encryption_stage = PipelineStage::new(
    "secure-encryption".to_string(),
    StageType::Encryption,
    config,
    2,  // Execute after compression
)?;
}

Example 3: Building a Complete Pipeline

#![allow(unused)]
fn main() {
let mut stages = Vec::new();

// Stage 0: Input checksum
let checksum_in = PipelineStage::new(
    "input-checksum".to_string(),
    StageType::Checksum,
    StageConfiguration::new("sha256".to_string(), HashMap::new(), true),
    0,
)?;
stages.push(checksum_in);

// Stage 1: Compression
let mut compress_params = HashMap::new();
compress_params.insert("level".to_string(), "6".to_string());
let compression = PipelineStage::new(
    "compression".to_string(),
    StageType::Compression,
    StageConfiguration::new("brotli".to_string(), compress_params, true),
    1,
)?;
stages.push(compression);

// Stage 2: Encryption
let mut encrypt_params = HashMap::new();
encrypt_params.insert("key_size".to_string(), "256".to_string());
let encryption = PipelineStage::new(
    "encryption".to_string(),
    StageType::Encryption,
    StageConfiguration::new("aes256gcm".to_string(), encrypt_params, false),
    2,
)?;
stages.push(encryption);

// Stage 3: Output checksum
let checksum_out = PipelineStage::new(
    "output-checksum".to_string(),
    StageType::Checksum,
    StageConfiguration::new("sha256".to_string(), HashMap::new(), true),
    3,
)?;
stages.push(checksum_out);

// Validate compatibility
for i in 0..stages.len() - 1 {
    assert!(stages[i].is_compatible_with(&stages[i + 1]));
}
}

Example 4: Executing a Stage

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::repositories::stage_executor::StageExecutor;

let executor: Arc<dyn StageExecutor> = /* ... */;
let stage = /* ... */;
let chunk = FileChunk::new(0, vec![1, 2, 3, 4, 5]);
let mut context = ProcessingContext::new("pipeline-123");

// Execute single chunk
let result = executor.execute(&stage, chunk, &mut context).await?;

println!("Processed {} bytes", result.data().len());
}

Example 5: Parallel Execution

#![allow(unused)]
fn main() {
let chunks = vec![
    FileChunk::new(0, vec![1, 2, 3]),
    FileChunk::new(1, vec![4, 5, 6]),
    FileChunk::new(2, vec![7, 8, 9]),
];

let results = executor.execute_parallel(&stage, chunks, &mut context).await?;

println!("Processed {} chunks", results.len());
}

Performance Considerations

Chunk Size Selection

Chunk size significantly impacts stage performance:

Data SizeRecommended Chunk SizeRationale
< 10 MB1 MBMinimize overhead
10-100 MB2-4 MBBalance memory/IO
100 MB - 1 GB4-8 MBOptimize parallelization
> 1 GB8-16 MBMaximize throughput
#![allow(unused)]
fn main() {
let mut config = StageConfiguration::default();
config.chunk_size = Some(4 * 1024 * 1024);  // 4 MB chunks
}

Parallel Processing

Enable parallel processing for CPU-bound operations:

#![allow(unused)]
fn main() {
// Compression: parallel processing beneficial
let compress_config = StageConfiguration::new(
    "zstd".to_string(),
    HashMap::new(),
    true,  // Enable parallel
);

// Encryption: sequential often better for security
let encrypt_config = StageConfiguration::new(
    "aes256gcm".to_string(),
    HashMap::new(),
    false,  // Disable parallel
);
}

Stage Ordering Impact

Optimal:

Checksum → Compress (6:1 ratio) → Encrypt → Checksum
1 GB → 1 GB → 167 MB → 167 MB → 167 MB

Suboptimal:

Checksum → Encrypt → Compress (1.1:1 ratio) → Checksum
1 GB → 1 GB → 1 GB → 909 MB → 909 MB

Encrypting before compression reduces compression ratio from 6:1 to 1.1:1.

Memory Usage

Per-stage memory usage:

Stage TypeMemory per ChunkNotes
Compression2-3x chunk sizeCompression buffers
Encryption1-1.5x chunk sizeEncryption overhead
Checksum~256 bytesHash state only
PassThrough1x chunk sizeNo additional memory

CPU Utilization

CPU-intensive stages:

  1. Compression: High CPU usage (especially Brotli level 9+)
  2. Encryption: Moderate CPU usage (AES-NI acceleration helps)
  3. Checksum: Low CPU usage (Blake3 faster than SHA-256)

Best Practices

1. Stage Naming

Use descriptive, kebab-case names:

#![allow(unused)]
fn main() {
// ✅ Good
"input-checksum", "fast-compression", "secure-encryption"

// ❌ Bad
"stage1", "s", "MyStage"
}

2. Configuration Validation

Always validate configurations:

#![allow(unused)]
fn main() {
let stage = PipelineStage::new(/* ... */)?;
stage.validate()?;  // Validate before execution
}

3. Optimal Ordering

Follow the recommended order:

1. Input Checksum
2. Compression
3. Encryption
4. Output Checksum

4. Enable/Disable vs. Remove

Prefer disabling over removing stages:

#![allow(unused)]
fn main() {
// ✅ Good: Preserve configuration
stage.set_enabled(false);

// ❌ Bad: Lose configuration
stages.retain(|s| s.name() != "compression");
}

5. Resource Estimation

Estimate resources before execution:

#![allow(unused)]
fn main() {
let requirements = executor.get_resource_requirements(&stage, file_size).await?;

if requirements.memory_bytes > available_memory {
    // Adjust chunk size or process sequentially
}
}

6. Error Handling

Handle stage-specific errors appropriately:

#![allow(unused)]
fn main() {
match executor.execute(&stage, chunk, &mut context).await {
    Ok(result) => { /* success */ },
    Err(PipelineError::CompressionFailed(msg)) => {
        // Handle compression errors
    },
    Err(PipelineError::EncryptionFailed(msg)) => {
        // Handle encryption errors
    },
    Err(e) => {
        // Handle generic errors
    },
}
}

7. Monitoring

Track stage execution metrics:

#![allow(unused)]
fn main() {
let start = Instant::now();
let result = executor.execute(&stage, chunk, &mut context).await?;
let duration = start.elapsed();

println!("Stage '{}' processed {} bytes in {:?}",
    stage.name(),
    result.data().len(),
    duration
);
}

8. Testing

Test stages in isolation:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_compression_stage() {
    let stage = create_compression_stage();
    let executor = create_test_executor();
    let chunk = FileChunk::new(0, vec![0u8; 1024]);
    let mut context = ProcessingContext::new("test");

    let result = executor.execute(&stage, chunk, &mut context).await.unwrap();

    assert!(result.data().len() < 1024);  // Compression worked
}
}

Troubleshooting

Issue 1: Stage Validation Fails

Symptom:

Error: InvalidConfiguration("Stage name cannot be empty")

Solution:

#![allow(unused)]
fn main() {
// Ensure stage name is not empty
let stage = PipelineStage::new(
    "my-stage".to_string(),  // ✅ Non-empty name
    stage_type,
    config,
    order,
)?;
}

Issue 2: Incompatible Stage Order

Symptom:

Error: IncompatibleStages("Cannot encrypt before compressing")

Solution:

#![allow(unused)]
fn main() {
// Check compatibility before adding stages
if !previous_stage.is_compatible_with(&new_stage) {
    // Reorder stages
}
}

Issue 3: Chunk Size Validation Error

Symptom:

Error: InvalidConfiguration("Chunk size must be between 1KB and 100MB")

Solution:

#![allow(unused)]
fn main() {
let mut config = StageConfiguration::default();
config.chunk_size = Some(4 * 1024 * 1024);  // ✅ 4 MB (valid range)
// config.chunk_size = Some(512);  // ❌ Too small (< 1KB)
// config.chunk_size = Some(200_000_000);  // ❌ Too large (> 100MB)
}

Issue 4: Out of Memory During Execution

Symptom:

Error: ResourceExhaustion("Insufficient memory for stage execution")

Solution:

#![allow(unused)]
fn main() {
// Reduce chunk size or disable parallel processing
let mut config = stage.configuration().clone();
config.chunk_size = Some(1 * 1024 * 1024);  // Reduce to 1 MB
config.parallel_processing = false;  // Disable parallel
stage.update_configuration(config);
}

Issue 5: Stage Executor Not Found

Symptom:

Error: ExecutorNotFound("No executor for stage type 'CustomStage'")

Solution:

#![allow(unused)]
fn main() {
// Check supported stage types
let supported = executor.supported_stage_types();
println!("Supported: {:?}", supported);

// Use a supported stage type
let stage = PipelineStage::new(
    "compression".to_string(),
    StageType::Compression,  // ✅ Supported type
    config,
    0,
)?;
}

Issue 6: Performance Degradation

Symptom: Stage execution is slower than expected.

Diagnosis:

#![allow(unused)]
fn main() {
let requirements = executor.get_resource_requirements(&stage, file_size).await?;
let duration = executor.estimate_processing_time(&stage, file_size).await?;

println!("Expected duration: {:?}", duration);
println!("Memory needed: {}", Byte::from_bytes(requirements.memory_bytes));
}

Solutions:

  • Enable parallel processing for compression stages
  • Increase chunk size for large files
  • Use faster algorithms (e.g., Lz4 instead of Brotli)
  • Check system resource availability

Testing Strategies

Unit Tests

Test individual stage operations:

#![allow(unused)]
fn main() {
#[test]
fn test_stage_creation() {
    let stage = PipelineStage::new(
        "test-stage".to_string(),
        StageType::Compression,
        StageConfiguration::default(),
        0,
    );
    assert!(stage.is_ok());
}

#[test]
fn test_stage_validation() {
    let stage = PipelineStage::new(
        "".to_string(),  // Empty name
        StageType::Compression,
        StageConfiguration::default(),
        0,
    );
    assert!(stage.is_err());
}
}

Integration Tests

Test stage execution with real executors:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_compression_integration() {
    let compression_service = create_compression_service();
    let encryption_service = create_encryption_service();
    let executor = BasicStageExecutor::new(compression_service, encryption_service);

    let stage = create_compression_stage();
    let chunk = FileChunk::new(0, vec![0u8; 10000]);
    let mut context = ProcessingContext::new("test-pipeline");

    let result = executor.execute(&stage, chunk, &mut context).await.unwrap();

    assert!(result.data().len() < 10000);  // Verify compression
}
}

Property-Based Tests

Test stage invariants:

#![allow(unused)]
fn main() {
#[quickcheck]
fn stage_order_preserved(order: u32) -> bool {
    let stage = PipelineStage::new(
        "test".to_string(),
        StageType::Checksum,
        StageConfiguration::default(),
        order,
    ).unwrap();

    stage.order() == order
}
}

Compatibility Tests

Test stage compatibility matrix:

#![allow(unused)]
fn main() {
#[test]
fn test_compression_encryption_compatibility() {
    let compression = create_stage(StageType::Compression, 0);
    let encryption = create_stage(StageType::Encryption, 1);

    assert!(compression.is_compatible_with(&encryption));
    assert!(encryption.is_compatible_with(&create_stage(StageType::Checksum, 2)));
}
}

Next Steps

After understanding stage processing fundamentals, explore specific implementations:

Detailed Stage Implementations

  1. Compression: Deep dive into compression algorithms and performance tuning
  2. Encryption: Encryption implementation, key management, and security considerations
  3. Integrity Checking: Checksum algorithms and verification strategies
  • Data Persistence: How stages are persisted and retrieved from the database
  • File I/O: File chunking and binary format for stage data
  • Observability: Monitoring stage execution and performance

Advanced Topics


Summary

Key Takeaways:

  1. Stages are the fundamental building blocks of pipelines, each performing a specific transformation
  2. Five stage types are supported: Compression, Encryption, Transform, Checksum, PassThrough
  3. PipelineStage is a domain entity with identity, configuration, and lifecycle management
  4. Stage compatibility rules ensure optimal ordering (compress before encrypt)
  5. StageExecutor trait provides async execution with resource estimation
  6. Resource management enables efficient execution planning and monitoring
  7. Best practices include proper naming, validation, and error handling

Configuration File Reference: pipeline/src/domain/entities/pipeline_stage.rs Executor Interface: pipeline-domain/src/repositories/stage_executor.rs:156 Executor Implementation: pipeline/src/infrastructure/repositories/stage_executor.rs:175

Compression Implementation

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active

Overview

The compression service provides multiple compression algorithms optimized for different use cases. It's implemented as an infrastructure adapter that implements the domain's CompressionService trait.

File: pipeline/src/infrastructure/adapters/compression_service_adapter.rs

Supported Algorithms

Brotli

  • Best for: Web content, text files, logs
  • Compression ratio: Excellent (typically 15-25% better than gzip)
  • Speed: Slower compression, fast decompression
  • Memory: Higher memory usage (~10-20 MB)
  • Library: brotli crate

Use cases:

  • Archival storage where size is critical
  • Web assets (HTML, CSS, JavaScript)
  • Log files with repetitive patterns

Performance characteristics:

File Type    | Compression Ratio | Speed      | Memory
-------------|-------------------|------------|--------
Text logs    | 85-90%           | Slow       | High
HTML/CSS     | 80-85%           | Slow       | High
Binary data  | 60-70%           | Moderate   | High

Gzip

  • Best for: General-purpose compression
  • Compression ratio: Good (industry standard)
  • Speed: Moderate compression and decompression
  • Memory: Moderate usage (~5-10 MB)
  • Library: flate2 crate

Use cases:

  • General file compression
  • Compatibility with other systems
  • Balanced performance needs

Performance characteristics:

File Type    | Compression Ratio | Speed      | Memory
-------------|-------------------|------------|--------
Text logs    | 75-80%           | Moderate   | Moderate
HTML/CSS     | 70-75%           | Moderate   | Moderate
Binary data  | 50-60%           | Moderate   | Moderate

Zstandard (Zstd)

  • Best for: Modern systems, real-time compression
  • Compression ratio: Very good (better than gzip)
  • Speed: Very fast compression and decompression
  • Memory: Efficient (~5-15 MB depending on level)
  • Library: zstd crate

Use cases:

  • Real-time data processing
  • Large file compression
  • Network transmission
  • Modern backup systems

Performance characteristics:

File Type    | Compression Ratio | Speed      | Memory
-------------|-------------------|------------|--------
Text logs    | 80-85%           | Fast       | Low
HTML/CSS     | 75-80%           | Fast       | Low
Binary data  | 55-65%           | Fast       | Low

LZ4

  • Best for: Real-time applications, live streams
  • Compression ratio: Moderate
  • Speed: Extremely fast (fastest available)
  • Memory: Very low usage (~1-5 MB)
  • Library: lz4 crate

Use cases:

  • Real-time data streams
  • Low-latency requirements
  • Systems with limited memory
  • Network protocols

Performance characteristics:

File Type    | Compression Ratio | Speed         | Memory
-------------|-------------------|---------------|--------
Text logs    | 60-70%           | Very Fast     | Very Low
HTML/CSS     | 55-65%           | Very Fast     | Very Low
Binary data  | 40-50%           | Very Fast     | Very Low

Architecture

Service Interface (Domain Layer)

The domain layer defines what compression operations are needed:

#![allow(unused)]
fn main() {
// pipeline-domain/src/services/compression_service.rs
use async_trait::async_trait;
use crate::value_objects::Algorithm;
use crate::error::PipelineError;

#[async_trait]
pub trait CompressionService: Send + Sync {
    /// Compress data using the specified algorithm
    async fn compress(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
    ) -> Result<Vec<u8>, PipelineError>;

    /// Decompress data using the specified algorithm
    async fn decompress(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
    ) -> Result<Vec<u8>, PipelineError>;
}
}

Service Implementation (Infrastructure Layer)

The infrastructure layer provides the concrete implementation:

#![allow(unused)]
fn main() {
// pipeline/src/infrastructure/adapters/compression_service_adapter.rs
pub struct CompressionServiceAdapter {
    // Configuration and state
}

#[async_trait]
impl CompressionService for CompressionServiceAdapter {
    async fn compress(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
    ) -> Result<Vec<u8>, PipelineError> {
        // Route to appropriate algorithm
        match algorithm.name() {
            "brotli" => self.compress_brotli(data),
            "gzip" => self.compress_gzip(data),
            "zstd" => self.compress_zstd(data),
            "lz4" => self.compress_lz4(data),
            _ => Err(PipelineError::UnsupportedAlgorithm(
                algorithm.name().to_string()
            )),
        }
    }

    async fn decompress(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
    ) -> Result<Vec<u8>, PipelineError> {
        // Route to appropriate algorithm
        match algorithm.name() {
            "brotli" => self.decompress_brotli(data),
            "gzip" => self.decompress_gzip(data),
            "zstd" => self.decompress_zstd(data),
            "lz4" => self.decompress_lz4(data),
            _ => Err(PipelineError::UnsupportedAlgorithm(
                algorithm.name().to_string()
            )),
        }
    }
}
}

Algorithm Implementations

Brotli Implementation

#![allow(unused)]
fn main() {
impl CompressionServiceAdapter {
    fn compress_brotli(&self, data: &[u8]) -> Result<Vec<u8>, PipelineError> {
        use brotli::enc::BrotliEncoderParams;
        use std::io::Cursor;

        let mut compressed = Vec::new();
        let mut params = BrotliEncoderParams::default();

        // Quality level 11 = maximum compression
        params.quality = 11;

        brotli::BrotliCompress(
            &mut Cursor::new(data),
            &mut compressed,
            &params,
        ).map_err(|e| PipelineError::CompressionError(e.to_string()))?;

        Ok(compressed)
    }

    fn decompress_brotli(&self, data: &[u8]) -> Result<Vec<u8>, PipelineError> {
        use brotli::Decompressor;
        use std::io::Read;

        let mut decompressed = Vec::new();
        let mut decompressor = Decompressor::new(data, 4096);

        decompressor.read_to_end(&mut decompressed)
            .map_err(|e| PipelineError::DecompressionError(e.to_string()))?;

        Ok(decompressed)
    }
}
}

Gzip Implementation

#![allow(unused)]
fn main() {
impl CompressionServiceAdapter {
    fn compress_gzip(&self, data: &[u8]) -> Result<Vec<u8>, PipelineError> {
        use flate2::write::GzEncoder;
        use flate2::Compression;
        use std::io::Write;

        let mut encoder = GzEncoder::new(Vec::new(), Compression::default());
        encoder.write_all(data)
            .map_err(|e| PipelineError::CompressionError(e.to_string()))?;

        encoder.finish()
            .map_err(|e| PipelineError::CompressionError(e.to_string()))
    }

    fn decompress_gzip(&self, data: &[u8]) -> Result<Vec<u8>, PipelineError> {
        use flate2::read::GzDecoder;
        use std::io::Read;

        let mut decoder = GzDecoder::new(data);
        let mut decompressed = Vec::new();

        decoder.read_to_end(&mut decompressed)
            .map_err(|e| PipelineError::DecompressionError(e.to_string()))?;

        Ok(decompressed)
    }
}
}

Zstandard Implementation

#![allow(unused)]
fn main() {
impl CompressionServiceAdapter {
    fn compress_zstd(&self, data: &[u8]) -> Result<Vec<u8>, PipelineError> {
        // Level 3 provides good balance of speed and compression
        zstd::encode_all(data, 3)
            .map_err(|e| PipelineError::CompressionError(e.to_string()))
    }

    fn decompress_zstd(&self, data: &[u8]) -> Result<Vec<u8>, PipelineError> {
        zstd::decode_all(data)
            .map_err(|e| PipelineError::DecompressionError(e.to_string()))
    }
}
}

LZ4 Implementation

#![allow(unused)]
fn main() {
impl CompressionServiceAdapter {
    fn compress_lz4(&self, data: &[u8]) -> Result<Vec<u8>, PipelineError> {
        lz4::block::compress(data, None, false)
            .map_err(|e| PipelineError::CompressionError(e.to_string()))
    }

    fn decompress_lz4(&self, data: &[u8]) -> Result<Vec<u8>, PipelineError> {
        // Need to know original size for LZ4
        // This is stored in the file metadata
        lz4::block::decompress(data, None)
            .map_err(|e| PipelineError::DecompressionError(e.to_string()))
    }
}
}

Performance Optimizations

Parallel Chunk Processing

The compression service processes file chunks in parallel using Rayon:

#![allow(unused)]
fn main() {
use rayon::prelude::*;

pub async fn compress_chunks(
    chunks: Vec<FileChunk>,
    algorithm: &Algorithm,
    compression_service: &Arc<dyn CompressionService>,
) -> Result<Vec<CompressedChunk>, PipelineError> {
    // Process chunks in parallel
    chunks.par_iter()
        .map(|chunk| {
            // Compress each chunk independently
            let compressed_data = compression_service
                .compress(&chunk.data, algorithm)?;

            Ok(CompressedChunk {
                sequence: chunk.sequence,
                data: compressed_data,
                original_size: chunk.data.len(),
            })
        })
        .collect()
}
}

Memory Management

Efficient buffer management reduces allocations:

#![allow(unused)]
fn main() {
pub struct CompressionBuffer {
    input_buffer: Vec<u8>,
    output_buffer: Vec<u8>,
}

impl CompressionBuffer {
    pub fn new(chunk_size: usize) -> Self {
        Self {
            // Pre-allocate buffers
            input_buffer: Vec::with_capacity(chunk_size),
            output_buffer: Vec::with_capacity(chunk_size * 2), // Assume 2x for safety
        }
    }

    pub fn compress(&mut self, data: &[u8], algorithm: &Algorithm) -> Result<&[u8]> {
        // Reuse buffers instead of allocating new ones
        self.input_buffer.clear();
        self.output_buffer.clear();

        self.input_buffer.extend_from_slice(data);
        // Compress from input_buffer to output_buffer
        // ...

        Ok(&self.output_buffer)
    }
}
}

Adaptive Compression Levels

Adjust compression levels based on data characteristics:

#![allow(unused)]
fn main() {
pub fn select_compression_level(data: &[u8]) -> u32 {
    // Analyze data entropy
    let entropy = calculate_entropy(data);

    if entropy < 0.5 {
        // Low entropy (highly repetitive) - use maximum compression
        11
    } else if entropy < 0.7 {
        // Medium entropy - balanced compression
        6
    } else {
        // High entropy (random-like) - fast compression
        3
    }
}

fn calculate_entropy(data: &[u8]) -> f64 {
    // Calculate Shannon entropy
    let mut freq = [0u32; 256];
    for &byte in data {
        freq[byte as usize] += 1;
    }

    let len = data.len() as f64;
    freq.iter()
        .filter(|&&f| f > 0)
        .map(|&f| {
            let p = f as f64 / len;
            -p * p.log2()
        })
        .sum()
}
}

Configuration

Compression Levels

Different algorithms support different compression levels:

#![allow(unused)]
fn main() {
pub struct CompressionConfig {
    pub algorithm: Algorithm,
    pub level: CompressionLevel,
    pub chunk_size: usize,
    pub parallel_chunks: usize,
}

pub enum CompressionLevel {
    Fastest,      // LZ4, Zstd level 1
    Fast,         // Zstd level 3, Gzip level 1
    Balanced,     // Zstd level 6, Gzip level 6
    Best,         // Brotli level 11, Gzip level 9
    BestSize,     // Brotli level 11 with maximum window
}

impl CompressionConfig {
    pub fn for_speed() -> Self {
        Self {
            algorithm: Algorithm::lz4(),
            level: CompressionLevel::Fastest,
            chunk_size: 64 * 1024 * 1024, // 64 MB chunks
            parallel_chunks: num_cpus::get(),
        }
    }

    pub fn for_size() -> Self {
        Self {
            algorithm: Algorithm::brotli(),
            level: CompressionLevel::BestSize,
            chunk_size: 4 * 1024 * 1024, // 4 MB chunks for better compression
            parallel_chunks: num_cpus::get(),
        }
    }

    pub fn balanced() -> Self {
        Self {
            algorithm: Algorithm::zstd(),
            level: CompressionLevel::Balanced,
            chunk_size: 16 * 1024 * 1024, // 16 MB chunks
            parallel_chunks: num_cpus::get(),
        }
    }
}
}

Error Handling

Comprehensive error handling for compression failures:

#![allow(unused)]
fn main() {
#[derive(Debug, thiserror::Error)]
pub enum CompressionError {
    #[error("Compression failed: {0}")]
    CompressionFailed(String),

    #[error("Decompression failed: {0}")]
    DecompressionFailed(String),

    #[error("Unsupported algorithm: {0}")]
    UnsupportedAlgorithm(String),

    #[error("Invalid compression level: {0}")]
    InvalidLevel(u32),

    #[error("Buffer overflow during compression")]
    BufferOverflow,

    #[error("Corrupted compressed data")]
    CorruptedData,
}

impl From<CompressionError> for PipelineError {
    fn from(err: CompressionError) -> Self {
        match err {
            CompressionError::CompressionFailed(msg) =>
                PipelineError::CompressionError(msg),
            CompressionError::DecompressionFailed(msg) =>
                PipelineError::DecompressionError(msg),
            CompressionError::UnsupportedAlgorithm(algo) =>
                PipelineError::UnsupportedAlgorithm(algo),
            _ => PipelineError::CompressionError(err.to_string()),
        }
    }
}
}

Usage Examples

Basic Compression

use adaptive_pipeline::infrastructure::adapters::CompressionServiceAdapter;
use adaptive_pipeline_domain::services::CompressionService;
use adaptive_pipeline_domain::value_objects::Algorithm;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create compression service
    let compression = CompressionServiceAdapter::new();

    // Compress data
    let data = b"Hello, World!".to_vec();
    let compressed = compression.compress(&data, &Algorithm::zstd()).await?;

    println!("Original size: {} bytes", data.len());
    println!("Compressed size: {} bytes", compressed.len());
    println!("Compression ratio: {:.2}%",
        (1.0 - compressed.len() as f64 / data.len() as f64) * 100.0);

    // Decompress data
    let decompressed = compression.decompress(&compressed, &Algorithm::zstd()).await?;
    assert_eq!(data, decompressed);

    Ok(())
}

Comparing Algorithms

#![allow(unused)]
fn main() {
async fn compare_algorithms(data: &[u8]) -> Result<(), PipelineError> {
    let compression = CompressionServiceAdapter::new();
    let algorithms = vec![
        Algorithm::brotli(),
        Algorithm::gzip(),
        Algorithm::zstd(),
        Algorithm::lz4(),
    ];

    println!("Original size: {} bytes\n", data.len());

    for algo in algorithms {
        let start = Instant::now();
        let compressed = compression.compress(data, &algo).await?;
        let compress_time = start.elapsed();

        let start = Instant::now();
        let _decompressed = compression.decompress(&compressed, &algo).await?;
        let decompress_time = start.elapsed();

        println!("Algorithm: {}", algo.name());
        println!("  Compressed size: {} bytes ({:.2}% reduction)",
            compressed.len(),
            (1.0 - compressed.len() as f64 / data.len() as f64) * 100.0
        );
        println!("  Compression time: {:?}", compress_time);
        println!("  Decompression time: {:?}\n", decompress_time);
    }

    Ok(())
}
}

Benchmarks

Typical performance on a modern system (Intel i7, 16GB RAM):

Algorithm | File Size | Comp. Time | Decomp. Time | Ratio | Throughput
----------|-----------|------------|--------------|-------|------------
Brotli    | 100 MB    | 8.2s       | 0.4s         | 82%   | 12 MB/s
Gzip      | 100 MB    | 1.5s       | 0.6s         | 75%   | 67 MB/s
Zstd      | 100 MB    | 0.8s       | 0.3s         | 78%   | 125 MB/s
LZ4       | 100 MB    | 0.2s       | 0.1s         | 60%   | 500 MB/s

Best Practices

Choosing the Right Algorithm

Use Brotli when:

  • Storage space is critical
  • Compression time is not a concern
  • Data will be compressed once, decompressed many times (web assets)

Use Gzip when:

  • Compatibility with other systems is required
  • Balanced performance is needed
  • Working with legacy systems

Use Zstandard when:

  • Modern systems are available
  • Both speed and compression ratio matter
  • Real-time processing is needed

Use LZ4 when:

  • Speed is the top priority
  • Working with live data streams
  • Low latency is critical
  • Memory is limited

Chunk Size Selection

#![allow(unused)]
fn main() {
// For maximum compression
let chunk_size = 4 * 1024 * 1024;  // 4 MB

// For balanced performance
let chunk_size = 16 * 1024 * 1024; // 16 MB

// For maximum speed
let chunk_size = 64 * 1024 * 1024; // 64 MB
}

Memory Considerations

#![allow(unused)]
fn main() {
// Estimate memory usage
fn estimate_memory_usage(
    chunk_size: usize,
    parallel_chunks: usize,
    algorithm: &Algorithm,
) -> usize {
    let per_chunk_overhead = match algorithm.name() {
        "brotli" => chunk_size * 2,  // Brotli uses ~2x for internal buffers
        "gzip" => chunk_size,         // Gzip uses ~1x
        "zstd" => chunk_size / 2,     // Zstd is efficient
        "lz4" => chunk_size / 4,      // LZ4 is very efficient
        _ => chunk_size,
    };

    per_chunk_overhead * parallel_chunks
}
}

Next Steps

Now that you understand compression implementation:

Encryption Implementation

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active

Overview

The encryption service provides authenticated encryption with multiple algorithms, secure key management, and automatic integrity verification. It's implemented as an infrastructure adapter that implements the domain's EncryptionService trait.

File: pipeline/src/infrastructure/adapters/encryption_service_adapter.rs

Supported Algorithms

AES-256-GCM (Advanced Encryption Standard)

  • Key size: 256 bits (32 bytes)
  • Nonce size: 96 bits (12 bytes)
  • Security: Industry standard, FIPS 140-2 approved
  • Performance: Excellent with AES-NI hardware acceleration
  • Library: aes-gcm crate

Best for:

  • Compliance requirements (FIPS, government)
  • Systems with AES-NI support
  • Maximum security requirements
  • Long-term data protection

Performance characteristics:

Operation   | With AES-NI | Without AES-NI
------------|-------------|----------------
Encryption  | 2-3 GB/s    | 100-200 MB/s
Decryption  | 2-3 GB/s    | 100-200 MB/s
Key setup   | < 1 μs      | < 1 μs
Memory      | Low         | Low

ChaCha20-Poly1305 (Stream Cipher)

  • Key size: 256 bits (32 bytes)
  • Nonce size: 96 bits (12 bytes)
  • Security: Modern, constant-time implementation
  • Performance: Consistent across all platforms
  • Library: chacha20poly1305 crate

Best for:

  • Systems without AES-NI
  • Mobile/embedded devices
  • Constant-time requirements
  • Side-channel attack resistance

Performance characteristics:

Operation   | Any Platform
------------|-------------
Encryption  | 500-800 MB/s
Decryption  | 500-800 MB/s
Key setup   | < 1 μs
Memory      | Low

AES-128-GCM (Faster AES Variant)

  • Key size: 128 bits (16 bytes)
  • Nonce size: 96 bits (12 bytes)
  • Security: Very secure, faster than AES-256
  • Performance: ~30% faster than AES-256
  • Library: aes-gcm crate

Best for:

  • High-performance requirements
  • Short-term data protection
  • Real-time encryption
  • Bandwidth-constrained systems

Performance characteristics:

Operation   | With AES-NI | Without AES-NI
------------|-------------|----------------
Encryption  | 3-4 GB/s    | 150-250 MB/s
Decryption  | 3-4 GB/s    | 150-250 MB/s
Key setup   | < 1 μs      | < 1 μs
Memory      | Low         | Low

Security Features

Authenticated Encryption (AEAD)

All algorithms provide Authenticated Encryption with Associated Data (AEAD):

Plaintext → Encrypt → Ciphertext + Authentication Tag
                ↓
            Detects tampering

Properties:

  • Confidentiality: Data is encrypted and unreadable
  • Integrity: Any modification is detected
  • Authentication: Verifies data origin

Nonce Management

Each encryption operation requires a unique nonce (number used once):

#![allow(unused)]
fn main() {
// Nonces are automatically generated for each chunk
pub struct EncryptionContext {
    key: SecretKey,
    nonce_counter: AtomicU64,  // Incrementing counter
}

impl EncryptionContext {
    fn next_nonce(&self) -> Nonce {
        let counter = self.nonce_counter.fetch_add(1, Ordering::SeqCst);

        // Generate nonce from counter
        let mut nonce = [0u8; 12];
        nonce[0..8].copy_from_slice(&counter.to_le_bytes());

        Nonce::from_slice(&nonce)
    }
}
}

Important: Never reuse a nonce with the same key!

Key Derivation

Derive encryption keys from passwords using secure KDFs:

#![allow(unused)]
fn main() {
pub enum KeyDerivationFunction {
    Argon2,   // Memory-hard, GPU-resistant
    Scrypt,   // Memory-hard, tunable
    PBKDF2,   // Standard, widely supported
}

// Derive key from password
pub fn derive_key(
    password: &[u8],
    salt: &[u8],
    kdf: KeyDerivationFunction,
) -> Result<SecretKey, PipelineError> {
    match kdf {
        KeyDerivationFunction::Argon2 => {
            // Memory: 64 MB, Iterations: 3, Parallelism: 4
            argon2::hash_raw(password, salt, &argon2::Config::default())
        }
        KeyDerivationFunction::Scrypt => {
            // N=16384, r=8, p=1
            scrypt::scrypt(password, salt, &scrypt::Params::new(14, 8, 1)?)
        }
        KeyDerivationFunction::PBKDF2 => {
            // 100,000 iterations
            pbkdf2::pbkdf2_hmac::<sha2::Sha256>(password, salt, 100_000)
        }
    }
}
}

Architecture

Service Interface (Domain Layer)

#![allow(unused)]
fn main() {
// pipeline-domain/src/services/encryption_service.rs
use async_trait::async_trait;
use crate::value_objects::Algorithm;
use crate::error::PipelineError;

#[async_trait]
pub trait EncryptionService: Send + Sync {
    /// Encrypt data using the specified algorithm
    async fn encrypt(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
        key: &EncryptionKey,
    ) -> Result<EncryptedData, PipelineError>;

    /// Decrypt data using the specified algorithm
    async fn decrypt(
        &self,
        encrypted: &EncryptedData,
        algorithm: &Algorithm,
        key: &EncryptionKey,
    ) -> Result<Vec<u8>, PipelineError>;
}

/// Encrypted data with nonce and authentication tag
pub struct EncryptedData {
    pub ciphertext: Vec<u8>,
    pub nonce: Vec<u8>,        // 12 bytes
    pub tag: Vec<u8>,          // 16 bytes (authentication tag)
}
}

Service Implementation (Infrastructure Layer)

#![allow(unused)]
fn main() {
// pipeline/src/infrastructure/adapters/encryption_service_adapter.rs
use aes_gcm::{Aes256Gcm, Key, Nonce};
use aes_gcm::aead::{Aead, NewAead};
use chacha20poly1305::ChaCha20Poly1305;

pub struct EncryptionServiceAdapter {
    // Secure key storage
    key_store: Arc<RwLock<KeyStore>>,
}

#[async_trait]
impl EncryptionService for EncryptionServiceAdapter {
    async fn encrypt(
        &self,
        data: &[u8],
        algorithm: &Algorithm,
        key: &EncryptionKey,
    ) -> Result<EncryptedData, PipelineError> {
        match algorithm.name() {
            "aes-256-gcm" => self.encrypt_aes_256_gcm(data, key),
            "chacha20-poly1305" => self.encrypt_chacha20(data, key),
            "aes-128-gcm" => self.encrypt_aes_128_gcm(data, key),
            _ => Err(PipelineError::UnsupportedAlgorithm(
                algorithm.name().to_string()
            )),
        }
    }

    async fn decrypt(
        &self,
        encrypted: &EncryptedData,
        algorithm: &Algorithm,
        key: &EncryptionKey,
    ) -> Result<Vec<u8>, PipelineError> {
        match algorithm.name() {
            "aes-256-gcm" => self.decrypt_aes_256_gcm(encrypted, key),
            "chacha20-poly1305" => self.decrypt_chacha20(encrypted, key),
            "aes-128-gcm" => self.decrypt_aes_128_gcm(encrypted, key),
            _ => Err(PipelineError::UnsupportedAlgorithm(
                algorithm.name().to_string()
            )),
        }
    }
}
}

Algorithm Implementations

AES-256-GCM Implementation

#![allow(unused)]
fn main() {
impl EncryptionServiceAdapter {
    fn encrypt_aes_256_gcm(
        &self,
        data: &[u8],
        key: &EncryptionKey,
    ) -> Result<EncryptedData, PipelineError> {
        use aes_gcm::{Aes256Gcm, Key, Nonce};
        use aes_gcm::aead::{Aead, NewAead};

        // Create cipher from key
        let key = Key::from_slice(key.as_bytes());
        let cipher = Aes256Gcm::new(key);

        // Generate unique nonce
        let nonce = self.generate_nonce();
        let nonce_obj = Nonce::from_slice(&nonce);

        // Encrypt with authentication
        let ciphertext = cipher
            .encrypt(nonce_obj, data)
            .map_err(|e| PipelineError::EncryptionError(e.to_string()))?;

        // Split ciphertext and tag
        let (ciphertext_bytes, tag) = ciphertext.split_at(ciphertext.len() - 16);

        Ok(EncryptedData {
            ciphertext: ciphertext_bytes.to_vec(),
            nonce: nonce.to_vec(),
            tag: tag.to_vec(),
        })
    }

    fn decrypt_aes_256_gcm(
        &self,
        encrypted: &EncryptedData,
        key: &EncryptionKey,
    ) -> Result<Vec<u8>, PipelineError> {
        use aes_gcm::{Aes256Gcm, Key, Nonce};
        use aes_gcm::aead::{Aead, NewAead};

        // Create cipher from key
        let key = Key::from_slice(key.as_bytes());
        let cipher = Aes256Gcm::new(key);

        // Reconstruct nonce
        let nonce = Nonce::from_slice(&encrypted.nonce);

        // Combine ciphertext and tag
        let mut combined = encrypted.ciphertext.clone();
        combined.extend_from_slice(&encrypted.tag);

        // Decrypt and verify authentication
        let plaintext = cipher
            .decrypt(nonce, combined.as_slice())
            .map_err(|e| PipelineError::DecryptionError(
                format!("Decryption failed (possibly tampered): {}", e)
            ))?;

        Ok(plaintext)
    }
}
}

ChaCha20-Poly1305 Implementation

#![allow(unused)]
fn main() {
impl EncryptionServiceAdapter {
    fn encrypt_chacha20(
        &self,
        data: &[u8],
        key: &EncryptionKey,
    ) -> Result<EncryptedData, PipelineError> {
        use chacha20poly1305::{ChaCha20Poly1305, Key, Nonce};
        use chacha20poly1305::aead::{Aead, NewAead};

        // Create cipher from key
        let key = Key::from_slice(key.as_bytes());
        let cipher = ChaCha20Poly1305::new(key);

        // Generate unique nonce
        let nonce = self.generate_nonce();
        let nonce_obj = Nonce::from_slice(&nonce);

        // Encrypt with authentication
        let ciphertext = cipher
            .encrypt(nonce_obj, data)
            .map_err(|e| PipelineError::EncryptionError(e.to_string()))?;

        // Split ciphertext and tag
        let (ciphertext_bytes, tag) = ciphertext.split_at(ciphertext.len() - 16);

        Ok(EncryptedData {
            ciphertext: ciphertext_bytes.to_vec(),
            nonce: nonce.to_vec(),
            tag: tag.to_vec(),
        })
    }

    fn decrypt_chacha20(
        &self,
        encrypted: &EncryptedData,
        key: &EncryptionKey,
    ) -> Result<Vec<u8>, PipelineError> {
        use chacha20poly1305::{ChaCha20Poly1305, Key, Nonce};
        use chacha20poly1305::aead::{Aead, NewAead};

        // Create cipher from key
        let key = Key::from_slice(key.as_bytes());
        let cipher = ChaCha20Poly1305::new(key);

        // Reconstruct nonce
        let nonce = Nonce::from_slice(&encrypted.nonce);

        // Combine ciphertext and tag
        let mut combined = encrypted.ciphertext.clone();
        combined.extend_from_slice(&encrypted.tag);

        // Decrypt and verify authentication
        let plaintext = cipher
            .decrypt(nonce, combined.as_slice())
            .map_err(|e| PipelineError::DecryptionError(
                format!("Decryption failed (possibly tampered): {}", e)
            ))?;

        Ok(plaintext)
    }
}
}

Key Management

Secure Key Storage

#![allow(unused)]
fn main() {
use zeroize::Zeroize;

/// Secure key that zeroizes on drop
pub struct SecretKey {
    bytes: Vec<u8>,
}

impl SecretKey {
    pub fn new(bytes: Vec<u8>) -> Self {
        Self { bytes }
    }

    pub fn as_bytes(&self) -> &[u8] {
        &self.bytes
    }

    /// Generate random key
    pub fn generate(size: usize) -> Self {
        use rand::RngCore;
        let mut bytes = vec![0u8; size];
        rand::thread_rng().fill_bytes(&mut bytes);
        Self::new(bytes)
    }
}

impl Drop for SecretKey {
    fn drop(&mut self) {
        // Securely wipe key from memory
        self.bytes.zeroize();
    }
}

impl Zeroize for SecretKey {
    fn zeroize(&mut self) {
        self.bytes.zeroize();
    }
}
}

Key Rotation

#![allow(unused)]
fn main() {
pub struct KeyRotation {
    current_key: SecretKey,
    previous_key: Option<SecretKey>,
    rotation_interval: Duration,
    last_rotation: Instant,
}

impl KeyRotation {
    pub fn rotate(&mut self) -> Result<(), PipelineError> {
        // Save current key as previous
        let old_key = std::mem::replace(
            &mut self.current_key,
            SecretKey::generate(32),
        );
        self.previous_key = Some(old_key);
        self.last_rotation = Instant::now();

        Ok(())
    }

    pub fn should_rotate(&self) -> bool {
        self.last_rotation.elapsed() >= self.rotation_interval
    }
}
}

Performance Optimizations

Parallel Chunk Encryption

#![allow(unused)]
fn main() {
use rayon::prelude::*;

pub async fn encrypt_chunks(
    chunks: Vec<FileChunk>,
    algorithm: &Algorithm,
    key: &SecretKey,
    encryption_service: &Arc<dyn EncryptionService>,
) -> Result<Vec<EncryptedChunk>, PipelineError> {
    // Encrypt chunks in parallel
    chunks.par_iter()
        .map(|chunk| {
            let encrypted = encryption_service
                .encrypt(&chunk.data, algorithm, key)?;

            Ok(EncryptedChunk {
                sequence: chunk.sequence,
                data: encrypted,
                original_size: chunk.data.len(),
            })
        })
        .collect()
}
}

Hardware Acceleration

#![allow(unused)]
fn main() {
// Detect AES-NI support
pub fn has_aes_ni() -> bool {
    #[cfg(target_arch = "x86_64")]
    {
        use std::arch::x86_64::*;
        is_x86_feature_detected!("aes")
    }
    #[cfg(not(target_arch = "x86_64"))]
    {
        false
    }
}

// Select algorithm based on hardware
pub fn select_algorithm() -> Algorithm {
    if has_aes_ni() {
        Algorithm::aes_256_gcm()  // Fast with hardware support
    } else {
        Algorithm::chacha20_poly1305()  // Consistent without hardware
    }
}
}

Configuration

Encryption Configuration

#![allow(unused)]
fn main() {
pub struct EncryptionConfig {
    pub algorithm: Algorithm,
    pub key_derivation: KeyDerivationFunction,
    pub key_rotation_interval: Duration,
    pub nonce_reuse_prevention: bool,
}

impl EncryptionConfig {
    pub fn maximum_security() -> Self {
        Self {
            algorithm: Algorithm::aes_256_gcm(),
            key_derivation: KeyDerivationFunction::Argon2,
            key_rotation_interval: Duration::from_secs(86400), // 24 hours
            nonce_reuse_prevention: true,
        }
    }

    pub fn balanced() -> Self {
        Self {
            algorithm: if has_aes_ni() {
                Algorithm::aes_256_gcm()
            } else {
                Algorithm::chacha20_poly1305()
            },
            key_derivation: KeyDerivationFunction::Scrypt,
            key_rotation_interval: Duration::from_secs(604800), // 7 days
            nonce_reuse_prevention: true,
        }
    }

    pub fn high_performance() -> Self {
        Self {
            algorithm: if has_aes_ni() {
                Algorithm::aes_128_gcm()
            } else {
                Algorithm::chacha20_poly1305()
            },
            key_derivation: KeyDerivationFunction::PBKDF2,
            key_rotation_interval: Duration::from_secs(2592000), // 30 days
            nonce_reuse_prevention: true,
        }
    }
}
}

Error Handling

#![allow(unused)]
fn main() {
#[derive(Debug, thiserror::Error)]
pub enum EncryptionError {
    #[error("Encryption failed: {0}")]
    EncryptionFailed(String),

    #[error("Decryption failed: {0}")]
    DecryptionFailed(String),

    #[error("Authentication failed - data may be tampered")]
    AuthenticationFailed,

    #[error("Invalid key length: expected {expected}, got {actual}")]
    InvalidKeyLength { expected: usize, actual: usize },

    #[error("Nonce reuse detected")]
    NonceReuse,

    #[error("Key derivation failed: {0}")]
    KeyDerivationFailed(String),
}

impl From<EncryptionError> for PipelineError {
    fn from(err: EncryptionError) -> Self {
        match err {
            EncryptionError::EncryptionFailed(msg) =>
                PipelineError::EncryptionError(msg),
            EncryptionError::DecryptionFailed(msg) =>
                PipelineError::DecryptionError(msg),
            EncryptionError::AuthenticationFailed =>
                PipelineError::IntegrityError("Authentication failed".to_string()),
            _ => PipelineError::EncryptionError(err.to_string()),
        }
    }
}
}

Usage Examples

Basic Encryption

use adaptive_pipeline::infrastructure::adapters::EncryptionServiceAdapter;
use adaptive_pipeline_domain::services::EncryptionService;
use adaptive_pipeline_domain::value_objects::Algorithm;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create encryption service
    let encryption = EncryptionServiceAdapter::new();

    // Generate encryption key
    let key = SecretKey::generate(32); // 256 bits

    // Encrypt data
    let data = b"Sensitive information";
    let encrypted = encryption.encrypt(
        data,
        &Algorithm::aes_256_gcm(),
        &key
    ).await?;

    println!("Original size: {} bytes", data.len());
    println!("Encrypted size: {} bytes", encrypted.ciphertext.len());
    println!("Nonce: {} bytes", encrypted.nonce.len());
    println!("Tag: {} bytes", encrypted.tag.len());

    // Decrypt data
    let decrypted = encryption.decrypt(
        &encrypted,
        &Algorithm::aes_256_gcm(),
        &key
    ).await?;

    assert_eq!(data, decrypted.as_slice());
    println!("✓ Decryption successful");

    Ok(())
}

Password-Based Encryption

#![allow(unused)]
fn main() {
async fn encrypt_with_password(
    data: &[u8],
    password: &str,
) -> Result<EncryptedData, PipelineError> {
    // Generate random salt
    let salt = SecretKey::generate(16);

    // Derive key from password
    let key = derive_key(
        password.as_bytes(),
        salt.as_bytes(),
        KeyDerivationFunction::Argon2,
    )?;

    // Encrypt data
    let encryption = EncryptionServiceAdapter::new();
    let encrypted = encryption.encrypt(
        data,
        &Algorithm::aes_256_gcm(),
        &key,
    ).await?;

    // Store salt with encrypted data
    encrypted.salt = salt.as_bytes().to_vec();

    Ok(encrypted)
}
}

Tamper Detection

#![allow(unused)]
fn main() {
async fn decrypt_with_verification(
    encrypted: &EncryptedData,
    key: &SecretKey,
) -> Result<Vec<u8>, PipelineError> {
    let encryption = EncryptionServiceAdapter::new();

    // Attempt decryption (will fail if tampered)
    match encryption.decrypt(encrypted, &Algorithm::aes_256_gcm(), key).await {
        Ok(plaintext) => {
            println!("✓ Data is authentic and unmodified");
            Ok(plaintext)
        }
        Err(PipelineError::DecryptionError(_)) => {
            eprintln!("✗ Data has been tampered with!");
            Err(PipelineError::IntegrityError(
                "Authentication failed - data may be tampered".to_string()
            ))
        }
        Err(e) => Err(e),
    }
}
}

Benchmarks

Typical performance on modern systems:

Algorithm          | File Size | Encrypt Time | Decrypt Time | Throughput
-------------------|-----------|--------------|--------------|------------
AES-256-GCM (NI)   | 100 MB    | 0.04s        | 0.04s        | 2.5 GB/s
AES-256-GCM (SW)   | 100 MB    | 0.8s         | 0.8s         | 125 MB/s
ChaCha20-Poly1305  | 100 MB    | 0.15s        | 0.15s        | 670 MB/s
AES-128-GCM (NI)   | 100 MB    | 0.03s        | 0.03s        | 3.3 GB/s

Best Practices

Algorithm Selection

Use AES-256-GCM when:

  • Compliance requires FIPS-approved encryption
  • Long-term data protection is needed
  • Hardware has AES-NI support
  • Maximum security is required

Use ChaCha20-Poly1305 when:

  • Running on platforms without AES-NI
  • Constant-time execution is critical
  • Side-channel resistance is needed
  • Mobile/embedded deployment

Use AES-128-GCM when:

  • Maximum performance is required
  • Short-term data protection is sufficient
  • Hardware has AES-NI support

Key Management

#![allow(unused)]
fn main() {
// ✅ GOOD: Secure key handling
let key = SecretKey::generate(32);
let encrypted = encrypt(data, &key)?;
// Key is automatically zeroized on drop

// ❌ BAD: Exposing key in logs
println!("Key: {:?}", key);  // Never log keys!

// ✅ GOOD: Key derivation from password
let key = derive_key(password, salt, KeyDerivationFunction::Argon2)?;

// ❌ BAD: Weak key derivation
let key = sha256(password);  // Not secure!
}

Nonce Management

#![allow(unused)]
fn main() {
// ✅ GOOD: Unique nonce per encryption
let nonce = generate_unique_nonce();

// ❌ BAD: Reusing nonces
let nonce = [0u8; 12];  // NEVER reuse nonces!

// ✅ GOOD: Counter-based nonces
let nonce_counter = AtomicU64::new(0);
let nonce = generate_nonce_from_counter(nonce_counter.fetch_add(1));
}

Authentication Verification

#![allow(unused)]
fn main() {
// ✅ GOOD: Always verify authentication
match decrypt(encrypted, key) {
    Ok(data) => process(data),
    Err(e) => {
        log::error!("Decryption failed - possible tampering");
        return Err(e);
    }
}

// ❌ BAD: Ignoring authentication failures
let data = decrypt(encrypted, key).unwrap_or_default();  // Dangerous!
}

Security Considerations

Nonce Uniqueness

  • Critical: Never reuse a nonce with the same key
  • Use counter-based or random nonces
  • Rotate keys after 2^32 encryptions (GCM limit)

Key Strength

  • Minimum 256 bits for long-term security
  • Use cryptographically secure random number generators
  • Derive keys properly from passwords (use Argon2)

Memory Security

  • Keys are automatically zeroized on drop
  • Avoid cloning keys unnecessarily
  • Don't log or print keys

Side-Channel Attacks

  • ChaCha20 provides constant-time execution
  • AES requires AES-NI for timing attack resistance
  • Validate all inputs before decryption

Next Steps

Now that you understand encryption implementation:

Integrity Verification

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active

Overview

Integrity verification ensures data hasn't been corrupted or tampered with during processing. The pipeline system uses cryptographic hash functions to calculate checksums at various stages, providing strong guarantees about data integrity.

The checksum service operates in two modes:

  • Calculate Mode: Generates checksums for data chunks
  • Verify Mode: Validates existing checksums to detect tampering

Supported Algorithms

Industry-standard cryptographic hash function

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::value_objects::Algorithm;

let algorithm = Algorithm::sha256();
}

Characteristics:

  • Hash Size: 256 bits (32 bytes)
  • Security: Cryptographically secure, collision-resistant
  • Performance: ~500 MB/s (software), ~2 GB/s (hardware accelerated)
  • Use Cases: General-purpose integrity verification

When to Use:

  • ✅ General-purpose integrity verification
  • ✅ Compliance requirements (FIPS 180-4)
  • ✅ Cross-platform compatibility
  • ✅ Hardware acceleration available (SHA extensions)

SHA-512

Stronger variant of SHA-2 family

#![allow(unused)]
fn main() {
let algorithm = Algorithm::sha512();
}

Characteristics:

  • Hash Size: 512 bits (64 bytes)
  • Security: Higher security margin than SHA-256
  • Performance: ~400 MB/s (software), faster on 64-bit systems
  • Use Cases: High-security requirements, 64-bit optimized systems

When to Use:

  • ✅ Maximum security requirements
  • ✅ 64-bit systems (better performance)
  • ✅ Long-term archival (future-proof security)
  • ❌ Resource-constrained systems (larger output)

BLAKE3

Modern, high-performance cryptographic hash

#![allow(unused)]
fn main() {
let algorithm = Algorithm::blake3();
}

Characteristics:

  • Hash Size: 256 bits (32 bytes, configurable)
  • Security: Based on BLAKE2, ChaCha stream cipher
  • Performance: ~3 GB/s (parallelizable, SIMD-optimized)
  • Use Cases: High-throughput processing, modern systems

When to Use:

  • ✅ Maximum performance requirements
  • ✅ Large file processing (highly parallelizable)
  • ✅ Modern CPUs with SIMD support
  • ✅ No regulatory compliance requirements
  • ❌ FIPS compliance needed (not FIPS certified)

Algorithm Comparison

AlgorithmHash SizeThroughputSecurityHardware AccelFIPS
SHA-256256 bits500 MB/sStrong✅ (SHA-NI)
SHA-512512 bits400 MB/sStronger✅ (SHA-NI)
BLAKE3256 bits3 GB/sStrong✅ (SIMD)

Performance measured on Intel i7-10700K @ 3.8 GHz

Architecture

Service Interface

The domain layer defines the checksum service interface:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::services::ChecksumService;
use adaptive_pipeline_domain::entities::ProcessingContext;
use adaptive_pipeline_domain::value_objects::FileChunk;
use adaptive_pipeline_domain::PipelineError;

/// Domain service for integrity verification
pub trait ChecksumService: Send + Sync {
    /// Process a chunk and update the running checksum
    fn process_chunk(
        &self,
        chunk: FileChunk,
        context: &mut ProcessingContext,
        stage_name: &str,
    ) -> Result<FileChunk, PipelineError>;

    /// Get the final checksum value
    fn get_checksum(
        &self,
        context: &ProcessingContext,
        stage_name: &str
    ) -> Option<String>;
}
}

Implementation

The infrastructure layer provides concrete implementations:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::services::{ChecksumService, ChecksumProcessor};

/// Concrete checksum processor using SHA-256
pub struct ChecksumProcessor {
    pub algorithm: String,
    pub verify_existing: bool,
}

impl ChecksumProcessor {
    pub fn new(algorithm: String, verify_existing: bool) -> Self {
        Self {
            algorithm,
            verify_existing,
        }
    }

    /// Creates a SHA-256 processor
    pub fn sha256_processor(verify_existing: bool) -> Self {
        Self::new("SHA256".to_string(), verify_existing)
    }
}
}

Algorithm Implementations

SHA-256 Implementation

#![allow(unused)]
fn main() {
use sha2::{Digest, Sha256};

impl ChecksumProcessor {
    /// Calculate SHA-256 checksum
    pub fn calculate_sha256(&self, data: &[u8]) -> String {
        let mut hasher = Sha256::new();
        hasher.update(data);
        format!("{:x}", hasher.finalize())
    }

    /// Incremental SHA-256 hashing
    pub fn update_hash(&self, hasher: &mut Sha256, chunk: &FileChunk) {
        hasher.update(chunk.data());
    }

    /// Finalize hash and return hex string
    pub fn finalize_hash(&self, hasher: Sha256) -> String {
        format!("{:x}", hasher.finalize())
    }
}
}

Key Features:

  • Incremental hashing for streaming large files
  • Memory-efficient (constant 32-byte state)
  • Hardware acceleration with SHA-NI instructions

SHA-512 Implementation

#![allow(unused)]
fn main() {
use sha2::{Sha512};

impl ChecksumProcessor {
    /// Calculate SHA-512 checksum
    pub fn calculate_sha512(&self, data: &[u8]) -> String {
        let mut hasher = Sha512::new();
        hasher.update(data);
        format!("{:x}", hasher.finalize())
    }
}
}

Key Features:

  • 512-bit output for higher security margin
  • Optimized for 64-bit architectures
  • Suitable for long-term archival

BLAKE3 Implementation

#![allow(unused)]
fn main() {
use blake3::Hasher;

impl ChecksumProcessor {
    /// Calculate BLAKE3 checksum
    pub fn calculate_blake3(&self, data: &[u8]) -> String {
        let mut hasher = Hasher::new();
        hasher.update(data);
        hasher.finalize().to_hex().to_string()
    }

    /// Parallel BLAKE3 hashing
    pub fn calculate_blake3_parallel(&self, chunks: &[&[u8]]) -> String {
        let mut hasher = Hasher::new();
        for chunk in chunks {
            hasher.update(chunk);
        }
        hasher.finalize().to_hex().to_string()
    }
}
}

Key Features:

  • Highly parallelizable (uses Rayon internally)
  • SIMD-optimized for modern CPUs
  • Incremental and streaming support
  • Up to 6x faster than SHA-256

Chunk Processing

ChunkProcessor Trait

The checksum service implements the ChunkProcessor trait for integration with the pipeline:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::services::file_processor_service::ChunkProcessor;

impl ChunkProcessor for ChecksumProcessor {
    /// Process chunk with checksum calculation/verification
    fn process_chunk(&self, chunk: &FileChunk) -> Result<FileChunk, PipelineError> {
        // Step 1: Verify existing checksum if requested
        if self.verify_existing && chunk.checksum().is_some() {
            let is_valid = chunk.verify_integrity()?;
            if !is_valid {
                return Err(PipelineError::IntegrityError(format!(
                    "Checksum verification failed for chunk {}",
                    chunk.sequence_number()
                )));
            }
        }

        // Step 2: Ensure chunk has checksum (calculate if missing)
        if chunk.checksum().is_none() {
            chunk.with_calculated_checksum()
        } else {
            Ok(chunk.clone())
        }
    }

    fn name(&self) -> &str {
        "ChecksumProcessor"
    }

    fn modifies_data(&self) -> bool {
        false // Only modifies metadata
    }
}
}

Integrity Verification

The FileChunk value object provides built-in integrity verification:

#![allow(unused)]
fn main() {
impl FileChunk {
    /// Verify chunk integrity against stored checksum
    pub fn verify_integrity(&self) -> Result<bool, PipelineError> {
        match &self.checksum {
            Some(stored_checksum) => {
                let calculated = Self::calculate_checksum(self.data());
                Ok(*stored_checksum == calculated)
            }
            None => Err(PipelineError::InvalidConfiguration(
                "No checksum to verify".to_string()
            )),
        }
    }

    /// Calculate checksum for chunk data
    fn calculate_checksum(data: &[u8]) -> String {
        use sha2::{Digest, Sha256};
        let mut hasher = Sha256::new();
        hasher.update(data);
        format!("{:x}", hasher.finalize())
    }

    /// Create new chunk with calculated checksum
    pub fn with_calculated_checksum(&self) -> Result<FileChunk, PipelineError> {
        let checksum = Self::calculate_checksum(self.data());
        Ok(FileChunk {
            sequence_number: self.sequence_number,
            data: self.data.clone(),
            checksum: Some(checksum),
            metadata: self.metadata.clone(),
        })
    }
}
}

Performance Optimizations

Parallel Chunk Processing

Process multiple chunks in parallel using Rayon:

#![allow(unused)]
fn main() {
use rayon::prelude::*;

impl ChecksumProcessor {
    /// Process chunks in parallel for maximum throughput
    pub fn process_chunks_parallel(
        &self,
        chunks: &[FileChunk]
    ) -> Result<Vec<FileChunk>, PipelineError> {
        chunks
            .par_iter()
            .map(|chunk| self.process_chunk(chunk))
            .collect()
    }
}
}

Performance Benefits:

  • Linear Scaling: Performance scales with CPU cores
  • No Contention: Each chunk processed independently
  • 2-4x Speedup: On typical multi-core systems

Hardware Acceleration

Leverage CPU crypto extensions when available:

#![allow(unused)]
fn main() {
/// Check for SHA hardware acceleration
pub fn has_sha_extensions() -> bool {
    #[cfg(target_arch = "x86_64")]
    {
        is_x86_feature_detected!("sha")
    }
    #[cfg(not(target_arch = "x86_64"))]
    {
        false
    }
}

/// Select optimal algorithm based on hardware
pub fn optimal_hash_algorithm() -> Algorithm {
    if has_sha_extensions() {
        Algorithm::sha256() // Hardware accelerated
    } else {
        Algorithm::blake3() // Software optimized
    }
}
}

Memory Management

Minimize allocations during hash calculation:

#![allow(unused)]
fn main() {
impl ChecksumProcessor {
    /// Reuse buffer for hash calculations
    pub fn calculate_with_buffer(
        &self,
        data: &[u8],
        buffer: &mut Vec<u8>
    ) -> String {
        buffer.clear();
        buffer.extend_from_slice(data);
        self.calculate_sha256(buffer)
    }
}
}

Configuration

Stage Configuration

Configure integrity stages in your pipeline:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::entities::PipelineStage;
use adaptive_pipeline_domain::value_objects::{Algorithm, StageType};

// Input integrity verification
let input_stage = PipelineStage::new(
    "input_checksum",
    StageType::Integrity,
    Algorithm::sha256(),
)?;

// Output integrity verification
let output_stage = PipelineStage::new(
    "output_checksum",
    StageType::Integrity,
    Algorithm::blake3(), // Faster for final verification
)?;
}

Verification Mode

Enable checksum verification for existing data:

#![allow(unused)]
fn main() {
// Calculate checksums only (default)
let processor = ChecksumProcessor::new("SHA256".to_string(), false);

// Verify existing checksums before processing
let verifying_processor = ChecksumProcessor::new("SHA256".to_string(), true);
}

Algorithm Selection

Choose algorithm based on requirements:

#![allow(unused)]
fn main() {
pub fn select_hash_algorithm(
    security_level: SecurityLevel,
    performance_priority: bool,
) -> Algorithm {
    match (security_level, performance_priority) {
        (SecurityLevel::Maximum, _) => Algorithm::sha512(),
        (SecurityLevel::High, false) => Algorithm::sha256(),
        (SecurityLevel::High, true) => Algorithm::blake3(),
        (SecurityLevel::Standard, _) => Algorithm::blake3(),
    }
}
}

Error Handling

Error Types

The service handles various error conditions:

#![allow(unused)]
fn main() {
pub enum IntegrityError {
    /// Checksum verification failed
    ChecksumMismatch {
        expected: String,
        actual: String,
        chunk: u64,
    },

    /// Invalid algorithm specified
    UnsupportedAlgorithm(String),

    /// Hash calculation failed
    HashCalculationError(String),

    /// Chunk data corrupted
    CorruptedData {
        chunk: u64,
        reason: String,
    },
}
}

Error Recovery

Handle integrity errors gracefully:

#![allow(unused)]
fn main() {
impl ChecksumProcessor {
    pub fn process_with_retry(
        &self,
        chunk: &FileChunk,
        max_retries: u32
    ) -> Result<FileChunk, PipelineError> {
        let mut attempts = 0;

        loop {
            match self.process_chunk(chunk) {
                Ok(result) => return Ok(result),
                Err(PipelineError::IntegrityError(msg)) if attempts < max_retries => {
                    attempts += 1;
                    eprintln!("Integrity check failed (attempt {}/{}): {}",
                        attempts, max_retries, msg);
                    continue;
                }
                Err(e) => return Err(e),
            }
        }
    }
}
}

Usage Examples

Basic Checksum Calculation

Calculate SHA-256 checksums for data:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::services::ChecksumProcessor;

fn calculate_file_checksum(data: &[u8]) -> Result<String, PipelineError> {
    let processor = ChecksumProcessor::sha256_processor(false);
    let checksum = processor.calculate_sha256(data);
    Ok(checksum)
}

// Example usage
let data = b"Hello, world!";
let checksum = calculate_file_checksum(data)?;
println!("SHA-256: {}", checksum);
// Output: SHA-256: 315f5bdb76d078c43b8ac0064e4a0164612b1fce77c869345bfc94c75894edd3
}

Integrity Verification

Verify data hasn't been tampered with:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::value_objects::FileChunk;

fn verify_chunk_integrity(chunk: &FileChunk) -> Result<bool, PipelineError> {
    let processor = ChecksumProcessor::sha256_processor(true);

    // Process with verification enabled
    match processor.process_chunk(chunk) {
        Ok(_) => Ok(true),
        Err(PipelineError::IntegrityError(_)) => Ok(false),
        Err(e) => Err(e),
    }
}

// Example usage
let chunk = FileChunk::new(0, data.to_vec())?
    .with_calculated_checksum()?;

if verify_chunk_integrity(&chunk)? {
    println!("✓ Chunk integrity verified");
} else {
    println!("✗ Chunk has been tampered with!");
}
}

Pipeline Integration

Integrate checksums into processing pipeline:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::entities::{Pipeline, PipelineStage};

fn create_verified_pipeline() -> Result<Pipeline, PipelineError> {
    let stages = vec![
        // Input verification
        PipelineStage::new(
            "input_checksum",
            StageType::Integrity,
            Algorithm::sha256(),
        )?,

        // Processing stages...
        PipelineStage::new(
            "compression",
            StageType::Compression,
            Algorithm::zstd(),
        )?,

        // Output verification
        PipelineStage::new(
            "output_checksum",
            StageType::Integrity,
            Algorithm::sha256(),
        )?,
    ];

    Pipeline::new("verified-pipeline".to_string(), stages)
}
}

Parallel Processing

Process multiple chunks with maximum performance:

#![allow(unused)]
fn main() {
use rayon::prelude::*;

fn hash_large_file(chunks: Vec<FileChunk>) -> Result<Vec<String>, PipelineError> {
    let processor = ChecksumProcessor::sha256_processor(false);

    chunks.par_iter()
        .map(|chunk| processor.calculate_sha256(chunk.data()))
        .collect()
}

// Example: Hash 1000 chunks in parallel
let checksums = hash_large_file(chunks)?;
println!("Processed {} chunks", checksums.len());
}

Benchmarks

SHA-256 Performance

File Size: 100 MB, Chunk Size: 1 MB

ConfigurationThroughputTotal TimeCPU Usage
Single-threaded500 MB/s200ms100% (1 core)
Parallel (4 cores)1.8 GB/s56ms400% (4 cores)
Hardware accel2.0 GB/s50ms100% (1 core)

SHA-512 Performance

File Size: 100 MB, Chunk Size: 1 MB

ConfigurationThroughputTotal TimeCPU Usage
Single-threaded400 MB/s250ms100% (1 core)
Parallel (4 cores)1.5 GB/s67ms400% (4 cores)

BLAKE3 Performance

File Size: 100 MB, Chunk Size: 1 MB

ConfigurationThroughputTotal TimeCPU Usage
Single-threaded1.2 GB/s83ms100% (1 core)
Parallel (4 cores)3.2 GB/s31ms400% (4 cores)
SIMD optimized3.5 GB/s29ms100% (1 core)

Test Environment: Intel i7-10700K @ 3.8 GHz, 32GB RAM, Ubuntu 22.04

Algorithm Recommendations by Use Case

Use CaseRecommended AlgorithmReason
General integritySHA-256Industry standard, FIPS certified
High securitySHA-512Larger output, stronger security margin
High throughputBLAKE33-6x faster, highly parallelizable
ComplianceSHA-256FIPS 180-4 certified
ArchivalSHA-512Future-proof security
Real-timeBLAKE3Lowest latency

Best Practices

Algorithm Selection

Choose the right algorithm for your requirements:

#![allow(unused)]
fn main() {
// Compliance requirements
if needs_fips_compliance {
    Algorithm::sha256() // FIPS 180-4 certified
}
// Maximum security
else if security_level == SecurityLevel::Maximum {
    Algorithm::sha512() // Stronger security margin
}
// Performance critical
else if throughput_priority {
    Algorithm::blake3() // 3-6x faster
}
// Default
else {
    Algorithm::sha256() // Industry standard
}
}

Verification Strategy

Implement defense-in-depth verification:

#![allow(unused)]
fn main() {
// 1. Input verification (detect source corruption)
let input_checksum_stage = PipelineStage::new(
    "input_verify",
    StageType::Integrity,
    Algorithm::sha256(),
)?;

// 2. Processing stages...

// 3. Output verification (detect processing corruption)
let output_checksum_stage = PipelineStage::new(
    "output_verify",
    StageType::Integrity,
    Algorithm::sha256(),
)?;
}

Performance Optimization

Optimize for your workload:

#![allow(unused)]
fn main() {
// Small files (<10 MB): Use single-threaded
if file_size < 10 * 1024 * 1024 {
    processor.calculate_sha256(data)
}
// Large files: Use parallel processing
else {
    processor.process_chunks_parallel(&chunks)
}

// Hardware acceleration available: Use SHA-256
if has_sha_extensions() {
    Algorithm::sha256()
}
// No hardware acceleration: Use BLAKE3
else {
    Algorithm::blake3()
}
}

Error Handling

Handle integrity failures appropriately:

#![allow(unused)]
fn main() {
match processor.process_chunk(&chunk) {
    Ok(verified_chunk) => {
        // Integrity verified, continue processing
        process_chunk(verified_chunk)
    }
    Err(PipelineError::IntegrityError(msg)) => {
        // Log error and attempt recovery
        eprintln!("Integrity failure: {}", msg);

        // Option 1: Retry from source
        let fresh_chunk = reload_chunk_from_source()?;
        processor.process_chunk(&fresh_chunk)
    }
    Err(e) => return Err(e),
}
}

Security Considerations

Cryptographic Strength

All supported algorithms are cryptographically secure:

  • SHA-256: 128-bit security level (2^128 operations for collision)
  • SHA-512: 256-bit security level (2^256 operations for collision)
  • BLAKE3: 128-bit security level (based on ChaCha20)

Collision Resistance

Practical collision resistance:

#![allow(unused)]
fn main() {
// SHA-256 collision resistance: ~2^128 operations
// Effectively impossible with current technology
let sha256_security_bits = 128;

// SHA-512 collision resistance: ~2^256 operations
// Provides future-proof security margin
let sha512_security_bits = 256;
}

Tampering Detection

Checksums detect any modification:

#![allow(unused)]
fn main() {
// Even single-bit changes produce completely different hashes
let original = "Hello, World!";
let tampered = "Hello, world!"; // Changed 'W' to 'w'

let hash1 = processor.calculate_sha256(original.as_bytes());
let hash2 = processor.calculate_sha256(tampered.as_bytes());

assert_ne!(hash1, hash2); // Completely different hashes
}

Not for Authentication

Important: Checksums alone don't provide authentication:

#![allow(unused)]
fn main() {
// ❌ WRONG: Checksum alone doesn't prove authenticity
let checksum = calculate_sha256(data);
// Attacker can modify data AND update checksum

// ✅ CORRECT: Use HMAC for authentication
let hmac = calculate_hmac_sha256(data, secret_key);
// Attacker cannot forge HMAC without secret key
}

Use HMAC or digital signatures for authentication.

Next Steps

Now that you understand integrity verification:

Data Persistence

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

This chapter provides a comprehensive overview of the data persistence architecture in the adaptive pipeline system. Learn how the repository pattern, SQLite database, and schema management work together to provide reliable, efficient data storage.


Table of Contents


Overview

Data persistence in the adaptive pipeline system follows Domain-Driven Design principles, separating domain logic from infrastructure concerns through the repository pattern. The system uses SQLite for reliable, zero-configuration data storage with full ACID transaction support.

Key Features

  • Repository Pattern: Abstraction layer between domain and infrastructure
  • SQLite Database: Embedded database with zero configuration
  • Schema Management: Automated migrations with sqlx
  • ACID Transactions: Full transactional support for data consistency
  • Connection Pooling: Efficient connection management
  • Type Safety: Compile-time query validation

Persistence Stack

┌──────────────────────────────────────────────────────────┐
│                    Domain Layer                          │
│  ┌────────────────────────────────────────────────┐     │
│  │   PipelineRepository (Trait)                   │     │
│  │   - save(), find_by_id(), list_all()          │     │
│  └────────────────────────────────────────────────┘     │
└──────────────────────────────────────────────────────────┘
                         ↓ implements
┌──────────────────────────────────────────────────────────┐
│                Infrastructure Layer                       │
│  ┌────────────────────────────────────────────────┐     │
│  │   SqlitePipelineRepository                     │     │
│  │   - Concrete SQLite implementation             │     │
│  └────────────────────────────────────────────────┘     │
│                         ↓ uses                           │
│  ┌────────────────────────────────────────────────┐     │
│  │   Schema Management                            │     │
│  │   - Migrations, initialization                 │     │
│  └────────────────────────────────────────────────┘     │
└──────────────────────────────────────────────────────────┘
                         ↓ persists to
┌──────────────────────────────────────────────────────────┐
│                  SQLite Database                         │
│  ┌────────────┬──────────────┬──────────────────┐      │
│  │ pipelines  │pipeline_stage│pipeline_config   │      │
│  └────────────┴──────────────┴──────────────────┘      │
└──────────────────────────────────────────────────────────┘

Design Principles

  1. Separation of Concerns: Domain logic independent of storage technology
  2. Testability: Easy mocking with in-memory implementations
  3. Flexibility: Support for different storage backends
  4. Consistency: ACID transactions ensure data integrity
  5. Performance: Connection pooling and query optimization

Persistence Architecture

The persistence layer follows a three-tier architecture aligned with Domain-Driven Design.

Architectural Layers

┌─────────────────────────────────────────────────────────────┐
│ Application Layer                                           │
│  - PipelineService uses repository trait                    │
│  - Business logic remains persistence-agnostic              │
└─────────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────────┐
│ Domain Layer                                                │
│  - PipelineRepository trait (abstract interface)            │
│  - Pipeline, PipelineStage entities                         │
│  - No infrastructure dependencies                           │
└─────────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────────┐
│ Infrastructure Layer                                        │
│  - SqlitePipelineRepository (concrete implementation)       │
│  - Schema management and migrations                         │
│  - Connection pooling and transaction management            │
└─────────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────────┐
│ Storage Layer                                               │
│  - SQLite database file                                     │
│  - Indexes and constraints                                  │
│  - Migration history tracking                               │
└─────────────────────────────────────────────────────────────┘

Component Relationships

Domain-to-Infrastructure Flow:

#![allow(unused)]
fn main() {
// Application code depends on domain trait
use adaptive_pipeline_domain::repositories::PipelineRepository;

async fn create_pipeline(
    repo: &dyn PipelineRepository,
    name: String,
) -> Result<Pipeline, PipelineError> {
    let pipeline = Pipeline::new(name)?;
    repo.save(&pipeline).await?;
    Ok(pipeline)
}

// Infrastructure provides concrete implementation
use adaptive_pipeline::infrastructure::repositories::SqlitePipelineRepository;

let repository = SqlitePipelineRepository::new(pool);
let pipeline = create_pipeline(&repository, "my-pipeline".to_string()).await?;
}

Benefits of This Architecture

BenefitDescription
Domain IndependenceBusiness logic doesn't depend on SQLite specifics
TestabilityEasy to mock repositories for unit testing
FlexibilityCan swap SQLite for PostgreSQL without changing domain
MaintainabilityClear separation makes code easier to understand
Type SafetyCompile-time verification of database operations

Repository Pattern

The repository pattern provides an abstraction layer between domain entities and data storage.

Repository Pattern Benefits

1. Separation of Concerns

Domain logic remains free from persistence details:

#![allow(unused)]
fn main() {
// Domain layer - storage-agnostic
#[async_trait]
pub trait PipelineRepository: Send + Sync {
    async fn save(&self, pipeline: &Pipeline) -> Result<(), PipelineError>;
    async fn find_by_id(&self, id: &PipelineId) -> Result<Option<Pipeline>, PipelineError>;
    async fn list_all(&self) -> Result<Vec<Pipeline>, PipelineError>;
    async fn delete(&self, id: &PipelineId) -> Result<bool, PipelineError>;
}
}

2. Implementation Flexibility

Multiple storage backends can implement the same interface:

#![allow(unused)]
fn main() {
// SQLite implementation
pub struct SqlitePipelineRepository { /* ... */ }

// PostgreSQL implementation
pub struct PostgresPipelineRepository { /* ... */ }

// In-memory testing implementation
pub struct InMemoryPipelineRepository { /* ... */ }

// All implement the same PipelineRepository trait
}

3. Enhanced Testability

Mock implementations simplify testing:

#![allow(unused)]
fn main() {
#[cfg(test)]
struct MockPipelineRepository {
    pipelines: Arc<Mutex<HashMap<PipelineId, Pipeline>>>,
}

#[async_trait]
impl PipelineRepository for MockPipelineRepository {
    async fn save(&self, pipeline: &Pipeline) -> Result<(), PipelineError> {
        let mut pipelines = self.pipelines.lock().await;
        pipelines.insert(pipeline.id().clone(), pipeline.clone());
        Ok(())
    }
    // ... implement other methods
}
}

Repository Interface Design

Method Categories:

CategoryMethodsPurpose
CRUDsave(), find_by_id(), update(), delete()Basic operations
Queriesfind_by_name(), find_all(), list_paginated()Data retrieval
Validationexists(), count()Existence checks
Lifecyclearchive(), restore(), list_archived()Soft deletion
Searchfind_by_config()Advanced queries

Complete Interface:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait PipelineRepository: Send + Sync {
    // CRUD Operations
    async fn save(&self, pipeline: &Pipeline) -> Result<(), PipelineError>;
    async fn find_by_id(&self, id: &PipelineId) -> Result<Option<Pipeline>, PipelineError>;
    async fn update(&self, pipeline: &Pipeline) -> Result<(), PipelineError>;
    async fn delete(&self, id: &PipelineId) -> Result<bool, PipelineError>;

    // Query Operations
    async fn find_by_name(&self, name: &str) -> Result<Option<Pipeline>, PipelineError>;
    async fn find_all(&self) -> Result<Vec<Pipeline>, PipelineError>;
    async fn list_all(&self) -> Result<Vec<Pipeline>, PipelineError>;
    async fn list_paginated(&self, offset: usize, limit: usize)
        -> Result<Vec<Pipeline>, PipelineError>;

    // Validation Operations
    async fn exists(&self, id: &PipelineId) -> Result<bool, PipelineError>;
    async fn count(&self) -> Result<usize, PipelineError>;

    // Lifecycle Operations
    async fn archive(&self, id: &PipelineId) -> Result<bool, PipelineError>;
    async fn restore(&self, id: &PipelineId) -> Result<bool, PipelineError>;
    async fn list_archived(&self) -> Result<Vec<Pipeline>, PipelineError>;

    // Search Operations
    async fn find_by_config(&self, key: &str, value: &str)
        -> Result<Vec<Pipeline>, PipelineError>;
}
}

Database Choice: SQLite

The system uses SQLite as the default database for its simplicity, reliability, and zero-configuration deployment.

Why SQLite?

AdvantageDescription
Zero ConfigurationNo database server to install or configure
Single FileEntire database stored in one file
ACID CompliantFull transactional support
Cross-PlatformWorks on Linux, macOS, Windows
EmbeddedRuns in-process, no network overhead
ReliableBattle-tested, used in production worldwide
FastOptimized for local file access

SQLite Characteristics

Performance Profile:

Operation          | Speed      | Notes
-------------------|------------|--------------------------------
Single INSERT      | ~0.1ms     | Very fast for local file
Batch INSERT       | ~10ms/1000 | Use transactions for batching
Single SELECT      | ~0.05ms    | Fast with proper indexes
Complex JOIN       | ~1-5ms     | Depends on dataset size
Full table scan    | ~10ms/10K  | Avoid without indexes

Limitations:

  • Concurrent Writes: Only one writer at a time (readers can be concurrent)
  • Network Access: Not designed for network file systems
  • Database Size: Practical limit ~281 TB (theoretical limit)
  • Scalability: Best for single-server deployments

When SQLite is Ideal:

✅ Single-server applications ✅ Embedded systems ✅ Desktop applications ✅ Development and testing ✅ Low-to-medium write concurrency

When to Consider Alternatives:

❌ High concurrent write workload ❌ Multi-server deployments ❌ Network file systems ❌ Very large datasets (> 100GB)

SQLite Configuration

Connection String:

#![allow(unused)]
fn main() {
// Local file database
let url = "sqlite://./pipeline.db";

// In-memory database (testing)
let url = "sqlite::memory:";

// Custom connection options
use sqlx::sqlite::SqliteConnectOptions;
let options = SqliteConnectOptions::new()
    .filename("./pipeline.db")
    .create_if_missing(true)
    .foreign_keys(true)
    .journal_mode(sqlx::sqlite::SqliteJournalMode::Wal);
}

Connection Pool Configuration:

#![allow(unused)]
fn main() {
use sqlx::sqlite::SqlitePoolOptions;

let pool = SqlitePoolOptions::new()
    .max_connections(5)          // Connection pool size
    .min_connections(1)          // Minimum connections
    .acquire_timeout(Duration::from_secs(30))
    .idle_timeout(Duration::from_secs(600))
    .connect(&database_url)
    .await?;
}

Storage Architecture

The storage layer uses a normalized relational schema with five core tables.

Database Schema Overview

┌─────────────┐
│  pipelines  │ (id, name, archived, created_at, updated_at)
└──────┬──────┘
       │ 1:N
       ├──────────────────────────┐
       │                          │
       ↓                          ↓
┌──────────────────┐    ┌──────────────────┐
│ pipeline_stages  │    │pipeline_config   │
│ (id, pipeline_id,│    │(pipeline_id, key,│
│  name, type,     │    │ value)           │
│  algorithm, ...)  │    └──────────────────┘
└──────┬───────────┘
       │ 1:N
       ↓
┌──────────────────┐
│stage_parameters  │
│(stage_id, key,   │
│ value)           │
└──────────────────┘

Table Purposes

TablePurposeKey Fields
pipelinesCore pipeline entityid (PK), name (UNIQUE), archived
pipeline_stagesStage definitionsid (PK), pipeline_id (FK), stage_order
pipeline_configurationPipeline-level config(pipeline_id, key) composite PK
stage_parametersStage-level parameters(stage_id, key) composite PK
processing_metricsExecution metricspipeline_id (PK/FK), progress, performance

Schema Design Principles

1. Normalization

Data is normalized to reduce redundancy:

-- ✅ Normalized: Stages reference pipeline
CREATE TABLE pipeline_stages (
    id TEXT PRIMARY KEY,
    pipeline_id TEXT NOT NULL,  -- Foreign key
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);

-- ❌ Denormalized: Duplicating pipeline data
-- Each stage would store pipeline name, created_at, etc.

2. Referential Integrity

Foreign keys enforce data consistency:

-- CASCADE DELETE: Deleting pipeline removes all stages
FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE

-- Orphaned stages cannot exist

3. Indexing Strategy

Indexes optimize common queries:

-- Index on foreign keys for JOIN performance
CREATE INDEX idx_pipeline_stages_pipeline_id ON pipeline_stages(pipeline_id);

-- Index on frequently queried fields
CREATE INDEX idx_pipelines_name ON pipelines(name);
CREATE INDEX idx_pipelines_archived ON pipelines(archived);

4. Timestamps

All entities track creation and modification:

created_at TEXT NOT NULL,  -- RFC 3339 format
updated_at TEXT NOT NULL   -- RFC 3339 format

Schema Initialization

Automated Migration System:

#![allow(unused)]
fn main() {
use adaptive_pipeline::infrastructure::repositories::schema;

// High-level initialization (recommended)
let pool = schema::initialize_database("sqlite://./pipeline.db").await?;
// Database created, migrations applied, ready to use!

// Manual initialization
schema::create_database_if_missing("sqlite://./pipeline.db").await?;
let pool = SqlitePool::connect("sqlite://./pipeline.db").await?;
schema::ensure_schema(&pool).await?;
}

Migration Tracking:

-- sqlx automatically creates this table
CREATE TABLE _sqlx_migrations (
    version BIGINT PRIMARY KEY,
    description TEXT NOT NULL,
    installed_on TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    success BOOLEAN NOT NULL,
    checksum BLOB NOT NULL,
    execution_time BIGINT NOT NULL
);

For complete schema details, see Schema Management.


Transaction Management

SQLite provides full ACID transaction support for data consistency.

ACID Properties

PropertySQLite Implementation
AtomicityAll-or-nothing commits via rollback journal
ConsistencyForeign keys, constraints enforce invariants
IsolationSerializable isolation (single writer)
DurabilityWAL mode ensures data persists after commit

Transaction Usage

Explicit Transactions:

#![allow(unused)]
fn main() {
// Begin transaction
let mut tx = pool.begin().await?;

// Perform multiple operations
sqlx::query("INSERT INTO pipelines (id, name, created_at, updated_at) VALUES (?, ?, ?, ?)")
    .bind(&id)
    .bind(&name)
    .bind(&now)
    .bind(&now)
    .execute(&mut *tx)
    .await?;

sqlx::query("INSERT INTO pipeline_stages (id, pipeline_id, name, stage_type, stage_order, algorithm, created_at, updated_at) VALUES (?, ?, ?, ?, ?, ?, ?, ?)")
    .bind(&stage_id)
    .bind(&id)
    .bind("compression")
    .bind("compression")
    .bind(0)
    .bind("brotli")
    .bind(&now)
    .bind(&now)
    .execute(&mut *tx)
    .await?;

// Commit transaction (or rollback on error)
tx.commit().await?;
}

Automatic Rollback:

#![allow(unused)]
fn main() {
async fn save_pipeline_with_stages(
    pool: &SqlitePool,
    pipeline: &Pipeline,
) -> Result<(), PipelineError> {
    let mut tx = pool.begin().await?;

    // Insert pipeline
    insert_pipeline(&mut tx, pipeline).await?;

    // Insert all stages
    for stage in pipeline.stages() {
        insert_stage(&mut tx, stage).await?;
    }

    // Commit (or automatic rollback if any operation fails)
    tx.commit().await?;
    Ok(())
}
}

Transaction Best Practices

1. Keep Transactions Short

#![allow(unused)]
fn main() {
// ✅ Good: Short transaction
let mut tx = pool.begin().await?;
sqlx::query("INSERT INTO ...").execute(&mut *tx).await?;
tx.commit().await?;

// ❌ Bad: Long-running transaction
let mut tx = pool.begin().await?;
expensive_computation().await;  // Don't do this inside transaction!
sqlx::query("INSERT INTO ...").execute(&mut *tx).await?;
tx.commit().await?;
}

2. Handle Errors Gracefully

#![allow(unused)]
fn main() {
match save_pipeline(&pool, &pipeline).await {
    Ok(()) => info!("Pipeline saved successfully"),
    Err(e) => {
        error!("Failed to save pipeline: {}", e);
        // Transaction automatically rolled back
        return Err(e);
    }
}
}

3. Use Connection Pool

#![allow(unused)]
fn main() {
// ✅ Good: Use pool for automatic connection management
async fn save(pool: &SqlitePool, data: &Data) -> Result<(), Error> {
    sqlx::query("INSERT ...").execute(pool).await?;
    Ok(())
}

// ❌ Bad: Creating new connections
async fn save(url: &str, data: &Data) -> Result<(), Error> {
    let pool = SqlitePool::connect(url).await?;  // Expensive!
    sqlx::query("INSERT ...").execute(&pool).await?;
    Ok(())
}
}

Connection Management

Efficient connection management is crucial for performance and resource utilization.

Connection Pooling

SqlitePool Benefits:

  • Connection Reuse: Avoid overhead of creating new connections
  • Concurrency Control: Limit concurrent database access
  • Automatic Cleanup: Close idle connections automatically
  • Health Monitoring: Detect and recover from connection failures

Pool Configuration:

#![allow(unused)]
fn main() {
use sqlx::sqlite::SqlitePoolOptions;
use std::time::Duration;

let pool = SqlitePoolOptions::new()
    // Maximum number of connections in pool
    .max_connections(5)

    // Minimum number of idle connections
    .min_connections(1)

    // Timeout for acquiring connection from pool
    .acquire_timeout(Duration::from_secs(30))

    // Close connections idle for this duration
    .idle_timeout(Duration::from_secs(600))

    // Maximum lifetime of a connection
    .max_lifetime(Duration::from_secs(3600))

    // Test connection before returning from pool
    .test_before_acquire(true)

    .connect(&database_url)
    .await?;
}

Connection Lifecycle

1. Application requests connection
   ↓
2. Pool checks for available connection
   ├─ Available → Reuse existing connection
   └─ Not available → Create new connection (if under max)
   ↓
3. Application uses connection
   ↓
4. Application returns connection to pool
   ↓
5. Pool keeps connection alive (if under idle_timeout)
   ↓
6. Connection eventually closed (after max_lifetime)

Performance Tuning

Optimal Pool Size:

#![allow(unused)]
fn main() {
// For CPU-bound workloads
let pool_size = num_cpus::get();

// For I/O-bound workloads
let pool_size = num_cpus::get() * 2;

// For SQLite (single writer)
let pool_size = 5;  // Conservative for write-heavy workloads
}

Connection Timeout Strategies:

ScenarioTimeoutRationale
Web API5-10 secondsFail fast for user requests
Background Job30-60 secondsMore tolerance for delays
Batch Processing2-5 minutesLong-running operations acceptable

Data Mapping

Data mapping converts between domain entities and database records.

Entity-to-Row Mapping

Domain Entity:

#![allow(unused)]
fn main() {
pub struct Pipeline {
    id: PipelineId,
    name: String,
    stages: Vec<PipelineStage>,
    archived: bool,
    created_at: DateTime<Utc>,
    updated_at: DateTime<Utc>,
}
}

Database Row:

CREATE TABLE pipelines (
    id TEXT PRIMARY KEY,
    name TEXT NOT NULL UNIQUE,
    archived BOOLEAN NOT NULL DEFAULT false,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL
);

Mapping Logic:

#![allow(unused)]
fn main() {
// Domain → Database (serialize)
let id_str = pipeline.id().to_string();
let name_str = pipeline.name();
let archived_bool = pipeline.is_archived();
let created_at_str = pipeline.created_at().to_rfc3339();
let updated_at_str = pipeline.updated_at().to_rfc3339();

sqlx::query("INSERT INTO pipelines (id, name, archived, created_at, updated_at) VALUES (?, ?, ?, ?, ?)")
    .bind(id_str)
    .bind(name_str)
    .bind(archived_bool)
    .bind(created_at_str)
    .bind(updated_at_str)
    .execute(pool)
    .await?;

// Database → Domain (deserialize)
let row = sqlx::query("SELECT * FROM pipelines WHERE id = ?")
    .bind(id_str)
    .fetch_one(pool)
    .await?;

let id = PipelineId::from(row.get::<String, _>("id"));
let name = row.get::<String, _>("name");
let archived = row.get::<bool, _>("archived");
let created_at = DateTime::parse_from_rfc3339(row.get::<String, _>("created_at"))?;
let updated_at = DateTime::parse_from_rfc3339(row.get::<String, _>("updated_at"))?;
}

Type Conversions

Rust TypeSQLite TypeConversion
StringTEXTDirect mapping
i64INTEGERDirect mapping
f64REALDirect mapping
boolINTEGER (0/1)sqlx handles conversion
DateTime<Utc>TEXTRFC 3339 string format
PipelineIdTEXTUUID string representation
Vec<u8>BLOBDirect binary mapping
Option<T>NULL / valueNULL for None

Handling Relationships

One-to-Many (Pipeline → Stages):

#![allow(unused)]
fn main() {
async fn load_pipeline_with_stages(
    pool: &SqlitePool,
    id: &PipelineId,
) -> Result<Pipeline, PipelineError> {
    // Load pipeline
    let pipeline_row = sqlx::query("SELECT * FROM pipelines WHERE id = ?")
        .bind(id.to_string())
        .fetch_one(pool)
        .await?;

    // Load related stages
    let stage_rows = sqlx::query("SELECT * FROM pipeline_stages WHERE pipeline_id = ? ORDER BY stage_order")
        .bind(id.to_string())
        .fetch_all(pool)
        .await?;

    // Map to domain entities
    let pipeline = map_pipeline_row(pipeline_row)?;
    let stages = stage_rows.into_iter()
        .map(map_stage_row)
        .collect::<Result<Vec<_>, _>>()?;

    // Combine into aggregate
    pipeline.with_stages(stages)
}
}

For detailed repository implementation, see Repository Implementation.


Performance Optimization

Several strategies optimize persistence performance.

Query Optimization

1. Use Indexes Effectively

-- Index on frequently queried columns
CREATE INDEX idx_pipelines_name ON pipelines(name);
CREATE INDEX idx_pipelines_archived ON pipelines(archived);

-- Index on foreign keys for JOINs
CREATE INDEX idx_pipeline_stages_pipeline_id ON pipeline_stages(pipeline_id);

2. Avoid N+1 Queries

#![allow(unused)]
fn main() {
// ❌ Bad: N+1 query problem
for pipeline_id in pipeline_ids {
    let pipeline = repo.find_by_id(&pipeline_id).await?;
    // Process pipeline...
}

// ✅ Good: Single batch query
let pipelines = repo.find_all().await?;
for pipeline in pipelines {
    // Process pipeline...
}
}

3. Use Prepared Statements

#![allow(unused)]
fn main() {
// sqlx automatically uses prepared statements
let pipeline = sqlx::query_as::<_, Pipeline>("SELECT * FROM pipelines WHERE id = ?")
    .bind(id)
    .fetch_one(pool)
    .await?;
}

Connection Pool Tuning

Optimal Settings:

#![allow(unused)]
fn main() {
// For low-concurrency (CLI tools)
.max_connections(2)
.min_connections(1)

// For medium-concurrency (web services)
.max_connections(5)
.min_connections(2)

// For high-concurrency (not recommended for SQLite writes)
.max_connections(10)  // Reading only
.min_connections(5)
}

Batch Operations

Batch Inserts:

#![allow(unused)]
fn main() {
async fn save_multiple_pipelines(
    pool: &SqlitePool,
    pipelines: &[Pipeline],
) -> Result<(), PipelineError> {
    let mut tx = pool.begin().await?;

    for pipeline in pipelines {
        sqlx::query("INSERT INTO pipelines (id, name, created_at, updated_at) VALUES (?, ?, ?, ?)")
            .bind(pipeline.id().to_string())
            .bind(pipeline.name())
            .bind(pipeline.created_at().to_rfc3339())
            .bind(pipeline.updated_at().to_rfc3339())
            .execute(&mut *tx)
            .await?;
    }

    tx.commit().await?;
    Ok(())
}
}

Performance Benchmarks

OperationLatencyThroughputNotes
Single INSERT~0.1ms~10K/secWithout transaction
Batch INSERT (1000)~10ms~100K/secWithin transaction
Single SELECT by ID~0.05ms~20K/secWith index
SELECT with JOIN~0.5ms~2K/secTwo-table join
Full table scan (10K rows)~10ms~1K/secWithout index

Usage Examples

Example 1: Basic CRUD Operations

use adaptive_pipeline::infrastructure::repositories::{schema, SqlitePipelineRepository};
use adaptive_pipeline_domain::Pipeline;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize database
    let pool = schema::initialize_database("sqlite://./pipeline.db").await?;
    let repo = SqlitePipelineRepository::new(pool);

    // Create pipeline
    let pipeline = Pipeline::new("my-pipeline".to_string())?;
    repo.save(&pipeline).await?;
    println!("Saved pipeline: {}", pipeline.id());

    // Read pipeline
    let loaded = repo.find_by_id(pipeline.id()).await?
        .ok_or("Pipeline not found")?;
    println!("Loaded pipeline: {}", loaded.name());

    // Update pipeline
    let mut updated = loaded;
    updated.update_name("renamed-pipeline".to_string())?;
    repo.update(&updated).await?;

    // Delete pipeline
    repo.delete(updated.id()).await?;
    println!("Deleted pipeline");

    Ok(())
}

Example 2: Transaction Management

#![allow(unused)]
fn main() {
use sqlx::SqlitePool;

async fn save_pipeline_atomically(
    pool: &SqlitePool,
    pipeline: &Pipeline,
) -> Result<(), PipelineError> {
    // Begin transaction
    let mut tx = pool.begin().await?;

    // Insert pipeline
    sqlx::query("INSERT INTO pipelines (id, name, created_at, updated_at) VALUES (?, ?, ?, ?)")
        .bind(pipeline.id().to_string())
        .bind(pipeline.name())
        .bind(pipeline.created_at().to_rfc3339())
        .bind(pipeline.updated_at().to_rfc3339())
        .execute(&mut *tx)
        .await?;

    // Insert all stages
    for (i, stage) in pipeline.stages().iter().enumerate() {
        sqlx::query("INSERT INTO pipeline_stages (id, pipeline_id, name, stage_type, stage_order, algorithm, created_at, updated_at) VALUES (?, ?, ?, ?, ?, ?, ?, ?)")
            .bind(stage.id().to_string())
            .bind(pipeline.id().to_string())
            .bind(stage.name())
            .bind(stage.stage_type().to_string())
            .bind(i as i64)
            .bind(stage.algorithm())
            .bind(stage.created_at().to_rfc3339())
            .bind(stage.updated_at().to_rfc3339())
            .execute(&mut *tx)
            .await?;
    }

    // Commit transaction (or rollback on error)
    tx.commit().await?;

    Ok(())
}
}

Example 3: Query with Pagination

#![allow(unused)]
fn main() {
async fn list_pipelines_paginated(
    repo: &dyn PipelineRepository,
    page: usize,
    page_size: usize,
) -> Result<Vec<Pipeline>, PipelineError> {
    let offset = page * page_size;
    repo.list_paginated(offset, page_size).await
}

// Usage
let page_1 = list_pipelines_paginated(&repo, 0, 10).await?;  // First 10
let page_2 = list_pipelines_paginated(&repo, 1, 10).await?;  // Next 10
}

Example 4: Archive Management

#![allow(unused)]
fn main() {
async fn archive_old_pipelines(
    repo: &dyn PipelineRepository,
    cutoff_date: DateTime<Utc>,
) -> Result<usize, PipelineError> {
    let pipelines = repo.find_all().await?;
    let mut archived_count = 0;

    for pipeline in pipelines {
        if pipeline.created_at() < &cutoff_date {
            repo.archive(pipeline.id()).await?;
            archived_count += 1;
        }
    }

    Ok(archived_count)
}
}

Example 5: Connection Pool Management

#![allow(unused)]
fn main() {
use sqlx::sqlite::SqlitePoolOptions;
use std::time::Duration;

async fn create_optimized_pool(database_url: &str) -> Result<SqlitePool, sqlx::Error> {
    let pool = SqlitePoolOptions::new()
        .max_connections(5)
        .min_connections(1)
        .acquire_timeout(Duration::from_secs(30))
        .idle_timeout(Duration::from_secs(600))
        .max_lifetime(Duration::from_secs(3600))
        .connect(database_url)
        .await?;

    Ok(pool)
}
}

Best Practices

1. Use Transactions for Multi-Step Operations

#![allow(unused)]
fn main() {
// ✅ Good: Atomic multi-step operation
async fn create_pipeline_with_stages(pool: &SqlitePool, pipeline: &Pipeline) -> Result<(), Error> {
    let mut tx = pool.begin().await?;
    insert_pipeline(&mut tx, pipeline).await?;
    insert_stages(&mut tx, pipeline.stages()).await?;
    tx.commit().await?;
    Ok(())
}

// ❌ Bad: Non-atomic operations
async fn create_pipeline_with_stages(pool: &SqlitePool, pipeline: &Pipeline) -> Result<(), Error> {
    insert_pipeline(pool, pipeline).await?;
    insert_stages(pool, pipeline.stages()).await?;  // May fail, leaving orphaned pipeline
    Ok(())
}
}

2. Always Use Connection Pooling

#![allow(unused)]
fn main() {
// ✅ Good: Reuse pool
let pool = SqlitePool::connect(&url).await?;
let repo = SqlitePipelineRepository::new(pool.clone());
let service = PipelineService::new(Arc::new(repo));

// ❌ Bad: Create new connections
for _ in 0..100 {
    let pool = SqlitePool::connect(&url).await?;  // Expensive!
    // ...
}
}

3. Handle Database Errors Gracefully

#![allow(unused)]
fn main() {
match repo.save(&pipeline).await {
    Ok(()) => info!("Pipeline saved successfully"),
    Err(PipelineError::DatabaseError(msg)) if msg.contains("UNIQUE constraint") => {
        warn!("Pipeline already exists: {}", pipeline.name());
    }
    Err(e) => {
        error!("Failed to save pipeline: {}", e);
        return Err(e);
    }
}
}

4. Use Indexes for Frequently Queried Fields

-- ✅ Good: Index on query columns
CREATE INDEX idx_pipelines_name ON pipelines(name);
SELECT * FROM pipelines WHERE name = ?;  -- Fast!

-- ❌ Bad: No index on query column
-- No index on 'name'
SELECT * FROM pipelines WHERE name = ?;  -- Slow (full table scan)

5. Keep Transactions Short

#![allow(unused)]
fn main() {
// ✅ Good: Short transaction
let mut tx = pool.begin().await?;
sqlx::query("INSERT ...").execute(&mut *tx).await?;
tx.commit().await?;

// ❌ Bad: Long transaction holding locks
let mut tx = pool.begin().await?;
expensive_computation().await;  // Don't do this!
sqlx::query("INSERT ...").execute(&mut *tx).await?;
tx.commit().await?;
}

6. Validate Data Before Persisting

#![allow(unused)]
fn main() {
// ✅ Good: Validate before save
async fn save_pipeline(repo: &dyn PipelineRepository, pipeline: &Pipeline) -> Result<(), Error> {
    pipeline.validate()?;  // Validate first
    repo.save(pipeline).await?;
    Ok(())
}
}

7. Use Migrations for Schema Changes

# ✅ Good: Create migration
sqlx migrate add add_archived_column

# Edit migration file
# migrations/20250101000001_add_archived_column.sql
ALTER TABLE pipelines ADD COLUMN archived BOOLEAN NOT NULL DEFAULT false;

# Apply migration
sqlx migrate run

Troubleshooting

Issue 1: Database Locked Error

Symptom:

Error: database is locked

Causes:

  • Long-running transaction blocking other operations
  • Multiple writers attempting simultaneous writes
  • Connection not released back to pool

Solutions:

#![allow(unused)]
fn main() {
// 1. Reduce transaction duration
let mut tx = pool.begin().await?;
// Do minimal work inside transaction
sqlx::query("INSERT ...").execute(&mut *tx).await?;
tx.commit().await?;

// 2. Enable WAL mode for better concurrency
use sqlx::sqlite::SqliteConnectOptions;
let options = SqliteConnectOptions::new()
    .filename("./pipeline.db")
    .journal_mode(sqlx::sqlite::SqliteJournalMode::Wal);

// 3. Increase busy timeout
.busy_timeout(Duration::from_secs(5))
}

Issue 2: Connection Pool Exhausted

Symptom:

Error: timed out while waiting for an open connection

Solutions:

#![allow(unused)]
fn main() {
// 1. Increase pool size
.max_connections(10)

// 2. Increase acquire timeout
.acquire_timeout(Duration::from_secs(60))

// 3. Ensure connections are returned
async fn query_data(pool: &SqlitePool) -> Result<(), Error> {
    let result = sqlx::query("SELECT ...").fetch_all(pool).await?;
    // Connection automatically returned to pool
    Ok(())
}
}

Issue 3: Foreign Key Constraint Failed

Symptom:

Error: FOREIGN KEY constraint failed

Solutions:

#![allow(unused)]
fn main() {
// 1. Ensure foreign keys are enabled
.foreign_keys(true)

// 2. Verify referenced record exists
let exists = sqlx::query_scalar("SELECT EXISTS(SELECT 1 FROM pipelines WHERE id = ?)")
    .bind(&pipeline_id)
    .fetch_one(pool)
    .await?;

if !exists {
    return Err("Pipeline not found".into());
}

// 3. Use CASCADE DELETE for automatic cleanup
CREATE TABLE pipeline_stages (
    ...
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);
}

Issue 4: Migration Checksum Mismatch

Symptom:

Error: migration checksum mismatch

Solutions:

# Option 1: Revert migration
sqlx migrate revert

# Option 2: Reset database (development only!)
rm pipeline.db
sqlx migrate run

# Option 3: Create new migration to fix
sqlx migrate add fix_schema_issue

Issue 5: Query Performance Degradation

Diagnosis:

#![allow(unused)]
fn main() {
// Enable query logging
RUST_LOG=sqlx=debug cargo run

// Analyze slow queries
let start = Instant::now();
let result = query.fetch_all(pool).await?;
let duration = start.elapsed();
if duration > Duration::from_millis(100) {
    warn!("Slow query: {:?}", duration);
}
}

Solutions:

-- 1. Add missing indexes
CREATE INDEX idx_pipelines_created_at ON pipelines(created_at);

-- 2. Use EXPLAIN QUERY PLAN
EXPLAIN QUERY PLAN SELECT * FROM pipelines WHERE name = ?;

-- 3. Optimize query
-- Before: Full table scan
SELECT * FROM pipelines WHERE lower(name) = ?;

-- After: Use index
SELECT * FROM pipelines WHERE name = ?;

Testing Strategies

Unit Testing with Mock Repository

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;
    use async_trait::async_trait;
    use std::sync::Arc;
    use tokio::sync::Mutex;

    struct MockPipelineRepository {
        pipelines: Arc<Mutex<HashMap<PipelineId, Pipeline>>>,
    }

    #[async_trait]
    impl PipelineRepository for MockPipelineRepository {
        async fn save(&self, pipeline: &Pipeline) -> Result<(), PipelineError> {
            let mut pipelines = self.pipelines.lock().await;
            pipelines.insert(pipeline.id().clone(), pipeline.clone());
            Ok(())
        }

        async fn find_by_id(&self, id: &PipelineId) -> Result<Option<Pipeline>, PipelineError> {
            let pipelines = self.pipelines.lock().await;
            Ok(pipelines.get(id).cloned())
        }

        // ... implement other methods
    }

    #[tokio::test]
    async fn test_save_and_load() {
        let repo = MockPipelineRepository {
            pipelines: Arc::new(Mutex::new(HashMap::new())),
        };

        let pipeline = Pipeline::new("test".to_string()).unwrap();
        repo.save(&pipeline).await.unwrap();

        let loaded = repo.find_by_id(pipeline.id()).await.unwrap().unwrap();
        assert_eq!(loaded.name(), "test");
    }
}
}

Integration Testing with SQLite

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_sqlite_repository_integration() {
    // Use in-memory database for tests
    let pool = schema::initialize_database("sqlite::memory:").await.unwrap();
    let repo = SqlitePipelineRepository::new(pool);

    // Create pipeline
    let pipeline = Pipeline::new("integration-test".to_string()).unwrap();
    repo.save(&pipeline).await.unwrap();

    // Verify persistence
    let loaded = repo.find_by_id(pipeline.id()).await.unwrap().unwrap();
    assert_eq!(loaded.name(), "integration-test");

    // Verify stages are loaded
    assert_eq!(loaded.stages().len(), pipeline.stages().len());
}
}

Transaction Testing

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_transaction_rollback() {
    let pool = schema::initialize_database("sqlite::memory:").await.unwrap();

    let result = async {
        let mut tx = pool.begin().await?;

        sqlx::query("INSERT INTO pipelines (id, name, created_at, updated_at) VALUES (?, ?, ?, ?)")
            .bind("test-id")
            .bind("test")
            .bind("2025-01-01T00:00:00Z")
            .bind("2025-01-01T00:00:00Z")
            .execute(&mut *tx)
            .await?;

        // Simulate error
        return Err::<(), sqlx::Error>(sqlx::Error::RowNotFound);

        // This would commit, but error prevents it
        // tx.commit().await?;
    }.await;

    assert!(result.is_err());

    // Verify rollback - no pipeline should exist
    let count: i64 = sqlx::query_scalar("SELECT COUNT(*) FROM pipelines")
        .fetch_one(&pool)
        .await
        .unwrap();
    assert_eq!(count, 0);
}
}

Next Steps

After understanding data persistence fundamentals, explore specific implementations:

Detailed Persistence Topics

  1. Repository Implementation: Deep dive into repository pattern implementation
  2. Schema Management: Database schema design and migration strategies

Advanced Topics


Summary

Key Takeaways:

  1. Repository Pattern provides abstraction between domain and infrastructure layers
  2. SQLite offers zero-configuration, ACID-compliant persistence
  3. Schema Management uses sqlx migrations for automated database evolution
  4. Connection Pooling optimizes resource utilization and performance
  5. Transactions ensure data consistency with ACID guarantees
  6. Data Mapping converts between domain entities and database records
  7. Performance optimized through indexing, batching, and query optimization

Architecture File References:

  • Repository Interface: pipeline-domain/src/repositories/pipeline_repository.rs:138
  • SQLite Implementation: pipeline/src/infrastructure/repositories/sqlite_pipeline_repository.rs:193
  • Schema Management: pipeline/src/infrastructure/repositories/schema.rs:18

Repository Implementation

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active

Overview

The repository pattern provides an abstraction layer between the domain and data persistence, enabling the application to work with domain entities without knowing about database details. This separation allows for flexible storage implementations and easier testing.

Key Benefits:

  • Domain Independence: Business logic stays free from persistence concerns
  • Testability: Easy mocking with in-memory implementations
  • Flexibility: Support for different storage backends (SQLite, PostgreSQL, etc.)
  • Consistency: Standardized data access patterns

Repository Interface

Domain-Defined Contract

The domain layer defines the repository interface:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::repositories::PipelineRepository;
use adaptive_pipeline_domain::entities::Pipeline;
use adaptive_pipeline_domain::value_objects::PipelineId;
use adaptive_pipeline_domain::PipelineError;
use async_trait::async_trait;

#[async_trait]
pub trait PipelineRepository: Send + Sync {
    /// Saves a pipeline
    async fn save(&self, pipeline: &Pipeline) -> Result<(), PipelineError>;

    /// Finds a pipeline by ID
    async fn find_by_id(&self, id: PipelineId)
        -> Result<Option<Pipeline>, PipelineError>;

    /// Finds a pipeline by name
    async fn find_by_name(&self, name: &str)
        -> Result<Option<Pipeline>, PipelineError>;

    /// Lists all pipelines
    async fn list_all(&self) -> Result<Vec<Pipeline>, PipelineError>;

    /// Lists pipelines with pagination
    async fn list_paginated(&self, offset: usize, limit: usize)
        -> Result<Vec<Pipeline>, PipelineError>;

    /// Updates a pipeline
    async fn update(&self, pipeline: &Pipeline) -> Result<(), PipelineError>;

    /// Deletes a pipeline by ID
    async fn delete(&self, id: PipelineId) -> Result<bool, PipelineError>;

    /// Checks if a pipeline exists
    async fn exists(&self, id: PipelineId) -> Result<bool, PipelineError>;

    /// Counts total pipelines
    async fn count(&self) -> Result<usize, PipelineError>;

    /// Finds pipelines by configuration parameter
    async fn find_by_config(&self, key: &str, value: &str)
        -> Result<Vec<Pipeline>, PipelineError>;

    /// Archives a pipeline (soft delete)
    async fn archive(&self, id: PipelineId) -> Result<bool, PipelineError>;

    /// Restores an archived pipeline
    async fn restore(&self, id: PipelineId) -> Result<bool, PipelineError>;

    /// Lists archived pipelines
    async fn list_archived(&self) -> Result<Vec<Pipeline>, PipelineError>;
}
}

Thread Safety

All repository implementations must be Send + Sync for concurrent access:

#![allow(unused)]
fn main() {
// ✅ CORRECT: Thread-safe repository
pub struct SqlitePipelineRepository {
    pool: SqlitePool, // SqlitePool is Send + Sync
}

// ❌ WRONG: Not thread-safe
pub struct UnsafeRepository {
    conn: Rc<Connection>, // Rc is not Send or Sync
}
}

SQLite Implementation

Architecture

The SQLite repository implements the domain interface using sqlx for type-safe queries:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::repositories::PipelineRepository;
use sqlx::SqlitePool;

pub struct SqlitePipelineRepository {
    pool: SqlitePool,
}

impl SqlitePipelineRepository {
    pub async fn new(database_path: &str) -> Result<Self, PipelineError> {
        let database_url = format!("sqlite:{}", database_path);
        let pool = SqlitePool::connect(&database_url)
            .await
            .map_err(|e| PipelineError::database_error(
                format!("Failed to connect: {}", e)
            ))?;

        Ok(Self { pool })
    }
}
}

Database Schema

The repository uses a normalized relational schema:

Pipelines Table

CREATE TABLE pipelines (
    id TEXT PRIMARY KEY,
    name TEXT NOT NULL UNIQUE,
    description TEXT,
    archived BOOLEAN NOT NULL DEFAULT 0,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL
);

CREATE INDEX idx_pipelines_name ON pipelines(name);
CREATE INDEX idx_pipelines_archived ON pipelines(archived);

Pipeline Stages Table

CREATE TABLE pipeline_stages (
    id TEXT PRIMARY KEY,
    pipeline_id TEXT NOT NULL,
    name TEXT NOT NULL,
    stage_type TEXT NOT NULL,
    algorithm TEXT NOT NULL,
    enabled BOOLEAN NOT NULL DEFAULT 1,
    order_index INTEGER NOT NULL,
    parallel_processing BOOLEAN NOT NULL DEFAULT 0,
    chunk_size INTEGER,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);

CREATE INDEX idx_stages_pipeline ON pipeline_stages(pipeline_id);
CREATE INDEX idx_stages_order ON pipeline_stages(pipeline_id, order_index);

Pipeline Configuration Table

CREATE TABLE pipeline_configuration (
    pipeline_id TEXT NOT NULL,
    key TEXT NOT NULL,
    value TEXT NOT NULL,
    PRIMARY KEY (pipeline_id, key),
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);

Pipeline Metrics Table

CREATE TABLE pipeline_metrics (
    id TEXT PRIMARY KEY,
    pipeline_id TEXT NOT NULL,
    bytes_processed INTEGER NOT NULL DEFAULT 0,
    bytes_total INTEGER NOT NULL DEFAULT 0,
    chunks_processed INTEGER NOT NULL DEFAULT 0,
    chunks_total INTEGER NOT NULL DEFAULT 0,
    start_time TEXT,
    end_time TEXT,
    throughput_mbps REAL NOT NULL DEFAULT 0.0,
    compression_ratio REAL,
    error_count INTEGER NOT NULL DEFAULT 0,
    warning_count INTEGER NOT NULL DEFAULT 0,
    recorded_at TEXT NOT NULL,
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);

CREATE INDEX idx_metrics_pipeline ON pipeline_metrics(pipeline_id);

CRUD Operations

Create (Save)

Save a complete pipeline with all related data:

#![allow(unused)]
fn main() {
#[async_trait]
impl PipelineRepository for SqlitePipelineRepository {
    async fn save(&self, pipeline: &Pipeline) -> Result<(), PipelineError> {
        // Start transaction for atomicity
        let mut tx = self.pool.begin().await
            .map_err(|e| PipelineError::database_error(
                format!("Failed to start transaction: {}", e)
            ))?;

        // Insert pipeline
        sqlx::query(
            "INSERT INTO pipelines
             (id, name, description, archived, created_at, updated_at)
             VALUES (?, ?, ?, ?, ?, ?)"
        )
        .bind(pipeline.id().to_string())
        .bind(pipeline.name())
        .bind(pipeline.description())
        .bind(pipeline.archived())
        .bind(pipeline.created_at().to_rfc3339())
        .bind(pipeline.updated_at().to_rfc3339())
        .execute(&mut *tx)
        .await
        .map_err(|e| PipelineError::database_error(
            format!("Failed to insert pipeline: {}", e)
        ))?;

        // Insert stages
        for (index, stage) in pipeline.stages().iter().enumerate() {
            sqlx::query(
                "INSERT INTO pipeline_stages
                 (id, pipeline_id, name, stage_type, algorithm, enabled,
                  order_index, parallel_processing, chunk_size,
                  created_at, updated_at)
                 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)"
            )
            .bind(stage.id().to_string())
            .bind(pipeline.id().to_string())
            .bind(stage.name())
            .bind(stage.stage_type().to_string())
            .bind(stage.algorithm().name())
            .bind(stage.enabled())
            .bind(index as i64)
            .bind(stage.parallel_processing())
            .bind(stage.chunk_size().map(|cs| cs.as_u64() as i64))
            .bind(stage.created_at().to_rfc3339())
            .bind(stage.updated_at().to_rfc3339())
            .execute(&mut *tx)
            .await
            .map_err(|e| PipelineError::database_error(
                format!("Failed to insert stage: {}", e)
            ))?;
        }

        // Insert configuration
        for (key, value) in pipeline.configuration() {
            sqlx::query(
                "INSERT INTO pipeline_configuration (pipeline_id, key, value)
                 VALUES (?, ?, ?)"
            )
            .bind(pipeline.id().to_string())
            .bind(key)
            .bind(value)
            .execute(&mut *tx)
            .await
            .map_err(|e| PipelineError::database_error(
                format!("Failed to insert config: {}", e)
            ))?;
        }

        // Commit transaction
        tx.commit().await
            .map_err(|e| PipelineError::database_error(
                format!("Failed to commit: {}", e)
            ))?;

        Ok(())
    }
}
}

Read (Find)

Retrieve pipelines with all related data:

#![allow(unused)]
fn main() {
impl SqlitePipelineRepository {
    async fn find_by_id(&self, id: PipelineId)
        -> Result<Option<Pipeline>, PipelineError> {
        // Fetch pipeline
        let pipeline_row = sqlx::query(
            "SELECT id, name, description, archived, created_at, updated_at
             FROM pipelines WHERE id = ?"
        )
        .bind(id.to_string())
        .fetch_optional(&self.pool)
        .await
        .map_err(|e| PipelineError::database_error(
            format!("Failed to fetch pipeline: {}", e)
        ))?;

        let Some(row) = pipeline_row else {
            return Ok(None);
        };

        // Fetch stages
        let stage_rows = sqlx::query(
            "SELECT id, name, stage_type, algorithm, enabled,
                    order_index, parallel_processing, chunk_size,
                    created_at, updated_at
             FROM pipeline_stages
             WHERE pipeline_id = ?
             ORDER BY order_index"
        )
        .bind(id.to_string())
        .fetch_all(&self.pool)
        .await
        .map_err(|e| PipelineError::database_error(
            format!("Failed to fetch stages: {}", e)
        ))?;

        // Fetch configuration
        let config_rows = sqlx::query(
            "SELECT key, value FROM pipeline_configuration
             WHERE pipeline_id = ?"
        )
        .bind(id.to_string())
        .fetch_all(&self.pool)
        .await
        .map_err(|e| PipelineError::database_error(
            format!("Failed to fetch config: {}", e)
        ))?;

        // Map rows to domain entities
        let pipeline = self.map_to_pipeline(row, stage_rows, config_rows)?;

        Ok(Some(pipeline))
    }
}
}

Update

Update existing pipeline:

#![allow(unused)]
fn main() {
impl SqlitePipelineRepository {
    async fn update(&self, pipeline: &Pipeline) -> Result<(), PipelineError> {
        let mut tx = self.pool.begin().await
            .map_err(|e| PipelineError::database_error(
                format!("Failed to start transaction: {}", e)
            ))?;

        // Update pipeline
        sqlx::query(
            "UPDATE pipelines
             SET name = ?, description = ?, archived = ?, updated_at = ?
             WHERE id = ?"
        )
        .bind(pipeline.name())
        .bind(pipeline.description())
        .bind(pipeline.archived())
        .bind(pipeline.updated_at().to_rfc3339())
        .bind(pipeline.id().to_string())
        .execute(&mut *tx)
        .await
        .map_err(|e| PipelineError::database_error(
            format!("Failed to update pipeline: {}", e)
        ))?;

        // Delete and re-insert stages (simpler than updating)
        sqlx::query("DELETE FROM pipeline_stages WHERE pipeline_id = ?")
            .bind(pipeline.id().to_string())
            .execute(&mut *tx)
            .await?;

        // Insert updated stages
        for (index, stage) in pipeline.stages().iter().enumerate() {
            // ... (same as save operation)
        }

        tx.commit().await
            .map_err(|e| PipelineError::database_error(
                format!("Failed to commit: {}", e)
            ))?;

        Ok(())
    }
}
}

Delete

Remove pipeline and all related data:

#![allow(unused)]
fn main() {
impl SqlitePipelineRepository {
    async fn delete(&self, id: PipelineId) -> Result<bool, PipelineError> {
        let result = sqlx::query("DELETE FROM pipelines WHERE id = ?")
            .bind(id.to_string())
            .execute(&self.pool)
            .await
            .map_err(|e| PipelineError::database_error(
                format!("Failed to delete: {}", e)
            ))?;

        // CASCADE will automatically delete related records
        Ok(result.rows_affected() > 0)
    }
}
}

Advanced Queries

Pagination

Efficiently paginate large result sets:

#![allow(unused)]
fn main() {
impl SqlitePipelineRepository {
    async fn list_paginated(&self, offset: usize, limit: usize)
        -> Result<Vec<Pipeline>, PipelineError> {
        let rows = sqlx::query(
            "SELECT id, name, description, archived, created_at, updated_at
             FROM pipelines
             ORDER BY created_at DESC
             LIMIT ? OFFSET ?"
        )
        .bind(limit as i64)
        .bind(offset as i64)
        .fetch_all(&self.pool)
        .await?;

        // Load stages and config for each pipeline
        let mut pipelines = Vec::new();
        for row in rows {
            let id = PipelineId::parse(&row.get::<String, _>("id"))?;
            if let Some(pipeline) = self.find_by_id(id).await? {
                pipelines.push(pipeline);
            }
        }

        Ok(pipelines)
    }
}
}

Find pipelines by configuration:

#![allow(unused)]
fn main() {
impl SqlitePipelineRepository {
    async fn find_by_config(&self, key: &str, value: &str)
        -> Result<Vec<Pipeline>, PipelineError> {
        let rows = sqlx::query(
            "SELECT DISTINCT p.id
             FROM pipelines p
             JOIN pipeline_configuration pc ON p.id = pc.pipeline_id
             WHERE pc.key = ? AND pc.value = ?"
        )
        .bind(key)
        .bind(value)
        .fetch_all(&self.pool)
        .await?;

        let mut pipelines = Vec::new();
        for row in rows {
            let id = PipelineId::parse(&row.get::<String, _>("id"))?;
            if let Some(pipeline) = self.find_by_id(id).await? {
                pipelines.push(pipeline);
            }
        }

        Ok(pipelines)
    }
}
}

Archive Operations

Soft delete with archive/restore:

#![allow(unused)]
fn main() {
impl SqlitePipelineRepository {
    async fn archive(&self, id: PipelineId) -> Result<bool, PipelineError> {
        let result = sqlx::query(
            "UPDATE pipelines SET archived = 1, updated_at = ?
             WHERE id = ?"
        )
        .bind(chrono::Utc::now().to_rfc3339())
        .bind(id.to_string())
        .execute(&self.pool)
        .await?;

        Ok(result.rows_affected() > 0)
    }

    async fn restore(&self, id: PipelineId) -> Result<bool, PipelineError> {
        let result = sqlx::query(
            "UPDATE pipelines SET archived = 0, updated_at = ?
             WHERE id = ?"
        )
        .bind(chrono::Utc::now().to_rfc3339())
        .bind(id.to_string())
        .execute(&self.pool)
        .await?;

        Ok(result.rows_affected() > 0)
    }

    async fn list_archived(&self) -> Result<Vec<Pipeline>, PipelineError> {
        let rows = sqlx::query(
            "SELECT id, name, description, archived, created_at, updated_at
             FROM pipelines WHERE archived = 1"
        )
        .fetch_all(&self.pool)
        .await?;

        // Load full pipelines
        let mut pipelines = Vec::new();
        for row in rows {
            let id = PipelineId::parse(&row.get::<String, _>("id"))?;
            if let Some(pipeline) = self.find_by_id(id).await? {
                pipelines.push(pipeline);
            }
        }

        Ok(pipelines)
    }
}
}

Transaction Management

ACID Guarantees

Ensure data consistency with transactions:

#![allow(unused)]
fn main() {
impl SqlitePipelineRepository {
    /// Execute multiple operations atomically
    async fn save_multiple(&self, pipelines: &[Pipeline])
        -> Result<(), PipelineError> {
        let mut tx = self.pool.begin().await?;

        for pipeline in pipelines {
            // All operations use the same transaction
            self.save_in_transaction(&mut tx, pipeline).await?;
        }

        // Commit all or rollback all
        tx.commit().await
            .map_err(|e| PipelineError::database_error(
                format!("Transaction commit failed: {}", e)
            ))?;

        Ok(())
    }

    async fn save_in_transaction(
        &self,
        tx: &mut Transaction<'_, Sqlite>,
        pipeline: &Pipeline
    ) -> Result<(), PipelineError> {
        // Insert using transaction
        sqlx::query("INSERT INTO pipelines ...")
            .execute(&mut **tx)
            .await?;

        Ok(())
    }
}
}

Rollback on Error

Automatic rollback ensures consistency:

#![allow(unused)]
fn main() {
async fn complex_operation(&self, pipeline: &Pipeline)
    -> Result<(), PipelineError> {
    let mut tx = self.pool.begin().await?;

    // Step 1: Insert pipeline
    sqlx::query("INSERT INTO pipelines ...")
        .execute(&mut *tx)
        .await?;

    // Step 2: Insert stages
    for stage in pipeline.stages() {
        sqlx::query("INSERT INTO pipeline_stages ...")
            .execute(&mut *tx)
            .await?;
        // If this fails, Step 1 is automatically rolled back
    }

    // Commit only if all steps succeed
    tx.commit().await?;
    Ok(())
}
}

Error Handling

Database Errors

Handle various database error types:

#![allow(unused)]
fn main() {
impl SqlitePipelineRepository {
    async fn save(&self, pipeline: &Pipeline) -> Result<(), PipelineError> {
        match sqlx::query("INSERT INTO pipelines ...").execute(&self.pool).await {
            Ok(_) => Ok(()),
            Err(sqlx::Error::Database(db_err)) => {
                if db_err.is_unique_violation() {
                    Err(PipelineError::AlreadyExists(pipeline.id().to_string()))
                } else if db_err.is_foreign_key_violation() {
                    Err(PipelineError::InvalidReference(
                        "Invalid foreign key".to_string()
                    ))
                } else {
                    Err(PipelineError::database_error(db_err.to_string()))
                }
            }
            Err(e) => Err(PipelineError::database_error(e.to_string())),
        }
    }
}
}

Connection Failures

Handle connection issues gracefully:

#![allow(unused)]
fn main() {
impl SqlitePipelineRepository {
    async fn with_retry<F, T>(&self, mut operation: F) -> Result<T, PipelineError>
    where
        F: FnMut() -> BoxFuture<'_, Result<T, PipelineError>>,
    {
        let max_retries = 3;
        let mut attempts = 0;

        loop {
            match operation().await {
                Ok(result) => return Ok(result),
                Err(PipelineError::DatabaseError(_)) if attempts < max_retries => {
                    attempts += 1;
                    tokio::time::sleep(
                        Duration::from_millis(100 * 2_u64.pow(attempts))
                    ).await;
                    continue;
                }
                Err(e) => return Err(e),
            }
        }
    }
}
}

Performance Optimizations

Connection Pooling

Configure optimal pool settings:

#![allow(unused)]
fn main() {
use sqlx::sqlite::SqlitePoolOptions;

impl SqlitePipelineRepository {
    pub async fn new_with_pool_config(
        database_path: &str,
        max_connections: u32,
    ) -> Result<Self, PipelineError> {
        let database_url = format!("sqlite:{}", database_path);

        let pool = SqlitePoolOptions::new()
            .max_connections(max_connections)
            .min_connections(5)
            .acquire_timeout(Duration::from_secs(10))
            .idle_timeout(Duration::from_secs(600))
            .max_lifetime(Duration::from_secs(1800))
            .connect(&database_url)
            .await?;

        Ok(Self { pool })
    }
}
}

Batch Operations

Optimize bulk inserts:

#![allow(unused)]
fn main() {
impl SqlitePipelineRepository {
    async fn save_batch(&self, pipelines: &[Pipeline])
        -> Result<(), PipelineError> {
        let mut tx = self.pool.begin().await?;

        // Build batch insert query
        let mut query_builder = sqlx::QueryBuilder::new(
            "INSERT INTO pipelines
             (id, name, description, archived, created_at, updated_at)"
        );

        query_builder.push_values(pipelines, |mut b, pipeline| {
            b.push_bind(pipeline.id().to_string())
             .push_bind(pipeline.name())
             .push_bind(pipeline.description())
             .push_bind(pipeline.archived())
             .push_bind(pipeline.created_at().to_rfc3339())
             .push_bind(pipeline.updated_at().to_rfc3339());
        });

        query_builder.build()
            .execute(&mut *tx)
            .await?;

        tx.commit().await?;
        Ok(())
    }
}
}

Query Optimization

Use indexes and optimized queries:

#![allow(unused)]
fn main() {
// ✅ GOOD: Uses index on pipeline_id
sqlx::query(
    "SELECT * FROM pipeline_stages
     WHERE pipeline_id = ?
     ORDER BY order_index"
)
.bind(id)
.fetch_all(&pool)
.await?;

// ❌ BAD: Full table scan
sqlx::query(
    "SELECT * FROM pipeline_stages
     WHERE name LIKE '%test%'"
)
.fetch_all(&pool)
.await?;

// ✅ BETTER: Use full-text search or specific index
sqlx::query(
    "SELECT * FROM pipeline_stages
     WHERE name = ?"
)
.bind("test")
.fetch_all(&pool)
.await?;
}

Testing Strategies

In-Memory Repository

Create test implementation:

#![allow(unused)]
fn main() {
use std::collections::HashMap;
use std::sync::{Arc, Mutex};

pub struct InMemoryPipelineRepository {
    pipelines: Arc<Mutex<HashMap<PipelineId, Pipeline>>>,
}

impl InMemoryPipelineRepository {
    pub fn new() -> Self {
        Self {
            pipelines: Arc::new(Mutex::new(HashMap::new())),
        }
    }
}

#[async_trait]
impl PipelineRepository for InMemoryPipelineRepository {
    async fn save(&self, pipeline: &Pipeline) -> Result<(), PipelineError> {
        let mut pipelines = self.pipelines.lock().unwrap();

        if pipelines.contains_key(pipeline.id()) {
            return Err(PipelineError::AlreadyExists(
                pipeline.id().to_string()
            ));
        }

        pipelines.insert(pipeline.id().clone(), pipeline.clone());
        Ok(())
    }

    async fn find_by_id(&self, id: PipelineId)
        -> Result<Option<Pipeline>, PipelineError> {
        let pipelines = self.pipelines.lock().unwrap();
        Ok(pipelines.get(&id).cloned())
    }

    // ... implement other methods
}
}

Unit Tests

Test repository operations:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[tokio::test]
    async fn test_save_and_find() {
        let repo = InMemoryPipelineRepository::new();
        let pipeline = Pipeline::new("test".to_string(), vec![])?;

        // Save
        repo.save(&pipeline).await.unwrap();

        // Find
        let found = repo.find_by_id(pipeline.id().clone())
            .await
            .unwrap()
            .unwrap();

        assert_eq!(found.id(), pipeline.id());
        assert_eq!(found.name(), pipeline.name());
    }

    #[tokio::test]
    async fn test_duplicate_save_fails() {
        let repo = InMemoryPipelineRepository::new();
        let pipeline = Pipeline::new("test".to_string(), vec![])?;

        repo.save(&pipeline).await.unwrap();

        let result = repo.save(&pipeline).await;
        assert!(matches!(result, Err(PipelineError::AlreadyExists(_))));
    }
}
}

Integration Tests

Test with real database:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod integration_tests {
    use super::*;

    async fn create_test_db() -> SqlitePipelineRepository {
        SqlitePipelineRepository::new(":memory:").await.unwrap()
    }

    #[tokio::test]
    async fn test_transaction_rollback() {
        let repo = create_test_db().await;
        let pipeline = Pipeline::new("test".to_string(), vec![])?;

        // Start transaction
        let mut tx = repo.pool.begin().await.unwrap();

        // Insert pipeline
        sqlx::query("INSERT INTO pipelines ...")
            .execute(&mut *tx)
            .await
            .unwrap();

        // Rollback
        tx.rollback().await.unwrap();

        // Verify pipeline was not saved
        let found = repo.find_by_id(pipeline.id().clone()).await.unwrap();
        assert!(found.is_none());
    }
}
}

Best Practices

Use Parameterized Queries

Prevent SQL injection:

#![allow(unused)]
fn main() {
// ✅ GOOD: Parameterized query
sqlx::query("SELECT * FROM pipelines WHERE name = ?")
    .bind(name)
    .fetch_one(&pool)
    .await?;

// ❌ BAD: String concatenation (SQL injection risk!)
let query = format!("SELECT * FROM pipelines WHERE name = '{}'", name);
sqlx::query(&query).fetch_one(&pool).await?;
}

Handle NULL Values

Properly handle nullable columns:

#![allow(unused)]
fn main() {
let description: Option<String> = row.try_get("description")?;
let chunk_size: Option<i64> = row.try_get("chunk_size")?;

let pipeline = Pipeline {
    description: description.unwrap_or_default(),
    chunk_size: chunk_size.map(|cs| ChunkSize::new(cs as u64)?),
    // ...
};
}

Use Foreign Keys

Maintain referential integrity:

CREATE TABLE pipeline_stages (
    id TEXT PRIMARY KEY,
    pipeline_id TEXT NOT NULL,
    -- ... other columns
    FOREIGN KEY (pipeline_id)
        REFERENCES pipelines(id)
        ON DELETE CASCADE
);

Index Strategic Columns

Optimize query performance:

-- Primary lookups
CREATE INDEX idx_pipelines_id ON pipelines(id);
CREATE INDEX idx_pipelines_name ON pipelines(name);

-- Filtering
CREATE INDEX idx_pipelines_archived ON pipelines(archived);

-- Foreign keys
CREATE INDEX idx_stages_pipeline ON pipeline_stages(pipeline_id);

-- Sorting
CREATE INDEX idx_stages_order
    ON pipeline_stages(pipeline_id, order_index);

Next Steps

Now that you understand repository implementation:

Schema Management

Version: 1.0 Date: 2025-10-04 License: BSD-3-Clause Copyright: (c) 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Active


Overview

The Adaptive Pipeline uses SQLite for data persistence with an automated schema management system powered by sqlx migrations. This chapter explains the database schema design, migration strategy, and best practices for schema evolution.

Key Features

  • Automatic Migrations: Schema automatically initialized and updated on startup
  • Version Tracking: Migrations tracked in _sqlx_migrations table
  • Idempotent: Safe to run migrations multiple times
  • Normalized Design: Proper foreign keys and referential integrity
  • Performance Indexed: Strategic indexes for common queries
  • Test-Friendly: Support for in-memory databases

Database Schema

Entity-Relationship Diagram

┌─────────────────────────────────────────────────────────────┐
│                        pipelines                            │
├─────────────────────────────────────────────────────────────┤
│ id (PK)             TEXT                                    │
│ name                TEXT UNIQUE NOT NULL                    │
│ archived            BOOLEAN DEFAULT false                   │
│ created_at          TEXT NOT NULL                           │
│ updated_at          TEXT NOT NULL                           │
└────────────────┬────────────────────────────────────────────┘
                 │
                 │ 1:N
                 │
    ┌────────────┼──────────────────┐
    │            │                  │
    ▼            ▼                  ▼
┌─────────────────┐  ┌───────────────────────┐  ┌──────────────────┐
│ pipeline_stages │  │pipeline_configuration │  │processing_metrics│
├─────────────────┤  ├───────────────────────┤  ├──────────────────┤
│ id (PK)         │  │ pipeline_id (PK,FK)   │  │ pipeline_id (PK,FK)│
│ pipeline_id (FK)│  │ key (PK)              │  │ bytes_processed  │
│ name            │  │ value                 │  │ throughput_*     │
│ stage_type      │  │ archived              │  │ error_count      │
│ algorithm       │  │ created_at            │  │ ...              │
│ enabled         │  │ updated_at            │  └──────────────────┘
│ stage_order     │  └───────────────────────┘
│ ...             │
└────────┬────────┘
         │
         │ 1:N
         │
         ▼
┌──────────────────┐
│ stage_parameters │
├──────────────────┤
│ stage_id (PK,FK) │
│ key (PK)         │
│ value            │
│ archived         │
│ created_at       │
│ updated_at       │
└──────────────────┘

Tables Overview

TablePurposeRelationships
pipelinesCore pipeline configurationsParent of stages, config, metrics
pipeline_stagesProcessing stages within pipelinesChild of pipelines, parent of parameters
pipeline_configurationKey-value configuration for pipelinesChild of pipelines
stage_parametersKey-value parameters for stagesChild of pipeline_stages
processing_metricsExecution metrics and statisticsChild of pipelines

Table Schemas

pipelines

The root table for pipeline management:

CREATE TABLE IF NOT EXISTS pipelines (
    id TEXT PRIMARY KEY,
    name TEXT NOT NULL UNIQUE,
    archived BOOLEAN NOT NULL DEFAULT false,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL
);

Columns:

  • id: UUID or unique identifier (e.g., "pipeline-123")
  • name: Human-readable name (unique constraint)
  • archived: Soft delete flag (false = active, true = archived)
  • created_at: RFC3339 timestamp of creation
  • updated_at: RFC3339 timestamp of last modification

Constraints:

  • Primary key on id
  • Unique constraint on name
  • Indexed on name WHERE archived = false for active pipeline lookups

pipeline_stages

Defines the ordered stages within a pipeline:

CREATE TABLE IF NOT EXISTS pipeline_stages (
    id TEXT PRIMARY KEY,
    pipeline_id TEXT NOT NULL,
    name TEXT NOT NULL,
    stage_type TEXT NOT NULL,
    enabled BOOLEAN NOT NULL DEFAULT TRUE,
    stage_order INTEGER NOT NULL,
    algorithm TEXT NOT NULL,
    parallel_processing BOOLEAN NOT NULL DEFAULT FALSE,
    chunk_size INTEGER,
    archived BOOLEAN NOT NULL DEFAULT FALSE,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);

Columns:

  • id: Unique stage identifier
  • pipeline_id: Foreign key to owning pipeline
  • name: Stage name (e.g., "compression", "encryption")
  • stage_type: Type of stage (enum: compression, encryption, checksum)
  • enabled: Whether stage is active
  • stage_order: Execution order (0-based)
  • algorithm: Specific algorithm (e.g., "zstd", "aes-256-gcm")
  • parallel_processing: Whether stage can process chunks in parallel
  • chunk_size: Optional chunk size override for this stage
  • archived: Soft delete flag
  • created_at, updated_at: Timestamps

Constraints:

  • Primary key on id
  • Foreign key to pipelines(id) with CASCADE delete
  • Indexed on (pipeline_id, stage_order) for ordered retrieval
  • Indexed on pipeline_id for pipeline lookups

pipeline_configuration

Key-value configuration storage for pipelines:

CREATE TABLE IF NOT EXISTS pipeline_configuration (
    pipeline_id TEXT NOT NULL,
    key TEXT NOT NULL,
    value TEXT NOT NULL,
    archived BOOLEAN NOT NULL DEFAULT FALSE,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    PRIMARY KEY (pipeline_id, key),
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);

Columns:

  • pipeline_id: Foreign key to pipeline
  • key: Configuration key (e.g., "max_workers", "buffer_size")
  • value: Configuration value (stored as TEXT, parsed by application)
  • archived, created_at, updated_at: Standard metadata

Constraints:

  • Composite primary key on (pipeline_id, key)
  • Foreign key to pipelines(id) with CASCADE delete
  • Indexed on pipeline_id

Usage Example:

pipeline_id                          | key           | value
-------------------------------------|---------------|-------
pipeline-abc-123                     | max_workers   | 4
pipeline-abc-123                     | buffer_size   | 1048576

stage_parameters

Key-value parameters for individual stages:

CREATE TABLE IF NOT EXISTS stage_parameters (
    stage_id TEXT NOT NULL,
    key TEXT NOT NULL,
    value TEXT NOT NULL,
    archived BOOLEAN NOT NULL DEFAULT FALSE,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    PRIMARY KEY (stage_id, key),
    FOREIGN KEY (stage_id) REFERENCES pipeline_stages(id) ON DELETE CASCADE
);

Columns:

  • stage_id: Foreign key to stage
  • key: Parameter key (e.g., "compression_level", "key_size")
  • value: Parameter value (TEXT, parsed by stage)
  • archived, created_at, updated_at: Standard metadata

Constraints:

  • Composite primary key on (stage_id, key)
  • Foreign key to pipeline_stages(id) with CASCADE delete
  • Indexed on stage_id

Usage Example:

stage_id                | key                | value
------------------------|--------------------|---------
stage-comp-456          | compression_level  | 9
stage-enc-789           | key_size           | 256

processing_metrics

Tracks execution metrics for pipeline runs:

CREATE TABLE IF NOT EXISTS processing_metrics (
    pipeline_id TEXT PRIMARY KEY,
    bytes_processed INTEGER NOT NULL DEFAULT 0,
    bytes_total INTEGER NOT NULL DEFAULT 0,
    chunks_processed INTEGER NOT NULL DEFAULT 0,
    chunks_total INTEGER NOT NULL DEFAULT 0,
    start_time_rfc3339 TEXT,
    end_time_rfc3339 TEXT,
    processing_duration_ms INTEGER,
    throughput_bytes_per_second REAL NOT NULL DEFAULT 0.0,
    compression_ratio REAL,
    error_count INTEGER NOT NULL DEFAULT 0,
    warning_count INTEGER NOT NULL DEFAULT 0,
    input_file_size_bytes INTEGER NOT NULL DEFAULT 0,
    output_file_size_bytes INTEGER NOT NULL DEFAULT 0,
    input_file_checksum TEXT,
    output_file_checksum TEXT,
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);

Columns:

  • pipeline_id: Foreign key to pipeline (also primary key - one metric per pipeline)
  • Progress tracking: bytes_processed, bytes_total, chunks_processed, chunks_total
  • Timing: start_time_rfc3339, end_time_rfc3339, processing_duration_ms
  • Performance: throughput_bytes_per_second, compression_ratio
  • Status: error_count, warning_count
  • File info: input_file_size_bytes, output_file_size_bytes
  • Integrity: input_file_checksum, output_file_checksum

Constraints:

  • Primary key on pipeline_id
  • Foreign key to pipelines(id) with CASCADE delete

Migrations with sqlx

Migration Files

Migrations live in the /migrations directory at the project root:

migrations/
└── 20250101000000_initial_schema.sql

Naming Convention: {timestamp}_{description}.sql

  • Timestamp: YYYYMMDDHHMMSS format
  • Description: Snake_case description of changes

Migration Structure

Each migration file contains:

-- Migration: 20250101000000_initial_schema.sql
-- Description: Initial database schema for pipeline management

-- Table creation
CREATE TABLE IF NOT EXISTS pipelines (...);
CREATE TABLE IF NOT EXISTS pipeline_stages (...);
-- ... more tables ...

-- Index creation
CREATE INDEX IF NOT EXISTS idx_pipeline_stages_pipeline_id ON pipeline_stages(pipeline_id);
-- ... more indexes ...

sqlx Migration Macro

The sqlx::migrate!() macro embeds migrations at compile time:

#![allow(unused)]
fn main() {
// In schema.rs
pub async fn ensure_schema(pool: &SqlitePool) -> Result<(), sqlx::Error> {
    debug!("Ensuring database schema is up to date");

    // Run migrations - sqlx will automatically track what's been applied
    sqlx::migrate!("../migrations").run(pool).await?;

    info!("Database schema is up to date");
    Ok(())
}
}

How it works:

  1. sqlx::migrate!("../migrations") scans directory at compile time
  2. Embeds migration SQL into binary
  3. run(pool) executes pending migrations at runtime
  4. Tracks applied migrations in _sqlx_migrations table

Schema Initialization

Automatic Initialization

The schema module provides convenience functions for database setup:

#![allow(unused)]
fn main() {
/// High-level initialization function
pub async fn initialize_database(database_url: &str) -> Result<SqlitePool, sqlx::Error> {
    // 1. Create database if it doesn't exist
    create_database_if_missing(database_url).await?;

    // 2. Connect to database
    let pool = SqlitePool::connect(database_url).await?;

    // 3. Run migrations
    ensure_schema(&pool).await?;

    Ok(pool)
}
}

Usage in application startup:

use adaptive_pipeline::infrastructure::repositories::schema;

#[tokio::main]
async fn main() -> Result<()> {
    // Initialize database with schema
    let pool = schema::initialize_database("sqlite://./pipeline.db").await?;

    // Database is ready to use!
    let repository = SqlitePipelineRepository::new(pool);

    Ok(())
}

Create Database if Missing

For file-based SQLite databases:

#![allow(unused)]
fn main() {
pub async fn create_database_if_missing(database_url: &str) -> Result<(), sqlx::Error> {
    if !sqlx::Sqlite::database_exists(database_url).await? {
        debug!("Database does not exist, creating: {}", database_url);
        sqlx::Sqlite::create_database(database_url).await?;
        info!("Created new SQLite database: {}", database_url);
    } else {
        debug!("Database already exists: {}", database_url);
    }
    Ok(())
}
}

Handles:

  • New database creation
  • Existing database detection
  • File system permissions

In-Memory Databases

For testing, use in-memory databases:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_with_in_memory_db() {
    // No file system needed
    let pool = schema::initialize_database("sqlite::memory:")
        .await
        .unwrap();

    // Database is fully initialized in memory
    // ... run tests ...
}
}

Migration Tracking

_sqlx_migrations Table

sqlx automatically creates a tracking table:

CREATE TABLE _sqlx_migrations (
    version BIGINT PRIMARY KEY,
    description TEXT NOT NULL,
    installed_on TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    success BOOLEAN NOT NULL,
    checksum BLOB NOT NULL,
    execution_time BIGINT NOT NULL
);

Columns:

  • version: Migration timestamp (e.g., 20250101000000)
  • description: Migration description
  • installed_on: When migration was applied
  • success: Whether migration succeeded
  • checksum: SHA256 of migration SQL
  • execution_time: Duration in milliseconds

Querying Applied Migrations

#![allow(unused)]
fn main() {
let migrations: Vec<(i64, String)> = sqlx::query_as(
    "SELECT version, description FROM _sqlx_migrations ORDER BY version"
)
.fetch_all(&pool)
.await?;

for (version, description) in migrations {
    println!("Applied migration: {} - {}", version, description);
}
}

Adding New Migrations

Step 1: Create Migration File

Create a new file in /migrations:

# Generate timestamp
TIMESTAMP=$(date +%Y%m%d%H%M%S)

# Create migration file
touch migrations/${TIMESTAMP}_add_pipeline_tags.sql

Step 2: Write Migration SQL

-- migrations/20250204120000_add_pipeline_tags.sql
-- Add tagging support for pipelines

CREATE TABLE IF NOT EXISTS pipeline_tags (
    pipeline_id TEXT NOT NULL,
    tag TEXT NOT NULL,
    created_at TEXT NOT NULL,
    PRIMARY KEY (pipeline_id, tag),
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);

CREATE INDEX IF NOT EXISTS idx_pipeline_tags_tag ON pipeline_tags(tag);

Step 3: Test Migration

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_new_migration() {
    let pool = schema::initialize_database("sqlite::memory:")
        .await
        .unwrap();

    // Verify new table exists
    let count: i64 = sqlx::query_scalar(
        "SELECT COUNT(*) FROM sqlite_master WHERE type='table' AND name='pipeline_tags'"
    )
    .fetch_one(&pool)
    .await
    .unwrap();

    assert_eq!(count, 1);
}
}

Step 4: Rebuild

# sqlx macro embeds migrations at compile time
cargo build

The next application start will automatically apply the new migration.


Indexes and Performance

Current Indexes

-- Ordered stage retrieval
CREATE INDEX idx_pipeline_stages_order
ON pipeline_stages(pipeline_id, stage_order);

-- Stage lookup by pipeline
CREATE INDEX idx_pipeline_stages_pipeline_id
ON pipeline_stages(pipeline_id);

-- Configuration lookup
CREATE INDEX idx_pipeline_configuration_pipeline_id
ON pipeline_configuration(pipeline_id);

-- Parameter lookup
CREATE INDEX idx_stage_parameters_stage_id
ON stage_parameters(stage_id);

-- Active pipelines only
CREATE INDEX idx_pipelines_name
ON pipelines(name) WHERE archived = false;

Index Strategy

When to add indexes:

  • ✅ Foreign key columns (for JOIN performance)
  • ✅ Columns in WHERE clauses (for filtering)
  • ✅ Columns in ORDER BY (for sorting)
  • ✅ Partial indexes for common filters (e.g., WHERE archived = false)

When NOT to index:

  • ❌ Small tables (< 1000 rows)
  • ❌ Columns with low cardinality (few distinct values)
  • ❌ Columns rarely used in queries
  • ❌ Write-heavy columns (indexes slow INSERTs/UPDATEs)

Best Practices

✅ DO

Use idempotent migrations

-- Safe to run multiple times
CREATE TABLE IF NOT EXISTS new_table (...);
CREATE INDEX IF NOT EXISTS idx_name ON table(column);

Include rollback comments

-- Migration: Add user_id column
-- Rollback: DROP COLUMN is not supported in SQLite, recreate table

ALTER TABLE pipelines ADD COLUMN user_id TEXT;

Use transactions for multi-statement migrations

BEGIN TRANSACTION;

CREATE TABLE new_table (...);
INSERT INTO new_table SELECT ...;
DROP TABLE old_table;

COMMIT;

Test migrations with production-like data

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_migration_with_data() {
    let pool = schema::initialize_database("sqlite::memory:").await.unwrap();

    // Insert test data
    sqlx::query("INSERT INTO pipelines (...) VALUES (...)")
        .execute(&pool)
        .await
        .unwrap();

    // Run migration
    schema::ensure_schema(&pool).await.unwrap();

    // Verify data integrity
    // ...
}
}

❌ DON'T

Don't modify existing migrations

-- BAD: Editing 20250101000000_initial_schema.sql after deployment
-- This will cause checksum mismatch!

-- GOOD: Create a new migration to alter the schema
-- migrations/20250204000000_modify_pipeline_name.sql

Don't use database-specific features unnecessarily

-- BAD: SQLite-only (limits portability)
CREATE TABLE pipelines (
    id INTEGER PRIMARY KEY AUTOINCREMENT
);

-- GOOD: Portable approach
CREATE TABLE pipelines (
    id TEXT PRIMARY KEY  -- Application generates UUIDs
);

Don't forget foreign key constraints

-- BAD: No referential integrity
CREATE TABLE pipeline_stages (
    pipeline_id TEXT NOT NULL
);

-- GOOD: Enforced relationships
CREATE TABLE pipeline_stages (
    pipeline_id TEXT NOT NULL,
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);

Testing Schema Changes

Unit Tests

From schema.rs:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_create_database_if_missing() {
    let temp = NamedTempFile::new().unwrap();
    let db_path = temp.path().to_str().unwrap();
    let db_url = format!("sqlite://{}", db_path);
    drop(temp); // Remove file

    // Should create the database
    create_database_if_missing(&db_url).await.unwrap();

    // Should succeed if already exists
    create_database_if_missing(&db_url).await.unwrap();
}
}

Integration Tests

From schema_integration_test.rs:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_schema_migrations_run_automatically() {
    let pool = schema::initialize_database("sqlite::memory:")
        .await
        .unwrap();

    // Verify _sqlx_migrations table exists
    let result: i64 = sqlx::query_scalar(
        "SELECT COUNT(*) FROM _sqlx_migrations"
    )
    .fetch_one(&pool)
    .await
    .unwrap();

    assert!(result > 0, "At least one migration should be applied");
}
}

Idempotency Tests

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_schema_idempotent_initialization() {
    let db_url = "sqlite::memory:";

    // Initialize twice - should not error
    let _pool1 = schema::initialize_database(db_url).await.unwrap();
    let _pool2 = schema::initialize_database(db_url).await.unwrap();
}
}

Troubleshooting

Issue: Migration checksum mismatch

Symptom: Error: "migration checksum mismatch"

Cause: Existing migration file was modified after being applied

Solution:

# NEVER modify applied migrations!
# Instead, create a new migration to make changes

# If in development and migration hasn't been deployed:
# 1. Drop database
rm pipeline.db

# 2. Recreate with modified migration
cargo run

Issue: Database file locked

Symptom: Error: "database is locked"

Cause: Another process has an exclusive lock

Solution:

#![allow(unused)]
fn main() {
// Use connection pool with proper configuration
let pool = SqlitePool::connect_with(
    SqliteConnectOptions::from_str("sqlite://./pipeline.db")?
        .busy_timeout(Duration::from_secs(30))  // Wait for lock
        .journal_mode(SqliteJournalMode::Wal)    // Use WAL mode
)
.await?;
}

Issue: Foreign key constraint failed

Symptom: Error: "FOREIGN KEY constraint failed"

Cause: Trying to insert/update with invalid foreign key

Solution:

-- Enable foreign key enforcement (SQLite default is OFF)
PRAGMA foreign_keys = ON;

-- Then verify referenced row exists before insert
SELECT id FROM pipelines WHERE id = ?;

Next Steps


References

File I/O

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

This chapter provides a comprehensive overview of the file input/output architecture in the adaptive pipeline system. Learn how file chunks, type-safe paths, and streaming I/O work together to enable efficient, memory-safe file processing.


Table of Contents


Overview

File I/O in the adaptive pipeline system is designed for efficient, memory-safe processing of files of any size through chunking, streaming, and intelligent memory management. The system uses immutable value objects and async I/O for optimal performance.

Key Features

  • Chunked Processing: Files split into manageable chunks for parallel processing
  • Streaming I/O: Process files without loading entirely into memory
  • Type-Safe Paths: Compile-time path category enforcement
  • Immutable Chunks: Thread-safe, corruption-proof file chunks
  • Validated Sizes: Chunk sizes validated at creation
  • Async Operations: Non-blocking I/O for better concurrency

File I/O Stack

┌──────────────────────────────────────────────────────────┐
│                  Application Layer                        │
│  ┌────────────────────────────────────────────────┐      │
│  │   File Processor Service                       │      │
│  │   - Orchestrates chunking and processing       │      │
│  └────────────────────────────────────────────────┘      │
└──────────────────────────────────────────────────────────┘
                         ↓ uses
┌──────────────────────────────────────────────────────────┐
│                    Domain Layer                           │
│  ┌────────────────────────────────────────────────┐      │
│  │   FileIOService (Trait)                        │      │
│  │   - read_file_chunks(), write_file_chunks()    │      │
│  └────────────────────────────────────────────────┘      │
│  ┌────────────┬───────────┬──────────────┐              │
│  │ FileChunk  │ FilePath  │  ChunkSize   │              │
│  │ (immutable)│(type-safe)│(validated)   │              │
│  └────────────┴───────────┴──────────────┘              │
└──────────────────────────────────────────────────────────┘
                         ↓ implements
┌──────────────────────────────────────────────────────────┐
│              Infrastructure Layer                         │
│  ┌────────────────────────────────────────────────┐      │
│  │   Async File I/O Implementation                │      │
│  │   - tokio::fs for async operations             │      │
│  │   - Streaming, chunking, buffering             │      │
│  └────────────────────────────────────────────────┘      │
└──────────────────────────────────────────────────────────┘
                         ↓ reads/writes
┌──────────────────────────────────────────────────────────┐
│                  File System                              │
│  - Input files, output files, temporary files            │
└──────────────────────────────────────────────────────────┘

Design Principles

  1. Immutability: File chunks cannot be modified after creation
  2. Streaming: Process files without loading entirely into memory
  3. Type Safety: Compile-time path category enforcement
  4. Async-First: Non-blocking I/O for better concurrency
  5. Memory Efficiency: Bounded memory usage regardless of file size

File I/O Architecture

The file I/O layer uses value objects and async services to provide efficient, safe file processing.

Architectural Components

┌─────────────────────────────────────────────────────────────┐
│ Value Objects (Domain)                                      │
│  ┌────────────────┬────────────────┬─────────────────┐     │
│  │  FileChunk     │  FilePath<T>   │   ChunkSize     │     │
│  │  - Immutable   │  - Type-safe   │   - Validated   │     │
│  │  - UUID ID     │  - Category    │   - 1B-512MB    │     │
│  │  - Sequence #  │  - Validated   │   - Default 1MB │     │
│  └────────────────┴────────────────┴─────────────────┘     │
└─────────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────────┐
│ Service Interface (Domain)                                  │
│  ┌──────────────────────────────────────────────────┐      │
│  │  FileIOService (async trait)                     │      │
│  │  - read_file_chunks()                            │      │
│  │  - write_file_chunks()                           │      │
│  │  - stream_chunks()                               │      │
│  └──────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────────┐
│ Implementation (Infrastructure)                             │
│  ┌──────────────────────────────────────────────────┐      │
│  │  Async File I/O                                  │      │
│  │  - tokio::fs::File                               │      │
│  │  - Buffering, streaming                          │      │
│  │  - Memory mapping (large files)                  │      │
│  └──────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────┘

Processing Flow

File → Chunks → Processing → Chunks → File:

Input File (e.g., 100MB)
        ↓
Split into chunks (1MB each)
        ↓
┌─────────┬─────────┬─────────┬─────────┐
│ Chunk 0 │ Chunk 1 │ Chunk 2 │   ...   │
│ (1MB)   │ (1MB)   │ (1MB)   │ (1MB)   │
└─────────┴─────────┴─────────┴─────────┘
        ↓ parallel processing
┌─────────┬─────────┬─────────┬─────────┐
│Processed│Processed│Processed│   ...   │
│ Chunk 0 │ Chunk 1 │ Chunk 2 │ (1MB)   │
└─────────┴─────────┴─────────┴─────────┘
        ↓
Reassemble chunks
        ↓
Output File (100MB)

FileChunk Value Object

FileChunk is an immutable value object representing a portion of a file for processing.

FileChunk Structure

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
pub struct FileChunk {
    id: Uuid,                           // Unique identifier
    sequence_number: u64,               // Order in file (0-based)
    offset: u64,                        // Byte offset in original file
    size: ChunkSize,                    // Validated chunk size
    data: Vec<u8>,                      // Actual chunk data
    checksum: Option<String>,           // Optional SHA-256 checksum
    is_final: bool,                     // Last chunk flag
    created_at: DateTime<Utc>,          // Creation timestamp
}
}

Key Characteristics

FeatureDescription
ImmutabilityOnce created, chunks cannot be modified
Unique IdentityEach chunk has a UUID for tracking
Sequence OrderingMaintains position for reassembly
IntegrityOptional checksums for validation
Thread SafetyFully thread-safe due to immutability

Creating Chunks

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::FileChunk;

// Basic chunk creation
let data = vec![1, 2, 3, 4, 5];
let chunk = FileChunk::new(
    0,        // sequence_number
    0,        // offset
    data,     // data
    false,    // is_final
)?;

println!("Chunk ID: {}", chunk.id());
println!("Size: {} bytes", chunk.size().bytes());
}

Chunk with Checksum

#![allow(unused)]
fn main() {
// Create chunk with checksum
let data = vec![1, 2, 3, 4, 5];
let chunk = FileChunk::new(0, 0, data, false)?
    .with_calculated_checksum();

// Verify checksum
if let Some(checksum) = chunk.checksum() {
    println!("Checksum: {}", checksum);
}

// Verify data integrity
assert!(chunk.verify_checksum());
}

Chunk Properties

#![allow(unused)]
fn main() {
// Access chunk properties
println!("ID: {}", chunk.id());
println!("Sequence: {}", chunk.sequence_number());
println!("Offset: {}", chunk.offset());
println!("Size: {} bytes", chunk.size().bytes());
println!("Is final: {}", chunk.is_final());
println!("Created: {}", chunk.created_at());

// Access data
let data: &[u8] = chunk.data();
}

FilePath Value Object

FilePath<T> is a type-safe, validated file path with compile-time category enforcement.

Path Categories

#![allow(unused)]
fn main() {
// Type-safe path categories
pub struct InputPath;      // For input files
pub struct OutputPath;     // For output files
pub struct TempPath;       // For temporary files
pub struct LogPath;        // For log files

// Usage with phantom types
let input: FilePath<InputPath> = FilePath::new("./input.dat")?;
let output: FilePath<OutputPath> = FilePath::new("./output.dat")?;

// ✅ Type system prevents mixing categories
// ❌ Cannot assign input path to output variable
// let wrong: FilePath<OutputPath> = input;  // Compile error!
}

Path Validation

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::value_objects::{FilePath, InputPath};

// Create and validate path
let path = FilePath::<InputPath>::new("/path/to/file.dat")?;

// Path properties
println!("Path: {}", path.as_str());
println!("Exists: {}", path.exists());
println!("Is file: {}", path.is_file());
println!("Is dir: {}", path.is_dir());
println!("Is absolute: {}", path.is_absolute());

// Convert to std::path::Path
let std_path: &Path = path.as_path();
}

Path Operations

#![allow(unused)]
fn main() {
// Get file name
let file_name = path.file_name();

// Get parent directory
let parent = path.parent();

// Get file extension
let extension = path.extension();

// Convert to string
let path_str = path.to_string();
}

ChunkSize Value Object

ChunkSize represents a validated chunk size within system bounds.

Size Constraints

#![allow(unused)]
fn main() {
// Chunk size constants
ChunkSize::MIN_SIZE  // 1 byte
ChunkSize::MAX_SIZE  // 512 MB
ChunkSize::DEFAULT   // 1 MB
}

Creating Chunk Sizes

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::ChunkSize;

// From bytes
let size = ChunkSize::new(1024 * 1024)?;  // 1 MB
assert_eq!(size.bytes(), 1_048_576);

// From kilobytes
let size_kb = ChunkSize::from_kb(512)?;  // 512 KB
assert_eq!(size_kb.kilobytes(), 512.0);

// From megabytes
let size_mb = ChunkSize::from_mb(16)?;  // 16 MB
assert_eq!(size_mb.megabytes(), 16.0);

// Default (1 MB)
let default_size = ChunkSize::default();
assert_eq!(default_size.megabytes(), 1.0);
}

Size Validation

#![allow(unused)]
fn main() {
// ✅ Valid sizes
let valid = ChunkSize::new(64 * 1024)?;  // 64 KB

// ❌ Invalid: too small
let too_small = ChunkSize::new(0);  // Error: must be ≥ 1 byte
assert!(too_small.is_err());

// ❌ Invalid: too large
let too_large = ChunkSize::new(600 * 1024 * 1024);  // Error: must be ≤ 512 MB
assert!(too_large.is_err());
}

Optimal Sizing

#![allow(unused)]
fn main() {
// Calculate optimal chunk size for file
let file_size = 100 * 1024 * 1024;  // 100 MB
let optimal = ChunkSize::optimal_for_file_size(file_size);

println!("Optimal chunk size: {} MB", optimal.megabytes());

// Size conversions
let size = ChunkSize::from_mb(4)?;
println!("Bytes: {}", size.bytes());
println!("Kilobytes: {}", size.kilobytes());
println!("Megabytes: {}", size.megabytes());
}

FileIOService Interface

FileIOService is an async trait defining file I/O operations.

Service Interface

#![allow(unused)]
fn main() {
#[async_trait]
pub trait FileIOService: Send + Sync {
    /// Read file into chunks
    async fn read_file_chunks(
        &self,
        path: &Path,
        chunk_size: ChunkSize,
    ) -> Result<Vec<FileChunk>, PipelineError>;

    /// Write chunks to file
    async fn write_file_chunks(
        &self,
        path: &Path,
        chunks: Vec<FileChunk>,
    ) -> Result<(), PipelineError>;

    /// Stream chunks for processing
    async fn stream_chunks(
        &self,
        path: &Path,
        chunk_size: ChunkSize,
    ) -> Result<impl Stream<Item = Result<FileChunk, PipelineError>>, PipelineError>;
}
}

Why Async?

The service is async because file I/O is I/O-bound, not CPU-bound:

Benefits:

  • Non-Blocking: Doesn't block the async runtime
  • Concurrent: Multiple files can be processed concurrently
  • tokio Integration: Natural integration with tokio::fs
  • Performance: Better throughput for I/O operations

Classification:

  • This is an infrastructure port, not a domain service
  • Domain services (compression, encryption) are CPU-bound and sync
  • Infrastructure ports (file I/O, network, database) are I/O-bound and async

File Reading

File reading uses streaming and chunking for memory-efficient processing.

Reading Small Files

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::FileIOService;

// Read entire file into chunks
let service: Arc<dyn FileIOService> = /* ... */;
let chunks = service.read_file_chunks(
    Path::new("./input.dat"),
    ChunkSize::from_mb(1)?,
).await?;

println!("Read {} chunks", chunks.len());
for chunk in chunks {
    println!("Chunk {}: {} bytes", chunk.sequence_number(), chunk.size().bytes());
}
}

Streaming Large Files

#![allow(unused)]
fn main() {
use tokio_stream::StreamExt;

// Stream chunks for memory efficiency
let mut stream = service.stream_chunks(
    Path::new("./large_file.dat"),
    ChunkSize::from_mb(4)?,
).await?;

while let Some(chunk_result) = stream.next().await {
    let chunk = chunk_result?;

    // Process chunk without loading entire file
    process_chunk(chunk).await?;
}
}

Reading with Metadata

#![allow(unused)]
fn main() {
use std::fs::metadata;

// Get file metadata first
let metadata = metadata("./input.dat")?;
let file_size = metadata.len();

// Choose optimal chunk size
let chunk_size = ChunkSize::optimal_for_file_size(file_size);

// Read with optimal settings
let chunks = service.read_file_chunks(
    Path::new("./input.dat"),
    chunk_size,
).await?;
}

File Writing

File writing assembles processed chunks back into complete files.

Writing Chunks to File

#![allow(unused)]
fn main() {
// Write chunks to output file
let processed_chunks: Vec<FileChunk> = /* ... */;

service.write_file_chunks(
    Path::new("./output.dat"),
    processed_chunks,
).await?;

println!("File written successfully");
}

Streaming Write

#![allow(unused)]
fn main() {
use tokio::fs::File;
use tokio::io::AsyncWriteExt;

// Stream chunks directly to file
let mut file = File::create("./output.dat").await?;

for chunk in processed_chunks {
    file.write_all(chunk.data()).await?;
}

file.flush().await?;
println!("File written: {} bytes", file.metadata().await?.len());
}

Atomic Writes

#![allow(unused)]
fn main() {
use tempfile::NamedTempFile;

// Write to temporary file first
let temp_file = NamedTempFile::new()?;
service.write_file_chunks(
    temp_file.path(),
    chunks,
).await?;

// Atomically move to final location
temp_file.persist("./output.dat")?;
}

Memory Management

The system uses several strategies to manage memory efficiently.

Bounded Memory Usage

File Size: 1 GB
Chunk Size: 4 MB
Memory Usage: ~4 MB (single chunk in memory at a time)

Without chunking: 1 GB in memory
With chunking: 4 MB in memory (250x reduction!)

Memory Mapping for Large Files

#![allow(unused)]
fn main() {
// Automatically uses memory mapping for files > threshold
let config = FileIOConfig {
    chunk_size: ChunkSize::from_mb(4)?,
    use_memory_mapping: true,
    memory_mapping_threshold: 100 * 1024 * 1024,  // 100 MB
    ..Default::default()
};

// Files > 100 MB use memory mapping
let chunks = service.read_file_chunks_with_config(
    Path::new("./large_file.dat"),
    &config,
).await?;
}

Streaming vs. Buffering

Streaming (Memory-Efficient):

#![allow(unused)]
fn main() {
// Process one chunk at a time
let mut stream = service.stream_chunks(path, chunk_size).await?;
while let Some(chunk) = stream.next().await {
    process_chunk(chunk?).await?;
}
// Peak memory: 1 chunk size
}

Buffering (Performance):

#![allow(unused)]
fn main() {
// Load all chunks into memory
let chunks = service.read_file_chunks(path, chunk_size).await?;
process_all_chunks(chunks).await?;
// Peak memory: all chunks
}

Error Handling

The file I/O system handles various error conditions.

Common Errors

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::PipelineError;

match service.read_file_chunks(path, chunk_size).await {
    Ok(chunks) => { /* success */ },
    Err(PipelineError::FileNotFound(path)) => {
        eprintln!("File not found: {}", path);
    },
    Err(PipelineError::PermissionDenied(path)) => {
        eprintln!("Permission denied: {}", path);
    },
    Err(PipelineError::InsufficientDiskSpace) => {
        eprintln!("Not enough disk space");
    },
    Err(PipelineError::InvalidChunk(msg)) => {
        eprintln!("Invalid chunk: {}", msg);
    },
    Err(e) => {
        eprintln!("I/O error: {}", e);
    },
}
}

Retry Logic

#![allow(unused)]
fn main() {
use tokio::time::{sleep, Duration};

async fn read_with_retry(
    service: &dyn FileIOService,
    path: &Path,
    chunk_size: ChunkSize,
    max_retries: u32,
) -> Result<Vec<FileChunk>, PipelineError> {
    let mut retries = 0;

    loop {
        match service.read_file_chunks(path, chunk_size).await {
            Ok(chunks) => return Ok(chunks),
            Err(e) if retries < max_retries => {
                retries += 1;
                warn!("Read failed (attempt {}/{}): {}", retries, max_retries, e);
                sleep(Duration::from_secs(1 << retries)).await;  // Exponential backoff
            },
            Err(e) => return Err(e),
        }
    }
}
}

Partial Reads

#![allow(unused)]
fn main() {
// Handle partial reads gracefully
async fn read_available_chunks(
    service: &dyn FileIOService,
    path: &Path,
    chunk_size: ChunkSize,
) -> Result<Vec<FileChunk>, PipelineError> {
    let mut chunks = Vec::new();
    let mut stream = service.stream_chunks(path, chunk_size).await?;

    while let Some(chunk_result) = stream.next().await {
        match chunk_result {
            Ok(chunk) => chunks.push(chunk),
            Err(e) => {
                warn!("Chunk read error: {}, stopping", e);
                break;  // Return partial results
            },
        }
    }

    Ok(chunks)
}
}

Performance Optimization

Several strategies optimize file I/O performance.

Optimal Chunk Size Selection

#![allow(unused)]
fn main() {
// Chunk size recommendations
fn optimal_chunk_size(file_size: u64) -> ChunkSize {
    match file_size {
        0..=10_485_760 => ChunkSize::from_mb(1).unwrap(),          // < 10 MB: 1 MB chunks
        10_485_761..=104_857_600 => ChunkSize::from_mb(4).unwrap(), // 10-100 MB: 4 MB chunks
        104_857_601..=1_073_741_824 => ChunkSize::from_mb(8).unwrap(), // 100 MB - 1 GB: 8 MB chunks
        _ => ChunkSize::from_mb(16).unwrap(),                       // > 1 GB: 16 MB chunks
    }
}
}

Parallel Processing

#![allow(unused)]
fn main() {
use futures::future::try_join_all;

// Process chunks in parallel
let futures: Vec<_> = chunks.into_iter()
    .map(|chunk| async move {
        process_chunk(chunk).await
    })
    .collect();

let results = try_join_all(futures).await?;
}

Buffered I/O

#![allow(unused)]
fn main() {
use tokio::io::BufReader;

// Use buffered reading for better performance
let file = File::open(path).await?;
let mut reader = BufReader::with_capacity(8 * 1024 * 1024, file);  // 8 MB buffer

// Read chunks with buffering
while let Some(chunk) = read_chunk(&mut reader).await? {
    process_chunk(chunk).await?;
}
}

Performance Benchmarks

OperationSmall Files (< 10 MB)Medium Files (100 MB)Large Files (> 1 GB)
Read~50 MB/s~200 MB/s~300 MB/s
Write~40 MB/s~180 MB/s~280 MB/s
Stream~45 MB/s~190 MB/s~290 MB/s

Benchmarks on SSD with 4 MB chunks


Usage Examples

Example 1: Basic File Processing

use adaptive_pipeline_domain::{FileIOService, ChunkSize};
use std::path::Path;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let service: Arc<dyn FileIOService> = /* ... */;

    // Read file
    let chunks = service.read_file_chunks(
        Path::new("./input.dat"),
        ChunkSize::from_mb(1)?,
    ).await?;

    println!("Read {} chunks", chunks.len());

    // Process chunks
    let processed: Vec<_> = chunks.into_iter()
        .map(|chunk| process_chunk(chunk))
        .collect();

    // Write output
    service.write_file_chunks(
        Path::new("./output.dat"),
        processed,
    ).await?;

    println!("Processing complete!");
    Ok(())
}

fn process_chunk(chunk: FileChunk) -> FileChunk {
    // Transform chunk data
    let transformed_data = chunk.data().to_vec();
    FileChunk::new(
        chunk.sequence_number(),
        chunk.offset(),
        transformed_data,
        chunk.is_final(),
    ).unwrap()
}

Example 2: Streaming Large Files

use tokio_stream::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let service: Arc<dyn FileIOService> = /* ... */;

    // Stream large file
    let mut stream = service.stream_chunks(
        Path::new("./large_file.dat"),
        ChunkSize::from_mb(8)?,
    ).await?;

    let mut processed_chunks = Vec::new();

    while let Some(chunk_result) = stream.next().await {
        let chunk = chunk_result?;

        // Process chunk in streaming fashion
        let processed = process_chunk(chunk);
        processed_chunks.push(processed);

        // Optional: write chunks as they're processed
        // write_chunk_to_file(&processed).await?;
    }

    println!("Processed {} chunks", processed_chunks.len());
    Ok(())
}

Example 3: Parallel Chunk Processing

use futures::future::try_join_all;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let service: Arc<dyn FileIOService> = /* ... */;

    // Read all chunks
    let chunks = service.read_file_chunks(
        Path::new("./input.dat"),
        ChunkSize::from_mb(4)?,
    ).await?;

    // Process chunks in parallel
    let futures = chunks.into_iter().map(|chunk| {
        tokio::spawn(async move {
            process_chunk_async(chunk).await
        })
    });

    let results = try_join_all(futures).await?;
    let processed: Vec<_> = results.into_iter()
        .collect::<Result<Vec<_>, _>>()?;

    // Write results
    service.write_file_chunks(
        Path::new("./output.dat"),
        processed,
    ).await?;

    Ok(())
}

async fn process_chunk_async(chunk: FileChunk) -> Result<FileChunk, PipelineError> {
    // Async processing
    tokio::time::sleep(Duration::from_millis(10)).await;
    Ok(process_chunk(chunk))
}

Example 4: Error Handling and Retry

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let service: Arc<dyn FileIOService> = /* ... */;
    let path = Path::new("./input.dat");
    let chunk_size = ChunkSize::from_mb(1)?;

    // Retry on failure
    let chunks = read_with_retry(&*service, path, chunk_size, 3).await?;

    println!("Successfully read {} chunks", chunks.len());
    Ok(())
}

async fn read_with_retry(
    service: &dyn FileIOService,
    path: &Path,
    chunk_size: ChunkSize,
    max_retries: u32,
) -> Result<Vec<FileChunk>, PipelineError> {
    for attempt in 1..=max_retries {
        match service.read_file_chunks(path, chunk_size).await {
            Ok(chunks) => return Ok(chunks),
            Err(e) if attempt < max_retries => {
                eprintln!("Attempt {}/{} failed: {}", attempt, max_retries, e);
                tokio::time::sleep(Duration::from_secs(2_u64.pow(attempt))).await;
            },
            Err(e) => return Err(e),
        }
    }
    unreachable!()
}

Best Practices

1. Choose Appropriate Chunk Sizes

#![allow(unused)]
fn main() {
// ✅ Good: Optimize chunk size for file
let file_size = metadata(path)?.len();
let chunk_size = ChunkSize::optimal_for_file_size(file_size);

// ❌ Bad: Fixed chunk size for all files
let chunk_size = ChunkSize::from_mb(1)?;  // May be suboptimal
}

2. Use Streaming for Large Files

#![allow(unused)]
fn main() {
// ✅ Good: Stream large files
let mut stream = service.stream_chunks(path, chunk_size).await?;
while let Some(chunk) = stream.next().await {
    process_chunk(chunk?).await?;
}

// ❌ Bad: Load entire large file into memory
let chunks = service.read_file_chunks(path, chunk_size).await?;
// Entire file in memory!
}

3. Validate Chunk Integrity

#![allow(unused)]
fn main() {
// ✅ Good: Verify checksums
for chunk in chunks {
    if !chunk.verify_checksum() {
        return Err(PipelineError::ChecK sumMismatch);
    }
    process_chunk(chunk)?;
}
}

4. Handle Errors Gracefully

#![allow(unused)]
fn main() {
// ✅ Good: Specific error handling
match service.read_file_chunks(path, chunk_size).await {
    Ok(chunks) => process(chunks),
    Err(PipelineError::FileNotFound(_)) => handle_missing_file(),
    Err(PipelineError::PermissionDenied(_)) => handle_permissions(),
    Err(e) => handle_generic_error(e),
}
}

5. Use Type-Safe Paths

#![allow(unused)]
fn main() {
// ✅ Good: Type-safe paths
let input: FilePath<InputPath> = FilePath::new("./input.dat")?;
let output: FilePath<OutputPath> = FilePath::new("./output.dat")?;

// ❌ Bad: Raw strings
let input = "./input.dat";
let output = "./output.dat";
}

6. Clean Up Temporary Files

#![allow(unused)]
fn main() {
// ✅ Good: Automatic cleanup with tempfile
let temp = NamedTempFile::new()?;
write_chunks(temp.path(), chunks).await?;
// Automatically deleted when dropped

// Or explicit cleanup
temp.close()?;
}

7. Monitor Memory Usage

#![allow(unused)]
fn main() {
// ✅ Good: Track memory usage
let chunks_in_memory = chunks.len();
let memory_used = chunks_in_memory * chunk_size.bytes();
if memory_used > max_memory {
    // Switch to streaming
}
}

Troubleshooting

Issue 1: Out of Memory

Symptom:

Error: Out of memory

Solutions:

#![allow(unused)]
fn main() {
// 1. Reduce chunk size
let chunk_size = ChunkSize::from_mb(1)?;  // Smaller chunks

// 2. Use streaming instead of buffering
let mut stream = service.stream_chunks(path, chunk_size).await?;

// 3. Process chunks one at a time
while let Some(chunk) = stream.next().await {
    process_chunk(chunk?).await?;
    // Chunk dropped, memory freed
}
}

Issue 2: Slow File I/O

Diagnosis:

#![allow(unused)]
fn main() {
let start = Instant::now();
let chunks = service.read_file_chunks(path, chunk_size).await?;
let duration = start.elapsed();
println!("Read took: {:?}", duration);
}

Solutions:

#![allow(unused)]
fn main() {
// 1. Increase chunk size
let chunk_size = ChunkSize::from_mb(8)?;  // Larger chunks = fewer I/O ops

// 2. Use memory mapping for large files
let config = FileIOConfig {
    use_memory_mapping: true,
    memory_mapping_threshold: 50 * 1024 * 1024,  // 50 MB
    ..Default::default()
};

// 3. Use buffered I/O
let reader = BufReader::with_capacity(8 * 1024 * 1024, file);
}

Issue 3: Checksum Mismatch

Symptom:

Error: Checksum mismatch for chunk 42

Solutions:

#![allow(unused)]
fn main() {
// 1. Verify during read
let chunk = chunk.with_calculated_checksum();
if !chunk.verify_checksum() {
    // Re-read chunk
}

// 2. Log and skip corrupted chunks
match chunk.verify_checksum() {
    true => process_chunk(chunk),
    false => {
        warn!("Chunk {} corrupted, skipping", chunk.sequence_number());
        continue;
    },
}
}

Issue 4: File Permission Errors

Symptom:

Error: Permission denied: ./output.dat

Solutions:

#![allow(unused)]
fn main() {
// 1. Check permissions before writing
use std::fs;
let metadata = fs::metadata(parent_dir)?;
if metadata.permissions().readonly() {
    return Err("Output directory is read-only".into());
}

// 2. Use appropriate path categories
let output: FilePath<OutputPath> = FilePath::new("./output.dat")?;
output.ensure_writable()?;
}

Testing Strategies

Unit Testing with Mock Chunks

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_chunk_creation() {
        let data = vec![1, 2, 3, 4, 5];
        let chunk = FileChunk::new(0, 0, data.clone(), false).unwrap();

        assert_eq!(chunk.sequence_number(), 0);
        assert_eq!(chunk.data(), &data);
        assert!(!chunk.is_final());
    }

    #[test]
    fn test_chunk_checksum() {
        let data = vec![1, 2, 3, 4, 5];
        let chunk = FileChunk::new(0, 0, data, false)
            .unwrap()
            .with_calculated_checksum();

        assert!(chunk.checksum().is_some());
        assert!(chunk.verify_checksum());
    }
}
}

Integration Testing with Files

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_file_round_trip() {
    let service: Arc<dyn FileIOService> = create_test_service();

    // Create test data
    let test_data = vec![0u8; 10 * 1024 * 1024];  // 10 MB
    let input_path = temp_dir().join("test_input.dat");
    std::fs::write(&input_path, &test_data).unwrap();

    // Read chunks
    let chunks = service.read_file_chunks(
        &input_path,
        ChunkSize::from_mb(1).unwrap(),
    ).await.unwrap();

    assert_eq!(chunks.len(), 10);  // 10 x 1MB chunks

    // Write chunks
    let output_path = temp_dir().join("test_output.dat");
    service.write_file_chunks(&output_path, chunks).await.unwrap();

    // Verify
    let output_data = std::fs::read(&output_path).unwrap();
    assert_eq!(test_data, output_data);
}
}

Streaming Tests

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_streaming() {
    use tokio_stream::StreamExt;

    let service: Arc<dyn FileIOService> = create_test_service();
    let path = create_test_file(100 * 1024 * 1024);  // 100 MB

    let mut stream = service.stream_chunks(
        &path,
        ChunkSize::from_mb(4).unwrap(),
    ).await.unwrap();

    let mut chunk_count = 0;
    while let Some(chunk_result) = stream.next().await {
        let chunk = chunk_result.unwrap();
        assert!(chunk.size().bytes() <= 4 * 1024 * 1024);
        chunk_count += 1;
    }

    assert_eq!(chunk_count, 25);  // 100 MB / 4 MB = 25 chunks
}
}

Next Steps

After understanding file I/O fundamentals, explore specific implementations:

Detailed File I/O Topics

  1. Chunking Strategy: Deep dive into chunking algorithms and optimization
  2. Binary Format: File format specification and serialization

Advanced Topics


Summary

Key Takeaways:

  1. FileChunk is an immutable value object representing file data portions
  2. FilePath provides type-safe, validated file paths with phantom types
  3. ChunkSize validates chunk sizes within 1 byte to 512 MB bounds
  4. FileIOService defines async I/O operations for streaming and chunking
  5. Streaming enables memory-efficient processing of files of any size
  6. Memory Management uses bounded buffers and optional memory mapping
  7. Error Handling provides retry logic and graceful degradation

Architecture File References:

  • FileChunk: pipeline-domain/src/value_objects/file_chunk.rs:176
  • FilePath: pipeline-domain/src/value_objects/file_path.rs:1
  • ChunkSize: pipeline-domain/src/value_objects/chunk_size.rs:1
  • FileIOService: pipeline-domain/src/services/file_io_service.rs:185

Chunking Strategy

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

This chapter provides a comprehensive overview of the file chunking strategy in the adaptive pipeline system. Learn how files are split into manageable chunks, how chunk sizes are optimized, and how chunking enables efficient parallel processing.


Table of Contents


Overview

Chunking is the process of dividing files into smaller, manageable pieces (chunks) that can be processed independently. This strategy enables efficient memory usage, parallel processing, and scalable file handling regardless of file size.

Key Benefits

  • Memory Efficiency: Process files larger than available RAM
  • Parallel Processing: Process multiple chunks concurrently
  • Fault Tolerance: Failed chunks can be retried independently
  • Progress Tracking: Track progress at chunk granularity
  • Scalability: Handle files from bytes to terabytes

Chunking Workflow

Input File (100 MB)
        ↓
[Chunking Strategy]
        ↓
┌──────────┬──────────┬──────────┬──────────┐
│ Chunk 0  │ Chunk 1  │ Chunk 2  │ Chunk 3  │
│ (0-25MB) │(25-50MB) │(50-75MB) │(75-100MB)│
└──────────┴──────────┴──────────┴──────────┘
        ↓ parallel processing
┌──────────┬──────────┬──────────┬──────────┐
│Processed │Processed │Processed │Processed │
│ Chunk 0  │ Chunk 1  │ Chunk 2  │ Chunk 3  │
└──────────┴──────────┴──────────┴──────────┘
        ↓
Output File (processed)

Design Principles

  1. Predictable Memory: Bounded memory usage regardless of file size
  2. Optimal Sizing: Empirically optimized chunk sizes for performance
  3. Independent Processing: Each chunk can be processed in isolation
  4. Ordered Reassembly: Chunks maintain sequence for correct reassembly
  5. Adaptive Strategy: Chunk size adapts to file size and system resources

Chunking Architecture

The chunking system uses a combination of value objects and algorithms to efficiently divide files.

Chunking Components

┌─────────────────────────────────────────────────────────┐
│                 Chunking Strategy                        │
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │  ChunkSize   │  │ FileChunk    │  │  Chunking    │ │
│  │   (1B-512MB) │  │ (immutable)  │  │  Algorithm   │ │
│  └──────────────┘  └──────────────┘  └──────────────┘ │
│                                                          │
└─────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────┐
│              Optimal Size Calculation                    │
│                                                          │
│  File Size → optimal_for_file_size() → ChunkSize       │
│                                                          │
│  - Small files  (<10 MB):    64-256 KB chunks          │
│  - Medium files (10-500 MB): 2-16 MB chunks            │
│  - Large files  (500MB-2GB): 64 MB chunks              │
│  - Huge files   (>2 GB):     128 MB chunks             │
└─────────────────────────────────────────────────────────┘

Chunk Lifecycle

1. Size Determination
   - Calculate optimal chunk size based on file size
   - Adjust for available memory if needed
   ↓
2. File Division
   - Read file in chunk-sized pieces
   - Create FileChunk with sequence number and offset
   ↓
3. Chunk Processing
   - Apply pipeline stages to each chunk
   - Process chunks in parallel if enabled
   ↓
4. Chunk Reassembly
   - Combine processed chunks by sequence number
   - Write to output file

Chunk Size Selection

Chunk size is critical for performance and memory efficiency. The system supports validated sizes from 1 byte to 512 MB.

Size Constraints

#![allow(unused)]
fn main() {
// ChunkSize constants
ChunkSize::MIN_SIZE  // 1 byte
ChunkSize::MAX_SIZE  // 512 MB
ChunkSize::DEFAULT   // 1 MB
}

Creating Chunk Sizes

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::ChunkSize;

// From bytes
let chunk = ChunkSize::new(1024 * 1024)?;  // 1 MB

// From kilobytes
let chunk_kb = ChunkSize::from_kb(512)?;  // 512 KB

// From megabytes
let chunk_mb = ChunkSize::from_mb(16)?;  // 16 MB

// Default size
let default_chunk = ChunkSize::default();  // 1 MB
}

Size Validation

#![allow(unused)]
fn main() {
// ✅ Valid sizes
let valid = ChunkSize::new(64 * 1024)?;  // 64 KB - valid

// ❌ Invalid: too small
let too_small = ChunkSize::new(0);  // Error: must be ≥ 1 byte
assert!(too_small.is_err());

// ❌ Invalid: too large
let too_large = ChunkSize::new(600 * 1024 * 1024);  // Error: must be ≤ 512 MB
assert!(too_large.is_err());
}

Size Trade-offs

Chunk SizeMemory UsageI/O OverheadParallelismBest For
Small (64-256 KB)LowHighExcellentSmall files, limited memory
Medium (1-16 MB)ModerateModerateGoodMost use cases
Large (64-128 MB)HighLowLimitedLarge files, ample memory

Chunking Algorithm

The chunking algorithm divides files into sequential chunks with proper metadata.

Basic Chunking Process

#![allow(unused)]
fn main() {
pub fn chunk_file(
    file_path: &Path,
    chunk_size: ChunkSize,
) -> Result<Vec<FileChunk>, PipelineError> {
    let file = File::open(file_path)?;
    let file_size = file.metadata()?.len();
    let mut chunks = Vec::new();
    let mut offset = 0;
    let mut sequence = 0;

    // Read file in chunks
    let mut reader = BufReader::new(file);
    let mut buffer = vec![0u8; chunk_size.bytes()];

    loop {
        let bytes_read = reader.read(&mut buffer)?;
        if bytes_read == 0 {
            break;  // EOF
        }

        let data = buffer[..bytes_read].to_vec();
        let is_final = offset + bytes_read as u64 >= file_size;

        let chunk = FileChunk::new(sequence, offset, data, is_final)?;
        chunks.push(chunk);

        offset += bytes_read as u64;
        sequence += 1;
    }

    Ok(chunks)
}
}

Chunk Metadata

Each chunk contains essential metadata:

#![allow(unused)]
fn main() {
pub struct FileChunk {
    id: Uuid,                 // Unique chunk identifier
    sequence_number: u64,     // Order in file (0-based)
    offset: u64,              // Byte offset in original file
    size: ChunkSize,          // Actual chunk size
    data: Vec<u8>,            // Chunk data
    checksum: Option<String>, // Optional checksum
    is_final: bool,           // Last chunk flag
    created_at: DateTime<Utc>,// Creation timestamp
}
}

Calculating Chunk Count

#![allow(unused)]
fn main() {
// Calculate number of chunks needed
let file_size = 100 * 1024 * 1024;  // 100 MB
let chunk_size = ChunkSize::from_mb(4)?;  // 4 MB chunks

let num_chunks = chunk_size.chunks_needed_for_file(file_size);
println!("Need {} chunks", num_chunks);  // 25 chunks
}

Optimal Sizing Strategy

The system uses empirically optimized chunk sizes based on comprehensive benchmarking.

Optimization Strategy

#![allow(unused)]
fn main() {
pub fn optimal_for_file_size(file_size: u64) -> ChunkSize {
    let optimal_size = match file_size {
        // Small files: smaller chunks
        0..=1_048_576 => 64 * 1024,           // 64KB for ≤ 1MB
        1_048_577..=10_485_760 => 256 * 1024, // 256KB for ≤ 10MB

        // Medium files: empirically optimized
        10_485_761..=52_428_800 => 2 * 1024 * 1024,  // 2MB for ≤ 50MB
        52_428_801..=524_288_000 => 16 * 1024 * 1024, // 16MB for 50-500MB

        // Large files: larger chunks for efficiency
        524_288_001..=2_147_483_648 => 64 * 1024 * 1024, // 64MB for 500MB-2GB

        // Huge files: maximum throughput
        _ => 128 * 1024 * 1024, // 128MB for >2GB
    };

    ChunkSize { bytes: optimal_size.clamp(MIN_SIZE, MAX_SIZE) }
}
}

Empirical Results

Benchmarking results that informed this strategy:

File SizeChunk SizeThroughputImprovement
100 MB16 MB~300 MB/s+43.7% vs 2 MB
500 MB16 MB~320 MB/s+56.2% vs 4 MB
2 GB128 MB~350 MB/sBaseline

Using Optimal Sizes

#![allow(unused)]
fn main() {
// Automatically select optimal chunk size
let file_size = 100 * 1024 * 1024;  // 100 MB
let optimal = ChunkSize::optimal_for_file_size(file_size);

println!("Optimal chunk size: {} MB", optimal.megabytes());  // 16 MB

// Check if current size is optimal
let current = ChunkSize::from_mb(4)?;
if !current.is_optimal_for_file(file_size) {
    println!("Warning: chunk size may be suboptimal");
}
}

Memory Management

Chunking enables predictable memory usage regardless of file size.

Bounded Memory Usage

Without Chunking:
  File: 10 GB
  Memory: 10 GB (entire file in memory)

With Chunking (16 MB chunks):
  File: 10 GB
  Memory: 16 MB (single chunk in memory)
  Reduction: 640x less memory!

Memory-Adaptive Chunking

#![allow(unused)]
fn main() {
// Adjust chunk size based on available memory
let available_memory = 100 * 1024 * 1024;  // 100 MB available
let max_parallel_chunks = 4;

let chunk_size = ChunkSize::from_mb(32)?;  // Desired 32 MB
let adjusted = chunk_size.adjust_for_memory(
    available_memory,
    max_parallel_chunks,
)?;

println!("Adjusted chunk size: {} MB", adjusted.megabytes());
// 25 MB (100 MB / 4 chunks)
}

Memory Footprint Calculation

#![allow(unused)]
fn main() {
fn calculate_memory_footprint(
    chunk_size: ChunkSize,
    parallel_chunks: usize,
) -> usize {
    // Base memory per chunk
    let per_chunk = chunk_size.bytes();

    // Additional overhead (metadata, buffers, etc.)
    let overhead_per_chunk = 1024;  // ~1 KB overhead

    // Total memory footprint
    parallel_chunks * (per_chunk + overhead_per_chunk)
}

let chunk_size = ChunkSize::from_mb(4)?;
let memory = calculate_memory_footprint(chunk_size, 4);
println!("Memory footprint: {} MB", memory / (1024 * 1024));
// ~16 MB total
}

Parallel Processing

Chunking enables efficient parallel processing of file data.

Parallel Chunk Processing

#![allow(unused)]
fn main() {
use futures::future::try_join_all;

async fn process_chunks_parallel(
    chunks: Vec<FileChunk>,
) -> Result<Vec<FileChunk>, PipelineError> {
    // Process chunks in parallel
    let futures = chunks.into_iter().map(|chunk| {
        tokio::spawn(async move {
            process_chunk(chunk).await
        })
    });

    // Wait for all to complete
    let results = try_join_all(futures).await?;
    Ok(results.into_iter().collect::<Result<Vec<_>, _>>()?)
}
}

Parallelism Trade-offs

Sequential Processing:
  Time = num_chunks × time_per_chunk
  Memory = 1 × chunk_size

Parallel Processing (N threads):
  Time = (num_chunks / N) × time_per_chunk
  Memory = N × chunk_size

Optimal Parallelism

#![allow(unused)]
fn main() {
// Calculate optimal parallelism
fn optimal_parallelism(
    file_size: u64,
    chunk_size: ChunkSize,
    available_memory: usize,
    cpu_cores: usize,
) -> usize {
    let num_chunks = chunk_size.chunks_needed_for_file(file_size) as usize;

    // Memory-based limit
    let memory_limit = available_memory / chunk_size.bytes();

    // CPU-based limit
    let cpu_limit = cpu_cores;

    // Chunk count limit
    let chunk_limit = num_chunks;

    // Take minimum of all limits
    memory_limit.min(cpu_limit).min(chunk_limit).max(1)
}

let file_size = 100 * 1024 * 1024;
let chunk_size = ChunkSize::from_mb(4)?;
let parallelism = optimal_parallelism(
    file_size,
    chunk_size,
    64 * 1024 * 1024,  // 64 MB available
    8,                  // 8 CPU cores
);
println!("Optimal parallelism: {} chunks", parallelism);
}

Adaptive Chunking

The system can adapt chunk sizes dynamically based on conditions.

Adaptive Sizing Triggers

#![allow(unused)]
fn main() {
pub enum AdaptiveTrigger {
    MemoryPressure,      // Reduce chunk size due to low memory
    SlowPerformance,     // Increase chunk size for better throughput
    NetworkLatency,      // Reduce chunk size for streaming
    CpuUtilization,      // Adjust based on CPU usage
}
}

Dynamic Adjustment

#![allow(unused)]
fn main() {
async fn adaptive_chunking(
    file_path: &Path,
    initial_chunk_size: ChunkSize,
) -> Result<Vec<FileChunk>, PipelineError> {
    let mut chunk_size = initial_chunk_size;
    let mut chunks = Vec::new();
    let mut performance_samples = Vec::new();

    loop {
        let start = Instant::now();

        // Read next chunk
        let chunk = read_next_chunk(file_path, chunk_size)?;
        if chunk.is_none() {
            break;
        }

        let duration = start.elapsed();
        performance_samples.push(duration);

        chunks.push(chunk.unwrap());

        // Adapt chunk size based on performance
        if performance_samples.len() >= 5 {
            let avg_time = performance_samples.iter().sum::<Duration>() / 5;

            if avg_time > Duration::from_millis(100) {
                // Too slow, increase chunk size
                chunk_size = adjust_chunk_size(chunk_size, 1.5)?;
            } else if avg_time < Duration::from_millis(10) {
                // Too fast, reduce overhead by increasing size
                chunk_size = adjust_chunk_size(chunk_size, 1.2)?;
            }

            performance_samples.clear();
        }
    }

    Ok(chunks)
}

fn adjust_chunk_size(
    current: ChunkSize,
    factor: f64,
) -> Result<ChunkSize, PipelineError> {
    let new_size = (current.bytes() as f64 * factor) as usize;
    ChunkSize::new(new_size)
}
}

Performance Characteristics

Chunking performance varies based on size and system characteristics.

Throughput by Chunk Size

Chunk SizeRead SpeedWrite SpeedCPU UsageMemory
64 KB~40 MB/s~35 MB/s15%Low
1 MB~120 MB/s~100 MB/s20%Low
16 MB~300 MB/s~280 MB/s25%Medium
64 MB~320 MB/s~300 MB/s30%High
128 MB~350 MB/s~320 MB/s35%High

Benchmarks on NVMe SSD with 8-core CPU

Latency Characteristics

Small Chunks (64 KB):
  - Low latency per chunk: ~1-2ms
  - High overhead: many chunks
  - Good for: streaming, low memory

Large Chunks (128 MB):
  - High latency per chunk: ~400ms
  - Low overhead: few chunks
  - Good for: throughput, batch processing

Performance Optimization

#![allow(unused)]
fn main() {
// Benchmark different chunk sizes
async fn benchmark_chunk_sizes(
    file_path: &Path,
    sizes: &[ChunkSize],
) -> Vec<(ChunkSize, Duration)> {
    let mut results = Vec::new();

    for &size in sizes {
        let start = Instant::now();
        let _ = chunk_file(file_path, size).await.unwrap();
        let duration = start.elapsed();

        results.push((size, duration));
    }

    results.sort_by_key(|(_, duration)| *duration);
    results
}

// Usage
let sizes = vec![
    ChunkSize::from_kb(64)?,
    ChunkSize::from_mb(1)?,
    ChunkSize::from_mb(16)?,
    ChunkSize::from_mb(64)?,
];

let results = benchmark_chunk_sizes(Path::new("./test.dat"), &sizes).await;
for (size, duration) in results {
    println!("{} MB: {:?}", size.megabytes(), duration);
}
}

Usage Examples

Example 1: Basic Chunking

use adaptive_pipeline_domain::{ChunkSize, FileChunk};
use std::path::Path;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file_path = Path::new("./input.dat");

    // Determine optimal chunk size
    let file_size = std::fs::metadata(file_path)?.len();
    let chunk_size = ChunkSize::optimal_for_file_size(file_size);

    println!("File size: {} MB", file_size / (1024 * 1024));
    println!("Chunk size: {} MB", chunk_size.megabytes());

    // Chunk the file
    let chunks = chunk_file(file_path, chunk_size)?;
    println!("Created {} chunks", chunks.len());

    Ok(())
}

Example 2: Memory-Adaptive Chunking

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file_path = Path::new("./large_file.dat");
    let file_size = std::fs::metadata(file_path)?.len();

    // Start with optimal size
    let optimal = ChunkSize::optimal_for_file_size(file_size);

    // Adjust for available memory
    let available_memory = 100 * 1024 * 1024;  // 100 MB
    let max_parallel = 4;

    let chunk_size = optimal.adjust_for_memory(
        available_memory,
        max_parallel,
    )?;

    println!("Optimal: {} MB", optimal.megabytes());
    println!("Adjusted: {} MB", chunk_size.megabytes());

    let chunks = chunk_file(file_path, chunk_size)?;
    println!("Created {} chunks", chunks.len());

    Ok(())
}

Example 3: Parallel Chunk Processing

use futures::future::try_join_all;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file_path = Path::new("./input.dat");
    let chunk_size = ChunkSize::from_mb(16)?;

    // Create chunks
    let chunks = chunk_file(file_path, chunk_size)?;
    println!("Processing {} chunks in parallel", chunks.len());

    // Process in parallel
    let futures = chunks.into_iter().map(|chunk| {
        tokio::spawn(async move {
            // Simulate processing
            tokio::time::sleep(Duration::from_millis(10)).await;
            process_chunk(chunk).await
        })
    });

    let results = try_join_all(futures).await?;
    let processed: Vec<_> = results.into_iter()
        .collect::<Result<Vec<_>, _>>()?;

    println!("Processed {} chunks", processed.len());

    Ok(())
}

async fn process_chunk(chunk: FileChunk) -> Result<FileChunk, PipelineError> {
    // Transform chunk data
    let transformed_data = chunk.data().to_vec();
    Ok(FileChunk::new(
        chunk.sequence_number(),
        chunk.offset(),
        transformed_data,
        chunk.is_final(),
    )?)
}

Example 4: Adaptive Chunking

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file_path = Path::new("./test.dat");
    let initial_size = ChunkSize::from_mb(4)?;

    println!("Starting with {} MB chunks", initial_size.megabytes());

    let chunks = adaptive_chunking(file_path, initial_size).await?;

    println!("Created {} chunks with adaptive sizing", chunks.len());

    // Analyze chunk sizes
    for chunk in &chunks[0..5.min(chunks.len())] {
        println!("Chunk {}: {} bytes",
            chunk.sequence_number(),
            chunk.size().bytes()
        );
    }

    Ok(())
}

Best Practices

1. Use Optimal Chunk Sizes

#![allow(unused)]
fn main() {
// ✅ Good: Use optimal sizing
let file_size = metadata(path)?.len();
let chunk_size = ChunkSize::optimal_for_file_size(file_size);

// ❌ Bad: Fixed size for all files
let chunk_size = ChunkSize::from_mb(1)?;
}

2. Consider Memory Constraints

#![allow(unused)]
fn main() {
// ✅ Good: Adjust for available memory
let chunk_size = optimal.adjust_for_memory(
    available_memory,
    max_parallel_chunks,
)?;

// ❌ Bad: Ignore memory limits
let chunk_size = ChunkSize::from_mb(128)?;  // May cause OOM
}

3. Validate Chunk Sizes

#![allow(unused)]
fn main() {
// ✅ Good: Validate user input
let user_size_mb = 32;
let file_size = metadata(path)?.len();

match ChunkSize::validate_user_input(user_size_mb, file_size) {
    Ok(bytes) => {
        let chunk_size = ChunkSize::new(bytes)?;
        // Use validated size
    },
    Err(msg) => {
        eprintln!("Invalid chunk size: {}", msg);
        // Use optimal instead
        let chunk_size = ChunkSize::optimal_for_file_size(file_size);
    },
}
}

4. Monitor Chunk Processing

#![allow(unused)]
fn main() {
// ✅ Good: Track progress
for (i, chunk) in chunks.iter().enumerate() {
    let start = Instant::now();
    process_chunk(chunk)?;
    let duration = start.elapsed();

    println!("Chunk {}/{}: {:?}",
        i + 1, chunks.len(), duration);
}
}

5. Handle Edge Cases

#![allow(unused)]
fn main() {
// ✅ Good: Handle small files
if file_size < chunk_size.bytes() as u64 {
    // File fits in single chunk
    let chunk_size = ChunkSize::new(file_size as usize)?;
}

// ✅ Good: Handle empty files
if file_size == 0 {
    return Ok(Vec::new());  // No chunks needed
}
}

6. Use Checksums for Integrity

#![allow(unused)]
fn main() {
// ✅ Good: Add checksums to chunks
let chunk = FileChunk::new(seq, offset, data, is_final)?
    .with_calculated_checksum();

// Verify before processing
if !chunk.verify_checksum() {
    return Err(PipelineError::ChecksumMismatch);
}
}

Troubleshooting

Issue 1: Out of Memory with Large Chunks

Symptom:

Error: Out of memory allocating chunk

Solutions:

#![allow(unused)]
fn main() {
// 1. Reduce chunk size
let smaller = ChunkSize::from_mb(4)?;  // Instead of 128 MB

// 2. Adjust for available memory
let adjusted = chunk_size.adjust_for_memory(
    available_memory,
    max_parallel,
)?;

// 3. Process sequentially instead of parallel
for chunk in chunks {
    process_chunk(chunk).await?;
    // Chunk dropped, memory freed
}
}

Issue 2: Poor Performance with Small Chunks

Symptom: Processing is slower than expected.

Diagnosis:

#![allow(unused)]
fn main() {
let start = Instant::now();
let chunks = chunk_file(path, chunk_size)?;
let duration = start.elapsed();

println!("Chunking took: {:?}", duration);
println!("Chunks created: {}", chunks.len());
println!("Avg per chunk: {:?}", duration / chunks.len() as u32);
}

Solutions:

#![allow(unused)]
fn main() {
// 1. Use optimal chunk size
let chunk_size = ChunkSize::optimal_for_file_size(file_size);

// 2. Increase chunk size
let larger = ChunkSize::from_mb(16)?;  // Instead of 1 MB

// 3. Benchmark different sizes
let results = benchmark_chunk_sizes(path, &sizes).await;
let (best_size, _) = results.first().unwrap();
}

Issue 3: Too Many Chunks

Symptom:

Created 10,000 chunks for 1 GB file

Solutions:

#![allow(unused)]
fn main() {
// 1. Increase chunk size
let chunk_size = ChunkSize::from_mb(16)?;  // ~63 chunks for 1 GB

// 2. Use optimal sizing
let chunk_size = ChunkSize::optimal_for_file_size(file_size);

// 3. Set maximum chunk count
let max_chunks = 100;
let min_chunk_size = file_size / max_chunks as u64;
let chunk_size = ChunkSize::new(min_chunk_size as usize)?;
}

Issue 4: Chunk Size Larger Than File

Symptom:

Error: Chunk size 16 MB is larger than file size (1 MB)

Solutions:

#![allow(unused)]
fn main() {
// 1. Validate before chunking
let chunk_size = if file_size < chunk_size.bytes() as u64 {
    ChunkSize::new(file_size as usize)?
} else {
    chunk_size
};

// 2. Use validate_user_input
match ChunkSize::validate_user_input(user_size_mb, file_size) {
    Ok(bytes) => ChunkSize::new(bytes)?,
    Err(msg) => {
        eprintln!("Warning: {}", msg);
        ChunkSize::optimal_for_file_size(file_size)
    },
}
}

Testing Strategies

Unit Tests

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_chunk_size_validation() {
        // Valid sizes
        assert!(ChunkSize::new(1024).is_ok());
        assert!(ChunkSize::new(1024 * 1024).is_ok());

        // Invalid sizes
        assert!(ChunkSize::new(0).is_err());
        assert!(ChunkSize::new(600 * 1024 * 1024).is_err());
    }

    #[test]
    fn test_optimal_sizing() {
        // Small file
        let small = ChunkSize::optimal_for_file_size(500_000);
        assert_eq!(small.bytes(), 64 * 1024);

        // Medium file
        let medium = ChunkSize::optimal_for_file_size(100 * 1024 * 1024);
        assert_eq!(medium.bytes(), 16 * 1024 * 1024);

        // Large file
        let large = ChunkSize::optimal_for_file_size(1_000_000_000);
        assert_eq!(large.bytes(), 64 * 1024 * 1024);
    }

    #[test]
    fn test_chunks_needed() {
        let chunk_size = ChunkSize::from_mb(4).unwrap();
        let file_size = 100 * 1024 * 1024;  // 100 MB

        let num_chunks = chunk_size.chunks_needed_for_file(file_size);
        assert_eq!(num_chunks, 25);  // 100 MB / 4 MB = 25
    }
}
}

Integration Tests

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_file_chunking() {
    // Create test file
    let test_file = create_test_file(10 * 1024 * 1024);  // 10 MB

    // Chunk the file
    let chunk_size = ChunkSize::from_mb(1).unwrap();
    let chunks = chunk_file(&test_file, chunk_size).unwrap();

    assert_eq!(chunks.len(), 10);

    // Verify sequences
    for (i, chunk) in chunks.iter().enumerate() {
        assert_eq!(chunk.sequence_number(), i as u64);
    }

    // Verify last chunk flag
    assert!(chunks.last().unwrap().is_final());
}
}

Performance Tests

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_chunking_performance() {
    let test_file = create_test_file(100 * 1024 * 1024);  // 100 MB

    let sizes = vec![
        ChunkSize::from_mb(1).unwrap(),
        ChunkSize::from_mb(16).unwrap(),
        ChunkSize::from_mb(64).unwrap(),
    ];

    for size in sizes {
        let start = Instant::now();
        let chunks = chunk_file(&test_file, size).unwrap();
        let duration = start.elapsed();

        println!("{} MB chunks: {} chunks in {:?}",
            size.megabytes(), chunks.len(), duration);
    }
}
}

Next Steps

After understanding chunking strategy, explore related topics:

  1. File I/O: File reading and writing with chunks
  2. Binary Format: How chunks are serialized

Advanced Topics


Summary

Key Takeaways:

  1. Chunking divides files into manageable pieces for efficient processing
  2. Chunk sizes range from 1 byte to 512 MB with optimal sizes empirically determined
  3. Optimal sizing adapts to file size: small files use small chunks, large files use large chunks
  4. Memory management ensures bounded memory usage regardless of file size
  5. Parallel processing enables concurrent chunk processing for better performance
  6. Adaptive chunking can dynamically adjust chunk sizes based on performance
  7. Performance varies significantly with chunk size (64 KB: ~40 MB/s, 128 MB: ~350 MB/s)

Architecture File References:

  • ChunkSize: pipeline-domain/src/value_objects/chunk_size.rs:169
  • FileChunk: pipeline-domain/src/value_objects/file_chunk.rs:176
  • Chunking Diagram: pipeline/docs/diagrams/chunk-processing.puml

Binary File Format

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active

Overview

The Adaptive Pipeline uses a custom binary file format (.adapipe) to store processed files with complete recovery metadata and integrity verification. This format enables perfect restoration of original files while maintaining processing history and security.

Key Features:

  • Complete Recovery: All metadata needed to restore original files
  • Integrity Verification: SHA-256 checksums for both input and output
  • Processing History: Complete record of all processing steps
  • Format Versioning: Backward compatibility through version management
  • Security: Supports encryption with nonce management

File Format Specification

Binary Layout

The .adapipe format uses a reverse-header design for efficient processing:

┌─────────────────────────────────────────────┐
│          PROCESSED CHUNK DATA               │
│         (variable length)                   │
│  - Compressed and/or encrypted chunks       │
│  - Each chunk: [NONCE][LENGTH][DATA]        │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│          JSON HEADER                        │
│         (variable length)                   │
│  - Processing metadata                      │
│  - Recovery information                     │
│  - Checksums                                │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│      HEADER_LENGTH (4 bytes, u32 LE)        │
│  - Length of JSON header in bytes           │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│    FORMAT_VERSION (2 bytes, u16 LE)         │
│  - Current version: 1                       │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│      MAGIC_BYTES (8 bytes)                  │
│  - "ADAPIPE\0" (0x4144415049504500)         │
└─────────────────────────────────────────────┘

Why Reverse Header?

  • Efficient Reading: Read magic bytes and version first
  • Validation: Quickly validate format without reading entire file
  • Streaming: Process chunk data while reading header
  • Metadata Location: Header location calculated from end of file

Magic Bytes

#![allow(unused)]
fn main() {
pub const MAGIC_BYTES: [u8; 8] = [
    0x41, 0x44, 0x41, 0x50, // "ADAP"
    0x49, 0x50, 0x45, 0x00  // "IPE\0"
];
}

Purpose:

  • Identify files in .adapipe format
  • Prevent accidental processing of wrong file types
  • Enable format detection tools

Format Version

#![allow(unused)]
fn main() {
pub const CURRENT_FORMAT_VERSION: u16 = 1;
}

Version History:

  • Version 1: Initial format with compression, encryption, checksum support

Future Versions:

  • Version 2: Enhanced metadata, additional algorithms
  • Version 3: Streaming optimizations, compression improvements

File Header Structure

Header Fields

The JSON header contains comprehensive metadata:

#![allow(unused)]
fn main() {
pub struct FileHeader {
    /// Application version (e.g., "0.1.0")
    pub app_version: String,

    /// File format version (1)
    pub format_version: u16,

    /// Original input filename
    pub original_filename: String,

    /// Original file size in bytes
    pub original_size: u64,

    /// SHA-256 checksum of original file
    pub original_checksum: String,

    /// SHA-256 checksum of processed file
    pub output_checksum: String,

    /// Processing steps applied (in order)
    pub processing_steps: Vec<ProcessingStep>,

    /// Chunk size used (bytes)
    pub chunk_size: u32,

    /// Number of chunks
    pub chunk_count: u32,

    /// Processing timestamp (RFC3339)
    pub processed_at: DateTime<Utc>,

    /// Pipeline ID
    pub pipeline_id: String,

    /// Additional metadata
    pub metadata: HashMap<String, String>,
}
}

Processing Steps

Each processing step records transformation details:

#![allow(unused)]
fn main() {
pub struct ProcessingStep {
    /// Step type (compression, encryption, etc.)
    pub step_type: ProcessingStepType,

    /// Algorithm used (e.g., "brotli", "aes-256-gcm")
    pub algorithm: String,

    /// Algorithm-specific parameters
    pub parameters: HashMap<String, String>,

    /// Application order (0-based)
    pub order: u32,
}

pub enum ProcessingStepType {
    Compression,
    Encryption,
    Checksum,
    PassThrough,
    Custom(String),
}
}

Example Processing Steps:

{
  "processing_steps": [
    {
      "step_type": "Compression",
      "algorithm": "brotli",
      "parameters": {
        "level": "6"
      },
      "order": 0
    },
    {
      "step_type": "Encryption",
      "algorithm": "aes-256-gcm",
      "parameters": {
        "key_derivation": "argon2"
      },
      "order": 1
    },
    {
      "step_type": "Checksum",
      "algorithm": "sha256",
      "parameters": {},
      "order": 2
    }
  ]
}

Chunk Format

Chunk Structure

Each chunk in the processed data section follows this format:

┌────────────────────────────────────┐
│   NONCE (12 bytes)                 │
│  - Unique for each chunk           │
│  - Used for encryption IV          │
└────────────────────────────────────┘
┌────────────────────────────────────┐
│   DATA_LENGTH (4 bytes, u32 LE)    │
│  - Length of encrypted data        │
└────────────────────────────────────┘
┌────────────────────────────────────┐
│   ENCRYPTED_DATA (variable)        │
│  - Compressed and encrypted        │
│  - Includes authentication tag     │
└────────────────────────────────────┘

Rust Structure:

#![allow(unused)]
fn main() {
pub struct ChunkFormat {
    /// Encryption nonce (12 bytes for AES-GCM)
    pub nonce: [u8; 12],

    /// Length of encrypted data
    pub data_length: u32,

    /// Encrypted (and possibly compressed) chunk data
    pub encrypted_data: Vec<u8>,
}
}

Chunk Processing

Forward Processing (Compress → Encrypt):

1. Read original chunk
2. Compress chunk data
3. Generate unique nonce
4. Encrypt compressed data
5. Write: [NONCE][LENGTH][ENCRYPTED_DATA]

Reverse Processing (Decrypt → Decompress):

1. Read: [NONCE][LENGTH][ENCRYPTED_DATA]
2. Decrypt using nonce
3. Decompress decrypted data
4. Verify checksum
5. Write original chunk

Creating Binary Files

Basic File Creation

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::value_objects::{FileHeader, ProcessingStep};
use std::fs::File;
use std::io::Write;

fn create_adapipe_file(
    input_data: &[u8],
    output_path: &str,
    processing_steps: Vec<ProcessingStep>,
) -> Result<(), PipelineError> {
    // Create header
    let original_checksum = calculate_sha256(input_data);
    let mut header = FileHeader::new(
        "input.txt".to_string(),
        input_data.len() as u64,
        original_checksum,
    );

    // Add processing steps
    header.processing_steps = processing_steps;
    header.chunk_count = calculate_chunk_count(input_data.len(), header.chunk_size);

    // Process chunks
    let processed_data = process_chunks(input_data, &header.processing_steps)?;

    // Calculate output checksum
    header.output_checksum = calculate_sha256(&processed_data);

    // Serialize header to JSON
    let json_header = serde_json::to_vec(&header)?;
    let header_length = json_header.len() as u32;

    // Write file in reverse order
    let mut file = File::create(output_path)?;

    // 1. Write processed data
    file.write_all(&processed_data)?;

    // 2. Write JSON header
    file.write_all(&json_header)?;

    // 3. Write header length
    file.write_all(&header_length.to_le_bytes())?;

    // 4. Write format version
    file.write_all(&CURRENT_FORMAT_VERSION.to_le_bytes())?;

    // 5. Write magic bytes
    file.write_all(&MAGIC_BYTES)?;

    Ok(())
}
}

Adding Processing Steps

#![allow(unused)]
fn main() {
impl FileHeader {
    /// Add compression step
    pub fn add_compression_step(mut self, algorithm: &str, level: u32) -> Self {
        let mut parameters = HashMap::new();
        parameters.insert("level".to_string(), level.to_string());

        self.processing_steps.push(ProcessingStep {
            step_type: ProcessingStepType::Compression,
            algorithm: algorithm.to_string(),
            parameters,
            order: self.processing_steps.len() as u32,
        });

        self
    }

    /// Add encryption step
    pub fn add_encryption_step(
        mut self,
        algorithm: &str,
        key_derivation: &str
    ) -> Self {
        let mut parameters = HashMap::new();
        parameters.insert("key_derivation".to_string(), key_derivation.to_string());

        self.processing_steps.push(ProcessingStep {
            step_type: ProcessingStepType::Encryption,
            algorithm: algorithm.to_string(),
            parameters,
            order: self.processing_steps.len() as u32,
        });

        self
    }

    /// Add checksum step
    pub fn add_checksum_step(mut self, algorithm: &str) -> Self {
        self.processing_steps.push(ProcessingStep {
            step_type: ProcessingStepType::Checksum,
            algorithm: algorithm.to_string(),
            parameters: HashMap::new(),
            order: self.processing_steps.len() as u32,
        });

        self
    }
}
}

Reading Binary Files

Basic File Reading

#![allow(unused)]
fn main() {
use std::fs::File;
use std::io::{Read, Seek, SeekFrom};

fn read_adapipe_file(path: &str) -> Result<FileHeader, PipelineError> {
    let mut file = File::open(path)?;

    // Read from end of file (reverse header)
    file.seek(SeekFrom::End(-8))?;

    // 1. Read and validate magic bytes
    let mut magic = [0u8; 8];
    file.read_exact(&mut magic)?;

    if magic != MAGIC_BYTES {
        return Err(PipelineError::InvalidFormat(
            "Not an .adapipe file".to_string()
        ));
    }

    // 2. Read format version
    file.seek(SeekFrom::End(-10))?;
    let mut version_bytes = [0u8; 2];
    file.read_exact(&mut version_bytes)?;
    let version = u16::from_le_bytes(version_bytes);

    if version > CURRENT_FORMAT_VERSION {
        return Err(PipelineError::UnsupportedVersion(version));
    }

    // 3. Read header length
    file.seek(SeekFrom::End(-14))?;
    let mut length_bytes = [0u8; 4];
    file.read_exact(&mut length_bytes)?;
    let header_length = u32::from_le_bytes(length_bytes) as usize;

    // 4. Read JSON header
    file.seek(SeekFrom::End(-(14 + header_length as i64)))?;
    let mut json_data = vec![0u8; header_length];
    file.read_exact(&mut json_data)?;

    // 5. Deserialize header
    let header: FileHeader = serde_json::from_slice(&json_data)?;

    Ok(header)
}
}

Reading Chunk Data

#![allow(unused)]
fn main() {
fn read_chunks(
    file: &mut File,
    header: &FileHeader
) -> Result<Vec<ChunkFormat>, PipelineError> {
    let mut chunks = Vec::with_capacity(header.chunk_count as usize);

    // Seek to start of chunk data
    file.seek(SeekFrom::Start(0))?;

    for _ in 0..header.chunk_count {
        // Read nonce
        let mut nonce = [0u8; 12];
        file.read_exact(&mut nonce)?;

        // Read data length
        let mut length_bytes = [0u8; 4];
        file.read_exact(&mut length_bytes)?;
        let data_length = u32::from_le_bytes(length_bytes);

        // Read encrypted data
        let mut encrypted_data = vec![0u8; data_length as usize];
        file.read_exact(&mut encrypted_data)?;

        chunks.push(ChunkFormat {
            nonce,
            data_length,
            encrypted_data,
        });
    }

    Ok(chunks)
}
}

File Recovery

Complete Recovery Process

#![allow(unused)]
fn main() {
fn restore_original_file(
    input_path: &str,
    output_path: &str,
    password: Option<&str>,
) -> Result<(), PipelineError> {
    // 1. Read header
    let header = read_adapipe_file(input_path)?;

    // 2. Read chunks
    let mut file = File::open(input_path)?;
    let chunks = read_chunks(&mut file, &header)?;

    // 3. Process chunks in reverse order
    let mut restored_data = Vec::new();

    for chunk in chunks {
        let mut chunk_data = chunk.encrypted_data;

        // Reverse processing steps
        for step in header.processing_steps.iter().rev() {
            chunk_data = match step.step_type {
                ProcessingStepType::Encryption => {
                    decrypt_chunk(chunk_data, &chunk.nonce, &step, password)?
                }
                ProcessingStepType::Compression => {
                    decompress_chunk(chunk_data, &step)?
                }
                ProcessingStepType::Checksum => {
                    verify_chunk_checksum(&chunk_data, &step)?;
                    chunk_data
                }
                _ => chunk_data,
            };
        }

        restored_data.extend_from_slice(&chunk_data);
    }

    // 4. Verify restored data
    let restored_checksum = calculate_sha256(&restored_data);
    if restored_checksum != header.original_checksum {
        return Err(PipelineError::IntegrityError(
            "Restored data checksum mismatch".to_string()
        ));
    }

    // 5. Write restored file
    let mut output = File::create(output_path)?;
    output.write_all(&restored_data)?;

    Ok(())
}
}

Processing Step Reversal

#![allow(unused)]
fn main() {
fn reverse_processing_step(
    data: Vec<u8>,
    step: &ProcessingStep,
    password: Option<&str>,
) -> Result<Vec<u8>, PipelineError> {
    match step.step_type {
        ProcessingStepType::Compression => {
            // Decompress
            match step.algorithm.as_str() {
                "brotli" => decompress_brotli(data),
                "gzip" => decompress_gzip(data),
                "zstd" => decompress_zstd(data),
                "lz4" => decompress_lz4(data),
                _ => Err(PipelineError::UnsupportedAlgorithm(
                    step.algorithm.clone()
                )),
            }
        }
        ProcessingStepType::Encryption => {
            // Decrypt
            let password = password.ok_or(PipelineError::MissingPassword)?;
            match step.algorithm.as_str() {
                "aes-256-gcm" => decrypt_aes_256_gcm(data, password, step),
                "chacha20-poly1305" => decrypt_chacha20(data, password, step),
                _ => Err(PipelineError::UnsupportedAlgorithm(
                    step.algorithm.clone()
                )),
            }
        }
        ProcessingStepType::Checksum => {
            // Verify checksum (no transformation)
            verify_checksum(&data, step)?;
            Ok(data)
        }
        _ => Ok(data),
    }
}
}

Integrity Verification

File Validation

#![allow(unused)]
fn main() {
fn validate_adapipe_file(path: &str) -> Result<ValidationReport, PipelineError> {
    let mut report = ValidationReport::new();

    // 1. Read and validate header
    let header = match read_adapipe_file(path) {
        Ok(h) => {
            report.add_check("Header format", true, "Valid");
            h
        }
        Err(e) => {
            report.add_check("Header format", false, &e.to_string());
            return Ok(report);
        }
    };

    // 2. Validate format version
    if header.format_version <= CURRENT_FORMAT_VERSION {
        report.add_check("Format version", true, &format!("v{}", header.format_version));
    } else {
        report.add_check(
            "Format version",
            false,
            &format!("Unsupported: v{}", header.format_version)
        );
    }

    // 3. Validate processing steps
    for (i, step) in header.processing_steps.iter().enumerate() {
        let is_supported = match step.step_type {
            ProcessingStepType::Compression => {
                matches!(step.algorithm.as_str(), "brotli" | "gzip" | "zstd" | "lz4")
            }
            ProcessingStepType::Encryption => {
                matches!(step.algorithm.as_str(), "aes-256-gcm" | "chacha20-poly1305")
            }
            _ => true,
        };

        report.add_check(
            &format!("Step {} ({:?})", i, step.step_type),
            is_supported,
            &step.algorithm
        );
    }

    // 4. Verify output checksum
    let mut file = File::open(path)?;
    let data_length = file.metadata()?.len() - 14 - header.json_size() as u64;
    let mut processed_data = vec![0u8; data_length as usize];
    file.read_exact(&mut processed_data)?;

    let calculated_checksum = calculate_sha256(&processed_data);
    let checksums_match = calculated_checksum == header.output_checksum;

    report.add_check(
        "Output checksum",
        checksums_match,
        if checksums_match { "Valid" } else { "Mismatch" }
    );

    Ok(report)
}

pub struct ValidationReport {
    checks: Vec<(String, bool, String)>,
}

impl ValidationReport {
    pub fn new() -> Self {
        Self { checks: Vec::new() }
    }

    pub fn add_check(&mut self, name: &str, passed: bool, message: &str) {
        self.checks.push((name.to_string(), passed, message.to_string()));
    }

    pub fn is_valid(&self) -> bool {
        self.checks.iter().all(|(_, passed, _)| *passed)
    }
}
}

Checksum Verification

#![allow(unused)]
fn main() {
fn verify_file_integrity(path: &str) -> Result<bool, PipelineError> {
    let header = read_adapipe_file(path)?;

    // Calculate actual checksum
    let mut file = File::open(path)?;
    let data_length = file.metadata()?.len() - 14 - header.json_size() as u64;
    let mut data = vec![0u8; data_length as usize];
    file.read_exact(&mut data)?;

    let calculated = calculate_sha256(&data);

    // Compare with stored checksum
    Ok(calculated == header.output_checksum)
}

fn calculate_sha256(data: &[u8]) -> String {
    let mut hasher = Sha256::new();
    hasher.update(data);
    format!("{:x}", hasher.finalize())
}
}

Version Management

Format Versioning

#![allow(unused)]
fn main() {
pub fn check_format_compatibility(version: u16) -> Result<(), PipelineError> {
    match version {
        1 => Ok(()), // Current version
        v if v < CURRENT_FORMAT_VERSION => {
            // Older version - attempt migration
            migrate_format(v, CURRENT_FORMAT_VERSION)
        }
        v => Err(PipelineError::UnsupportedVersion(v)),
    }
}
}

Format Migration

#![allow(unused)]
fn main() {
fn migrate_format(from: u16, to: u16) -> Result<(), PipelineError> {
    match (from, to) {
        (1, 2) => {
            // Migration from v1 to v2
            // Add new fields with defaults
            Ok(())
        }
        _ => Err(PipelineError::MigrationUnsupported(from, to)),
    }
}
}

Backward Compatibility

#![allow(unused)]
fn main() {
fn read_any_version(path: &str) -> Result<FileHeader, PipelineError> {
    let version = read_format_version(path)?;

    match version {
        1 => read_v1_format(path),
        2 => read_v2_format(path),
        v => Err(PipelineError::UnsupportedVersion(v)),
    }
}
}

Best Practices

File Creation

Always set checksums:

#![allow(unused)]
fn main() {
// ✅ GOOD: Set both checksums
let original_checksum = calculate_sha256(&input_data);
let header = FileHeader::new(filename, size, original_checksum);
// ... process data ...
header.output_checksum = calculate_sha256(&processed_data);
}

Record all processing steps:

#![allow(unused)]
fn main() {
// ✅ GOOD: Record every transformation
header = header
    .add_compression_step("brotli", 6)
    .add_encryption_step("aes-256-gcm", "argon2")
    .add_checksum_step("sha256");
}

File Reading

Always validate format:

#![allow(unused)]
fn main() {
// ✅ GOOD: Validate before processing
let header = read_adapipe_file(path)?;

if header.format_version > CURRENT_FORMAT_VERSION {
    return Err(PipelineError::UnsupportedVersion(
        header.format_version
    ));
}
}

Verify checksums:

#![allow(unused)]
fn main() {
// ✅ GOOD: Verify integrity
let restored_checksum = calculate_sha256(&restored_data);
if restored_checksum != header.original_checksum {
    return Err(PipelineError::IntegrityError(
        "Checksum mismatch".to_string()
    ));
}
}

Error Handling

Handle all error cases:

#![allow(unused)]
fn main() {
match read_adapipe_file(path) {
    Ok(header) => process_file(header),
    Err(PipelineError::InvalidFormat(msg)) => {
        eprintln!("Not a valid .adapipe file: {}", msg);
    }
    Err(PipelineError::UnsupportedVersion(v)) => {
        eprintln!("Unsupported format version: {}", v);
    }
    Err(e) => {
        eprintln!("Error reading file: {}", e);
    }
}
}

Security Considerations

Nonce Management

Never reuse nonces:

#![allow(unused)]
fn main() {
// ✅ GOOD: Generate unique nonce per chunk
fn generate_nonce() -> [u8; 12] {
    let mut nonce = [0u8; 12];
    use rand::RngCore;
    rand::thread_rng().fill_bytes(&mut nonce);
    nonce
}
}

Key Derivation

Use strong key derivation:

#![allow(unused)]
fn main() {
// ✅ GOOD: Argon2 for password-based encryption
fn derive_key(password: &str, salt: &[u8]) -> Vec<u8> {
    use argon2::{Argon2, PasswordHasher};

    let argon2 = Argon2::default();
    let hash = argon2.hash_password(password.as_bytes(), salt)
        .unwrap();

    hash.hash.unwrap().as_bytes().to_vec()
}
}

Integrity Protection

Verify at every step:

#![allow(unused)]
fn main() {
// ✅ GOOD: Verify after each transformation
fn process_with_verification(
    data: Vec<u8>,
    step: &ProcessingStep
) -> Result<Vec<u8>, PipelineError> {
    let processed = apply_transformation(data, step)?;
    verify_transformation(&processed, step)?;
    Ok(processed)
}
}

Next Steps

Now that you understand the binary file format:

Observability Overview

Version: 1.0 Date: 2025-10-04 License: BSD-3-Clause Copyright: (c) 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Active


Overview

Observability is the ability to understand the internal state of a system by examining its external outputs. The Adaptive Pipeline implements a comprehensive observability strategy that combines metrics, logging, and health monitoring to provide complete system visibility.

Key Principles

  • Three Pillars: Metrics, Logs, and Traces (health monitoring)
  • Comprehensive Coverage: Monitor all aspects of system operation
  • Real-Time Insights: Live performance tracking and alerting
  • Low Overhead: Minimal performance impact on pipeline processing
  • Integration Ready: Compatible with external monitoring systems (Prometheus, Grafana)
  • Actionable: Designed to support debugging, optimization, and operations

The Three Pillars

1. Metrics - Quantitative Measurements

What: Numerical measurements aggregated over time

Purpose: Track system performance, identify trends, detect anomalies

Implementation:

  • Domain layer: ProcessingMetrics entity
  • Infrastructure layer: MetricsService with Prometheus integration
  • HTTP /metrics endpoint for scraping

Key Metrics:

  • Counters: Total pipelines processed, bytes processed, errors
  • Gauges: Active pipelines, current throughput, memory usage
  • Histograms: Processing duration, latency distribution

See: Metrics Collection

2. Logging - Contextual Events

What: Timestamped records of discrete events with structured context

Purpose: Understand what happened, when, and why

Implementation:

  • Bootstrap phase: BootstrapLogger trait
  • Application phase: tracing crate with structured logging
  • Multiple log levels: ERROR, WARN, INFO, DEBUG, TRACE

Key Features:

  • Structured fields for filtering and analysis
  • Correlation IDs for request tracing
  • Integration with ObservabilityService for alerts

See: Logging Implementation

3. Health Monitoring - System Status

What: Aggregated health scores and status indicators

Purpose: Quickly assess system health and detect degradation

Implementation:

  • ObservabilityService with real-time health scoring
  • SystemHealth status reporting
  • Alert generation for threshold violations

Key Components:

  • Performance health (throughput, latency)
  • Error health (error rates, failure patterns)
  • Resource health (CPU, memory, I/O)
  • Overall health score (weighted composite)

Architecture

Layered Observability

┌─────────────────────────────────────────────────────────────┐
│                    Application Layer                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │           ObservabilityService                      │   │
│  │  (Orchestrates monitoring, alerting, health)        │   │
│  └──────────┬────────────────┬──────────────┬──────────┘   │
│             │                │              │               │
│             ▼                ▼              ▼               │
│  ┌──────────────┐  ┌─────────────┐  ┌─────────────┐       │
│  │ Performance  │  │   Alert     │  │   Health    │       │
│  │   Tracker    │  │  Manager    │  │  Monitor    │       │
│  └──────────────┘  └─────────────┘  └─────────────┘       │
│                                                              │
└─────────────────────────────────────────────────────────────┘
                           │
                           │ Uses
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                  Infrastructure Layer                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────────┐              ┌──────────────────┐    │
│  │ MetricsService   │              │ Logging (tracing)│    │
│  │ (Prometheus)     │              │ (Structured logs)│    │
│  └──────────────────┘              └──────────────────┘    │
│           │                                 │               │
│           │                                 │               │
│           ▼                                 ▼               │
│  ┌──────────────────┐              ┌──────────────────┐    │
│  │ /metrics HTTP    │              │ Log Subscribers  │    │
│  │ endpoint         │              │ (console, file)  │    │
│  └──────────────────┘              └──────────────────┘    │
│                                                              │
└─────────────────────────────────────────────────────────────┘
                           │
                           │ Exposes
                           ▼
┌─────────────────────────────────────────────────────────────┐
│                    External Systems                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐ │
│  │  Prometheus  │    │    Grafana   │    │ Log Analysis │ │
│  │   (Scraper)  │    │ (Dashboards) │    │    Tools     │ │
│  └──────────────┘    └──────────────┘    └──────────────┘ │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Component Integration

The observability components are tightly integrated:

  1. ObservabilityService orchestrates monitoring
  2. MetricsService records quantitative data
  3. Logging records contextual events
  4. PerformanceTracker maintains real-time state
  5. AlertManager checks thresholds and generates alerts
  6. HealthMonitor computes system health scores

ObservabilityService

Core Responsibilities

The ObservabilityService is the central orchestrator for monitoring:

#![allow(unused)]
fn main() {
pub struct ObservabilityService {
    metrics_service: Arc<MetricsService>,
    performance_tracker: Arc<RwLock<PerformanceTracker>>,
    alert_thresholds: AlertThresholds,
}
}

Key Methods:

  • start_operation() - Begin tracking an operation
  • complete_operation() - End tracking with metrics
  • get_system_health() - Get current health status
  • record_processing_metrics() - Record pipeline metrics
  • check_alerts() - Evaluate alert conditions

PerformanceTracker

Maintains real-time performance state:

#![allow(unused)]
fn main() {
pub struct PerformanceTracker {
    pub active_operations: u32,
    pub total_operations: u64,
    pub average_throughput_mbps: f64,
    pub peak_throughput_mbps: f64,
    pub error_rate_percent: f64,
    pub system_health_score: f64,
    pub last_update: Instant,
}
}

Tracked Metrics:

  • Active operation count
  • Total operation count
  • Average and peak throughput
  • Error rate percentage
  • Overall health score
  • Last update timestamp

OperationTracker

Automatic operation lifecycle tracking:

#![allow(unused)]
fn main() {
pub struct OperationTracker {
    operation_name: String,
    start_time: Instant,
    observability_service: ObservabilityService,
    completed: AtomicBool,
}
}

Lifecycle:

  1. Created via start_operation()
  2. Increments active operation count
  3. Logs operation start
  4. On completion: Records duration, throughput, success/failure
  5. On drop (if not completed): Marks as failed

Drop Safety: If the tracker is dropped without explicit completion (e.g., due to panic), it automatically marks the operation as failed.


Health Monitoring

SystemHealth Structure

#![allow(unused)]
fn main() {
pub struct SystemHealth {
    pub status: HealthStatus,
    pub score: f64,
    pub active_operations: u32,
    pub throughput_mbps: f64,
    pub error_rate_percent: f64,
    pub uptime_seconds: u64,
    pub alerts: Vec<Alert>,
}

pub enum HealthStatus {
    Healthy,   // Score >= 90.0
    Warning,   // Score >= 70.0 && < 90.0
    Critical,  // Score < 70.0
    Unknown,   // Unable to determine health
}
}

Health Score Calculation

The health score starts at 100 and deductions are applied:

#![allow(unused)]
fn main() {
let mut score = 100.0;

// Deduct for high error rate
if error_rate_percent > max_error_rate_percent {
    score -= 30.0;  // Error rate is critical
}

// Deduct for low throughput
if average_throughput_mbps < min_throughput_mbps {
    score -= 20.0;  // Performance degradation
}

// Additional deductions for other factors...
}

Health Score Ranges:

  • 100-90: Healthy - System operating normally
  • 89-70: Warning - Degraded performance, investigation needed
  • 69-0: Critical - System in distress, immediate action required

Alert Structure

#![allow(unused)]
fn main() {
pub struct Alert {
    pub severity: AlertSeverity,
    pub message: String,
    pub timestamp: String,
    pub metric_name: String,
    pub current_value: f64,
    pub threshold: f64,
}

pub enum AlertSeverity {
    Info,
    Warning,
    Critical,
}
}

Alert Thresholds

Configuration

#![allow(unused)]
fn main() {
pub struct AlertThresholds {
    pub max_error_rate_percent: f64,
    pub min_throughput_mbps: f64,
    pub max_processing_duration_seconds: f64,
    pub max_memory_usage_mb: f64,
}

impl Default for AlertThresholds {
    fn default() -> Self {
        Self {
            max_error_rate_percent: 5.0,
            min_throughput_mbps: 1.0,
            max_processing_duration_seconds: 300.0,
            max_memory_usage_mb: 1024.0,
        }
    }
}
}

Alert Generation

Alerts are generated when thresholds are violated:

#![allow(unused)]
fn main() {
async fn check_alerts(&self, tracker: &PerformanceTracker) {
    // High error rate alert
    if tracker.error_rate_percent > self.alert_thresholds.max_error_rate_percent {
        warn!(
            "🚨 Alert: High error rate {:.1}% (threshold: {:.1}%)",
            tracker.error_rate_percent,
            self.alert_thresholds.max_error_rate_percent
        );
    }

    // Low throughput alert
    if tracker.average_throughput_mbps < self.alert_thresholds.min_throughput_mbps {
        warn!(
            "🚨 Alert: Low throughput {:.2} MB/s (threshold: {:.2} MB/s)",
            tracker.average_throughput_mbps,
            self.alert_thresholds.min_throughput_mbps
        );
    }

    // High concurrent operations alert
    if tracker.active_operations > 10 {
        warn!("🚨 Alert: High concurrent operations: {}", tracker.active_operations);
    }
}
}

Usage Patterns

Basic Operation Tracking

#![allow(unused)]
fn main() {
// Start operation tracking
let tracker = observability_service
    .start_operation("file_processing")
    .await;

// Do work
let result = process_file(&input_path).await?;

// Complete tracking with success/failure
tracker.complete(true, result.bytes_processed).await;
}

Automatic Tracking with Drop Safety

#![allow(unused)]
fn main() {
async fn process_pipeline(id: &PipelineId) -> Result<()> {
    // Tracker automatically handles failure if function panics or returns Err
    let tracker = observability_service
        .start_operation("pipeline_execution")
        .await;

    // If this fails, tracker is dropped and marks operation as failed
    let result = execute_stages(id).await?;

    // Explicit success
    tracker.complete(true, result.bytes_processed).await;
    Ok(())
}
}

Recording Pipeline Metrics

#![allow(unused)]
fn main() {
// After pipeline completion
let metrics = pipeline.processing_metrics();

// Record to both Prometheus and performance tracker
observability_service
    .record_processing_metrics(&metrics)
    .await;

// This automatically:
// - Updates Prometheus counters/gauges/histograms
// - Updates PerformanceTracker state
// - Checks alert thresholds
// - Logs completion with metrics
}

Health Check Endpoint

#![allow(unused)]
fn main() {
async fn health_check() -> Result<SystemHealth> {
    let health = observability_service.get_system_health().await;

    match health.status {
        HealthStatus::Healthy => {
            info!("System health: HEALTHY (score: {:.1})", health.score);
        }
        HealthStatus::Warning => {
            warn!(
                "System health: WARNING (score: {:.1}, {} alerts)",
                health.score,
                health.alerts.len()
            );
        }
        HealthStatus::Critical => {
            error!(
                "System health: CRITICAL (score: {:.1}, {} alerts)",
                health.score,
                health.alerts.len()
            );
        }
        HealthStatus::Unknown => {
            warn!("System health: UNKNOWN");
        }
    }

    Ok(health)
}
}

Performance Summary

#![allow(unused)]
fn main() {
// Get human-readable performance summary
let summary = observability_service
    .get_performance_summary()
    .await;

println!("{}", summary);
}

Output:

📊 Performance Summary:
Active Operations: 3
Total Operations: 1247
Average Throughput: 45.67 MB/s
Peak Throughput: 89.23 MB/s
Error Rate: 2.1%
System Health: 88.5/100 (Warning)
Alerts: 1

Integration with External Systems

Prometheus Integration

The system exposes metrics via HTTP endpoint:

#![allow(unused)]
fn main() {
// HTTP /metrics endpoint
use axum::{routing::get, Router};

let app = Router::new()
    .route("/metrics", get(metrics_handler));

async fn metrics_handler() -> String {
    metrics_service.get_metrics()
        .unwrap_or_else(|_| "# Error generating metrics\n".to_string())
}
}

Prometheus Configuration:

scrape_configs:
  - job_name: 'pipeline'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s
    scrape_timeout: 10s

Grafana Dashboards

Create dashboards to visualize:

  • Pipeline Throughput: Line graph of MB/s over time
  • Active Operations: Gauge of current active count
  • Error Rate: Line graph of error percentage
  • Processing Duration: Histogram of completion times
  • System Health: Gauge with color thresholds

Example PromQL Queries:

# Average throughput over 5 minutes
rate(pipeline_bytes_processed_total[5m]) / 1024 / 1024

# Error rate percentage
100 * (
  rate(pipeline_errors_total[5m]) /
  rate(pipeline_processed_total[5m])
)

# P99 processing duration
histogram_quantile(0.99, pipeline_processing_duration_seconds_bucket)

Log Aggregation

Send logs to external systems:

#![allow(unused)]
fn main() {
use tracing_subscriber::{fmt, layer::SubscriberExt, EnvFilter, Registry};
use tracing_appender::{non_blocking, rolling};

// JSON logs for shipping to ELK/Splunk
let file_appender = rolling::daily("./logs", "pipeline.json");
let (non_blocking_appender, _guard) = non_blocking(file_appender);

let file_layer = fmt::layer()
    .with_writer(non_blocking_appender)
    .json()
    .with_target(true)
    .with_thread_ids(true);

let subscriber = Registry::default()
    .with(EnvFilter::new("info"))
    .with(file_layer);

tracing::subscriber::set_global_default(subscriber)?;
}

Performance Considerations

Low Overhead Design

Atomic Operations: Metrics use atomic types to avoid locks:

#![allow(unused)]
fn main() {
pub struct MetricsService {
    pipelines_processed: Arc<AtomicU64>,
    bytes_processed: Arc<AtomicU64>,
    // ...
}
}

Async RwLock: PerformanceTracker uses async RwLock for concurrent reads:

#![allow(unused)]
fn main() {
performance_tracker: Arc<RwLock<PerformanceTracker>>
}

Lazy Evaluation: Expensive calculations only performed when health is queried

Compile-Time Filtering: Debug/trace logs have zero overhead in release builds

Benchmark Results

Observability overhead on Intel i7-10700K @ 3.8 GHz:

OperationTimeOverhead
start_operation()~500 nsNegligible
complete_operation()~1.2 μsMinimal
record_processing_metrics()~2.5 μsLow
get_system_health()~8 μsModerate (infrequent)
info!() log~80 nsNegligible
debug!() log (disabled)~0 nsZero

Total overhead: < 0.1% of pipeline processing time


Best Practices

✅ DO

Track all significant operations

#![allow(unused)]
fn main() {
let tracker = observability.start_operation("file_compression").await;
let result = compress_file(&path).await?;
tracker.complete(true, result.compressed_size).await;
}

Use structured logging

#![allow(unused)]
fn main() {
info!(
    pipeline_id = %id,
    bytes = total_bytes,
    duration_ms = elapsed.as_millis(),
    "Pipeline completed"
);
}

Record domain metrics

#![allow(unused)]
fn main() {
observability.record_processing_metrics(&pipeline.metrics()).await;
}

Check health regularly

#![allow(unused)]
fn main() {
// In health check endpoint
let health = observability.get_system_health().await;
}

Configure thresholds appropriately

#![allow(unused)]
fn main() {
let observability = ObservabilityService::new_with_config(metrics_service).await;
}

❌ DON'T

Don't track trivial operations

#![allow(unused)]
fn main() {
// BAD: Too fine-grained
let tracker = observability.start_operation("allocate_vec").await;
let vec = Vec::with_capacity(100);
tracker.complete(true, 0).await; // Overhead > value
}

Don't log in hot loops without rate limiting

#![allow(unused)]
fn main() {
// BAD: Excessive logging
for chunk in chunks {
    debug!("Processing chunk {}", chunk.id); // Called millions of times!
}

// GOOD: Log summary
debug!(chunk_count = chunks.len(), "Processing chunks");
info!(chunks_processed = chunks.len(), "Chunk processing complete");
}

Don't forget to complete trackers

#![allow(unused)]
fn main() {
// BAD: Leaks active operation count
let tracker = observability.start_operation("process").await;
process().await?;
// Forgot to call tracker.complete()!

// GOOD: Explicit completion
let tracker = observability.start_operation("process").await;
let result = process().await?;
tracker.complete(true, result.bytes).await;
}

Don't block on observability operations

#![allow(unused)]
fn main() {
// BAD: Blocking in async context
tokio::task::block_in_place(|| {
    observability.get_system_health().await // Won't compile anyway!
});

// GOOD: Await directly
let health = observability.get_system_health().await;
}

Testing Strategies

Unit Testing ObservabilityService

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_operation_tracking() {
    let metrics_service = Arc::new(MetricsService::new().unwrap());
    let observability = ObservabilityService::new(metrics_service);

    // Start operation
    let tracker = observability.start_operation("test").await;

    // Check active count increased
    let health = observability.get_system_health().await;
    assert_eq!(health.active_operations, 1);

    // Complete operation
    tracker.complete(true, 1000).await;

    // Check active count decreased
    let health = observability.get_system_health().await;
    assert_eq!(health.active_operations, 0);
}
}

Testing Alert Generation

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_high_error_rate_alert() {
    let metrics_service = Arc::new(MetricsService::new().unwrap());
    let mut observability = ObservabilityService::new(metrics_service);

    // Set low threshold
    observability.alert_thresholds.max_error_rate_percent = 1.0;

    // Simulate high error rate
    for _ in 0..10 {
        let tracker = observability.start_operation("test").await;
        tracker.complete(false, 0).await; // All failures
    }

    // Check health has alerts
    let health = observability.get_system_health().await;
    assert!(!health.alerts.is_empty());
    assert_eq!(health.status, HealthStatus::Critical);
}
}

Integration Testing

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_end_to_end_observability() {
    // Setup
    let metrics_service = Arc::new(MetricsService::new().unwrap());
    let observability = Arc::new(ObservabilityService::new(metrics_service.clone()));

    // Run pipeline with tracking
    let tracker = observability.start_operation("pipeline").await;
    let result = run_test_pipeline().await.unwrap();
    tracker.complete(true, result.bytes_processed).await;

    // Verify metrics recorded
    let metrics_output = metrics_service.get_metrics().unwrap();
    assert!(metrics_output.contains("pipeline_processed_total"));

    // Verify health is good
    let health = observability.get_system_health().await;
    assert_eq!(health.status, HealthStatus::Healthy);
}
}

Common Issues and Solutions

Issue: Active operations count stuck

Symptom: active_operations never decreases

Cause: OperationTracker not completed or dropped

Solution:

#![allow(unused)]
fn main() {
// Ensure tracker is completed in all code paths
let tracker = observability.start_operation("op").await;
let result = match dangerous_operation().await {
    Ok(r) => {
        tracker.complete(true, r.bytes).await;
        Ok(r)
    }
    Err(e) => {
        tracker.complete(false, 0).await;
        Err(e)
    }
};
}

Issue: Health score always 100

Symptom: Health never degrades despite errors

Cause: Metrics not being recorded

Solution:

#![allow(unused)]
fn main() {
// Always record processing metrics
observability.record_processing_metrics(&metrics).await;
}

Issue: Alerts not firing

Symptom: Thresholds violated but no alerts logged

Cause: Log level filtering out WARN messages

Solution:

# Enable WARN level
export RUST_LOG=warn

# Or per-module
export RUST_LOG=pipeline::infrastructure::logging=warn

Next Steps


References

Metrics Collection

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active

Overview

The pipeline system implements comprehensive metrics collection for monitoring, observability, and performance analysis. Metrics are collected in real-time, exported in Prometheus format, and can be visualized using Grafana dashboards.

Key Features:

  • Prometheus Integration: Native Prometheus metrics export
  • Real-Time Collection: Low-overhead metric updates
  • Comprehensive Coverage: Performance, system, and business metrics
  • Dimensional Data: Labels for multi-dimensional analysis
  • Thread-Safe: Safe concurrent metric updates

Metrics Architecture

Two-Level Metrics System

The system uses a two-level metrics architecture:

┌─────────────────────────────────────────────┐
│         Application Layer                   │
│  - ProcessingMetrics (domain entity)        │
│  - Per-pipeline performance tracking        │
│  - Business metrics and analytics           │
└─────────────────┬───────────────────────────┘
                  │
                  ↓
┌─────────────────────────────────────────────┐
│       Infrastructure Layer                  │
│  - MetricsService (Prometheus)              │
│  - System-wide aggregation                  │
│  - HTTP export endpoint                     │
└─────────────────────────────────────────────┘

Domain Metrics (ProcessingMetrics):

  • Attached to processing context
  • Track individual pipeline execution
  • Support detailed analytics
  • Persist to database

Infrastructure Metrics (MetricsService):

  • System-wide aggregation
  • Prometheus counters, gauges, histograms
  • HTTP /metrics endpoint
  • Real-time monitoring

Domain Metrics

ProcessingMetrics Entity

Tracks performance data for individual pipeline executions:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::entities::ProcessingMetrics;
use std::time::{Duration, Instant};
use std::collections::HashMap;

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ProcessingMetrics {
    // Progress tracking
    bytes_processed: u64,
    bytes_total: u64,
    chunks_processed: u64,
    chunks_total: u64,

    // Timing (high-resolution internal, RFC3339 for export)
    #[serde(skip)]
    start_time: Option<Instant>,
    #[serde(skip)]
    end_time: Option<Instant>,
    start_time_rfc3339: Option<String>,
    end_time_rfc3339: Option<String>,
    processing_duration: Option<Duration>,

    // Performance metrics
    throughput_bytes_per_second: f64,
    compression_ratio: Option<f64>,

    // Error tracking
    error_count: u64,
    warning_count: u64,

    // File information
    input_file_size_bytes: u64,
    output_file_size_bytes: u64,
    input_file_checksum: Option<String>,
    output_file_checksum: Option<String>,

    // Per-stage metrics
    stage_metrics: HashMap<String, StageMetrics>,
}
}

Creating and Using ProcessingMetrics

#![allow(unused)]
fn main() {
impl ProcessingMetrics {
    /// Create new metrics for pipeline execution
    pub fn new(total_bytes: u64, total_chunks: u64) -> Self {
        Self {
            bytes_processed: 0,
            bytes_total: total_bytes,
            chunks_processed: 0,
            chunks_total: total_chunks,
            start_time: Some(Instant::now()),
            start_time_rfc3339: Some(Utc::now().to_rfc3339()),
            end_time: None,
            end_time_rfc3339: None,
            processing_duration: None,
            throughput_bytes_per_second: 0.0,
            compression_ratio: None,
            error_count: 0,
            warning_count: 0,
            input_file_size_bytes: total_bytes,
            output_file_size_bytes: 0,
            input_file_checksum: None,
            output_file_checksum: None,
            stage_metrics: HashMap::new(),
        }
    }

    /// Update progress
    pub fn update_progress(&mut self, bytes: u64, chunks: u64) {
        self.bytes_processed += bytes;
        self.chunks_processed += chunks;
        self.update_throughput();
    }

    /// Calculate throughput
    fn update_throughput(&mut self) {
        if let Some(start) = self.start_time {
            let elapsed = start.elapsed().as_secs_f64();
            if elapsed > 0.0 {
                self.throughput_bytes_per_second =
                    self.bytes_processed as f64 / elapsed;
            }
        }
    }

    /// Complete processing
    pub fn complete(&mut self) {
        self.end_time = Some(Instant::now());
        self.end_time_rfc3339 = Some(Utc::now().to_rfc3339());

        if let (Some(start), Some(end)) = (self.start_time, self.end_time) {
            self.processing_duration = Some(end - start);
        }

        self.update_throughput();
        self.calculate_compression_ratio();
    }

    /// Calculate compression ratio
    fn calculate_compression_ratio(&mut self) {
        if self.output_file_size_bytes > 0 && self.input_file_size_bytes > 0 {
            self.compression_ratio = Some(
                self.output_file_size_bytes as f64 /
                self.input_file_size_bytes as f64
            );
        }
    }
}
}

StageMetrics

Track performance for individual pipeline stages:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StageMetrics {
    /// Stage name
    stage_name: String,

    /// Bytes processed by this stage
    bytes_processed: u64,

    /// Processing time for this stage
    #[serde(skip)]
    processing_time: Option<Duration>,

    /// Throughput (bytes per second)
    throughput_bps: f64,

    /// Error count for this stage
    error_count: u64,

    /// Memory usage (optional)
    memory_usage_bytes: Option<u64>,
}

impl ProcessingMetrics {
    /// Record stage metrics
    pub fn record_stage_metrics(
        &mut self,
        stage_name: String,
        bytes: u64,
        duration: Duration,
    ) {
        let throughput = if duration.as_secs_f64() > 0.0 {
            bytes as f64 / duration.as_secs_f64()
        } else {
            0.0
        };

        let stage_metrics = StageMetrics {
            stage_name: stage_name.clone(),
            bytes_processed: bytes,
            processing_time: Some(duration),
            throughput_bps: throughput,
            error_count: 0,
            memory_usage_bytes: None,
        };

        self.stage_metrics.insert(stage_name, stage_metrics);
    }
}
}

Prometheus Metrics

MetricsService

Infrastructure service for Prometheus metrics:

#![allow(unused)]
fn main() {
use prometheus::{
    IntCounter, IntGauge, Gauge, Histogram,
    HistogramOpts, Opts, Registry
};
use std::sync::Arc;

#[derive(Clone)]
pub struct MetricsService {
    registry: Arc<Registry>,

    // Pipeline execution metrics
    pipelines_processed_total: IntCounter,
    pipeline_processing_duration: Histogram,
    pipeline_bytes_processed_total: IntCounter,
    pipeline_chunks_processed_total: IntCounter,
    pipeline_errors_total: IntCounter,
    pipeline_warnings_total: IntCounter,

    // Performance metrics
    throughput_mbps: Gauge,
    compression_ratio: Gauge,
    active_pipelines: IntGauge,

    // System metrics
    memory_usage_bytes: IntGauge,
    cpu_utilization_percent: Gauge,
}

impl MetricsService {
    pub fn new() -> Result<Self, PipelineError> {
        let registry = Arc::new(Registry::new());

        // Create counters
        let pipelines_processed_total = IntCounter::new(
            "pipeline_processed_total",
            "Total number of pipelines processed"
        )?;

        let pipeline_bytes_processed_total = IntCounter::new(
            "pipeline_bytes_processed_total",
            "Total bytes processed by pipelines"
        )?;

        let pipeline_errors_total = IntCounter::new(
            "pipeline_errors_total",
            "Total number of processing errors"
        )?;

        // Create histograms
        let pipeline_processing_duration = Histogram::with_opts(
            HistogramOpts::new(
                "pipeline_processing_duration_seconds",
                "Pipeline processing duration in seconds"
            )
            .buckets(vec![0.1, 0.5, 1.0, 5.0, 10.0, 30.0, 60.0])
        )?;

        // Create gauges
        let throughput_mbps = Gauge::new(
            "pipeline_throughput_mbps",
            "Current processing throughput in MB/s"
        )?;

        let compression_ratio = Gauge::new(
            "pipeline_compression_ratio",
            "Current compression ratio"
        )?;

        let active_pipelines = IntGauge::new(
            "pipeline_active_count",
            "Number of currently active pipelines"
        )?;

        // Register all metrics
        registry.register(Box::new(pipelines_processed_total.clone()))?;
        registry.register(Box::new(pipeline_bytes_processed_total.clone()))?;
        registry.register(Box::new(pipeline_errors_total.clone()))?;
        registry.register(Box::new(pipeline_processing_duration.clone()))?;
        registry.register(Box::new(throughput_mbps.clone()))?;
        registry.register(Box::new(compression_ratio.clone()))?;
        registry.register(Box::new(active_pipelines.clone()))?;

        Ok(Self {
            registry,
            pipelines_processed_total,
            pipeline_processing_duration,
            pipeline_bytes_processed_total,
            pipeline_chunks_processed_total: /* ... */,
            pipeline_errors_total,
            pipeline_warnings_total: /* ... */,
            throughput_mbps,
            compression_ratio,
            active_pipelines,
            memory_usage_bytes: /* ... */,
            cpu_utilization_percent: /* ... */,
        })
    }
}
}

Recording Metrics

#![allow(unused)]
fn main() {
impl MetricsService {
    /// Record pipeline completion
    pub fn record_pipeline_completion(
        &self,
        metrics: &ProcessingMetrics
    ) -> Result<(), PipelineError> {
        // Increment counters
        self.pipelines_processed_total.inc();
        self.pipeline_bytes_processed_total
            .inc_by(metrics.bytes_processed());
        self.pipeline_chunks_processed_total
            .inc_by(metrics.chunks_processed());

        // Record duration
        if let Some(duration) = metrics.processing_duration() {
            self.pipeline_processing_duration
                .observe(duration.as_secs_f64());
        }

        // Update gauges
        self.throughput_mbps.set(
            metrics.throughput_bytes_per_second() / 1_000_000.0
        );

        if let Some(ratio) = metrics.compression_ratio() {
            self.compression_ratio.set(ratio);
        }

        // Record errors
        if metrics.error_count() > 0 {
            self.pipeline_errors_total
                .inc_by(metrics.error_count());
        }

        Ok(())
    }

    /// Record pipeline start
    pub fn record_pipeline_start(&self) {
        self.active_pipelines.inc();
    }

    /// Record pipeline end
    pub fn record_pipeline_end(&self) {
        self.active_pipelines.dec();
    }
}
}

Available Metrics

Counter Metrics

Monotonically increasing values:

Metric NameTypeDescription
pipeline_processed_totalCounterTotal pipelines processed
pipeline_bytes_processed_totalCounterTotal bytes processed
pipeline_chunks_processed_totalCounterTotal chunks processed
pipeline_errors_totalCounterTotal processing errors
pipeline_warnings_totalCounterTotal warnings

Gauge Metrics

Values that can increase or decrease:

Metric NameTypeDescription
pipeline_active_countGaugeCurrently active pipelines
pipeline_throughput_mbpsGaugeCurrent throughput (MB/s)
pipeline_compression_ratioGaugeCurrent compression ratio
pipeline_memory_usage_bytesGaugeMemory usage in bytes
pipeline_cpu_utilization_percentGaugeCPU utilization percentage

Histogram Metrics

Distribution of values with buckets:

Metric NameTypeBucketsDescription
pipeline_processing_duration_secondsHistogram0.1, 0.5, 1, 5, 10, 30, 60Processing duration
pipeline_chunk_size_bytesHistogram1K, 10K, 100K, 1M, 10MChunk size distribution
pipeline_stage_duration_secondsHistogram0.01, 0.1, 0.5, 1, 5Stage processing time

Metrics Export

HTTP Endpoint

Export metrics via HTTP for Prometheus scraping:

#![allow(unused)]
fn main() {
use warp::Filter;

pub fn metrics_endpoint(
    metrics_service: Arc<MetricsService>
) -> impl Filter<Extract = impl warp::Reply, Error = warp::Rejection> + Clone {
    warp::path!("metrics")
        .and(warp::get())
        .map(move || {
            let encoder = prometheus::TextEncoder::new();
            let metric_families = metrics_service.registry.gather();
            let mut buffer = Vec::new();

            encoder.encode(&metric_families, &mut buffer)
                .unwrap_or_else(|e| {
                    eprintln!("Failed to encode metrics: {}", e);
                });

            warp::reply::with_header(
                buffer,
                "Content-Type",
                encoder.format_type()
            )
        })
}
}

Prometheus Configuration

Configure Prometheus to scrape metrics:

# prometheus.yml
scrape_configs:
  - job_name: 'pipeline'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:9090']
    metrics_path: '/metrics'

Example Metrics Output

# HELP pipeline_processed_total Total number of pipelines processed
# TYPE pipeline_processed_total counter
pipeline_processed_total 1234

# HELP pipeline_bytes_processed_total Total bytes processed
# TYPE pipeline_bytes_processed_total counter
pipeline_bytes_processed_total 1073741824

# HELP pipeline_processing_duration_seconds Pipeline processing duration
# TYPE pipeline_processing_duration_seconds histogram
pipeline_processing_duration_seconds_bucket{le="0.1"} 45
pipeline_processing_duration_seconds_bucket{le="0.5"} 120
pipeline_processing_duration_seconds_bucket{le="1.0"} 280
pipeline_processing_duration_seconds_bucket{le="5.0"} 450
pipeline_processing_duration_seconds_bucket{le="10.0"} 500
pipeline_processing_duration_seconds_bucket{le="+Inf"} 520
pipeline_processing_duration_seconds_sum 2340.5
pipeline_processing_duration_seconds_count 520

# HELP pipeline_throughput_mbps Current processing throughput
# TYPE pipeline_throughput_mbps gauge
pipeline_throughput_mbps 125.7

# HELP pipeline_compression_ratio Current compression ratio
# TYPE pipeline_compression_ratio gauge
pipeline_compression_ratio 0.35

Integration with Processing

Automatic Metric Collection

Metrics are automatically collected during processing:

#![allow(unused)]
fn main() {
use adaptive_pipeline::MetricsService;

async fn process_file_with_metrics(
    input_path: &str,
    output_path: &str,
    metrics_service: Arc<MetricsService>,
) -> Result<(), PipelineError> {
    // Create domain metrics
    let mut metrics = ProcessingMetrics::new(
        file_size,
        chunk_count
    );

    // Record start
    metrics_service.record_pipeline_start();

    // Process file
    for chunk in chunks {
        let start = Instant::now();

        let processed = process_chunk(chunk)?;

        metrics.update_progress(
            processed.len() as u64,
            1
        );

        metrics.record_stage_metrics(
            "compression".to_string(),
            processed.len() as u64,
            start.elapsed()
        );
    }

    // Complete metrics
    metrics.complete();

    // Export to Prometheus
    metrics_service.record_pipeline_completion(&metrics)?;
    metrics_service.record_pipeline_end();

    Ok(())
}
}

Observer Pattern

Use observer pattern for automatic metric updates:

#![allow(unused)]
fn main() {
pub struct MetricsObserver {
    metrics_service: Arc<MetricsService>,
}

impl MetricsObserver {
    pub fn observe_processing(
        &self,
        event: ProcessingEvent
    ) {
        match event {
            ProcessingEvent::PipelineStarted => {
                self.metrics_service.record_pipeline_start();
            }
            ProcessingEvent::ChunkProcessed { bytes, duration } => {
                self.metrics_service.pipeline_bytes_processed_total
                    .inc_by(bytes);
                self.metrics_service.pipeline_processing_duration
                    .observe(duration.as_secs_f64());
            }
            ProcessingEvent::PipelineCompleted { metrics } => {
                self.metrics_service
                    .record_pipeline_completion(&metrics)
                    .ok();
                self.metrics_service.record_pipeline_end();
            }
            ProcessingEvent::Error => {
                self.metrics_service.pipeline_errors_total.inc();
            }
        }
    }
}
}

Visualization with Grafana

Dashboard Configuration

Create Grafana dashboard for visualization:

{
  "dashboard": {
    "title": "Pipeline Metrics",
    "panels": [
      {
        "title": "Throughput",
        "targets": [{
          "expr": "rate(pipeline_bytes_processed_total[5m]) / 1000000",
          "legendFormat": "Throughput (MB/s)"
        }]
      },
      {
        "title": "Processing Duration (P95)",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(pipeline_processing_duration_seconds_bucket[5m]))",
          "legendFormat": "P95 Duration"
        }]
      },
      {
        "title": "Active Pipelines",
        "targets": [{
          "expr": "pipeline_active_count",
          "legendFormat": "Active"
        }]
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "rate(pipeline_errors_total[5m])",
          "legendFormat": "Errors/sec"
        }]
      }
    ]
  }
}

Common Queries

Throughput over time:

rate(pipeline_bytes_processed_total[5m]) / 1000000

Average processing duration:

rate(pipeline_processing_duration_seconds_sum[5m]) /
rate(pipeline_processing_duration_seconds_count[5m])

P99 latency:

histogram_quantile(0.99,
  rate(pipeline_processing_duration_seconds_bucket[5m])
)

Error rate:

rate(pipeline_errors_total[5m])

Compression effectiveness:

avg(pipeline_compression_ratio)

Performance Considerations

Low-Overhead Updates

Metrics use atomic operations for minimal overhead:

#![allow(unused)]
fn main() {
// ✅ GOOD: Atomic increment
self.counter.inc();

// ❌ BAD: Locking for simple increment
let mut guard = self.counter.lock().unwrap();
*guard += 1;
}

Batch Updates

Batch metric updates when possible:

#![allow(unused)]
fn main() {
// ✅ GOOD: Batch update
self.pipeline_bytes_processed_total.inc_by(total_bytes);

// ❌ BAD: Multiple individual updates
for _ in 0..total_bytes {
    self.pipeline_bytes_processed_total.inc();
}
}

Efficient Labels

Use labels judiciously to avoid cardinality explosion:

#![allow(unused)]
fn main() {
// ✅ GOOD: Limited cardinality
let counter = register_int_counter_vec!(
    "pipeline_processed_total",
    "Total pipelines processed",
    &["algorithm", "stage"]  // ~10 algorithms × ~5 stages = 50 series
)?;

// ❌ BAD: High cardinality
let counter = register_int_counter_vec!(
    "pipeline_processed_total",
    "Total pipelines processed",
    &["pipeline_id", "user_id"]  // Could be millions of series!
)?;
}

Alerting

Alert Rules

Define Prometheus alert rules:

# alerts.yml
groups:
  - name: pipeline_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(pipeline_errors_total[5m]) > 0.1
        for: 5m
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/sec"

      - alert: LowThroughput
        expr: rate(pipeline_bytes_processed_total[5m]) < 1000000
        for: 10m
        annotations:
          summary: "Low throughput detected"
          description: "Throughput is {{ $value }} bytes/sec"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(pipeline_processing_duration_seconds_bucket[5m])
          ) > 60
        for: 5m
        annotations:
          summary: "High P95 latency"
          description: "P95 latency is {{ $value }}s"

Best Practices

Metric Naming

Follow Prometheus naming conventions:

#![allow(unused)]
fn main() {
// ✅ GOOD: Clear, consistent names
pipeline_bytes_processed_total      // Counter with _total suffix
pipeline_processing_duration_seconds // Time in base unit (seconds)
pipeline_active_count               // Gauge without suffix

// ❌ BAD: Inconsistent naming
processed_bytes                     // Missing namespace
duration_ms                        // Wrong unit (use seconds)
active                             // Too vague
}

Unit Consistency

Always use base units:

#![allow(unused)]
fn main() {
// ✅ GOOD: Base units
duration_seconds: f64              // Seconds, not milliseconds
size_bytes: u64                    // Bytes, not KB/MB
ratio: f64                         // Unitless ratio 0.0-1.0

// ❌ BAD: Non-standard units
duration_ms: u64
size_mb: f64
percentage: u8
}

Documentation

Document all metrics:

#![allow(unused)]
fn main() {
// ✅ GOOD: Well documented
let counter = IntCounter::with_opts(
    Opts::new(
        "pipeline_processed_total",
        "Total number of pipelines processed successfully. \
         Incremented on completion of each pipeline execution."
    )
)?;
}

Next Steps

Now that you understand metrics collection:

Logging Implementation

Version: 1.0 Date: 2025-10-04 License: BSD-3-Clause Copyright: (c) 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Active


Overview

The Adaptive Pipeline uses structured logging via the tracing crate to provide comprehensive observability throughout the system. This chapter details the logging architecture, implementation patterns, and best practices.

Key Features

  • Structured Logging: Rich, structured log events with contextual metadata
  • Two-Phase Architecture: Separate bootstrap and application logging
  • Hierarchical Levels: Traditional log levels (ERROR, WARN, INFO, DEBUG, TRACE)
  • Targeted Filtering: Fine-grained control via log targets
  • Integration: Seamless integration with ObservabilityService and metrics
  • Performance: Low-overhead logging with compile-time filtering
  • Testability: Trait-based abstractions for testing

Architecture

Two-Phase Logging System

The system employs a two-phase logging approach to handle different initialization stages:

┌─────────────────────────────────────────────────────────────┐
│                    Application Lifecycle                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Phase 1: Bootstrap                Phase 2: Application     │
│  ┌───────────────────┐             ┌──────────────────────┐ │
│  │ BootstrapLogger   │────────────>│ Tracing Subscriber   │ │
│  │ (Early init)      │             │ (Full featured)      │ │
│  ├───────────────────┤             ├──────────────────────┤ │
│  │ - Simple API      │             │ - Structured events  │ │
│  │ - No dependencies │             │ - Span tracking      │ │
│  │ - Testable        │             │ - Context propagation│ │
│  │ - Minimal overhead│             │ - External outputs   │ │
│  └───────────────────┘             └──────────────────────┘ │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Phase 1: Bootstrap Logger

Located in bootstrap/src/logger.rs, the bootstrap logger provides minimal logging during early initialization:

  • Minimal dependencies: No heavy tracing infrastructure
  • Trait-based abstraction: BootstrapLogger trait for testability
  • Simple API: Only 4 log levels (error, warn, info, debug)
  • Early availability: Available before tracing subscriber initialization

Phase 2: Application Logging

Once the application is fully initialized, the full tracing infrastructure is used:

  • Rich structured logging: Fields, spans, and events
  • Multiple subscribers: Console, file, JSON, external systems
  • Performance tracking: Integration with ObservabilityService
  • Distributed tracing: Context propagation for async operations

Log Levels

The system uses five hierarchical log levels, from most to least severe:

LevelMacroUse CaseExample
ERRORerror!()Fatal errors, unrecoverable failuresDatabase connection failure, file corruption
WARNwarn!()Non-fatal issues, degraded performanceHigh error rate alert, configuration warning
INFOinfo!()Normal operations, key milestonesPipeline started, file processed successfully
DEBUGdebug!()Detailed diagnostic informationStage execution details, chunk processing
TRACEtrace!()Very verbose debuggingFunction entry/exit, detailed state dumps

Level Guidelines

ERROR: Use sparingly for genuine failures that require attention

#![allow(unused)]
fn main() {
error!("Failed to connect to database: {}", err);
error!(pipeline_id = %id, "Pipeline execution failed: {}", err);
}

WARN: Use for concerning situations that don't prevent operation

#![allow(unused)]
fn main() {
warn!("High error rate: {:.1}% (threshold: {:.1}%)", rate, threshold);
warn!(stage = %name, "Stage processing slower than expected");
}

INFO: Use for important operational events

#![allow(unused)]
fn main() {
info!("Started pipeline processing: {}", pipeline_name);
info!(bytes = %bytes_processed, duration = ?elapsed, "File processing completed");
}

DEBUG: Use for detailed diagnostic information during development

#![allow(unused)]
fn main() {
debug!("Preparing stage: {}", stage.name());
debug!(chunk_count = chunks, size = bytes, "Processing chunk batch");
}

TRACE: Use for extremely detailed debugging (usually disabled in production)

#![allow(unused)]
fn main() {
trace!("Entering function with args: {:?}", args);
trace!(state = ?current_state, "State transition complete");
}

Bootstrap Logger

BootstrapLogger Trait

The bootstrap logger abstraction allows for different implementations:

#![allow(unused)]
fn main() {
/// Bootstrap logging abstraction
pub trait BootstrapLogger: Send + Sync {
    /// Log an error message (fatal errors during bootstrap)
    fn error(&self, message: &str);

    /// Log a warning message (non-fatal issues)
    fn warn(&self, message: &str);

    /// Log an info message (normal bootstrap progress)
    fn info(&self, message: &str);

    /// Log a debug message (detailed diagnostic information)
    fn debug(&self, message: &str);
}
}

ConsoleLogger Implementation

The production implementation wraps the tracing crate:

#![allow(unused)]
fn main() {
/// Console logger implementation using tracing
pub struct ConsoleLogger {
    prefix: String,
}

impl ConsoleLogger {
    /// Create a new console logger with default prefix
    pub fn new() -> Self {
        Self::with_prefix("bootstrap")
    }

    /// Create a new console logger with custom prefix
    pub fn with_prefix(prefix: impl Into<String>) -> Self {
        Self { prefix: prefix.into() }
    }
}

impl BootstrapLogger for ConsoleLogger {
    fn error(&self, message: &str) {
        tracing::error!(target: "bootstrap", "[{}] {}", self.prefix, message);
    }

    fn warn(&self, message: &str) {
        tracing::warn!(target: "bootstrap", "[{}] {}", self.prefix, message);
    }

    fn info(&self, message: &str) {
        tracing::info!(target: "bootstrap", "[{}] {}", self.prefix, message);
    }

    fn debug(&self, message: &str) {
        tracing::debug!(target: "bootstrap", "[{}] {}", self.prefix, message);
    }
}
}

NoOpLogger for Testing

A no-op implementation for testing scenarios where logging should be silent:

#![allow(unused)]
fn main() {
/// No-op logger for testing
pub struct NoOpLogger;

impl NoOpLogger {
    pub fn new() -> Self {
        Self
    }
}

impl BootstrapLogger for NoOpLogger {
    fn error(&self, _message: &str) {}
    fn warn(&self, _message: &str) {}
    fn info(&self, _message: &str) {}
    fn debug(&self, _message: &str) {}
}
}

Usage Example

#![allow(unused)]
fn main() {
use bootstrap::logger::{BootstrapLogger, ConsoleLogger};

fn bootstrap_application() -> Result<()> {
    let logger = ConsoleLogger::new();

    logger.info("Starting application bootstrap");
    logger.debug("Parsing command line arguments");

    match parse_config() {
        Ok(config) => {
            logger.info("Configuration loaded successfully");
            Ok(())
        }
        Err(e) => {
            logger.error(&format!("Failed to load configuration: {}", e));
            Err(e)
        }
    }
}
}

Application Logging with Tracing

Basic Logging Macros

Once the tracing subscriber is initialized, use the standard tracing macros:

#![allow(unused)]
fn main() {
use tracing::{trace, debug, info, warn, error};

fn process_pipeline(pipeline_id: &str) -> Result<()> {
    info!("Starting pipeline processing: {}", pipeline_id);

    debug!(pipeline_id = %pipeline_id, "Loading pipeline configuration");

    match execute_pipeline(pipeline_id) {
        Ok(result) => {
            info!(
                pipeline_id = %pipeline_id,
                bytes_processed = result.bytes,
                duration = ?result.duration,
                "Pipeline completed successfully"
            );
            Ok(result)
        }
        Err(e) => {
            error!(
                pipeline_id = %pipeline_id,
                error = %e,
                "Pipeline execution failed"
            );
            Err(e)
        }
    }
}
}

Structured Fields

Add structured fields to log events for better searchability and filtering:

#![allow(unused)]
fn main() {
// Simple field
info!(stage = "compression", "Stage started");

// Display formatting with %
info!(pipeline_id = %id, "Processing pipeline");

// Debug formatting with ?
debug!(config = ?pipeline_config, "Loaded configuration");

// Multiple fields
info!(
    pipeline_id = %id,
    stage = "encryption",
    bytes_processed = total_bytes,
    duration_ms = elapsed.as_millis(),
    "Stage completed"
);
}

Log Targets

Use targets to categorize and filter log events:

#![allow(unused)]
fn main() {
// Bootstrap logs
tracing::info!(target: "bootstrap", "Application starting");

// Domain events
tracing::debug!(target: "domain::pipeline", "Creating pipeline entity");

// Infrastructure events
tracing::debug!(target: "infrastructure::database", "Executing query");

// Metrics events
tracing::info!(target: "metrics", "Recording pipeline completion");
}

Example from Codebase

From pipeline/src/infrastructure/repositories/stage_executor.rs:

#![allow(unused)]
fn main() {
tracing::debug!(
    "Processing {} chunks with {} workers",
    chunks.len(),
    worker_count
);

tracing::info!(
    "Processed {} bytes in {:.2}s ({:.2} MB/s)",
    total_bytes,
    duration.as_secs_f64(),
    throughput_mbps
);

tracing::debug!(
    "Stage {} processed {} chunks successfully",
    stage_name,
    chunk_count
);
}

From bootstrap/src/signals.rs:

#![allow(unused)]
fn main() {
tracing::info!("Received SIGTERM, initiating graceful shutdown");
tracing::info!("Received SIGINT (Ctrl+C), initiating graceful shutdown");
tracing::error!("Failed to register SIGTERM handler: {}", e);
}

Integration with ObservabilityService

The logging system integrates with the ObservabilityService for comprehensive monitoring.

Automatic Logging in Operations

From pipeline/src/infrastructure/logging/observability_service.rs:

#![allow(unused)]
fn main() {
impl ObservabilityService {
    /// Complete operation tracking
    pub async fn complete_operation(
        &self,
        operation_name: &str,
        duration: Duration,
        success: bool,
        throughput_mbps: f64,
    ) {
        // ... update metrics ...

        info!(
            "Completed operation: {} in {:.2}s (throughput: {:.2} MB/s, success: {})",
            operation_name,
            duration.as_secs_f64(),
            throughput_mbps,
            success
        );

        // Check for alerts
        self.check_alerts(&tracker).await;
    }

    /// Check for alerts based on current metrics
    async fn check_alerts(&self, tracker: &PerformanceTracker) {
        if tracker.error_rate_percent > self.alert_thresholds.max_error_rate_percent {
            warn!(
                "🚨 Alert: High error rate {:.1}% (threshold: {:.1}%)",
                tracker.error_rate_percent,
                self.alert_thresholds.max_error_rate_percent
            );
        }

        if tracker.average_throughput_mbps < self.alert_thresholds.min_throughput_mbps {
            warn!(
                "🚨 Alert: Low throughput {:.2} MB/s (threshold: {:.2} MB/s)",
                tracker.average_throughput_mbps,
                self.alert_thresholds.min_throughput_mbps
            );
        }
    }
}
}

OperationTracker with Logging

The OperationTracker automatically logs operation lifecycle:

#![allow(unused)]
fn main() {
pub struct OperationTracker {
    operation_name: String,
    start_time: Instant,
    observability_service: ObservabilityService,
    completed: std::sync::atomic::AtomicBool,
}

impl OperationTracker {
    pub async fn complete(self, success: bool, bytes_processed: u64) {
        self.completed.store(true, std::sync::atomic::Ordering::Relaxed);

        let duration = self.start_time.elapsed();
        let throughput_mbps = calculate_throughput(bytes_processed, duration);

        // Logs completion via ObservabilityService.complete_operation()
        self.observability_service
            .complete_operation(&self.operation_name, duration, success, throughput_mbps)
            .await;
    }
}
}

Usage Pattern

#![allow(unused)]
fn main() {
// Start operation (increments active count, logs start)
let tracker = observability_service
    .start_operation("file_processing")
    .await;

// Do work...
let result = process_file(&input_path).await?;

// Complete operation (logs completion with metrics)
tracker.complete(true, result.bytes_processed).await;
}

Logging Configuration

Environment Variables

Control logging behavior via environment variables:

# Set log level (error, warn, info, debug, trace)
export RUST_LOG=info

# Set level per module
export RUST_LOG=pipeline=debug,bootstrap=info

# Set level per target
export RUST_LOG=infrastructure::database=debug

# Complex filtering
export RUST_LOG="info,pipeline::domain=debug,sqlx=warn"

Subscriber Initialization

In main.rs or bootstrap code:

#![allow(unused)]
fn main() {
use tracing_subscriber::{fmt, EnvFilter};

fn init_logging() -> Result<()> {
    // Initialize tracing subscriber with environment filter
    tracing_subscriber::fmt()
        .with_env_filter(
            EnvFilter::try_from_default_env()
                .unwrap_or_else(|_| EnvFilter::new("info"))
        )
        .with_target(true)
        .with_thread_ids(true)
        .with_line_number(true)
        .init();

    info!("Logging initialized");
    Ok(())
}
}

Advanced Subscriber Configuration

#![allow(unused)]
fn main() {
use tracing_subscriber::{fmt, layer::SubscriberExt, EnvFilter, Registry};
use tracing_appender::{non_blocking, rolling};

fn init_advanced_logging() -> Result<()> {
    // Console output
    let console_layer = fmt::layer()
        .with_target(true)
        .with_thread_ids(true);

    // File output with daily rotation
    let file_appender = rolling::daily("./logs", "pipeline.log");
    let (non_blocking_appender, _guard) = non_blocking(file_appender);
    let file_layer = fmt::layer()
        .with_writer(non_blocking_appender)
        .with_ansi(false)
        .json();

    // Combine layers
    let subscriber = Registry::default()
        .with(EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("info")))
        .with(console_layer)
        .with(file_layer);

    tracing::subscriber::set_global_default(subscriber)?;

    info!("Advanced logging initialized");
    Ok(())
}
}

Performance Considerations

Compile-Time Filtering

Tracing macros are compile-time filtered at the trace! and debug! levels in release builds:

#![allow(unused)]
fn main() {
// This has ZERO overhead in release builds if max_level_debug is not set
debug!("Expensive computation result: {:?}", expensive_calculation());
}

To enable debug/trace in release builds, add to Cargo.toml:

[dependencies]
tracing = { version = "0.1", features = ["max_level_debug"] }

Lazy Evaluation

Use closures for expensive operations:

#![allow(unused)]
fn main() {
// BAD: Always evaluates format_large_struct() even if debug is disabled
debug!("Large struct: {}", format_large_struct(&data));

// GOOD: Only evaluates if debug level is enabled
debug!(data = ?data, "Processing large struct");
}

Structured vs. Formatted

Prefer structured fields over formatted strings:

#![allow(unused)]
fn main() {
// Less efficient: String formatting always happens
info!("Processed {} bytes in {}ms", bytes, duration_ms);

// More efficient: Fields recorded directly
info!(bytes = bytes, duration_ms = duration_ms, "Processed data");
}

Async Performance

In async contexts, avoid blocking operations in log statements:

#![allow(unused)]
fn main() {
// BAD: Blocking call in log statement
info!("Config: {}", read_config_file_sync()); // Blocks async task!

// GOOD: Log after async operation completes
let config = read_config_file().await?;
info!(config = ?config, "Configuration loaded");
}

Benchmark Results

Logging overhead on Intel i7-10700K @ 3.8 GHz:

OperationTimeOverhead
info!() with 3 fields~80 nsNegligible
debug!() (disabled)~0 nsZero
debug!() (enabled) with 5 fields~120 nsMinimal
JSON serialization~500 nsLow
File I/O (async)~10 μsModerate

Recommendation: Log freely at info! and above. Use debug! and trace! judiciously.


Best Practices

✅ DO

Use structured fields for important data

#![allow(unused)]
fn main() {
info!(
    pipeline_id = %id,
    bytes = total_bytes,
    duration_ms = elapsed.as_millis(),
    "Pipeline completed"
);
}

Use display formatting (%) for user-facing types

#![allow(unused)]
fn main() {
info!(pipeline_id = %id, stage = %stage_name, "Processing stage");
}

Use debug formatting (?) for complex types

#![allow(unused)]
fn main() {
debug!(config = ?pipeline_config, "Loaded configuration");
}

Log errors with context

#![allow(unused)]
fn main() {
error!(
    pipeline_id = %id,
    stage = "encryption",
    error = %err,
    "Stage execution failed"
);
}

Use targets for filtering

#![allow(unused)]
fn main() {
tracing::debug!(target: "infrastructure::database", "Executing query: {}", sql);
}

Log at appropriate levels

#![allow(unused)]
fn main() {
// Error: Genuine failures
error!("Database connection lost: {}", err);

// Warn: Concerning but recoverable
warn!("Retry attempt {} of {}", attempt, max_attempts);

// Info: Important operational events
info!("Pipeline started successfully");

// Debug: Detailed diagnostic information
debug!("Stage prepared with {} workers", worker_count);
}

❌ DON'T

Don't log sensitive information

#![allow(unused)]
fn main() {
// BAD: Logging encryption keys
debug!("Encryption key: {}", key); // SECURITY RISK!

// GOOD: Log that operation happened without exposing secrets
debug!("Encryption key loaded from configuration");
}

Don't use println! or eprintln! in production code

#![allow(unused)]
fn main() {
// BAD
println!("Processing file: {}", path);

// GOOD
info!(path = %path, "Processing file");
}

Don't log in hot loops without rate limiting

#![allow(unused)]
fn main() {
// BAD: Logs millions of times
for chunk in chunks {
    debug!("Processing chunk {}", chunk.id); // Too verbose!
}

// GOOD: Log summary
debug!(chunk_count = chunks.len(), "Processing chunks");
// ... process chunks ...
info!(chunks_processed = chunks.len(), "Chunk processing complete");
}

Don't perform expensive operations in log statements

#![allow(unused)]
fn main() {
// BAD
debug!("Data: {}", expensive_serialization(&data));

// GOOD: Use debug formatting for lazy evaluation
debug!(data = ?data, "Processing data");
}

Don't duplicate metrics in logs

#![allow(unused)]
fn main() {
// BAD: Metrics already tracked separately
info!("Throughput: {} MB/s", throughput); // Redundant with MetricsService!

// GOOD: Log events, not metrics
info!("File processing completed successfully");
// MetricsService already tracks throughput
}

Testing Strategies

CapturingLogger for Tests

The bootstrap layer provides a capturing logger for tests:

#![allow(unused)]
fn main() {
#[cfg(test)]
pub struct CapturingLogger {
    messages: Arc<Mutex<Vec<LogMessage>>>,
}

#[cfg(test)]
impl CapturingLogger {
    pub fn new() -> Self {
        Self {
            messages: Arc::new(Mutex::new(Vec::new())),
        }
    }

    pub fn messages(&self) -> Vec<LogMessage> {
        self.messages.lock().unwrap().clone()
    }

    pub fn clear(&self) {
        self.messages.lock().unwrap().clear();
    }
}

#[cfg(test)]
impl BootstrapLogger for CapturingLogger {
    fn error(&self, message: &str) {
        self.log(LogLevel::Error, message);
    }

    fn warn(&self, message: &str) {
        self.log(LogLevel::Warn, message);
    }

    fn info(&self, message: &str) {
        self.log(LogLevel::Info, message);
    }

    fn debug(&self, message: &str) {
        self.log(LogLevel::Debug, message);
    }
}
}

Test Usage

#![allow(unused)]
fn main() {
#[test]
fn test_bootstrap_logging() {
    let logger = CapturingLogger::new();

    logger.info("Starting operation");
    logger.debug("Detailed diagnostic");
    logger.error("Operation failed");

    let messages = logger.messages();
    assert_eq!(messages.len(), 3);
    assert_eq!(messages[0].level, LogLevel::Info);
    assert_eq!(messages[0].message, "Starting operation");
    assert_eq!(messages[2].level, LogLevel::Error);
}
}

Testing with tracing-subscriber

For testing application-level logging:

#![allow(unused)]
fn main() {
use tracing_subscriber::{fmt, EnvFilter};

#[test]
fn test_application_logging() {
    // Initialize test subscriber
    let subscriber = fmt()
        .with_test_writer()
        .with_env_filter(EnvFilter::new("debug"))
        .finish();

    tracing::subscriber::with_default(subscriber, || {
        // Test code that produces logs
        process_pipeline("test-pipeline");
    });
}
}

NoOp Logger for Silent Tests

#![allow(unused)]
fn main() {
#[test]
fn test_without_logging() {
    let logger = NoOpLogger::new();

    // Run tests without any log output
    let result = bootstrap_with_logger(&logger);

    assert!(result.is_ok());
}
}

Common Patterns

Request/Response Logging

#![allow(unused)]
fn main() {
pub async fn process_file(path: &Path) -> Result<ProcessingResult> {
    let request_id = Uuid::new_v4();

    info!(
        request_id = %request_id,
        path = %path.display(),
        "Processing file"
    );

    match do_processing(path).await {
        Ok(result) => {
            info!(
                request_id = %request_id,
                bytes_processed = result.bytes,
                duration_ms = result.duration.as_millis(),
                "File processed successfully"
            );
            Ok(result)
        }
        Err(e) => {
            error!(
                request_id = %request_id,
                error = %e,
                "File processing failed"
            );
            Err(e)
        }
    }
}
}

Progress Logging

#![allow(unused)]
fn main() {
pub async fn process_chunks(chunks: Vec<Chunk>) -> Result<()> {
    let total = chunks.len();

    info!(total_chunks = total, "Starting chunk processing");

    for (i, chunk) in chunks.iter().enumerate() {
        process_chunk(chunk).await?;

        // Log progress every 10%
        if (i + 1) % (total / 10).max(1) == 0 {
            let percent = ((i + 1) * 100) / total;
            info!(
                processed = i + 1,
                total = total,
                percent = percent,
                "Chunk processing progress"
            );
        }
    }

    info!(total_chunks = total, "Chunk processing completed");
    Ok(())
}
}

Error Context Logging

#![allow(unused)]
fn main() {
pub async fn execute_pipeline(id: &PipelineId) -> Result<()> {
    debug!(pipeline_id = %id, "Loading pipeline from database");

    let pipeline = repository.find_by_id(id).await
        .map_err(|e| {
            error!(
                pipeline_id = %id,
                error = %e,
                "Failed to load pipeline from database"
            );
            e
        })?;

    debug!(
        pipeline_id = %id,
        stage_count = pipeline.stages().len(),
        "Pipeline loaded successfully"
    );

    Ok(())
}
}

Conditional Logging

#![allow(unused)]
fn main() {
pub fn process_with_validation(data: &Data) -> Result<()> {
    if let Err(e) = validate(data) {
        warn!(
            validation_error = %e,
            "Data validation failed, attempting recovery"
        );

        return recover_from_validation_error(data, e);
    }

    debug!("Data validation passed");
    process(data)
}
}

Next Steps


References

Concurrency Model

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

This chapter provides a comprehensive overview of the concurrency model in the adaptive pipeline system. Learn how async/await, Tokio runtime, and concurrent patterns enable high-performance, scalable file processing.


Table of Contents


Overview

The concurrency model enables the adaptive pipeline to process files efficiently through parallel processing, async I/O, and concurrent chunk handling. The system uses Rust's async/await with the Tokio runtime for high-performance, scalable concurrency.

Key Features

  • Async/Await: Non-blocking asynchronous operations
  • Tokio Runtime: Multi-threaded async runtime
  • Parallel Processing: Concurrent chunk processing
  • Worker Pools: Configurable thread pools
  • Resource Management: Efficient resource allocation and cleanup
  • Thread Safety: Safe concurrent access through Rust's type system

Concurrency Stack

┌─────────────────────────────────────────────────────────────┐
│                  Application Layer                          │
│  - Pipeline orchestration                                   │
│  - File processing coordination                             │
└─────────────────────────────────────────────────────────────┘
                         ↓ async
┌─────────────────────────────────────────────────────────────┐
│                   Tokio Runtime                             │
│  - Multi-threaded work-stealing scheduler                   │
│  - Async task execution                                     │
│  - I/O reactor                                              │
└─────────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────────┐
│              Concurrency Primitives                         │
│  ┌─────────┬──────────┬──────────┬─────────────┐           │
│  │ Mutex   │ RwLock   │ Semaphore│  Channel    │           │
│  │ (Sync)  │ (Shared) │ (Limit)  │ (Message)   │           │
│  └─────────┴──────────┴──────────┴─────────────┘           │
└─────────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────────┐
│                  Worker Threads                             │
│  - Chunk processing workers                                 │
│  - I/O workers                                              │
│  - Background tasks                                         │
└─────────────────────────────────────────────────────────────┘

Design Principles

  1. Async-First: All I/O operations are asynchronous
  2. Structured Concurrency: Clear task ownership and lifetimes
  3. Safe Sharing: Thread-safe sharing through Arc and sync primitives
  4. Resource Bounded: Limited resource usage with semaphores
  5. Zero-Cost Abstractions: Minimal overhead from async runtime

Concurrency Architecture

The system uses a layered concurrency architecture with clear separation between sync and async code.

Architectural Layers

┌─────────────────────────────────────────────────────────────┐
│ Async Layer (I/O Bound)                                     │
│  ┌──────────────────────────────────────────────────┐      │
│  │  File I/O Service (async)                        │      │
│  │  - tokio::fs file operations                     │      │
│  │  - Async read/write                              │      │
│  └──────────────────────────────────────────────────┘      │
│  ┌──────────────────────────────────────────────────┐      │
│  │  Pipeline Service (async orchestration)          │      │
│  │  - Async workflow coordination                   │      │
│  │  - Task spawning and management                  │      │
│  └──────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────────┐
│ Sync Layer (CPU Bound)                                      │
│  ┌──────────────────────────────────────────────────┐      │
│  │  Compression Service (sync)                      │      │
│  │  - CPU-bound compression algorithms              │      │
│  │  - No async overhead                             │      │
│  └──────────────────────────────────────────────────┘      │
│  ┌──────────────────────────────────────────────────┐      │
│  │  Encryption Service (sync)                       │      │
│  │  - CPU-bound encryption algorithms               │      │
│  │  - No async overhead                             │      │
│  └──────────────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────┘

Async vs Sync Decision

Async for:

  • File I/O (tokio::fs)
  • Network I/O
  • Database operations
  • Long-running waits

Sync for:

  • CPU-bound compression
  • CPU-bound encryption
  • Hash calculations
  • Pure computation

Async/Await Model

The system uses Rust's async/await for non-blocking concurrency.

Async Functions

#![allow(unused)]
fn main() {
// Async function definition
async fn process_file(
    path: &Path,
    chunk_size: ChunkSize,
) -> Result<Vec<FileChunk>, PipelineError> {
    // Await async operations
    let chunks = read_file_chunks(path, chunk_size).await?;

    // Process chunks
    let results = process_chunks_parallel(chunks).await?;

    Ok(results)
}
}

Awaiting Futures

#![allow(unused)]
fn main() {
// Sequential awaits
let chunks = service.read_file_chunks(path, chunk_size).await?;
let processed = process_chunks(chunks).await?;
service.write_file_chunks(output_path, processed).await?;

// Parallel awaits with join
use tokio::try_join;

let (chunks1, chunks2) = try_join!(
    service.read_file_chunks(path1, chunk_size),
    service.read_file_chunks(path2, chunk_size),
)?;
}

Async Traits

#![allow(unused)]
fn main() {
use async_trait::async_trait;

#[async_trait]
pub trait FileIOService: Send + Sync {
    async fn read_file_chunks(
        &self,
        path: &Path,
        chunk_size: ChunkSize,
    ) -> Result<Vec<FileChunk>, PipelineError>;

    async fn write_file_chunks(
        &self,
        path: &Path,
        chunks: Vec<FileChunk>,
    ) -> Result<(), PipelineError>;
}
}

Tokio Runtime

The system uses Tokio's multi-threaded runtime for async execution.

Runtime Configuration

#![allow(unused)]
fn main() {
use tokio::runtime::Runtime;

// Multi-threaded runtime (default)
let runtime = Runtime::new()?;

// Custom configuration
let runtime = tokio::runtime::Builder::new_multi_thread()
    .worker_threads(8)              // 8 worker threads
    .thread_name("pipeline-worker")
    .thread_stack_size(3 * 1024 * 1024)  // 3 MB stack
    .enable_all()                   // Enable I/O and time drivers
    .build()?;

// Execute async work
runtime.block_on(async {
    process_file(path, chunk_size).await?;
});
}

Runtime Selection

// Multi-threaded runtime (CPU-bound + I/O)
#[tokio::main]
async fn main() {
    // Automatically uses multi-threaded runtime
    process_pipeline().await;
}

// Current-thread runtime (testing, single-threaded)
#[tokio::main(flavor = "current_thread")]
async fn main() {
    // Single-threaded runtime
    process_pipeline().await;
}

Work-Stealing Scheduler

Tokio uses a work-stealing scheduler for load balancing:

Thread 1: [Task A] [Task B] ────────> (idle, steals Task D)
Thread 2: [Task C] [Task D] [Task E]  (busy)
Thread 3: [Task F] ────────────────>  (idle, steals Task E)

Parallel Chunk Processing

Chunks are processed concurrently for maximum throughput.

Parallel Processing with try_join_all

#![allow(unused)]
fn main() {
use futures::future::try_join_all;

async fn process_chunks_parallel(
    chunks: Vec<FileChunk>,
) -> Result<Vec<FileChunk>, PipelineError> {
    // Spawn tasks for each chunk
    let futures = chunks.into_iter().map(|chunk| {
        tokio::spawn(async move {
            process_chunk(chunk).await
        })
    });

    // Wait for all to complete
    let results = try_join_all(futures).await?;

    // Collect results
    Ok(results.into_iter().collect::<Result<Vec<_>, _>>()?)
}
}

Bounded Parallelism

#![allow(unused)]
fn main() {
use tokio::sync::Semaphore;
use std::sync::Arc;

async fn process_with_limit(
    chunks: Vec<FileChunk>,
    max_parallel: usize,
) -> Result<Vec<FileChunk>, PipelineError> {
    let semaphore = Arc::new(Semaphore::new(max_parallel));
    let futures = chunks.into_iter().map(|chunk| {
        let permit = semaphore.clone();
        async move {
            let _guard = permit.acquire().await.unwrap();
            process_chunk(chunk).await
        }
    });

    try_join_all(futures).await
}
}

Pipeline Parallelism

#![allow(unused)]
fn main() {
// Stage 1: Read chunks
let chunks = read_chunks_stream(path, chunk_size);

// Stage 2: Process chunks (parallel)
let processed = chunks
    .map(|chunk| async move {
        tokio::spawn(async move {
            compress_chunk(chunk).await
        }).await
    })
    .buffer_unordered(8);  // Up to 8 chunks in flight

// Stage 3: Write chunks
write_chunks_stream(processed).await?;
}

Concurrency Primitives

The system uses Tokio's async-aware concurrency primitives.

Async Mutex

#![allow(unused)]
fn main() {
use tokio::sync::Mutex;
use std::sync::Arc;

let shared_state = Arc::new(Mutex::new(HashMap::new()));

// Acquire lock asynchronously
let mut state = shared_state.lock().await;
state.insert(key, value);
// Lock automatically released when dropped
}

Async RwLock

#![allow(unused)]
fn main() {
use tokio::sync::RwLock;
use std::sync::Arc;

let config = Arc::new(RwLock::new(PipelineConfig::default()));

// Multiple readers
let config_read = config.read().await;
let chunk_size = config_read.chunk_size;

// Single writer
let mut config_write = config.write().await;
config_write.chunk_size = ChunkSize::from_mb(16)?;
}

Channels (mpsc)

#![allow(unused)]
fn main() {
use tokio::sync::mpsc;

// Create channel
let (tx, mut rx) = mpsc::channel::<FileChunk>(100);

// Send chunks
tokio::spawn(async move {
    for chunk in chunks {
        tx.send(chunk).await.unwrap();
    }
});

// Receive chunks
while let Some(chunk) = rx.recv().await {
    process_chunk(chunk).await?;
}
}

Semaphores

#![allow(unused)]
fn main() {
use tokio::sync::Semaphore;

// Limit concurrent operations
let semaphore = Arc::new(Semaphore::new(4));  // Max 4 concurrent

for chunk in chunks {
    let permit = semaphore.clone().acquire_owned().await.unwrap();
    tokio::spawn(async move {
        let result = process_chunk(chunk).await;
        drop(permit);  // Release permit
        result
    });
}
}

Thread Pools and Workers

Worker pools manage concurrent task execution.

Worker Pool Configuration

#![allow(unused)]
fn main() {
pub struct WorkerPool {
    max_workers: usize,
    semaphore: Arc<Semaphore>,
}

impl WorkerPool {
    pub fn new(max_workers: usize) -> Self {
        Self {
            max_workers,
            semaphore: Arc::new(Semaphore::new(max_workers)),
        }
    }

    pub async fn execute<F, T>(&self, task: F) -> Result<T, PipelineError>
    where
        F: Future<Output = Result<T, PipelineError>> + Send + 'static,
        T: Send + 'static,
    {
        let _permit = self.semaphore.acquire().await.unwrap();
        tokio::spawn(task).await.unwrap()
    }
}
}

Adaptive Worker Count

#![allow(unused)]
fn main() {
fn optimal_worker_count() -> usize {
    let cpu_count = num_cpus::get();

    // For I/O-bound: 2x CPU count
    // For CPU-bound: 1x CPU count
    // For mixed: 1.5x CPU count
    (cpu_count as f64 * 1.5) as usize
}

let worker_pool = WorkerPool::new(optimal_worker_count());
}

For detailed worker pool implementation, see Thread Pooling.


Resource Management

Efficient resource management is critical for concurrent systems.

Resource Limits

#![allow(unused)]
fn main() {
pub struct ResourceLimits {
    max_memory: usize,
    max_file_handles: usize,
    max_concurrent_tasks: usize,
}

impl ResourceLimits {
    pub fn calculate_max_parallel_chunks(&self, chunk_size: ChunkSize) -> usize {
        let memory_limit = self.max_memory / chunk_size.bytes();
        let task_limit = self.max_concurrent_tasks;

        memory_limit.min(task_limit)
    }
}
}

Resource Tracking

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicUsize, Ordering};

pub struct ResourceTracker {
    active_tasks: AtomicUsize,
    memory_used: AtomicUsize,
}

impl ResourceTracker {
    pub fn acquire_task(&self) -> TaskGuard {
        self.active_tasks.fetch_add(1, Ordering::SeqCst);
        TaskGuard { tracker: self }
    }
}

pub struct TaskGuard<'a> {
    tracker: &'a ResourceTracker,
}

impl Drop for TaskGuard<'_> {
    fn drop(&mut self) {
        self.tracker.active_tasks.fetch_sub(1, Ordering::SeqCst);
    }
}
}

For detailed resource management, see Resource Management.


Concurrency Patterns

Common concurrency patterns used in the pipeline.

Pattern 1: Fan-Out/Fan-In

#![allow(unused)]
fn main() {
// Fan-out: Distribute work to multiple workers
let futures = chunks.into_iter().map(|chunk| {
    tokio::spawn(async move {
        process_chunk(chunk).await
    })
});

// Fan-in: Collect results
let results = try_join_all(futures).await?;
}

Pattern 2: Pipeline Pattern

#![allow(unused)]
fn main() {
use tokio_stream::StreamExt;

// Stage 1 → Stage 2 → Stage 3
let result = read_stream(path)
    .map(|chunk| compress_chunk(chunk))
    .buffer_unordered(8)
    .map(|chunk| encrypt_chunk(chunk))
    .buffer_unordered(8)
    .collect::<Vec<_>>()
    .await;
}

Pattern 3: Worker Pool Pattern

#![allow(unused)]
fn main() {
let pool = WorkerPool::new(8);

for chunk in chunks {
    pool.execute(async move {
        process_chunk(chunk).await
    }).await?;
}
}

Pattern 4: Rate Limiting

#![allow(unused)]
fn main() {
use tokio::time::{interval, Duration};

let mut interval = interval(Duration::from_millis(100));

for chunk in chunks {
    interval.tick().await;  // Rate limit: 10 chunks/sec
    process_chunk(chunk).await?;
}
}

Performance Considerations

Tokio Task Overhead

OperationCostNotes
Spawn task~1-2 μsVery lightweight
Context switch~100 nsWork-stealing scheduler
Mutex lock~50 nsUncontended case
Channel send~100-200 nsDepends on channel type

Choosing Concurrency Level

#![allow(unused)]
fn main() {
fn optimal_concurrency(
    file_size: u64,
    chunk_size: ChunkSize,
    available_memory: usize,
) -> usize {
    let num_chunks = (file_size / chunk_size.bytes() as u64) as usize;
    let memory_limit = available_memory / chunk_size.bytes();
    let cpu_limit = num_cpus::get() * 2;

    num_chunks.min(memory_limit).min(cpu_limit)
}
}

Avoiding Contention

#![allow(unused)]
fn main() {
// ❌ Bad: High contention
let counter = Arc::new(Mutex::new(0));
for _ in 0..1000 {
    let c = counter.clone();
    tokio::spawn(async move {
        *c.lock().await += 1;  // Lock contention!
    });
}

// ✅ Good: Reduce contention
let counter = Arc::new(AtomicUsize::new(0));
for _ in 0..1000 {
    let c = counter.clone();
    tokio::spawn(async move {
        c.fetch_add(1, Ordering::Relaxed);  // Lock-free!
    });
}
}

Best Practices

1. Use Async for I/O, Sync for CPU

#![allow(unused)]
fn main() {
// ✅ Good: Async I/O
async fn read_file(path: &Path) -> Result<Vec<u8>, Error> {
    tokio::fs::read(path).await
}

// ✅ Good: Sync CPU-bound
fn compress_data(data: &[u8]) -> Result<Vec<u8>, Error> {
    brotli::compress(data)  // Sync, CPU-bound
}

// ❌ Bad: Async for CPU-bound
async fn compress_data_async(data: &[u8]) -> Result<Vec<u8>, Error> {
    // Unnecessary async overhead
    brotli::compress(data)
}
}

2. Spawn Blocking for Sync Code

#![allow(unused)]
fn main() {
// ✅ Good: Spawn blocking task
async fn process_chunk(chunk: FileChunk) -> Result<FileChunk, Error> {
    tokio::task::spawn_blocking(move || {
        // CPU-bound compression in blocking thread
        compress_sync(chunk)
    }).await?
}
}

3. Limit Concurrent Tasks

#![allow(unused)]
fn main() {
// ✅ Good: Bounded parallelism
let semaphore = Arc::new(Semaphore::new(max_concurrent));
for chunk in chunks {
    let permit = semaphore.clone();
    tokio::spawn(async move {
        let _guard = permit.acquire().await.unwrap();
        process_chunk(chunk).await
    });
}

// ❌ Bad: Unbounded parallelism
for chunk in chunks {
    tokio::spawn(async move {
        process_chunk(chunk).await  // May spawn thousands of tasks!
    });
}
}

4. Use Channels for Communication

#![allow(unused)]
fn main() {
// ✅ Good: Channel communication
let (tx, mut rx) = mpsc::channel(100);

tokio::spawn(async move {
    while let Some(chunk) = rx.recv().await {
        process_chunk(chunk).await;
    }
});

tx.send(chunk).await?;
}

5. Handle Errors Properly

#![allow(unused)]
fn main() {
// ✅ Good: Proper error handling
let results: Result<Vec<_>, _> = try_join_all(futures).await;
match results {
    Ok(chunks) => { /* success */ },
    Err(e) => {
        error!("Processing failed: {}", e);
        // Cleanup resources
        return Err(e);
    }
}
}

Troubleshooting

Issue 1: Too Many Tokio Tasks

Symptom:

thread 'tokio-runtime-worker' stack overflow

Solutions:

#![allow(unused)]
fn main() {
// 1. Limit concurrent tasks
let semaphore = Arc::new(Semaphore::new(100));

// 2. Use buffer_unordered
stream.buffer_unordered(10).collect().await

// 3. Increase stack size
tokio::runtime::Builder::new_multi_thread()
    .thread_stack_size(4 * 1024 * 1024)  // 4 MB
    .build()?;
}

Issue 2: Mutex Deadlock

Symptom: Tasks hang indefinitely.

Solutions:

#![allow(unused)]
fn main() {
// 1. Always acquire locks in same order
async fn transfer(from: &Mutex<u64>, to: &Mutex<u64>, amount: u64) {
    let (first, second) = if ptr::eq(from, to) {
        panic!("Same account");
    } else if (from as *const _ as usize) < (to as *const _ as usize) {
        (from, to)
    } else {
        (to, from)
    };

    let mut a = first.lock().await;
    let mut b = second.lock().await;
    // Transfer logic
}

// 2. Use try_lock with timeout
tokio::time::timeout(Duration::from_secs(5), mutex.lock()).await??;
}

Issue 3: Channel Backpressure

Symptom:

Producer overwhelms consumer

Solutions:

#![allow(unused)]
fn main() {
// 1. Bounded channel
let (tx, rx) = mpsc::channel::<FileChunk>(100);  // Max 100 in flight

// 2. Apply backpressure
match tx.try_send(chunk) {
    Ok(()) => { /* sent */ },
    Err(TrySendError::Full(_)) => {
        // Wait and retry
        tokio::time::sleep(Duration::from_millis(10)).await;
    },
    Err(e) => return Err(e.into()),
}
}

Next Steps

After understanding the concurrency model, explore specific implementations:

  1. Thread Pooling: Worker pool implementation and optimization
  2. Resource Management: Memory and resource tracking

Summary

Key Takeaways:

  1. Async/Await provides non-blocking concurrency for I/O operations
  2. Tokio Runtime uses work-stealing for efficient task scheduling
  3. Parallel Processing enables concurrent chunk processing for throughput
  4. Concurrency Primitives (Mutex, RwLock, Semaphore) enable safe sharing
  5. Worker Pools manage bounded concurrent task execution
  6. Resource Management tracks and limits resource usage
  7. Patterns (fan-out/fan-in, pipeline, worker pool) structure concurrent code

Architecture File References:

  • Pipeline Service: pipeline/src/application/services/pipeline_service.rs:189
  • File Processor: pipeline/src/application/services/file_processor_service.rs:1

Thread Pooling

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

This chapter explores the pipeline's thread pool architecture, configuration strategies, and tuning guidelines for optimal performance across different workload types.

Overview

The pipeline employs a dual thread pool architecture that separates async I/O operations from sync CPU-bound computations:

  • Tokio Runtime: Handles async I/O operations (file reads/writes, database queries, network)
  • Rayon Thread Pools: Handles parallel CPU-bound operations (compression, encryption, hashing)

This separation prevents CPU-intensive work from blocking async tasks and ensures optimal resource utilization.

Thread Pool Architecture

Dual Pool Design

┌─────────────────────────────────────────────────────────────┐
│                    Async Layer (Tokio)                      │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Multi-threaded Tokio Runtime                        │   │
│  │  - Work-stealing scheduler                           │   │
│  │  - Async I/O operations                              │   │
│  │  - Task coordination                                 │   │
│  │  - Event loop management                             │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                                │
                                │ spawn_blocking()
                                ▼
┌─────────────────────────────────────────────────────────────┐
│                  Sync Layer (Rayon Pools)                   │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  CPU-Bound Pool (rayon-cpu-{N})                      │   │
│  │  - Compression operations                            │   │
│  │  - Encryption/decryption                             │   │
│  │  - Checksum calculation                              │   │
│  │  - Complex transformations                           │   │
│  └──────────────────────────────────────────────────────┘   │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Mixed Workload Pool (rayon-mixed-{N})               │   │
│  │  - File processing with transformations              │   │
│  │  - Database operations with calculations             │   │
│  │  - Network operations with data processing           │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Thread Naming Convention

All worker threads are named for debugging and profiling:

  • Tokio threads: Default Tokio naming (tokio-runtime-worker-{N})
  • Rayon CPU pool: rayon-cpu-{N} where N is the thread index
  • Rayon mixed pool: rayon-mixed-{N} where N is the thread index

This naming convention enables:

  • Easy identification in profilers (e.g., perf, flamegraph)
  • Clear thread attribution in stack traces
  • Simplified debugging of concurrency issues

Tokio Runtime Configuration

Default Configuration

The pipeline uses Tokio's default multi-threaded runtime:

#[tokio::main]
async fn main() -> Result<()> {
    // Tokio runtime initialized with default settings
    run_pipeline().await
}

Default Tokio Settings:

  • Worker threads: Equal to number of CPU cores (via std::thread::available_parallelism())
  • Thread stack size: 2 MiB per thread
  • Work-stealing: Enabled (automatic task balancing)
  • I/O driver: Enabled (async file I/O, network)
  • Time driver: Enabled (async timers, delays)

Custom Runtime Configuration

For advanced use cases, you can configure the Tokio runtime explicitly:

#![allow(unused)]
fn main() {
use tokio::runtime::Runtime;

fn create_custom_runtime() -> Runtime {
    Runtime::builder()
        .worker_threads(8)          // Override thread count
        .thread_stack_size(3 * 1024 * 1024) // 3 MiB stack
        .thread_name("pipeline-async")
        .enable_all()               // Enable I/O and time drivers
        .build()
        .expect("Failed to create Tokio runtime")
}
}

Configuration Parameters:

  • worker_threads(usize): Number of worker threads (defaults to CPU cores)
  • thread_stack_size(usize): Stack size per thread in bytes (default: 2 MiB)
  • thread_name(impl Into<String>): Thread name prefix for debugging
  • enable_all(): Enable both I/O and time drivers

Rayon Thread Pool Configuration

Global Pool Manager

The pipeline uses a global RAYON_POOLS manager with two specialized pools:

#![allow(unused)]
fn main() {
use adaptive_pipeline::infrastructure::config::rayon_config::RAYON_POOLS;

// Access CPU-bound pool
let cpu_pool = RAYON_POOLS.cpu_bound_pool();

// Access mixed workload pool
let mixed_pool = RAYON_POOLS.mixed_workload_pool();
}

Implementation (pipeline/src/infrastructure/config/rayon_config.rs):

#![allow(unused)]
fn main() {
pub struct RayonPoolManager {
    cpu_bound_pool: Arc<rayon::ThreadPool>,
    mixed_workload_pool: Arc<rayon::ThreadPool>,
}

impl RayonPoolManager {
    pub fn new() -> Result<Self, PipelineError> {
        let available_cores = std::thread::available_parallelism()
            .map(|n| n.get())
            .unwrap_or(WorkerCount::DEFAULT_WORKERS);

        // CPU-bound pool: Use optimal worker count for CPU-intensive ops
        let cpu_worker_count = WorkerCount::optimal_for_processing_type(
            100 * 1024 * 1024,  // Assume 100MB default file size
            available_cores,
            true,               // CPU-intensive = true
        );

        let cpu_bound_pool = rayon::ThreadPoolBuilder::new()
            .num_threads(cpu_worker_count.count())
            .thread_name(|i| format!("rayon-cpu-{}", i))
            .build()?;

        // Mixed workload pool: Use fewer threads to avoid contention
        let mixed_worker_count = (available_cores / 2).max(WorkerCount::MIN_WORKERS);

        let mixed_workload_pool = rayon::ThreadPoolBuilder::new()
            .num_threads(mixed_worker_count)
            .thread_name(|i| format!("rayon-mixed-{}", i))
            .build()?;

        Ok(Self {
            cpu_bound_pool: Arc::new(cpu_bound_pool),
            mixed_workload_pool: Arc::new(mixed_workload_pool),
        })
    }
}

// Global static instance
pub static RAYON_POOLS: std::sync::LazyLock<RayonPoolManager> =
    std::sync::LazyLock::new(|| RayonPoolManager::new()
        .expect("Failed to initialize Rayon pools"));
}

CPU-Bound Pool

Purpose: Maximize throughput for CPU-intensive operations

Workload Types:

  • Data compression (Brotli, LZ4, Zstandard)
  • Data encryption/decryption (AES-256-GCM, ChaCha20-Poly1305)
  • Checksum calculation (SHA-256, BLAKE3)
  • Complex data transformations

Thread Count Strategy:

#![allow(unused)]
fn main() {
// Uses WorkerCount::optimal_for_processing_type()
// For CPU-intensive work, allocates up to available_cores threads
let cpu_worker_count = WorkerCount::optimal_for_processing_type(
    file_size,
    available_cores,
    true  // CPU-intensive
);
}

Typical Thread Counts (on 8-core system):

  • Small files (< 50 MB): 6-14 workers (aggressive parallelism)
  • Medium files (50-500 MB): 5-12 workers (balanced approach)
  • Large files (> 500 MB): 8-12 workers (moderate parallelism)
  • Huge files (> 2 GB): 3-6 workers (conservative, avoid overhead)

Mixed Workload Pool

Purpose: Handle operations with both CPU and I/O components

Workload Types:

  • File processing with transformations
  • Database operations with calculations
  • Network operations with data processing

Thread Count Strategy:

#![allow(unused)]
fn main() {
// Uses half the cores to avoid contention with I/O operations
let mixed_worker_count = (available_cores / 2).max(WorkerCount::MIN_WORKERS);
}

Typical Thread Counts (on 8-core system):

  • 4 threads (half of 8 cores)
  • Minimum: 1 thread (WorkerCount::MIN_WORKERS)
  • Maximum: 16 threads (half of MAX_WORKERS = 32)

The spawn_blocking Pattern

Purpose

tokio::task::spawn_blocking bridges async and sync code by running synchronous CPU-bound operations on a dedicated blocking thread pool without blocking the Tokio async runtime.

When to Use spawn_blocking

Always use for:

  • ✅ CPU-intensive operations (compression, encryption, hashing)
  • ✅ Blocking I/O that can't be made async
  • ✅ Long-running computations (> 10-100 μs)
  • ✅ Sync library calls that may block

Never use for:

  • ❌ Quick operations (< 10 μs)
  • ❌ Already async operations
  • ❌ Pure data transformations (use inline instead)

Usage Pattern

Basic spawn_blocking:

#![allow(unused)]
fn main() {
use tokio::task;

async fn compress_chunk_async(
    chunk: FileChunk,
    config: &CompressionConfig,
) -> Result<FileChunk, PipelineError> {
    let config = config.clone();

    // Move CPU-bound work to blocking thread pool
    task::spawn_blocking(move || {
        // This runs on a dedicated blocking thread
        compress_chunk_sync(chunk, &config)
    })
    .await
    .map_err(|e| PipelineError::InternalError(format!("Task join error: {}", e)))?
}

fn compress_chunk_sync(
    chunk: FileChunk,
    config: &CompressionConfig,
) -> Result<FileChunk, PipelineError> {
    // Expensive CPU-bound operation
    // This will NOT block the Tokio async runtime
    let compressed_data = brotli::compress(&chunk.data, config.level)?;
    Ok(FileChunk::new(compressed_data))
}
}

Rayon inside spawn_blocking (recommended for batch parallelism):

#![allow(unused)]
fn main() {
async fn compress_chunks_parallel(
    chunks: Vec<FileChunk>,
    config: &CompressionConfig,
) -> Result<Vec<FileChunk>, PipelineError> {
    use rayon::prelude::*;

    let config = config.clone();

    // Entire Rayon batch runs on blocking thread pool
    tokio::task::spawn_blocking(move || {
        // Use CPU-bound pool for compression
        RAYON_POOLS.cpu_bound_pool().install(|| {
            chunks
                .into_par_iter()
                .map(|chunk| compress_chunk_sync(chunk, &config))
                .collect::<Result<Vec<_>, _>>()
        })
    })
    .await
    .map_err(|e| PipelineError::InternalError(format!("Task join error: {}", e)))?
}
}

Key Points:

  • spawn_blocking returns a JoinHandle<T>
  • Must .await the handle to get the result
  • Errors are wrapped in JoinError (handle with map_err)
  • Data must be Send + 'static (use clone() or move)

spawn_blocking Thread Pool

Tokio maintains a separate blocking thread pool for spawn_blocking:

  • Default size: 512 threads (very large to handle many blocking operations)
  • Thread creation: On-demand (lazy initialization)
  • Thread lifetime: Threads terminate after being idle for 10 seconds
  • Stack size: 2 MiB per thread (same as Tokio async threads)

Why 512 threads?

  • Blocking operations may wait indefinitely (file I/O, database queries)
  • Need many threads to prevent starvation
  • Threads are cheap (created on-demand, destroyed when idle)

Worker Count Optimization

WorkerCount Value Object

The pipeline uses a sophisticated WorkerCount value object for adaptive thread allocation:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::value_objects::WorkerCount;

// Optimal for file size (empirically validated)
let workers = WorkerCount::optimal_for_file_size(file_size);

// Optimal for file size + system resources
let workers = WorkerCount::optimal_for_file_and_system(file_size, available_cores);

// Optimal for processing type (CPU-intensive vs I/O-intensive)
let workers = WorkerCount::optimal_for_processing_type(
    file_size,
    available_cores,
    is_cpu_intensive,
);
}

Optimization Strategies

Empirically Validated (based on benchmarks):

File SizeWorker CountStrategyPerformance Gain
< 1 MB (tiny)1-2Minimal parallelismN/A
1-50 MB (small)6-14Aggressive parallelism+102% (5 MB)
50-500 MB (medium)5-12Balanced approach+70% (50 MB)
500 MB-2 GB (large)8-12Moderate parallelismBalanced
> 2 GB (huge)3-6Conservative strategy+76% (2 GB)

Why these strategies?

  • Small files: High parallelism amortizes task overhead quickly
  • Medium files: Balanced approach for consistent performance
  • Large files: Moderate parallelism to manage memory and coordination overhead
  • Huge files: Conservative to avoid excessive memory pressure and thread contention

Configuration Options

Via CLI:

#![allow(unused)]
fn main() {
use bootstrap::config::AppConfig;

let config = AppConfig::builder()
    .app_name("pipeline")
    .worker_threads(8)  // Override automatic detection
    .build();
}

Via Environment Variable:

export ADAPIPE_WORKER_COUNT=8
./pipeline process input.txt

Programmatic:

#![allow(unused)]
fn main() {
let worker_count = config.worker_threads()
    .map(WorkerCount::new)
    .unwrap_or_else(|| WorkerCount::default_for_system());
}

Tuning Guidelines

1. Start with Defaults

Recommendation: Use default settings for most workloads.

#![allow(unused)]
fn main() {
// Let WorkerCount choose optimal values
let workers = WorkerCount::optimal_for_file_size(file_size);
}

Why?

  • Empirically validated strategies
  • Adapts to system resources
  • Handles edge cases (tiny files, huge files)

2. CPU-Intensive Workloads

Symptoms:

  • High CPU utilization (> 80%)
  • Low I/O wait time
  • Operations: Compression, encryption, hashing

Tuning:

#![allow(unused)]
fn main() {
// Use CPU-bound pool
RAYON_POOLS.cpu_bound_pool().install(|| {
    // Parallel CPU-intensive work
});

// Or increase worker count
let workers = WorkerCount::new(available_cores);
}

Guidelines:

  • Match worker count to CPU cores (1x to 1.5x)
  • Avoid excessive oversubscription (> 2x cores)
  • Monitor context switching with perf stat

3. I/O-Intensive Workloads

Symptoms:

  • Low CPU utilization (< 50%)
  • High I/O wait time
  • Operations: File reads, database queries, network

Tuning:

#![allow(unused)]
fn main() {
// Use mixed workload pool or reduce workers
let workers = WorkerCount::optimal_for_processing_type(
    file_size,
    available_cores,
    false,  // Not CPU-intensive
);
}

Guidelines:

  • Use fewer workers (0.5x to 1x cores)
  • Rely on async I/O instead of parallelism
  • Consider increasing I/O buffer sizes

4. Memory-Constrained Systems

Symptoms:

  • High memory usage
  • Swapping or OOM errors
  • Large files (> 1 GB)

Tuning:

#![allow(unused)]
fn main() {
// Reduce worker count to limit memory
let workers = WorkerCount::new(3);  // Conservative

// Or increase chunk size to reduce memory overhead
let chunk_size = ChunkSize::new(64 * 1024 * 1024);  // 64 MB chunks
}

Guidelines:

  • Limit workers to 3-6 for huge files (> 2 GB)
  • Increase chunk size to reduce parallel memory allocations
  • Monitor RSS memory with htop or ps

5. Profiling and Measurement

Tools:

  • perf: CPU profiling, context switches, cache misses
  • flamegraph: Visual CPU time breakdown
  • htop: Real-time CPU and memory usage
  • tokio-console: Async task monitoring

Example: perf stat:

perf stat -e cycles,instructions,context-switches,cache-misses \
    ./pipeline process large-file.bin

Metrics to monitor:

  • CPU utilization: Should be 70-95% for CPU-bound work
  • Context switches: High (> 10k/sec) indicates oversubscription
  • Cache misses: High indicates memory contention
  • Task throughput: Measure chunks processed per second

Common Patterns

Pattern 1: Fan-Out CPU Work

Distribute CPU-bound work across Rayon pool:

#![allow(unused)]
fn main() {
use rayon::prelude::*;

async fn process_chunks_parallel(
    chunks: Vec<FileChunk>,
) -> Result<Vec<FileChunk>, PipelineError> {
    tokio::task::spawn_blocking(move || {
        RAYON_POOLS.cpu_bound_pool().install(|| {
            chunks
                .into_par_iter()
                .map(|chunk| {
                    // CPU-intensive per-chunk work
                    compress_and_encrypt(chunk)
                })
                .collect::<Result<Vec<_>, _>>()
        })
    })
    .await??
}
}

Pattern 2: Bounded Parallelism

Limit concurrent CPU-bound tasks with semaphore:

#![allow(unused)]
fn main() {
use tokio::sync::Semaphore;

async fn process_with_limit(
    chunks: Vec<FileChunk>,
    max_parallel: usize,
) -> Result<Vec<FileChunk>, PipelineError> {
    let semaphore = Arc::new(Semaphore::new(max_parallel));

    let futures = chunks.into_iter().map(|chunk| {
        let permit = semaphore.clone();
        async move {
            let _guard = permit.acquire().await.unwrap();

            // Run CPU-bound work on blocking pool
            tokio::task::spawn_blocking(move || {
                compress_chunk_sync(chunk)
            })
            .await?
        }
    });

    futures::future::try_join_all(futures).await
}
}

Pattern 3: Mixed Async/Sync Pipeline

Combine async I/O with sync CPU-bound stages:

#![allow(unused)]
fn main() {
async fn process_file_pipeline(path: &Path) -> Result<(), PipelineError> {
    // Stage 1: Async file read
    let chunks = read_file_chunks(path).await?;

    // Stage 2: Sync CPU-bound processing (spawn_blocking + Rayon)
    let processed = tokio::task::spawn_blocking(move || {
        RAYON_POOLS.cpu_bound_pool().install(|| {
            chunks.into_par_iter()
                .map(|chunk| compress_and_encrypt(chunk))
                .collect::<Result<Vec<_>, _>>()
        })
    })
    .await??;

    // Stage 3: Async file write
    write_chunks(path, processed).await?;

    Ok(())
}
}

Performance Characteristics

Thread Creation Overhead

OperationTimeNotes
Tokio runtime initialization~1-5 msOne-time cost at startup
Rayon pool creation~500 μsOne-time cost (global static)
spawn_blocking task~10-50 μsPer-task overhead
Rayon parallel iteration~5-20 μsPer-iteration overhead
Thread context switch~1-5 μsDepends on workload and OS

Guidelines:

  • Amortize overhead with work units > 100 μs
  • Batch small operations to reduce per-task overhead
  • Avoid spawning tasks for trivial work (< 10 μs)

Scalability

Linear Scaling (ideal):

  • CPU-bound operations with independent chunks
  • Small files (< 50 MB) with 6-14 workers
  • Minimal memory contention

Sub-Linear Scaling (common):

  • Large files (> 500 MB) due to memory bandwidth limits
  • Mixed workloads with I/O contention
  • High worker counts (> 12) with coordination overhead

Performance Cliff (avoid):

  • Excessive worker count (> 2x CPU cores)
  • Memory pressure causing swapping
  • Lock contention in shared data structures

Best Practices

1. Use Appropriate Thread Pool

#![allow(unused)]
fn main() {
// ✅ Good: CPU-intensive work on CPU-bound pool
RAYON_POOLS.cpu_bound_pool().install(|| {
    chunks.par_iter().map(|c| compress(c)).collect()
});

// ❌ Bad: Using wrong pool or no pool
chunks.iter().map(|c| compress(c)).collect()  // Sequential!
}

2. Wrap Rayon with spawn_blocking

#![allow(unused)]
fn main() {
// ✅ Good: Rayon work inside spawn_blocking
tokio::task::spawn_blocking(move || {
    RAYON_POOLS.cpu_bound_pool().install(|| {
        // Parallel work
    })
})
.await?

// ❌ Bad: Rayon work directly in async context
RAYON_POOLS.cpu_bound_pool().install(|| {
    // This blocks the async runtime!
})
}

3. Let WorkerCount Optimize

#![allow(unused)]
fn main() {
// ✅ Good: Use empirically validated strategies
let workers = WorkerCount::optimal_for_file_size(file_size);

// ❌ Bad: Arbitrary fixed count
let workers = 8;  // May be too many or too few!
}

4. Monitor and Measure

#![allow(unused)]
fn main() {
// ✅ Good: Measure actual performance
let start = Instant::now();
let result = process_chunks(chunks).await?;
let duration = start.elapsed();
info!("Processed {} chunks in {:?}", chunks.len(), duration);

// ❌ Bad: Assume defaults are optimal without measurement
}

5. Avoid Oversubscription

#![allow(unused)]
fn main() {
// ✅ Good: Bounded parallelism based on cores
let max_workers = available_cores.min(WorkerCount::MAX_WORKERS);

// ❌ Bad: Unbounded parallelism
let workers = chunks.len();  // Could be thousands!
}

Troubleshooting

Issue 1: High Context Switching

Symptoms:

  • perf stat shows > 10k context switches/sec
  • High CPU usage but low throughput

Causes:

  • Too many worker threads (> 2x cores)
  • Rayon pool size exceeds optimal

Solutions:

#![allow(unused)]
fn main() {
// Reduce Rayon pool size
let workers = WorkerCount::new(available_cores);  // Not 2x

// Or use sequential processing for small workloads
if chunks.len() < 10 {
    chunks.into_iter().map(process).collect()
} else {
    chunks.into_par_iter().map(process).collect()
}
}

Issue 2: spawn_blocking Blocking Async Runtime

Symptoms:

  • Async tasks become slow
  • Other async operations stall

Causes:

  • Long-running CPU work directly in async fn (no spawn_blocking)
  • Blocking I/O in async context

Solutions:

#![allow(unused)]
fn main() {
// ✅ Good: Use spawn_blocking for CPU work
async fn process_chunk(chunk: FileChunk) -> Result<FileChunk> {
    tokio::task::spawn_blocking(move || {
        // CPU-intensive work here
        compress_chunk_sync(chunk)
    })
    .await?
}

// ❌ Bad: CPU work blocking async runtime
async fn process_chunk(chunk: FileChunk) -> Result<FileChunk> {
    compress_chunk_sync(chunk)  // Blocks entire runtime!
}
}

Issue 3: Memory Pressure with Many Workers

Symptoms:

  • High memory usage (> 80% RAM)
  • Swapping or OOM errors

Causes:

  • Too many concurrent chunks allocated
  • Large chunk size × high worker count

Solutions:

#![allow(unused)]
fn main() {
// Reduce worker count for large files
let workers = if file_size > 2 * GB {
    WorkerCount::new(3)  // Conservative for huge files
} else {
    WorkerCount::optimal_for_file_size(file_size)
};

// Or reduce chunk size
let chunk_size = ChunkSize::new(16 * MB);  // Smaller chunks
}

Summary

The pipeline's dual thread pool architecture provides:

  1. Tokio Runtime: Async I/O operations with work-stealing scheduler
  2. Rayon Pools: Parallel CPU-bound work with specialized pools
  3. spawn_blocking: Bridge between async and sync without blocking
  4. WorkerCount: Empirically validated thread allocation strategies
  5. Tuning: Guidelines for CPU-intensive, I/O-intensive, and memory-constrained workloads

Key Takeaways:

  • Use CPU-bound pool for compression, encryption, hashing
  • Wrap Rayon work with spawn_blocking in async contexts
  • Let WorkerCount optimize based on file size and system resources
  • Monitor performance with perf, flamegraph, and metrics
  • Tune based on actual measurements, not assumptions

Resource Management

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

This chapter explores the pipeline's resource management system, including CPU and I/O token governance, memory tracking, and resource optimization strategies for different workload types.

Overview

The pipeline employs a two-level resource governance architecture that prevents system oversubscription when processing multiple files concurrently:

  1. Global Resource Manager: Caps total system resources (CPU tokens, I/O tokens, memory)
  2. Local Resource Limits: Caps per-file concurrency (semaphores within each file's processing)

This two-level approach ensures optimal resource utilization without overwhelming the system.

Why Resource Governance?

Problem without limits:

10 files × 8 workers/file = 80 concurrent tasks on an 8-core machine
Result: CPU oversubscription, cache thrashing, poor throughput

Solution with resource governance:

Global CPU tokens: 7 (cores - 1)
Maximum 7 concurrent CPU-intensive tasks across all files
Result: Optimal CPU utilization, consistent performance

Resource Management Architecture

┌─────────────────────────────────────────────────────────────┐
│              Global Resource Manager                        │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  CPU Tokens (Semaphore)                              │   │
│  │  - Default: cores - 1                                │   │
│  │  - Purpose: Prevent CPU oversubscription             │   │
│  │  - Used by: Rayon workers, CPU-bound operations      │   │
│  └──────────────────────────────────────────────────────┘   │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  I/O Tokens (Semaphore)                              │   │
│  │  - Default: Device-specific (NVMe:24, SSD:12, HDD:4) │   │
│  │  - Purpose: Prevent I/O queue overrun                │   │
│  │  - Used by: File reads/writes                        │   │
│  └──────────────────────────────────────────────────────┘   │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Memory Tracking (AtomicUsize)                       │   │
│  │  - Default: 40 GB capacity                           │   │
│  │  - Purpose: Monitor memory pressure (gauge only)     │   │
│  │  - Used by: Memory allocation/deallocation           │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                                │
                                │ acquire_cpu() / acquire_io()
                                ▼
┌─────────────────────────────────────────────────────────────┐
│              File-Level Processing                          │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Per-File Semaphore                                  │   │
│  │  - Local concurrency limit (e.g., 8 workers/file)    │   │
│  │  - Prevents single file from saturating system       │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

CPU Token Management

Overview

CPU tokens limit the total number of concurrent CPU-bound operations across all files being processed. This prevents CPU oversubscription and cache thrashing.

Implementation: Semaphore-based with RAII (Resource Acquisition Is Initialization)

Configuration

Default CPU Token Count:

#![allow(unused)]
fn main() {
// Automatic detection: cores - 1
let available_cores = std::thread::available_parallelism().map(|n| n.get()).unwrap_or(4);
let cpu_token_count = (available_cores - 1).max(1);

// Example on 8-core machine: 7 CPU tokens
}

Why cores - 1?

  • Leave one core for OS, I/O threads, and system tasks
  • Prevents complete CPU saturation
  • Improves overall system responsiveness
  • Reduces context switching overhead

Custom Configuration:

#![allow(unused)]
fn main() {
use adaptive_pipeline::infrastructure::runtime::{init_resource_manager, ResourceConfig};

// Initialize with custom CPU token count
let config = ResourceConfig {
    cpu_tokens: Some(6),  // Override: use 6 CPU workers
    ..Default::default()
};

init_resource_manager(config)?;
}

Usage Pattern

Acquire CPU token before CPU-intensive work:

#![allow(unused)]
fn main() {
use adaptive_pipeline::infrastructure::runtime::RESOURCE_MANAGER;

async fn process_chunk(chunk: FileChunk) -> Result<FileChunk, PipelineError> {
    // 1. Acquire global CPU token (waits if system is saturated)
    let _cpu_permit = RESOURCE_MANAGER.acquire_cpu().await?;

    // 2. Do CPU-intensive work (compression, encryption, hashing)
    tokio::task::spawn_blocking(move || {
        RAYON_POOLS.cpu_bound_pool().install(|| {
            compress_and_encrypt(chunk)
        })
    })
    .await?

    // 3. Permit released automatically when _cpu_permit goes out of scope (RAII)
}
}

Key Points:

  • acquire_cpu() returns a SemaphorePermit guard
  • If all CPU tokens are in use, the call waits (backpressure)
  • Permit is auto-released when dropped (RAII pattern)
  • This creates natural flow control and prevents oversubscription

Backpressure Mechanism

When all CPU tokens are in use:

Timeline:
─────────────────────────────────────────────────────────────
Task 1: [===CPU work===]         (acquired token)
Task 2:   [===CPU work===]       (acquired token)
Task 3:     [===CPU work===]     (acquired token)
...
Task 7:             [===CPU===]  (acquired token - last one!)
Task 8:               ⏳⏳⏳     (waiting for token...)
Task 9:                 ⏳⏳⏳   (waiting for token...)
─────────────────────────────────────────────────────────────

When Task 1 completes → Task 8 acquires the released token
When Task 2 completes → Task 9 acquires the released token

This backpressure prevents overwhelming the CPU with too many concurrent tasks.

Monitoring CPU Saturation

#![allow(unused)]
fn main() {
use adaptive_pipeline::infrastructure::metrics::CONCURRENCY_METRICS;

// Check CPU saturation percentage
let saturation = CONCURRENCY_METRICS.cpu_saturation_percent();

if saturation > 80.0 {
    println!("CPU-saturated: {}%", saturation);
    println!("Available tokens: {}", RESOURCE_MANAGER.cpu_tokens_available());
    println!("Total tokens: {}", RESOURCE_MANAGER.cpu_tokens_total());
}
}

Saturation Interpretation:

  • 0-50%: Underutilized, could increase worker count
  • 50-80%: Good utilization, healthy balance
  • 80-95%: High utilization, approaching saturation
  • 95-100%: Fully saturated, tasks frequently waiting

I/O Token Management

Overview

I/O tokens limit the total number of concurrent I/O operations to prevent overwhelming the storage device's queue depth.

Why separate from CPU tokens?

  • I/O and CPU have different characteristics
  • Storage devices have specific optimal queue depths
  • Prevents I/O queue saturation independent of CPU usage

Device-Specific Optimization

Different storage devices have different optimal I/O queue depths:

Storage TypeQueue DepthRationale
NVMe24Multiple parallel channels, low latency
SSD12Medium parallelism, good random access
HDD4Sequential access preferred, high seek latency
Auto12Conservative default (assumes SSD)

Implementation (pipeline/src/infrastructure/runtime/resource_manager.rs):

#![allow(unused)]
fn main() {
pub enum StorageType {
    NVMe,   // 24 tokens
    SSD,    // 12 tokens
    HDD,    // 4 tokens
    Auto,   // 12 tokens (SSD default)
    Custom(usize),
}

fn detect_optimal_io_tokens(storage_type: StorageType) -> usize {
    match storage_type {
        StorageType::NVMe => 24,
        StorageType::SSD => 12,
        StorageType::HDD => 4,
        StorageType::Auto => 12,  // Conservative default
        StorageType::Custom(n) => n,
    }
}
}

Configuration

Custom I/O token count:

#![allow(unused)]
fn main() {
let config = ResourceConfig {
    io_tokens: Some(24),              // Override: NVMe-optimized
    storage_type: StorageType::NVMe,
    ..Default::default()
};

init_resource_manager(config)?;
}

Usage Pattern

Acquire I/O token before file operations:

#![allow(unused)]
fn main() {
async fn read_file_chunk(path: &Path, offset: u64, size: usize)
    -> Result<Vec<u8>, PipelineError>
{
    // 1. Acquire global I/O token (waits if I/O queue is full)
    let _io_permit = RESOURCE_MANAGER.acquire_io().await?;

    // 2. Perform I/O operation
    let mut file = tokio::fs::File::open(path).await?;
    file.seek(SeekFrom::Start(offset)).await?;

    let mut buffer = vec![0u8; size];
    file.read_exact(&mut buffer).await?;

    Ok(buffer)

    // 3. Permit auto-released when _io_permit goes out of scope
}
}

Monitoring I/O Saturation

#![allow(unused)]
fn main() {
// Check I/O saturation
let io_saturation = CONCURRENCY_METRICS.io_saturation_percent();

if io_saturation > 80.0 {
    println!("I/O-saturated: {}%", io_saturation);
    println!("Available tokens: {}", RESOURCE_MANAGER.io_tokens_available());
    println!("Consider reducing concurrent I/O or optimizing chunk size");
}
}

Memory Management

Overview

The pipeline uses memory tracking (not enforcement) to monitor memory pressure and provide visibility into resource usage.

Design Philosophy:

  • Phase 1: Monitor memory usage (current implementation)
  • Phase 2: Soft limits with warnings
  • Phase 3: Hard limits with enforcement (future)

Why start with monitoring only?

  • Memory is harder to predict and control than CPU/I/O
  • Avoids complexity in initial implementation
  • Allows gathering real-world usage data before adding limits

Memory Tracking

Atomic counter (AtomicUsize) tracks allocated memory:

#![allow(unused)]
fn main() {
use adaptive_pipeline::infrastructure::runtime::RESOURCE_MANAGER;

// Track memory allocation
let chunk_size = 1024 * 1024;  // 1 MB
RESOURCE_MANAGER.allocate_memory(chunk_size);

// ... use memory ...

// Track memory deallocation
RESOURCE_MANAGER.deallocate_memory(chunk_size);
}

Usage with RAII guard:

#![allow(unused)]
fn main() {
pub struct MemoryGuard {
    size: usize,
}

impl MemoryGuard {
    pub fn new(size: usize) -> Self {
        RESOURCE_MANAGER.allocate_memory(size);
        Self { size }
    }
}

impl Drop for MemoryGuard {
    fn drop(&mut self) {
        RESOURCE_MANAGER.deallocate_memory(self.size);
    }
}

// Usage
let _guard = MemoryGuard::new(chunk_size);
// Memory automatically tracked on allocation and deallocation
}

Configuration

Set memory capacity:

#![allow(unused)]
fn main() {
let config = ResourceConfig {
    memory_limit: Some(40 * 1024 * 1024 * 1024),  // 40 GB capacity
    ..Default::default()
};

init_resource_manager(config)?;
}

Monitoring Memory Usage

#![allow(unused)]
fn main() {
use adaptive_pipeline::infrastructure::metrics::CONCURRENCY_METRICS;

// Check current memory usage
let used_bytes = RESOURCE_MANAGER.memory_used();
let used_mb = CONCURRENCY_METRICS.memory_used_mb();
let utilization = CONCURRENCY_METRICS.memory_utilization_percent();

println!("Memory: {:.2} MB / {} MB ({:.1}%)",
    used_mb,
    RESOURCE_MANAGER.memory_capacity() / (1024 * 1024),
    utilization
);

// Alert on high memory usage
if utilization > 80.0 {
    println!("⚠️  High memory usage: {:.1}%", utilization);
    println!("Consider reducing chunk size or worker count");
}
}

Concurrency Metrics

Overview

The pipeline provides comprehensive metrics for monitoring resource utilization, wait times, and saturation levels.

Metric Types:

  • Gauges: Instant values (e.g., CPU tokens available)
  • Counters: Cumulative values (e.g., total wait time)
  • Histograms: Distribution of values (e.g., P50/P95/P99 wait times)

CPU Metrics

#![allow(unused)]
fn main() {
use adaptive_pipeline::infrastructure::metrics::CONCURRENCY_METRICS;

// Instant metrics (gauges)
let cpu_available = CONCURRENCY_METRICS.cpu_tokens_available();
let cpu_saturation = CONCURRENCY_METRICS.cpu_saturation_percent();

// Wait time percentiles (histograms)
let p50 = CONCURRENCY_METRICS.cpu_wait_p50();  // Median wait time (ms)
let p95 = CONCURRENCY_METRICS.cpu_wait_p95();  // 95th percentile (ms)
let p99 = CONCURRENCY_METRICS.cpu_wait_p99();  // 99th percentile (ms)

println!("CPU Saturation: {:.1}%", cpu_saturation);
println!("Wait times: P50={} ms, P95={} ms, P99={} ms", p50, p95, p99);
}

Why percentiles matter:

Averages hide problems:

  • Average wait: 10 ms (looks fine)
  • P99 wait: 500 ms (users experience terrible latency!)

Histograms show the full distribution and reveal tail latencies.

I/O Metrics

#![allow(unused)]
fn main() {
// I/O saturation and wait times
let io_saturation = CONCURRENCY_METRICS.io_saturation_percent();
let io_p95 = CONCURRENCY_METRICS.io_wait_p95();

if io_saturation > 80.0 && io_p95 > 50 {
    println!("⚠️  I/O bottleneck detected:");
    println!("  Saturation: {:.1}%", io_saturation);
    println!("  P95 wait time: {} ms", io_p95);
}
}

Worker Metrics

#![allow(unused)]
fn main() {
// Worker tracking
let active = CONCURRENCY_METRICS.active_workers();
let spawned = CONCURRENCY_METRICS.tasks_spawned();
let completed = CONCURRENCY_METRICS.tasks_completed();

println!("Workers: {} active, {} spawned, {} completed",
    active, spawned, completed);

// Queue depth (backpressure indicator)
let queue_depth = CONCURRENCY_METRICS.cpu_queue_depth();
let queue_max = CONCURRENCY_METRICS.cpu_queue_depth_max();

if queue_depth > 100 {
    println!("⚠️  High queue depth: {} (max: {})", queue_depth, queue_max);
    println!("Workers can't keep up with reader - increase workers or optimize stages");
}
}

Queue Metrics

Queue depth reveals whether workers can keep up with the reader:

  • Depth near 0: Workers are faster than reader (good!)
  • Depth near capacity: Workers are bottleneck (increase workers or optimize stages)
  • Depth at capacity: Reader is blocked (severe backpressure)
#![allow(unused)]
fn main() {
// Queue wait time distribution
let queue_p50 = CONCURRENCY_METRICS.cpu_queue_wait_p50();
let queue_p95 = CONCURRENCY_METRICS.cpu_queue_wait_p95();
let queue_p99 = CONCURRENCY_METRICS.cpu_queue_wait_p99();

println!("Queue wait: P50={} ms, P95={} ms, P99={} ms",
    queue_p50, queue_p95, queue_p99);
}

Resource Configuration

ResourceConfig Structure

#![allow(unused)]
fn main() {
pub struct ResourceConfig {
    /// Number of CPU worker tokens (default: cores - 1)
    pub cpu_tokens: Option<usize>,

    /// Number of I/O tokens (default: device-specific)
    pub io_tokens: Option<usize>,

    /// Storage device type for I/O optimization
    pub storage_type: StorageType,

    /// Soft memory limit in bytes (gauge only, no enforcement)
    pub memory_limit: Option<usize>,
}
}

Initialization Pattern

In main(), before any operations:

use adaptive_pipeline::infrastructure::runtime::{init_resource_manager, ResourceConfig, StorageType};

#[tokio::main]
async fn main() -> Result<()> {
    // 1. Initialize resource manager with configuration
    let config = ResourceConfig {
        cpu_tokens: Some(6),              // 6 CPU workers
        io_tokens: Some(24),              // NVMe-optimized
        storage_type: StorageType::NVMe,
        memory_limit: Some(40 * 1024 * 1024 * 1024),  // 40 GB
    };

    init_resource_manager(config)
        .map_err(|e| anyhow!("Failed to initialize resources: {}", e))?;

    // 2. Now safe to use RESOURCE_MANAGER anywhere
    run_pipeline().await
}

Global access pattern:

#![allow(unused)]
fn main() {
async fn my_function() {
    // Access global resource manager
    let _cpu_permit = RESOURCE_MANAGER.acquire_cpu().await?;
    // ... CPU work ...
}
}

Tuning Guidelines

1. CPU-Bound Workloads

Symptoms:

  • High CPU saturation (> 80%)
  • Low CPU wait times (< 10 ms P95)
  • Heavy compression, encryption, or hashing

Tuning:

#![allow(unused)]
fn main() {
// Increase CPU tokens to match cores (remove safety margin)
let config = ResourceConfig {
    cpu_tokens: Some(available_cores),  // Use all cores
    ..Default::default()
};
}

Trade-offs:

  • ✅ Higher throughput for CPU-bound work
  • ❌ Reduced system responsiveness
  • ❌ Higher context switching overhead

2. I/O-Bound Workloads

Symptoms:

  • High I/O saturation (> 80%)
  • High I/O wait times (> 50 ms P95)
  • Many concurrent file operations

Tuning:

#![allow(unused)]
fn main() {
// Increase I/O tokens for high-throughput storage
let config = ResourceConfig {
    io_tokens: Some(32),              // Higher than default
    storage_type: StorageType::Custom(32),
    ..Default::default()
};
}

Monitoring:

# Check I/O queue depth on Linux
iostat -x 1

# Look for:
# - avgqu-sz (average queue size) - should be < I/O tokens
# - %util (device utilization) - should be 70-90%

3. Memory-Constrained Systems

Symptoms:

  • High memory utilization (> 80%)
  • Swapping or OOM errors
  • Processing large files

Tuning:

#![allow(unused)]
fn main() {
// Reduce chunk size to lower memory usage
let chunk_size = ChunkSize::new(16 * 1024 * 1024);  // 16 MB (smaller)

// Reduce worker count to limit concurrent chunks
let config = ResourceConfig {
    cpu_tokens: Some(3),  // Fewer workers = less memory
    ..Default::default()
};
}

Formula:

Peak Memory ≈ chunk_size × cpu_tokens × files_processed_concurrently

Example:
  chunk_size = 64 MB
  cpu_tokens = 7
  files = 3
  Peak Memory ≈ 64 MB × 7 × 3 = 1.3 GB

4. Mixed Workloads

Symptoms:

  • Both CPU and I/O saturation
  • Variable chunk processing times
  • Compression + file I/O operations

Tuning:

#![allow(unused)]
fn main() {
// Balanced configuration
let config = ResourceConfig {
    cpu_tokens: Some(available_cores - 1),  // Leave headroom
    io_tokens: Some(12),                    // Moderate I/O concurrency
    storage_type: StorageType::SSD,
    ..Default::default()
};
}

Best practices:

  • Monitor both CPU and I/O saturation
  • Adjust based on bottleneck (CPU vs I/O)
  • Use Rayon's mixed workload pool for hybrid operations

5. Multi-File Processing

Symptoms:

  • Processing many files concurrently
  • High queue depths
  • Resource contention between files

Tuning:

#![allow(unused)]
fn main() {
// Global limits prevent oversubscription
let config = ResourceConfig {
    cpu_tokens: Some(7),   // Total across ALL files
    io_tokens: Some(12),   // Total across ALL files
    ..Default::default()
};

// Per-file semaphores limit individual file's concurrency
async fn process_file(path: &Path) -> Result<()> {
    let file_semaphore = Arc::new(Semaphore::new(4));  // Max 4 workers/file

    for chunk in chunks {
        let _global_cpu = RESOURCE_MANAGER.acquire_cpu().await?;  // Global limit
        let _local = file_semaphore.acquire().await?;             // Local limit

        // Process chunk...
    }

    Ok(())
}
}

Two-level governance:

  • Global: Prevents system oversubscription (7 CPU tokens total)
  • Local: Prevents single file from monopolizing resources (4 workers/file)

Performance Characteristics

Resource Acquisition Overhead

OperationTimeNotes
acquire_cpu() (fast path)~100 nsToken immediately available
acquire_cpu() (slow path)~1-50 msMust wait for token
acquire_io() (fast path)~100 nsToken immediately available
allocate_memory() tracking~10 nsAtomic increment (Relaxed)
Metrics update~50 nsAtomic operations

Guidelines:

  • Fast path is negligible overhead
  • Slow path (waiting) creates backpressure (intentional)
  • Memory tracking is extremely low overhead

Scalability

Linear scaling (ideal):

  • CPU tokens = available cores
  • I/O tokens matched to device queue depth
  • Minimal waiting for resources

Sub-linear scaling (common with oversubscription):

  • Too many CPU tokens (> 2x cores)
  • Excessive context switching
  • Cache thrashing

Performance cliff (avoid):

  • No resource limits
  • Uncontrolled parallelism
  • System thrashing (swapping, CPU saturation)

Best Practices

1. Always Acquire Resources Before Work

#![allow(unused)]
fn main() {
// ✅ Good: Acquire global resource token before work
async fn process_chunk(chunk: FileChunk) -> Result<FileChunk> {
    let _cpu_permit = RESOURCE_MANAGER.acquire_cpu().await?;

    tokio::task::spawn_blocking(move || {
        // CPU-intensive work
    }).await?
}

// ❌ Bad: No resource governance
async fn process_chunk(chunk: FileChunk) -> Result<FileChunk> {
    tokio::task::spawn_blocking(move || {
        // Uncontrolled parallelism!
    }).await?
}
}

2. Use RAII for Automatic Release

#![allow(unused)]
fn main() {
// ✅ Good: RAII guard auto-releases
let _permit = RESOURCE_MANAGER.acquire_cpu().await?;
// Work here...
// Permit released automatically when _permit goes out of scope

// ❌ Bad: Manual release (error-prone)
// Don't do this - no manual release API
}

3. Monitor Saturation Regularly

#![allow(unused)]
fn main() {
// ✅ Good: Periodic monitoring
tokio::spawn(async {
    let mut interval = tokio::time::interval(Duration::from_secs(10));
    loop {
        interval.tick().await;

        let cpu_sat = CONCURRENCY_METRICS.cpu_saturation_percent();
        let io_sat = CONCURRENCY_METRICS.io_saturation_percent();
        let mem_util = CONCURRENCY_METRICS.memory_utilization_percent();

        info!("Resources: CPU={:.1}%, I/O={:.1}%, Mem={:.1}%",
            cpu_sat, io_sat, mem_util);
    }
});
}

4. Configure Based on Workload Type

#![allow(unused)]
fn main() {
// ✅ Good: Workload-specific configuration
let config = if is_cpu_intensive {
    ResourceConfig {
        cpu_tokens: Some(available_cores),
        io_tokens: Some(8),  // Lower I/O concurrency
        ..Default::default()
    }
} else {
    ResourceConfig {
        cpu_tokens: Some(available_cores / 2),
        io_tokens: Some(24),  // Higher I/O concurrency
        ..Default::default()
    }
};
}

5. Track Memory with Guards

#![allow(unused)]
fn main() {
// ✅ Good: RAII memory guard
pub struct ChunkBuffer {
    data: Vec<u8>,
    _memory_guard: MemoryGuard,
}

impl ChunkBuffer {
    pub fn new(size: usize) -> Self {
        let data = vec![0u8; size];
        let _memory_guard = MemoryGuard::new(size);
        Self { data, _memory_guard }
    }
}
// Memory automatically tracked on allocation and freed on drop
}

Troubleshooting

Issue 1: High CPU Wait Times

Symptoms:

  • P95 CPU wait time > 50 ms
  • Low CPU saturation (< 50%)
  • Many tasks waiting for CPU tokens

Causes:

  • Too few CPU tokens configured
  • CPU tokens not matching actual cores

Solutions:

#![allow(unused)]
fn main() {
// Check current configuration
println!("CPU tokens: {}", RESOURCE_MANAGER.cpu_tokens_total());
println!("CPU saturation: {:.1}%", CONCURRENCY_METRICS.cpu_saturation_percent());
println!("CPU wait P95: {} ms", CONCURRENCY_METRICS.cpu_wait_p95());

// Increase CPU tokens
let config = ResourceConfig {
    cpu_tokens: Some(available_cores),  // Increase from cores-1
    ..Default::default()
};
init_resource_manager(config)?;
}

Issue 2: I/O Queue Saturation

Symptoms:

  • I/O saturation > 90%
  • High I/O wait times (> 100 ms P95)
  • iostat shows high avgqu-sz

Causes:

  • Too many I/O tokens for storage device
  • Storage device can't handle queue depth
  • Sequential I/O on HDD with high concurrency

Solutions:

#![allow(unused)]
fn main() {
// Reduce I/O tokens for HDD
let config = ResourceConfig {
    storage_type: StorageType::HDD,  // Sets I/O tokens = 4
    ..Default::default()
};

// Or manually configure
let config = ResourceConfig {
    io_tokens: Some(4),  // Lower concurrency
    ..Default::default()
};
}

Issue 3: Memory Pressure

Symptoms:

  • Memory utilization > 80%
  • Swapping (check vmstat)
  • OOM killer activated

Causes:

  • Too many concurrent chunks allocated
  • Large chunk size × high worker count
  • Memory leaks (not tracked properly)

Solutions:

#![allow(unused)]
fn main() {
// Reduce memory usage
let chunk_size = ChunkSize::new(16 * 1024 * 1024);  // Smaller chunks

let config = ResourceConfig {
    cpu_tokens: Some(3),  // Fewer concurrent chunks
    ..Default::default()
};

// Monitor memory closely
let mem_mb = CONCURRENCY_METRICS.memory_used_mb();
let mem_pct = CONCURRENCY_METRICS.memory_utilization_percent();

if mem_pct > 80.0 {
    error!("⚠️  High memory usage: {:.1}% ({:.2} MB)", mem_pct, mem_mb);
}
}

Issue 4: High Queue Depth

Symptoms:

  • CPU queue depth > 100
  • High queue wait times (> 50 ms P95)
  • Reader blocked waiting for workers

Causes:

  • Workers can't keep up with reader
  • Slow compression/encryption stages
  • Insufficient worker count

Solutions:

#![allow(unused)]
fn main() {
// Check queue metrics
let depth = CONCURRENCY_METRICS.cpu_queue_depth();
let max_depth = CONCURRENCY_METRICS.cpu_queue_depth_max();
let wait_p95 = CONCURRENCY_METRICS.cpu_queue_wait_p95();

println!("Queue depth: {} (max: {})", depth, max_depth);
println!("Queue wait P95: {} ms", wait_p95);

// Increase workers to drain queue faster
let config = ResourceConfig {
    cpu_tokens: Some(available_cores),  // More workers
    ..Default::default()
};

// Or optimize stages (faster compression, fewer stages)
}

Summary

The pipeline's resource management system provides:

  1. CPU Token Management: Prevent CPU oversubscription with semaphore-based limits
  2. I/O Token Management: Device-specific I/O queue depth optimization
  3. Memory Tracking: Monitor memory usage with atomic counters (gauge only)
  4. Concurrency Metrics: Comprehensive observability (gauges, counters, histograms)
  5. Two-Level Governance: Global + local limits prevent system saturation

Key Takeaways:

  • Always acquire resource tokens before CPU/I/O work
  • Use RAII guards for automatic resource release
  • Monitor saturation percentages and wait time percentiles
  • Configure based on workload type (CPU-intensive vs I/O-intensive)
  • Start with defaults, tune based on actual measurements
  • Memory tracking is informational only (no enforcement yet)

Performance Optimization

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

This chapter explores performance optimization strategies for the adaptive pipeline, including benchmarking methodologies, tuning parameters, and common performance bottlenecks with their solutions.

Overview

The pipeline is designed for high-performance file processing with several optimization strategies:

  1. Adaptive Configuration: Automatically selects optimal settings based on file characteristics
  2. Parallel Processing: Leverages multi-core systems with Tokio and Rayon
  3. Resource Management: Prevents oversubscription with CPU/I/O token governance
  4. Memory Efficiency: Streaming processing with bounded memory usage
  5. I/O Optimization: Memory mapping, chunked I/O, and device-specific tuning

Performance Goals:

  • Throughput: 100-500 MB/s for compression/encryption pipelines
  • Latency: < 100 ms overhead for small files (< 10 MB)
  • Memory: Bounded memory usage regardless of file size
  • Scalability: Linear scaling up to available CPU cores

Performance Metrics

Throughput

Definition: Bytes processed per second

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::entities::ProcessingMetrics;

let metrics = ProcessingMetrics::new();
metrics.start();

// ... process data ...
metrics.add_bytes_processed(file_size);

metrics.end();

println!("Throughput: {:.2} MB/s", metrics.throughput_mb_per_second());
}

Typical Values:

  • Uncompressed I/O: 500-2000 MB/s (limited by storage device)
  • LZ4 compression: 300-600 MB/s (fast, low compression)
  • Brotli compression: 50-150 MB/s (slow, high compression)
  • AES-256-GCM encryption: 400-800 MB/s (hardware-accelerated)
  • ChaCha20-Poly1305: 200-400 MB/s (software)

Latency

Definition: Time from start to completion

Components:

  • Setup overhead: File opening, thread pool initialization (1-5 ms)
  • I/O time: Reading/writing chunks (varies by device and size)
  • Processing time: Compression, encryption, hashing (varies by algorithm)
  • Coordination overhead: Task spawning, semaphore acquisition (< 1 ms)

Optimization Strategies:

  • Minimize setup overhead by reusing resources
  • Use memory mapping for large files to reduce I/O time
  • Choose faster algorithms (LZ4 vs Brotli, ChaCha20 vs AES)
  • Batch small operations to amortize coordination overhead

Memory Usage

Formula:

Peak Memory ≈ chunk_size × active_workers × files_concurrent

Example:

chunk_size = 64 MB
active_workers = 7
files_concurrent = 1
Peak Memory ≈ 64 MB × 7 × 1 = 448 MB

Monitoring:

#![allow(unused)]
fn main() {
use adaptive_pipeline::infrastructure::metrics::CONCURRENCY_METRICS;

let mem_mb = CONCURRENCY_METRICS.memory_used_mb();
let mem_pct = CONCURRENCY_METRICS.memory_utilization_percent();

println!("Memory: {:.2} MB ({:.1}%)", mem_mb, mem_pct);
}

Optimization Strategies

1. Chunk Size Optimization

Impact: Chunk size affects memory usage, I/O efficiency, and parallelism.

Adaptive Chunk Sizing:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::value_objects::ChunkSize;

// Automatically selects optimal chunk size based on file size
let chunk_size = ChunkSize::optimal_for_file_size(file_size);

println!("Optimal chunk size: {}", chunk_size);  // e.g., "4.0MB"
}

Guidelines:

File SizeChunk SizeRationale
< 10 MB (small)64-256 KBMinimize memory, enable fine-grained parallelism
10-100 MB (medium)256 KB-1 MBBalance memory and I/O efficiency
100 MB-1 GB (large)1-4 MBReduce I/O overhead, acceptable memory usage
> 1 GB (huge)4-16 MBMaximize I/O throughput, still bounded memory

Trade-offs:

  • Small chunks: ✅ Lower memory, better parallelism ❌ Higher I/O overhead
  • Large chunks: ✅ Lower I/O overhead ❌ Higher memory, less parallelism

2. Worker Count Optimization

Impact: Worker count affects CPU utilization and resource contention.

Adaptive Worker Count:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::value_objects::WorkerCount;

// File size + system resources + processing type
let workers = WorkerCount::optimal_for_processing_type(
    file_size,
    available_cores,
    is_cpu_intensive,  // true for compression/encryption
);

println!("Optimal workers: {}", workers);  // e.g., "8 workers"
}

Empirically Validated Strategies:

File SizeWorker CountStrategyBenchmark Result
5 MB (small)9Aggressive parallelism+102% speedup
50 MB (medium)5Balanced approach+70% speedup
2 GB (huge)3Conservative (avoid overhead)+76% speedup

Why these strategies work:

  • Small files: Task overhead is amortized quickly with many workers
  • Medium files: Balanced to avoid both under-utilization and over-subscription
  • Huge files: Fewer workers prevent memory pressure and coordination overhead

3. Memory Mapping vs Regular I/O

When to use memory mapping:

  • ✅ Files > 100 MB (amortizes setup cost)
  • ✅ Random access patterns (page cache efficiency)
  • ✅ Read-heavy workloads (no write overhead)

When to use regular I/O:

  • ✅ Files < 10 MB (lower setup cost)
  • ✅ Sequential access patterns (streaming)
  • ✅ Write-heavy workloads (buffered writes)

Configuration:

#![allow(unused)]
fn main() {
use adaptive_pipeline::infrastructure::adapters::file_io::TokioFileIO;
use adaptive_pipeline_domain::services::file_io_service::FileIOConfig;

let config = FileIOConfig {
    enable_memory_mapping: true,
    max_mmap_size: 1024 * 1024 * 1024,  // 1 GB threshold
    default_chunk_size: 64 * 1024,       // 64 KB chunks
    ..Default::default()
};

let service = TokioFileIO::new(config);
}

Benchmark Results (from pipeline/benches/file_io_benchmark.rs):

File SizeRegular I/OMemory MappingWinner
1 MB2000 MB/s1500 MB/sRegular I/O
10 MB1800 MB/s1900 MB/sComparable
50 MB1500 MB/s2200 MB/sMemory Mapping
100 MB1400 MB/s2500 MB/sMemory Mapping

4. Compression Algorithm Selection

Performance vs Compression Ratio:

AlgorithmCompression SpeedDecompression SpeedRatioUse Case
LZ4500-700 MB/s2000-3000 MB/s2-3xReal-time, low latency
Zstd200-400 MB/s600-800 MB/s3-5xBalanced, general use
Brotli50-150 MB/s300-500 MB/s4-8xStorage, high compression

Adaptive Selection:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::services::CompressionPriority;

// Automatic algorithm selection
let config = service.get_optimal_config(
    "data.bin",
    &sample_data,
    CompressionPriority::Speed,  // or CompressionPriority::Ratio
)?;

println!("Selected: {:?}", config.algorithm);
}

Guidelines:

  • Speed priority: LZ4 for streaming, real-time processing
  • Balanced: Zstandard for general-purpose use
  • Ratio priority: Brotli for archival, storage optimization

5. Encryption Algorithm Selection

Performance Characteristics:

AlgorithmThroughputSecurityHardware Support
AES-256-GCM400-800 MB/sExcellentYes (AES-NI)
ChaCha20-Poly1305200-400 MB/sExcellentNo
XChaCha20-Poly1305180-350 MB/sExcellentNo

Configuration:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::services::EncryptionAlgorithm;

// Use AES-256-GCM if hardware support available
let algorithm = if has_aes_ni() {
    EncryptionAlgorithm::Aes256Gcm  // 2-4x faster with AES-NI
} else {
    EncryptionAlgorithm::ChaCha20Poly1305  // Software fallback
};
}

Common Bottlenecks

1. CPU Bottleneck

Symptoms:

  • CPU saturation > 80%
  • High CPU wait times (P95 > 50 ms)
  • Low I/O utilization

Causes:

  • Too many CPU-intensive operations (compression, encryption)
  • Insufficient worker count for CPU-bound work
  • Slow algorithms (Brotli on large files)

Solutions:

#![allow(unused)]
fn main() {
// Increase CPU tokens to match cores
let config = ResourceConfig {
    cpu_tokens: Some(available_cores),  // Use all cores
    ..Default::default()
};

// Use faster algorithms
let compression = CompressionAlgorithm::Lz4;  // Instead of Brotli
let encryption = EncryptionAlgorithm::Aes256Gcm;  // With AES-NI

// Optimize worker count
let workers = WorkerCount::optimal_for_processing_type(
    file_size,
    available_cores,
    true,  // CPU-intensive = true
);
}

2. I/O Bottleneck

Symptoms:

  • I/O saturation > 80%
  • High I/O wait times (P95 > 100 ms)
  • Low CPU utilization

Causes:

  • Too many concurrent I/O operations
  • Small chunk sizes causing excessive syscalls
  • Storage device queue depth exceeded

Solutions:

#![allow(unused)]
fn main() {
// Increase chunk size to reduce I/O overhead
let chunk_size = ChunkSize::from_mb(4)?;  // 4 MB chunks

// Reduce I/O concurrency for HDD
let config = ResourceConfig {
    storage_type: StorageType::HDD,  // 4 I/O tokens
    ..Default::default()
};

// Use memory mapping for large files
let use_mmap = file_size > 100 * 1024 * 1024;  // > 100 MB
}

I/O Optimization by Device:

Device TypeOptimal Chunk SizeI/O TokensStrategy
HDD1-4 MB4Sequential, large chunks
SSD256 KB-1 MB12Balanced
NVMe64 KB-256 KB24Parallel, small chunks

3. Memory Bottleneck

Symptoms:

  • Memory utilization > 80%
  • Swapping (check vmstat)
  • OOM errors

Causes:

  • Too many concurrent chunks allocated
  • Large chunk size × high worker count
  • Memory leaks or unbounded buffers

Solutions:

#![allow(unused)]
fn main() {
// Reduce chunk size
let chunk_size = ChunkSize::from_mb(16)?;  // Smaller chunks

// Limit concurrent workers
let config = ResourceConfig {
    cpu_tokens: Some(3),  // Fewer workers = less memory
    ..Default::default()
};

// Monitor memory closely
if CONCURRENCY_METRICS.memory_utilization_percent() > 80.0 {
    warn!("High memory usage, reducing chunk size");
    chunk_size = ChunkSize::from_mb(8)?;
}
}

4. Coordination Overhead

Symptoms:

  • High task spawn latency
  • Context switching > 10k/sec
  • Low overall throughput despite low resource usage

Causes:

  • Too many small tasks (excessive spawn_blocking calls)
  • High semaphore contention
  • Channel backpressure

Solutions:

#![allow(unused)]
fn main() {
// Batch small operations
if chunks.len() < 10 {
    // Sequential for small batches (avoid spawn overhead)
    for chunk in chunks {
        process_chunk_sync(chunk)?;
    }
} else {
    // Parallel for large batches
    tokio::task::spawn_blocking(move || {
        RAYON_POOLS.cpu_bound_pool().install(|| {
            chunks.into_par_iter().map(process_chunk_sync).collect()
        })
    }).await??
}

// Reduce worker count to lower contention
let workers = WorkerCount::new(available_cores / 2);
}

Tuning Parameters

Chunk Size Tuning

Parameters:

#![allow(unused)]
fn main() {
pub struct ChunkSize {
    pub const MIN_SIZE: usize = 1;              // 1 byte
    pub const MAX_SIZE: usize = 512 * 1024 * 1024;  // 512 MB
    pub const DEFAULT_SIZE: usize = 1024 * 1024;    // 1 MB
}
}

Configuration:

#![allow(unused)]
fn main() {
// Via ChunkSize value object
let chunk_size = ChunkSize::from_mb(4)?;

// Via CLI/config file
let chunk_size_mb = 4;
let chunk_size = ChunkSize::from_mb(chunk_size_mb)?;

// Adaptive (recommended)
let chunk_size = ChunkSize::optimal_for_file_size(file_size);
}

Impact:

  • Memory: Directly proportional (2x chunk = 2x memory per worker)
  • I/O overhead: Inversely proportional (2x chunk = 0.5x syscalls)
  • Parallelism: Inversely proportional (2x chunk = 0.5x parallel units)

Worker Count Tuning

Parameters:

#![allow(unused)]
fn main() {
pub struct WorkerCount {
    pub const MIN_WORKERS: usize = 1;
    pub const MAX_WORKERS: usize = 32;
    pub const DEFAULT_WORKERS: usize = 4;
}
}

Configuration:

#![allow(unused)]
fn main() {
// Manual
let workers = WorkerCount::new(8);

// Adaptive (recommended)
let workers = WorkerCount::optimal_for_file_size(file_size);

// With system resources
let workers = WorkerCount::optimal_for_file_and_system(
    file_size,
    available_cores,
);

// With processing type
let workers = WorkerCount::optimal_for_processing_type(
    file_size,
    available_cores,
    is_cpu_intensive,
);
}

Impact:

  • Throughput: Generally increases with workers (up to cores)
  • Memory: Directly proportional (2x workers = 2x memory)
  • Context switching: Increases with workers (diminishing returns > 2x cores)

Resource Token Tuning

CPU Tokens:

#![allow(unused)]
fn main() {
let config = ResourceConfig {
    cpu_tokens: Some(7),  // cores - 1 (default)
    ..Default::default()
};
}

I/O Tokens:

#![allow(unused)]
fn main() {
let config = ResourceConfig {
    io_tokens: Some(24),          // Device-specific
    storage_type: StorageType::NVMe,
    ..Default::default()
};
}

Impact:

  • CPU tokens: Limits total CPU-bound parallelism across all files
  • I/O tokens: Limits total I/O concurrency across all files
  • Both: Prevent system oversubscription

Performance Monitoring

Real-Time Metrics

#![allow(unused)]
fn main() {
use adaptive_pipeline::infrastructure::metrics::CONCURRENCY_METRICS;
use std::time::Duration;

// Spawn monitoring task
tokio::spawn(async {
    let mut interval = tokio::time::interval(Duration::from_secs(5));
    loop {
        interval.tick().await;

        // Resource saturation
        let cpu_sat = CONCURRENCY_METRICS.cpu_saturation_percent();
        let io_sat = CONCURRENCY_METRICS.io_saturation_percent();
        let mem_util = CONCURRENCY_METRICS.memory_utilization_percent();

        // Wait time percentiles
        let cpu_p95 = CONCURRENCY_METRICS.cpu_wait_p95();
        let io_p95 = CONCURRENCY_METRICS.io_wait_p95();

        info!(
            "Resources: CPU={:.1}%, I/O={:.1}%, Mem={:.1}% | Wait: CPU={}ms, I/O={}ms",
            cpu_sat, io_sat, mem_util, cpu_p95, io_p95
        );

        // Alert on issues
        if cpu_sat > 90.0 {
            warn!("CPU saturated - consider increasing workers or faster algorithms");
        }
        if mem_util > 80.0 {
            warn!("High memory - consider reducing chunk size or workers");
        }
    }
});
}

Processing Metrics

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::entities::ProcessingMetrics;

let metrics = ProcessingMetrics::new();
metrics.start();

// Process file...
for chunk in chunks {
    metrics.add_bytes_processed(chunk.data.len() as u64);
}

metrics.end();

// Report performance
println!("Throughput: {:.2} MB/s", metrics.throughput_mb_per_second());
println!("Duration: {:.2}s", metrics.duration().as_secs_f64());
println!("Processed: {} MB", metrics.bytes_processed() / (1024 * 1024));

// Stage-specific metrics
for stage_metrics in metrics.stage_metrics() {
    println!("  {}: {:.2} MB/s", stage_metrics.stage_name, stage_metrics.throughput);
}
}

Performance Best Practices

1. Use Adaptive Configuration

#![allow(unused)]
fn main() {
// ✅ Good: Let the system optimize
let chunk_size = ChunkSize::optimal_for_file_size(file_size);
let workers = WorkerCount::optimal_for_processing_type(
    file_size,
    available_cores,
    is_cpu_intensive,
);

// ❌ Bad: Fixed values
let chunk_size = ChunkSize::from_mb(1)?;
let workers = WorkerCount::new(8);
}

2. Choose Appropriate Algorithms

#![allow(unused)]
fn main() {
// ✅ Good: Algorithm selection based on priority
let compression_config = service.get_optimal_config(
    file_extension,
    &sample_data,
    CompressionPriority::Speed,  // or Ratio
)?;

// ❌ Bad: Always use same algorithm
let compression_config = CompressionConfig {
    algorithm: CompressionAlgorithm::Brotli,  // Slow!
    ..Default::default()
};
}

3. Monitor and Measure

#![allow(unused)]
fn main() {
// ✅ Good: Measure actual performance
let start = Instant::now();
let result = process_file(path).await?;
let duration = start.elapsed();

let throughput_mb_s = (file_size as f64 / duration.as_secs_f64()) / (1024.0 * 1024.0);
info!("Throughput: {:.2} MB/s", throughput_mb_s);

// ❌ Bad: Assume performance without measurement
let result = process_file(path).await?;
}

4. Batch Small Operations

#![allow(unused)]
fn main() {
// ✅ Good: Batch to amortize overhead
tokio::task::spawn_blocking(move || {
    RAYON_POOLS.cpu_bound_pool().install(|| {
        chunks.into_par_iter()
            .map(|chunk| process_chunk(chunk))
            .collect::<Result<Vec<_>, _>>()
    })
}).await??

// ❌ Bad: Spawn for each small operation
for chunk in chunks {
    tokio::task::spawn_blocking(move || {
        process_chunk(chunk)  // Excessive spawn overhead!
    }).await??
}
}

5. Use Device-Specific Settings

#![allow(unused)]
fn main() {
// ✅ Good: Configure for storage type
let config = ResourceConfig {
    storage_type: StorageType::NVMe,  // 24 I/O tokens
    io_tokens: Some(24),
    ..Default::default()
};

// ❌ Bad: One size fits all
let config = ResourceConfig {
    io_tokens: Some(12),  // May be suboptimal
    ..Default::default()
};
}

Troubleshooting Performance Issues

Issue 1: Low Throughput Despite Low Resource Usage

Symptoms:

  • Throughput < 100 MB/s
  • CPU usage < 50%
  • I/O usage < 50%

Diagnosis:

#![allow(unused)]
fn main() {
// Check coordination overhead
let queue_depth = CONCURRENCY_METRICS.cpu_queue_depth();
let active_workers = CONCURRENCY_METRICS.active_workers();

println!("Queue: {}, Active: {}", queue_depth, active_workers);
}

Causes:

  • Too few workers (underutilization)
  • Small batch sizes (high spawn overhead)
  • Synchronous bottlenecks

Solutions:

#![allow(unused)]
fn main() {
// Increase workers
let workers = WorkerCount::new(available_cores);

// Batch operations
let batch_size = 100;
for batch in chunks.chunks(batch_size) {
    process_batch(batch).await?;
}
}

Issue 2: Inconsistent Performance

Symptoms:

  • Performance varies widely between runs
  • High P99 latencies (> 10x P50)

Diagnosis:

#![allow(unused)]
fn main() {
// Check wait time distribution
let p50 = CONCURRENCY_METRICS.cpu_wait_p50();
let p95 = CONCURRENCY_METRICS.cpu_wait_p95();
let p99 = CONCURRENCY_METRICS.cpu_wait_p99();

println!("Wait times: P50={}ms, P95={}ms, P99={}ms", p50, p95, p99);
}

Causes:

  • Resource contention (high wait times)
  • GC pauses or memory pressure
  • External system interference

Solutions:

#![allow(unused)]
fn main() {
// Reduce contention
let config = ResourceConfig {
    cpu_tokens: Some(available_cores - 2),  // Leave headroom
    ..Default::default()
};

// Monitor memory
if mem_util > 70.0 {
    chunk_size = ChunkSize::from_mb(chunk_size_mb / 2)?;
}
}

Issue 3: Memory Growth

Symptoms:

  • Memory usage grows over time
  • Eventually triggers OOM or swapping

Diagnosis:

#![allow(unused)]
fn main() {
// Track memory trends
let mem_start = CONCURRENCY_METRICS.memory_used_mb();
// ... process files ...
let mem_end = CONCURRENCY_METRICS.memory_used_mb();

if mem_end > mem_start * 1.5 {
    warn!("Memory grew {:.1}%", ((mem_end - mem_start) / mem_start) * 100.0);
}
}

Causes:

  • Memory leaks (improper cleanup)
  • Unbounded queues or buffers
  • Large chunk size with many workers

Solutions:

#![allow(unused)]
fn main() {
// Use RAII guards for cleanup
struct ChunkBuffer {
    data: Vec<u8>,
    _guard: MemoryGuard,
}

// Limit queue depth
let (tx, rx) = tokio::sync::mpsc::channel(100);  // Bounded channel

// Reduce chunk size
let chunk_size = ChunkSize::from_mb(16)?;  // Smaller
}

Summary

The pipeline's performance optimization system provides:

  1. Adaptive Configuration: Automatic chunk size and worker count optimization
  2. Algorithm Selection: Choose algorithms based on speed/ratio priority
  3. Resource Governance: Prevent oversubscription with token limits
  4. Memory Efficiency: Bounded memory usage with streaming processing
  5. Comprehensive Monitoring: Real-time metrics and performance tracking

Key Takeaways:

  • Use adaptive configuration (ChunkSize::optimal_for_file_size, WorkerCount::optimal_for_processing_type)
  • Choose algorithms based on workload (LZ4 for speed, Brotli for ratio)
  • Monitor metrics regularly (CPU/I/O saturation, wait times, throughput)
  • Tune based on bottleneck (CPU: increase workers/faster algorithms, I/O: increase chunk size, Memory: reduce chunk/workers)
  • Benchmark and measure actual performance (don't assume)

Performance Goals Achieved:

  • ✅ Throughput: 100-500 MB/s (algorithm-dependent)
  • ✅ Latency: < 100 ms overhead for small files
  • ✅ Memory: Bounded usage (chunk_size × workers × files)
  • ✅ Scalability: Linear scaling up to available cores

Benchmarking

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

This chapter covers the pipeline's benchmark suite, methodologies for measuring performance, and techniques for interpreting benchmark results to guide optimization decisions.

Overview

The pipeline uses Criterion.rs for rigorous, statistical benchmarking. Criterion provides:

  • Statistical Analysis: Measures mean, standard deviation, and percentiles
  • Regression Detection: Identifies performance regressions automatically
  • HTML Reports: Generates detailed visualizations and comparisons
  • Outlier Detection: Filters statistical outliers for consistent results
  • Iterative Measurement: Automatically determines iteration counts

Benchmark Location: pipeline/benches/file_io_benchmark.rs

Benchmark Suite

The benchmark suite covers four main categories:

1. Read Method Benchmarks

Purpose: Compare regular file I/O vs memory-mapped I/O across different file sizes.

File Sizes Tested:

  • 1 MB (small files)
  • 10 MB (medium files)
  • 50 MB (large files)
  • 100 MB (very large files)

Methods Compared:

  • regular_io: Traditional buffered file reading
  • memory_mapped: Memory-mapped file access (mmap)

Configuration:

#![allow(unused)]
fn main() {
let service = TokioFileIO::new(FileIOConfig {
    default_chunk_size: 64 * 1024,     // 64KB chunks
    max_mmap_size: 1024 * 1024 * 1024, // 1GB threshold
    enable_memory_mapping: true,
    ..Default::default()
});
}

Expected Results:

  • Small files (< 10 MB): Regular I/O faster (lower setup cost)
  • Large files (> 50 MB): Memory mapping faster (reduced copying)

2. Chunk Size Benchmarks

Purpose: Determine optimal chunk sizes for different workloads.

Chunk Sizes Tested:

  • 4 KB (4096 bytes)
  • 8 KB (8192 bytes)
  • 16 KB (16384 bytes)
  • 32 KB (32768 bytes)
  • 64 KB (65536 bytes)
  • 128 KB (131072 bytes)

File Size: 10 MB (representative medium file)

Methods:

  • regular_io: Chunked reading with various sizes
  • memory_mapped: Memory mapping with logical chunk sizes

Expected Results:

  • Smaller chunks (4-16 KB): Higher overhead, more syscalls
  • Medium chunks (32-64 KB): Good balance for most workloads
  • Larger chunks (128 KB+): Lower syscall overhead, higher memory usage

3. Checksum Calculation Benchmarks

Purpose: Measure overhead of integrity verification.

Benchmarks:

  • with_checksums: File reading with SHA-256 checksum calculation
  • without_checksums: File reading without checksums
  • checksum_only: Standalone checksum calculation

File Size: 10 MB

Expected Results:

  • Checksum overhead: 10-30% depending on CPU and chunk size
  • Larger chunks reduce relative overhead (fewer per-chunk hash operations)

4. Write Operation Benchmarks

Purpose: Compare write performance across data sizes and options.

Data Sizes Tested:

  • 1 KB (tiny writes)
  • 10 KB (small writes)
  • 100 KB (medium writes)
  • 1000 KB / 1 MB (large writes)

Write Options:

  • write_data: Buffered write without checksum
  • write_data_with_checksum: Buffered write with SHA-256 checksum
  • write_data_sync: Buffered write with fsync

Expected Results:

  • Write throughput: 100-1000 MB/s (buffered, no sync)
  • Checksum overhead: 10-30%
  • Sync overhead: 10-100x depending on storage device

Running Benchmarks

Basic Usage

Run all benchmarks:

cargo bench --bench file_io_benchmark

Output:

file_read_methods/regular_io/1    time:   [523.45 µs 528.12 µs 533.28 µs]
file_read_methods/regular_io/10   time:   [5.1234 ms 5.2187 ms 5.3312 ms]
file_read_methods/memory_mapped/1 time:   [612.34 µs 618.23 µs 624.91 µs]
file_read_methods/memory_mapped/10 time: [4.8923 ms 4.9721 ms 5.0584 ms]
...

Benchmark Groups

Run specific benchmark group:

# Read methods only
cargo bench --bench file_io_benchmark -- "file_read_methods"

# Chunk sizes only
cargo bench --bench file_io_benchmark -- "chunk_sizes"

# Checksums only
cargo bench --bench file_io_benchmark -- "checksum_calculation"

# Write operations only
cargo bench --bench file_io_benchmark -- "write_operations"

Specific Benchmarks

Run benchmarks matching a pattern:

# All regular_io benchmarks
cargo bench --bench file_io_benchmark -- "regular_io"

# Only 50MB file benchmarks
cargo bench --bench file_io_benchmark -- "/50"

# Memory-mapped with 64KB chunks
cargo bench --bench file_io_benchmark -- "memory_mapped/64"

Output Formats

HTML report (default):

cargo bench --bench file_io_benchmark

# Opens: target/criterion/report/index.html
open target/criterion/report/index.html

Save baseline for comparison:

# Save current performance as baseline
cargo bench --bench file_io_benchmark -- --save-baseline main

# Compare against baseline after changes
cargo bench --bench file_io_benchmark -- --baseline main

Example comparison output:

file_read_methods/regular_io/50
                        time:   [24.512 ms 24.789 ms 25.091 ms]
                        change: [-5.2341% -4.1234% -2.9876%] (p = 0.00 < 0.05)
                        Performance has improved.

Interpreting Results

Understanding Output

Criterion output format:

benchmark_name          time:   [lower_bound estimate upper_bound]
                        change: [lower% estimate% upper%] (p = X.XX < 0.05)

Components:

  • lower_bound: 95% confidence interval lower bound
  • estimate: Best estimate (typically median)
  • upper_bound: 95% confidence interval upper bound
  • change: Performance change vs baseline (if available)
  • p-value: Statistical significance (< 0.05 = significant)

Example:

file_read_methods/regular_io/50
                        time:   [24.512 ms 24.789 ms 25.091 ms]
                        change: [-5.2341% -4.1234% -2.9876%] (p = 0.00 < 0.05)

Interpretation:

  • Time: File reading takes ~24.79 ms (median)
  • Confidence: 95% certain actual time is between 24.51-25.09 ms
  • Change: 4.12% faster than baseline (statistically significant)
  • Conclusion: Performance improvement confirmed

Throughput Calculation

Calculate MB/s from benchmark time:

File size: 50 MB
Time: 24.789 ms = 0.024789 seconds

Throughput = 50 MB / 0.024789 s = 2017 MB/s

Rust code:

#![allow(unused)]
fn main() {
let file_size_mb = 50.0;
let time_ms = 24.789;
let throughput_mb_s = file_size_mb / (time_ms / 1000.0);

println!("Throughput: {:.2} MB/s", throughput_mb_s);  // 2017 MB/s
}

Regression Detection

Performance regression indicators:

Improvement (faster):

change: [-10.234% -8.123% -6.012%] (p = 0.00 < 0.05)
Performance has improved.

Regression (slower):

change: [+6.234% +8.456% +10.891%] (p = 0.00 < 0.05)
Performance has regressed.

⚠️ No significant change:

change: [-1.234% +0.456% +2.123%] (p = 0.42 > 0.05)
No change in performance detected.

Statistical significance:

  • p < 0.05: Change is statistically significant (95% confidence)
  • p > 0.05: Change could be noise (not statistically significant)

HTML Report Navigation

Criterion generates interactive HTML reports at target/criterion/report/index.html.

Report Sections:

  1. Summary: All benchmarks with comparisons
  2. Individual Reports: Detailed analysis per benchmark
  3. Violin Plots: Distribution visualization
  4. History: Performance over time

Key Metrics in Report:

  • Mean: Average execution time
  • Median: 50th percentile (typical case)
  • Std Dev: Variability in measurements
  • MAD: Median Absolute Deviation (robust to outliers)
  • Slope: Linear regression slope (for iteration scaling)

Performance Baselines

Establishing Baselines

Save baseline before optimizations:

# 1. Run benchmarks on main branch
git checkout main
cargo bench --bench file_io_benchmark -- --save-baseline main

# 2. Switch to feature branch
git checkout feature/optimize-io

# 3. Run benchmarks and compare
cargo bench --bench file_io_benchmark -- --baseline main

Baseline management:

# List all baselines
ls target/criterion/*/base/

# Delete old baseline
rm -rf target/criterion/*/baseline_name/

# Compare two baselines
cargo bench -- --baseline old_baseline --save-baseline new_baseline

Baseline Strategy

Recommended baselines:

  1. main: Current production performance
  2. release-X.Y.Z: Tagged release versions
  3. pre-optimization: Before major optimization work
  4. target: Performance goals

Example workflow:

# Establish target baseline (goals)
cargo bench -- --save-baseline target

# Work on optimizations...
# (make changes, run benchmarks)

# Compare to target
cargo bench -- --baseline target

# If goals met, update main baseline
cargo bench -- --save-baseline main

Continuous Integration

CI/CD Integration

GitHub Actions example:

name: Benchmarks

on:
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install Rust
        uses: actions-rs/toolchain@v1
        with:
          toolchain: stable

      - name: Run benchmarks
        run: |
          cargo bench --bench file_io_benchmark -- --save-baseline pr

      - name: Compare to main
        run: |
          git fetch origin main:main
          git checkout main
          cargo bench --bench file_io_benchmark -- --save-baseline main
          git checkout -
          cargo bench --bench file_io_benchmark -- --baseline main

      - name: Upload results
        uses: actions/upload-artifact@v3
        with:
          name: benchmark-results
          path: target/criterion/

Regression Alerts

Detect regressions in CI:

#!/bin/bash
# scripts/check_benchmarks.sh

# Run benchmarks and save output
cargo bench --bench file_io_benchmark -- --baseline main > bench_output.txt

# Check for regressions
if grep -q "Performance has regressed" bench_output.txt; then
    echo "❌ Performance regression detected!"
    grep "Performance has regressed" bench_output.txt
    exit 1
else
    echo "✅ No performance regressions detected"
    exit 0
fi

Benchmark Best Practices

1. Use Representative Workloads

#![allow(unused)]
fn main() {
// ✅ Good: Test realistic file sizes
for size_mb in [1, 10, 50, 100, 500, 1000].iter() {
    let test_file = create_test_file(*size_mb);
    benchmark_file_io(&test_file);
}

// ❌ Bad: Only tiny files
let test_file = create_test_file(1);  // Not representative!
}

2. Control Variables

#![allow(unused)]
fn main() {
// ✅ Good: Isolate what you're measuring
group.bench_function("compression_only", |b| {
    b.iter(|| {
        // Only benchmark compression, not I/O
        compress_data(black_box(&test_data))
    });
});

// ❌ Bad: Measuring multiple things
group.bench_function("compression_and_io", |b| {
    b.iter(|| {
        let data = read_file(path);  // I/O overhead!
        compress_data(&data)         // Compression
    });
});
}

3. Use black_box to Prevent Optimization

#![allow(unused)]
fn main() {
// ✅ Good: Prevent compiler from eliminating code
b.iter(|| {
    let result = expensive_operation();
    black_box(result);  // Ensures result is not optimized away
});

// ❌ Bad: Compiler may optimize away the work
b.iter(|| {
    expensive_operation();  // Result unused, may be eliminated!
});
}

4. Warm Up Before Measuring

#![allow(unused)]
fn main() {
// ✅ Good: Criterion handles warmup automatically
c.bench_function("my_benchmark", |b| {
    // Criterion runs warmup iterations automatically
    b.iter(|| expensive_operation());
});

// ❌ Bad: Manual warmup (unnecessary with Criterion)
for _ in 0..100 {
    expensive_operation();  // Criterion does this for you!
}
}

5. Measure Multiple Configurations

#![allow(unused)]
fn main() {
// ✅ Good: Test parameter space
let mut group = c.benchmark_group("chunk_sizes");
for chunk_size in [4096, 8192, 16384, 32768, 65536].iter() {
    group.bench_with_input(
        BenchmarkId::from_parameter(chunk_size),
        chunk_size,
        |b, &size| b.iter(|| process_with_chunk_size(size))
    );
}

// ❌ Bad: Single configuration
c.bench_function("process", |b| {
    b.iter(|| process_with_chunk_size(65536));  // Only one size!
});
}

Example Benchmark Analysis

Scenario: Optimizing Chunk Size

1. Run baseline benchmarks:

cargo bench --bench file_io_benchmark -- "chunk_sizes" --save-baseline before

Results:

chunk_sizes/regular_io/4096   time:   [82.341 ms 83.129 ms 83.987 ms]
chunk_sizes/regular_io/16384  time:   [62.123 ms 62.891 ms 63.712 ms]
chunk_sizes/regular_io/65536  time:   [52.891 ms 53.523 ms 54.201 ms]

2. Calculate throughput:

File size: 10 MB
4KB chunks:   10 / 0.083129 = 120.3 MB/s
16KB chunks:  10 / 0.062891 = 159.0 MB/s
64KB chunks:  10 / 0.053523 = 186.8 MB/s

3. Analysis:

  • 4 KB chunks: Slow (120 MB/s) due to syscall overhead
  • 16 KB chunks: Better (159 MB/s), balanced
  • 64 KB chunks: Best (187 MB/s), amortizes overhead

4. Recommendation:

Use 64 KB chunks for medium files (10-100 MB).

Scenario: Memory Mapping Threshold

1. Run benchmarks across file sizes:

cargo bench --bench file_io_benchmark -- "file_read_methods"

Results:

regular_io/1      time:   [523.45 µs 528.12 µs 533.28 µs]  → 1894 MB/s
memory_mapped/1   time:   [612.34 µs 618.23 µs 624.91 µs]  → 1618 MB/s

regular_io/50     time:   [24.512 ms 24.789 ms 25.091 ms]  → 2017 MB/s
memory_mapped/50  time:   [19.234 ms 19.512 ms 19.812 ms]  → 2563 MB/s

regular_io/100    time:   [52.123 ms 52.891 ms 53.712 ms]  → 1891 MB/s
memory_mapped/100 time:   [38.234 ms 38.712 ms 39.234 ms]  → 2584 MB/s

2. Crossover analysis:

  • 1 MB: Regular I/O faster (1894 vs 1618 MB/s)
  • 50 MB: Memory mapping faster (2563 vs 2017 MB/s) - 27% improvement
  • 100 MB: Memory mapping faster (2584 vs 1891 MB/s) - 37% improvement

3. Recommendation:

Use memory mapping for files > 10 MB (threshold between 1-50 MB).

Summary

The pipeline's benchmarking suite provides:

  1. Comprehensive Coverage: Read/write operations, chunk sizes, checksums
  2. Statistical Rigor: Criterion.rs with confidence intervals and regression detection
  3. Baseline Comparison: Track performance changes over time
  4. CI/CD Integration: Automated regression detection in pull requests
  5. HTML Reports: Interactive visualizations and detailed analysis

Key Takeaways:

  • Run benchmarks before and after optimizations (--save-baseline)
  • Use representative workloads (realistic file sizes and configurations)
  • Look for statistically significant changes (p < 0.05)
  • Calculate throughput (MB/s) for intuitive performance comparison
  • Integrate benchmarks into CI/CD for regression prevention
  • Use HTML reports for detailed analysis and visualization

Benchmark Commands:

# Run all benchmarks
cargo bench --bench file_io_benchmark

# Save baseline
cargo bench --bench file_io_benchmark -- --save-baseline main

# Compare to baseline
cargo bench --bench file_io_benchmark -- --baseline main

# Specific group
cargo bench --bench file_io_benchmark -- "file_read_methods"

Profiling

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

This chapter covers profiling tools and techniques for identifying performance bottlenecks, analyzing CPU usage, memory allocation patterns, and optimizing the pipeline's runtime characteristics.

Overview

Profiling is the process of measuring where your program spends time and allocates memory. Unlike benchmarking (which measures aggregate performance), profiling provides detailed insights into:

  • CPU hotspots: Which functions consume the most CPU time
  • Memory allocation: Where and how much memory is allocated
  • Lock contention: Where threads wait for synchronization
  • Cache misses: Memory access patterns affecting performance

When to profile:

  • ✅ After identifying performance issues in benchmarks
  • ✅ When optimizing critical paths
  • ✅ When investigating unexplained slowness
  • ✅ Before and after major architectural changes

When NOT to profile:

  • ❌ Before establishing performance goals
  • ❌ On trivial workloads (profile representative cases)
  • ❌ Without benchmarks (profile after quantifying the issue)

Profiling Tools

CPU Profiling Tools

ToolPlatformSamplingInstrumentationOverheadOutput
perfLinuxYesNoLow (1-5%)Text/perf.data
flamegraphAllYesNoLow (1-5%)SVG
samplyAllYesNoLow (1-5%)Firefox Profiler
InstrumentsmacOSYesOptionalLow-MedGUI
VTuneLinux/WindowsYesOptionalMedGUI

Recommended for Rust:

  • Linux: perf + flamegraph
  • macOS: samply or Instruments
  • Windows: VTune or Windows Performance Analyzer

Memory Profiling Tools

ToolPlatformHeapLeaksPeak UsageOutput
valgrind (massif)Linux/macOSYesNoYesText/ms_print
heaptrackLinuxYesYesYesGUI
dhatLinux/macOSYesNoYesJSON/Web UI
InstrumentsmacOSYesYesYesGUI

Recommended for Rust:

  • Linux: heaptrack or dhat
  • macOS: Instruments (Allocations template)
  • Memory leaks: valgrind (memcheck)

CPU Profiling

Using perf (Linux)

Setup:

# Install perf
sudo apt-get install linux-tools-common linux-tools-generic  # Ubuntu/Debian
sudo dnf install perf  # Fedora

# Build with debug symbols
cargo build --release --bin pipeline

# Enable perf for non-root users
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid

Record profile:

# Profile pipeline execution
perf record --call-graph dwarf \
    ./target/release/pipeline process testdata/large-file.bin

# Output: perf.data

Analyze results:

# Interactive TUI
perf report

# Text summary
perf report --stdio

# Function call graph
perf report --stdio --no-children

# Top functions
perf report --stdio | head -20

Example output:

# Overhead  Command   Shared Object     Symbol
# ........  ........  ................  .......................
    45.23%  pipeline  libbrotli.so      BrotliEncoderCompress
    18.91%  pipeline  pipeline          rayon::iter::collect
    12.34%  pipeline  libcrypto.so      AES_encrypt
     8.72%  pipeline  pipeline          tokio::runtime::task
     6.45%  pipeline  libc.so           memcpy
     3.21%  pipeline  pipeline          std::io::Write::write_all

Interpretation:

  • 45% in BrotliEncoderCompress: Compression is the hotspot
  • 19% in rayon::iter::collect: Parallel iteration overhead
  • 12% in AES_encrypt: Encryption is expensive but expected
  • 9% in Tokio task: Async runtime overhead

Optimization targets:

  1. Use faster compression (LZ4 instead of Brotli)
  2. Reduce Rayon parallelism overhead (larger chunks)
  3. Use AES-NI hardware acceleration

Using Flamegraph (All Platforms)

Setup:

# Install cargo-flamegraph
cargo install flamegraph

# Linux: Install perf (see above)

# macOS: Install DTrace (built-in)

# Windows: Install WPA (Windows Performance Analyzer)

Generate flamegraph:

# Profile and generate SVG
cargo flamegraph --bin pipeline -- process testdata/large-file.bin

# Output: flamegraph.svg
open flamegraph.svg

Flamegraph structure:

┌─────────────────────────────────────────────────────────────┐
│                     main (100%)                             │ ← Root
├─────────────────────┬───────────────────┬───────────────────┤
│ tokio::runtime (50%)│ rayon::iter (30%) │ other (20%)       │
├──────┬──────┬───────┼─────────┬─────────┴───────────────────┤
│ I/O  │ task │ ... │compress │ encrypt │ hash │ ...          │
│ 20%  │ 15%  │  15%│   18%   │   8%    │  4%  │              │
└──────┴──────┴───────┴─────────┴─────────┴──────┴─────────────┘

Reading flamegraphs:

  • Width: Percentage of CPU time (wider = more time)
  • Height: Call stack depth (bottom = entry point, top = leaf functions)
  • Color: Random (for visual distinction, not meaningful)
  • Interactive: Click to zoom, search functions

Example analysis:

If you see:

main → tokio::runtime → spawn_blocking → rayon::par_iter → compress → brotli
                                                               45%

Interpretation:

  • 45% of time spent in Brotli compression
  • Called through Rayon parallel iteration
  • Spawned from Tokio's blocking thread pool

Action:

  • Evaluate if 45% compression time is acceptable
  • If too slow, switch to LZ4 or reduce compression level

Using Samply (All Platforms)

Setup:

# Install samply
cargo install samply

Profile and view:

# Profile and open in Firefox Profiler
samply record --release -- cargo run --bin pipeline -- process testdata/large-file.bin

# Automatically opens in Firefox Profiler web UI

Firefox Profiler features:

  • Timeline view: See CPU usage over time
  • Call tree: Hierarchical function breakdown
  • Flame graph: Interactive flamegraph
  • Marker timeline: Custom events and annotations
  • Thread activity: Per-thread CPU usage

Advantages over static flamegraphs:

  • ✅ Interactive (zoom, filter, search)
  • ✅ Timeline view (see activity over time)
  • ✅ Multi-threaded visualization
  • ✅ Easy sharing (web-based)

Memory Profiling

Using Valgrind Massif (Linux/macOS)

Setup:

# Install valgrind
sudo apt-get install valgrind  # Ubuntu/Debian
brew install valgrind  # macOS (Intel only)

Profile heap usage:

# Run with Massif
valgrind --tool=massif --massif-out-file=massif.out \
    ./target/release/pipeline process testdata/large-file.bin

# Visualize results
ms_print massif.out

Example output:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1        123,456       64,000,000       64,000,000             0            0
  2        234,567      128,000,000      128,000,000             0            0
  3        345,678      192,000,000      192,000,000             0            0  ← Peak
  4        456,789      128,000,000      128,000,000             0            0
  5        567,890       64,000,000       64,000,000             0            0
  6        678,901                0                0             0            0

Peak memory: 192 MB

Top allocations:
    45.2%  (87 MB)  Vec::with_capacity (chunk buffer)
    23.1%  (44 MB)  Vec::with_capacity (compressed data)
    15.7%  (30 MB)  Box::new (encryption context)
    ...

Interpretation:

  • Peak memory: 192 MB
  • Main contributor: 87 MB chunk buffers (45%)
  • Growth pattern: Linear increase then decrease (expected for streaming)

Optimization:

  • Reduce chunk size to lower peak memory
  • Reuse buffers instead of allocating new ones
  • Use SmallVec for small allocations

Using Heaptrack (Linux)

Setup:

# Install heaptrack
sudo apt-get install heaptrack heaptrack-gui  # Ubuntu/Debian

Profile and analyze:

# Record heap usage
heaptrack ./target/release/pipeline process testdata/large-file.bin

# Output: heaptrack.pipeline.12345.gz

# Analyze with GUI
heaptrack_gui heaptrack.pipeline.12345.gz

Heaptrack GUI features:

  • Summary: Total allocations, peak memory, leaked memory
  • Top allocators: Functions allocating the most
  • Flame graph: Allocation call chains
  • Timeline: Memory usage over time
  • Leak detection: Allocated but never freed

Example metrics:

Total allocations: 1,234,567
Total allocated: 15.2 GB
Peak heap: 192 MB
Peak RSS: 256 MB
Leaked: 0 bytes

Top allocators:

1. Vec::with_capacity (chunk buffer)        5.4 GB  (35%)
2. tokio::spawn (task allocation)           2.1 GB  (14%)
3. Vec::with_capacity (compressed data)     1.8 GB  (12%)

Using DHAT (Linux/macOS)

Setup:

# Install valgrind with DHAT
sudo apt-get install valgrind

Profile with DHAT:

# Run with DHAT
valgrind --tool=dhat --dhat-out-file=dhat.out \
    ./target/release/pipeline process testdata/large-file.bin

# Output: dhat.out.12345

# View in web UI
dhat_viewer dhat.out.12345
# Or upload to https://nnethercote.github.io/dh_view/dh_view.html

DHAT metrics:

  • Total bytes allocated: Cumulative allocation size
  • Total blocks allocated: Number of allocations
  • Peak bytes: Maximum heap size
  • Average block size: Typical allocation size
  • Short-lived: Allocations freed quickly

Example output:

Total:     15.2 GB in 1,234,567 blocks
Peak:      192 MB
At t-gmax: 187 MB in 145 blocks
At t-end:  0 B in 0 blocks

Top allocation sites:
  1. 35.4% (5.4 GB in 98,765 blocks)
     Vec::with_capacity (file_io.rs:123)
     ← chunk buffer allocation

  2. 14.2% (2.1 GB in 567,890 blocks)
     tokio::spawn (runtime.rs:456)
     ← task overhead

  3. 11.8% (1.8 GB in 45,678 blocks)
     Vec::with_capacity (compression.rs:789)
     ← compressed buffer

Profiling Workflows

Workflow 1: Identify CPU Hotspot

1. Establish baseline (benchmark):

cargo bench --bench file_io_benchmark -- --save-baseline before

2. Profile with perf + flamegraph:

cargo flamegraph --bin pipeline -- process testdata/large-file.bin
open flamegraph.svg

3. Identify hotspot:

Look for wide bars in flamegraph (e.g., 45% in BrotliEncoderCompress).

4. Optimize:

#![allow(unused)]
fn main() {
// Before: Brotli (slow, high compression)
let compression = CompressionAlgorithm::Brotli;

// After: LZ4 (fast, lower compression)
let compression = CompressionAlgorithm::Lz4;
}

5. Verify improvement:

cargo flamegraph --bin pipeline -- process testdata/large-file.bin
# Check if Brotli bar shrunk

cargo bench --bench file_io_benchmark -- --baseline before
# Expect: Performance has improved

Workflow 2: Reduce Memory Usage

1. Profile heap usage:

heaptrack ./target/release/pipeline process testdata/large-file.bin
heaptrack_gui heaptrack.pipeline.12345.gz

2. Identify large allocations:

Look for top allocators (e.g., 87 MB chunk buffers).

3. Calculate optimal size:

Current: 64 MB chunks × 8 workers = 512 MB peak
Target: < 256 MB peak
Solution: 16 MB chunks × 8 workers = 128 MB peak

4. Reduce chunk size:

#![allow(unused)]
fn main() {
// Before
let chunk_size = ChunkSize::from_mb(64)?;

// After
let chunk_size = ChunkSize::from_mb(16)?;
}

5. Verify reduction:

heaptrack ./target/release/pipeline process testdata/large-file.bin
# Check peak memory: should be < 256 MB

Workflow 3: Optimize Parallel Code

1. Profile with perf:

perf record --call-graph dwarf \
    ./target/release/pipeline process testdata/large-file.bin

perf report --stdio

2. Check for synchronization overhead:

12.3%  pipeline  [kernel]     futex_wait    ← Lock contention
 8.7%  pipeline  pipeline     rayon::join    ← Coordination
 6.5%  pipeline  pipeline     Arc::clone     ← Reference counting

3. Reduce contention:

#![allow(unused)]
fn main() {
// Before: Shared mutex (high contention)
let counter = Arc::new(Mutex::new(0));

// After: Per-thread counters (no contention)
let counter = Arc::new(AtomicUsize::new(0));
counter.fetch_add(1, Ordering::Relaxed);
}

4. Verify improvement:

perf record ./target/release/pipeline process testdata/large-file.bin
perf report --stdio
# Check if futex_wait reduced

Interpreting Results

CPU Profile Patterns

Pattern 1: Single Hotspot

compress_chunk: 65%  ← Dominant function
encrypt_chunk:  15%
write_chunk:    10%
other:          10%

Action: Optimize the 65% hotspot (use faster algorithm or optimize implementation).

Pattern 2: Distributed Cost

compress: 20%
encrypt:  18%
hash:     15%
io:       22%
other:    25%

Action: No single hotspot. Profile deeper or optimize multiple functions.

Pattern 3: Framework Overhead

tokio::runtime: 35%  ← High async overhead
rayon::iter:    25%  ← High parallel overhead
actual_work:    40%

Action: Reduce task spawning frequency, batch operations, or use sync code for CPU-bound work.

Memory Profile Patterns

Pattern 1: Linear Growth

Memory over time:
  0s:   0 MB
 10s:  64 MB
 20s: 128 MB  ← Growing linearly
 30s: 192 MB

Likely cause: Streaming processing with bounded buffers (normal).

Pattern 2: Sawtooth

Memory over time:
  0-5s:   0 → 128 MB  ← Allocate
  5s:   128 →   0 MB  ← Free
  6-11s:  0 → 128 MB  ← Allocate again

Likely cause: Batch processing with periodic flushing (normal).

Pattern 3: Unbounded Growth

Memory over time:
  0s:   0 MB
 10s:  64 MB
 20s: 158 MB  ← Growing faster than linear
 30s: 312 MB

Likely cause: Memory leak (allocated but never freed).

Action: Use heaptrack to identify leak source, ensure proper cleanup with RAII.

Common Performance Issues

Issue 1: Lock Contention

Symptoms:

  • High futex_wait in perf
  • Threads spending time in synchronization

Diagnosis:

perf record --call-graph dwarf ./target/release/pipeline ...
perf report | grep futex

Solutions:

#![allow(unused)]
fn main() {
// ❌ Bad: Shared mutex
let counter = Arc::new(Mutex::new(0));

// ✅ Good: Atomic
let counter = Arc::new(AtomicUsize::new(0));
counter.fetch_add(1, Ordering::Relaxed);
}

Issue 2: Excessive Allocations

Symptoms:

  • High malloc / free in flamegraph
  • Poor cache performance

Diagnosis:

heaptrack ./target/release/pipeline ...
# Check "Total allocations" count

Solutions:

#![allow(unused)]
fn main() {
// ❌ Bad: Allocate per iteration
for chunk in chunks {
    let buffer = vec![0; size];  // New allocation every time!
    process(buffer);
}

// ✅ Good: Reuse buffer
let mut buffer = vec![0; size];
for chunk in chunks {
    process(&mut buffer);
    buffer.clear();
}
}

Issue 3: Small Task Overhead

Symptoms:

  • High tokio::spawn or rayon::spawn overhead
  • More framework time than actual work

Diagnosis:

cargo flamegraph --bin pipeline ...
# Check width of spawn-related functions

Solutions:

#![allow(unused)]
fn main() {
// ❌ Bad: Spawn per chunk
for chunk in chunks {
    tokio::spawn(async move { process(chunk).await });
}

// ✅ Good: Batch chunks
tokio::task::spawn_blocking(move || {
    RAYON_POOLS.cpu_bound_pool().install(|| {
        chunks.into_par_iter().map(process).collect()
    })
})
}

Profiling Best Practices

1. Profile Release Builds

# ✅ Good: Release mode
cargo build --release
perf record ./target/release/pipeline ...

# ❌ Bad: Debug mode (10-100x slower, misleading results)
cargo build
perf record ./target/debug/pipeline ...

2. Use Representative Workloads

# ✅ Good: Large realistic file
perf record ./target/release/pipeline process testdata/100mb-file.bin

# ❌ Bad: Tiny file (setup overhead dominates)
perf record ./target/release/pipeline process testdata/1kb-file.bin

3. Profile Multiple Scenarios

# Profile different file sizes
perf record -o perf-small.data ./target/release/pipeline process small.bin
perf record -o perf-large.data ./target/release/pipeline process large.bin

# Compare results
perf report -i perf-small.data
perf report -i perf-large.data

4. Combine Profiling Tools

# 1. Flamegraph for overview
cargo flamegraph --bin pipeline -- process test.bin

# 2. perf for detailed analysis
perf record --call-graph dwarf ./target/release/pipeline process test.bin
perf report

# 3. Heaptrack for memory
heaptrack ./target/release/pipeline process test.bin

5. Profile Before and After Optimizations

# Before
cargo flamegraph -o before.svg --bin pipeline -- process test.bin
heaptrack -o before.gz ./target/release/pipeline process test.bin

# Make changes...

# After
cargo flamegraph -o after.svg --bin pipeline -- process test.bin
heaptrack -o after.gz ./target/release/pipeline process test.bin

# Compare visually
open before.svg after.svg

Summary

Profiling tools and techniques provide:

  1. CPU Profiling: Identify hotspots with perf, flamegraph, samply
  2. Memory Profiling: Track allocations with heaptrack, DHAT, Massif
  3. Workflows: Systematic approaches to optimization
  4. Pattern Recognition: Understand common performance issues
  5. Best Practices: Profile release builds with representative workloads

Key Takeaways:

  • Use cargo flamegraph for quick CPU profiling (all platforms)
  • Use heaptrack for comprehensive memory analysis (Linux)
  • Profile release builds (cargo build --release)
  • Use representative workloads (large realistic files)
  • Combine benchmarking (quantify) with profiling (diagnose)
  • Profile before and after optimizations to verify improvements

Quick Start:

# CPU profiling
cargo install flamegraph
cargo flamegraph --bin pipeline -- process large-file.bin

# Memory profiling (Linux)
sudo apt-get install heaptrack
heaptrack ./target/release/pipeline process large-file.bin
heaptrack_gui heaptrack.pipeline.*.gz

Extending the Pipeline

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

This chapter explains how to extend the pipeline with custom functionality, including custom stages, algorithms, and services while maintaining architectural integrity and type safety.

Overview

The pipeline is designed for extensibility through well-defined extension points:

  1. Custom Stages: Add new processing stages with custom logic
  2. Custom Algorithms: Implement new compression, encryption, or hashing algorithms
  3. Custom Services: Create new domain services for specialized operations
  4. Custom Adapters: Add infrastructure adapters for external systems
  5. Custom Metrics: Extend observability with custom metrics collectors

Design Principles:

  • Open/Closed: Open for extension, closed for modification
  • Dependency Inversion: Depend on abstractions (traits), not concretions
  • Single Responsibility: Each extension has one clear purpose
  • Type Safety: Use strong typing to prevent errors

Extension Points

1. Custom Stage Types

Add new stage types by extending the StageType enum.

Current stage types:

#![allow(unused)]
fn main() {
pub enum StageType {
    Compression,    // Data compression/decompression
    Encryption,     // Data encryption/decryption
    Transform,      // Data transformation
    Checksum,       // Integrity verification
    PassThrough,    // No modification
}
}

Adding a new stage type:

#![allow(unused)]
fn main() {
// 1. Extend StageType enum (in pipeline-domain/src/entities/pipeline_stage.rs)
pub enum StageType {
    Compression,
    Encryption,
    Transform,
    Checksum,
    PassThrough,

    // Custom stage types
    Sanitization,   // Data sanitization (e.g., PII removal)
    Validation,     // Data validation
    Deduplication,  // Remove duplicate chunks
}

// 2. Update Display trait
impl std::fmt::Display for StageType {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            // ... existing types ...
            StageType::Sanitization => write!(f, "sanitization"),
            StageType::Validation => write!(f, "validation"),
            StageType::Deduplication => write!(f, "deduplication"),
        }
    }
}

// 3. Update FromStr trait
impl std::str::FromStr for StageType {
    type Err = PipelineError;

    fn from_str(s: &str) -> Result<Self, Self::Err> {
        match s.to_lowercase().as_str() {
            // ... existing types ...
            "sanitization" => Ok(StageType::Sanitization),
            "validation" => Ok(StageType::Validation),
            "deduplication" => Ok(StageType::Deduplication),
            _ => Err(PipelineError::InvalidConfiguration(format!(
                "Unknown stage type: {}",
                s
            ))),
        }
    }
}
}

2. Custom Algorithms

Extend algorithm enums to support new implementations.

Example: Custom Compression Algorithm

#![allow(unused)]
fn main() {
// 1. Extend CompressionAlgorithm enum
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum CompressionAlgorithm {
    Brotli,
    Gzip,
    Zstd,
    Lz4,

    // Custom algorithms
    Snappy,    // Google's Snappy compression
    Lzma,      // LZMA/XZ compression
    Custom(u8), // Custom algorithm identifier
}

// 2. Implement Display trait
impl std::fmt::Display for CompressionAlgorithm {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            // ... existing algorithms ...
            CompressionAlgorithm::Snappy => write!(f, "snappy"),
            CompressionAlgorithm::Lzma => write!(f, "lzma"),
            CompressionAlgorithm::Custom(id) => write!(f, "custom-{}", id),
        }
    }
}

// 3. Update FromStr trait
impl std::str::FromStr for CompressionAlgorithm {
    type Err = PipelineError;

    fn from_str(s: &str) -> Result<Self, Self::Err> {
        match s.to_lowercase().as_str() {
            // ... existing algorithms ...
            "snappy" => Ok(CompressionAlgorithm::Snappy),
            "lzma" => Ok(CompressionAlgorithm::Lzma),
            s if s.starts_with("custom-") => {
                let id = s.strip_prefix("custom-")
                    .and_then(|id| id.parse::<u8>().ok())
                    .ok_or_else(|| PipelineError::InvalidConfiguration(
                        format!("Invalid custom algorithm ID: {}", s)
                    ))?;
                Ok(CompressionAlgorithm::Custom(id))
            }
            _ => Err(PipelineError::InvalidConfiguration(format!(
                "Unknown compression algorithm: {}",
                s
            ))),
        }
    }
}
}

3. Custom Services

Implement domain service traits for custom functionality.

Example: Custom Compression Service

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::services::CompressionService;
use adaptive_pipeline_domain::{FileChunk, PipelineError, ProcessingContext};

/// Custom compression service using Snappy algorithm
pub struct SnappyCompressionService;

impl SnappyCompressionService {
    pub fn new() -> Self {
        Self
    }
}

impl CompressionService for SnappyCompressionService {
    fn compress(
        &self,
        chunk: FileChunk,
        context: &mut ProcessingContext,
    ) -> Result<FileChunk, PipelineError> {
        use snap::raw::Encoder;

        let start = std::time::Instant::now();

        // Compress data using Snappy
        let mut encoder = Encoder::new();
        let compressed = encoder
            .compress_vec(chunk.data())
            .map_err(|e| PipelineError::CompressionError(format!("Snappy error: {}", e)))?;

        // Update context
        let duration = start.elapsed();
        context.add_bytes_processed(chunk.data().len() as u64);
        context.record_stage_duration(duration);

        // Create compressed chunk
        let mut result = FileChunk::new(
            chunk.sequence_number(),
            chunk.file_offset(),
            compressed,
        );

        // Preserve metadata
        result.set_metadata(chunk.metadata().clone());

        Ok(result)
    }

    fn decompress(
        &self,
        chunk: FileChunk,
        context: &mut ProcessingContext,
    ) -> Result<FileChunk, PipelineError> {
        use snap::raw::Decoder;

        let start = std::time::Instant::now();

        // Decompress data using Snappy
        let mut decoder = Decoder::new();
        let decompressed = decoder
            .decompress_vec(chunk.data())
            .map_err(|e| PipelineError::DecompressionError(format!("Snappy error: {}", e)))?;

        // Update context
        let duration = start.elapsed();
        context.add_bytes_processed(decompressed.len() as u64);
        context.record_stage_duration(duration);

        // Create decompressed chunk
        let mut result = FileChunk::new(
            chunk.sequence_number(),
            chunk.file_offset(),
            decompressed,
        );

        result.set_metadata(chunk.metadata().clone());

        Ok(result)
    }

    fn estimate_compressed_size(&self, chunk: &FileChunk) -> usize {
        // Snappy typically compresses to ~50-70% of original
        (chunk.data().len() as f64 * 0.6) as usize
    }
}
}

4. Custom Adapters

Create infrastructure adapters for external systems.

Example: Custom Cloud Storage Adapter

#![allow(unused)]
fn main() {
use async_trait::async_trait;
use adaptive_pipeline_domain::services::FileIOService;
use std::path::Path;

/// Custom adapter for cloud storage (e.g., S3, Azure Blob)
pub struct CloudStorageAdapter {
    client: CloudClient,
    bucket: String,
}

impl CloudStorageAdapter {
    pub fn new(endpoint: &str, bucket: &str) -> Result<Self, PipelineError> {
        let client = CloudClient::connect(endpoint)?;
        Ok(Self {
            client,
            bucket: bucket.to_string(),
        })
    }
}

#[async_trait]
impl FileIOService for CloudStorageAdapter {
    async fn read_file_chunks(
        &self,
        path: &Path,
        options: ReadOptions,
    ) -> Result<Vec<FileChunk>, PipelineError> {
        // 1. Convert local path to cloud path
        let cloud_path = self.to_cloud_path(path)?;

        // 2. Read file from cloud storage
        let data = self.client
            .get_object(&self.bucket, &cloud_path)
            .await
            .map_err(|e| PipelineError::IOError(format!("Cloud read error: {}", e)))?;

        // 3. Chunk the data
        let chunk_size = options.chunk_size.unwrap_or(64 * 1024);
        let chunks = data
            .chunks(chunk_size)
            .enumerate()
            .map(|(seq, chunk)| {
                FileChunk::new(seq, seq * chunk_size, chunk.to_vec())
            })
            .collect();

        Ok(chunks)
    }

    async fn write_file_data(
        &self,
        path: &Path,
        data: &[u8],
        options: WriteOptions,
    ) -> Result<(), PipelineError> {
        // 1. Convert local path to cloud path
        let cloud_path = self.to_cloud_path(path)?;

        // 2. Write to cloud storage
        self.client
            .put_object(&self.bucket, &cloud_path, data)
            .await
            .map_err(|e| PipelineError::IOError(format!("Cloud write error: {}", e)))?;

        Ok(())
    }

    // ... implement other methods ...
}
}

5. Custom Metrics

Extend observability with custom metrics collectors.

Example: Custom Metrics Collector

#![allow(unused)]
fn main() {
use adaptive_pipeline::infrastructure::metrics::MetricsCollector;
use std::sync::atomic::{AtomicU64, Ordering};

/// Custom metrics collector for advanced analytics
pub struct AdvancedMetricsCollector {
    // Existing metrics
    chunks_processed: AtomicU64,
    bytes_processed: AtomicU64,

    // Custom metrics
    compression_ratio: AtomicU64,  // Stored as f64 bits
    deduplication_hits: AtomicU64,
    cache_efficiency: AtomicU64,   // Percentage * 100
}

impl AdvancedMetricsCollector {
    pub fn new() -> Self {
        Self {
            chunks_processed: AtomicU64::new(0),
            bytes_processed: AtomicU64::new(0),
            compression_ratio: AtomicU64::new(0),
            deduplication_hits: AtomicU64::new(0),
            cache_efficiency: AtomicU64::new(0),
        }
    }

    /// Record compression ratio for a chunk
    pub fn record_compression_ratio(&self, original: usize, compressed: usize) {
        let ratio = (compressed as f64 / original as f64) * 100.0;
        self.compression_ratio.store(ratio.to_bits(), Ordering::Relaxed);
    }

    /// Increment deduplication hits counter
    pub fn record_deduplication_hit(&self) {
        self.deduplication_hits.fetch_add(1, Ordering::Relaxed);
    }

    /// Update cache efficiency percentage
    pub fn update_cache_efficiency(&self, hits: u64, total: u64) {
        let efficiency = ((hits as f64 / total as f64) * 10000.0) as u64;
        self.cache_efficiency.store(efficiency, Ordering::Relaxed);
    }

    /// Get compression ratio
    pub fn compression_ratio(&self) -> f64 {
        f64::from_bits(self.compression_ratio.load(Ordering::Relaxed))
    }

    /// Get deduplication hits
    pub fn deduplication_hits(&self) -> u64 {
        self.deduplication_hits.load(Ordering::Relaxed)
    }

    /// Get cache efficiency percentage
    pub fn cache_efficiency_percent(&self) -> f64 {
        self.cache_efficiency.load(Ordering::Relaxed) as f64 / 100.0
    }
}

impl MetricsCollector for AdvancedMetricsCollector {
    fn record_chunk_processed(&self, size: usize) {
        self.chunks_processed.fetch_add(1, Ordering::Relaxed);
        self.bytes_processed.fetch_add(size as u64, Ordering::Relaxed);
    }

    fn chunks_processed(&self) -> u64 {
        self.chunks_processed.load(Ordering::Relaxed)
    }

    fn bytes_processed(&self) -> u64 {
        self.bytes_processed.load(Ordering::Relaxed)
    }
}
}

Architecture Patterns

Hexagonal Architecture

The pipeline uses hexagonal architecture (Ports and Adapters):

Ports (Interfaces):

  • Domain Services (CompressionService, EncryptionService)
  • Repositories (StageExecutor)
  • Value Objects (StageType, CompressionAlgorithm)

Adapters (Implementations):

  • Infrastructure Services (BrotliCompressionService, AesEncryptionService)
  • Infrastructure Adapters (TokioFileIO, SQLitePipelineRepository)
  • Application Services (ConcurrentPipeline, StreamingFileProcessor)

Extension Strategy:

  1. Define domain trait (port)
  2. Implement infrastructure adapter
  3. Register with dependency injection container

Dependency Injection

Example: Registering Custom Services

#![allow(unused)]
fn main() {
use std::sync::Arc;

/// Application dependencies container
pub struct AppDependencies {
    compression_service: Arc<dyn CompressionService>,
    encryption_service: Arc<dyn EncryptionService>,
    file_io_service: Arc<dyn FileIOService>,
}

impl AppDependencies {
    pub fn new_with_custom_compression(
        compression: Arc<dyn CompressionService>,
    ) -> Self {
        Self {
            compression_service: compression,
            encryption_service: Arc::new(DefaultEncryptionService::new()),
            file_io_service: Arc::new(DefaultFileIOService::new()),
        }
    }

    pub fn new_with_cloud_storage(
        cloud_adapter: Arc<CloudStorageAdapter>,
    ) -> Self {
        Self {
            compression_service: Arc::new(DefaultCompressionService::new()),
            encryption_service: Arc::new(DefaultEncryptionService::new()),
            file_io_service: cloud_adapter,
        }
    }
}
}

Best Practices

1. Follow Domain-Driven Design

#![allow(unused)]
fn main() {
// ✅ Good: Domain trait in domain layer
// pipeline-domain/src/services/my_service.rs
pub trait MyService: Send + Sync {
    fn process(&self, data: &[u8]) -> Result<Vec<u8>, PipelineError>;
}

// ✅ Good: Infrastructure implementation in infrastructure layer
// pipeline/src/infrastructure/services/my_custom_service.rs
// Note: Use technology-based names (e.g., TokioFileIO, BrotliCompression)
// rather than "Impl" suffix for actual implementations
pub struct MyCustomService {
    config: MyConfig,
}

impl MyService for MyCustomService {
    fn process(&self, data: &[u8]) -> Result<Vec<u8>, PipelineError> {
        // Implementation details
    }
}

// ❌ Bad: Implementation in domain layer
// pipeline-domain/src/services/my_service.rs
pub struct MyCustomService { /* ... */ }  // Wrong layer!
}

2. Use Type-Safe Configuration

#![allow(unused)]
fn main() {
// ✅ Good: Strongly typed configuration
pub struct MyStageConfig {
    algorithm: MyAlgorithm,
    level: u8,
    parallel: bool,
}

impl MyStageConfig {
    pub fn new(algorithm: MyAlgorithm, level: u8) -> Result<Self, PipelineError> {
        if level > 10 {
            return Err(PipelineError::InvalidConfiguration(
                "Level must be 0-10".to_string()
            ));
        }
        Ok(Self { algorithm, level, parallel: true })
    }
}

// ❌ Bad: Stringly-typed configuration
pub struct MyStageConfig {
    algorithm: String,  // Could be anything!
    level: String,      // Not validated!
}
}

3. Implement Proper Error Handling

#![allow(unused)]
fn main() {
// ✅ Good: Specific error types
impl MyService for MyCustomService {
    fn process(&self, data: &[u8]) -> Result<Vec<u8>, PipelineError> {
        self.validate_input(data)
            .map_err(|e| PipelineError::ValidationError(format!("Invalid input: {}", e)))?;

        self.do_processing(data)
            .map_err(|e| PipelineError::ProcessingError(format!("Processing failed: {}", e)))?;

        Ok(result)
    }
}

// ❌ Bad: Generic errors
impl MyService for MyServiceImpl {
    fn process(&self, data: &[u8]) -> Result<Vec<u8>, PipelineError> {
        // Just returns "something went wrong" - not helpful!
        Ok(self.do_processing(data).unwrap())
    }
}
}

4. Add Comprehensive Tests

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    // ✅ Good: Unit tests for service
    #[test]
    fn test_custom_compression_small_data() {
        let service = SnappyCompressionService::new();
        let chunk = FileChunk::new(0, 0, vec![0u8; 1024]);
        let mut context = ProcessingContext::new();

        let result = service.compress(chunk, &mut context).unwrap();

        assert!(result.data().len() < 1024);
        assert_eq!(context.bytes_processed(), 1024);
    }

    // ✅ Good: Integration test
    #[tokio::test]
    async fn test_custom_cloud_adapter_roundtrip() {
        let adapter = CloudStorageAdapter::new("http://localhost:9000", "test").unwrap();
        let test_data = b"Hello, World!";

        adapter.write_file_data(Path::new("test.txt"), test_data, Default::default())
            .await
            .unwrap();

        let chunks = adapter.read_file_chunks(Path::new("test.txt"), Default::default())
            .await
            .unwrap();

        assert_eq!(chunks[0].data(), test_data);
    }

    // ✅ Good: Error case testing
    #[test]
    fn test_custom_compression_invalid_data() {
        let service = SnappyCompressionService::new();
        let corrupt_chunk = FileChunk::new(0, 0, vec![0xFF; 10]);
        let mut context = ProcessingContext::new();

        let result = service.decompress(corrupt_chunk, &mut context);

        assert!(result.is_err());
        assert!(matches!(result.unwrap_err(), PipelineError::DecompressionError(_)));
    }
}
}

5. Document Extension Points

#![allow(unused)]
fn main() {
/// Custom sanitization service for PII removal
///
/// This service implements data sanitization by removing personally
/// identifiable information (PII) from file chunks.
///
/// # Examples
///
/// ```
/// use adaptive_pipeline::infrastructure::services::SanitizationService;
///
/// let service = SanitizationService::new();
/// let sanitized = service.sanitize(chunk, &mut context)?;
/// ```
///
/// # Performance
///
/// - Throughput: ~200 MB/s (regex-based)
/// - Memory: O(chunk_size)
/// - Thread-safe: Yes
///
/// # Algorithms Supported
///
/// - Email addresses (regex: `\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b`)
/// - Phone numbers (various formats)
/// - Social Security Numbers (US)
/// - Credit card numbers (Luhn validation)
pub struct SanitizationService {
    patterns: Vec<Regex>,
}
}

Integration Examples

Example 1: Custom Deduplication Stage

Complete implementation:

#![allow(unused)]
fn main() {
// 1. Add to StageType enum
pub enum StageType {
    // ... existing types ...
    Deduplication,
}

// 2. Define domain service
pub trait DeduplicationService: Send + Sync {
    fn deduplicate(
        &self,
        chunk: FileChunk,
        context: &mut ProcessingContext,
    ) -> Result<Option<FileChunk>, PipelineError>;
}

// 3. Implement infrastructure service
pub struct BloomFilterDeduplicationService {
    bloom_filter: Arc<Mutex<BloomFilter>>,
}

impl DeduplicationService for BloomFilterDeduplicationService {
    fn deduplicate(
        &self,
        chunk: FileChunk,
        context: &mut ProcessingContext,
    ) -> Result<Option<FileChunk>, PipelineError> {
        use blake3::hash;

        // Calculate chunk hash
        let hash = hash(chunk.data());

        // Check if chunk seen before
        let mut filter = self.bloom_filter.lock().unwrap();
        if filter.contains(&hash) {
            // Duplicate found
            context.increment_deduplication_hits();
            return Ok(None);
        }

        // New chunk, add to filter
        filter.insert(&hash);
        Ok(Some(chunk))
    }
}

// 4. Register in pipeline configuration
let mut pipeline = Pipeline::new();
pipeline.add_stage(PipelineStage::new(
    StageType::Deduplication,
    "dedup",
    StageConfiguration::new(
        "bloom-filter".to_string(),
        HashMap::new(),
        false,  // Not parallel (shared bloom filter)
    ),
));
}

Summary

The pipeline provides multiple extension points:

  1. Custom Stages: Extend StageType enum and implement processing logic
  2. Custom Algorithms: Add new compression, encryption, or hashing algorithms
  3. Custom Services: Implement domain service traits for specialized operations
  4. Custom Adapters: Create infrastructure adapters for external systems
  5. Custom Metrics: Extend observability with custom metrics collectors

Key Takeaways:

  • Follow hexagonal architecture (domain → application → infrastructure)
  • Use dependency injection for loose coupling
  • Implement domain traits (ports) in infrastructure layer (adapters)
  • Maintain type safety with strong typing and validation
  • Add comprehensive tests for all custom implementations
  • Document extension points and usage examples

Extension Checklist:

  • Define domain trait (if new concept)
  • Implement infrastructure adapter
  • Add unit tests for adapter
  • Add integration tests for end-to-end workflow
  • Document configuration options
  • Update architectural diagrams (if significant)
  • Consider performance impact (benchmark)
  • Verify thread safety (Send + Sync)

Custom Stages

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Stable

This chapter provides a practical guide to creating custom pipeline stages using the unified StageService architecture. All stages—built-in and custom—implement the same traits and follow identical patterns.

Quick Start

Creating a custom stage involves three simple steps:

  1. Implement StageService trait - Define your stage's behavior
  2. Implement FromParameters trait - Extract typed config from HashMap
  3. Register in service registry - Add to main.rs stage_services HashMap

That's it! No enum modifications, no executor updates, no separate trait definitions.

Learning by Example

The codebase includes three production-ready stages that demonstrate the complete pattern:

StageFileDescriptionComplexity
Base64pipeline/src/infrastructure/services/base64_encoding_service.rsBinary-to-text encoding⭐ Simple
PII Maskingpipeline/src/infrastructure/services/pii_masking_service.rsPrivacy protection⭐⭐ Medium
Teepipeline/src/infrastructure/services/tee_service.rsData inspection/debugging⭐⭐ Medium

Recommendation: Start by reading base64_encoding_service.rs (220 lines) for a complete, minimal example.

The StageService Pattern

Every stage implements two traits:

1. FromParameters - Type-Safe Configuration

Extracts typed configuration from HashMap<String, String>:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::services::FromParameters;
use adaptive_pipeline_domain::PipelineError;
use std::collections::HashMap;

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Base64Config {
    pub variant: Base64Variant,
}

impl FromParameters for Base64Config {
    fn from_parameters(params: &HashMap<String, String>) -> Result<Self, PipelineError> {
        let variant = params
            .get("variant")
            .map(|s| match s.to_lowercase().as_str() {
                "standard" => Ok(Base64Variant::Standard),
                "url_safe" => Ok(Base64Variant::UrlSafe),
                other => Err(PipelineError::InvalidParameter(format!(
                    "Unknown Base64 variant: {}",
                    other
                ))),
            })
            .transpose()?
            .unwrap_or(Base64Variant::Standard);

        Ok(Self { variant })
    }
}
}

Key Points:

  • Similar to Rust's FromStr trait
  • Validates parameters early (before execution)
  • Provides clear error messages
  • Allows default values with .unwrap_or()

2. StageService - Core Processing Logic

Defines how your stage processes data:

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::services::StageService;
use adaptive_pipeline_domain::entities::{
    Operation, ProcessingContext, StageConfiguration, StagePosition, StageType,
};
use adaptive_pipeline_domain::value_objects::file_chunk::FileChunk;
use adaptive_pipeline_domain::PipelineError;

pub struct Base64EncodingService;

impl StageService for Base64EncodingService {
    fn process_chunk(
        &self,
        chunk: FileChunk,
        config: &StageConfiguration,
        context: &mut ProcessingContext,
    ) -> Result<FileChunk, PipelineError> {
        // 1. Extract typed config
        let base64_config = Base64Config::from_parameters(&config.parameters)?;

        // 2. Process based on operation
        let processed_data = match config.operation {
            Operation::Forward => self.encode(chunk.data(), base64_config.variant),
            Operation::Reverse => self.decode(chunk.data(), base64_config.variant)?,
        };

        // 3. Return processed chunk
        chunk.with_data(processed_data)
    }

    fn position(&self) -> StagePosition {
        StagePosition::PreBinary  // Before compression/encryption
    }

    fn is_reversible(&self) -> bool {
        true  // Can encode AND decode
    }

    fn stage_type(&self) -> StageType {
        StageType::Transform
    }
}
}

Key Points:

  • process_chunk() - The core processing logic
  • position() - Where in pipeline (PreBinary/PostBinary/Any)
  • is_reversible() - Whether stage supports reverse operation
  • stage_type() - Category (Transform/Compression/Encryption/etc.)

Stage Position (Binary Boundary)

The position() method enforces ordering constraints:

┌──────────────┬─────────────┬──────────────┐
│  PreBinary   │   Binary    │  PostBinary  │
│   Stages     │  Boundary   │    Stages    │
├──────────────┼─────────────┼──────────────┤
│              │             │              │
│ Base64       │ Compression │              │
│ PII Masking  │ Encryption  │              │
│              │             │              │
└──────────────┴─────────────┴──────────────┘
     ↑                             ↑
     └─ Sees readable data         └─ Sees binary data

PreBinary:

  • Executes BEFORE compression/encryption
  • Sees readable, original data format
  • Examples: Base64, PII Masking, format conversion

PostBinary:

  • Executes AFTER compression/encryption
  • Sees final binary output
  • Examples: checksumming, digital signatures

Any:

  • Can execute anywhere in pipeline
  • Typically diagnostic/observability stages
  • Examples: Tee (data inspection), metrics

Complete Example: Custom ROT13 Stage

Here's a minimal custom stage (rot13 cipher):

#![allow(unused)]
fn main() {
// pipeline/src/infrastructure/services/rot13_service.rs

use adaptive_pipeline_domain::entities::{
    Operation, ProcessingContext, StageConfiguration, StagePosition, StageType,
};
use adaptive_pipeline_domain::services::{FromParameters, StageService};
use adaptive_pipeline_domain::value_objects::file_chunk::FileChunk;
use adaptive_pipeline_domain::PipelineError;
use std::collections::HashMap;

/// Configuration for ROT13 cipher (no parameters needed)
#[derive(Debug, Clone, Default)]
pub struct Rot13Config;

impl FromParameters for Rot13Config {
    fn from_parameters(_params: &HashMap<String, String>) -> Result<Self, PipelineError> {
        Ok(Self)  // No parameters to parse
    }
}

/// ROT13 cipher service (simple character rotation)
pub struct Rot13Service;

impl Rot13Service {
    pub fn new() -> Self {
        Self
    }

    fn rot13_byte(b: u8) -> u8 {
        match b {
            b'A'..=b'Z' => ((b - b'A' + 13) % 26) + b'A',
            b'a'..=b'z' => ((b - b'a' + 13) % 26) + b'a',
            _ => b,
        }
    }
}

impl StageService for Rot13Service {
    fn process_chunk(
        &self,
        chunk: FileChunk,
        config: &StageConfiguration,
        _context: &mut ProcessingContext,
    ) -> Result<FileChunk, PipelineError> {
        let _ = Rot13Config::from_parameters(&config.parameters)?;

        // ROT13 is self-inverse: same operation for Forward and Reverse
        let processed: Vec<u8> = chunk.data()
            .iter()
            .map(|&b| Self::rot13_byte(b))
            .collect();

        chunk.with_data(processed)
    }

    fn position(&self) -> StagePosition {
        StagePosition::PreBinary  // Must execute before encryption
    }

    fn is_reversible(&self) -> bool {
        true  // ROT13 is self-inverse
    }

    fn stage_type(&self) -> StageType {
        StageType::Transform
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_rot13_roundtrip() {
        let service = Rot13Service::new();
        let chunk = FileChunk::new(0, 0, b"Hello, World!".to_vec(), false).unwrap();
        let config = StageConfiguration::default();
        let mut context = ProcessingContext::default();

        // Encode
        let encoded = service.process_chunk(chunk, &config, &mut context).unwrap();
        assert_eq!(encoded.data(), b"Uryyb, Jbeyq!");

        // Decode (same operation)
        let decoded = service.process_chunk(encoded, &config, &mut context).unwrap();
        assert_eq!(decoded.data(), b"Hello, World!");
    }
}
}

Registration and Usage

1. Export from Module

#![allow(unused)]
fn main() {
// pipeline/src/infrastructure/services.rs

pub mod base64_encoding;
pub mod pii_masking_service;
pub mod tee_service;
pub mod rot13_service;  // Add your service

pub use base64_encoding_service::Base64EncodingService;
pub use pii_masking_service::PiiMaskingService;
pub use tee_service::TeeService;
pub use rot13_service::Rot13Service;  // Export
}

2. Register in main.rs

#![allow(unused)]
fn main() {
// pipeline/src/main.rs

use adaptive_pipeline::infrastructure::services::{
    Base64EncodingService, PiiMaskingService, TeeService, Rot13Service,
};

// Build stage service registry
let mut stage_services: HashMap<String, Arc<dyn StageService>> = HashMap::new();

// Register built-in stages
stage_services.insert("brotli".to_string(), compression_service.clone());
stage_services.insert("aes256gcm".to_string(), encryption_service.clone());

// Register production transform stages
stage_services.insert("base64".to_string(), Arc::new(Base64EncodingService::new()));
stage_services.insert("pii_masking".to_string(), Arc::new(PiiMaskingService::new()));
stage_services.insert("tee".to_string(), Arc::new(TeeService::new()));

// Register your custom stage
stage_services.insert("rot13".to_string(), Arc::new(Rot13Service::new()));

// Create stage executor with registry
let stage_executor = Arc::new(BasicStageExecutor::new(stage_services));
}

3. Use in Pipeline Configuration

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::entities::pipeline_stage::{PipelineStage, StageConfiguration, StageType};
use adaptive_pipeline_domain::entities::Operation;
use std::collections::HashMap;

// Create ROT13 stage
let rot13_stage = PipelineStage::new(
    "rot13_encode".to_string(),
    StageType::Transform,
    StageConfiguration {
        algorithm: "rot13".to_string(),  // Matches registry key
        operation: Operation::Forward,
        parameters: HashMap::new(),       // No parameters needed
        parallel_processing: false,
        chunk_size: None,
    },
    1,  // Order
).unwrap();

// Add to pipeline
pipeline.add_stage(rot13_stage)?;
}

Design Patterns and Best Practices

Pattern 1: Parameterless Stages

For stages without configuration:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Default)]
pub struct MyConfig;

impl FromParameters for MyConfig {
    fn from_parameters(_params: &HashMap<String, String>) -> Result<Self, PipelineError> {
        Ok(Self)  // Always succeeds, no validation needed
    }
}
}

Pattern 2: Optional Parameters with Defaults

#![allow(unused)]
fn main() {
impl FromParameters for TeeConfig {
    fn from_parameters(params: &HashMap<String, String>) -> Result<Self, PipelineError> {
        // Required parameter
        let output_path = params
            .get("output_path")
            .ok_or_else(|| PipelineError::MissingParameter("output_path".into()))?;

        // Optional with default
        let format = params
            .get("format")
            .map(|s| TeeFormat::from_str(s))
            .transpose()?
            .unwrap_or(TeeFormat::Binary);

        Ok(Self { output_path: output_path.into(), format })
    }
}
}

Pattern 3: Non-Reversible Stages

For one-way operations:

#![allow(unused)]
fn main() {
impl StageService for PiiMaskingService {
    fn process_chunk(...) -> Result<FileChunk, PipelineError> {
        match config.operation {
            Operation::Forward => {
                // Mask PII
                self.mask_data(chunk.data(), &pii_config)?
            }
            Operation::Reverse => {
                // Cannot unmask - data is destroyed
                return Err(PipelineError::ProcessingFailed(
                    "PII masking is not reversible".to_string()
                ));
            }
        }
    }

    fn is_reversible(&self) -> bool {
        false  // Important: declares non-reversibility
    }
}
}

Pattern 4: Lazy Static Resources

For expensive initialization (regex, schemas, etc.):

#![allow(unused)]
fn main() {
use once_cell::sync::Lazy;
use regex::Regex;

static EMAIL_REGEX: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b")
        .expect("Invalid email regex")
});

impl StageService for PiiMaskingService {
    fn process_chunk(...) -> Result<FileChunk, PipelineError> {
        // Regex compiled once, reused for all chunks
        let masked = EMAIL_REGEX.replace_all(&text, "***");
        // ...
    }
}
}

Testing Your Stage

Unit Tests

Test your service in isolation:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;
    use adaptive_pipeline_domain::entities::{SecurityContext, SecurityLevel};
    use std::path::PathBuf;

    #[test]
    fn test_config_parsing() {
        let mut params = HashMap::new();
        params.insert("variant".to_string(), "url_safe".to_string());

        let config = Base64Config::from_parameters(&params).unwrap();
        assert_eq!(config.variant, Base64Variant::UrlSafe);
    }

    #[test]
    fn test_forward_operation() {
        let service = Base64EncodingService::new();
        let chunk = FileChunk::new(0, 0, b"Hello, World!".to_vec(), false).unwrap();
        let config = StageConfiguration::default();
        let mut context = ProcessingContext::new(
            PathBuf::from("/tmp/in"),
            PathBuf::from("/tmp/out"),
            100,
            SecurityContext::new(None, SecurityLevel::Public),
        );

        let result = service.process_chunk(chunk, &config, &mut context).unwrap();

        // Base64 of "Hello, World!" is "SGVsbG8sIFdvcmxkIQ=="
        assert_eq!(result.data(), b"SGVsbG8sIFdvcmxkIQ==");
    }

    #[test]
    fn test_roundtrip() {
        let service = Base64EncodingService::new();
        let original = b"Test data";
        let chunk = FileChunk::new(0, 0, original.to_vec(), false).unwrap();

        let mut config = StageConfiguration::default();
        let mut context = ProcessingContext::default();

        // Encode
        config.operation = Operation::Forward;
        let encoded = service.process_chunk(chunk, &config, &mut context).unwrap();

        // Decode
        config.operation = Operation::Reverse;
        let decoded = service.process_chunk(encoded, &config, &mut context).unwrap();

        assert_eq!(decoded.data(), original);
    }
}
}

Integration Tests

Test stage within pipeline:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_rot13_in_pipeline() {
    // Build stage service registry
    let mut stage_services = HashMap::new();
    stage_services.insert("rot13".to_string(),
        Arc::new(Rot13Service::new()) as Arc<dyn StageService>);

    let stage_executor = Arc::new(BasicStageExecutor::new(stage_services));

    // Create pipeline with ROT13 stage
    let rot13_stage = PipelineStage::new(
        "rot13".to_string(),
        StageType::Transform,
        StageConfiguration {
            algorithm: "rot13".to_string(),
            operation: Operation::Forward,
            parameters: HashMap::new(),
            parallel_processing: false,
            chunk_size: None,
        },
        1,
    ).unwrap();

    let pipeline = Pipeline::new("test".to_string(), vec![rot13_stage]).unwrap();

    // Execute
    let chunk = FileChunk::new(0, 0, b"Hello!".to_vec(), false).unwrap();
    let mut context = ProcessingContext::default();

    let result = stage_executor
        .execute(pipeline.stages()[1], chunk, &mut context)  // [1] skips input_checksum
        .await
        .unwrap();

    assert_eq!(result.data(), b"Uryyb!");
}
}

Real-World Examples

Production Stages in Codebase

Study these files for complete, tested implementations:

  1. base64_encoding_service.rs (220 lines)

    • Simple reversible transformation
    • Multiple config variants
    • Clean error handling
    • Complete test coverage
  2. pii_masking_service.rs (309 lines)

    • Non-reversible transformation
    • Multiple regex patterns
    • Lazy static initialization
    • Complex configuration
  3. tee_service.rs (380 lines)

    • Pass-through with side effects
    • File I/O operations
    • Multiple output formats
    • Position::Any usage

Common Pitfalls

❌ Don't: Modify Domain Enums

#![allow(unused)]
fn main() {
// WRONG: Don't add to StageType enum
pub enum StageType {
    Transform,
    MyCustomStage,  // ❌ Not needed!
}
}

Correct: Just use StageType::Transform and register with a unique algorithm name.

❌ Don't: Create Separate Traits

#![allow(unused)]
fn main() {
// WRONG: Don't create custom trait
pub trait MyCustomService: Send + Sync {
    fn process(...);
}
}

Correct: Implement StageService directly. All stages use the same trait.

❌ Don't: Manual Executor Updates

#![allow(unused)]
fn main() {
// WRONG: Don't modify stage executor
impl StageExecutor for BasicStageExecutor {
    async fn execute(...) {
        match algorithm {
            "my_stage" => // manual dispatch ❌
        }
    }
}
}

Correct: Just register in stage_services HashMap. Executor uses the registry.

Summary

Creating custom stages with the unified architecture:

Three Steps:

  1. Implement StageService trait (process_chunk, position, is_reversible, stage_type)
  2. Implement FromParameters trait (type-safe config extraction)
  3. Register in main.rs stage_services HashMap

Key Benefits:

  • No enum modifications needed
  • No executor changes required
  • Type-safe configuration
  • Automatic validation (ordering, parameters)
  • Same pattern for all stages

Learn More:

  • Read production stages: base64_encoding_service.rs, pii_masking_service.rs, tee_service.rs
  • See pipeline-domain/src/services/stage_service.rs for trait definitions
  • See pipeline-domain/src/entities/pipeline_stage.rs for StagePosition documentation

Quick Reference:

#![allow(unused)]
fn main() {
// Minimal custom stage template
pub struct MyService;

impl StageService for MyService {
    fn process_chunk(...) -> Result<FileChunk, PipelineError> {
        let config = MyConfig::from_parameters(&config.parameters)?;
        // ... process data ...
        chunk.with_data(processed_data)
    }
    fn position(&self) -> StagePosition { StagePosition::PreBinary }
    fn is_reversible(&self) -> bool { true }
    fn stage_type(&self) -> StageType { StageType::Transform }
}

impl FromParameters for MyConfig {
    fn from_parameters(params: &HashMap<String, String>) -> Result<Self, PipelineError> {
        // ... parse parameters ...
        Ok(Self { /* config fields */ })
    }
}
}

Custom Algorithms

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

This chapter demonstrates how to implement custom compression, encryption, and hashing algorithms while integrating them seamlessly with the pipeline's existing infrastructure.

Overview

The pipeline supports custom algorithm implementations for:

  • Compression: Add new compression algorithms (e.g., Snappy, LZMA, custom formats)
  • Encryption: Implement new ciphers (e.g., alternative AEADs, custom protocols)
  • Hashing: Add new checksum algorithms (e.g., BLAKE3, xxHash, custom)

Key Concepts:

  • Algorithm Enum: Type-safe algorithm identifier
  • Service Trait: Domain interface defining algorithm operations
  • Service Implementation: Concrete algorithm implementation
  • Configuration: Algorithm-specific parameters and tuning

Custom Compression Algorithm

Step 1: Extend CompressionAlgorithm Enum

#![allow(unused)]
fn main() {
// pipeline-domain/src/value_objects/algorithm.rs

#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum CompressionAlgorithm {
    Brotli,
    Gzip,
    Zstd,
    Lz4,

    // Custom algorithms
    Snappy,         // Google's Snappy
    Lzma,           // LZMA/XZ compression
    Custom(u8),     // Custom algorithm ID
}

impl std::fmt::Display for CompressionAlgorithm {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        match self {
            // ... existing ...
            CompressionAlgorithm::Snappy => write!(f, "snappy"),
            CompressionAlgorithm::Lzma => write!(f, "lzma"),
            CompressionAlgorithm::Custom(id) => write!(f, "custom-{}", id),
        }
    }
}

impl std::str::FromStr for CompressionAlgorithm {
    type Err = PipelineError;

    fn from_str(s: &str) -> Result<Self, Self::Err> {
        match s.to_lowercase().as_str() {
            // ... existing ...
            "snappy" => Ok(CompressionAlgorithm::Snappy),
            "lzma" => Ok(CompressionAlgorithm::Lzma),
            s if s.starts_with("custom-") => {
                let id = s.strip_prefix("custom-")
                    .and_then(|id| id.parse::<u8>().ok())
                    .ok_or_else(|| PipelineError::InvalidConfiguration(
                        format!("Invalid custom ID: {}", s)
                    ))?;
                Ok(CompressionAlgorithm::Custom(id))
            }
            _ => Err(PipelineError::InvalidConfiguration(format!(
                "Unknown algorithm: {}",
                s
            ))),
        }
    }
}
}

Step 2: Implement CompressionService

#![allow(unused)]
fn main() {
// pipeline/src/infrastructure/services/snappy_compression_service.rs

use adaptive_pipeline_domain::services::CompressionService;
use adaptive_pipeline_domain::{FileChunk, PipelineError, ProcessingContext};
use snap::raw::{Encoder, Decoder};

/// Snappy compression service implementation
///
/// Snappy is a fast compression/decompression library developed by Google.
/// It does not aim for maximum compression, or compatibility with any other
/// compression library; instead, it aims for very high speeds and reasonable
/// compression.
///
/// **Performance Characteristics:**
/// - Compression: 250-500 MB/s
/// - Decompression: 500-1000 MB/s
/// - Compression ratio: ~50-70% of original size
pub struct SnappyCompressionService;

impl SnappyCompressionService {
    pub fn new() -> Self {
        Self
    }
}

impl CompressionService for SnappyCompressionService {
    fn compress(
        &self,
        chunk: FileChunk,
        context: &mut ProcessingContext,
    ) -> Result<FileChunk, PipelineError> {
        let start = std::time::Instant::now();

        // Compress using Snappy
        let mut encoder = Encoder::new();
        let compressed = encoder
            .compress_vec(chunk.data())
            .map_err(|e| PipelineError::CompressionError(format!("Snappy: {}", e)))?;

        // Update metrics
        let duration = start.elapsed();
        context.add_bytes_processed(chunk.data().len() as u64);
        context.record_stage_duration(duration);

        // Create compressed chunk
        let mut result = FileChunk::new(
            chunk.sequence_number(),
            chunk.file_offset(),
            compressed,
        );

        result.set_metadata(chunk.metadata().clone());

        Ok(result)
    }

    fn decompress(
        &self,
        chunk: FileChunk,
        context: &mut ProcessingContext,
    ) -> Result<FileChunk, PipelineError> {
        let start = std::time::Instant::now();

        // Decompress using Snappy
        let mut decoder = Decoder::new();
        let decompressed = decoder
            .decompress_vec(chunk.data())
            .map_err(|e| PipelineError::DecompressionError(format!("Snappy: {}", e)))?;

        // Update metrics
        let duration = start.elapsed();
        context.add_bytes_processed(decompressed.len() as u64);
        context.record_stage_duration(duration);

        // Create decompressed chunk
        let mut result = FileChunk::new(
            chunk.sequence_number(),
            chunk.file_offset(),
            decompressed,
        );

        result.set_metadata(chunk.metadata().clone());

        Ok(result)
    }

    fn estimate_compressed_size(&self, chunk: &FileChunk) -> usize {
        // Snappy typically achieves ~50-70% compression
        (chunk.data().len() as f64 * 0.6) as usize
    }
}

impl Default for SnappyCompressionService {
    fn default() -> Self {
        Self::new()
    }
}
}

Step 3: Register Algorithm

#![allow(unused)]
fn main() {
// In service factory or dependency injection

use std::sync::Arc;
use adaptive_pipeline_domain::services::CompressionService;

fn create_compression_service(
    algorithm: CompressionAlgorithm,
) -> Result<Arc<dyn CompressionService>, PipelineError> {
    match algorithm {
        CompressionAlgorithm::Brotli => Ok(Arc::new(BrotliCompressionService::new())),
        CompressionAlgorithm::Gzip => Ok(Arc::new(GzipCompressionService::new())),
        CompressionAlgorithm::Zstd => Ok(Arc::new(ZstdCompressionService::new())),
        CompressionAlgorithm::Lz4 => Ok(Arc::new(Lz4CompressionService::new())),

        // Custom algorithms
        CompressionAlgorithm::Snappy => Ok(Arc::new(SnappyCompressionService::new())),
        CompressionAlgorithm::Lzma => Ok(Arc::new(LzmaCompressionService::new())),

        CompressionAlgorithm::Custom(id) => {
            Err(PipelineError::InvalidConfiguration(format!(
                "Custom algorithm {} not registered",
                id
            )))
        }
    }
}
}

Custom Encryption Algorithm

Step 1: Extend EncryptionAlgorithm Enum

#![allow(unused)]
fn main() {
// pipeline-domain/src/value_objects/algorithm.rs

#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum EncryptionAlgorithm {
    Aes256Gcm,
    ChaCha20Poly1305,
    XChaCha20Poly1305,

    // Custom algorithms
    Aes128Gcm,      // AES-128-GCM (faster, less secure)
    Twofish,        // Twofish cipher
    Custom(u8),     // Custom algorithm ID
}
}

Step 2: Implement EncryptionService

#![allow(unused)]
fn main() {
// pipeline/src/infrastructure/services/aes128_encryption_service.rs

use adaptive_pipeline_domain::services::EncryptionService;
use adaptive_pipeline_domain::{FileChunk, PipelineError, ProcessingContext};
use aes_gcm::{
    aead::{Aead, KeyInit, OsRng},
    Aes128Gcm, Nonce,
};

/// AES-128-GCM encryption service
///
/// This implementation uses AES-128-GCM, which is faster than AES-256-GCM
/// but provides a lower security margin (128-bit vs 256-bit key).
///
/// **Use Cases:**
/// - Performance-critical applications
/// - Scenarios where 128-bit security is sufficient
/// - Systems without AES-NI support (software fallback)
///
/// **Performance:**
/// - Encryption: 800-1200 MB/s (with AES-NI)
/// - Encryption: 150-300 MB/s (software)
pub struct Aes128EncryptionService {
    cipher: Aes128Gcm,
}

impl Aes128EncryptionService {
    pub fn new(key: &[u8; 16]) -> Self {
        let cipher = Aes128Gcm::new(key.into());
        Self { cipher }
    }

    pub fn generate_key() -> [u8; 16] {
        use rand::RngCore;
        let mut key = [0u8; 16];
        OsRng.fill_bytes(&mut key);
        key
    }
}

impl EncryptionService for Aes128EncryptionService {
    fn encrypt(
        &self,
        chunk: FileChunk,
        context: &mut ProcessingContext,
    ) -> Result<FileChunk, PipelineError> {
        let start = std::time::Instant::now();

        // Generate nonce (96-bit for GCM)
        let mut nonce_bytes = [0u8; 12];
        use rand::RngCore;
        OsRng.fill_bytes(&mut nonce_bytes);
        let nonce = Nonce::from_slice(&nonce_bytes);

        // Encrypt data
        let ciphertext = self.cipher
            .encrypt(nonce, chunk.data())
            .map_err(|e| PipelineError::EncryptionError(format!("AES-128-GCM: {}", e)))?;

        // Prepend nonce to ciphertext
        let mut encrypted = nonce_bytes.to_vec();
        encrypted.extend_from_slice(&ciphertext);

        // Update metrics
        let duration = start.elapsed();
        context.add_bytes_processed(chunk.data().len() as u64);
        context.record_stage_duration(duration);

        // Create encrypted chunk
        let mut result = FileChunk::new(
            chunk.sequence_number(),
            chunk.file_offset(),
            encrypted,
        );

        result.set_metadata(chunk.metadata().clone());

        Ok(result)
    }

    fn decrypt(
        &self,
        chunk: FileChunk,
        context: &mut ProcessingContext,
    ) -> Result<FileChunk, PipelineError> {
        let start = std::time::Instant::now();

        // Extract nonce from beginning
        if chunk.data().len() < 12 {
            return Err(PipelineError::DecryptionError(
                "Encrypted data too short".to_string()
            ));
        }

        let (nonce_bytes, ciphertext) = chunk.data().split_at(12);
        let nonce = Nonce::from_slice(nonce_bytes);

        // Decrypt data
        let plaintext = self.cipher
            .decrypt(nonce, ciphertext)
            .map_err(|e| PipelineError::DecryptionError(format!("AES-128-GCM: {}", e)))?;

        // Update metrics
        let duration = start.elapsed();
        context.add_bytes_processed(plaintext.len() as u64);
        context.record_stage_duration(duration);

        // Create decrypted chunk
        let mut result = FileChunk::new(
            chunk.sequence_number(),
            chunk.file_offset(),
            plaintext,
        );

        result.set_metadata(chunk.metadata().clone());

        Ok(result)
    }
}
}

Custom Hashing Algorithm

Step 1: Extend HashAlgorithm Enum

#![allow(unused)]
fn main() {
// pipeline-domain/src/value_objects/algorithm.rs

#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
pub enum HashAlgorithm {
    Sha256,
    Blake3,

    // Custom algorithms
    XxHash,         // xxHash (extremely fast)
    Blake2b,        // BLAKE2b
    Custom(u8),     // Custom algorithm ID
}
}

Step 2: Implement ChecksumService

#![allow(unused)]
fn main() {
// pipeline/src/infrastructure/services/xxhash_checksum_service.rs

use adaptive_pipeline_domain::services::ChecksumService;
use adaptive_pipeline_domain::{FileChunk, PipelineError, ProcessingContext, Checksum};
use xxhash_rust::xxh3::Xxh3;

/// xxHash checksum service implementation
///
/// xxHash is an extremely fast non-cryptographic hash algorithm,
/// working at RAM speed limits.
///
/// **Performance:**
/// - Hashing: 10-30 GB/s (on modern CPUs)
/// - ~10-20x faster than SHA-256
/// - ~2-3x faster than BLAKE3
///
/// **Use Cases:**
/// - Data integrity verification (non-cryptographic)
/// - Deduplication
/// - Hash tables
/// - NOT for security (use SHA-256 or BLAKE3 instead)
pub struct XxHashChecksumService;

impl XxHashChecksumService {
    pub fn new() -> Self {
        Self
    }
}

impl ChecksumService for XxHashChecksumService {
    fn calculate_checksum(
        &self,
        chunk: &FileChunk,
        context: &mut ProcessingContext,
    ) -> Result<Checksum, PipelineError> {
        let start = std::time::Instant::now();

        // Calculate xxHash64
        let mut hasher = Xxh3::new();
        hasher.update(chunk.data());
        let hash = hasher.digest();

        // Convert to bytes (big-endian)
        let hash_bytes = hash.to_be_bytes();

        // Update metrics
        let duration = start.elapsed();
        context.record_stage_duration(duration);

        Ok(Checksum::new(hash_bytes.to_vec()))
    }

    fn verify_checksum(
        &self,
        chunk: &FileChunk,
        expected: &Checksum,
        context: &mut ProcessingContext,
    ) -> Result<bool, PipelineError> {
        let calculated = self.calculate_checksum(chunk, context)?;
        Ok(calculated == *expected)
    }
}

impl Default for XxHashChecksumService {
    fn default() -> Self {
        Self::new()
    }
}
}

Algorithm Configuration

Compression Configuration

#![allow(unused)]
fn main() {
use adaptive_pipeline_domain::entities::StageConfiguration;
use std::collections::HashMap;

// Snappy configuration (minimal parameters)
let snappy_config = StageConfiguration::new(
    "snappy".to_string(),
    HashMap::new(),  // No tuning parameters
    true,            // Parallel processing
);

// LZMA configuration (with level)
let lzma_config = StageConfiguration::new(
    "lzma".to_string(),
    HashMap::from([
        ("level".to_string(), "6".to_string()),  // 0-9
    ]),
    true,
);
}

Encryption Configuration

#![allow(unused)]
fn main() {
// AES-128-GCM configuration
let aes128_config = StageConfiguration::new(
    "aes128gcm".to_string(),
    HashMap::from([
        ("key".to_string(), base64::encode(&key)),
    ]),
    true,
);
}

Hashing Configuration

#![allow(unused)]
fn main() {
// xxHash configuration
let xxhash_config = StageConfiguration::new(
    "xxhash".to_string(),
    HashMap::new(),
    true,
);
}

Performance Comparison

Compression Algorithms

AlgorithmCompression SpeedDecompression SpeedRatioUse Case
LZ4500-700 MB/s2000-3000 MB/s2-3xReal-time, low latency
Snappy250-500 MB/s500-1000 MB/s1.5-2xGoogle services, fast
Zstd200-400 MB/s600-800 MB/s3-5xModern balanced
Brotli50-150 MB/s300-500 MB/s4-8xWeb, maximum compression
LZMA10-30 MB/s50-100 MB/s5-10xArchival, best ratio

Encryption Algorithms

AlgorithmEncryption SpeedSecurityHardware Support
AES-128-GCM800-1200 MB/sGoodYes (AES-NI)
AES-256-GCM400-800 MB/sExcellentYes (AES-NI)
ChaCha20200-400 MB/sExcellentNo

Hashing Algorithms

AlgorithmThroughputSecurityUse Case
xxHash10-30 GB/sNoneIntegrity, dedup
BLAKE33-10 GB/sCryptographicGeneral purpose
SHA-256400-800 MB/sCryptographicSecurity, signatures

Testing Custom Algorithms

Unit Tests

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_snappy_compression_roundtrip() {
        let service = SnappyCompressionService::new();
        let original_data = b"Hello, World! ".repeat(100);
        let chunk = FileChunk::new(0, 0, original_data.to_vec());
        let mut context = ProcessingContext::new();

        // Compress
        let compressed = service.compress(chunk.clone(), &mut context).unwrap();
        assert!(compressed.data().len() < original_data.len());

        // Decompress
        let decompressed = service.decompress(compressed, &mut context).unwrap();
        assert_eq!(decompressed.data(), &original_data);
    }

    #[test]
    fn test_aes128_encryption_roundtrip() {
        let key = Aes128EncryptionService::generate_key();
        let service = Aes128EncryptionService::new(&key);
        let original_data = b"Secret message";
        let chunk = FileChunk::new(0, 0, original_data.to_vec());
        let mut context = ProcessingContext::new();

        // Encrypt
        let encrypted = service.encrypt(chunk.clone(), &mut context).unwrap();
        assert_ne!(encrypted.data(), original_data);

        // Decrypt
        let decrypted = service.decrypt(encrypted, &mut context).unwrap();
        assert_eq!(decrypted.data(), original_data);
    }

    #[test]
    fn test_xxhash_checksum() {
        let service = XxHashChecksumService::new();
        let data = b"Test data";
        let chunk = FileChunk::new(0, 0, data.to_vec());
        let mut context = ProcessingContext::new();

        let checksum = service.calculate_checksum(&chunk, &mut context).unwrap();
        assert_eq!(checksum.bytes().len(), 8);  // 64-bit hash

        // Verify
        let valid = service.verify_checksum(&chunk, &checksum, &mut context).unwrap();
        assert!(valid);
    }
}
}

Benchmark Custom Algorithm

#![allow(unused)]
fn main() {
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn benchmark_snappy_vs_lz4(c: &mut Criterion) {
    let snappy = SnappyCompressionService::new();
    let lz4 = Lz4CompressionService::new();

    let test_data = vec![0u8; 1024 * 1024];  // 1 MB
    let chunk = FileChunk::new(0, 0, test_data);
    let mut context = ProcessingContext::new();

    let mut group = c.benchmark_group("compression");

    group.bench_function("snappy", |b| {
        b.iter(|| {
            snappy.compress(black_box(chunk.clone()), &mut context).unwrap()
        });
    });

    group.bench_function("lz4", |b| {
        b.iter(|| {
            lz4.compress(black_box(chunk.clone()), &mut context).unwrap()
        });
    });

    group.finish();
}

criterion_group!(benches, benchmark_snappy_vs_lz4);
criterion_main!(benches);
}

Best Practices

1. Choose Appropriate Algorithms

#![allow(unused)]
fn main() {
// ✅ Good: Match algorithm to use case
let compression = if priority == Speed {
    CompressionAlgorithm::Snappy  // Fastest
} else if priority == Ratio {
    CompressionAlgorithm::Lzma    // Best ratio
} else {
    CompressionAlgorithm::Zstd    // Balanced
};

// ❌ Bad: Always use same algorithm
let compression = CompressionAlgorithm::Brotli;  // Slow!
}

2. Handle Errors Gracefully

#![allow(unused)]
fn main() {
// ✅ Good: Descriptive errors
fn compress(&self, chunk: FileChunk) -> Result<FileChunk, PipelineError> {
    encoder.compress_vec(chunk.data())
        .map_err(|e| PipelineError::CompressionError(
            format!("Snappy compression failed: {}", e)
        ))?
}

// ❌ Bad: Generic errors
fn compress(&self, chunk: FileChunk) -> Result<FileChunk, PipelineError> {
    encoder.compress_vec(chunk.data()).unwrap()  // Panics!
}
}

3. Benchmark Performance

#![allow(unused)]
fn main() {
// Always benchmark custom algorithms
#[bench]
fn bench_custom_algorithm(b: &mut Bencher) {
    let service = MyCustomService::new();
    let chunk = FileChunk::new(0, 0, vec![0u8; 1024 * 1024]);

    b.iter(|| {
        service.process(black_box(chunk.clone()))
    });
}
}

Summary

Implementing custom algorithms involves:

  1. Extend Algorithm Enum: Add variant for new algorithm
  2. Implement Service Trait: Create concrete implementation
  3. Register Algorithm: Add to service factory
  4. Test Thoroughly: Unit tests, integration tests, benchmarks
  5. Document Performance: Measure and document characteristics

Key Takeaways:

  • Choose algorithms based on workload (speed vs ratio vs security)
  • Implement complete error handling with specific error types
  • Benchmark against existing algorithms
  • Document performance characteristics
  • Add comprehensive tests (correctness + performance)
  • Consider hardware acceleration (AES-NI, SIMD)

Algorithm Implementation Checklist:

  • Extend algorithm enum (Display + FromStr)
  • Implement service trait
  • Add unit tests (correctness)
  • Add roundtrip tests (compress/decompress, encrypt/decrypt)
  • Benchmark performance
  • Compare with existing algorithms
  • Document use cases and characteristics
  • Register in service factory
  • Update configuration documentation

Software Requirements Specification

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

Comprehensive software requirements specification.

Introduction

TODO: Add SRS introduction

Functional Requirements

TODO: List functional requirements

Non-Functional Requirements

TODO: List non-functional requirements

System Requirements

TODO: List system requirements

Software Design Document (SDD)

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active


1. Introduction

1.1 Purpose

This Software Design Document (SDD) describes the architectural and detailed design of the Adaptive Pipeline system. It provides technical specifications for developers implementing, maintaining, and extending the system.

Intended Audience:

  • Software developers implementing features
  • Technical architects reviewing design decisions
  • Code reviewers evaluating pull requests
  • New team members onboarding to the codebase

1.2 Scope

This document covers the design of a high-performance file processing pipeline implemented in Rust, following Domain-Driven Design (DDD), Clean Architecture, and Hexagonal Architecture principles.

Covered Topics:

  • System architecture and layer organization
  • Component interactions and dependencies
  • Data structures and algorithms
  • Interface specifications
  • Design patterns and rationale
  • Technology stack and tooling
  • Cross-cutting concerns

Not Covered:

  • Requirements (see SRS)
  • Testing strategies (see Test Plan)
  • Deployment procedures
  • Operational procedures

1.3 Design Philosophy

Core Principles:

  1. Domain-Driven Design: Business logic isolated in domain layer
  2. Clean Architecture: Dependencies point inward toward domain
  3. Hexagonal Architecture: Ports and adapters isolate external concerns
  4. SOLID Principles: Single responsibility, dependency inversion
  5. Rust Idioms: Zero-cost abstractions, ownership, type safety

Design Goals:

  • Correctness: Type-safe, compiler-verified invariants
  • Performance: Async I/O, parallel processing, minimal allocations
  • Maintainability: Clear separation of concerns, testable components
  • Extensibility: Plugin architecture for custom stages
  • Observability: Comprehensive metrics and logging

1.4 References


2. System Architecture

2.1 Architectural Overview

The system follows a layered architecture with strict dependency rules:

┌─────────────────────────────────────────────┐
│         Presentation Layer                  │ ← CLI, Future APIs
│  (bootstrap, cli, config, logger)           │
├─────────────────────────────────────────────┤
│         Application Layer                    │ ← Use Cases
│  (commands, services, orchestration)         │
├─────────────────────────────────────────────┤
│         Domain Layer (Core)                  │ ← Business Logic
│  (entities, value objects, services)         │
├─────────────────────────────────────────────┤
│         Infrastructure Layer                 │ ← External Concerns
│  (repositories, adapters, file I/O)          │
└─────────────────────────────────────────────┘

Dependency Rule: Inner layers know nothing about outer layers

2.2 Layered Architecture Details

2.2.1 Presentation Layer (Bootstrap Crate)

Responsibilities:

  • CLI argument parsing and validation
  • Application configuration
  • Logging and console output
  • Exit code handling
  • Platform-specific concerns

Key Components:

  • cli::parser - Command-line argument parsing
  • cli::validator - Input validation and sanitization
  • config::AppConfig - Application configuration
  • logger::Logger - Logging abstraction
  • platform::Platform - Platform detection

Design Decisions:

  • Separate crate for clean separation
  • No domain dependencies
  • Generic, reusable components

2.2.2 Application Layer

Responsibilities:

  • Orchestrate use cases
  • Coordinate domain services
  • Manage transactions
  • Handle application-level errors

Key Components:

  • use_cases::ProcessFileUseCase - File processing orchestration
  • use_cases::RestoreFileUseCase - File restoration orchestration
  • commands::ProcessFileCommand - Command pattern implementation
  • commands::RestoreFileCommand - Restoration command

Design Patterns:

  • Command Pattern: Encapsulate requests as objects
  • Use Case Pattern: Application-specific business rules
  • Dependency Injection: Services injected via constructors

Key Abstractions:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait UseCase<Input, Output> {
    async fn execute(&self, input: Input) -> Result<Output>;
}
}

2.2.3 Domain Layer

Responsibilities:

  • Core business logic
  • Domain rules and invariants
  • Domain events
  • Pure functions (no I/O)

Key Components:

Entities:

  • Pipeline - Aggregate root for pipeline configuration
  • PipelineStage - Individual processing stage
  • FileMetadata - File information and state

Value Objects:

  • ChunkSize - Validated chunk size
  • PipelineId - Unique pipeline identifier
  • StageType - Type-safe stage classification

Domain Services:

  • CompressionService - Compression/decompression logic
  • EncryptionService - Encryption/decryption logic
  • ChecksumService - Integrity verification logic
  • FileIOService - File operations abstraction

Invariants Enforced:

#![allow(unused)]
fn main() {
impl Pipeline {
    pub fn new(name: String, stages: Vec<PipelineStage>) -> Result<Self> {
        // Invariant: Name must be non-empty
        if name.trim().is_empty() {
            return Err(PipelineError::InvalidName);
        }

        // Invariant: At least one stage required
        if stages.is_empty() {
            return Err(PipelineError::NoStages);
        }

        // Invariant: Stages must have unique, sequential order
        validate_stage_order(&stages)?;

        Ok(Pipeline { name, stages })
    }
}
}

2.2.4 Infrastructure Layer

Responsibilities:

  • External system integration
  • Data persistence
  • File I/O operations
  • Third-party library adapters

Key Components:

Repositories:

  • SqlitePipelineRepository - SQLite-based persistence
  • InMemoryPipelineRepository - Testing/development

Adapters:

  • BrotliCompressionAdapter - Brotli compression
  • ZstdCompressionAdapter - Zstandard compression
  • AesGcmEncryptionAdapter - AES-256-GCM encryption
  • Sha256ChecksumAdapter - SHA-256 hashing

File I/O:

  • BinaryFileWriter - .adapipe format writer
  • BinaryFileReader - .adapipe format reader
  • ChunkedFileReader - Streaming file reader
  • ChunkedFileWriter - Streaming file writer

2.3 Hexagonal Architecture (Ports & Adapters)

Primary Ports (driving adapters - how the world uses us):

  • CLI commands
  • Future: REST API, gRPC API

Secondary Ports (driven adapters - how we use the world):

#![allow(unused)]
fn main() {
// Domain defines the interface (port)
#[async_trait]
pub trait CompressionPort: Send + Sync {
    async fn compress(&self, data: &[u8]) -> Result<Vec<u8>>;
    async fn decompress(&self, data: &[u8]) -> Result<Vec<u8>>;
}

// Infrastructure provides implementations (adapters)
pub struct BrotliAdapter {
    quality: u32,
}

#[async_trait]
impl CompressionPort for BrotliAdapter {
    async fn compress(&self, data: &[u8]) -> Result<Vec<u8>> {
        // Brotli-specific implementation
    }
}
}

Benefits:

  • Domain layer testable in isolation
  • Easy to swap implementations
  • Third-party dependencies isolated
  • Technology-agnostic core

3. Component Design

3.1 Pipeline Processing Engine

Architecture:

┌──────────────┐
│   Pipeline   │ (Aggregate Root)
│              │
│  - name      │
│  - stages[]  │
│  - metadata  │
└──────┬───────┘
       │ has many
       ▼
┌──────────────┐
│PipelineStage │
│              │
│  - name      │
│  - type      │
│  - config    │
│  - order     │
└──────────────┘

Processing Flow:

#![allow(unused)]
fn main() {
pub struct PipelineProcessor {
    context: Arc<PipelineContext>,
}

impl PipelineProcessor {
    pub async fn process(&self, pipeline: &Pipeline, input: PathBuf)
        -> Result<PathBuf>
    {
        // 1. Validate input file
        let metadata = self.validate_input(&input).await?;

        // 2. Create output file
        let output = self.create_output_file(&pipeline.name).await?;

        // 3. Process in chunks
        let chunks = self.chunk_file(&input, self.context.chunk_size).await?;

        // 4. Execute stages on each chunk (parallel)
        for chunk in chunks {
            let processed = self.execute_stages(pipeline, chunk).await?;
            self.write_chunk(&output, processed).await?;
        }

        // 5. Write metadata and finalize
        self.finalize(&output, metadata).await?;

        Ok(output)
    }

    async fn execute_stages(&self, pipeline: &Pipeline, chunk: Chunk)
        -> Result<Chunk>
    {
        let mut data = chunk.data;

        for stage in &pipeline.stages {
            data = match stage.stage_type {
                StageType::Compression => {
                    self.context.compression.compress(&data).await?
                }
                StageType::Encryption => {
                    self.context.encryption.encrypt(&data).await?
                }
                StageType::Checksum => {
                    let hash = self.context.checksum.hash(&data).await?;
                    self.context.store_hash(chunk.id, hash);
                    data // Checksum doesn't transform data
                }
                StageType::PassThrough => data,
            };
        }

        Ok(Chunk { data, ..chunk })
    }
}
}

3.2 Chunk Processing Strategy

Design Rationale:

  • Large files cannot fit in memory
  • Streaming processing enables arbitrary file sizes
  • Parallel chunk processing utilizes multiple cores
  • Fixed chunk size simplifies memory management

Implementation:

#![allow(unused)]
fn main() {
pub const DEFAULT_CHUNK_SIZE: usize = 1_048_576; // 1 MB

pub struct Chunk {
    pub id: usize,
    pub data: Vec<u8>,
    pub is_final: bool,
}

pub struct ChunkedFileReader {
    file: File,
    chunk_size: usize,
    chunk_id: usize,
}

impl ChunkedFileReader {
    pub async fn read_chunk(&mut self) -> Result<Option<Chunk>> {
        let mut buffer = vec![0u8; self.chunk_size];
        let bytes_read = self.file.read(&mut buffer).await?;

        if bytes_read == 0 {
            return Ok(None);
        }

        buffer.truncate(bytes_read);

        let chunk = Chunk {
            id: self.chunk_id,
            data: buffer,
            is_final: bytes_read < self.chunk_size,
        };

        self.chunk_id += 1;
        Ok(Some(chunk))
    }
}
}

Parallel Processing:

#![allow(unused)]
fn main() {
use tokio::task::JoinSet;

pub async fn process_chunks_parallel(
    chunks: Vec<Chunk>,
    stage: &dyn ProcessingStage,
) -> Result<Vec<Chunk>> {
    let mut join_set = JoinSet::new();

    for chunk in chunks {
        let stage = Arc::clone(&stage);
        join_set.spawn(async move {
            stage.process(chunk).await
        });
    }

    let mut results = Vec::new();
    while let Some(result) = join_set.join_next().await {
        results.push(result??);
    }

    // Sort by chunk ID to maintain order
    results.sort_by_key(|c| c.id);

    Ok(results)
}
}

3.3 Repository Pattern Implementation

Interface (Port):

#![allow(unused)]
fn main() {
#[async_trait]
pub trait PipelineRepository: Send + Sync {
    async fn save(&self, pipeline: &Pipeline) -> Result<()>;
    async fn find_by_id(&self, id: &str) -> Result<Option<Pipeline>>;
    async fn find_by_name(&self, name: &str) -> Result<Option<Pipeline>>;
    async fn delete(&self, id: &str) -> Result<()>;
    async fn list_all(&self) -> Result<Vec<Pipeline>>;
}
}

SQLite Implementation:

#![allow(unused)]
fn main() {
pub struct SqlitePipelineRepository {
    pool: Arc<SqlitePool>,
}

#[async_trait]
impl PipelineRepository for SqlitePipelineRepository {
    async fn save(&self, pipeline: &Pipeline) -> Result<()> {
        let mut tx = self.pool.begin().await?;

        // Insert pipeline
        sqlx::query!(
            "INSERT INTO pipelines (id, name, created_at) VALUES (?, ?, ?)",
            pipeline.id,
            pipeline.name,
            pipeline.created_at
        )
        .execute(&mut *tx)
        .await?;

        // Insert stages
        for stage in &pipeline.stages {
            sqlx::query!(
                "INSERT INTO pipeline_stages
                 (pipeline_id, name, type, order_num, config)
                 VALUES (?, ?, ?, ?, ?)",
                pipeline.id,
                stage.name,
                stage.stage_type.to_string(),
                stage.order,
                serde_json::to_string(&stage.configuration)?
            )
            .execute(&mut *tx)
            .await?;
        }

        tx.commit().await?;
        Ok(())
    }
}
}

3.4 Adapter Pattern for Algorithms

Design: Each algorithm family has a common interface with multiple implementations.

Compression Adapters:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait CompressionAdapter: Send + Sync {
    fn name(&self) -> &str;
    async fn compress(&self, data: &[u8]) -> Result<Vec<u8>>;
    async fn decompress(&self, data: &[u8]) -> Result<Vec<u8>>;
}

pub struct BrotliAdapter {
    quality: u32,
}

#[async_trait]
impl CompressionAdapter for BrotliAdapter {
    fn name(&self) -> &str { "brotli" }

    async fn compress(&self, data: &[u8]) -> Result<Vec<u8>> {
        let mut output = Vec::new();
        let mut compressor = brotli::CompressorReader::new(
            data,
            4096,
            self.quality,
            22
        );
        compressor.read_to_end(&mut output)?;
        Ok(output)
    }

    async fn decompress(&self, data: &[u8]) -> Result<Vec<u8>> {
        let mut output = Vec::new();
        let mut decompressor = brotli::Decompressor::new(data, 4096);
        decompressor.read_to_end(&mut output)?;
        Ok(output)
    }
}
}

Factory Pattern:

#![allow(unused)]
fn main() {
pub struct AdapterFactory;

impl AdapterFactory {
    pub fn create_compression(
        algorithm: &str
    ) -> Result<Box<dyn CompressionAdapter>> {
        match algorithm.to_lowercase().as_str() {
            "brotli" => Ok(Box::new(BrotliAdapter::new(11))),
            "zstd" => Ok(Box::new(ZstdAdapter::new(3))),
            "gzip" => Ok(Box::new(GzipAdapter::new(6))),
            "lz4" => Ok(Box::new(Lz4Adapter::new())),
            _ => Err(PipelineError::UnsupportedAlgorithm(
                algorithm.to_string()
            )),
        }
    }
}
}

4. Data Design

4.1 Domain Entities

Pipeline Entity:

#![allow(unused)]
fn main() {
pub struct Pipeline {
    id: PipelineId,
    name: String,
    stages: Vec<PipelineStage>,
    created_at: DateTime<Utc>,
    updated_at: DateTime<Utc>,
}

impl Pipeline {
    // Factory method enforces invariants
    pub fn new(name: String, stages: Vec<PipelineStage>) -> Result<Self> {
        // Validation logic
    }

    // Domain behavior
    pub fn add_stage(&mut self, stage: PipelineStage) -> Result<()> {
        // Ensure stage order is valid
        // Ensure no duplicate stage names
    }

    pub fn execute_order(&self) -> &[PipelineStage] {
        &self.stages
    }

    pub fn restore_order(&self) -> Vec<PipelineStage> {
        self.stages.iter().rev().cloned().collect()
    }
}
}

PipelineStage Entity:

#![allow(unused)]
fn main() {
pub struct PipelineStage {
    name: String,
    stage_type: StageType,
    configuration: StageConfiguration,
    order: usize,
}

pub enum StageType {
    Compression,
    Encryption,
    Checksum,
    PassThrough,
}

pub struct StageConfiguration {
    pub algorithm: String,
    pub parameters: HashMap<String, String>,
    pub parallel_processing: bool,
    pub chunk_size: Option<usize>,
}
}

4.2 Value Objects

ChunkSize:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct ChunkSize(usize);

impl ChunkSize {
    pub const MIN: usize = 1024;  // 1 KB
    pub const MAX: usize = 100 * 1024 * 1024;  // 100 MB
    pub const DEFAULT: usize = 1_048_576;  // 1 MB

    pub fn new(size: usize) -> Result<Self> {
        if size < Self::MIN || size > Self::MAX {
            return Err(PipelineError::InvalidChunkSize {
                size,
                min: Self::MIN,
                max: Self::MAX
            });
        }
        Ok(ChunkSize(size))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}
}

4.3 Database Schema

Schema Management: Using sqlx with migrations

-- migrations/20250101000000_initial_schema.sql

-- Pipelines table: stores pipeline configurations
CREATE TABLE IF NOT EXISTS pipelines (
    id TEXT PRIMARY KEY,
    name TEXT NOT NULL UNIQUE,
    archived BOOLEAN NOT NULL DEFAULT false,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL
);

-- Pipeline configuration: key-value pairs for pipeline settings
CREATE TABLE IF NOT EXISTS pipeline_configuration (
    pipeline_id TEXT NOT NULL,
    key TEXT NOT NULL,
    value TEXT NOT NULL,
    archived BOOLEAN NOT NULL DEFAULT FALSE,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    PRIMARY KEY (pipeline_id, key),
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);

-- Pipeline stages: defines processing stages in a pipeline
CREATE TABLE IF NOT EXISTS pipeline_stages (
    id TEXT PRIMARY KEY,
    pipeline_id TEXT NOT NULL,
    name TEXT NOT NULL,
    stage_type TEXT NOT NULL,
    enabled BOOLEAN NOT NULL DEFAULT TRUE,
    stage_order INTEGER NOT NULL,
    algorithm TEXT NOT NULL,
    parallel_processing BOOLEAN NOT NULL DEFAULT FALSE,
    chunk_size INTEGER,
    archived BOOLEAN NOT NULL DEFAULT FALSE,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);

-- Stage parameters: configuration for individual stages
CREATE TABLE IF NOT EXISTS stage_parameters (
    stage_id TEXT NOT NULL,
    key TEXT NOT NULL,
    value TEXT NOT NULL,
    archived BOOLEAN NOT NULL DEFAULT FALSE,
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    PRIMARY KEY (stage_id, key),
    FOREIGN KEY (stage_id) REFERENCES pipeline_stages(id) ON DELETE CASCADE
);

-- Processing metrics: tracks pipeline execution metrics
CREATE TABLE IF NOT EXISTS processing_metrics (
    pipeline_id TEXT PRIMARY KEY,
    bytes_processed INTEGER NOT NULL DEFAULT 0,
    bytes_total INTEGER NOT NULL DEFAULT 0,
    chunks_processed INTEGER NOT NULL DEFAULT 0,
    chunks_total INTEGER NOT NULL DEFAULT 0,
    start_time_rfc3339 TEXT,
    end_time_rfc3339 TEXT,
    processing_duration_ms INTEGER,
    throughput_bytes_per_second REAL NOT NULL DEFAULT 0.0,
    compression_ratio REAL,
    error_count INTEGER NOT NULL DEFAULT 0,
    warning_count INTEGER NOT NULL DEFAULT 0,
    input_file_size_bytes INTEGER NOT NULL DEFAULT 0,
    output_file_size_bytes INTEGER NOT NULL DEFAULT 0,
    input_file_checksum TEXT,
    output_file_checksum TEXT,
    FOREIGN KEY (pipeline_id) REFERENCES pipelines(id) ON DELETE CASCADE
);

-- Indexes for performance
CREATE INDEX IF NOT EXISTS idx_pipeline_stages_pipeline_id ON pipeline_stages(pipeline_id);
CREATE INDEX IF NOT EXISTS idx_pipeline_stages_order ON pipeline_stages(pipeline_id, stage_order);
CREATE INDEX IF NOT EXISTS idx_pipeline_configuration_pipeline_id ON pipeline_configuration(pipeline_id);
CREATE INDEX IF NOT EXISTS idx_stage_parameters_stage_id ON stage_parameters(stage_id);
CREATE INDEX IF NOT EXISTS idx_pipelines_name ON pipelines(name) WHERE archived = false;

4.4 Binary File Format (.adapipe)

Format Specification:

.adapipe File Structure:
┌────────────────────────────────────┐
│ Magic Number (4 bytes): "ADPI"    │
│ Version (2 bytes): Major.Minor     │
│ Header Length (4 bytes)            │
├────────────────────────────────────┤
│ Header (JSON):                     │
│  - pipeline_name                   │
│  - stage_configs[]                 │
│  - original_filename               │
│  - original_size                   │
│  - chunk_count                     │
│  - checksum_algorithm              │
├────────────────────────────────────┤
│ Chunk 0:                           │
│  - Length (4 bytes)                │
│  - Checksum (32 bytes)             │
│  - Data (variable)                 │
├────────────────────────────────────┤
│ Chunk 1: ...                       │
├────────────────────────────────────┤
│ ...                                │
├────────────────────────────────────┤
│ Footer (JSON):                     │
│  - total_checksum                  │
│  - created_at                      │
└────────────────────────────────────┘

Implementation:

#![allow(unused)]
fn main() {
pub struct AdapipeHeader {
    pub pipeline_name: String,
    pub stages: Vec<StageConfig>,
    pub original_filename: String,
    pub original_size: u64,
    pub chunk_count: usize,
    pub checksum_algorithm: String,
}

pub struct BinaryFileWriter {
    file: File,
}

impl BinaryFileWriter {
    pub async fn write_header(&mut self, header: &AdapipeHeader)
        -> Result<()>
    {
        // Magic number
        self.file.write_all(b"ADPI").await?;

        // Version
        self.file.write_u8(1).await?;  // Major
        self.file.write_u8(0).await?;  // Minor

        // Header JSON
        let header_json = serde_json::to_vec(header)?;
        self.file.write_u32(header_json.len() as u32).await?;
        self.file.write_all(&header_json).await?;

        Ok(())
    }

    pub async fn write_chunk(&mut self, chunk: &ProcessedChunk)
        -> Result<()>
    {
        self.file.write_u32(chunk.data.len() as u32).await?;
        self.file.write_all(&chunk.checksum).await?;
        self.file.write_all(&chunk.data).await?;
        Ok(())
    }
}
}

5. Interface Design

5.1 Public API

Command-Line Interface:

# Process a file
pipeline process \
    --input ./document.pdf \
    --output ./document.adapipe \
    --pipeline secure-archive \
    --compress zstd \
    --encrypt aes256gcm \
    --key-file ./key.txt

# Restore a file
pipeline restore \
    --input ./document.adapipe \
    --output ./document.pdf \
    --key-file ./key.txt

# List pipelines
pipeline list

# Create pipeline configuration
pipeline create \
    --name my-pipeline \
    --stages compress:zstd,encrypt:aes256gcm

Future: Library API:

use adaptive_pipeline_domain::entities::{Pipeline, PipelineStage, StageType, StageConfiguration};
use std::collections::HashMap;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create pipeline stages
    let compression_stage = PipelineStage::new(
        "compress".to_string(),
        StageType::Compression,
        StageConfiguration::new(
            "zstd".to_string(),
            HashMap::new(),
            false,
        ),
        0,
    )?;

    let encryption_stage = PipelineStage::new(
        "encrypt".to_string(),
        StageType::Encryption,
        StageConfiguration::new(
            "aes256gcm".to_string(),
            HashMap::new(),
            false,
        ),
        1,
    )?;

    // Create pipeline with stages
    let pipeline = Pipeline::new(
        "my-pipeline".to_string(),
        vec![compression_stage, encryption_stage],
    )?;

    // Process file (future implementation)
    // let processor = PipelineProcessor::new()?;
    // let output = processor.process(&pipeline, "input.txt").await?;

    println!("Pipeline created: {}", pipeline.name());
    Ok(())
}

5.2 Internal APIs

Domain Service Interfaces:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait CompressionService: Send + Sync {
    async fn compress(
        &self,
        algorithm: &str,
        data: &[u8]
    ) -> Result<Vec<u8>>;

    async fn decompress(
        &self,
        algorithm: &str,
        data: &[u8]
    ) -> Result<Vec<u8>>;
}

#[async_trait]
pub trait EncryptionService: Send + Sync {
    async fn encrypt(
        &self,
        algorithm: &str,
        data: &[u8],
        key: &[u8]
    ) -> Result<Vec<u8>>;

    async fn decrypt(
        &self,
        algorithm: &str,
        data: &[u8],
        key: &[u8]
    ) -> Result<Vec<u8>>;
}
}

6. Technology Stack

6.1 Core Technologies

TechnologyPurposeVersion
RustPrimary language1.75+
TokioAsync runtime1.35+
SQLxDatabase access0.7+
SerdeSerialization1.0+
AnyhowError handling1.0+
ClapCLI parsing4.5+

6.2 Algorithms & Libraries

Compression:

  • Brotli (brotli crate)
  • Zstandard (zstd crate)
  • Gzip (flate2 crate)
  • LZ4 (lz4 crate)

Encryption:

  • AES-GCM (aes-gcm crate)
  • ChaCha20-Poly1305 (chacha20poly1305 crate)

Hashing:

  • SHA-256/SHA-512 (sha2 crate)
  • BLAKE3 (blake3 crate)

Testing:

  • Criterion (benchmarking)
  • Proptest (property testing)
  • Mockall (mocking)

6.3 Development Tools

  • Cargo: Build system and package manager
  • Clippy: Linter
  • Rustfmt: Code formatter
  • Cargo-audit: Security auditing
  • Cargo-deny: Dependency validation
  • mdBook: Documentation generation

7. Design Patterns

7.1 Repository Pattern

Purpose: Abstract data persistence logic from domain logic.

Implementation: See Section 3.3

Benefits:

  • Domain layer independent of storage mechanism
  • Easy to swap SQLite for PostgreSQL or other storage
  • Testable with in-memory implementation

7.2 Adapter Pattern

Purpose: Integrate third-party algorithms without coupling.

Implementation: See Section 3.4

Benefits:

  • Easy to add new algorithms
  • Algorithm selection at runtime
  • Consistent interface across all adapters

7.3 Strategy Pattern

Purpose: Select algorithm implementation dynamically.

Example:

#![allow(unused)]
fn main() {
pub struct PipelineProcessor {
    compression_strategy: Box<dyn CompressionAdapter>,
    encryption_strategy: Box<dyn EncryptionAdapter>,
}

impl PipelineProcessor {
    pub fn new(
        compression: &str,
        encryption: &str
    ) -> Result<Self> {
        Ok(Self {
            compression_strategy: AdapterFactory::create_compression(
                compression
            )?,
            encryption_strategy: AdapterFactory::create_encryption(
                encryption
            )?,
        })
    }
}
}

7.4 Builder Pattern

Purpose: Construct complex objects step by step.

Note: Builder pattern is a potential future enhancement. Current API uses direct construction.

Current API:

#![allow(unused)]
fn main() {
let pipeline = Pipeline::new(
    "my-pipeline".to_string(),
    vec![compression_stage, encryption_stage],
)?;
}

Potential Future Builder API:

#![allow(unused)]
fn main() {
// Not yet implemented - potential future enhancement
let pipeline = Pipeline::builder()
    .name("my-pipeline")
    .add_stage(compression_stage)
    .add_stage(encryption_stage)
    .build()?;
}

7.5 Command Pattern

Purpose: Encapsulate requests as objects for undo/redo, queueing.

Example:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait Command: Send + Sync {
    async fn execute(&self) -> Result<()>;
}

pub struct ProcessFileCommand {
    pipeline: Pipeline,
    input: PathBuf,
    output: PathBuf,
}

#[async_trait]
impl Command for ProcessFileCommand {
    async fn execute(&self) -> Result<()> {
        // Processing logic
    }
}
}

8. Cross-Cutting Concerns

8.1 Error Handling

Strategy: Use anyhow::Result for application errors, custom error types for domain errors.

#![allow(unused)]
fn main() {
#[derive(Debug, thiserror::Error)]
pub enum PipelineError {
    #[error("Invalid pipeline name: {0}")]
    InvalidName(String),

    #[error("No stages defined")]
    NoStages,

    #[error("Invalid chunk size: {size} (must be {min}-{max})")]
    InvalidChunkSize { size: usize, min: usize, max: usize },

    #[error("Stage order conflict at position {0}")]
    StageOrderConflict(usize),

    #[error("Unsupported algorithm: {0}")]
    UnsupportedAlgorithm(String),

    #[error("I/O error: {0}")]
    Io(#[from] std::io::Error),
}
}

8.2 Logging

Strategy: Structured logging with tracing.

#![allow(unused)]
fn main() {
use tracing::{info, warn, error, debug, instrument};

#[instrument(skip(self, data))]
async fn process_chunk(&self, chunk_id: usize, data: &[u8])
    -> Result<Vec<u8>>
{
    debug!(chunk_id, size = data.len(), "Processing chunk");

    let start = Instant::now();
    let result = self.compress(data).await?;

    info!(
        chunk_id,
        original_size = data.len(),
        compressed_size = result.len(),
        duration_ms = start.elapsed().as_millis(),
        "Chunk compressed"
    );

    Ok(result)
}
}

8.3 Metrics Collection

Strategy: Prometheus-style metrics.

#![allow(unused)]
fn main() {
pub struct PipelineMetrics {
    pub chunks_processed: AtomicUsize,
    pub bytes_processed: AtomicU64,
    pub compression_ratio: AtomicU64,
    pub processing_duration: Duration,
}

impl PipelineMetrics {
    pub fn record_chunk(&self, original_size: usize, compressed_size: usize) {
        self.chunks_processed.fetch_add(1, Ordering::Relaxed);
        self.bytes_processed.fetch_add(
            original_size as u64,
            Ordering::Relaxed
        );

        let ratio = (compressed_size as f64 / original_size as f64 * 100.0)
            as u64;
        self.compression_ratio.store(ratio, Ordering::Relaxed);
    }
}
}

8.4 Security

Key Management:

  • Keys never logged or persisted unencrypted
  • Argon2 key derivation from passwords
  • Secure memory wiping for key material

Input Validation:

  • CLI inputs sanitized against injection
  • File paths validated
  • Chunk sizes bounded

Algorithm Selection:

  • Only vetted, well-known algorithms
  • Default to secure settings
  • AEAD ciphers for authenticated encryption

9. Performance Considerations

9.1 Asynchronous I/O

Rationale: File I/O is the primary bottleneck. Async I/O allows overlap of CPU and I/O operations.

Implementation:

  • Tokio async runtime
  • tokio::fs for file operations
  • tokio::io::AsyncRead and AsyncWrite traits

9.2 Parallel Processing

Chunk-level Parallelism:

#![allow(unused)]
fn main() {
use rayon::prelude::*;

pub fn process_chunks_parallel(
    chunks: Vec<Chunk>,
    processor: &ChunkProcessor,
) -> Result<Vec<ProcessedChunk>> {
    chunks
        .par_iter()
        .map(|chunk| processor.process(chunk))
        .collect()
}
}

9.3 Memory Management

Strategies:

  • Fixed chunk size limits peak memory
  • Streaming processing avoids loading entire files
  • Buffer pooling for frequently allocated buffers
  • Zero-copy where possible

9.4 Algorithm Selection

Performance Profiles:

AlgorithmSpeedRatioMemory
LZ4★★★★★★★☆☆☆★★★★★
Zstd★★★★☆★★★★☆★★★★☆
Gzip★★★☆☆★★★☆☆★★★★☆
Brotli★★☆☆☆★★★★★★★★☆☆

10. Testing Strategy (Overview)

10.1 Test Organization

  • Unit Tests: #[cfg(test)] modules in source files
  • Integration Tests: tests/integration/
  • E2E Tests: tests/e2e/
  • Benchmarks: benches/

10.2 Test Coverage Goals

  • Domain layer: 90%+ coverage
  • Application layer: 80%+ coverage
  • Infrastructure: 70%+ coverage (mocked external deps)

10.3 Testing Tools

  • Mockall for mocking
  • Proptest for property-based testing
  • Criterion for benchmarking
  • Cargo-tarpaulin for coverage

11. Future Enhancements

11.1 Planned Features

  1. Distributed Processing: Process files across multiple machines
  2. Cloud Integration: S3, Azure Blob, GCS support
  3. REST API: HTTP API for remote processing
  4. Plugin System: Dynamic loading of custom stages
  5. Web UI: Browser-based configuration and monitoring

11.2 Architectural Evolution

Phase 1 (Current): Single-machine CLI application

Phase 2: Library + CLI + REST API

Phase 3: Distributed processing with coordinator nodes

Phase 4: Cloud-native deployment with Kubernetes


12. Conclusion

This Software Design Document describes a robust, extensible file processing pipeline built on solid architectural principles. The design prioritizes:

  • Correctness through type safety and domain-driven design
  • Performance through async I/O and parallel processing
  • Maintainability through clean architecture and separation of concerns
  • Extensibility through ports & adapters and plugin architecture

The implementation follows Rust best practices and leverages the language's strengths in safety, concurrency, and zero-cost abstractions.


Appendix A: Glossary

TermDefinition
AggregateDomain entity that serves as consistency boundary
AEADAuthenticated Encryption with Associated Data
ChunkFixed-size portion of file for streaming processing
DDDDomain-Driven Design architectural approach
PortInterface defined by application core
AdapterImplementation of port for external system
Use CaseApplication-specific business rule
Value ObjectImmutable object defined by its attributes

Appendix B: Diagrams

See the following chapters for detailed diagrams:

Software Test Plan (STP)

Version: 1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner, Claude Code Status: Active


1. Introduction

1.1 Purpose

This Software Test Plan (STP) defines the testing strategy, approach, and organization for the Adaptive Pipeline system. It ensures the system meets all functional and non-functional requirements specified in the SRS through comprehensive, systematic testing.

Intended Audience:

  • QA engineers implementing tests
  • Developers writing testable code
  • Project managers tracking test progress
  • Stakeholders evaluating quality assurance

1.2 Scope

Testing Coverage:

  • Unit testing of all domain, application, and infrastructure components
  • Integration testing of component interactions
  • End-to-end testing of complete workflows
  • Architecture compliance testing
  • Performance and benchmark testing
  • Security testing

Out of Scope:

  • User acceptance testing (no end users yet)
  • Load testing (single-machine application)
  • Penetration testing (no network exposure)
  • GUI testing (CLI only)

1.3 Test Objectives

  1. Verify Correctness: Ensure all requirements are met
  2. Validate Design: Confirm architectural compliance
  3. Ensure Quality: Maintain high code quality standards
  4. Prevent Regression: Catch bugs before they reach production
  5. Document Behavior: Tests serve as living documentation
  6. Enable Refactoring: Safe code changes through comprehensive tests

1.4 References


2. Test Strategy

2.1 Testing Approach

Test-Driven Development (TDD):

  • Write tests before implementation when feasible
  • Red-Green-Refactor cycle
  • Tests as specification

Behavior-Driven Development (BDD):

  • Given-When-Then structure for integration tests
  • Readable test names describing behavior
  • Focus on outcomes, not implementation

Property-Based Testing:

  • Use Proptest for algorithmic correctness
  • Generate random inputs to find edge cases
  • Verify invariants hold for all inputs

2.2 Test Pyramid

        ┌─────────────┐
        │   E2E (11)  │  ← Few, slow, high-level
        ├─────────────┤
        │Integration  │  ← Medium count, medium speed
        │    (35)     │
        ├─────────────┤
        │   Unit      │  ← Many, fast, focused
        │   (314)     │
        └─────────────┘

Rationale:

  • Most tests are fast unit tests for quick feedback
  • Integration tests verify component collaboration
  • E2E tests validate complete user workflows
  • Architecture tests ensure design compliance

2.3 Test Organization (Post-Reorganization)

Following Rust best practices:

Unit Tests:

  • Location: #[cfg(test)] modules within source files
  • Scope: Single function/struct in isolation
  • Run with: cargo test --lib
  • Count: 314 tests (68 bootstrap + 90 pipeline + 156 pipeline-domain)
  • Example: pipeline-domain/src/entities/pipeline_stage.rs:590-747

Integration Tests:

  • Location: pipeline/tests/integration/
  • Entry: pipeline/tests/integration.rs
  • Scope: Multiple components working together
  • Run with: cargo test --test integration
  • Count: 35 tests (3 ignored pending work)

End-to-End Tests:

  • Location: pipeline/tests/e2e/
  • Entry: pipeline/tests/e2e.rs
  • Scope: Complete workflows from input to output
  • Run with: cargo test --test e2e
  • Count: 11 tests

Architecture Compliance Tests:

  • Location: pipeline/tests/architecture_compliance_test.rs
  • Scope: Validate DDD, Clean Architecture, Hexagonal Architecture
  • Run with: cargo test --test architecture_compliance_test
  • Count: 2 tests

Documentation Tests:

  • Location: Doc comments with ``` code blocks
  • Run with: cargo test --doc
  • Count: 27 tests (12 ignored)

Total Test Count: 389 tests (15 ignored)


3. Test Levels

3.1 Unit Testing

Objectives:

  • Test individual functions and methods in isolation
  • Verify domain logic correctness
  • Ensure edge cases are handled
  • Validate error conditions

Coverage Goals:

  • Domain layer: 90%+ line coverage
  • Application layer: 85%+ line coverage
  • Infrastructure layer: 75%+ line coverage

Example Test Structure:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_pipeline_creation_validates_name() {
        // Given: Empty name
        let name = "";
        let stages = vec![create_test_stage()];

        // When: Creating pipeline
        let result = Pipeline::new(name.to_string(), stages);

        // Then: Should fail validation
        assert!(result.is_err());
        assert!(matches!(result, Err(PipelineError::InvalidName(_))));
    }

    #[test]
    fn test_pipeline_requires_at_least_one_stage() {
        // Given: Valid name but no stages
        let name = "test-pipeline";
        let stages = vec![];

        // When: Creating pipeline
        let result = Pipeline::new(name.to_string(), stages);

        // Then: Should fail validation
        assert!(result.is_err());
        assert!(matches!(result, Err(PipelineError::NoStages)));
    }
}
}

Key Test Categories:

CategoryDescriptionExample
Happy PathNormal, expected inputsValid pipeline creation
Edge CasesBoundary conditionsEmpty strings, max values
Error HandlingInvalid inputsMalformed data, constraints violated
InvariantsDomain rules always holdStage order uniqueness

3.2 Integration Testing

Objectives:

  • Test component interactions
  • Verify interfaces between layers
  • Validate repository operations
  • Test service collaboration

Test Structure:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_pipeline_repository_save_and_retrieve() {
    // Given: In-memory repository and pipeline
    let repo = Arc::new(InMemoryPipelineRepository::new());
    let pipeline = create_test_pipeline();

    // When: Saving and retrieving pipeline
    repo.save(&pipeline).await.unwrap();
    let retrieved = repo.find_by_name(&pipeline.name)
        .await
        .unwrap()
        .unwrap();

    // Then: Retrieved pipeline matches original
    assert_eq!(retrieved.name, pipeline.name);
    assert_eq!(retrieved.stages.len(), pipeline.stages.len());
}
}

Integration Test Suites:

  1. Application Layer Integration (application_integration_test.rs)

    • Command execution
    • Use case orchestration
    • Service coordination
  2. Application Services (application_services_integration_test.rs)

    • Service interactions
    • Transaction handling
    • Error propagation
  3. Domain Services (domain_services_test.rs)

    • Compression service integration
    • Encryption service integration
    • Checksum service integration
    • File I/O service integration
  4. Schema Integration (schema_integration_test.rs)

    • Database creation
    • Schema migrations
    • Idempotent initialization
  5. Pipeline Name Validation (pipeline_name_validation_tests.rs)

    • Name normalization
    • Validation rules
    • Reserved names

3.3 End-to-End Testing

Objectives:

  • Validate complete user workflows
  • Test real file processing scenarios
  • Verify .adapipe format correctness
  • Ensure restoration matches original

Test Structure:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_e2e_complete_pipeline_workflow() {
    // Given: Test input file and pipeline configuration
    let input_path = create_test_file_with_content("Hello, World!");
    let pipeline = create_secure_pipeline(); // compress + encrypt + checksum

    // When: Processing the file
    let processor = PipelineProcessor::new()?;
    let output_path = processor.process(&pipeline, &input_path).await?;

    // Then: Output file should exist and be valid .adapipe format
    assert!(output_path.exists());
    assert!(output_path.extension().unwrap() == "adapipe");

    // And: Can restore original file
    let restored_path = processor.restore(&output_path).await?;
    let restored_content = fs::read_to_string(&restored_path).await?;
    assert_eq!(restored_content, "Hello, World!");

    // Cleanup
    cleanup_test_files(&[input_path, output_path, restored_path])?;
}
}

E2E Test Scenarios:

  1. Binary Format Complete Roundtrip (e2e_binary_format_test.rs)

    • Process file through all stages
    • Verify .adapipe format structure
    • Restore and compare with original
    • Test large files (memory mapping)
    • Test corruption detection
    • Test version compatibility
  2. Restoration Pipeline (e2e_restore_pipeline_test.rs)

    • Multi-stage restoration
    • Stage ordering (reverse of processing)
    • Real-world document restoration
    • File header roundtrip
    • Chunk processing verification

3.4 Architecture Compliance Testing

Objectives:

  • Enforce architectural boundaries
  • Validate design patterns
  • Ensure dependency rules
  • Verify SOLID principles

Test Structure:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_ddd_compliance() {
    println!("Testing DDD Compliance");

    // Test domain entities in isolation
    test_domain_entity_isolation().await;

    // Test value objects for immutability
    test_value_object_patterns().await;

    // Test domain services through interfaces
    test_domain_service_interfaces().await;

    println!("✅ DDD Compliance: PASSED");
}
}

Compliance Checks:

  1. Domain-Driven Design (DDD)

    • Entities tested in isolation
    • Value objects immutable
    • Domain services use interfaces
    • Aggregates maintain consistency
  2. Clean Architecture

    • Dependency flow is inward only
    • Use cases independent
    • Infrastructure tested through abstractions
  3. Hexagonal Architecture

    • Primary ports (driving adapters) tested
    • Secondary ports (driven adapters) mocked
    • Application core isolated
  4. Dependency Inversion Principle

    • High-level modules depend on abstractions
    • Low-level modules implement abstractions
    • Abstractions remain stable

4. Test Automation

4.1 Continuous Integration

CI Pipeline:

# .github/workflows/ci.yml (example)
name: CI

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions-rs/toolchain@v1
        with:
          toolchain: stable
      - name: Run unit tests
        run: cargo test --lib
      - name: Run integration tests
        run: cargo test --test integration
      - name: Run E2E tests
        run: cargo test --test e2e
      - name: Run architecture tests
        run: cargo test --test architecture_compliance_test
      - name: Run doc tests
        run: cargo test --doc

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run clippy
        run: cargo clippy -- -D warnings
      - name: Check formatting
        run: cargo fmt -- --check

  coverage:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install tarpaulin
        run: cargo install cargo-tarpaulin
      - name: Generate coverage
        run: cargo tarpaulin --out Xml
      - name: Upload to codecov
        uses: codecov/codecov-action@v3

4.2 Pre-commit Hooks

Git Hooks:

#!/bin/sh
# .git/hooks/pre-commit

# Run tests
cargo test --lib || exit 1

# Check formatting
cargo fmt -- --check || exit 1

# Run clippy
cargo clippy -- -D warnings || exit 1

echo "✅ All pre-commit checks passed"

4.3 Test Commands

Quick Feedback (fast):

cargo test --lib  # Unit tests only (~1 second)

Full Test Suite:

cargo test  # All tests (~15 seconds)

Specific Test Suites:

cargo test --test integration          # Integration tests
cargo test --test e2e                  # E2E tests
cargo test --test architecture_compliance_test  # Architecture tests
cargo test --doc                       # Documentation tests

With Coverage:

cargo tarpaulin --out Html --output-dir coverage/

5. Testing Tools and Frameworks

5.1 Core Testing Framework

Built-in Rust Testing:

  • #[test] attribute for unit tests
  • #[tokio::test] for async tests
  • assert!, assert_eq!, assert_ne! macros
  • #[should_panic] for error testing

Async Testing:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_async_operation() {
    let result = async_function().await;
    assert!(result.is_ok());
}
}

5.2 Mocking and Test Doubles

Mockall:

#![allow(unused)]
fn main() {
#[automock]
#[async_trait]
pub trait PipelineRepository {
    async fn save(&self, pipeline: &Pipeline) -> Result<()>;
    async fn find_by_id(&self, id: &str) -> Result<Option<Pipeline>>;
}

#[tokio::test]
async fn test_with_mock_repository() {
    let mut mock_repo = MockPipelineRepository::new();
    mock_repo
        .expect_save()
        .times(1)
        .returning(|_| Ok(()));

    let service = PipelineService::new(Arc::new(mock_repo));
    let result = service.create_pipeline("test").await;

    assert!(result.is_ok());
}
}

5.3 Property-Based Testing

Proptest:

#![allow(unused)]
fn main() {
use proptest::prelude::*;

proptest! {
    #[test]
    fn test_chunk_size_always_valid(size in 1024usize..=100_000_000) {
        // Given: Any size within valid range
        let chunk_size = ChunkSize::new(size);

        // Then: Should always succeed
        prop_assert!(chunk_size.is_ok());
        prop_assert_eq!(chunk_size.unwrap().value(), size);
    }

    #[test]
    fn test_compression_roundtrip(
        data in prop::collection::vec(any::<u8>(), 0..10000)
    ) {
        // Given: Random byte array
        let compressed = compress(&data)?;
        let decompressed = decompress(&compressed)?;

        // Then: Should match original
        prop_assert_eq!(data, decompressed);
    }
}
}

5.4 Benchmarking

Criterion:

#![allow(unused)]
fn main() {
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn bench_compression(c: &mut Criterion) {
    let data = vec![0u8; 1_000_000]; // 1 MB

    c.bench_function("brotli_compression", |b| {
        b.iter(|| {
            let adapter = BrotliAdapter::new(6);
            adapter.compress(black_box(&data))
        })
    });

    c.bench_function("zstd_compression", |b| {
        b.iter(|| {
            let adapter = ZstdAdapter::new(3);
            adapter.compress(black_box(&data))
        })
    });
}

criterion_group!(benches, bench_compression);
criterion_main!(benches);
}

Run Benchmarks:

cargo bench

5.5 Code Coverage

Cargo-tarpaulin:

# Install
cargo install cargo-tarpaulin

# Generate coverage report
cargo tarpaulin --out Html --output-dir coverage/

# View report
open coverage/index.html

Coverage Goals:

  • Overall: 80%+ line coverage
  • Domain layer: 90%+ coverage
  • Critical paths: 95%+ coverage

6. Test Data Management

6.1 Test Fixtures

Fixture Organization:

testdata/
├── input/
│   ├── sample.txt
│   ├── large_file.bin (10 MB)
│   └── document.pdf
├── expected/
│   ├── sample_compressed.bin
│   └── sample_encrypted.bin
└── schemas/
    └── v1_pipeline.json

Fixture Helpers:

#![allow(unused)]
fn main() {
pub fn create_test_file(content: &str) -> PathBuf {
    let temp_dir = tempfile::tempdir().unwrap();
    let file_path = temp_dir.path().join("test_file.txt");
    std::fs::write(&file_path, content).unwrap();
    file_path
}

pub fn create_test_pipeline() -> Pipeline {
    Pipeline::builder()
        .name("test-pipeline")
        .add_stage(compression_stage("compress", "zstd", 1))
        .add_stage(encryption_stage("encrypt", "aes256gcm", 2))
        .build()
        .unwrap()
}
}

6.2 Test Database

In-Memory SQLite:

#![allow(unused)]
fn main() {
async fn create_test_db() -> SqlitePool {
    let pool = SqlitePoolOptions::new()
        .connect(":memory:")
        .await
        .unwrap();

    // Run migrations
    sqlx::migrate!("./migrations")
        .run(&pool)
        .await
        .unwrap();

    pool
}
}

Test Isolation:

  • Each test gets fresh database
  • Transactions rolled back after test
  • No test interdependencies

7. Performance Testing

7.1 Benchmark Suites

Algorithm Benchmarks:

  • Compression algorithms (Brotli, Zstd, Gzip, LZ4)
  • Encryption algorithms (AES-256-GCM, ChaCha20-Poly1305)
  • Hashing algorithms (SHA-256, SHA-512, BLAKE3)

File Size Benchmarks:

  • Small files (< 1 MB)
  • Medium files (1-100 MB)
  • Large files (> 100 MB)

Concurrency Benchmarks:

  • Single-threaded vs multi-threaded
  • Async vs sync I/O
  • Chunk size variations

7.2 Performance Regression Testing

Baseline Establishment:

# Run benchmarks and save baseline
cargo bench -- --save-baseline main

Regression Detection:

# Compare against baseline
cargo bench -- --baseline main

CI Integration:

  • Fail PR if >10% performance degradation
  • Alert on >5% degradation
  • Celebrate >10% improvement

8. Security Testing

8.1 Input Validation Testing

Test Cases:

  • Path traversal attempts (../../etc/passwd)
  • Command injection attempts
  • SQL injection (SQLx prevents, but verify)
  • Buffer overflow attempts
  • Invalid UTF-8 sequences

Example:

#![allow(unused)]
fn main() {
#[test]
fn test_rejects_path_traversal() {
    let malicious_path = "../../etc/passwd";
    let result = validate_input_path(malicious_path);
    assert!(result.is_err());
}
}

8.2 Cryptographic Testing

Test Cases:

  • Key derivation reproducibility
  • Encryption/decryption roundtrips
  • Authentication tag verification
  • Nonce uniqueness
  • Secure memory wiping

Example:

#![allow(unused)]
fn main() {
#[tokio::test]
async fn test_encryption_with_wrong_key_fails() {
    let data = b"secret data";
    let correct_key = generate_key();
    let wrong_key = generate_key();

    let encrypted = encrypt(data, &correct_key).await?;
    let result = decrypt(&encrypted, &wrong_key).await;

    assert!(result.is_err());
}
}

8.3 Dependency Security

Cargo-audit:

# Install
cargo install cargo-audit

# Check for vulnerabilities
cargo audit

# CI integration
cargo audit --deny warnings

Cargo-deny:

# Check licenses and security
cargo deny check

9. Test Metrics and Reporting

9.1 Test Metrics

Key Metrics:

  • Test count: 389 tests (15 ignored)
  • Test pass rate: Target 100%
  • Code coverage: Target 80%+
  • Test execution time: < 20 seconds for full suite
  • Benchmark performance: Track trends

Tracking:

# Test count
cargo test -- --list | wc -l

# Coverage
cargo tarpaulin --out Json | jq '.coverage'

# Execution time
time cargo test

9.2 Test Reporting

Console Output:

running 389 tests
test unit::test_pipeline_creation ... ok
test integration::test_repository_save ... ok
test e2e::test_complete_workflow ... ok

test result: ok. 374 passed; 0 failed; 15 ignored; 0 measured; 0 filtered out; finished in 15.67s

Coverage Report:

|| Tested/Total Lines:
|| src/domain/entities/pipeline.rs: 95/100 (95%)
|| src/domain/services/compression.rs: 87/95 (91.6%)
|| Overall: 2847/3421 (83.2%)

Benchmark Report:

brotli_compression      time:   [45.2 ms 46.1 ms 47.0 ms]
                        change: [-2.3% +0.1% +2.5%] (p = 0.91 > 0.05)
                        No change in performance detected.

zstd_compression        time:   [12.5 ms 12.7 ms 12.9 ms]
                        change: [-8.2% -6.5% -4.8%] (p = 0.00 < 0.05)
                        Performance has improved.

10. Test Maintenance

10.1 Test Review Process

Code Review Checklist:

  • Tests cover new functionality
  • Tests follow naming conventions
  • Tests are independent and isolated
  • Test data is appropriate
  • Assertions are clear and specific
  • Edge cases are tested
  • Error conditions are tested

10.2 Test Refactoring

When to Refactor Tests:

  • Duplicate test logic (extract helpers)
  • Brittle tests (too coupled to implementation)
  • Slow tests (optimize or move to integration)
  • Unclear test names (rename for clarity)

Test Helpers:

#![allow(unused)]
fn main() {
// Instead of duplicating this in every test:
#[test]
fn test_something() {
    let stage = PipelineStage::new(
        "compress".to_string(),
        StageType::Compression,
        StageConfiguration {
            algorithm: "zstd".to_string(),
            parameters: HashMap::new(),
            parallel_processing: false,
            chunk_size: None,
        },
        1
    ).unwrap();
    // ...
}

// Extract helper:
fn create_compression_stage(name: &str, order: usize) -> PipelineStage {
    PipelineStage::compression(name, "zstd", order).unwrap()
}

#[test]
fn test_something() {
    let stage = create_compression_stage("compress", 1);
    // ...
}
}

10.3 Flaky Test Prevention

Common Causes:

  • Time-dependent tests
  • Filesystem race conditions
  • Nondeterministic ordering
  • Shared state between tests

Solutions:

  • Mock time with mockall
  • Use unique temp directories
  • Sort collections before assertions
  • Ensure test isolation

11. Test Schedule

11.1 Development Workflow

During Development:

# Quick check (unit tests only)
cargo test --lib

# Before commit
cargo test && cargo clippy

Before Push:

# Full test suite
cargo test

# Check formatting
cargo fmt -- --check

# Lint
cargo clippy -- -D warnings

Before Release:

# All tests
cargo test

# Benchmarks
cargo bench

# Coverage
cargo tarpaulin

# Security audit
cargo audit

11.2 CI/CD Integration

On Every Commit:

  • Unit tests
  • Integration tests
  • Clippy linting
  • Format checking

On Pull Request:

  • Full test suite
  • Coverage report
  • Benchmark comparison
  • Security audit

On Release:

  • Full test suite
  • Performance benchmarks
  • Security scan
  • Documentation build

12. Test Deliverables

12.1 Test Documentation

  • Test Organization Guide (docs/TEST_ORGANIZATION.md)
  • Architecture Compliance Tests (tests/architecture_compliance_test.rs)
  • This Software Test Plan

12.2 Test Code

  • 314 unit tests in source files (68 bootstrap + 90 pipeline + 156 pipeline-domain)
  • 35 integration tests in tests/integration/ (3 ignored)
  • 11 E2E tests in tests/e2e/
  • 2 architecture compliance tests
  • 27 documentation tests (12 ignored)
  • Benchmark suite (TODO)
  • Property-based tests (TODO)

12.3 Test Reports

Generated Artifacts:

  • Test execution report (console output)
  • Coverage report (HTML, XML)
  • Benchmark report (HTML, JSON)
  • Security audit report

13. Risks and Mitigation

13.1 Testing Risks

RiskImpactProbabilityMitigation
Flaky testsMediumLowTest isolation, deterministic behavior
Slow testsMediumMediumOptimize, parallelize, tiered testing
Low coverageHighLowCoverage tracking, CI enforcement
Missing edge casesHighMediumProperty-based testing, code review
Test maintenance burdenMediumHighHelper functions, clear conventions

13.2 Mitigation Strategies

Flaky Tests:

  • Run tests multiple times in CI
  • Investigate and fix immediately
  • Use deterministic test data

Slow Tests:

  • Profile test execution
  • Optimize slow tests
  • Move to higher test level if appropriate

Low Coverage:

  • Track coverage in CI
  • Require minimum coverage for PRs
  • Review uncovered code paths

14. Conclusion

This Software Test Plan establishes a comprehensive testing strategy for the Adaptive Pipeline system. Key highlights:

  • 389 tests across all levels (314 unit, 35 integration, 11 E2E, 2 architecture, 27 doc)
  • Organized structure following Rust best practices
  • Automated CI/CD integration for continuous quality
  • High coverage goals (80%+ overall, 90%+ domain layer)
  • Multiple testing approaches (TDD, BDD, property-based)
  • Performance monitoring through benchmarks
  • Security validation through audits and crypto testing

The testing strategy ensures the system meets all requirements, maintains high quality, and remains maintainable as it evolves.


Appendix A: Test Command Reference

# Run all tests
cargo test

# Run specific test levels
cargo test --lib                    # Unit tests only
cargo test --test integration       # Integration tests
cargo test --test e2e              # E2E tests
cargo test --test architecture_compliance_test  # Architecture tests
cargo test --doc                    # Doc tests

# Run specific test
cargo test test_pipeline_creation

# Run tests matching pattern
cargo test pipeline

# Show test output
cargo test -- --nocapture

# Run tests in parallel (default)
cargo test

# Run tests serially
cargo test -- --test-threads=1

# Generate coverage
cargo tarpaulin --out Html

# Run benchmarks
cargo bench

# Security audit
cargo audit

Appendix B: Test Naming Conventions

Unit Tests:

  • test_<function>_<scenario>_<expected_result>
  • Example: test_pipeline_creation_with_empty_name_fails

Integration Tests:

  • test_<component>_<interaction>_<expected_result>
  • Example: test_repository_save_and_retrieve_pipeline

E2E Tests:

  • test_e2e_<workflow>_<scenario>
  • Example: test_e2e_complete_pipeline_roundtrip

Property Tests:

  • test_<property>_holds_for_all_<inputs>
  • Example: test_compression_roundtrip_succeeds_for_all_data

Public API Reference

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

Public API documentation and examples.

API Overview

TODO: Add API overview

Core Types

TODO: List core public types

Examples

TODO: Add usage examples

Generated Documentation

See rustdoc API documentation

Internal APIs

Version: 0.1.0 Date: October 08, 2025 SPDX-License-Identifier: BSD-3-Clause License File: See the LICENSE file in the project root. Copyright: © 2025 Michael Gardner, A Bit of Help, Inc. Authors: Michael Gardner Status: Draft

Internal API documentation for contributors.

Internal Architecture

TODO: Add internal API overview

Module Organization

TODO: Explain module structure

Extension Points

TODO: List internal extension points