OpenDB
OpenDB is a high-performance hybrid embedded database written in pure Rust, combining multiple database paradigms into a single, cohesive system.
Features
- ๐ Key-Value Store: Fast point lookups and range scans
- ๐ Structured Records: Document/row storage with schema support
- ๐ Graph Database: Relationships and graph traversals
- ๐ Vector Search: Semantic search with HNSW-based approximate nearest neighbors
- ๐พ In-Memory Cache: LRU cache for hot data
- โ ACID Transactions: Full transactional guarantees with WAL
Why OpenDB?
OpenDB is designed for applications that need multiple database capabilities without the complexity of managing separate systems:
- Agent Memory Systems: Store and recall facts, relationships, and semantic information
- Knowledge Graphs: Build and traverse complex relationship networks
- Semantic Search: Find similar content using vector embeddings
- High-Performance Applications: LSM-tree backend for excellent write throughput
Repository
- GitHub: muhammad-fiaz/OpenDB
- Documentation: https://muhammad-fiaz.github.io/opendb
- Contact: contact@muhammadfiaz.com
Quick Example
use opendb::{OpenDB, Memory}; fn main() -> opendb::Result<()> { // Open database let db = OpenDB::open("./my_database")?; // Store a memory with embedding let memory = Memory::new( "memory_1", "Rust is awesome!", vec![0.1, 0.2, 0.3], 0.9, // importance ); db.insert_memory(&memory)?; // Create relationships db.link("memory_1", "related_to", "memory_2")?; // Vector search let similar = db.search_similar(&[0.1, 0.2, 0.3], 5)?; Ok(()) }
Next Steps
Installation
From crates.io (once published)
cargo add opendb
From source
- Clone the repository:
git clone https://github.com/muhammad-fiaz/OpenDB.git
cd OpenDB
- Build the project:
cargo build --release
- Run tests:
cargo test
- Run examples:
cargo run --example quickstart
cargo run --example memory_agent
cargo run --example graph_relations
Requirements
- Rust: 1.70.0 or higher (Rust 2021 edition)
- Operating System: Linux, macOS, or Windows
- Dependencies: All dependencies are managed by Cargo
System Dependencies
OpenDB uses RocksDB as its storage backend, which requires:
- Linux: gcc, g++, make, libsnappy-dev, zlib1g-dev, libbz2-dev, liblz4-dev
- macOS: Xcode command line tools
- Windows: Visual Studio Build Tools
Linux Setup
# Ubuntu/Debian
sudo apt-get install -y gcc g++ make libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev
# Fedora/RHEL
sudo dnf install -y gcc gcc-c++ make snappy-devel zlib-devel bzip2-devel lz4-devel
macOS Setup
xcode-select --install
Windows Setup
Install Visual Studio Build Tools
Verifying Installation
cargo test --all
All tests should pass. If you encounter issues, please check:
- Rust version:
rustc --version - Build dependencies are installed
- Open an issue if problems persist
Quick Start
This guide will walk you through the basic usage of OpenDB.
Opening a Database
use opendb::{OpenDB, Result}; fn main() -> Result<()> { // Open or create a database let db = OpenDB::open("./my_database")?; Ok(()) }
Working with Key-Value Data
#![allow(unused)] fn main() { // Store a value db.put(b"my_key", b"my_value")?; // Retrieve a value if let Some(value) = db.get(b"my_key")? { println!("Value: {:?}", value); } // Delete a value db.delete(b"my_key")?; // Check existence if db.exists(b"my_key")? { println!("Key exists!"); } }
Working with Memory Records
Memory records are structured data with embeddings for semantic search.
#![allow(unused)] fn main() { use opendb::Memory; // Create a memory let memory = Memory::new( "memory_001", "The user prefers dark mode", vec![0.1, 0.2, 0.3, 0.4], // embedding vector 0.9, // importance (0.0 to 1.0) ) .with_metadata("category", "preference") .with_metadata("source", "user_settings"); // Insert the memory db.insert_memory(&memory)?; // Retrieve it if let Some(mem) = db.get_memory("memory_001")? { println!("Content: {}", mem.content); println!("Importance: {}", mem.importance); } // List all memories with a prefix let all = db.list_memories("memory")?; println!("Found {} memories", all.len()); }
Creating Relationships
#![allow(unused)] fn main() { // Create relationships between memories db.link("memory_001", "related_to", "memory_002")?; db.link("memory_001", "caused_by", "memory_003")?; // Query relationships let related = db.get_related("memory_001", "related_to")?; for id in related { println!("Related memory: {}", id); } // Get all outgoing edges let edges = db.get_outgoing("memory_001")?; for edge in edges { println!("{} --[{}]--> {}", edge.from, edge.relation, edge.to); } }
Vector Search
#![allow(unused)] fn main() { // Search for similar memories let query_embedding = vec![0.1, 0.2, 0.3, 0.4]; let results = db.search_similar(&query_embedding, 5)?; // top 5 for result in results { println!("Memory: {} (distance: {:.4})", result.memory.content, result.distance); } }
Using Transactions
#![allow(unused)] fn main() { // Begin a transaction let mut txn = db.begin_transaction()?; // Perform operations txn.put("records", b"key1", b"value1")?; txn.put("records", b"key2", b"value2")?; // Commit the transaction txn.commit()?; // Or rollback if needed // txn.rollback()?; }
Flushing to Disk
#![allow(unused)] fn main() { // Ensure all writes are persisted db.flush()?; }
Complete Example
See the quickstart example for a complete, runnable example.
Next Steps
Architecture Overview
OpenDB is designed as a modular, hybrid database system that combines multiple database paradigms while maintaining high performance and ACID guarantees.
System Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OpenDB Public API โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโค
โ Key-Value โ Records โ Graph โ Vectors โ
โ Store โ (Memory) โ Relations โ (HNSW) โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโค
โ Transaction Manager (ACID) โ
โ WAL + Optimistic Locking + MVCC โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ LRU Cache Layer โ
โ (Write-Through + Invalidation) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Storage Trait (Pluggable Backend) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ RocksDB Backend (LSM Tree) โ
โ Column Families + Native Transactions + WAL โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Core Components
1. Storage Layer
- Backend: RocksDB (high-performance LSM tree)
- Column Families: Namespace isolation for different data types
- Persistence: Write-Ahead Log (WAL) for durability
2. Transaction Manager
- ACID Guarantees: Full transactional support
- Isolation: Snapshot isolation via RocksDB transactions
- Concurrency: Optimistic locking
3. Cache Layer
- Strategy: LRU (Least Recently Used)
- Write Policy: Write-through (update storage first, then cache)
- Coherency: Automatic invalidation on delete
4. Feature Modules
Key-Value Store
- Direct byte-level storage
- Prefix scans
- Cache-accelerated reads
Records Manager
- Structured Memory records
- Codec: rkyv (zero-copy deserialization)
- Metadata support
Graph Manager
- Bidirectional adjacency lists
- Forward index:
from โ [(relation, to)] - Backward index:
to โ [(relation, from)]
Vector Manager
- HNSW index for approximate nearest neighbor search
- Automatic index rebuilding
- Configurable search quality
Data Flow
Write Path
Application โ OpenDB API โ Cache (update) โ Storage Backend โ WAL โ Disk
Read Path (Cache Hit)
Application โ OpenDB API โ Cache โ Return
Read Path (Cache Miss)
Application โ OpenDB API โ Cache (miss) โ Storage Backend โ Cache (populate) โ Return
Design Decisions
Why RocksDB?
Advantages:
- Production-tested LSM tree
- Excellent write throughput
- Built-in WAL and transactions
- Column families for organization
Tradeoffs:
- Not pure Rust (C++ with bindings)
- Larger binary size
Alternatives Considered:
redb: Pure Rust, B-tree based, simpler but lower throughputsled: Pure Rust, but less mature and maintenance concerns- Custom LSM: Too much complexity for initial version
Why rkyv for Serialization?
Advantages:
- Zero-copy deserialization (fast reads)
- Schema versioning support
- Type safety
Alternatives:
bincode: Simpler but requires full deserializationserde_json: Human-readable but slower
Why HNSW for Vector Search?
Advantages:
- Excellent accuracy/speed tradeoff
- Logarithmic search complexity
- Works well for high-dimensional data
Alternatives:
- IVF (Inverted File Index): Faster but less accurate
- Flat index: Exact but O(n) search
Next Steps
Storage Layer
RocksDB Backend
OpenDB uses RocksDB as its default storage backend, providing a robust foundation for ACID transactions and high-performance data access.
Column Families
Data is organized into separate column families (namespaces):
| Column Family | Purpose | Data Format |
|---|---|---|
default | Key-value store | Raw bytes |
records | Memory records | rkyv-encoded Memory structs |
graph_forward | Forward adjacency list | rkyv-encoded Edge arrays |
graph_backward | Backward adjacency list | rkyv-encoded Edge arrays |
vector_data | Vector embeddings | bincode-encoded f32 arrays |
vector_index | HNSW metadata | (currently in-memory) |
metadata | DB metadata | JSON |
Storage Trait
The storage layer is abstracted behind a trait, allowing for pluggable backends:
#![allow(unused)] fn main() { pub trait StorageBackend: Send + Sync { fn get(&self, cf: &str, key: &[u8]) -> Result<Option<Vec<u8>>>; fn put(&self, cf: &str, key: &[u8], value: &[u8]) -> Result<()>; fn delete(&self, cf: &str, key: &[u8]) -> Result<()>; fn scan_prefix(&self, cf: &str, prefix: &[u8]) -> Result<Vec<(Vec<u8>, Vec<u8>)>>; fn begin_transaction(&self) -> Result<Box<dyn Transaction>>; fn flush(&self) -> Result<()>; } }
Performance Tuning
RocksDB is configured with optimizations for mixed read/write workloads:
#![allow(unused)] fn main() { // Write buffer: 128MB opts.set_write_buffer_size(128 * 1024 * 1024); // Number of write buffers: 3 opts.set_max_write_buffer_number(3); // Target file size: 64MB opts.set_target_file_size_base(64 * 1024 * 1024); // Compression: LZ4 opts.set_compression_type(rocksdb::DBCompressionType::Lz4); }
Write-Ahead Log (WAL)
RocksDB's WAL ensures durability:
- All writes are first appended to the WAL
- Then applied to memtables
- Periodically flushed to SST files
- Old WAL segments are deleted after checkpoint
LSM Tree Structure
RocksDB uses a Log-Structured Merge (LSM) tree:
Write Path:
Write โ WAL โ MemTable โ (flush) โ L0 SST โ (compact) โ L1 SST โ ...
Read Path:
Read โ MemTable โ Block Cache โ L0 โ L1 โ ... โ Ln
Advantages
- Write Amplification: Minimized for sequential writes
- Compression: Data is compressed at each level
- Compaction: Background process merges and cleans data
Tradeoffs
- Read Amplification: May need to check multiple levels
- Space Amplification: Compaction creates temporary overhead
Future Backend Options
redb (Pure Rust B-Tree)
Pros:
- Pure Rust, no C++ dependencies
- Simpler architecture
- Good for read-heavy workloads
Cons:
- Lower write throughput than LSM
- Less mature
Custom LSM Implementation
Pros:
- Full control over optimization
- Pure Rust
Cons:
- High development and maintenance cost
- Risk of bugs in critical path
Next
Transaction Model
OpenDB provides full ACID (Atomicity, Consistency, Isolation, Durability) guarantees through RocksDB's transaction support.
ACID Properties
Atomicity
All operations in a transaction either succeed together or fail together.
#![allow(unused)] fn main() { let mut txn = db.begin_transaction()?; txn.put("records", b"key1", b"value1")?; txn.put("records", b"key2", b"value2")?; txn.commit()?; // Both writes succeed or both fail }
Consistency
Transactions move the database from one consistent state to another.
Isolation
Transactions use snapshot isolation:
- Each transaction sees a consistent snapshot of the database
- Concurrent transactions don't interfere with each other
- RocksDB provides MVCC (Multi-Version Concurrency Control)
Durability
Once a transaction commits, the changes are permanent:
- Write-Ahead Log (WAL) ensures durability
- Data survives process crashes
- Can be verified by reopening the database
Transaction API
Basic Usage
#![allow(unused)] fn main() { // Begin transaction let mut txn = db.begin_transaction()?; // Perform operations txn.put("records", b"key1", b"value1")?; let val = txn.get("records", b"key1")?; // Commit txn.commit()?; }
Rollback
#![allow(unused)] fn main() { let mut txn = db.begin_transaction()?; txn.put("records", b"key1", b"modified")?; // Something went wrong, rollback txn.rollback()?; // Original value remains unchanged }
Auto-Rollback
Transactions are automatically rolled back if dropped without commit:
#![allow(unused)] fn main() { { let mut txn = db.begin_transaction()?; txn.put("records", b"key1", b"value")?; // txn dropped here - auto rollback } }
Concurrency Model
Optimistic Locking
RocksDB transactions use optimistic locking:
- Read phase: Transaction reads data without locks
- Validation phase: Before commit, check if data changed
- Write phase: If no conflicts, commit; otherwise abort
Conflict Detection
#![allow(unused)] fn main() { // Transaction 1 let mut txn1 = db.begin_transaction()?; txn1.put("records", b"counter", b"1")?; // Transaction 2 (concurrent) let mut txn2 = db.begin_transaction()?; txn2.put("records", b"counter", b"2")?; // First to commit wins txn1.commit()?; // Success txn2.commit()?; // May fail with conflict error }
Snapshot Isolation Example
#![allow(unused)] fn main() { // Initial state: counter = 0 db.put(b"counter", b"0")?; // Transaction 1 reads let mut txn1 = db.begin_transaction()?; let val1 = txn1.get("default", b"counter")?; // Meanwhile, Transaction 2 updates let mut txn2 = db.begin_transaction()?; txn2.put("default", b"counter", b"5")?; txn2.commit()?; // Transaction 1 still sees old snapshot let val1_again = txn1.get("default", b"counter")?; assert_eq!(val1, val1_again); // Still "0" }
Best Practices
Keep Transactions Short
#![allow(unused)] fn main() { // โ Bad: Long-running transaction let mut txn = db.begin_transaction()?; for i in 0..1_000_000 { txn.put("default", &i.to_string().as_bytes(), b"value")?; } txn.commit()?; // โ Good: Batch commits for chunk in (0..1_000_000).collect::<Vec<_>>().chunks(1000) { let mut txn = db.begin_transaction()?; for i in chunk { txn.put("default", &i.to_string().as_bytes(), b"value")?; } txn.commit()?; } }
Handle Conflicts
#![allow(unused)] fn main() { loop { let mut txn = db.begin_transaction()?; // Read-modify-write let val = txn.get("default", b"counter")?.unwrap_or_default(); let new_val = increment(val); txn.put("default", b"counter", &new_val)?; match txn.commit() { Ok(_) => break, Err(Error::Transaction(_)) => continue, // Retry on conflict Err(e) => return Err(e), } } }
Use Snapshots for Consistent Reads
For read-only operations across multiple keys, use snapshots (coming soon):
#![allow(unused)] fn main() { let snapshot = db.snapshot()?; let val1 = snapshot.get("records", b"key1")?; let val2 = snapshot.get("records", b"key2")?; // val1 and val2 are from the same consistent point in time }
Limitations
- Transactions are single-threaded (one transaction per thread)
- Cross-column-family transactions are supported
- Very large transactions may impact performance
Next
Caching Strategy
OpenDB uses an LRU (Least Recently Used) cache to accelerate reads while maintaining consistency.
Cache Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Application โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ
Read/Write
โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโ
โ LRU Cache โ
โ โโโโโโโโฌโโโโโโโฌโโโโโโโฌโโโโโโโ โ
โ โ Hot1 โ Hot2 โ Hot3 โ Hot4 โ โ
โ โโโโโโโโดโโโโโโโดโโโโโโโดโโโโโโโ โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ
Cache Miss/Write
โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโ
โ Storage Backend โ
โ (RocksDB) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Write-Through Policy
All writes go to storage first, then update the cache:
#![allow(unused)] fn main() { pub fn put(&self, key: &[u8], value: &[u8]) -> Result<()> { // 1. Write to storage (ensures durability) self.storage.put(ColumnFamilies::DEFAULT, key, value)?; // 2. Update cache self.cache.insert(key.to_vec(), value.to_vec()); Ok(()) } }
Why Write-Through?
- โ Durability: Data is persisted immediately
- โ Consistency: Cache never has uncommitted data
- โ Slower writes: Every write hits disk
Alternative: Write-Back
- โ Faster writes (batch to disk later)
- โ Risk of data loss if crash before flush
- โ More complex consistency model
Cache Invalidation
Deletes remove from both cache and storage:
#![allow(unused)] fn main() { pub fn delete(&self, key: &[u8]) -> Result<()> { // 1. Delete from storage self.storage.delete(ColumnFamilies::DEFAULT, key)?; // 2. Invalidate cache self.cache.invalidate(&key.to_vec()); Ok(()) } }
LRU Eviction
When cache reaches capacity, least-recently-used items are evicted:
Cache (capacity = 3):
Put("A", "1") โ [A]
Put("B", "2") โ [B, A]
Put("C", "3") โ [C, B, A]
Get("A") โ [A, C, B] # A is now most recent
Put("D", "4") โ [D, A, C] # B evicted (LRU)
Cache Sizes
Default cache sizes:
#![allow(unused)] fn main() { pub struct OpenDBOptions { pub kv_cache_size: usize, // Default: 1000 pub record_cache_size: usize, // Default: 500 } }
Tuning Cache Size
#![allow(unused)] fn main() { let mut options = OpenDBOptions::default(); options.kv_cache_size = 10_000; // More KV entries options.record_cache_size = 2_000; // More Memory records let db = OpenDB::open_with_options("./db", options)?; }
Guidelines:
- Small cache (100-1000): Low memory, high cache miss rate
- Medium cache (1000-10000): Balanced for most workloads
- Large cache (10000+): High memory, low cache miss rate
Cache Hit Rates
Monitor effectiveness (metrics to be added):
Hit Rate = Cache Hits / Total Reads
- > 80%: Excellent, cache is effective
- 50-80%: Good, consider increasing size
- < 50%: Poor, increase cache or review access patterns
Multi-Level Caching
OpenDB has two cache levels:
- Application Cache (LRU): In-process, fast
- RocksDB Block Cache: Built into RocksDB, shared
RocksDB Block Cache
RocksDB has its own block cache (not exposed in current API):
#![allow(unused)] fn main() { // Future tuning option opts.set_block_cache_size(256 * 1024 * 1024); // 256 MB }
Concurrent Access
Caches use parking_lot::RwLock for thread safety:
#![allow(unused)] fn main() { pub struct LruMemoryCache<K, V> { cache: RwLock<LruCache<K, V>>, } }
- Reads: Multiple concurrent readers
- Writes: Exclusive lock during insert/evict
Cache Coherency Guarantees
- Write Visibility: Writes are immediately visible after
put()returns - Delete Visibility: Deletes are immediately visible after
delete()returns - Transaction Isolation: Transactions bypass cache (read from storage snapshot)
Best Practices
Warm Up Cache
#![allow(unused)] fn main() { // Preload important data let important_ids = vec!["mem_001", "mem_002", "mem_003"]; for id in important_ids { db.get_memory(id)?; // Populate cache } }
Avoid Thrashing
#![allow(unused)] fn main() { // โ Bad: Random access pattern, poor cache hit rate for i in 0..1_000_000 { let random_key = generate_random_key(); db.get(&random_key)?; } // โ Good: Sequential or localized access for i in 0..1000 { db.get(&format!("key_{}", i).as_bytes())?; } }
Cache Bypass for Large Scans
For scanning large datasets, consider bypassing cache (future feature):
#![allow(unused)] fn main() { // Future API db.scan_prefix_no_cache(b"prefix")?; }
Next
Key-Value Store API
OpenDB provides a simple, fast key-value interface for storing arbitrary binary data.
Basic Operations
Put
Store a value under a key:
#![allow(unused)] fn main() { use opendb::OpenDB; let db = OpenDB::open("./db")?; db.put(b"user:123", b"Alice")?; }
Signature:
#![allow(unused)] fn main() { pub fn put(&self, key: &[u8], value: &[u8]) -> Result<()> }
Behavior:
- Writes to storage immediately (write-through cache)
- Updates cache
- Returns error if storage fails
Get
Retrieve a value by key:
#![allow(unused)] fn main() { let value = db.get(b"user:123")?; match value { Some(bytes) => println!("Found: {}", String::from_utf8_lossy(&bytes)), None => println!("Not found"), } }
Signature:
#![allow(unused)] fn main() { pub fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>> }
Behavior:
- Checks cache first (fast path)
- Falls back to storage on cache miss
- Returns
Noneif key doesn't exist
Delete
Remove a key-value pair:
#![allow(unused)] fn main() { db.delete(b"user:123")?; }
Signature:
#![allow(unused)] fn main() { pub fn delete(&self, key: &[u8]) -> Result<()> }
Behavior:
- Removes from storage
- Invalidates cache entry
- Succeeds even if key doesn't exist
Exists
Check if a key exists without fetching the value:
#![allow(unused)] fn main() { if db.exists(b"user:123")? { println!("User exists"); } }
Signature:
#![allow(unused)] fn main() { pub fn exists(&self, key: &[u8]) -> Result<bool> }
Behavior:
- Checks cache first
- Falls back to storage on cache miss
- More efficient than
get()for existence checks
Advanced Operations
Scan Prefix
Iterate over all keys with a common prefix:
#![allow(unused)] fn main() { let users = db.scan_prefix(b"user:")?; for (key, value) in users { println!("{} = {}", String::from_utf8_lossy(&key), String::from_utf8_lossy(&value) ); } }
Signature:
#![allow(unused)] fn main() { pub fn scan_prefix(&self, prefix: &[u8]) -> Result<Vec<(Vec<u8>, Vec<u8>)>> }
Behavior:
- Bypasses cache (reads from storage)
- Returns all matching key-value pairs
- Sorted by key (lexicographic order)
Usage Patterns
Namespacing
Use prefixes to organize data:
#![allow(unused)] fn main() { // User namespace db.put(b"user:123", b"Alice")?; db.put(b"user:456", b"Bob")?; // Session namespace db.put(b"session:abc", b"user:123")?; db.put(b"session:xyz", b"user:456")?; // Scan all users let users = db.scan_prefix(b"user:")?; }
Counter
Implement atomic counters with transactions:
#![allow(unused)] fn main() { fn increment_counter(db: &OpenDB, key: &[u8]) -> Result<u64> { let mut txn = db.begin_transaction()?; let current = txn.get("default", key)? .map(|v| u64::from_le_bytes(v.try_into().unwrap())) .unwrap_or(0); let new_val = current + 1; txn.put("default", key, &new_val.to_le_bytes())?; txn.commit()?; Ok(new_val) } let count = increment_counter(&db, b"visits")?; }
Binary Data
Store any serializable type:
#![allow(unused)] fn main() { use serde::{Serialize, Deserialize}; #[derive(Serialize, Deserialize)] struct Config { host: String, port: u16, } let config = Config { host: "localhost".to_string(), port: 8080, }; // Serialize let bytes = bincode::serialize(&config)?; db.put(b"config", &bytes)?; // Deserialize let bytes = db.get(b"config")?.unwrap(); let config: Config = bincode::deserialize(&bytes)?; }
Performance Characteristics
| Operation | Time Complexity | Cache Hit | Cache Miss |
|---|---|---|---|
get() | O(1) avg | ~100ns | ~1-10ยตs |
put() | O(log n) | ~1-10ยตs | ~1-10ยตs |
delete() | O(log n) | ~1-10ยตs | ~1-10ยตs |
exists() | O(1) avg | ~100ns | ~1-10ยตs |
scan_prefix() | O(k log n) | N/A | ~10ยตs + k*1ยตs |
Where:
n= total keys in databasek= number of matching keys
Error Handling
All operations return Result<T, Error>:
#![allow(unused)] fn main() { use opendb::{OpenDB, Error}; match db.get(b"key") { Ok(Some(value)) => { /* use value */ }, Ok(None) => { /* key not found */ }, Err(Error::Storage(e)) => { /* storage error */ }, Err(Error::Cache(e)) => { /* cache error */ }, Err(e) => { /* other error */ }, } }
Thread Safety
All KV operations are thread-safe:
#![allow(unused)] fn main() { use std::sync::Arc; use std::thread; let db = Arc::new(OpenDB::open("./db")?); let handles: Vec<_> = (0..10).map(|i| { let db = Arc::clone(&db); thread::spawn(move || { db.put(format!("key_{}", i).as_bytes(), b"value").unwrap(); }) }).collect(); for handle in handles { handle.join().unwrap(); } }
Next
Records API
The Records API manages structured Memory objects with metadata, timestamps, and embeddings.
Memory Type
#![allow(unused)] fn main() { pub struct Memory { pub id: String, pub content: String, pub embedding: Vec<f32>, pub importance: f64, pub timestamp: i64, pub metadata: HashMap<String, String>, } }
Creating Memories
New Memory
#![allow(unused)] fn main() { use opendb::{OpenDB, Memory}; let memory = Memory::new( "mem_001".to_string(), "User asked about Rust ownership".to_string(), ); }
With Metadata
#![allow(unused)] fn main() { let memory = Memory::new("mem_002".to_string(), "Content".to_string()) .with_metadata("category", "conversation") .with_metadata("user_id", "123"); }
Custom Builder
#![allow(unused)] fn main() { use std::collections::HashMap; let mut metadata = HashMap::new(); metadata.insert("priority".to_string(), "high".to_string()); let memory = Memory { id: "mem_003".to_string(), content: "Important note".to_string(), embedding: vec![0.1, 0.2, 0.3], // 3D for demo importance: 0.95, timestamp: chrono::Utc::now().timestamp(), metadata, }; }
CRUD Operations
Insert
#![allow(unused)] fn main() { let db = OpenDB::open("./db")?; let memory = Memory::new("mem_001".to_string(), "Hello world".to_string()); db.insert_memory(&memory)?; }
Signature:
#![allow(unused)] fn main() { pub fn insert_memory(&self, memory: &Memory) -> Result<()> }
Behavior:
- Serializes with
rkyv(zero-copy) - Writes to
recordscolumn family - Updates cache
- If embedding is non-empty, stores in vector index (requires rebuild for search)
Get
#![allow(unused)] fn main() { let memory = db.get_memory("mem_001")?; match memory { Some(mem) => println!("Content: {}", mem.content), None => println!("Not found"), } }
Signature:
#![allow(unused)] fn main() { pub fn get_memory(&self, id: &str) -> Result<Option<Memory>> }
Behavior:
- Checks cache first
- Deserializes from storage on cache miss
- Returns
Noneif not found
Update
#![allow(unused)] fn main() { let mut memory = db.get_memory("mem_001")?.unwrap(); memory.content = "Updated content".to_string(); memory.importance = 0.9; memory.touch(); // Update timestamp db.insert_memory(&memory)?; // Upsert }
Note: insert_memory() acts as upsert (update if exists, insert if not).
Delete
#![allow(unused)] fn main() { db.delete_memory("mem_001")?; }
Signature:
#![allow(unused)] fn main() { pub fn delete_memory(&self, id: &str) -> Result<()> }
Behavior:
- Removes from storage
- Invalidates cache
- Does not remove from vector index (requires rebuild)
- Does not remove graph edges (handle separately)
Listing Operations
List All IDs
#![allow(unused)] fn main() { let ids = db.list_memory_ids()?; for id in ids { println!("Memory ID: {}", id); } }
Signature:
#![allow(unused)] fn main() { pub fn list_memory_ids(&self) -> Result<Vec<String>> }
List All Memories
#![allow(unused)] fn main() { let memories = db.list_memories()?; for memory in memories { println!("{}: {}", memory.id, memory.content); } }
Signature:
#![allow(unused)] fn main() { pub fn list_memories(&self) -> Result<Vec<Memory>> }
Warning: Loads all memories into memory. For large datasets, use pagination (not yet implemented) or filter by prefix.
Advanced Usage
Importance Filtering
#![allow(unused)] fn main() { let memories = db.list_memories()?; let important: Vec<_> = memories.into_iter() .filter(|m| m.importance > 0.8) .collect(); }
Metadata Queries
#![allow(unused)] fn main() { let memories = db.list_memories()?; let category_matches: Vec<_> = memories.into_iter() .filter(|m| { m.metadata.get("category") .map(|v| v == "conversation") .unwrap_or(false) }) .collect(); }
Time Range Queries
#![allow(unused)] fn main() { use chrono::{Utc, Duration}; let one_hour_ago = (Utc::now() - Duration::hours(1)).timestamp(); let recent: Vec<_> = db.list_memories()?.into_iter() .filter(|m| m.timestamp > one_hour_ago) .collect(); }
Embeddings
Setting Embeddings
Embeddings enable semantic search:
#![allow(unused)] fn main() { let embedding = generate_embedding("Hello world"); // Your embedding model let memory = Memory { id: "mem_001".to_string(), content: "Hello world".to_string(), embedding, // Vec<f32> ..Default::default() }; db.insert_memory(&memory)?; }
Dimension Requirements
All embeddings must have the same dimension (default 384):
#![allow(unused)] fn main() { use opendb::OpenDBOptions; let mut options = OpenDBOptions::default(); options.vector_dimension = 768; // For larger models let db = OpenDB::open_with_options("./db", options)?; }
Searching Embeddings
See Vector API for semantic search.
Touch Timestamp
Update access time without modifying content:
#![allow(unused)] fn main() { let mut memory = db.get_memory("mem_001")?.unwrap(); memory.touch(); // Sets timestamp to now db.insert_memory(&memory)?; }
Default Values
#![allow(unused)] fn main() { impl Default for Memory { fn default() -> Self { Self { id: String::new(), content: String::new(), embedding: Vec::new(), importance: 0.5, timestamp: chrono::Utc::now().timestamp(), metadata: HashMap::new(), } } } }
Performance Tips
- Batch Inserts: Use transactions for multiple inserts:
#![allow(unused)] fn main() { let mut txn = db.begin_transaction()?; for memory in memories { // Insert via transaction (lower-level API needed) } txn.commit()?; }
- Cache Warm-Up: Preload frequently accessed memories:
#![allow(unused)] fn main() { for id in important_ids { db.get_memory(id)?; // Populate cache } }
- Lazy Embedding Generation: Only generate embeddings when needed for search:
#![allow(unused)] fn main() { let memory = Memory::new(id, content); // Don't set embedding unless search is required db.insert_memory(&memory)?; }
Error Handling
#![allow(unused)] fn main() { use opendb::Error; match db.get_memory("mem_001") { Ok(Some(memory)) => { /* use memory */ }, Ok(None) => { /* not found */ }, Err(Error::Codec(_)) => { /* deserialization error */ }, Err(Error::Storage(_)) => { /* storage error */ }, Err(e) => { /* other error */ }, } }
Next
Graph API
OpenDB provides a labeled property graph for modeling relationships between memories.
Core Concepts
- Nodes:
Memoryobjects (referenced by ID) - Edges: Directed relationships with labels and weights
- Relations: String labels like
"causes","before","similar_to"
Edge Type
#![allow(unused)] fn main() { pub struct Edge { pub from: String, pub relation: String, pub to: String, pub weight: f64, pub timestamp: i64, } }
Linking Memories
Basic Link
#![allow(unused)] fn main() { use opendb::OpenDB; let db = OpenDB::open("./db")?; // Create two memories let mem1 = Memory::new("mem_001".to_string(), "Rust is fast".to_string()); let mem2 = Memory::new("mem_002".to_string(), "C++ is fast".to_string()); db.insert_memory(&mem1)?; db.insert_memory(&mem2)?; // Link them db.link("mem_001", "mem_002", "similar_to")?; }
Signature:
#![allow(unused)] fn main() { pub fn link(&self, from: &str, to: &str, relation: &str) -> Result<()> }
Behavior:
- Creates directed edge from
fromโto - Default weight: 1.0
- Stores in both forward and backward indexes
- Allows multiple relations between same nodes
Custom Weight
#![allow(unused)] fn main() { use opendb::{OpenDB, Edge}; let edge = Edge { from: "mem_001".to_string(), relation: "causes".to_string(), to: "mem_002".to_string(), weight: 0.85, // Custom confidence score timestamp: chrono::Utc::now().timestamp(), }; // Link via graph manager (internal API, use link() for simple cases) }
Unlinking
Remove a specific relationship:
#![allow(unused)] fn main() { db.unlink("mem_001", "mem_002", "similar_to")?; }
Signature:
#![allow(unused)] fn main() { pub fn unlink(&self, from: &str, to: &str, relation: &str) -> Result<()> }
Behavior:
- Removes edge from both indexes
- Succeeds even if edge doesn't exist
- Does not delete the nodes
Querying Relationships
Get All Related Nodes
#![allow(unused)] fn main() { let related = db.get_related("mem_001", "similar_to")?; for edge in related { println!("{} --[{}]--> {} (weight: {})", edge.from, edge.relation, edge.to, edge.weight); } }
Signature:
#![allow(unused)] fn main() { pub fn get_related(&self, id: &str, relation: &str) -> Result<Vec<Edge>> }
Returns: All edges from id with the specified relation.
Get Outgoing Edges
#![allow(unused)] fn main() { let outgoing = db.get_outgoing("mem_001")?; for edge in outgoing { println!("Outgoing: {} --[{}]--> {}", edge.from, edge.relation, edge.to); } }
Signature:
#![allow(unused)] fn main() { pub fn get_outgoing(&self, id: &str) -> Result<Vec<Edge>> }
Returns: All edges where id is the source (all relations).
Get Incoming Edges
#![allow(unused)] fn main() { let incoming = db.get_incoming("mem_002")?; for edge in incoming { println!("Incoming: {} --[{}]--> {}", edge.from, edge.relation, edge.to); } }
Signature:
#![allow(unused)] fn main() { pub fn get_incoming(&self, id: &str) -> Result<Vec<Edge>> }
Returns: All edges where id is the target (all relations).
Relation Types
OpenDB provides predefined relation constants:
#![allow(unused)] fn main() { pub mod relation { pub const RELATED_TO: &str = "related_to"; pub const CAUSED_BY: &str = "caused_by"; pub const BEFORE: &str = "before"; pub const AFTER: &str = "after"; pub const REFERENCES: &str = "references"; pub const SIMILAR_TO: &str = "similar_to"; pub const CONTRADICTS: &str = "contradicts"; pub const SUPPORTS: &str = "supports"; } }
Usage
#![allow(unused)] fn main() { use opendb::graph::relation; db.link("mem_001", "mem_002", relation::CAUSED_BY)?; db.link("mem_002", "mem_003", relation::BEFORE)?; }
Custom Relations
You can use any string as a relation:
#![allow(unused)] fn main() { db.link("mem_001", "mem_002", "depends_on")?; db.link("mem_003", "mem_004", "implements")?; }
Graph Patterns
Temporal Chain
#![allow(unused)] fn main() { use opendb::graph::relation; // Build timeline db.link("event_1", "event_2", relation::BEFORE)?; db.link("event_2", "event_3", relation::BEFORE)?; db.link("event_3", "event_4", relation::BEFORE)?; // Traverse forward let next_events = db.get_related("event_1", relation::BEFORE)?; }
Causal Graph
#![allow(unused)] fn main() { use opendb::graph::relation; // A causes B, B causes C db.link("symptom_A", "symptom_B", relation::CAUSED_BY)?; db.link("symptom_B", "symptom_C", relation::CAUSED_BY)?; // Find root causes let causes = db.get_incoming("symptom_C")?; }
Knowledge Graph
#![allow(unused)] fn main() { use opendb::graph::relation; // Rust has ownership db.link("rust", "ownership", "has_feature")?; // Ownership enables memory_safety db.link("ownership", "memory_safety", "enables")?; // Memory_safety prevents bugs db.link("memory_safety", "bug_prevention", "prevents")?; // Traverse features let features = db.get_related("rust", "has_feature")?; }
Bidirectional Relationships
#![allow(unused)] fn main() { // A is similar to B db.link("mem_A", "mem_B", "similar_to")?; // B is also similar to A db.link("mem_B", "mem_A", "similar_to")?; // Query either direction let similar_from_A = db.get_related("mem_A", "similar_to")?; let similar_from_B = db.get_related("mem_B", "similar_to")?; }
Advanced Queries
Multi-Hop Traversal
#![allow(unused)] fn main() { fn traverse_depth_2(db: &OpenDB, start: &str, relation: &str) -> Result<Vec<String>> { let mut result = Vec::new(); // First hop let hop1 = db.get_related(start, relation)?; for edge1 in hop1 { result.push(edge1.to.clone()); // Second hop let hop2 = db.get_related(&edge1.to, relation)?; for edge2 in hop2 { result.push(edge2.to.clone()); } } Ok(result) } }
Filter by Weight
#![allow(unused)] fn main() { let edges = db.get_related("mem_001", "similar_to")?; let strong_edges: Vec<_> = edges.into_iter() .filter(|e| e.weight > 0.8) .collect(); }
Aggregate Relations
#![allow(unused)] fn main() { use std::collections::HashMap; let outgoing = db.get_outgoing("mem_001")?; let mut relation_counts: HashMap<String, usize> = HashMap::new(); for edge in outgoing { *relation_counts.entry(edge.relation).or_insert(0) += 1; } println!("Relation distribution: {:?}", relation_counts); }
Performance Characteristics
| Operation | Time Complexity | Notes |
|---|---|---|
link() | O(log n) | Two index writes (forward + backward) |
unlink() | O(k log n) | k = edges between nodes |
get_related() | O(log n + k) | k = matching edges |
get_outgoing() | O(log n + k) | k = total outgoing edges |
get_incoming() | O(log n + k) | k = total incoming edges |
Storage Details
Edges are stored in two column families:
- graph_forward:
{from}:{relation}โVec<Edge> - graph_backward:
{to}:{relation}โVec<Edge>
This dual-indexing enables fast queries in both directions.
Error Handling
#![allow(unused)] fn main() { use opendb::Error; match db.link("mem_001", "mem_002", "related_to") { Ok(_) => println!("Link created"), Err(Error::Storage(_)) => println!("Storage error"), Err(Error::Graph(_)) => println!("Graph error"), Err(e) => println!("Other error: {}", e), } }
Next
Vector Search API
OpenDB provides semantic similarity search using HNSW (Hierarchical Navigable Small World) index.
Overview
Vector search enables finding memories based on semantic similarity rather than exact matches:
#![allow(unused)] fn main() { use opendb::OpenDB; let db = OpenDB::open("./db")?; // Insert memories with embeddings let memory = Memory { id: "mem_001".to_string(), content: "Rust is a systems programming language".to_string(), embedding: generate_embedding("Rust is a systems programming language"), ..Default::default() }; db.insert_memory(&memory)?; // Search by query embedding let query_embedding = generate_embedding("What is Rust?"); let results = db.search_similar(&query_embedding, 5)?; }
Search Similar
Find memories similar to a query vector:
#![allow(unused)] fn main() { let results = db.search_similar(&query_embedding, top_k)?; for result in results { println!("ID: {}, Distance: {}", result.id, result.distance); let memory = db.get_memory(&result.id)?.unwrap(); println!("Content: {}", memory.content); } }
Signature:
#![allow(unused)] fn main() { pub fn search_similar(&self, query: &[f32], top_k: usize) -> Result<Vec<SearchResult>> }
Parameters:
query: Query vector (must match configured dimension)top_k: Number of results to return
Returns: Vec<SearchResult> sorted by distance (closest first).
SearchResult Type
#![allow(unused)] fn main() { pub struct SearchResult { pub id: String, pub distance: f32, } }
- id: Memory ID
- distance: Euclidean distance (lower = more similar)
Embeddings
Dimension Configuration
Set embedding dimension when opening database:
#![allow(unused)] fn main() { use opendb::OpenDBOptions; let mut options = OpenDBOptions::default(); options.vector_dimension = 768; // For OpenAI ada-002 or similar let db = OpenDB::open_with_options("./db", options)?; }
Default: 384 (for sentence-transformers/all-MiniLM-L6-v2)
Generating Embeddings
OpenDB does not include embedding generation. Use external models:
Example: sentence-transformers (Python)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode("Hello world").tolist() # [0.1, -0.2, ...]
Example: OpenAI API
#![allow(unused)] fn main() { // Pseudo-code (use openai-rust crate) let embedding = openai_client .embeddings("text-embedding-ada-002") .create("Hello world") .await?; }
Example: Candle (Rust)
#![allow(unused)] fn main() { // Use candle-transformers for local inference // See: https://github.com/huggingface/candle }
Synthetic Embeddings (Testing)
For testing without real models:
#![allow(unused)] fn main() { fn generate_synthetic_embedding(text: &str, dimension: usize) -> Vec<f32> { use std::collections::hash_map::DefaultHasher; use std::hash::{Hash, Hasher}; let mut hasher = DefaultHasher::new(); text.hash(&mut hasher); let seed = hasher.finish(); let mut rng = /* initialize with seed */; (0..dimension).map(|_| rng.gen_range(-1.0..1.0)).collect() } }
Index Management
Automatic Index Building
The HNSW index is built automatically on first search:
#![allow(unused)] fn main() { // Insert memories db.insert_memory(&memory1)?; db.insert_memory(&memory2)?; // First search triggers index build let results = db.search_similar(&query, 5)?; // Builds index here }
Manual Rebuild
Force index rebuild (e.g., after bulk inserts):
#![allow(unused)] fn main() { db.rebuild_vector_index()?; }
Signature:
#![allow(unused)] fn main() { pub fn rebuild_vector_index(&self) -> Result<()> }
When to rebuild:
- After bulk memory inserts
- After changing embeddings
- To incorporate deleted memories
Note: Search automatically rebuilds if index is stale.
HNSW Parameters
HNSW has tunable parameters for speed vs accuracy tradeoff:
Default Parameters
#![allow(unused)] fn main() { pub struct HnswParams { pub ef_construction: usize, // 200 pub max_neighbors: usize, // 16 } }
Presets
#![allow(unused)] fn main() { // High accuracy (slower build, better recall) HnswParams::high_accuracy() // ef=400, neighbors=32 // High speed (faster build, lower recall) HnswParams::high_speed() // ef=100, neighbors=8 // Balanced (default) HnswParams::default() // ef=200, neighbors=16 }
Note: Currently not exposed in OpenDB API. Future versions will allow tuning.
Distance Metric
OpenDB uses Euclidean distance:
$$ d(p, q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2} $$
Properties:
- Lower distance = more similar
- Distance 0 = identical vectors
- Sensitive to magnitude (normalize if needed)
Normalization
For cosine similarity behavior, normalize embeddings:
#![allow(unused)] fn main() { fn normalize(vec: &mut Vec<f32>) { let magnitude: f32 = vec.iter().map(|x| x * x).sum::<f32>().sqrt(); for x in vec.iter_mut() { *x /= magnitude; } } let mut embedding = generate_embedding(text); normalize(&mut embedding); }
Usage Patterns
Semantic Memory Search
#![allow(unused)] fn main() { // User asks a question let query = "How do I prevent memory leaks in Rust?"; let query_embedding = generate_embedding(query); // Find relevant memories let results = db.search_similar(&query_embedding, 3)?; for result in results { let memory = db.get_memory(&result.id)?.unwrap(); println!("Relevant memory: {}", memory.content); } }
Deduplication
Find duplicate or near-duplicate content:
#![allow(unused)] fn main() { let new_content = "Rust ownership prevents data races"; let new_embedding = generate_embedding(new_content); let similar = db.search_similar(&new_embedding, 1)?; if let Some(top) = similar.first() { if top.distance < 0.1 { // Threshold for "duplicate" println!("Similar content already exists: {}", top.id); } } }
Clustering
Group similar memories:
#![allow(unused)] fn main() { let all_memories = db.list_memories()?; let mut clusters: Vec<Vec<String>> = Vec::new(); for memory in all_memories { if memory.embedding.is_empty() { continue; } let similar = db.search_similar(&memory.embedding, 5)?; let cluster: Vec<String> = similar.iter() .filter(|r| r.distance < 0.5) // Similarity threshold .map(|r| r.id.clone()) .collect(); clusters.push(cluster); } }
Performance Characteristics
| Operation | Time Complexity | Typical Latency |
|---|---|---|
search_similar() | O(log n) | ~1-10ms |
rebuild_vector_index() | O(n log n) | ~100ms per 1k vectors |
| Insert with embedding | O(1) + rebuild | Instant (rebuild deferred) |
Scalability:
- 100-1k memories: Instant search
- 1k-10k memories: <10ms search
- 10k-100k memories: <50ms search
- 100k+ memories: Consider sharding (future feature)
Limitations
- Dimension Mismatch: All embeddings must have same dimension
- No Incremental Updates: Index rebuild is full reconstruction
- Memory Usage: HNSW index kept in memory (~4 bytes ร dimension ร count)
- No GPU Support: Pure CPU implementation
Error Handling
#![allow(unused)] fn main() { use opendb::Error; match db.search_similar(&query, 10) { Ok(results) => { /* use results */ }, Err(Error::VectorIndex(e)) => println!("Index error: {}", e), Err(Error::InvalidInput(e)) => println!("Bad query: {}", e), Err(e) => println!("Other error: {}", e), } }
Best Practices
- Batch Inserts: Insert all memories, then rebuild once:
#![allow(unused)] fn main() { for memory in memories { db.insert_memory(&memory)?; } db.rebuild_vector_index()?; // One rebuild for all }
- Lazy Embeddings: Only generate embeddings for searchable content:
#![allow(unused)] fn main() { let memory = Memory::new(id, content); // Don't set embedding if this memory won't be searched db.insert_memory(&memory)?; }
- Relevance Filtering: Filter by distance threshold:
#![allow(unused)] fn main() { let results = db.search_similar(&query, 20)?; let relevant: Vec<_> = results.into_iter() .filter(|r| r.distance < 1.0) // Adjust threshold .collect(); }
- Combine with Metadata: Use metadata to post-filter:
#![allow(unused)] fn main() { let results = db.search_similar(&query, 50)?; for result in results { let memory = db.get_memory(&result.id)?.unwrap(); if memory.metadata.get("category") == Some(&"docs".to_string()) { println!("Relevant doc: {}", memory.content); } } }
Next
Multimodal File Support
OpenDB provides production-ready support for multimodal file processing, designed specifically for AI/LLM applications, RAG (Retrieval Augmented Generation) pipelines, and agent memory systems.
Overview
The multimodal API enables you to:
- Detect and classify file types (PDF, DOCX, audio, video, text)
- Process and chunk large documents
- Store extracted text with embeddings
- Track processing status for async workflows
- Add custom metadata for any file type
File Type Detection
FileType Enum
The FileType enum represents supported file formats:
#![allow(unused)] fn main() { use opendb::FileType; // Automatic detection from file extension let pdf_type = FileType::from_extension("pdf"); assert_eq!(pdf_type, FileType::Pdf); let audio_type = FileType::from_extension("mp3"); assert_eq!(audio_type, FileType::Audio); // Get human-readable description println!("{}", pdf_type.description()); // "PDF document" println!("{}", audio_type.description()); // "Audio file" }
Supported File Types
| FileType | Extensions | Description |
|---|---|---|
Text | .txt | Plain text file |
Pdf | PDF document | |
Docx | .docx | Microsoft Word document |
Audio | .mp3, .wav, .ogg, .flac | Audio file |
Video | .mp4, .avi, .mkv, .mov | Video file |
Image | .jpg, .png, .gif, .bmp | Image file |
Unknown | others | Unknown file type |
Example: File Type Detection
#![allow(unused)] fn main() { use opendb::FileType; fn detect_file_type(filename: &str) -> FileType { let extension = filename .rsplit('.') .next() .unwrap_or(""); FileType::from_extension(extension) } // Usage let file = "research_paper.pdf"; let file_type = detect_file_type(file); match file_type { FileType::Pdf => println!("Processing PDF document"), FileType::Audio => println!("Transcribing audio file"), FileType::Video => println!("Extracting video captions"), _ => println!("Unsupported file type"), } }
Multimodal Documents
MultimodalDocument Structure
The MultimodalDocument struct represents a processed file with extracted content:
#![allow(unused)] fn main() { pub struct MultimodalDocument { pub id: String, pub filename: String, pub file_type: FileType, pub file_size: usize, pub extracted_text: String, pub chunks: Vec<DocumentChunk>, pub embedding: Option<Vec<f32>>, pub metadata: HashMap<String, String>, pub processing_status: ProcessingStatus, pub created_at: DateTime<Utc>, pub updated_at: DateTime<Utc>, } }
CRUD Operations
Create
#![allow(unused)] fn main() { use opendb::{MultimodalDocument, FileType}; // Create a new multimodal document let doc = MultimodalDocument::new( "doc_001", // Unique ID "research_paper.pdf", // Filename FileType::Pdf, // File type 1024 * 500, // File size in bytes (500 KB) "Extracted text content...", // Extracted text vec![0.1; 384], // Document embedding (384-dim) ); // Add metadata let doc = doc .with_metadata("author", "Dr. Jane Smith") .with_metadata("pages", "25") .with_metadata("year", "2024") .with_metadata("category", "machine-learning"); println!("Created document: {}", doc.id); println!("Status: {:?}", doc.processing_status); }
Read
#![allow(unused)] fn main() { // Access document properties println!("Filename: {}", doc.filename); println!("File type: {:?}", doc.file_type); println!("File size: {} KB", doc.file_size / 1024); println!("Extracted text length: {} chars", doc.extracted_text.len()); println!("Number of chunks: {}", doc.chunks.len()); // Access metadata if let Some(author) = doc.metadata.get("author") { println!("Author: {}", author); } // Check processing status match &doc.processing_status { ProcessingStatus::Completed => println!("โ Processing complete"), ProcessingStatus::Processing => println!("โณ Still processing..."), ProcessingStatus::Failed(err) => println!("โ Failed: {}", err), ProcessingStatus::Queued => println!("โธ Queued for processing"), } }
Update
#![allow(unused)] fn main() { use opendb::ProcessingStatus; // Update processing status let mut doc = doc.clone(); doc.processing_status = ProcessingStatus::Processing; // Add more metadata doc.metadata.insert("processed_by".to_string(), "worker-01".to_string()); doc.metadata.insert("processing_time_ms".to_string(), "1234".to_string()); // Mark as completed doc.processing_status = ProcessingStatus::Completed; doc.updated_at = chrono::Utc::now(); println!("Updated document: {}", doc.id); }
Delete
#![allow(unused)] fn main() { // In OpenDB, you would typically delete by ID using the database handle // This is a conceptual example showing how to remove from memory let mut documents: Vec<MultimodalDocument> = vec![/* ... */]; documents.retain(|d| d.id != "doc_001"); println!("Document deleted"); }
Document Chunking
DocumentChunk Structure
For large documents, use DocumentChunk to split content into processable segments:
#![allow(unused)] fn main() { pub struct DocumentChunk { pub chunk_id: String, pub content: String, pub embedding: Option<Vec<f32>>, pub start_offset: usize, pub end_offset: usize, pub metadata: HashMap<String, String>, } }
Creating Chunks
#![allow(unused)] fn main() { use opendb::{DocumentChunk, MultimodalDocument}; let mut doc = MultimodalDocument::new( "doc_002", "large_book.pdf", FileType::Pdf, 1024 * 1024 * 5, // 5 MB "Full book content...", vec![0.1; 384], ); // Add chunks (e.g., by chapter or page) doc.add_chunk(DocumentChunk::new( "chunk_0", "Chapter 1: Introduction to Rust programming...", vec![0.15; 384], // Chunk-specific embedding 0, // Start offset 1500, // End offset ).with_metadata("chapter", "1") .with_metadata("page_start", "1") .with_metadata("page_end", "15")); doc.add_chunk(DocumentChunk::new( "chunk_1", "Chapter 2: Ownership and Borrowing...", vec![0.25; 384], 1500, 3200, ).with_metadata("chapter", "2") .with_metadata("page_start", "16") .with_metadata("page_end", "32")); println!("Added {} chunks", doc.chunks.len()); }
Chunk Strategies
1. Fixed-Size Chunking
#![allow(unused)] fn main() { fn chunk_by_size(text: &str, chunk_size: usize) -> Vec<String> { text.chars() .collect::<Vec<_>>() .chunks(chunk_size) .map(|chunk| chunk.iter().collect()) .collect() } // Usage let text = "Very long document text..."; let chunks = chunk_by_size(&text, 1000); }
2. Paragraph-Based Chunking
#![allow(unused)] fn main() { fn chunk_by_paragraphs(text: &str, max_paragraphs: usize) -> Vec<String> { text.split("\n\n") .collect::<Vec<_>>() .chunks(max_paragraphs) .map(|chunk| chunk.join("\n\n")) .collect() } // Usage let chunks = chunk_by_paragraphs(&text, 3); }
3. Token-Based Chunking (for LLMs)
#![allow(unused)] fn main() { // Requires tiktoken-rs or similar tokenizer fn chunk_by_tokens(text: &str, max_tokens: usize) -> Vec<String> { // Pseudo-code - use actual tokenizer in production let tokens = tokenize(text); tokens .chunks(max_tokens) .map(|chunk| detokenize(chunk)) .collect() } }
Processing Status
ProcessingStatus Enum
Track the lifecycle of document processing:
#![allow(unused)] fn main() { use opendb::ProcessingStatus; // Status variants let queued = ProcessingStatus::Queued; let processing = ProcessingStatus::Processing; let completed = ProcessingStatus::Completed; let failed = ProcessingStatus::Failed("OCR error".to_string()); // Pattern matching match doc.processing_status { ProcessingStatus::Queued => { println!("Document is queued for processing"); } ProcessingStatus::Processing => { println!("Processing in progress..."); } ProcessingStatus::Completed => { println!("โ Processing completed successfully"); } ProcessingStatus::Failed(error) => { eprintln!("โ Processing failed: {}", error); } } }
Production Workflow
Complete PDF Processing Example
#![allow(unused)] fn main() { use opendb::{OpenDB, MultimodalDocument, DocumentChunk, FileType, ProcessingStatus}; use std::fs; fn process_pdf(filepath: &str, db: &OpenDB) -> Result<String> { // 1. Read file let file_bytes = fs::read(filepath)?; let filename = filepath.rsplit('/').next().unwrap(); // 2. Extract text (use pdf-extract or pdfium in production) let extracted_text = extract_pdf_text(&file_bytes)?; // 3. Generate document embedding let doc_embedding = generate_embedding(&extracted_text)?; // 4. Create multimodal document let mut doc = MultimodalDocument::new( &generate_id(), filename, FileType::Pdf, file_bytes.len(), &extracted_text, doc_embedding, ) .with_metadata("source", "upload") .with_metadata("pages", &count_pages(&file_bytes).to_string()); // 5. Chunk the document let chunks = chunk_text(&extracted_text, 1000); for (i, chunk_text) in chunks.iter().enumerate() { let chunk_embedding = generate_embedding(chunk_text)?; let chunk = DocumentChunk::new( &format!("chunk_{}", i), chunk_text, chunk_embedding, i * 1000, (i + 1) * 1000, ) .with_metadata("chunk_index", &i.to_string()); doc.add_chunk(chunk); } // 6. Mark as completed doc.processing_status = ProcessingStatus::Completed; // 7. Store in OpenDB (pseudo-code - actual storage via Memory type) let doc_id = doc.id.clone(); store_document(db, &doc)?; Ok(doc_id) } // Helper functions (implement with actual libraries) fn extract_pdf_text(bytes: &[u8]) -> Result<String> { // Use pdf-extract, pdfium, or poppler todo!("Implement with pdf-extract crate") } fn generate_embedding(text: &str) -> Result<Vec<f32>> { // Use sentence-transformers, OpenAI API, or onnxruntime todo!("Implement with embedding model") } fn chunk_text(text: &str, size: usize) -> Vec<String> { // Smart chunking by sentences/paragraphs todo!("Implement chunking strategy") } fn generate_id() -> String { uuid::Uuid::new_v4().to_string() } fn count_pages(bytes: &[u8]) -> usize { // Parse PDF to count pages todo!("Implement page counting") } fn store_document(db: &OpenDB, doc: &MultimodalDocument) -> Result<()> { // Store document and chunks as Memory records with embeddings todo!("Implement storage logic") } }
Audio Transcription Example
#![allow(unused)] fn main() { use opendb::{MultimodalDocument, DocumentChunk, FileType, ProcessingStatus}; fn process_audio(filepath: &str) -> Result<MultimodalDocument> { let file_bytes = fs::read(filepath)?; let filename = filepath.rsplit('/').next().unwrap(); // 1. Transcribe audio (use whisper-rs or OpenAI Whisper API) let transcript = transcribe_audio(&file_bytes)?; // 2. Generate embedding from transcript let embedding = generate_embedding(&transcript)?; // 3. Create multimodal document let mut doc = MultimodalDocument::new( &generate_id(), filename, FileType::Audio, file_bytes.len(), &transcript, embedding, ) .with_metadata("duration_seconds", &get_audio_duration(&file_bytes).to_string()) .with_metadata("transcription_model", "whisper-large-v3"); // 4. Add timestamped chunks let timestamped_segments = get_timestamped_segments(&file_bytes)?; for (i, segment) in timestamped_segments.iter().enumerate() { let chunk_embedding = generate_embedding(&segment.text)?; let chunk = DocumentChunk::new( &format!("segment_{}", i), &segment.text, chunk_embedding, segment.start_offset, segment.end_offset, ) .with_metadata("timestamp_start", &segment.start_time.to_string()) .with_metadata("timestamp_end", &segment.end_time.to_string()); doc.add_chunk(chunk); } doc.processing_status = ProcessingStatus::Completed; Ok(doc) } struct AudioSegment { text: String, start_time: f64, end_time: f64, start_offset: usize, end_offset: usize, } fn transcribe_audio(bytes: &[u8]) -> Result<String> { // Use whisper-rs or cloud API todo!("Implement transcription") } fn get_audio_duration(bytes: &[u8]) -> f64 { // Parse audio metadata todo!("Implement duration extraction") } fn get_timestamped_segments(bytes: &[u8]) -> Result<Vec<AudioSegment>> { // Use Whisper with timestamps todo!("Implement segment extraction") } }
Integration with OpenDB
Storing Multimodal Documents
#![allow(unused)] fn main() { use opendb::{OpenDB, Memory, MultimodalDocument}; fn store_multimodal_document(db: &OpenDB, doc: &MultimodalDocument) -> Result<()> { // Store main document as Memory let memory = Memory::new( &doc.id, &doc.extracted_text, doc.embedding.clone().unwrap_or_default(), 1.0, // importance ) .with_metadata("filename", &doc.filename) .with_metadata("file_type", &format!("{:?}", doc.file_type)) .with_metadata("file_size", &doc.file_size.to_string()); db.insert_memory(&memory)?; // Store each chunk as separate Memory with relationships for chunk in &doc.chunks { let chunk_memory = Memory::new( &format!("{}_{}", doc.id, chunk.chunk_id), &chunk.content, chunk.embedding.clone().unwrap_or_default(), 0.8, // chunk importance ) .with_metadata("parent_doc", &doc.id) .with_metadata("chunk_id", &chunk.chunk_id); db.insert_memory(&chunk_memory)?; // Link chunk to parent document db.link(&memory.id, "has_chunk", &chunk_memory.id)?; } Ok(()) } }
Semantic Search Across Documents
#![allow(unused)] fn main() { use opendb::{OpenDB, SearchResult}; fn search_documents( db: &OpenDB, query: &str, top_k: usize, ) -> Result<Vec<SearchResult>> { // Generate query embedding let query_embedding = generate_embedding(query)?; // Search across all documents and chunks let results = db.search_similar(&query_embedding, top_k)?; Ok(results) } // Usage let results = search_documents(&db, "machine learning algorithms", 5)?; for result in results { println!("Found: {} (distance: {:.4})", result.memory.content, result.distance); } }
Best Practices
1. Chunking Strategy
- Small chunks (500-1000 chars): Better precision, more API calls
- Large chunks (1500-3000 chars): More context, fewer API calls
- Overlap chunks: 10-20% overlap for continuity
2. Metadata Usage
- Always add source file metadata
- Include timestamps for temporal data
- Add processing metadata (model version, date)
- Store original file path for reference
3. Error Handling
#![allow(unused)] fn main() { use opendb::ProcessingStatus; fn safe_process(filepath: &str) -> MultimodalDocument { let mut doc = MultimodalDocument::new( &generate_id(), filepath, FileType::Unknown, 0, "", vec![], ); doc.processing_status = ProcessingStatus::Queued; match process_file(filepath) { Ok(processed) => { doc = processed; doc.processing_status = ProcessingStatus::Completed; } Err(e) => { doc.processing_status = ProcessingStatus::Failed(e.to_string()); eprintln!("Processing failed: {}", e); } } doc } }
4. Memory Management
- Process files in batches
- Clear processed chunks from memory
- Use streaming for very large files
- Implement backpressure for async processing
See Also
- Records Management - Storing Memory records
- Vector Search - Semantic similarity search
- Graph Operations - Linking documents and chunks
- Multimodal Example - Complete working example
Production Libraries
PDF Processing
pdf-extract- Text extractionpdfium-render- Rendering and OCRlopdf- Low-level parsing
DOCX Processing
docx-rs- Read/write DOCXmammoth-rs- Convert to text
Audio Transcription
whisper-rs- Local Whisper- OpenAI Whisper API - Cloud service
Video Processing
ffmpeg-next- Video/audio extraction- Combine with whisper for captions
Embeddings
sentence-transformers(Python + PyO3)- OpenAI Embeddings API
onnxruntime- Local models
Transactions API
OpenDB provides ACID-compliant transactions for atomic multi-operation updates.
Overview
Transactions group multiple operations into a single atomic unit:
#![allow(unused)] fn main() { use opendb::OpenDB; let db = OpenDB::open("./db")?; let mut txn = db.begin_transaction()?; txn.put("default", b"key1", b"value1")?; txn.put("default", b"key2", b"value2")?; txn.commit()?; // Both writes succeed or both fail }
Basic API
Begin Transaction
#![allow(unused)] fn main() { let mut txn = db.begin_transaction()?; }
Signature:
#![allow(unused)] fn main() { pub fn begin_transaction(&self) -> Result<Transaction> }
Returns: Transaction handle for performing operations.
Commit
#![allow(unused)] fn main() { txn.commit()?; }
Signature:
#![allow(unused)] fn main() { pub fn commit(mut self) -> Result<()> }
Behavior:
- Atomically applies all changes
- Returns error if conflicts detected (optimistic locking)
- Consumes transaction (can't use after commit)
Rollback
#![allow(unused)] fn main() { txn.rollback()?; }
Signature:
#![allow(unused)] fn main() { pub fn rollback(mut self) -> Result<()> }
Behavior:
- Discards all changes
- Always succeeds
- Consumes transaction
Auto-Rollback
Transactions auto-rollback if dropped without commit:
#![allow(unused)] fn main() { { let mut txn = db.begin_transaction()?; txn.put("default", b"key", b"value")?; // txn dropped here โ automatic rollback } // Key was not written assert!(db.get(b"key")?.is_none()); }
Transaction Operations
Get
#![allow(unused)] fn main() { let value = txn.get("default", b"key")?; }
Signature:
#![allow(unused)] fn main() { pub fn get(&self, cf: &str, key: &[u8]) -> Result<Option<Vec<u8>>> }
Behavior:
- Reads from transaction snapshot
- Sees writes from current transaction
- Isolated from concurrent transactions
Put
#![allow(unused)] fn main() { txn.put("default", b"key", b"value")?; }
Signature:
#![allow(unused)] fn main() { pub fn put(&mut self, cf: &str, key: &[u8], value: &[u8]) -> Result<()> }
Behavior:
- Buffers write in transaction
- Not visible outside transaction until commit
- Visible to subsequent reads in same transaction
Delete
#![allow(unused)] fn main() { txn.delete("default", b"key")?; }
Signature:
#![allow(unused)] fn main() { pub fn delete(&mut self, cf: &str, key: &[u8]) -> Result<()> }
Behavior:
- Buffers delete in transaction
- Subsequent gets in same transaction return
None
Column Families
Transactions work across all column families:
#![allow(unused)] fn main() { let mut txn = db.begin_transaction()?; // Write to different column families txn.put("default", b"kv_key", b"value")?; txn.put("records", b"mem_001", &encoded_memory)?; txn.put("graph_forward", b"mem_001:related_to", &edges)?; txn.commit()?; // All or nothing }
Available Column Families:
"default"- KV store"records"- Memory records"graph_forward"- Outgoing edges"graph_backward"- Incoming edges"vector_data"- Embedding data"vector_index"- HNSW index"metadata"- Database metadata
ACID Examples
Atomicity
Either all operations succeed or none:
#![allow(unused)] fn main() { let mut txn = db.begin_transaction()?; txn.put("default", b"account_A", b"-100")?; txn.put("default", b"account_B", b"+100")?; match txn.commit() { Ok(_) => println!("Transfer complete"), Err(e) => println!("Transfer failed, both accounts unchanged: {}", e), } }
Consistency
Maintain invariants across operations:
#![allow(unused)] fn main() { // Invariant: memory must exist before linking let mut txn = db.begin_transaction()?; // Insert memories txn.put("records", b"mem_001", &encode_memory(&mem1))?; txn.put("records", b"mem_002", &encode_memory(&mem2))?; // Create link (requires both memories exist) txn.put("graph_forward", b"mem_001:related_to", &encode_edges(&edges))?; txn.commit()?; // Ensures consistency }
Isolation
Transactions don't see each other's uncommitted changes:
#![allow(unused)] fn main() { // Transaction 1 let mut txn1 = db.begin_transaction()?; txn1.put("default", b"counter", b"100")?; // Transaction 2 (concurrent) let mut txn2 = db.begin_transaction()?; let val = txn2.get("default", b"counter")?; // Sees old value (not 100) txn1.commit()?; txn2.commit()?; // May conflict depending on operations }
Durability
Committed changes survive crashes:
#![allow(unused)] fn main() { let mut txn = db.begin_transaction()?; txn.put("default", b"important", b"data")?; txn.commit()?; // Even if process crashes here, data is safe // Reopen database let db = OpenDB::open("./db")?; assert_eq!(db.get(b"important")?.unwrap(), b"data"); }
Conflict Handling
Transactions use optimistic locking and may fail on conflict:
#![allow(unused)] fn main() { use opendb::Error; loop { let mut txn = db.begin_transaction()?; // Read-modify-write let val = txn.get("default", b"counter")? .and_then(|v| String::from_utf8(v).ok()) .and_then(|s| s.parse::<i64>().ok()) .unwrap_or(0); let new_val = val + 1; txn.put("default", b"counter", new_val.to_string().as_bytes())?; match txn.commit() { Ok(_) => break, Err(Error::Transaction(_)) => { println!("Conflict detected, retrying..."); continue; // Retry } Err(e) => return Err(e), } } }
Advanced Patterns
Compare-and-Swap
#![allow(unused)] fn main() { fn compare_and_swap( db: &OpenDB, key: &[u8], expected: &[u8], new_value: &[u8], ) -> Result<bool> { let mut txn = db.begin_transaction()?; let current = txn.get("default", key)?; if current.as_deref() != Some(expected) { txn.rollback()?; return Ok(false); // Value changed } txn.put("default", key, new_value)?; txn.commit()?; Ok(true) } }
Batch Updates
#![allow(unused)] fn main() { fn batch_update(db: &OpenDB, updates: Vec<(Vec<u8>, Vec<u8>)>) -> Result<()> { let mut txn = db.begin_transaction()?; for (key, value) in updates { txn.put("default", &key, &value)?; } txn.commit() } }
Conditional Delete
#![allow(unused)] fn main() { fn delete_if_exists(db: &OpenDB, key: &[u8]) -> Result<bool> { let mut txn = db.begin_transaction()?; if txn.get("default", key)?.is_none() { txn.rollback()?; return Ok(false); } txn.delete("default", key)?; txn.commit()?; Ok(true) } }
Performance Considerations
Transaction Overhead
Transactions have overhead compared to direct writes:
#![allow(unused)] fn main() { // โ Slower: Many small transactions for i in 0..1000 { let mut txn = db.begin_transaction()?; txn.put("default", &format!("key_{}", i).as_bytes(), b"value")?; txn.commit()?; } // โ Faster: One transaction for batch let mut txn = db.begin_transaction()?; for i in 0..1000 { txn.put("default", &format!("key_{}", i).as_bytes(), b"value")?; } txn.commit()?; }
Transaction Size
Keep transactions reasonably sized:
- Small (1-100 ops): Best performance
- Medium (100-1000 ops): Good
- Large (1000+ ops): May increase conflict rate and memory usage
Conflict Rate
High contention increases conflict rate:
#![allow(unused)] fn main() { // High contention: many threads updating same key // Solution: Shard keys or use separate counters }
Limitations
- Single-threaded: One transaction per thread
- No nested transactions: Can't begin transaction within transaction
- Memory buffering: Large transactions use more memory
- Optimistic locking: High contention may cause retries
Error Handling
#![allow(unused)] fn main() { use opendb::Error; let mut txn = db.begin_transaction()?; txn.put("default", b"key", b"value")?; match txn.commit() { Ok(_) => println!("Success"), Err(Error::Transaction(e)) => println!("Conflict: {}", e), Err(Error::Storage(e)) => println!("Storage error: {}", e), Err(e) => println!("Other error: {}", e), } }
Best Practices
- Keep transactions short: Minimize duration to reduce conflicts
- Handle conflicts: Implement retry logic for read-modify-write
- Batch when possible: Group related operations
- Use auto-rollback: Let Drop handle cleanup in error paths
- Explicit commits: Don't rely on implicit behavior
Next
Performance Tuning
This guide covers optimization strategies for OpenDB deployments.
Profiling
Before optimizing, measure your bottleneck:
#![allow(unused)] fn main() { use std::time::Instant; let start = Instant::now(); db.insert_memory(&memory)?; println!("Insert took: {:?}", start.elapsed()); }
RocksDB Tuning
Write Buffer Size
Larger write buffers improve write throughput:
#![allow(unused)] fn main() { // Default: 128 MB // For write-heavy workloads, increase: opts.set_write_buffer_size(256 * 1024 * 1024); // 256 MB }
Trade-offs:
- โ Fewer flushes to disk
- โ Better write throughput
- โ More memory usage
- โ Longer recovery time after crash
Block Cache
RocksDB's internal cache for disk blocks:
#![allow(unused)] fn main() { opts.set_block_cache_size(512 * 1024 * 1024); // 512 MB }
Trade-offs:
- โ Faster reads
- โ More memory usage
Compression
Balance CPU vs storage:
#![allow(unused)] fn main() { use rocksdb::DBCompressionType; // Default: LZ4 (fast, moderate compression) opts.set_compression_type(DBCompressionType::Lz4); // For better compression (slower writes): opts.set_compression_type(DBCompressionType::Zstd); // For faster writes (larger storage): opts.set_compression_type(DBCompressionType::None); }
Parallelism
Increase background threads for compaction:
#![allow(unused)] fn main() { opts.increase_parallelism(4); // Use 4 threads }
Cache Tuning
Cache Sizes
Adjust cache capacity based on workload:
#![allow(unused)] fn main() { use opendb::OpenDBOptions; let mut options = OpenDBOptions::default(); // For read-heavy workloads options.kv_cache_size = 10_000; options.record_cache_size = 5_000; // For write-heavy workloads (smaller cache) options.kv_cache_size = 1_000; options.record_cache_size = 500; let db = OpenDB::open_with_options("./db", options)?; }
Cache Hit Rate
Monitor cache effectiveness:
#![allow(unused)] fn main() { // Implement hit rate tracking (example) struct CacheStats { hits: AtomicU64, misses: AtomicU64, } impl CacheStats { fn hit_rate(&self) -> f64 { let hits = self.hits.load(Ordering::Relaxed) as f64; let misses = self.misses.load(Ordering::Relaxed) as f64; hits / (hits + misses) } } }
Target hit rates:
- > 90%: Excellent
- 70-90%: Good
- < 70%: Increase cache size
Batch Operations
Batch Inserts
Use transactions for bulk inserts:
#![allow(unused)] fn main() { // โ Slow: Individual commits for memory in memories { db.insert_memory(&memory)?; } // โ Fast: Batch commit (future API) let mut txn = db.begin_transaction()?; for memory in memories { // Insert via transaction } txn.commit()?; }
Flush Control
Control when data is flushed to disk:
#![allow(unused)] fn main() { // Insert many records for i in 0..10_000 { db.insert_memory(&memory)?; } // Explicit flush db.flush()?; }
Vector Search Optimization
Index Parameters
Tune HNSW parameters for your use case:
#![allow(unused)] fn main() { // High accuracy (slower, better recall) HnswParams::high_accuracy() // ef=400, neighbors=32 // High speed (faster, lower recall) HnswParams::high_speed() // ef=100, neighbors=8 }
Rebuild Strategy
Rebuild index strategically:
#![allow(unused)] fn main() { // โ Bad: Rebuild after every insert for memory in memories { db.insert_memory(&memory)?; db.rebuild_vector_index()?; // Expensive! } // โ Good: Rebuild once after batch for memory in memories { db.insert_memory(&memory)?; } db.rebuild_vector_index()?; // Once }
Dimension Reduction
Lower dimensions = faster search:
#![allow(unused)] fn main() { // 768D (high quality, slower) options.vector_dimension = 768; // 384D (balanced) options.vector_dimension = 384; // 128D (fast, lower quality) options.vector_dimension = 128; }
Graph Optimization
Link Batching
Batch graph operations:
#![allow(unused)] fn main() { // Create all memories first for memory in memories { db.insert_memory(&memory)?; } // Then create all links for (from, to, relation) in edges { db.link(from, to, relation)?; } }
Prune Unused Relations
Remove stale edges periodically:
#![allow(unused)] fn main() { fn prune_orphaned_edges(db: &OpenDB) -> Result<()> { let all_ids: HashSet<_> = db.list_memory_ids()?.into_iter().collect(); for id in db.list_memory_ids()? { let outgoing = db.get_outgoing(&id)?; for edge in outgoing { if !all_ids.contains(&edge.to) { db.unlink(&edge.from, &edge.to, &edge.relation)?; } } } Ok(()) } }
Memory Usage
Estimate Memory Footprint
Total Memory =
RocksDB Write Buffers +
RocksDB Block Cache +
Application Caches +
HNSW Index +
Overhead
Example:
128 MB (write buffers) +
256 MB (block cache) +
10 MB (app caches, 10k entries ร 1KB avg) +
30 MB (HNSW, 10k vectors ร 384D ร 4 bytes ร 2x overhead) +
50 MB (overhead)
= ~474 MB
Reduce Memory Usage
- Smaller caches:
#![allow(unused)] fn main() { options.kv_cache_size = 100; options.record_cache_size = 100; }
- Lower RocksDB buffers:
#![allow(unused)] fn main() { opts.set_write_buffer_size(64 * 1024 * 1024); // 64 MB opts.set_block_cache_size(128 * 1024 * 1024); // 128 MB }
- Smaller embeddings:
#![allow(unused)] fn main() { options.vector_dimension = 128; // Instead of 768 }
Disk Usage
Compaction
Force compaction to reclaim space:
#![allow(unused)] fn main() { // Manual compaction (future API) db.compact_range(None, None)?; }
Monitoring
Check database size:
#![allow(unused)] fn main() { // On Linux std::process::Command::new("du") .args(&["-sh", "./db"]) .output()?; }
Benchmarking
Use Criterion for accurate benchmarks:
#![allow(unused)] fn main() { use criterion::{black_box, criterion_group, criterion_main, Criterion}; fn benchmark_insert(c: &mut Criterion) { let db = OpenDB::open("./bench_db").unwrap(); c.bench_function("insert_memory", |b| { b.iter(|| { let memory = Memory::new("id".to_string(), "content".to_string()); db.insert_memory(black_box(&memory)).unwrap(); }); }); } criterion_group!(benches, benchmark_insert); criterion_main!(benches); }
Monitoring Metrics
Implement metrics collection:
#![allow(unused)] fn main() { struct Metrics { reads: AtomicU64, writes: AtomicU64, cache_hits: AtomicU64, cache_misses: AtomicU64, } impl Metrics { fn report(&self) { println!("Reads: {}", self.reads.load(Ordering::Relaxed)); println!("Writes: {}", self.writes.load(Ordering::Relaxed)); println!("Cache hit rate: {:.2}%", self.cache_hits.load(Ordering::Relaxed) as f64 / (self.cache_hits.load(Ordering::Relaxed) + self.cache_misses.load(Ordering::Relaxed)) as f64 * 100.0 ); } } }
Platform-Specific Tips
Linux
- Use
io_uringfor async I/O (future RocksDB feature) - Disable transparent huge pages for lower latency
- Use
fallocatefor preallocating disk space
macOS
- APFS filesystem has good performance
- Use
F_NOCACHEfor large scans (avoid cache pollution)
Windows
- Use NTFS for best RocksDB performance
- Disable indexing on database directory
- Use SSD for best performance
Common Bottlenecks
- Slow writes: Increase write buffer size, disable compression
- Slow reads: Increase cache sizes, use SSD
- High memory: Reduce cache sizes, lower embedding dimension
- Slow vector search: Reduce HNSW parameters, lower dimension
- Large database size: Enable compression, run compaction
Next
Extending OpenDB
OpenDB is designed to be extensible. This guide covers custom backends, plugins, and extensions.
Custom Storage Backends
OpenDB uses the StorageBackend trait for pluggability.
Storage Trait
#![allow(unused)] fn main() { pub trait StorageBackend: Send + Sync { fn get(&self, cf: &str, key: &[u8]) -> Result<Option<Vec<u8>>>; fn put(&self, cf: &str, key: &[u8], value: &[u8]) -> Result<()>; fn delete(&self, cf: &str, key: &[u8]) -> Result<()>; fn exists(&self, cf: &str, key: &[u8]) -> Result<bool>; fn scan_prefix(&self, cf: &str, prefix: &[u8]) -> Result<Vec<(Vec<u8>, Vec<u8>)>>; fn begin_transaction(&self) -> Result<Box<dyn Transaction>>; fn flush(&self) -> Result<()>; fn snapshot(&self) -> Result<Box<dyn Snapshot>>; } }
Example: In-Memory Backend
#![allow(unused)] fn main() { use std::collections::HashMap; use std::sync::RwLock; use opendb::storage::{StorageBackend, Transaction, Snapshot}; use opendb::{Result, Error}; pub struct MemoryBackend { data: RwLock<HashMap<String, HashMap<Vec<u8>, Vec<u8>>>>, } impl MemoryBackend { pub fn new() -> Self { Self { data: RwLock::new(HashMap::new()), } } } impl StorageBackend for MemoryBackend { fn get(&self, cf: &str, key: &[u8]) -> Result<Option<Vec<u8>>> { let data = self.data.read().unwrap(); Ok(data.get(cf) .and_then(|cf_data| cf_data.get(key)) .cloned()) } fn put(&self, cf: &str, key: &[u8], value: &[u8]) -> Result<()> { let mut data = self.data.write().unwrap(); data.entry(cf.to_string()) .or_insert_with(HashMap::new) .insert(key.to_vec(), value.to_vec()); Ok(()) } fn delete(&self, cf: &str, key: &[u8]) -> Result<()> { let mut data = self.data.write().unwrap(); if let Some(cf_data) = data.get_mut(cf) { cf_data.remove(key); } Ok(()) } fn exists(&self, cf: &str, key: &[u8]) -> Result<bool> { Ok(self.get(cf, key)?.is_some()) } fn scan_prefix(&self, cf: &str, prefix: &[u8]) -> Result<Vec<(Vec<u8>, Vec<u8>)>> { let data = self.data.read().unwrap(); Ok(data.get(cf) .map(|cf_data| { cf_data.iter() .filter(|(k, _)| k.starts_with(prefix)) .map(|(k, v)| (k.clone(), v.clone())) .collect() }) .unwrap_or_default()) } fn flush(&self) -> Result<()> { // No-op for in-memory Ok(()) } // Implement Transaction and Snapshot traits... } }
Using Custom Backend
#![allow(unused)] fn main() { let backend = Arc::new(MemoryBackend::new()); let db = OpenDB::with_backend(backend, OpenDBOptions::default())?; }
Custom Cache Implementations
Implement the Cache trait for custom caching strategies:
#![allow(unused)] fn main() { pub trait Cache<K, V>: Send + Sync { fn get(&self, key: &K) -> Option<V>; fn put(&self, key: K, value: V); fn remove(&self, key: &K); fn clear(&self); fn len(&self) -> usize; } }
Example: TTL Cache
#![allow(unused)] fn main() { use std::collections::HashMap; use std::time::{Instant, Duration}; use parking_lot::RwLock; pub struct TtlCache<K, V> { data: RwLock<HashMap<K, (V, Instant)>>, ttl: Duration, } impl<K: Eq + std::hash::Hash + Clone, V: Clone> Cache<K, V> for TtlCache<K, V> { fn get(&self, key: &K) -> Option<V> { let data = self.data.read(); data.get(key).and_then(|(value, inserted)| { if inserted.elapsed() < self.ttl { Some(value.clone()) } else { None // Expired } }) } fn put(&self, key: K, value: V) { let mut data = self.data.write(); data.insert(key, (value, Instant::now())); } // ... implement other methods } }
Custom Vector Indexes
While OpenDB uses HNSW, you can wrap alternative indexes:
Example: Flat Index
#![allow(unused)] fn main() { pub struct FlatVectorIndex { vectors: RwLock<Vec<(String, Vec<f32>)>>, } impl FlatVectorIndex { pub fn search(&self, query: &[f32], top_k: usize) -> Vec<SearchResult> { let vectors = self.vectors.read(); let mut results: Vec<_> = vectors.iter() .map(|(id, vec)| { let distance = euclidean_distance(query, vec); SearchResult { id: id.clone(), distance } }) .collect(); results.sort_by(|a, b| a.distance.partial_cmp(&b.distance).unwrap()); results.truncate(top_k); results } } fn euclidean_distance(a: &[f32], b: &[f32]) -> f32 { a.iter().zip(b.iter()) .map(|(x, y)| (x - y).powi(2)) .sum::<f32>() .sqrt() } }
Custom Serialization
Replace rkyv with custom codec:
#![allow(unused)] fn main() { pub trait Codec<T> { fn encode(&self, value: &T) -> Result<Vec<u8>>; fn decode(&self, bytes: &[u8]) -> Result<T>; } pub struct JsonCodec; impl<T: serde::Serialize + serde::de::DeserializeOwned> Codec<T> for JsonCodec { fn encode(&self, value: &T) -> Result<Vec<u8>> { serde_json::to_vec(value).map_err(|e| Error::Codec(e.to_string())) } fn decode(&self, bytes: &[u8]) -> Result<T> { serde_json::from_slice(bytes).map_err(|e| Error::Codec(e.to_string())) } } }
Plugin System (Future)
Planned plugin architecture:
#![allow(unused)] fn main() { pub trait Plugin: Send + Sync { fn name(&self) -> &str; fn init(&mut self, db: &OpenDB) -> Result<()>; fn on_insert(&self, memory: &Memory) -> Result<()>; fn on_delete(&self, id: &str) -> Result<()>; fn on_link(&self, edge: &Edge) -> Result<()>; } // Example: Audit logger plugin pub struct AuditPlugin { log_file: Mutex<File>, } impl Plugin for AuditPlugin { fn on_insert(&self, memory: &Memory) -> Result<()> { let mut file = self.log_file.lock().unwrap(); writeln!(file, "INSERT: {}", memory.id)?; Ok(()) } } }
Custom Relation Types
Extend graph relations for domain-specific needs:
#![allow(unused)] fn main() { pub mod custom_relations { pub const IMPLEMENTS: &str = "implements"; pub const EXTENDS: &str = "extends"; pub const DEPENDS_ON: &str = "depends_on"; pub const TESTED_BY: &str = "tested_by"; } use custom_relations::*; db.link("MyStruct", "MyTrait", IMPLEMENTS)?; db.link("ChildStruct", "ParentStruct", EXTENDS)?; }
Embedding Adapters
Create adapters for different embedding models:
#![allow(unused)] fn main() { pub trait EmbeddingModel { fn dimension(&self) -> usize; fn encode(&self, text: &str) -> Result<Vec<f32>>; } pub struct SentenceTransformerAdapter { // Python bindings via PyO3 } impl EmbeddingModel for SentenceTransformerAdapter { fn dimension(&self) -> usize { 384 // all-MiniLM-L6-v2 } fn encode(&self, text: &str) -> Result<Vec<f32>> { // Call Python model todo!() } } }
Future Extension Points
Planned extensibility features:
- Query Language: SQL-like interface for complex queries
- Triggers: Execute callbacks on events
- Views: Virtual collections with custom logic
- Migrations: Schema evolution helpers
- Replication: Multi-instance synchronization
Contributing Extensions
If you build a useful extension, consider contributing:
- Fork the repository
- Create a new module in
src/extensions/ - Document usage and API
- Add tests for functionality
- Submit a pull request
Best Practices
- Follow trait contracts: Implement all required methods
- Handle errors: Use
Result<T, Error>consistently - Thread safety: Use
Send + Syncfor shared state - Document: Provide clear documentation and examples
- Test: Write comprehensive tests for custom components
Examples
See the examples/ directory for:
custom_backend.rs: Alternative storage backendplugin_example.rs: Sample plugin implementationcustom_index.rs: Alternative vector index
Next
Contributing to OpenDB
Thank you for your interest in contributing to OpenDB! This guide will help you get started.
Code of Conduct
This project adheres to the Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code.
How to Contribute
Reporting Bugs
- Check existing issues to avoid duplicates
- Use the bug report template when creating a new issue
- Provide details:
- OpenDB version
- Rust version (
rustc --version) - Operating system
- Minimal reproduction steps
- Expected vs actual behavior
Suggesting Features
- Check the roadmap to see if it's planned
- Use the feature request template
- Describe:
- Use case and motivation
- Proposed API design
- Alternative solutions considered
Pull Requests
- Fork the repository
- Create a branch from
main:git checkout -b feature/my-feature - Make your changes following our code style
- Write tests for new functionality
- Update documentation if needed
- Commit with descriptive messages
- Push to your fork
- Open a pull request with detailed description
Development Setup
Prerequisites
- Rust 1.70 or later
- RocksDB development libraries (see Installation guide)
Clone and Build
git clone https://github.com/muhammad-fiaz/OpenDB.git
cd OpenDB
cargo build
Run Tests
# All tests
cargo test
# Specific test
cargo test test_name
# With output
cargo test -- --nocapture
Run Examples
cargo run --example quickstart
cargo run --example memory_agent
cargo run --example graph_relations
Build Documentation
# API docs
cargo doc --open
# mdBook docs
cd docs
mdbook serve --open
Code Style
Formatting
Use rustfmt for consistent formatting:
cargo fmt --all
Linting
Use clippy for code quality:
cargo clippy --all-targets --all-features -- -D warnings
Naming Conventions
- Types:
PascalCase(e.g.,OpenDB,StorageBackend) - Functions:
snake_case(e.g.,insert_memory,get_related) - Constants:
SCREAMING_SNAKE_CASE(e.g.,DEFAULT_CACHE_SIZE) - Modules:
snake_case(e.g.,graph,vector)
Documentation
- Public APIs: Must have
///documentation - Examples: Include usage examples in doc comments
- Errors: Document possible error cases
Example:
#![allow(unused)] fn main() { /// Inserts a memory record into the database. /// /// # Arguments /// /// * `memory` - The memory record to insert /// /// # Returns /// /// Returns `Ok(())` on success, or an error if: /// - Serialization fails /// - Storage write fails /// /// # Example /// /// ``` /// let memory = Memory::new("id".to_string(), "content".to_string()); /// db.insert_memory(&memory)?; /// ``` pub fn insert_memory(&self, memory: &Memory) -> Result<()> { // ... } }
Testing Guidelines
Unit Tests
Place unit tests in the same file as the code:
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; #[test] fn test_memory_creation() { let memory = Memory::new("id".to_string(), "content".to_string()); assert_eq!(memory.id, "id"); assert_eq!(memory.content, "content"); } } }
Integration Tests
Place integration tests in tests/:
#![allow(unused)] fn main() { // tests/my_feature_test.rs use opendb::{OpenDB, Memory}; use tempfile::TempDir; #[test] fn test_my_feature() { let temp_dir = TempDir::new().unwrap(); let db = OpenDB::open(temp_dir.path()).unwrap(); // Test logic } }
Test Coverage
Aim for:
- New features: >80% coverage
- Bug fixes: Regression test included
- Edge cases: Test error paths
Commit Messages
Follow conventional commits format:
<type>(<scope>): <subject>
<body>
<footer>
Types:
feat: New featurefix: Bug fixdocs: Documentation changesstyle: Formatting changesrefactor: Code refactoringtest: Adding testschore: Maintenance tasks
Examples:
feat(graph): add weighted edge support
Adds optional weight parameter to link() method,
allowing users to specify edge weights.
Closes #123
fix(cache): prevent race condition in LRU eviction
Fixes deadlock when multiple threads evict simultaneously
by using a write lock during eviction.
Fixes #456
Pull Request Guidelines
PR Title
Use the same format as commit messages:
feat(vector): add cosine similarity distance metric
PR Description
Include:
- What: Description of changes
- Why: Motivation and context
- How: Implementation approach
- Testing: How you tested the changes
- Checklist:
- Tests added/updated
- Documentation updated
- Changelog updated (for features/fixes)
-
Code formatted with
rustfmt -
Linted with
clippy
Review Process
- CI checks: All tests must pass
- Code review: At least one maintainer approval
- Documentation: Verify docs are updated
- Changelog: Ensure CHANGELOG.md is updated
Architecture Guidelines
Module Organization
Follow existing structure:
src/
lib.rs # Public API exports
database.rs # Main OpenDB struct
error.rs # Error types
types.rs # Core data types
storage/ # Storage backends
cache/ # Caching layer
kv/ # Key-value store
records/ # Memory records
graph/ # Graph relationships
vector/ # Vector search
transaction/ # Transaction management
codec/ # Serialization
Adding New Features
- New module: Create in appropriate directory
- Trait-based: Use traits for extensibility
- Error handling: Use
Result<T, Error> - Thread safety: Ensure
Send + Syncwhere needed
Performance Considerations
- Benchmarks: Add benchmarks for performance-critical code
- Profiling: Profile before optimizing
- Allocations: Minimize unnecessary allocations
- Locks: Prefer
RwLockfor read-heavy workloads
Documentation Updates
When adding features, update:
- API docs:
///comments in code - mdBook docs: Relevant pages in
docs/src/ - Examples: Add example if appropriate
- CHANGELOG.md: Document changes
- README.md: Update if API changes
Release Process (Maintainers)
- Version bump: Update
Cargo.toml - Changelog: Update
CHANGELOG.md - Tag: Create git tag
v0.x.y - Publish:
cargo publish - GitHub Release: Create release notes
Getting Help
- Discussions: GitHub Discussions for questions
- Issues: GitHub Issues for bugs/features
- Email: contact@muhammadfiaz.com for private inquiries
Recognition
Contributors are recognized in:
CONTRIBUTORS.mdfile- GitHub contributors page
- Release notes
Thank you for contributing to OpenDB! ๐