Platform — LakeX Sovereign DataVault

OneView DSL — Control Plane

Meridian: centralised intelligence for your entire archive estate

Meridian is the brain of the platform — a FastAPI backend with a React control plane that manages every Stratum node, archive job, user, and governance policy across your organisation. All metadata lives here; all data stays on your Stratum servers.

⚡

WebSocket Agent Transport

Stratum agents maintain a persistent WebSocket connection to the gateway pool. Job dispatch latency drops from ~2 seconds (polling) to ~10 milliseconds. Redis Pub/Sub decouples the backend from gateway instances, enabling horizontal scaling.

🏢

Multi-Tenant Organisations

Full parent/child organisation hierarchy. Each org gets its own user space, Stratum registry, storage backends, archive jobs, and governance policies. System admins manage the platform; org admins manage their own estate — with complete isolation enforced at the database level.

📅

SLA Templates & Scheduler

Define reusable SLA templates specifying schedule, retention policy, and health thresholds. Assign templates to archive jobs — changes propagate automatically. Built-in scheduler supports cron expressions, daily/weekly/monthly cycles, and on-demand triggers.

🔄

Catalog Sync & Schema Drift

Sovereign DataVault tracks the schema of every archived table. When the source schema changes, drift detection surfaces additions, removals, and type changes — giving operators a preview before the next archive run or restore operation touches the target.

🔒

HashiCorp Vault Integration

All connection credentials, encryption keys, and SSH passwords are stored in HashiCorp Vault — never in the application database as plaintext. Vault Transit provides envelope encryption; Vault KV stores per-org secrets. Snapshot and restore for Vault state included.

📦

Package Manager

Stage and distribute agent binaries, LLM GGUF model files, and Trino plugin JARs from Meridian. Stratum agents pull packages on demand — no external internet access required. Complete air-gap operation from day one.

Structured Data — LVS Plane

Archive relational data at enterprise scale — with millisecond query access

Sovereign DataVault Structured (LVS) connects to your production relational databases, extracts data on a configurable schedule, writes sorted Apache Iceberg Parquet to your Stratum storage, and registers each file in a queryable metadata tracker. No external clusters. No manual partition tuning. No idle compute burning resources.

Sort Advisor

Automated write-time sort column selection

At archive time, the Sort Advisor inspects the source table's schema — primary keys, clustered indexes, column types — and selects the optimal sort strategy before writing a single Parquet row.

Composite clustered index → Z-order multi-column sort

IDENTITY / SEQUENCE PK → sort by primary key

UUID PK + indexed date → skip UUID, sort by date

Non-clustered index on timestamp → sort by that column

No useful index → write unsorted, pruning disabled

Sort basis is recorded per-file in the tracker (sort_column_basis) — giving operators a full audit trail of why each sort decision was made.

Pruning Panel — AI Query Page

Predicate: order_date BETWEEN '2023-01-01' AND '2023-03-31'

📄 FINCORE_2019_001.parquetSKIP

📄 FINCORE_2020_001.parquetSKIP

📄 FINCORE_2021_001.parquetSKIP

📄 FINCORE_2023_Q1.parquetSCAN

📄 FINCORE_2023_Q2.parquetSKIP

📄 FINCORE_2024_001.parquetSKIP

✓ 5 of 6 files pruned · 1 scanned · 142ms total

Bounds Verify: all file bounds match live Parquet footer stats

Source Connectors

Every major enterprise relational database — supported out of the box

Sovereign DataVault ships production-ready connectors for all major enterprise databases. Each connector handles schema discovery, data type mapping, incremental extraction, redacted column detection, and pre-requisite validation before the first byte is written.

🐉Oracle 19c / 21c

🐘PostgreSQL

🐬MySQL / MariaDB

🪟SQL Server

🔷IBM Db2

🗄️More coming

Iceberg

Native format

Parquet

File encoding

Trino

Query engine

Archive Schedule — FINCORE_PROD

Schema Discovery 142 tables scanned

Sort Advisor TRXN_DATE selected

Extract & Sort In progress — 67%

Write Parquet → Stratum Queued

Register Iceberg Catalog Queued

Update Bounds Tracker Queued

🔍

Redacted Column Detection

Automatically discovers masked or redacted columns in source schemas — ensuring compliant handling before archiving.

📊

Schema Version History

Every archive run snapshots the schema. Drift is detected, compared, and surfaced in the UI before any restore.

✔️

Archive Health Monitor

Continuous health scoring per archive job. Agents report heartbeats, job status, and error events in real time.

🚀

Flight SQL Server

Arrow Flight SQL interface for high-throughput bulk reads from archived Iceberg tables — ideal for analytics pipelines.

Unstructured Data — LVUS Plane

Your unstructured data has never been this searchable

Sovereign DataVault Unstructured (LVUS) ingests file-based sources — local filesystems, NFS, SFTP, object storage, cloud collaboration, email archives, web feeds, and application logs — extracts structured metadata and raw text, runs Named Entity Recognition, generates embeddings, and indexes everything in a local Qdrant vector store. The result: semantic search and RAG-powered AI queries over millions of documents without a single byte leaving your infrastructure.

Source Connectors — 20+ built in

📁Local Filesystem

🔗NFS Shares

🔒SFTP Servers

📦S3-Compatible (on-prem)

🌐AWS S3 + IAM / STS

☁️Azure Blob Storage

📁HDFS

☁️Google Drive

📊SharePoint Online

📧Email Archives (EML)

📝Confluence Spaces

🪵Application Logs

🌊JSON / XML Feeds

📉CSV / TSV Files

📨Messaging Archives

🌍Web Crawlers

📄Fixed-Width (FW) Files

🔄EDI Documents

Document Formats Supported

PDF (text + scanned via OCR)

Word (DOCX), Excel (XLSX/XLS)

PowerPoint (PPTX)

HTML, Markdown, RTF

Plain text and log files

Images (JPEG, PNG) with OCR

Video (audio transcript extraction)

Email (EML, MSG with attachments)

🧠

Named Entity Recognition

NER runs at ingest time — detecting PII (names, emails, phone numbers, national IDs, account numbers, addresses) in every document. Entities are indexed separately and drive DSAR search.

🔢

Vector Embeddings — Local

nomic-embed-text (768-dim) runs on the Stratum node via Ollama — embeddings are generated locally, stored in per-Stratum Qdrant. No cloud API call is ever needed for embedding.

🗂️

Iceberg Manifest Parquet

LVUS archivers write ZSTD-compressed JSONL extractions and register manifest Parquet files in the datagen Iceberg catalog — making document metadata queryable via Trino SQL.

🔐

Permission Inheritance

SharePoint MIP labels, S3 Object Lock / Macie classification, HDFS permissions, and group-based ACLs are captured at ingest and enforced at query time via the governance engine.

📊

Eager & Lazy Embedding

Choose Eager mode (embed at ingest — immediate semantic search) or Lazy mode (embed on demand — lower write overhead for large cold archives). Switch per Stratum.

🔄

ZSTD Compression + CAS

All raw content is stored as ZSTD-compressed archives. Content-addressed storage (CAS) deduplicates text payloads across documents — dramatically reducing storage footprint.

📁

Document State Lifecycle

Documents cycle through PRODUCTION → ARCHIVED_WARM → ARCHIVED_COLD → FILE_DELETED states, with configurable warm/cold tiering thresholds per source.

🔍

Document Field Versioning

Track field-level version history for reclassified documents — see exactly which attributes changed, when, and by whom. Full audit trail for compliance.

AI Query Engine

Natural language intelligence over your entire archive

The AI Query page brings together a professional Monaco SQL editor, a persistent multi-session AI chat assistant, and a real-time schema browser — all powered by your choice of LLM. For unstructured data, the same interface switches to RAG mode, retrieving relevant document chunks and synthesising answers with citations.

Structured Query (SQL + AI)

Monaco editor with syntax highlighting and schema autocomplete

AI generates schema-aware SQL from natural language

COUNT(*) pre-check before full query execution

Multi-session chat with persistent history per archive

Inline pruning panel — see exactly which files were skipped

Results visualised as bar, line, pie, or scatter charts

AI narrative generation — convert results to business prose

Saved queries, query history, thumbs up/down feedback

Export results to CSV / Excel

Lazy schema tree — expand catalogs, schemas, tables on click

Unstructured RAG Query

Semantic vector search across all ingested documents

RAG: top-K chunk retrieval + LLM synthesis with citations

Document preview pane with entity highlights

Similar document discovery (post-retrieve)

LvEL commands — advanced filter syntax for structured search

Saved commands — repeatable research workflows

In-session document access requests with justification

Supported AI Providers

Sovereign DataVault is provider-agnostic. Switch models per session or set an org-level default. Air-gap deployments use Ollama with locally-hosted GGUF models — no internet required.

🤖Ollama (local)

✦Claude (Anthropic)

⬡OpenAI

☁️AWS Bedrock

🔷IBM WatsonX

🐉Oracle Vector DB

AI Session · FINCORE Archive

👤

What were the top 10 loan disbursements in Q1 2023?

🤖

Generated SQL:

SELECT loan_id, customer_name,
  amount, disbursement_date
FROM datagen.fincore.loan_master
WHERE disbursement_date BETWEEN
  '2023-01-01' AND '2023-03-31'
ORDER BY amount DESC LIMIT 10

✓ 10 rows · 89ms · 31 files pruned

Restore & Test Data Management

From archived Parquet back to your production database — in hours

Point-in-Time Restore

Restore archived Iceberg Parquet files back to any target relational database. Sovereign DataVault reads Parquet, maps types back to the target schema, and writes row-by-row with FK order awareness. Schema drift is detected and surfaced before the restore begins — no surprises.

Restore to Oracle, PostgreSQL, MySQL, MSSQL targets

ADD COLUMN schema diff preview before restore

Configurable batch size and parallel writer count

Restore points — select any historical archive version

Unstructured restore: LVUS restore dialog + per-file progress

Test Data Management (TDM)

TDM Workflows provision masked copies of production archives to dev and staging environments. Every workflow is replayable, CI/CD-triggerable, and produces referentially-intact, compliance-safe test data.

Masking modes: FULL, PARTIAL, REGEX, FPE, RANDOM, DENY, NULL

FK chain traversal — provision in dependency order

Row count and size estimation before provisioning

Seed SQL files for deterministic initial data state

CI/CD API keys — trigger provisioning from pipelines

Job retry (max 3) with full run history per workflow

CDC bridge sources for near-real-time test data refresh

Semi-Structured Restore

The Semi-Structured plane handles CSV, JSON, XML, EDI, and fixed-width archives independently — with its own restore dialog, file browser, and job tracking. Restore specific files or full source snapshots.

TDM Workflow — RISK_DEV_REFRESH

⚙ FK Graph Build87 links resolved

📏 Size Estimate14.2 GB · 8.7M rows

🔐 Apply MasksFPE on PAN, PARTIAL on email

✍ Write to DEV_DBIn progress

🌱 Run seed.sqlQueued

Triggered via CI/CD · API key: tdm_•••••••••••••

🔁 CDC Bridge (Phase 3)

Register a Change Data Capture bridge source to ingest CDC event batches into TDM — keeping your test environments in near-real-time sync with production changes, without direct production access.

Observability

See everything — every service, every trace, every anomaly

Sovereign DataVault's observability suite gives platform operators full visibility into the health and behaviour of every component — from Stratum agent heartbeats to Trino query traces and archive job event streams.

🗺️

Service Map

Interactive topology diagram showing every service (agents, Trino, Qdrant, Ollama, catalogs) and their live connections. Click any node for health details, latency metrics, and recent events.

🔍

Log Search

Full-text search across all service logs with time-range filtering, severity faceting, and service-level filtering. Query across hundreds of gigabytes of structured log data in seconds.

🩺

Root Cause Analysis

Automatic correlation of anomalies across services. When an archive job fails, RCA traces the causal chain — from the agent event, through Trino query failures, to the originating storage or network error.

🕵️

Trace Explorer

Distributed trace viewer for every query and archive operation. Inspect spans across the full call stack — backend → agent → Trino → Iceberg catalog — with timing, status, and payload at each hop.

💥

Impact Analyzer

Before making infrastructure changes, model the downstream impact — which archive jobs, queries, and restore pipelines will be affected. Blast-radius analysis for Stratum maintenance windows.

📈

Observability Dashboard

Pre-bucketed log metrics (1-minute, 5-minute, 1-hour granularity), service health summaries, active and resolved outage tracking, and configurable alert thresholds — all in a single pane of glass.

Deployment Architecture

Sovereign by design — no compromise

Three clearly separated planes — control, data, and AI — that can be deployed on-premises, in a private cloud, or in a hybrid configuration. No customer data ever transits the AI plane. No metadata ever leaves Meridian.

Control Plane — Meridian (OneView DSL)

React Frontend (MUI)

FastAPI Backend

Gunicorn + Uvicorn

PostgreSQL MetaDB

Redis Pub/Sub

WebSocket Gateway Pool

Celery / APScheduler

HashiCorp Vault

Nginx :443 TLS Termination

↕ WebSocket (10ms job dispatch) + REST API over TLS

Stratum — Structured (LVS)
LVS Agent (systemd)
Trino Query Engine
Iceberg REST Catalog
PostgreSQL 16 (catalog)
Block / NFS Storage
Stratum — Unstructured (LVUS)
LVUS Agent (systemd)
Qdrant Vector Store
Ollama (embed model)
Trino + Iceberg Catalog
Block / NFS / S3 Blobs
AI Server (Query-Time Only)
Ollama LLM (Qwen2.5 etc.)
Claude / OpenAI / Bedrock
RAG Orchestration
NO customer data stored

↕ Stratum-local data access only (vectors + chunks stay on Stratum)

Data Sources (remain untouched — Sovereign DataVault reads, never modifies source data)

Oracle

PostgreSQL

MySQL

SQL Server

IBM Db2

AWS S3

Azure Blob

SharePoint

HDFS

NFS

SFTP

Email

Logs

Confluence

Google Drive

🏗️ On-Premises

Docker Compose for single-node. Kubernetes + Helm for production clusters. All components deployable on RHEL 8/9, Ubuntu 22+, or compatible enterprise Linux.

🔌 Air-Gap

Full air-gap operation. Package Manager distributes binaries and LLM models offline. Ollama serves local inference. No external internet connection required at any point after initial setup.

☁️ Private Cloud / SaaS Overlay

Set SCHEDULER_EXTERNAL=true to offload scheduling to Redis. Add the saas Docker profile for cloud-native operation. Control plane can run in your private cloud while Stratums remain on-premises.

The Sovereign DataVault Platform