10
Sections
3
Search Modes
100%
Local / No Cloud
RAG
With Citations
Every moving part of Knowledge Hub: the system architecture, OCR pipeline, full-text search stack, semantic search with pgvector, hybrid ranking math, and the RAG layer that powers cited answers, all with implementation notes.
The system is a single-origin Flask API behind Gunicorn. All storage is local, Postgres 16 handles both keyword search and semantic vectors via pgvector, so there is one operational store and no separate vector database to run. File originals live in Storage (local filesystem or S3). Ollama runs the answer LLM entirely on-device.
All services run locally, no API keys, no cloud dependency. Flask (Gunicorn) is the single entry point: uploads go to Storage and queue background ingestion into Postgres; search and embed queries hit Postgres directly; answer requests retrieve from Postgres then forward context to Ollama.
GIN for FTS · IVFFlat for ANN
202 Accepted · async chunking
gemma3:1b · no API key needed
The schema centres on four tables. documents is the catalogue; chunks is the retrieval unit; embeddings holds the pgvector column; users and tags round out multi-user and organisational features.
| Table | Key Columns | Notes |
|---|---|---|
| documents | id, user_id, title, source_path, mime_type, pages, bytes, hash_sha256, status | SHA-256 used for dedup and versioning |
| chunks | id, document_id, version, page_no, chunk_index, text, tokens, modality, bbox, extra_json | extra_json stores { "ocr_conf": 72.5 }; chunk is the retrieval unit |
| embeddings | id, chunk_id, model, dim, vector (pgvector) | Embedding dim must equal model output (384 for MiniLM-L6-v2) |
| users / tags | id, email, password_hash / name, color | AuthZ per user_id; document_tags join table |
Design decision
Initially one chunk per page; later refined to 300–700 tokens with overlap once search quality testing showed that page-sized chunks hurt precision on long PDFs. Chunk size turned out to matter more than embedding model choice.
Upload returns 202 immediately. A background thread does the heavy work: rendering, preprocessing, OCR, chunking, and embedding. Each stage records per-page errors and continues rather than aborting on a bad page.
Each document goes through PyMuPDF rendering at scale 3 for clarity, then deskew (affine transform), non-local-means denoising, and Otsu/adaptive binarization before Tesseract runs three PSM passes and keeps the highest-confidence result. If average token confidence falls below the threshold, TrOCR (ViT encoder + Transformer LM) handles the page as a fallback. Text is then chunked at 300–700 tokens with overlap, embedded by Sentence-Transformers, and stored in Postgres.
3.1: Rendering & Preprocessing
PyMuPDF rasterises each page at scale s=3 (typically 216 DPI) to improve OCR readability. Deskew estimates rotation via minimum-area rectangle on foreground pixels and applies an affine transform. Non-local-means denoising followed by an unsharp mask sharpens fine details. Binarization uses Otsu's global threshold or adaptive local thresholds for uneven illumination. A 2×2 closing operation connects broken handwriting strokes.
M = [[cos θ, sin θ, tₓ], [−sin θ, cos θ, tᵧ]]
affine deskew — θ from minAreaRect on foreground pixels
I_sharp = a·I − b·G_σ(I) (a=1.5, b=0.5)
unsharp mask — G_σ is Gaussian blur at scale σ
t* = arg max_t σ²_B(t)
Otsu threshold — maximises between-class variance
T(x,y) = μ_{N(x,y)} − C
adaptive threshold — local neighbourhood mean minus constant C
3.2: Tesseract OCR (multi-pass + confidence)
Three PSM configs run (--psm 6, 11, 4) and the result with highest average word confidence is kept. Token confidences c_i ∈ [0, 100] are returned by image_to_data; the pipeline computes the average and stores it alongside the text in chunks.extra_json.
avg_conf = (1/n) ∑ᵢ cᵢ, cᵢ ∈ [0, 100]
average token confidence — stored in chunks.extra_json as ocr_conf; fallback if below threshold
3.3: Handwriting Fallback (TrOCR)
When Tesseract confidence is low, TrOCR (VisionEncoderDecoder: ViT encoder + Transformer LM) handles the page. The decoder minimises cross-entropy loss over token sequences conditioned on the image, making it far more robust to cursive and irregular handwriting. This runs only as a fallback to keep inference time reasonable.
L = −∑ₜ log p(yₜ | y_{<t}, image)
cross-entropy loss — Transformer decoder conditioned on ViT image features
Postgres's native FTS is fast enough for sub-second search over tens of thousands of chunks with a GIN index, and it ships with the database — no Elasticsearch to operate.
4.1: Tokenisation → tsvector
Postgres parses text, lowercases it, stems it, removes stopwords, and produces a multiset of lexemes with positional info. Example: to_tsvector('english', text) → a:1 b:2,7 c:4. We index on to_tsvector('english', coalesce(text,'')) with a GIN index for sub-second search.
4.2: Queries → tsquery
plainto_tsquery('english', q) handles plain queries robustly. websearch_to_tsquery supports Google-like syntax with "phrase", -exclude, and OR.
4.3: Ranking → ts_rank / ts_rank_cd
ts_rank ranks by term frequency with optional weights per lexeme class. ts_rank_cd (cover density) favours compact spans covering many query terms — it scores coverage windows and normalises by document length. We use ts_rank_cd because it penalises chunks that contain the terms scattered across many pages rather than clustered together.
4.4: Snippets → ts_headline
ts_headline generates fragments with query terms emphasised. Parameters control fragment count, min/max word window, and surrounding HTML tags — we use <b>…</b> to highlight hits in the search UI without XSS risk.
Semantic search finds conceptually similar chunks even when the user's query shares no exact keywords with the document. Both the chunks and the query are encoded into the same vector space; nearest-neighbor search does the rest.
5.1: Embedding Model & Normalisation
sentence-transformers/all-MiniLM-L6-v2 encodes text into dense vectors x ∈ ℝ³⁸⁴. Vectors are L2-normalised so ‖x‖₂ = 1 — this makes cosine similarity equal to the inner product: cos(θ) = x · y. The embedding dimension must match the model output; the embeddings.dim column enforces this at insert time.
5.2: Distance Metrics in pgvector
Operator class chosen at index time must match the metric: vector_l2_ops, vector_ip_ops, or vector_cosine_ops.
5.3: IVFFlat Index (Approximate Nearest Neighbour)
Exact k-NN over all vectors is O(N). IVFFlat coarse-quantises the space into lists buckets using k-means centroids. At query time only probes nearest centroids are scanned, giving O(P · N/L) complexity — much faster with a controlled recall tradeoff.
FTS gives precision (exact keyword hits); semantic search gives recall (conceptual matches). Combining both yields stable rankings across document types and query styles. The key challenge is that the two score scales are incomparable, so z-score normalisation brings them to the same range before blending.
Both FTS and semantic search run in parallel on the same query. Raw scores (ts_rank_cd and cosine similarity 1 − d_cos) are each z-score normalized within their stream, then blended: score = 0.6·z_v + 0.4·z_f. Chunks with low OCR confidence take an optional penalty λ. This produces stable, interpretable rankings across wildly different document types.
Why z-score and not min-max?
Min-max normalisation is sensitive to outliers, one unusually high FTS score on a dense keyword document would compress every other score into a tiny range. Z-score handles outliers better and produces stable, interpretable blending. The weights α and β can later be tuned from click-through data.
The goal is to compose an answer only from retrieved chunks, with citations, so the model cannot hallucinate facts that aren't in the user's own documents. Ollama runs gemma3:1b entirely on-device.
7.1: Prompt Structure
7.2: Context Packing & Map-Reduce
Chunks are sorted by hybrid score so the highest-quality context appears early (models attend more to earlier tokens). Total context is capped at ~3k–4k tokens. When the top chunks overflow, the pipeline switches to map-reduce: each chunk is summarised independently (map), then the summaries are combined into a final answer (reduce), preserving citations throughout.
7.3: Citation Enforcement
Post-processing extracts [CIT-#] tags from the LLM output and maps each back to (document_id, page_no, title). If the output contains no citation tags, the pipeline optionally re-prompts with a stricter instruction before returning.
Ingestion
Processed in background threads; commits every N pages to avoid long transactions. Per-page errors are recorded and ingestion continues rather than aborting.
FTS
GIN index on to_tsvector(...) keeps queries sub-second. vacuum/analyze runs periodically to keep index statistics fresh.
Vectors
IVFFlat with cosine ops. Tune lists and probes as corpus grows. Embeddings are batched in groups of 64–256 rows to limit RAM.
Monitoring
Log durations for render, OCR, chunk, embed, retrieval, and LLM. Track total end-to-end latency per query. Timeouts avoid request-handler blocking.
Prefer embedded text over OCR. Run OCR only when the PDF has no selectable text layer.
300–700 tokens with overlap and a heading_path field to preserve document outline for context.
Detect and store table_json + auto-summaries; caption figures for discoverability in semantic search.
Learn α, β weights from click/relevance data. Consider BM25 for the lexical score instead of ts_rank_cd.
Evaluate HNSW or PQ-IVF at ultra-large scale for better recall/latency curves.
Replace background threads with Redis/RQ or Celery for durable task queues with a /tasks/:id status API.
Source
Everything described here is in the repo. docker-compose up brings up Flask, Postgres, and Ollama together. No external services required.