Project / Technical Deep Dive

Knowledge Hub: Technical Deep Dive

Sections

Search Modes

100%

Local / No Cloud

RAG

With Citations

Every moving part of Knowledge Hub: the system architecture, OCR pipeline, full-text search stack, semantic search with pgvector, hybrid ranking math, and the RAG layer that powers cited answers, all with implementation notes.

FlaskPostgres 16pgvectorTesseractTrOCRSentence-TransformersIVFFlatOllamaDocker

Architecture Overview

The system is a single-origin Flask API behind Gunicorn. All storage is local, Postgres 16 handles both keyword search and semantic vectors via pgvector, so there is one operational store and no separate vector database to run. File originals live in Storage (local filesystem or S3). Ollama runs the answer LLM entirely on-device.

All services run locally, no API keys, no cloud dependency. Flask (Gunicorn) is the single entry point: uploads go to Storage and queue background ingestion into Postgres; search and embed queries hit Postgres directly; answer requests retrieve from Postgres then forward context to Ollama.

Postgres 16 + pgvector

GIN for FTS · IVFFlat for ANN

Background Ingestion

202 Accepted · async chunking

Ollama (local)

gemma3:1b · no API key needed

Data Model & Storage

The schema centres on four tables. documents is the catalogue; chunks is the retrieval unit; embeddings holds the pgvector column; users and tags round out multi-user and organisational features.

Table	Key Columns	Notes
documents	id, user_id, title, source_path, mime_type, pages, bytes, hash_sha256, status	SHA-256 used for dedup and versioning
chunks	id, document_id, version, page_no, chunk_index, text, tokens, modality, bbox, extra_json	extra_json stores { "ocr_conf": 72.5 }; chunk is the retrieval unit
embeddings	id, chunk_id, model, dim, vector (pgvector)	Embedding dim must equal model output (384 for MiniLM-L6-v2)
users / tags	id, email, password_hash / name, color	AuthZ per user_id; document_tags join table

Design decision

Initially one chunk per page; later refined to 300–700 tokens with overlap once search quality testing showed that page-sized chunks hurt precision on long PDFs. Chunk size turned out to matter more than embedding model choice.

Ingestion Pipeline

Upload returns 202 immediately. A background thread does the heavy work: rendering, preprocessing, OCR, chunking, and embedding. Each stage records per-page errors and continues rather than aborting on a bad page.

Each document goes through PyMuPDF rendering at scale 3 for clarity, then deskew (affine transform), non-local-means denoising, and Otsu/adaptive binarization before Tesseract runs three PSM passes and keeps the highest-confidence result. If average token confidence falls below the threshold, TrOCR (ViT encoder + Transformer LM) handles the page as a fallback. Text is then chunked at 300–700 tokens with overlap, embedded by Sentence-Transformers, and stored in Postgres.

3.1: Rendering & Preprocessing

PyMuPDF rasterises each page at scale s=3 (typically 216 DPI) to improve OCR readability. Deskew estimates rotation via minimum-area rectangle on foreground pixels and applies an affine transform. Non-local-means denoising followed by an unsharp mask sharpens fine details. Binarization uses Otsu's global threshold or adaptive local thresholds for uneven illumination. A 2×2 closing operation connects broken handwriting strokes.

M = [[cos θ, sin θ, tₓ], [−sin θ, cos θ, tᵧ]]

affine deskew — θ from minAreaRect on foreground pixels

I_sharp = a·I − b·G_σ(I) (a=1.5, b=0.5)

unsharp mask — G_σ is Gaussian blur at scale σ

t* = arg max_t σ²_B(t)

Otsu threshold — maximises between-class variance

T(x,y) = μ_{N(x,y)} − C

adaptive threshold — local neighbourhood mean minus constant C

3.2: Tesseract OCR (multi-pass + confidence)

Three PSM configs run (--psm 6, 11, 4) and the result with highest average word confidence is kept. Token confidences c_i ∈ [0, 100] are returned by image_to_data; the pipeline computes the average and stores it alongside the text in chunks.extra_json.

avg_conf = (1/n) ∑ᵢ cᵢ, cᵢ ∈ [0, 100]

average token confidence — stored in chunks.extra_json as ocr_conf; fallback if below threshold

3.3: Handwriting Fallback (TrOCR)

When Tesseract confidence is low, TrOCR (VisionEncoderDecoder: ViT encoder + Transformer LM) handles the page. The decoder minimises cross-entropy loss over token sequences conditioned on the image, making it far more robust to cursive and irregular handwriting. This runs only as a fallback to keep inference time reasonable.

L = −∑ₜ log p(yₜ | y_{<t}, image)

cross-entropy loss — Transformer decoder conditioned on ViT image features

Full-Text Search in Postgres

Postgres's native FTS is fast enough for sub-second search over tens of thousands of chunks with a GIN index, and it ships with the database — no Elasticsearch to operate.

4.1: Tokenisation → tsvector

Postgres parses text, lowercases it, stems it, removes stopwords, and produces a multiset of lexemes with positional info. Example: to_tsvector('english', text) → a:1 b:2,7 c:4. We index on to_tsvector('english', coalesce(text,'')) with a GIN index for sub-second search.

4.2: Queries → tsquery

plainto_tsquery('english', q) handles plain queries robustly. websearch_to_tsquery supports Google-like syntax with "phrase", -exclude, and OR.

4.3: Ranking → ts_rank / ts_rank_cd

ts_rank ranks by term frequency with optional weights per lexeme class. ts_rank_cd (cover density) favours compact spans covering many query terms — it scores coverage windows and normalises by document length. We use ts_rank_cd because it penalises chunks that contain the terms scattered across many pages rather than clustered together.

4.4: Snippets → ts_headline

ts_headline generates fragments with query terms emphasised. Parameters control fragment count, min/max word window, and surrounding HTML tags — we use <b>…</b> to highlight hits in the search UI without XSS risk.

Semantic Search with Embeddings (pgvector)

Semantic search finds conceptually similar chunks even when the user's query shares no exact keywords with the document. Both the chunks and the query are encoded into the same vector space; nearest-neighbor search does the rest.

5.1: Embedding Model & Normalisation

sentence-transformers/all-MiniLM-L6-v2 encodes text into dense vectors x ∈ ℝ³⁸⁴. Vectors are L2-normalised so ‖x‖₂ = 1 — this makes cosine similarity equal to the inner product: cos(θ) = x · y. The embedding dimension must match the model output; the embeddings.dim column enforces this at insert time.

5.2: Distance Metrics in pgvector

<->L2 distance — ‖x − y‖₂

<#>Inner product — −x · y (smaller = more similar when normalised)

<=>Cosine distance — 1 − cos(θ). With normalised vectors, equals 1 − x · y

Operator class chosen at index time must match the metric: vector_l2_ops, vector_ip_ops, or vector_cosine_ops.

5.3: IVFFlat Index (Approximate Nearest Neighbour)

Exact k-NN over all vectors is O(N). IVFFlat coarse-quantises the space into lists buckets using k-means centroids. At query time only probes nearest centroids are scanned, giving O(P · N/L) complexity — much faster with a controlled recall tradeoff.

ivfflat.tuning

01lists ≈ 4·√N (rule of thumb; each list stores full vectors)

02probes ≈ 5–10% of lists (start here, tune by latency)

03ANALYZE embeddings; -- run after bulk inserts

04SET ivfflat.probes = P; -- per-session override

Hybrid Ranking (FTS ⊕ Semantic)

FTS gives precision (exact keyword hits); semantic search gives recall (conceptual matches). Combining both yields stable rankings across document types and query styles. The key challenge is that the two score scales are incomparable, so z-score normalisation brings them to the same range before blending.

Both FTS and semantic search run in parallel on the same query. Raw scores (ts_rank_cd and cosine similarity 1 − d_cos) are each z-score normalized within their stream, then blended: score = 0.6·z_v + 0.4·z_f. Chunks with low OCR confidence take an optional penalty λ. This produces stable, interpretable rankings across wildly different document types.

hybrid_ranking.steps

011. Retrieve top-K_sem via cosine distance; convert: s_v = 1 − d_cos

022. Retrieve top-K_fts via FTS; score: s_f = ts_rank_cd

033. Z-score each stream: z = (s − μ) / σ

044. Blend: score = α·z_v + β·z_f (α=0.6, β=0.4)

055. Penalty: if chunk.ocr_conf < 50, score −= λ (optional)

066. Return top-K by final score, deduplicated by (doc_id, page_no)

Why z-score and not min-max?

Min-max normalisation is sensitive to outliers, one unusually high FTS score on a dense keyword document would compress every other score into a tiny range. Z-score handles outliers better and produces stable, interpretable blending. The weights α and β can later be tuned from click-through data.

RAG with Ollama LLM

The goal is to compose an answer only from retrieved chunks, with citations, so the model cannot hallucinate facts that aren't in the user's own documents. Ollama runs gemma3:1b entirely on-device.

7.1: Prompt Structure

System"Answer using only CONTEXT. If insufficient, say so. Add [CIT-#] after claims."

ContextTop chunks, deduped by (doc, page), trimmed to ~500–800 chars, labelled [CIT-1]…

UserOriginal question, unchanged.

7.2: Context Packing & Map-Reduce

Chunks are sorted by hybrid score so the highest-quality context appears early (models attend more to earlier tokens). Total context is capped at ~3k–4k tokens. When the top chunks overflow, the pipeline switches to map-reduce: each chunk is summarised independently (map), then the summaries are combined into a final answer (reduce), preserving citations throughout.

7.3: Citation Enforcement

Post-processing extracts [CIT-#] tags from the LLM output and maps each back to (document_id, page_no, title). If the output contains no citation tags, the pipeline optionally re-prompts with a stricter instruction before returning.

Performance, Scalability & Ops

Ingestion

Processed in background threads; commits every N pages to avoid long transactions. Per-page errors are recorded and ingestion continues rather than aborting.

FTS

GIN index on to_tsvector(...) keeps queries sub-second. vacuum/analyze runs periodically to keep index statistics fresh.

Vectors

IVFFlat with cosine ops. Tune lists and probes as corpus grows. Embeddings are batched in groups of 64–256 rows to limit RAM.

Monitoring

Log durations for render, OCR, chunk, embed, retrieval, and LLM. Track total end-to-end latency per query. Timeouts avoid request-handler blocking.

Security & Privacy

Local-first by default — originals never leave the machine unless Storage is configured to S3.

Files are hashed with SHA-256 on upload for deduplication and to detect tampering.

AuthZ is per user_id — users can only access their own documents.

HTML rendering in search snippets uses only <b> tags — no arbitrary HTML, preventing XSS.

Optional encryption at rest when writing to Storage.

Future Upgrades & Research Notes

Text-first extraction

Prefer embedded text over OCR. Run OCR only when the PDF has no selectable text layer.

Smarter chunking

300–700 tokens with overlap and a heading_path field to preserve document outline for context.

Tables & Figures

Detect and store table_json + auto-summaries; caption figures for discoverability in semantic search.

Hybrid tuning

Learn α, β weights from click/relevance data. Consider BM25 for the lexical score instead of ts_rank_cd.

ANN variants

Evaluate HNSW or PQ-IVF at ultra-large scale for better recall/latency curves.

Queue & retries

Replace background threads with Redis/RQ or Celery for durable task queues with a /tasks/:id status API.

Quick Reference

cheat_sheet.md

01Cosine similarity: cos(θ) = x·y / (‖x‖·‖y‖)

02Cosine distance: <=> = 1 − cos(θ)

03Z-score: z = (x − μ) / σ

04Otsu threshold: t* = arg max_t σ²_B(t) (between-class variance)

05IVFFlat: lists = coarse clusters; probes = centroids searched

06IVFFlat complexity: O(P · N / L) vs O(N) exact

07FTS cover density: prefers compact spans covering many query terms

08Embedding dim: must match model output (384 for MiniLM-L6-v2)

09Hybrid weights: α=0.6 (semantic) β=0.4 (FTS) — tune from data

Source

Fully open-source · Docker Compose setup

Everything described here is in the repo. docker-compose up brings up Flask, Postgres, and Ollama together. No external services required.

Source Code Back to Project PDF Version

Knowledge Hub Portfolio