A full-stack app that scrapes tech articles, categorizes them with Azure OpenAI, embeds them into Qdrant, and lets me browse or semantically search a personalized feed.

Tech Updates is a personal tech news aggregator. The pipeline pulls articles from a few sources I actually read, tags them with categories like AI/ML, startups, and web dev, and indexes them in a vector database so I can search the feed semantically. The React + Vite frontend displays the articles as tiles with pagination, dark mode, and a modal for details.
It is not a SaaS product. It is something I built so I would stop wasting twenty minutes every morning bouncing between tabs, and so I had a real reason to put a vector database and an LLM into the same app.
The problem was small but real. I was reading tech news from four or five sources, and there was no single place that pulled them together in a way that respected what I cared about. Most aggregators are either too broad (RSS readers that drown you) or too narrow (one source, one perspective).
I also wanted an excuse to play with vector search and LLM categorization. I had read about Qdrant and SentenceTransformers and wanted to wire them together myself instead of nodding along to blog posts. So this project did two things at once: solve my own annoyance, and let me build something end-to-end across scraping, an LLM pipeline, a vector DB, and a frontend.
The flow is straightforward, with each step doing one thing:
A search request goes to /api/search, which embeds the query, asks Qdrant for the nearest vectors, and returns the matching articles with a similarity score.
The system has five clear layers. Each one has a single job, and the contract between them is plain JSON.
Source-specific Python scripts collect articles and normalize them into a common shape (title, details, source, URL, category).
Utility functions clean the data and call Azure OpenAI to assign categories. Output is written to JSON.
Flask exposes blueprints under /api: scrape + categorize, upload to Qdrant, and search.
SentenceTransformers encodes queries into 384-dim embeddings. Qdrant searches the tech_news collection by similarity.
React + Vite reads categorized articles and renders the feed with tiles, modal, dark mode, and pagination.
The backend is a Flask app organized around blueprints and utility modules. The main app file registers the API blueprint under /api, initializes Flask-APScheduler so scraping can run on a schedule, and wires up the routes. Qdrant credentials and the Azure OpenAI key load from a .env file through python-dotenv, which kept secrets out of source control from day one and meant I could publish the repo without a cleanup pass.
Each endpoint maps to a small utility function so the route handlers stay thin:
Semantic search is what makes the feed feel different from a regular news reader. Instead of matching exact words, it compares the meaning of the query to the meaning of every article in the index.
The difference from keyword search shows up immediately. A query like "open source models that run on a laptop" can surface an article titled "Mistral 7B on consumer hardware." Keyword search would miss that. Semantic search does not, because the embeddings cluster on meaning rather than overlapping words.
The frontend is a React + Vite app written in TypeScript. It defines a small Article interface with the same fields the backend produces: Title, Details, URL, Source, Category.
The UI fetches /categorized_data.json once and renders the list. A few specifics worth calling out:
Source-specific Python scrapers pull articles from Medium, YC-related feeds, and Crunchbase, then normalize them into one shape.
Azure OpenAI assigns each article a topic like AI/ML, startups, or web dev, so the feed is sorted before I even open it.
Articles are embedded with all-MiniLM-L6-v2 and upserted into a `tech_news` collection for similarity search.
Queries get embedded and compared against article vectors, so a search for an idea finds articles that share the meaning, not just the keyword.
Flask-APScheduler triggers scraping from inside the API, so there is no separate worker process to babysit.
React + Vite frontend with reusable Tile components, a Modal for details, dark mode, and pagination at 6 articles per page.
A few choices I want to defend, because they are the ones a reviewer would push on first:
Flask over FastAPI
I wanted to ship a small REST API quickly. Blueprints made the layout obvious and I did not need anything that Flask could not give me.
JSON files instead of a database
This is a prototype. Plain JSON made the pipeline visible during debugging and meant I did not have to manage a schema while the data shape was still moving.
Qdrant for vectors
I wanted a purpose-built vector DB with a real API and an admin UI, not a library glued into the same Python process.
all-MiniLM-L6-v2 locally
Cheap, fast on CPU, and good enough for short article text. Using a hosted embedding API for a personal project felt like overkill.
APScheduler inside Flask
One process, one place to look at the schedule. For something only I use, a separate worker setup was not worth the operational cost.
React + Vite
Fast dev loop, no SSR needed, and Vite's HMR made iterating on the UI feel almost instant.
Challenges
What I Learned
Things I would do if I picked this up again, roughly in order of how much I would learn from each: