Projects

Things I've Built

Most of these started with a question I couldn't stop thinking about. Some turned into production systems. Others taught me why the problem was hard.

---

LLM Sycophancy Evaluator

Research Project · Aug 2024 – May 2026

The Problem: Large language models agree with users even when the user is wrong. But most benchmarks treat this as a static property. I wanted to measure it behaviorally—across specific pressure tactics—and understand whether instruct-tuning actually fixes the problem or just hides it.

What I Built: A paired-prompt evaluation framework targeting Llama-3.2-3B. Thirty adversarial prompt pairs, each contrasting a neutral question against variants that inject agreement bias, flattery, or assertive pressure. Every inference is logged to structured JSON with the prompt template ID, model checkpoint, and raw generation metadata.

The Details: I wrote a weighted keyword-based scorer that classifies each response as sycophantic, ambiguous, or honest, with tunable thresholds per behavior type. Built configurable Jinja2 prompt templates so the framework can swap model checkpoints or behavior definitions without code changes. Wrapped the whole thing in a Streamlit dashboard for interactive result exploration, and wrote thirteen unit tests covering scorer edge cases, template rendering, and JSON schema validation.

What I Found: The base (non-instruct) model produces severe repetition loops that inflate ambiguous scores—meaning instruct-tuning isn't just about capability, it's about evaluability. Measured 45% agreement bias, 40% pressure capitulation, and 80% flattery resistance. Context-dependent capitulation showed up clearly: the model abandoned correct positions on flat Earth and "largest organ" facts under assertive pushback, but held firm on Pluto classification. Sycophancy is not uniform across domains.

Tech: Python · PyTorch · HuggingFace Transformers · Streamlit · Jinja2 · pytest

More: Full harness, methodology, and results on GitHub and lkslokesh.com

---

Chain-of-Thought Faithfulness Evaluation

Research Project · Active

The Problem: When an LLM shows its work, is that reasoning actually driving the answer, or is it post-hoc rationalization? If we can't tell the difference, we can't trust systems that use CoT for high-stakes decisions.

What I'm Building: An extension of Turpin et al. (2023) and Lanham et al. (2023) to Llama-3-8B on BIG-Bench-Hard. The methodology is a biasing-feature perturbation: inject a subtle hint toward a wrong answer into the prompt, then measure (a) whether accuracy drops and (b) whether the CoT verbalizes the bias or silently conforms to it.

The Details: I'm extending the evaluation harness from the sycophancy project with new prompt templates designed for CoT extraction. Implemented paired CoT logging that captures both biased and unbiased reasoning traces for direct comparison. Built bias-acknowledgement tagging using regex and heuristic classifiers to detect whether the model explicitly mentions the injected hint—or ignores it while still being swayed.

Status: Systematic evaluation running across BIG-Bench-Hard subsets. Results and code to be released.

Tech: Python · PyTorch · HuggingFace Transformers · BIG-Bench-Hard

---

AI-Powered DDoS Detection System

ML Engineering Project

The Problem: DDoS detection is a sequence problem, not a snapshot problem. A single packet tells you almost nothing. A sequence of packets—their timing, inter-arrival deltas, rate ramp—tells you everything. Most systems either ignore this (static thresholds), overfit it, or drown in 300+ features of noise.

What I Built: A BiLSTM + 1D Residual CNN ensemble for three-class intrusion classification (Attack / Benign / Suspicious) on 700K+ flows from BCCC-Cloud-DDoS-2024. The BiLSTM captures temporal dependencies across sliding windows; the 1D Residual CNN extracts local traffic patterns. A fusion layer combines both representations before classification.

The Details: Preprocessed 700K+ flows and reduced 318 raw features to 86 via correlation analysis and information gain. Grouped traffic into sliding temporal windows to model sequence behavior. Added confidence-based escalation: predictions above 0.85 confidence auto-classify, while ambiguous traffic routes for human review instead of forcing a wrong label.

Results:

  1. 11% accuracy. The escalation layer handles 85% of traffic autonomously at high confidence and explicitly reduces miscalls on the ambiguous "Suspicious" class by routing low-confidence predictions for analyst review.

Tech: Python · PyTorch · BiLSTM · 1D CNN · Scikit-learn · Pandas

---

Temporal CNN-Based Intrusion Detection System

Capstone Project · Active

The Problem: Security teams get alerts they can't act on. ALERT: flow_id=47293 label=anomaly confidence=0.94 doesn't tell a SOC analyst what happened or what to do. Detection without explainability just creates alert fatigue.

What I'm Building: A Temporal Convolutional Network trained on BCCC-Cloud-DDoS-2024 (540,000+ real flows) with automated feature selection and natural language alert explanations via the Gemini API.

The Details:

  1. Most are correlated noise. I built an automated feature selection pipeline using mutual information and recursive feature elimination that identified 32 critical indicators—90% dimensionality reduction. Training efficiency improved 85%, and detection accuracy held. TCNs handle sequential patterns without LSTM's vanishing gradients: dilated causal convolutions with exponentially growing receptive fields. Integrated the Gemini API to parse detection outputs and generate contextual explanations like: "Coordinated SYN flood — 47 source IPs, packet rate 400× baseline, sustained 3-minute window targeting port 443. Consistent with volumetric DDoS. Recommend upstream rate limiting."

Status: Tuning false positive rate. Detection without alert fatigue is the actual benchmark.

Tech: Python · TensorFlow · Temporal CNN · Gemini API · Pandas · scikit-learn

---

QUIC Router Simulation

Side Project · Ongoing

The Question: When 10,000 flows compete for 1Gbps, which packet gets sent first—and what does that choice do to P99 tail latency?

What I'm Building: A discrete-event packet-level simulator in Python testing queue scheduling algorithms under real load conditions:

The Details: Implemented a custom event loop modeling QUIC short-header packets with connection IDs and variable-length payloads. Traffic generators produce three classes: bulk transfer (greedy TCP-like), video streaming (CBR with burst tolerance), and WebRTC (latency-sensitive, small packets). Each class gets configurable weights in WFQ and nested hierarchies in HPFQ. I'm measuring P99 tail latency specifically—not mean, not median. P99 is where video streaming stutters, where WebRTC calls drop frames, where user experience actually degrades.

Why QUIC: QUIC is the transport layer behind HTTP/3. Every modern browser uses it. Understanding queue scheduling behavior at this layer means understanding where real internet bottlenecks form under concurrent mixed-priority traffic.

Tech: Python · discrete-event simulation · QUIC protocol modeling · algorithmic queue management

---

DSA Coaching Platform

Full-Stack Project

The Problem: Most DSA prep resources are either passive (YouTube videos) or expensive (paid tutors). Students who learn by doing and explaining need something in between—a space where they can work through problems, see structured breakdowns, and build intuition rather than memorize patterns.

What I Built: A full-stack web platform with 150+ curated problems organized by concept graph, not just difficulty. I built the backend API, content system, and frontend from scratch.

The Details: Backend in FastAPI with SQLAlchemy ORM over PostgreSQL. JWT-based authentication with bcrypt-hashed credentials. Problem content stored as structured Markdown with metadata tags for prerequisites, patterns, and complexity classes. Frontend in React using Context API for state management. Each problem includes a structured approach breakdown, edge case analysis, complexity proof, and common misconception warnings. Docker Compose setup for local deployment: app container, Postgres container, and Nginx reverse proxy.

What It Taught Me: Content organization is a design problem. How you structure information determines how people learn from it. The graph thinking applies: prerequisites are edges, concepts are nodes, and good pedagogy is shortest-path from confusion to understanding. Used by 40+ students in the initial pilot.

Tech: Python · FastAPI · React · PostgreSQL · Docker · Nginx

---

Self-Hosted AI Inference Stack

Infrastructure Project · Ongoing

Why Self-Host: Running inference through APIs gives you outputs. Running it on your own hardware gives you understanding. Quantization tradeoffs, memory bandwidth ceilings, latency under concurrent requests, thermal throttling at sustained load—these are the constraints that matter when you're building production AI systems, and you don't learn them from HTTP calls.

What I Run: Local LLM inference stack on personal hardware. RTX 3080 (10GB VRAM), 32GB DDR4, Ryzen 7 5800X. Running llama.cpp server with CUDA acceleration. Models tested: Llama-2-7B (Q4KM, Q5KM), Mistral-7B-Instruct (Q4KM), Phi-2 (Q4_0).

The Details: Built an automated benchmark suite measuring time-to-first-token, tokens-per-second at batch sizes 1 through 8, and latency degradation curves under concurrent requests. Used a custom Python load generator (asyncio + aiohttp) to simulate client traffic. Logged thermal throttling and VRAM utilization via nvidia-smi polling. Testing both 4-bit and 8-bit quantization across reasoning and summarization tasks.

What I've Learned:

Tech: Linux · CUDA · llama.cpp · Python · asyncio · local networking

---

Network Intrusion Detection Using Deep Learning

Research Project · Anna University

The Challenge: Detect 9 different attack types—DoS, exploits, reconnaissance, brute force, backdoors, and more—from raw network flow records. Dataset: UNSW-NB15. 2.5 million flows, 49 raw features, real attack captures.

What I Built: A hybrid LSTM-CNN architecture. LSTM to capture sequential flow behavior across time windows; CNN to detect local feature patterns within individual flows. Together they handle both temporal dependencies and spatial feature interactions.

The Details: Preprocessed the dataset with SMOTE to handle class imbalance on minority attack categories. Analyzed the impact of input sequence length by varying temporal window sizes from 5 to 50 flows. Implemented a two-stage metaheuristic pipeline for feature selection: Sine Cosine Algorithm (SCA) for global search across 42 features, then Particle Swarm Optimization (PSO) for local refinement of candidate subsets. Population size 30, 100 iterations per stage.

Results:

  1. False positive rate dropped 25%. Real-time detection latency improved. The 18 features that survived—flow duration, packet length variance, inter-arrival time statistics, TCP flag counts—are the ones that actually separate attack behavior from normal flow patterns. Everything else was noise that made the model worse.

What I Learned: Feature selection is the work. Not a preprocessing step. The swarm intelligence approach solved a network security problem better than manual feature engineering.

Tech: Python · TensorFlow · LSTM · CNN · metaheuristic optimization (SCA, PSO) · UNSW-NB15

---

Secure Digital Library Platform

Production System · Anna University

The Context:

  1. Different access levels for students, faculty, and administrators. Had to handle traffic spikes during exam season without falling over.

What I Built: A containerized full-stack platform with OAuth 2.0 authentication and role-based access control. PostgreSQL backend. Docker deployment for horizontal scaling.

The Details: Node.js/Express REST API with Passport.js handling OAuth 2.0 flows, JWT session tokens, and refresh token rotation. RBAC middleware enforcing five permission levels: student (read), faculty (read + annotate), librarian (content management), admin (user management), super-admin (system config). Database queries optimized with composite indexes on userroles and resourceaccess tables. Docker multi-container deployment with health checks and automatic restart policies. Nginx load balancing across three API container replicas.

The Results:

  1. 5% uptime over 12 months. Sub-200ms P95 API response under peak exam-season load (800+ concurrent users). Students could access resources when they needed them—exam night included. Zero privilege escalation incidents during security audit.

What Mattered: Session management correctness, RBAC implementation without privilege escalation gaps, and database queries that didn't degrade under concurrent reads. The boring infrastructure correctness that makes or breaks production systems.

Tech: Node.js · PostgreSQL · React · Docker · Nginx · OAuth 2.0 · RBAC

---

Face Detection Mobile App

Undergrad Project

The Architecture: Android app in Java → FastAPI backend on Heroku → OpenCV Haar Cascade detection → bounding box coordinates → real-time camera overlay.

The Details: Android client uses Camera2 API for 30fps 640×480 preview frames. Frames are converted from YUV420888 to JPEG via RenderScript for consistent encoding, then sent as multipart/form-data POST to the FastAPI backend. Backend runs on Heroku with gunicorn (3 workers, 120s timeout), loads OpenCV's haarcascadefrontalfacedefault.xml, and runs detection with scaleFactor=1.1, minNeighbors=5, minSize=(30,30). Returns JSON: {"faces": [{"x": 120, "y": 80, "w": 200, "h": 200, "confidence": 0.92}]}. The Android overlay canvas draws rectangles in real-time using SurfaceView.

Why I Built It: I wanted to understand the full stack: mobile client, API design, computer vision algorithm, cloud deployment, how data actually flows from camera pixel to screen annotation. The full graph, end to end.

Results: End-to-end latency ~400ms on WiFi. Handled 5 concurrent users without blocking via FastAPI's async endpoints. The real bottlenecks weren't the detection algorithm—they were camera preview orientation, JPEG compression quality, and network payload size.

Tech: Android (Java) · Camera2 API · FastAPI · OpenCV · Heroku · REST

Source: GitLab

---

NFT Tracking Dashboard

24-Hour Hackathon · NIT Trichy · First Place

The Situation: NIT Trichy blockchain competition. 24 hours. Tool I'd never touched: Google Apps Script.

What I Built: A real-time dashboard tracking Ethereum NFT transactions. Automated data pulls, transaction pattern visualization, and price alerts—running entirely inside Google Workspace.

The Details: Google Apps Script with time-driven triggers firing every 5 minutes. UrlFetchApp calls to the Etherscan API for wallet transaction history and to the OpenSea API for floor price data. Parsed JSON responses into structured Google Sheets tabs: Transactions (tx hash, timestamp, ETH value, gas fee), Wallets (tracked addresses, holdings, P&L), and Alerts (price threshold breaches, whale movements). Built Sheets-native charts for transaction volume trends and wallet activity heatmaps. Added a custom menu in the Sheets UI for manual refresh and wallet addition.

What It Taught Me: You can absorb a new tool fast when you're shipping something real. The constraint of "no backend except Google" forced efficient design: Apps Script's 6-minute execution limit meant I had to implement paginated API fetching and batched writes. Sometimes the right tool is whatever gets it working in time, not the theoretically perfect solution.

Tech: Google Apps Script · Etherscan API · OpenSea API · Google Sheets

---

3D Campus Digital Twin

Undergrad Exploration

What I Built: A complete 3D model of my entire undergrad campus in Blender. ~200 acres, 15+ buildings modeled to scale.

The Details: Used Blender 2.93 with satellite imagery as reference planes for accurate vertex placement. Key structures: main admin block (4 stories, 2,400 polygons), central library (cylindrical reading room, 1,800 polygons), three hostel blocks, open-air auditorium, and access roads. Applied texture baking with 2K PBR materials for concrete, glass, and foliage. Optimized for real-time walkthrough via the Eevee render engine with baked lighting. Exported to glTF 2.0 for web viewer integration.

Results:

  1. Learned that spatial abstraction is a systems problem: LOD decisions, occlusion culling, and texture atlasing directly impact whether a complex scene remains navigable. The detour taught me how to represent graph-like spatial relationships in queryable 3D form.

Tech: Blender · glTF · Eevee · photogrammetry reference

---

What Connects These

Feature selection appears in three separate projects. Graph thinking shows up in every one. The gap between benchmark accuracy and production usefulness is the recurring constraint.

Whether it's packets through router queues, network flows hiding attacks, or users hitting a database during exams—the structure is the same. Understand the graph. Find where it breaks. Build the thing that holds.

---