AI Researcher & ML Scientist
Himanshu Kumar
Final-year CS student, IIIT Nagpur
I work at the intersection of multimodal learning, multilingual NLP, and mechanistic interpretability — building and understanding AI systems that operate across languages, modalities, and scales. Published at AAAI 2026 with research on representation geometry and efficient vision-language models.
Publications
Research
LM4UC Workshop · AAAI 2026
When Gujarati Meets English: Toward Robust Translation of Code-Mixed Low Resourced Indian Language
Created the first large-scale Gujlish–English parallel corpus addressing translation for millions of Gujarati speakers who naturally code-mix Gujarati with English. Fine-tuned NLLB-200 for Romanized Gujarati and intra-sentential code-mixing.
30K sentence pairs via BPCC + GPT-4o generation with human validation
1.5–2× BLEU and ChrF++ improvements over Google Translate
New Gujlish evaluation benchmarks adapted from XNLI and IN22
arXiv Preprint
NanoVLM: How Small Can Vision Language Models Be and Still Generate Coherent Text?
Systematically studied the lower bound of VLM scale for coherent image captioning. Designed a family of parameter-efficient vision-language models achieving 10× parameter reduction vs. standard VLMs, revealing that caption length matters more than parameter count for alignment.
NanoVLM (mini/base/large) with up to 10× parameter reduction
Curated ShortDesc (20–25w) and LongDesc (60–70w) minimal alignment datasets
New evaluation axes: creativity, consistency, semantic coherence
Under Review · Vizuara AI Labs
The Geometry of Entanglement: Bridging Representation Probing and Mechanistic Interpretability
Multi-scale investigation of how deep vision models implicitly encode attributes beyond their training objective, revealing a fundamental geometric asymmetry between task-relevant and implicitly encoded features.
Task features: ~750 localized neurons with strong correlations (|r|>0.6)
Implicit attributes: ~1,200 neurons with distributed weak signals (|r|<0.2), 94% linearly separable
95% of gender info concentrates in 1/16 of ResNet-50 dimensions (causal entanglement)
Work Experience
Research & Industry
AI Research Intern — Mechanistic Interpretability
- Led mechanistic interpretability study on representation entanglement in vision models using correlation analysis and linear probing across layers and neurons.
- Distinguished task-relevant features from implicitly encoded attributes; performed targeted ablations to quantify bias–utility trade-offs.
- Built reproducible representation analysis pipelines that became the basis of a publication.
AI Research Intern — Generative AI & RLHF
- Developed soundscape music generation system using MusicGen trained on curated ambient datasets scraped from YouTube.
- Designed and deployed a web-based human feedback collection platform for evaluating 30-second generated audio clips.
- Applied RLHF to align generative outputs with human aesthetic and perceptual preferences.
AI Research Intern — Multimodal Learning
- Built multimodal image captioning VLM using a pretrained ViT encoder and GPT-2/BERT decoder, trained on Flickr30k.
- Applied Bottom-Up Top-Down attention for enhanced feature extraction, achieving measurable improvements in caption quality.
- Research directly informed the NanoVLM arXiv paper.
ML Engineering Intern
- Architected a MultiPDF RAG system with semantic search across 100+ documents; 85% retrieval accuracy with Supabase-backed vector storage.
- Deployed production pipeline via FastAPI serving 50+ daily queries at sub-2s latency.
Selected Work
Projects
AI-Powered Appointment Scheduling — Multi-Agent System
Multi-agent architecture with specialized agents for intent parsing, availability reasoning, and booking confirmation. Real-time slot validation, conflict resolution, and async email/SMS notifications.
GitHubSankshipt: Multilingual News Summarization
Transformer-based summarization pipeline for 10 Indian languages with language detection, normalization, and entity preservation for cross-lingual topic coverage.
GitHubMulti-Label Sentiment Analysis
9-label multi-label sentiment classifier using DistilBERT for overlapping emotional categories in user-generated text. Mitigated severe class imbalance via weighted loss — 88% accuracy.
GitHubNanoVLM — Tiny Vision-Language Model
ViT encoder + GPT-2 decoder VLM family achieving coherent image captioning at 10× smaller scale. Includes curated alignment datasets and new evaluation criteria for caption quality.
GitHubTechnical
Skills
AI / ML Domains
Frameworks & Tools
Languages
Core Coursework
Recognition
Awards & Certifications
Jagriti, IIT BHU
2nd Runner-up
Built Multi-label Sentiment Analysis system · Competed against 700+ teams
Codefest, IIT BHU
2nd Runner-up
Competed against 1,600+ teams
Mar 2025
CUDA C/C++ Fundamentals
NVIDIA Deep Learning Institute
Feb 2025
Deep Learning Fundamentals
NVIDIA Deep Learning Institute
Nov 2024
AI for Anomaly Detection
NVIDIA Deep Learning Institute
Oct 2024
Transformer NLP Applications
NVIDIA Deep Learning Institute
Get in touch
Contact
Seeking ML Engineer, AI Researcher, or Data Scientist roles starting May 2025. Open to research collaborations and interesting problems at the frontier of AI.
Current Research Interests
Mechanistic Interpretability of vision & language models
Representation geometry and feature disentanglement
Low-resource and code-mixed multilingual NLP
Efficient vision-language model architectures
RLHF and human preference alignment