Mapping innovation: how we classify 300K grants into VC sectors and scientific disciplines

When a VC partner searches for "autonomous vehicles", they might find grants about LiDAR sensor arrays, simultaneous localisation and mapping, or V2X communication protocols. The problem is obvious: the grant database doesn't speak in VC categories. It speaks in research abstracts.

SIC codes won't help. Keyword search won't help. The gap between how scientists describe their work and how investors categorise markets is the core challenge of deep tech deal sourcing.

We built a dual-taxonomy engine to bridge that gap.

Two lenses on the same data

Every grant in Loturo is classified through two parallel systems: one that maps to venture capital investment sectors, and one that maps to scientific disciplines. The same grant gets tagged in both.

VC Taxonomy

Purpose: commercial viability signalling

Structure: sectors → sub-sectors → niches

Example: Healthcare → Medical Devices → Diagnostic Imaging

Audience: deal sourcing, portfolio mapping

Science Taxonomy

Purpose: technical domain identification

Structure: disciplines from PACS, arXiv, MeSH, OpenAlex

Example: Physics → Condensed Matter → Superconductivity

Audience: technical deep-dives, cross-disciplinary search

The power isn't in either taxonomy alone. It's in the intersection. A grant tagged as "Drug Delivery" in the VC taxonomy and "Molecular Biology" in the science taxonomy tells a richer story than either label could. A grant at the crossroads of "Aerospace" and "Machine Learning" is a different signal than pure aerospace.

Taxonomy sources

Grants classified

292K

Tags per grant

15–25

Why keywords fail

Traditional classification uses keywords or manual categories. Both break at scale.

Keyword search is brittle. A grant about "therapeutic protein engineering for targeted oncology delivery systems" won't match a search for "cancer drugs" even though that's exactly what it is. The vocabulary is different. Researchers write for peer review, not for Bloomberg Terminal queries.

Manual categorisation is expensive and subjective. The NIH uses study sections, SBIR uses topic codes, the EU uses Horizon pillars — but none of these map to how an investor thinks about their portfolio. And none of them are consistent across programmes.

We needed an approach that could understand meaning, not just match words.

Semantic matching with embeddings

Every grant abstract in Loturo is converted into a 1,536-dimensional vector using OpenAI's embedding model. Each category in each taxonomy is embedded the same way. Classification then becomes a geometric problem: which taxonomy categories are closest to this grant in vector space?

This is fundamentally different from keyword matching:

"Therapeutic protein engineering" lands near "Drug Delivery" and "Biotechnology" in the VC taxonomy — even though those exact words never appear in the abstract.
"Autonomous navigation for GPS-denied environments" gets tagged under both "Defence & Security" and "Robotics" — because the meaning overlaps with both sectors.
"Novel photovoltaic materials using perovskite nanostructures" maps to "Clean Energy" (VC) and "Condensed Matter Physics" + "Materials Science" (science) simultaneously.

For each grant, we compute similarity against every taxonomy category and keep the top 5 matches per source. A typical grant ends up with 15–25 tags spanning VC sectors and scientific disciplines.

Quality control: not all matches are equal

Raw similarity scores need calibration. A VC taxonomy written in business language naturally aligns better with grant abstracts than a physics classification scheme that uses highly technical terminology. If we used the same threshold for both, we'd either get too many false positives in science or too few matches in VC.

Each taxonomy source has its own minimum similarity threshold, tuned to its domain characteristics:

VC categories: lower threshold (0.30) — business language aligns well with grant abstracts
PACS (physics): moderate (0.35) — technical but precise
arXiv, OpenAlex: higher (0.40) — academic language, stricter matching
MeSH (biomedical): highest (0.45) — extremely specialised vocabulary

The result: each source contributes meaningful tags without flooding grants with noise.

What this reveals

When you layer VC sectors on top of scientific disciplines across 292,000 grants and 80,000 companies, patterns emerge that are invisible in flat databases.

Technology cluster maps

Our PCA scatter visualisation projects all taxonomy categories into 2D space. VC categories and science categories form clusters that show where investment themes and research domains converge. Gaps between clusters — areas where science exists but no VC category is close — are potential whitespace opportunities.

Cross-domain heatmaps

The taxonomy heatmap cross-tabulates VC sectors against science disciplines. Each cell shows how many grants sit at that intersection. High-density cells are mature fields. Low-density cells with rising grant counts are emerging ones.

For example: the intersection of "AI & Machine Learning" (VC) and "Biomedical Engineering" (science) shows accelerating grant density since 2020 — a signal that BioML is moving from academic curiosity to funded development.

Trend detection

By tracking grant counts per taxonomy category over time, we can see which fields are growing. Not based on headlines or Gartner hype cycles, but based on where governments are actually directing R&D money. Government funding often leads private markets by 3–5 years.

Government R&D budgets are the world's largest leading indicator for deep tech venture. Taxonomy trend data makes that signal readable.

Hierarchical exploration

Each taxonomy is structured in levels. VC categories go from broad sectors (Level 1: "Healthcare") to sub-sectors (Level 2: "Medical Devices") to niches (Level 3: "Diagnostic Imaging"). Science taxonomies follow their native hierarchies.

Users can start broad — "how many grants fall under Clean Energy?" — and drill down to the specific technology: "perovskite photovoltaics". At each level, they see grant counts, total funding, and the actual grants ranked by relevance.

This is how you go from sector thesis to specific company in three clicks.

What's next

The taxonomy engine is the foundation for everything that comes next. Alerts when new grants appear in your sectors. Portfolio overlap analysis. Custom taxonomy sources that match your fund's investment thesis. Competitive intelligence: which sectors are getting crowded with new SBIR awards?

But the core insight is simple: government R&D data is only useful if you can see it through your own lens. Keywords give you a keyhole. A dual-taxonomy engine gives you the full picture.

Explore the taxonomy

Browse VC sectors and scientific disciplines across 292K+ grants. See where capital meets science.

Open Loturo →