▲ HK Murphy — Private

◇

Career Vision

I'm at a deliberate inflection point — transitioning from platform engineering into AI safety. Three paths are in view. Each has a different theory of change and a different ask of my background.

AI Safety Research

Transitioning into technical AI safety research with a focus on ML security, robustness, and the security of agentic AI systems. Likely pathway: DPhil at Oxford AI Security or an intensive fellowship programme. My statistics training, 10+ years of engineering, and hands-on agentic AI work give me a concrete foundation — and my DevSecOps background brings a security-first lens that is underrepresented in ML research.

DPhil / OxfordML SecurityRobustnessAlignment

AI Safety Engineering

Specialising in AI security, ML infrastructure security, or building safety evaluation infrastructure — at labs like Anthropic, DeepMind, or dedicated safety organisations. This path offers faster time-to-impact and fills a genuine talent gap: there are far fewer safety-focused engineers than researchers in the ecosystem. My agentic AI technical leadership translates directly.

ML InfrastructureSecurity EngineeringSafety EvalsAI Labs

AI Governance

Combining technical depth with policy and governance work — either as a researcher at GovAI-type organisations, through policy fellowships, or in grantmaking roles that require technical evaluation capacity. My linguistics background and international social enterprise experience are comparative advantages for cross-jurisdictional AI governance work.

PolicyGovAIStandardsInternational

▲

Current thinking leans toward Path 1 or 2 — staying deeply technical while pivoting to safety. The core open question: is the marginal safety contribution greater from an engineer who ships immediately, or from one who invests 2–4 years to become a researcher? If you have a perspective, I want to hear it.

■

Learning Path

Structured upskilling in AI engineering and safety — building from foundations to production systems.

Active

AI Engineering

6 weeks · 45+ hours

Structured to take you from foundations to production. Build real RAG pipelines, evaluation frameworks, and agentic systems.

LLM Fundamentals & RAG Foundations Build a working RAG pipeline that answers questions from real documentation

▾

How LLMs work under the hood: pre-training, tokenization, attention, post-training
The RAG paradigm: when to use it, how to scope your project
Data ingestion: handling different document formats for your corpus
Your first end-to-end pipeline: index, retrieve, generate, test
Framework: RAG Project Scoping Framework

You build

Interactive Q&A system on MCP documentation

QdrantHaystackFastEmbedGemini

Chunking & Embeddings Test 7 chunking strategies on your data and find the winner

▾

Why chunking is the most important decision in your RAG pipeline
7 strategies compared: naive, sentence, recursive, semantic, hybrid content-aware
Embeddings deep dive: how text becomes vectors, FastEmbed vs Voyage
LLM-as-Judge evaluation with side-by-side comparison dashboards
Framework: Chunking Decision Framework

You build

Ranked chunking strategy backed by your own evaluation evidence

7 StrategiesVoyage AIStreamlitLLM-as-Judge

Advanced Retrieval Optimise retrieval accuracy from 70% to 90%+

▾

Vector DB internals: how HNSW and approximate nearest neighbor works
Hybrid retrieval: combining dense and sparse search with Reciprocal Rank Fusion
Reranking architectures: when and why cross-encoders beat bi-encoders
Search space narrowing: metadata filtering and LLM-based routing
Framework: Retrieval Strategy Selection Framework

You build

Evidence-based retrieval strategy with 4 techniques evaluated head-to-head

Hybrid SearchBM25Voyage RerankerLLM Routing

Mastering Evaluation Build your own evaluation system with golden datasets

▾

The evaluation challenge: why measuring RAG quality is harder than building it
Synthetic test generation, LLM-as-Judge, deterministic semantic metrics
Building golden datasets from scratch when no ground truth exists
Cross-validating 3 independent evaluation methods to find where your system breaks
Framework: RAG Evaluation Strategy Framework

You build

Golden dataset + multi-method evaluation framework

RAGASDeepEvalCustom JudgesTriangulation

Production Engineering Deploy a production chatbot with caching, memory, and observability

▾

Production architecture: the real tradeoffs between latency, cost, and accuracy
Semantic caching with Redis for sub-50ms response times on repeated queries
Conversation memory, query rewriting, and intent-based routing
Observability, user feedback loops, Docker deployment
Framework: Production RAG Architecture

You build

Deployed production chatbot serving real requests

FastAPIRedisStreamlitOpikDocker

Agentic AI & Security Build a self-correcting RAG agent with adaptive routing

▾

The intelligence spectrum: from single API calls to fully autonomous agents
Corrective RAG: grading retrieval quality and self-correcting when it fails
Adaptive agents: confidence-based tool selection across multiple sources
RAG security essentials: injection detection, retrieval validation, output sanitization
Framework: Intelligence Spectrum Framework

You build

Self-correcting CRAG system + adaptive multi-tool agent

CRAGHaystack AgentsTavilyGitHub MCP

◎

Your Mission

The goal is to create legible output — to practice the craft, get feedback, find collaborators, test your fit, and improve understanding. This is the operating framework for everything that follows.

◆

The Cheap Tests Ladder

Cheap tests require the least effort, time, or resources to reduce your uncertainty. Start very short (<1 hour each), progress to short (1–10 hours), then long (10–100 hours), then very long. Each rung gives a stronger signal that you're a good fit — without sunk-cost commitment.

Very short (<1h each)

Talk to people further along the path
Read abstracts, blog posts, newsletters
Watch YouTube videos on technical content
Run a GitHub repo, reproduce math proofs

Short (1–10h each)

Read research papers and agendas
Reproduce a toy version of a paper
Write a short post; estimate timelines
Attend an ML conference or workshop

Long (10–100h each)

Read a book; complete an online course
Replicate and extend a paper
Do an Apart Sprint hackathon
Form an inside view on timelines

Very long (<1000h each)

Internships and residencies
Research fellowships (MATS, AISC)
Masters programme or DPhil
Independent research project

■

Next Steps Framework

Read / Listen / Watch

Follow your nose through the resources below. Prioritise things that build intuition before depth.

Do Stuff — Create Legible Output

Summarise what you read. Write opinions. Code and do maths. Add to GitHub. Post to EA Forum or LessWrong to get feedback.

Network — Learn, Don't Sell

Reach out to people doing the work you want to do. Learn about their path, get feedback on your understanding, build relationships before you need them.

Apply — Even for Feedback

Every application is a cheap test. A rejection with feedback is worth the effort. Use jobs.80000hours.org and the EA Opportunities board.

Career Vision

AI Safety Research

AI Safety Engineering

AI Governance

Learning Path

AI Engineering

Your Mission

The Cheap Tests Ladder

Next Steps Framework

Roadmap

Immediate Actions

1–3 Months

Medium-term

Opportunity Roles

Security Lead, Agentic Red Team

Deputy Director, Research Unit — AI

Machine Learning Research Engineer / Scientist

Research Scientist, Science of Evaluation

Senior Distributed ML Engineer

Policy Director

Governance & Policy Fellow