Alan Jiang

Production Engineer / Site Reliability Engineer

I build and debug production-adjacent infrastructure: Kubernetes/GKE observability, gateway reliability, operational automation, and small open-source fixes that make real systems safer to run.

OSS upstream PRs merged in Google and Google Cloud Platform maintained repositories

GKE AI inference reliability lab with traces, incident replay, and SLO-style gates

Prod gateway, data infra, runbook, dashboard, and automation experience

Bellevue, WA haihanj99@gmail.com LinkedIn GitHub GitHub profile README

Outside work, I like quiet coastlines. At work, I like systems that stay calm under load.

How I Work

I am most interested in engineering work where correctness, operations, and automation meet.

Make production changes reversible

I prefer staged rollout, health checks, observation windows, and a clear rollback path over one-shot changes that are hard to reason about during incidents.

Turn manual operations into systems

When a workflow repeats, I look for the state machine behind it: inputs, validation, side effects, ownership, observability, and safe retry behavior.

Leave evidence for the next engineer

Dashboards, audit logs, runbooks, and small validation tools matter because they turn production work from tribal knowledge into something reviewable.

Fix sharp edges in developer tooling

My open-source work is usually small but practical: build integration fixes, reliability edge cases, clearer docs, and safer infrastructure defaults.

Curiosity

The problems that keep my attention are usually practical, messy, and close to production.

How do we know a change is safe?

I like building gates around real signals: health checks, diff checks, traffic weights, dashboards, and clear stop conditions.

What should be automated, and what should stay supervised?

The best automation removes repeated work while keeping high-risk decisions visible, reviewable, and reversible.

Field Notes

Gateway migration work taught me to treat rollout design as engineering, not ceremony.

SHEIN production gateway infrastructure

The interesting part of a migration is not only replacing machines or operating systems. It is deciding how traffic should move, what signal proves the new path is healthy, how long old nodes should remain useful, and what would make rollback obvious at 2 a.m.

Capacity automation is a reliability problem disguised as a workflow problem.

Meta monetization data infrastructure

A quota request can look like a ticket queue, but under the hood it needs validation, demand breakdown, state tracking, polling, alerts, ownership, and failure recovery. That is the type of backend infrastructure work I enjoy: making operational intent executable and observable.

I use open source as a way to practice engineering judgment in public.

Google-maintained infrastructure and developer-tooling repositories

The contributions I care about are usually not flashy. They make package metadata more correct, tests more robust when host capabilities differ, Terraform defaults safer, or docs clearer for the next person trying to integrate a tool.

Current Build

GKE AI Inference Reliability Lab is a runnable Kubernetes observability and reliability lab for the kind of infrastructure work I want to keep doing: practical, reviewable, and close to production failure modes.

Incident replay

The project replays baseline traffic, cache-miss latency, dependency timeout, and rollout regression scenarios, then publishes sample evidence from the replay.

Reliability gate

It evaluates the replay against SLO-style checks so the lab proves both healthy baseline behavior and expected incident classification.

GKE production shape

It includes Kubernetes manifests for collector RBAC, resource enrichment, PVC-backed queue storage, cross-namespace instrumentation, and a small sample workload.

Google Cloud OSS track

Related OpenTelemetry Operator sample PRs are open and pending review. I keep this wording precise until upstream maintainers merge the work.

Why it matters

AI services still need ordinary production discipline: traces, Kubernetes context, durable delivery, safe rollout evidence, and incident narratives a team can act on.

project repo case study sample incident report reliability gate incident replay architecture upstream PR tracker

Selected Proof

Production systems

Experience across Nginx/APISIX gateways, AKS/Kubernetes-style platforms, Kafka, ZooKeeper, Elasticsearch, warehouse quota systems, and on-call workflows.

Automation and reliability

Built workflow automation, service reliability controls, validation tooling, runbooks, dashboards, and scripts that reduced manual operational work.

Open-source examples

Selected merged contributions are linked below. They show practical infrastructure fixes across runtime, fuzzing, Kubernetes troubleshooting, cloud defaults, and build/test tooling.

gVisor cgroup syzkaller dashboard KHI slog containerregistry export benchmark pkg-config benchmark perf counters Terraform defaults cloud config

Roles I fit best

SRE, Production Engineer, Infrastructure Engineer, Platform Engineer, Cloud/DevOps Engineer, and backend infrastructure roles close to production operations.

Tech

Python C++ Java Go PHP SQL Bash Linux Kubernetes/AKS Tupperware (Kubernetes-equivalence) Azure AWS Terraform Ansible Nginx/APISIX Consul/etcd Kafka ZooKeeper Elasticsearch Hive/Scuba XDB MWFS Thrift/RPC SLI/SLO Incident response OpenTelemetry GKE Git/GitHub CMake/pkg-config