Make production changes reversible
I prefer staged rollout, health checks, observation windows, and a clear rollback path over one-shot changes that are hard to reason about during incidents.
Production Engineer / Site Reliability Engineer
I build and debug production-adjacent infrastructure: Kubernetes/GKE observability, gateway reliability, operational automation, and small open-source fixes that make real systems safer to run.
I am most interested in engineering work where correctness, operations, and automation meet.
I prefer staged rollout, health checks, observation windows, and a clear rollback path over one-shot changes that are hard to reason about during incidents.
When a workflow repeats, I look for the state machine behind it: inputs, validation, side effects, ownership, observability, and safe retry behavior.
Dashboards, audit logs, runbooks, and small validation tools matter because they turn production work from tribal knowledge into something reviewable.
My open-source work is usually small but practical: build integration fixes, reliability edge cases, clearer docs, and safer infrastructure defaults.
The problems that keep my attention are usually practical, messy, and close to production.
I like building gates around real signals: health checks, diff checks, traffic weights, dashboards, and clear stop conditions.
The best automation removes repeated work while keeping high-risk decisions visible, reviewable, and reversible.
SHEIN production gateway infrastructure
The interesting part of a migration is not only replacing machines or operating systems. It is deciding how traffic should move, what signal proves the new path is healthy, how long old nodes should remain useful, and what would make rollback obvious at 2 a.m.
Meta monetization data infrastructure
A quota request can look like a ticket queue, but under the hood it needs validation, demand breakdown, state tracking, polling, alerts, ownership, and failure recovery. That is the type of backend infrastructure work I enjoy: making operational intent executable and observable.
Google-maintained infrastructure and developer-tooling repositories
The contributions I care about are usually not flashy. They make package metadata more correct, tests more robust when host capabilities differ, Terraform defaults safer, or docs clearer for the next person trying to integrate a tool.
GKE AI Inference Reliability Lab is a runnable Kubernetes observability and reliability lab for the kind of infrastructure work I want to keep doing: practical, reviewable, and close to production failure modes.
The project replays baseline traffic, cache-miss latency, dependency timeout, and rollout regression scenarios, then publishes sample evidence from the replay.
It evaluates the replay against SLO-style checks so the lab proves both healthy baseline behavior and expected incident classification.
It includes Kubernetes manifests for collector RBAC, resource enrichment, PVC-backed queue storage, cross-namespace instrumentation, and a small sample workload.
Related OpenTelemetry Operator sample PRs are open and pending review. I keep this wording precise until upstream maintainers merge the work.
AI services still need ordinary production discipline: traces, Kubernetes context, durable delivery, safe rollout evidence, and incident narratives a team can act on.
Experience across Nginx/APISIX gateways, AKS/Kubernetes-style platforms, Kafka, ZooKeeper, Elasticsearch, warehouse quota systems, and on-call workflows.
Built workflow automation, service reliability controls, validation tooling, runbooks, dashboards, and scripts that reduced manual operational work.
Selected merged contributions are linked below. They show practical infrastructure fixes across runtime, fuzzing, Kubernetes troubleshooting, cloud defaults, and build/test tooling.
SRE, Production Engineer, Infrastructure Engineer, Platform Engineer, Cloud/DevOps Engineer, and backend infrastructure roles close to production operations.