Namit Tiwari — Site Reliability Architect

Node: namit

$ cat /etc/namit/bio

I'm Namit — a platform engineer who has spent eight years turning fragile release nights into boring ones. I lead the Digital Platform Team for Mobile Banking 2.0 at Kotak Mahindra Bank: the clusters, the pipelines, the mesh, and the people who keep them healthy.

Digital Platform Team · Mobile Banking 2.0

Release Engineering: CI/CD, GitOps, progressive delivery
Infrastructure Engineering: EKS platform, service mesh, networking
VACA / Vulnerability Mgmt: Scanning, patching cadence, posture
Compliance: RBI-regulated change & audit controls
SRE: SLOs, incident response, capacity

operating principles

Automate yourself out of toil.
If a human did it twice, it's a script. If a script ran twice, it's a pipeline. The best on-call shift is the one where nothing pages you.

You can't AI what you can't observe.
Every AIOps ambition dies without clean telemetry underneath it. Instrument first, correlate second, automate third.

The Service Mesh

Every skill here is wired to the others the way it is in production. Hover to trace the connections; click to pin one.

Platform

Primary cloud since 2019. Designed and operate the EKS platform underneath Mobile Banking 2.0.
Multi-cluster production operations at bank scale — workload architecture, upgrades, hardening.

Traffic & Mesh

Production service mesh with mTLS via AWS Private CA across banking microservices.
Dual-gateway ingress: separate internal and external traffic paths in front of the mesh.

Infra as Code

Everything is a module. Cluster, mesh, and observability stacks are fully codified.

Observability

Metrics backbone — recording rules, federation, SLO burn-rate alerting.
Golden-signal dashboards per service; the single pane the war room actually uses.
Distributed tracing across the mesh; latency archaeology during incidents.
Mesh topology and traffic health — the map when a rollout goes sideways.
Centralized log store for the platform; lifecycle-managed indices at bank volume.
DaemonSet log pipeline feeding OpenSearch — parsing, enrichment, backpressure tuning.

Delivery

Git is the source of truth. Drift is a bug. App-of-apps across environments.

Code

Automation, operators, glue, and the LLM-agent prototypes in the AIOps roadmap.

AI / AIOps

Authored a 12-month AI-infrastructure roadmap: alert-noise reduction, release risk scoring, SLO breach prediction.

Resilience & Cost

DR architecture for RBI-regulated workloads — tested failovers, not paper exercises.
Right-sizing, spot strategy, and showback for the platform's cloud spend.

Workloads in Production

Written the way incidents are: problem, action, impact. Metrics marked TODO: are being de-classified from war-room notes.

INC-2401

Zero-trust service mesh for a bank

Istio
AWS PCA
Kong
EKS
mTLS

Problem

East–west traffic between banking microservices was unencrypted and invisible, and a single shared ingress mixed internal and external traffic paths — unacceptable for a regulated workload.

Action

Designed and shipped a production Istio service mesh on EKS with mTLS anchored to AWS Private CA, fronted by a Kong dual-gateway ingress that hard-separates internal and external traffic.

Impact

100% of service-to-service traffic now rides mTLS with automated cert rotation. TODO: services onboarded, p99 latency overhead, cert-rotation incidents (zero so far).

INC-2402

Centralized logging pipeline

Fluent Bit
OpenSearch
EKS

Problem

Logs lived and died on individual nodes. Debugging a cross-service failure meant kubectl-exec archaeology across dozens of pods under incident pressure.

Action

Rolled out a Fluent Bit DaemonSet pipeline into OpenSearch — structured parsing, Kubernetes metadata enrichment, index lifecycle management sized for bank-scale volume.

Impact

One query now answers what used to take an hour of pod spelunking. TODO: GB/day ingested, MTTR delta, retention window.

INC-2403

12-month AIOps roadmap for a regulated bank

LLM agents
Python
Prometheus
OpenSearch

Problem

Alert volume outgrew human triage, and release risk was judged by gut feel — in an environment where a bad call is a regulatory event, not just a bad day.

Action

Authored the bank's 12-month AI-infrastructure roadmap: alert-noise reduction via correlation, ML-scored release risk gates in the pipeline, and predictive SLO-breach detection on golden signals.

Impact

Roadmap approved and funded; first phase in flight. TODO: % alert reduction in pilot, releases scored, prediction lead-time.

INC-2404

Full observability stack rollout

Prometheus
Grafana
Jaeger
Kiali

Problem

Every team had its own half-configured dashboards and nobody trusted any of them. Incidents began with an argument about whose graph was right.

Action

Stood up a unified stack — Prometheus metrics with SLO burn-rate alerts, golden-signal Grafana dashboards per service, Jaeger tracing through the mesh, Kiali for topology.

Impact

One agreed-upon picture of production. War rooms start with data, not debate. TODO: services covered, alert precision before/after.

INC-2405

DR/BCP for RBI-regulated workloads

AWS
Multi-region
RTO/RPO
Chaos drills

Problem

Regulatory DR requirements demanded provable recovery — not a document that claims failover works, but a failover that actually has.

Action

Architected the DR/BCP posture for the platform's banking workloads: multi-region topology, data replication strategy, runbooked failover, and scheduled game-day drills.

Impact

Recovery is rehearsed, timed, and evidenced for audit. TODO: measured RTO/RPO, drill cadence, last failover duration.

Runbooks

SEV-1 = deep dive · SEV-3 = quick tip. Runbooks are written after the pager stops.

Page Me

Evaluating me for a Principal SRE, Platform Architect, or engineering-management role? Open an incident — I acknowledge fast.