whoami

Namit Tiwari — Site Reliability Architect

kubectl get engineer namit -o wide

NAME    ROLE                        ORG                   EXP  STATUS
namit   DVP · Sr. Architect, SRE    kotak-mahindra-bank   8y   Running

I build and run the digital platform behind Kotak's Mobile Banking 2.0 — banking infrastructure that millions of customers rely on and never have to think about.

99.99% availability · 8.0 years in production

Node: namit

$ cat /etc/namit/bio

I'm Namit — a platform engineer who has spent eight years turning fragile release nights into boring ones. I lead the Digital Platform Team for Mobile Banking 2.0 at Kotak Mahindra Bank: the clusters, the pipelines, the mesh, and the people who keep them healthy.

Digital Platform Team · Mobile Banking 2.0

Release Engineering
CI/CD, GitOps, progressive delivery
Infrastructure Engineering
EKS platform, service mesh, networking
VACA / Vulnerability Mgmt
Scanning, patching cadence, posture
Compliance
RBI-regulated change & audit controls
SRE
SLOs, incident response, capacity

operating principles

Automate yourself out of toil.

If a human did it twice, it's a script. If a script ran twice, it's a pipeline. The best on-call shift is the one where nothing pages you.

You can't AI what you can't observe.

Every AIOps ambition dies without clean telemetry underneath it. Instrument first, correlate second, automate third.

The Service Mesh

Every skill here is wired to the others the way it is in production. Hover to trace the connections; click to pin one.

Platform

  • Primary cloud since 2019. Designed and operate the EKS platform underneath Mobile Banking 2.0.
  • Multi-cluster production operations at bank scale — workload architecture, upgrades, hardening.

Traffic & Mesh

  • Production service mesh with mTLS via AWS Private CA across banking microservices.
  • Dual-gateway ingress: separate internal and external traffic paths in front of the mesh.

Infra as Code

  • Everything is a module. Cluster, mesh, and observability stacks are fully codified.

Observability

  • Metrics backbone — recording rules, federation, SLO burn-rate alerting.
  • Golden-signal dashboards per service; the single pane the war room actually uses.
  • Distributed tracing across the mesh; latency archaeology during incidents.
  • Mesh topology and traffic health — the map when a rollout goes sideways.
  • Centralized log store for the platform; lifecycle-managed indices at bank volume.
  • DaemonSet log pipeline feeding OpenSearch — parsing, enrichment, backpressure tuning.

Delivery

  • Git is the source of truth. Drift is a bug. App-of-apps across environments.

Code

  • Automation, operators, glue, and the LLM-agent prototypes in the AIOps roadmap.

AI / AIOps

  • Authored a 12-month AI-infrastructure roadmap: alert-noise reduction, release risk scoring, SLO breach prediction.

Resilience & Cost

  • DR architecture for RBI-regulated workloads — tested failovers, not paper exercises.
  • Right-sizing, spot strategy, and showback for the platform's cloud spend.

Workloads in Production

Written the way incidents are: problem, action, impact. Metrics marked TODO: are being de-classified from war-room notes.

INC-2401

Zero-trust service mesh for a bank

  • Istio
  • AWS PCA
  • Kong
  • EKS
  • mTLS

Problem

East–west traffic between banking microservices was unencrypted and invisible, and a single shared ingress mixed internal and external traffic paths — unacceptable for a regulated workload.

Action

Designed and shipped a production Istio service mesh on EKS with mTLS anchored to AWS Private CA, fronted by a Kong dual-gateway ingress that hard-separates internal and external traffic.

Impact

100% of service-to-service traffic now rides mTLS with automated cert rotation. TODO: services onboarded, p99 latency overhead, cert-rotation incidents (zero so far).

INC-2402

Centralized logging pipeline

  • Fluent Bit
  • OpenSearch
  • EKS

Problem

Logs lived and died on individual nodes. Debugging a cross-service failure meant kubectl-exec archaeology across dozens of pods under incident pressure.

Action

Rolled out a Fluent Bit DaemonSet pipeline into OpenSearch — structured parsing, Kubernetes metadata enrichment, index lifecycle management sized for bank-scale volume.

Impact

One query now answers what used to take an hour of pod spelunking. TODO: GB/day ingested, MTTR delta, retention window.

INC-2403

12-month AIOps roadmap for a regulated bank

  • LLM agents
  • Python
  • Prometheus
  • OpenSearch

Problem

Alert volume outgrew human triage, and release risk was judged by gut feel — in an environment where a bad call is a regulatory event, not just a bad day.

Action

Authored the bank's 12-month AI-infrastructure roadmap: alert-noise reduction via correlation, ML-scored release risk gates in the pipeline, and predictive SLO-breach detection on golden signals.

Impact

Roadmap approved and funded; first phase in flight. TODO: % alert reduction in pilot, releases scored, prediction lead-time.

INC-2404

Full observability stack rollout

  • Prometheus
  • Grafana
  • Jaeger
  • Kiali

Problem

Every team had its own half-configured dashboards and nobody trusted any of them. Incidents began with an argument about whose graph was right.

Action

Stood up a unified stack — Prometheus metrics with SLO burn-rate alerts, golden-signal Grafana dashboards per service, Jaeger tracing through the mesh, Kiali for topology.

Impact

One agreed-upon picture of production. War rooms start with data, not debate. TODO: services covered, alert precision before/after.

INC-2405

DR/BCP for RBI-regulated workloads

  • AWS
  • Multi-region
  • RTO/RPO
  • Chaos drills

Problem

Regulatory DR requirements demanded provable recovery — not a document that claims failover works, but a failover that actually has.

Action

Architected the DR/BCP posture for the platform's banking workloads: multi-region topology, data replication strategy, runbooked failover, and scheduled game-day drills.

Impact

Recovery is rehearsed, timed, and evidenced for audit. TODO: measured RTO/RPO, drill cadence, last failover duration.

Page Me

Evaluating me for a Principal SRE, Platform Architect, or engineering-management role? Open an incident — I acknowledge fast.