Hi, I'm
Sugandha
Vashishtha
Cloud & Infrastructure Engineer — specializing in AWS, Azure, SRE, and enterprise reliability at scale.

Where Operations meet Reliability
I'm Sugandha Vashishtha, a Cloud and Site Reliability Engineer with 9+ years of experience in IT, including 4+ years specializing in cloud operations and SRE across Amazon Web Services, Microsoft Azure, and Microsoft Nebula on-premises environments.
At Tech Mahindra, I've served as a Designated Response Individual (DRI)on agile SRE teams — owning service availability, incident response, and SLA/SLO accountability for production workloads at Microsoft scale. I've monitored and triaged alerts using Dynatrace, Jarvis, Hawkeye, and Microsoft ICM, and led root cause analyses for production incidents.
I believe the best SREs come from ops — because they've felt the 3 AM pages, traced the packet drops, and learned that reliability is a feature, not an afterthought.
Master of Computer Applications (MCA) — Pursuing
Manipal University Jaipur · 2026 – 2028
Bachelor of Computer Applications (BCA)
Manipal University Jaipur · 2022 – 2025
Current Focus
Cloud infrastructure engineering, Azure Bastion architecture, hybrid cloud security, and expanding toward CKA and Terraform certification
AWS & Azure Cloud Operations
SRE · Incident Management · DRI
Dynatrace, Jarvis & Hawkeye Observability
On-Prem · Hybrid Cloud · Azure Bastion
$ whoami
sugandha-vashishtha
$ cat /etc/current-role
Cloud Infrastructure Engineer @ Tech Mahindra
$ cat /etc/current-focus
Azure Bastion ▸ AWS ▸ SRE ▸ Observability ▸ CKA
$ echo $AVAILABILITY
open_to_opportunities=true
▌
Technical Arsenal
Skills organized by depth of hands-on experience — from daily production use to active learning
Expertise
Core competencies — used daily in production
Proficient
Hands-on project & work experience
Learning & Building
Actively studying and expanding
On-Premises Infrastructure
Enterprise datacenter & bare-metal operations — Microsoft Nebula environments
Work History
9+ years building reliability — from network operations to cloud infrastructure and SRE at scale
Cloud Infrastructure Engineer
📁 AT&T Bastion — Microsoft Azure & AWS
Responsibilities
- Architected and administered Azure Bastion hybrid cloud infrastructure for AT&T enterprise workloads — eliminated public RDP/SSH exposure across all managed VMs and enforced zero-trust access patterns
- Diagnosed and resolved complex Azure-to-on-premises connectivity failures spanning Bastion, Load Balancers, Virtual Machines, and HTTP proxy; identified 3 recurring session-drop patterns and authored runbooks that cut resolution time from 40+ minutes to under 10
- Managed Azure subscriptions, resource groups, virtual networks, IAM policies, and shared image galleries across Windows and Linux fleets; enforced configuration standards using automated health checks
- Maintained SLA compliance through continuous metrics monitoring, log analysis, and proactive operational health checks across production cloud environments; used GitHub Copilot to accelerate code-level diagnostics
DevSecOps Engineer — Application OS Remediation
📁 Tracfone–Verizon — AWS Security & Vulnerability Remediation
Responsibilities
- Performed SAST and DAST scans across 126 AWS-hosted applications to identify security vulnerabilities; tracked 300+ findings through full remediation lifecycle in alignment with compliance requirements
- Conducted systematic vulnerability assessment and remediation of OS-level risks across AWS EC2 instances; prioritised HIGH and CRITICAL CVEs for immediate patching
- Executed OS patching across 40+ EC2 instances aligned with Change Management protocols; ran pre- and post-patching validation checkpoints achieving zero downtime across all change windows
- Accessed AWS EC2 instances via AWS Session Manager for patching and upgrades; managed server snapshots and rollback strategy for risk mitigation and business continuity
Key Outcomes
- Assessed 126 applications through SAST/DAST in a single engagement cycle — identified and tracked 300+ HIGH/CRITICAL vulnerabilities to remediation closure
- Achieved zero-downtime patching across 40+ EC2 instances through structured pre/post-validation checkpoints
Site Reliability Engineer (DRI)
📁 Microsoft Nebula Score Cloud
Responsibilities
- Served as Designated Response Individual (DRI) on an agile SRE team — owned service availability, service health, incident response, and SLA/SLO accountability
- Monitored services using Dynatrace and Microsoft ICM; participated in daily spike-management calls, triaged alerts, and led/contributed to root cause analyses (RCA) for production incidents
- Monitored and supported production applications using Hawkeye and Jarvis; improved alert detection time through proactive monitoring and alert tuning
- Managed Cloudman billing, OS patching, OS upgrades, infrastructure monitoring, and Fabric Manager operations; used KQL extensively to query telemetry and investigate production issues
- Oversaw Microsoft Nebula architecture: offline nodes, storage tier services, hardware failures, host upgrades, reimaging, RAID configuration, WDS, fabric creation, and DHCP scoping
Key Outcomes
- Maintained 99.9% SLA accountability as DRI across 8+ production services running on Microsoft Nebula infrastructure
- Reduced mean time to detect (MTTD) by ~30% through proactive Dynatrace alert tuning and ICM incident triage optimisation
- Coordinated with SETO, Private Lab Networks, and Microsoft Lab Services for hardware and network operations — supporting 2,000+ bare-metal nodes
Network Support Engineer
📁 Netgear — US, UK, Australia, Canada
Responsibilities
- Resolved enterprise network connectivity incidents across four countries; configured wireless controllers, routers, and switches; performed systematic routing and switching troubleshooting
Technical Support & Customer Service
📁 Xavient Digital (TELUS) · HI3 Technologies (HP) · Convergys (AT&T)
Responsibilities
- Tier 2 technical support, root cause analysis, escalation management, and enterprise customer service across TELUS (Canada), HP India, and AT&T (USA) accounts
Credentials & Learning Path
AWS & Red Hat certified — actively building toward Kubernetes and Terraform credentials
Completed
3 certificationsAWS Certified Solutions Architect – Associate
Amazon Web Services
Issued May 2026
Designing resilient, cost-optimized, and high-performing architectures on AWS
AWS Certified Cloud Practitioner
Amazon Web Services
Issued Jun 2025 · Valid through Jun 2028
Foundational AWS cloud concepts, services, security, and billing
Red Hat System Administration I (RH124) — Ver. 9.3
Red Hat
Issued via Credly
Linux system administration fundamentals on Red Hat Enterprise Linux 9 — users, storage, networking, and services
In Progress
Actively studyingCertified Kubernetes Administrator (CKA)
CNCF / Linux Foundation
Administering Kubernetes clusters in production environments
HashiCorp Certified: Terraform Associate
HashiCorp
Infrastructure as Code with Terraform for multi-cloud deployments
Things I've Built
Hands-on projects demonstrating cloud architecture, DevOps practices, and infrastructure automation
AWS Three-Tier Architecture
Problem
Need a scalable, fault-tolerant web application infrastructure that handles traffic spikes without manual intervention.
Solution
Production-grade three-tier architecture on AWS using EC2, RDS Multi-AZ, and ALB. Infrastructure provisioned entirely via Terraform with VPC segmentation, security groups, NAT Gateways, and CloudWatch monitoring.
Outcome
ASG auto-scales from 2→6 EC2 instances under load. Multi-AZ RDS failover completes in under 60 seconds. Terraform apply time reduced from hours of manual work to under 8 minutes.
Kubernetes Monitoring Stack
Problem
Kubernetes clusters running without visibility into pod health, node resource usage, or SLO breach alerting.
Solution
Full observability stack with Prometheus, Grafana, and Alertmanager. Pre-built dashboards for cluster health, pod metrics, and custom SLO tracking. Deployed via Helm charts with GitOps-ready configuration.
Outcome
MTTD reduced from ~15 minutes (manual log review) to under 2 minutes via Alertmanager firing before user impact. Dashboard covers 20+ Kubernetes metrics across pods, nodes, and SLOs.
Terraform Infrastructure Automation
Problem
Manual infrastructure provisioning across dev/staging/prod environments leads to configuration drift and inconsistent deployments.
Solution
Modular Terraform codebase for multi-environment AWS infrastructure using remote state management with S3/DynamoDB locking, workspaces, and reusable modules for VPC, EKS, RDS, and IAM. Integrated with GitHub Actions.
Outcome
Eliminated configuration drift across 3 environments (dev/staging/prod). Infra provisioning time cut from 4+ hours of manual clicks to a single `terraform apply` run in under 12 minutes.
Azure Hybrid Connectivity
Problem
On-premises workloads need secure, reliable connectivity to Azure virtual networks without exposing public endpoints.
Solution
Site-to-site VPN between on-premises network and Azure VNet using Azure VPN Gateway, local network gateway configuration, BGP routing, and NSG rules. Documented with step-by-step runbooks and tested failover procedures.
Outcome
Hybrid tunnel established with sub-100ms latency between on-prem and Azure VNet. BGP failover tested at under 30 seconds. Runbooks cut incident resolution time from 45+ minutes to under 10 minutes.
DevOps CI/CD Pipeline
Problem
Containerized Node.js app deployed manually — no automated testing, no security scanning, and no rollback capability.
Solution
End-to-end CI/CD pipeline using GitHub Actions, Docker, and Kubernetes. Stages: linting → unit tests → Docker build/push to ECR → Trivy security scanning → automated deployment to EKS with rollback capability.
Outcome
End-to-end pipeline runs in under 6 minutes (lint → test → build → scan → deploy). Trivy blocks HIGH/CRITICAL CVEs before production. Rollback completes in under 2 minutes on failure — zero bad deployments reach users.
Activity & Contributions
Building in public — infrastructure code, automation scripts, and DevOps tooling
Public Repos
5
Primary Languages
HCL · YAML
Security Scans
Trivy + SAST
Active Projects
IaC · K8s · CI/CD
📌 Featured Repositories
Production-grade three-tier architecture on AWS with Terraform — VPC, EC2, RDS Multi-AZ, ALB, CloudWatch
Prometheus + Grafana + Alertmanager observability stack for Kubernetes clusters with SLO tracking
Modular Terraform codebase for multi-environment AWS infrastructure with remote state and GitHub Actions CI/CD
End-to-end CI/CD pipeline with GitHub Actions, Docker, Trivy security scanning, and EKS deployment with rollback
Technical Writing
Articles on Cloud, DevOps, AWS, and infrastructure — published on LinkedIn and Medium
GitHub vs GitLab: What Most Developers Get Wrong!!
Click to read on LinkedIn →
DevOps vs DevSecOps: Why Speed Alone Is No Longer Enough
Streamlining Server Regression Analysis with Windows Deployment Kit
Speech to Text using AWS Transcribe, S3, and Lambda
Mastering API Fundamentals with POSTMAN and GraphQL
Understanding Web Servers: The Heart of the Internet
AWS Global Accelerator — Improving Latency and Design for Failure
The Benefits of AWS Global Accelerator — Case Study: EC2 Linux GUI
Empowering AI: Unleashing the Potential of ChatGPT Prompt Engineering
My Resume
A full picture of my experience, skills, and education

Sugandha Vashishtha
Cloud & Site Reliability Engineer
Key Skills
Ready to download
9+ years of experience across cloud operations, SRE, AWS, Azure, incident management, and enterprise infrastructure. Available as a PDF for immediate download.
Sugandha_Vashishtha_Resume.pdf
Last updated · June 2026
Let's Connect
Open to Infrastructure, Cloud, DevOps, and SRE opportunities at enterprise technology organizations. Drop me a message — I respond within 24 hours.
Prefer a quick call? 📅
Schedule a 15-minute intro call — no commitment, just a conversation about how I can help.
buildwithsugandha@gmail.com
linkedin.com/in/sugandha-vashishtha
GitHub
github.com/buildwithsugandha
Location
Noida, India
Resume
Download PDF
Schedule a Call
Book a 15-min intro
Actively seeking Cloud Infrastructure, SRE, and DevOps roles at enterprise technology organizations. Remote and Noida/India-based positions welcome.