Tech Mahindra·Cloud Infrastructure Engineer·Jul 2025–Present
Open to SRE & Cloud Opportunities

Hi, I'm

Sugandha
Vashishtha

|

Cloud & Infrastructure Engineer — specializing in AWS, Azure, SRE, and enterprise reliability at scale.

Noida, India🏆 AWS CertifiedTech Mahindra
Download Resume
Sugandha Vashishtha – Cloud & SRE Engineer
💼
9+ Yrs Exp
☁️
4+ Yrs Cloud
🏆
AWS Certified
scroll

Where Operations meet Reliability

I'm Sugandha Vashishtha, a Cloud and Site Reliability Engineer with 9+ years of experience in IT, including 4+ years specializing in cloud operations and SRE across Amazon Web Services, Microsoft Azure, and Microsoft Nebula on-premises environments.

At Tech Mahindra, I've served as a Designated Response Individual (DRI)on agile SRE teams — owning service availability, incident response, and SLA/SLO accountability for production workloads at Microsoft scale. I've monitored and triaged alerts using Dynatrace, Jarvis, Hawkeye, and Microsoft ICM, and led root cause analyses for production incidents.

I believe the best SREs come from ops — because they've felt the 3 AM pages, traced the packet drops, and learned that reliability is a feature, not an afterthought.

Master of Computer Applications (MCA) — Pursuing

Manipal University Jaipur · 2026 – 2028

Bachelor of Computer Applications (BCA)

Manipal University Jaipur · 2022 – 2025

Current Focus

Cloud infrastructure engineering, Azure Bastion architecture, hybrid cloud security, and expanding toward CKA and Terraform certification

AWS & Azure Cloud Operations

SRE · Incident Management · DRI

Dynatrace, Jarvis & Hawkeye Observability

On-Prem · Hybrid Cloud · Azure Bastion

9+
Years in IT
4+
Years in Cloud & SRE
3
Cloud Platforms
4
Certifications
sugandha@cloud:~$

$ whoami

sugandha-vashishtha

$ cat /etc/current-role

Cloud Infrastructure Engineer @ Tech Mahindra

$ cat /etc/current-focus

Azure Bastion ▸ AWS ▸ SRE ▸ Observability ▸ CKA

$ echo $AVAILABILITY

open_to_opportunities=true

Technical Arsenal

Skills organized by depth of hands-on experience — from daily production use to active learning

Expertise

Core competencies — used daily in production

Microsoft Azure
AWS
Incident Management
SRE / DRI
Dynatrace
Microsoft ICM
Azure Bastion
Linux Administration
Windows Server
KQL (Kusto)
SQL
SLA / SLO Management

Proficient

Hands-on project & work experience

Terraform
Kubernetes
GitHub Actions
Prometheus & Grafana
Docker
SAST / DAST
Vulnerability Remediation
AWS EC2 & IAM
AWS Session Manager
Hawkeye & Jarvis
Shell Scripting
Python (Fundamentals)
CI/CD Pipelines
Git & GitHub
GitHub Copilot
TCP/IP & Networking

Learning & Building

Actively studying and expanding

Snowflake
CKA (Kubernetes Admin)
HashiCorp Terraform Associate

On-Premises Infrastructure

Enterprise datacenter & bare-metal operations — Microsoft Nebula environments

Microsoft Nebula
Cloudman
Fabric Manager
RAID Configuration
Bare Metal Ops
WDS
DHCP Scoping
Hardware Management

Work History

9+ years building reliability — from network operations to cloud infrastructure and SRE at scale

Cloud Infrastructure Engineer

Tech Mahindra

📁 AT&T Bastion — Microsoft Azure & AWS

Jul 2025 – Present

Responsibilities

  • Architected and administered Azure Bastion hybrid cloud infrastructure for AT&T enterprise workloads — eliminated public RDP/SSH exposure across all managed VMs and enforced zero-trust access patterns
  • Diagnosed and resolved complex Azure-to-on-premises connectivity failures spanning Bastion, Load Balancers, Virtual Machines, and HTTP proxy; identified 3 recurring session-drop patterns and authored runbooks that cut resolution time from 40+ minutes to under 10
  • Managed Azure subscriptions, resource groups, virtual networks, IAM policies, and shared image galleries across Windows and Linux fleets; enforced configuration standards using automated health checks
  • Maintained SLA compliance through continuous metrics monitoring, log analysis, and proactive operational health checks across production cloud environments; used GitHub Copilot to accelerate code-level diagnostics
Azure BastionAWSIAMLoad BalancersVirtual NetworksLinuxGitHub Copilot

DevSecOps Engineer — Application OS Remediation

Tech Mahindra

📁 Tracfone–Verizon — AWS Security & Vulnerability Remediation

Aug 2024 – Jun 2025

Responsibilities

  • Performed SAST and DAST scans across 126 AWS-hosted applications to identify security vulnerabilities; tracked 300+ findings through full remediation lifecycle in alignment with compliance requirements
  • Conducted systematic vulnerability assessment and remediation of OS-level risks across AWS EC2 instances; prioritised HIGH and CRITICAL CVEs for immediate patching
  • Executed OS patching across 40+ EC2 instances aligned with Change Management protocols; ran pre- and post-patching validation checkpoints achieving zero downtime across all change windows
  • Accessed AWS EC2 instances via AWS Session Manager for patching and upgrades; managed server snapshots and rollback strategy for risk mitigation and business continuity

Key Outcomes

  • Assessed 126 applications through SAST/DAST in a single engagement cycle — identified and tracked 300+ HIGH/CRITICAL vulnerabilities to remediation closure
  • Achieved zero-downtime patching across 40+ EC2 instances through structured pre/post-validation checkpoints
SASTDASTAWS EC2Session ManagerOS PatchingVulnerability RemediationChange ManagementDevSecOps

Site Reliability Engineer (DRI)

Tech Mahindra

📁 Microsoft Nebula Score Cloud

Jul 2022 – Jul 2024

Responsibilities

  • Served as Designated Response Individual (DRI) on an agile SRE team — owned service availability, service health, incident response, and SLA/SLO accountability
  • Monitored services using Dynatrace and Microsoft ICM; participated in daily spike-management calls, triaged alerts, and led/contributed to root cause analyses (RCA) for production incidents
  • Monitored and supported production applications using Hawkeye and Jarvis; improved alert detection time through proactive monitoring and alert tuning
  • Managed Cloudman billing, OS patching, OS upgrades, infrastructure monitoring, and Fabric Manager operations; used KQL extensively to query telemetry and investigate production issues
  • Oversaw Microsoft Nebula architecture: offline nodes, storage tier services, hardware failures, host upgrades, reimaging, RAID configuration, WDS, fabric creation, and DHCP scoping

Key Outcomes

  • Maintained 99.9% SLA accountability as DRI across 8+ production services running on Microsoft Nebula infrastructure
  • Reduced mean time to detect (MTTD) by ~30% through proactive Dynatrace alert tuning and ICM incident triage optimisation
  • Coordinated with SETO, Private Lab Networks, and Microsoft Lab Services for hardware and network operations — supporting 2,000+ bare-metal nodes
SREDRIDynatraceMicrosoft ICMHawkeyeJarvisKQLNebulaCloudman

Network Support Engineer

Tech Mahindra

📁 Netgear — US, UK, Australia, Canada

Dec 2021 – May 2022

Responsibilities

  • Resolved enterprise network connectivity incidents across four countries; configured wireless controllers, routers, and switches; performed systematic routing and switching troubleshooting
NetworkingRoutingSwitchingWirelessTCP/IP

Technical Support & Customer Service

Earlier Experience

📁 Xavient Digital (TELUS) · HI3 Technologies (HP) · Convergys (AT&T)

May 2016 – Mar 2021

Responsibilities

  • Tier 2 technical support, root cause analysis, escalation management, and enterprise customer service across TELUS (Canada), HP India, and AT&T (USA) accounts
Technical SupportRoot Cause AnalysisNetworkingEnterprise Clients

Credentials & Learning Path

AWS & Red Hat certified — actively building toward Kubernetes and Terraform credentials

Completed

3 certifications
🟠
✅ Completed

AWS Certified Solutions Architect – Associate

Amazon Web Services

Issued May 2026

Designing resilient, cost-optimized, and high-performing architectures on AWS

Verified
Verify
🟠
✅ Completed

AWS Certified Cloud Practitioner

Amazon Web Services

Issued Jun 2025 · Valid through Jun 2028

Foundational AWS cloud concepts, services, security, and billing

Verified
Verify
🎩
✅ Completed

Red Hat System Administration I (RH124) — Ver. 9.3

Red Hat

Issued via Credly

Linux system administration fundamentals on Red Hat Enterprise Linux 9 — users, storage, networking, and services

Verified
Verify

In Progress

Actively studying
☸️
🔄 In Progress

Certified Kubernetes Administrator (CKA)

CNCF / Linux Foundation

Administering Kubernetes clusters in production environments

Studying now
🌍
🔄 In Progress

HashiCorp Certified: Terraform Associate

HashiCorp

Infrastructure as Code with Terraform for multi-cloud deployments

Studying now

Things I've Built

Hands-on projects demonstrating cloud architecture, DevOps practices, and infrastructure automation

🏗️

AWS Three-Tier Architecture

Problem

Need a scalable, fault-tolerant web application infrastructure that handles traffic spikes without manual intervention.

Solution

Production-grade three-tier architecture on AWS using EC2, RDS Multi-AZ, and ALB. Infrastructure provisioned entirely via Terraform with VPC segmentation, security groups, NAT Gateways, and CloudWatch monitoring.

Outcome

ASG auto-scales from 2→6 EC2 instances under load. Multi-AZ RDS failover completes in under 60 seconds. Terraform apply time reduced from hours of manual work to under 8 minutes.

Multi-AZAuto-scalingIaC
AWSTerraformVPCEC2RDSALBCloudWatch

Kubernetes Monitoring Stack

Problem

Kubernetes clusters running without visibility into pod health, node resource usage, or SLO breach alerting.

Solution

Full observability stack with Prometheus, Grafana, and Alertmanager. Pre-built dashboards for cluster health, pod metrics, and custom SLO tracking. Deployed via Helm charts with GitOps-ready configuration.

Outcome

MTTD reduced from ~15 minutes (manual log review) to under 2 minutes via Alertmanager firing before user impact. Dashboard covers 20+ Kubernetes metrics across pods, nodes, and SLOs.

SLO TrackingHelmGitOps
KubernetesPrometheusGrafanaHelmAlertmanager

Terraform Infrastructure Automation

Problem

Manual infrastructure provisioning across dev/staging/prod environments leads to configuration drift and inconsistent deployments.

Solution

Modular Terraform codebase for multi-environment AWS infrastructure using remote state management with S3/DynamoDB locking, workspaces, and reusable modules for VPC, EKS, RDS, and IAM. Integrated with GitHub Actions.

Outcome

Eliminated configuration drift across 3 environments (dev/staging/prod). Infra provisioning time cut from 4+ hours of manual clicks to a single `terraform apply` run in under 12 minutes.

Multi-envRemote StateCI/CD
TerraformAWSGitHub ActionsS3EKSIAM

Azure Hybrid Connectivity

Problem

On-premises workloads need secure, reliable connectivity to Azure virtual networks without exposing public endpoints.

Solution

Site-to-site VPN between on-premises network and Azure VNet using Azure VPN Gateway, local network gateway configuration, BGP routing, and NSG rules. Documented with step-by-step runbooks and tested failover procedures.

Outcome

Hybrid tunnel established with sub-100ms latency between on-prem and Azure VNet. BGP failover tested at under 30 seconds. Runbooks cut incident resolution time from 45+ minutes to under 10 minutes.

BGP RoutingFailoverHybrid
AzureVPN GatewayBGPNSGHybrid Cloud

DevOps CI/CD Pipeline

Problem

Containerized Node.js app deployed manually — no automated testing, no security scanning, and no rollback capability.

Solution

End-to-end CI/CD pipeline using GitHub Actions, Docker, and Kubernetes. Stages: linting → unit tests → Docker build/push to ECR → Trivy security scanning → automated deployment to EKS with rollback capability.

Outcome

End-to-end pipeline runs in under 6 minutes (lint → test → build → scan → deploy). Trivy blocks HIGH/CRITICAL CVEs before production. Rollback completes in under 2 minutes on failure — zero bad deployments reach users.

Security ScanAuto-rollbackEKS
GitHub ActionsDockerKubernetesECRTrivyEKS

My Resume

A full picture of my experience, skills, and education

Sugandha Vashishtha

Sugandha Vashishtha

Cloud & Site Reliability Engineer

📍 Noida, India📧 buildwithsugandha@gmail.com🔗 linkedin.com/in/sugandha-vashishtha

Key Skills

Cloud & Site Reliability Engineering (SRE)AWS & Azure Cloud OperationsIncident Management · DRI · SLA/SLOAzure Bastion & Hybrid CloudSAST / DAST Vulnerability RemediationDynatrace · Jarvis · Hawkeye · ICMLinux & Windows Server AdministrationShell Scripting · Python · KQL
Experience
Education
Certifications
Projects

Ready to download

9+ years of experience across cloud operations, SRE, AWS, Azure, incident management, and enterprise infrastructure. Available as a PDF for immediate download.

Download Resume (PDF)

Sugandha_Vashishtha_Resume.pdf

Last updated · June 2026

Let's Connect

Open to Infrastructure, Cloud, DevOps, and SRE opportunities at enterprise technology organizations. Drop me a message — I respond within 24 hours.

Prefer a quick call? 📅

Schedule a 15-minute intro call — no commitment, just a conversation about how I can help.

Schedule a Call

Email

buildwithsugandha@gmail.com

LinkedIn

linkedin.com/in/sugandha-vashishtha

GitHub

github.com/buildwithsugandha

Location

Noida, India

Resume

Download PDF

Schedule a Call

Book a 15-min intro

Available for opportunities

Actively seeking Cloud Infrastructure, SRE, and DevOps roles at enterprise technology organizations. Remote and Noida/India-based positions welcome.