Senior Site Reliability Engineer
Company: Las Vegas Sands
Location: Dallas
Posted on: April 1, 2026
|
|
|
Job Description:
Job Description: Position Overview The primary responsibility of
the Senior Site Reliability Engineer (SRE) to lead reliability
engineering initiatives across our Azure estate and Command Center
operations. This role focuses on scripting, automation, and
observability to ensure uptime, performance, and rapid incident
response. The Senior SRE will design and implement
monitoring-as-code, optimize alerting, and build self-healing
automation that reduces toil and accelerates recovery. As part of
our journey from traditional operations toward a mature SRE model,
the Senior SRE will partner with product engineering, platform
teams, and the Command Center including Service Desk and Major
Incident Command (MIC) to deliver measurable improvements in
service reliability. All duties are to be performed in accordance
with departmental and Las Vegas Sands Corp.’s policies, practices,
and procedures. All Las Vegas Sands Corp. Team Members are expected
to conduct and carry themselves in a professional manner at all
times. Team Members are required to observe the company’s
standards, work requirements and rules of conduct. Essential Duties
& Responsibilities Observability & Monitoring Architect end-to-end
monitoring using Azure Monitor, Log Analytics, Application
Insights, and ITRS Geneos. Implement monitoring-as-code with
Terraform/Bicep, including alerts, dashboards, and diagnostic
settings. Create actionable dashboards (Azure Workbooks, Grafana)
for SLIs/SLOs and real-time service health. Alerting & Incident
Response Design alert taxonomies with severity mapping (P0–P4),
dynamic thresholds, and escalation policies. Reduce alert noise and
ensure 100% alert-to-runbook mapping. Support Major Incident
Command (MIC) during P0/P1 bridges with technical expertise and
rapid remediation. Automation & Tooling Build automation using
PowerShell, Python, and Azure Functions for alert lifecycle,
runbooks, and self-healing workflows. Integrate with ITSM
(ServiceNow/Jira) for automated ticket enrichment and routing.
Eliminate repetitive operational tasks and reduce toil through
automation-first practices. Reliability Engineering Define and
enforce SLIs/SLOs, error budgets, and resilience patterns
(bulkheads, retries, timeouts). Conduct production readiness
reviews, chaos drills, and failover rehearsals. Partner with app
teams to embed instrumentation and structured logging. Governance &
Compliance Enforce desired state with Azure Policy, DSC/Guest
Configuration, and drift detection. Harden networking (VNet, NSGs,
Private Link, Firewall), identity (Entra ID), and secrets (Key
Vault). Ensure auditability and compliance across environments.
Perform job duties in a safe manner. Attend work as scheduled on a
consistent and regular basis. Perform other related duties as
assigned. Minimum Qualifications At least 21 years of age. Proof of
authorization to work in the United States. Bachelor’s degree in
Computer Science or IT field, or equivalent experience. Must be
able to obtain and maintain any certification or license, as
required by law or policy. 7 years of experience in
SRE/DevOps/Platform roles, with 4 years focused on Azure in
production at scale. Expert knowledge in Infrastructure as Code
(Terraform or Bicep) and Git-based workflows (GitHub Actions/Azure
DevOps). Proficiency in CI/CD, deployment strategies (canary,
blue-green), and automated rollbacks. Proficiency in PowerShell and
Python for automation; experience building reusable modules.
Demonstrated experience with AKS, App Services, Functions, VM Scale
Sets, and Azure networking/security. Deep knowledge of: Azure: AKS,
App Services, Functions, VMSS, Storage, Front Door, API Management,
Load Balancers, Monitor, Log Analytics, App Insights, Key Vault,
Policy, Defender Automation & IaC: Terraform/Bicep, PowerShell,
Python, GitHub Actions/Azure DevOps Observability: Azure Monitor,
Log Analytics, App Insights, Prometheus/OpenTelemetry; experience
with ITRS Geneos. Service Management: ServiceNow, Jira Proficiency
in SRE fundamentals: SLIs/SLOs, error budgets, capacity planning,
chaos testing, and toil reduction. Demonstrated experience leading
incidents and collaborating across teams. Strong interpersonal
skills with the ability to communicate effectively and interact
appropriately with management, other Team Members and outside
contacts of different backgrounds and levels of experience. Must be
available to work varied shifts including nights, weekends, and
holidays, to ensure 24/7 coverage. Provide off-hours support on an
infrequent, but as needed basis during critical incidents.
(Potential shifts may run 24/7 due to the need of the business.)
Team Members are required to be on site within the IT Command
Center. Preferred Qualifications Certifications & Training AZ-400:
Azure DevOps Engineer Expert AZ-305: Azure Solutions Architect
Expert or AZ-104: Azure Administrator AZ-500: Azure Security
Engineer Associate ITIL v4 for operational rigor SRE
Foundation/Practitioner Certification (DevOps Institute or
equivalent) Physical Requirements Must be able to: Lift or carry 50
pounds, unassisted, in the performance of specific tasks, as
assigned. Physically access assigned workspace areas with or
without reasonable accommodation. Work indoors and be exposed to
various environmental factors such as, but not limited to, CRT,
noise, and dust. Utilize laptop and standard keyboard to perform
essential functions of the job.
Keywords: Las Vegas Sands, Grand Prairie , Senior Site Reliability Engineer, IT / Software / Systems , Dallas, Texas