Site Reliability Engineering Lead
Company: Truist Bank
Location: Atlanta
Posted on: April 2, 2026
|
|
|
Job Description:
The position is described below. If you want to apply, click the
Apply Now button at the top or bottom of this page. After you click
Apply Now and complete your application, you'll be invited to
create a profile, which will let you see your application status
and any communications. If you already have a profile with us, you
can log in to check status. Need Help? If you have a disability and
need assistance with the application, you can request a reasonable
accommodation. Send an email to Accessibility (accommodation
requests only; other inquiries won't receive a response). Regular
or Temporary: Regular Language Fluency: English (Required) Work
Shift: 1st shift (United States of America) Please review the
following job description: The Site Reliability Engineering Lead is
a senior, hands-on technical leader within the Wholesale Production
Support Operations organization. This teammate is accountable for
elevating the reliability, resiliency, and operational excellence
of critical enterprise platforms across hybrid cloud and onprem
environments. Acting as both a handson SRE expert and a crossdomain
influencer, the SRE Lead drives systemic improvements in
observability, automation, AIOps adoption, fault tolerance, and
incident management. The role partners closely with Application
Development, Infrastructure, Production Support, Platform Delivery,
Architecture, Cybersecurity, Risk, and Business technology teams to
uplift operational practices and deliver stable, predictable, and
scalable services. This position also plays a pivotal role in
building and maturing the SRE Center for Enablement (C4E) by
contributing standards, repeatable patterns, runbooks, playbooks,
and coaching that amplify reliability practices across the
enterprise. The SRE Lead delivers measurable impact through deep
expertise in distributed systems, modern operational tooling,
cloud-native reliability patterns, and enterprise-scale
incident/problem management. ESSENTIAL DUTIES AND RESPONSIBILITIES
Following is a summary of the essential functions for this job.
Other duties may be performed, both major and minor, which are not
mentioned below. Specific activities may change from time to time.
1.Guide, educate, and provide thought leadership to our delivery
teams as related to their optimum adoption of DevSecOps practices
and framework. 2. Champion the use of DevSecOps as a strategic
asset of culture change to enhance the flow of business value to
our clients. 3. Make informed decisions and determine which tool
best fits any given situation based on proficiencies with multiple
vendor products based on each of the above capabilities. 4. Develop
and recommend DevSecOps best practices. 5. Use sophisticated,
analytical thought to exercise judgment and design innovative
solutions for the most complex components of the DevSecOps
lifecycle. 6. Works independently, with guidance in only the most
complex situations. 7. Provide technical and process guidance to
junior team members. 8. Build and maintain the automation and
streamlining of software delivery and operations for new or
existing software applications through advanced proficiency and
subject matter expertise in vendor tools in the DevOps lifecycle
including: a. Infrastructure as Code; Agile and Development
Lifecycle Management; Source Code Management; Build Orchestration;
Build Management; Artifact Repository Management; Behavior Driven
Development; Test Driven Development; Automated Testing including
Unit Testing, Integration Testing, Functional Testing, Smoke
Testing, Regression Testing, Stress Testing, and Performance
Testing; Static Code Analysis; Load and Performance Testing;
Artifact Scanning; Database Schema Management, Orchestration and
Recovery; Compliance Automation and Audit Trails; Configuration
Management; Containers; Application Release Automation; Deployment
Strategies and Patterns including Blue/Green Deployment, Canary
Releases, and Rolling Releases; Logging and Log Analytics; and
Performance Monitoring and Management. 9. Liaise with DevSecOps
Center for Enablement (C4E) to ensure that Enterprise tools or
practices are followed, and to share information about any team
specific tools or practices that may benefit other teams. 10.
Active participant with the Truist Agile Guild and Agile DevOps
Communities of Practice. Key Responsibilities Incident & Problem
Management Leadership Lead major and high-severity incident
response efforts, focusing on diagnosing technical root causes
therein , and driving multi -team technical resolution. Drive
problem management to closure, ensuring systemic fixes replace
recurring operational risks. Establish and maintain standardized
incident playbooks, escalation paths, and communication frameworks.
Reliability Engineering & Automation Architect and deliver
automation solutions that eliminate toil, reduce MTTR, and increase
service resilience. Implement intelligent alerting, anomaly
detection, and event correlation leveraging AI and AIOps tools.
Guide and enforce SLO/SLI adoption across product teams, ensuring
metrics inform decision-making and prioritization. Observability &
Operational Excellence Enhance telemetry coverage across logs,
metrics, traces, and events using platforms such as Dynatrace and
Splunk. Define and standardize enterprise observability practices,
dashboards, and KPIs. Ensure operational readiness of applications
and platforms through resiliency testing, chaos engineering, and
failure-mode validation. Cross-Functional Leadership & Influence
Partner with Delivery, Architecture, Security, and Risk teams to
embed reliability and resilience into design and execution. Act as
a change agent to elevate operational maturity and drive
transformative improvements across Wholesale . Lead workshops,
maturity assessments, and enablement sessions through the SRE C4E
and Communities of Practice. Standardization & Documentation
Develop, maintain , and enforce runbooks, response playbooks, and
automated recovery patterns. Contribute to enterprise SRE
frameworks, templates, and maturity models. Promote consistent
adoption of best practices across domains and lines of business.
Mentorship & Technical Development Coach and mentor Associate,
Professional, and Senior SREs to build technical depth and
operational discipline. Provide thought leadership in SRE
methodologies, cloud-native operational patterns, and automated
reliability engineering. Required Qualifications 7 years of
experience in Site Reliability Engineering, DevOps, Platform
Engineering, or Infrastructure Operations. Deep hands on experience
with distributed systems, container orchestration (Kubernetes), and
cloud-native operational tooling. Proficiency with automation and
scripting languages (Python, Go, PowerShell, Ansible). Strong
understanding of observability platforms (Splunk, Dynatrace) and
event-driven monitoring. Proven leadership in major incident
management and cross-team technical coordination. Strong grasp of
networking, Linux/Unix internals, and modern infrastructure
patterns. Excellent communication skills, including executive-level
situational awareness during critical incidents. Demonstrated
ability to influence technical roadmaps and drive adoption of
reliability best practices. Preferred Qualifications Financial
services or regulated industry experience. Experience enabling
large-scale SRE transformations or modernization initiatives.
Familiarity with chaos engineering, resilience assessments, and
service failure modeling. Exposure to hybrid-cloud and multi-cloud
operational frameworks. Experience contributing to or leading
Center for Enablement functions or Communities of Practice OTHER
JOB REQUIREMENTS / WORKING CONDITIONS Sitting Constantly (More than
50% of the time) Standing Occasionally (Less than 25% of the time)
Walking Occasionally (Less than 25% of the time) Visual / Audio /
Speaking Able to access and interpret client information received
from the computer and able to hear and speak with individuals in
person and on the phone. Manual Dexterity / Keyboarding Able to
work standard office equipment, including PC keyboard and mouse,
copy/fax machines, and printers. Availability Able to work all
hours scheduled, including overtime as directed by
manager/supervisor and required by business need. Travel Minimal
and up to 10% General Description of Available Benefits for
Eligible Employees of Truist Financial Corporation: All regular
teammates (not temporary or contingent workers) working 20 hours or
more per week are eligible for benefits, though eligibility for
specific benefits may be determined by the division of Truist
offering the position. Truist offers medical, dental, vision, life
insurance, disability, accidental death and dismemberment,
tax-preferred savings accounts, and a 401k plan to teammates.
Teammates also receive no less than 10 days of vacation (prorated
based on date of hire and by full-time or part-time status) during
their first year of employment, along with 10 sick days (also
prorated), and paid holidays. For more details on Truist’s generous
benefit plans, please visit our Benefits site . Depending on the
position and division, this job may also be eligible for Truist’s
defined benefit pension plan, restricted stock units, and/or a
deferred compensation plan. As you advance through the hiring
process, you will also learn more about the specific benefits
available for any non-temporary position for which you apply, based
on full-time or part-time status, position, and division of work.
Truist is an Equal Opportunity Employer that does not discriminate
on the basis of race, gender, color, religion, citizenship or
national origin, age, sexual orientation, gender identity,
disability, veteran status, or other classification protected by
law. Truist is a Drug Free Workplace. EEO is the Law E-Verify IER
Right to Work
Keywords: Truist Bank, Macon , Site Reliability Engineering Lead, IT / Software / Systems , Atlanta, Georgia