Site Reliability Engineer III -(AIML SRE)

JPMorgan Chase Bank, N.A.
Jersey City, NJ
Are you looking for an exciting opportunity to join a dynamic and growing team in a fast paced and challenging area? This is a unique opportunity for you to work in our team to partner with the Business to provide a comprehensive view.

As a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division, you will join our dynamic team of innovators and technologists. Your mission will be to enhance the reliability and resilience of AI systems that revolutionize how the Bank services and advises clients. You will focus on ensuring the robustness and availability of AI models, deepening client engagements, and promoting process transformation. We seek team members passionate about leveraging advanced reliability engineering practices, AI observability, and incident response strategies to solve complex business challenges through high-quality, cloud-centric software delivery.

Job Responsibilities:


  • Define and refine Service Level Objectives (SLOs) for large language model serving and training systems, using metrics like accuracy, fairness, latency, drift targets, TTFT, and TPOT, while balancing reliability and development velocity.
  • Design, implement, and continuously improve monitoring systems to track availability, latency, drift, and other key metrics for robust observability and rapid issue detection.
  • Collaborate in the design and deployment of high-availability language model serving infrastructure that supports high-traffic internal workloads across multiple regions and cloud providers.
  • Champion site reliability engineering practices, providing technical leadership and fostering a culture of reliability, resilience, and continuous improvement across teams.
  • Develop and manage automated failover and recovery systems for model serving deployments, ensuring seamless operation and rapid recovery from failures.
  • Create and lead AI-specific incident response playbooks for issues like model drift or bias spikes, including automated rollbacks, circuit breakers, and systematic post-incident improvements.
  • Build and maintain cost optimization systems for large-scale AI infrastructure, leveraging load balancing, caching, optimized GPU scheduling, and AI Gateways to ensure efficient, secure, and scalable operations.
Required qualifications, capabilities, and skills:
  • Formal training or certification on AI reliability concepts and 3+ years applied experience.
  • Demonstrate a strong sense of curiosity and a passion for continuous learning, especially in the rapidly evolving field of AI reliability.
  • Show proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices.
  • Possess deep knowledge and experience in observability, including white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, and Splunk.
  • Be proficient with continuous integration and delivery tools like Jenkins, GitLab, or Terraform, as well as container and orchestration technologies such as ECS, Kubernetes, and Docker.
  • Have experience troubleshooting common networking technologies and issues, and understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines.
  • Communicate effectively and bridge the gap between ML engineers and infrastructure teams, with proven experience implementing and maintaining SLO/SLA frameworks for business-critical services, and working with both traditional and AI-specific metrics.
Preferred qualifications, capabilities, and skills
  • Experience with AI-specific observability tools and platforms, such as OpenTelemetry and OpenInference.
  • Familiarity with AI incident response strategies, including automated rollbacks and AI circuit breakers.
  • Knowledge of AI-centric SLOs/SLAs, including metrics like accuracy, fairness, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token).
  • Expertise in engineering for scale and security, including load balancing, caching, optimized GPU scheduling, and AI Gateways.
    Experience with continuous evaluation processes, including pre-deployment, pre-release, and post-deployment monitoring for drift and degradation.
  • Understand ML model deployment strategies and their reliability implications
  • Have contributed to open-source infrastructure or ML tooling
  • Have experience with chaos engineering and systematic resilience testing
#LI-ID1

JPMorganChase, one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the world's most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans over 200 years and today we are a leader in investment banking, consumer and small business banking, commercial banking, financial transaction processing and asset management.

We offer a competitive total rewards package including base salary determined based on the role, experience, skill set and location. Those in eligible roles may receive commission-based pay and/or discretionary incentive compensation, paid in the form of cash and/or forfeitable equity, awarded in recognition of individual achievements and contributions. We also offer a range of benefits and programs to meet employee needs, based on eligibility. These benefits include comprehensive health care coverage, on-site health and wellness centers, a retirement savings plan, backup childcare, tuition reimbursement, mental health support, financial coaching and more. Additional details about total compensation and benefits will be provided during the hiring process.

We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants' and employees' religious practices and beliefs, as well as mental health or physical disability needs. Visit our FAQs for more information about requesting an accommodation.


JPMorgan Chase & Co. is an Equal Opportunity Employer, including Disability/Veterans

Base Pay/Salary
Jersey City,NJ $133,000.00 - $185,000.00 / year
Posted 2025-09-21

Recommended Jobs

Confirmation Call Center Manager

Bath Planet
Brick, NJ

Job Description Job Description Confirmation Call Center Manager Creating a fresh solution to bath remodeling, Bath Planet offers a stylish, cost-effective, low-maintenance bath improvement to…

View Details
Posted 2025-07-24

Configuration Manager

NDI Engineering
Thorofare, NJ

Job Description Job Description NDI Engineering Company is seeking a full-time Configuration Management Analyst Support person to join our team in Philadelphia and supporting the US Navy Engineer…

View Details
Posted 2025-09-20

Export Specialist

Coda Staffing
Moonachie, NJ

Job Description Job Description Role Description This is a full-time on-site role located in Moonachie, NJ for an Export Specialist. The Export Specialist will be responsible for handling expo…

View Details
Posted 2025-09-06

Mechanical Engineer

Lincoln Electric Products Co Inc.
Union, NJ

Job Description Job Description Description: About the Company: At Lincoln Electric Products Co. Inc., we specialize in the design, manufacture, and distribution of custom equipment tailore…

View Details
Posted 2025-09-20

Director, Statistics (Office-based)

AbbVie Inc.
Florham Park, NJ

Company Description AbbVie's mission is to discover and deliver innovative medicines and solutions that solve serious health issues today and address the medical challenges of tomorrow. We striv…

View Details
Posted 2025-08-22

Membership Specialist

UFC Gym
Wayne, NJ

Job Description Job Description The Membership Specialist (MS) will represent UFC GYM by providing a welcoming, informative, and entertaining experience for all members and guests during their vi…

View Details
Posted 2025-08-05

Insurance Customer Service Rep

McDyer Insurance Agency LLC
Medford, NJ

Job Description Job Description McDyer Insurance Agency LLC has proudly served New Jersey communities since 2001. We are the largest Allstate agency in the state, protecting more than 7,000 hous…

View Details
Posted 2025-07-26

Business Development Director - Market Access Technology & Solutions

IQVIA
New Providence, NJ

Position Summary The Business Development Director will play a pivotal role in driving strategic growth and revenue generation for IQVIA's Market Access Technology and Services (MATS) practice. Th…

View Details
Posted 2025-08-23

Studio Team Member

Alpha Fit Club
Marlton, NJ

Job Description Job Description Company Description Alpha Fit Club is a premier group fitness training program for all fitness levels. This circuit-style concept features a blend of total body…

View Details
Posted 2025-09-20

IT Support Technician

The Rockridge Group
Princeton, NJ

Job Description Job Description TITLE: IT Support Technician DURATION: Three (3+) months with possibility to be extended LOCATION: 210 Carnegie Center, Suite 103, Princeton, NJ 0854…

View Details
Posted 2025-07-25