JP Morgan Services India Pvt Ltd

Senior Software Engineer - AI Reliability

JP Morgan Services India Pvt Ltd
Bengaluru/Bangalore
Not disclosed
Work from OfficeWork from Office
Full TimeFull Time
Min. 3 yearsMin. 3 years

Job Description

Software Engineer III - SRE

We have an exciting and rewarding opportunity for you to take your software engineering career to the next level.

As a Software Engineer III - AI Reliability Engineer at JPMorgan Chase within Asset and Wealth Management Technology team, your mission will be to enhance the reliability and resilience of AI systems that revolutionize how the Bank services and advises clients. You will focus on ensuring the robustness and availability of AI models, deepening client engagements, and driving process transformation. We seek individuals passionate about leveraging advanced reliability engineering practices, AI observability, and incident response strategies to solve complex business challenges through high-quality, cloud-centric software delivery. Our culture thrives on experimentation, continuous improvement, and learning. You will work in a collaborative, trusting, and intellectually stimulating environment—one that values diversity of thought and fosters creative solutions that serve the best interests of our global clientele.

Responsibilities:

  • Develop and refine Service Level Objectives( including metrics like accuracy, fairness, latency, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token)) for large language model serving and training systems, balancing availability/latency with development velocity
  • Design, implement and continuously improve monitoring systems including availability, latency and other salient metrics
  • Collaborate in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of high-traffic internal workloads
  • Champion site reliability culture and practices, providing technical leadership and influence across teams to foster a culture of reliability and resilience
  • Consistently models and champions site reliability culture and practices and exerts technical influence throughout your team
  • Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers
  • Develop AI Incident Response playbooks for AI-specific failures like sudden drift or bias spikes, including automated rollbacks and AI circuit breakers.
  • Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident, build and maintain cost optimization systems for large-scale AI infrastructure, ensuring efficient resource utilization without compromising performance.
  • Engineer for Scale and Security, leveraging techniques like load balancing, caching, optimized GPU scheduling, and AI Gateways for managing traffic and security.
  • Collaborate with ML engineers to ensure seamless integration and operation of AI infrastructure, bridging the gap between development and operations.
  • Implement Continuous Evaluation, including pre-deployment, pre-release, and continuous post-deployment monitoring for drift and degradation.

Required qualifications, capabilities, and skills

 

  • Formal training or certification on software engineering concepts and 3+ years applied experience
  • Demonstrated proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
  • Proficient knowledge and experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
  • Proficient with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform
  • Proficient with container and container orchestration: (ECS, Kubernetes, Docker)
  • Experience with troubleshooting common networking technologies and issues
  • Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
  • Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
  • Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
  • Can effectively bridge the gap between ML engineers and infrastructure teams
  • Have excellent communication skills

Preferred qualifications, capabilities, and skills

  • Experience with AI-specific observability tools and platforms, such as OpenTelemetry and OpenInference.
  • Familiarity with AI incident response strategies, including automated rollbacks and AI circuit breakers.
  • Knowledge of AI-centric SLOs/SLAs, including metrics like accuracy, fairness, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token).
  • Expertise in engineering for scale and security, including load balancing, caching, optimized GPU scheduling, and AI Gateways.
  • Experience with continuous evaluation processes, including pre-deployment, pre-release, and post-deployment monitoring for drift and degradation.

 

 

Experience Level

Senior Level

Job role

Work location
Work locationBengaluru, Karnataka, India
Department
DepartmentSoftware Engineering
Role / Category
Role / CategorySoftware Development
Employment type
Employment typeFull Time
Shift
ShiftDay Shift

Job requirements

Experience
ExperienceMin. 3 years

About company

Name
NameJP Morgan Services India Pvt Ltd
Job posted by JP Morgan Services India Pvt Ltd

Similar jobs you can apply for

Software / Web Developer

Full Stack Java Developer

Optalon Hr Consultant Private Limited
HBR Layout, Bengaluru/Bangalore
₹50,000 - ₹66,667
Work from Office
Full Time
Min. 1 year
Good (Intermediate / Advanced) English
Sre Kateel Industries Private Limited

Quality Control Engineer

Sre Kateel Industries Private Limited
Hommadevanahalli, Bengaluru/Bangalore
₹28,000 - ₹35,000
Work from Office
Full Time
Min. 6 months
Basic English

Full Stack Web Developer

Tatvam Ai Labs Private Limited
Basavanagudi, Bengaluru/Bangalore
₹22,000 - ₹26,000
Work from Office
Full Time
Any experience
Basic English
Om Sai Building Solutions

Web Developer

Om Sai Building Solutions
Marathahalli, Bengaluru/Bangalore
₹15,000 - ₹40,000
Work from Office
Full Time
Min. 1 year
Good (Intermediate / Advanced) English
Big Basket

Quality Executive

Big Basket
Bengaluru/Bangalore
₹20,000 - ₹25,000
Work from Office
Full Time
Any experience
Basic English
Randstad India Private Limited

Engineering Trainee

Randstad India Private Limited
Electronics City, Bengaluru/Bangalore
₹20,000 - ₹21,500
Work from Office
Full Time
Freshers only
No English Required

You can expect a minimum salary of 0 INR. The salary offered will depend on your skills, experience and performance in the interview.

The candidate should have completed the required education and people who have 3 to 31 years are eligible to apply for this job. You can apply for more jobs in Bengaluru/Bangalore to get hired quickly.

The candidate should have sound communication skills and sound communication skills for this job.

Both Male and Female candidates can apply for this job.

No, it's not a work from home job and can't be done online. You can explore and apply for other work from home jobs in Bengaluru/Bangalore at apna.

No work-related deposit needs to be made during your employment with the company.

Go to the apna app and apply for this job. Click on the apply button and call HR directly to schedule your interview.

The last date to apply for this job is . For more details, download apna app and find Full Time jobs in Bengaluru/Bangalore . Through apna, you can find jobs in 64 cities across India. Join NOW!