Senior Site Reliability Engineer - Observability and DevOps

Qualys Security Techservices Private Limited

Pune

Not disclosed

Work from Office

Full Time

Min. 5 years

Job Details

Job Description

Lead Site Reliability Engineer, DevOps

Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!

Job Title

Senior Site Reliability Engineer (SRE) – Observability & DevOps

Role Summary

We are looking for a Senior SRE who will own and evolve our observability and reliability platform. The ideal candidate has strong Linux fundamentals, hands-on experience with modern monitoring stacks, and the ability to design scalable alerting and metrics pipelines for large, distributed systems.

This role requires both deep technical expertise and production ownership mindset.

Primary Responsibilities

Observability & Monitoring

  • Design, implement, and maintain end-to-end observability using:
    • Prometheus for metrics collection
    • Alertmanager for alert routing, deduplication, and escalation
    • Grafana for visualization and dashboards
    • AppDynamics for APM, transaction tracing, and application health
  • Build actionable dashboards for:
    • SLIs, SLOs, and error budgets
    • Application, infrastructure, and platform health
  • Reduce alert fatigue by implementing signal-based alerting and proper severity models

Data & Metrics Platform

  • Manage and optimize ClickHouse for:
    • High-volume metrics, logs, or traces
    • Long-term retention and fast analytical queries
  • Work on schema design, performance tuning, and cost optimization

Reliability & Operations

  • Define and measure SRE best practices (SLIs, SLOs, SLAs)
  • Participate in incident response, postmortems, and root cause analysis
  • Drive reliability improvements through automation and capacity planning

Automation & Engineering

  • Develop tooling and automation using at least one scripting/programming language
  • Automate monitoring onboarding, alert generation, dashboard creation
  • Improve operational efficiencies across DevOps tooling

Required Technical Skills (Must-Have)

Core Skills

  • Strong Linux fundamentals
    • Troubleshooting, performance tuning, networking, system internals
  • Scripting / Programming (Any one or more):
    • Python (preferred), Bash, Go, or similar
  • Observability Tools (Hands-on):
    • Prometheus
    • Alertmanager
    • Grafana
    • AppDynamics
  • Data Platform:
    • Hands-on experience with ClickHouse

Monitoring & Alerting Concepts

  • Metrics vs logs vs traces
  • Golden signals (latency, traffic, errors, saturation)
  • Alert thresholds, routing policies, escalation strategies

Preferred / Nice-to-Have Skills

  • Kubernetes monitoring (Prometheus Operator, kube-state-metrics)
  • Infrastructure as Code (Terraform, Helm)
  • CI/CD observability
  • Cloud platforms (AWS / Azure / GCP)
  • Experience managing observability at scale (100+ services / platforms)

Senior-Level Expectations

  • Ability to architect observability solutions, not just operate them
  • Strong production troubleshooting and incident ownership
  • Mentoring junior engineers
  • Influence DevOps and SRE best practices across teams
  • Communicate clearly with developers and leadership

Experience & Qualification

  • 5-7 years of experience in SRE / DevOps / Production Engineering
  • Experience operating high-availability, large-scale systems
  • Proven background in observability-driven reliability improvements

Experience Level

Senior Level

Job role

Work location

Pune, India

Department

Software Engineering

Role / Category

DevOps

Employment type

Full Time

Shift

Day Shift

Job requirements

Experience

Min. 5 years

About company

Name

Qualys Security Techservices Private Limited

Job posted by Qualys Security Techservices Private Limited

Apply on company website