NielsenIQ

Principal Software Engineer - Site Reliability Engineering

NielsenIQ
Chennai
Not disclosed
Work from OfficeWork from Office
Full TimeFull Time
Min. 10 yearsMin. 10 years

Job Description

Principal Software Engineer

Job Description

Principal Software Engineer – Site Reliability & Application Support, Chennai

We are looking for a Principal Software Engineer in Site Reliability Engineering (SRE) who defines and drives the reliability strategy for large‑scale, distributed, and cloud‑native applications. This role operates at a company and platform level, bridging the gap between software engineering and operations to ensure our applications are highly available, performant, and resilient at scale. The scope spans the full application stack Angular front‑end, Node. jsservices, Java back‑end, and Python tooling — and encompasses reliability engineering, observability, incident management, and continuous improvement of application health across production environments.
You will act as a technical authority for application reliability and support, leading triage efforts, driving automation to eliminate toil, setting company‑wide SRE standards, and collaborating with development, platform, and architecture teams to embed reliability as a first‑class engineering concern.

Responsibilities

Application Reliability & Support

  • Own end‑to‑end reliability of multi‑tier applications spanning Angular, Node.js, Java, and Python stacks
  • Monitor, triage, and resolve production incidents with speed and precision, minimizing customer impact and MTTR
  • Perform root cause analysis (RCA) on recurring issues and drive permanent fixes through development or platform teams
  • Define and track SLIs, SLOs, and error budgets aligned to business criticality
  • Lead blameless post‑mortems and ensure actionable follow‑through on learnings
  • Proactively identify reliability risks and work with engineering teams to address them before they impact production

Incident Management & Technical Triage

  • Lead technical triage bridges during P1/P2 incidents, coordinating across application, infrastructure, and vendor teams
  • Rapidly diagnose issues across the full stack — front‑end rendering, API failures, JVM issues, database bottlenecks, and network anomalies
  • Establish and maintain runbooks, escalation paths, and incident response playbooks
  • Drive structured incident timelines, stakeholder communications, and resolution documentation
  • Champion fast feedback loops between on‑call, engineering, and leadership during high‑severity events

Observability & Monitoring

  • Design and implement end‑to‑end observability strategies covering logs, metrics, traces, and synthetic monitoring
  • Build and maintain dashboards, alerting rules, and anomaly detection for Angular, Node.js, Java, and Python applications
  • Define golden signals (latency, traffic, errors, saturation) and SLO‑based alerting for all critical services
  • Drive adoption of distributed tracing and correlation of signals across service boundaries
  • Evaluate and integrate observability tooling (e.g., Prometheus, Grafana, Open Telemetry, Datadog, Dynatrace,Splunk, ELK)
  • Continuously improve signal‑to‑noise ratio to reduce alert fatigue and improve detection confidence

Automation & Toil Reduction

  • Identify and eliminate operational toil through automation, scripting, and self‑healing mechanisms
  • Build and maintain automation scripts in Python, Shell/Bash, or Node.js for diagnostics, remediation, and reporting
  • Develop automated health checks, smoke tests, and canary validations for releases and deployments
  • Automate repetitive support workflows such as log analysis, data reconciliation, and environment reset procedures
  • Contribute to the internal tooling ecosystem to improve operational efficiency across teams

Release & Change Management

  • Coordinate application releases in alignment with change management processes and release calendars
  • Conduct pre‑release readiness reviews, validating deployment readiness, rollback plans, and monitoring coverage
  • Collaborate with development and DevOps teams to define and enforce safe deployment practices(blue‑green, canary, feature flags)
  • Participate in change advisory board (CAB) processes, providing technical assessment of risk and impact
  • Maintain deployment runbooks and ensure change traceability across environments

Collaboration — Development, Architecture & Platform Teams

  • Serve as the operational voice in engineering discussions, advocating for reliability, observability, and supportability
  • Partner with development teams during design and sprint cycles to embed SRE best practices early
  • Engage with architects to review designs for failure modes, observability gaps, and operability concerns
  • Provide production insights and telemetry data to inform architectural decisions and technical debt prioritization
  • Drive feedback loops from production back to development and architecture teams in a structured ,data‑driven manner

Cloud & Infrastructure

  • Support and operate cloud‑native applications on Azure, AWS, or GCP, leveraging managed services effectively
  • Manage and troubleshoot containerized workloads using Docker and Kubernetes (AKS / EKS / GKE)
  • Understand and operate CI/CD pipelines, supporting deployment automation and pipeline reliability
  • Apply Infrastructure‑as‑Code (Terraform, Bicep, or similar) understanding to diagnose and support environment‑level issues
  • Collaborate with platform and cloud teams on capacity planning, cost optimization, and scaling strategies

AI & Engineering Innovation

  • Leverage AI‑assisted tooling (e.g., AIOps, GenAI‑based log analysis, intelligent alerting) to accelerate diagnosis and reduce resolution time
  • Evaluate and adopt AI/ML‑driven observability and anomaly detection capabilities
  • Apply GenAI tools responsibly to improve runbook generation, RCA summaries, and incident documentation quality
  • Contribute to organizational knowledge by documenting patterns, solutions, and operational best practices

Required Technical Skills

  • Application Stack
  • Angular (component lifecycle, API integration, front‑end performance profiling, browser diagnostics)
  • Node.js (event loop, async patterns, memory management, npm ecosystem, service debugging)
  • Java (Spring Boot, JVM diagnostics, heap/thread analysis, REST APIs, microservices)
  • Python (scripting, automation, data analysis, diagnostic tooling)
  • SRE & Reliability Engineering
  • SLI / SLO / SLA definition, tracking, and error budget management
  • Incident management frameworks (ITIL, PagerDuty, Opsgenie, or equivalent)
  • Root cause analysis methodologies (5 Whys, Fishbone, fault tree analysis)
  • Reliability patterns: circuit breakers, retries, timeouts, bulkheads, graceful degradation
  • Capacity planning, performance profiling, and load analysis

Observability & Monitoring

  • Logging: ELK Stack / Splunk / Loki — structured logging, log correlation, query analysis
  • Metrics: Prometheus, Grafana, Datadog, CloudWatch, Azure Monitor
  • Tracing: OpenTelemetry, Jaeger, Zipkin, distributed trace correlation
  • Synthetic monitoring, uptime checks, and real‑user monitoring (RUM)
  • Alert design: thresholds, multi‑condition rules, SLO burn rate alerts
  • Automation & Scripting
  • Python, Shell/Bash, PowerShell for automation, diagnostics, and remediation scripts
  • REST API automation and integration testing tools (Postman, curl, pytest, JUnit)
  • CI/CD pipelines (Jenkins, GitHub Actions, Azure DevOps, GitLab CI)
  • Infrastructure tooling: Terraform, Ansible, or similar
  • Cloud & Platforms
  • Cloud platforms: Azure / AWS / GCP — managed services, networking, IAM, storage, compute
  • Containers and orchestration: Docker, Kubernetes (kubectl, Helm, namespaces, resource limits)
  • Service mesh basics (Istio, Linkerd) and API gateway management
  • Database operations: SQL query analysis, connection pool diagnostics, slow query identification
  • AI / Data (Working Knowledge)
  • AIOps platforms and AI‑assisted alert correlation
  • GenAI tooling for documentation, RCA assistance, and knowledge management
  • Basic understanding of ML model deployment and observability for AI‑driven systems

 

Qualifications

  • Must have 10–15+ years of hands‑on software engineering and/or SRE experience
  • Proven experience designing and operating enterprise‑grade, large‑scale production systems
  • Demonstrated impact at Staff / Principal / Architect level in SRE, platform engineering, or applicationreliability
  • Strong background in influencing reliability and observability strategy across multiple teams or platforms
  • Demonstrated experience leading incident triage and driving resolution in high‑pressure, high‑stakesenvironments
  • Bachelor's or master's degree in Computer Science, Information Technology, or a related field

Leadership & Soft Skills

  • Exceptional analytical, diagnostic, and structured problem‑solving skills
  • Strong written and verbal communication — able to convey technical issues clearly to both technical andnon‑technical stakeholders
  • Ability to lead under pressure and drive calmness and clarity during high‑severity incidents
  • High ownership, accountability, and bias for action
  • Collaborative mindset with the ability to influence development and architectural decisions through dataand evidence
  • Continuous improvement orientation — always looking to reduce toil, improve quality, and raise thereliability bar

Nice to Have

  • Experience with Kafka, event‑driven architectures, and streaming system observability
  • Exposure to security monitoring, compliance frameworks, and vulnerability management in production
  • Experience with large‑scale analytics platforms (Spark, BigQuery, Databricks)
  • Familiarity with chaos engineering principles and tooling (Chaos Monkey, Litmus, Gremlin)
  • Prior role as Principal SRE, Staff Engineer, or Platform Reliability Architect
  • Certifications: AWS/Azure/GCP Associate or Professional, CKA (Certified Kubernetes Administrator), orequivalent

Additional Information

Our Benefits

  • Flexible working environment
  • Volunteer time off
  • LinkedIn Learning
  • Employee-Assistance-Program (EAP)

NIQ may utilize artificial intelligence (AI) tools at various stages of the recruitment process, including résumé screening, candidate assessments, interview scheduling, job matching, communication support, and certain administrative tasks that help streamline workflows. These tools are intended to improve efficiency and support fair and consistent evaluation based on job-related criteria. All use of AI is governed by NIQ’s principles of fairness, transparency, human oversight, and inclusion. Final hiring decisions are made exclusively by humans. NIQ regularly reviews its AI tools to help mitigate bias and ensure compliance with applicable laws and regulations. If you have questions, require accommodations, or wish to request human review were permitted by law, please contact your local HR representative. For more information, please visit NIQ’s AI Safety Policies and Guiding Principles: https://nielseniq.com/global/en/info/niqs-ai-safety-policies/

About NIQ

NIQ is the world’s leading consumer intelligence company, delivering the most complete understanding of consumer buying behavior and revealing new pathways to growth. In 2023, NIQ combined with GfK, bringing together the two industry leaders with unparalleled global reach. With a holistic retail read and the most comprehensive consumer insights—delivered with advanced analytics through state-of-the-art platforms—NIQ delivers the Full View™. NIQ is an Advent International portfolio company with operations in 100+ markets, covering more than 90% of the world’s population.

For more information, visit NIQ.com

Want to keep up with our latest updates?

Follow us on: LinkedIn | Instagram | Twitter | Facebook

Our commitment to Diversity, Equity, and Inclusion

At NIQ, we are steadfast in our commitment to fostering an inclusive workplace that mirrors the rich diversity of the communities and markets we serve. We believe that embracing a wide range of perspectives drives innovation and excellence.  All employment decisions at NIQ are made without regard to race, color, religion, sex (including pregnancy, sexual orientation, or gender identity), national origin, age, disability, genetic information, marital status, veteran status, or any other characteristic protected by applicable laws. We invite individuals who share our dedication to inclusivity and equity to join us in making a meaningful impact. To learn more about our ongoing efforts in diversity and inclusion, please visit the https://nielseniq.com/global/en/news-center/diversity-inclusion

Experience Level

Senior Level

Job role

Work location
Work locationChennai, TN, India
Department
DepartmentSoftware Engineering
Role / Category
Role / CategorySoftware Development
Employment type
Employment typeFull Time
Shift
ShiftDay Shift

Job requirements

Experience
ExperienceMin. 10 years

About company

Name
NameNielsenIQ
Job posted by NielsenIQ

Similar jobs you can apply for

Software / Web Developer
Big Basket

Loss Prevention Associate

Big Basket
Tambaram West, Chennai
₹17,000 - ₹25,000
Work from Office
Full Time
Any experience
Basic English
Teamspace Financial Services Private Limited

Risk Officer

Teamspace Financial Services Private Limited
Ambattur Industrial Estate, Chennai
₹14,000 - ₹18,000
Work from Office
Full Time
Any experience
Good (Intermediate / Advanced) English
Rohini Enterprises

Test Job - Test Advert Don't Apply

Rohini Enterprises
All areas in Chennai Region
₹1,00,000 - ₹2,00,000
Field Job
Full Time
Min. 1 year

QA / QC Analyst

Haven Infra Projects & Power Limited
Saidapet, Chennai
₹15,000 - ₹25,000
Work from Office
Full Time
Min. 1 year
Basic English

Mobile App Developer

Thaswikha Exim Services Private Limited
Mugalivakkam, Chennai
₹20,000 - ₹25,000
Work from Office
Full Time
Min. 2 years
Good (Intermediate / Advanced) English
Dewetron Technology India Private Limited

Technical Assistant

Dewetron Technology India Private Limited
Perungudi, Chennai
₹18,000 - ₹20,000
Work from Office
Full Time
Min. 2 years
Good (Intermediate / Advanced) English

You can expect a minimum salary of 0 INR. The salary offered will depend on your skills, experience and performance in the interview.

The candidate should have completed the required education and people who have 10 to 15 years are eligible to apply for this job. You can apply for more jobs in Chennai to get hired quickly.

The candidate should have sound communication skills and sound communication skills for this job.

Both Male and Female candidates can apply for this job.

No, it's not a work from home job and can't be done online. You can explore and apply for other work from home jobs in Chennai at apna.

No work-related deposit needs to be made during your employment with the company.

Go to the apna app and apply for this job. Click on the apply button and call HR directly to schedule your interview.

The last date to apply for this job is . For more details, download apna app and find Full Time jobs in Chennai . Through apna, you can find jobs in 64 cities across India. Join NOW!