Published on April 25, 2025. Modified on May 13, 2025.
As a Site Reliability Engineer (SRE) at Centific, you will be responsible for implementing and maintaining highly available, scalable, and secure infrastructure and services.
This role involves developing automation solutions to enhance reliability, performance, and incident response while ensuring operational efficiency.
You will collaborate with software engineering, infrastructure, and Dev Ops teams to proactively identify potential issues, prevent system failures, and drive continuous improvements across cloud and on-prem environments.
This role is hands-on and requires expertise in system reliability, automation, cloud infrastructure, and incident response.
Key Responsibilities :
Reliability Engineering & Infrastructure Automation :
- Implement scalable and highly available systems to improve system resilience.
- Automate manual operational tasks using Python/Bash to improve system performance and reliability.
- Develop and maintain Infrastructure as Code (Ia C) solutions using Terraform/Ansible
- Apply auto-scaling, load balancing, and failover strategies for cloud-based applications.
- Work with cloud services such as AWS/Azure/GCP to optimize infrastructure provisioning and scaling.
- Develop and deploy self-healing mechanisms for automated remediation of system failures.
Incident and Problem Management :
- Follow incident response playbooks to streamline on-call troubleshooting and resolution.
- Knowledge of ITIL V3 / V4
- Orchestration automation using any ITSM Tool
- Participate in production incident resolution, conduct root cause analysis (RCA), and assist in implementing permanent fixes.
- Improve system fault tolerance using chaos engineering tools (Chaos Monkey/Litmus Chaos) to test failure scenarios.
- Support disaster recovery (DR) plans with backup, restore, and failover strategies.
- Participate regular failover drills and game days to validate recovery strategies and incident handling efficiency.
Performance Optimization & Capacity Planning :
- Assist in system performance analysis through capacity planning, latency tracking, and traffic analysis.
- Support monitoring of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to ensure uptime and performance targets are met.
- Work with Dev Ops and infrastructure teams to ensure systems are scalable and meet business growth demands.
- Leverage predictive analytics to proactively detect capacity bottlenecks and optimize resource allocation.
Security, Compliance & Best Practices :
- Follow security best practices in cloud and on-prem environments.
- Support compliance such as GDPR, HIPAA, and ISO 27001 in reliability and monitoring solutions.
- Adhere to role-based access controls (RBAC), encryption standards, and vulnerability assessments.
- Knowledge of automated security scanning and monitoring to detect vulnerabilities and misconfigurations in real time.
Monitoring, Observability & Performance Optimization :
- Deploy and configure monitoring, logging, and alerting tools such as Stack (ELK)/New Relic.
- Establish real-time alerting mechanisms using Prometheus Alertmanager/Pager Duty/Opsgenie to proactively detect failures.
- Work with developers and Dev Ops teams to instrument applications with Open Telemetry/Jaeger/AWS X-Ray for distributed tracing.
- Implement log aggregation pipelines using Fluentd/Graylog to centralize logs for troubleshooting and analytics.
- Optimize metrics ingestion pipelines to maintain performance efficiency with minimal overhead.
- Establish Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to ensure uptime and performance targets are met.
- Work with Dev Ops and infrastructure teams to ensure systems are scalable and meet business growth demands.
- Leverage predictive analytics to proactively detect capacity bottlenecks and optimize resource allocation.
CI/CD & Dev Ops Integration :
- Contribute to highly efficient CI/CD pipelines using Jenkins/Git Hub Actions/Git Lab CI/CD.
- Work with developers to integrate reliability principles into software development workflows.
- Assist in progressive delivery strategies such as blue-green deployments and canary releases to minimize production impact.
- Automate deployment rollback mechanisms to improve system stability and reduce downtime.
Must-Have Qualifications :
- Education : Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- Experience : 3+ years of experience in site reliability engineering, cloud infrastructure, and Dev Ops automation.
- Cloud & Infrastructure : Practical expertise in AWS/Azure/GCP, with experience in cloud networking, storage, and computing services.
- Automation & Scripting : Proficiency in Python/Go/Bash to build automation scripts and tools.
- CI/CD & Infrastructure Automation : Experience in managing CI/CD pipelines with Jenkins/Git Hub Actions/Git Lab CI/CD.
- High Availability & Performance Optimization : Knowledge of auto-scaling, load balancing, and performance tuning.
- Incident Response & RCA : Ability to assist in production incident response and RCA methodologies.
Good to Have Qualifications :
- Certifications : AWS Certified Solutions Architect, Google Cloud Professional Engineer, or Certified Kubernetes Administrator (CKA).
- Chaos Engineering : Experience with Chaos Monkey, Litmus Chaos for testing system resilience.
- Kubernetes & Containerization : Familiarity with Kubernetes cluster management and container orchestration.
- Security & Compliance : Experience in implementing security policies, access controls, and vulnerability assessments.
- Experience with Predictive Analytics : Knowledge of AI/ML techniques for proactive failure detection and automated incident response.
Soft Skills :
- Strong problem-solving and analytical thinking to diagnose and troubleshoot complex system failures efficiently.
- Ability to collaborate effectively with development, Dev Ops, and infrastructure teams to integrate reliability best practices.
- Strong verbal and written communication skills to explain technical issues clearly to both engineering and non-technical teams.
- Ability to remain calm under pressure during high-severity incidents and make well-reasoned decisions.
- Adaptability to work in dynamic environments with evolving infrastructure, tools, and business requirements.
- Resilience and stress management to handle on-call rotations, production outages, and critical system failures.
Why Join Centific?
- Innovative Work Environment : Work on cutting-edge SRE and infrastructure reliability solutions.
- Global Impact : Contribute to mission-critical systems used by Fortune 500 companies and industry leaders.
- Competitive Compensation : Attractive salary package, bonuses, and performance-based incentives.
- Career Growth : Access to certification sponsorships, upskilling programs, and industry-leading training sessions.
- Work-Life Balance : Hybrid work model, flexible schedules, and wellness programs for employees.
- Technology & Tools : Gain hands-on experience with the latest cloud, Dev Ops, and automation technologies.
Shift & Work Schedule :
- This role may require working in a 24/7 shift rotation, including night shifts and on-call duties to support real-time monitoring and incident response.
This role is for professionals passionate about automation, scalability, and site reliability engineering - be a part of Centific and shape the future of operational excellence.