Job position Site Reliability Engineer
Share this job
We are seeking an experienced and motivated Site Reliability Engineer (SRE) to join a high-performing team supporting multiple data product and platform groups. This role is focused on improving the reliability, scalability, observability, deployment, and operational support of critical data-driven platforms and services operating within complex production environments.
The successful candidate will work closely with engineering, platform, and operational support teams to strengthen monitoring and alerting capabilities, improve logging and traceability, troubleshoot incidents, support deployments, and automate operational processes wherever possible. The environment includes Kubernetes, Helm, the ELK stack, and a broad range of modern Site Reliability Engineering and cloud platform practices.
This is a hands-on technical role suited to someone who thrives in fast-paced operational environments, enjoys solving complex production issues, and is passionate about automation, platform reliability, and continuous improvement. The role requires strong collaboration with both client stakeholders and engineering teams to ensure operational excellence, platform resilience, and service availability across critical systems.
Candidate profile
- Manage and support Kubernetes clusters and Helm-based deployments across multiple environments.
- Enhance monitoring, alerting, logging, and observability solutions to improve operational visibility and system reliability.
- Investigate incidents, analyse logs, identify root causes, and drive timely resolution of production issues.
- Participate in incident response, post-incident reviews, and continuous service improvement activities.
- Automate operational tasks and repetitive support activities to reduce manual effort and improve platform efficiency.
- Collaborate with engineering and data platform teams to improve scalability, resilience, deployment reliability, and operational maturity.
- Develop and maintain operational documentation, support procedures, runbooks, and troubleshooting guides.
- Contribute to reliability engineering initiatives including proactive monitoring, service health management, operational readiness, and platform optimisation.
- Support deployment activities, release processes, and production change management activities.
Required qualifications to be successful in this role
- Strong commercial experience in Site Reliability Engineering, DevOps, Platform Engineering, or Production Support environments.
- Strong hands-on experience with Kubernetes and Helm within enterprise or production environments.
- Proven experience supporting mission-critical production platforms and operational support environments.
- Strong experience with the ELK stack (Elasticsearch, Logstash, Kibana) for logging, monitoring, troubleshooting, and operational analysis.
- Demonstrated capability in log analysis, incident investigation, troubleshooting, and root cause analysis.
- Strong understanding and practical experience with core SRE practices
- Experience working with data platforms, analytics platforms, or data product teams would be highly advantageous.
- Experience with scripting and automation technologies such as Bash, Python, or similar would be beneficial.
- Exposure to CI/CD pipelines, Infrastructure as Code, cloud-native platforms, or observability tooling would be desirable.
- Strong communication, stakeholder engagement, and collaboration skills.
- Ability to work effectively within fast-paced operational support environments while managing competing priorities and deadlines.
Security Clearance
- Resource must be willing and able to work onsite at the client location five days per week.
- Candidate must already hold current HLC clearance (mandatory requirement).
- Previous experience working within secure, government, defence, or highly regulated environments will be highly regarded.
- Due to client security requirements, only candidates meeting the required clearance criteria will be considered.
#LI-CGISDI
Working environment
Life at CGI is rooted in ownership, teamwork, respect and belonging. Here, you’ll reach your full potential because…
You are invited to be an owner from day 1 as we work together to bring our Dream to life. That’s why we call ourselves CGI Partners rather than employees. We benefit from our collective success and actively shape our company’s strategy and direction.
Your work creates value. You’ll develop innovative solutions and build relationships with teammates and clients while accessing global capabilities to scale your ideas, embrace new opportunities, and benefit from expansive industry and technology expertise.
You’ll shape your career by joining a company built to grow and last. You’ll be supported by leaders who care about your health and well-being and provide you with opportunities to deepen your skills and broaden your horizons.
Come join our team—one of the largest IT and business consulting services firms in the world.
Apply to this job!
Find your next job from +700 jobs!
-
Manage your visibility
Salary, remote work... Define all the criteria that are important to you.
-
Get discovered
Recruiters come directly to look for their future hires in our CV library.
-
Join a community
Connect with like-minded tech and IT professionals on a daily basis through our forum.
Site Reliability Engineer
CGI
