Title: Site Reliability Engineer Group Digital Product
Kuala Lumpur, MY, MY
Job summary
Site Reliability Engineering (SRE) ensures the scalability, reliability, and performance of systems by applying software engineering principles to IT operations. It focuses on automation, incident response, monitoring, and optimizing system availability.
General responsibilities
- Oversee and ensure the continuous health, performance, and stability of SAP CX, Cloud, and .NET applications across the enterprise ecosystem.
- Establish and manage comprehensive monitoring frameworks and dashboards to proactively identify risks, performance bottlenecks, and system anomalies.
- Lead and coordinate first-line support and incident response efforts to ensure minimal downtime and seamless user experience.
- Drive root cause analysis and implement preventive measures to enhance application reliability and service continuity.
- Plan, manage, and govern deployment and release processes for SAP CX and .NET applications, ensuring controlled rollouts with zero disruption to production environments.
- Optimize and maintain operational tools, processes, and environments to support scalable and efficient application performance.
- Oversee cloud operations, ensuring optimal utilization, cost management, and performance of applications deployed on Azure and other cloud platforms.
- Collaborate with business and technology stakeholders to understand operational requirements, define SLAs, and deliver effective technical support and service improvements.
- Communicate effectively with cross-functional teams and senior leadership to provide updates, insights, and recommendations on system performance and improvement plans.
- Foster strong partnerships across teams to align operational objectives with business goals and project deliverables.
- Coach and mentor team members to build technical depth, accountability, and a culture of continuous improvement.
- Champion initiatives aimed at improving application reliability, scalability, and performance through automation and best practices.
- Stay ahead of industry trends, evaluating emerging tools and technologies to enhance operational excellence and ensure the team leverages modern methodologies.
Functional skills and knowledge
- Strong understanding of infrastructure monitoring, first-level support, and deployment & release management.
- Experience in managing tools and services operations and maintenance.
- Excellent stakeholder management skills.
- Proficiency in managing applications on Azure cloud and other cloud platforms.
- Familiarity with multiple technology stacks and cloud applications.
- Strong problem-solving and analytical skills.
- Excellent communication and collaboration skills.
- Proven expertise in managing and optimizing applications on Azure cloud platforms.