IBM Site Reliability Engineer (SRE) in MONTERREY, Mexico

Job Description

Designing, analyzing, and troubleshooting large-scale distributed systems

Participation in on-call rotation

Engage with product teams to fix production outages and carry forward action items to improve ongoing reliability

Develop effective tooling, alerts, and response to both identify and address reliability risks including automatic problem detection and mitigation

Required Technical and Professional Expertise

DevOps Mindset

You enjoy solving difficult engineering problems and don’t mind getting your hands dirty

Good Software engineering skills ideally with experience in Python, Go and/or Java

Understanding of Linux system internals, are familiar with the TCP/IP stack, network routing and load balancing

Approach troubleshooting systematically and have a deep sense of ownership for whatever you work on

Ability to root cause sources of instability in a high-traffic, distributed system

Experience with configuration and troubleshooting of Linux, Java, Docker systems

Understanding of large-scale complex systems from a reliability perspective

Passion for resolving reliability issues and identify strategies to mitigate going forward

Willingness to work in an ever-changing environment

You are lazy – you are passionate about automation and innovations that improve productivity

Preferred Tech and Prof Experience

English level: Independent User

EO Statement

IBM is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status.