Reliability Engineer – Bologna

  • Location:
  • Salary:
    negotiable / YEAR
  • Job type:
    FULL_TIME
  • Posted:
    2 days ago
  • Category:
    Engineering
  • Deadline:
    09/12/2024

JOB DESCRIPTION

Job Summary: 

We are in search of a highly motivated reliability engineer to work in the Service Reliability team of the Application Delivery section at ECMWF. In this role, you will support some of ECMWF’s most essential platform services in the areas of observability and identity and access management.

The role requires experience of both IT systems and software development with a focus on maintaining effective operations. Key skills include cloud native, automation, IT observability and service and application performance monitoring, logging and analytics.

Our reliability engineers engage with, advise, steer and support services relevant to the lifecycle of application deployment and hosting, including technical strategy, design, infrastructure, software development, tooling, service transition, service operation and use.

Day-to-day, you will be working as a bridge between the ECMWF Computing Department, in-house and community system/service providers and application developers, and your technical peers of our WEkEO partners (EUMETSAT, Mercator Ocean, and EEA) advocating good practice and building a greater understanding of architecture and design to enable reliable and performant operations of the WEkEO distributed platform.

 

Responsibilities:

  • Developing and supporting observability platforms for services and their underlying systems
  • Developing and supporting identity and access management (IAM) platforms for services
  • Advocating for reliability engineering within ECMWF, WEkEO, and Copernicus partners
  • Developing observability capabilities for WEkEO services and the underlying systems
  • Contributing to federated IAM including liaising with technical peers at WEkEO partners
  • Deploying open source, commercial, and proprietary software to containers, VMs, or bare metal
  • Contributing to documentation and training, including cross-training within the team
  • Participating in regular 24-hour on-call rotas for critical services in the relevant areas
  • Any other relevant domains related to the team’s portfolio

Qualifications:

Education:

  • A university degree (EFQ Level 6 or above) or equivalent industry experience

Experience required in the following areas:

  • Demonstrated relevant professional experience
  • Experience in configuring network, server and storage infrastructures
  • Experience in operational monitoring and application performance systems
  • Experience in Identity and access management systems
  • Experience in designing and developing in Linux-based Cloud environment
  • This role would suit IT professionals with either a software development or IT Operations background

Knowledge and skills:

We encourage you to apply even if you don’t feel you meet precisely all these criteria. In particular, we welcome applications from candidates with other technical/computing backgrounds to join this multi-disciplinary team.

Advertisement

Demonstrable knowledge and skills in some of the following:

  • Programming (any language) or scripting (Python, Ruby, Perl, Go)
  • Observability, monitoring, logging and analytics, tracing applications
  • General Linux system administration
  • Cloud Native (Kubernetes, Docker)
  • Cloud IaaS (Terraform, VMware, OpenStack, Amazon, Google)
  • The server, storage and networking components of Cloud applications
  • Excellent interpersonal and communication skills, with a co-operative nature
  • Strong analytical and problem-solving skills, with a proactive continuous improvement approach
  • Self-motivated, and able to work with minimal supervision
  • Dedication and enthusiasm to work in a geographically distributed team
  • Ability to work efficiently and complete diverse tasks in a timely manner

A working knowledge in some of the following is desirable:

  • Splunk, Grafana, Prometheus, Loki, ELK or similar
  • Application Performance Monitoring
  • NOSQL(influxdb or others TSDB), SQL (PostgreSQL or MySQL)
  • Identity and Access Management (e.g. OpenID Connect, SAML)
  • Microsoft Active Directory
  • Use of git version control

Please provide clear examples of your knowledge and experience in the space provided on the application form.

Candidates must be able to work effectively in English and interviews will be conducted in English. A good knowledge of one of the Centre’s other working languages (French or German) is an advantage but not required.