Full-time · Hybrid

Site Reliability Engineer (SRE)

This is a high-impact SRE role focused on building autonomous, self-healing systems for AI-driven cloud infrastructure. It stands out by moving away from traditional 'on-call' maintenance toward sophisticated automation using Temporal and Argo. It is an ideal spot for an engineer who wants to define reliability standards for the next wave of AI technology.

P
Hiring company

PEAK

Tel Aviv, Tel Aviv District, Israel · Posted 2 May 2026

The role

Overview

This is a high-impact SRE role focused on building autonomous, self-healing systems for AI-driven cloud infrastructure. It stands out by moving away from traditional 'on-call' maintenance toward sophisticated automation using Temporal and Argo. It is an ideal spot for an engineer who wants to define reliability standards for the next wave of AI technology.

The hiring side

About PEAK

PEAK is a mid-sized employer working in technology / ai infrastructure, based in Tel Aviv, Tel Aviv District, Israel.

Industry

Technology / AI Infrastructure

Size

Medium

Location

Tel Aviv, Tel Aviv District, Israel

Go deeper

See PEAK the way an insider would

Unlock company research, key people to know, recent moves, and how this role fits into their wider picture.

Research PEAK

What they need

Requirements & Skills

Key Responsibilities

  • Architecting reliability frameworks for multi-cluster Kubernetes deployments
  • Developing automated self-healing and failover systems
  • Establishing and monitoring critical reliability metrics (SLOs/SLIs)
  • Managing full-stack observability operations and distributed tracing
  • Creating automated incident playbooks via ChatOps
  • Collaborating with developers to optimize service performance and resilience

Essential

  • Deep technical understanding of Kubernetes and container orchestration patterns
  • Hands-on experience with Infrastructure as Code using Terraform or Terragrunt
  • Proven ability to implement GitOps methodologies
  • Proficiency in programming with Python or Go for developing internal tools and automation
  • Experience managing observability stacks including Prometheus and Grafana

Preferred

  • Experience with Argo Workflows and Temporal for orchestration
  • Familiarity with Loki and Tempo for logging and tracing
  • Interest in integrating AI technologies into reliability and DevOps workflows
  • Experience implementing ChatOps for incident management

Key Skills

KubernetesTerraformTerragruntGitOpsPythonGoPrometheusGrafanaLokiTempoArgo WorkflowsTemporalIncident Response AutomationSLO/SLI Definition

Networking

People to Know

Sign up to discover hiring managers, team leads, and key people at PEAK.

Perks

Benefits & perks

  • Hybrid work model for better work-life balance
  • Opportunity to work with cutting-edge AI-driven cloud systems
  • Collaborative and high-tech work environment in Tel Aviv
  • Professional growth in advanced automation and orchestration technologies

Next step

Apply now

Found via pickpeak.co

We use cookies to improve your experience

We use essential cookies for functionality and analytics cookies to understand how you use Career Steer and improve our services. You can manage your preferences or learn more in our Privacy Policy.

Site Reliability Engineer (SRE) at PEAK - Tel Aviv, Tel Aviv District, Israel | Career Steer