Full-time · Hybrid

Site Reliability Engineer (SRE)

This is a high-impact SRE role focused on building autonomous, self-healing systems for AI-driven cloud infrastructure. It stands out by moving away from traditional 'on-call' maintenance toward sophisticated automation using Temporal and Argo. It is an ideal spot for an engineer who wants to define reliability standards for the next wave of AI technology.

Hiring company

PEAK

Tel Aviv, Tel Aviv District, Israel · Posted 2 May 2026

The role

Overview

The hiring side

About PEAK

PEAK is a mid-sized employer working in technology / ai infrastructure, based in Tel Aviv, Tel Aviv District, Israel.

Industry

Technology / AI Infrastructure

Size

Medium

Location

Tel Aviv, Tel Aviv District, Israel

Go deeper

See PEAK the way an insider would

Unlock company research, key people to know, recent moves, and how this role fits into their wider picture.

Research PEAK

What they need

Requirements & Skills

Key Responsibilities

Architecting reliability frameworks for multi-cluster Kubernetes deployments
Developing automated self-healing and failover systems
Establishing and monitoring critical reliability metrics (SLOs/SLIs)
Managing full-stack observability operations and distributed tracing
Creating automated incident playbooks via ChatOps
Collaborating with developers to optimize service performance and resilience

Essential

Deep technical understanding of Kubernetes and container orchestration patterns
Hands-on experience with Infrastructure as Code using Terraform or Terragrunt
Proven ability to implement GitOps methodologies
Proficiency in programming with Python or Go for developing internal tools and automation
Experience managing observability stacks including Prometheus and Grafana

Preferred

Experience with Argo Workflows and Temporal for orchestration
Familiarity with Loki and Tempo for logging and tracing
Interest in integrating AI technologies into reliability and DevOps workflows
Experience implementing ChatOps for incident management

Key Skills

KubernetesTerraformTerragruntGitOpsPythonGoPrometheusGrafanaLokiTempoArgo WorkflowsTemporalIncident Response AutomationSLO/SLI Definition

Networking

People to Know

Hiring influence

James Mitchell

Head of Talent Acquisition

Over 12 years leading hiring strategy across EMEA. Oversees all senior-level recruitment and partners closely with department heads on headcount planning.

Hiring influence

Sarah Reynolds

Engineering Team Lead

Manages a team of 8 engineers and is closely involved in technical interviews. Previously scaled engineering teams at two high-growth startups.

Hiring influence

Alex Kim

Senior Product Manager

Drives product roadmap and cross-functional collaboration. Regularly involved in hiring for product and design roles.

Perks

Benefits & perks

Hybrid work model for better work-life balance
Opportunity to work with cutting-edge AI-driven cloud systems
Collaborative and high-tech work environment in Tel Aviv
Professional growth in advanced automation and orchestration technologies

Next step

Apply now

Apply on pickpeak.co

Found via pickpeak.co

Career Steer

Site Reliability Engineer (SRE)

Overview

About PEAK

Requirements & Skills

Key Responsibilities

Essential

Preferred

Key Skills

People to Know

James Mitchell

Sarah Reynolds

Alex Kim

Benefits & perks

Apply now

We use cookies to improve your experience