SRE/ Site Reliability Engineer (Middle / Senior)

Roles & Responsibilities:

Ensuring the smooth operation of software, environments and company services
Analyzing and improving the performance and availability of products
Identification of bottlenecks in the architecture and in the infrastructure
Improvement of system alerting and incident management
Improvements of the monitoring systems based on SLI (Prometheus, Icinga, Grafana etc.)
Formalization of SLI under the main business requirements
Formation of SLO for services and infrastructure in general
Minimization of system recovery time (RPO and RTO)
Analysis of incidents in the prod environment
Capacity management

Requirements

3+ years of work experience implementing, troubleshooting, and supporting infrastructure software and distributed systems
Support experience software in Golang, python , Ruby
Worked with virtualization and containerization technologies (containerd, docker, k8s) for more than 2 years
Set up CI of varying complexity (Jenkins) with CD to different environments
Experience in creating and maintaining a fault-tolerant system, with log coverage, monitoring, and alerting
Understanding the principle of "infrastructure as code" and the ability to test it (Ansible Terraform)
Principles of organizing network security (IPsec, WAF, IPS)
Experience with maintenance of blockchain nodes
Availability in US timezone is required

Our Tech Stack: