SRE/ Site Reliability Engineer (Middle / Senior)
Roles & Responsibilities:
- Ensuring the smooth operation of software, environments and company services
- Analyzing and improving the performance and availability of products
- Identification of bottlenecks in the architecture and in the infrastructure
- Improvement of system alerting and incident management
- Improvements of the monitoring systems based on SLI (Prometheus, Icinga, Grafana etc.)
- Formalization of SLI under the main business requirements
- Formation of SLO for services and infrastructure in general
- Minimization of system recovery time (RPO and RTO)
- Analysis of incidents in the prod environment
- Capacity management
Requirements
- 3+ years of work experience implementing, troubleshooting, and supporting infrastructure software and distributed systems
- Support experience software in Golang, python , Ruby
- Worked with virtualization and containerization technologies (containerd, docker, k8s) for more than 2 years
- Set up CI of varying complexity (Jenkins) with CD to different environments
- Experience in creating and maintaining a fault-tolerant system, with log coverage, monitoring, and alerting
- Understanding the principle of "infrastructure as code" and the ability to test it (Ansible Terraform)
- Principles of organizing network security (IPsec, WAF, IPS)
- Experience with maintenance of blockchain nodes
- Availability in US timezone is required
Our Tech Stack:
- Infrastructure: Bare-metal / AWS
- Databases: Clickhouse / MySQL
- SCM: git / GitHub
- Message broker: Kafka
- Repository: Nexus
- CI/CD: Jenkins
- Monitoring: Icinga 2, Grafana, Prometheus, Victoria metrics, ELK
- Orchestration: k8s, Ansible, Terraform
- Containers: LXC, Docker
- Scripting: Python, Golang, Ruby, Groovy
- OS: Debian/Ubuntu
- Others: Docker compose, IPSec
- 1Exploratory Interview
- 2Technical Interview I
- 3Technical Interview II
- 4Challenge
- 5HR decision