Senior/ Staff SRE Engineer
Stellar Cyber
- Taiwan
- Permanent
- Full-time
- Administer and maintain container orchestration platforms and containerized workloads.
- Monitor and troubleshoot production systems, participating in on-call rotations to ensure reliability.
- Drive observability improvements by enhancing monitoring, logging, and alerting capabilities across systems and data platforms.
- Administer and optimize cloud-based environments across multiple providers.
- Manage and support distributed data platforms and real-time processing systems.
- Develop and maintain continuous integration and delivery pipelines for efficient and reliable deployments.
- Own and implement Infrastructure as Code (IaC) practices to ensure consistency and scalability.
- Automate and orchestrate infrastructure using programming and scripting languages.
- Perform system administration and networking tasks to support internal and external environments.
- Collaborate effectively with engineers and stakeholders across different time zones.
- 5+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles.
- Proven success leading large-scale production systems in cloud environments (AWS, GCP, Azure, or OCI).
- Demonstrated leadership in driving incident response, on-call best practices, and reliability-focused culture.
- Strong experience with production on-call operations and incident management.
- Advanced proficiency in Kubernetes administration and troubleshooting.
- Hands-on experience with observability tools: Prometheus, Grafana, Loki, and Alertmanager.
- Knowledge in chat-based operations interfaces and/or auto-remediation controllers using AI agentic framework.
- Understanding of AI agents for Auto-triaging alerts, correlate signals and suggest/root-cause hypotheses
- Expertise in operating data platforms (Elasticsearch, MongoDB, Spark, Kafka, Redis).
- Proficiency with public cloud services (AWS, Azure, GCP, or OCI).
- Strong programming and automation skills in Python and Bash.
- Deep understanding of Infrastructure as Code (Terraform, Helm).
- Experience with CI/CD pipelines (GitHub Actions, Bitbucket, ArgoCD).
- Strong technical background in distributed systems, databases, networking, and Linux administration.
- Excellent problem-solving, communication, and leadership abilities.
- Bachelor's degree in Computer Science, Engineering, or a related technical field.
- Certifications in AWS, GCP, Observability, Linux or Kubernetes are a plus.