Our client is a leading consumer genetics and research company committed to helping people access, understand, and benefit from the human genome. With the world’s largest crowdsourced platform for genetic research, they are dedicated to innovation and improving accessibility.
This team focuses on supporting backend services that power key consumer-facing products. You’ll play a critical role in maintaining and optimizing data storage systems, managing computing resources, and ensuring the stability of core infrastructure. Responsibilities include supporting existing services, filling documentation gaps, managing permissions and access roles, and eventually participating in on-call rotations (with no immediate response expectations). The role will also contribute to long-term efforts like addressing audit findings and improving system reliability across departments.
Incident Management: Lead and participate in the investigation, diagnosis, and resolution of production incidents affecting back-office systems, ensuring timely communication and minimizing impact.
System Reliability & Availability: Proactively identify potential failure points and implement strategies to improve system reliability, availability, and overall stability.
Performance Optimization: Analyze and optimize the performance of our back-office systems, with a specific focus on database performance tuning and query optimization.
Infrastructure as Code (IaC): Design, implement, and maintain our cloud infrastructure using Infrastructure as Code principles with Terraform and AWS CloudFormation.
Automation: Develop and implement automation scripts and tools (primarily in Python) to streamline operational tasks, improve efficiency, and reduce manual errors.
Batch Processing: Design, develop, and maintain robust and efficient Python-based batch processing jobs for various back-office workflows.
Capacity Planning: Participate in capacity planning activities to ensure our systems can handle future growth and demand.
Monitoring & Alerting: Design, implement, and maintain comprehensive monitoring and alerting systems to provide real-time visibility into system health and performance.
Documentation: Create and maintain clear and concise documentation for system architecture, operational procedures, incident post-mortems, and knowledge base articles.
Collaboration: Work closely with development, operations, and other cross-functional teams to ensure seamless integration and deployment of new features and services.
On-Call Rotations: Participate in an on-call rotation to provide support for critical production issues outside of regular business hours.
System Enhancements: Contribute to the design and implementation of enhancements and improvements to our existing back-office systems to meet evolving business needs.
5+ years of experience in a Software Reliability Engineering, DevOps, or similar role focused on production systems.
Strong practical experience with Amazon Web Services (AWS) and a deep understanding of various AWS services (e.g., EC2, S3, RDS, IAM, etc.).
Proven ability to develop and maintain infrastructure using Terraform and AWS CloudFormation.
Excellent programming skills in Python, with specific experience in developing and managing batch processing jobs.
Solid understanding of database principles and experience with database performance tuning and optimization (experience with specific database technologies used by your company would be a plus).
Experience with monitoring and logging tools (e.g., CloudWatch, Prometheus, Grafana, ELK stack).
Strong troubleshooting and problem-solving skills with a systematic approach to identifying root causes.
Excellent communication, collaboration, and documentation skills.
Experience participating in on-call rotations and effectively responding to production incidents.
Ability to work independently and manage priorities in a fast-paced environment.
Extra points if you have:
Experience with containerization technologies like Docker and orchestration tools like Kubernetes.
Familiarity with configuration management tools like Ansible or Chef.
Experience with CI/CD pipelines and automation tools.
Knowledge of networking principles and security best practices in a cloud environment.
Experience working in an Agile development environment.