Senior/Staff Site Reliability Engineer (SRE)
Ever since 2014, Morressier has been helping a growing number of scientific and academic societies and publishers harness the power and scalability of the digital world to make scientific research more accessible. We facilitate the connecting of the thinkers and the curious, the inquisitive amongst us, to discover, absorb and collaborate on research. We enable research across our platform from peer review workflows, virtual and hybrid conferencing, research libraries and integrity checks.
As we continue on our growth trajectory we are looking for a Senior or Staff level Site Reliability Engineer who will ensure that our underlying infrastructure is running smoothly and that systems and tools are working as expected so that our users can have uninterrupted platform access to accelerate scientific breakthroughs!
At Morressier, SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our environments and the Morressier codebase. You’ll be joining the Core Engineering team, where we produce and support essential tools and services that help the Product Engineering teams to succeed. Our goals are stability, reliability, and consistency across all areas of the Morressier platform. You will help us to develop and automate solutions as we work in these areas, whilst contributing to the standardization of architectures, technologies, and building applications and services.
As an SRE in Core Engineering, you will:
• Work in close collaboration with Product Engineering development teams to establish strong operational readiness across teams
• Help to scale systems through automation, improving change velocity without sacrificing performance and reliability
• Extend monitoring and alerting as first-class citizens via the use of OpenTelemetry-based instrumentation and application and system logging, and enable Product Engineering teams in their use of Honeycomb
• Document every action so your findings turn into repeatable actions–and then into automation
• Work with other engineering stakeholders on resolving larger architectural bottlenecks, and identify areas for improvement as the platform scales
• Aid in debugging production issues across services and levels of the stack and help to complete RCA investigations as needed
You may be a great fit for the role if you:
• Are self-driven and a fast learner, eager to deliver great functionality
• Can structure and organize your work easily
• Have a strong sense for action and know how to iterate through a problem quickly
• Have hands-on experience with Docker and Kubernetes and experience managing Kubernetes resources in a graceful way
• Have experience working with complex event-driven architectures
• Know about creating and optimizing CI/CD pipelines and processes
• Have worked with Mongo and PostgreSQL and can articulate the trade-offs in reliability and scalability between them
• Have worked with GCP, ideally their Kubernetes Engine and PubSub offerings
Projects you could work on:
• Coding infrastructure automation with Terraform, and Kubernetes resource management
• Expanding and deepening our use of observability and monitoring via code instrumentation and telemetry
• Develop relationships with Product Engineering groups, define their SLI/SLOs and metrics, and improve their reliability and performance
• Work with engineering stakeholders to establish Disaster Recovery and High Availability strategies
• Establish standardization of tooling, practices, and technical considerations across the Engineering departments
*Even if you don't meet all of these requirements, we would still like to hear from you.
Could you be the newest member of our team? Apply now and let’s find out