Implementation & Infrastructure Reliability Specialist

  • Location
    • Remote
  • Date Posted
  • Jan. 7, 2022
  • Function
  • IT
  • Sector
  • Data

Gremlin’s mission is to make the Internet more reliable. We’re leading the way in the exciting, growing practice of Chaos Engineering, for enterprises like Target, Twilio and JP Morgan Chase that are building complex, distributed SaaS applications whose success depends on uptime. The Gremlin platform uncovers risks and weaknesses that aren’t addressed by traditional DevOps and IT operations processes and best practices. If paving a new path forward at the leading edge of technology sounds exciting to you, we should talk.

About the Role

The Implementation & Infrastructure Reliability Specialist works directly with Gremlin’s largest customers from onboarding through renewal to ensure our products implementation and strategic use.  You will build and expand the customer’s technical expertise and help prove the business value of Gremlin as THE tool to support an organization’s Chaos Engineering practice. Acting as an extension of the customer’s team, you will collaborate with our customer’s highly technical engineers and engineering leaders to evangelize chaos engineering, support experimental design, and tailor targeted chaos within a customer’s applications and architecture to help them develop reliability measures. Ultimately, your work is integral to an organization’s ability to drive a business plan designed to deliver a more resilient system. Joining you in this collaborative role, you will work with Gremlin’s Customer Success Managers, Developer Advocates, and Product Teams. Our team is a tight-knit group of technically minded individuals and relationship magic managers who together help customers implement, enable, and adopt our platform and chaos engineering best practices to take our customers to the next level of reliability.

Gremlin Reliability Specialists are our customer’s deepest technical experts wielding reliability, chaos engineering, and, devops expertise in order to implement Gremlin’s application into their complex, highly secure, and regulated infrastructure.  But that is just the first step, as our specialists take our customers further, enabling them to drive the use of our app and api to enable their use cases and reach their business goals.  Specialists intimately understand the short and long term use of our product relative to our customers unique infrastructure driven use cases. Together Gremlin’s Reliability Specialists and Customers form a team to create experiments to understand failure points, design scenarios to test integrated systems and resolve reliability issues, and embed automation to keep their systems resilient and the SRE teams in the know of potential failures.

This role requires both a breadth and depth of DevOps, SRE and or Chaos Engineering technical knowledge and experience as well as the ability to establish, lead, and form successful relationships with multiple personas across an organization  to be successful.

In this role, you’ll get to:

  • This is a hands on role with customers diving deep into their implementation, onboarding, and first tier technical needs
  • Act as an advocate for our customers, and invest the time to develop and enhance relationships with key stakeholders to earn “trusted advisor” status, naturally growing value, revenue, and increasing customer satisfaction.
  • Design and execute a business plan discussed during the sales customer engagement
  • Understand a customer’s infrastructure as it relates to the product and the importance of chaos engineering as a practice and then help them see the path to a mature Reliability Practice
  • Identify applications or services to target for Chaos Engineering experiments and help the customer prioritize an attack and scenario rollout plan
  • Assist customers with the implementation of Gremlin’s agent and mitigate any issues they may arise using your knowledge of services, dependencies, and integrations
  • Enterprise SSO Authentication Systems (ADFS, Okta, etc…)
  • Integrate Gremlin with existing customer enterprise tools
  • CI/CD Pipelines (Jenkins, Spinnaker, etc…)
  • Perform architecture reviews with customers (application and infrastructure perspectives) to assess their current reliability and propose where and how to test to increase reliability
  • Organize, plan, and assist in running GameDays with customers
  • Provide training to customers and customer teams, both directly and train-the-trainer
  • Document each customer’s success criteria, then communicate and validate with each customer on an ongoing basis that value is being recognized. Consult with customers on best practices to increase value and ROI, ensuring we’re hitting our renewal and expansion targets.
  • Align customer goals with the Gremlin account team to drive deliverables.
  • Be a primary point of escalation contact for targeted customers.
  • Work with other internal resources to coordinate/facilitate high level demos, workshops, and training sessions to educate customers on current features based on best practices and provide visibility into current vs. future product features and capabilities.
  • Engage with the Gremlin product team to advocate for customer requests and improve their experiences.
  • Create opportunities for customer stories, whitepapers, and blog posts among assigned customers.

We’ll expect you to have:

  • 3-5+ years in n SRE, DevOps, IaaS or SaaS providers, or Software Development
  • 3-5+ years consulting experience driving requirements and deliverables for customers at both the technical and executive levels
  • Experience in one of Java, Python, JavaScript or Rust programming languages
  • Strong Linux and Container experience
  • Knowledge of Kubernetes and OpenShift container orchestration platforms
  • Familiarity with IT management frameworks such as ITIL, COBIT, or eTOM
  • Experience with automation frameworks such as Puppet, Chef, or Ansible
  • Experience with CI/CD tools such as Jenkins, Spinnaker or Github Actions
  • Experience working through a production outage
  • Excellent verbal and written communication skills
  • Ability to manage up and down and think creatively outside the box

Bonus Experience:

  • Experience with monitoring and observability tools such as Grafana, New Relic, DataDog, CloudWatch
  • Familiarity with incident management tools such as PagerDuty, etc...
  • Familiarity with project management tools such as Asana, Jira, or Trello
  • Familiarity with the modern software development life cycle
  • Prior experience in test automation
  • AWS, GCP or Azure cloud certifications
  • Experience in a previous role supporting the establishment and growth of Chaos Engineering
  • Experience using Gremlin for reliability testing
  • If you don’t think you meet all of the criteria below but still are interested in the job, please apply. Nobody checks every box—we’re looking for candidates that are particularly strong in a few areas, and have some interest and capabilities in others.


  • Competitive compensation
  • 401k Match
  • Stock Options
  • Flexible PTO
  • Competitive benefits package, including medical, dental, and vision insurance
  • Team Activities (currently virtual due to Covid-19)

About Gremlin:

Our founders, Kolton Andrus and Matthew Fornaciari, lived and breathed incidents, on-call, and Chaos Engineering at Amazon and Netflix. As “Call Leaders,” they were responsible for guiding teams through analyzing and resolving global outages. After a decade of developing and advocating Chaos Engineering internally, in 2016 they decided to make what they had learned available to a wider set of enterprise companies and launched Gremlin.

Since then, Gremlin has built an incredible team of industry veterans and people eager to learn from one another while pushing the entire industry forward to new heights. We’re backed by top-tier investors Index Ventures, Amplify Partners, and Redpoint Ventures. Our customers love us, and we’re thrilled to be a partner in their success.

At Gremlin, we value:

  • OUR CUSTOMERS - We won’t be a company if our customers aren’t thrilled. We live and die by our customers, so they come first.
  • ACTION - We favor small experiments to gather data rather than over-analyzing a situation. Getting stuff done always beats talking about getting stuff done.
  • CONTEXT, NOT CONTROL - We hire autonomous adults with good judgement. We provide them with the context to make smart decisions. We don’t micromanage.
  • BEING VOCALLY SELF-CRITICAL - We all make mistakes, we all have ways in which we can improve. We own that upfront, and honestly discuss ways in which we’ve personally made mistakes and can get better. Then, we encourage and help one another succeed at doing so.
  • DIVERSITY, EQUITY, & INCLUSION - We are at our best when we encourage and include the thoughts and voices of people from many diverse backgrounds into our strategy and execution. We recognize that systemic racism and gender bias are real and that we aren’t perfect, so we actively work to encourage the difficult conversations, to listen, and to change as we discover our blind spots so that Gremlin is a company all of us feel proud to be a part of.
  • FRUGALITY - We are working to build a profitable company and create a new practice in the industry. We spend money on the right things, like making sure employees have the tools they need to be successful and the company has what it needs; we simply choose not to waste what we have and not to buy what we don’t actually need.
  • You are welcome at Gremlin for who you are. The more voices and ideas we have represented in our business, the more we will all flourish, contribute, and build a more reliable internet. Gremlin is a place where everyone can grow and is encouraged. However you identify and whatever background you bring with you, please apply if this sounds like a role that would make you excited to come into work everyday. It’s in our differences that we will find the power to keep building a more reliable internet by building and designing tools used by the best companies in the world.
  • We can’t wait to meet you!

Learn more about how Gremlin is defining the practice of Chaos Engineering: