Think Beyond The Label Jobs

Mobile Think Beyond The Label Logo

Job Information

Vertafore Sr. Manager, Site Reliability (This is not DevOps role) in IN, United States

Vertafore is looking for Senior Manager, SRE, managing complex IT Systems environment and processes including on-prem and AWS Cloud. This is a critical role in SaaS operation and is responsible for designing, maintaining and managing our Site Reliability Engineering (SRE) and PKI teams. The person is expected to have strong experience and expertise in SRE, PKI and broader infrastructure operations. This role will be reporting into Director SaaS Operation

Core Requirements and Responsibilities:

· Mentor and empower Site Reliability Engineers to deliver sound solutions for our customers within defined SLAs. The team’s primary responsibilities include analyze and troubleshoot application and infrastructure issues, debug and fix code, write database queries, setup client/systems and deploy changes

· Coach team on site reliability and resiliency principles and in-depth software engineering best practices

· Build the team as the key subject matter expert of applications, underlying architecture, and data relationships

· Primary liaison for SRE, Customer Support/Success, Global Command Center, SaaS Operations, DevOps and Development teams

· Technical lead for critical incident and escalation calls

· Collaborate with other teams in SaaS Operations, Development and Product management for long term solutions to improve the agility and scalability of products

· Identify opportunities and take the lead on automation projects that will improve processes and the usability of products for internal users.

· Assure the team creates and maintains runbooks that outline product information

· and support procedures

· Create metrics and measure team performance. Lead automation projects to measure product stability, usability and SRE performance

· Hands-on software development and support as required. Willing to roll up the sleeves and assist the team on any occasions

· Communicate solutions effectively to technical and non-technical teams and regularly update leadership on project and delivery status

· Comply with standard processes and security policies when implementing solutions

· Ability to operate in complex, high demanding, highly-secure, and highly-available, operations environments

· Interact with the technology domain experts and leaders as required to maintain and continuously enhance the Infrastructure services of the department.

· Excellent knowledge of monitoring and automation practices and understanding of tools like Dynatrace, Solarwinds, Chef, Ansible

· Must have managed a team of 20 people or more including hire, train, educate and mentor to maintain proficiency on current and future technologies within the organization

Knowledge, Skills and Abilities:

· Systems and Networking: Deep understanding of operating systems (Linux, Windows, etc.), networking protocols (TCP/IP, BGP, etc.), and infrastructure components (PKI, load balancers, firewalls, etc.).

· Cloud Technologies: Experience with major cloud providers (AWS, GCP, Azure), cloud architecture patterns, and infrastructure-as-code (Terraform, CloudFormation).

· Programming/Scripting: Proficiency in at least one programming language (Python, Go, Java) and scripting languages (Bash, PowerShell) for automation and tooling.

· Monitoring and Observability: Knowledge of monitoring tools (Dynatrace, SolarWinds) to ensure system health and performance.

· Deep understanding of PKI principles and concepts: Thorough knowledge of public key cryptography, digital certificates, certificate authorities (CAs), registration authorities (RAs), and validation authorities (VAs).

· Incident Management: Expertise in leading incident response, conducting postmortems, and implementing corrective actions to improve system reliability.

· Database Management: Familiarity with various database technologies (SQL, NoSQL) for managing and optimizing data storage.

· Security: Understanding of security best practices, vulnerability management, and incident response to protect systems and data.

· Team Management: Experience in building, motivating, and leading high-performing SRE teams, including hiring, mentoring, and performance management.

· Project Management: Ability to manage complex projects, define scope, set goals, track progress, and deliver results on time and within budget.

· Strategic Planning: Developing and executing a long-term vision for the SRE team, aligning with organizational goals, and anticipating future challenges.

· Decision-making: Ability to make sound, data-driven decisions under pressure, considering various factors and trade-offs.

· Stakeholder Management: Effectively communicating with stakeholders, building relationships, and managing expectations across different departments.

Qualifications:

· 10+ or more years with Infrastructure and Operations

· Solid understanding of ITSM, SRE and PKI. Ability to monitor performance and proactively detect issues before outage.

· Bachelor of Science in Computer Science, Business Information Engineering, or established professional with equivalent experience.

· Must have exposure and experience with AWS (preferred) or other cloud provider.

Additional Requirements and Details:

· Need to be flexible to work odd-hours/rotation shifts and weekends to support incidents, releases, maintenance activities and large project go-lives.

DirectEmployers