Site Reliability Engineer(SRE)

Key Responsibilities

As a Site Reliability Engineer be part of a group that’s intensely focused on our customers and the health of our solutions. Whether it’s incident management, production support, advanced monitoring, or mentoring, SREs provide the foundation for issue triage and speedy resolution with a continuous improvement mindset.

Serve as a Tier 2 to escalation point for reported issues..
Research problem tickets to address data, setup, and code issues to provide responsive correction of issues.
Assist in prioritization of enhancement and defect resolution.
Monitor automated system alerts, log files, and other monitoring tool outputs.
Design and develop emergency patches to address critical production issues.
Manage third-party components used in the different Digital Commerce Solutions.
Perform administrative functions for application software.
Assist in providing support by participating in weekly on-call rotation.
Interact with internal business customers, operation personnel, and development groups in troubleshooting and correction of issues.
Manage the development, quality assurance, and production application environments, working closely with operations personnel to honor application Service Level Agreements (SLA).
Perform root cause analysis on issues that lead to the implementation of processes to prevent repetitive problems.
Conduct analysis on issues that lead to the implementation of processes to prevent repetitive problems.
Maintain knowledge base documentation.
Work on projects to better improve the Production Support model and processes.

Candidates should have experience with many of the following:

You have solved multiple problems by writing and documenting exceptional script solutions.
You have extensive experience automating solutions to identified issues/bugs/anomalies. You have a passion for replacing manual processes with efficient and concise automated solutions.
You have been responsible for running critical services that multiple customers depend upon. You understand the importance and impact that operational optimization can have on a product and the positive ripple effects that it can have across an entire organization.
You believe CI servers, push-button deploys, time-series datastores, metrics dashboards, and centralized logging are not just “nice to haves,” they are critical pieces of infrastructure that rapidly pay for themselves. You are familiar with the tool-space and can suggest products in each of these areas.
You are empathetic: You take others’ opinions into account and clearly communicate your thoughts to reach technical solutions quickly.
You consider it necessary to understand and appreciate your customers and enjoy seeing your work improve the work of others.
Mentorship and a Servant/Leader mentality
Experience in automation, specifically related to deployment, recovery, or other manual processes.
Experience using telemetry to understand throughput, limitations, and constraints in a service.
Strong problem-solving skills and passion for solving hard problems as part of a team and by individual investigation.
Experience in defining cost per transaction or per user, based on service configurations.
Experience with REST APIs, JSON, and exposure to container-based technologies.
Good understanding of SOA principles, Web Services and messaging technologies.
Experience supporting zero fault-tolerant, scalable, and high-volume systems applications in .NET.
Experience designing and developing complex .NET, C++, IIS applications.
Experience in SQL Server 2012, Transact SQL, Stored Procedures.
Strong experience with Agile development incorporating Continuous Integration and continuous delivery, utilizing technologies such as GIT, TFS/Azure DevOps, Sonar, Junit, Dynatrace, Redis
Great analytical skills and ability to think on the feet and work under pressure.
Deep understanding of XML parsing and XML schema design.
Strong Windows/Unix platform skills and understanding of network, storage, tiered application environments, and security.
Excellent interpersonal skills along with effective communication (both written and verbal) skills.
Ability to solve complex systems and database environment issues.
Knowledge of Splunk, Graylog, Dynatrace, Application Insights or equivalent monitoring tools.
Experience with performance tuning of applications and use of load testing tools such as JMeter, LoadRunner, etc.
Experience analyzing .Net thread/heap dumps

AWS services such as S3, Lambda, SQS, SNS, EC2, EKS,

Minimum Requirements

Bachelor’s degree in computer information systems preferred.
4+ Years Software Engineering with a focus on problem resolution and platform optimization; SRE.
Ability to read/write/configure code using modern design techniques to automate various platform capabilities
Background building and managing end-to-end services surfacing telemetry and stitching together long-running business processes
Experience with enabling and managing cloud services, usage, and optimizations
Experience with resilience modeling (FMEA, MTTR) and the ability to automate simulation of service outages for platforms
Ability to work with service teams and own Live Site Reviews and corrective action plans
Excellent knowledge of a scripting language; Ruby, Python, and/or .Net Core
Experience working on an Azure-based, cloud-native infrastructure and managed services, including App Services, SQL Azure, and containers
Experience with Docker in a production environment including container orchestration (e.g., Nomad, Mesos, Kubernetes, etc.)
Knowledge of configuration management systems like Ansible, Chef or Terraform
Experience with infrastructure as code (Terraform or CloudFormation)
API Technologies Swagger, Rest API, JSON, JWT, OAuth
Good knowledge of managing data disks, Storage for Windows in both on Azure and On-Prem.
Good knowledge of Windows Cluster configuration and troubleshooting.
Good experience in Developing PowerBI reports.