Site Reliability Engineer (SRE)
- Kuala Lumpur
- In-depth understanding of business, responsible for high-availability governance of financial business, and continuous improvement of business SLA.
- Through continuous all-round data operation (including availability indicators, historical incidents, resource utilization, etc., find system weaknesses and implement improvement projects;
- Continue to polish the monitoring system, improve monitoring efficiency, and shorten the fault location time;
- Ensure the efficient and stable operation of business system IaaS and PaaS infrastructure, continuously improve O&M specifications, and refine standard operating procedures
- Monitor and review the rationality of system architecture, process logic rationality, system performance, stability and other technical fields and indicators, and drive the project business team to solve problems;
- Responsible for responding to production faults in the first time, as the overall scheduling role, organizing relevant R&D, operation and maintenance, product and other parties to jointly investigate and solve problems, and responsible for fault response time and fault resolution time MTTR;
- Guide the evolution of SRE basic O&M work towards automation, platformization, and intelligence, and improve the overall O&M management efficiency of each component of the infrastructure.
- Accumulate operational best practices, provide guidance for business architecture design and component selection, and output O&M technical documents.
- Write relevant documents and regularly share technical and management results.
- Bachelor degree in computer science, several years of experience in medium and large Internet/financial industry development/operation and maintenance/SRE, 3 years of message middleware/cache/k8s/database production environment maintenance experience.
- Proficient in shell programming, proficient in 1-2 programming languages in Golang/Java/Python;
- Good knowledge of computing, storage, networking, security, computer architecture;
- Familiar with the basic principles of network, familiar with TCP/UDP network, Http, Socket, CDN and other technologies
- Proficient in the working principle, deployment and use of common middleware/databases such as Nginx, LVS, Redis, Kafka, MySQL, Elasticsearch, etc.
- Familiar with Jenkins, Gitlab, etc., and have practical experience in CI/CD process development and integration;
- Familiar with Docker/k8s container platform and related underlying technologies and principles;
- Familiar with Internet technical architecture, have a deep understanding of network communication protocols, application servers, load balancing, and microservice architecture;
- Familiar with the common components of the Internet, have a deep understanding of message middleware, distributed cache, database;
- Have rich experience in service operation and maintenance or middleware operation and maintenance Troubleshooting, and have systematic summary and practical experience in common system hidden dangers and system failures;
- Able to respond and deal with faults 7/24, strong pressure resistance, good service awareness and teamwork spirit;
- Cheerful and outgoing, good cross-team communication skills, strong sense of responsibility, excellent motivation, and pursuit of the ultimate.
- Meticulous work, good at thinking, strong data analysis and problem solving skills;
- Experience in assisting cross-regional remote projects is preferred;
- Have relevant technical work experience in securities, futures companies and blockchain;
- Experience in the development of complete automated O&M tools is preferred;
Interested in joining our team and explore your talents in different parts of the world? Worry not, a work visa will be provided by Doo Prime too, if applicable.
Please send in your resume, and personal as well as professional certificates, along with your job application to our HR mailbox: [email protected]
We will contact you soon if the requirements are met.