MLOps Engineer – Infrastructure & Operations Management
What We Offer:
- Remote job opportunity
- Internet allowance
- Canteen Subsidy
- Night Shift allowance as per process
- Health Insurance
- Tuition Reimbursement
- Work Life Balance Initiatives
- Rewards & Recognition
- Internal movement through IJP
WHAT YOU’LL BE DOING:
Infrastructure Management:
- Design and maintain scalable infrastructure for AI/ML model training and deployment on platforms like AWS.
- Manage GPU/CPU resources and automate infrastructure deployment using IaC tools.
CI/CD Pipeline Development:
- Develop CI/CD pipelines for continuous integration and deployment of AI models.
- Integrate monitoring tools to track model performance and detect issues early.
Model Deployment & Monitoring:
- Oversee the deployment of AI models, manage model versioning, and implement monitoring solutions.
Optimization & Resource Management:
- Optimize training jobs and monitor cloud resource usage to ensure cost- effectiveness.
Security & Compliance:
- Implement security best practices and ensure compliance with relevant regulations.
Documentation & Knowledge Sharing:
- Document infrastructure setups and contribute to the team’s knowledge base.
- Collaborate with AI team members to understand infrastructure needs and provide technical support.
WHAT WE EXPECT YOU TO HAVE:
Cloud Infrastructure:
- AWS Services: Expertise in AWS cloud services, particularly AWS Sagemaker for model training and deployment, EC2 for scalable compute resources, EKS for Kubernetes management, and S3/DocumentDB for data storage.
Infrastructure as Code (IaC):
- Proficiency in Terraform or AWS CloudFormation for automating the provisioning and management of cloud infrastructure. Experience in creating reusable, version- controlled infrastructure templates.
Containerization & Orchestration:
- Docker: Advanced knowledge of Docker for containerizing applications and managing dependencies in isolated environments.
- Kubernetes: Proficiency in Kubernetes for orchestrating containerized applications, managing clusters, and ensuring high availability of AI/ML services.
CI/CD Pipeline Development:
- CI/CD Tools: Hands-on experience with CI/CD tools such as Jenkins, GitLab CI, or CircleCI for automating the deployment and integration of AI/ML models.
- Model Versioning & Monitoring: Proficiency in integrating tools like MLFlow or Tensorboard for model versioning, tracking experiments, and managing model artifacts across different environments.
Monitoring & Logging:
- Monitoring Tools: Experience with monitoring and logging tools like Prometheus and Grafana for tracking system performance, detecting anomalies, and ensuring the reliability of AI/ML systems.
- Model Performance Monitoring: Implementing and managing systems for monitoring model performance, detecting model drift, and triggering retraining processes as needed.
Optimization & Resource Management:
- Resource Allocation: Expertise in managing and optimizing the allocation of GPU/CPU resources for training and inference tasks, ensuring efficient use of computational resources.
- Cost Optimization: Strong understanding of cloud cost management strategies, including the use of spot instances, auto-scaling, and reserved instances to minimize costs while maintaining performance.
- Distributed Training: Familiarity with distributed training techniques and frameworks to accelerate large-scale model training.
Security & Compliance:
- Security Best Practices: Knowledge of security best practices for AI/ML deployments, including network security, data encryption, identity and access management (IAM), and secure API development.
- Compliance Standards: Experience ensuring compliance with industry standards and regulations (e.g., GDPR, HIPAA) in AI/ML operations, particularly in environments handling sensitive data.
Experience:
- Infrastructure Management: Proven experience in designing, deploying, and managing large-scale cloud infrastructure for AI/ML workloads, with a focus on scalability, reliability, and efficiency.
- Model Lifecycle Management: Experience in managing the end-to-end lifecycle of AI/ML models, from training and deployment to monitoring and continuous integration.
- DevOps & MLOps: Strong background in DevOps practices, with a focus on applying these principles to the deployment and management of AI/ML models in production environments.
Education:
- Academic Ǫualifications: Bachelor’s degree in Computer Science, DevOps, Cloud Computing, or a related field. An advanced degree or relevant certifications in cloud computing (e.g., AWS Certified DevOps Engineer) or MLOps is a plus.
- Certifications: Relevant certifications in cloud infrastructure (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator) or security (e.g., Certified Information Systems Security Professional – CISSP) would be advantageous.
Other Skills:
- Problem-Solving: Strong analytical skills with the ability to troubleshoot and resolve complex infrastructure and deployment issues in a timely manner.
- Collaboration: Ability to work closely with AI researchers and developers to understand infrastructure needs and provide robust technical support.
- Documentation: Experience in creating clear, detailed documentation for infrastructure setups, deployment pipelines, and operational procedures to facilitate knowledge sharing and reproducibility.
To apply for this job email your details to priya.mittal@etechtexas.com