MLOps Engineer – Infrastructure & Operations Management

  • Remote
  • Anywhere
What We Offer:
  • Remote job opportunity
  • Internet allowance
  • Canteen Subsidy
  • Night Shift allowance as per process
  • Health Insurance
  • Tuition Reimbursement
  • Work Life Balance Initiatives
  • Rewards & Recognition
  • Internal movement through IJP

WHAT YOU’LL BE DOING: 

Infrastructure Management: 

  • Design and maintain scalable infrastructure for AI/ML model training and deployment on platforms like AWS. 
  • Manage GPU/CPU resources and automate infrastructure deployment using IaC tools. 

CI/CD Pipeline Development: 

  • Develop CI/CD pipelines for continuous integration and deployment of AI models. 
  • Integrate monitoring tools to track model performance and detect issues early. 

Model Deployment & Monitoring:

  • Oversee the deployment of AI models, manage model versioning, and implement monitoring solutions. 

Optimization & Resource Management: 

  • Optimize training jobs and monitor cloud resource usage to ensure cost- effectiveness. 

Security & Compliance: 

  • Implement security best practices and ensure compliance with relevant regulations. 

Documentation & Knowledge Sharing: 

  • Document infrastructure setups and contribute to the team’s knowledge base.
  • Collaborate with AI team members to understand infrastructure needs and provide technical support. 

WHAT WE EXPECT YOU TO HAVE: 

Cloud Infrastructure: 

  • AWS Services: Expertise in AWS cloud services, particularly AWS Sagemaker for model training and deployment, EC2 for scalable compute resources, EKS for Kubernetes management, and S3/DocumentDB for data storage. 

Infrastructure as Code (IaC):

  • Proficiency in Terraform or AWS CloudFormation for automating the provisioning and management of cloud infrastructure. Experience in creating reusable, version- controlled infrastructure templates. 

Containerization & Orchestration: 

  • Docker: Advanced knowledge of Docker for containerizing applications and managing dependencies in isolated environments. 
  • Kubernetes: Proficiency in Kubernetes for orchestrating containerized applications, managing clusters, and ensuring high availability of AI/ML services. 

CI/CD Pipeline Development: 

  • CI/CD Tools: Hands-on experience with CI/CD tools such as Jenkins, GitLab CI, or CircleCI for automating the deployment and integration of AI/ML models. 
  • Model Versioning & Monitoring: Proficiency in integrating tools like MLFlow or Tensorboard for model versioning, tracking experiments, and managing model artifacts across different environments. 

Monitoring & Logging: 

  • Monitoring Tools: Experience with monitoring and logging tools like Prometheus and Grafana for tracking system performance, detecting anomalies, and ensuring the reliability of AI/ML systems. 
  • Model Performance Monitoring: Implementing and managing systems for monitoring model performance, detecting model drift, and triggering retraining processes as needed. 

Optimization & Resource Management: 

  • Resource Allocation: Expertise in managing and optimizing the allocation of GPU/CPU resources for training and inference tasks, ensuring efficient use of computational resources. 
  • Cost Optimization: Strong understanding of cloud cost management strategies, including the use of spot instances, auto-scaling, and reserved instances to minimize costs while maintaining performance. 
  • Distributed Training: Familiarity with distributed training techniques and frameworks to accelerate large-scale model training. 

Security & Compliance: 

  • Security Best Practices: Knowledge of security best practices for AI/ML deployments, including network security, data encryption, identity and access management (IAM), and secure API development. 
  • Compliance Standards: Experience ensuring compliance with industry standards and regulations (e.g., GDPR, HIPAA) in AI/ML operations, particularly in environments handling sensitive data. 

Experience: 

  • Infrastructure Management: Proven experience in designing, deploying, and managing large-scale cloud infrastructure for AI/ML workloads, with a focus on scalability, reliability, and efficiency. 
  • Model Lifecycle Management: Experience in managing the end-to-end lifecycle of AI/ML models, from training and deployment to monitoring and continuous integration. 
  • DevOps & MLOps: Strong background in DevOps practices, with a focus on applying these principles to the deployment and management of AI/ML models in production environments. 

Education: 

  • Academic Ǫualifications: Bachelor’s degree in Computer Science, DevOps, Cloud Computing, or a related field. An advanced degree or relevant certifications in cloud computing (e.g., AWS Certified DevOps Engineer) or MLOps is a plus. 
  • Certifications: Relevant certifications in cloud infrastructure (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator) or security (e.g., Certified Information Systems Security Professional – CISSP) would be advantageous. 

Other Skills: 

  • Problem-Solving: Strong analytical skills with the ability to troubleshoot and resolve complex infrastructure and deployment issues in a timely manner. 
  • Collaboration: Ability to work closely with AI researchers and developers to understand infrastructure needs and provide robust technical support. 
  • Documentation: Experience in creating clear, detailed documentation for infrastructure setups, deployment pipelines, and operational procedures to facilitate knowledge sharing and reproducibility. 

To apply for this job email your details to priya.mittal@etechtexas.com

Job Title : MLOps Engineer – Infrastructure & Operations Management
Location : Gandhinagar
Scroll to Top

Contact Us

Request A Free Consultation

Request a Demo

Request a Free Trial

HIRE DATA SCIENTISTS

Thank you for sharing your details. Click below link to watch.