GenAI Ops Engineer

Location: Remote
Contract Duration: 3 month contract (with possibility of extension)

Job Overview:
We are seeking a skilled GenAI Ops Engineer to join a 3+ month platform support project. The ideal candidate will play a critical role in ensuring the smooth operation of AI/ML models and APIs, providing both user-level and project-level support. You will work in an agile environment alongside a small, collaborative team to maintain and optimize platform operations on AWS SageMaker and Kubernetes.

Key Responsibilities:

User-Level Support:
- Provide user support, including troubleshooting access issues, responding to user inquiries, and offering education and documentation to ensure effective usage of GenAI tools and platforms.
Project-Level Support:
- Handle new requests and escalations related to GenAI models and APIs.
- Provide hands-on maintenance of deployed AI/ML models and ensure the platform is functioning optimally.
Platform Maintenance and Engineering:
- Oversee infrastructure, particularly AWS SageMaker, to ensure model deployments are efficient and reliable.
- Collaborate with platform engineering teams to support the SageMaker Inference, Kubernetes services, and troubleshoot any issues that arise.
Model and API Management:
- Maintain and optimize the API layer, ensuring fast and reliable access to deployed models.
- Work with TensorRT, TGI, and similar frameworks to manage inference for Large Language Models (LLMs).

Required Skills and Experience:

AWS SageMaker Inference:
Experience in deploying and managing AI models on AWS SageMaker or a similar ML platform.
Kubernetes Service Layer:
Hands-on experience with Kubernetes, particularly in managing service layers implemented in Golang.
TGI or LLM Frameworks:
Exposure to TensorRT, TGI, or LLM inference frameworks is essential, especially for troubleshooting and optimizing model performance.
Golang:
Experience with Golang is a plus, particularly if you've worked with proxies or backend services in Golang.

Nice to Have:

Experience working in an agile team environment.
Experience with troubleshooting application issues at both the platform and application levels.