VLA/VLM Engineer - VinDynamics

Job Summary:

The VLA/VLM Engineer will design and deploy vision-language-action models that enable humanoid robots to perceive scenes, understand context, and execute actions intelligently. This role bridges computer vision, natural language processing, and decision-making — empowering robots to interpret human instructions, recognize behaviors, and perform real-world tasks safely and autonomously.

Position: Vision-Language-Action (VLA/VLM) Engineer
Department: AI application
Location: Gia Lam, Ha Noi
Reports to: Head of AI Application.

Key Responsibilities:

Design, train, and deploy vision-language-action models (VLA) that combine visual perception, language understanding, and robotic control.
Develop multimodal AI pipelines that process RGB/Depth video, LiDAR, and audio data for perception and reasoning
Implement VLMs (e.g., LLaVA, GPT-4V, BLIP-2, Florence-2, Kosmos-2, or Gemini-based models) for scene understanding and natural interaction with humans.
Integrate AI models with the humanoid robot’s control and decision layers, enabling the robot to interpret human commands and respond intelligently.
Build and optimize prompt-based reasoning and visual grounding systems for human-robot dialogue and situational awareness.
Conduct experiments in simulation and real-world environments to test perception–reasoning–action loops.
Collaborate with cross-functional teams (Vision, RL, Motion, Control, Conversation) to ensure seamless end-to-end integration.

Requirements:

Must Have:

Bachelor’s, Master’s, or PhD degree in Computer Science, Artificial Intelligence, Robotics, or a related field.
Strong programming skills in Python and C++; experience with PyTorch or TensorFlow.
Solid understanding of multimodal learning (vision + language + action) and transformer architectures (ViT, CLIP, BLIP, Flamingo, or LLaVA).
Hands-on experience training or fine-tuning VLMs or LLMs with vision input for image/video captioning, grounding, or reasoning.
Experience with dataset preparation and annotation for multimodal tasks (e.g., VQA, instruction-following, embodied navigation).
Knowledge of deployment and inference optimization on edge hardware such as NVIDIA Jetson, AGX Orin, or RTX platforms.
Familiarity with ROS/ROS2, real-time inference, and integrating AI perception modules into robotic systems.

Nice to Have:

Experience with Vision-Language-Action models (RT-2, OpenVLA, PaLM-E, or GR-1).
Research or publications in robotics, multimodal AI, or embodied intelligence (CVPR, ICRA, NeurIPS, CoRL, RSS).
Experience in robotic control via language (e.g., natural language navigation, human instruction following).
Understanding of reinforcement learning with multimodal feedback.
Familiarity with safety alignment and hallucination mitigation for large vision-language models.

Benefits:

Competitive salary and benefits package (open to salary negotiations).
Opportunities for professional development and career growth.
Flexible work arrangements.
A collaborative and innovative work environment where ideas are valued and creativity is encouraged.