Research Engineer, Computer Vision - Learning From Videos (LFV)
Company: Toyota Research Institute
Location: Los Altos
Posted on: April 1, 2026
|
|
|
Job Description:
At Toyota Research Institute (TRI), we’re on a mission to
improve the quality of human life. We’re developing new tools and
capabilities to amplify the human experience. To lead this
transformative shift in mobility, we’ve built a world-class team
advancing the state of the art in AI, robotics, driving, and
material sciences. The Mission Make general-purpose robots a
reality. The Challenge We envision a future where robots assist
with household chores and cooking, aid the elderly in maintaining
their independence, and enable people to spend more time on the
activities they enjoy most. To achieve this, robots need to operate
reliably in messy, unstructured environments. Our mission is to
answer the question: what will it take to create truly
general-purpose robots that can accomplish a wide variety of tasks
in settings like human homes, with minimal human supervision? We
believe that the answer lies in cultivating large-scale datasets of
physical interaction from a variety of sources and building on the
latest advances in machine learning to learn general-purpose robot
behaviors from this data. The Team The Learning From Videos (LFV)
team in the Robotics division develops foundation models that
leverage large-scale multi-modal data (RGB, depth, flow, semantics,
actions, tactile, audio, etc.) from multiple domains (driving,
robotics, indoors, outdoors, etc.) to power downstream embodied AI
tasks. Our topics of interest include Video Generation, World
Models, 4D Reconstruction, Multi-Modal Models, Multi-View Geometry,
Data Augmentation, and Video-Language-Action models, with a primary
focus on embodied applications. We are making progress on some of
the hardest scientific challenges around spatio-temporal reasoning,
and how it can lead to the deployment of autonomous agents in
real-world unstructured environments, across both robotics and
driving domains. The Opportunity Our team is looking for a Research
Engineer to own and drive the core data and model infrastructure
that powers our research. As our foundation models scale in both
data diversity and model complexity, we need a strong engineer who
can bridge the gap between research ideas and production-grade
systems. This is not a traditional software engineering role; you
will work directly alongside research scientists, understand the
research deeply enough to make independent technical decisions, and
play a key role in enabling the team to move faster and train
better models. As a Research Engineer, you will be responsible for
building and maintaining the infrastructure that ingests, unifies,
and serves heterogeneous multi-modal datasets at scale; supporting
and optimizing large-scale distributed training of diffusion and
transformer models; and developing tools and pipelines that
accelerate the research-to-results cycle. You will work closely
with researchers to prototype new ideas, run experiments, and help
ship our most successful models toward real-world applications.
Responsibilities Build and maintain scalable pipelines for
ingesting, converting, validating, and serving heterogeneous
robotics and vision datasets (multi-view, multi-modal,
multi-embodiment, etc.) into unified training-ready formats. Track
and integrate new public and internal datasets as they become
available. Support and optimize large-scale distributed training of
foundation models (diffusion transformers, video generation models)
on multi-GPU and multi-node clusters. Manage experiment workflows,
profiling, debugging, and hyperparameter sweeps. Collaborate
directly with research scientists to implement, iterate on, and
evaluate new architectures, objectives, and training strategies.
Translate research prototypes into clean, maintainable, reusable
code. Develop tools for dataset inspection, experiment tracking,
model evaluation, GPU resource management, and visualization.
Automate repetitive workflows to improve team velocity. Work with
other TRI teams and Toyota affiliates to set up shared pipelines,
onboard their data, and support joint training and evaluation
efforts. Produce maintainable, well-documented code. Contribute to
internal tooling and open-source releases to the scientific
community. Qualifications Master’s or PhD in Computer Science,
Electrical Engineering, Machine Learning, or a related field, with
a minimum of 2 years of relevant experience and strong software
engineering skills. Deep proficiency in Python, PyTorch, and the
Unix/Linux toolchain. Comfort working in terminal-heavy, SSH-based
workflows on shared GPU clusters. Hands-on experience with
large-scale deep learning training, including distributed training
(DDP, FSDP, DeepSpeed, or similar), GPU profiling, and debugging
training failures at scale. Experience building data pipelines for
heterogeneous or multi-modal datasets (images, video, depth, point
clouds, actions, etc). Strong fundamentals in at least one of:
computer vision, video understanding, generative models, 3D
reconstruction, or robotics. You are proactive, self-directed, and
comfortable operating with ambiguity in a research-driven
environment. You are a reliable teammate who communicates clearly
and takes ownership of problems end-to-end. Bonus Qualifications
Experience with video diffusion models, world models, or multi-view
geometry pipelines. Familiarity with robotics data formats and
collection pipelines (ROS, MCAP, HDF5, etc). Experience with cloud
training infrastructure (AWS SageMaker, EC2) and containerized
workflows (Docker, Kubernetes). Proficiency with modern AI-assisted
development tools (e.g., Copilot, Cursor, Claude Code) for
accelerating engineering workflows. Track record of contributions
to open-source projects or publications at top venues (CVPR, ICLR,
NeurIPS, RSS, ICRA, etc.) is a plus but not required. Please submit
a brief cover letter and add a link to Google Scholar to include a
full list of publications when submitting your CV for this
position. The pay range for this position at commencement of
employment is expected to be between $176,000 and $253,000/year for
California-based roles. Base pay offered will depend on multiple
individualized factors, including, but not limited to, a
candidate's experience, skills, job-related knowledge, and market
location. TRI offers a generous benefits package including medical,
dental, and vision insurance, 401(k) eligibility, paid time off
benefits (including vacation, sick time, and parental leave), and
an annual cash bonus structure. Additional details regarding these
benefit plans will be provided if an employee receives an offer of
employment. Please reference this Candidate Privacy Notice to
inform you of the categories of personal information that we
collect from individuals who inquire about and/or apply to work for
Toyota Research Institute, Inc. or its subsidiaries, including
Toyota A.I. Ventures GP, L.P., and the purposes for which we use
such personal information. TRI is fueled by a diverse and inclusive
community of people with unique backgrounds, education and life
experiences. We are dedicated to fostering an innovative and
collaborative environment by living the values that are an
essential part of our culture. We believe diversity makes us
stronger and are proud to provide Equal Employment Opportunity for
all, without regard to an applicant’s race, color, creed, gender,
gender identity or expression, sexual orientation, national origin,
age, physical or mental disability, medical condition, religion,
marital status, genetic information, veteran status, or any other
status protected under federal, state or local laws. It is unlawful
in Massachusetts to require or administer a lie detector test as a
condition of employment or continued employment. An employer who
violates this law shall be subject to criminal penalties and civil
liability. Pursuant to the San Francisco Fair Chance Ordinance, we
will consider qualified applicants with arrest and conviction
records for employment. We may use artificial intelligence (AI)
tools to support parts of the hiring process, such as reviewing
applications, analyzing resumes, or assessing responses. These
tools assist our recruitment team but do not replace human
judgment. Final hiring decisions are ultimately made by humans. If
you would like more information about how your data is processed,
please contact us.
Keywords: Toyota Research Institute, Danville , Research Engineer, Computer Vision - Learning From Videos (LFV), Engineering , Los Altos, California