NVIDIA recently released a new method called LLaMA-Mesh, which enables large language models (LLMs) to generate 3D meshes from text prompts. This approach integrates 3D mesh generation with language understanding, allowing the model to represent meshes in plain text format without modifying its vocabulary or tokenisers. This latest update was announced by Ahsen Khaliq, ML at Hugging Face, on LinkedIn.
LLaMA-Mesh builds on LLaMA, a language model, by fine-tuning it on a curated dataset of 3D dialogue. The method was designed by researchers from NVIDIA and Tsinghua University to preserve the model’s language capabilities while extending its functionality to generate and understand 3D content.
So, how does LLaMA-Mesh work? The method leverages existing spatial knowledge embedded in LLMs, derived from textual sources like 3D tutorials. It tokenises 3D mesh data, including vertex coordinates and face definitions, into text, allowing seamless processing by language models. To train the model, the researchers developed a supervised fine-tuning dataset. This dataset enables the LLM to perform tasks such as generating 3D meshes from text prompts, producing interleaved text and 3D outputs, and interpreting 3D mesh structures.
The study shows that LLaMA-Mesh achieves 3D mesh generation quality comparable to specialised models trained exclusively on 3D data. The model was trained on 32 A100 GPUs for 21k iterations over 3 days using an AdamW optimiser with a small learning rate and warm-up steps. The loss shows quick adaptation to the new task with no spikes or issues, highlighting the model’s stability and ability to learn efficiently.
The researchers claim that the method creates detailed, high-quality 3D meshes with artist-level designs, learning this during training. It can generate diverse, creative outputs from the same text prompt, making it perfect for tasks that require multiple design options. Even after fine-tuning for 3D mesh generation, the model retains its strong language skills, understanding complex instructions, asking smart questions, and giving detailed answers. Tests show that it performs as well as other models in reasoning and problem-solving while also excelling at creating 3D designs.
AI is transforming animation and 3D modelling, impacting artists, studios, developers, and end-users alike. By automating repetitive tasks, AI frees artists to focus on creativity, while studios benefit from faster, cost-effective production. However, this raises the question of whether it will be the end of human animators and modellers soon. Earlier in October, NVIDIA announced EdgeRunner, which could generate highly detailed 3D meshes with up to 4,000 faces at a spatial resolution of 512. This was derived from both images and text, showcasing the potential of AI in 3D modelling.