At AWS re:Invent 2024, Amazon Web Services (AWS) announced the general availability of AWS Trainium2-powered Amazon Elastic Compute Cloud (EC2) instances. These new instances offer 30-40% better price performance than the previous generation of GPU-based EC2 instances. AWS chief Matt Garman stated, “Today, I’m excited to announce the GA of Trainium2-powered Amazon EC2 Trn2 instances.” Along with this, the company also introduced Trn2 UltraServers and unveiled its next-generation Trainium3 AI chip.
The Trn2 instances are built with 16 Trainium2 chips, delivering up to 20.8 petaflops of compute performance. They are designed for training and deploying large language models (LLMs) with billions of parameters. Trn2 UltraServers combine four Trn2 servers into a single system, offering 83.2 petaflops of compute for higher scalability. These new UltraServers feature 64 interconnected Trainium2 chips.
David Brown, AWS vice president of compute and networking, stated, “The launch of Trainium2 instances and Trn2 UltraServers provides customers with the computational power needed to tackle the most complex AI models, whether for training or inference.” AWS is also collaborating with Anthropic to create Project Rainier, a large-scale AI compute cluster powered by hundreds of thousands of Trainium2 chips. This infrastructure will support Anthropic’s model development, including the optimization of its flagship product, Claude, to run on Trainium2 hardware.
Databricks and Hugging Face have partnered with AWS to leverage Trainium2’s capabilities for improved performance and cost efficiency in their AI offerings. Databricks plans to utilize the hardware to enhance the Mosaic AI platform, while Hugging Face integrates Trainium2 into its AI development and deployment tools. Other customers of Trainium2 include Adobe, Poolside, and Qualcomm.
Garman mentioned, “Adobe is seeing very promising early testing after running Trainium2 against their Firefly inference model, and they expect to save significant amounts of money.” He added, “Poolside expects to save 40% compared to alternative options, and Qualcomm is using Trainium2 to deliver AI systems that can train in the cloud and then deploy at the edge.”
AWS also previewed its Trainium3 chip, built using a 3-nanometer process node. Trainium3-powered UltraServers are expected in late 2025 and aim to deliver four times the performance of Trn2 UltraServers. To optimize the use of Trainium hardware, AWS introduced the Neuron SDK, a suite of software tools that enables developers to optimize their models for peak performance on Trainium chips. The SDK supports frameworks such as JAX and PyTorch, allowing customers to integrate the software into their existing workflows with minimal code changes. The Neuron SDK also supports over 100,000 models hosted on the Hugging Face model hub, further enhancing its accessibility for AI developers.
Trn2 instances are currently available in multiple regions, including US East (N. Virginia), US West (Oregon), and Europe (Ireland). With the launch of Trainium2, AWS is providing customers with powerful and cost-effective options for training and deploying AI models. The company’s partnerships with Anthropic, Databricks, and Hugging Face demonstrate its commitment to supporting the AI community and driving innovation in the field. As Trainium3 and the Neuron SDK become available, we can expect even more advancements in AI technology from AWS.