Amazon Web Services said Project Rainier, an AI compute cluster powered by 500,000 Trainium2 chips, is now in use for Anthropic.

The AI infrastructure project is critical for AWS since it is being used by Anthropic to train its Claude models as well as other workloads. AWS said that Project Ranier will ultimately scale to 1 million Trainium2 processors.

Project Ranier was announced a year ago. Anthropic has pursued a multi-cloud approach and recently said it would procure TPUs from Google Cloud. Anthropic's models will now run on Nvidia, AWS and Google Cloud.

Key facts include:

  • AWS said Project Rainier will have more than 1 million Trainium2 chips by the end of the year.
  • The AI compute power is being used to build and deploy future versions of Claude.
  • Project Rainier is AWS largest infrastructure project to date.
  • Project Rainier is designed as a massive “EC2 UltraCluster of Trainium2 UltraServers.”
  • The architecture consists of stringing together UltraServers, which have four physical Trainium2 servers each with 16 Trainium2 chips. They communicate via high-speed connections called NeuronLinks.
  • The combination of these Ultraservers add up to an UltraCluster.
  • AWS said the vertical integration will enable it to continually optimize Project Rainier for cost and energy efficiency.
  • Given that AWS is highly likely to announce Trainium3, the next question will revolve around the replacement cadence and depreciation for Trainium2.

Constellation Research analyst Holger Muller said:

"It's good to see AWS being on track to build its first super computer. Traditionally AWS would scale through many machines not large machines, which require different engineering and different fault tolerances. We will see more details when the new machine will be in production. Obviously, AWS is confident for it to work and wants a portion of the news during GTC week. And finally it's a great proof point for AWS wanting to keep workloads inhouse and living the build vs buy mantra."