Advancing AI Hardware: The Thermal Implications

6 December, 2019

Competition within the world of artificial intelligence hardware is heating up. 

The recent Hot Chips conference, which took place at Stanford University, showcased some astonishing new chip designs — not least of which was the Wafer-Scale Engine (WSE) from Cerebas Systems.

The largest chip ever built, the WSE is nearly 9 inches wide, and its total size is 56 times larger than the biggest graphics processing unit.

But perhaps the most startling thing about the WSE is that it contains 400,000 cores and 1.2 trillion transistors — using an absolutely massive 15 kW of power.

As the WSE has not yet hit the market, it remains to be seen if performance testing lives up to the claims surrounding its capabilities. 

Or indeed whether the thermal implications of its power output have been fully accounted for. 

Cooling Challenges for AI Chips

For hardware manufacturers to truly overcome the thermal complications associated with such powerful AI designs, thermal management must be considered at every stage of the design process.

For example, a chip outputting 15 kW will need especially careful selection of materials for substrates, bonding, die attaches and interface materials.

At the system level there are equally important decisions to be made regarding printed circuit board (PCB) materials, heat sinks and where to incorporate liquid cooling or thermoelectric coolers. 

Additionally, the more robust materials required by high power electronics bring unique challenges. Compared to typical FR4 PCBs, materials like ceramic or copper have high thermal conductivity, which is an advantage in thermal management. But these materials can also add significant cost and weight to a design if not used optimally. 

All of these seemingly minor considerations can accumulate into major concerns for manufacturers, potentially leading to reduced running efficiency and possibly even product recalls. 

Thermal Simulation for AI Hardware

Businesses that maximize the use of innovative solutions within their designs and bake thermal considerations into their fabric will be most effective at minimizing unnecessary risk and costs due to component failure.

And the most efficient way to achieve this is through thermal simulation.

By creating a thermal simulation model in advance, such as 6SigmaET, AI hardware engineers can test their designs using a wide variety of different materials and configurations — for example, switching a component from copper to aluminium at the click of a button. 

Simulation also enables designs to be tested in a massive variety of different environments, temperatures and operating scenarios. Not only does this alert designers to potential inefficiencies early on in the process, but it also negates the need for multiple real-world prototypes — saving manufacturers time and money.

Incorporating thermal simulation into the early stages of the design process allows engineers to precisely understand the unique thermal challenges facing AI hardware. 

Through dealing with thermal considerations far earlier, designers can optimize the thermal performance of AI hardware and reduce the risk of expensive late-stage fixes and over-engineering. 

Download 6SigmaET’s latest Thermal Focus whitepaper to find out more about how new technologies and trends are impacting thermal management.

Blog written by: Tom Gregory, Product Manager