Google TPU: Thermal Management in Machine Learning

13 March, 2019


For much of the electronics community, 2018 has been the year that artificial intelligence – in particular machine learning – has become a reality.

Used in everything from retail recommendations to driverless cars, machine learning represents the ongoing effort to get computers to solve problems without being explicitly programmed to do so. While both machine learning and the wider notion of AI come with revolutionary new implications for the technology sector, they also require significant investment in new forms of integrated circuits.

One of the latest examples of such hardware is the recently released Tensor Processor Unit 3.0 (TPUv3) from Google. Designed specifically with the future of machine learning in mind, the TPU is a custom ASIC tailored for TensorFlow – Google’s open source software library for machine learning applications.

While positioned as a game-changing development in the AI space, Google’s TPU faces many of the same challenges as competitor products offered by Amazon and Nvidia – in particular, the potential for thermal complications.

As with so much of the hardware developed specifically for the machine learning market, Google’s TPUv3 offers a colossal amount of processing power. In fact, according to Google CEO Sundar Pichai, the new TPU will be eight times more powerful than any of Google’s previous efforts in this area (Google I/O 2018). From an AI standpoint, this is incredibly beneficial because the process of machine learning relies on the ability to crunch huge volumes of data instantaneously in order to make self-determined decisions.

From a thermal perspective, however, this dramatic increase in processing represents a minefield of potential complications – increased power means more heat is generated throughout the device. This accumulation of heat could potentially impact reliability, ultimately risking the performance and longevity of the TPU.

In a market where reliability is essential, and buyers have little room for system downtime, this issue could prove the deciding factor in which hardware manufacturers ultimately claim ownership of the AI space.

Given these high stakes, Google has clearly invested significant time in maximizing the thermal design of its TPUv3. Unlike the company’s previous tensor processing units, the TPUv3 is the first to bring liquid cooling to the chip – with coolant being delivered to a cold plate sitting atop each TPUv3 ASIC chip and two pipes attached to circulate excess heat away from the components. According to Sundar Pichai, this will be the first time ever that Google has needed to incorporate a form of liquid cooling into its data centers (Google I/O 2018).

Figure 1. The Google Tensor Processing Unit 3.0

While the addition of liquid cooling technology has been positioned as a new innovation for the industry, the reality is that high powered electronics running in rugged environments have been using similar heat dissipation systems for a long time.

As just one example, 6SigmaET has been used to model liquid cooling systems for use in aerospace projects and designs. Such aerospace products are high-power-density, tightly packed, highly engineered products, exposed to an array of very harsh environments. Preparing for these environments, designers must also work with limited cooling resources in order to find clever ways of dissipating heat away from the critical components. In this context, carefully designed liquid cooling systems have proved vital in ensuring that the electronics function reliably.

While such liquid cooling systems are extremely effective, they should not necessarily be used as the go-to solution for thermal management. With those in the machine learning space looking to optimize efficiency – both in terms of energy and cost – it’s vital that designers minimize thermal issues across their entire designs, and do not rely on the sledgehammer approach of installing a liquid cooling system just because the option is available.

For some of the most powerful chips, such as the Google TPUv3, it may be that liquid cooling is the only viable solution. In the future, however, as ever more investment is placed in machine learning hardware, developers should not grow complacent when it comes to exploring different thermal management solutions. Liquid cooling may help to dissipate heat build-up, but it is far more efficient to strive for designs that do not risk such accumulations of heat in the first place.

It is the businesses that maximize their use of space and build thermal considerations into the fabric of their designs that will be most effective at minimizing energy waste and limiting unnecessary component costs. Inevitably, these will also be the firms that produce the most elegant, and ultimately, the most reliable products, affording them the best opportunity to claim ownership of the AI hardware space.


Blog written by: Tom Gregory, Product Manager at 6SigmaET