Open in App
  • Local
  • U.S.
  • Election
  • Politics
  • Crime
  • Sports
  • Lifestyle
  • Education
  • Real Estate
  • Newsletter
  • Tom's Hardware

    Ex-Twitter dev reminisces about finding 700 unused Nvidia GPUs after takeover — forgotten cluster was 'powered on and idle'

    By Mark Tyson,

    8 hours ago

    https://img.particlenews.com/image.php?url=48MhbC_0ufBz8bX00

    An engineer who worked at Twitter during the seismic Agrawal-Musk transition has been publicly reminiscing about finding a cluster of 700 Nvidia V100 GPUs. Tim Zaman, who now works as a software engineer at Google DeepMind, discovered this significant chunk of GPU power to be powered up but idle in the data center of X’s chirpy ancestor.

    The warm humming mass of Nvidia silicon and PCBs in the Twitter data center was poetically described as “the forgotten remains of an honest attempt to make a cluster within Twitter 1.0” by Zaman in a Twitter/X post on Monday. The engineer had been spurred to write about his surprise discovery of this silicon treasure trove after reading about xAI’s Memphis Supercluster getting to work training Grok 3, powered by 100,000 liquid-cooled Nvidia H100 accelerators on a single RDMA fabric.

    Zaman underlined what many of you will be thinking – Twitter had 700 of the world's most powerful GPUs humming along without purpose for years. “How times have changed!” he exclaimed. Indeed, the first Nvidia Volta architecture V100 GPUs for data centers started to arrive in the market during the first great GPU shortage of 2017, and Zaman found the 700x V100 card-powered cluster running without purpose in mid-2022. That’s a lot of computing time and resources wasted.

    Another moment of mirth for Zaman was discovering that the 700 Nvidia V100s were PCIe GPUs rather than the far higher bandwidth NVLink interfaced SXM2 form factor variety. Of course, we don’t know why the 2017-era Twitter bought PCIe instead of SXM2 bus V100 GPUs for this sizable installation, and perhaps we will never know.

    Zaman’s Tweet also contained some interesting musings about Musk’s new ‘Gigafactory of Compute.’ Running “100k GPUs on the same fabric must be an epic challenge,” commented the engineer. “At that scale, the only guarantee is failure, and it's all about graceful failure management.” With this in mind, Zaman pondered over disaggregating resources into distinct domains so that failures wouldn’t bring the whole house down.

    The engineer also found the potential maximum number of GPUs that could exist on a single fabric fascinating. With tech titans racing to build bigger and bigger AI training clusters, both predictable and unforeseen limits on the maximum number of GPUs on the same fiber are bound to become known.

    Expand All
    Comments / 0
    Add a Comment
    YOU MAY ALSO LIKE
    Most Popular newsMost Popular

    Comments / 0