Close Menu
    Trending
    • Meghan Markle & Prince Harry Mark 7 Year Wedding Anniversary
    • The Costliest Startup Mistakes Are Made Before You Launch
    • Trump Signs Controversial Law Targeting Nonconsensual Sexual Content
    • Museo facilita el regreso de un artefacto maya de la colección de un filántropo de Chicago
    • Eagles extend head coach Nick Sirianni
    • New book details how Biden’s mental decline was kept from voters : NPR
    • Regeneron buys 23andMe for $256m after bankruptcy | Business and Economy
    • Cheryl Burke Blasts Critics, Defends Appearance in Passionate Video
    Messenger Media Online
    • Home
    • Top Stories
    • Plainfield News
      • Fox Valley News
      • Sports
      • Technology
      • Business
    • International News
    • US National News
    • Entertainment
    • More
      • Product Review
      • Local Business
      • Local Sports
    Messenger Media Online
    Home»Technology»Reinforcement Learning Uncovers Silent Data Errors
    Technology

    Reinforcement Learning Uncovers Silent Data Errors

    DaveBy DaveApril 25, 2025No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    For top-performance chips in huge data centers, math will be the enemy. Because of the sheer scale of calculations occurring in hyperscale data centers, working around the clock with thousands and thousands of nodes and huge quantities of silicon, extraordinarily unusual errors seem. It’s merely statistics. These uncommon, “silent” knowledge errors don’t present up throughout typical quality-control screenings—even when corporations spend hours on the lookout for them.

    This month on the IEEE International Reliability Physics Symposium in Monterey, Calif., Intel engineers described a way that uses reinforcement learning to uncover extra silent knowledge errors quicker. The corporate is utilizing the machine learning technique to make sure the standard of its Xeon processors.

    When an error occurs in a knowledge middle, operators can both take a node down and change it, or use the flawed system for lower-stakes computing, says Manu Shamsa, {an electrical} engineer at Intel’s Chandler, Ariz., campus. However it will be significantly better if errors might be detected earlier on. Ideally they’d be caught earlier than a chip is included in a pc system, when it’s doable to make design or manufacturing corrections to stop errors recurring sooner or later.

    “In a laptop computer, you gained’t discover any errors. In knowledge facilities, with actually dense nodes, there are excessive possibilities the celebrities will align and an error will happen.” —Manu Shamsa, Intel

    Discovering these flaws is just not really easy. Shamsa says engineers have been so baffled by them they joked that they have to be attributable to spooky motion at a distance, Einstein’s phrase for quantum entanglement. However there’s nothing spooky about them, and Shamsa has spent years characterizing them. In a paper offered on the identical convention final yr, his crew gives a complete catalog of the causes of those errors. Most are attributable to infinitesimal variations in manufacturing.

    Even when every of the billions of transistors on every chip is purposeful, they aren’t utterly equivalent to 1 one other. Refined variations in how a given transistor responds to adjustments in temperature, voltage, or frequency, as an example, can result in an error.

    These subtleties are more likely to crop up in large knowledge facilities due to the tempo of computing and the huge quantity of silicon concerned. “In a laptop computer, you gained’t discover any errors. In knowledge facilities, with actually dense nodes, there are excessive possibilities the celebrities will align and an error will happen,” Shamsa says.

    Some errors may crop up solely after a chip has been put in in a knowledge middle and has been working for months. Small variations within the properties of transistors could cause them to degrade over time. One such silent error Shamsa has discovered is expounded to electrical resistance. A transistor that operates correctly at first, and passes commonplace checks to search for shorts, can, with use, degrade in order that it turns into extra resistant.

    “You’re considering all the things is ok, however beneath, an error is inflicting a fallacious resolution,” Shamsa says. Over time, due to a slight weak spot in a single transistor, “one plus one goes to a few, silently, till you see the influence,” Shamsa says.

    The brand new method builds on an present set of strategies for detecting silent errors, known as Eigen tests. These checks make the chip do laborious math issues, repeatedly over a time period, within the hopes of creating silent errors obvious. They contain operations on totally different sizes of matrices crammed with random knowledge.

    There are a lot of Eigen checks. Operating all of them would take an impractical period of time, so chipmakers use a randomized strategy to generate a manageable set of them. This protects time however leaves errors undetected. “There’s no precept to information the number of inputs,” Shamsa says. He wished to discover a approach to information the choice so {that a} comparatively small variety of checks may flip up extra errors.

    The Intel crew used reinforcement learning to develop checks for the a part of its Xeon CPU chip that performs matrix multiplication utilizing what are known as fuse-multiply-add (FMA) directions. Shamsa says they selected the FMA area as a result of it takes up a comparatively massive space of the chip, making it extra weak to potential silent errors—extra silicon, extra issues. What’s extra, flaws on this a part of a chip can generate electromagnetic fields that have an effect on different elements of the system. And since the FMA is turned off to avoid wasting energy when it’s not in use, testing it entails repeatedly powering it up and down, doubtlessly activating hidden defects that in any other case wouldn’t seem in commonplace checks.

    Throughout every step of its coaching, the reinforcement-learning program selects totally different checks for the doubtless faulty chip. Every error it detects is handled as a reward, and over time the agent learns to pick which checks maximize the possibilities of detecting errors. After about 500 testing cycles, the algorithm discovered which set of Eigen checks optimized the error-detection fee for the FMA area.

    Shamsa says this system is 5 occasions as prone to detect a defect as randomized Eigen testing. Eigen checks are open source, a part of the openDCDiag for knowledge facilities. So different customers ought to have the ability to use reinforcement studying to switch these checks for their very own methods, he says.

    To a sure extent, silent, refined flaws are an unavoidable a part of the manufacturing course of—absolute perfection and uniformity stay out of attain. However Shamsa says Intel is attempting to make use of this analysis to be taught to seek out the precursors that result in silent knowledge errors quicker. He’s investigating whether or not there are crimson flags that would present an early warning of future errors, and whether or not it’s doable to vary chip recipes or designs to handle them.

    From Your Website Articles

    Associated Articles Across the Internet



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleA loss for Illinois farmers and food banks | News
    Next Article Why AI Should Be a Core Part of Your Business Strategy
    Dave

    Related Posts

    Technology

    Trump Signs Controversial Law Targeting Nonconsensual Sexual Content

    May 19, 2025
    Technology

    A Silicon Valley VC Says He Got the IDF Starlink Access Within Days of October 7 Attack

    May 19, 2025
    Technology

    12 Ways to Upgrade Your Wi-Fi and Make Your Internet Faster (2024)

    May 19, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    Four in Ukraine killed in drone strike as Russia claims advances on ground | Russia-Ukraine war News

    March 29, 2025

    Medicaid cuts could slash fentanyl addiction treatment : NPR

    April 14, 2025

    The state of the state | News

    February 28, 2025

    How Much Cash Does The ‘Espresso’ Singer Have?

    May 4, 2025

    Watch: Eagles’ Saquon Barkley makes 70-yard house vs. Rams

    November 25, 2024
    Categories
    • Business
    • Entertainment
    • Fox Valley News
    • International News
    • Plainfield News
    • Sports
    • Technology
    • Top Stories
    • US National News
    Most Popular

    Army helicopter forces two jetliners to abort DCA landings : NPR

    May 3, 2025

    Carson Hocevar earns pole for Wurth 400 at Texas

    May 3, 2025

    Bulls offseason position analysis: Center of attention this summer

    May 3, 2025
    Our Picks

    Lakers alter course with Bronny James, and it’s working

    December 13, 2024

    Fed Keeps Interest Rates Unchanged, Experts Not Surprised

    March 19, 2025

    Incumbent expected to win as Croatians vote in presidential run-off | News

    January 12, 2025
    Categories
    • Business
    • Entertainment
    • Fox Valley News
    • International News
    • Plainfield News
    • Sports
    • Technology
    • Top Stories
    • US National News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Messengermediaonline.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.