Reliability

AMD product qualification exceeds industry standards by applying knowledge-based methodologies and demonstrating world class reliability results on leading technology nodes.

Overview

AMD product reliability exceeds industry requirements by applying knowledge-based qualification methodologies and demonstrating world class reliability results on leading-edge technology nodes. All products must meet unique and stringent quality and reliability exit-criteria (standards-based JEDEC, IPC, IEC, AEC, MIL-STD, knowledge-based methodology, machine-learning, etc.) before production release.

Reliability margins have been shrinking over the years. Generation after generation, AMD has been flattening the “shrinking bathtub” by addressing infant mortality / Early Life Failure (EFR), thus, reducing defect density (DD) through:

  • Extrinsic Limit Infant Mortality (EFR)
    • Defect reduction at Design
    • Advanced outlier elimination
    • Electrical voltage stress screens
  • Intrinsic Limit Wear Out
    • Realistic use condition
    • Appropriate model
    • Design for reliability
    • Longer / newer tests
quality-failure-rate

Figure 1. Shrinking Bathtub: High-volume consumer markets such as mobile communications are shrinking the IC reliability margin. By focusing on design reliability, AMD has been able to deliver and meet the reliability requirements.

Qualification Methodology

Standard and Knowledge-Based Qualification Methodology

Advanced leading-edge technologies have led to a need for evolutionary thinking around testing and the analysis of quality and reliability of components. AMD uses knowledge-based reliability qualification (KBQ) in addition to standards-based methods (JEDEC, AEC, MIL-STD, etc.) to combat the shrinking bathtub. Evolving risk mitigation via testing with deep and machine learning application based analysis for defect exclusion.

Reliability at the system-level requires key learning, knowledge of customer use conditions and failure mechanisms. AMD Volume-System-Test (VST) characterizes devices at the system-level, emulating typical application use conditions.

As part of our commitment to assure reliability, AMD exceeds the industry standards-based qualification requirements to understand the failure mechanism and evaluate the reliability margins before product release. The achievements are accomplished through focus on die, package, and test-level defect reduction initiatives. The success of these efforts can be seen both in the line quality measured at customer sites and in the low FIT rates published in the AMD reliability report.

Process Technology FIT Rate
16nm 8
20nm 11
28nm 11
40nm 10
45nm 11
65nm 6
90nm 2

AMD publishes device reliability monitor report to provide customers with insight regarding the reliability of AMD products. The goal of the reliability program is to achieve continuous improvement in the robustness of each product being evaluated. As part of this program, finished product reliability is measured continuously and periodically to ensure that the product performance meets or exceeds reliability specifications.

Reliability Estimator

Reliability Estimator

The Xilinx Reliability Estimator (XRE) tool was developed to help customers estimate the reliability performance and life time products based on customer mission profile and use conditions. Designed from the ground up, the calculator estimates the failure rates (FITs) for various customer-specified use conditions and durations.

The fundamental concepts of the XRE tool include:

  • Separating the chip into small components according to their characteristics and applications
  • Calculating the failure rate of each component using a reliability aging model
  • Taking into consideration the reliability physics for gate oxide, transistor, and interconnects, as well as:
    • Design characteristics (voltage, current, area, complexity)
    • The physics of failures: wear-out (BTI/HCI, TDDB, EM) data
    • Parametric drift (BTI/HCI) data from CAD simulations, mitigated with test guard-band
quality-fit-rate

Figure 1. Example of 28nm FIT Rate Calculation: The XRE tool takes into consideration the reliability device physics, along with the appropriate models and customer profiles to calculate an accurate FIT rate.

Single Event Upsets

Ionizing radiation is capable of inducing undesired effects in most silicon devices. A single event upset (SEU), is an unintentional change of state caused by ionizing radiation in any integrated circuit, including ASIC, ASSP, memory, logic, and mixed-signal devices.

AMD devices are designed to have an inherently low susceptibility to SEUs. Although SEUs are extremely rare and fully recoverable in AMD devices, AMD understands the need for the utmost in system reliability and availability, and that managing SEUs requires far more than simply estimating SEU Failures-In-Time (FIT). To that end, AMD provides system designers a comprehensive solution for SEU mitigation.

seu-solution

Silicon

The foundation of reliability and availability is the silicon. Through continued innovation in circuit design and layout techniques, AMD has lowered the intrinsic SEU FIT of the silicon with each new generation, enabling most application deployments without any additional SEU mitigation. In addition, should an SEU occur, AMD provides rapid embedded error detection and correction that can restore the device state, such that the majority of SEUs will not result in system interruption.

cram-soft-error-rate

To maximize the integrity of designs in AMD devices, AMD offers industry-leading resilience to SEUs through more than 40 techniques spanning process, layout, circuit, and device architecture. Compared to 7 Series devices, Versal devices achieve more than 2 orders of magnitude reduction in SEU FIT.

Versal devices are the next step in continuing efforts to offer the most robust and comprehensive solution available. Versal devices contain additional design innovations and use 7nm FinFET transistor technology as a multiplier to gain substantial additional reduction in SEU FIT. Most applications will meet their reliability and availability requirements based on the inherent resilience of Versal devices without any additional SEU mitigation.

Packaging

AMD uses only ultra-low alpha (ULA) packaging materials and actively monitors material suppliers to ensure compliance with ULA specifications.

Mitigation Solutions

To effectively manage SEUs, AMD offers optional solutions that can be leveraged to increase reliability and availability in applications requiring additional mitigation.

The Xilinx Soft Error Mitigation (XilSEM) Library for Versal SoCs is a user-configurable, pre-verified solution to detect and correct SEUs in Configuration RAM.  It is also supportive of advanced techniques enabling users to classify SEUs in Configuration RAM during device operation.  For previous AMD architectures, the Soft Error Mitigation (SEM) IP Core for MPSoC, FPGAs, and SoCs is available with similar capabilities.

These solutions do not prevent SEUs; however, they provide a method to better manage system-level effects of SEUs. Proper management of SEUs increases reliability and availability, and reduces system maintenance and downtime costs. These solutions are validated and characterized through accelerated particle testing at one or more radiation effects facilities.

Design Techniques

Optimization by EDA tools typically improves quality of results, but these tools may also optimize away design-level SEU mitigation, such as redundant circuits or modules. AMD offers tools and a methodology to ensure mitigation techniques are left intact and design functionality is preserved.

For extremely demanding applications intersecting Aerospace and Defense or Functional Safety, AMD offers additional tools and techniques to assist system architects and developers.  Please visit our overview of these topics using the Quick Links provided on this page.

Analysis and Verfication

Analysis and verification are the most critical pieces for ensuring reliability and availability. AMD takes an open and direct approach to assessing SEU FIT. AMD stands alone in the publication of radiation effects data for commercial devices, via the AMD Device Reliability Report, and uses this data to support pre-design and post-design SEU FIT estimation for reliability and availability analysis.

In order to foster independent verification by interested users and the broader radiation effects community, AMD hardware debug tools support device Configuration RAM read back for verification during radiation effects tests.

Continuous Improvement

Continuous Improvement

Despite delivering absolute quality with zero-defect targets at production, exceeding industry reliability and operating lifetimes, AMD applies continuous improvement on a daily-basis. It is in our DNA and mostly driven by stringent market and their longer life time reliability requirements.

AMD Continuous Improvement Action (CIA) eliminates causes of non-conformities to prevent recurrence. Automatic escalation process throughout the management chain ensures CIAs are addressed and closed. Preventive Action Request (PAR) & Material Review Board (MRB) systems detect and eliminate the cause of potential non-conformities to prevent occurrence or impact customers.

new-rma-trend

Figure 1. Over the last 8 years, RMAs have been reduced by >60% as a result of product quality, customer support, and direct engagement for issue resolution