• 27 Posts
  • 1.76K Comments
Joined 10 months ago
cake
Cake day: October 4th, 2023

help-circle

  • Yeah. They can’t replace it with their upcoming 15th gen, because that uses a new, incompatible socket. They’d apparently been handing replacement CPUs out to large customers to replace failed processors, according to one of Steve Burke’s past videos on the subject.

    On a motherboard that has the microcode update which they’re theoretically supposed to get out in a month or so, the processors should at least refrain from destroying themselves, though I expect that they’ll probably run with some degree of degraded performance from the update.

    Just guessing, not anything Burke said, but if there’s enough demand for replacement CPUs, might also be possible that they’ll do another 14th gen production run, maybe fixing the oxidation issue this time, so that the processors could work as intended.




  • I think I went 12th Gen for my brothers computer

    12th gen isn’t affected. The problem affects only the 13th and 14th gen Intel chips. If your brother has 12th gen – and you might want to confirm that – he’s okay.

    For the high-end thing, initially it was speculated that it was just the high-end chips in these generations, but it’s definitely the case that chips other than just the high-end ones have been recorded failing. It may be that the problem is worse with the high-end CPUs, but it’s known to not be restricted to them at this point.

    The bar they list in the article here is 13th and 14th gen Intel desktop CPUs over 65W TDP.


  • I really think that most people who think that they want ARM machines are wrong, at least given the state of things in 2024. Like, maybe you use Linux…but do you want to run x86 Windows binary-only games? Even if you can get 'em running, you’ve lost the power efficiency. What’s hardware support like? Do you want to be able to buy other components? If you like stuff like that Framework laptop, which seems popular on here, an SoC is heading in the opposite direction of that – an all-in-one, non-expandable manufacturer-specified system.

    But yours is a legit application. A non-CPU-constrained datacenter application running open-source software compiled against ARM, where someone else has validated that the hardware is all good for the OS.

    I would not go ARM for a desktop or laptop as things stand, though.


  • To put this another way, Intel had at least three serious failures that let the problem reach this level:

    • A manufacturing defect that led to the flawed CPUs being produced in the first place.

    • A QA failure to detect the flawed CPUs initially (or to be able to quickly narrow down the likely and certain scope of the problem once the issue arose). Not to mention having a second generation of chips with the defect go out the door, I can only assume (and hope) without QA having initially identified that they were also affected.

    • A customer care issue, in that Intel did not promptly publicly provide customers with information that Intel either had or should have had about likely scope of the problem, mitigation, and at least within some bounds of uncertainty (“if it can be proven that the problem is due to an Intel manufacturing defect on a given processor for some definition of proven, Intel will provide a replacement processor”), what Intel would do for affected customers. A lot of customers spent a lot of time replicating effort trying to diagnose and address the problem at their level, as well as continuing to buy and use the defective CPUs. It is almost certain that some of that was not necessary.

    The manufacturing failure sucks, fine. But it happens. Intel’s pushing physical limits. I accept that this kind of thing is just one thing that occasionally happens when you do that. Obviously not great, but it happens. This was an especially bad defect, but it’s within the realm of what I can understand and accept. AMD just recalled an initial batch of new CPUs (albeit way, way earlier in the generation than Intel)…they dicked something up too.

    I still don’t understand how the QA failure happened to the degree that it did. Like, yes, it was a hard problem to identify, since it was progressive degradation that took some time to arise, and there were a lot of reasons for other components to potentially be at fault. And CPUs are a fast moving market. You can’t try running a new gen of CPU for weeks or months prior to shipping, maybe. But for Intel to not have identified that they had a problem with the 13th gen at least within certain parameters at least subsequent to release and then to have not held up the 14th gen until it was definitely addressed seems unfathomable to me. Like, does Intel not have a number of CPUs that they just keep hot and running to see if there are aging problems? Surely that has to be part of their QA process, right? I used to work for another PC component manufacturer and while I wasn’t involved in it, I know that they definitely did that as part of their QA process.

    But as much as I think that that QA failure should not have happened, it pales in comparison to the customer care failure.

    Like, there were Intel customers who kept building systems with components that Intel knew or should have known were defective. Far a long time, Intel did not promptly issue a public warning saying “we know that there is a problem with this product”. They did not pull known defective components from the market, which means that customers kept sinking money into them (and resources trying to diagnose and otherwise resolve the issues). Intel did not issue a public statement about the likely-affected components, even though they were probably in the best position to know. Again, they let customers keep building them into systems. They did not issue a statement as to what Intel would do (and I’m not saying that Intel has to conclusively determine that this is an Intel problem, but at least say “if this is shown to be an Intel defect, then we will provide a replacement for parts proven to be defective due to this cause”). They did not issue a statement telling Intel customers what to do to qualify for any such program. Those are all things that I am confident that Intel could have done much earlier and which would have substantially reduced how bad this incident was for their customers. Instead, their customers were left in isolation to try to figure out the problems individually and come up with mitigations themselves. In many cases, manufacturers of other parts were blamed, and money spent buying components unnecessarily, or trying to run important services on components that Intel knew or should have known were potentially defective. Like, I expect Intel, whatever failures happen at the manufacturing or QA stages, to get the customer care done correctly. I expect that to happen even if Intel does not yet completely understand the scope of the problem or how it could be addressed. And they really did not.


  • 150W instead of 250

    Yeah, when I saw that the CPU could pull 250W, I initially thought that it was a misprint in the spec sheet. That is kind of a nutty number. I have a space heater that can run at low at 400W, which is getting into that range, and you can get very low-power space heaters that consume less power than the TDP on that processor. That’s an awful lot of heat to be putting into an incredibly small, fragile part.

    That being said, I don’t believe that Intel intentionally passed the initial QA for the 13th generation thinking that there were problems. They probably thought there was a healthy safety margin. You can certainly blame them for insufficient QA or for how they handled the problem as the issue was ongoing, though.

    And you could also have said “this is absurd” at many times in the past when other performance barriers came up. I remember – a long time ago now – when the idea of processors that needed active cooling or they would destroy themselves seemed rather alarming and fragile. I mean, fans do fail. Processors capable of at least shutting down on overheat to avoid destroying themselves, or later throttling themselves, didn’t come along until much later. But if we’d stopped with passive heatsink cooling, we’d be using far slower systems (though probably a lot quieter!)


  • They do say that you can contact Intel customer support if you have an affected CPU, and that they’re replacing CPUs that have been actually damaged. I don’t know – and Intel may not know – what information or proof you need, but my guess is that it’s good odds that you can get a replacement CPU. So there probably is some level of recourse.

    Now, obviously that’s still a bad situation. You’re out the time that you didn’t have a stable system, out the effort you put into diagnosing it, maybe have losses from system downtime (like, I took an out-of-state trip expecting to be able to access my system remotely and had it hang due to the CPU damage at one point), maybe out data you lost from corruption, maybe out money you spent trying to fix the problem (like, on other parts).

    But I’d guess that specifically for the CPU, if it’s clearly damaged, you have good odds of being able to at least get a non-damaged replacement CPU at some point without needing to buy it. It may not perform as well as the generation had initially been benchmarked at. But it should be stable.


  • I mean, I’m sure Intel cares.

    My problem is really in how they handled the situation once they knew that there was a problem, not even the initial manufacturing defect.

    Yes, okay. They didn’t know exactly the problem, didn’t know exactly the scope, and didn’t have a fix. Fine. I get that that is a really hard problem to solve.

    But they knew that there was a problem.

    Putting out a list of known-affected processors and a list of known-possibly-affected processors at the earliest date would have at least let their customers do what is possible to mitigate the situation. And I personally think that they shouldn’t have been selling more of the potentially-affected processors until they’d figured out the problem sufficient to ensure that people who bought new ones wouldn’t be affected.

    And I think that, at first opportunity, they should have advised customers as to what Intel planned to do, at least within the limits of certainty (e.g. if Intel can confirm that the problem is due to an Intel manufacturing or design problem, then Intel will issue a replacement to consumers who can send in affected CPUs) and what customers should do (save purchase documentation or physical CPUs).

    Those are things that Intel could certainly have done but didn’t. This is the first statement they’ve made with some of that kind of information.

    It might have meant that an Intel customer holds off on an upgrade to a potentially-problematic processor. Maybe those customers would have been fine taking the risk or just waiting for Intel to figure out the issue, issue an update, and make sure that they used updated systems with the affected processors. But they would have at least been going into this with their eyes open, and been able to mitigate some of the impact.

    Like, I think that in general, the expectation should be that a manufacturer who has sold a product with a defect should put out what information they can to help customers mitigate the impact, even if that information is incomplete, at the soonest opportunity. And I generally don’t think that a manufacturer should sell a product with known severe defects (of the “it might likely destroy itself in a couple months” variety).

    I think that one should be able to expect that a manufacturer do so even today. If there are some kind of reasons that they are not willing to do so (e.g. concerns about any statement affecting their position in potential class-action suits), I’d like regulators to restructure the rules to eliminate that misincentive. Maybe it could be a stick, like “if you don’t issue information dealing with known product defects of severity X within N days, you are exposed to strict liability”. Or a carrot, like “any information in public statements provided to consumers with the intent of mitigating harm caused by a defective product may not be introduced as evidence in class action lawsuits over the issue”. But I want manufacturers of defective products to act, not to just sit there clammed up, even if they haven’t figured out the full extent of the problem, because they are almost certainly in a better position to figure out the problem and issue information to mitigate it than their customers individually are, and in this case, Intel just silently sat there for a very long time while a lot of their customers tried to figure out the scope of what was going wrong, and often spent a lot of money trying to address the problem themselves when more information from Intel probably would have avoided them incurring some of those costs.



  • I would have gone AMD in the first place if this happened at the time of my purchase.

    Well, you’ve got better judgement than me. l’d been running just Intel for ~25 years and was comfortable with them, and even when ordering the replacement, still wasn’t absolutely certain that the CPU was at fault until the replacement (temporarily, for a few months) resolved all the problems.

    Moving forward, I expect I’ll use AMD unless they manage to do something like this.

    My last gaming PC served me well for almost 10 years before I did an in socket upgrade.

    Yeah, not a lot of annual single-threaded performance improvements since the early 2000s. Can very easily use older CPUs just fine for a long time these days, depending upon workload.



  • That was one initial theory, but it’s known to not be the cause. An earlier video that Steve Burke and Wendell from A1techs did had Wendell examine several hundred CPUs that were running in servers on non-Z790 motherboards (another source of potential problems that was initially blamed) at conservative settings, known and logged temperature for the lifetime of the server (so not temperature). He still saw about a 50% failure rate.

    I also personally destroyed one of my CPUs with motherboard default settings, and the other with Intel’s recommended settings (less aggressive than the motherboard defaults), so I can personally attest to this not just being people running with crazy voltages or something.

    There may also be other issues that people have caused by doing something else, but the elephant in the room has been narrowed down to processors destroying themselves while running well within spec.


  • That is, disappointingly, not sufficient to guarantee avoiding damage. I set all that in the BIOS using my first processor (13900KF) before ever inserting my replacement processor (14900KF) into the motherboard. The replacement processor still destroyed itself.

    Processor 1 used only motherboard defaults and managed to destroy itself.

    Processor 2 used only Intel recommended settings, no XMP memory profile, no Intel turbo boost, more conservative than motherboard defaults, and also destroyed itself.

    I did not try running a processor for its lifetime at minimum memory speed or with only 1 core active. It’s possible that that might be sufficient to avoid damage. If I hadn’t already gone AMD over this, and had to use a processor from the affected generations, that’s what I’d be doing now until Intel comes out with their update. Not gonna do much by way of fancy gaming, but at least the system’s usable and hopefully won’t destroy itself.


  • If you can avoid using a new one, I would. I would not buy or use an unused 13th gen or 14th gen Intel CPU until Intel completes their updates.

    In my case, there was a period of time where I had an old, damaged 13th gen CPU, and a new, unused 14th gen.

    I was always able to use my damaged CPUs without problems as long as I booted up Linux and told it to use only one core (maxcpus=1 on the GRUB command line passed to the kernel). Even two cores enabled, and it couldn’t even boot towards the end, but I never saw corruption with one.

    If I could rewind time, I would continue to use my old CPU and avoid using the new one. I would add maxcpus=1 to my Linux command line (to do it every boot, edit /etc/default/grub, runsudo update-grub on Debian-family systems). And I’d use the damaged CPU on a single core until I know that Intel has a workaround in microcode, my motherboard has the relevant BIOS update applied, and then l’d swap in the replacement CPU).

    If I didn’t have a known-damaged CPU, just have a still-working 13th or 14th gen processor and could get by using an old desktop or laptop or something until the update is out, I’d probably do that if at all possible, so that I don’t incur damage.


  • I destroyed my second CPU, a 14900KF, while having already been aware of that recommendation, and having disabled all of the settings like that that the motherboard vendor had enabled by default prior to ever inserting the replacement CPU, and only used the CPU with those settings; it still destroyed itself, like the first. I am very confident that you can still destroy a CPU having done that.

    That isn’t to say that using conservative settings is a bad idea (and maybe doing something further, like running memory at minimum frequency, not just using the Intel recommended default rather than the motherboard vendor defaults, might actually manage to reliably avoid CPU damage). But I am confident that just running standard Intel recommended settings is not, alone, enough to avoid damage.


  • If I had a known unused one, I would absolutely not use it until Intel finishes putting out their patch to motherboards to address this. You have no idea whether you could cause damage that won’t be detected, leaving you with a slightly damaged processor that malfunctions occasionally.

    Intel may publish guidance on how to use unpatched processors. If they don’t – they sure have not been forthcoming with information thus far – here’s my own suggestion.

    When I do use it, I would, prior to booting any OS on the CPU, go into the BIOS and turn everything related to the CPU to minimal performance. Memory speed down, disable Intel turbo boost, everything. If you can disable cores there, disable all but one – even my severely-damaged pair of CPUs could still boot without corrupting my root filesystem as long as I ran using only a single core (though two cores induced problems), and I’d take that as an argument in favor of one core being preferable, though I cannot say for sure that doing so helps avoid damaging the chip rather then just avoiding being affected by the damage once incurred.

    And the first thing I’d do, booted into that minimal-performance-CPU-environment, would be to do that motherboard BIOS update. Then go back and reset the motherboard to defaults and use the thing normally.

    Maybe that’s over-cautious, but we know that the processors destroy themselves with use, and we have no idea what the minimum amount of time – if any – to incur damage is. Unless Intel can come out with some kind of diagnostic to reliably detect damaged CPUs, you won’t know if you damaged your CPU in that window before the BIOS update, and it is maybe occasionally corrupting data, which I’d guess is a situation that you probably don’t want to be in during the lifetime of the CPU.


  • It can get a whole lot worse.

    I bought a $500 13th gen CPU that destroyed itself, replaced it (and didn’t keep the dead CPU) with a $500 14th gen CPU that destroyed itself, and spent another ~$500 on related hardware and dumping Intel stuff to go AMD to get a working system. I also spent a lot of time trying to resolve the problem. I’d bet that I’m not the person burned worst, because someone could very easily have replaced their motherboard or memory or power supply unit in the hopes of fixing the issue, as any of these could have looked like potential causes, and there’d be no way for anyone to prove to Intel that this was the cause even if Intel intended to reimburse for these.

    Maybe, I might get $500 back at most if Intel reimburses for the 14th gen CPU; I’d assume that at best, based on what they’ve been doing so far, that they’d send out another Intel CPU (which I no longer have a use for, having gone AMD).

    And I was mostly using this system for fun. While I was corrupting my root filesystem regularly at boot at the end, I ultimately didn’t – as far as I know – suffer any serious data loss or expense from the data that the processor was corrupting. My system was mostly to be used for my own entertainment. I didn’t miss deadlines or lose critical information.

    As Steve Burke has pointed out in earlier episodes on this, there are people who have been impacted by those secondary costs, some of which might make my own costs look irrelevant.

    He was talking to video game companies who were using affected processors as well as having customers who were affected; they had apparently banned some customers for cheating because they knew that the internal state of the game was incorrect; they couldn’t figure out what the customers were doing, but knew that their game state was being modified. It apparently wasn’t the customers cheating, but their CPU, which had partially destroyed itself, and was now corrupting memory.

    Another had been using CPUs for video game servers and those kept dying and taking down service; another company estimated that they’d lost $100k in player business due to the problem.

    Apparently these were also popular, due to high single-threaded performance, with hedge funds that do stock trading. I imagine that a system that suddenly stops working or corrupts data can very quickly become extremely expensive in that context, far in excess of what the CPUs cost.

    OEMs who build and sold systems containing these CPUs had apparently been taking back systems and repeatedly replacing parts; they probably incurred substantial costs and hits to their own reputation, as customers are upset with them.

    Same thing with datacenter providers, who incurred a lot of costs investigating and mitigating problems, swapping parts and CPUs. One of these Burke quoted as having advised customers to use an alternate AMD-based system and if they insisted on the Intel one, the provider would charge a $1000 additional service fee to cover all the costs the provider was taking in having to deal with systems based on the CPUs. Gives an idea of what they were losing.

    God only knows what the impact of having a ton of data around the world corrupted is. Probably no more than a tiny fraction of the problems related to corruption will ever actually be attributed to the CPUs themselves.

    And I don’t know how many systems out there may not be fully-tracked – so they don’t get updates to avoid the problem – and have the CPUs built into them. Industrial automation hardware? Ship navigation systems? Who knows? All kinds of things that might fail in absolutely spectacular ways if they work for a period of time, then down the road, eventually start corrupting data more and more severely.

    I mean, Intel might, at best, provide a cash refund for a dead CPU. But they aren’t gonna cover losses from secondary problems, and there’s no realistic way that most businesses and people who bought these could prove them, anyway.

    Buying the last CPU they made before this clusterfuck occurred is maybe one of the best things you could have done and still be indirectly affected, as you got a reasonably fast system that wasn’t directly affected – if I’d known about this in advance, rather then Intel not saying anything, I’d have purchased a 12th gen CPU happily rather than another $1k in useless hardware and spent a ton of time to try to resolve my problems. You’ll have the option to, at upgrade time, go AMD or 15th gen Intel and LGA 1851, if you want to hope that Intel’s 15th gen is more solid than their previous two. Just means a new motherboard and, if you’re using DDR4 memory, you’ll need to toss that and buy DDR5.