informa
/
4 min read
article

Bug In AMD's Quad-Core Barcelona And Phenom May Be More Serious Than Previously Suspected

On Friday, I thought I'd identified the translation-lookaside buffer (TLB) bug which AMD said was responsible for problems it's having with its new Barcelona and Phenom quad-core processors. Now, two readers claim that the bug is more serious than I suggested. The reason is, while there is a BIOS workaround, they claim the fix results in a big performance penalty. (There's also an operating system fix wi
On Friday, I thought I'd identified the translation-lookaside buffer (TLB) bug which AMD said was responsible for problems it's having with its new Barcelona and Phenom quad-core processors. Now, two readers claim that the bug is more serious than I suggested. The reason is, while there is a BIOS workaround, they claim the fix results in a big performance penalty. (There's also an operating system fix with no performance hit.) This may be why heavy volume shipments don't seem to be in the cards until Q1, when updated silicon, now being readied, is available.Okay, here's the deal. AMD on Thursday issued a statement where it said: "There has been some talk about an erratum relative to our TLB cache in Barcelona as well a Phenom processor resulting in delays. AMD notified customers of this erratum and released a BIOS fix prior to the Nov. 19th launch that resolves it."

I looked through my AMD documentation and came up with what I thought was the bug (erratum) to which the statement referred. I figured it was number 122, "TLB Flush Filter May Cause Coherency Problem in Multicore Systems." Erratum 122 isn't a huge deal; it can be managed by disabling the TLB flush filter.

However, by Friday evening two anonymous readers had posted comments claiming that bug 122 wasn't the bug at issue, and in fact the glitch affecting Barcelona and Phenom is more serious than anyone thinks.

Here's what commenter "Fred" wrote:


"Alex, you're wrong. The erratum is not yet in AMD's public documentation. It's #298, or something, and it most certainly IS a show-stopper.

The patch, needed to avoid random crashes, results in [an approximately] 13% penalty to desktop apps and a huge penalty to virtualization. The penalty is so bad that no Tier 1 OEM will ship Barcelona servers until the B3 stepping in Q1.

AMD is left foisting these defective parts on HPC installations willing to take them at a steep discount, and, until recently, an unsuspecting consumer public that was buying 9500 and 9600 Phenoms. AMD made sure these parts were benchmarked by review sites without the performance-killing fix, which is shameless. It really is surprising that there hasn't been a recall."

Here's what the second poster, self-identified using the Slashdot slang "Anonymous Coward," wrote:


"Errata definitely exists, and as Fred pointed out, it's a new number. There are actually two "fixes" for this bug.

1) BIOS-level fix (the 13-20% performance penalty). I've read this errata, it sets two specific hidden registers, surprisingly simple ... which means I'll bet the BIOS-level fix actually disables the L3 cache, the performance penalty is about right.

2) Operating system workaround, word is the performance cost is effectively zero. RedHat has a fix, Microsoft has a fix, VMware (who would take the biggest performance hit) could do one, it's easy to do. Catch is, the OEMs can't guarantee the end customer runs a patched OS, so the OEM would rather wait three months to ship a fixed processor.

One of the rumors I heard suggested this bug affects all processors above a certain speed, 2.0GHz chips wouldn't hit it but 2.4+ are likely, so everything ends up in a low speed bin. Just a rumor."

So, in summary, these posters are claiming that the bug in Phenom and Barcelona causes random crashes and that there are both BIOS and operating-system workarounds, but that one of those fixes -- the BIOS -- results in a big performance penalty.

I was going to cut the post off here, but I have one final thought, which is that "Anonymous Coward's" second point, above, that it affects all processors above a certain speed, makes me wonder whether it has something to do with erratum #169, "System May Hang Due To DMA or Stalled Probe Response." This is an obscure glitch where, under certain obscure timing conditions, the Northbridge hangs. Interestingly, the fix is a BIOS workaround.

Anyway, I have a query in to AMD, and I also invite any readers with knowledge of the situation to comment below.

Finally, I want to state that I remain a huge fan of AMD's innovative new 10h architecture, which is making its first appearance in Barcelona and Phenom. I remain convinced that the success of both of these processor families is important for the industry. I refer you to my earlier piece, Inside AMD's Phenom And Opteron Quad-Core Architectures.

P.S. Readers who wish to comment directly can e-mail me at [email protected]



Detailed description of erratum 169, which isn't the one the commenters are talking about, just one that I think might be relevant. (Click picture to enlarge.)

P.P.S. For the latest update, see AMD's Quad-Core Barcelona Bug Revealed.