Has there ever been a truly trouble-free PC? We've gotten a lot closer to it in recent years, thanks to better diagnostics and improved software and hardware engineering -- but sometimes, every now and then, things fall apart and the center cannot hold in a big country way.
The worst problems of all are the ones that come without warning, maybe also striking again and again without warning, and leave little or nothing for you to analyze when they're done. That's when you need to call in a PC version of Gregory House, Fox TV's caustic but brilliant medical mastermind, or play a version of the role yourself, whittling down possible causes until the patient recovers. Or doesn’t.
Resist the temptation to pitch the whole thing out the window.
|(click for image gallery)|
The good news is that you don't have to put up with them. Over time I've built up a repository of insights and strategies for dealing with these kinds of difficult-to-trace failures. They take time and effort to track down, but the effort is well-spent.
Note that most of the discussion here is aimed at a Windows-centric audience, but many of the same concepts apply to Linux or other OS users, too -- especially tips about hardware.
Types Of Failure
Most of the time, when something goes wrong, there's at least an error message or a warning of some kind, like the infamous Blue Screen of Death, to steer us in the right direction. This piece, though, deals with failures that have no warning at all -- no BSOD, no errors, nothing. The system may hang completely, reboot spontaneously, or even shut itself off without warning.
If there's no BSOD, then the system has been -- to use a euphemism employed by another of industry colleagues -- "mugged," meaning whatever happened was outside the realm of the operating system's ability to cope with it. Such things generally fall into a few basic categories: hardware failures, electrical problems, and untrappable OS issues.
Hardware Failures are anything from a component going bad to memory failing to a device being mistakenly disconnected. A fair number of hardware failures are "trappable," meaning the OS can anticipate disasters of that variety and warn the user about what went wrong (via a BSOD). But not everything can be trapped in this fashion, simply because there's no way to anticipate it.
Electrical Problems might normally be filed under hardware failures, but I'm breaking them out as a category of their own for a few reasons. For one, electrical problems can come from outside the PC entirely (a frayed power cable, a bad socket, a dying UPS battery) or from within it (a failing system power supply, a faulty soft switch). Also, they can typically be fixed without affecting the rest of the PC or its hardware.
Untrappable Issues include things like badly debugged kernel-level device drivers, which by Microsoft's own statistics are still the biggest cause of system crashes. Video cards, USB/SATA or other bus-control drivers, and audio controllers are three of the biggest culprits in this regard. If one of them goes sour, the whole system can follow suit in an eyeblink.
How To Diagnose The Undiagnosable
We've become used to the idea that the modern PC can give us reasonably detailed information about what might be wrong when things go afoul. That hasn't made end-user detective work obsolete -- if anything, it's made it all the more valuable, since the user now has to diagnose what on the face of it might seem like something wholly undiagnosable. It's not -- it just requires a bit more tenacity and patience than normal.
Sysinternals's AutoRuns tool lets you peek at what's being loaded when you boot.
|(click for image gallery)|
Remove Or Disable Everything That's Physically Unnecessary
This is a textbook troubleshooting technique, but many people are loath to go to the lengths they need to make it thoroughly effective. A mouse, keyboard, display, and maybe a network connection are all you need to get things going -- and sometimes you can do without the network connection as well. If you have an extra display, unplug it, too -- a second display can be problematic for reasons I'll go into later.
It might able to be useful to go into the system BIOS and disable devices that are not in use if you have the option to do so. Examples: onboard audio or networking, unused bus controllers (e.g., FireWire), or devices that are enabled but never actually used.
Clean Up And Look Around
Now would be as good a time as any to break out the Q-tips and vacuum cleaner. Open up the PC and look around -- sometimes the problem may be something grossly physical that wouldn't come to your attention when the lid is on. Loose or severely bent cabling (especially for hard drives), dust clogs on fans or heat-exchange apparatus, and bulging capacitors should all be considered signs of trouble. This is another reason to run with as little hardware as possible: the less you have inside the system as well, the easier it is to spot problems like this.
Turn Off Unneeded Kernel-level Objects
Aside from hardware, disable any non-Microsoft drivers or components that aren't absolutely essential. One powerful tool that can be used to this end is Sysinternals's AutoRuns, a program that's something of a big brother to Microsoft's own MSCONFIG. AutoRuns covers a great deal more territory than MSCONFIG, and like that program, anything disabled through it can be re-enabled later on without a great deal of hassle; its effects are totally reversible and nondestructive.
When you run AutoRuns (remember to run it in admin mode!), use the Options | Hide Signed Microsoft Entries menu option to show only files have been provided by other companies, which are more likely to be the problem. Pay specific attention to the entries in the Drivers and Services tabs, but a once-over in the Everything tab wouldn't be a bad idea if you're patient.
Another good way to get a complete summary of kernel-level objects is through Gabriel Topala's outstanding SIW utility, easily one of the best general-purpose system-information tools out there. The program can generate in stupefying detail, reports about a system's makeup, including kernel device drivers.
Run the program (again, in admin mode), look in Software | Drivers and sort by the "Type" column, then scroll down to "Kernel Drivers" (with "Running" in the Status column) to see a full rundown of what's currently running as a kernel driver. Right-click on any of those entries to change their running status -- but be very careful what you turn off here, as you could bring your system to a screeching halt if you're careless.
So how do you know what's needed and what's not? This part may require some research on your part, since it isn't always obvious. If you have a guru handy, dump the list out to a file (SIW lets you do this), send it his (or her) way, and have him (or her) peek at it. If your guru can't figure out what a given kernel driver is for, or feels it's creating more trouble than anything else, nix it.
The SIW utility delivers detailed information about what's going on under the hood.
|(click for image gallery)|
Be Mindful Of Power Issues When Debugging
Electrical problems can be some of the toughest to diagnose because they don't seem to be related to anything happening on the PC itself. They just strike like, well, lightning.
I mentioned before, in passing, that electrical problems can take two forms. One is the power supplied to the PC itself, and if you live in an area with glitchy power you already know about this first-hand. I live on an island in the Atlantic Ocean, where even on less windy days the power to my house is fairly dirty. Consequently, UPSes for each computer and its associated peripherals are mandatory. I should also note that a UPS's power load should be distributed intelligently: don't plug laser printers, for instance, into the battery-backup sockets of a UPS, since there's little reason to give them power protection.
The second form is the power supplied within the PC itself -- the power that the PC's power supply distributes internally. Few people reading this need to be convinced of the wildly varying quality of computer power supplies. Anyone stuck with a low-wattage, no-name or third-tier power supply in their PC automatically has a good reason to drop a few dollars and upgrade to something a little more robust. 500 watts or more is a good margin of safety for most desktop PCs.
Also, be mindful of a common PC component with potentially high electrical consumption that can be a hidden source of problems: the video card. A gaming-quality video card can use up enough juice by itself to count as a compelling argument to upgrade the power supply. Problems with video card power draw can manifest in three ways: BSODs, hard freezes, and (most commonly) that frustratingly inconclusive "Video driver stopped responding and has been restarted" message. That error has caused no end of people to tear their hair out because it doesn't tell you why that happened.
Problems with video cards can show up for reasons that have nothing to do with gaming. On a system that shipped with a 375 watt power supply (low end, to be sure), I added a second display and within barely a week was experiencing all of the above symptoms in various combinations.
I shied away from upgrading the power supply -- which would have been a major hassle -- and sought other solutions. As it turned out, the software control suite for the video card (an ATI Radeon HD 4650) allowed the user to manually override the GPU and memory clock speeds, as well as the fan speed. I set all of these to the lowest possible settings (see the illustration), reattached the second display, and haven't had a problem since.
The Catalyst Control Center for ATI video cards lets you change a whole bevy of normally hidden settings.
|(click for image gallery)|
I should point out that other devices, such as hard drives and optical drives, typically don't draw all that much power. Removing them as a power-saving measure (as opposed to debugging, as described above) gives you back very little.
One common reason for random failure is issues with memory -- a bad memory module can appear even on the most high-end machines. The best way to determine if there's a memory issue with a given machine is to test it, rigorously and repeatedly. Vista has its own memory test application, but you can also download and run a program like Memtest86+ (http://www.memtest.org/), which sports a slightly broader set of test parameters.
The best way to run a memory test is to set it up and let it run overnight: not just one pass, but continuously, for hours on end. If the test program detects an error -- or, worse, if the machine locks up solid -- there's a good chance one of the DIMMs is defective. Sometimes mismatched DIMMs can cause problems; try pulling one and then the other, and see if things go south on you then.
Memtest86+ puts your system's memory through a grueling battery of workouts.
|(click for image gallery)|
Get Everything Up-To-Date
This means more than running Windows Update. Your PC manufacturer may have updates not offered through Microsoft -- BIOS patches, for instance, or device drivers not provided in the default Windows installation. Fortunately it's become that much easier to find these things and keep them current -- Sony and Dell, for instance, both have applications that bring you directly to the relevant web page for your system. BIOS updates often go by undetected, both because they're generally not delivered automatically and because many people are still twitchy about applying BIOS updates. They shouldn't be: in the past, updating BIOS typically required booting a DOS disk or something similar, but today the vast majority of such updates can be done from within Windows, quite safely.
The single most important thing -- and the one hardest to remember for many people -- is to be patient and diligent. It's easy to succumb to the temptation to pitch the whole thing out the window and start anew, but that's also an expensive solution -- and brings with it the risk that you'll end up no better off than you were before. Solve a problem like this on your own (or with a little guru oversight), and you'll be that much better equipped to tackle something like this the next time it shows up.
InformationWeek has published an in-depth report on Windows 7. Download the report here (registration required).