Troubleshooting Intermittent Hardware Problems
An intermittent problem is one that occurs occasionally or unpredictably. This differs from a problem that is predictable - e.g. the PC always fails to power up, always fails if a certain program/function is run. With Intermittent problems, a good Problem Log is essential to narrowing down the root cause.
An intermittent problem is one that occurs occasionally or unpredictably
Typical causes of intermittent problems are:
Firmware Problem or Configuration: the Operating System or device drivers do not correctly support a component
Heat problems: a component (-often the CPU or a graphics card) is getting too hot and overheating
Power problems: the PSU is supplying too little or too much power, or cannot maintain a constant supply
Motherboard problems: a motherboard component (-e.g. Northbridge / MCH) that communicates with the other components is damaged or misbehaving
Manufacturing Defect: a particular sub-part of a component is faulty but the PC functions normally until that area is accessed
Hardware is usually very reliable and seldom fails after it has been working correctly for some time (-the exception here are hard drives, which become more likely to fail with age or rough usage)
The most common cause of Intermittent faults is firmware. Firmware comprises the device drivers required to support each component and their configuration within the Operating System (O/S). As companies strive to produce innovative ways of delivering components, with increased capacity and speed, the components often differ in their interaction with the O/S. This means that a new driver program may be required to facilitate reliable communication between the device and the O/S - or different configuration settings made before the device will operate consistently.
.. [component] manufacturers issue updated drivers or fixes as soon as they are made aware of problems
Firmware problems should be suspected if any of the following are true:
If you suspect driver/configuration problems, then the best course of action is to search the device manufacturer's website for upgraded drivers or FAQs: manufacturers issue updated drivers or fixes as soon as they are made aware of problems. If there are no updates available, you can also try contacting their support department (-giving them as much of your evidence as possible) and ask them if they are aware of the problem: if nothing else, they generally can give you valuable advice on how to troubleshoot their devices further.
Another useful resource are forums: these are sites where questions can be posted and knowledgeable people answer ..
Another useful resource are forums: these are sites where questions can be posted and knowledgeable people answer (-if they can). Generally, the best way is to proceed is to type a description of the problem (-remove any specifics of your own computer from the message) into a Search Engine and see if it returns any matches. If you do not find the answer (-or even the question), you can sign up to one (-or more) of the technical forums best associated with your configuration and post your question there.
This is particularly common in high end gaming PCs, where there is a lot of high power components in a confined space, or overclocked systems where the user is running the CPU at a faster clock speed than is recommended. The key to diagnosing overheating is to keep an eye on the temperature of your system over time: if it is inexorably climbing (-even when the fan is running flat out) then this is likely to be the source of the problem
There are only two possible cures for overheating: reduce the heat generated .. [or] .. increase cooling
You can normally monitor temperatures in the BIOS, but a more convenient way is from within Linux. You can check the temperature of various components using the sensors command, which can be installed from the command line by typing:
sudo apt-get install lm-sensors
Here is an example of using the sensors command:
$ sensors
k10temp-pci-00c3
Adapter: PCI adapter
temp1: +16.6°C (high = +70.0°C, crit = +99.5°C)
atk0110-acpi-0
Adapter: ACPI interface
Vcore Voltage: +1.04 V (min = +0.85 V, max = +1.60 V)
+3.3 Voltage: +3.31 V (min = +2.97 V, max = +3.63 V)
+5 Voltage: +4.86 V (min = +4.50 V, max = +5.50 V)
+12 Voltage: +11.73 V (min = +10.20 V, max = +13.80 V)
CPU FAN Speed: 2537 RPM (min = 600 RPM)
CHASSIS FAN Speed:2537 RPM (min = 600 RPM)
CPU Temperature: +31.0°C (high = +60.0°C, crit = +95.0°C)
MB Temperature: +27.0°C (high = +45.0°C, crit = +75.0°C)
The useful thing about this is that it can be croned to run periodically and redirect any output to disc. this can then be imported into a spreadsheet and graphed to identify any trends.
There are only two possible cures for overheating:
The components in your PC are not going to be happy if your PSU cannot supply the right voltage consistently. This is obviously a problem with the PSU and it can be diagnosed using the sensors command in the same way as overheating problems:
Here is an example of using the sensors command:
$ sensors
k10temp-pci-00c3
Adapter: PCI adapter
temp1: +16.6°C (high = +70.0°C, crit = +99.5°C)
atk0110-acpi-0
Adapter: ACPI interface
Vcore Voltage: +1.04 V (min = +0.85 V, max = +1.60 V)
+3.3 Voltage: +3.31 V (min = +2.97 V, max = +3.63 V)
+5 Voltage: +4.86 V (min = +4.50 V, max = +5.50 V)
+12 Voltage: +11.73 V (min = +10.20 V, max = +13.80 V)
CPU FAN Speed: 2537 RPM (min = 600 RPM)
CHASSIS FAN Speed:2537 RPM (min = 600 RPM)
CPU Temperature: +31.0°C (high = +60.0°C, crit = +95.0°C)
MB Temperature: +27.0°C (high = +45.0°C, crit = +75.0°C)
Check that the min/max values lie within the tolerances for your CPU, otherwise it could lead to problems.
If you find that a constant voltage is not being maintained, you can either try removing some of the other components (-if they are not essential) to lighten the load. Otherwise a new PSU is required.
Motherboard failure is very rare: if it does occur it is often due to either static damage during installation or from a power surge (-which is why it is important to protect your PC with a surge protector or - better still - an Uninterruptable Power Supply).
Be sure to first check that the setup is correct in the BIOS setup program (-they normally have a "reset to factory defaults" option that you can try if all else fails) ..
If a component on the motherboard has truly failed, there is only one course of action: a motherboard replacement. This is a major step - and an expensive one - so you need to be absolutely sure that the motherboard is at fault. Be sure to first check that the setup is correct in the BIOS setup program (-they normally have a "reset to factory defaults" option that you can try if all else fails) and also check the motherboard manufacturer's website to see if there is a later version of the BIOS which might fix your problem.
Unfortunately, it is unlikely that you will gain definitive proof that the root cause of the problem lies with the motherboard (-short of swapping it for a new one); it is normally a case of ruling out everything else before being left with the motherboard hypothesis. Here are some guidelines:
Use your Problem Log and the Linux Log Files to see if the failure occurs during a function controlled by the motherboard
Strip out all but the essential components from your PC and see if the problem persists: for example, remove any cards and all drives, then try booting from a USB stick to see if the problem persists. If not, try adding back components one at a time (-beginning with the drive containing the Operating System) to see if the problems come back: if they do, it could be a component and not the motherboard that is faulty. In these cases, follow the troubleshooting section for that component within this guide to rule it out as the root cause
Sometimes, you may find that a particular port or slot is damaged: try moving cards/cables to different slot/port to see if this fixes things: if so, then the problem lies with the motherboard