Troubleshooting

This section contains troubleshooting procedures for common situations as well as for specific problems.

Common Troubleshooting Procedures

This section describes commands and procedures that can be used in troubleshooting. Topics covered include:

Cables Attached Correctly
System Requirements
Drivers and Firmware
Isolate Hardware Problems
Rescan to Update Information on SCSI Controllers
Replacing a Failed Disk
Recovering from Removing the Wrong Physical Disk
Resolving Microsoft Windows Upgrade Problems

Cables Attached Correctly

Verify that the power-supply cord and adapter cables are attached correctly. If the system is having trouble with read and write operations to a particular virtual disk or non-RAID physical disk (if the system hangs, for example), then make sure that the cables attached to the corresponding enclosure or backplane are secure. If the connection is secure but the problem persists, you may need to replace a cable. Also see Isolate Hardware Problems.

On SAS controllers, you should verify that the cable configuration is valid. Refer to the SAS hardware documentation for valid cable configurations. If the cable configuration is invalid, you may receive alerts 2182 or 2356.

For information on Alert Messages, see the Dell OpenManage Server Administrator Messages Reference Guide at support.dell.com/manuals.

System Requirements

Make sure that the system meets all system requirements. In particular, verify that the correct levels of firmware and drivers are installed on the system. For more information on drivers and firmware, see Drivers and Firmware.

Drivers and Firmware

Storage Management is tested with the supported controller firmware and drivers. In order to function properly, the controller must have the minimum required version of the firmware and drivers installed. The most current versions can be obtained from support.dell.com.

NOTE: You can verify which firmware and drivers are installed by selecting the Storage object in the tree view and clicking the Information/Configuration tab. You can also check the Alert Log for alerts relating to unsupported firmware and driver versions.

It is also recommended to obtain and apply the latest Dell PowerEdge Server System BIOS on a periodic basis to benefit from the most recent improvements. For more information, see the Dell PowerEdge system documentation.

Isolate Hardware Problems

If you receive a "timeout" alert related to a hardware device or if you otherwise suspect that a device attached to the system is experiencing a failure, then to confirm the problem:

Verify that the cables are correctly attached.
If the cables are correctly attached and you are still experiencing the problem, then disconnect the device cables and reboot the system. If the system reboots successfully, then one of the devices may be defective. Refer to the hardware device documentation for more information.

Rescan to Update Information on SCSI Controllers

On SCSI controllers, use the Rescan controller task to update information for the controller and attached devices. This operation may take a few minutes if there are a number of devices attached to the controller.

If the Rescan does not properly update the disk information, you may need to reboot your system.

Replacing a Failed Disk

You may need to replace a failed disk in the following situations:

Replacing a Failed Disk that is Part of a Redundant Virtual Disk
Replacing a Failed Physical Disk that is Part of a Non-Redundant Virtual Disk
Replacing a Failed Physical Disk in a RAID 1 on a CERC SATA1.5/2s

Replacing a Failed Disk that is Part of a Redundant Virtual Disk

If the failed disk is part of a redundant virtual disk, then the disk failure should not result in data loss. You should replace the failed disk immediately, however, as additional disk failures can cause data loss.

If the redundant virtual disk has a hot spare assigned to it, then the data from the failed disk is rebuilt onto the hot spare. After the rebuild, the former hot spare functions as a regular physical disk and the virtual disk is left without a hot spare. In this case, you should replace the failed disk and make the replacement disk a hot spare.

NOTE: If the redundant virtual disk does not have a hot spare assigned to it, then replace the failed disk using the procedure described in Replacing a Physical Disk Receiving SMART Alerts

Replacing the Disk:

1 Remove the failed disk.
2 Insert a new disk. Make sure that the new disk is the same size or larger as the disk you are replacing. On some controllers, you may not be able to use the additional disk space if you insert a larger disk. For more information, see Virtual Disk Considerations for Controllers.

A rebuild is automatically initiated because the virtual disk is redundant.

Assigning a Hot Spare:

If a hot spare was already assigned to the virtual disk, then data from the failed disk may already be rebuilt onto the hot spare. In this case, you need to assign a new hot spare. For more information, see Assign and Unassign Dedicated Hot Spare and Assign and Unassign Global Hot Spare.

Replacing a Failed Physical Disk that is Part of a Non-Redundant Virtual Disk

If the failed physical disk is part of a non-redundant virtual disk (such as RAID 0), then the failure of a single physical disk causes the entire virtual disk to fail. To proceed, you need to verify when your last backup was, and if there is any new data that has been written to the virtual disk since that time.

If you have backed up recently and there is no new data on the disks that would be missed, you can restore from backup.

NOTE: If the failed disk is attached to a PERC 4/SC, 4/DC, 4e/DC, or 4/Di controller, you can attempt to recover data from the disk by using the procedure described in Using the Physical Disk Online Command on Select Controllers before continuing with the following procedure.

Do the following:

1 Delete the virtual disk which is currently in a failed state.
2 Remove the failed physical disk.
3 Insert a new physical disk.
4 Create a new virtual disk.
5 Restore from backup.

Using the Physical Disk Online Command on Select Controllers

Does my controller support this feature? See Supported Features

If you do not have a suitable backup available, and if the failed disk is part of a virtual disk on a controller that supports the Online physical disk task, then you can attempt to retrieve data by selecting Online from the failed disk's drop-down task menu.

The Online command attempts to force the failed disk back into a Online state. If you are able to force the disk into a Online state, you may be able to recover individual files. How much data you can recover depends on the extent of disk damage. File recovery is only possible if a limited portion of the disk is damaged.

There is no guarantee you is able to recover any data using this method. A forced Online does not fix a failed disk. You should not attempt to write new data to the virtual disk.

After retrieving any viable data from the disk, replace the failed disk as described previously in Replacing a Failed Disk that is Part of a Redundant Virtual Disk or Replacing a Failed Physical Disk that is Part of a Non-Redundant Virtual Disk.

Replacing a Failed Physical Disk in a RAID 1 on a CERC SATA1.5/2s

On a CERC SATA1.5/2s controller, a rebuild may not start automatically when you replace a failed physical disk that is part of a RAID 1 virtual disk. In this circumstance, use the following procedure to replace the failed physical disk and rebuild the redundant data.

1 Turn off the system.
2 Disconnect the SATA cable on the failed physical disk in the RAID 1 virtual disk.
3 Replace the failed physical disk with a formatted physical disk. You can format the physical disk using the Disk Utilities in the controller BIOS. (You may not need to format the entire physical disk. Formatting 1% of the disk may be sufficient.)
4 Reboot the system. When rebooted, the RAID 1 virtual disk should display a Failed Redundancy state.
5 Expand the controller object in the tree view and select the Physical Disks object.
6 Execute the Rebuild task for the physical disk you added.

Recovering from Removing the Wrong Physical Disk

If the physical disk that you mistakenly removed is part of a redundant virtual disk that also has a hot spare, then the virtual disk rebuilds automatically either immediately or when a write request is made. After the rebuild has completed, the virtual disk no longer has a hot spare since data has been rebuilt onto the disk previously assigned as a hot spare. In this case, you should assign a new hot spare.

If the physical disk that you removed is part of a redundant virtual disk that does not have a hot spare, then replace the physical disk and do a rebuild.

For information on rebuilding physical disks and assigning hot spares, see the following sections:

Understanding Hot Spares for RAID controllers
Rebuild for PERC 4/SC, 4/DC, 4e/DC, 4/Di, PERC 5/E and PERC 5/i controllers

You can avoid removing the wrong physical disk by blinking the LED display on the physical disk that you intend to remove. For information on blinking the LED display, see Blink and Unblink (Physical Disk).

Resolving Microsoft Windows Upgrade Problems

If you upgrade the Microsoft Windows operating system on a server, you may find that Storage Management no longer functions after the upgrade. The installation process installs files and makes registry entries on the server that are specific to the operating system. For this reason, changing the operating system can disable Storage Management.

To avoid this problem, you should uninstall Storage Management before upgrading. If you have already upgraded without uninstalling Storage Management, however, you should uninstall Storage Management after the upgrade.

After you have uninstalled Storage Management and completed the upgrade, reinstall Storage Management using the Storage Management install media. You can download Storage Management from support.dell.com.

Virtual Disk Troubleshooting

The following sections describe troubleshooting procedures for virtual disks.

Replacing a Failed Disk that is Part of a Redundant Virtual Disk
Replacing a Failed Physical Disk in a RAID 1 on a CERC SATA1.5/2s
A Rebuild Does Not Work
A Rebuild Completes with Errors
Cannot Create a Virtual Disk
Virtual Disk Errors on Linux
Problems Associated With Using the Same Physical Disks for Both Redundant and Non-Redundant Virtual Disks

A Rebuild Does Not Work

A rebuild does not work in the following situations:

The virtual disk is non-redundant—For example, a RAID 0 virtual disk cannot be rebuilt because RAID 0 does not provide data redundancy.
There is no hot spare assigned to the virtual disk—As long as the virtual disk is redundant, to rebuild it:
Pull out the failed physical disk and replace it. A rebuild automatically starts on the new disk.
Assign a hot spare to the virtual disk and then perform a rebuild.
You are attempting to rebuild onto a hot spare that is too small—Different controllers have different size requirements for hot spares. For more information on disk size requirements, see Considerations for Hot Spares on PERC 4/SC, 4/DC, 4e/DC, 4/Di, 4e/Si, 4e/Di, PERC 5/E, PERC 5/i, PERC 6/E, PERC 6/I, and CERC 6/I Controllers and Considerations for Hot Spares on CERC SATA1.5/6ch, S100, and S300 Controllers.
The hot spare has been unassigned from the virtual disk—This could happen on some controllers if the hot spare was assigned to more than one virtual disk and has already been used to rebuild a failed physical disk for another virtual disk. For more information, see Considerations for Hot Spares on CERC SATA1.5/6ch, S100, and S300 Controllers.
On SCSI controllers, both redundant and non-redundant virtual disks reside on the same set of physical disks—On the PERC 4/SC, 4/DC, 4e/DC, and 4/Di controllers, a rebuild is not performed for a physical disk that is used by both redundant and non-redundant virtual disks. In order to rebuild the redundant virtual disk, you need to delete the non-redundant virtual disk. Before deleting this disk, however, you can attempt to recover data from the failed physical disk by forcing it back online. For more information, see Using the Physical Disk Online Command on Select Controllers.
A physical disk has been removed, and the system has not yet attempted to write data to the removed disk—In this case, the system does not recognize the removal of a physical disk until it attempts a write operation to the disk. If the physical disk is part of a redundant virtual disk, then the system rebuilds the disk after attempting a write operation. This situation applies to PERC 4/SC, 4/DC, 4e/DC, and 4/Di controllers.
The virtual disk includes failed or corrupt physical disks—This situation may generate alert 2083. For information on Alert Messages, see the Dell OpenManage Server Administrator Messages Reference Guide at support.dell.com/manuals.
The rebuild rate setting is too low—If the rebuild rate setting is quite low and the system is processing a number of operations, then the rebuild may take an unusual amount of time to complete. For more information, see Set Rebuild Rate.
The rebuild was cancelled—Another user can cancel a rebuild that you have initiated.

A Rebuild Completes with Errors

This section applies to PERC 4/SC, 4/DC, 4e/DC, 4/Di, 4e/Si, and 4e/Di controllers

In some situations, a rebuild may complete successfully while also reporting errors. This may occur when a portion of the disk containing redundant (parity) information is damaged. The rebuild process can restore data from the healthy portions of the disk but not from the damaged portion.

When a rebuild is able to restore all data except data from damaged portions of the disk, it indicates successful completion while also generating alert 2163.

For information on Alert Messages, see the Dell OpenManage Server Administrator Messages Reference Guide at support.dell.com/manuals.

The rebuild may also report sense key errors. In this situation, take the following actions to restore the maximum data possible:

1 Back up the degraded virtual disk onto a fresh (unused) tape.
- If the backup is successful—If the backup completes successfully then the user data on the virtual disk has not been damaged. In this case, you can continue with step 2.
- If the backup encounters errors—If the backup encounters errors then the user data has been damaged and cannot be recovered from the virtual disk. In this case, the only possibility for recovery is to restore from a previous backup of the virtual disk.
2 Perform a Check Consistency on the virtual disk that you have backed up onto tape.
3 Restore the virtual disk from the tape onto healthy physical disks.

Cannot Create a Virtual Disk

You might be attempting a RAID configuration that is not supported by the controller. Check the following:

How many virtual disks already exist on the controller? Each controller supports a maximum number of virtual disks. See Maximum Number of Virtual Disks per Controller for more information.
Is there adequate available space on the disk? The physical disks that you have selected for creating the virtual disk must have an adequate amount of free space available.
The controller may be performing other tasks, such rebuilding a physical disk, that must run to completion before the controller can create the new virtual disk.

A Virtual Disk of Minimum Size is Not Visible to Windows Disk Management

If you create a virtual disk using the minimum allowable size in Storage Management, the virtual disk may not be visible to Windows Disk Management even after initialization. This occurs because Windows Disk Management is only able to recognize extremely small virtual disks if they are dynamic. It is generally advisable to create virtual disks of larger size when using Storage Management.

Virtual Disk Errors on Linux

On some versions of the Linux operating system, the virtual disk size is limited to 1TB. If you create a virtual disk that exceeds the 1TB limitation, your system may experience the following behavior:

I/O errors to the virtual disk or logical drive
Inaccessible virtual disk or logical drive
Virtual disk or logical drive size is smaller than expected

If you have created a virtual disk that exceeds the 1TB limitation, you should do the following:

1 Back up your data.
2 Delete the virtual disk.
3 Create one or more virtual disks that are smaller than 1TB.
4 Restore your data from backup.

Irrespective of whether your Linux operating system limits the virtual disk size to 1TB, the virtual disk size depends on the version of the operating system and any updates or modifications that you have implemented. For more information, see your operating system documentation.

Problems Associated With Using the Same Physical Disks for Both Redundant and Non-Redundant Virtual Disks

When creating virtual disks, you should avoid using the same physical disks for both redundant and non-redundant virtual disks. This recommendation applies to all controllers. Using the same physical disks for both redundant and non-redundant virtual disks can result in unexpected behavior including data loss.

NOTE: SAS controllers do not allow you to create redundant and non-redundant virtual disks on the same set of physical disks.

Considerations for CERC SATA1.5/6ch and CERC SATA1.5/2s Controllers When Physical Disks are Shared by Redundant and Non-Redundant Virtual Disks

This section describes behavior that may occur on the CERC SATA1.5/6ch and CERC SATA1.5/2s controllers if you use the same physical disks for both redundant and non-redundant virtual disks. In this type of configuration, the failure or removal of a physical disk can cause the following behavior:

The non-redundant virtual disk displays a Failed state.

Resolution: This behavior is expected because the virtual disk is non-redundant. In this case, the failure or removal of a single physical disk causes the entire virtual disk to fail with no possibility of recovering the data unless a backup is available.

The redundant virtual disks display a Degraded state.

Resolution: This behavior is also expected. Data can be recovered if a hot spare is available to rebuild the failed or removed disk.

Various disks display an Offline state. The Offline state may apply to all physical disks used by the redundant and non-redundant virtual disks.

Resolution: Perform a Rescan Controller. When the rescan is complete, select each physical disk that is Offline and perform a Remove Dead Segments task. You must remove the dead segments before the physical disk can be brought back online. The dead segments are caused by the failure or removal of the shared physical disk.

NOTE: It is recommended that you avoid using the same physical disks for both redundant and non-redundant virtual disks.

Specific Problem Situations and Solutions

This section contains additional trouble-shooting problem areas. Topics include:

Physical Disk is Offline or Displays an Error Status
A Disk is Marked as Failed When Rebuilding in a Cluster Configuration
A Disk on a PERC 4/Di Controller Does not Return Online after a Prepare to Remove
Receive a "Bad Block" Alert with "Replacement," "Sense," or "Medium" Error
Read and Write Operations Experience Problems
I/O Stops When a Redundant Channel Fails
A Task Menu Option is Not Displayed
A Corrupt Disk or Drive Message Suggests Running autocheck During a Reboot
Erroneous Status and Error Messages after a Windows Hibernation
Storage Management May Delay Before Updating Temperature Probe Status
Storage Management May Delay Displaying Storage Devices After Reboot
You are Unable to Log into a Remote System
Cannot Connect to Remote System Running Windows Server 2003
Reconfiguring a Virtual Disk Displays Error in Mozilla Browser
Physical Disks Display Under Connector Not Enclosure Tree Object

Physical Disk is Offline or Displays an Error Status

A physical disk may display an error status if it has been damaged, taken offline, or was a member of a virtual disk that has been deleted or initialized. The following actions may resolve the error condition:

If a user has taken the disk offline, then return the disk to Online status by executing the Online disk task.
Rescan the controller. This action updates the status of storage objects attached to the controller. If the error status was caused by deleting or initializing a virtual disk, rescanning the controller should resolve this problem.
Investigate whether there are any cable, enclosure, or controller problems preventing the disk from communicating with the controller. If you find a problem and resolve it, you may need to rescan the controller to return the disk to Online or Ready status. If the disk does not return to Online or Ready status, reboot the system.
If the disk is damaged, replace it. For more information, see Replacing a Failed Disk.

A Disk is Marked as Failed When Rebuilding in a Cluster Configuration

When a system in a cluster attempts to rebuild a failed disk but the rebuild fails, then another system takes over the rebuild. In this situation, you may notice that the rebuilt disk continues to be marked as failed on both systems even after the second system has rebuilt successfully. To resolve this problem, perform a rescan on both systems after the rebuild completes successfully.

A Disk on a PERC 4/Di Controller Does not Return Online after a Prepare to Remove

When you do a Prepare to Remove command on a physical disk attached to a PERC 4/Di controller, you may find that the disk does not display in the Storage Management tree view even after doing a rescan or a reboot.

In this case, to redisplay the disk in the Storage Management tree view:

1 Manually remove and then replace the physical disk.
2 Rescan the controller or reboot the system.

Receive a "Bad Block" Alert with "Replacement," "Sense," or "Medium" Error

The following alerts or events are generated when a portion of a physical disk is damaged:

2146
2147
2148
2149
2150

This damage is discovered when the controller performs an operation that requires scanning the disk. Examples of operations that may result in these alerts are as follows:

Consistency check
Rebuild
Virtual disk format
I/O

If you receive an alerts 2146 through 2150 as the result of doing a rebuild or while the virtual disk is in a degraded state, then data cannot be recovered from the damaged disk without restoring from backup. If you receive alerts 2146 through 2150 under circumstances other than a rebuild, then data recovery may be possible. The following describes each of these situations.

Alerts 2146 through 2150 Received during a Rebuild or while a Virtual Disk is Degraded

Do the following if you receive alerts 2146 through 2150 during a rebuild or while the virtual disk is in a degraded state:

1 Replace the damaged physical disk.
2 Create a new virtual disk and allow the virtual disk to completely resynchronize. While the resynchronization is in progress, the status of the virtual disk is Resynching.
3 Restore data to the virtual disk from backup.

Alerts 2146 through 2150 Received while Performing I/O, Consistency Check, Format, or Other Operation

If you receive alerts 2146 through 2150 while performing an operation other than a rebuild, you should replace the damaged disk immediately to avoid data loss.

Do the following:

1 Back up the degraded virtual disk to a fresh (unused) tape.
2 Replace the damaged disk.
3 Do a rebuild.

Read and Write Operations Experience Problems

If the system is hanging, timing out, or experiencing other problems with read and write operations, then there may be a problem with the controller cables or a device. For more information, see Cables Attached Correctly and Isolate Hardware Problems.

I/O Stops When a Redundant Channel Fails

If you have implemented channel redundancy on a PERC 4/SC, 4/DC, 4e/DC, or 4/Di controller, a failure of one channel causes I/O to stop on the other channels included in the channel-redundant configuration. To resolve this issue, see Channel Redundancy on PERC 4/DC, 4e/DC, 4/Di, and 4e/Di Controllers.

A Task Menu Option is Not Displayed

You may notice that the task menus do not always display the same task options. This is because Storage Management only displays those tasks that are valid at the time the menu is displayed. Some tasks are only valid for certain types of objects or at certain times. For example, a Check Consistency task can only be performed on a redundant virtual disk. Similarly, if a disk is already offline, the Offline task option is not displayed.

There may be other reasons why a task cannot be run at a certain time. For example, there may already be a task running on the object that must complete before additional tasks can be run.

A Corrupt Disk or Drive Message Suggests Running autocheck During a Reboot

Let autocheck run, but do not worry about the message. The reboot completes after autocheck is finished. Depending on the size of your system, this may take about ten minutes.

Erroneous Status and Error Messages after a Windows Hibernation

Activating the Windows hibernation feature may cause Storage Management to display erroneous status information and error messages. This problem resolves itself when the Windows operating system recovers from hibernation.

Storage Management May Delay Before Updating Temperature Probe Status

In order to display the enclosure temperature and temperature probe status, Storage Management polls the enclosure firmware at regular intervals to obtain temperature and status information. On some enclosures, there is a short delay before the enclosure firmware reports the current temperature and temperature probe status. Because of this delay, Storage Management may require one or two minutes before displaying the correct temperature and temperature probe status.

Storage Management May Delay Displaying Storage Devices After Reboot

Storage Management requires time after a reboot to find and inventory all attached storage devices. You may experience a delay in storage controllers being displayed until this operation has completed.

You are Unable to Log into a Remote System

Access can be denied here if you do not enter a user name and password that match an administrator account on the remote computer or if you mistype the login information. The remote system may also not be powered on or there may be network problems.

Cannot Connect to Remote System Running Windows Server 2003

When connecting to a remote system running Windows Server 2003, log into the remote system using an account that has administrator privileges. By default, Windows Server 2003 does not allow anonymous (null) connections to access the SAM user accounts. Therefore, if you are attempting to connect using an account that has a blank or null password, the connection may fail.

Reconfiguring a Virtual Disk Displays Error in Mozilla Browser

When reconfiguring a virtual disk using the Mozilla browser, the following error message may display:

Although this page is encrypted, the information you have entered is to be sent over an unencrypted connection and could easily be read by a third party.

You can disable this error message by changing a Mozilla browser setting. To disable this error message:

1 Select Edit and then Preferences.
2 Click Privacy and Security.
3 Click SSL.
4 Uncheck the "Sending form data from an unencrypted page to an unencrypted page" option.

Physical Disks Display Under Connector Not Enclosure Tree Object

Storage Management polls the status of physical disks at frequent intervals. When the physical disk is located in an enclosure, Storage Management uses the data reported by the SCSI Enclosure Processor (SEP) to ascertain the status of the physical disk. In the event that the SEP is not functioning, Storage Management is still able to poll the status of the physical disk, but Storage Management is not able to identify the physical disk as being located in the enclosure. In this case, Storage Management displays the physical disk directly below the Connector object in the tree view and not under the enclosure object.

You can resolve this problem by restarting the Server Administrator service or by rebooting the system. For more information on restarting the Server Administrator service, see the Dell OpenManage™ Server Administrator User's Guide.

PCIe SSD Troubleshooting

Peripheral Cponent Interconnect Express (PCIe) Solid-State Drive (SSD) is not seen in the operating system

Probable Cause:

Hardware is not installed correctly

Solution:

Check the following components to ensure they are plugged:

Devices: Ensure that the PCIe SSDs are installed in a PCIe SSD backplane.
Backplane: Ensure that the cables for the PCIe SSD backplane are connected.
Cables: PCIe cables are unique for the configuration. Ensure that the backplane cable connectors mate with the backplane and the extender card cable connectors mate with the extender card.
Extender card: Ensure that the PCIe extender card is plugged into the correct supported slot.

PCIe SSD is not seen in disk management in the operating system

Probable Cause:

Device driver is not installed

Solution:

1 Download the latest PCIe SSD driver from support.dell.com.
2 Open Device Manager and double-click on Other Devices where the PCIe device is seen with a yellow mark.
3 Right-click and install the driver on the instance.

For more information on possible error conditions with your PCIe SSD, see the system specific Owner's Manual at support.dell.com/manuals.