Ticket #1298 (assigned defect)

Opened 8 years ago

Last modified 7 years ago

suspect bad memory

Reported by: jhoblitt Owned by: cindy
Priority: normal Milestone:
Component: hardware Version:
Severity: normal Keywords:
Cc:

Description

These nodes have been suspect at one time or another of stability issues due to bad memory:

ipp005
ipp008
ipp018
ipp027
ipp025

With ipp018 being the worst offender as MCE errors are being logged.

They should be taken offline and have a memory check run on them. In the past, we've had dubious results running memtest86+ but we've never gone into the BIOS and disabled ECC checking. I think it's worth trying this sort of test again with ECC/scrubbing disabled. The way to track down a defective memory module is via a binary search. Eg. Run memtest86+ until an error is hit, remove have the memory sticks, run mt86+ again. If no error is hit, swap the removed memory back in and the good memory out, repeat. I'd let memtest86+ run for 5-7 days without an error to consider is a negative result.

Change History

Changed 8 years ago by cindy

ipp005 done - no memory
ipp018 - done 1 bad memory board

Changed 8 years ago by cindy

ipp008 ipp0018 - no bad memory

Changed 7 years ago by eugene

  • owner changed from jhoblitt to cindy
  • status changed from new to assigned
Note: See TracTickets for help on using tickets.