Version 29 (modified by rmattson, 6 years ago)

--

  • 2009-09-15T14:35:36 NFS issues; ls: cannot access /data/ipp051.0: Input/output error ; rebooted by gavin
  • 2009-09-18T16:32:55 Degraded unit: unit=0, port=18 (replaced)
  • 2010-01-05T16:20:30system unresponsive, ganglia load, memory, cpu, power cycled by gavin
  • 2011-5-12 5:37:16 Controller#1(PCI) IDE Channel # 9 Reading Error
  • 2011-07-06T11:26:14 system unresponsive, nothing on console, power cycled by gavin
  • 2011-07-07T03:28:44 system unresponsive, nothing on console, power cycled by gavin (2011-07-07T08:38:30)
  • 2011-10-03T23:48:00 system crash, console information shows the same kind of error as on ipp026:
    <Oct/04 08:39 am>ipp029 login: [7653866.118158] 
    <Oct/04 08:39 am>[7653866.118158] HARDWARE ERROR
    <Oct/04 08:39 am>[7653866.118158] CPU 5: Machine Check Exception:                4 Bank 0: b200004000000800
    <Oct/04 08:39 am>[7653866.118158] TSC 56ca1938fc43b0 
    <Oct/04 08:39 am>[7653866.118158] This is not a software problem!
    <Oct/04 08:39 am>[7653866.118158] Run through mcelog --ascii to decode and contact your hardware vendor
    
    ^^^^^^^^^^TRANSLATION^^^^^^^^^^^
    (ipp029:~) cindy% /usr/sbin/mcelog --k8 --ascii < myerror 
    
    
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 5 BANK 0 TSC 56ca1938fc43b0 
    MCG status:MCIP 
    MCi status:
    Uncorrected error
    Error enabled
    Processor context corrupt
    MCA:BUS Level-0 Originated-request Generic Memory-access Request-timeout Error
    Model:
    STATUS b200004000000800 MCGSTATUS 4
    
    
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    <Oct/04 08:39 am>[7653866.118158] 
    <Oct/04 08:39 am>[7653866.118158] HARDWARE ERROR
    <Oct/04 08:39 am>[7653866.118158] CPU 7: Machine Check Exception:                5 Bank 0: b200004000000800
    <Oct/04 08:39 am>[7653866.118158] RIP !INEXACT! 10:<ffffffff80212604> {mwait_idle+0x41/0x44}
    <Oct/04 08:39 am>[7653866.118158] TSC 56ca1938fc43a0 
    <Oct/04 08:39 am>[7653866.118158] This is not a software problem!
    <Oct/04 08:39 am>[7653866.118158] Run through mcelog --ascii to decode and contact your hardware vendor
    <Oct/04 08:39 am>[7653866.118158] 
    
    ^^^^^^^^^^TRANSLATION^^^^^^^^^^^
    
    (ipp029:~) cindy% /usr/sbin/mcelog --p4 --ascii < myerror2
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 7 BANK 0 TSC 56ca1938fc43a0 
    MCG status:RIPV MCIP 
    MCi status:
    Uncorrected error
    Error enabled
    Processor context corrupt
    MCA:BUS Level-0 Originated-request Generic Memory-access Request-timeout Error
    Model:
    STATUS b200004000000800 MCGSTATUS 5
    
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    <Oct/04 08:39 am>[7653866.118158] HARDWARE ERROR
    <Oct/04 08:39 am>[7653866.118158] CPU 7: Machine Check Exception:                5 Bank 5: b200120020080400
    <Oct/04 08:39 am>[7653866.118158] RIP !INEXACT! 10:<ffffffff80212604> {mwait_idle+0x41/0x44}
    <Oct/04 08:39 am>[7653866.118158] TSC 56ca1938fc4fd8 
    <Oct/04 08:39 am>[7653866.118158] This is not a software problem!
    <Oct/04 08:39 am>[7653866.118158] Run through mcelog --ascii to decode and contact your hardware vendor
    <Oct/04 08:39 am>[7653866.118158] 
    
    ^^^^^^^^^^TRANSLATION^^^^^^^^^^^
    
    (ipp029:~) cindy% /usr/sbin/mcelog --p4 --ascii < myerror3
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 7 BANK 5 TSC 56ca1938fc4fd8 
    MCG status:RIPV MCIP 
    MCi status:
    Uncorrected error
    Error enabled
    Processor context corrupt
    MCA:Internal Timer error
    STATUS b200120020080400 MCGSTATUS 5
    
    
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    <Oct/04 08:39 am>[7653866.118158] HARDWARE ERROR
    <Oct/04 08:39 am>[7653866.118158] CPU 5: Machine Check Exception:                4 Bank 5: b200120014040400
    <Oct/04 08:39 am>[7653866.118158] TSC 56ca1938fc5098 
    <Oct/04 08:39 am>[7653866.118158] This is not a software problem!
    <Oct/04 08:39 am>[7653866.118158] Run through mcelog --ascii to decode and contact your hardware vendor
    
    
    ^^^^^^^^^^TRANSLATION^^^^^^^^^^^
    
    (ipp029:~) cindy% /usr/sbin/mcelog --p4 --ascii < myerror4
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 5 BANK 5 TSC 56ca1938fc5098 
    MCG status:MCIP 
    MCi status:
    Uncorrected error
    Error enabled
    Processor context corrupt
    MCA:Internal Timer error
    STATUS b200120014040400 MCGSTATUS 4
    
    
    
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    
    <Oct/04 08:39 am>[7653866.118158] Kernel panic - not syncing: Machine check
    <Oct/04 08:39 am>[7653866.118158] ------------[ cut here ]------------
    <Oct/04 08:39 am>[7653866.118158] WARNING: at kernel/smp.c:333 smp_call_function_mask+0x37/0x1d7()
    <Oct/04 08:39 am>[7653866.118158] Modules linked in: coretemp w83627hf w83793 hwmon_vid k8temp autofs4 i2c_i801 i2c_core iTCO_wdt e1000e tg3 libphy e1000 xfs dm_snapshot dm_mirror dm_region_hash dm_log aacraid 3w_9xxx 3w_xxxx atp870u arcmsr aic7xxx scsi_wait_scan
    <Oct/04 08:39 am>[7653866.118158] Pid: 26053, comm: ppSub Tainted: G   M    W  2.6.28-rc7-00105-gfeaf384 #4
    <Oct/04 08:39 am>[7653866.118158] Call Trace:
    <Oct/04 08:39 am>[7653866.118158]  <#MC>  [<ffffffff802397db>] warn_on_slowpath+0x51/0x6d
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff803eec86>] notify_update+0x2b/0x30
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff8025bc31>] smp_call_function_mask+0x37/0x1d7
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff80264013>] crash_kexec+0x17/0xef
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff802640e2>] crash_kexec+0xe6/0xef
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff8021d53e>] native_smp_send_stop+0x1a/0x26
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff80239896>] panic+0x95/0x13f
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff80239d5d>] release_console_sem+0x3e/0x1a5
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff80239d5d>] release_console_sem+0x3e/0x1a5
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff80239d5d>] release_console_sem+0x3e/0x1a5
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff805b9e66>] __atomic_notifier_call_chain+0x74/0x83
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff805b9df2>] __atomic_notifier_call_chain+0x0/0x83
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff80217b55>] mce_log+0x0/0x7f
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff80217eef>] do_machine_check+0x2d5/0x378
    <Oct/04 08:39 am>[7653866.118158]  [<ffffffff8020d04f>] machine_check+0x7f/0x90
    <Oct/04 08:39 am>[7653866.118158]  <<EOE>> <4>---[ end trace 4eaa2a86a8e2da22 ]---
    
    • 2011-10-06T10:00:00 system unresponsive, ipp029-hw-error, power cycled by gavin
    • 2011-10-08T12:41:06 system unresponsive, ipp029-hw-error, power cycled by gavin
    • 2011-10-09T15:38:56 system unresponsive, ipp029-hw-error, power cycled by gavin
    • 2011-10-11T13:26:16 system unresponsive, ipp029-hw-error, power cycled by gavin
    • 2011-10-11T14:11:16 system unresponsive, ipp029-hw-error, power cycled by gavin
    • 2011-10-11T14:46:08 system unresponsive, ipp029-hw-error, power cycled by gavin
    • 2011-10-17T13:15 replace memory modules (4 x 4GB DDR2 667 modules)
    • 2011-10-17T14:07 crashed with single processing load Ipp029-crash-20111017T140700
    • 2011-11-09T01:08 crashed but it ran for several hours more before ganglia noticed
<Nov/10 01:08 am>ipp029 login: [69507.667999] BUG: unable to handle kernel paging request at 000000003e445208
<Nov/10 01:08 am>[69507.670758] IP: [<ffffffff8022f1db>] task_rq_lock+0x52/0x75
<Nov/10 01:08 am>[69507.670758] PGD 2c9d47067 PUD 28b033067 PMD 0 
<Nov/10 01:08 am>[69507.670758] Oops: 0000 [#1] SMP 
<Nov/10 01:08 am>[69507.670758] last sysfs file: /sys/devices/platform/coretemp.7/temp1_crit

<Nov/10 01:08 am>[69507.670758] CPU 5 
<Nov/10 01:08 am>[69507.670758] Modules linked in: coretemp w83627hf w83793 hwmon_vid k8temp autofs4 i2c_i801 i2c_core iTCO_wdt e1000e tg3 libphy e1000 xfs dm_snapshot dm_mirror dm_region_hash dm_log aacraid 3w_9xxx 3w_xxxx atp870u arcmsr aic7xxx scsi_wait_scan

  • 2011-11-10T14:40 Earlier today Haydn re-attached the heat sink to one of the cpus. However as soon as we started to load the system we got a crash like the ones reported above.
  • 2011-11-13T18:30 crashed again
  • 2011-11-13T23:40 crashed again
  • 2011-11-14T16:15 intended load crashed again with similar panic (but not with IPP code).
  • 2011-11-14T18:40 unresponsive, no info on console
  • 12/7/11 ipp029 drive 16 failed replaced
  • 2012-4-11 12:22:0 Controller#1(PCI) IDE Channel # 1 Reading Error DISK REPLACED
  • 2012-9-11 5:7:2 Controller#1(PCI) IDE Channel #24 Reading Error ( replaced: 2012-9-11 9:22:27 Controller#1(PCI) IDE Channel #24 Device Inserted)
  • 2012-12-10 ~18:00 system unresponsive, nothing on console. power cycle attempted and didn't come back up. (Mark)
  • 2013-01-16 All RAID 1TB drives replaced with 2TB drives (Rita and Haydn)

Attachments