PS1 IPP Czar Logs for the week 2017-07-31 - 2017-08-06

(Up to PS1 IPP Czar Logs)

Monday : 2017-07-31

  • 16:00 CZW: Restarted ippitc pantasks. The check_system.sh stop.block.restart seems to be working correctly.

Tuesday : 2017-08-01

  • 15:15 CZW: Power cycling stare01, which I appear to have killed with ppStack.
  • 15:54 CZW: And again. Leaving processing off on that host, as it seems to have an issue. From the console:
    [ 1411.290072] Pid: 14737, comm: addstar Tainted: G   M       2.6.28-rc7-00105-gfeaf384 #4
    [ 1411.290072] Call Trace:
    [ 1411.290072]  <#MC>  [<ffffffff802397db>] warn_on_slowpath+0x51/0x6d
    [ 1411.290072]  [<ffffffff803eec86>] notify_update+0x2b/0x30
    [ 1411.290072]  [<ffffffff8025bc31>] smp_call_function_mask+0x37/0x1d7
    [ 1411.290072]  [<ffffffff80264013>] crash_kexec+0x17/0xef
    [ 1411.290072]  [<ffffffff802640e2>] crash_kexec+0xe6/0xef
    [ 1411.290072]  [<ffffffff8021d53e>] native_smp_send_stop+0x1a/0x26
    [ 1411.290072]  [<ffffffff80239896>] panic+0x95/0x13f
    [ 1411.290072]  [<ffffffff80239d5d>] release_console_sem+0x3e/0x1a5
    [ 1411.290072]  [<ffffffff80239d5d>] release_console_sem+0x3e/0x1a5
    [ 1411.290072]  [<ffffffff80239d5d>] release_console_sem+0x3e/0x1a5
    [ 1411.290072]  [<ffffffff805b9e66>] __atomic_notifier_call_chain+0x74/0x83
    
    [ 1411.290072]  [<ffffffff805b9df2>] __atomic_notifier_call_chain+0x0/0x83
    [ 1411.290072]  [<ffffffff8023969e>] oops_enter+0x9/0x10
    [ 1411.290072]  [<ffffffff80217b55>] mce_log+0x0/0x7f
    [ 1411.290072]  [<ffffffff80217eef>] do_machine_check+0x2d5/0x378
    [ 1411.290072]  [<ffffffff8020d04f>] machine_check+0x7f/0x90
    [ 1411.290072]  <<EOE>> <4>---[ end trace 9eaaeb060d55ea58 ]---
    

Wednesday : 2017-08-02

  • 15:00 CZW: Adding x3 to the /data/stare04.1/watersc1/hsc/stdsci2/ pantasks, in an attempt to get HSC stacks finished at a reasonable rate. With stare03/04 now responding properly to NFS, this should not cause mount issues.

Thursday : 2017.08.03

  • MEH: cleared stalled/broken stamps from 7/30, unclear why weren't cleared by czar..
  • MEH: HVAC planned downtime 8/4-8/6 @IfA -- Mez computers that are not critical for operations will need to be powered down
    • shutdown any mysql, halt and power down ipp003,ipp001,ipp022,ippb04,ippb05; ippc20,c21,c22,013,018-021 (Gavin powered down and manually off since not on remote power control)
    • monitor temperatures with sensors on ippops1,ippops2,ipp002
  • MEH: Haydn sent Chris email about working on console for ipp074,075,076,114 -- appears to actually not be working... -- Haydn had turned off i04-con (impacts access to ippdb, ipp105-117 nodes critical for nightly and PSPS processing), Ming turned back on, but still critical problem
    • 086-089,091,092 remote power management okay but no serial console access -- leave neb-host repair since minimal data space
    • 105-112 no remote power management now.. -- critical problem -- put ipp105 into neb-host repair to avoid breaking nightly processing if node has problem (one of the primary nightly data nodes)..
  • MEH: ipp100-104 neb-host up to add extra data space for ipp105 into repair
  • MEH: more misc cleanup of SNIaF updates
  • 17:25 CZW: Power cycling stare01, which crashed again.
    [115569.806728] HARDWARE ERROR
    [115569.806729] CPU 0: Machine Check Exception:                4 Bank 8: 0000000000000000
    [115569.810018] TSC 0 
    [115569.810018] This is not a software problem!
    [115569.810018] Run through mcelog --ascii to decode and contact your hardware vendor
    [115569.810018] Kernel panic - not syncing: Machine check
    [115569.810018] ------------[ cut here ]------------
    [115569.810018] WARNING: at kernel/smp.c:333 smp_call_function_mask+0x37/0x1d7()
    [115569.810018] Modules linked in: w83793 hwmon_vid autofs4 e100 i2c_i801 iTCO_wdt i2c_core e1000e tg3 libphy e1000 xfs dm_snapshot dm_mirror dm_region_hash dm_log aacraid 3w_9xxx 3w_xxxx atp870u arcmsr aic7xxx scsi_wait_scan
    [115569.810018] Pid: 0, comm: swapper Tainted: G   M       2.6.28-rc7-00105-gfeaf384 #4
    [115569.810018] Call Trace:
    [115569.810018]  <#MC>  [<ffffffff802397db>] warn_on_slowpath+0x51/0x6d
    [115569.810018]  [<ffffffff803eec86>] notify_update+0x2b/0x30
    [115569.810018]  [<ffffffff8025bc31>] smp_call_function_mask+0x37/0x1d7
    [115569.810018]  [<ffffffff80264013>] crash_kexec+0x17/0xef
    [115569.810018]  [<ffffffff802640e2>] crash_kexec+0xe6/0xef
    
    [115569.810018]  [<ffffffff805b7758>] _spin_lock_irqsave+0x37/0x3f
    [115569.810018]  [<ffffffff8021d53e>] native_smp_send_stop+0x1a/0x26
    [115569.810018]  [<ffffffff80239896>] panic+0x95/0x13f
    [115569.810018]  [<ffffffff80239d5d>] release_console_sem+0x3e/0x1a5
    [115569.810018]  [<ffffffff80239d5d>] release_console_sem+0x3e/0x1a5
    [115569.810018]  [<ffffffff80239d5d>] release_console_sem+0x3e/0x1a5
    [115569.810018]  [<ffffffff805b9dc5>] notifier_call_chain+0x29/0x56
    [115569.810018]  [<ffffffff805b9e66>] __atomic_notifier_call_chain+0x74/0x83
    [115569.810018]  [<ffffffff805b9df2>] __atomic_notifier_call_chain+0x0/0x83
    [115569.810018]  [<ffffffff8023969e>] oops_enter+0x9/0x10
    [115569.810018]  [<ffffffff80217b55>] mce_log+0x0/0x7f
    [115569.810018]  [<ffffffff80217eef>] do_machine_check+0x2d5/0x378
    [115569.810018]  [<ffffffff8020d04f>] machine_check+0x7f/0x90
    [115569.810018]  [<ffffffff802572dd>] __lock_acquire+0x6ef/0x7f3
    [115569.810018]  <<EOE>>  <IRQ>  [<ffffffff80257d90>] lock_acquire+0x52/0x6b
    [115569.810018]  [<ffffffff80521a95>] sk_filter+0x0/0xaf
    [115569.810018]  [<ffffffff80521acc>] sk_filter+0x37/0xaf
    [115569.810018]  [<ffffffff80521a95>] sk_filter+0x0/0xaf
    [115569.810018]  [<ffffffff805474e3>] tcp_v4_rcv+0x270/0x68d
    [115569.810018]  [<ffffffff8052e344>] ip_local_deliver+0xcb/0x164
    [115569.810018]  [<ffffffff8052e2d4>] ip_local_deliver+0x5b/0x164
    [115569.810018]  [<ffffffff8052e8d7>] ip_rcv+0x4fa/0x54f
    [115569.810018]  [<ffffffff805164df>] netif_receive_skb+0x252/0x29b
    [115569.810018]  [<ffffffff80516397>] netif_receive_skb+0x10a/0x29b
    [115569.810018]  [<ffffffffa016b73a>] e1000_receive_skb+0x3e/0x54 [e1000e]
    [115569.810018]  [<ffffffffa016e65b>] e1000_clean_rx_irq+0x234/0x2d6 [e1000e]
    [115569.810018]  [<ffffffffa016de03>] e1000_clean+0x24a/0x262 [e1000e]
    [115569.810018]  [<ffffffff8051870b>] net_rx_action+0xc5/0x1fb
    [115569.810018]  [<ffffffff8051869d>] net_rx_action+0x57/0x1fb
    [115569.810018]  [<ffffffff8023e144>] __do_softirq+0x7a/0x13d
    [115569.810018]  [<ffffffff8020d07c>] call_softirq+0x1c/0x28
    [115569.810018]  [<ffffffff8020e028>] do_softirq+0x2c/0x68
    [115569.810018]  [<ffffffff8023e084>] irq_exit+0x3f/0x85
    [115569.810018]  [<ffffffff8020e1ae>] do_IRQ+0x14a/0x16c
    [115569.810018]  [<ffffffff8020c2f6>] ret_from_intr+0x0/0xa
    [115569.810018]  <EOI>  [<ffffffff803d4f96>] acpi_idle_enter_bm+0x2b0/0x31a
    [115569.810018]  [<ffffffff803d4f8c>] acpi_idle_enter_bm+0x2a6/0x31a
    [115569.810018]  [<ffffffff804f9763>] cpuidle_idle_call+0x7f/0xbc
    [115569.810018]  [<ffffffff8020ac41>] cpu_idle+0x4a/0x6d
    [115569.810018] ---[ end trace c9a61e19b0bbc266 ]---
    
  • MEH: nightly processing issue in summitcopy, 3-5 minutes to download some otas -- from the timestamps in summit.copy.log, appears delay is happening in copy into nebulous stage... seemed to also be the case last night as well.. -- further tracing shows cases for /tmp/file for wget at 0 size for >120s, looks like summit/pixel server issue

Friday : 2017.08.04

  • MEH: reconfigure ippitc targeting to include ipp100-104 tonight

Saturday : 2017.08.05

  • MEH: update+distribution of additional SNIaF data

Sunday : 2017.08.06

  • MEH: Sidik fixed pixel server that was having issue and degrading download rate past few days
  • MEH: HVAC back on @IfA Saturday -- try to boot up ippdev machines if possible
    • ipp003 -- ok, mysql/web pages online
    • ippb04, b05 -- ok, disks seen from ITC cluster
    • ipp001 -- boots to liveCD, leave off
    • ipp022 -- doesn't boot (also didn't in April when HVAC power down)