PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : YYYY.MM.DD

  • 07:20 MEH: like Saturday mornning, looks like 3 camRuns are stalling due to missing chipRun files -- are these not being checked for validity?
     	Exp Name 	Exp ID 	Chip ID 	Cam ID 	state 	label 	data grp 	dist grp 	Fault 	Revert Command
    
     	o6586g0074o 	667195 	902792 	873445 	new 	ThreePi.nightlyscience 	ThreePi.20131021 	ThreePi 	3 	camtool -revertprocessedexp -fault 3 -label ThreePi.nightlyscience -dbname gpc1
    	o6586g0398o 	667519 	903082 	873737 	new 	ThreePi.nightlyscience 	ThreePi.20131021 	ThreePi 	3 	camtool -revertprocessedexp -fault 3 -label ThreePi.nightlyscience -dbname gpc1
    	o6586g0458o 	667579 	903139 	873793 	new 	ThreePi.nightlyscience 	ThreePi.20131021 	ThreePi 	3 	camtool -revertprocessedexp -fault 3 -label ThreePi.nightlyscience -dbname gpc1 
    
    neb-less neb://any/gpc1/ThreePi.nt/2013/10/21//o6586g0074o.667195/o6586g0074o.667195.cm.873445.log
    
    Unable to parse camera.
     -> pmConfigConvertFilename (pmConfig.c:1802): System error
         Unable to access file neb://ipp041.0/gpc1/ThreePi.nt/2013/10/21//o6586g0074o.667195/o6586g0074o.667195.ch.902792.XY62.b1.fits: nebclient.c:535 nebFind() - no instances found
     -> readPHUfromFilename (pmFPAfileDefine.c:454): System error
         Failed to convert file name neb://ipp041.0/gpc1/ThreePi.nt/2013/10/21//o6586g0074o.667195/o6586g0074o.667195.ch.902792.XY62.b1.fits
     -> fpaFileDefineFromArray (pmFPAfileDefine.c:641): System error
         Failed to read PHU for neb://ipp041.0/gpc1/ThreePi.nt/2013/10/21//o6586g0074o.667195/o6586g0074o.667195.ch.902792.XY62.b1.fits
     -> ppImageDefineFile (ppImageDefineFile.c:17): unknown psLib error
         failed to load file definition ARG LIST
     -> ppImageParseCamera (ppImageParseCamera.c:12): I/O error
         Can't find an input image source
    
    perl ~ipp/src/ipp-20130712/tools/runchipimfile.pl --chip_id  902792  --class_id XY62 --redirect-output
    
    neb://any/gpc1/ThreePi.nt/2013/10/21//o6586g0398o.667519/o6586g0398o.667519.cm.873737.log
    perl ~ipp/src/ipp-20130712/tools/runchipimfile.pl --chip_id 903082   --class_id XY62 --redirect-output
    
    neb://any/gpc1/ThreePi.nt/2013/10/21//o6586g0458o.667579/o6586g0458o.667579.cm.873793.log
    perl ~ipp/src/ipp-20130712/tools/runchipimfile.pl --chip_id 903139   --class_id XY62 --redirect-output
    
  • 12:05 MEH: difficult to get even status from stdsci.. time for regular restart..
    • MD03.pv2 label out for a bit for LAP to push through to stacks -- many (>400) warp skyfiles faulting, 4 camera faults to look into
  • 12:30 CZW: The excessive failures of warp were due to ipp031 not having a proper /local/ipp/gpc1/tess/ hierarchy. rsync -auv /data/ipp044.0/ipp/gpc1/tess/ . seems to have resolved this problem. Reverting the failures, and seeing if they pop up again.
  • 13:10 MEH: something still messed up with many (10%?) warp skyfile faults, set lap.cleanup.off to not allow more chips to trigger but will push through all the warps possible to diagnose
  • 16:50 MEH: ipp041 out of processing fully, see if that helps the odd file missing faults (maybe clear up nightly processing faults as well?)
  • 16:53 Bill: reran class_id XY62 for STS chip_ids 900631 900863 900954 901070 901994 whose contents were wiped out by a duplicate run for the job. The jobs all ran on ipp041 which has been a problem child. Reverted the associated camRuns
  • 17:45 CZW: ipp029 had a load spike, then a kernel panic, and so I've cycled the power.
    [1645892.244376] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000007
    [1645892.244376] 
    [1645892.253892] Pid: 1, comm: init Tainted: G          IO 3.7.6 #1
    [1645892.254240] Call Trace:
    [1645892.254240]  [<ffffffff8151470d>] panic+0xb9/0x1c5
    [1645892.254240]  [<ffffffff810322fe>] do_exit+0x361/0x7a0
    [1645892.254240]  [<ffffffff810327af>] do_group_exit+0x72/0x9a
    [1645892.254240]  [<ffffffff8103c701>] get_signal_to_deliver+0x442/0x461
    [1645892.254240]  [<ffffffff81002016>] do_signal+0x2c/0x4d6
    [1645892.254240]  [<ffffffff81024f3a>] ? is_prefetch+0xc1/0x1c2
    [1645892.254240]  [<ffffffff810024e7>] do_notify_resume+0x27/0x5f
    [1645892.254240]  [<ffffffff815176f5>] retint_signal+0x3d/0x78
    [1645892.307707] ------------[ cut here ]------------
    [1645892.312552] WARNING: at arch/x86/kernel/smp.c:123 native_smp_send_reschedule+0x25/0x51()
    [1645892.317705] Hardware name: Empty
    [1645892.317705] Modules linked in: w83627hf w83793 hwmon_vid k8temp ipv6 dm_mod joydev usbhid snd_hda_codec_realtek mperf sg ehci_hcd e1000e(O) snd_hda_intel uhci_hcd snd_hda_codec usbcore microcode snd_hwdep coretemp processor thermal_sys snd_pcm snd_timer snd i5k_amb pcspkr snd_page_alloc i2c_i801 i2c_core usb_common floppy container button
    [1645892.317705] Pid: 1, comm: init Tainted: G          IO 3.7.6 #1
    [1645892.317705] Call Trace:
    [1645892.317705]  <IRQ>  [<ffffffff8102d9a0>] warn_slowpath_common+0x80/0x98
    [1645892.317705]  [<ffffffff8102d9cd>] warn_slowpath_null+0x15/0x17
    [1645892.317705]  [<ffffffff8101c53b>] native_smp_send_reschedule+0x25/0x51
    [1645892.317705]  [<ffffffff81053437>] trigger_load_balance+0x1e8/0x214
    [1645892.317705]  [<ffffffff8105081f>] scheduler_tick+0x10c/0x114
    [1645892.317705]  [<ffffffff810385d9>] update_process_times+0x62/0x73
    [1645892.317705]  [<ffffffff81063222>] tick_sched_timer+0x7c/0x9b
    [1645892.317705]  [<ffffffff81048b6f>] __run_hrtimer+0x4e/0xc3
    [1645892.317705]  [<ffffffff81048f10>] hrtimer_interrupt+0xce/0x1b0
    [1645892.317705]  [<ffffffff8101d921>] smp_apic_timer_interrupt+0x81/0x94
    [1645892.317705]  [<ffffffff815187ca>] apic_timer_interrupt+0x6a/0x70
    [1645892.317705]  <EOI>  [<ffffffff8104a36e>] ? up+0x34/0x39
    [1645892.317705]  [<ffffffff815147dd>] ? panic+0x189/0x1c5
    [1645892.317705]  [<ffffffff81514742>] ? panic+0xee/0x1c5
    [1645892.317705]  [<ffffffff810322fe>] do_exit+0x361/0x7a0
    [1645892.317705]  [<ffffffff810327af>] do_group_exit+0x72/0x9a
    [1645892.317705]  [<ffffffff8103c701>] get_signal_to_deliver+0x442/0x461
    [1645892.317705]  [<ffffffff81002016>] do_signal+0x2c/0x4d6
    [1645892.317705]  [<ffffffff81024f3a>] ? is_prefetch+0xc1/0x1c2
    [1645892.317705]  [<ffffffff810024e7>] do_notify_resume+0x27/0x5f
    [1645892.317705]  [<ffffffff815176f5>] retint_signal+0x3d/0x78
    [1645892.317705] ---[ end trace f668650e43a0263c ]---
    
  • 20:10 MEH: need another regular restart of stdsci before nightly, >100k warp Njobs
    • LAP finished all warps possible, remainder skyfiles fault. avoid new chips loading -- lap.cleanup.off
    • LAP stacks loading now
    • MD03.pv2.20131018 prio run with nightly

Tuesday : 2013.10.22

mark is czar

  • 08:30 MEH: nightly cleared, setting up for regular restart stdsci
    • lap.cleanup.off
  • 14:05 MEH: attempting fixes, X.revert.off in stdsci
  • 14:25 MEH: ipp040 overloaded for ~hr, power cycling -- back up and ok
  • 14:30 MEH: taking -1x all wave out of stdsci..
  • 14:50 MEH: ipp057 has had enough as well.. crash.. -- power cycling -- back up and ok ipp057-crash-20131022T145000
  • 18:15 MEH: STS label out for nightly science, MD05.pv2 updates running instead
    • updates taking long time, something wrong. some large loads on stsci nodes -- killed off all the 2ks running ppImage and things moving again..

Wednesday : 2013.10.23

mark is czar

  • 06:20 MEH: nightly finished, STS label back in
  • 08:30 MEH: MD03.pv2 chips to cleanup to free up equatable space for MD05.pv2
  • 10:00 MEH: general restart of pantasks
    • distribution, publishing last night
    • summitcopy, registration
    • stdsci -- lap.cleanup.off (will want to turn on soon to cleanup chips?), -1x all wave machines still
  • 17:15 MEH: too many failures when attempting to copy /export/ipp001.0/ipp/mysql-dumps/gpc1_checksum.md5 email, disk space ok. will let it try the next 4 hours and if fails again then will look into for morning QUB needs.
    • worked itself out
  • 18:00 Bill restarted pstamp and update pantasks. Their pcontrols were using too much cpu
  • 19:10 STS label out for night

Thursday : YYYY.MM.DD

  • 13:40 MEH: stdsci needs its regular restart.. ugh.. commands repeatly bork from "server is busy.."
  • 15:45 MEH: doesn't look like neb-host down was set for the last minute(?) announcement ipp046,047 ready for shutdown.. things kind of stopped waiting for files off those hosts..
  • 16:25 MEH: and because of that, stdsci is some 70% plugged with chip jobs that have stalled with ipp046,047 going down while processing... will need to be cleared for nightly processing..
  • 17:15 MEH: finally cleared and fixed ipp011 borked mount to ipp047 --

Friday : YYYY.MM.DD

  • 10:25 MEH: last of nightly diffs finishing, preparing for needed stsci regular restart..
  • 11:30 MEH: something is stalling chip updates (>2ks..).. chip.off to track down which nodes are the culprit --
    • all outputs to the stsci1x nodes now since all others are full, those nodes all lots of disk/wait% as ippm scans all being done at same time.. -- not good, scan should probably be done on single partition at time..
    • ipp011 heavy wait%, taking out of processing to help, probably needs mysql restart
  • 14:08 Bill: queued STS.rp.2013 chips and warps for cleanup
  • 22:10 MEH: again, some long running/stalled chips (3-4ks..), processing rate ~20% normal since ~7pm..
    • ipp019 has bit of wait% going, take out of processing too
    • kill all ppImages
    • might as well do the regular restart of stdsci as well..

Saturday : YYYY.MM.DD

  • 08:20 MEH: ipp062 mount trouble, stalled stdsci, reg, summitcopy.. nfs had to be restarted -- no nightly but lost 50% of potential reprocessing
  • 11:30 MEH: regular restart stdsci, rate getting better
  • 17:30 MEH: again, rate is bad.. another mount?
  • 17:42 haf: did something happen to a computer since yesterday? all my loaders are dead jim
    • whole system has been having hangups since last night.. when did loaders die?
  • 19:10 MEH: stdsci is behaving strangely.. chip barely keeping >100 when plenty to do.. restarting..

Sunday : YYYY.MM.DD

  • 10:45 MEH: nightly ssdiffs finished, time for regular restart stdsci
    • rate much improved after ~8pm last night.. did some other processing finish around then?
  • 10:50 MEH: the stalling in the ippmonitor czartool page is due to ipp001 filling up disk so gpc1 slave check stalling.. -- looks like gene is cleaning up space, back up to 100G free now and gpc1@ipp001 catching up for QUB to access for transient work, 26ks behind