PS1 IPP Czar Logs for the week 2014.04.28 - 2014.05.04

(Up to PS1 IPP Czar Logs)

Monday : 2014.04.28

mark is czar

  • 07:00 MEH: chip.off to push remaining night warp+diffim through (lots of LAP chips..). two fault 5 OSS diffim cleared, small stamp poor stats robustness (543558 skycell.1014.058; 543560 skycell.1014.001)
  • 08:10 MEH: stdsci poll underloaded, nightly finished, regular restart of pantasks to begin
    • have to cycle chip.off to mostly push backlog of LAP warps out, seems mixed and need all out to start making stacks..
  • 10:20 MEH: LAP stacks now loading, will need to throw stdsci nodes into stack -- if like yesterday ~3500 stacks/400hr then mostly finished by nightly obs, but if normal night processing then leaving ~1k keeps the few stack nodes busy all night
  • 10:40 MEH: ipp053 was never put back into processing after the mobo was replaced on 3/20..
  • 18:40 MEH: ~800 LAP stacks remain, returning power to stdsci for nightly (after pantasks restart)

Tuesday : 2014.04.29

  • 10:04 Bill: staticsky memory use is mostly staying around 30G peak. Set one set of c2 notes to on in stack.
  • 11:22 Bill: ganglia reports that ippc13 has been down for 6 1/3 hours. I set the power to off and notified Haydn.
  • 15:30 CZW: I've restarted the replication pantasks, pointing it at the new permcheck.pl script and the PV1 stacks. This will shuffle the backup image copy to the b storage nodes.
  • 20:20 MEH: ipps05-14 machines manually allocating to staticsky until next MD run ready -- 2x should be possible; ipps01-04 3x possible

Wednesday : 2014.04.30

  • 14:30 CZW: ippdb01 seems to have crashed. Nothing on the console except iostat/temperature checks someone had done that were left in the log. Power cycled, and it seems to up and responding again. I'll use this as an opportunity to restart the pantasks servers.
  • 15:30 CZW: Just realized that the reason the czartool page wasn't updating was because the crash of ippdb01 probably killed the scripts that run those jobs. That was the case. Restarted.

Thursday : 2014.05.01

  • 10:11 Bill: ippc26, 27, and 28 have got extremely large psphotStack processes. Set those nodes to off in the pstamp and stack pantasks.
    • 11:10 ippc27 and 28 have finished and the one on ippc26 is in the process of exiting.
    • we are out of the galactic plane for now. Enabling a second set of c2 nodes in staticsky
    • 11:40 MEH: remember to bump up the poll number, the ipps nodes will float in/out as available
  • 15:20 EAM : stopped and restarted stdscience, pstamp, registration (pantasks were running slow)
  • 15:25 EAM : ipp051 is down. trying to power cycle. so far, no luck.
  • 15:40 EAM : ipp051 is back up (took a second power cycle try with a long off time).

Friday : 2014.05.02

  • 08:00 Bill: in staticsky set.ra.max 270. This will stop staticsky processing once all of the skycells at lower RA finish. The next pending skycells at ra > 270 will require going back to single set of c2 nodes to avoid memory overload. I think we should get there sometime on Saturday.
  • 10:20 Bill: investigated memory corruption fault from warp 610477 skycell.2155.006. Does not fail with trunk build. Does not fail with tag build if run in debugger, so I ran the script with the tag as ipp user attaching to pswarp with debugger, let the program finish successfully clearing the fault. Move along nothing to see here.
  • 14:00 Bill: survey.relexp task has been failing for awhile due to duplicate processings for exposures. Dropped the send of the two camRuns for each exposure. Note that the stacks are inconsistent since some warps were used from the duplicate processing.
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940384
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940422
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940423
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940386
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940393
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940385
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940391
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940597
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940596
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940598
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940599
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940389
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940441
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 940443
    camtool -updaterun -set_state drop -set_label duplicate.LAP.ThreePi.20130717 -set_note duplicate -cam_id 943985
    
  • 15:55 Bill: ippc08 crashed about 20 minuts ago. nothing on the console cycled power. Didn't come up right away. Gavin is investigating.
  • 21:00 MEH: warp.skycell is off, czarplot shows warps off most of day.. not clear why, guessing for warp mem debug in morning. turning on for night after doing the normal daily required stdsci restart

Saturday : 2014.05.03

  • 06:36 Bill ipp051 has gone down. Powered off with console then back on 1 minute later. Nothing happened, which is what Gene reported the other day. Leaving powered off for now.
  • 06:49 ipp010 is down as well. This one has responded to power cycle
  • 13:45 MEH: looks like stdsci has been clogged for past ~10 hrs with no chips->warps getting done according to the czarplots.. probably due to ipp051 and needing to clear mounts.. guess i'll start clearing them so we will be able to do any nightly processing for MOPS/NASA program..
    • attempted a few power cycle attempts since ipp051 off for a while, no response, so back off
    • remaining darks from morning now registered
    • resulting massive backlog of camera, raise poll 45 -- down to 30 now
    • some burn.tbl to fix since on ipp051, chip.revert.off until fixed -- all but 1 fixed, level of caring <0 for now

Sunday : 2014.05.04

  • 13:10 MEH: looks like LAP is going to fully stall w/ 46 or so chip faults, good time for regular stdsci restart and then cleanup faults if possible so system doesn't sit mostly idle..