PS1 IPP Czar Logs for the week 2011-08-22 to 2011-08-28

(Up to PS1 IPP Czar Logs)

Monday : 2011-08-22

  • 9:00 Bill shut down publishing, rcserver, and the postage stamp server to recover from out of disk space issue on ipppc17
  • 12:20 Bill fixed a camera run with corrupt files (250162) that was causing some warps to fail
  • 12:20 Bill fixed a corrupt skycell
  • evening Bill queued diffs for STS.201108 19, 20, 21
  • raised priority for STS and M31 nightlyscience to 400 to match the other labels.

Tuesday : 2011-08-23

  • 00:00 Mark: made change to ppStack PSF matching rejection before new stacks ready and queued MD01.refstacks
  • 7:40 Bill: added each of the stare nodes to stdscience twice
  • 15:55 CZW: Restart stdscience to jumpstart processing after nebulous issue (still not fully understood).
  • 17:50 CZW: Cycling power on ippc06 for the second time today. Removed from stdscience and stack processing.

Wednesday : 2011-08-24

  • 02:15 Bill finds many faults on czartool. ipp014 load over 100. Unresponsive to ssh. Nothing on the console power cycled it.
  • 02:57 ipp014 left a mess behind. Did force.umount on several hosts. On some it did not work. Removed stare01 and stare02 from stdscience for this reason
  • 03:10 destreak is a mess. Many failed_revert runs. This will have to wait until morning
  • 03:14 warpPendingSkyCell full of pages in state DONE. reset it
  • 03:21 warp is in a bad way. There are over 200 runs to do yet only 50 skycells are showing up in the book warptool -towarped produces the expected output. I'm going to stop and restart stdscience.
  • 03:41 stdscience restarted. killed all lingering ppImage, pswarp, and ppSub jobs on the cluster. (There were many stuck in the SOAP recv call)
  • 08:12 found a few more nodes that can't access ipp014. did force.umount
  • 08:13 warpPendingSkycell full with DONE again. Leaving alone for Gene to possibly debug.
  • 09:45 stopped processing. Gavin reset the nfs mounts of ipp014
  • 09:55 rebuilt pantasks with a bug fix from Gene that should fix the warp book getting full bug.
  • 09:57 started stdscience only. Still getting faults. reverts off
  • 10:04 started summit_copy with only gpc1. All files have been copied except for some recent LED exposures some of which are failing. Set pantasks to stop.
  • 10:09 started registration. It is registerting and applying burntool successfully. Looks like fixing ipp014 repaired the problem.
  • 10:18 getting faults from ipp039 It is not mounting /data/ipp014.0. Set it to off in stdscience.
  • 10:30 ipp006 ipp049 ipp053 ippc18, 13, and 14 are also not mounting ipp014. Reported to Gavin. Setting the host to repair
  • 11:38 CZW: rebooting offending nodes appears to resolve the problem. I have restarted registration and stdscience, and jobs appear to be completing correctly now.
  • 13:05 CZW: Increased poll limit in distribution pantasks from 128 to 300. This should allow enough jobs to accumulate in "run" to avoid downtime due to slow "load" queries.
  • 13:30 Bill: fixed bug in warptool -towarped that caused the queue to not be full.
  • 14:02 Bill: killed a ppMops process that was using all of the memory on ippc10
  • 15:10 Mark: restarted cleanup, added some MD01.GR0 chip cleanup earlier.

Thursday : 2011-08-25

  • 01:00-02:00 Mark: looks like ipp014 may be having trouble. can't mount ipp014.0 and trying to ssh results in "Connection to ipp014 closed by remote host. Connection to ipp014 closed." so turned off in stdscience and set neb-host down to help the remaining STD and MD01 chips through. many of the failures seem due to wanting data from ipp014. not sure if reboot is necessary but don't have access to in any case. Queued up remaining MD10.refstack reruns to keep stack processing fed.
  • 06:30 Bill: There were 420 magicDSRuns in state failed_revert. Cleared the statefaults.
  • 06:30 repaired a broken diffSkyfile that was causing a magic run to get stuck
  • 06:30 installed (temporarily ?) a modified that deletes the existing log file. This works around several warps that were stuck for this reason.
  • 06:56 There are over 3000 magicDSFiles still left to revert
  • 07:10 dropped distRun 672273 (a diff from MD08.20110821) because one of the destreaked skycells is corrupt and the original files have been deleted because the magicDSRun is cleaned. There is no easy way to repair faults like this as long as we insist on cleaning up data whether all of the processing is complete for the exposure or not.
  • 07:17 turned and rcserver.revert off to in order to find the number of recurring faults that we have.
  • 07:52 set destreak.revert.load task to run more often.
  • 08:52 set destreak.revert.load back to normal pace. Set destreak.revert off to look for repeat falures

Friday : 2011-08-26

  • 06:30 Bill: ipp013 is back online but /local/ipp hasn't been populated. Later Gene mentioned that ipp018 has the same problem
  • 06:58 3 runs got queued for camera engineering exposures that were mis identified as MD08. Dropped the warps
  • 09:25 ipp018 /local/ipp/tmp didn't have the right ownership so some dist runs failed. Fixed.
  • 09:47 fixed a corrupted warp skycell on ipp052: runwarpskyfile --warp_id 241244 --skycell_id skycell.0947.038

Saturday : 2011-08-27

  • 06:36 Bill checked in found that burntool had stopped at o5800g0328o
From registration panatasks found an error and the name of the logfile 

Running [/home/panstarrs/ipp/psconfig/ipp-20110622.lin64/bin/ --dbname gpc1  --class_id 'XY21' --exp_id 382931  --this_uri neb://ipp008.0/gpc1/20110827/o5800g0328o/o5800g0328o.ota21.fits  --previous_uri neb://ipp008.0/gpc1/20110827/o5800g0327o/o5800g0327o.ota21.fits  --imfile_state check_burntool  --verbose ]...
Unable to perform 1 at /home/panstarrs/ipp/psconfig//ipp-20110622.lin64/bin/ line 518        main::my_die_for_update('Unable to perform 1', 382931, '\'XY21\'', 2) called at /home/panstarrs/ipp/psconfig//ipp-20110622.lin64/bin/ line 352
Running: /home/panstarrs/ipp/psconfig/ipp-20110622.lin64/bin/regtool -updateprocessedimfile -exp_id 382931 -class_id 'XY21' -fault 2 -hostname ipp008 -dbname gpc1

 -> psDBAlloc (psDB.c:166): Database error originated in the client library     Failed to connect to database.  Error: Lost connection to MySQL server at '
reading authorization packet', system error: 0
 -> regtoolConfig (regtoolConfig.c:468): (null)
     Can't configure database
 -> main (regtool.c:71): (null)
     failed to configure

I ran the regtool -updateprocessedimfile command by hand, then ran regtool -revertprocessedimfile. Burntool ran swiftly to conclusion.
  • 20:30 A LAP warp was stuck with several failing skycells. It looks like some of the outputs from the camera run are corrupt. Sigh. Fixed with tools/ --redirect-output --cam_id 254369
  • 22:00 Mark: set MD10.GR warps to cleanup

Sunday : 2011-08-28