PS1 IPP Czar Logs for the week 2011-04-25 - 2011-05-01

(Up to PS1 IPP Czar Logs)

Monday : 2011-04-25

Bill is czar today.

  • 09:00 The 576 STS.2010.b exposures queued for reprocessing so far have finished and have been posted to the distribution server. Queued the chip, warp, and diff runs for cleanup
  • 11:58 set pantasks' to run. They were stopped to allow Chris to make some changes to nebulous.
  • 13:56 CZW: ATRC storage nodes completely added to nebulous. replication pantasks is stopped, as we want to do some replication tests to send data to these disks.

Tuesday : 2011-04-26

  • 08:00 Bill burntool was stuck due to a database error that left a rawImfile in an incorrect state. Fixed by running the command that failed by hand.
    regtool -updateprocessedimfile -dbname gpc1 -exp_id 330078 -class_id 'XY46' -burntool_state 0 -set_state pending_burntool
  • 08:05 Bill warp was stuck because the book warpPendingSkyCell was full of entries in pantaskState DONE. Fixed by running warp.reset stdscience.
  • 09:32 ipp041 is generating lots of faults because it cannot see /data/ippb02.0. force.umount hung. Running sudo umount -f -v /data/ippb02.2 hung as well. Set ipp041 to off in stdscience.
  • 10:56 Heather snuck in some more jt stacks. (MD02.jtrp)
  • 11:15 Serge tries to replicate ippc02.gpc1 onto ipp001. There might be unusual activity between them.

Wednesday : 2011-04-27

  • 07:20 Bill noticed that warp was stuck. warpPendingSkyCell book full of DONE jobs. wait for any running jobs to finish ; warp.reset ; warp.on
  • 07:30 interestingly the czartool graph for warp did not flatten out even though there were no warp jobs running. Perhaps I caught it just before it would have flat lined.
  • 10:27 Bill set label STS.2010.b to inactive to prevent chip processing from slowing down last night's science processing.
  • 10:31 Bill set ipp012 and ipp020 nebulous state to up (they were set to repair over the weekend)
  • 13:45 gpc1 database restarted, stdscience restarted (to avoid timeout counts disturbing Nrun counts)
  • 14:00 Serge restarted czarpoll and czarbot
  • 14:10 On ippc02, show slave status\G says (shortened):
               Slave_IO_Running: Yes
              Slave_SQL_Running: Yes
                Replicate_Do_DB: gpc1,czardb,isp,ippadmin,ipptopsps
          Seconds_Behind_Master: 0

which means that gpc1,czardb,isp,ippadmin,ipptopsps are replicated from ippdb01

  • 16:00 gpc1,czardb,isp,ippadmin,ipptopsps are replicated on ipp001 (from ippdb01 unfortunately).
  • 16:15 restarted distribution. pcontrol was pegged and few jobs were running. Also set ippc15 off in stdscience

Thursday : 2011-04-28

  • 11:05 Serge stopped gpc1 and other dbs replication (on the Mānoa cluster) to see if it has any effect on the processing speed.
  • 13:30 Serge restarts replication since stopping it has no apparent effect.
  • 14:00 Bill killed ~100 threads from survey.addstar jobs that had timed out. This reduced the mysqld cpu usage from 700% to ~250%
  • 14:17 Bill saw hundreds of faults on czartool. At least one problem was that ipp040 was having nfs problems acessing ipp013. force.umount seems to have fixed the problem.
  • 14:23 Bill ran pubtool -revert to clear 5 faults. None of them came back. Also ran warp and diff revert commands to clear up the faults due to ipp040-ipp013 communication problems.
  • 15:40 ipp013 got very overloaded running streaksremove on STS fields. The memory usage was such that nfs errors resulted and the jobs never finished. Bill restarted distribution with the usual number of hosts.
  • 19:58 Bill Will this day ever end??? warpPendingSkyCell book full of DONE jobs is delaying everything downstream. warp.reset

Friday : 2011-04-29

  • 10:45 Bill fixed the one destreak fault from yesterday's 3pi data. The variance image from warp_id 188058 skycell.1198.141 was corrput. This was difficult to debug since streaksremove was core dumping inside cfitsio.
  • 15:00 CZW: Stopped science and nebulous/apache to install a fix in Nebulous-Server that prevents the ATRC nodes from being randomly selected as a destination by a replication process. Re-enabled everything, and set the state of those nodes to up. Restarted replication pantasks (which doesn't actually use the random replication process, but rather specifically selects the target volume) and now it seems to be correctly shuffling data to the ATRC, while not putting any processing data there.
  • 17:00 heather: sneaking in more MD.jtrp stuff, found bug in format configs, stopped stdsci to fix that. Cleaned up MD02.jtrp, MD03.jtrp.

Saturday : 2011-04-30

  • 08:15 Bill queued 7 warp-warp diffs for the 14 sts test exposures from last night.

Sunday : 2011-05-01

  • 06:00 Roy restarted stdscience, which had crashed at around 5am.
  • 16:40 Bill restarted distribution which was way behind. Also doubled the number of hosts working on it.