PS1 IPP Czar Logs for the week 2011.01.31 - 2011.02.06

(Up to PS1 IPP Czar Logs)

Monday : 2011.01.31

  • Bill has magic cleanup is running cleaning up old magic runs on ipp053. This is the reason for the relatively higher load. pantasks directory is ~bills/magic_cleanup.

Tuesday : 2011.02.01

  • Bill 06:55 Lots of red on the board this morning. This could be because there is a lot of update processing going on. 3 stack faults. Turned stack.revert on (since we fixed the known bugs it should be on all the time now.)
  • Bill 07:00 Once last night's data is through warp I will queue the next STS night from September. We still have several thousand STS exposures to deliver. This key project has been much neglected.
  • Lots of database problems (probably due to multiple processing streams) and nfs errors today.
  • Bill 17:24. Two distRuns fail to complete because of corrupted input files. dist_id: 409008 403036 Set state to 'hold' to stop the fail/revert/fail cycle
  • Set STS.201009 to be distributed.
  • 17:30 set update and distribution pantasks to off to see if the load drops significantly. Yep. Set them to run at 17:40
  • fixed a corrupted warp file with runwarpskyfile. I lost the ids. the diff that was failing completed.
  • 15:50 STS is through diff. So the only thing running is distribution and update). Still getting lots of nfs faults.
  • restarting distribution pantasks to reduce the number of enabled hosts.
  • we got an MDO07 exposure last night and we are threatened with MD06. Added these labels back into stdscience survey tasks
  • 17:00 Bill is investigating distribution and destreak failures. dist.revert.off and destreak.revert.off
  • all reverts back on
  • fixed one corrupt warp component with perl ~bills/ipp/tools/runwarpskycell.pl --redirect-output --warp_id 157463 --skycell_id skycell.298

Wednesday : 2011.02.02

  • 05:50 (Bill) space page on czartool is mostly red. STS.20100909 is done except for 3 broken diff stage distRuns. Dropped those and set chip,warp,and diff data to be cleaned.
  • 11:38 (heather) Heather is running a separate stdscience to process magictest and mopssuite data.
  • 14:30 (Serge): I changed the /etc/mysql/my.cnf (backup is /etc/mysql/my.cnf.O_DIRECT). O_DIRECT has been replaced by O_DSYNC.

For info, the log tells:

  [...]
110202 14:23:17 [Note] /usr/sbin/mysqld: Normal shutdown
110202 14:28:07 [Note] /usr/sbin/mysqld: Shutdown complete
  [...]
110202 14:28:39 [Note] /usr/sbin/mysqld: ready for connections.
  [...]

Command to stop the server: mysqladmin shutdown now / Command to start the server: /etc/init.d/mysql start

Thursday : 2011.02.03

  • Bills 21:03 Two STS chip files with zero sized burntool instances. I'm starting to think that the problem isn't replication processing but rather dsreg's replication
(ipp004:~) bills% nvi neb://ipp006.0/gpc1/STS.201009/o5449g0304o.222698/o5449g0304o.222698.ch.189072.XY03.log
reading /data/ipp006.0/nebulous/fb/fb/675079205.gpc1:STS.201009:o5449g0304o.222698:o5449g0304o.222698.ch.189072.XY03.log

Found that this burntool file was bad

(ipp004:~) bills% nls neb://ipp006.0/gpc1/20100910/o5449g0304o/o5449g0304o.ota03.burn.tbl
/data/ipp052.0/nebulous/a4/4b/461150307.gpc1:20100910:o5449g0304o:o5449g0304o.ota03.burn.tbl

Sure enough one of the instances is empty

(ipp004:~) bills% nls -l -a !$
nls -l -a neb://ipp006.0/gpc1/20100910/o5449g0304o/o5449g0304o.ota03.burn.tbl
-rw-rw-r-- 1 apache nebulous 0 Sep 23 10:30 /data/ipp052.0/nebulous/a4/4b/461150307.gpc1:20100910:o5449g0304o:o5449g0304o.ota03.burn.tbl
-rw-rw-r-- 1 apache nebulous 206285 Sep 23 10:33 /data/ipp034.0/nebulous/a4/4b/461151851.gpc1:20100910:o5449g0304o:o5449g0304o.ota03.burn.tbl

copy good instance over bad ..

(ipp004:~) bills% cp /data/ipp034.0/nebulous/a4/4b/461151851.gpc1:20100910:o5449g0304o:o5449g0304o.ota03.burn.tbl /data/ipp052.0/nebulous/a4/4b/461150307.gpc1:20100910:o5449g0304o:o5449g0304o.ota03.burn.tbl

Second bad file found with
(ipp004:~) bills% nvi  neb://ipp034.0/gpc1/STS.201009/o5449g0174o.222568/o5449g0174o.222568.ch.189011.log
no instances found
(ipp004:~) bills% nvi neb://ipp034.0/gpc1/STS.201009/o5449g0174o.222568/o5449g0174o.222568.ch.189011.XY57.log
reading /data/ipp034.0/nebulous/5b/5c/673256319.gpc1:STS.201009:o5449g0174o.222568:o5449g0174o.222568.ch.189011.XY57.log
(ipp004:~) bills% nls -a -l neb://ipp034.0/gpc1/20100910/o5449g0174o/o5449g0174o.ota57.burn.tbl
-rw-rw-r-- 1 apache nebulous 0 Sep 23 10:29 /data/ipp027.0/nebulous/4e/e0/461150290.gpc1:20100910:o5449g0174o:o5449g0174o.ota57.burn.tbl
-rw-rw-r-- 1 apache nebulous 274019 Sep 23 10:33 /data/ipp042.0/nebulous/4e/e0/461150997.gpc1:20100910:o5449g0174o:o5449g0174o.ota57.burn.tbl
(ipp004:~) bills% cp /data/ipp042.0/nebulous/4e/e0/461150997.gpc1:20100910:o5449g0174o:o5449g0174o.ota57.burn.tbl /data/ipp027.0/nebulous/4e/e0/461150290.gpc1:20100910:o5449g0174o:o5449g0174o.ota57.burn.tbl
(ipp004:~) bills% !nls
nls -a -l neb://ipp034.0/gpc1/20100910/o5449g0174o/o5449g0174o.ota57.burn.tbl
-rw-rw-r-- 1 apache nebulous 274019 Feb  3 21:02 /data/ipp027.0/nebulous/4e/e0/461150290.gpc1:20100910:o5449g0174o:o5449g0174o.ota57.burn.tbl
-rw-rw-r-- 1 apache nebulous 274019 Sep 23 10:33 /data/ipp042.0/nebulous/4e/e0/461150997.gpc1:20100910:o5449g0174o:o5449g0174o.ota57.burn.tbl


(ipp004:~) bills% chiptool -revertprocessedimfile -label STS.201009

Friday : 2011.02.04

  • 06:53 boosted 3pi priority over MD to speed up the processing of 136.3PI._.A-NE-P2% data
  • 23:07 magic and it's brothers in the distribution pantasks have fallen behind. Restarted distribution and doubled the number of active nodes

Saturday : 2011.02.05

The following entries are from Bill

  • 06:30 all data from last night appears to be through warp. distribution still has lots of work to do
  • 06:50 ipp008 just load spiked and became unresponsive. A few minutes later ganglia figured that out and listed the host as down. Since it was just upgraded I'm reluctant to restart it.
  • turned off destreak.revert, dist.revert, and stack.revert since the faults are likely to recur with the node down.
  • set cleanup to 'stop' (it wasn't busy yet)
  • 08:03 I noticed that ipp008 nebulous state was "mounted = 0, allocate = 1, available = 1 This clearly isn't correct since it has crashed. Set it to down
  • 10:00 ipp008 eventually panic'd Gene rebooted.
  • 10:30 turned on revert tasks.
  • 10:50 set cleanup to run. It doesn't seem to have anything to do though.

Sunday : 2011.02.06