PS1 IPP Czar Logs for the week 2011.02.06 - 2011.02.10

(Up to PS1 IPP Czar Logs)

Monday : 2011.02.06

Roy is czar

  • 09:00: only 70 science images, all through the pipeline
  • 12:45 Serge: replication/shuffling seemed to be stuck. Following Chris advice, I shut it down and restarted it.
  • 13:30: Roy: ran Bill's script to repair 'lost' instance of Exp: 193145, OTA: 15
repair_bad_instance -e 193145 -c XY15 -l
  • 13:40 fixing bad burn.tbl as example for Roy
neb-stat --validate neb://ipp049.0/gpc1/20100629/o5376g0254o/o5376g0254o.ota54.burn.tbl --class_id XY54 --exp_id 188719 --dbname gpc1 --this_uri neb://ipp049.0/gpc1/20100629/o5376g0254o/o5376g0254o.ota54.fits --previous_uri neb://ipp049.0/gpc1/20100629/o5376g0253o/o5376g0253o.ota54.fits
neb-repair neb://ipp049.0/gpc1/20100629/o5376g0254o/o5376g0254o.ota54.burn.tbl
  • 13:50: Roy: Trying one on my own:
neb-stat -validate neb://ipp051.0/gpc1/20100719/o5396g0026o/o5396g0026o.ota64.burn.tbl --class_id XY64 --exp_id 193152 --dbname gpc1 --this_uri neb://ipp051.0/gpc1/20100719/o5396g0026o/o5396g0026o.ota64.fits --previous_uri neb://ipp051.0/gpc1/20100719/o5396g0025o/o5396g0025o.ota64.fits
neb-repair neb://ipp051.0/gpc1/20100719/o5396g0026o/o5396g0026o.ota64.burn.tbl
  • same for:
--class_id XY65 --exp_id 193166
--class_id XY51 --exp_id 193145
  • and finally:
repair_bad_instance -e 193162 -c XY26 -r

Tuesday : 2011.02.07

Roy is czar

  • 08:00: Roy: No data last night
  • 08:00: Roy: LAP stuck again. One chip with a single corrupted copy and two with no copies at all, so:
repair_bad_instance -e 183688 -c XY26 -l
repair_bad_instance -e 183688 -c XY32 -l
repair_bad_instance -e 183664 -c XY65 -l
  • 09:42 postage stamp server and data store shut down for database recovery.
  • 10:00: Roy: stuck stacks, so:
stacktool -revertsumskyfile -fault 2 -label LAP.ThreePi.20110809 -dbname gpc1
  • 11:18 Bill A couple of distribution fileset faults were lingering probably due to ippc17 failures. The failure was dsreg "fileset already exists" deleted them and the registrations completed.
  • 11:40: Roy: more stuck stacks:
stacktool -revertsumskyfile -fault 2 -label LAP.ThreePi.20110809 -dbname gpc1
  • 12:10 Earlier Bill queued all of the ps_ud% labels for cleanup. Cleanup was proceeding slowly. Restarted cleanup pantasks. Much better
  • 14:55: Roy: Another lost chip:
repair_bad_instance -e 179899 -c XY76 -l
  • 15:19 Bill queued some M31 data in g z and y filters for processing. Some of these exposures are old so there may be some problems.
  • 15:33 As predicted some of the data has old burntool_state. Set those run's labels to M31.gzy.old.burntool
  • 16:15: Roy:
repair_bad_instance -e 191531 -c XY76 -r
  • 16:20: Roy: a 'bad' burn.tbl:
neb-stat -validate neb://ipp042.0/gpc1/20100614/o5361g0284o/o5361g0284o.ota23.burn.tbl --class_id XY23 --exp_id 181037 --dbname gpc1 --this_uri neb://ipp042.0/gpc1/20100614/o5361g0284o/o5361g0284o.ota23.fits --previous_uri neb://ipp042.0/gpc1/20100614/o5361g0283o/o5361g0283o.ota23.fits
neb-repair neb://ipp042.0/gpc1/20100614/o5361g0284o/o5361g0284o.ota23.burn.tbl
  • 16:40: Roy: 3 more stuck stacks:
stacktool -revertsumskyfile -fault 2 -label LAP.ThreePi.20110809 -dbname gpc1

Wednesday : 2012.02.08

  • 09:45 Mark: stdscience seems to be struggling in catching up with last nights data. pcontrol use high, restarting. still struggling, but due to many ipp015 pending to do.
  • 11:35 CZW: turning off the hosts that are having repair work done today to off in pantasks: ippc08, ipp030, ipp028, ipp031, ipp023. I've also set these hosts to state 'repair' in nebulous. When I get the all clear, I will undo these changes.
  • 12:04 CZW: We have a backlog of jobs that all want to run on ipp015. To prevent this from slowing everything down for the rest of the day, I'm removing ipp015 from stdscience to allow these jobs to run on other hosts. This will increase the NFS load on ipp015, but once we've cleared this backlog, I'll add it back in.
  • 13:15 CZW: Further work on ipp008, ipp014, ipp016. I've added ipp023, ipp030, and ipp031 back, as the work on them is finished.
  • 17:09 CZW: ipp033 went down a few times, so I've marked it as repair in nebulous. ipp027 and ipp028 won't come back up, and I've set them to down in nebulous. It seems like the maintenance on the remaining hosts is finished, so I've set them back to 'up'.

Thursday : YYYY.MM.DD

  • 09:30 Heather: chip.revert.on (and off) - everything had faulted in chip stage because of ipp033. The files were accessible again (?) so revert worked.
  • 12:52 CZW: ipp027 and ipp028 appear to back up and running. I've set them to 'repair' in nebulous, and will move them to 'up' later today if they stay up and running.
  • 16:55 Mark: tweaking stdscience to run MD SSdiffs at 5pm hawaii time now that the stacks are finished.

Friday : YYYY.MM.DD

  • 14:21 CZW: Restarted stack pantasks, as it appeared to be confused (claimed jobs were running when top showed they weren't, refused to queue jobs that stacktool -tosum listed).
  • 14:33 Bill: M31.gzy has a few chips left to do which are requiring massive amounts of memory. (good seeing galactic bulge millions of sources). I've removed the label from stdscience and started a new pantasks on ipp054 to run the jobs. ~ipp/m31.gzy

Saturday : 2012.02.11

  • 11:30 Mark: giving LAP a kick on few chip faults.
  • 11:45 stdscience staggering, pcontrol running high cpu use. restarting and adding labels back in. OOPS, looks like M31.gzy was still in the default labels so was added on restart, removing in both cases...
  • 15:15 stare03 appears down on ganglia but really isn't. came back once load dropped.
  • 15:20 some LAP stalling warps stuck. looks like skycells are all poor quality, 42 or others, and stuck in state=update,date_state=full (i.e., warp_id=344489, 344494)
  • 17:30 ipp009 has been getting overly abused today. pswarp for o5441g0060o.217792.wrp.365925.skycell.1122.003 using >26% of 20GB RAM and couple others are high teens.

Sunday : 2012.02.12

  • 09:30 Mark looks like ipp014 is unresponsive, can't log into even through console. rebooted okay. console message ipp014-2012-02-12-crash. ipp054 is turned off but 1 in stdscience for a reason? M31.gzy? seems to not be busy, turning some back on to push nightly science through.
  • 10:00 Bill turning off ipp054 again, MD are backed up waiting for ipp054 so will need to tweak SSdiff to run in a couple hours instead. will try deleting host from stdscience in future to see if unthrottles it.
  • 12:20 running "server input tweak_ssdiff" in stdscience to finish the remaining SSdiffs.
  • 14:29 m31.gzy chips are finally done. turning ipp054 back on in stdscience