PS1 IPP Czar Logs for the week 2011-05-09 - 2011-05-15

(Up to PS1 IPP Czar Logs)

Monday : 2011-05-09

Bill is czar today.

  • 15:18 incomplete list of activities done so far today
    • restarted distribution with double the number of nodes (since it was way behind)
    • moved warp to it's own pantasks.
    • Now we have a few too many pclients running on some of the weaker machines. Some nodes have entered swap heck. So removed 2 of the 3 host entries in distribution for ipp013 and ipp037.
    • Created a new publishing client and associated data store IPP-MOPS-LAP
    • queued next set of runs for STS.2010.b (STS.b.20100831)
    • repaired 2 warp and one diff skycells that had corrupted files. (ipp050, ipp051, and ipp025)

Tuesday : 2011-05-10

Bill is czar today.

  • Still getting lots of faults and file corruption. A couple of magicRun's got very damaged with zero sized streakMap files. Our tools aren't ready for this.
  • 11:51 Stopping stdscience. The warp pantasks experiment is causing too much load on the system.
  • 12:33 stdscience restarted including warp.

Wednesday : 2011-05-11

Heather is czar today

  • Bill: 09:40 Still getting some file corruption. chip 229883 XY04 warp 196230.skycell.1198.051 Fixed with tools/run*
  • 09:45 turned off chip processing to allow the nightly science warps to finish faster
  • 11:13 chip processing turned back on.
  • 19:25 ipp005 died. rebooted and set it to state down. Reverted the chipRuns that wanted files from there.

Thursday : 2011-05-12

bill (aka mr fussy) is watching things today

  • 07:30 ipp005 crashed again.
  • 11:00 stopped processing to check in two fixes to ppStack: Remove unneeded assertion failure, and reject inputs with fhwm_major > 12.
  • 11:28 Several MD08 diffs failed because they couldn't find the cmf from the reference stack. It turned out that one of the instances had zero size.
  • 11:34 turned chip processing off to allow the LAP warps to catch up a bit.
  • 13:25 cluster is quite unhappy. set stack poll to 20 to reduce the memory load.
  • 14:50 moved stack out of stdscience into it's own pantasks using only compute nodes for processing.
  • 15:19 restarted distribution because pcontrol was pegging one of ippc15's cpus at 100%
  • 15ish setting up ippdb03 as the new ippdb01 replication server
  • 16:25 rebuilt psModules with fix for the problem reported in ticket #1484
  • 14:40 data from ipp005 is needed for postage stamp requests. Set host to repair mode.
  • 20:08 corrupt file caused several warp skycells to fail. fixed with: perl tools/ --chip_id 230276 --class_id XY13 --redirect-output

Friday : 2011-05-13

  • 06:45 Turned chip processing off to let the warps make better progress.
  • 07:00 Fixed corrupted chip file on ipp021 (ppImage run on ippc26) with perl tools/ --redirect-output --chip_id 231352 --class_id XY32
  • 07:00 Fixed corrupted camera mask files on ipp052 (psastro run on ipp044) with tools/ --redirect-output --cam_id 209191
  • 07:08 Fixed corrupted diff skyfile on ipp014 (ppSub run on ippc05) with tools/ --redirect-output --diff_id 131534 --skycell_id skycell.1468.063
  • 07:15 Fixed corrupted warp skyfile on ipp036 (pswarp run on ippc10) with tools/ --redirect-output --warp_id 197590 --skycell_id skycell.1379.032
  • 09:45 Fixed corrupted destreaked diff skyfile (on ipp035) with a the steps described here.
  • 13:15 Killed ingestion of gpc1 into gpc1_1 on ipp001 to give some air to replication ingestion
  • 13:10 turned chip processing back on

Saturday : 2011-05-14

  • Bill 19:00 purged the incorrect warpstack diffs and magicRuns for last night's STS exposures. Wrote a script to queue 42 pairwise diffs. 7 pointings x 6 dithers x 2 vists.
  • Bill 19:02 distribution pantasks died. Restarted.
  • 19:20 a few skycells were failing due to corrupt camera mask files on ipp052 (psastro on ippc12) fixed with tools/ --redirect-output --cam_id 209573

Sunday : 2011-05-15

  • 12:45 Set revert.on for warps and diffs