PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2012.01.23

  • stdsci is slow, restarting. Not much data from last night.
  • CZW: Missing burntool tables seemed to be preventing some chips from finishing. To resolve this, I did the following: first, confirm bad file status with neb-stat --validate. Second, regenerate burntool table with ipp_apply_burntool_single.pl command. This requires the exp_id, class_id, and the fits file nebulous keys for this image and the previous image. The previous image is generally the same, with a different image number (see example). Finally, issue a neb-repair command to ensure that all instances of this table agree.
    # neb-stat --validate neb://ipp040.0/gpc1/20100719/o5396g0129o/o5396g0129o.ota15.burn.tbl
     [...]
           1 d41d8cd98f00b204e9800998ecf8427e file:///data/ippb01.2/nebulous/be/47/963446032.gpc1:20100719:o5396g0129o:o5396g0129o.ota15.burn.tbl
          0                     NON-EXISTANT file:///data/ippb02.0/nebulous/be/47/963447208.gpc1:20100719:o5396g0129o:o5396g0129o.ota15.burn.tbl
    # ipp_apply_burntool_single.pl --class_id XY15 --exp_id 193253 --dbname gpc1 --this_uri neb://ipp040.0/gpc1/20100719/o5396g0129o/o5396g0129o.ota15.fits --previous_uri neb://ipp040.0/gpc1/20100719/o5396g0128o/o5396g0128o.ota15.fits
    # neb-repair neb://ipp040.0/gpc1/20100719/o5396g0129o/o5396g0129o.ota15.burn.tbl
    Repairing instances
            cp /data/ippb01.2/nebulous/be/47/963446032.gpc1:20100719:o5396g0129o:o5396g0129o.ota15.burn.tbl /data/ippb02.0/nebulous/be/47/963447208.gpc1:20100719:o5396g0129o:o5396g0129o.ota15.burn.tbl
    
  • CZW: To clear up warp updates that are failing due to missing camera stage products, I've been using an SQL command that generates the appropriate warptool commands to mark these skycells as impossible and forces processing to continue.
    select CONCAT("warptool -tofullskyfile -warp_id ",warp_id," -skycell_id ",skycell_id," -set_quality 42 -dbname gpc1") from warpRun JOIN warpSkyfile USING(warp_id) where label = 'LAP.ThreePi.20110809' AND state = 'update' AND data_state = 'update' AND fault != 0;
    
  • 12:30 Mark: MD04.GR0 full reprocessing of nightly stacks should start to greatly harass distribution soon
  • 12:40 Mark: dropping MD02 exposure last week that was junk (lost guiding) with loop for each XY
    regtool -dbname gpc1 -updateprocessedimfile -set_ignored -exp_id  440256 -class_id XY
    
  • 15:00 Bill: restarted pstamp and update pantasks on ippc17. Their pcontrols were spinning.
  • 23:00 Mark: cleared out MD04,06.GR0 stack fault 5 as quality 13007. send MD06.GR0 warps to cleanup to maybe clear up some space on full 20TB disks

Tuesday : 2012.01.24

  • 10:19 Bill turned off staticsky since stdscience is a bit behind
  • Heather restarted stdscicence and stack sometime this afternoon. This turned staticsky back on
  • 16:06 Bill turned staticsky off again.
  • 16:08 Bill turned chip.off to help last nights science to complete
  • 16:15 turned ipp064 on in stdscience to work on the backlogged jobs
  • 17:20 chip.on

Wednesday : 2012.01.25

Bill is czar today

  • 03:15 stscience is backed up with fake jobs again. this time the class_id is XY46 whose target host is ipp029. ipp029 is currently in hosts_wave2_ignore so no jobs run there. Turned 3 of the stdscience hosts to on and the backlog is clearing slowly.
  • 03:30 repaired 4 broken XY26 instances that had faulted
  • 07:30 all nightly exposures downloaded, registered, and through camera state. Still a big backup at fake. Turning chip.off for a bit
  • 07:52 fake caught up. chip.on
  • 11:29 set staticsky to on with a very low poll limit. I'll reorganize the pantasks to run only on high memory hosts after lunch
  • 13:28 restarted stdscience. Modified fake.imfile.run to not use host targeting. Since no files are actually accessed this isn't needed.
  • 13:45 set stack pantasks to stop in preparation for re-organization
  • 14:41 stack (running stack only) and deepstack (running LAP staticsky no stack) restarted.

Thursday : 2012.01.26

Bill is czar again today.

  • 04:25 Looks like they had an unannounced (partial) stare night. Ran stare_nodes.sh off. We are currently 849 exposures behind on downloads. That number is slowly dropping. So far registration is keeping up.
  • 12:00 We are still have 414 exposures left to download.
  • 13:00 61 more stare exposures left to download. Setting pzDownloadExp.state = 'wait' for the stare exposures. This allows the science data to jump ahead in line.
  • 13:16 This is working. We have started downloading and processing the 265 science exposures taken after the stares.
  • 14:18 Since the stare nodes don't seem to be doing anything added them back into processing.
  • 14:14 cleanup pantasks died. Here is the tail of pantasks.stderr.log
controller is not responding (0 tries)
garbage in pcontrol reponse
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
controller is not responding (0 tries)
missing PID in pcontrol message : programming error
ControllerCommand returns: 1
ControllerCommand response: Njobs: 0

controller is not responding (0 tries)
controller is not responding (0 tries)

  • 14:31 Queued some more staticsky runs. Projection cells 0713, 785 - 793
  • 17:19 all science exposures successfully downloaded and burntooled. Set stare exposures in wait state back to run.
  • 17:25 set stdscience to stop in preparation for restarting it. Decided not to right now and set it back to 'run'
  • 19:00 set stdscience to stop in preparateion for restaring it
  • 19:05 restarted stdscience
  • 19:07 stare_nodes.sh off
  • 20:57 7 x (control host delete ipp064) in stdscience. Will fix the config file
  • 21:34 well that didn't work (jobs in queue with ipp064 as target didn't run) 7 x (control host add ipp064)

Friday : 2012.01.27

  • 04:50 Ken Smith reports that yesterdays stack-stack diffs didn't show up on the data store. They didn't run because the stacks weren't ready in the time window that the survey task runs (20 - 21 UT / 10 - 11 am HST) Tweaked the task parameters to get it to run no 14:50 - 15:15 UT. Need to figure out how to undo that or restart stdscience before tomorrow night.
  • 05:16 stack stack diffs are on the data store.
  • 09:16: Roy just realized he is czar today...
  • 09:40 Serge: /tmp full on ippc01. Stopped apache. Moved /tmp/nebulous_server.log to /export/ippc01.0/ipp (Renamed it to nebulous_server_20120127.log and bzipped it there). Restarted apache.
  • 14:00 Bill made some changes to cleanup sql and the script to not clean up bad quality files and to not set chips that fail to error_cleaned.
  • 14:50 Bill updated the postage stamp server code to include the fixes to detectability queries. He will quietly announce this to a couple of interested users.
  • 14:54 Bill set all ps_ud% labels to be cleaned. Earlier he set a whole slew of lingering magicDSRuns that were in full state to be cleaned.
  • 17:10 Bill halted things for a bit in order to pick up a bug fix in psphotForced

Saturday : 2012.01.28

  • 14:10 Bill repaired a number of bad instances (mostly ota 26)
  • 14:44 Bill rebuilt chip 394275 XY55 to fix corrupt file. Strange, the script checks now yet this file got by....

Sunday : 2012.01.29