PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2011-07-25

  • 09:07 Serge: A lot of red in publishing. pubtool -revert -dbname gpc1 -client_id 5. Mops want to have ANW3, ANO1, and OSS asap.
  • 09:24 Serge: chip.off so that ANO1 quad gets processed/reverted faster. OSS and ANW3 (but 2 exposures because of bad cam quality) all published.
  • 09:34 Serge: warp.off. 8 exposures in ANO1 to go
  • 09:43 Serge: pubtool -definerun -client_id 5 -label ThreePi.nightlyscience -dbname gpc1
  • 09:47 Serge: All MOPS quads published. chip.on, warp.on
  • 12:12 CZW: Restarted stack pantasks, as it did not appear to be properly queuing new jobs (and SAS had 900 stacks sitting idle).
  • 15:52 Bills: added M31 label to stdscience and updated the input file. Did not turn on warp-stack diffs yet. We need to examine the exposures relative to the current template.
  • 16:46 Mark: re-enabled MD10 for W-S diffs tonight.

Tuesday : 2011-07-26

  • 09:14: Serge: Heather ran and fixed? the summitcopy problem
    summit_copy.pl --uri http://conductor.ifa.hawaii.edu/ds/gpc1/o5768g0297o/o5768g0297o43.fits 
    --filename neb://ipp026.0/gpc1/20110726/o5768g0297o/o5768g0297o.ota43.fits --summit_id 364059 
    --exp_name o5768g0297o --inst gpc1 --telescope ps1 --class chip --class_id ota43 --bytes 49432320 
    --md5 424f69d8594ef727d7b78ce03d0a7a43 --dbname gpc1 --timeout 600 --verbose --copies 2 
    --compress --nebulous
    
  • 09:36: Bill queued sts diffs for July 3.
  • 10:03: Bill/Serge killed some ppSub running for ages (ipp043/diff_id 147113//267983.34 seconds, ippc28/diff_id 147235/204905.41 seconds, ipp038/diff_id 147554/74371.20 seconds)
  • 12:05: Serge: chip.off so that mops have OSS asap
  • 12:56: Serge: warp.off
  • 13:12: Serge: chip.on, warp.on
  • 13:20: Serge: ippdb03 is down
  • 13:40: Serge: ipp026 is down
  • 15:30: CZW: paused lap processing to track down and fix a bug that was preventing SAS2.12 gri runs from queuing stacks correctly. This is resolved now.
  • 17:30: Serge: By Gene's request, system stop and shut down
  • 17:55: Serge: all pantasks shutdown
  • 18:00: Serge: all pantasks running
    ipp@ipp004:/home/panstarrs/ipp>./check_system.sh start.server
    [...]
    ipp@ipp004:/home/panstarrs/ipp>./check_system.sh start
    pantasks server cleanup has been started (host: ippc07)
    pantasks server distribution has been started (host: ippc15)
    pantasks server pstamp has been started (host: ippc17)
    pantasks server publishing has been started (host: ippc08)
    pantasks server registration has been started (host: ipp052)
    pantasks server replication has been started (host: ippc19)
    pantasks server stack has been started (host: ippc05)
    pantasks server stdscience has been started (host: ippc16)
    pantasks server summitcopy has been started (host: ipp050)
    pantasks server update has been started (host: ippc13)
    ipp@ipp004:/home/panstarrs/ipp>./check_system.sh run
    pantasks server cleanup is running (host: ippc07)
    pantasks server distribution is running (host: ippc15)
    pantasks server pstamp is running (host: ippc17)
    pantasks server publishing is running (host: ippc08)
    pantasks server registration is running (host: ipp052)
    pantasks server replication is running (host: ippc19)
    pantasks server stack is running (host: ippc05)
    pantasks server stdscience is running (host: ippc16)
    pantasks server summitcopy is running (host: ipp050)
    pantasks server update is running (host: ippc13)
    
  • 18:07: CZW: restarted replication pantasks with new input file that limits the number of hosts being used for shuffle. This does not seem to have had a significant impact on the shuffle speed.
  • 18:20: Serge: fixed and restarted replication of ippdb01 onto ippdb03
  • 18:20+epsilon Serge: ipp030 is down

Wednesday : 2011-07-27

  • 13:36 Bill ipp030 is still down causing lots of faults. I turned magic, destreak and dist revert off to find out the extent of the errors
  • 15:57 Bill turned the reverts back on since ipp030 has finished it's brain transplant.

Thursday : 2011-07-28

  • 09:16 Serge: chip.off, warp.off
  • 09:20 Serge: killed ppSub on ipp036 (diff 148435) and on ipp049 (148445) running for about 3 hours
  • 09:21 Serge: pubtool -revert -label (OSS.nightlyscience, MD10.nightlyscience, MD08.nightlyscience)
  • 09:22 Serge: warp.on, chip.on
  • 09:55 Serge: trying to fix the 400+ failed published items
    mysql> select pub_id, client_id, diffRun.state, diffRun.data_group, diffRun.dist_group from publishRun 
           join publishDone using(pub_id) join diffRun on stage_id = diff_id where publishDone.fault > 0 
           and publishRun.state ='new' and diffRun.dist_group ='ThreePi';
    +--------+-----------+-------+------------------+------------+
    | pub_id | client_id | state | data_group       | dist_group |
    +--------+-----------+-------+------------------+------------+
    | 237134 |         5 | full  | ThreePi.20110713 | ThreePi    | -> publishRun.state set to drop (likely too old)
    | 242765 |         1 | full  | ThreePi.20110727 | ThreePi    | \
    | 242871 |         1 | full  | ThreePi.20110727 | ThreePi    | |
    | 242873 |         1 | full  | ThreePi.20110726 | ThreePi    | |
    | 242882 |         1 | full  | ThreePi.20110726 | ThreePi    | |
    | 242928 |         1 | full  | ThreePi.20110726 | ThreePi    | |
    | 242929 |         1 | full  | ThreePi.20110726 | ThreePi    | |
    | 242930 |         1 | full  | ThreePi.20110726 | ThreePi    | |
    | 242945 |         1 | full  | ThreePi.20110726 | ThreePi    | |--> All fixed with pubtool -revert -pub_id <pub_id>
    | 242974 |         1 | full  | ThreePi.20110726 | ThreePi    | |
    | 242996 |         5 | full  | ThreePi.20110728 | ThreePi    | |
    | 243046 |         1 | full  | ThreePi.20110728 | ThreePi    | |
    | 243050 |         1 | full  | ThreePi.20110728 | ThreePi    | |
    | 243054 |         1 | full  | ThreePi.20110728 | ThreePi    | |
    | 243059 |         5 | full  | ThreePi.20110728 | ThreePi    | /
    +--------+-----------+-------+------------------+------------+
    15 rows in set (0.01 sec)
    
  • 10:45 Serge: Trying to fix failed warps (neb-mv missing log file + revert), e.g.:
    neb-mv neb://ipp014.0/gpc1/ThreePi.nt/2011/07/27/o5769g0171o.368889/o5769g0171o.368889.wrp.228460.skycell.1578.028.log neb://trash/o5769g0171o.368889.wrp.228460.skycell.1578.028.log
    warptool -revertwarped -dbname gpc1 -warp_id 228460
    
  • 11:10 Serge: Fixed failed cam (trace files had no nebulous instances but the problem was shown as if the fits was corrupted/inexistant in the log)
  • 11:21 Bill fixed a zero sized raw file instance by copying a good copy on top of it: cp /data/ippb01.2/nebulous/3d/7d/1101912367.gpc1:20101008:o5477g0061o:o5477g0061o.ota75.fits /data/ipp025.0/nebulous/3d/7d/900677111.gpc1:20101008:o5477g0061o:o5477g0061o.ota75.fits
  • 11:15 Bill fixed some corrupt mask files with runcamexp cam_id 240526 and 240529
  • 11:50 Serge: Fixed all ThreePi?.nightlyscience failed warps
  • 12:13 Bill fixed a bug in magic_destreak_revert.pl which caused camera stage faults to got to failed_revert. cleared all destreak statefaults
  • 12:25 Serge: UPDATE publishRun JOIN publishDone using(pub_id) join diffRun on stage_id = diff_id SET publishRun.state='drop' WHERE publishRun.state='new' and diffRun.dist_group='ThreePi.offnight'; to fix the 417 exposures which were queued to publishing by mistake
  • 13:40 Bill added module backgound.pro to stdscience to do the background restoration for the M31 images.
  • 13:51 Bill queued STS diffs for July 4 - 9
  • 13:56 Bill put a couple of chips out of their misery that were repeatedly asserting in psphot. See ticket # 1493
    • chiptool -updateprocessedimfile -chip_id 259540 -fault 0 -set_quality 42 -class_id XY67
    • chiptool -updateprocessedimfile -chip_id 259539 -fault 0 -set_quality 42 -class_id XY67
  • 17:12 Bill found that the warpPendingSkyCell book in stdscience was full of skycells in state done. warp.reset cleared that out.

Friday : 2011-07-29

  • 08:30 Not much data last night.
  • 08:30 put a diff skyfile that was repeatedly asserting out of it's misery -diff_id 148886 -skycell_id skycell.7.25. See ticket # 1494
  • 08:50 queued the rest of the STS diffs.
  • 10:45 stopped all pantasks for ippdb00 & ippdb01 memory upgrades (32GB -> 64GB) (EAM)
  • 11:53 restarted all pantasks after ippdb00 & ippdb101 memory upgrade (EAM)
  • 17:36 CZW: realized that the reason nebulous graphs weren't updating was that nebdiskd didn't restart after the upgrade. Restarted nebdiskd.

Saturday : 2011-07-30

  • 07:55 All sts data has completed diff, magic, and destreak. There are 311 dist runs to process and they are going slowly. High fault rate. Turned off dist.revert to investigate. Also since we have so many stages to poll but only 3 (chip, camea, and warp) are processed for sts the queue empties between polls. Set poll limit to 300 and dropped the exec time for dist.prcoess.load down to 5 seconds from 20.
  • 08:00 The faults have stopped since revert was turned off. There are 14 or so faults that must be repeating

Sunday : 2011-07-31

  • 11:00 bill found that two of the dist faults were due to corrupt chip files. Luckily the destreak run hadn't been cleaned up yet for 2 of them. Fixed those. The md08 componnent I set the quality to 42 temporarily. The other 12 errors were due to warp log files that have gone missing. neb-touched the filenames allowed the distribution to complete.
  • 11:00 dist.revert.on
  • 11:02 Set STS data that has finished distribution to be cleaned. This will free up space for ~1000 exposures
  • 17:00 cleared a sts warp fault 4. It worked second time. Random. Queued sts diffs for last night's data.