PS1 IPP Czar Logs for the week 2011.07.04 - 2011.07.10

(Up to PS1 IPP Czar Logs)

Monday : 2011.07.04

  • 07:50 : Roy : Restarted summitcopy, which appeared to crash at 2am-ish
  • 15:54 : Bill checked in to see whether the STS.2009.b data is finally done. Turned out that there were a couple of corrupt files that needed repair. Noticed that distribution pcontrrol was 1.3GB in size and not spawning the pending jobs very fast. Restarted it.

Tuesday : 2011-07-05

Serge is czar

  • 10:41 Serge: stopping all servers
  • 10:50 Serge: all servers stopped
  • 12:00 Serge: processing resumed
  • 13:15 Serge:
  • 13:34 CZW: Serge noticed that diffs were not being queued. There was a registration fault this morning, and I suspect that fooled nightly_science into setting the nsDiffState to FINISHED_DIFFING. I launched most of the missing diffs with --date 2011-07-05 --verbose --queue_diffs. It looks like some diff inputs aren't ready, so this command will likely need to be run once more (alternatively, a restart of stdscience will reset the state).
  • 15:03 Serge: from the file log file: /data/ipp041.0/nebulous/be/ef/1063727843.gpc1:IPP-MOPS-TEST:IPP-MOPS-TEST.231034.log, the task has been restarted 4464 times and finished with the roor:
    input file does not exist: neb://ipp020.0/gpc1/MD08.nightlyscience/2011/06/30/MD08.V2/skycell.094/SR_MD08.V2.skycell.094.WS.dif.141868.cmf 
    at /home/panstarrs/ipp/psconfig//ipp-20110622.lin64/bin/ line 374
    	main::my_die('input file does not exist: neb://ipp020.0/gpc1/MD08.nightlysc...', 231034, 2) 
            called at /home/panstarrs/ipp/psconfig//ipp-20110622.lin64/bin/ line 190

I tried:

difftool -revertdiffskyfile -dbname gpc1 -skycell_id skycell.094 -diff_id 141868

which had no effect.

  • 15:20: Serge:
  • 16:20: Serge: chip.on/warp.on. 5 exposures have been processed in the last 20 minutes
  • 16:35: Serge: Once again it can be observed that, when stare hosts are loaded, performance drops down.
  • 17:25: Serge:
  • 19:25: Serge: chip.on/warp.on

Wednesday : 2011-07-06

  • heather restarted summitcopy - there was a chip that wasn't downloading (?). when I restarted it it looked like it downloaded (checkexp says so, mysql says no)
  • registration is 'not happy' heather is investigating this still
  • burntool is burntooling, no faults since heather started investigating (?)
  • checkexp now reports things are going through registration.
  • 11:10 Serge: MD data were not published to IPP-MOPS-TEST. I suspect it's because of the multiple declarations (e.g. survey.add.publish MD01 MD01.nightlyscience 5 NULL/survey.add.publish MD01 MD01.nightlyscience 1 NULL) in ~ipp/stdscience/input. I deleted the surveys (e.g. survey.del.publish MD01) and added them back with different "labels" (e.g. survey.add.publish MD01 MD01.nightlyscience 5 NULL/survey.add.publish MD01.ds MD01.nightlyscience 1 NULL). I modified ~ipp/stdscience/input accordingly.
  • 11:22 CZW: registration stuck on exposure o5748g0233o: trunk/tool/ identified this as being a bt_state = -1 data_state = check_burntool, suggesting burntool failed to run correctly. Manually fixed with: --exp_id 358617 --class_id XY52 --this_uri neb://ipp033.0/gpc1/20110706/o5748g0233o/o5748g0233o.ota52.fits --previous_uri neb://ipp033.0/gpc1/20110706/o5748g0232o/o5748g0232o.ota52.fits --verbose --dbname gpc1
  • 12:40 heather: registration was stuck again, but this is a pantasks type of fault. it said on controller status:
    0    ipp015    RESP   4544.07 --exp_id 358627 --class_id XY52 --this_uri neb://ipp033.0/gpc1/20110706/o5748g0246o/o5748g0246o.ota52.fits --continue 10 --previous_uri neb://ipp033.0/gpc1/20110706/o5748g0245o/o5748g0245o.ota52.fits --dbname gpc1 --verbose 
    but when I checked on ipp015, there was no evidence of burntool. I ran this command by hand, and it worked (I don't know how/where/why it got lost in pantasks). Registration is now continuing.
  • 12:41 Serge: added label for SAS data to stdscience/publishing
  • 14:39 heather restarted stdsci at the request of chris (this should queue yesterday's stacks)
  • 14:49 Serge (in stdscience):
    survey.add.publish SAS.footprint.123 SAS.footprint.123 5 SAS2_for_IPP-MOPS-TEST
    survey.add.publish SAS2.12 SAS2.12 5 SAS2_for_IPP-MOPS-TEST
  • 16:00 heather deleted all the non-nightlyscience labels out of stdsci (LAP.ThreePi?.20110621, LAP.ThreePi?.test, SAS2.12, SAS.footprint.123)
  • heather forgot to add: some nightlyscience was delayed because some of the detrends were on ipp018 (out), and the other copy did not exist on ippb01 (like it claimed)

Thursday : 2011-07-07

  • heather/serge stopped all processing and mysql servers for memory upgrade
  • memory upgrade not performed: only 3/13 memory sticks available
  • heather/serge restarted all processing and mysql servers
  • heather/serge stopped all processing
  • serge took a dump of nebulous
  • serge restarted processing
  • heather cleaned up the czar board. she has no idea how to deal with the 600+ destreak faults.
  • heather added sas/lap labels - there's not much chance of nightlyscience tonight (high winds)

Friday : 2011-07-08

  • 10:03 Serge: stdscience
    pantasks: del.label SAS2.12
    pantasks: del.label SAS.footprint.123
    pantasks: del.label LAP.ThreePi.20110621
  • 10:45 Serge stopped stdscience to try to fix corrupted with Bill's scripts while Heather investigates distribution
  • 11:00 heather discovered the destreaks are in state 'failed_revert' - following bill's example here:
  • 11:00 heather set destreak off, and did magicdstool -clearstatefaults -dbname gpc1 -label ThreePi?.nightlyscience -set_state new -state failed_revert to set the destreaks to new. do they revert?
  • 11:00 yes! they are reverting! (128 of 532 so far)
  • 11:01 bill also reverted the destreaks, using the same method, and he also did the other labels (and update)
  • 11:53 heather restarted distribution - it wasn't queueing up more destreaks (why) so heather restarted it. It's still not queueing, heather is investigating....
  • 13:39 all but 2 files have destreaked
  • 13:55 From distribution log:
    • In /data/ipp030.0/nebulous/db/a3/1077314227.gpc1:destreak:ThreePi.nightlyscience:358533:warp:358533.mds.revert.581401.219054.skycell.2065.017.log
      /data/ipp030.0/nebulous/ef/cd/1074937924.gpc1:ThreePi.nt:2011:07:06:o5748g0156o.358533:o5748g0156o.358533.wrp.219054.skycell.2065.017.fits is not a destreaked file
      /data/ipp030.0/nebulous/23/d5/1075327019.gpc1:ThreePi.nt:2011:07:06:o5748g0156o.358533:SR_o5748g0156o.358533.wrp.219054.skycell.2065.017.fits is not a destreaked file
      both files appear to not be destreaked
      original: neb://ipp030.0/gpc1/ThreePi.nt/2011/07/06//o5748g0156o.358533/o5748g0156o.358533.wrp.219054.skycell.2065.017.fits
      backup:   neb://ipp030.0/gpc1/ThreePi.nt/2011/07/06//o5748g0156o.358533/SR_o5748g0156o.358533.wrp.219054.skycell.2065.017.fits
    • In /data/ipp030.0/gpc1/destreak/ThreePi.nightlyscience/358533/warp/358533.mds.revert.581401.219054.skycell.2065.017.log
      /data/ipp030.0/nebulous/ef/cd/1074937924.gpc1:ThreePi.nt:2011:07:06:o5748g0156o.358533:o5748g0156o.358533.wrp.219054.skycell.2065.017.fits is not a destreaked file
      Running [/home/panstarrs/ipp/psconfig/ipp-20110622.lin64/bin/isdestreaked /data/ipp030.0/nebulous/23/d5/1075327019.gpc1:ThreePi.nt:2011:07:06:o5748g0156o.358533:SR_o5748g0156o.358533.wrp.219054.skycell.2065.017.fits]...
      /data/ipp030.0/nebulous/23/d5/1075327019.gpc1:ThreePi.nt:2011:07:06:o5748g0156o.358533:SR_o5748g0156o.358533.wrp.219054.skycell.2065.017.fits is not a destreaked file
      both files appear to not be destreaked
      original: neb://ipp030.0/gpc1/ThreePi.nt/2011/07/06//o5748g0156o.358533/o5748g0156o.358533.wrp.219054.skycell.2065.017.fits
      backup:   neb://ipp030.0/gpc1/ThreePi.nt/2011/07/06//o5748g0156o.358533/SR_o5748g0156o.358533.wrp.219054.skycell.2065.017.fits

According to Gene, files don't seem to have been swapped (or they might have been but have been unswapped)

  • 14:00 Serge tries to fix publish issues: Since 'neb://ipp032.0/gpc1/ThreePi.nt/2011/07/06/RINGS.V0/skycell.1210.116/SR_RINGS.V0.skycell.1210.116.dif.143513.cmf' does not exist, let's create it:
    perl --diff_id 143513 --skycell_id skycell.1210.116
  • 14:04 Serge (continued)
    pubtool -revert -dbname gpc1 -pub_id 233273
  • 14:10 Serge (end): no effect. I stop.
  • 14:40 Serge has added back the three labels removed at 10:03 and restarted stdscience
  • 14:50 Serge stopped stdscience to try:
    perl ./ --chip_id 254088 --class_id XY26
  • 15:02 Serge restarted stdscience (because the previous command worked!)
  • 15:55 CZW: stopped replication due to nebulous concern.
  • 17:18 CZW: dropped pstamp requests for raw data that were clogging the postage stamp server:
    pst -updatereq -req_id 93751 -set_state drop
    pst -updatereq -req_id 93750 -set_state drop
    pst -updatereq -req_id 93749 -set_state drop
  • 17:29 CZW: shutdown stdscience and restarted fresh to attempt to unstick remaining SAS y/z warps and diffs. Seems to have taken, and the jobs are completing now.
  • 17:34 CZW: And added their labels back into distribution so they can be magicked/destreaked/stacked.
  • 18:13 CZW: Merge update from trunk to tag that uses glockfile to attempt to force NFS to write data to disk during neb-replicate. This should resolve the nebulous issue, and I have restarted shuffle. We will need to look on Monday to see how many files replicated over the weekend are corrupt/zero byte/completely missing.
  • 21:40 Serge: ippdb02 is still ingesting the dump (instance table). Let's wait tomorrow

Saturday : YYYY.MM.DD

  • 00:49 CZW: an imfile failed to copy correctly with a CRASH instead of a fault. Manually re-ran the command: --uri --filename neb://ipp026.0/gpc1/20110709/o5751g0034o/o5751g0034o.ota43.fits --summit_id 354868 --exp_name o5751g0034o --inst gpc1 --telescope ps1 --class chip --class_id ota43 --bytes 49432320 --md5 cd7bf8fe865254b7d40d83d79e72e84c --dbname gpc1 --timeout 600 --verbose --copies 2 --compress --nebulous
  • 14:18 heather reverted a few thing (camera stage)
  • 16:40 Serge: nebulous replication activated on ippdb02

Sunday : YYYY.MM.DD