PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2011-07-11

  • 17:00 Serge: o5753g0178o stuck at warp stage. I ran perl
    runwarpskycell.pl --warp_id 220303 --skycell_id skycell.2615.054
    
  • 17:04 Serge: o5753g0096o stuck at fake stage. pantasks.stdout.log reported crash. I ran the same command again and it seems to be working:
    fake_imfile.pl --exp_id 360038 --fake_id 217175 --class_id XY42 --chiproot=neb://ipp026.0/gpc1/MD07.nt/2011/07/11//o5753g0096o.360038/o5753g0096o.360038.ch.255145
       --camroot=neb://any/gpc1/MD07.nt/2011/07/11//o5753g0096o.360038/o5753g0096o.360038.cm.232378 --camera GPC1 
       --outroot neb://ipp026.0/gpc1/MD07.nt/2011/07/11//o5753g0096o.360038/o5753g0096o.360038.fk.217175 --dbname gpc1 --verbose
    

Tuesday : 2011.07.12

  • 08:00: Roy: Everything downloaded from summit, but quite a few faults. Will investigate...
  • 09:55: Serge: Manually reverted failed publish to IPP-MOPS-TEST -> pubtool -revert -dbname gpc1 -client_id 5
  • 10:03: Roy: stdscience died. No idea why. Restarted.
  • 14:30: Roy: ipp026 load up to 130....

Comment from Serge: From http://ipp004.ifa.hawaii.edu/clusterMonitor/top-20110712T1500.html the load can clearly be attributed to nfs

  • 14:55: Serge: Ran a bunch of commands to unstuck MOPS data (difftool -revertdiffskyfile -fault 2 -label ThreePi.nightlyscience -dbname gpc1; warptool -revertwarped -fault 4 -label ThreePi.nightlyscience -dbname gpc1; pubtool -revert -dbname gpc1 -client_id 5)
  • 15:00 heather restarted stack because no LAP.ThreePi's were getting queued. They still aren't, and further investigations (through labeltool) show them to not be in active state. The labels are in stdsci, but they won't be processed until the state changes to active.
  • 15:05: Roy: Stopped all pantasks servers
  • 15:11 Serge: diff log complains about missing warp files I tried: perl runwarpskycell.pl --warp_id 220738 --skycell_id skycell.1996.132
  • 15:25: Gene: Rebooted ipp026
  • 15:30: Roy: ipp026 backup, Started all pantasks servers
  • 15:55: Roy: attempted to revert destreaks with:
magicdstool -clearstatefaults -state failed_revert -set_state new -label 'Threepi.nightlyscience' -dbname gpc1

Didn't seem to do anything. Followed instructions here instead. They seem to be clearing.

  • 16:30: Roy: warps have finally been reverted and are being processed again.
  • 18:00 heather reverted stacks - it says fault 4 but I don't believe it.
  • 18:00 heather neb-rm -m'd gpc1/ThreePi.nt/2011/07/12/o5754g0471o.360705/o5754g0471o.360705.wrp.220745.skycell.1320.071.log, which had a nebulous entry but no corresponding file.

Wednesday : 2011.07.13

  • 07:15: Roy: Only 88 of 496 science exposures downloaded from last night, even though summitcopy is apparently running. Unfinished postage stamp requests continues to grow....
  • 08:11: CZW: regpeek.pl suggested that o5755g0120o/XY43 was stuck. Reverted using regtool -updateprocessedimfile -exp_id 360887 -class_id XY43 -set_state pending_burntool. Watching pantasks showed continuing errors, and attempts to read the burntool log (gpc1/20110713/o5755g0120o/o5755g0120o.ota43.burn.log) resulted in nebulous unavailable instance errors. I didn't check if there was a nebulous host down, but simply moved the log and table (also unavailable) to FILE.bak to allow burntool to write to that location. Exposures seem to be registering now.
  • 08:30: Roy: Gene rebooted ipp026. It had been down for 10 hours. Taken out of neb allocate
  • 13:56: CZW: Allowed replication pantasks to resume shuffling data to the ATRC. Tests of newly created zero-byte files show that the shuffle code creates a third copy of the file, fails to replicate correctly (resulting in the zero-byte file), and then exits, preserving both original copies of the file. Although this means we are creating spurious files at a 0.4% rate, we are not damaging any of the data, or reducing our redundancy.
  • 18:00 heather found that the maskfiles gpc1/detrend/GPC1.MASK.20101215/GPC1.MASK.20101215.XY62.fits and gpc1/detrend/GPC1.MASK.20101215/GPC1.MASK.20101215.XY22.fits only had one copy, and they were not accessible. she fixed this. She's going to set these to have multiple copies. This should clear up the faults for ThreePi?.nightlyscience chips.
  • 20:20 heather: one of the linearity files (gpc1/detrend.20100817/linearity/linearity_data.XY30.fits) has a copy on ippb02 (backup machine), and on ipp026. The copy on ippb02 is 0 bytes. The copy on ipp026 is inaccessible because the computer is down (again). This is blocking at least 1 chip in ThreePi?.nightlyscience.
  • 22:00 heather: gene rebooted ipp026 and is investigating ipp051 which had insane load
  • 22:00 heather fixed the linearity file above - it has 3 copies (all non-zero now). Heather neb-replicated gpc1/flatcorr.20100124/GPC1.FLATTEST.303/GPC1.FLATTEST.303.XY64.co.fits both it's copies were on ipp026 and ipp051
  • 23:14 heather moved summitcopy off of ipp051 to ipp050. ipp051 still has stupidly high load, heather doesn't know why

Thursday : 2011-07-14

  • 07:00 Gene stopped replication.
  • 09:20 Serge: Processing is remarkably inefficient. As usual, I have no idea why.
  • 09:45 Serge: I added LAP.ThreePi?.20110621 to publishing labels.
  • 10:15 Heather: some of the failed chips have some neb files in a funny state. Heather's sorting this out.
  • 11:08 heather found that ippc23 can't access /data/ipp026.0. She emailed ipp-dev. She's had bad luck unsticking NFS...
  • 12:25 Serge: I set the priority of MD09 to 405 (for MOPS).
  • 12:45 Serge: Repeated failures for cam_id 233235 because of a missing file. I ran perl runchipimfile.pl --chip_id 256006 --class_id XY26
  • 14:56 heather: gene unstuck a few NFS mounts, heather went on a killing spree (killing old ipp processes that have been running since Jun/Jul)
  • 15:08 heather: cam_id 233235 is failing because of neb's there_can_be_only_one on gpc1/ThreePi.nt/2011/07/13//o5755g0332o.361098/o5755g0332o.361098.cm.233235.trace. Moved the offending file out of the way.
  • 18:00 Mark: MD10 setup for the V3 tessellation in nightlyscience if observed tonight/weekend. WSdiff and SSdiff for MD10 disabled in stdscience processing until refstack finishes.

Friday : 2011-07-15

  • 11:40 Serge One of the MD09 is repeatedly failing.
    chiptool -dbname gpc1 -updateprocessedimfile -chip_id 256988 -class_id XY17 -fault 0 -set_quality 42
    
  • 12:06 Serge (11:40 entry continued). On Gene's request:
    update warpRun set state="keep" where warp_id=222184;
    
  • 12:33 Serge labeltool -updatelabel -label MD10.nightlyscience -set_priority 405 -dbname gpc1
  • 18:23 heather 6 of the warps are repeatedly failing, suspect bad camera stage file, using rumcameraexp.pl (cam_id 233608 ) to recover.
  • 22:00 Mark: added MD10.V3 y-band reprocessing of last year's data through warp under MD10.GR0 label.

Saturday : 2011-07-16

Sunday : 2011-07-17