(Up to PS1 IPP Czar Logs)

Monday : 2012.10.22

Mark is czar

  • 00:00 MEH: after reboot ipp023, restarted summitcopy, registration, stdscience for clean start. ran regtool -revertprocessedimfile -dbname gpc1.
  • 01:30 MEH: watching processing summit/registration getting stalled often, looks like ipp018 often not responding for replication stage, putting into repair (too many trying to replicate to ipp018 with so many red disks?). been a problem for a while with
    INFO: task mount.nfs:29470 blocked for more than 120 seconds.
    
    • seems to have now moved the problem to ipp017.. but handling better. ipp018 may want a reboot (been a whole ~23d since last one).
  • 07:40 oddly, seems like manually running regtool was needed when this has been automatic in the past (and sometimes still is). has something changed in code related to it, is there a number of tries attempted and then stops?
    regtool -updateprocessedimfile -exp_id 536377 -class_id XY11 -set_state pending_burntool -dbname gpc1
    
  • 08:45 Bill restarted pstamp and update pantasks in order to have clear success and failure counts.
  • 09:15 MEH: ipp023 down again.. same problem as last night. rebooting and taking out of processing Ipp023-crash-20121022
  • 10:25 MEH: nightlyscience 99% through (finished downloading ~09:20)
  • 12:15 MEH: Bill changed retention for M31+STS for MPG folks, when nightlyscience finished need to make install in src/ipp-20120802/ippconfig/recipes/ (shutting down and restarting all pantasks)
    • this may challenge available disk space on 20TB machines, 8/30 aren't in repair or red.
  • 12:45 MEH: Chris added check for registration burntool stalling condition to ippScripts. pantasks stopped (mostly distribution, pstamp only active) and rebuilt ippScripts only.
  • 12:50 MEH: reboot testing ipp046 after BIOS battery replacement hasn't been done yet. does it boot on power cycle properly now -- no. sending email to Hayden, Gavin, Rita to ask the MHPCC people to look into it.
  • 13:25 MEH: reactivating the LAP label
    labeltool -dbname gpc1 -updatelabel -label LAP.ThreePi.20120706 -set_active
    
  • 14:40 MEH: Rita reported loss of AC for the ippbXX nodes, so shutting down until repaired tomorrow.
    • replication pantasks stopped
    • neb-host ippb00 down etc for all 4
  • 14:50 MEH: many LAP faults, turning all reverts off to avoid wasting cycles on the same faults until fixed (many of them), or nightly science starts.
  • 15:10 MEH: adding deepstack's compute3 into pstamp for 100% increase in nodes to help move through MOPS request.
  • 17:00 MEH: something is spiking load >150 on stsci nodes, skycell_jpeg is very backlogged
  • 19:00 MEH: some 43 camera distribution fault 2 from 10/16, reverted and cleared disttool -revertrun -dbname gpc1 -fault 2 -label ThreePi?.nightlyscience
  • 22:30 MEH: checked on the exposures o6218g0006o--o6218g0054o reported to be missing for DQstats from otis, they all appear to be processed and there on the datastore from 10/18.

Tuesday : 2012.10.23

Mark is czar

  • 07:00 MEH: only 346 exposures, mostly finished. no registration hangups last night.
  • 10:30 MEH: looking at the IPP->OTIS Problem Exposure Notification from 10/17-10/18. Todd notes there was a FITS formatting anomaly with [Header missing END card.] for those, but the following days is okay. Me manually loading dqstats.9999.fits with pyfits 2.0.1dev428 locally seems okay.
  • 11:40 MEH: ippbXX machines back online now that AC is fixed at ATRC. ippb00.0/1 and ippb03 were all in repair, keeping ippb03 in repair but putting ippb00.0/1 into up. ippb02 are very full, setting all to repair.
  • 11:50 MEH: working on fixing faulted LAP chips
    -- regenerated burn.tbl using ipp-trunk/tools/fixburntool, important to note the replicated version of these end up on the stsci nodes like nightly science is sometimes doing as well.
    neb://ipp034.0/gpc1/20100626/o5373g0426o/o5373g0426o.ota57.burn.tbl
    neb://ipp034.0/gpc1/20100626/o5373g0466o/o5373g0466o.ota57.burn.tbl
    neb://ipp034.0/gpc1/20100626/o5373g0445o/o5373g0445o.ota57.burn.tbl
    neb://ipp034.0/gpc1/20100626/o5373g0449o/o5373g0449o.ota57.burn.tbl
    neb://ipp036.0/gpc1/20100917/o5456g0089o/o5456g0089o.ota67.burn.tbl
    neb://ipp036.0/gpc1/20100917/o5456g0083o/o5456g0083o.ota67.burn.tbl
    neb://ipp040.0/gpc1/20100530/o5346g0451o/o5346g0451o.ota14.burn.tbl
    
    -- the two versions of this file have different md5sums but same size. moved replicated version .maybebad and cp main to replicated
    neb://ipp036.0/gpc1/20120806/o6145g0438o/o6145g0438o.ota55.fits
    
    
  • 13:45 MEH: preparing for full restart of pantasks and rebuild with Chris' change to ippconfig to start cleaning warps. stdscience needing its regular restart.
    • excessive number of warps (2500) so chip.off for a while
  • 14:45 MEH: bringing mysql back up on the many (16) recently rebooted machines so deneb-locate can be used for fixing LAP chips. doing the wave1 set (ipp010,011 etc slowly to watch for excessive use/problems with the updatedb and crontab scripts that could crash them)
    sudo /etc/init.d/mysql zap
    sudo /etc/init.d/mysql start
    
  • 15:10 MEH: working on fixing missing LAP chips now
    -- 0 sized files, found orphan version to replace for all
    neb://ipp031.0/gpc1/20100821/o5429g0219o/o5429g0219o.ota43.fits
    neb://ipp027.0/gpc1/20101123/o5523g0192o/o5523g0192o.ota26.fits
    neb://ipp007.0/gpc1/20100628/o5375g0494o/o5375g0494o.ota05.fits
    neb://ipp029.0/gpc1/20100627/o5374g0426o/o5374g0426o.ota33.fits
    neb://ipp053.0/gpc1/20100627/o5374g0450o/o5374g0450o.ota73.fits
    
    -- corrupted but recovered
    neb://ipp025.0/gpc1/20100627/o5374g0344o/o5374g0344o.ota76.fits
    
    -- corrupted
    neb://ipp036.0/gpc1/20120806/o6145g0438o/o6145g0438o.ota55.fits
    
    chiptool -dropprocessedimfile -set_quality 42 -chip_id 648294   -class_id XY55 -dbname gpc1
    regtool -updateprocessedimfile -set_ignored -exp_id 505899 -class_id XY55 -dbname gpc1
    
  • 16:10 MEH: stdscience seemed to stall, all running jobs cleared running and no new ones loaded but poll reported plenty of jobs available. someone do a book reset or something?
  • 16:30 chip.off until nightly science starts or something breaks.
  • 19:00 MEH: like last night, adding ipp054-059 to summitcopy as Gene mentioned few days ago to bring the download nodes back to 30 and help the download rate.
  • 23:35 MEH: stdscience crashed.. restarted.
    [2012-10-23 22:23:19] pantasks_server[4489]: segfault at 90 ip 0000000000407d9c sp 000000004139ef50 error 6 in pantasks_server[400000+13000]
    

Wednesday : 2012.10.24

Bill is czar today

  • 07:30 Wow! last night appears to have gone very smoothly, at least after the stdscience restart. All nightlyscience appears to be done. Recovered three missing raw files with deneb-locate.py and rebuilt one burntool table.
  • 11:00 MEH: going through past october czar logs and adding missing reboot/problem notes to Production_Cluster
    • 12:00 also for september
  • 16:00 restarted stdscience pantasks

Thursday : 2012.10.25

Bill is czar today

  • 10:01 recovered some "lost" raw files and regenerated a couple of lost burtool tables
  • 19:05 reran --chip_id 654849 --class_id XY55 which left a corrupt file behind
  • 19:37 rebooted ipp051 which had deadlocked nfs mounts with ipp056. Stopping stdscience to clear out deadlocked jobs
  • cleanup has run out of things to do
  • 19:46 restarted summitcopy, registration and shortly stdscience in order to clear out faults due to the ipp051 outage
  • 20:10 everything looks stable. As Mark did a few nights ago "adding ipp054-059 to summitcopy" should the hosts list be updated ?

Friday : 2012.10.26

Serge is czar

  • 06:45 (Serge): Nightly processing finished but for one exposure
  • 12:52 (Bill) Turned off chip processing in preparation for changing to the new ipp-20121026 tag.
  • 14:04 (Bill) Tag ipp-20121026 has been installed and all pantasks have been restarted. Before restart I insured that all runs in progress were completed with the old tag. The last ids for the last runs processed with ipp-20120802 are
chip   660276
camera 637103
warp   615859
diff   328156
stack 1674519
  • 14:20 (Bill) started deepstack pantasks and queued 5 filter staticsky runs for SAS_v10 to run there. Set one batch of compute3 hosts to off in stack pantasks
  • 18:41 (Bill) turned chip.revert.on for nightly science processing.
  • 20:00 (Bill) skycalibration runs are now running out of deep stack. It seems to be stressing the nfs servers with the dvo catalogs (failure to lock errors). Reducing poll limit to 10. they will still be done by morning.

Saturday : 2012.10.27

  • 13:15 MEH: stdscience running half loaded, time for regular restart
    • chip.off for a bit to push warps through for stacks

Sunday : 2012.10.28

  • 10:35 MEH: stdscience running barely half loaded, doing regular restart