PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Non-standard Processing

In an effort to improve communication, so we know what's going on better, HAF thinks we should list in the czar pages additional (non-standard) processing - so that we all know what's going on (no requirements for others to help run it). HAF will start:

HAF:

  • ipp054 - ipp081: addstar processing for full force / diff. If you notice problems and need to reboot / stop this, please do the following:
    • contact HAF via email or phone (all should have my #)
    • if you feel addstar needs to be stopped and can't reach her:
             ssh ippdvo2@ippc19
             cd addstar.ipp054 (replace with machine that is problematic)
             pantasks_client
             stop
      
      • and then check status: wait for addstar run to finish, wait for minidvodb.premerge to finish (otherwise, risk corruption of mini)
  • ~stsci03 is currently running ipptopsps for SAS39 FW. This is fairly resilient - no need to stop it if problem. you can access it via:~ no longer running
             ssh ippdvo2@ippc19
             screen -r fw
    
    • you can ctrl-c to stop it if necessary (but ipptopsps really doesn't care, it's fine.. it gracefully dies off if gpc1/nebulous die off, for example)


Monday : 2015.07.27

  • HAF restarted summitcopy - shoved aside the summitExp faults (I think -- these are darks, and checkexp and others look "happy" now):
    delete from summitExp where fault = 110 and summit_id > 945000 and imfiles is null;;
    
    update summitExp set fault = 0 where fault = 110 and summit_id > 945000 ;
    

Tuesday : 2015.07.28

  • 14:00 HAF : daily restart of pantasks

Wednesday : 2015.07.29

  • 19:35 EAM : daily restart of pantasks

Thursday : 2015.07.30

  • 04:40 Bill : reduced priority of ps_ud_MOPS label below nightly science labels since there seem to be a lot of requests pending
  • 04:45 gpc1 progress is quite slow. gpc2 is using the cycles again.
  • 8:26 HAF: registration is borked, download is slow.. I kicked registration
        16	8:25	regtool -updateprocessedimfile -exp_id 950693 -class_id XY54 -set_state pending_burntool -dbname gpc1
        17	8:26	regtool -updateprocessedimfile -exp_id 950693 -class_id XY15 -set_state pending_burntool -dbname gpc1
    
  • 10:12 HAF: kicked registration - this is self preservation to make sure my stuff is working.. I am not the czar
     regtool -updateprocessedimfile -exp_id 950770 -class_id XY72 -set_state pending_burntool -dbname gpc1
    
    
  • 14:45 CZW: started 10x check script on stare04 (to keep it out of the way) that is scanning lists of raw image prototypes and checking that we have two copies on the cluster. If not, it replicates copies to fix this. This is a pre-emptive scan to ensure that after shipping stsci nodes, we have two copies of the raw data.

Friday : 2015.07.31

  • 00:00 MEH: only 4 data nodes for nightly processing and one is nearly full.. removing WS diffs untill morning as the use a lot of space
    • Ken has said gpc1 has priority over gpc2 and gpc1 >100 behind, removing gpc2 from summitcopy and stdsci until morning
    • there are no notes for ipp073 being in repair.. rsync isn't running on ipp074 it seems and why it was put into repair, as was ipp079 per email from Bill days ago.. unclear if rsyncs are still going to use these or not and why not put back up. past email noted ippb06 will be target for rsync but it really needs to be logged on the czar pages
    • ipp082 needs its BBU.. that also seems to have been ignored. it may be a better target for rsyncs
  • 00:37 MEH: ipp071 is not down, needed gmond restarted as it is getting clobbered
  • 03:23 Bill: summit copy was not doing running any jobs so I restarted the pantasks. I'm not sure what was going on. Maybe it had something to do with removing the gpc2 database.
    • It is running now. 195 incomplete downloads. chip processing just ran out of work to do a few minutes ago
  • 03:56 bill: cleared curve of growth fault for -warp_id 1609413 -skycell_id skycell.1203.064
  • 07:30 MEH: two more datanodes red and nearly full, back down to 5 from raised to 7 last night.. another one close to red
  • 08:40 MEH: summitcopy mostly done with last 3PI images and just darks, adding gpc2 db back in to continue downloading those now
    • had to be removed again, gpc1 downloads stalled ~52/60, jobs were pending but never loaded, gpc2 had similar issue ~43/46
    • gpc1 quickly cleared and didn't stall like last night -- added gpc2 back in
  • 08:55 MEH: camera stage is the bottleneck has been doing better -- adding WS label in for a bit so not run out of things to do rather than adding more camera jobs with gpc2 until gpc1 cleared
  • 10:00 MEH: top priority OSS gpc1 is finished, once 3PI camera clears then add back gpc2 to stdsci. ~100 gpc2 exposures left to download
    • second priority gpc1 SNIa finished and WSdiff probably take >3 hrs still --
    • gpc2 data also likely >3hrs
  • 10:18 MEH: ipp073 gmond down again -- suspect this is the problem from the logs -- isn't in stdsci processing but wonder if causing addstar faults
    [2015-07-31 10:03:31] VFS: file-max limit 19809059 reached
    
    Jul 31 10:06:23 ipp073 nrpe[7329]: Network server accept failure (23: Too many open files in system)
    Jul 31 10:06:23 ipp073 nrpe[7329]: Cannot remove pidfile '/var/run/nrpe/nrpe.pid' - check your privileges.
    
    • Heather reports no addstar faults, so not to point of blocking logins and such. will wait until nightly finishes since data there before filled up and is still serving that data fine
  • 12:55 MEH: nightly processing getting clobbered in faults.. something get turned on?
    • need to wait for MOPS to finish getting stamps before restarting ipp073 (and ipp071?), so try to get more WS diffims finished
  • 13:10 MEH: looks like roboczar crashed and been stopped all week.. will restart later to avoid messages
  • 14:09 MEH: also notice ipp038 has full load like ipp039 but unlike ipp039 only has 1 CPU as noted by Haydn on 7/6... it needed a lighter load respectively and/or not be running dvomerge w/ regular processing there.. -- with the newer host grouping, guess will just remove entirely
    • probably want to keep in summitcopy and registration however or replace w/ another node to keep the rate up
  • 14:55 MEH: just stdsci back on to finish last nights data while working out console access to ipp073, ipp071
  • 16:32 MEH: ipp071,076,080 rebooted -- stdsci back on w/ gpc2, cleanup doing diffs only to make space ASAP for tonight
    • summitcopy+registration stop until gpc2 finished to reduce disk loads -- only 4 nodes with reasonable space right now
  • 18:28 MEH: was decided that ipp082 w/o a BBU with the write cache on is now neb-host up for data (don't kill it..)
    • after a bit loads jumped and wait cpu >50% and climbing.. back to repair -- 2x dvomerge only thing using now and still high..
    • by 2100 finally back to normal load.. without being able to pause/stop dvomerge, not much can do..
  • 18:40 MEH: after talking with Ken, the following are the more detailed processing priorities (not just gpc1, then gpc2)
    gpc1 OSS/ESS/MSS
    gpc1 SNIa (through warp)
    gpc2 OSS 
    gpc1 WS 
    
    • if watching the system, then gpc2 could be left in summitcopy+stdsci if the normal rate keeps pace, however, with few data nodes with significant space and all the other processing/disk use going on, the gpc2 label probably should be removed until morning. on the otherhand, could just leave both in and in the morning focus on finish up gpc1 in case there is weather.
    • the gpc1 WS label should be taken out until morning as those use a lot of processing and disk space
  • 20:30 MEH: to save OSS.20150731 from being cleaned, changing the label to
    difftool -dbname gpc2 -updaterun -data_group OSS.20150731 -set_label OSS.nightlyscience.hold
    
  • 21:55 MEH: stack on in cleanup now that last nights cleanup has almost finished.. -- ah someone turned it on early w/o noting it...
  • 22:35 MEH: will watch ipp082 in neb-host up w/ regular nightly processing for a bit -- immediately went to high load, wait cpu ~50% and io only ~30MB/20MB in/out -- seems like case when write cache is not actually on. suspect since no BBU, the write cache is turned off even though reports on... if so then will need to set back to repair
    • does seem to be a drag on the processing and probably impacts the dvomerge running -- back to repair @2320
  • 00:55 MEH: rate seems reasonable tonight w/ gpc2 kept in, leaving as is and WS labels out until morning

Saturday : 2015.08.01

  • 07:34 MEH: clear stalled warp fault
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.2277.067 -warp_id 1610131  -fault 0
    
  • 07:36 MEH: nightly gpc1 OSS almost finished and gpc2 OSS finished, adding WS labels back in -- will see how long they take on their own compared to more limited data nodes yesterday (suspect 2/6 may go red after WS)
  • 08:30 MEH: moving some static test data off ipp071 to ipp023 to help free up space for priority nightly processing data nodes
  • 09:00 MEH: w/ the current priorities with WS running in morning, ps_ud_QUB must be higher than WS label otherwise already delayed data gets delayed again in the TSS pipeline..
    • ps_ud_MOPS probably should also be raised as well
  • 13:39 MEH: nightly WS finished except for one regularly fault 2 and set qual 42 to clear.. -- distribution will take some time to push out the bundles for QUB now
  • 16:20 MEH: nightly distribution just about finally finished... this is 5 minutes past final deadline to make it into tonight's QUB ingest and needs to be improved....
  • 16:40 MEH: had to spend extra time fixing stalled WS stalled from 20150725 since wasn't dealt with before all parts cleaned.. -- diff_id 1181701, skycell.2261.014
  • 22:00 EAM: nebulous replication has been running days behind and has finally given up the ghost. There are missing binlogs and the only solution is to re-rsync the full nebulous database and restart replication. Since there is currently bad weather, I am shutting the system down now and plan to stop nebulous and do the rsync.
  • 22:50 EAM: rsync ippdb08 -> ippdb06 started. here are the relevant coordinates:
    mysql> show master status;
    +-------------------+-----------+--------------+------------------+-------------------+
    | File              | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
    +-------------------+-----------+--------------+------------------+-------------------+
    | mysqld-bin.002287 | 164616904 |              |                  |                   |
    +-------------------+-----------+--------------+------------------+-------------------+
    1 row in set (0.00 sec)
    

Sunday : YYYY.MM.DD

  • Gene fixed nebulous and restarted stdsci (HAF added this so we have a record, Gene should add details)
  • HAF: 12:02am registration stuck, kicked it
    regtool -updateprocessedimfile -exp_id 952319 -class_id XY35 -set_state pending_burntool -dbname gpc1
    
    • 00:16 MEH: another reg fault and remove WS diff label (del.WS.label/add.WS.label) to catch up on the primary OSS gpc1 data