PS1 IPP Czar Logs for the week 2012.03.12 - 2012.03.17

(Up to PS1 IPP Czar Logs)

Monday : 2012.03.12

Mark is czar

  • 07:15 nightly science looks downloaded and processed. One DM07 stack (828514, skycell.011) has completely bad PSF and probably needs to be dropped. Fixed couple XY26 LAP chips.
  • 08:30 stack ahoy.. cleared the 5 stalled LAP runs held up by error_cleaned in warp stage (common skycell_ids 2349.041,2349.042,2350.019,2350.053) with
    warptool -dbname gpc1 -tocleanedskyfile -warp_id 383567 -skycell_id skycell.2350.053
    warptool -dbname gpc1  -setskyfiletoupdate -warp_id 383567 -skycell_id skycell.2350.053 -set_label LAP.ThreePi.20110809
    
  • 08:50 while most of LAP is in stack, going to restart stdscience.
  • 10:50 LAP stack on ippc24 seems stalled >2ks. a suggestion of Chris, logging into the data host ipp008 and running sync seems to have cleared.
    -- stack pantasks
      0    ippc24    BUSY   2349.08 0.0.0.20e5  0 stack_skycell.pl --threads @MAX_THREADS@ --stack_id 829996 --outroot neb://any/gpc1/LAP.ThreePi.20110809/2012/03/12/RINGS.V3/skycell.2572.042/RINGS.V3.skycell.2572.042.stk.829996 --redirect-output --run-state new --reduction THREEPI_STACK --dbname gpc1 --verbose
    
    -- ps on ippc24
    ipp      15381 11400  0 10:38 ?        00:00:00 glockfile /data/ipp008.0/nebulous/e2/07/1954880509.gpc1:LAP.ThreePi.20110809:2012:03:12:RINGS.V3:skycell.2572.042:RINGS.V3.skycell.2572.042.stk.829996.ppStack.mdc xcld 0
    
    
  • 12:10 many pending LAP jobs want data on ipp041 (only 1 CPU still and out of processing), Chris has set unwant=7 (5 is default) and 7 remote tasks running. Chris going to try bumping up to 10, system very busy and going to watch to see if anything breaks.
  • 13:55 Serge fixed replication issue Replication_Issues#a2012031213:31ippdb03ipp0001
  • 17:00 non-existent LAP OTA to drop
    neb://ipp040.0/gpc1/20100921/o5460g0133o/o5460g0133o.ota15.fits 
    
    chiptool -dropprocessedimfile -set_quality 42 -chip_id 426078 -class_id XY15 -dbname gpc1
    regtool -updateprocessedimfile -set_ignored -exp_id 228082 -class_id XY15 -dbname gpc1
    
  • 17:30 Chris/Gene investigated ipp015 disk issues. ended up rebooting.
  • 18:00 ipp015 back up but the disk not being exported (correctly?). ran exportfs -f. however, ganglia seems to still thinks it is down. seemed to eventually come back.
  • 19:00 more xy26 LAP chips repaired. and oops, turns out actually have 18 more lap_id left to go before finished with 0-1hr strip..

Tuesday : 2012.03.13

Mark is czar

  • 07:00 ipp015 is down, and not rebooting... looks like needs someone to physically go check out.
  • 07:30 setting neb-host down ipp015, stopping and shutting down most pantasks while attempting to untangle things that needed ipp015. not sure what to do about addstar.
  • 08:00 turning summitcopy back on (while going into office), seems to be downloading somewhat.
  • 09:10 Serge: Nebulous snafu. Restarted the mysql server on ippdb00.
  • 10:05 Serge: All apache/nebulous servers have been restarted.
  • 11:00 still killing off hung processes.
  • 11:10 notified by Rita that the AC @ATRC is down, room ~91F but now open and fans on. no temperature info on http://ps1wiki.ifa.hawaii.edu/cgi-bin/ippDashboard.py but not anticipating any load on those machines. should they be powered down anyways?
  • 11:30 restarting things, putting ipp015 into the hosts_ignore_wave1 list (do we want to comment it out completely in the wave1 list?). Bill taking care of postage stamp server.
  • 12:13 CZW: I've modified a copy of repair_bad_instance, and fed it with an input list generated from Serge's slocate databases of ota26 entries on ipp064. It's currently running singly threaded, and checking all of these raw fits files and repairing bad instances when possible. This is copying good copies from the ATRC, but as it should not be doing more than one at a time, I do not expect this to overly tax the ATRC computers/ATRC AC.
  • 12:15 Serge: Killed ippdb03 scheduled mysqldump to perform a manual one that I intend to ingest on the broken ipp001. Replication on ippdb03 has been stopped on purpose.
  • 12:40 nightly science cam/warp/stack faults mostly due to processed chip data needed that lives on ipp015. same goes for LAP so LAP is effectively stalled.
  • 12:50 replication pantasks back running, target.on only.
  • 13:00 Gavin setup access to the power control for us on compute3 nodes.
  • 14:10 Serge: Restarted replication on ippdb03. CHANGE MASTER TO MASTER_LOG_FILE='mysqld-bin.020936', MASTER_LOG_POS=73345738;
  • 14:30 registration has been very unhappy. ipp011 often having lockd server not responding for ipp008,ipp014 so set those two to repair and will watch with tonight's data.
  • 15:00 all data downloaded and registered from last night, finally. There will be some data stuck (3 MD06 warps, 3 MD07 stacks, 14 3PI warps, 7 3PI camera, 1 3PI diff) that are waiting for access to intermediate processing products on ipp015. Turning chip/camera/warp.revert.off for now and will toggle manually every so often as remaining science data moves through.
  • 16:00 extra LAP run queued to see if will process without needing any past processed products from ipp015. some did, but most got stuck as well due to ipp015.
  • 16:30 Chris bumped up the stdscience unwant=10 again since it was reset back to 5 after the pantasks restart. May just remove ipp015 from pantasks (rather than off) tonight if still running slow waiting for targeted ipp015 jobs to be triggered remotely.
  • 16:50 Rita reports that the ATRC room temperature is still ~90F, Gene is going to power down the ippb00-03 machines. AC service not until Thursday afternoon.
  • 19:00 previous night data just finishing up before tonights data..
  • 22:00 looks like lockd also having troubles with ipp009, ipp016 in addition to ipp008,ipp014 seen earlier. Looking at the /var/log/messages of a few machines, seems to have been an issue for at least the past week. not sure how much this slows down registration in general. putting back into normal neb-host up state.

Wednesday : 2012.03.14

Serge is czar. TODO list:

  • If no news from Rita/Haydn at 930am, MHPCC phone number is 808-879-5077 (ask for server room working on PanSTARRS).
  • Start mysql replication slave on ipp001
  • Remove ippc50 from processing list
  • 06:34 EAM: ipp014 hung, rebooted it (looks like a cpu error)
  • 08:30 Serge: Shutting down ippc50
  • 08:52 Serge: ippc50 is down. Updated http://ps1wiki.ifa.hawaii.edu/trac/wiki/IppSystemDownProcedure with the not-so-new-now compute nodes, ippdb03...
  • 10:15 Serge: Added ipp015 and ippc50 back in pantasks and nebulous.
  • 10:30 Serge: Haydn checked ipp058 (see Mark's e-mail "[ps-ipp-dev] ipp058 stuck in power cycle" Fri, 9 Mar 2012 17:26:04 for details). The cabinet seems to be ok (status is ON and no more PENDING).
  • 11:18 Bill Set host ippc43 to off in pantasks for stdscience, deepstack, and stack. He is using that node to debug psphotStack memory explosion
  • 11:18 Bill Reverted LAP staticsky runs that faulted due to ipp015 being down and other cluster sadness
  • 11:23 Bill ran pztool -clearcommonfaults to clear several summit copy faults. (The revert task would have done it eventually but I'm impatent) Still have 84 incomplete exposures in summit copy and 36 that have been copied but haven't finished burntool.
  • 11:25 Mark: helping check and kick remaining LAP chip-warp into stack while nightlyscience download catching up.
  • 11:47 Serge: ipp001 replication is back.
  • 12:40 Mark: running tweak_ssdiff in stdscience to make the MD SSdiffims now that all MD are finished processing
  • 13:36 CZW: ipp014 became unresponsive and I cycled the power.
  • 15:30 Serge: removed stare nodes from pantasks servers
  • 15:40 Serge: ipp059 hung. Rebooted it.
  • 17:00 Mark: ipp059 stalling put some LAP diffs in strange pantasks state. going to restart stdscience. turned ippc43 off for Bill, turned ipp014 off 3 out of 6 since it has been unhappy.
  • 17:10 LAP looks to be effectively stalled on the final few runs needing instances from the ippb machines.
  • 17:20 Bill finished, adding ippc43 back into stdscience.

Thursday : 2012.03.15

  • 01:10 Mark: do we need to manually put the stare data into wait state to start PS1 nightly data? Looks like have to do manually, still downloading stare data
    update pzDownloadExp set state ='wait' where state = 'run' and exp_name like '%a';
    
    (Talked with Chris in morning, need to look at pztool_pendingimfile.sql to debug in future)
  • 02:00 registration seems to be stuck with pending_burntool on o6001g1205o? seemed to start after ~30min, maybe 1204o wasn't completely downloaded yet?
  • 07:00 Bill registration has advanced to o6001g1378o but now it is stuck because an apply burntool process was using ipp061 which has crashed. Power cycled it. From the ganglia chart it looks like it crashed earlier as well.
  • 09:16 Serge: Burntool finished with science data. Executed stmt "update pzDownloadExp set state ='run' where state = 'wait' and exp_name like '%a';". About 800 exposures to download/register.
  • 12:05 Serge: stdscience finished. Turned chip.revert.off
  • 15:20 ippb machines at ATRC are back up temporarily to get LAP chips needed. reverting faulted LAP chips and succeeding.. Word is AC repair early next week.
  • ~16:00 Chris fixed summitcopy's pztool_pendingimfile.sql to select science data over stare data, will see if works tonight since there will be stare data still downloading once science observations start.
  • 18:10 Mark: merging SYS.ERR=0 for stacks, will restart stack and stdscience pantasks when finished rebuilding ippconfig.
  • 19:30 the fix Chris made to summitcopy worked, with new science data and stare data (~350) left, it has chosen to download the available science data. When caught up with science data, have seen a few stare chips get slipped in as well.

Friday : 2012.03.16

  • 09:15 Serge: Mops ask why diffs have not been processed?
  • 09:30 Serge: Shut down ipp003 and ipp022
  • 09:40 Mark: looks like MD stacks not made either - looks like the ppSub.config change had a whitespace rather than tab space (unsure how), so double checked that and rebuilt and restarted stdscience and stack pantasks.
  • 10:47 Serge: ns.add.date 2012-03-16 fixed the problem! I also ns.add.date 2012-03-15 just in case.
  • 11:15 Serge: Still 33 stare exposures taken two nights ago to be downloaded.
  • 12:22 Serge: All stare exposures have been downloaded.
  • 13:20 Mark: MD stacks finished, running tweak_ssdiff

Saturday : 2012.03.17

Sunday : 2012.03.18