PS1 IPP Czar Logs for the week 2013.04.01 - 2013.04.07

(Up to PS1 IPP Czar Logs)

Monday : 2013.04.01

mark is czar

  • 08:05 MEH: remaining few nightly science exposures finishing up
  • 09:38 Bill: ippc03 is running out of space in /tmp. The nebulous_server.log file had grown to 22G. Stopped apache, moved the file out of the way, restarted apache, then deleted it.
  • 11:10 MEH: ipp05,08,09 nebulous_sever.log >22G as well, leaving ~4G on / OS disk and should be cycled out as well -- done. ippc07 will be next, only ~9G left available
  • 19:00 MEH: MD09 deepstacks done so compute3 back to stdsci and stack until start MD08 refstack ~tomorrow
  • 22:20 MEH: appear to have lost connection from ifa to production cluster? ganglia also reporting all systems down..

Tuesday : 2013.04.02

mark is czar

  • 07:00 MEH: network issue with summit, but also something odd with production cluster and looks like some systems are down and some were rebooted.
    • still down: 028,033,036,046,048,c17,c19,db02 and console not reachable -- was ganglia problem for some, ipp046,048,c17,c19 actually down/off network, cannot connect via console
    • checking if the stsci nodes were power cycled -- no, so probably good thing
    • ippdb00,01,03 have been so may need to check mysql
    • various data (non-stsci data) nodes have been rebooted it seems
  • 08:30 MEH: restarted screens and czarpoll on ippc11, will wait for roboczar until systems back running
  • 08:40 MEH: rebooted ipp036 to fix mount issues, didn't come back up cleanly after whatever happened last night. need to scan mounts on all machines
    • serge put ipp036 into neb-host repair -- need to put back up - ok
    • ipp028,033,036,060 disks aren't always mounting -- Gene noted ypbind gone rogue, need to kill and restart with nfs (sudo tcsh, su - otherwise once kill ypbind no sudo access of course)
    • some wave1 machines trouble mounting wave4
  • 09:30 MEH: ganglia is showing several machines high 1-m load but nothing running, gmond probably needs to be restarted on most/all machines -- done, and login seems fine except for ipp041 date below
  • 09:40: MEH: ippc17 and ippc19 down, ippRequestServer (pss, datastore db and replication) so not possible for PSS to run until fixed
  • 10:00 MEH: ipp041 date is wrong -- 10:25 manually set to w/in 1s. how to fix time sync?
  • 10:20 Bill added PSS MOTD, to remove -- cp /data/ippc17.0/pstamp/web/pstamp_motd.html.empty /data/ippc17.0/pstamp/web/pstamp_motd.html
  • 11:45 MEH: ipp048 and ippc19 still problems, ipp048 to down. -- Haydn and Rita having trouble logging into ipp048 console, I can log in and disk scan taking the time for reboot -- back up now, neb-host up.
    • ippc19 still being investigated -- burning smell, out for the time being
  • 12:05 MEH: datanodes back online, mounts appear fine. slowly starting pantasks -- summitcopy, registration, stdscience, stack
  • 12:20 MEH: nebdiskd needs to be started on ippdb00
  • 12:35 Serge fixed crashed czardb table processed -- repair table processed. loaded_to_datastore also crashed and repaired. eventually the czarpoll page will be uptodate. loaded_to_ODM also crash and repaired.
  • 12:50 MEH: distribution, publishing, detrend started. pstamp and update will wait until decide on ippc17 replication
  • 13:40 MEH: finding other machines with close but wrong time by hours.. more importantly ippdb00,01 are off.. so all processing to stop
  • 15:00 MEH: system times fixed and /etc/init.d/ntpd restart, ippdb00,01 mysql restarted, processing restarted
  • 15:05 MEH: wave1 need reboot with testing 3.7.6 kernel or may have disk issues with all up in nebulous -- all into repair and removed from processing, slowly rebooting with 3.7.6 kernel
    • kernel option may not be a visible option depending on the console state and system logged into, should show up when arrow key up
    • ipp017 different mobo so keep original kernel and in neb-host repair
    • ipp011-016 rebooted with new kernel and neb-host up
    • ipp018-021 rebooted with new kernel and set neb-host up (ipp021 originally okay with original kernel and neb-host up)
  • 15:15 Bill noticed pstamp confused with ports from detrend. fixed and pstamp running again
  • 15:18 Bill removed the "pstamp is down" message from postage stamp web pages by emptying the pstamp_motd file.
  • 17:55 MEH: ippc01-c05, stare, stsci ganglia not reporting right, sudo /etc/init.d/gmond restart
  • 19:30 MEH: done downloading last nightly data, but stuck and not registering them all. downloading and registering tonights data okay.
  • 20:30 MEH: cleared offending imfiles and rawExp registration from last night proceeding
  • 22:20 MEH: normally caught up with last night's and tonight's exposure registration, processing load seems more irregular than normal with more stage faults.
  • 23:40 MEH: restart roboczar now on ippc11. cleanup stays off (not in roboczar) and replication off as normally runs on ippc19 but not active right now anyways.

Wednesday : 2013.04.03

Bill is czar today

  • 08:05 Nightly science has about 40 warps left to process and lots of diffs. Setting chip.off for now.
  • 09:51 Just 50 diffs to do setting warp to off for a few minutes
  • 10:21 Letting nightly science diffs finish then will restart stdscience.
  • 10:36 stdscience restarted
    • MEH: compute3 out of autoload to stdscience and stack still -- deepstack tests and refstacks soon -- adding manually back into stdscience until i reactivate deepstack after the tuesday mess
  • 11:00 MEH: appears MD05,06,07 from 4/02 still missing from stacks and diffims, re-adding that date to stdscience needed pick up? will need to tweak_ssdiffs since past default processing window
    • added and stacks except MD07 picked up @1210. tweak_ssdiff and 4/03 SSdiff running, may need to extend the window depending how long 4/02 stacks take
  • 13:40 MEH: MD stacks and ssdiff finished, tweak_ssdiff time back to default
  • 14:45 OTIS reported a problem exposure from Monday night o6384g0201o. One of it's pzDownloadImfiles had fault 110 (HTTP Gone 410 - 300) and the pzDownloadExp had been set to drop. The file is now available so I reverted the imfile and set the state back to run and the exposure copied and registered succesfully. Is now going through chip processing.
  • 15:00 MEH: md.pv1 notes -- no need to monitor and since low prio, if chip.off etc necessary, then that is not a problem.
    • MD08.refstack.20130401 chip->warp and prio my move up since we've started observing MD08. deepstack will take compute3 nodes in very near future.
    • MD08.pv1.20130403 reprocessing necessary of all exposures prior to july 2012 (minimum date of may 15, 2012 so some overlap of "acceptable" processing for comparison pv1 start vs now)
    • MD09.update2012 almost half the MD09.nightlyscience warps from 2012 were not cleaned up (chips were), moved label MD09.nightlyscience->MD09.update2012 and updating warps that were cleaned up to be available for the MD09 2009-2012 deepstack.
  • stsci06 has gone down stopping processing
  • 16:09 set stsci06 to down and set pantasks to run'
  • 16:57 many stuck jobs waiting for stsci06. Stopping processing and will attempt to kill outstanding processes
  • 17:18 stdscience, distribution, update, and pstamp pantasks restarted.
  • 17:20 camera and warp revert.off jobs for now. The faulted jobs need data from stsci06 that is not available.
  • 17:40 Many Many nebulous errors creating new files. restarted the apache nebulous servers that had high load. This seems to have improved things
  • 21:25 MEH: with stsci06 down, removing MD08.refstack.20130401 label since remaining chip/warps fault. chip/camera/warp revert back on to not hang up nightly processing

Thursday : 2013.04.04

Bill is czar today

  • 05:40 Lots of scattered red on czartool, but no smoking guns. Publish in particular has about 20% of jobs faulting since last pantasks restart. It is running sluggishly so restarted publishing pantasks.
    • summit copy and registration kept up pretty well. The last science observation was 17 minutes ago and there are only 7 more exposures to copy.
    • publish faults are repeating. It looks like the previous faults may have occurred when attempting to adding the result to the database which occurs after file set registration. Current failures are failure to add to datastore due to fileset already existing. Yuck. Those are hard to fix.
    • example of fault from problem connecting to the database
       -> psDBAlloc (psDB.c:166): Database error originated in the client library
           Failed to connect to database.  Error: Unknown MySQL server host 'scidbm' (1)
       -> camtoolConfig (camtoolConfig.c:331): unknown psLib error
           Can't configure database
       -> main (camtool.c:62): (null)
           failed to configure
      
  • 08:25 Serge pointed out that this could be due to ippc19 being down since it is one of the name servers. I've changed site.config to use db01 explicitly. I'm not sure if that will make any difference or not.
  • 08:30 stsci02 is down which has sent all of the nebulous servers into non-responsive mode. Setting all pantasks to stop until the situation improves.
  • 09:55: Some tables in the DisksMonitoring? databases on the storage notes are corrupted. The error is ERROR 130 (HY000): Incorrect file format. A classical REPAIR TABLE <table name> doesn't work but REPAIR TABLE <table name> USE_FRM does.
  • 10:10: Rita is on here way down to Kihei
  • 11:06: Rita was able to get ahold of MHPCC and they were able to reboot stsci02 and stsci06. I am letting the queues empty of formerly stuck jobs before restarting things. (Except for pstamp which I have restarted because MOPS is waiting)
  • 13:00 MEH: dealing with messed up MD08xxx files
  • 14:10 MEH: tweak_ssdiff to make SSdiffs from last night now that night stacks are finished
  • 14:33 conductor is back online. Restarted summitcopy
  • 14:55 MEH: camera.revert.off on purpose while fixing MD08xxx problems..
  • 15:20 MEH: SSdiffs finished, tweak_ssdiff back to default
  • 15:40 MEH: manually adding compute3 to stdsci since MD08.ref processing not ready to use deepstack yet
  • 20:30 MEH: deepstack started, removing compute3 1x stack and 2x stdsci
  • 21:20 MEH: may want to revise prio for WEB pstamp, looks like a large request is blocking MOPS -- looks like someone just did
  • 22:10 MEH: looks like the nightly_science.pl --queue_stacks is having regular timeouts and probably why MD, xSS stacks not being made? MD03 almost finished, will watch -- ESS queued ok

Friday : 2013.04.05

Heather is czar today

  • 02:40 MEH: look like stsci09 is unresponsive/down, will need to be specially rebooted in morning. setting neb-host down to maybe keep some processing going..
    • not sure what has been on the console, but maybe want to specially reboot the remaining ones that haven't crashed after the bad power issues last tuesday..
  • 08:00 Rita contacted MHPCC to specially reboot stsci09, back up @0840. setting neb-host repair
  • 08:58 Bill: set summit copy to stop because conductor is down.
  • 11:15 MEH: might as well chip.off to push the remaining 3pi warps through, lots of downtime to do the MD08.pv1 chips later
  • 11:50 MEH: MD05 night stacks finished, tweak_ssdiff to get out. MD06,07 not exposures not yet downloaded so will have to wait for later.
  • 12:30 MEH: now ippc17 is down.. looking into ... cannot access power management on console.. no PSS/datastore for now... with c19 gone as well
    • Eddie @MHPCC rebooted, but console shows it booting to livecd.. bios problem?
  • 13:20 MEH: moving deepstack pantasks to run on ipp060 since just used for testing and will be regularly logged into to do test runs
  • 13:26 Bill: set publishing and distribtion/rcserver to stop since they depend on the data store and ippc17 is down.
  • 14:45 MEH: talked with Serge, will turn on cleanup for a couple days until ~Sunday. With problem of stsci nodes going down recently, will also put into repair for over weekend and remove MD08.pv1 from processing
    • dist.cleanup.off since uses the ippc17 DB
    • restarted stdscience to clear some stalled MD08 jobs. ns.add.date 2013-04-05 (and ns.add.date 2013-04-05 just in case) to pick up last nights data.
  • 20:00 MEH: dist.cleanup.on, restarting distribution and publishing now that ippc17 is back up and things are moving along
  • 22:40 MEH: looks like summitcopy got a double loading of nodes, was this intentional? download time >1ks and reg times large as well. removing the double and times down to normal values.

Saturday : 2013.04.06

  • 08:30 MEH: adding extra compute3 seems to harass processing load, something is having problems
    • ipp023 isn't mount happy but out of normal processing
  • 08:50 MEH: dropping the ThreePi?.WS.nightlyscience label from stdscience for a bit, there are some 3pi 200 warps ahead of MD05 and diffs are using a large chunk of processing power.
  • 12:20 MEH: MD05 finished stack, tweak_ssdiff to finish. ThreePi?.WS.nightlyscience still out
  • 14:40 MEH: adding ThreePi?.WS.nightlyscience label back in now the nightly science warps are finished and final diffs loaded
  • 15:00 MEH: only diffim skycells left, had problems in past with all skycell processing overloading stsci node i/o so limit stdscience set.poll 200 and watch rate.
    • <150 still drive stsci to spike to i/o wait, may as well over-load the queue of running jobs, while some are stalled, others will run
  • 17:00 MEH: stsci08 load very high (>200), scaling stdsci WSdiff back down.. with reduced data nodes, removing compute3 from stdscience to reduce i/o
  • 19:40 MEH: waiting for stdscience to start chips before turning up poll.. registration was confused, so restarted and chips loading
  • 22:10 MEH: ThreePi?.WS.nightlyscience label out of stdsci until morning, appears regularly tries to pull stack files from the same stsci node driving it into a heavy wait state if >100 jobs -- label will need to be added in morning with poll set to 100

Sunday : 2013.04.07

  • 06ish Serge; Poked the usual OSS stuck at publishing
  • 09:30 Serge: Stopped cleanup. Started backup of ippdb02 mysql to ippc63 (Screen session backup_nebulous / ippdb02 superuser).
  • 10:35 MEH: SSdiffim finished, adding ThreePi?.WS.nightlyscience label in with poll 100. start the game of keeping the stsci nodes from spiking into a wait state
  • 10:50 MEH: ippc06 down. console shows it is rebooting, on its own??
  • 14:49 MEH: ippc06 down again.. console shows in the middle of reboot on own again??
  • 19:35 MEH: turning up stdsci poll once a handful of chips are registered, removing the ThreePi?.WS.nightlyscience label
  • 23:00 MEH: marginal night so far, many MD and MD has WS which will be like ThreePi?.WS.nightlyscience. adding the WS label back and setting poll 130, if chip,warp,diff then most nodes will be used but WS diffs won't hammer stsci nodes as much