PS1 IPP Czar Logs for the week 2013.04.15 - 2013.04.21

(Up to PS1 IPP Czar Logs)

Monday : 2013.04.15

mark is czar

  • 07:10 MEH: nightly downloaded (before the morning shutdown again) and processed -- tweak_ssdiff to run early before any system work needs to be done today
  • 09:00 Bill: postage stamp pantasks was running very slowly (pcontrol spin) restarted it. Also restarted update.
  • 10:25 Bill: mysqld on ippc17 is not running.
    • MEH: ippc17 having several port/disk error messages..
  • 10:40 MEH: stsci01 crashed over weekend even with neb-host repair. was rebooted into new 3.7.6 kernel, will put neb-host up as seems to be running okay like stsci06 -- stsci01 was one of the stsci machines that had not crashed since the power failure 4/2 and in neb-host repair, will probably want to start sequentially reboot all stsci machine soon into the new kernel..
  • 10:50 MEH: shutting down majority of processing for ippdb work
  • 13:00 MEH: Serge et al finished ippdb swap and cleanup of apache node logs, system back up.
    • ipp home directory <1G space left.. people need to clean home dir. moving ~ipp pantasks logs to archive and compressing -- with everyones help, back to ~150G by end of day
    • setting some nightly stacks with fault 5 to drop, will never complete.
  • 16:30 Gavin/Rita report that ippc17 looks to have experienced a motherboard failure, will need to be looked at again tomorrow. Datastore and postage server (as it uses the datastore to distribute images) will likely be down for 24 hours -- send email to general ps-ipp-users to inform all users of this
  • 17:00 Serge: Commented out ippc19 related backup on ipp@ipp001: /home/panstarrs/ipp/mysql-dump/ops_dump.csh
  • 17:10 Bill set rcserver.off as it uses the datastore, but distribution can be left running to do the preparation work.
  • 17:40 MEH: removing pstamp and publishing from roboczar until ippc17/datastore back
  • 23:50 MEH: looks like ipp025 has given up around 23:14.. SMP oops.. cannot log in, power cycled.. ipp025_log
    • seems like an odd mesg on boot screen about SATA.. link online but device misclassified, retrying , takes a bit and then says setting to 1.5Mbps.. seemed okay in end

Tuesday : 2013.04.16

mark is czar

  • 06:17 Bill: postage stamp and update pantasks have been moved to ippc14 and started up.
  • 06:30 MEH: looks like registration may have had hangups over the early morning again.. may not make it through all processing before network down ~@0900. MD done so tweak_ssdiff
  • 8:13 Bill: set rcDestination.dbhost = 'ippc19' and restarted distribution pantasks.
  • 09:50 MEH: all processing stopped/shutdown including czartools for Rita/Gavin to swap switch connecting ippc01-ippc16.
  • 10:50 Haydn replaced BBU for raid in ipp064
  • 10:55 Haydn adding 32G memory to ippc13
  • 13:00 Rita/Gavin finished with switch rewiring, ippc01-ippc16 back up and running. starting reboot of stsci machine to load new 3.7.6 kernel. can do a normal reboot procedure from console since won't power down disks. don't have access to power management for power cycle (good thing) so Rita sticking around at MHPCC in case power cycle needed.
    sudo tcsh 
    su -
    sync;sync;sync; umount -a; df -h
    reboot -f
    uname -a
    
    • stsci00 has an aspensys user logged in, starting with stsci02 and came back up. stsci03,04,05,07 all needed to do local disk scans and are okay.
  • 13:30 MEH: Gavin noticed stsci02 network bond config with the 3.7.6 wasn't the same as Aspen configured them, fixed and is rebooting all the stsci nodes again to correct.
  • 14:20 MEH: restarted mysql on ipp019, 020
  • 14:30 MEH: ippc13 down again, Gavin rebooted, maybe bad memory module (Haydn, Gavin working on memtest later). taking out of processing
  • 16:40 MEH: Gavin notes ippc07 at network speed of 100Mbit.. taking out of the nebulous host set and out of processing, someone will look at it tomorrow
    • in looking at the load problem noticed that pstamp loads its hosts multiple times in the input file rather than the default of once, having the number of loading defined in the pantasks_hosts.input file
  • 17:50 MEH: now everything stalled/faulting..
    • nebulous tools and such not working.. stopping processing and restarting apache on ippc01-c09. when did ippc08,09 seemed to clear up
  • 18:30 MEH: MOPS reporting still missing 3 exposures but nothing showing up. once system gets going again will try and trace back
  • 22:30 MEH: processing has been behaving like the parking brake is on.. many red 20TB disks and seen similar behavior before when this happened, but seems more extreme.
    • stsci nodes have been fine with new kernel, neb-host up with note of new kernel. no improvement, but now skycell products will go there to ease space usage on other data nodes
      neb-host stsci00 up --note "upgraded kernel to 3.7.6"
      
    • even cleanup is painfully slow, so maybe an issue with nebulous or raid speed on a large portion of machines -- allowing a few more commented out nebservers (ippc04,06) but leaving ippc07 out due to network issue
    • cleared old ippc17.0 stalled mounts
    • stsci has dqstats process that uses datastore, had problem with missing dir so set to off and emailed Bill
    • removing ThreePi?.WS.nightlyscience label to focus on primary nightly science
    • raidstatus?
      cat ~ipp/raidstatus/*
      --> ip057 raid pratially degraded but someone put into repair @0923 as it is rebuilt
      

Wednesday : 2013.04.17

Bill is czar today. Oh joy.

  • 06:40 processing is still painfully slow. camera_exp processing is taking > 4000 s. The ones I checked were all done with the real work and ever so slowly running the
     foreach file (@outputs { check fits and replicate }
    
  • The problem is either network throughput or a nebulous bottleneck. Clearly having so many nodes working is not helping. I'm going to turn off a bunch and see if that improves
  • In email Gene suggests shutting everything down and restarting the nebulous apache servers. stopping stdscience
  • 07:36 changed my mind. I'm going to wait until I get to the office ~8:10 to restart things
  • 08:55 Bill stopped all processing. cleanup was stopped a couple of hours ago yet still has jobs running. Many blocked threads in nebulous mysql processlist. command = 'commit'
  • 09:20 Serge stopped the nebulous apaches. nebulous mysql threadlist emptied. Restarted apaches
  • 09:36 Bill restarted registration, pstamp, and stdscience pantasks. Will start others shortly.
  • 10:11 registration finished soon after restart. All pantasks are up except cleanup, deepstack, detrend, and replication
  • 10:11 ippc07 network connectivity was repaired. Added it to the ~ipp nebulous server list
  • Evidence is mounting that the problem with our throughput is in the nebulous database on ippdb00. Looking at the processlist it appears that queries are getting blockdd for several seconds.
  • 10:50 increased camera poll limit to 100. Since these processes are taking the better part of an hour to do the replication it doesn't hurt to get them all started.
  • 11:45 stopped processing in preparation for switching nebulous to the mysql on ippdb04
  • 12:14 change of plans. starting processing again
  • 12:30 Serge: I quote Gavin:
    I found that "Write" caching was not enabled on ippdb00.
    I also found that ippdb00 firmware version is different
    from ippdb02. (I recall updating ippdb02 firmware when we
    were conducting SATA HDD compatibility tests.)
    ippdb00 - Firmware Version = FH9X 4.10.00.007
    ippdb02 - Firmware Version = FH9X 4.10.00.027
    I've enabled "write" cache on ippdb00. 
    

And since we had suspicion about them:

battery backup unit (bbu) in ippdb02 was replaced once on (Nov 2011). (Date corresponds to work log on svnwiki.) ippdb00 bbu looks to be the original.
ippdb02
/c4/bbu Battery Installation Date = 11-Nov-2011
/c4/bbu Last Capacity Test        = 24-Feb-2013
ippdb00
/c4/bbu Battery Installation Date = 24-Aug-2009
/c4/bbu Last Capacity Test        = 14-Apr-2013
  • 12:35 Serge: Processing normal... Back to nfs errors
  • 14:27 started up cleanup pantasks
  • 15:00 MEH: tweak_ssdiff for MD05 now that stacks finished. ready to queue up replacement/redo diffims for MOPS for three "missing" exposures under a .redo label/data_group
    • missing exposures were due to nightly_science being setup to get out whatever diffims available and as soon as possible -- o6398g0178o, o6398g0179o were flagged as fault 110 and thought lost, so proceeded to make diffims in 163.OSS.R.Q.w ps1_15_4072, 163.OSS.R.Q.w ps1_15_4049 quads with whatever available (visit 1-3 and visit 2-3 respectively). MOPS wants the full diffim set, visit 1-2, visit 3-4
    • manually queued with respective exp_id as
      difftool -dbname gpc1 -definewarpwarp -exp_id 600199  -template_exp_id 600440  -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2013/04/16 -set_dist_group SweetSpot -set_label OSS.nightlyscience.redo -set_data_group OSS.20130416.redo -set_reduction SWEETSPOT -simple
      
      -- once done, manually publish (not sure if need to push to distribution, not going to unless requested) -- because special/unique data_group can just use that
      pubtool -dbname gpc1 -definerun -set_label OSS.nightlyscience -data_group OSS.20130416.redo -client_id 5 -simple -pretend
      
  • 17:10 CZW: As things seem to be running again, I'm running some test SAS stacks. These are largely running in my pantasks on ippc25 (/home/panstarrs/watersc1/pantasks_debugging/sideband_science/ is the directory with the ptolemy.rc file), but I am running a set of stacks on the standard stack pantasks with label czw.SAS12.ipp and priority 5 (because I wanted to ensure it was lower priority than anything else).
  • 18:00 MOPS notes two exposures missing (o6399g0497o, o6399g0498o -- single visit on a quad from last night) -- were waiting for WSdiff not a multi-day WWdiff that was in a long publish queue from all the interrupted processing

Thursday : 2013.04.18

Bill is czar toay

  • 06:35 stdscience processing is finished except for ~50 stacks that faulted. Reverted them
  • added compute3 nodes to cleanup pantasks. Cleanup for warp and diff are done, chip is proceeding. dist.cleanup is off because of the ippc17. I may have to revisit this.
  • 11:50 MEH: enabling MD08.refstack.20130401 now since no complaints past couple weeks -- will need to setup some older warps for update to get diffims made
  • 12:15 CZW: Serge stopped by to ask why a diff wasn't queued for a pair of exposures he manually pushed through the camera stage. I checked the nightly_science.pl result, and realized that this trick probably wasn't widely known. The command I ran was: nightly_science.pl --verbose --debug --date 2013-04-18 --queue_diffs --this_target_only OSS . This does all the database queries and state checks for things, and then displays which difftool commands it thinks need to be run for this date/target combination. The output contained the line for the diff that Serge was looking for: /home/panstarrs/ipp/psconfig/ipp-20130307.lin64/bin/difftool -dbname gpc1 -definewarpwarp -input_label OSS.nightlyscience -template_label OSS.nightlyscience -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2013/04/18 -set_dist_group SweetSpot -set_data_group OSS.20130418 -simple -set_label OSS.nightlyscience -exp_id 601333 -template_exp_id 601345 -set_reduction SWEETSPOT. The reason this wasn't automatically done is that nightly_science had decided that all diffs possible for the date were finished and stopped polling before Serge fixed the exposures.
  • 12:45 MEH: taking compute2+stare out of processing for memory swaps/upgrades by Haydn
  • 13:45 MEH: MD09 updates finished, chips to cleanup and stdscience restart (regular needed and need to include MD08 refstk)
  • 13:50 MEH: system mostly back to normal operation, adding publishing and pstamp back to czarconfig.xml for roboczar
  • 15:06 set ippc30 in the stdscience hosts to off beause it is hosting the data store. This channge is probably temporary.
  • 16:05 MEH: putting stare+c2 back into processing, keep eye on stare during nightly data

Friday : 2013.04.19

  • 08:00 MEH: nightly finished, stare+c2 out of processing for more work today. when?
  • 11:52 Bill: set 13471 chipRuns in state error_cleaned to goto_cleaned and added their label goto_cleaned.rerun to the cleanup pantasks
  • 13:37 Bill: ippc17 is now the data store server again. ippc30 has the postage stamp server web services and working directories. The database is on ippc19
  • 14:12 Bill: added distribution cleanup back into the cleanup pantasks
  • 14:29 Bill: repaired broken flat image gpc1//flatcorr.20100124/GPC1.FLATTEST.300/GPC1.FLATTEST.300.XY24.co.fits by copying the good copy to the bad one's location
  • 17:20 MEH:/data/ipp043.0 xfs error and inaccessible, overly full. Gavin rebooted, Gene cleared off some data. /data/ipp053.0 close to possibly the same, should remove non-essential data off the 20T data nodes. neb-host repair for both, maybe will help but so full nebulous shouldn't be putting anything there anyways.
  • 21:36 Bill: summit copy seems to be getting 100% timeouts from conductor
  • 21:53 SCRATCH that. It's nebulous that's timing out
  • 23:00 MEH: tried to connect to mysql on ippdb00 and get error ERROR 1040 (00000): Too many connections, stop summit copy and then can connect.. mysql connections need changing, restarting?
    • stopping all attempts at processing, only summitcopy really trying and 600 jobs queued?? something odd.
    • neb connections on ippdb00 slowly decreasing
    • ippc01 can't mount /data/ipp043.0 -- likely need to refresh the mount on many systems..
    • yeah, /data/ipp043.0 mount glitch on the apache servers -- restarting apache and mounts cleared things.. probably messed up other things... so continues the regular nightly problem whack-a-mole
  • 00:21 MEH: nightly data processing proceeding, need to add stare+c2 back into processing i guess

Saturday : 2013.04.20

  • 16:20 EAM : ippc14 had hund mounts on ipp043. I killed the pending update tasks and umounted /data/ipp043.0. that seems to have cleared updates somewhat.
  • 21:04 Bill: reset.chip in update pantasks. A postage stamp request was stuck due to a crashed job in the book....
  • .... moved pstamp and update pantasks back to ippc17. ippc14 has too much work to do as a processing mode.

Sunday : 2013.04.21

  • 09:15 EAM : i have setup pantasks / rsync jobs running on ipp031 & ipp032 to copy their data to the stsci node directories. I have spread the output data across all stsci0X.Y, so the impact should be only ~2% on each of those nodes' disks. I have set those two machines to 'repair' in nebulous so they will not get new data.