PS1 IPP Czar Logs for the week 2015.03.16 - 2015.03.22

(Up to PS1 IPP Czar Logs)

Monday : 2015.03.16

  • 07:06 Bill: postage stamp server was a bit sluggish. pcontrol spinning so restarted it.
    • Cleared some MPIA request faults due to problems at their data store. I need to start reverting fault 200 (HTTP 500) which their data store installation gives from time to time.
    • lowered QUB's priority to be even with MPIA and MPG since they have a large backlog of requests requiring update processing.

Tuesday : 2015.03.17

  • 06:00 EAM: stdscience running a bit slow, restarting; also restarting pv3diff
  • 10:21 Bill: disabled ipp-misc site on ippc17 which is where the getsmf script lives. It requires nebulous.
  • 17:20 EAM: Bill upgraded the ippdb00 mysql to 5.6 today. things restarted fine and seemed to run ok for a while. around 16:15, Bill reported errors going to the log. looking into these, I concluded we were having trouble with locks due to the innodb adaptive index hash (see http://dev.mysql.com/doc/refman/5.6/en/innodb-adaptive-hash.html). I turned of adaptive hash with the following:
    mysql> set global innodb_adaptive_hash_index = 0;
    

So far, it looks like this is helping (we are not getting a build up of long thread waits in the SEMAPHORES section of show engine innodb status\G)

Wednesday : 2015.03.18

Thursday : 2015.03.19

  • 05:25 EAM : processing was running glacially slow. the nebulous mysql on ippdb00 had a lot of stuck long-running jobs. I killed of the worst ones (which cleared the rest). Also very curious: there had been no entries to the log since around the time of the adaptive hash change above. without logging, it is hard to know what we the problem. I shut mysql down, added the adaptive hash value to the config file, and restarted mysql. things are running along fine now.
  • 09:50 MEH: looks like some exposures stuck in registration -- restart registration while looking into
    o7100g0281o          null       0
    o7100g0287o          null       0
    o7100g0290o          null       0
    o7100g0293o          null       0
    o7100g0321o          null       0
    o7100g0333o          null       0
    
    • seems o7100g0281o is in pzDownloadExp state=drop.. all are.. why.. and why not logged..
  • 10:45 Bill: mysqld on ippdb00 (nebulous) had a hickup. Hopefully this is because I changed a file that it had open and the program simply lost its place in it OOPS.
    • Restarted the server and it seems back to normal.
    • the good news is that replication on ippdb02 is working now.
    • ippdb02 and ippdb06 are now running in a new safer mode that mysqld warning messages suggested. Rather than using disk files for the replication state it uses tables in the database 'mysql'.
# CHANGES /etc/mysql/my.cnf to make replication more crash resistant
master-info-repository=TABLE
relay-log-info-repository=TABLE
relay-log-recovery=1

  • 11:10 MEH: dropped and stalled exposures of course cause WW diffims to be scrambled.. notified MOPS and running difftool cmds to redo once all OSS nightly warps and diffs fininshed

Friday : 2015.03.20

  • 10:00 MEH: bad exposures o7101g0307o and o7101g0446o were used automatically for the WW diffims (v3-4).. should probably redo them.. -- Serge thinks that is okay for now. also notified QUB about possibly poor WSdiff
  • 00:00 MEH: looks like ippdb00 behaving poorly since ~23:40 with nightly data -- going to stop pv3diff and see if helps

Saturday : 2015.03.21

  • 05:11 Bill: ippdb00 had very little free memory. Slow log filling up quickly. Restarted mysqld. Changed definition of slow query from 1s to 10s
    • throughput looks more normal after restart
    • ran regtool -revertprocessedimfile to fix some registrtation faults.
  • 06:15 Bill: nebulous database replication on ippdb02 has now caught up with the master.
  • 07:00 EAM: ippdb00 has a low load so i'm turning pv3diffs back on.
  • 09:35 MEH: clearing bad warp skycells (cannot build growth curve (psf model is invalid everywhere)) so MOPS and QUB get their data
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1514268  -skycell_id skycell.0854.088
    
    warptool -dbname gpc1 -updateskyfile -fault 0 -set_quality 42 -warp_id 1514279   -skycell_id skycell.0855.063
    
  • 10:36 Bill: restarting the postage stamp server and mysql on ippc17
  • 10:45 MEH: all nightly pantasks to stop to rebuild ops tag -- and then regular restart
  • 11:02 Bill: Rebuilt the replicated copy of the isp database on ippc17 using a mysql dump. Replication is now proceeding.
  • Earlier Gene fixed the czarlog page to look at ippdb06 instead of the non-existent ippdb04 so the czartool page loads much better now.
  • 11:20 MEH: ops tag rebuild finished -- pantasks restarted and set to run
  • 11:30 MEH: pantasks to stop again to rebuild tag
  • 12:50 MEH: pantasks to run, SNIa finished starting MD backlog
  • 14:45 MEH: restarting stdsci to start MD WSdiff backlog
  • 17:30 MEH: restarting stdsci and stack for MD nightly stack setup -- night stack will run on m0,m1,x0b,x1b until nodes defined for
    • SSdiff will run in the morning as normal done

Sunday : 2015.03.22

  • 00:45 Bill: nebulous mysqld memory use blew up again. Restarted mysql with 40GB buffer pool. This leaves more headroom for the system.
    • 01:19 processing rate is now more reasonable. Ganglia is quite red. Getting quite a few I/O faults from various tasks. Don't see a pattern in the storage nodes that are targets for the failures.
  • 06:10 EAM: restarting pv3diff pantasks
  • 07:30 Bill: all science exposures have been downloaded and registered
  • 09:45 MEH: clearing diffims from fault -- cannot build growth curve (psf model is invalid everywhere)
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0687.072 -diff_id 800110 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0687.072 -diff_id 800122 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0687.072 -diff_id 800158 -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.0687.072 -diff_id 800170 -fault 0
    
  • 10:15 MEH: MD SSdiffs running -- isn't clear if shutting off nightly nodes in pv3diff but small and load seems ok
    • many timeouts -- probably want to include some sort of startdate for at least the diffim query >2015 in nightly_science script at some point, will also prevent PS1SC data from being picked up for various reasons..
  • 12:50 MEH: pantasks stop while rebuilding ops tag ippconfig -- restart reg+stdsci pantasks to clear old dates
  • 21:00 Bill: It looks like the database may not be in great shape again.
    • The mysqld processlist has over 700 entries.
    • 21:08 Going to set stdscience to stop for a few minutes to see if it stabilizes.
    • 21:16 no change in db backlog.
    • 21:20 There are 2 database threads that have been doing simple selects for > 7000, and 12000 seconds respecitivly.
    • Killed them and now the number of outstanding threads has shrunken back to normal. See ~bills/mysqld.stuckthreads