PS1 IPP Czar Logs for the week 2011.11.14 - 2011.11.20

(Up to PS1 IPP Czar Logs)

Monday : 2011.11.14

  • 08:40 Mark: looks like ippc06 is down, seems to be causing problem even though out of nebulous hosts? maybe problem in my config (yes, see ~ipp/.tcshrc). rebooting (no info on console)
    Unable to access file neb://@HOST@.0/gpc1/SAS2.123/2011/07/26/RINGS.V3/skycell.1405.001/RINGS.V3.skycell.1405.001.stk.339528.target.psf: nebclient.c:535 nebFind() - no instances found
    
    -- but neb-stat says ok
          1 409a428a70bcbbccd1528cecdd33c02a file:///data/ipp037.0/nebulous/f6/92/1132632500.gpc1:SAS2.123:2011:07:26:RINGS.V3:skycell.1405.001:RINGS.V3.skycell.1405.001.stk.339528.target.psf
          1 409a428a70bcbbccd1528cecdd33c02a file:///data/ipp027.0/nebulous/f6/92/1132694136.gpc1:SAS2.123:2011:07:26:RINGS.V3:skycell.1405.001:RINGS.V3.skycell.1405.001.stk.339528.target.psf
    
    
  • 14:50 Mark: stress testing CPU on ippc11 to check if can use again after heatsink reset. CPU1 usage (up to 4 cores) may be possible, full 8 cores lasted 3 minutes. Also isn't rebooting on powercycle again, leaving off for a bit and back after ~5-10 mins (seems like thermal issue still).
  • 16:16 Mark: crashed ipp029 with same stress code (only 2 jobs) and got kernel panic like has been seen before, probabaly not thermal issue.
  • 16:30 Mark: crashed ippc11 again, 6 jobs running. temperature plots show temperature rise as expected, then drop off after some 15 mins, then spike one CPU and crash. suspect ippc11 could now be used like ippc13 in the hosts_poor_compute group, at about 30-50%.
  • 18:40 Mark: rebooted ipp029 (no info on console this time..)
  • 14:35 Bill damaged the magicDSFile table in database. We are working to recover that data from backup

Tuesday : 2011.11.15

Bill is czar today

  • 08:00 processing went smoothly last night with destreak and distribution turned off (to work around the broken table)

Wednesday : 2011.11.16

  • 10:00 Serge and Bill have finished restoring the missing rows from the magicDSFile table. LAP and update processing have been enabled.

Thursday : 2011.11.17

  • 07:50 Only 6 MD09 exposures taken last night
  • 11:40: Rebooted ipp029

Friday : 2011.11.18

  • 7 - 9 am Bill fixed the data state problems that were preventing several lap chipRuns from finishing.
  • 9:30 chip revert is off while bill debugs an assertion failure
  • 10:02 48 warps were stuck because they depended on chips whose magicDSRun had state ='update but label = 'goto_cleaned' Set label to the proper LAP label
  • 17:00 Serge started condor running MD06 chip->warp.
  • 17:30 Mark: shutting down pantasks and rebuilding ops tag with Gene's psphotStack merges. nightly science seems to have needed the restart as well. looks like new wave4 machines in use as well.
  • 17:xx ippc11 went down from condor runs. someone rebooted. also condor_off for ippc13 (weak machine) and ippc16 (stdscience pantasks)
  • 23:00 looks like ippc04 being worked a bit too much with processing and running staticsky phot from stack pantasks. setting host off for a bit

Saturday : 2011.11.19

Bill is accidentlly czar this morning.

  • 09:00 gpc1 database oveloaded. ~500 queries in processlist. Stopped stdscience. Killed condor.
  • 09:40 process list not shrinking very fast. Killed all connections to gpc1. They were all SELECTS
  • 10:33 restarted processing and the database because sluggish quite quickly. Decided to restart it. Before that made a dump of gpc1 in /export/ippdb01.0/bills/gpc1.dump.20111113T1116.sql. Did not bother to compress. 90GB
  • 13:00 shut down mysqld with mysqladmin shutdown. Then restarted it with /etc/init.d/mysqld start. That program is impatient and says that it didn't start, but it did. Restarted all pantasks except for deepstack and the addstars. There is a dvomerge that kept running through the shutdown (whew) Heather will restart addstar when it is done.
  • 13:30 Bill is resigning as czar.
  • 13:33 .... well one last thing. Ran mkdir /local/ipp/tmp on the new nodes. the directories were missing and causing lots of faults
  • 16:30 We're getting timeouts from the load tasks. I think that the timeout value for several of the tasks (30s) might be too small . Restarted distribution to reset the counts
  • 16:31 This doesn't sound good
Running [/home/panstarrs/ipp/psconfig/ipp-20111110.lin64/bin/magicdstool -magic_ds_id 734026 -component XY61 -setmagicked -tofullfile -dbname gpc1]...


 -> p_psDBRunQuery (psDB.c:812): Database error generated by the server
     Failed to execute SQL query.  Error: Deadlock found when trying to get lock; try restarting transaction
 -> setMagicked (magicdstool.c:802): unknown psLib error
     database error
 -> change_file_data_state (magicdstool.c:1722): unknown psLib error
     setMagicked failed

Sunday : 2011.11.20