PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2014.06.02

  • 22:45 MEH: power cycle ippdb01, nothing on console
    • takes ~3 min to get through mem test w/ static boot screen.. looked like one of the raids may be degraded? mysql crash seems to have recovered /var/log/mysql/mysqld.err
    • ganglia temperature plots seem to indicate a ramp up with one reaching ~84F before crash -- overheat level?

Tuesday : 2014.06.03

mark is czar

  • 00:10 MEH: summitcopy and registration jobs stalling 7ks, pztool revert commonfaults and revert reg seems to clear but still not moving.. been long time since restart anyways, restart summitcopy, registration, stdsci. might finally be moving again after manually resetting pending_burntool for a few exposures
  • 00:30 MEH: czarpoll and roboczar crash on ippc11 from ippdb01 crash -- restarted
  • 00:50 MEH: since watching things still, might as well restart distribution and pstamp since have also been a while
  • 08:30 MEH: repeat warp fault for memory issue -- faulting on ipp063, turned that machine off and it ran on ipp042 successfully
    o6811g0522o 	747855 	1006029 	975131 	951386 	957441 	skycell.1225.068 
  • 08:50 MEH: looks like exposure o6811g0389o is stuck being registered -- exposure gone from summit, was bad exposure and had repeat obs for o6811g0390o. need to drop
    • pzDownloadExp already set to drop.. odd?
    • obs_type is DARK and where "missing" dark from -- set to broken
      update summitExp set exp_type = 'broken', imfiles=0, fault =0 where exp_name = 'o6811g0389o'; 
  • 09:00 EAM: running relastro across the cluster so I stopped staticsky
  • 18:00 EAM: relastro finished with the storage + compute nodes, so I've restarted staticsky
  • 23:40 MEH: of course, another node down -- ipp033, for ~1200s -- power cycle and back up, neb-host repair for the night
    • catching up with processing
  • 01:30 MEH: ippdb01 T spike near 84F like last time crashed.. hopefully won't tonight

Wednesday : 2014.06.04

mark is czar

  • 07:30 MEH: dault 5 WS diffims to clear
  • 11:00 MEH: going to switch the PV1 to PV2 stacks for WS diffims now that LAP PV2 is finished (except for pole) -- mod made to stdscience/input, svn commit, restart stdsci
    • of course, with label change will get new diffims.. so need to stop stdsci -- looking at ones made for comparison, drop all others, then cleanup
    • label/data_group change to .skipnewtemplate --
  • 21:40 MEH: looks like someone didnt turn off the storage nodes in staticsky before nightly OSS data started.. system overrunning a bit..
    • long running jobs, even worse when overloaded.. starting to manually kill off -- particularly the ones in cab5, ippdb01 cpu2 running up to 83F
  • 22:34 EAM : I just remembered to call '' for staticsky. I need to automate this...

Thursday : 2014.06.05

chris is czar

  • 06:40 EAM : turned on storage nodes for static sky.
  • 15:26 Bill: reverted faulted 283 staticsky runs

Friday : YYYY.MM.DD

  • 11:35 MEH: cleared stalled fault 5 diffim so not forgotten into weekend and cleaned..
    difftool -dbname gpc1 -updatediffskyfile -set_quality 14006 -skycell_id skycell.2609.037 -diff_id 558693  -fault 0
  • 14:37 Bill: did some relabeling of some of the lap staticsky runs
  * 8669 runs LAP.ThreePi.20130717         abs(glat) > 20 || number_of_detections_in_pv1 <= 100,000
  * 5082 runs LAP.ThreePi.20130717.100k    100,000 < number of detections in pv1 < 200,000
  * 9510 runs LAP.ThreePi.20130717.dense   >= 200,000 detections in PV1 or no data from pv1
  • The standard label can run with the current host configuration
  • the 100k label will go > 32G memory if we run more than 3 at a time
  • the dense label will need to run 1 at a time on the nodes.

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD

  • 20:00 EAM : I've shutdown and restarted staticsky. I added the LAP.ThreePi?.20130717.100k label. These jobs will use > 16GB each. I modified the host lists to use only 1 job on the 24GB or 32GB machines and 2 on each of the 48GB machines. This is slightly conservative, but will avoid thrashing. There are only 90 machines in this list, so if the memory usage is not too high, I should add another to the 32GB machines.