PS1 IPP Czar Logs for the week 2013-12-16 - 2013-12-22

(Up to PS1 IPP Czar Logs)

Monday : 2013-12-16

  • 10:20 Bill: restarted stdscience with STS.rp.2013 label added.
  • 13:13 Bill: I'm running a psphot full force test using compute3 and wave4 and since there are only 14 runs this should not be very visible.
  • 16:59 CZW: restarted replication pantasks to use the rawcheck.pro module to scan all raw OTA data and replicate and cull as appropriate. Use rawcheck.show.date and rawcheck.set.date DATE to monitor progress.
  • 17:50 MEH: running local trunk staticsky+skycal on MD04, taking 1x c3 from stack since LAP processing will be behind STS for a day or so

Tuesday : 2013-12-17

  • 12:29 CZW: STS has finished warp stage, so I stopped stdscience to rebuild psLib to pick up the changes to the psMixtureModel code. Restarting stdscience now.

Wednesday : 2013-12-18

  • 07:50 Bill: turned chip revert off. There are 6 chips with missing burntool tables that I want to fix.
    • 08:08 It was actually only one missing burntool table (--exp_id 178815 --class_id XY34). The other faults were due to ipp050 misbehaving temporarily cannot instantiate logfile and other similar nebulous errors). Revert back on.
    • 08:09 Fixed a broken pstamp dependency. diff_id 504256 (from just last week) was marked as state cleaned but several skycells' data state was 'full'. The dependency checker does not expect this state. Worked around this by changing the state of the run to 'update'
  • 08:45 Bill: restarted pstamp and update pantasks their pcontrols were spinning
    • 08:58 ...and since the restart 6596 apparently backlogged jobs have run.
  • 22:50 MEH: STS warps finishing up, LAP updates well underloaded -- stdsci needs its regular restart. did distribution as well

Thursday : 2013-12-19

mark is czar

  • 08:00 MEH: chip.revert.off to fix some burn.tbl for STS -- it appears all burn.tbl are 0 and missing as normal case BUT primary and secondary are both on the ippbXX machines ONLY.. if this is common for all then may be a problem..
    • all have rogue file to copy over, so primary and secondary still only on ippbXX
      o5354g0426o 	XY34 	178903 	925065 
      neb://ipp045.0/gpc1/20100607/o5354g0426o/o5354g0426o.ota34.burn.tbl
            1 d41d8cd98f00b204e9800998ecf8427e file:///data/ippb01.2/nebulous/7a/c2/929738415.gpc1:20100607:o5354g0426o:o5354g0426o.ota34.burn.tbl
            0                     NON-EXISTANT file:///data/ippb02.0/nebulous/7a/c2/929746459.gpc1:20100607:o5354g0426o:o5354g0426o.ota34.burn.tbl
      -->
         300606 | /data/ippb02.2/nebulous/7a/c2/929746459.gpc1:20100607:o5354g0426o:o5354g0426o.ota34.burn.tbl
      
      
      o5354g0388o 	XY06 	178865 	925027 
      neb://ipp023.0/gpc1/20100607/o5354g0388o/o5354g0388o.ota06.burn.tbl
            1 d41d8cd98f00b204e9800998ecf8427e file:///data/ippb01.2/nebulous/11/ac/929738221.gpc1:20100607:o5354g0388o:o5354g0388o.ota06.burn.tbl
            0                     NON-EXISTANT file:///data/ippb02.0/nebulous/11/ac/929749698.gpc1:20100607:o5354g0388o:o5354g0388o.ota06.burn.tbl
      -->
         252446 | /data/ippb02.2/nebulous/11/ac/929749698.gpc1:20100607:o5354g0388o:o5354g0388o.ota06.burn.tbl
      
      
      o5354g0381o 	XY13 	178858 	925020 
      neb://ipp024.0/gpc1/20100607/o5354g0381o/o5354g0381o.ota13.burn.tbl
            1 d41d8cd98f00b204e9800998ecf8427e file:///data/ippb01.1/nebulous/99/e1/929738267.gpc1:20100607:o5354g0381o:o5354g0381o.ota13.burn.tbl
            0                     NON-EXISTANT file:///data/ippb02.2/nebulous/99/e1/929746247.gpc1:20100607:o5354g0381o:o5354g0381o.ota13.burn.tbl
      -->
         235117 | /data/ippb02.1/nebulous/99/e1/929746247.gpc1:20100607:o5354g0381o:o5354g0381o.ota13.burn.tbl
      
      
      o5354g0366o 	XY13 	178843 	925005 
      neb://ipp024.0/gpc1/20100607/o5354g0366o/o5354g0366o.ota13.burn.tbl
            1 d41d8cd98f00b204e9800998ecf8427e file:///data/ippb01.2/nebulous/1a/2c/929738232.gpc1:20100607:o5354g0366o:o5354g0366o.ota13.burn.tbl
            0                     NON-EXISTANT file:///data/ippb02.2/nebulous/1a/2c/929746241.gpc1:20100607:o5354g0366o:o5354g0366o.ota13.burn.tbl
      --> 
         222820 | /data/ippb02.1/nebulous/1a/2c/929746241.gpc1:20100607:o5354g0366o:o5354g0366o.ota13.burn.tbl
      
      
  • 10:57 CZW: Mark pointed out an exposure that was stuck with chipRun.state = 'error_cleaned' and warpRun.state = 'update'. This blocks LAP somewhat, as it waits for a warp that is waiting for a chip that isn't updating because chiptool refuses to update data that isn't in state 'cleaned'. This has happened on occasion before, and my solution has been to issue the commands generated by the output of /data/ippc18.0/home/watersc1/PV2.LAP.20130717/fixes/scan_chips_for_errorCleaned.sql. An example output is listed below. The run_update_for_chip_id.pl script issues the chiptool/camtool/warptool commands to set things to be updated.
     chiptool -updaterun -set_state cleaned -chip_id 574277
     chiptool -updaterun -set_state cleaned -chip_id 574396
     chiptool -updaterun -set_state cleaned -chip_id 574444
     chiptool -updaterun -set_state cleaned -chip_id 837183
     /home/panstarrs/watersc1/bin//run_update_for_chip_id.pl 574277
     /home/panstarrs/watersc1/bin//run_update_for_chip_id.pl 574396
     /home/panstarrs/watersc1/bin//run_update_for_chip_id.pl 574444
     /home/panstarrs/watersc1/bin//run_update_for_chip_id.pl 837183
    
  • 16:35 MEH: regular restart of stdsci before nightly
  • 23:20 MEH: something stalling LAP processing/cleanup... cam stage and cleanup reaching >3ks , also ~6 LAP id stalled waiting for pending update and will look into later

Friday : 2013-12-20

mark is czar

  • 09:00 MEH: pausing everything to trace down what was stalling processing late last night -- this includes all extra processing..
    • CZW_notes has details of some of the extra processing -- turning off
      --- convolved stack cleanup -- uses compute3
      pantasks_client -c ~watersc1/this_is_where_pantasks_lives/ptolemy.rc
      
      --- rawcheck running in replication using wave2-4 + 2_weak
      pantasks_client -c ~ipp/replication/ptolemy.rc
      
    • ippc01,c02,c03 -- /var/log/apache2/error_log having a growing number of "server seems busy" messages recently -- only c01-c03, are these being manually additionally targeted?
      [Fri Dec 20 11:05:25 2013] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 0 idle, and 21 total children
      
      • restarted apache on all 3, cleared larger nebulous_server.log on ippc02
        sudo /etc/init.d/apache2 stop
        sudo rm /tmp/nebulous_server.log ; sudo touch /tmp/nebulous_server.log ; sudo chown apache /tmp/nebulous_server.log ; sudo chmod g+w /tmp/nebulous_server.log ; ls -l /tmp/*log
        sudo /etc/init.d/apache2 start
        
    • ipp044 was running neb_rawOTA_host_scan.pl with long >10ks query times in the nebulous DB -- kill -STOP to pause, but appears to have died or finished (on last set?) after ~hour
      | 19865251 | ipp      | ipp044.ifa.hawaii.edu:48902  | nebulous | Query       |   15915 | Sending data                                                   | SELECT ins_id,so_id,uri FROM instance WHERE vol_id = 39 AND ins_id > 4280476494 LIMIT 10000 | 
      
      | 19865251 | ipp      | ipp044.ifa.hawaii.edu:48902  | nebulous | Query       |   11300 | Sending data                                                   | SELECT ins_id,so_id,uri FROM instance WHERE vol_id = 39 AND ins_id > 4284170025 LIMIT 10000
      
      --- cmdline when kill -STOP
      watersc1  2130 30746  0 Dec16 pts/0    00:01:06 perl ./neb_rawOTA_host_scan.pl --limit 10000 --continue --host ipp044 --min 2479121495
      
      
  • 13:10 MEH: things moving more now but still reduced..
  • 14:00 nope, breaking down again.. turning off neb_rawOTA_host_scan.pl nodes and setting them to repair
  • 15:20 MEH: last night found ~1/3 LAP runs (20770-20775) were stalled from messed up chip state goto_cleaned when warps full -- all chips still untouched, set back to full and LAP cleared for stacks
    chiptool -dbname gpc1 -updaterun -set_state full -set_label LAP.ThreePi.20130717 -chip_id 925815
    chiptool -dbname gpc1 -updaterun -set_state full -set_label LAP.ThreePi.20130717 -chip_id 925808
    chiptool -dbname gpc1 -updaterun -set_state full -set_label LAP.ThreePi.20130717 -chip_id 925809
    chiptool -dbname gpc1 -updaterun -set_state full -set_label LAP.ThreePi.20130717 -chip_id 925810
    chiptool -dbname gpc1 -updaterun -set_state full -set_label LAP.ThreePi.20130717 -chip_id 925811
    chiptool -dbname gpc1 -updaterun -set_state full -set_label LAP.ThreePi.20130717 -chip_id 925812
    
    chiptool -dbname gpc1 -updaterun -set_state full -set_label LAP.ThreePi.20130717 -chip_id 925813
    chiptool -dbname gpc1 -updaterun -set_state full -set_label LAP.ThreePi.20130717 -chip_id 924307
    chiptool -dbname gpc1 -updaterun -set_state full -set_label LAP.ThreePi.20130717 -chip_id 925814
    
    chiptool -dbname gpc1 -updaterun -set_state full -set_label LAP.ThreePi.20130717 -chip_id 925813
    
  • 18:30 core processing is better -- rawcheck and conv stack cleanup back on
  • 22:00 MEH: no nightly so far, preparing for regular stdscience restart once LAP triggers stacks
    • unclear source of problem last night, LAP rate now is >100/hr and seems higher than have had recently -- possibly bad intersection of too many file operation processes, may want to turn some of these extras off during nightly
    • manually turned off nodes again -- ipp031,032 (rsyncs still?); 015, 050 (generally problematic, need reboot soon?); ipp042, 043, 044, 046, 047, 048 (running neb_rawOTA_host_scan.pl)

Saturday : 2013-12-21

  • 14:45 MEH: repairing LAP error_cleaned per Chris' script above
    mysql -h ippdb01 -uippuser -pxxxxx gpc1 <  /data/ippc18.0/home/watersc1/PV2.LAP.20130717/fixes/scan_chips_for_errorCleaned.sql
    
    chiptool -updaterun -set_state cleaned -chip_id 577754 -dbname gpc1
    chiptool -updaterun -set_state cleaned -chip_id 578312 -dbname gpc1
    chiptool -updaterun -set_state cleaned -chip_id 578329 -dbname gpc1
    chiptool -updaterun -set_state cleaned -chip_id 858023 -dbname gpc1
    /home/panstarrs/watersc1/bin//run_update_for_chip_id.pl 577754
    /home/panstarrs/watersc1/bin//run_update_for_chip_id.pl 578312
    /home/panstarrs/watersc1/bin//run_update_for_chip_id.pl 578329
    /home/panstarrs/watersc1/bin//run_update_for_chip_id.pl 858023
    
  • 14:50 MEH: LAP rate still okay, preparing for regular restart of stdsci before nightly
    • ipp015, 050 out of all processing, not clear if part of problems the other night
    • ipp042,043,044,046,048 out of processing while some running neb_rawOTA_host_scan.pl and not clear if some part of problems the other night -- will reduce summitcopy and registration so some may want to be test placed back into use
    • stsci19 neb-host repair while dvo heavily using
    • ipp015, 050 neb-host repair since having issues
    • ipp042,043,044,046,047,048 neb-host reapir while neb_rawOTA_host_scan.pl in case was causing issues the other night

Sunday : 2013-12-22

  • 09:00 Bill restarted stdscience, update, and pstamp pantasks
  • 14:50 EAM : things looked a little slow, but there is no obvious culprit. things seem to be slow in camera, so I bumped up the polling there (now to 40).