PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

Extra/Non-standard Processing

Continuing HAF suggestion to improve communication, so we know what's going on better -- a list in the czar pages additional (non-standard) processing - so that we all know what's going on.

Daily Czaring:

  • currently there is a modified ops tag running diffs (WS labels only) as ippqub (was ippmops) under ~ippqub/src/stdscience_ws on ippc06 (was ippc29) -- if problems (Njobs>100k, power loss on ippc06 etc), it will need to be restarted like a normal nightly processing pantasks
    ./start_server.sh stdscience_ws
    
    • even with the modified ops tag, there are still various files MIA in nebulous and as such requires the daily czar to check and clear them as has been discussed before

MD processing:

  • ippmd/stdscience running WS diffs w/o writing images -- using ippx065-x096 (hosts_xmd) -- stop as necessary, but always communicate doing so

(Up to PS1 IPP Czar Logs)

Monday : 2015-10-05

  • CZW: 12:00 I plan on starting a large scale scan of disks for nebulous trash and orphan files. This will likely run on all IPP storage nodes, and will slow I/O down. There is some nebulous database overhead, but it's all indexed look-ups, so nebulous should not overload.
  • CZW: 14:30 This is now running, via watersc1 screen sessions on stare04 that ssh to each storage node and run the scan code.
  • CZW: adding label PV2.cleanup to cleanup pantasks. This will clear PV2 camera data. These are failing, but that's because there is some goto_cleaned data that is not cleaning properly. I've moved that to label goto_cleaned_CZWwait, but the book will need to clear out.

Tuesday : YYYY.MM.DD

Wednesday : 2015.10.07

  • 17:30 EAM: since we are tight on space, I'm running gzip on the cpm files for some of the minis on the worst machines: ipp027, ipp055, ipp067, ipp081, ipp073
  • 17:40 EAM: stopping and restarting pantasks for the night

Thursday : 2015.10.08

  • 09:30 EAM: running updates on the outstanding ws diff inputs. I've generated a script in ~ipp to handle this situation so we don't have to keep re-inventing the commands. The script is called "~ipp/fix.ws.nightlyscience". Here are examples using it to find and fix the WS diff run inputs and then the diff runs (replace PASSWORD with the ippuser password):
    • list state of inputs to the outstanding WS diff runs:
      ipp@ippc19:/home/panstarrs/ipp>fix.ws.nightlyscience PASSWORD list
      +---------+---------+---------+------------------+---------+---------+-------+
      | chip_id | warp_id | diff_id | data_group       | state   | state   | state |
      +---------+---------+---------+------------------+---------+---------+-------+
      | 1676749 | 1624170 | 1218073 | SNIa.20151002    | cleaned | full    | new   | 
      | 1677272 | 1624691 | 1222131 | ThreePi.20151003 | full    | cleaned | new   | 
      | 1677283 | 1624702 | 1222307 | ThreePi.20151003 | full    | cleaned | new   | 
      | 1677582 | 1624994 | 1222718 | OSS.20151003     | cleaned | cleaned | new   | 
      +---------+---------+---------+------------------+---------+---------+-------+
      
    • generate chiptool and warptool commands to fix the problems above:
      ipp@ippc19:/home/panstarrs/ipp>fix.ws.nightlyscience PASSWORD update
      chiptool -dbname gpc1 -setimfiletoupdate  -set_label ws_nightly_update -chip_id 1676749
      chiptool -dbname gpc1 -setimfiletoupdate  -set_label ws_nightly_update -chip_id 1677582
      warptool -dbname gpc1 -setskyfiletoupdate -set_label ws_nightly_update -warp_id 1624691
      warptool -dbname gpc1 -setskyfiletoupdate -set_label ws_nightly_update -warp_id 1624702
      warptool -dbname gpc1 -setskyfiletoupdate -set_label ws_nightly_update -warp_id 1624994
      

You can cut and paste the above lines, redirect them into a script to be sourced, or pipe the output directly to csh. After running the above, you will see the states get updated:

ipp@ippc19:/home/panstarrs/ipp>fix.ws.nightlyscience PASSWORD list
+---------+---------+---------+------------------+--------+--------+-------+
| chip_id | warp_id | diff_id | data_group       | state  | state  | state |
+---------+---------+---------+------------------+--------+--------+-------+
| 1676749 | 1624170 | 1218073 | SNIa.20151002    | update | full   | new   | 
| 1677272 | 1624691 | 1222131 | ThreePi.20151003 | full   | update | new   | 
| 1677283 | 1624702 | 1222307 | ThreePi.20151003 | full   | update | new   | 
| 1677582 | 1624994 | 1222718 | OSS.20151003     | update | update | new   | 
+---------+---------+---------+------------------+--------+--------+-------+

The runs get set to the label 'ws_nightly_update' which I've added to the input script for stdscience, so it is always available.

After the inputs are processed, you may find the diff runs still fail (likely, as this is why they never completed in the first place). You can list the diff runs like this:

ipp@ippc19:/home/panstarrs/ipp>fix.ws.nightlyscience PASSWORD diffstate
+---------+------------------+---------------------------+------------------+-------+-------+
| diff_id | skycell_id       | label                     | data_group       | state | fault |
+---------+------------------+---------------------------+------------------+-------+-------+
| 1222131 | skycell.1565.069 | ThreePi.WS.nightlyscience | ThreePi.20151003 | new   |     2 | 
| 1222307 | skycell.1566.012 | ThreePi.WS.nightlyscience | ThreePi.20151003 | new   |     2 | 
| 1218073 | skycell.1051.065 | SNIa.WS.nightlyscience    | SNIa.20151002    | new   |     5 | 
| 1222718 | skycell.0885.058 | OSS.WS.nightlyscience     | OSS.20151003     | new   |     5 | 
+---------+------------------+---------------------------+------------------+-------+-------+

You can also examine the logs with the command:

fix.ws.nightlyscience PASSWORD difflogs   

Finally, the script will generate difftool commands to set the quality flags for the failures:

ipp@ippc19:/home/panstarrs/ipp>fix.ws.nightlyscience PASSWORD difffix
difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 66 -diff_id 1222131 -skycell_id skycell.1565.069
difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 66 -diff_id 1222307 -skycell_id skycell.1566.012
difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1218073 -skycell_id skycell.1051.065
difftool -dbname gpc1 -updatediffskyfile -fault 0 -set_quality 42 -diff_id 1222718 -skycell_id skycell.0885.058

In this case, the fault 2 entries had missing stack inputs, so they get quality 66 while the fault 5s were one of the more traditional psphot failures.

Friday : 2015.10.09

  • 07:10 MEH: clearing stalled warp
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.0941.094 -warp_id 1626232   -fault 0
    
  • 07:23 MEH: ipp026,054,082 down in ganglia but not really down -- needing /etc/init.d/gmond restart
    • all had the common problem indicator of the file-max limit being reached causing misc/ipp/system things trying to open files to behave badly..
      Oct  9 06:25:39 ipp026 [1186637.290055] VFS: file-max limit 1644327 reached
      
  • 07:25 MEH: very large loads on various systems, likely due to only ~5 data nodes available with space for data again..
  • 07:41 MEH: poor exposure, looks like rotator was moving during exposure
    camtool -dbname gpc1 -updateprocessedexp -set_quality 42 -cam_id 1646469   -fault 0
    
  • 09:00 MEH: mostly red across all but 4 data nodes up.. setting up cleanup of anything possible..
  • 09:10 MEH: starting swap of ippmops to ippqub for WS diffs once nightly finishes --
  • 12:10 MEH: running MOPS test for mk+wt compression
  • 13:50 MEH: Gene put ipp006,007,009,010,011,015,025,033,057,060 neb-host up from repair
    • BBU and/or write cache seem to be enabled for all, though 057 and 060 may have had problems in past -- @1308 ipp057 reported BBU fail so will need to watch for overloading..
    • may want to keep eye on network, not sure where all are -- http://ifa-cacti.ifa.hawaii.edu
    • ipp015 may have had unresponsive behavior in past, ipp025 recently had xfs issue to keep eye on
    • ipp033 is not a 40 but rather a 20TB node as well..
  • 21:10 MEH: clearing stalled warp -- cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.1040.041 -warp_id 1626978   -fault 0
    

Saturday : 2015.10.10

  • 07:33 MEH: more stalled warps
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.2169.078 -warp_id 1627337  -fault 0
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.2169.078 -warp_id 1627350  -fault 0
    
  • 10:05 MEH: ipp039 is down, nothing on console -- power cycling -- failing to boot, power off and neb-host down

Sunday : YYYY.MM.DD