PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2013.05.06

  • CZW 11:50: over the weekend, I noticed that a large number of warp jobs were failing on ipp030. I disabled that host in my pantasks, and ignored it. Looking at the logs today, it appears that /local/ipp was a symlink pointing to the non-existant /export/ipp053.0/ipp. I pointed this to the correct /export/ipp030.0/ipp, and it should work again.
  • 15:50 SC: Restored apache configuration on ippc01, c02, and c03 and gracefully restarted apache there. More reasonable number of active connections now.
  • 15:55 SC: Changed the configuration of ippdb00 to allow up to 1300 connections: 1300 = 10 apache clients * 128 connections + 20 extra-connections. I haven't restarted the mysql server though:
    mysql> show variables like "max_connections";
    | Variable_name   | Value |
    | max_connections | 1024  | 
    1 row in set (0.00 sec)
    mysql> set global max_connections = 1300;
    Query OK, 0 rows affected (0.02 sec)
    mysql> show variables like "max_connections";
    | Variable_name   | Value |
    | max_connections | 1300  | 
    1 row in set (0.00 sec)

Tuesday : 2013.05.07

  • 01:20 MEH: looks like missing exposures from last night, adding to registration 2013-05-06 gpc1 14. few burntool fix and faulted chip from missing from OTIS report now moving through as well as some reverts to clean up the red.

Wednesday : 2013.05.08

  • 05:45 CZW: registration crashed again around 2AM. Restarted.
  • 13:44 Bill: stopping processing for rebuild.
  • 14:00 Bill: Rebuilt production tag with new recipes and optional auxiliary masking. Will start up first batch of STS reprocessing using the new recipes
  • all pantasks restarted except for replication, deepstack, and detrend
  • 17:10 Bill: queued first 115 STS test exposures label is STS.rp.20130508
  • 17:30 MEH: compute3 in full use for MD09 deep stack processing. ippc29 (compute2) also being used now that it has extra RAM (~90G) to test normal processing while deepstacks running. pantasks running this ippc30 and logs in ~mhuber/IPP/local_deepstack with locally modified ipp-test-20130502 tag. labels in gpc1 MD09.alldeeptest.20130507, MD09.cutdeeptest.20130508, MD09.deeptest.20130509

Thursday : 2013.05.09

mark is czar

  • 00:40 MEH: registration crashed again ~00:30. ipp052 out of processing, if goes down again then move registration to ippc13
  • 06:20 Bill: queued another batch of STS exposures.
  • 11:20 MEH: move registration (ipp052) and stack (ippc05) pantasks both to ippc13 and make a dedicated server out of processing
    • scratch that -- ippc04 is out of processing due to half RAM (16/32G) so try running registration and stack there in order to use ippc13 as full processing node
  • 12:15 MEH: other host modifications -- many items in ~ipp/ippconfig/pantasks_hosts.input not commented or in czarlogs.. must add notes..
    • ipp028,030 back into processing, looks like was taken out for disk swap and never put back in
    • ipp026,027,029,037 out for old crash issues, will try adding in slowly while monitoring processing later tonight/tomorrow
    • ipp014,019 also crash issues but w/o date or with older kernel, will try adding in slowly while monitoring processing later tonight/tomorrow
    • ipp011,012,018 may not be "weak" anymore if using the new kernel
    • 2x compute3 back to stdscience should be okay with deepstacks, may add more later
    • ippc26,c27,c28 have 48G now, ippc29 has 90G so now a compute2_himem group that can be added for deepstack processing (used 3x in stdsci and 1x update, so removing 1x in stdsci)
  • 14:10 MEH: ipp017 seems to have mount issues with ipp004,005,015,009,032,038,049,058,059,063,064... dvomerge running, emailed heather.
    • restarted NFS and seems to have recovered after 10-15 min

Friday : 2013.05.10

mark is czar

  • 07:00 MEH: nightly data though
    • ipp014,019 with new kernel back in full and seem okay so far
    • ipp013,015 were commented out w/o any info other than using for ipptopsps which they aren't. so also were back in and seem okay so far.
    • registration and stack pantasks on ippc04 seems okay so far
  • 15:00 MEH: to clear some warp backlog before nightly science (warps are lagging by -20/hr from chips)

Saturday : 2013.05.11

  • 16:00 MEH: doing regular restart of stdscience before nightly science starts

Sunday : 2013.05.12

  • 23:30 MEH: apparently the local deepstack pantask on ippc30 crashed with bus error ~14:04... jobs still running and will for at least another 10 hrs for many, so will not be restarted until morning..