PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2016-04-11

  • 15:00 CZW: Processing was slow last night, but none of the main tasks seemed to be failing at an excessive rate. There are more timeouts in the pztool -pendingexp task, so perhaps that was the limiting step. I've rebuilt the ipp user's ippTasks and ippTools to pick up the r39527 changes I've made to this task/tool. There is now a -dateobs_begin/end argument to restrict the dates considered. The task uses a 30-day look back to set dateobs_begin, which was the agreed upon length on the email chain from last week. A quick test suggests a factor of four speed improvement with this restriction. I plan on relaunching the ipp pantasks at 16:00 today, at which point this change will be active.
  • 15:14 CZW: Saw that there were some daytime exposures being downloaded by summitcopy, so I restarted that to get the change early to check for problems before tonight. Seems to work fine.
  • 16:25 CZW: Cleanup took some time to finish up. Restarting ipp pantasks.

Tuesday : 2016.04.12

  • 14:22 MEH: revised the pantasks_hosts.input to use s5 again since Gene reported dvo calibration mostly finished, added s6 as well -- also updating the data targets -- will restart nightly pantasks to use these ~1600
    • seeing one ipp008 off and others on... looks like some conflict in the hosts_ignore_storage group use.. -- if in the _ignore_ group it needs to be commented out of the active group... probably has been a problem for a while...
    • summitcopy/registration now are about @72 nodes -- probably excessive for summit but provides a sample of poor nodes to prune from (possibly/likely the set w/ rsyncs..) -- include a pruning of the rack set that likes to power cycle is down to 67 nodes
    • stdscience ~@560 after pruning the rack set that likes to power cycle (some of which already were supposed to be and actually weren't... -- ipp008,012,013,014,016,018,037)
  • 15:10 CZW: Started wave3 rsyncs on stare00. I'm starting fairly slowly, with 5x4x10MB transfers to get an idea of the speed and loading of this.
  • 19:12 MEH: ipp050 crash, nothing on console -- not responding to power cycle yet so neb-host down (from repair) before it messes up nightly processing..
  • 20:13 MEH: appears ipp100-104 were NOT fully setup to run pantasks jobs... missing /local/ipp link etc..
    • oddly so is ipp078 -- /local/ipp symlink to /export/ipp078.0/ipp -- why aren't things like this being checked... probably explains some faults in the past when ipp078 was in use before dvo use...
  • 23:10 MEH: things running better tonight, only ~40 exposures behind @2230 when >70 the past few nights (so similar to earlier last week and still not back to ~60 exposures/hr as normal several weeks ago) -- noticed another apache server had NOT been turned on to replace ippc05 being removed since disk/raid issues, ippc02 was fixed when disk/raid and power supply issue at end of 2015 so have uncommented that for ~ipp, ~ippqub/.tcshrc
    • ippc02 also originally hosted stdscience pantasks, probably should also be returned since seen ippc01 having a good amount of load at times...
    • wondering if there are other pantasks running not using the same apache for nebulous -- maybe should stop apache on those nodes to block any cases of that... -- yep... ~ipptest and ~ipplanl were both out of alignment with ~ipp/ippqub using ippc02 rather than ippc05, so using ippc02 again helps those use the proper same apache servers..

Wednesday : 2016-04-13

  • 12:10 CZW: Started additional rsync processes on ipp033 to clear that host faster. It's already ignored in pantasks, so this shouldn't cause processing issues.
  • 17:58 HAF: daily restart of pantasks.
    summary of haydn work (from various emails):
    I've got ipp048 back online. Please exercise it vigorously if possible.
    I'll have ipp050 back online with one less CPU soon.
    I'd like to power cycle ippc05 to swap a disk in it, and then it might work again.
    ippc05 is back online, and will be rebuilding its RAID for the next hour or so.
    ipp085 is back online with a different RAID battery, which I tested earlier this week.
    
    

Thursday : 2016.04.14

  • 20:00 EAM : stopping pantasks for restart

Friday : YYYY.MM.DD

Saturday : 2016.04.16

  • 19:55 EAM : stopping pantasks for restart

Sunday : YYYY.MM.DD

  • 06:40 MEH: looks like registration has been down since before 0400.. restarting
    • pantask segfault
      [17680772.288433] pantasks_server[31319]: segfault at d61b48 ip 0000000000408a4e
       sp 0000000042021f20 error 4 in pantasks_server[400000+16000]