PS1 IPP Czar Logs for the week 2014-02-24 - 2014-03-02

(Up to PS1 IPP Czar Logs)

Monday : 2014-02-24

  • 10:37 Bill set stdscience to stop. It is desperately in need of a restart.
    • serge came by and reported that the postage stamp server is slow. Restarted it and added a set of compute3 nodes.
    • 10:44 stdscience restarted with standard (low power host set)
    • restarting staticsky
  • 14:45 CZW: quick note on how I investigated the issue from last night:
    • Step 1: check that registration was working correctly: regpeek.pl
    • Step 2: investigate the state of diffs: nightly_science.pl --debug --verbose --queue_diffs --date 2014-02-24 This command does not actually execute any commands, but does the full logic check to see what's up. The interesting line is: diff_queue: Number of input warps to make diffs is not even for target OSS and object ps1_14_1783! 4 : I should declare an exposure to be qualityy. , which despite the spelling error, points out that object ps1_14_1783 has 5 warps, and so it needs to exclude one (because diffs require an even number of exposures). However, this is done based on exp_id (for historical reasons), which kicks out the visit 1 instance.

Tuesday : 2014-02-25

  • 10:54 Bill: daily restart of pstamp server
  • 11:27 Bill: stopping staticsky in preparation to rebuild the tag to support distribution of full force runs
    • back to run I will do this after lunch.
    • 12:59 set to stop
  • 16:30 Bill: rebuilt ippsky's ipp and started up a pantasks to distribute the full force runs. This shouldn't take too long to complete.
    • 16:38 set staticsky to run
  • 17:30 CZW: In an attempt to keep stdscience from falling behind, I've made the following changes:
    • stdscince:
       hosts add wave2; hosts off ignore_wave2
       hosts add wave2; hosts off ignore_wave2
       hosts add wave3; hosts off ignore_wave3
       hosts add wave3; hosts off ignore_wave3
       hosts add compute2; hosts off ignore_compute2
      
    • staticsky:
       hosts off wave3
      

Wednesday : 2014-02-26

  • 05:10 Bill: processing is pretty far behind again. Set staticsky to stop. Rate may have climbed a little bit.
    • 06:10 summit copy backlog 67
  • 07:30 Bill: added 2 sets of compute3 nodes to stdscience removed LAP label stdscience
    • pcontrol is spinning I'm not sure that pantasks is keeping things loaded as well as it could. status commands mostly timnig out.
    • really really need to do a daily restart
  • 07:53 now we are getting a very large number of database related faults. detselect commands are failing. ipptool -pending queries are timing out.
  • 07:58 Bill: I'm going to restart stdscience
    • 08:05 restarted stdscicence adding the deepstack hosts (staticsky is off)
  • 10:13 Bill: with permission from the czar (gene) I am adding the STS label to stdscience. I will watch for memory problems. (I could not reproduce the result that we got on Friday)
    • the memory problem is back. It seems that some chips (XY24 and XY31 at least) for some exposures gets really whacky results, generating lots of detections (>300,000), which causes memory to grow very large (>20GB) when writing the cmf file. Unfortunately the targets for these chips (ipp016 and ipp020) have 24 GB of memory so they can't handle multiple ppImage instances. Further debugging is required
  • 10:55 Gene is going to restart things without the STS label.
  • 10:55 EAM : I found a situation which may be causing pcontrol to get extra slow and sticky : hosts which appear to be hung. I've added a bit of debugging to understand this better, so I'm stopping everything and restarting. In restarting, I'm keeping the STS label out since it was was causing some trouble with memory explosions (2009 data on a specific chip).
  • 16:33 Bill: ippdb01 has apparently gone down. Setting pantasks to stop
  • 21:00 EAM : ippdb01 was rebooted by Haydn and was working ok by about 17:20; I restarted processing at that time, and it ran for a while. However, it crashed around 18:20. I was able to power cycle it and it came back online @ 20:30, but I'm not certain how stable it will be. there are error messages in /var/log/messages which look like problems with the disks. I have reduced operations to summitcopy, registration, stdscience, publications, and pstamp. hopefully this will be enough to get data to MOPS.

Thursday : 2014-02-27

  • 22:20 Bill: fixed stuck diffRun by manually setting the quality flag. Code should really do this. difftool -updatediffskyfile -set_quality 14006 -diff_id 527324 -skycell_id skycell.1636.050 -fault 0
  • survey.relexp and relstack tasks have been enabled again after dropping 300 duplicate lap stacks and 1 duplicate cam run

Friday : 2014-02-28

  • 10:13 Bill: restarted postage stamp server. pcontrol was spinning so jobs were running very slowly.

Saturday : 2014-03-01

Sunday : 2014-03-02

  • 11:00 EAM : I am stopping all processing to do a major revamp on the way we allocate hosts. more in a moment...
  • 12:45 EAM :
    last week, we were having some trouble keeping up with nightly science.  
    we did not have enough machines at night to support the necessary exposure rate.  
    when I tried to make adjustments, I decided it was difficult to see how best 
    to balance staticsky and stdscience with the machines organized as they were.  
    therefore, I have made a significant change to the way the pantasks hosts are allocated.
    
    one source of confusion has been that not all 'wave1' or 'compute3', etc, machines 
    are alike or even very similar.  to avoid overloading some in the same class, we 
    have been either underloading others or going through some silly gymnastics.
    
    I have re-sorted the list of machines into groups by compute and storage based on 
    their total memory (and at bit on the number of cores).  The hosts are now 
    in the following groups:
    
    storage: 
    s0 : 16GB or 20GB + 8 cores
    s1 : 24GB + 8 cores
    s2 : 32GB + 8 cores
    s3 : 40GB or 48GB + 12 cores
    
    compute:
    c0 : 24GB + 8 or 12 cores 
    c1 : 32GB + 8 cores
    c2 : 48GB + 12 cores
    
    (c1 is split for now into c1a and c1b, where c1a are the apache servers)
    
    Next, I looked at the typical usage for stdscience and our bottom-line requirement 
    there.  For stdscience, most of the jobs take a modest amount of memory (~2GB) and 
    do not normally keep all 4 threads spinning (say 2 on average).  Meanwhile, the 
    typical processing time in stdscience is ~3.85 node-hours per exposure total with 
    (chip,cam,fake,warp,diff) = (1.5,0.12,0.06,1.04,1.15).  For a solid NASA night, 
    we get about 700 exposures.  If we process 70 exposures per hour, we can keep up 
    with the night (done in 10 hours).  That says we need a total of ~270 jobs active.  
    
    I've assigned stdscience to the storage nodes only with the following number of jobs 
    per host, based on the memory and cpu usage:
    s0 : 3
    s1 : 4
    s2 : 5
    s3 : 6
    
    That gives us a total of 271 active connections, so we should be OK in 
    terms of nightly science processing with that level.
    
    Next, I've assigned only a modest number of machines to stack: only 2 x c1a, or 
    about 16 jobs.  We are not doing too much stacking at the moment.  when stack gets 
    behind, we can swap stdscience and stack allocations during the day.
    
    That leaves c0, c1b, and c2 machines for staticsky.  that gives us ~130 staticsky jobs.  
    
    For the other pantasks, I've given a modest allocation on either 
    s0-s3 (cleanup, detrend, registration, summitcopy, distribution) or c2 (pstamp & publish).  
    
    Note that I have removed the deepstack list for now : we will need to re-vamp it 
    when the new machines are ready, and we will have to juggle the stare and c2 nodes 
    with ipptopsps and staticsky.
    
    Sifan and Haydn have set up a couple of new (actually old) machines at MTRC-B.  
    These are low-cpu, low-mem machines which should be sufficient for both apache 
    and pantasks servers. Over the next few days, I've like to test them and get the 
    services used by some of the compute nodes moved over there.  
    that will let us move fully use the compute hardware.