PS1 IPP Czar Logs for the week 2012.04.09 - 2012.04.15

(Up to PS1 IPP Czar Logs)

Monday : 2012.04.09

  • 09:30 Bill set all data with label ps_ud% to be cleaned
  • 09:55 stopped pantasks in order to pick up a bug fix for psphotStack
  • 10:02 set pantasks to run. Restarted deepstack. Reverted faults with fault == 4
  • 11:40 changed state of staticskyRuns in state new.wait and abs(glat > 10) to new
  • 12:01 Started running camera destreak restore in the distribution pantasks. To stop use "destreak.revert.off"
  • 23:10 Mark: looks like ipp056 has become unresponsive, power cycling and back up.

Tuesday : 2012.04.10

Wednesday : 2012.04.11

  • 12:00 Bill stopped processing in order to rebuild psphot to pick up a workaround to the memory explosion issue. If the number of peaks found is > 50,000 we exit psphotStack with fault 1
  • 17:42 distribution pantasks was acting funny so Bill restarted it

Thursday : 2012.04.12

  • 11:27 ipp017 is backing up processing because it is in swap heck trying to do 6 STS warps at a time. Set two hosts to off.

Friday : 2012.04.13

  • 06:05 EAM : a hung mount point for ipp008 on ipp042 was blocking jobs on ipp042. I tried to force.umount ipp008 (after killing off the relevant jobs), but failed. I rebooted ipp042.
  • 06:10 EAM : pantasks have been running since Apr 5, and stdscience is running slow as a result. I've shutdown everything and restarted it all.
  • 09:00 bill: distribution is backing up turning off camera destreak restore (destreak.revert)
  • 09:30 Bill: restarted deepstack

Saturday : 2012.04.14

  • 06:40 EAM : problems with registration of some test camera image:
    there were a handful of test images taken last with a somewhat
    different format than usual: they claim to have only 8 amplifiers per
    chip.  this causes two problems:
    
    1) they are not recognized as a form of gpc1 exposure (this is
    arguably the right thing since they are of an unknown gpc1 layout)
    2) they ARE picked up as ssp images.  this is because the ssp format
    is too generous: any FITS image of any dimension will match it since
    it only requires the keyword SIMPLE to be true.
    
    In fact, the images fail to register because the ssp data is expected
    to have some pixels, while the PHU of these images does not have any
    pixels.
    
    I don't have time this morning to deal with these issues, so I've
    marked the image with a state of 'wait' in newExp for now so they do
    not clog up registration
    
  • 14:43 Heather reports through email that nebulous is "cranky" Bill checks ganglia and ipp017 is down
    <Apr/14 10:50 am>[6211029.517021] BUG: soft lockup - CPU#7 stuck for 61s! [swapper:0]
    
    DOESN'T LINUX have a reboot on panic option like UNIX had in 1982 ??
    (see /etc/sysctl.conf)
    
  • Bill stopped pantasks' and power cycled ipp017. Once it came back up the neublous apache servers got back to work. Set pantasks to run

Sunday : 2012.04.15

  • 09:30 Bill: A ppImage process has been stuck on ipp010 since 7am yesterday. Killed it off and reverted it. chip completed in a couple of minutes. The rerun occurred on ipp010 so whatever the problem was has gone away.
  • 17:30 Bill reverted an rawExp that had been faulted for several days. Apparently we've been lax at paying attention to the IPP->OTIS problem notifications. Added 2012-04-10 to ns.date list and the diffs got queued.
  • Earlier dropped poll limit from 200 to 100 in distribution in an attempt to keep destreak.revert tasks (camera destreak restore task) from filling up the queue. Bill will look into fixing this on Monday
  • 20:33 destreak revert off in distribution