PS1 IPP Czar Logs for the week 2014.08.25 - 2014.08.31

(Up to PS1 IPP Czar Logs)

Monday : 2014.08.25

  • 06:45 Bill: distribution's rcserver.make.fileset.load task has been timing out, so filesets were not getting posted to the data store. Ther are about 88500 of these short jobs pending. I debugged this a bit and found that if I have the MD06.nightlyscience label in the list the query takes > 10 minutes. If I remove it it is quite fast. Must be some strange interaction of the priority ordering because the md06 label has no filesets pending. Removed that label, verified that data started flowing again, then set distribution to stop so as not to interfere with nightlyscience processing.
    • please set the distribution pantasks to run when nightly processing completes
  • 08:45 : ipp035 crashed, nothing no console. rebooting now (back up @ 8:46)
  • 16:48 Bill: set distribution to run
  • 17:50 MEH: another quick MD run using the compute3 nodes before nightly starts
  • 17:55 Bill: restarted distribution pantasks

Tuesday : 2014.08.26

  • 08:15 MEH: with the half-night of nightly finished, starting more MD pv2 on compute3 nodes
  • 09:45 MEH: shouldn't need to pause MD while MOPS is making stamps, but keeping an eye on
  • 11:22 Bill: queued more skycal distribution runs for processing. ~10 %left to go.
  • 16:43 Bill: restarted the distribution pantasks which was getting sluggish after >~200,000 jobs

Wednesday : 2014.08.27

  • 05:22 Bill: restated postage stamp server with a bug fix to
    • 07:34 turned pstamp parser off in order to let backed up cleanup jobs free up some space
    • 08:05 removed MPE label from pstamp to see of the jobs are having a significant effect on stdscience throughput.
  • 10:48 EAM : stopping all processing in prep of a UPS test @ MRTC-B
  • 11:55 EAM : UPS test postponed : restarted processing
  • 16:10 MEH: ipp043 very unhappy w/ cpu_wait and load -- putting into repair
  • 16:15 MEH: s2+s3 being used by ippmd to distribute MD pv2 products -- finished before nightly starts

Thursday : 2014.08.28

  • 05:00 Bill: removed MPIA and MPE labels from pstamp to try and get an idea if the stamp processing is having any effect on stdscience throughput.
    • 06:27 looks like the postage stamp processing does not noticeably affect the processing rate. labels back on
  • 10:56 Bill: changed the labels for the final batch of lap skycal dist runs to run label and queued runs for the pole. Set distribution back to run. 39868 left to go
    • 18:33 set labels for skycal distRuns for LAP.ThreePi?.20130717.pole3 to LAP.ThreePi?.20130717 last 12,500 runs running. PV2 will then be finished. For IPP anyways

Friday : 2014.08.29

  • 06:20 Bill: PV2 skycal distribution is done. Shut down distribution pantasks as it has nothing to do

Saturday : 2014.08.30

Sunday : 2014.08.31

  • 01:15 MEH: looks like ipp034 is down since ~0100.. nothing on console, minor load -- already was in repair, taking out of processing until the morning
  • 01:20 MEH: also looks like stdsci is well due for regular restart and partly the cause of slow processing rate -- always fun trying to get stop to go through in such a condistion..
    • processing is already overly slowed, do we really want PV3 processing running during the night?
    • also thought the WS were lower prio but they are getting processed, taking label OSS.WS.nightlyscience out until morning as well
  • 09:25 MEH: ipp035 down ~0915.. nothing on console, moderate load -- put into repair (little/no disk space left anyways) and taking out of processing
  • 09:30 still >150 OSS warps to do.. nightly finished downloading, restarting summitcopy+registration since have also been running for a very long time..
    • stsci12 has a load >350 since ~0800.. data disks appear non-responsive, unable to log into
  • 12:40 MEH: all base nightly diffims finished, WS continue. turning PV3 back on, problem w/ stsci12 may cause issues..
  • 14:20 MEH: stsci12 issue blocking 3PI WS, trying set stsci12 neb-host down (from repair)
    • looks like many jobs stalled, will want to clear before nightly -- killing all ppSub jobs, restart stdsci and cleared
  • 17:20 MEH: no change on diskspace plot after cleanup, czarpoll/nebdiskd broke?