PS1 IPP Czar Logs for the week 2012.01-21 - 2012.01.27

(Up to PS1 IPP Czar Logs)

Monday : 2012.01-21

  • 12:02 Bill reloaded the module survey.pro into stdscience with the survey.skycal set to add -ra_min 120 to the command line. This is the area that we have most interest in. Staticsky has finished the region between 120 - 180 including skycells with < 5 filters. The area from 180 - 210 should finish later today. See first plot at http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/staticsky.20120706
  • 16:38 Bill restarted stdscience

Tuesday : 2012.01-22

  • 08:05 Bill added label goto_cleaned.rerun into the cleanup pantasks. There are 19711 chip runs set to get their cmf files removed.
  • 10:41 Bill removed label goto_cleaned.rerun watching it's effect on staticsky completion rate. (That wasn't it set back on a couple of hours later)
  • 14:56 Bill set stsci06 to repair mode
  • 15:53 Bill set stsci06 back to up.
  • 15:59 Bill set cleanup pantasks to stop. It's ready for a restart.
  • 16:42 started cleanup pantasks

Wednesday : 2012.01-23

  • 12:02 Bill has been investigating the reduction in staticsky throughput during the day. It's 3x faster at night when nightly science is also doing work. We believe that the culprit is stsci06. Set one set of wave3 nodes to off in deepstack to reduce the load.
  • 13:05 Bill: Rita is going to stop the raid consistency check on stsci06. The I/O lag seems to not be happening. Set wave3 nodes back to on.
  • 14:10 MEH: stsci00 disks put into neb-host repair in December due to 5% space buffer not being tracked by nebulous and could've resulted in badness. Gavin will look into setting down to 1% like other stsci disks after ipp027-ipp030 updates finished (data on stsci00), but there is +2TB free on the disks now so setting to up and will keep eye on. If goes red, need to set repair again.
  • 15:09 Bill: stsci06 results inconclusive as yet. It does seem to show up less often on ganglia as high load
  • 15:10 restarted pstamp and update pantasks. Their pcontrols were hogging cpu
  • 19:59 Bill: deepstack pantasks finished all of the runs with RA < 270. Stopped it and restarted. Seat waiting runs in the range 300 < RA < 330 from state wait to new. Queued all remaining skycells in that RA range with abs(glat) > 10. This is about 10000 skycells in state new.

Thursday : 2012.01-24

mark is czar

  • 08:44 Bill: registration is stuck waiting for a register_imfile job running on ipp016 to finish but that host has crashed. power cycling it and will restart registration pantasks
  • 09:25 MEH: looks like ipp020 has stuck imfile for past 30 mins, fixing. other things have been stuck since before 6am from mount problems. removing ipp020 from processing, restarting nfs and some mounts are back. registration is moving again, clearing other stalled jobs now
  • 09:45 MEH: mysql restarted on ipp016 after its reboot
  • 10:00 MEH: now looks like stsci06 has been having nfs issues for past 20 mins stalling stdscience jobs... setting neb-host repair on it
  • 10:10 MEH: registration finally finished but nightly science a ways to go. skycal and deepstack stop until things balance out
  • 10:40 MEH: stdscience having trouble keeping fully loaded, only the skycal stage is >100 but pcontrol 100% cpu... so restarting stdsci
  • 11:00 Bill: set cleanup to stop to further reduce I/O activity.
    • 11:50 MEH: having many same issues, doesn't appear cleanup is part of the problem
  • 11:05 MEH: deepstack back on, stsci06 massive load spike only (in repair so should just be reads, at start of job anyways).. slowly returning to normal. stop and back on gradually in next time.
  • 11:39 Bill: shut down deepstack pantasks and killed the psphotStack jobs for now
  • 12:10 MEH: restarted distribution, clearing many many Nfail -- appears to be a config error, problem with the diff_id=367980, publishing having fault as well.
    • diffim was missing .mdc file, some glitch with stsci06. regenerated diffim
      perl ~ipp/src/ipp-20121218/tools/rundiffskycell.pl --diff_id 367980 --skycell_id skycell.2115.064
      
  • 12:13 Bill: reinserted staticskyResults that he deleted in error while to revert the faults caused by killing deepstack (luckily replication was a bit behind)
  • 11:57 MOPS postage stamps finished
  • 13:40 MEH: a few remaining diffims left and nightly science finally finished..
    • stsci06 neb-host up again
    • skycal back run
    • cleanup back run
  • 14:00 MEH: going to start nfs/rpc.statd on some of the wave1 systems with large swap use (ipp010,011,012,018) as a preventative measure
    • added 007,019 to list as have been problematic and had large share mem use respectively
  • 15:50 Bill restarted deep stack. Starting with poll limit of 36 in order to ramp up the load slowly
  • 16:24 Bill worked around problem with distribution of single filter staticsky results. Need to think about whether to code support for older staticsky runs that used psphot. Those should probably be purged.

Friday : 2012.01-25

mark is czar

  • 09:10 MEH: nightly science finished except for 1 diffim with its outroot to stsci06..
  • 10:10 Bill: reduced deepstack poll limit to 64 to give stsci06 to settle down.
  • 12:45 Bill: set deepsky poll limit back go nominal 105 without noticing that the postage stamp server is busy getting diff stamps for MOPS. Oops.
  • 14:52 Bill: last postage stamp barrage is complete. deepstack is back at nominal load. I'm going to stop touching things now and go to bed

Saturday : 2012.01-26

  • 11:00 MEH: doing regular restart of stdscience
  • 18:11 Bill: since last night deepstack has been running with 2 x compute3 nodes instead of 3 x. The throughput was stable at aboue 280 runs per hour and did not drop significantly during the day as we have been seeing when 3 sets were used. Changed the default for deepstack to this configuration.
  • 23:00 Angie notes exposure quality feedback reporting large chunks from last night data ranked as poor quality (roughly o6318g0168o->0355o, excepting 0240o->0245o)
    • MEH: unclear the criteria for ranking poor, but the exposures < g0168o appear poorer with 6-8 FWHM and 0.1-0.3 fluctuations in zpt from the camProcessedExp entries. the 0168-0335 seem fairly regularly 4.5-5.5 FWHM, zpt ~24.5. maybe odd profiles?

Sunday : 2012.01-27

  • 06:20 EAM : ipp018 had a kernel panic partial crash around 4am; i rebooted after checking console (ipp018-crash-20130127)