PS1 IPP Czar Logs for the week 2013.07.08 - 2013.07.14

(Up to PS1 IPP Czar Logs)

Monday : 2013.07.08

  • 08:04 Bill stopped distribution. It is falling all over itself because the new stsci machines do not have /local/ipp set up correctly. (It was set to be a link to /export/ipp053.0 which is inaccessible except from ipp053)
  • 08:23 Set up /local/ipp on stsci10 - 19 (rm /local/ipp ; mkdir -p /export/$hostname.0/ipp/tmp; ln -s /export/$hostname.0/ipp /local ; mkdir /export/$hostname.0/ipp/gpc1 ; rsync -av /data/ipp053.0/ipp/gpc1/tess /local/ipp/gpc1
  • 08:38 Bill: more distribution woes. The wave 2 nodes have absolutely no space in their /export partitions. Thus their /local/ipp/tmp has no space, so distribution tasks cannot run on those nodes. Restarted distribution with wave2 and wave3 nodes turned off and some compute nodes added in.
  • 10:09 Bill: set M31.test and STS.rerun data to be cleaned. We now have 8000 chipRuns in state goto_cleaned
  • 11:00 EAM : set all relevant machines with > 98% full to 'repair'. started dvo purge-temp to remove unneeded temp files from ipptopsps ops.
  • 11:01 EAM : ipp054 crashed : ipp054.20130708.crashlog
    • console server for ipp054 (and ipp055 at least) does not allow power control -- contacted Gavin (who contacted Rita) to fix).
  • 23:30 MEH: taking 1x compute3 and compute2_himem from stdscience to run MD09.deeptest staticsky

Tuesday : 2013.07.09

  • earlier Bill: set all chip, warp, and diff runs with ps_ud% labels to goto_cleaned
  • 10:45 Bill: shut down ippc17 so Rita can check the fans. Since this is the data store machine we need to gracefully shut down some services
    • shut down pstamp and update pantasks
    • shut down apache with /etc/init.d/apache2 stop
    • shut down mysqld with mysqladmin shutdown (this takes awhile)
    • shut down host with shutdown -h now
      • I actually didn't know about the -h but I add it here for future reference
  • 11:04 Bill: forgot about some other services dependent on ippc17 because of the data store. distribution cleanup (which is now hung) rcserver task in distribution (which was idle) and ipptopsps which writes to the data store.
  • 11:35 Bill: restarted cleanup pantasks with
  • 12:24 Bill: pstamp restarted. Added a couple of sets of compute3 nodes to work on the mops backlog.
  • 12:33 restarted update and set dist.cleanup.on (in cleanup) and rcserver.on (in distribution) Everything is back online.
  • 13:10 MEH: returning 1x compute3 + compute2_himem from MD09.deeptest staticsky run back to stdsci
  • 21:45 Bill: set cleanup to stop. Will restart with standard host config.

Wednesday : 2013.07.10

  • 06:45 Bill: added some hosts to the cleanup pantasks. wave2, compute2, compute3
  • 11:50 MEH: sending ~5k MD04,06,07 warp (older) to cleanup since will use the update method for pv2, hopefully many live on the wave2 datanodes.. including ~3k MD diffims with the data_group of MDxx.nightlyscience up to 6/20. also including MD.GR0 distribution bundles prior to 1/2013 (6 months time)
  • 12:06 Bill: stopping processing to add chipRun.update_mode to the database
  • 13:33 Bill: pantasks restarted

Thursday : 2013.07.11

Bill is czar today

  • 02:50 burntool is about 20 exposures behind. Setting cleanup to stop. Will restart with standard number of nodes running.
  • 02:55 ipp060 is in heavy swap heck. This was causing some nfs errors a while ago. Setting to repair in nebulous
  • 07:58 cleanup has caught up. added label goto_cleaned.rerun and boosted the number of hosts
  • 09:35 shutting down ippc17 for battery replacement
  • 10:39 ippc17 is back up and all services have been resumed
  • 11:52 cleanup has completed it's work. Resuming repeat cleanup of old nightly 3pi chip data. This will remove the cmf files and binned images.
            chiptool -updaterun -set_state goto_cleaned -set_label goto_cleaned.rerun -data_group ThreePi.201008%
  • 21:07 EAM : ipp026 has some kernel faults and was causing trouble on disk I/O issues; rebooting at console

Friday : 2013.07.12

  • 08:50 EAM : a psphot core dump on stare03 is taking hours to complete. I cannot kill it off so I am rebooting the machine.

Saturday : 2013.07.13

  • 22:08 CZW: Restarting stdscience pantasks as the rate has dropped below 100exp/hr. It has ~185631 chip_imfile jobs completed, so it's due.

Sunday : 2013.07.14

  • 15:30 MEH: ippdvo rsync looks to be pushing node to heavy wait state and making manual tests running difficult.. taking ipp058 out of stdsci and stack so can at least do 'ls'..
  • 17:45 CZW: restarting stdscience.

Sunday : 2013.07.14