(Up to PS1 IPP Czar Logs)

Monday : 2012.09.03

  • 11:40 EAM : some NFS hung mounts on ipp016 (from stsci). I am stopping processing to clear out hung jobs
  • 12:40 EAM : I tried to clear the NFS mounts, with no luck. I rebooted ipp016 and restarted processing

Tuesday : 2012-09-04

  • 11:44 CZW: Restarted stdscience because the rate plots where showing <50 exp/hour.

Wednesday : 2012-09-05

  • 09:02 (Serge): Stopped replication on ippc63.
  • 09:10 (Serge): Started rsync to ippc62 (/export/ippc62.0/backup_nebulous/20120905 ; screen session: backup_nebulous on ippc62 rsync -e rsh -avz ippc63:/export/ippc63.0/mysql .)
  • 13:05 (Serge): chip.revert.off for LAP surgery
    • Fixed:
      • gpc1/20100607/o5354g0061o/o5354g0061o.ota35.burn.tbl
      • gpc1/20100228/o5255g0601o/o5255g0601o.ota62.burn.tbl
      • gpc1/20100228/o5255g0603o/o5255g0603o.ota24.burn.tbl
      • gpc1/20100607/o5354g0213o/o5354g0213o.ota35.burn.tbl
      • gpc1/20100607/o5354g0252o/o5354g0252o.ota06.burn.tbl
      • gpc1/20100607/o5354g0240o/o5354g0240o.ota34.burn.tbl
      • gpc1/20100607/o5354g0242o/o5354g0242o.ota13.burn.tbl
      • gpc1/20100607/o5354g0260o/o5354g0260o.ota13.burn.tbl
      • gpc1/20100607/o5354g0261o/o5354g0261o.ota06.burn.tbl
    • Recovered:
      • gpc1/20100607/o5354g0236o/o5354g0236o.ota02.fits
      • gpc1/20100607/o5354g0261o/o5354g0261o.ota65.fits
      • gpc1/20100616/o5363g0091o/o5363g0091o.ota37.fits
      • gpc1/20100616/o5363g0111o/o5363g0111o.ota26.fits
      • gpc1/20100616/o5363g0110o/o5363g0110o.ota44.fits
      • gpc1/20100617/o5364g0138o/o5364g0138o.ota52.fits
      • gpc1/20100619/o5366g0205o/o5366g0205o.ota52.fits
      • gpc1/20100619/o5366g0223o/o5366g0223o.ota51.fits
      • gpc1/20100619/o5366g0202o/o5366g0202o.ota44.fits
    • Lost
      • gpc1/20100617/o5364g0160o/o5364g0160o.ota65.fits (XY65 182678 581641)
  • 13:55 (Serge): chip.revert.on
  • 14:50 (Serge): Shutting everything down (E-mail from Gavin about power outage at mhpcc)
  • 18:10 (haf/czw): haf restarted processing (all but addstars and deepstack... deepstack was never shut off?). czw restarted nebdiskd, which kicked out ipp013 and ipp046 (these will be off tonight). haf is doing stuff with addstars, so ignore if they are off.
  • 19:00 (haf) restarted czarpoll/robovzar

Thursday : 2012-09-06

Serge is Zhar

  • 07:00 (Serge): Started slave on ippdb02.
  • 07:05 (Serge): Registration was stuck but now seems to run again.
  • 8:00 (haf): kicking registration again
  • 09:23 (Serge): Restarted rsync on ippc62
  • 11:13 (Serge): Set ipp013 to down in nebulous.
  • 12:00 (Bill): postage stamp job stuck for >50000 seconds restarted pstamp pantasks
  • 13:35 (Serge): chip.revert.off
  • 14:0 0- 15:00 (Bill): ipp011 shut down because it is jobs frequently hanging due to nfs problems. It took so long because Bill got distracted.
  • 15:45 (Serge): Asked Gavin to reboot ipp037 (Input/output error messages when connecting as well as when trying to send commands like 'ls')
  • 16:05 (Serge): Stopped all processing. Shut down all pantasks. ipp037 is not well (load/procs>200?!)
  • 16:30 (Serge): Restarted processing. ipp037 is set to down in nebulous (neb-host ipp037 down) and has been removed from /home/panstarrs/ipp/ippconfig/pantasks_hosts.input
  • 17:50 (Serge): neb-host ipp037 up

Friday : 2012-09-07

  • 07:47 (bill): turned warp revert off as it doesn't seem to be doing anything useful right now.
     Task Status
      AV Name                     Nrun   Njobs  Ngood  Nfail Ntime Command           
      ++ warp.skycell.run          184  290524  17591 272749     0 warp_skycell.pl       
  • 07:49 (Serge): No observation last night because of rotator problems.
  • 08:05 (Bill): Set stdscience to stop in preperation for cleaning up various nfs problems. Starting with stuck warp_skycell processes on ipp047
  • 08:10 (Serge): Set ipp037 to down
  • 08:51 (Bill): Found one distribution job that was stuck and 5 warp_skycell.pls on ipp047. Killed them off and the pantasks state is clear. who uses the cluster just shows pstamp and stack jobs which seem to be proceeding nominally. Setting stdscience to run with warp revert and camera revert turned off.
  • 09:07 (Bill) recovered 3 lost raw files. Alll were found on ippb02.X
    and remade 
  • 09:40 (Bill): stopped looking for new work on the MOPS pstamp data store. They are complaining that we aren't giving them stamps fast enough yet we have been working as fast as we usually do. New requests are continuously being queued we have a backlog of 28000 jobs. I'm pausing to let them confirm that things are working fine on their side. The obscure command to disable their data store is:
    pstamptool -dbname ippRequestServer -dbserver ippc17 -moddatastore -set_state disabled -ds_id 4. 

To turn it back on substitute enabled for disabled.

  • 10:00 restarted pstamp. pcontrol cpu usage seemed high.
  • 10:37 (Serge): nebulous (actually mysql) directory backup complete on ippc62 (/export/ippc62.0/backup_nebulous/20120905).
  • 10:55 (serge): Still attempting to restart mysql server on ippc63
  • 11:05 (Serge): mysql server on ippc63 finally started (15min needed). Replication working (180 ks behind).
  • 15:10 (Bill): set ipp013 and ipp037 to nebulous repair state. Restarted stdscience with LAP label turned off.
  • 15:17 (Bill): started querying MOPS data store for new postage stamp requests.
  • 15:40 (bill): added LAP label to stdscience (so czartool shows it) but set entry in database is set to inactive for now. Current plan: Once nightlyscience is caught up we will enable the label but set ipp037 to down. This will allow stdscience to continue but will fault anything needed from ipp037. The Gene will restart the rsyncs
  • 16:42 (Bill): 3pi diffs are running. warp is off.
  • 17:00 (Bill): SAS_v9 has been started. It is running in ~ipp/deepstack as bills off of an optimized version of the trunk
  • 17:28 (Bill): ipp037 set to down in nebulous. Gene has his rsyncs running.

Saturday : 2012-09-08

  • 09:00 (Bill): moved deepstack pantasks to ippc17. ippc05 is loaded doing processing and so it was very slow to spawn skycal jobs which are pretty fast
  • 10:30 (Bill): rebooted ipp015 which encounterd a general protection fault and then became a near zombie
  • 15:30 (Bill): oh rats. I apparently didn't update my optmized build with Gene's new code before running sas_v9 staticsky and skycalibration. Renlabeld the runs sas_v9.broken and queued new ones.

Sunday : 2012-09-09

Bill is acting czar today

  • 08:30 revert LAP stacks with fault == 2
  • Found exp_id 184688 XY31 with data_state = 'pending_burntool' and burntool_state = 0. Strange. Ran fixburntool to create a bunt table and then updated the rawImfile. Chip run completed promptly.
  • 08:49 set ipp037 to repair and revereted the warps. 4221 to process. set poll limit to 16 in stdscience in order to keep the load light. In order to concentrate solely on the warps, turned lap monitoring off for now.
  • 09:40 set host ipp051 to off in stdscience. It is having a problem running jobs. Attempts to access files in /data/ipp051.0 on that node hang. This causes jobs to hang. The file system is visible from remote nodes and locally as /export/ipp051.0.
  • 09:49 3557 skycells to go. 664 finished in last hour. At this rate it will take about 6 hours to catch up. task warp.skycell.run npending = 24
  • afternoon regenerating M31 g band warpBackgroundRuns that got cleaned up before the distribution set up had a chance to distribute them
  • 13:16 807 skycells to go. npending = 40 turning lap on
  • 14:05 LAP warps all caught up. Restarting stdscience.