PS1 IPP Czar Logs for the week 2013-12-30 - 2014-01-05

(Up to PS1 IPP Czar Logs)

Monday : 2013-12-30

Bill is the czar today

  • 07:48 Last night's data has finished chip processing. Turning chip.off for now.
  • 09:54 stopping stdscience to restart
    • cleared two diff faults difftool -updatediffskyfile -set_quality 14006 -fault 0 -skycell_id skycell.2466.010 -diff_id 510366 ; difftool -updatediffskyfile -set_quality 14006 -fault 0 -skycell_id skycell.2466.010 -diff_id 510375
    • 10:05 restarted, chip.off, poll set to 600
  • 10:15 restarted pstamp added a set of compute3 hosts to work on the mops stamps.
    • 11:04 mops stamps are done. pstamp is mostly idle now, the qub jobs need images to be updated.
  • 14:24 The cleanup and distribution processing is proceeding very slowly. I suspect that the rsync jobs are part of the problem, but cleanup's pcontrol is spinning severely. Setting cleanup to stop for a bit.
  • 15:30 changed the interval between gpc1 mysql dumps on ippdb03 from 4 to 6 hours. The previous file had not finished it's copy before the next one started, leading to checksum comparison problems.
  • 15:40 restarted cleanup pantasks. Set sts chip and warp data that has already been bundled for distribtuion to be cleaned.

Tuesday : 2013-12-31

Bill is czar today

  • 06:45 summit copy and registration are stuck. Didn't get very much data last night
    • set all pantasks to stop
    • registration process hung on ipp056. Cannot commincate with nfs server ipp029 killed it
    • restarted registration and summit copy pantasks. Very many task timeouts in the status page before restart. took ipp056 out of the host list
    • 06:55 summit copy and registration are proceeding now.
    • summit copy has several stuck processes on ipp056. Will reboot it.
  • 07:05 restarted stdscience. set pantasks to run except for cleanup which also has a number of timeouts ipp056 seems ok after power cycle.
  • 07:14 registration is quite slow. The number of enabled hosts is small. Right now burntool is backed up waiting for one chip on exposure 64. The job is in the controller queue but wants to run on ipp047 which is set to state off so it's having to wait. Once that job goes a burst of exposures should finish registration.
  • 07:30 The critical rsyncs are going too slowly. Gene is removing more hosts from processing. Removing STS and LAP labels from stdscience for the time being.
  • 09:20 Turned sts label back on, but with chip.off
  • 09:59 killed scp of 00:05 backup of gpc1 database to ipp001. The 06:05 backup completed at 9:30 and was also copying. Changed backup to twice a day 16:00 and 04:00 for now
  • 10:06 ipp051 crashed, cycling power. On console: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    • after reboot ganglia shows that there was a spike in network activity on the node right before the crash at 09:34
  • 11:56 current host configuration really starves distribution. Adding two sets of compute3s and compute2. Since stack is idle, this should be fine. Will remove before nightly science starts up.
  • 12:04 started cleanup pantasks and turned chip.on

Wednesday : 2014-01-01

Bill is continuing to watch things today

  • 09:50 restarted stdscience pantasks. 48 sts 2010 chipRuns left to process. When those finish there will be only ~1100 from 2009 to do. Will wait for feedback from MPG for tose. So I should be able to start up LAP again this afternoon.
  • 18:29 restarted stdscience LAP label included. 71 sts warps left.

Thursday : 2014-01-02

  • 06:00 EAM : burntool failed on one imfile. I reset the data_state and it continued fine.
  • 11:00 Bill : stack is stopped. I'm debugging another stats failure
    • 11:24 stack set to run

Friday : 2014-01-03

mark is czar

  • 07:20 MEH: looks like something is stalling camera processing, >3-13ks jobs... chip+warp >1ks, cleanup jobs also >>1ks
    • stdsci prep for regular restart
    • stopping cleanup and replication to look for slowdown
  • 11:30 MEH: removing LAP label from stdsci, isn't making any progress anyways and maybe will help ease load for jobs preparing for the machine move
  • 16:00 MEH: looks like ipp052 may have crashed @15:45.. power cycled and back up ipp052-crash-20140103
  • 20:10 MEH: nightly is already 50 behind and MD chips are taking >1ks, the rate is going to be low for nightly processing..
  • 23:30 MEH: 175 exposures behind downloading, first warps finally done..
    • as Gene suggested will start calling kill -STOP on ipp033-048 for neb_rawOTA_host_scan.pl -- ones that were running
      --> 033,034,035
      sudo kill -STOP 336
      sudo kill -STOP 23473
      sudo kill -STOP 15042
      --> 037,038,039 
      sudo kill -STOP 22934
      sudo kill -STOP 11329
      sudo kill -STOP 29598
      --> 046,047,048
      sudo kill -STOP 24219
      sudo kill -STOP 14106
      sudo kill -STOP 24630
      
    • MOPS stamps also stalling.. restarting pstamp to help

Saturday : 2014-01-04

  • 10:05 Bill restarted update pantasks. Added a set of compute3 nodes to help with MOPS backlog
    • 10:49 undid that. new mops requests that don't need updates have been submitted.

Sunday : 2014-01-05

  • 07:00 MEH: like yesterday, registration seemed to be stalled waiting for summitcopy fault to clear but had to be manually cleared with pztool -clearcommonfaults. final 60 exposures going through registration now