PS1 IPP Czar Logs for the week 2013.08.19 - 2013.08.25

(Up to PS1 IPP Czar Logs)

Monday : 2013.08.19

Bill is czar today

  • 04:15 restarted pstamp and update pantasks
  • 04:30 restarted distribution and publishing. (There were multiple pantasks servers running for these directories)
  • 09:00 set quality on a diffSkyfile that couldn't get psf. Very small number of pixels covered. There are two warpstack diffs that fault with this error but the skycells have good coverage. Leaving faulted in case someone wants to investigate.
    • Here are the details on the diffs
This is the one with the tiny number of pixels

difftool -updatediffskyfile -set_quality 14006 -fault 0 -diff_id 466363 -skycell_id skycell.2221.065

These are the 3PI warpstack diffs. Looking at the inputs the telescope moved during the exposure. Don't know
why only these two skycells got a psf when the rest of the exposure worked.

difftool -updatediffskyfile -set_quality 14006 -fault 0 -diff_id 466503 -skycell_id skycell.1332.013
difftool -updatediffskyfile -set_quality 14006 -fault 0 -diff_id 466503 -skycell_id skycell.1332.056

  • 09:10 There were 3 LAP camRuns in state update with corresponding chipRuns not updating. They were in state cleaned. All of the chipProcessedImfiles had state update which I think causes the -setimfiletoupdate command to do nothing. Set the chipRun.states to update and they are running now.
  • 14:39 stopping stdscience in preparation for periodic restart
  • 14:45 restarted stdscience, registration, and summit copy pantasks.

Tuesday : 2013.08.20

Bill is czar today but woke up late

  • 07:05 MEH: looks like some ippc01,03,04,08,09 /tmp has little ot no space available and stalling processing -- no time to fix, cleaned up a little space, maybe enough to crawl things along..
  • 09:50 Bill: processing stopped.
    • deleted /tmp/nebulous_server.log on a number of nodes. ippc03 still has no space free. Gavin to reboot single user to investigate
    • fixed now. There must have been a process that had the deleted nebulous_server.log open
    • deleted all of the nebulous_server.log files on ippc01 - 10 restarted apache
  • 10:05 setting pantasks to run
  • 10:14 oops forgot to start apache on ippc05 fault fest
  • 11:12 warps are backed up wanting stsci04 bumped unwant parameter to 30
  • 13:43 stdscience's pcontrol is spinning. Stopping stdscience in preparation for restart
  • 14:02 The glockfile slowness is back. A number of the storage nodes in particular ipp023 and ipp024 are running the older kernel. Stopping processing to fix
  • 14:18 all pantasks restarted. chip.off in stdscience
  • 14:48 That's better. stdscience warps are done. Turned chip.on
  • 14:57 changed label for pending publish runs that aren't going to be processed to ThreePi?.WS.nightlyscience.todrop from ThreePi?.WS.nightlyscience

Wednesday : 2013.08.21

  • 10:02 Bill: restarted pstamp and update pantasks.
  • 10:11 CZW: removed ipp052 and ippc63 from processing so they will be clear for work by Haydn at MHPCC.
  • 11:00 CZW: daily stdscience restart.

Thursday : 2013.08.22

  • 06:40 Bill: Registration got stuck. XY67 in strange state. See below. Seemed to clear up by fixing burntool state of 646624 XY67 using ipp_apply_burntool_single.pl
mysql> select exp_id, class_id, data_state, burntool_state, fault from rawImfile where exp_id >= 646621 and class_id = 'xy67' order by dateobs limit 20;
+--------+----------+------------------+----------------+-------+
| exp_id | class_id | data_state       | burntool_state | fault |
+--------+----------+------------------+----------------+-------+
| 646621 | XY67     | full             |            -14 |     0 | 
| 646622 | XY67     | full             |            -14 |     0 | 
| 646624 | XY67     | full             |              0 |     0 | 
| 646623 | XY67     | full             |            -14 |     0 | 
| 646625 | XY67     | full             |            -14 |     0 | 
| 646627 | XY67     | full             |            -14 |     0 | 
| 646626 | XY67     | full             |            -14 |     0 | 
| 646629 | XY67     | full             |            -14 |     0 | 
| 646628 | XY67     | full             |            -14 |     0 | 
| 646630 | XY67     | full             |            -14 |     0 | 
| 646631 | XY67     | full             |            -14 |     0 | 
| 646632 | XY67     | full             |            -14 |     0 | 
| 646648 | XY67     | full             |            -14 |     0 | 
| 646633 | XY67     | full             |            -14 |     0 | 
| 646634 | XY67     | pending_burntool |              0 |     0 | 
| 646635 | XY67     | pending_burntool |              0 |     0 | 
| 646637 | XY67     | pending_burntool |              0 |     0 | 

  • 07:25 Bill: stdscience has a large number of processes that have been running for more than 20,000 seconds.
  • 08:30 Bill: ipp047 is down. no messages on the console. Starting to power cycle.
  • 08:37 ipp047 is not booting after power cycle. Setting it to down in nebulous
  • 12:30 CZW: After a number of passes, it looks like all the hung mounts of ipp047 have been cleared. I'm going to restart processing, and see if we can get things running cleanly again.
  • 12:52 CZW: czarpoll seemed to not be running? I've restarted it on ippc11 as discussed on the documentation page.

Friday : 2013.08.23

mark is czar

  • 10:05 MEH: nightly SSdiffs finished, time for regular restart of stdsci
    • with ipp047 down, all sorts of missing replicated instances appearing -- 142 chip imfiles -- chip.revert.off
    • 49 image replicant non-existant but rogue copies are being found
    • 95 likely burn.tbl
  • 13:10 MEH: looking for disk space -- nodes that could be turned back on from repair
    • ipp017 -- needed new kernel build before using again, built so try putting up
    • ipp018 -- note about statd problem 7/1 -- new kernel resolve? -- set up
    • ipp012 -- repair since 5/5 w/o clear note why -- set up
    • ipp013 -- repair since 8/13 due to mount issues with ipp052 -- set up
    • ipp014 -- repair since 5/22 w/o clear note why -- set up
    • ipp007--ipp010 -- were planned for dvo related use, but using other more nodes instead -- set up and can work back into processing if out
    • all 20TB disks in repair for copy to stsci machines for moving
  • 13:45 all processing stopped for Chris to rebuilt ippTools. Also removing the goto_cleaned.rerundiff label from cleanup since backing up normal cleanup.
  • 17:40 MEH: PSS has no disk space remaining (COB on friday of course) -- Bill is cleaning space up
  • 22:50 MEH: appears ipp033 has crashed.. about ~hr ago, nothing on display. already in neb-host repair, taking out of processing until later
  • 23:30 MEH: killing off some long running tasks, nightly slowly catching up

Saturday : 2013.08.24

  • 07:05 MEH: typical warp backlog, chip.off for a bit
  • 08:40 MEH: nightly done, doing regular restart stdsci
  • 13:00 Bill: looks like M31 has run into a burst of broken burntool instances. Turning chip.revert off for awhile.
    • 14:40 recovered a few lost raw instances and rebuilt some burntool tables but got bored. Decided to try moving forward. m31 data up to 2011-08-31 is queued.
    • Turning chip reverts on. The repeating faults fail quickly.

Sunday : 2013.08.25

  • 09:00 MEH: looks like LAP and M31 have faulted to a standstill -- doing regular restart of stdsci and then see if can get things moving again
    • 3 imfiles only exist on ipp047 and need to be fixed when back online
      neb:///gpc1/20111120/o5885g0116o/o5885g0116o.ota67.fits
      neb:///gpc1/20111125/o5890g0407o/o5890g0407o.ota67.fits
      neb:///gpc1/20111126/o5891g0203o/o5891g0203o.ota67.fits
      
  • 11:00 MEH: some LAP fixed, while here doing burn.tbl recovery if possible for M31
  • 23:40 MEH: registration stalled.. manual revert with pztool and regtool seems to have cleared it. 40 exp behind, registration catching up