PS1 IPP Czar Logs for the week 2015.05.04 - 2015.05.10

(Up to PS1 IPP Czar Logs)

Monday : 2015.05.04

  • ippc18 is off - this causes cron jobs to not run as well as ganglia. (so this is a wrning)
  • haf : 21:00 registration sucks:
        21	20:53	regtool -updateprocessedimfile -exp_id 909463 -class_id XY37 -set_state pending_burntool -dbname gpc1
        22	20:53	regtool -updateprocessedimfile -exp_id 909463 -class_id XY42 -set_state pending_burntool -dbname gpc1
        23	20:53	regtool -updateprocessedimfile -exp_id 909463 -class_id XY62 -set_state pending_burntool -dbname gpc
        32	21:50	regtool -updateprocessedimfile -exp_id 909566 -class_id XY37 -set_state pending_burntool -dbname gpc1
    

Tuesday : 2015.05.05

  • 03:50 Bill: Registration is falling behind. There are 3 register_imfile.pl processes that have been running for 10 hours. I am beginning to investigate.
    • those processes were not an issue They were all from one dark exposure that had already finished registration. The scripts must have got stuck while exiting.
    • However exp_id 909635 XY31 was stuck in check_burntool data_state. Set it back to pending_burntool and registration is now catching up.
  • 05:40 Bill: registration is stuck again 3 apply_burntool single processes have been running for > 6000 seconds XY42, XY75, and XY17. The hosts running the processes ipp004, ipp010, and ipp054 are the same ones that were stuck previously. There is probably an nfs problem.
    • killed the burntool program processes that were inexplicably stuck and the system fixed the burntool_state on it's own.
    • set hosts ipp004, ipp010, and ipp054 to off in registration pantasks. They seem to keep hanging.
  • 12;29 Bill: set pstamp pantasks to stop in preparation for a restart. Lots of faults in the status that are distracting.
    • 12:36 restarted pstamp pantasks
  • 12:40 CZW: restarting pv3fflt/rt.
  • 20:12 HAF: registration stuck
    • regtool -updateprocessedimfile -exp_id 909868 -class_id XY02 -set_state pending_burntool -dbname gpc1
    • regtool -updateprocessedimfile -exp_id 909877 -class_id XY06 -set_state pending_burntool -dbname gpc1 *
      regtool -updateprocessedimfile -exp_id 909885 -class_id XY06 -set_state pending_burntool -dbname gpc1
      regtool -updateprocessedimfile -exp_id 909887 -class_id XY06 -set_state pending_burntool -dbname gpc1
      regtool -updateprocessedimfile -exp_id 909890 -class_id XY06 -set_state pending_burntool -dbname gpc1
      
      
  • 21:35 HAF : ipp010 / ipp054 / ipp029 had stck mounts, gene fixed them (mostly) - ipp010 , ipp029 are fixed, ipp054, ippc19 are unhappy still, but better than we were..
  • 00:45 MEH: registration >200 behind, doesn't look like any gpc1 data has processed in a while (though has been doing gpc2). seems to be stuck on
    o7148g0073o  XY06 -14 pending_burntool neb://ipp091.0/gpc1/20150506/o7148g0073o/o7148g0073o.ota06.fits  
    
    • looks like this is full but got set to pending -- doing
      regtool -updateprocessedimfile -exp_id 909890 -class_id XY06 -set_state full -dbname gpc1
      
  • 01:10 MEH: since here, also going to do a restart of registration, been running for a few days

Wednesday : 2015.05.06

  • 03:50 Bill: registration still proceeding very slowly. It is 258 exposures behind with burntool. Each burntool processs for the 5 or so chips that are behind seems to take several minutes to complete. However there are 30 or so chipsRuns in the queue so the system isn't completely stuck at this time. Processing rate is only about 20-25 per hour though
  • 06:00 EAM: still way behind, and Ken Smith reports problems with postage stamps. I'm going to stop all processing and attempt to clear out hung mounts throughout the system and restart apache servers as needed.
  • 06:30 EAM: problem hosts (hung df): ipp054 (rebooting now), ippc11, ippc13, ippc19, ippc29, ippc32, ippc33. i'll try to clear mounts on these and reboot as needed (ex ippc19). update: also stsci00, stsci01,
    • MEH: remember ippc29 isn't a normal processing machine, please give notification before rebooting
  • 6:45 HAF: came late to the game, noticed all the pantasks are off - I assume they will get restarted once stuff is cleared up (by gene?) - i'll stay out of the way now..
  • 07:10 EAM : only ippc13 (failed to reboot) and ipp19 (hung mounts) are now a problem. I'm restarting just the ipp services
  • 07:11 EAM : update : some ippx noddes are also still having a problem -- they do not see ippc19.
  • 07:40 EAM : i forgot to remove ipp004 from the pantasks hosts. it is being used by haydn for ippc18 recovery work and has an excessive load. i'm stopping things now to take it out cleanly
  • 07:55 EAM : everything for ipp user is up and running again
  • 10:50 EAM : burntool got stuck at exp 348, but i've cleared the problem and things are moving along now.
  • 15:00 CZW: regtool -updateprocessedimfile -set_state pending_burntool -exp_id 910276 -class_id XY23
  • 17:45 EAM: ippc30 crashed; i am rebooting now
  • 17:50 HAF: I have been shepherding registration for gene (I offered). Here are the fixes from the past 2 hours:
    regtool -updateprocessedimfile -exp_id 910328 -class_id XY14 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 910330 -class_id XY57 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 910330 -class_id XY54 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 910329 -class_id XY54 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 910334 -class_id XY54 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 910339 -class_id XY72 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 910343 -class_id XY72 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 910345 -class_id XY72 -set_state pending_burntool -dbname gpc1
    regtool -updateprocessedimfile -exp_id 910351 -class_id XY72 -set_state pending_burntool -dbname gpc1
    in race state
    regtool -updateprocessedimfile -exp_id 910351 -class_id XY72 -set_state full -dbname gpc1
    regtool -updateprocessedimfile -exp_id 910354 -class_id XY72 -set_state pending_burntool -dbname gpc1
    
  • 20:20 EAM: things are still running quite slowly, with many gpc1 errors. I am stopping pantasks so I can restart the gpc1 mysql server.
  • 20:30 EAM: after ippdb05/mysql (gpc1) restart, things seem perhaps a bit better. burntool jobs are completing more quickly than before. I still cannot get ippc19 to release ippc18.0, however.
  • 20:55 EAM: ippc26 claimed to be down in nebulous, but it is actually responding (load of 30). it was hanging up on /data/ippc30.0 (why?); in the end, i had to reboot it anyway.
  • 21:20 EAM: pstamp is currently offline : ippc30 is down so pstamp is just making machines hang on the mount point for /data/ippc30.1.
  • 21:23 HAF:
    regtool -updateprocessedimfile -exp_id 910602 -class_id XY52 -set_state pending_burntool -dbname gpc1
    

Thursday : 2015.05.07

  • 08:00 MEH: MD PV3 continues using c29,ippsXX,ippx037-044 with night stacks and can be stopped as necessary in ~ippmd/deepstack
  • 13:50 EAM: I rebooted ippc19 to clear out hung mounts (ippc18.0 and ippc30.0). this seems to have helped: query to ippMonitor is much quicker than before. I'm restarting ipp servers and pv3ff[lr]t servers.
  • 16:45 CZW: difftool -updatediffskyfile -diff_id 941 -skycell_id skycell.2192.004 -fault 0 -set_quality 42 -dbname gpc2 and difftool -updatediffskyfile -diff_id 959 -skycell_id skycell.2192.068 -fault 0 -set_quality 42 -dbname gpc2 to clear out stuck diffs that were holding up GPC2 publishing completion.
  • 20:39 HAF : weirdness with registration - it's not picking up a couple of chips to burntool - stopping, and investigating
  • 20:50 EAM : registration was stuck on 2 chips from exp o7150g0057o - they were running on ipp025 and ipp055 for the past 1 hour, stuck on the db update step. i killed off those jobs and eventually regpeek showed them stuck in a normal way, at which point I updated the state.

Friday : 2015.05.08

  • 05:45 EAM : stsci10 was down, rebooting now.
  • 6:25 HAF: registration stuck regtool -updateprocessedimfile -exp_id 911663 -class_id XY02 -set_state pending_burntool -dbname gpc1
  • 10:30 MEH: Serge reports couple exposure pairs missing from GPC2 processing
    o7150h0088o-o7150h0107o
    o7150h0187o-o7150h0206o
    
    • looks like maybe invisible fault 3 blocking -- yes, error was
       -> psDBAlloc (psDB.c:166): Database error originated in the client library
           Failed to connect to database.  Error: Unknown MySQL server host 'scidbm' (1)
       -> warptoolConfig (warptoolConfig.c:499): unknown psLib error
      
    • revert cleared warps, diffs queued automatically and published
  • 14:10 EAM: mysql on ippdb05 has been restarted with innodb_buffer_pool_size = 128000M (128G)
  • 15:55 CZW: Added gpc2 database to ~ipp/cleanup pantasks. I'm also going to send old nightlyscience data to cleanup, as future data will be handled by the initday cleanup.
  • 18:25 CZW: restarting ~ippsky/pv3ff{lt|rt} pantasks.
  • 20:00 MEH: faulted warp updates stalling QUB stamps, revert label ps_ud_QUB

Saturday : 2015.05.09

  • 11:45 EAM: stsci17 down, rebooting now
  • 18:25 MEH: ipp029 down for ~hour -- trying power cycle, will take a bit
     * Checking root filesystem .../dev/sda3 has gone 278 days without being checked, check forced.
    
  • 18:30 MEH: doing regular restarting of nightly pantasks before nightly starts
  • 18:45 MEH: pstamp stalled possibly because stsci08 is down for past ~4hrs -- cleared and stsci08 rebooted, seems to have also cleared hanging FF jobs
  • 20:20 EAM: restarting pv3ff[rl]t

Sunday : 2015.05.10

  • 12:45 Haydn powercycled stsci01, then stsci04 down and Haydn cycled as well
  • 12:50 MEH: looks like ipp029 also unresponsive again -- doing another power cycle
    • not booting, leave power off -- neb-host down, take out of nightly processing, emailed group
    • may need to watch for mount issues when nightly processing starts --
  • 13:18 MEH: looks like stsci04 is down again... and stsci14 -- someone start something?
    • pstamp stalled by stsci04 down -- jobs cleared when back up -- last time rebooting, not time to play wack-a-mole with these..
    • leaving stsci04 neb-host down for hour or so to see if stays up okay..