PS1 IPP Czar Logs for the week 2012.01.30 - 2012.02.03

(Up to PS1 IPP Czar Logs)

Monday : 2012.01.30

Roy is czar

  • 08:00: Everything downloaded, 3PI and MDFs through the system

Tuesday : 2012.01.31

Mark is czar

  • 00:00 standard science lagging/under utilized, restarting.
  • 07:00 LAP fully stalled. fixing 2 non-existant files (set_quality 42, set_ignored on chip_id 397155,397536 XY15 both) and corrupted warp mask file stalling stack (run runwarpskycell.pl on --warp_id 254273 --skycell_id skycell.2361.049)
  • 07:38 looks like all data downloaded and through except for one ThreePi? offnight diff
  • 07:45 odd case LAP waiting for warps to finish when no chip/warps to run (lap_id 2745,2746,2747,2749), out of sequence cleaning again? 2745 has new state for fake. setting exp_id 197295 to drop in those 4 lap_id per the 2 day rule.. fake_id=353973
  • 08:00 Bill repaired a corrupt chip file: "Successfully rebuilt chip 393617 XY55" and raw file "repair_bad_instance -c xy26 -e 228020 -r". Set exp_id 82627, XY31 to be ignored since all instances have been lost.
  • 10:05 Bill: MPG has reported that there are 46 2011 M31 exposures that have fallen though the cracks caused by the loss of the ipp064 raid, limitations of the distribution client and intercontinental mis communication between Johannes and Bill. Rather than try and regenerate the bits for the existing runs he has decided to rerun them from scratch. Queued them with label M31.V4.rerun which since it has no Label entry will run at the very highest priority. Set all of the existing M31.v4.2010-2011 data to be cleaned.
  • 11:10 oddities with stdscience pantasks, restarting.
  • 12:00 all LAP up to date and running
  • 14:36 M31.V4.rerun is complete. The missing exposures were g, z, and y band. These weren't processed previously because I thought we were only observing in r and i.
  • 15:30 Bill started 40 addstar runs with label STS.PP5 the goal is to build a reference catalog for one of the STS fields.
  • 15:50 Bill started up a magic_cleanup pantasks on ipp053. This should restore a bit of space on ipp053.
  • 17:11 magicRuns have had their workdirs cleaned. pantasks.off. There are still some runs in state error_cleaned Bill will look at those later.
  • 19:30 summitcopy having 503 errors, okay ~10min later slowly catching up. (half-stare night, don't want to get behind..)
  • 19:50 working on LAP chips stalled so system working while stare data taken. a fun assortment of no-instance, 0 sized, broken MEF and burntool tbl...
  • 20:00 nebulous replication 24ks behind..
  • 21:15 looks like ipp055 decided to freak out..load and swap space overload. nothing on console. rebooted ok.
  • 22:30 following the method in the czar logs from 1/23, had tried to fix the burntool table failures but used the incorrect previous exposure directory (i.e, neb://ipp043.0/gpc1/20100605/o5352g0471o/o5352g0470o.ota25.fits instead of neb://ipp043.0/gpc1/20100605/o5352g0470o/o5352g0470o.ota25.fits) and didn't catch until after after the waiting chips had been reverted and processed. Fixed the table files, but these chip_ids may need to have quality 42 set (not sure how it worked using an improper uri or the result).
    exp_id=177853, chip_id398181 
    neb://ipp043.0/gpc1/20100605/o5352g0471o/o5352g0471o.ota25.burn.tbl
    
    exp_id=175908, chip_id=398115 
    neb://ipp051.0/gpc1/20100601/o5348g0248o/o5348g0248o.ota64.burn.tbl
    
    exp_id=176051, chip_id=398040  
    neb://ipp040.0/gpc1/20100602/o5349g0047o/o5349g0047o.ota14.burn.tbl
    
    -- still fails with a wrong version error so set quality to 42
    neb://ipp044.0/gpc1/20100605/o5352g0506o/o5352g0506o.ota30.burn.tbl
    chiptool -dropprocessedimfile -set_quality 42 -chip_id 398287 -class_id XY30 -dbname gpc
    
  • 23:50 ~800 of 1100 stare exposures behind.. won't be downloading normal science for processing until late morning at this rate unless use the method Bill used before to put stare images in a wait state.

Wednesday : 2012.02.01

Mark is czar today

  • 01:00 fixing a few more LAP MEF errors and burn.tbl file.
  • 06:40 Gene set the stare data to wait so science data can be downloaded
  • 07:10 looks like ipp062 has been down for ~1.2 hrs and stalled everything.
  • 07:30 registration was also stuck on stare exposure (neb://stare04.1/gpc1/20120201/o5958g0657a/o5958g0657a.ota25.fits), restarted and looks like registering science exposures now.
  • 07:50 but burntool is stalled. not sure how to fix, reading wiki and sent email. -- someone fixed? how?
  • 09:15 Gene added ipp027 back into stdscience, need to keep an eye on for problems.
  • 11:00 nightly science has been running for a few hours now, seems like seeing an inordinate number of camera faults in 3PI and MD04 so far. reverting seems to eventually get them through.
  • 12:05 ipp054 has been down for 10mins.. ganglia shows massive overload and overuse of swap space. rebooting, running again
  • 12:55 looks like science data downloaded, setting stare data state back from wait to run (after checking summitcopy stop and no jobs loaded and doing a query first to see only exposures from last night). 645 to download.
    update pzDownloadExp set state ='run' where state = 'wait' and exp_name like '%a';
    
  • 13:15 looks like nightly science is finished. looking into queuing up the SSdiffs for the MD fields.
  • 13:30 summitcopy died? restarting.. stare data still downloading, ~600 to go
  • 15:20 MD SSdiffs running using similar tweak Bill used a few days ago (server input tweak_ssdiff). to reset the trange setting, can use -reset
  • 16:00 ipp058 died ~30 mins ago.. rebooting ok. same problem as the others..
  • 18:25 stopped summitcopy to set last nights incompletely downloaded stare data back to wait but should've started earlier since LED and darks remaining as well.
    update pzDownloadExp set state ='wait' where state = 'run' and exp_name like '%a';
    
  • 18:50 ipp059 down.. no info on console. rebooting.. okay. also have top running for all wave4 on desktop, will see what the last state was
  • 19:10 also setting all the LED exposures from earlier to wait
    update pzDownloadExp set state ='wait' where state = 'run' and exp_name like '%l' and epoch>"2012";
    
  • 19:45 noticed registration wasn't doing anything with new 3PI data at start of night. restarted.

Thursday : 2012.02.02

  • 01:00 Mark: looks like ipp061 crash ~2h ago stalling things. rebooting... unending dump to console. about 2 hrs behind in download now. will another wave4 system go down in 2-4 hrs? Bill had stopped psphotStack, so it looks like ppStack may be the cause as the dump to the console for a couple suggested. -- psphotStack was still running, seems to take a while to flush, but was a mix of psphotStack and ppStack wanting memory.
  • 03:00 Mark: looks to finally caught up to nightly science and is running ok. once summitcopy done downloading in the morning, to get the remaining stare from the other night and LED exposures should just need to set states back to run
    update pzDownloadExp set state ='run' where state = 'wait' and exp_name like '%a';
    update pzDownloadExp set state ='run' where state = 'wait' and exp_name like '%l' and epoch>"2012-";
    
  • 08:50 Serge: Nightly science download and processing complete. I ran both previous sql statements on gpc1.
  • 09:05 Serge: Nebulous replicant is a bit late (more tahn 100000 sec). I killed the current dump.
  • 09:30 Bill: repaired a bunch of broken chips. Turned chip revert off. It's almost always a waste of effort to try again these days.
  • 10:55 Serge: Stopped all processing for Haydn/Rita to check fans. Set neb-host down for ippc08, ipp025, ipp017, ipp015
  • 12:35 Serge: Stopped mysql server on ippdb00 to rename /var/log/mysql/mysqld.slow.log (~29GB!) into mysqld.slow.log.20120202.bz2 after bzipping. Server immediately restarted then.
  • 15:25 Serge: neb-host ippc08, ipp025, ipp017, ipp015 set to up. ipp027 was lost. Processing (but Heather's stuff) restarted.
  • 16:00 Serge: update and pstamp restarted after ippc17 went down.
  • 16:05 Serge: ippdb02/ippdb00 replicant has caught up with its master.
  • 18:00 Serge: all exposures from last night have been downloaded.
  • 17:00 Serge: ippc17 crashed again. At 18:00 restarted update and pstamp
  • 21:00 Mark: helped clear off ppStacks running on the stare nodes for Paul Sydney for the stare night. he had been only using the ippdor/condor job stop script so reminded him about both
    ssh ipp@stare03 stare_nodes.sh off
    ssh ippdor@stare03 stare_nodes_off 
    

Friday : 2012.02.03

  • 08:30 Serge: ippc17 crashed again last night. /var/log/mysql/mysqld.err tells that some tables of ippRequestServer need repair, namely: dsFileset, pstampDependent, pstampJob. Replicant ippc19 has been crashed by the REPAIR TABLE commands.
  • 08:51 Serge: All fixed even if http://pstamp.ipp.ifa.hawaii.edu/status.php claims that "The IPP is currently shut down for system maintenance".
  • 10:35 Serge: I remembered we had to run 'target.on' in replication pantasks to start shuffling.
  • 13:30 Serge: Dumped ippRequestServer from ippc17 and reingested it into ippc19. Pstamp replication is working again.
  • 17:00 Mark: kicking LAP while it's down.. stacks held up from corrupted mask/weight files fixing with runwarpskycell.pl. normal mess of 10 chips with corrupt MEF, burn.tbl, missing files. with a bonus stack stuck in new and 8 stuck in distribution.
  • 18:00 Mark: ippc17 down, no info on console. looks like being overused again? looks like 2 instances in stack pantasks and shouldn't have been. set to off..
  • 18:20 Mark: also noticed Paul Sydney running jobs on stare nodes and stdscience got turned back on there. turning off again.

Saturday : 2012.02.04

  • 09:30 morning coffee and LAP - 7 chips and 2 stacks stuck to bring it to a halt
  • 18:00 many LAP stack faults, seems to succeed on revert. IO issues or from running on wave4 with staticsky?

Sunday : 2012.02.05

  • 15:00 stdscience not doing much, restarting. Paul Sydney is running on stares so be sure to turn them off in pantasks!