PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2012.05.07

  • 07:25 Mark: registration stalled? had to run to get things moving along
    regtool -updateprocessedimfile -exp_id 484476 -class_id XY43 -set_state pending_burntool -dbname gpc1
    
    -- seems to also be burntool taking 20ks but not going to try and fix until get in
      0    ipp008    BUSY  21216.52 0.0.3.740b  0 ipp_apply_burntool_single.pl --exp_id 484466 --class_id XY43 --this_uri neb://ipp015.0/gpc1/20120507/o6054g0187o/o6054g0187o.ota43.fits --continue 10 --previous_uri neb://ipp015.0/gpc1/20120507/o6054g0186o/o6054g0186o.ota43.fits --dbname gpc1 --verbose 
    -- nothing in registration log for this. looks like stalled on ipp008 in a neb-replicate for neb://ipp015.0/gpc1/20120507/o6054g0197o/o6054g0197o.ota43.burn.tbl
    -- which has 3 entries now and one with different md5sum -- culled 0 size one and didnt unstick
    -- neb-replicate not able to kill. not going to mess with anymore, emailed czar, maybe just needs to restart replication pantasks. otherwise mostly everything registered and processing
    
  • 10:00 heather: messing with addstars...
  • 14:05 Bill rebooting ipp008. It isn't playing nicely with others. Also will restart stdscience.

Tuesday : 2012.05.08

Bill is czar today.

  • 07:30 restarted summit copy. Set ipp008 to repair. It is bogging things down.
  • 09:56 summit copy is proceeding but running into "503 try again" faults from conductor
  • set newExp for exp_id 456051 456052 from December 15 to state 'drop' They keep reverting and failing to register. c5972g0035l and c5972g0036l

Wednesday : 2012.05.09

Mark czar today

  • 07:30 looks like nightly science finished downloading and processing.
  • 11:15 Chris rebuilt ops tag and restarted stdscience to reprocess SAS with changes (details)

Thursday : 2012.05.10

Mark is czar

  • 07:40 all nightly data downloaded and processed except for one chip on ipp008 running for 36ks. manually killed ppImage on ipp008, reverted and continued through warp fine. not a particularly crowded field at all, nothing stands out of the normal in the ipp008 log around the time it was originally running.
  • 10:30 connections from manoa to production cluster seem to have been interrupted/lost for a bit (as was found earlier in the morning when got in). ippc19 seems to not be coming back or is down (can connect via console however so seems like network issue). when connecting via console noticed 2 pantasks_servers were running as ipp, two replication servers??
  • 12:20 looks like Gavin had to reboot ippc19. restarting replication pantasks
  • 12:30 Chris doing daily restarting of stdscience and will start LAP processing.
  • 12:41 Bill restarted pstamp pantasks. It had many timeout errors in the status output.
  • 13:40 upped the stack.poll 100->200 now that there are ~140 hosts. Chris bumped up the unwant 5->7 in stdscience.
  • 15:10 Chris found misc variety of nightlyscience chips/warps missed in clean up through last year and sent to cleanup now.
  • 19:00 looks like LAP got stuck, chip running on ipp046 for 20ks and none of the 100 LAP chips processing. killed and reverted
      0    ipp046    BUSY  20845.46 0.0.0.45ca  0 chip_imfile.pl --threads @MAX_THREADS@ --exp_id 401130 --chip_id 450882 --chip_imfile_id 26878138 --class_id XY66 --uri neb://ipp046.0/gpc1/20110930/o5834g0380o/o5834g0380o.ota66.fits --camera GPC1 --run-state new --deburned 0 --outroot neb://ipp046.0/gpc1/LAP.ThreePi.20120510/2012/05/10/o5834g0380o.401130/o5834g0380o.401130.ch.450882 --redirect-output --reduction LAP_SCIENCE --dbname gpc1 --verbose 
    
  • 20:20 ipp017 has been overworked (load spikes >30) past couple hours, looks like ipp012 now experiencing same thing. Is it Serge's python code causing the high CPU wait? Would nice help?

Friday : 2011-05-12

Serge is czar

  • 05:45: All exposures downloaded and processed
  • 05:50: My python scripts are still running on about 25 hosts (ipp006-025; ipp060-066 except 064). The cpu wio is of 12,5% (i.e. 1 proc out of 8 is waiting?).
  • 09:40: 3 failures for LAP.ThreePi?.20120510:
    • o5517g0339o - XY26: No file where the nebkey would be gpc1/20101117/o5517g0339o/o5517g0339o.ota26.fits (or [...].fz) on the disks.
      chiptool -dropprocessedimfile -set_quality 42 -chip_id 451565  -class_id XY26 -dbname gpc1
      regtool -updateprocessedimfile -set_ignored -exp_id 257297  -class_id XY26 -dbname gpc1
      
    • o5517g0341o - XY26: No file where the nebkey would be gpc1/20101117/o5517g0341o/o5517g0341o.ota26.fits (or [...].fz) on the disks.
      chiptool -dropprocessedimfile -set_quality 42 -chip_id 451567 -class_id XY26 -dbname gpc1
      regtool -updateprocessedimfile -set_ignored -exp_id 257290 -class_id XY26 -dbname gpc1
      
    • o5523g0285o - XY26: The log says:
      error: libfh: problem with MEF structure
      
      error: Cannot read EXTNAME from `/tmp/chip.260332.XY26.deburned.yI9k.fits' for extension #28
      Unable to perform burntool: 193 at /home/panstarrs/ipp/psconfig//ipp-20120404.lin64/bin/chip_imfile.pl line 797
      	main::my_die('Unable to perform burntool: 193', 260332, 451774, 'XY26', 2) called at /home/panstarrs/ipp/psconfig//ipp-20120404.lin64/bin/chip_imfile.pl line 398
      

but I don't know yet what to do with it.

  • Bill 10:11 One of the burntool table instances was bad for 260332/XY26. copied the good one on top of the bad one
  • heather 14:27 added SAStest.v4 to deepstack (for staticsky stuff) - also all the addstars are back up - addstaring of LAP is paused while I sort out dvodb problems.
  • 15:50 Mark: giving the SAStest.v4 label a priority equal to the refstacks that were already running in deepstack to see if they will play nicely together

Saturday : 2012.05.12

Sunday : 2012.05.13

  • 21:40 Mark: stdscience seems to be struggling, restarting.
  • 22:30 LAP also having trouble
    --> 2 exposures don't exist -- Bill reminded to use whichnode and these were on ipp064.0X, so are lost
    neb://ipp027.0/gpc1/20101109/o5509g0165o/o5509g0165o.ota26.fits
    neb://ipp027.0/gpc1/20101102/o5502g0492o/o5502g0492o.ota26.fits
    
    --> chip_id=451774 has 59 imfiles that seems to stall fakeRun (fake_id 409798) which is probably stalling the i-band LAP. Seems to be a single instance of
    neb://ipp027.0/gpc1/20101123/o5523g0285o/o5523g0285o.ota26.fits
    
    --> so dropping the exposure from the runs for now, been 2 day limit. not sure how to clean up or fix properly 
    laptool -updateexp -lap_id 3651 -exp_id 260332 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 3652 -exp_id 260332 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 3655 -exp_id 260332 -set_data_state drop -dbname gpc1
    laptool -updateexp -lap_id 3656 -exp_id 260332 -set_data_state drop -dbname gpc1 
    
    --> Chris discovered the chip is actually truncated so should set to XY26 to ignore -- checked as valid because actually passes funpack 
    chiptool -dropprocessedimfile -set_quality 42 -chip_id 451774 -class_id XY26 -dbname gpc1
    regtool -updateprocessedimfile -set_ignored -exp_id 260332 -class_id XY26 -dbname gpc1
    
    --> exposure was cleaned after dropped from LAP earlier, so after updates and reverts, looks like warp made for future overlap if any.