PS1 IPP Czar Logs for the week 2011-09-12 - 2011-09-18

(Up to PS1 IPP Czar Logs)

Monday : 2011-09-12

  • Around 10am Bill finally got all of the pieces checked into the branch for the ppMops memory reduction fix. pantasks were restarted
  • 11:00 adjusted pstamp.dependent.run task to not be so agressive at running. Should reduce the database load that it cause.
  • 15:00 CZW: Reworked host definitions to be more equitable (hopefully) and to run the processors we have as hard as possible without crashing anything. New definitions for all servers are stored in /home/panstarrs/ipp/ippconfig/pantasks_hosts.input

Tuesday : 2011.09.13

  • 09:50 Mark (czar): removed from nebulous and processing stopped on ipp021 for Cindy to upgrade the motherboard.
  • 13:00 ipp021 back online. Added back nebulous and pantasks.
  • 15:10 ipp026 went down. Removed from nebulous list until back up. Kernel panic similar to what has happened before, Chris rebooted and added details to ipp026-crash-20110913. Put back into nebulous in same state as before: repair.
  • 21:30 registration trouble, in pantasks.stdout.log and run manually
    crash for: ipp_apply_burntool_single.pl --exp_id 392018 --class_id XY04 --this_uri neb://ipp006.0/gpc1/20110914/o5818g0058o/o5818g0058o.ota04.fits --continue 10 --previous_uri neb://ipp006.0/gpc1/20110914/o5818g0057o/o5818g0057o.ota04.fits --dbname gpc1 --verbose
    

Wednesday : 2011.09.14

  • 06:00 Mark: stalled at o5818g0440o, ipp_apply_burntool_single.pl running for 8ks so killed.
    • stalled again o5818g0448o with check_burntool and ota27 but picked itself up. ipp018 was having trouble connecting to ipp007.0.
  • 11:45 excessive CPU use by distribution pcontrol, restarted distribution.
  • 12:00 diff faulting from error reading FITS file /data/ipp042.0/nebulous/48/67/1297415775.gpc1:ThreePi.nt:2011:09:14:o5818g0508o.392470:o5818g0508o.392470.wrp.254273.skycell.2361.049.mask.fits. Regenerated with
    perl ~ipp/src/ipp-20110622/tools/runwarpskycell.pl --warp_id 254273 --skycell_id skycell.2361.049 --redirect-output 
    
  • 12:10 diffim (diff_id=165304) running on ipp026 for 46ks, ppSub hanging. Killed ppSub to fault and revert. ipp026 has had timeouts to ippb00,01,02 in the past (seen in dmesg, not sure when). Diff completed.
  • 112:21-12:56 Serge: stopped pstamp; dumped ippRequestServer to /export/ippc17.0/ipp/mysql-dumps/ippRequestServer.20110914.sql ; all done in less than 2 minutes. Master coordinates: mysqld-bin.000610, 505678724. Dump copied to /export/ippc19.0/pstamp_replication. Stopped slave on ippc19. Dropped existing database on ippc19. Ingested dump. Changed master coordinates. Restarted slave.
  • 12:30 Mark: stdscience pcontrol on ippc16 100%, restarting stdscience now that last night's data finished (and start habit of restarting regularly to see if improves rates). Waiting for jobs to finish.
  • 13:10 took longer to flush stdscience than normal. Also a hanging warp on ipp026 (warp_id=254454). stdscience now restarted.
  • 13:20 diffim repeatedly faulting (diff_id=165610, skycell_id skycell.0982.067) like described in PS1_IPP_czarLog_20110627? for LAP diff 141693. set quality=42, fault=0
    difftool -updatediffskyfile -diff_id 165610 -skycell_id skycell.0982.067 -set_quality 42 -set_fault 0 -dbname gpc1
    
  • 14:16 Bill Experimenting with pantasks parameters in update pantasks changed LOADEXEC from default 5 seconds to 20 seconds. Upped POLLIMIT from 32 to 64. Goal is to see if the database load is reduced noticeably
  • 14:18 Set LOADEXEC to 30 and POLLLIMIT to 32 in cleanup pantasks. Previous polllimit was 200 which is silly since the jobs are taking a long time.
  • 14:34 It turns out thta LOADEXEC gets applied when the task is created and is not subsequently updated. Restarted update pantasks.
  • 17:11 CZW: After wondering why none of the lapRuns were completing, I tracked down a stuck magicRun (magic_id = 204696). Since I could see jobs to do by calling magictool -toprocess, I tried resetting the book (magic.reset) in the distribution pantasks. This appears to have unstuck this magicRun.
  • 18:00 Mark: 16 remain faulted in magicDS for ThreePi?.nightlyscience from missing Skychip.psf table in the diffim CMFs.
    neb://ipp043.0/gpc1/destreak/ThreePi.nightlyscience/392499/diff/392499.mds.697053.165578.skycell.1338.011.log
    
    Regenerated CMF with
    perl ~ipp/src/ipp-20110622/tools/rundiffskycell.pl --redirect-output --diff_id 165578 --skycell_id skycell.1338.011
    
    with same odd/bad result.
  • 21:00 Reran sample ppSub (skycell.1338.011,diff_id=165578) redirected to local directory from the 3PI magicDS failing set with missing SkyChip?.psf table. Produced table, with few (7) detections.
  • 21:48 CZW: I merged the registration bugfix into the working branch. This of course means that a bug popped up elsewhere. summit_copy.pl exited with a "CRASH" state, which seems to have left a bad entry in the book. Manually running the commands that crashed:
    summit_copy.pl --uri http://conductor.ifa.hawaii.edu/ds/gpc1/o5819g0063o/o5819g0063o36.fits --filename neb://ipp044.0/gpc1/20110915/o5819g0063o/o5819g0063o.ota36.fits --summit_id 388294 --exp_name o5819g0063o --inst gpc1 --telescope ps1 --class chip --class_id ota36 --bytes 51831360 --md5 5ce24a1da3a713695bfabd72fa6df8c8 --dbname gpc1 --timeout 600 --verbose --copies 2 --compress --nebulous
    summit_copy.pl --uri http://conductor.ifa.hawaii.edu/ds/gpc1/o5819g0065o/o5819g0065o04.fits --filename neb://ipp006.0/gpc1/20110915/o5819g0065o/o5819g0065o.ota04.fits --summit_id 388296 --exp_name o5819g0065o --inst gpc1 --telescope ps1 --class chip --class_id ota04 --bytes 49432320 --md5 6a33bfd0cab134dcfb2563431cabef3f --dbname gpc1 --timeout 600 --verbose --copies 2 --compress --nebulous
    

cleared up the problems, and burntool started running and finishing registration for subsequent exposures.

Thursday : 2011-09-15

Serge is czar

  • 09:00 Serge: nightly processing finished but a few 3pi at destreak stage. Reverted 4 errors in publishing.
  • 10:00 Mark: still tracking down the 16 or so 3PI magicDS failures such as
    failure for: magic_destreak.pl --magic_ds_id 697058 --camera GPC1 --exp_id 392505 --streaks_path_base neb://any/gpc1/20110914/o5818g0543o.392505/o5818g0543o.392505.mgc.204600 --inv_streaks_path_base neb://any/gpc1/20110914/o5818g0566o.392528/o5818g0566o.392528.mgc.204611 --streaks NULL --inv_streaks NULL --stage diff --stage_id 165583 --component skycell.1519.088 --uri NULL --path_base neb://ipp009.0/gpc1/ThreePi.nt/2011/09/14/RINGS.V3/skycell.1519.088/RINGS.V3.skycell.1519.088.dif.165583 --cam_path_base NULL --cam_reduction NULL --outroot neb://ipp009.0/gpc1/destreak/ThreePi.nightlyscience/392505/diff --logfile neb://ipp009.0/gpc1/destreak/ThreePi.nightlyscience/392505/diff/392505.mds.697058.165583.skycell.1519.088.log --recoveryroot neb://any/gpc1/destreak/recover/ThreePi.nightlyscience --replace T --magicked 0 --run-state new --dbname gpc1 --verbose
    
    failed to read table in /data/ipp009.0/nebulous/69/f3/1297857751.gpc1:ThreePi.nt:2011:09:14:RINGS.V3:skycell.1519.088:RINGS.V3.skycell.1519.088.dif.165583.cmf. Chris suggested renaming the .cmf to .cmf.bak. So ran
    neb-mv neb://ipp009.0/gpc1/ThreePi.nt/2011/09/14/RINGS.V3/skycell.1519.088/RINGS.V3.skycell.1519.088.dif.165583.cmf neb://ipp009.0/gpc1/ThreePi.nt/2011/09/14/RINGS.V3/skycell.1519.088/RINGS.V3.skycell.1519.088.dif.165583.cmf.bak
    
    and reran
    perl ~ipp/src/ipp-20110622/tools/rundiffskycell.pl --redirect-output --diff_id 165583 --skycell_id skycell.1519.088
    
    still failed to produce a detection table. Running multiple times did however, and with 3 detections. So appears to be case when 0 detections an empty table isn't being made and not sure why. The following is a list of the 17 that originally failed magicDS with the diff stage, 3 noted as still failed because of the missing detection table
    -stage_id 165557 --component skycell.2414.004
    -stage_id 165586 --component skycell.1340.067 
    -stage_id 165583 --component skycell.1519.088 
    -stage_id 165519 --component skycell.2411.062 
    -stage_id 165543 --component skycell.2505.023
    -stage_id 165527 --component skycell.2411.052
    -stage_id 165580 --component skycell.1429.028
    -stage_id 165574 --component skycell.1517.056
    -stage_id 165610 --component skycell.0982.016 -- still problem
    -stage_id 165609 --component skycell.0896.063
    -stage_id 164809 --component skycell.1693.014 -- still problem
    -stage_id 165551 --component skycell.2622.095 -- still problem
    -stage_id 164754 --component skycell.2019.073
    -stage_id 165638 --component skycell.1050.064
    -stage_id 165537 --component skycell.2576.075
    -stage_id 164754 --component skycell.2019.084
    -stage_id 165547 --component skycell.2543.053
    
    Once the diff detection table was fixed, not sure if reverts will be successful or not.
  • 11:50 Serge: stopped, shutdown and restarted distribution
  • 11:58 heather restarted stack. it crashed on me. I suspect it was because I was doing 'status' too frequently. I also added a small number of stacks for test: MD09.haf
  • 14:23 CZW: restarted stdscience, partially to see if that would kick processing rates, partially to add a rate adjustment for the LAP monitor stage to see if that is overloading the database.
  • 16:06 Serge: killed ppMops on ippc12 (-exp_name o5814g0052o)
  • 17:30 Bill: set all runs with label ps_ud% to goto_cleaned. This should free up about 900 chip runs worth of space

Friday : 2011-09-16

Serge is czar

  • No observation last night
  • 10:00 Serge All processing stop for ipp020 mobo replacement
  • 10:50 Serge/Mark: can't connect to ipp044
  • 11:00 Mark/Bill: found LAP chip fault due to not able to access /data/ipp053.0/nebulous/c6/dc/423950490.gpc1:20100830:o5438g0382o:o5438g0382o.ota76.fits. Checked directory and copied from ippb00
    ls -l /data/ipp053.0/nebulous/c6/dc
    cp /data/ippb00.1/nebulous/c6/dc/1128537216.gpc1:20100830:o5438g0382o:o5438g0382o.ota76.fits /data/ipp053.0/nebulous/c6/dc/423950490.gpc1:20100830:o5438g0382o:o5438g0382o.ota76.fits 
    
  • 11:10 Serge: stopped all apcahe servers
  • 12:00 Cindy is finished with ipp020 / Serge is not finished with mysql
  • 13:33 Serge: gpc1 optimization is finished
  • 13:48 Serge: stopped nebulous optimization and restarted (in this order): replication slaves; apache servers; czarpoll and roboczar; pantasks.
  • 14:40 Serge: reports for optimization are attached to this page. Optimization of gpc1 lasted 2 hours 51. It seems that the time required to optimize a table is roughly 1.5 times greater than what it was 6 months ago (see attachments at http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/201103_Optimization)

Saturday : 2011.09.17

  • 09:00 Mark: registration stuck with ~80 exposures left. Regpeek.pl reported o5821g0478o.ota11.fits was in state check_burntool and pantasks.stdout.log reported a config_error for exp_id 393399. Ran regtool and is finishing up.
    regtool -updateprocessedimfile -exp_id  393399 -class_id XY11 -set_state pending_burntool -dbname gpc1
    
  • 11:00 LAP chip fault, funpack error on /data/ipp053.0/nebulous/8b/3b/423999272.gpc1:20100830:o5438g0438o:o5438g0438o.ota76.fits, missing file neb://ipp053.0/gpc1/20100830/o5438g0438o/o5438g0438o.ota76.fits. Ran
    cp /data/ippb01.0/nebulous/8b/3b/880997586.gpc1:20100830:o5438g0438o:o5438g0438o.ota76.fits /data/ipp053.0/nebulous/8b/3b/423999272.gpc1:20100830:o5438g0438o:o5438g0438o.ota76.fits
    
  • 19:50 set ipp033 to repair after Cindy reported in degraded state.
  • 22:50 burntool/registration stalled for past ~30min, regpeak said neb://ipp007.0/gpc1/20110918/o5822g0237o/o5822g0237o.ota05.fits and registration/pantasks.stdout.log said system failure for: register_imfile.pl --exp_id 39374. Catching up after ran
    regtool -updateprocessedimfile -exp_id  393746 -class_id XY05 -set_state pending_burntool -dbname gpc1
    

Sunday : 2011.09.18

  • 08:00 Mark: removed ipp033 from processing and nebulous so Cindy could reboot and re-seat a disk.
  • 08:30 ipp033 pressed back into service.
  • 09:30 another case of 0 size file with 3 copies
    -rw-rw-r-- 1 apache 23503680 Jun 17 05:20 /data/ipp016.0/nebulous/aa/0b/1015497360.gpc1:20110617:o5729g0569o:o5729g0569o.ota03.fits
    -rw-rw-r-- 1 apache 0 Jul 22 10:46 /data/ipp006.0/nebulous/aa/0b/1123250971.gpc1:20110617:o5729g0569o:o5729g0569o.ota03.fits
    -rw-rw-r-- 1 apache 23503680 Jul 27 04:27 /data/ippb00.0/nebulous/aa/0b/1136967851.gpc1:20110617:o5729g0569o:o5729g0569o.ota03.fits
    
    copied over 0 size with valid copy for now
    cp /data/ipp016.0/nebulous/aa/0b/1015497360.gpc1:20110617:o5729g0569o:o5729g0569o.ota03.fits /data/ipp006.0/nebulous/aa/0b/1123250971.gpc1:20110617:o5729g0569o:o5729g0569o.ota03.fits
    

Attachments