PS1 IPP Czar Logs for the week 2011.10.17 - 2011.10.23

(Up to PS1 IPP Czar Logs)

Monday : 2011.10.17

  • 11:00 - 11:15 update pantasks died. Restated by bills. distribution is backed up. Restarted pantasks
  • 12:45 Mark: set ipp029 down in nebulous for Cindy to install new ram.
  • 13:40 Added back to nebulous in repair state initially. Will set to up in nebulous and add back in gradually for testing.
  • 14:15 One stdscience job started on ipp029 ran for ~15min before it crashed.
  • 20:45 stdscience seems to be under-running, restarting.

Tuesday : 2011.10.18

  • 06:00 registration stalled. restarted pantasks and stalled. started running again after ~20 minutes.
  • 06:20 ippc13 down for 5hr. rebooted.
  • 06:25 warprun stalled
    Reading FITS file /data/ipp027.0/nebulous/c7/38/1467947204.gpc1:ThreePi.nt:2011:10:18:o5852g0316o.408936:o5852g0316o.408936.ch.327264.XY44.ch.wt.fits failed.
    
    perl ~ipp/src/ipp-20110622/tools/runchipimfile.pl --chip_id  327264  --class_id XY44 --redirect-output 
    
  • 06:30 registration now stuck with ipp033 mount to ipp028, running again after
    /usr/local/sbin/force.umount ipp028
    
  • 06:35 stdscience pantasks down. restarted...
  • 06:50 registration stalled again... restarted. summitcopy stalled with 26 images left. restarted. looks like o5852g0500o/o5852g0500o46.fits got stalled with mount problem and failed with
    failure for: summit_copy.pl --uri http://conductor.ifa.hawaii.edu/ds/gpc1/o5852g0500o/o5852g0500o46.fits --filename neb://ipp029.0/gpc1/20111018/o5852g0500o/o5852g0500o.ota46.fits --summit_id 404735 --exp_name o5852g0500o --inst gpc1 --telescope ps1 --class chip --class_id ota46 --bytes 51831360 --md5 7e8c954ded69e0537d465a1f457ebc26 --dbname gpc1 --timeout 600 --verbose --copies 2 --compress --nebulous
    
    summitcopy still not loading and running manually fails to insert row now. Entry looks okay, no fault. Talking with Bill to trace out why/how to fix.
  • 09:50 Bill traced it to variable CLASS_ID not found in summit_copy.pro and put in a temporary fix. Remaining exposures look to make it through to fake stage okay. In future would help to properly use the start_server.sh so logs get renamed.
  • 10:00 will delay shutdown for nebulous repair until remaining data looks to make it through and MOPS gets postage stamps.
  • 11:00 systems shutdown, apache off for nebulous/ippdb02 repair.
  • 11:20 nebulous mysqldump started (/export/ippdb00.0/sch/nebulous-20111018.sql). Master coordinates: mysqld-bin.003345 ; 30002581
  • 11:30 Cleaned ippdb02. Backup of user table in /export/ippdb02.0/mysql-dumps/backups/mysql.user.sql; cd /var/lib/mysql && rm -rf *. Archived /var/log/mysql contents (to /export/ippdb02.0/mysql-dumps/backups/20111018)
  • 16:30 dump not going to finish before 5pm. stopping it and restarting IPP processing.
  • 16:50 crashed ipp036 after processing restarted Ipp036-crash-20111018T165000 (looks like kernel panic like ipp026?). had 482 days w/o the root filesystem being checked so check done.
  • 17:30 processing running smoothly again.
  • 21:00 MD04 g-band deep stack sample running. the deepstack pantasks now uses only the compute2 group (ippc20-c29), the compute group has been returned to distribution.

Wednesday : 2011.10.19

  • 00:39 ippc06 down. rebooted.
  • 09:20 stopped processing for nebulous mysqldump. Master coordinates mysqld-bin.003352 / 277806591
  • 09:21 Dump started /export/ippdb00.0/sch/nebulous_20111019.sql.bz2
  • 15:00 Cindy reported possibly 5 drives in ippb00 failed/faulted according to the manufacturer in past 2 weeks. ippb00 is going to be shut down to avoid any data loss until she receives more information from the manufacturer tomorrow.
  • 15:16 End of nebulous dump (5h57m to make the dump)
  • 16:00 Mark: ippc22 went down ~1 hr ago while system idle (during nebulous DB dump). trying to bring back up and unable to. Odd characters on console. Cindy found loose power connection PDU.
  • 16:05 Processing restarted. Copied nebulous dump on ippdb02 (/export/ippdb02.0/ipp/nebulous_20111019.sql.bz2) and ipp001 (/export/ipp001.0/ipp/mysql-dumps/nebulous_20111019.sql.bz2). For info: scp at 59.3MB/s from ippdb00 to ippdb02 / 30.6MB/s from ippdb02 to ipp001.
  • 16:10 Started ingestion on ippdb02

Thursday : 2011.10.20

  • 19:00 summit copy seems to be down. restarting.
  • 19:30 Mark: new nightly data having trouble, ipp036 looks to be down again. console message at ipp036-crash-2011020T1900. powered off and on, waiting still for it to come back up. doesn't seem to be..
  • 21:30: with all the system issues, tried to push through any LAP stuck in destreak before nightly science gets there
    dist.revert.off
    destreak.off
    magicdstool -clearstatefaults -dbname gpc1 -label LAP.ThreePi.20110809 -set_state new -state failed_revert
    

Friday : 2011-10-21

  • 08:35 (About nebulous ingestion on ippdb02): `/3 of the 'instance' table has been ingested (according to the so_id).
  • 10:00 Mark: registration needed a kick
    o5855g0321o  XY06 0 check_burntool neb://ipp007.0/gpc1/20111021/o5855g0321o/o5855g0321o.ota06.fits
    
    regtool -updateprocessedimfile -exp_id 410550 -class_id XY06 -set_state pending_burntool -dbname gpc1
    
  • 10:30 looking into how to re-download the file from the summit that ended up on ipp036 last night. Not finding any that didn't have a second instance copy elsewhere.
  • 12:50 MD09 had 11 warps stuck, were actually OT ENGINEERING so set state to drop and added note
    warptool -dbname gpc1 -updaterun -set_state drop -set_note "OT testing, object not set ENGINEERING" -warp_id 290947
    
  • ippb00 keep down in nebulous. rsyncs running.
  • 15:20 seems nebdiskd was down on ippdb00, Gene restarted.
  • 16:00 working to kick LAP
  • 16:20 restarted stdscience, distribution, stack, update pantasks. setting revert.off until new news on ipp036 so not to fill up logs.
  • 16:25 looks like all waiting in distribution are files on ipp036, publish faults are also waiting on ipp036. If ipp036 not back online by Sunday will want to modify cleanup.
  • 16:40 magicDs fault on ThreePi? camera from missing chip XY55 on ipp036 still
    failure for: magic_destreak.pl --magic_ds_id 775012 --camera GPC1 --exp_id 406580 --streaks_path_base neb://any/gpc1/20111013/o5847g0115o.406580/o5847g0115o.406580.mgc.226726 --inv_streaks_path_base NULL --streaks NULL --inv_streaks NULL --stage camera --stage_id 298409 --component exposure --uri NULL --path_base neb://any/gpc1/ThreePi.nt/2011/10/13//o5847g0115o.406580/o5847g0115o.406580.cm.298409 --cam_path_base neb://any/gpc1/ThreePi.nt/2011/10/13//o5847g0115o.406580/o5847g0115o.406580.cm.298409 --cam_reduction NULL --outroot neb://any/gpc1/destreak/ThreePi.nightlyscience/406580/camera --logfile neb://any/gpc1/destreak/ThreePi.nightlyscience/406580/camera/406580.mds.775012.298409.exposure.log --recoveryroot neb://any/gpc1/destreak/recover/ThreePi.nightlyscience --replace T --magicked 0 --run-state new --dbname gpc1 --verbose
    
  • 16:50 two LAP chips stuck in update.. likely due to chip data on ipp036, why not faulting?
  • 19:40 registration stalled, restarted. stare night but some normal 3PI data taken at beginning.
  • 20:00 summary of attempts to move LAP along
    - collection of XY76 chips having problems, setting qualty to 42 but wait to set corrupt until looked at when ipp036/ippb00 back up
    -- stuck in chipRun as update
     chiptool -updateprocessedimfile -set_state full -chip_id 328426 -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 328426 -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -set_state full -chip_id 328452 -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 328452 -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -set_state full -chip_id 328507 -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 328507 -class_id XY76 -dbname gpc1
    -- didn't help, try and drop from the LAP set. no change
    laptool -updateexp -lap_id 1381 -exp_id 169255 -set_data_state drop
    laptool -updateexp -lap_id 1368 -exp_id 222467 -set_data_state drop
    -- also stuck in chipRun update due to XY55 on ipp036, if set to quality to 42 move it forward? no
     chiptool -updateprocessedimfile -set_state full -chip_id 308565  -class_id XY55 -dbname gpc1
     chiptool -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 308565  -class_id XY55 -dbname gpc1
    -- stuck in destreak
     chiptool -updateprocessedimfile -set_state full -chip_id 328511  -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 328511  -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -set_state full -chip_id 328571  -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 328571  -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -set_state full -chip_id 328615  -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 328615  -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -set_state full -chip_id 328641  -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 328641  -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -set_state full -chip_id 328643  -class_id XY76 -dbname gpc1
     chiptool -updateprocessedimfile -fault 0 -set_quality 42 -chip_id 328643  -class_id XY76 -dbname gpc1
    -- 8 camera exposures stuck, try dropping one of them from LAP. still stuck
    laptool -updateexp -lap_id 1375 -exp_id 174295 -set_data_state drop
    
  • 22:00 not able to unstick LAP, looks like ipp036 hit each group unsurprisingly. queuing up a new group to see if makes it though. not with secondary instances on ippb00. loaded 2 manually to work through and also stalls
    laptool -definerun -dbname gpc1 -projection_cell skycell.1670 -tess_id RINGS.V3 -ra 327.273 -decl 14 -radius 5 -seq_id 8 -label LAP.ThreePi.20110809 -dist_group LAP.ThreePi -filter z.00000
    laptool -definerun -dbname gpc1 -projection_cell skycell.1754 -tess_id RINGS.V3 -ra 318.14 -decl 18 -radius 5 -seq_id 8 -label LAP.ThreePi.20110809 -dist_group LAP.ThreePi -filter z.00000
    
  • 00:30 the combination of ipp036 and ippb00 being down has basically stalled any reprocessing.

Saturday : 2011.10.22

  • 12:45 full test of modifications Chris made for stare nights was successful. Night started with 3PI/MD, switched to stare, and finished with more 3PI. Downloading and processing finished.

Sunday : YYYY.MM.DD