PS1 IPP Czar Logs for the week 2011.08.08 - 2011.08.14

(Up to PS1 IPP Czar Logs)

Monday : 2011.08.08

Tuesday : 2011.08.09

  • 08:00: Roy: Registration stuck as usual. regpeek tells me offending exposure is o5782g0063o
  • 08:10: Bill and Heather: stopped summitcopy pantasks, waited for Nrun column to reach zero then ran:
copy.reset

this reset the book with

book init pzPendingImfile
  • 10:38 Serge: Reverted failed publishing exposures
  • 11:40 Bill: changed label on runs with data_group STS.201104rerun from STS.nightlyscience to 201104rerun. (Added the label to the usual places.
  • 12:03 Bill: queued STS diffs from 2011-08-06. I really need to automate this.
  • 14:19 Bill: reverted cam_id 245246 which has faulted twice for "unable to access file on ipp020" for two different files. There are diff faults for files on that node as well.
  • 14:25: Roy: processing slow, so stopping stdscience and...
  • 14:40: Roy: ...restarted again
  • 15:20: Roy: reverted stuck chips with:
camtool -revertprocessedexp -fault 3 -label ThreePi.nightlyscience -dbname gpc1
  • 15:40: Roy: ippc11 is down. Tried to reboot with serial console:
 ssh rhenders:ippdb03@cab6con.ipp.ifa.hawaii.edu

but blocked by another user

  • 22:30 : eam : ipp014 crashed so I rebooted it. I suspected that the ippc11 crashes may have left some things stuck, so I restarted the full system.

Wednesday : 2011.08.10

  • 03:30 eam : summit copy failure. here is the error that occurred, leaving behind a crashed entry:
    Starting script /home/panstarrs/ipp/psconfig//ipp-20110622.lin64/bin/summit_copy.pl on ipp041
    
    Running [/home/panstarrs/ipp/psconfig/ipp-20110622.lin64/bin/dsget --uri http://conductor.ifa.hawaii.edu/ds/gpc1/o5783g0268o/o5783g0268o32.fits --filename neb://ipp021.0/gpc1/20110810/o5783g0268o/o5783g0268o.ota32.fits --compress --bytes 52842240 --nebulous --md5 26c60c4a3177cd43bcb803ca44e2ace5 --timeout 600 --copies 2]...
    downloading file to /tmp/o5783g0268o.ota32.fits.uznkmvcY.tmp
    can't sync nebulous filehandle: Input/output error at /home/panstarrs/ipp/psconfig/ipp-20110622.lin64/bin/dsget line 207.
    Unable to perform dsget: 5 at /home/panstarrs/ipp/psconfig//ipp-20110622.lin64/bin/summit_copy.pl line 236.
    
  • 08:00 Roy: All data downloaded from last night.
  • 09:10 Serge: reverted failed publishing
  • 09:35 Roy: killed some chip_imfile.pl instances that had been running for too long
  • 10:40 Roy: warps stuck, so ran warp.reset in stdscience
  • 12:06 CZW: Investigating LAP faults led me to discover an imfile with no valid instances. It looks like this might have been lost when ipp020 lost it's raid (the original instance was there). Set the imfile state to make this visible for the future: regtool -updateprocessedimfile -exp_id 182839 -class_id XY54 -set_state corrupt
  • 13:19 CZW: Restarted registration pantasks to include bugfix to burntool host assignment.

Thursday : 2011-08-11

  • 00:40 Mark: setting up MD01.V3 tesselation, so pausing any MD01 nightly processing until MD01.V3 finished.
  • 09:00 Serge: In stdscience no task when "controller status"?! "status" doesn't display anything weird. I "stop"'d and "run"'d it without effect. I "reset.warp"'d it and a few tasks were queued.
  • 11:30 Serge: Killed diff for diff_ids 153081 and 153091, warp fro warp_id 233860. All running for too long. Note: All running on ippc14
  • 11:35 Serge: warp.reset, diff.reset
  • 12:45 Serge: unmounted /data/ipp014.0 on ippc14 (a lot of messages "server ipp014 not responding, timed out" in ipcc14 /var/log/messages file).
  • 14:00 Mark: setup MD10.GR0 for distribution (large)
  • 14:15 Serge: warp.off to fix --warp_id 232591 --skycell_id skycell.1837.003
  • 14:30 Serge: warp.on
  • 17:40 Mark: MD01.V3 tessellation setup for nightly processing (w/o diffs until refstacks done)

Friday : 2011-08-12

  • 08:00 Serge: All mops data published (16+16+2+37+26+48=145 exposures)
  • 15:49 Bill: stdscience was sluggish due to a lot of nodes wanting to put data on ipp026 which is off in the host lists. Changed ipphosts.mhpcc to target ipp024 instead of ipp026. Restarted stdscience.
  • 18:51 Bill: interesting restarting stdscience caused a number of M31 nightly stacks and ThreePi? diffs to be queued.
  • 18:52 Bill: stopping stdscience to tweak the ippHosts again. Will restart.
  • 18:30 Mark: set MD10.GR0 chips to goto_cleaned to start clearing space for MD01.

Saturday : 2011-08-13

  • 02:00 Mark: queued up MD01-y chip to warp for new reference stack (little nightly data and M31 mostly through).
  • 15:00 Mark: MD01 chip failing, OTA file 0 size. Copied over with good file with
    cp /data/ipp039.0/nebulous/3a/a1/457785155.gpc1:20100921:o5460g0266o:o5460g0266o.ota56.fits /data/ipp034.0/nebulous/3a/a1/951582978.gpc1:20100921:o5460g0266o:o5460g0266o.ota56.fits
    
  • 17:45 Mark: warps continuing to stall again around 2pm? did a warp.reset and seems to have finished the one M31.nightly remaining and running the MD01.GR0 now.

Sunday : 2011-08-14

  • 13:15 EAM : around 10:30 I noticed that processing was stuck. It seemed that we were getting many nebulous related errors. I discovered some possible problems with the /tmp/nebulous_server.log files and the apache access_log files that could have been causing this. I stopped all processing and shutdown the neb apache servers on ippc01-09. I modified the config for apache to limit MaxClients? to 127, moved /var/log/apache2 to /export/$HOSTNAME/apache2, moved access_log to a save name and recreated it with 0 size, and moved the /tmp/nebulous_server.log to /var/log/apache2 (linked to /tmp), and zeroed it out as well. I also found a few zombie jobs on some machines and a hung nfs mount on ippc17. I restarted all processing at about 13:00.