(Up to PS1 IPP Czar Logs)

Monday : 2012.11.12

Mark is czar

  • 08:00 MEH: last of nightly science finishing up. LAP and other reprocessing chip faults from permissions still.
    • chip.off for a while to push more LAP warps
  • 11:30 still babysitting a large number of OSS diffims slowly making their way through..
  • 14:30 OSS and 3PI diffims finished, regular restart of stdscience done.
  • 21:00 MEH: poor weather so pushing more MD05 reprocessing through
  • 21:30 repaired lost files, replaced with found orphans

Tuesday : 2012-11-13

Serge is czar

  • 06:15: No observation last night. However ipp001 mysql server crashed: Nothing in its log though. Replication is only 10 minutes behind so it should not become a major incident.
  • 14:00: Stopped all processing
  • 14:45: Restarting all rpc.statd on all storage nodes:
    schastel@ippc11:~$ echo $PYTHONPATH 
    schastel@ippc11:~/dev/SshSudo$ ./ssh_sudo.py -H `cat tmp/storageNodes` 'sudo /etc/init.d/rpc.statd restart'
  • 15:12 (Bill) Since there are many postage stamp requests requiring update processing I added some more hosts to the update pantasks (hosts.update) and tweaked the polling variables: set.poll 80 in update and in pstamp set.dependent.poll 120
  • 17:00: Chris queued a bunch of cleanup tasks.
    • MEH also push all of the MD05 reprocessed chip+warps to cleanup as now finished.

Wednesday : 2012-11-14

Serge is czar

  • 08:00: 35(?) science exposures last night. No observation report though.
  • 11:24: Bill started restoring about 15,000 camera runs that never had magic destreak restored. Only ThreePi? exposures were done apparently. It is being run as bills from ~bills/destreak.
  • 12:30: ippc09 root partition is full (Thanks Bill).
    rm /tmp/nebulous_server.log ; touch /tmp/nebulous_server.log ; /etc/init.d/apache2 graceful
  • 12:35: Did the same on ippc05, c06, and c08 which were above 85%
  • 21:35 MEH: stdscience could use its regular restart
    • pantasks is fully loaded but something is still dragging on the system. looks like rpc.statd on ipp018 is railing @100.. nfsd restart seems to have cleared it and rate back up.

Thursday : 2012-11-15

  • 07:27 Bill: registration job was stuck for > 4000 seconds on ipp016. It has nfs connection problem with stsci05.2. Killed job, removed ipp016 and cleared regPendingImfile book in registration panatasks and registration finished the 35 unregistered exposures in a few minutes.
  • 07:35 Bill: force.umount successfully unmounted the stuck partition on ipp016 but subsequenty mount attempts failed (or at least didn't succeed before I got impatient
(ipp016:~) bills% sudo ~eugene/force.umount stsci05.2
USAGE: force.umount (ipphost) (volume)
(ipp016:~) bills% sudo ~eugene/force.umount stsci05 stsci05.2
umount.nfs: trying prog 100005 vers 3 prot UDP port 46872
stsci05:/export/stsci05.2 umounted
(ipp016:~) bills% ls /data/stsci05.2
^Cls: cannot open directory /data/stsci05.2: Interrupted system call

(ipp016:~) bills% !sud
sudo ~eugene/force.umount stsci05 stsci05.2
Could not find /data/stsci05.2 in mtab
umount2: Invalid argument
umount: /data/stsci05.2: not mounted
(ipp016:~) bills% !ls
ls /data/stsci05.2
^Cls: cannot open directory /data/stsci05.2: Interrupted system call
  • 08:04 Bill: recovered two missing raw files o5467g0328o.ota20.fits and o5467g0342o.ota02.fits
  • 13:23 CZW: periodic restart of stdscience.

Friday : 2012-11-16

  • 9:30 CZW: Noticed that everything seemed to be jammed up since 2AM (probably just after the last person looked at it). Tracked problem to hung up NFS mounts of ipp033.0. Restarted nfs on ipp033 and this seems to have cleared up the problem. Jobs seem to be completing again, so this was likely the culprit.
  • 9:35 CZW: Definitely related to previous issue, registration has faulted on OTA52, which is targeted to ipp033. "regtool -revertprocessedimfile -exp_id_begin 549130" cleared out three imfiles, and it looks like we're now registering the end-of-night darks, suggesting this issue is resolved now.
  • 10:04 Bill: some files still hadn't been downloaded yet due to summit copy faults. pztool -clearcommonfaults got them going and now they are registering. Eventually pantasks would have run that command.
  • 10:37 CZW: Fixed bad burntool table: ipp_apply_burntool_single.pl --class_id XY21 --exp_id 94834 --this_uri neb://ipp041.0/gpc1/20090906/o5080g0172o/o5080g0172o.ota21.fits --previous_uri neb://ipp041.0/gpc1/20090906/o5080g0171o/o5080g0171o.ota21.fits --dbname gpc1
  • 11:49 Serge: Added SC.TEST.MOPS.PS1_DV3 to stdscience and publishing for mops tests
  • 16:00 CZW: ipp018 crashed (or became sufficiently unresponsive to simulate being crashed). Console messages saved here: ipp018-crash-20121116.
  • 17:29 CZW: Set ipp018 to repair mode. I think we seem to be having the following failure route. 1) Large numbers of hosts are set to repair. 2) We are targeting chips to those hosts. 3) Nebulous transfers those requests to the host with the most space (actually, randomly among the top N hosts with space). 4) This swamps a host (ipp018) under a large number of requests. Because this happens outside of pantasks targeting, the controller machines option does not prevent it.

Saturday : 2012-11-17

  • 08:10 Gene looking into registration holdup, ipp009, ipp020 mount troubles
  • 12:05 MEH: looks like stdscience could use its regular restart
  • 23:00 MEH: ipp014 lost mount to own data disk and stalled registration, restarted nfs and registration.
    • stdscience also has some jobs stuck, some cannot be killed. ipp014 may just want a reboot tomorrow. some stall jobs on ipp020 as well and trying to purge them now too
    • LAP jobs stalling on ipp014, ipp020 again - taking out of processing until they can be looked at in more detail or rebooted

Sunday : 2012-11-18

  • 10:50 MEH: stdscience struggling to stay loaded >50%, regular restart + removing ipp014,ipp020 from processing