(Up to PS1 IPP Czar Logs)

Monday : 2012-10-15

  • 07:30 Bill: ipp020 is having nfs problems which has clogged stdscience and registration. Fixed the registration problem (we are a couple of hundred exposures behind). Working on ipp020 now.
  • 07:50 Restarted stdscience and summitcopy. A stuck job in summitcopy was blocking burntool processing. We have 274 exposures copied but not registered. ipp020 is set to repair and set to off in registration and summit copy.
  • 08:00 removed LAP label from stdscience to give nightlyscience all of the horespower
  • 11:36 added lap label back. the database replication problem happened again and czartool is behind the times
  • 14:14 Bill ran stop slave; set global sql_slave_skip_counter=1; start slave; on ippdb03 and ipp001 to fix the replication problem
  • 22:10 MEH: looks like ipp016 has stalling registration and processing.. trying to isolate + clean up
    • restarted registration w/o ipp016 - ok
    • restarted stdscience w/o ipp016 - nightly_science.pl --queue_diffs hanging again.. will probably cause replication error
    • restarted pstamp w/o ipp016 - ok
    • neb-host ipp016 repair -- dont want to reboot tonight if don't have too, can still access /data/ipp016.0 just cannot ssh into (like ipp023, ipp013 etc recently)

Tuesday : 2012-10-16

  • 06:08 EAM : ipp023 crashed, rebooted it (Ipp023-crash-20121016)
  • 06:20 EAM : set ipp023 to 'retry' in summitcopy and registration (had been 'down' due to crash). also set ipp054 - ipp059 to 'on' in summitcopy. we used to have 30 machines doing the download, but now we seem to be down to 23 by default. I suspect this is because of concern about some of the wave 1 nodes. I think we should keep enough nodes in the list to keep the download rate acceptable.
  • 07:55 Bills: ipp016 had a panic this morning just power cycled it.
    <Oct/15 06:09 pm>ipp016 login: [1200542.035287] BUG: unable to handle kernel paging request at ffffffff800073d4
    <Oct/15 06:09 pm>[1200542.036617] IP: [<ffffffff80589612>] xprt_autoclose+0x19/0x4c
    <Oct/15 06:09 pm>[1200542.036617] PGD 203067 PUD 207063 PMD 0 
    <Oct/15 06:09 pm>[1200542.036617] Oops: 0000 [#1] SMP 
    <Oct/15 06:09 pm>[1200542.036617] last sysfs file: /sys/class/i2c-adapter/i2c-0/0-002f/temp6_alarm
    <Oct/15 06:09 pm>[1200542.036617] CPU 0 
  • 10:20 EAM : repeated failures on stack_id 1618773 : unable to find PSF. I set quality & fault to 42

Wednesday : 2012-10-17

  • 06:42 Bill : repaired stalled db replication on ippdb03 and ipp001
  • 09:00 EAM : things were sluggish, so I've stopped and restarted all pantasks.

Thursday : 2012-10-18

  • 02:45 (Serge): Fixed ipp001 and ippdb03 replication
  • 09:45 (Serge): Bill rebooted sick ipp012. Serge set it to down for nebulous. neb-host ipp012 down
  • 09:55 (Serge): neb-host ipp012 up
  • 11:20 (Serge): Restarted publishing and (since it didn't kill pcontrol) distribution

Friday : 2012-10-19

  • 08:22 (Bill): tried to fix mysql slave status on ippdb03 and ipp001. Following the procedure it faults again immediately with another error. Looking at the error log more carefully it is a different error today. This time it isn't duplicate rows that are the problem but rather missing entries in diffRun. It is trying to copy diffSkyfiles for diffRuns that do not exist in the replicant.
  • 11:40 (Serge): rebooted then power-cycled ipp010
  • 11:50 (Serge): rebooted then power-cycled ipp011
  • Dumping czardb gpc1 ippadmin ipptopsps megacam ssp uip from ippdb01. Master coordinates:
    | File              | Position | Binlog_Do_DB | Binlog_Ignore_DB |
    | mysqld-bin.023689 | 16391529 |              |                  | 
  • 17:00 (Serge) All pantasks and czartool restarted. Note: While ippdb03 replication is being restored, configuration of czartool /home/panstarrs/ipp/src/ippMonitor/czartool/czartool/czarconfig.xml has been slightly changed to point to scidbm instead of scidbs1.

Saturday : 2012-10-20

  • 05:40 (Serge): Same problems as yesterday (about 150 exposures have been downloaded out of 550 and the night is not finished). Once again, replication to ipp011 or ipp010 is slow. Maybe this will help?
    neb-host ipp010 repair
    neb-host ipp011 repair
  • 15:00 EAM : related or not to the problem above, ipp023 crashed sometime before 10am. I rebooted it, but there were some outstanding problems from summit_copy. specifically, o6220g0168o10.fits and o6220g0168o42.fits crashed and did not recover. I manually ran the download commands (cut & paste from error log). this cleared things up, and processing is now moving along. I turned off LAP.ThreePi?.20120706 until nightly science is up to date (or close).
  • 16:45 EAM : things running a bit slow, so I restarted the pantaskses

Sunday : 2012-10-21

  • 05:40 EAM : mount problem on ipp018. an interesting bit in /var/log/messages: ipp018.interesting.message
  • 06:15 EAM : I spent some time looking into the system throughput. last night, I turned off the LAP label for processing, in order to let the nightly processing catch up. Over night the mean processing rate was down in the 20 exp/hour range. After looking into things a bit, I suspect that this was limited by download. summitcopy was running, but a few attempts were taking 5-10 minutes. these seem to be in cases where there is an intermittent NFS glitch. In one example, a machine was trying to place results on /data/ipp018.0, and that machine had an nfs hang. While I looked into it, the hang eventually cleared without any real action on my part, but the message above about a blocked mount job seemed to have the correct timing. In summary, I suspect we are suffering these kinds of problems frequently, but the sheer number of machines and the LAP processing somewhat hide them. If LAP had been on, we probably would have had much closer to 100 exp/hour since chip, etc were running along well.
  • 13:00 EAM : (added later) : I needed to clear a few burntool failure problems. 2 or 3 chips got stuck in 'check_burntool', which is still not automatically recovering. we need to fix this.
  • 23:30 MEH: looks like nightly science stalled since about 2000, want to avoid pileup in morning so seeing if can work through.
    • looks like ipp023 is down, see log Ipp023-crash-20121021, last crash was yesterday, prior crash saturday 10/13 with same hardware problem
    • LAP not running, why is it still off?
    • EAM : because we are having such problems with nightly science that it is not regularly finishing before late afternoon...