(Up to PS1 IPP Czar Logs)

Monday : 2012-11-03

Serge is czar

  • 06:30 (Serge): No observation last night.
  • 10:00 (Serge): Stopping replication on ippdb02 and starting backup to ippc62 (/export/ippc62.1/backup_nebulous/20121203)
  • 10:30 (Serge): Power cycled ipp012. The not-so-interesting apparent fault is the following:
    <Dec/03 01:40 am>ipp012 login: [2201848.347410] BUG: unable to handle kernel paging request at ffffffff800073d4
    <Dec/03 01:40 am>[2201848.349821] IP: [<ffffffff80589612>] xprt_autoclose+0x19/0x4c
    <Dec/03 01:40 am>[2201848.349821] PGD 203067 PUD 207063 PMD 0 
    <Dec/03 01:40 am>[2201848.349821] Oops: 0000 [#1] SMP 
    <Dec/03 01:40 am>[2201848.349821] last sysfs file: /sys/class/i2c-adapter/i2c-0/0-002f/temp6_alarm
    <Dec/03 01:40 am>[2201848.349821] CPU 0 
    <Dec/03 01:40 am>[2201848.349821] Modules linked in: w83793 hwmon_vid autofs4 i2c_i801 iTCO_wdt i2c_core e1000e tg3 libphy e1000 xfs 
    dm_snapshot dm_mirror dm_region_hash dm_log aacraid 3w_9xxx 3w_xxxx atp870u arcmsr aic7xxx scsi_wait_scan
    <Dec/03 01:40 am>[2201848.349821] Pid: 1522, comm: rpciod/0 Tainted: G        W  2.6.28-rc7-00105-gfeaf384 #4
    <Dec/03 01:40 am>[2201848.349821] RIP: 0010:[<ffffffff80589612>]  [<ffffffff80589612>] xprt_autoclose+0x19/0x4c
    <Dec/03 01:40 am>[2201848.349821] RSP: 0018:ffff88063c8ede60  EFLAGS: 00010246
    <Dec/03 01:40 am>[2201848.349821] RAX: ffffffff80007374 RBX: ffff8800697d0578 RCX: 0000000000000000
    <Dec/03 01:40 am>[2201848.349821] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800697d0000
    <Dec/03 01:40 am>[2201848.349821] RBP: ffff88033cc09a00 R08: 0000000000000001 R09: 0000000000000000
    <Dec/03 01:40 am>[2201848.349821] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8800697d0578
    <Dec/03 01:40 am>[2201848.349821] R13: ffff8800697d0000 R14: ffffffff808b85a0 R15: 0000000000000000
    <Dec/03 01:40 am>[2201848.349821] FS:  0000000000000000(0000) GS:ffffffff80872d80(0000) knlGS:0000000000000000
    <Dec/03 01:40 am>[2201848.349821] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    <Dec/03 01:40 am>[2201848.349821] CR2: ffffffff800073d4 CR3: 0000000043d72000 CR4: 00000000000006e0
    <Dec/03 01:40 am>[2201848.349821] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    <Dec/03 01:40 am>[2201848.349821] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    <Dec/03 01:40 am>[2201848.349821] Process rpciod/0 (pid: 1522, threadinfo ffff88063c8ec000, task ffff88063ccc5b20)
    <Dec/03 01:40 am>[2201848.349821] Stack:
    <Dec/03 01:40 am>[2201848.349821]  ffff88063c8ede90 ffff8800697d0578 ffffffff805895f9 ffffffff80248973
    <Dec/03 01:40 am>[2201848.349821]  ffffffff8024891d 0000000000000046 0000000000000000 0000000000000000
    <Dec/03 01:40 am>[2201848.349821]  0000000000000000 ffff88033cc09a00 ffff88063e43bbf0 ffff88033cc09a00
    <Dec/03 01:40 am>[2201848.349821] Call Trace:
    <Dec/03 01:40 am>[2201848.349821]  [<ffffffff805895f9>] ? xprt_autoclose+0x0/0x4c
    <Dec/03 01:40 am>[2201848.349821]  [<ffffffff80248973>] ? run_workqueue+0xed/0x1ed
    <Dec/03 01:40 am>[2201848.349821]  [<ffffffff8024891d>] ? run_workqueue+0x97/0x1ed
    <Dec/03 01:40 am>[2201848.349821]  [<ffffffff80249511>] ? worker_thread+0xd8/0xe3
    <Dec/03 01:40 am>[2201848.349821]  [<ffffffff8024bfda>] ? autoremove_wake_function+0x0/0x2e
    <Dec/03 01:40 am>[2201848.349821]  [<ffffffff80249439>] ? worker_thread+0x0/0xe3
    <Dec/03 01:40 am>[2201848.349821]  [<ffffffff8024becc>] ? kthread+0x47/0x73
    <Dec/03 01:40 am>[2201848.349821]  [<ffffffff80234852>] ? schedule_tail+0x27/0x5f
    <Dec/03 01:40 am>[2201848.349821]  [<ffffffff8020cd19>] ? child_rip+0xa/0x11
    <Dec/03 01:40 am>[2201848.349821]  [<ffffffff8024be85>] ? kthread+0x0/0x73
    <Dec/03 01:40 am>[2201848.349821]  [<ffffffff8020cd0f>] ? child_rip+0x0/0x11
    <Dec/03 01:40 am>[2201848.349821] Code: 89 f7 e8 59 dd 02 00 89 d8 5b 5d 41 5c 41 5d 41 5e c3 41 55 4c 8d af 88 fa ff ff 41 54 53 48 
    8b 87 90 fa ff ff 48 89 fb 4c 89 ef <ff> 50 60 f0 80 63 d0 f7 4c 8d a3 98 00 00 00 4c 89 e7 e8 43 de 
    <Dec/03 01:40 am>[2201848.349821] RIP  [<ffffffff80589612>] xprt_autoclose+0x19/0x4c
    <Dec/03 01:40 am>[2201848.349821]  RSP <ffff88063c8ede60>
    <Dec/03 01:40 am>[2201848.349821] CR2: ffffffff800073d4
    <Dec/03 01:40 am>[2201848.349821] ---[ end trace 5dc435962a52df42 ]---
    • MEH: stopping lm_sensors on ipp010 and ipp011 (~5 and 1 week since last related crash) to see if this problem can be mitigated
  • 10:45 bill set deepstack to stop. Going to restart with 8 threads per host.
  • 11:15 bill set sts label to active. We'll need to keep an eye on the psastro processes pointing 2 contains a globular cluster and we had some memory explosions yesterday. Camera poll limit is 25
  • 11:45 bill skycal.pro added to stdscience
  • 12:34 bill restarted deepstack with 8 threads per job
  • 13:00 (Serge) Bill set chip.off
  • 14:19 bill {{set chip.on}}}, deepstack to off. Setting 8 threads per job resulted in apparently single threaded operation for some reason.

Tuesday : 2012.12.04

  • 05:00 bill: cluster is wedged due to ipp013 same signature as usual power cycled it. Took a very long time to do its memory check
     <Dec/04 00:27 am>ipp013 login: [4365681.451101] BUG: unable to handle kernel paging request at ffffffff800073d4
    • remember to add these to the ipp013_log respectively
  • 05:12 staticsky has finished 5 filter runs for 267 < RA < 303 DEC > 10. turned off deepstack for now and added 3 x compute3 to stdscience.
  • 06:40 since rebooting ipp013 the system has made very little progress. Stopping processing.
  • 07:12 setting 10, 11, 12, 13, and 19 to repair mode got things going again. Running with this state for now. We can revisit this during the shutdown later today. Retarted rpc.statd on ipp012 since its glockfiles weren't finishing
  • 07:30 stdscience is now making progress. poll limit set to 250. There are plenty of jobs in the books, but pantasks is falling trouble getting them to the queue. pantasks needs its daily restart
  • 09:27 CZW: stopping all pantasks_servers to allow for NFS switchover at 10:00.
    • MEH: will shut down ghost running refstacks @09:50
  • 09:27:01: Serge: Restarted replication slave ippdb02
  • 10:00 CZW: all processing stopped for NFS switchover.
  • 11:00 MEH: on the problem machines ipp010,011,012,013,014,016,018,019,020
    • restarted rpc.statd twice to clear old process
    • shut down lm_sensors manually (except ipp020 off for default init now). also renamed /etc/ganglia/conf.d/sensors.pyconf and restarted gmond to stop spamming /var/log/messages on these machines.
    • once processing up stable for a while, will slowly put back into neb-host up from repair
  • 11:30 CZW: processing back up.
  • 12:50 CZW: set ipp02[3-6] to nebulous up.
  • 13:30 MEH: stack gets 2x wave4 back in, adding ipp060 which is normally out. MD05.refstack can run with stdscience on compute3 so have started there unless LAP stacks start lagging warps.
  • 13:50 MEH: ipp020 has given up on some mounts, rebooting and turning off lm_sensors on startup.
  • 14:25 MEH: ipp010 neb-host up, will do others in sequence to see if rpc.statd restart and lm_sensors off have cleared. stdscience also has 7x wave1, seems possibly excessive. turning one of those off for - ~8 nodes.
  • 14:55 ipp011 neb-host up (neb-host has date stamp now for tracking)
  • 15:00 Bill: Two staticsky runs have y band inputs that generate unreasonable numbers of detections (> 200,000). Since they are y band these cause forced photometry detections for the other bands. This causes memory explosion when processing. If the runs had completed, they would generate ~2,000,000 false detections for dvo and psps and ultimately the user. I have updated the stack tables to mark the offending stacks as bad quality and removed them from the corresponding staticskyInput. The stack_ids are 1094024 and 1494695.
  • 15:25 MEH: ipp012 neb-host up
  • 15:55 ipp013 neb-host up
  • 18:25 ipp019 neb-host up
  • 18:55 ipp016 neb-host up
  • 16:55 Bill turned on skycal.revert we're getting catalog lock failures from getstar. It uses the same mechanism as glockfile.
  • 17:25 MEH: ipp018 neb-host up, 1x wave1 hosts off to see if helps hangups at all.
  • 17:35 MEH: processing mostly came to a halt -- oddly, putting back to repair the recent adds ipp016, 018 and processing back up and not clear what ipp016, 018 connection there was. File samples looked like ipp010,012 were stalling things. Leaving things at this stage for tonight. Didn't get to ipp014,ipp020 so will also be left in repair. The ones up have all had lm_sensor problems in past for watching.
  • 19:55 MEH: nightly data -- many processes want to put files to ipp013 and stalling, setting ipp013 to repair for a bit. then ipp012, now ipp019 overly busy but leaving up.. basically appears many systems want to lock files on a wave1 host and that host goes tttthhhhpppppppptttttt. ipp013 neb-host back up
    • once the few MD04 nightly through, then fine except ipp013,015 seem to have overly busy rpc.statd very often without stalling things much
    • what is different between nightly and LAP data, summitcopy+registration using wave1 as well?
  • 22:45 CZW: restarting stdscience to see if that will help it keep going overnight.

Wednesday : 2012.12.05

  • 09:55 MEH: no kernel paging problems with lm_sensors/temp6_alarm -- add in ipp014 neb-host up
  • 09:52 Bill: reference catalog on ipp018 is locked. Tried restarting rpc.statd but that didn't work. Will need to fix later.
  • 09:52 turned skycal revert on and changed task file to turn it on by default
  • 10:25 MEH: ipp020 neb-host up
  • 10:35 MEH: machines with kernel paging trouble in recent past ipp010, 011, 012, 013, 014, 016, 018, 019, 020
    • seemed related to lm_sensors with last sysfs file: /sys/class/i2c-adapter/i2c-0/0-002f/temp6_alarm
    • removed lm_sensors from startup/default init
    • removed sensors.pyconf from ganglia calls
  • 10:45 MEH: ipp019 was set up yesterday but isn't a normal host in the wave1 list (commented out and not added to host_ignore set) and no clear notes in ipp19_log or czarlog for why. adding into stdscience and turning off, fixing in ippconfig/pantasks_hosts.input
  • 11:25 MEH: ipp016 neb-host up
  • 12:17 CZW: finally able to restart stdscience after flushing out hung glockfiles that were preventing the old stdscience from being finished.
  • 12:55 MEH: ipp018 neb-host up, last of of set
  • 14:30 CZW: begin rsync of refcat.20120524.v0 to ipp023-026,054-059,061-063,065-066,048.
  • 21:55 MEH: glocks on wave1 set is unstable again with nightly science, will slowly put ones back into repair that are struggling: ipp013--ipp020 again until morning

Thursday : 2012.12.06

  • 10:45 MEH: ipp019 has encountered a GPF, power cycling ipp019_log
  • 12:10 MEH: stsci00.1,00.2 data disks low on space ~100,50 GB. put to repair until ghost 5% buffer can be adjusted on them
  • 13:36 CZW: switched refcat to point at new hosts (away from wave1), and restarted stdscience.
    • MD05 refstack setup and diffims from last night proceeding
  • 13:55 MEH: proceeding to take problem data hosts out of repair again -- ipp018, ipp019, ipp020 to start with. ipp015,016,017 next. ipp013,014 last. not as bad as yesterday bringing things to a standstill, but still getting hung up.
    • ipp016, 019 back into repair seems to be the magic pair for today, glocks happening but processing moving along about at normal.
  • 20:05 MEH: ipp018 crash, power cycling ipp018_log
    • pausing stdscience until things settle down/clear
    • stdscience left with ipp018 jobs hanging around, restarting
    • ipp013,014,015,016,017,018,019,020 have ended up in repair again. rpc.statd on 013,015,017 tend to clear a bit more quickly, ~3-4 mins, but 014,018,020 can take >10 min, 018 even stalled with ssh/login for 15-20 min. will probably leave through night, may be good to just leave until LAP wraps up.
    • ipp018 even stalling jobs running locally, rpc full cpu use. finally caught up, but taking ipp018 out of all processing.
  • 22:10 MEH: neb glitch on neb://ipp027.0/gpc1/ESS.nt/2012/12/07/o6268g0110o.553240/o6268g0110o.553240.cm.698383.log, renamed .bad and reverted camera run
  • 22:50 MEH: multiple rsync on stsci nodes seem to be stalling some jobs, emailed Gene
    • putting stsci03 into neb-host repair until the rsyncs are dealt with, all warp,diff, dist etc jobs that touch it are stalled.. adding 02,04,07
    • 23:50 stsci06 also

Friday : 2012.12.07

  • 07:20 MEH: mounts on ipp056 gone inaccessible, stalled registration.
    • restarting rpc.statd -- not helping
    • removing from processing
    • stopping, restarting registration, stdscience, summitcopy. killing hung jobs from stack (LAPs) failed so restarted as well.
    • putting into repair in prep for reboot later
  • 08:05 MEH: now wave4 machines are being overloaded (ipp058,054,066..). taking 1-2 instances out of processing as needed
  • 08:46 EAM : rsyncs to stsci machines were causing too much load. I've stopped them, and I'll restart with lower bandwidth.
  • 09:00 MEH: rebooting ipp056, ipp056_log, neb-host up after a bit
  • 09:02 Serge: Fixed replication (see Replication_Issues)
  • 09:05 MEH: ipp054 now the glock tarpit, has frozen registration. has lost many mounts.
    • restarting rpc and after ~5-10 min mounts back and registration moving along
    • then stalled again. put to neb-host repair
  • 09:28 EAM : my rsync system was allowing multiple jobs to hit the same target output machine. I reworked the scripts to a single connection per source per target, so there should not be the same overloading. also re-started with 5MB/sec instead of 8MB/sec per connection
  • 09:35 MEH: neb-host stsci up again, maybe help ease the load on wave4
  • 09:45 MEH: ipp054 lost mounts to ipp030,023 again and restarting rpc.statd not helping. rebooting ipp054_log
    • glocks now flushing, registration moving again
  • 10:40 MEH: registration finished, slowly putting the wave4 nodes back up
  • 12:30 MEH: MD stacks finished after the SSdiff window, running tweak_ssdiff to adjust the time to queue them. chip.off to push the remaining 100 3PI warps through
  • 13:35 Chris added back the original wave1 refcats to available locations so now twice as many options for access
    • CZW: except on ipp009, which was a pointer to the ipp064 copy. So we have 31 instead of 32 to select from.
  • 18:00 CZW stopped stdscience to let statd calm down on ipp058. This never happened, so I restarted statd on ipp058. skycal is off, and I turned off camera as a test as well. Given that this issue started on ipp008-ipp020, and spread to ipp52-60 when I moved the reference catalogs there, I'm increasingly suspicious that this is the root cause of the nfs issues.

Saturday : 2012.12.08

  • 07:45 MEH: summitcopy and registration had just not been doing anything. restarted both and now finishing up nightly science.
  • 10:05 Bill: There is a 3pi warp in the warpPendingSkyCell book with pantasksState CRASH. Setting STS label to inactive to clear out the queue then will reset the book.
  • 12:20 Bill: cleared the book and set STS label to active.
  • 12:28 Bill: two lap warps were not running because their chips had been cleaned set chip_ids 692565 and 609303 to be updated
  • 18:35: MEH: ipp011 is unhappy and blocking processing and nightly science, kernel paging fault has moved on to another last call. ipp011_log
    • rebooted, locks clearing. put into repair for a bit, may try and take out later
  • 22:59 CZW: noticed stacks were stuck. It looks like the problem is that the stack.summary stage runs at a particular time each night, and launched ~100 jobs. These jobs attempted to read data from stsci00, causing that host to grind to a halt trying to serve all the files. This hasn't been a problem since we generally don't do 10k stacks a day. I killed all the ppSkycell instances I could find, and that seems to have unstuck things. I've disabled stack.summary for the weekend, and will turn it back on Monday with a lower number of concurrent jobs.

Sunday : 2012.12.09

  • 01:20 MEH: registration stall for a while, lost file neb://ipp012.0/gpc1/20121209/o6270g0278o/o6270g0278o.ota16.fits with a crash status (like on saturday). restarting registration and running again.
  • 09:05 MEH stdscience needs its regular restart, will also pick up some dropped/lost exposures Denver queried about
    • taking ipp026 out of processing, is the common thread for all the lost jobs from crash pantasks task and annoying for the weekend...
    • ipp026 may want a reboot but weekend and new raid card so waiting.. many faults, trap invalid opcode and BUG: unable to handle kernel NULL pointer dereference at 0000000000000000, ipp026_log
  • 09:20 Serge: labeltool -updatelabel -set_inactive -label STS.rerun.20121129 -dbname gpc1
  • 12:35 MEH: can re-activate the STS label
  • 12:45 Serge: labeltool -updatelabel -set_active -label STS.rerun.20121129 -dbname gpc1
  • 12:50 Serge: Fixed broken replication (see log)
  • 20:53 Bill: set sts label to inactive since it seems so backed up at warp. Too many sources taking up too much memory perhaps