IPP MHPCC Production Cluster Status

(Up to IPP for PS1)

Log of Work Done on Production Nodes

List of Spares for Production Nodes

ippc30.1 and ipp0xx.0 filesystem exported to jaws j00xbxx servers Mar. (10th), 2014 4pm

  • /export/ippc31.1 and /export/ipp0xx.0 filesystems exported to jaws cluster - per eugene/bills

stare0x.0 filesystem exported to entire IPP cluster Feb. (28th), 2014 4pm

  • /export/stare0x.0 filesystem exported to IPP cluster - per heather

Precautionary Shutdown of IPP Servers at MHPCC due to Tropical Storm Flossie July 29-30, 2013

  • Majority of servers powered up with no issues except for the following:
    ippc02 - need to check why alarm is on (chassis fan?)
    ippc53 - failed bootdisk in slot0 prevented system from detecting grub on mirror disk. Eddie replaced w/ new disk. software raid is rebuilding.
    ipp046 - required onsite intervention to get system powered up.
    ipp060 - required reboot to detect all 48GB of memory
    ippdb04 - Haydn booted up system at ATRC.
    

Network Outage due to MRTC-A Building Power Outage Scheduled for August (26th) 12pm - 1pm HST

  • MRTC-A will be installing electrical meters.
  • UH ITS Routers located in MRTC-A will be down, this is the only route out to the internet for PS1 IPP & PSPS, network will be unavailable during the scheduled power outage.

Status as of 2011-01-20 ipp005,006,007,008,009,0012,0014,0016,0025 all need either USB or new cd/dvd drives with compatible cables to house the boot disk.

Wave1 Refurb Stage3 (Motherboard upgrades on ipp005,006,007 and 025.)

*ipp005 resulted in 2 degraded drives - rebuilding. *ipp006 CPU1 Memory slot closest to CPU not working. memory moved to black slot - all 24 GB working *ipp0025 CPU0 Memory slot furthest from CPU not working. memory moved to black slot - all 24 GB working

=== Wave1 Refurb Stage2 (Disk upgrades on ipp008,016, and 021, and motherboard upgrades on ipp008 and ipp016.) ===

  • no problems

Wave1 Refurb Stage1 (Disk and Motherboard upgrades IPP009,012,014)

components upgraded: motherboard + cpu + memory + 2TB HDD

  • ipp009 : BIOS detects 6 memory modules in temp readings, BIOS/OS only recognizes 20GB of 24GB
  • ipp012 : unexpectedly crashes under high rsync load - RAID card replaced.
  • ipp014 : erratic beeping, suspected PSU, seems to be originating from PSB controller for HDD and fan display alarms, not mobo

Storage upgrade

  • ipp005 : 2TB HDD install - Mohr (2010-07-02)
  • ipp006 : 2TB HDD install - Bill/Cindy (2010-08-06)
  • ipp007 : 2TB HDD install - Bill/Cindy (2010-08-06)

Increase swap

  • ipp015 : 32GB swapfile - total ~64GB swap (2012-02-10)
  • ipp054 : 152GB swapfile - total ~200GB swap (2012-02-01)
  • stare03 : 64GB swapfile - total ~100GB swap (2012-02-10)
  • stare04 : 64GB swapfile - total ~100GB swap (2012-02-10)
  • Bill requested to increase swap space on all nodes +128GB (2012-02-17)
  • ipp054-66 added 128GB swapfile (2012-02-23)
  • ippc20-c29 added 128GB swapfile (2012-02-23)
  • ipp005-053 added 128GB swapfile (2012-02-29)
  • ippc01-c16 added 64GB swapfile (2012-09-21)

apache2 activated

  • March (7th) 2013
  • on servers ippc20-32
  • NAT reconfigured; allocated static IP's to ippc20-32 for external access to www (80/tcp)
  • FW modified to allow two hosts from lanl.gov to ippc20-32
  • February (27th), 2014
  • modified firewall to allow mustang lanl (mswarren) to ippc20-c32 via ssh (per watersc1/eugene)
  • patched sshd to v6.3p1 and hpn

MHPCC System Maintenance

  • Required to shutdown Pan-STARRS IPP (for 1hr starting @ 10AM HST) cluster in order to install power meters on IPP circuits.

Nodes down

  • ipp008 : motherboard problems -- CPUs were changed on 2010.02.17, but still no boot
  • ipp018 : motherboard problems? power supply problems?
  • ipp037 : long history of return, etc.

Under-Used Nodes : Status 2012.02.09

  • ipp004 : currently used for addstar / dvo databases
  • ipp005 : recently replaced CPU from ipp018, has been used for processing since 2/16, but not yet for nebulous
  • ipp008 : dead hardware (CPU appears to be defective 2010-02-22), (2010-03-22 Bill G. replaced motherboard, using CPU fans from IPP018)
  • ipp009 : had several crashes in 2009, only used for nebulous since ~2009.10.29
  • ipp012 : several crashes, suspected faulty memory, 8GB memory replaced (2010-08-06 Bill/Cindy)
  • ipp014 : had several crashes in 2009, only 'available' in nebulous, not 'allocated' (since ~2009.10.29), suspect bad memory (mcelog entries ~2008)
  • ipp016 : some crashes in 2009, only nebulous for some time (memory replaced 2010-05-28)
  • ipp018 : dead hardware (motherboard dead; CPU's installed into ipp005 2010-02-16); replaced motherboard/cpu (2010-04-01)
  • ipp025 : paul raised suspicions about memory problems; only used for nebulous since then. (memory replaced 2010-05-28)
  • ipp027 : high loads, sluggish, suspect bad cpu, possible bad memory (mcelog entries dated daily) - (2012-02-09 Haydn flashed BIOS -> v1.04)
  • ipp037 : dead hardware (determined motherboard & CPU are defective 2010-02-22)
  • ipp049 : seems to have somewhat slow RAID I/O; has not been used for processing, and only 'available' in nebulous
  • ippc18 : battery temperature warning (Cindy added fans with the intention to cool RAID controller & batt., no warnings since 2010-02-22)
  • ippc19 : battery temperature warning (Cindy added fans with the intention to cool RAID controller & batt., no warnings since 2010-02-09)
  • ippdb02: series of crashes, motherboard swapped. ippc17 pstamp/datastore moved to ippdb02 2010-02-22.

Other suspicious Wave 1 machines

  • ipp006 : a few crashes in 2009
  • ipp010 : several crashes, kernel panics in 2009
  • ipp017 : several high-load, high-memory crashes 2009-2010 (not a Wave 1 machine!)

Notes

  • ippc31 - bills requested to reformat to xfs to support >32000 files per directory. (2013-04-18)
  • ipp005 - The memory was completely swapped and crashes continued. CPUs were replaced (2010.02.09). In production since 2010.02.16.
  • ipp014 - using the chassis from ipp017 with disks from ipp014 (chassis had been ipp037, but was sent back to ASA for repair, ASA shipped back with replacement motherboard (Tyan S2912G2NR-E) Tyan S2912G2NR is EOL - kernel 2.6.31.5 is required.)
  • ipp017 - using the chassis from ipp037 with disks from ipp017
  • ipp018 - replaced 8GB of memory (2009-12-30)
  • ipp027 - original areca card replaced with a newer 3ware card; using this raid as manoa backup
  • ipp034 - using IPP037 Areca RAID Controller (IPP034 original Areca Controller appears to be defected -> Controller#1(PCI) H/W MONITOR DRAM 1-Bit ECC)
  • ipp037 - using the chassis from ipp014 with disks from ipp037
  • ippdb00 - in production as nebulous server
  • ippdb02 - memtest86 4 passes completed no errors. (2009-12-30)

Known issues

  • ipp008 - complete raid failure 2009.10.28 -- disks have been replaced and raid needs to be re-built. ipp008 has not been used for storage for some time, so we have not lost any vital data.
  • ipp009 - several recent crashes with coincident complaints from CPU #3. note: we will swap CPU 1 and 3 and then stress-test this machine to see if the effect moves with the CPU
  • ipp013 - dead fans, but it is not the fans themselves (tried replacing them). It appears to be the cabling that leads up to the fan modules that's gone bad.
  • ipp014 - has a newer version of the Wave #1 motherboard which requires older kernel drivers or kernel 2.6.31.5.
  • ipp016 - dead fans, but it is not the fans themselves (tried replacing them). It appears to be the cabling that leads up to the fan modules that's gone bad.
  • ipp018 - suspect bad memory module (system is usable)
  • ipp025 - suspect bad memory module (system is usable)
  • ipp027 - suspect bad memory module (system is usable)
  • ipp037 - fails to boot; serial device not detected by acs48, suspect motherboard
  • ipp034 - Areca reported: 2009-11-17 13:42:23 Controller#1(PCI) H/W MONITOR DRAM 1-Bit ECC (online events logs states to Check the DRAM module and replace with new one if required.)
  • ipp008, ipp046, ippdb02 - after mhpcc power outage (2009-11-27), the nodes fail to retain BIOS settings when power is totally lost. PDU outlets are powered off after shutdown procedure to prevent electrical surge when power resumes.

Old Issues

  • random system crashes under heavy load of nodes, occasionally with a printk() of "do_IRQ: X.XXX" which appears to be caused by a hardware interrupt that does not have a handler (device driver) for it
    • these were only present on Wave #1 hardware
    • these seem to have largely stopped with the current forcedeth driver and kernel (early 2009?)
  • ipp004 - disk bay #12 was dead, but started working again after the disk was reseated (ipp004 has device labels inverted, and it is suspected that previous re-seat attempts were addressing the wrong disk)
  • cabinet 4 managed power strip was being overloaded. With the wave 3 delivery, we rebalanced the power load to avoid this problem.
  • cab2pdu0 - cab2 down (2009-11-14) PDU sockets 5 & 9 reports on but PSU's on ipp005 & ipp009 not powered. Plan to return for repair.
  • Entire IPP Production Cluster. MHPCC is currently experiencing a power outage. (2009-11-27T08:59)

Layout at MHPCC

  cabinet0: ipp020-21
  cabinet1: ipp012-013,037,008,014,016,018-019
  cabinet2: ipp004-007,015,009-011
  cabinet3: ipp023-029
  cabinet4: ipp030-036,017
  cabinet5: ipp038-043,ippdb00-02 (cab5con.ipp)
  cabinet6: ipp044-048,ippc00-09  (cab6con.ipp)
  cabinet7: ipp049-053,ippc10-19  (cab7con.ipp)
  cabinet8: ippc20-29,stare00-04,ipp054-058 (cab8con.ipp)
  cabinet9: ipp059-066
  cabinet10: ippc30-46,stsci00-02 (cab10con.ipp)
  cabinet11: ippc47-63,stsci03-05
  cabinet12: stsci06-09