IPP Software Navigation Tools IPP Links Communication Pan-STARRS Links
wiki:Production_Work_Log

Version 73 (modified by huntley, 13 years ago) ( diff )

--

https://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/Production_Cluster_Status

2013 Production Node Work

Apr 19 (Haydn)

  • stare04 - tried to upgrade it to 96GB RAM, but it didn't like the new RAM. Used the leftover 4GB sticks to upgrade it to 48GB.
  • ippc29 - removed its old RAM and upgraded it to 88GB. When I tried to insert the last 8GB stick, then it showed 80GB, so I figured this was better.
  • ippc26 - transplanted the leftover 4GB sticks to upgrade them to 48GB.

Apr 18 (Haydn)

  • ippc17 - reinstalled it, now it has a 500w power supply.
  • ippc13 - completely stopped working. Removed the extra 32GB of RAM which had been added. It remained inoperative. Took it back to ATRC to repair it.
  • stare00-stare03 - removed their six 4GB RAM sticks (24GB) and replaced them with twelve 8GB RAM sticks to upgrade them to 96GB. Not all of the memory seems to be recognized all of the time, so sometimes these machines boot up with 80GB, 88GB, or 96GB.

Apr 17 (Rita)

  • ippc07 - wasn't working at 1GB as the other servers on the new 48 port switch. It was working at 100MB. Discovered that the cable C7N2S0 was bad. Labeled the cable as bad and swapped with the C7N2S1 cable, which is now in ethernet port 0.
  • ippc13 - checked on the memory check that Haydn had started 4/16/13. Looked as if the memory check ran without errors but I couldn't reboot from the monitor. Tried to power cycle the server many times but it wouldn't start up. Strangely it seemed to power up on it's own?! Gavin was then able to see it on the serial console and configured it to boot from the RAID disk. Haydn to do more checking on this server to see if the new memory just added could be faulty.

Apr 16 (Rita & Haydn)

  • ipp064 - Replaced the LSI RAID BBU. Now write-thru caching is working again.
  • ipp057 - Replaced failed drive 19.
  • ippc17 - Discovered that the power supply fan had stopped working. The power supply is no longer supplying +12v, and the -12v supply is only providing -10.5v. Also, heat sink compound inadvertently got onto the bottom of one of the CPU's and the CPU socket's pins. We don't have the materials here to fix the power supply or clean the CPU or CPU socket, so we'll take it back to ATRC, and I'll try to fix it there.
  • ippc13 - installed the leftover 32GB RAM from the old ippc12 into ippc13, to upgrade it to 64GB.
  • completed the rest of the 10G data transfer hardware swap
    • moved 2 x10G uplinks from the brocade to ipp2960s-te
    • moved ippc01 - ippc16 connections from the brocade to ipp2960-s
    • removed the 10GBase blade WS-X6704-10GE (on loan to use from Indiana Univ.)
    • inserted PanSTARRS 10GBase blade WS-X6704-10GE (recently purchased by PS1)
    • inserted long range transceiver (XENPAK-10GB-LR+ in Te13/1 and reconnected fiber
    • inserted the two short range transceivers in Te13/3 and Te13/4 and reconnected fiber
    • Gavin confirmed the new connection to ippc01-c16 was operational, and was able to restore traffice to Te13/1
    • removed 10Gbase brocade switch from cab0(on loan from Indiana Univ.), which was replaced with ipp2960s-te
  • Rita to box up all equipment for IU and mail back to IU from ATRC, ASAP

Apr 10 (Rita)

  • Worked on ippdb00 and swapped out the 300GB disks for new 600GB disks. Reconfigured the RAID array, and handed over the OS rebuild to Gavin and Haydn to perform remotely. All came up fine and IPP team was notified that afternoon that they could proceed.
  • While there, Gene and team updated stsci06 with ver 3.6.7 of kernel remotely and reboot. From what I could see manually all seemed fine when system came back online.

Apr 9 (Rita)

  • Worked on the 10GB data transfer hardware swap
    • Indiana Univ. hardware that has been on loan needs to be swamped with new hardware PS1 has recently purchased, so loaned hardware can be returned.
    • Haydn removed the DELL 1U server on 4/8 that will be returned.
    • Rita mounted the new Cisco 2960S-48TB-L in Cab 0 and inserted teh two SFP10G-SR transceiver modules.
    • While there at MHPCC the IPPCORE was having problems. See http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/ippcore_log for details. Will continue with the 10GB data hardware swap once the IPPCORE is stable.

Apr 8 (Haydn)

  • ippc19 - reinstalled it, but this time with a Tyan S7010 motherboard, and replaced the fan in the power supply. I had to modify the power supply to add a second 8-pin power connector for the S7010.
  • xfer22 - Gavin shut it down, and I disconnected it, packed it in a box, and brought it back to ATRC to be shipped to IU soon.
  • cab7 - reseated the ethernet cable connecting pdu0 and pdu1, because we couldn't reach pdu1. Gavin confirmed that it works now.

Apr 2 (Rita)

  • power outage for parts of Kihei from approx. 4/1/13 @ 10:30PM - 4/2/13 1AM affected the MHPCC PanSTARRS IPP cluster. The IPP Panstarrs system unfortunately went down along with the UPS' about 45 minutes after the outage. I went to MHPCC 4/2 at 9Am to check it out:

All but 4 servers had restarted when power was restored.

  • ipp046 - needed to have the BIOS reset. Once that was done the reboot went fine.
  • ipp048 - had rebooted automatically and was running fsck, which took some time, but eventually recovered without problems.
  • ippc17 - had to be power cycled a few times, and then it did reboot without problems.
  • ippc19 - got no response after a few power cycles. Opened the box and reset the CMOS. When trying to re-power the server, saw an FF on LED and then Eddie and I smelled something burning and quickly unplugged power.

Brought the server back to the ATRC to investigate and it appears that the voltage regulator to the CPU had burned - MOBO must be replaced.

Mar 27 (Haydn and Rita)

  • ippdb02: swapped out 300GB drives with new 600GB drives. Configured RAID array and installed OS. All seems to have gone fine. IPP team to check out server.
  • ipp064: Haydn was going to replace the RAID battery, but we didn't have to correct model. Haydn to place order.
  • Per request of Ken and Gene we checked the RAM in the IPP compute nodes in order to determine which ones to upgrade to 64GB and 48GB.

Mar 19 (Haydn)

  • ippdb02: Powered machine down, checked how the RAID card is setup, and decided not to try another RAID card in it. Powered machine back up.
  • ippcore: Together with Brad, cleaned fiber port 13/1. Gavin said it is working now.

Mar 8 (Rita)

  • ippdb02: 1TB drives were swapped with the original 300GB drives. All rebooted fine.

MAR 7 (Rita)

  • ippdb02: 2TB drives installed on 2/28 were not compatible. Meant to swap in the original 300GB drives but it was discovered that I accidentally installed the 1TB drives. Will return tomorrow and swap again with 300GB this time.

FEB 28 (Rita)

  • ippdb02: Swapped out the 300GB drives and replaced with new 2TB drives. IPP team will test to see if they are compatible with the RAID controller.

FEB 26 (Rita)

  • ippdb02: Swapped out the 1TB disks that were installed FEB-21-13, due to compatibility issues, and replaced the original 300GB 15K RPM drives. Was able to replace the drives in the correct slot and didn't have to rebuild the RAID array.

FEB 21 (Haydn/Rita)

  • ippdb02: Upgraded the RAID from using 300GB 15k RPM Cheetah drives to using ordinary 1TB drives, to increase the RAID storage capacity.

JAN 16 (Haydn/Rita)

  • ipp027-030: All of the RAID controller disks were swapped for 2TB disks. We are waiting for the RAID silvering to finish before installing the OS.
  • ipp027: Had stopped working on January 6, 2013. It looks like another case of an incorrectly installed CPU heat sink (very little heat sink compound was used). Again, it seems to have made the CPU socket stop working. I tried 4 known good CPU's in the second socket, and none would work. All of those same CPU's worked fine in the first socket. Now the machine is running with just one CPU.

2012 Production Node Work

DEC 12 (Haydn/Rita)

  • ipp029: Machine stopped working. Discovered that the heat sink hadn't been screwed down -- it was just loose. The heat sink compound was dark (almost black) and grainy, the top of the CPU was slightly pitted, and the bottom of the heat sink was also slightly pitted. We tried using a known good CPU and many unknown ones in socket 1, but nothing worked and it kept the machine from booting. We had to remove the CPU from socket 1 so that the machine would work.
  • ipp032: Replaced drive #31.

NOV 9 (Haydn)

  • ipp026: New Areca RAID controller installed and tested for compatibility. All was fine and data still accessible.
  • ipp023, ipp024, ipp025, ipp026: All the Areca RAID controller disks were swapped for 2TB disks. OS was reinstalled and all backed up data restored.

OCT 11 (Haydn)

  • IPP013: Spoke with Mr. Chau (chaut@…) about the Tyan S7010 motherboard we sent back to them to fix on 9/12. He said the motherboard is broken and cannot be repaired. Something broke within the Northbridge chip. We still have to pay their $50 processing fee. It is NOT covered under warranty. Their warranty period is 3 years from the date of manufacture, and even though we purchased them less than 3 years ago, the board was manufactured ~4 years ago. NewEgg.com no longer sells these boards. There are two for sale on eBay for ~$290+shipping. I'll check with Gavin to see what we want to do.

OCT 03 (Haydn/Rita)

  • IPP046: removed and replaced the BIOS settings battery. The old battery was completely dead. Also verified that the CMOS jumper was in the normal position. Re-entered the usual BIOS settings for console redirection, power, and boot.
  • IPP013: removed the CPU heatsink which was missing a fan, added the repaired fan, reinstalled it with fresh Arctic Silver thermal paste. When power was reapplied it almost instantly showed the Tyan logo screen, however it stayed there for a long time. I pressed the reset button and it did the same thing. Tim pressed the escape key, and then it instantly went through the rest of the boot process. I'm wondering if it has the correct BIOS settings? When it finished booting, I logged in via SSH, but it hadn't mounted our home directories. I let Heather (today's Czar) know about this problem.

SEPT 07 (Cindy/Rita/Haydn/Tyler)

  • IPP013: replace mobo w/ cold spare Tyan S7010 which Tyler was testing on ippc12. Discovered CPU1 fan wire was sliced.
  • IPP037: RAID issues surfaced starting on the afternoon of 9/6/2012. RAID volume appeared to become unresponsive after boot. Re-seated Areca controller. System remained responsive after re-seating. IPP team to backup nebulous data to available storage space on stsci0x nodes. /export/ipp037.0 mounted in read-only mode during rsync process.

SEPT 06 (Cindy/Rita)

  • Servers affected IPP013, IPP046, IPPC04 not booting up after clean (emergency) shutdown due to MHPCC power outage (on 9/5/2012).
  • IPP013: VGA and serial console not responding when server is powered on. All 4 PSU are powered on, memory reseated, PCI-e RAID controller reseated, CMOS cleared, chassis circuit board examined. Server powers up, HDD LED visible, VGA and serial console not displaying anything.
  • IPP046: CMOS error, Cindy fixed by resetting BIOS (console redirection + power on after failure)
  • IPPC04: Console Redirection not enabled in BIOS, all other BIOS settings were correct.

JUNE 19

  • re-labeled the A disks on the 10 stsci servers.
  • contacted Lonnie at ASPEN to work with the MegaCLi (MegaRAID command line utility), but we could not get the syntax correct for the PdLocate option. -- I plan to experiment further with the MegaCLi and document the correct syntax for others.

JUNE 18

  • ippc04 fixed, so added it back into the cluster at MHPCC. Started up fine and informed Gavin so he could continue with the configure with the cluster.
  • verified that the RAID disk locate utility from the MSM GUI was correct on all 10 servers, and that the labels did need to be switched for the A and B disks.
  • re-labeled all the B disks on the stsci servers, but too late to start on the A disks.

JUNE 13

ippc04 MOBO was swapped into ipp015 (see april 4&5) Tyler found the JP1 in cmos clear position - machine hangs at checksum. With the jumper in normal mode the machine won't power on. Without the jumper in normal mode the machine will not be able to save the bios - which is what we saw. With the jumper removed it powers on and ran overnight.

JUNE 12

DB03 crashing - no hint as to why. Swapped MOBO, memory and CPU's out of ippc12 into ippbd03 - added 8x2GB of ram for a total of 48GB of RAM Brought DB03 back to ATRC with original memory and cpu's for diagnosis.

JUNE 5-6

Installed STSCI servers - all running.

MAY

Installed the BROCADE switch and CISCO 10G card. swapped cables around the big CISCO switch to make room for the STSCI machines.

APR 05

  • ippc04 -- removed the motherboard from it and installed it in ipp015. Brought it back to ATRC, in case we want to install a new motherboard/CPU/RAM combination in it.
  • ipp015 -- works now, but the motherboard I pulled out of it had the RAM in the wrong slots, and JP7 on the wrong pins. We could use better quality assurance when we are retrofitting motherboards.

APR 04

  • ipp015 -- it stopped working yesterday, and I went to MHPCC to see if I could coax it into functioning again. It was off when I arrived. I tried turning it on, clearing the CMOS RAM, removing the battery, removing the RAM, removing the RAID card, moving the RAID card to a new slot, removing CPU #2, and swapping CPU #2 into CPU#1's place. Nothing made any difference. Tomorrow morning at 9am I will return and transplant ippc04's motherboard into ipp015.

MAR 28

  • ipp041 -- took it down to swap in a motherboard we obtained on eBay. Unfortunately, the new motherboard wasn't actually new, and the power sockets on it wouldn't accept our power cables, so we had to remove it and revert back to the old motherboard, which only works with one CPU. The eBay seller apologized and is sending us a full refund.

MAR 21

  • ipp014 -- took it down to rotate one of the CPU coolers by 180 degrees. It had accidentally been installed backwards.
  • ipp041 -- took it down twice to identify what motherboard it uses. It is a Tyan S5397AG2NRF. It turns out there is a second variant, the S5397WAGNRF. Normally these cost about $500-$700, but I found one on eBay. The auction closes tomorrow morning, but currently there are 21 bidders and the price is $92, plus $12 for priority mail shipping. The board is "New In Box" and the pictures of it show it in the original box, with the original stickers over the slots and the cables and other components sealed in their bags. The difference between the two boards is that the one on eBay also has connectors for SAS drives, while the one we have doesn't have that part of the board populated. I think we should snap this one up -- it should go for less than $200, which is about a third of the best price Google/products showed.

MAR 14

  • ipp015 -- was turned off when we arrived. We plugged a screen into it and turned it on and it worked, except for networking. I'd chosen to boot the top kernel in the list instead of the default one, so we tried again with the default kernel, and then it worked perfectly.
  • ipp058 -- it was reported to us that one of its PDU's was in an odd state. We confirmed that the electrical plugs were connected to the right sockets on the PDU, and that their little green lights were both glowing (and thus indicating power on). We couldn't access the URL for the PDU from MHPCC. When Serge went to confirm the existence of the odd status, it was no longer there, so at that point there was nothing more to be fixed.

FEB 22

  • ipp041 -- fixed the problem with the CPU not being able to make good contact with the heatsink. Had to remove a standoff which was in the way. Unfortunately, had only 1 hour to work on this machine, so we ran out of time to debug it and still the second CPU was keeping the machine from working, so we had to remove the second CPU again and leave. We wanted to begin working on this machine first, but couldn't make contact with anyone from the IPP-dev team, including the czar, so most of our time was wasted. Perhaps next week or in a few weeks we can visit it again. I bought some Q-tips and Goo-Gone to clean the second CPU with next time I'm there. This was frustrating.
  • ippc11 -- reinstalled it with the new motherboard. Confirmed with Gavin that all of the cables for it were correct.
  • ippc12 -- accidentally unplugged the power.

FEB 17

  • ipp041 -- discovered that one of the CPU coolers wasn't making good contact with the CPU, so had to remove that CPU to get the machine to boot up. Next Wed or so we should be able to return and take a stab at fixing the problem.

FEB 9

  • ipp027 -- began displaying the BIOS message today -- we have no idea why, but we were very thankful! Used the CD to update its BIOS. This cleared the old settings, so I changed the ones which we always change. Now it works fine again.
  • ipp028 -- still catatonic. Displayed "E4" POST code. Cindy recommended removing everything from the machine, so we removed the RAID card. This caused the two beeps it was making to change to a different (undocumented in the manual) pattern. The POST code changed to "E1". Next we removed all of the RAM. The POST code changed to "60" after going through a bunch of different values. We started reinstalling the RAM, first 1 stick, then 4, then 8. With one stick the BIOS screen said that the BIOS settings were invalid, so I reset them to our usual values and saved it. Apparently this machine's BIOS settings spontaneously got corrupted. If this happens again, we should replace the CR2032 battery with a fresh one, and/or check that the CMOS reset jumper is in the default position. Reinstalled the RAID card. Oddly, this machine recognizes that it has an ATA DVD/RW drive, however it cannot read the CD in the drive.

FEB 8

  • ipp030 - one small fan in back wasn't working, but didn't bother to replace it. It has miniscule impact.
  • ipp028 - no fan failures found.
  • ipp031 - no fan failures found.
  • ipp023 - no fan failures found.
  • ipp014 - no fan failures found, however one of the CPU fans is installed backwards -- it is pushing air toward all of the other fans. Fixing it requires removing the heat sink and rotating it 180 degrees, which we didn't want to do at that time, but we should consider doing in the future.
  • ipp016 - missing one fan (on the left), had 2 broken fans. One fan was actually broken, and the other wouldn't work because of a short/open circuit on the litte board which the fans plug into. Replaced the broken fan, and connected the other fan directly to the headers behind the disk drives.
  • ippc08 - the large power supply fan was actually ok.
  • ipp025 - moved the power cables on the PDU to where they should have gone.
  • ipp027 - the power cables weren't looped through the power supply handles, so one by one I moved them. Tried to update the BIOS with the CD which Gavin sent me the ISO for. Unfortunately, the machine turned catatonic, and we couldn't see anything on the video and the caps lock light wouldn't come on on the keyboard.
  • ipp028 - wouldn't come back up. Gavin recommended using the motherboard jumper to clear the CMOS. Behaved just like ipp027 -- catatonic. Displayed "E4" POST code.

FEB 2

  • Checked 4 servers for possible fan failures - Haydn and Rita
    • ippc08 - larger fan on power supply not working -- frozen stuck. We did not have the replacement fan with us and will replace next week.
    • ipp015 - we found two fans in the front (that pull the cold air in) that weren't working. We discovered two power connectors to power the fans were not connected. After connecting power all fans now working.
    • ipp025 - no fan failures found.
    • ipp017 - no fan failures found.
  • Rita - While I was there working Gavin called for support to check out ipp027, which seemed to have crashed. After power cycling a few times, waiting 1 hr., and power cycling again, the system rebooted and came back online.

JAN 30

  • ipp027 crashing and cpu overheated Opened ipp027 to find CPU1 fan had come unplugged. - reconnected it - running ok since then
  • remounted ippc12 with new MOBO installed by Haydn
  • Unboxed and racked WAVE 5 34 1U's in cab 10 and 11.

JAN 31

  • Configured the bios on all 34 WAVE 5 machines and began the labeling of the nodes.

2011 Production Node Work

DEC 19

  • Replaced the bad RAM in ipp063.
  • Rebuilt the RAID in ipp064, however I didn't have the documentation, couldn't get help from Cindy or Gavin, and one parameter wasn't correct, so it will have to be redone. Maybe Wednesday...
  • Found an old Tyan LGA771 motherboard and brought it back. I'll try populating it to see if it works in ippc11.
  • Found two small fans. I'll try them in the power supply for ippc12.
  • ipp064 sometimes shows 40 GB RAM when it boots, and other times 48 GB. It contains some inconsistent RAM. I'll try to find it and swap it out next time I'm there.

DEC 12

  • Temporarily swapped two known good motherboards into the chassis for ippc11 and ippc12. They both worked fine, so this indicates that the motherboards in ippc11 and ippc12 are both broken. Perhaps the CPU's and RAM are okay, but we'll need a known good LGA771 motherboard to test them with.

DEC 8

  • Swapped the RAM between ipp065, ipp066, and ipp063. ipp065 and ipp066 either sometimes or always indicated only 40GB, instead of 48GB. When we moved the RAM between the different machines, the good RAM from ipp063 always worked, the bad RAM always failed, and the inconsistent RAM continued to behave inconsistently, so the problem is that we have at least two bad sticks of RAM -- one which always fails and one which fails intermittently.

DEC 7

  • Tested the power supplies on ippc11 and ippc12. ippc11's power supply's -12v line wasn't working. ippc12's power supply produces the right voltages, but the exit fan doesn't spin, which decreases the airflow to about 25% of what it should be. Found a spare power supply for ippc11. Replaced the motherboard in ippc11, but the "new" motherboard doesn't work.

NOV 21

  • Cindy Bill and Rita swapped out motherboard and CPU's from ippc12 into ipp029

NOV 15

  • added auto-monthly consistency checks for ippb machines
  • switched the names for ippb00/03 machines - auto-mount configuration ippb00.0 -> ippb00.2 to reflect physical array swaps done.

Nov 10

  • Haydn visited MHPCC.
  • Installed a new BBU in ippdb02.
  • Checked ipp029, because it was reported to crash quickly under load. The CPU heat sinks had not been installed correctly. Heat sink compound covered only ~50% of the CPU's. Removed the old heat sink compound and used Artic Silver 5. We should probably do this for all of the machines in this series -- just because they aren't crashing, doesn't mean they were installed correctly. They are probably simply throttling themselves to avoid thermal failure.
  • Checked ippc11, because it was reported to crash under load every few days. Burned myself slightly when my bare forearm touched the bottom of the case. This is a 1U machine, and the power supply is incredibly hot (~55 deg C). Removed the CPU heat sinks, but they had originally been installed correctly. Re-racked the machine, but other machines with the same design have power supplies which are much cooler. Perhaps the power supply is failing, or one of the fans in it has failed. Because this is a 1U machine, the CPU heat sinks depend on the cover being on in order to get good air flow across them, but next time I'm there, there is a spare power supply which might work in this machine, and I can try running it with the top off for a few seconds to check if the power supply fans are working or not.
  • Reseated the memory in ipp065, but it still reports only 40GB instead of 48GB.

Nov 7

  • team racking and installation of 13 5 U's
  • put cpu back into ipp036
  • put ippc10 back in rack

Nov 4

  • ippb00 Haydn replaced back plane for ippb00 a0 - Drive 3 detected. All drives put back in unit - parts sent back to YMI.

Nov 1

  • Installed ASUS MOBO into ippc10

Oct 27

  • Installed spare ASUS MOBO into ippc10 - cpu heat sinks do not install properly - Brought ippc10 back to ATRC -ordered proper CPU carriage frame from YMI

Oct 25

  • ippb00 replaced cable,raid card and sled - still not detecting drive A0 3 - ordered back plane

Oct 24

  • ipp036 has dead CPU0 fan unable to boot.
  • Swapped MOBO out of ippc10 into ipp036 (still in rack - unable to remove it - only me working ) Left ipp036 with 1 processor - not sure if cpu damaged - did want to hurt MOBO of running system
  • Installed 6 of the 8 3TB drives in STARE nodes. 2 drive DOA sent for RMA

Oct 19

  • replaced ippb00 A0 drive 3 but still not being detected. worked with LSI to de-bug - needs hardware fix Ordered parts from YMI

Oct 10

  • Swapped CPU0 memory with CPU1 memory

Sept 30

  • installed 3TB disks in slots 2&3

Sept 20

  • Bill and Haydn Installed the new MOBO on ipp020

Sept 13

  • Bill and Haydn Installed the new MOBO on ipp021

Aug 31

  • ipp010, ipp017, ipp020 (disk upgrades) and ipp021 (mobo

upgrade).

Aug 30

  • Replaced IPP014's raid controller.

Aug 23 2011

  • Haydn Bill and I were able upgrade ipp004,ipp013 and ipp018 MB and disks without incident.
  • We corrected the existing error in the RAID card cabling in ipp004 so now the drives are seated in the same order as the other machines ( funnily we discovered and remembered this problem because we had missed swapping a drive and put a 500GB back in which was showing in the wrong slot )
  • ipp037 has a new MB.
  • Gavin installed the OS and building the RAID - he was instrumental once again in supporting us while at MHPCC.
  • Haydn did all the drive replacements

Aug 16

  • Installed the replacement MOBO for ippc10

Aug 15

  • ipp011
    • Upgraded to new MB and 2 TB drives. Had to move CPU0 memory stick from outer blue slot to middle black slot.
    • Suppose to fill blue slots first. The memory was not detected in the outer black slot. Same weirdness seen in ipp009 ( according to Gavin ASA chassis)
  • ipp019
    • Upgraded to new MB and 2 TB drives. Memory all in blue slots and detected
    • The 1st new MB installed did not detect any memory in the cpu0 no matter where we put the sticks. Tried a second new MB using all the same memory and CPU's - with good results. Will RMA the 1st new MB.
  • ipp015
    • Upgraded to 2TB drives. On reboot saw only 12GB of memory. For this machine memory is suppose to be an all black slots. However we noted that the memory was in a mixture of black and blue slots. Moved to all blacks. All memory detected.

Aug 1

  • Swapped all memory out of ippdb03 - formally ippc00 - using the memory from ippc10 - waiting for MB
  • Added 2 2TB disks to ippc10 to become the newest rendition of ippdb03

July 29

  • Added new memory to ippb00 1nd 01 - now have 32GB each.

July 27

  • Spent AM diagnosing the non-boot problem in ipp030 - failed MB
  • After lunch
  • Took failed MB out of ipp030 and replaced it with MB from ippc10.
  • ippc10 remains out of the rack.

July 15

  • Swapped the 1st 3 mem sticks out of ippdb00

May 16

  • replace a raid card in ipp005

Feb 9

  • Bill and I replaced the motherboards of ipp005, ipp006, ipp007, ipp025

Notables: ipp006 & ipp025 memory counts were missing 4GB. We identified which slot and put the stick in a black slot next to a blue working slot. All 24GB are now showing on both machines. The malfunctioning slots on the 2 machines were not the same. We tested the power cables and while we found some odd readings they were not consistent between the machines or relative to the cpu's with the malfunctioning slots.

The boards installed on the back of the CD drive are not working. We still have external CD drives hanging off the machines. We need to come up with a solution.

At some point in time - this fix should work for ipp009 so we have all 24 GB available.

ipp006: Voltage readings for CPU0 8pin power connector: +5v +12v +3.3V 5.2 12.5 3.4

-12V +12V 5vSB P6 11.9 12.2* 5.4* 410ms

  • blinking

Voltage readings for CPU1 8pin power connector: same as cpu0 except for pg was 590ms and blinking too.

CPU1 slot closest to the cpu was not working.

ipp025: Voltage readings for CPU0 8pin power connector: Same display for ipp006 24 pin and 8 pin cpu0

Voltage readings for CPU0 8pin power connector: Same for CPU1 8pin. pg 430ms

Feb 4

We installed the new RAID card in ipp012 without problems.

However, we did not have success with the IDE->SATA converter board for the CD/DVD. The drive was not being detected on boot. We tried all SATA ports, different card, different CD/DVD, different cable to no avail. We checked the bios and there was note anything we could see that would prevent it from working. The external drive has been reattached. We brought a card and cd/dvd player back to test it and get it working here. It may need a jumper set to make sure it is not in slave mode - however we did not have a manual or a jumper - the boards did not come with one.

We also had the usual rack problems - ipp012 is particularly difficult to put back in - it is at the bottom of the rack and ipp013 is weighting down on it a bit.

Jan 31

disk upgrades on ipp008, ipp016, and ipp021, and motherboard upgrades on ipp008 and ipp016.

Jan 24

We put the power supplies from ipp009 in and it booted normally. However there are still fan related alarms sounding though all the fans are working. We swapped out the led/alarm board out with the same results.

ipp009 is up and ready for use but still short 4GB of memory. After many tests we can conclude that the problems lies in something other than the motherboard or memory. We had the same problem with a known good motherboard and memory. We narrowed it down to a set of black and blue memory slots associated with CPU0. Not the slots themselves, something that serves the slots. We know there is a problem with 2 slots because 12 x4GB memory sticks only showed 40 GB of memory. Also: The power supplies from ipp014 were put into ipp09 and there didn't seem to be any issues. Therefore we have no idea why ipp014 started booting normally with a power supply swap. Note: the ipp009 memory problem occurred with the good power supplies as well.

Jan 21

replaced motherboard hardware and disks for ipp009, ipp012, ipp014.

ipp012 new disk 11 had to be swapped out. ipp009: 6 4BG sticks were installed but only 5 are showing. All the memory has been swapped out for new memory with the same result. ipp014: We hear continuous beeps on power on changing too 2 short beeps and a long one after something appears on the screen. However the 2 beeps and a long on are not constant - they become random. We don't see any memory report on the first screen like the others. We changed the mother board and memory with identical results. There was a dead fan issue but we rewired it and all the fans are working and the fan light is off. No effect on the beep pattern. The initial quick short beeps sound like memory beeps but it weird that we get it to boot at all. Left out of the rack.

The new mother boards do not have an IDE connection for the existing optical drives. We (Bill) drove to the only 2 computer stores on the island and purchased the only 2 notebook size optical drives with sata connectors available. However - once we installed these we realized the connectors were something we had not seen - and did not have the cables for. At first glance they look like sata connections but there is a power component as well. opps. Fortunately Brad had external usb optical drives we could use. Hundreds of them, in fact. Gavin had to Mac Gyvered them in place because the cables were too short to set them anywhere.

2010 Production Node Work

(Up to Production Cluster Status)

Sept 15

  • Removed cable obstructing power supply fan in ippc11

Sept 13

  • Replaced CPUs (with new ones) and failed CPU fan in ipp014

Aug 23

  • Swapped Memory out of IPP14

Aug 6

  • Swapped Memory out of IPP12
  • Upgraded IPP006 and IPP007 with 2TB drives

Jun 22

  • Swapped Memory out of IPP14

April 27

  • ipp018
    • swapped memory -> continuous beeps
    • Brad noticed one row of the hard drive array did not light up with red lights
    • Reseated SATA cables to backplane and RAID card
    • Swapped in set of new memory

April 26

  • Swapped in larger RAID card memory module in ipp037
  • ipp018
    • reseated memory -> still 2G memory
    • swapped CPUs -> still 2G memory showing
    • swapped in new set of memory (all 8 sticks) -> no boot rapid beeps

April 16

  • Swapped two new CPUs into ipp037 and booted it into .31 kernel

April 8

  • Swapped positions of ipp005 and ipp037 in rack
  • Attached ipp037 SATA cables to the RAID card so they're mapped properly
  • Noted fan warnings on nodes ipp014, 016, 008, 007, 004, 025, 024, and 045 (appear to be due to fan power board failures)

April 1

  • ipp008
    • Put back into rack
    • Still missing single HD array fan
  • ipp018
    • New motherboard; 2 new CPU fans (Sunon)
    • Put back in rack
  • ipp037
    • New motherboard; 2 new CPU fans (Sunon)
    • Put back in rack
    • Array cables not mapped to RAID card correctly
  • ipp014
    • Unsuccessfully attempted to get one of the HD array duo fans working
    • Appears to be an issue with the connector

March 22

  • ipp018
    • Boots but goes to A:\>
    • Pulled CMOS battery; reset MB w/ jumper; no help
  • ipp008
    • No boot
    • Rapid beeps
    • One pwr supply dead
    • Swapped in new MB
    • Replaced CPU fans with those taken from ipp037
    • Single HD array fan not working (appears to be module connector)

February 22

  • Confirmed light on ippc17 drive
  • Installed additional fan in ippc18 (small fan on right side)

February 16

  • ipp008
    • No boot
    • rapid beeps -> then 1 sec beep
  • ipp018
    • No boot
    • Both CPU fans not working - one has broken wire
    • No codes on MB
    • No lights on power supplies
  • ipp037
    • No RAID card
    • CPUs installed w/o paste

February 11

  • Installed an additional fan in ippc19 (small fan on right side)
  • Swapped CPUs
    • ipp018 -> ipp005
  • ipp037 -> ipp008
  • Notes: ipp018 had excessive paste, bent pins on CPU2 (on MB), and an apparently scorched heatsink

NODE SWAPPING MAP

*Wave1 IPP17 died. Its disks were swapped into Wave2 IPP37

  • So IPP17 is a Wave2 Machine.

*The broken Wave1 Machine (previously IPP17) was sent back. When it came back

  • It had a different motherboard and was incomplete - missing fans.
  • A newly broken IPP14 disks were swapped in.

*So IPP14 became IPP37.

  • IPP14 ( Now IPP37 ) was out of warranty so we decided to swap the mother board and cpu's
  • Summary:
    • IPP14 is a Wave 1 ASA Machine with a different motherboard.
    • IPP17 Is a Wave 2 YMI Machine
    • IPP37 is a Wave 1 Machine

Note: See TracWiki for help on using the wiki.