PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2013.07.01

  • 11:10 MEH: restarted apache on ippc20-ippc29,ippc31,ippc32 for high throughput data access tests with Suvi et al. at UMD after Gavin opened up firewall for the requested ports. ippc30 not in set as used for other work (PSS?)
  • 11:40 MEH: exposure o6461g0486o from 6/18/13 is confirmed lost by camera group, will need to flag appropriately in DB. Angie notes it has been retaken as o6462g0489o.
  • 12:00 CZW: ipp055 had an unrepairable NFS issue, so I rebooted it. I had stopped stdscience while attempting to fix this, and after the reboot, decided to simply restart stdscience as well. Processing appears to be completing now.
  • 23:20 MEH: looked like pstamp was stumbling, restarted.

Tuesday : 2013.07.02

mark is czar

  • 07:15 MEH: data downloaded and processing finished
  • 07:40 MHPCC chillers down, systems need to be shutdown. Rita, Heather, Gene and Gavin shutting systems down, mostly compute nodes and things not running DVO. All processing stopped.
  • 10:50 MHPCC chillers back up, systems to startup again -- Gavin going through the slow process of booting the many machines
  • 13:29 Bill: set pstamp pantasks to run
  • 14:39 Bill: added some compute nodes to pstamp pantasks to work on the backlog
  • 14:45 MEH: turning stsci back on w/o compute(1),2,3 nodes (being used by PSS) in order to do the late morning SS diffim for MD.
  • 15:15 CZW: Mark raised an interesting question, and I wanted to get my understanding of the logic down someplace. We've added the new stsci1X.Y nodes into nebulous, and enabled them to accept data. This allows nebulous to use them as a target for things that do not have an explicit target set (replicated copies of data, files with no host request, files that have a host request that cannot be serviced due to the disk being full, etc). Nebulous is fully controlling these hosts. However, the majority of hosts have targetting rules set in pantasks, as part of the ippTasks/ipphosts.mhpcc.config file. This file sets rules that target OTAXY onto host ippXYZ.0, and do a similar thing with skycell_ids. Once these rules are defined, pantasks will begin to write data directly to those hosts. As this can cause NFS issues if pantasks attempts to write a large amount of data to a single host, we use the ~ipp/ippconfig/pantasks_hosts.input file to restrict this. Pantasks will not launch more than N jobs that rely on a given host, provided it knows about that host. To prevent this issue with the stsci0X.Y nodes, we add one instance of each host, and then turn off that instance. This prevents processing from happening on those hosts, but allows pantasks to manage the jobs that read/write from them. This means that if we add targeting to the stsci1X.Y nodes, we need to ensure that we add the hosts to pantasks_hosts.input to prevent overloading them.
  • 15:30 MEH: turning on other pantasks for now as well, will do a clean restart for many before nightly obs start -- ssp summitcopy+registration catching up

Wednesday : 2013.07.03

mark is czar

  • 07:10 MEH: all night downloaded and processed. may need to bump PSS prio for MOPS(499) vs QUB(515)
  • 20:50 MEH: looks like stealth stacks running on wave1-4,compute(1)-3 without adjustments to stdscience and started sticking jobs on a few regular suspects (ipp040..), removing the equivalent from stdscience to help balance..
  • 22:15 MEH: stacks may need more than -1x to balance because of time+threading. ipp056 getting hit a bit when other stacks running on it. probably do not want to run extra jobs on things other than the compute3 nodes, particularly with DVO often running extra on the data nodes.
    • pulling wave4 from stack in anticipation others will have issues.. +1x compute3 to stack, -2x compute3 from stdsci to balance.....

Thursday : YYYY.MM.DD

Friday : YYYY.MM.DD

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD