PS1 IPP Czar Logs for the week 2014.08.11 - 2014.08.17

(Up to PS1 IPP Czar Logs)

Monday : 2014.08.11

  • HAF czar : restarted summitcopy,registration,distribution,stdscience,stack (cluster was off this weekend due to hurricane)
  • HAF czar : ippb00, 01, 03, 04 are set to repair, 02 is set to up
  • HAF czar : restarted czartool/roboczar on c18, then stopped and restarted on c11

Tuesday : 2014.08.12

  • 07:30 MEH: LAP stacks needed to be reverted so some runs now clearing, one had stack label LAP.PV3.20140730.local
  • 08:30 MEH: probably need to do a regular restart of the LAP pantasks as well -- will do tonight if have to reallocate for normal nightly
  • 10:30 MEH: restarted pstamp -- pstamp must run from ipptest for now -- often running out of space, need to change PSTAMP_PRESERVE_DAYS 14->7d or so or doc for doing manually?
    • suspect stamp cleanup not happening because cleanup pantasks not running? starting now -- or does this just cleanup the updated products and not the stamp bundles, need to cleanup stack bundles for space.. and some more doc on this
    • to try and clean up more space for pstamp, as ipptest ran: pstamp_queue_cleanup.pl --preserve-days=12
      • clearing >200G now..
  • 11:00 MEH: LAP pole stdsci poll was set too low for number of nodes available if run out of chips etc. bumped up
  • 12:00 MEH: don't see nebdiskd on ippdb00, needs to be started -- ipp@ippdb00 started nebdiskd
  • 13:10 MEH: unable to log into ipp071.. suspect NIS/ypbind issue again like on 8/4 -- only noticed since nebdiskd couldn't access, has it been broken since reboot the other day or did it recently die? -- Gavin found problem with 10G fiber delay in network startup causing ypbind to timeout. will look into a fix
  • 14:10 MEH: restarted LAP stdlocal pantasks and reloaded nodes from before -- too much, if any pstamp updates happen then will overload datanodes..
  • 15:10 MEH: pstamp having trouble -- seems a reqType unknown is in system and causing problems? -- leaving off until solved because just filling log file
    failure for: request_finish.pl --req_id 405851 --req_type unknown --req_file /data/ippc30.1/pstamp/work/webreq/2014/08/12/web_162610.fits --req_name NULL --product NULL --outdir NULL --redirect-output --dbname ippRequestServer --verbose
    job exit status: 29
    job host: ippc38
    job dtime: 0.604044
    job exit date: Tue Aug 12 15:25:27 2014
    
    • is creating ~ipptest/NULL/reqfinish.405851.log file, seems to be unknown DB? but if ipptest was running before, should be okay?
      request  405851 has unknown reqType unknown
      Running [/home/panstarrs/ipptest/psconfig/ipp-pv3-20140717.lin64/bin/pstamptool -updatereq -req_id 405851 -state stop -fault 5 -dbname ippRequestServer]...
      Unable to perform /home/panstarrs/ipptest/psconfig/ipp-pv3-20140717.lin64/bin/pstamptool -updatereq -req_id 405851 -state stop -fault 5 -dbname ippRequestServer error code: 768 at /home/panstarrs/ipptest/psconfig/ipp-pv3-20140717.lin64/bin/request_finish.pl line 101.
      request  405851 has unknown reqType unknown
      Running [/home/panstarrs/ipptest/psconfig/ipp-pv3-20140717.lin64/bin/pstamptool -updatereq -req_id 405851 -state stop -fault 5 -dbname ippRequestServer]...
      
      
       -> psDBAlloc (psDB.c:166): Database error generated by the server
           Failed to connect to database.  Error: Unknown database 'ippRequestServer'
       -> pstamptoolConfig (pstamptoolConfig.c:353): unknown psLib error
           Can't configure database
       -> main (pstamptool.c:80): (null)
           failed to configure
      Unable to perform /home/panstarrs/ipptest/psconfig/ipp-pv3-20140717.lin64/bin/pstamptool -updatereq -req_id 405851 -state stop -fault 5 -dbname ippRequestServer error code: 768 at /home/panstarrs/ipptest/psconfig/ipp-pv3-20140717.lin64/bin/request_finish.pl line 101.
      
    • looks like the cmd has left off the -dbserver ippc17
    • not only that, but if the cmd did work it would fault because there is no -set_XXXX. something is wrong with this error case then and -state,-fault need to be -set_state,-set_fault?
    • 16:35 MEH: so fix the stop/fault cmd and run for the broken ones -- seems to be moving again (problem looks to have happened ~4-6am)
      pstamptool -dbname ippRequestServer -dbserver ippc17 -updatereq -req_id 405851 -set_state stop -set_fault 5
      pstamptool -dbname ippRequestServer -dbserver ippc17 -updatereq -req_id 405852 -set_state stop -set_fault 5
      pstamptool -dbname ippRequestServer -dbserver ippc17 -updatereq -req_id 405853 -set_state stop -set_fault 5
      pstamptool -dbname ippRequestServer -dbserver ippc17 -updatereq -req_id 405854 -set_state stop -set_fault 5
      pstamptool -dbname ippRequestServer -dbserver ippc17 -updatereq -req_id 405855 -set_state stop -set_fault 5
      pstamptool -dbname ippRequestServer -dbserver ippc17 -updatereq -req_id 405856 -set_state stop -set_fault 5
      pstamptool -dbname ippRequestServer -dbserver ippc17 -updatereq -req_id 405857 -set_state stop -set_fault 5
      pstamptool -dbname ippRequestServer -dbserver ippc17 -updatereq -req_id 405858 -set_state stop -set_fault 5
      
    • does seem to suggest a code bug but unclear the process logistics here so..
  • 15:30 MEH: ifaps1 is still down from the weekend storms? -- is it needed to be up?
    • 16:00 HAF: I haven't talked to Gavin about it, but I'd like it to be up..
  • 17:20 MEH: so other PSS jobs have finished leaving a handful in various states, two in fault 4 and four in state new --

Wednesday : 2014.08.13

  • 05:54 Bill: reverted registration fault for o6882g0284o XY04. It ran into a problem connecting to the database on first try. Science exposures are all registered now.
  • 07:00 MEH: some remaining warps also DB access faulted, revert also cleared.
  • 07:10 MEH: nightly finished, putting some datanode power into LAP stdlocal again --
  • 09:55 Bill: restarted pstamp pantasks. Made a change to dsreg to not fail if the size of a file does not match the value in the registration list. Just use the new value and remeasure the md5sum. When this happens (rarely) the file size differences are small. Integrated change into trunk, 20130712 tag (for distribution), and ipp-pv3-20140717 for pstamp.
    • 10:30 installed new version of request_finish.pl that looks up the dbserver in site.config if it is not supplied. Also fixed the arguments to the -updatereq command to -set_state and -set_fault from -state -fault
  • 13:38 Bill: pstamp set to stop while I clean out some stale data in /data/ippc30.1/pstamp/work
    • 14:22 pstamp set back to run. 483G now available in the working partition. Cleaned up
      • work/server_status the server status page updated every couple of minutes
      • work/webreq old request files created by the web interfaces. These are copied to the working directories so are not needed
      • work/2013 and 2014 prior to July. Cleanup is leaving some files behind sometimes. Since I do rm -rf errors are not detected. May want to reconsider
      • found another TB of dvo backups that Gene is going to delete.
  • 19:20 MEH: looks like LAP stdlocal is using nightly data nodes for processing -- these need to be turned off..
  • 20:25 MEH: looks like ipp047 has crashed.. nothing on console. caught before wedged things, leaving in repair
  • 21:40 MEH: ipps13 incorrect RAM, will try a reboot since MD processing stalled for the moment.. -- not recovered missing memory, had memory issue in past so maybe same issue?
  • 22:40 MEH: since MD stalled for the moment, adding the ippsXX compute nodes to LAP stdlocal and bumped up the poll

Thursday : 2014.08.14

  • 08:04 Bill: set quality fault for repeatedly failing diff with: difftool -updatediffskyfile -fault 0 -set_quality 14006 -diff_id 582996 -skycell_id skycell.0489.079
  • 08:45 EAM : ipp047 crashed (2nd time today, gavin rebooted at 07:30). rebooting, but perhaps needs mobo work?
  • 17:06 HAF: set repair for all stsci nodes (for addstar purposes)
  • 18:35 EAM : cab1 PDU failed : I've put ipp008, ipp012, ipp014, ipp013, ipp016, ipp018, ipp019, ipp037 into neb off and removed from processing.
  • 21:02 HAF : summitcopy had tons of neb failures - stopped processing, stopped c01-c10 apache servers, restarted c01-c10 apache servers (for neb), seems to have cleared things up. (we are about 120 exposures behind right now)

Friday : 2014.08.15

  • 14:15 EAM : Haydn replaced the PDU for cab1. Two chips spend the night failing as they needed detrends on those machines. I've restarted everything and reverted the chip failures, and processing is now back up and running. it will be some time before everything is finished for tonight.
  • 14:50 EAM : ipp030 crashed again : leave out of processing.
  • 17:35 MEH: using ippsXX 4/12 cores for stack photometry
  • 20:15 HAF: problems with registration - it fell behind and I'm investigating. I find that the failures happen on ipp046, and it hangs on df (on ipp039). ipp039 has a crazy high load of 145, so I set it to 'repair' for now (to try to stop some of the crazy? hopefully?)
  • 22:30 EAM : a lot of trouble with processing due to high load on a few of the ipp0XX machines. these are the only ones with space, aside from the new big storage nodes (ipp067-071). Much of the output is going there so the processing hammers those nodes. I think the new nodes, with newer gen raid cards and 10g eth can handle the load better, so I have set the older ipp0XX machines to neb 'repair'.

Saturday : 2014.08.16

  • 07:05 EAM : ipp034 was down this morning; rebooted it (nothing on console).
  • 11:25 MEH: ipp040 isn't happy being mostly full and stalling processing -- repair and out of processing

Sunday : 2014.08.17