PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : YYYY.MM.DD

  • 11:25 CZW: started pantasks server on j001b09 to do raw data shuffle to backup nodes.

Tuesday : 2014.11.11

  • 5:00 HAF registration is stuck on o6972g0517o -- usual bag o tricks didn't work. I ended up setting a quality of 42 on that stupid burntool. update rawImfile set quality = 42, fault = 0, state = 'full' where exp_name = 'o6972g0517o' and class_id = 'XY14';, but not after spending an hour trying to figure out what was going on...
  • 6:20 HAF registration is still stick on teh same exp, after moving on.. I redid the update since it seems to have been reverted... registration is moving again but I don't trust it
  • 6:24 Chris investigated, regtool -updateprocessedimfile -burntool_state -14 -exp_id 816417 -class_id XY14 -set_ignored -set_state full is the proper way to tell the database to ignore an xy
  • 12:00 MEH: mops wondering where data is, stdsci looks to have been down since ~10am? did someone shutdown?
    • restarting .. -- lanl stdlocal off until nightly finished (wasnt doing anything anyways).. putting ippsXX nodes into stdsci to finish up asap..
  • 12:40 MEH: ipp035 unresponsive.. neb-host down until back up, out of processing --
  • 12:50 MEH: ipp035 up, putting back into repair -- ipp029 now unresponsive.. seriously..
    • ipp029 from neb-host up to down -- out of processing
    • ipp029 up (kernel panic), putting neb-host repair (normally up)
  • 13:40 MEH: when nightly finished, ippsXX off in stdsci and stdlocal for MD/CNP processing, stdlocal on again and update labels back into stdsci
  • 15:30 EAM : stdlocal is going slowly, so i'm stopping stacks in prep for restarting
    • good was inbetween runs, had set the ippsXX off in stdlocal.. setting back to off again
  • 17:30 HAF : restarted registration, it had crashed (? i assumed?)

Wednesday : 2014.11.12

  • 06:40 EAM : I am stopping stdlocal for now in anticipation of launching relphot on the compute nodes. Also note that nightly science is proceeding slowly - is this the db again?
    • 10:32 HAF : in answer to Gene's question - I noticed that we were slow to download images last night - we did catch up, and probably without errors (?). I had noticed timeouts on Tuesday, so I wonder if that's related (ie, db is slow?)
  • 07:20 EAM : lots of warps still to be done -- I've added compute nodes c2 to stdscience to up the throughput.
  • 10:00 EAM : launched relphot on compute nodes ippc33 - ippc63.
  • 13:00 EAM : my relphot analysis crashed, so I'm turning stdlocal back on. i'm going to do some tests, and maybe run relphot again soon, so i'm keeping stack off for quicker response.
  • 15:10 MEH: ippsXX on in stdlocal until system restart later tonight
  • 15:20 CZW: restarted stdlanl to attempt to clear out the backlog of fakeRuns. Mustang DST has been extended overnight until Thursday, so other than preparing as much as possible for execution when it returns to active use, there's not much for stdlanl to do.
  • 20:20 EAM : The cooling work was aborted; Haydn brought the cluster up, but the following machines are still not on line:
    • ipp052, ipp068, ipp071, stsci03, stsci09

I have set these machines to 'down' in nebulous and am starting processing. We may have trouble until they are available (eg, missing detrends)

  • 21:06 EAM : ipp068, ipp071, and stsci09 all came up; just ipp052 and stsci03 still unavailable
  • 22:15 MEH: ippsXX machined in use by ~mhuber/IPP/local_deepstack/ptolemy.rc -- data shouldn't be needed on ipp052 or stsci03

Thursday : 2014.11.13

  • 01:40 MEH: ippsXX available for stdlocal again for next day or so -- seems lanl stdlocal is down so just leaving ippsXX idle
  • 09:25 MEH: looks like stdlocal is running now w/o ippsXX machines so tuning them on
  • 15:45 CZW: Mustang DST finished at 5pm their time. Restarted stdlanl remote poll/exec jobs. Came back later to see that they had all failed. Mustang DST has broken their firewall rules, preventing file transfers over ssh. I've emailed everyone appropriate to try to get this fixed.
  • 16:45 EAM: ipp052 is back on line, but stsci03 is going to be off for a few more days -- power supply problems.
  • 23:30 EAM: the relphot analysis is done on the c2 nodes for tonight so I am adding them back to stdlocal. weather looks bad -- i've set stack poll low (50) and am leaving the storage nodes in stdlocal. if data comes along, the tasks should keep stdlocal from getting in the way.

Friday : 2014-11-14

  • 14:30 CZW: started ipplanl/pv3shuffle pantasks to shift PV3 warps from the main storage nodes to stsciXX.

Saturday : 2014.11.15

  • 16:55 EAM : a lot of errors from nebulous troubles. it turns out the /tmp/nebulous_server.log filled the /tmp disk on ippc02. I fixed it and the errors stopped

Sunday : 2014.11.16

  • 05:30 EAM : ipp035 crashed in the night, i'm rebooting. no messages on console
  • 23:55 MEH: ippsXX off in stdlocal