PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2012.06.25

Mark was Czar

  • no data last night, postagestamp server busy with MD04 chip bundles and other requests.
  • finally was able to rebuild corrupted MD06.GR0 nightly stack (dest_id=1628823), -startover only seems to work with cleaned runs so using with dsreg --product ps1-md-GR --del stack.609903.1628823.53 didn't work. had to trick into reverting by setting fault states (kludge).

Tuesday : 2012.06.26

  • Mark: PSS having parse problem with recent local IfA user requests, doesn't like ROWNUM 0? Talked with user and will fix and resubmit jobs. Need to clear/drop faulted ones since not clearing automatically or trace out why not sending error info to the datastore.
    • user noted cannot access datastore from IfA-secure wifi network (403) while can connect via open wifi (with password), is this something we want to setup/allow? -- Heather talked with Gavin, ifa-secured should be allowed to datastore now.
    • PSS/camtool also doesn't like MJD_MAX way into the future (4538377-10-14T23:59:26Z), but again not error flagged to datastore, seems stuck.
  • 13:45 Serge: Stopping czartool while I make the update for the stsci nodes.
  • Mark: PSS mostly stuck again. put all 50 or so previous jobs to hold state (ifa.2012xxxx) and restarted pstamp and update. other jobs moving through again.
    • once finished, slowly adding back previously stalled ifa.2012xxxx jobs and now the previously failed to parse ones clearing as should.
    • later added other ifa.2012xxxx with image updates that stalled, and also moving through slowly.
    • seems to be a large number of repeating to fail chip updates, the PPIMAGE.PATTERN config problem from Feb. 2010 data seems common and also some misc. errors
      -- PPIMAGE.PATTERN - gpc1/MD04.20100208/o5213g0186o.123508/o5213g0186o.123508.ch.55151.XY42.log.update
      
      -- missing/corrupt burntbl - gpc1/ThreePi.nt/2010/05/21/o5337g0178o.170550/o5337g0178o.170550.ch.96218.XY61.log.update
      neb://ipp050.0/gpc1/20100521/o5337g0178o/o5337g0178o.ota61.burn.tbl
      
      -- missing psphot file and cannot regenerate? gpc1/ThreePi.nt/2011/07/16/o5758g0048o.362703/o5758g0048o.362703.ch.257334.XY23.log.update
      neb://ipp015.0/gpc1/ThreePi.nt/2011/07/16//o5758g0048o.362703/o5758g0048o.362703.ch.257334.XY23.cmf
      
      -- not finding mdl file?  gpc1/ThreePi.nt/2010/02/27/o5254g0598o.142669/o5254g0598o.142669.ch.67727.XY54.log.update
      neb://ipp049.0/gpc1/ThreePi.nt/2010/02/27/o5254g0598o.142669/o5254g0598o.142669.ch.67727.XY54.mdl.fits
      
      -- magictest.md20 also being updated? not sure what it is, but doesn't seem like would want to be updated?
      gpc1/magictest.md20.20100205/o5217g0208o.125162/o5217g0208o.125162.ch.54932.XY31.log.update
      
    • maybe problem related to last week's original update stall and needs to be cleaned, not sure the proper way to do that for PSS related runs.

Wednesday : YYYY.MM.DD

Thursday : 2012-06-28

  • 10:40 Serge: Restarted replication which seems to have crashed

Friday : YYYY.MM.DD

Saturday : YYYY.MM.DD

  • 08:30 Mark: PSS went down (nothing really in log) around 0330. restarted and seems okay.
  • 12:00 looks like ippc05 down for ~30mins. console display looks stalled also with partial line so power-cycling
    [53486671.703816] BUG: spinlock lockup on CPU#4, ntpd/29947, ffff8800280a7180
    
  • 15:45 oh yeah, probably needed to restart stack+deepstack pantasks as well

Sunday : YYYY.MM.DD

  • 7:40 Serge Cleaning /export/ipp001.0