PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2017.01.16

  • MEH: MOPS test chunk running ipps, upper ippx, ippc nodes w/ data to ipp100+
  • MEH: setting nightly processing up for another night of QUB targeted followup -- no changes should be made to nightly processing

Tuesday : 2017.01.17

  • MEH: setting nightly processing up for another night of QUB targeted followup -- no changes should be made to nightly processing

Wednesday : 2017.01.18

  • MEH: QUB targeted followup needs ippx, ipps nodes as well as nightly nodes in AM and mid-afternoon -- finished
  • MEH: OSS WSdiff skipped in distribution, doing large number of WSdiff updates in normal ps_ud_QUB -- using ipps, ippx048-088

Thursday : 2017-01-19

  • 16:15 CZW: Started running trunk/tools/raw_MD5_check.pl on ippb04. This script uses 12 client runs to calculate on disk MD5 sums for raw GPC1 OTA data in each nebulous sub-sub directory (/data/HOST.VOL/nebulous/aa/bb/FAKE_ota55.fits has an MD5 sum recorded in file /data/HOST.VOL/nebulous/aa/bb.OTA_md5sums). This will cause some added load on this machine, and on the other b node machines when I start running it on them (ippb05 today, the others either later tonight or tomorrow). A quick check suggests total runtimes on the order of a week and a half.
  • 22:10 MEH: ipp121 kernel panic blocking nightly processing for while now (90 exposures stuck in chip stage) and not checked on... neb-host repair and take out of any pantasks -- moving forward some now...
    • ipp121 appears to not be up any longer (data mount was accessible at some level before) -- will power cycle soon if no one claims doing so, otherwise stalled mount mess...
    • ipp121 back up (boots w/in 5 minutes so not doing ram check for the 260GB...) -- nightly processing is moving again

Friday : 2017.01.20

  • MEH: special QUB needs ippx, ipps nodes as well as nightly nodes in AM and mid-afternoon
  • 15:15 CZW: Started more md5sum scans on ippb03, ippb05, ippb06. Should have little impact on other processing.
  • MEH: setting nightly processing up for another night of QUB targeted followup -- no changes should be made to nightly processing

Saturday : 2017.01.21

  • MEH: QUB targeted followup needs ippx, ipps nodes as well as nightly nodes in AM -- finished
  • MEH: setting nightly processing up for another night of QUB targeted followup -- no changes should be made to nightly processing
  • MEH: manually managing large number of QUB pstamp image updates

Sunday : 2017.01.22

  • MEH: IfA-wide AC failure -- powering down ipp003,001,022 as less critical to ipp002,ippops1,ippops2
    • last gpc1 dump was successful ~0000 last night before ippdb machines went down last night
    • AC up but leaving machines off until morning and winds lighten in case AC crashes again
  • MEH: was QUB night, will restart pantasks now that things appear to be back online --
    • ipp032 also not booting -- normally neb-host repair, but must set neb-host down
    • ipp121 not booting -- must be set neb-host down until fixed -- hopefully wont bork summitcopy since power glitch interrupted download o7775g0046f-o7775g0049f but only flats
      • o7775g0046f having regular reg fault -- messed up redownload and ended up with duplicate newExp/Imfiles -- 1190291 original (dropped from gpc1), 1190457 replaced
      • o7775g0047f handful of files in missing (ip121) and/or corrupted state but sufficient to replicate and repair
    • ippMonitor crashed when systems down, restart on ippc33
    • not sure if nebdiskd is running still or where it runs now (ippdb08 seems to be from latest czarlog entry?)
  • EAM: restarted nebdiskd on ippdb01 (location of nebulous mysql master server)
  • MEH: setting nightly processing up for another night of QUB targeted followup -- no changes should be made to nightly processing
  • MEH: Gene starting data shuffle on stare04 started killing ipp088,104 (date nodes critical for priority QUB processing...) -- 140 remote_md5sum.pl jobs and rising -- so set to neb-host repair -- setting ~ipptest/replication stop -- cannot have datanode minefield for nightly processing...
  • MEH: ipp118, ipp120 appears to be having XFS issues -- neb-host repair until can look into... -- regular faults then started clearing
    • looks like some files on ipp120 is blocking registration.. -- corrupted gpc1/20170123/o7776g0076o/o7776g0076o.ota55.burn.tbl, manually regenerated with fixburntool