PS1 IPP Czar Logs for the week 2016.11.28 - 2016.12.04

(Up to PS1 IPP Czar Logs)

Monday : 2016.11.28

  • MEH:
    • ipp118-121 not reported in ganglia --
      • ipp118,119 gmond not started -- reporting now
      • ipp120,121 /etc/ganglia/gmond.conf setup for ippc18 and not ippc19 again, correct and start gmond
    • Hadyn needs to work with ipp121 to troubleshoot 10G connection on ipp122 -- ipp121 not available until hear otherwise
    • ipp118-122 not in ~ipp/.Consolerc... also adding ipp105-117 since not there either..
      • ipp120 oddly logs into root account

Tuesday : 2016.11.29

  • 15:30 CZW: Merging and installing new nightly_science.pl/nightly_science.config files into the ipp-20141024 IPP user build. This should enable the updated diff queue management, and reduce the amount of manual effort needed for this.
  • 15:37 CZW: Restarting IPP pantasks. This isn't required to pick up changes, just a daily restart.

Wednesday : 2016.11.30

  • 12:15 CZW: Pulling ipp083 so Haydn can replace the RAID battery.
  • 12:30 CZW: I'm tired of looking at c7696g0016f failing repeatedly in registration. It is a bad exposure, with no useful header information, and this is causing registration to fail the imfiles with fault=5, no exposure level entry is inserted, so this will fail forever until fixed. I'm going to insert minimal information in the hope that this stops the constant failures. Commands used:
     [...]
     regtool -dbname gpc1 -updateprocessedimfile -exp_id 1160472 -fault 0 -set_ignored -class_id ota76
     regtool -dbname gpc1 -addprocessedexp -exp_id 1160472 -exp_name c7696g0016f -exp_tag c7696g0016f.1160472 -end_stage reg -state full -telescope UNKNOWN -inst UNKNOWN -filelevel CHIP -exp_type DOMEFLAT -obs_mode ENGINEERING
    
  • 15:50 CZW: Restarting pantasks.

Thursday : 2016.12.01

  • 11:38 HAF No network from 2:25am to about 11:00am. Pantasks in a funky state, restarting
  • 12:06 CZW: The new MRTCB nebulous nodes should be fine to accept data. Setting to up.
  • 12:10 CZW: I'm increasing the loading of the ipptest/replication pantasks to speed up the MRTCB->ITC raw shuffle. This increase will use 6 times loading of the x2 and x3 nodes. The notes for this process are listed on the wiki here. It would be helpful if czars can check on this pantasks every day or so, and try to kill hung jobs. The standard execution time is about 400-600s, so jobs with times in the 6000+s range are probably having mount issues. The wiki page discusses the ways I've been dealing with this (kill the jobs, restart nfs if needed to clear bad mounts). Although it's best to have things complete, failed jobs can be easily retried.
  • 15:20 CZW: I began getting concerned that the czartool page hadn't noted that those new nebulous volumes were up. Restarted czartool and roboczar scripts.
  • MEH: large set QUB/MOPS warp updates started and targeting ipp100+, running ~ippmops/stdscience using ipps nodes (for now) -- adding c2 normally used for QUB WS

Friday : 2016.12.02

  • MEH: large set QUB/MOPS warp updates started and targeting ipp100+, running ~ippmops/stdscience using ipps nodes (for now) -- adding c2 normally used for QUB WS and stacks -- if pantasks is stopped, then you must notify me directly as it can block MOPS+QUB stamps

Saturday : 2016.12.03

Sunday : 2016.12.04