Monday : 2017.05.01

  • MEH: ipp081 seems stable when idle, set from neb-host down->repair until remote power access configured
  • MEH: ippx041-x044 online and connected to network for heavy use and power measurement -- will remain up so can be used by Rob after test as well

  • Czar will need to shutdown mysql on ippdb05 for Haydn to replace BBU (TBD)
  • Czar probably shutdown mysql on ippdb06 while Haydn works on raid (TBD)
  • 18:30 CZW: Gene mentioned that the nebulous replication was back online, and that with the telescope down for the night, it was safe to restart the retired host reinsert process. Running as ipptest in screen session on stare02. Control-C will stop it.

Wednesday : 2017-05-03

  • 16:00 CZW: Things should be back online, although the database replicants are not yet cleared. I am restarting retire check scans, which are read-only nebulous operations.

Thursday : 2017.05.04

  • 09:20 EAM : pub ID 1098211 is repeatedly failing. I'm running it manually to debug; meanwhile I've turned off the publish task to avoid collisions.
  • 10:50 EAM : the above failure was due to a bad diff skycell: the cmf output files for skycell.0600.027 had NAN values for FPA.RA and other critical keywords. this caused a segfault in ppMops. I've set the quality to 42 for the problem skycell, but for future debugging purposes, I've saved the input file for ppMops as ~ippitc/publish.1098211.mops.neg.4zxQ and the cmf.inv files referenced therein in ~ippitc/publish.1098211.mops.neg.files. To debug, run the command (fixing the list):
    ppMops input.list /data/ipp117.1/nebulous/65/80/9786860543.gpc1:IPP-MOPS-TEST:IPP-MOPS-TEST.1098211.neg.mops -exp_name o7877g0159o -exp_id 1238270 -chip_id 1931003 -cam_id 1898513 -fake_id 1870452 -warp_id 1877068 -diff_id 1529439 -camera GPC1 -inverse -zp 28.4612312844384 -zp_error 0.148091 -astrom_rms 0.275213030305616 -comment "OSSR.R12S5.7.Q.i ps1_23_1501 visit 1" -obsmode "OSS" -difftype WW -sky 303.222 -shutoutc "2017-05-04T07:35:45"
  • 15:15 EAM : moved nebdiskd to be run by the ipp user, not the ippitc user
    • the above move revealed a number of issues:
      • ipp071 and ipp022 were not mounting home directories from ippc18 (now fixed by gavin)
      • ippb06 is missing (did it move to ATRC?)
      • ippb09 is not correctly added to nebulous (it was not previously cross-mounted on the cluster, but this is now addressed)
      • ippb15.1 does not have nebulous subdirs
  • 15:45 CZW: I have set the nebulous xattr=5 for the retired volumes. This removes them from scanning and marks them as permanently gone. It should have no effect on the scans and reinsertion process.
  • 16:10 CZW: ippb09 is now added to nebulous. ippb15.1 likely did not accept data due to a incorrect ownership of the nebulous directory (ipp.users instead of apache.nebulous).
  • 16:25 CZW: For future reference on clearing nebulous xattr=5 entries from mounted vol:
    select CONCAT("DELETE FROM mountedvol WHERE mountpoint = '",mountedvol.mountpoint,"';") from volume JOIN mountedvol USING (vol_id) WHERE volume.xattr = 5;
  • 17:25 CZW: Corresponding command for the czardb table:
    select CONCAT("DELETE FROM hosts WHERE host = '",host,"';") from hosts WHERE total < 1 AND host != '';

Friday : 2017-05-05

  • 17:15 CZW: Restarting retire database checks, as ipp118-ipp122 are back online.

