PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2016.04.25

  • 23:50 MEH: another warp fault 4 that would have stalled until manually cleared -- cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.0392.066 -warp_id 1721240 -fault 0

Tuesday : 2016.04.26

  • 13:40 EAM : ipp041 fell over, rebooting now.
  • 14:20 CZW : starting third batch of wave 3 rsync processes, concurrent with still running second batch. This set is ipp043,44,45,46,47. There are delays built into the jobs to try and stagger the impact.
  • 20:15 EAM : stopping and restarting pantasks

Wednesday : YYYY.MM.DD

  • 23:24 MEH: looks like registration crashed... restarting
    [2016-04-27 23:11:07] pantasks_server[6514]: segfault at 10f8ab8 ip 0000000000408a4e sp 00000000429b9f20 error 4 in pantasks_server[400000+16000]

Thursday : 2016.04.28

  • 07:40 MEH: Serge/MOPS request for manual diffims since visit 4 cam quality fault, do v2-3
    difftool -dbname gpc1 -definewarpwarp -exp_id 1084724 -template_exp_id 1084735 -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2016/04/28 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20160428.extra -set_reduction SWEETSPOT -simple -rerun 
    difftool -dbname gpc1 -definewarpwarp -exp_id 1084812 -template_exp_id 1084830 -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2016/04/28 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20160428.extra -set_reduction SWEETSPOT -simple -rerun
  • 08:20 MEH: the .multi diffims still being made, sending a large number to cleanup...
  • 08:30 MEH: doing cleanup of ps_ud% to clear the cycling faulting chips and warps...
  • 11:48 MEH: manually updating local ~ipp/psconfig/ipp-20141024.lin64/share/pantasks/modules/ipphosts.mhpcc.config to use .20160428avoidfullv2 for data targeting to reduce number of ipp1xx nodes used and some full disks to reduce random allocations
    • also reducing s6 group use in stdscience 6x->3x to reduce load there and since stdsci is still overpowered
    • restarting nightly pantasks as normally required and make use of these changes
  • 16:30 EAM: stopping pantasks to do the pantasks host re-organization (moving pantaskses from ippc01-09 to ippc20-25. I will also double check the pantasks client loading assignments to avoid overloading c20-25.
    • 16:55 EAM: pantaskses are now running on c20-c25. I've removed all processing tasks from these machines.
  • 16:40 CZW: Running nebulous restore on rsynced data from ipp033. This is running on ipp100, as it must run directly on the host containing the data. It is incredibly lightweight (doing link() and sql updates), but will likely be running for a few hours.
  • 17:40 EAM: I set up nebulous on the ippc21-c36 machines and restarted their apache servers. I unfortunately forgot to change the number of connection threads, and they blocked the mysql database. I shut them all down again. I removed nebulous from ippc30 (pstamp) and ippc32 (dvodist) and restarted just those two. I will fix this tomorrow morning.
  • 20:50 EAM: one dark got confused and left in a funny state. i manually set it to 'stop', then needed to manually advance it:
    pztool -dbname gpc1 -advance -summit_id 1080680 -exp_name o7507g0003d -inst gpc1 -telescope ps1 -end_stage reg -workdir neb://@HOST@.0/gpc1/20160429
  • 23:01 MEH: manually reverting a fault 4 warp memory error that would have stalled o7507g0248o until morning

Friday : 2016.04.29

  • 12:40 EAM : I set up the apache / nebulous interfaces for ippc20-ippc28. I have installed the nebulous perl modules (nebserver and apache) and /etc/apache2 elements for ippc20-ippc36, but I have only activated the first 9 machines (the others have the needed module in /etc/apache2/modules.d with .with.nebulous appended -- to activate apache, just move the active one out of the way and replace with this one, then restart apache). We will use 7 of these and the rest can be spares. It was also necessary to modify /etc/apache2/modules.d/00_mpm.conf to avoid overloading the nebulous mysql machine (as I did yesterday afternoon).

Saturday : YYYY.MM.DD

Sunday : YYYY.MM.DD

  • 15:00 MEH: ippdb03 down since ~0730 this morning, tried a couple power cycles without success (not sure how long it takes to start the boot screen with the RAM it has but gave it ~20-30 minutes). looks like it crashed in the middle of the gpc1 db dump this morning. with ippdb03 down, ippmonitor is down but it looks like ippc18 ippmonitor is still running okay.