PS1 IPP Czar Logs for the week YYYY.MM.DD - YYYY.MM.DD

(Up to PS1 IPP Czar Logs)

Monday : 2016.05.16

  • 7:24 am Rob noticed the following exposures had problems:
    Exposure Status Comment
    o7524g0105o FAIL (Diff stage) OSSR.R12N6.14.Q.r ps1_30_4430 visit 1
    o7524g0126o FAIL (Diff1 stage) OSSR.R12N6.14.Q.r ps1_30_4430 visit 2
    
    Exposure Status Comment
    o7524g0392o FAIL (Diff stage) OSSR.R19S3.14.Q.w ps1_30_3423 visit 3
    o7524g0393o FAIL (Diff stage) OSSR.R19S3.14.Q.w ps1_30_3436 visit 3
    o7524g0410o FAIL (Cam (Bad Quality) stage) OSSR.R19S3.14.Q.w ps1_30_3423 visit 4
    o7524g0411o FAIL (Diff stage) OSSR.R19S3.14.Q.w ps1_30_3436 visit 4
    
    Exposure Status Comment
    o7524g0225o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4187 visit 3
    o7524g0226o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4353 visit 3
    o7524g0227o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4376 visit 3
    o7524g0228o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4186 visit 3
    o7524g0229o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4352 visit 3
    o7524g0230o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4176 visit 3
    o7524g0231o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4329 visit 3
    o7524g0232o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4350 visit 3
    o7524g0233o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4351 visit 3
    o7524g0234o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4185 visit 3
    o7524g0235o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4327 visit 3
    o7524g0236o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4174 visit 3
    o7524g0237o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4326 visit 3
    o7524g0238o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4165 visit 3
    o7524g0239o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4307 visit 3
    o7524g0240o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4328 visit 3
    o7524g0241o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4175 visit 3
    o7524g0242o FAIL (Diff stage) OSSR.R15N4.14.Q.r ps1_30_4187 visit 4
    
  • 9:00am Serge says "I took the chainsaw and care of them (marked one warp and two diffs as invalid)."
  • 10:40am HAF queued the following diff o7524g0393o-o7524g0411o for serge, also investigated/checked and set quality of 42 for a few diffs:
    difftool -dbname gpc1 -definewarpwarp -exp_id 1093036 -template_exp_id 1093055 -backwards 
    -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2016/05/16 -set_dist_group SweetSpot 
    -set_label OSS.nightlyscience -set_data_group OSS.20160516 -set_reduction SWEETSPOT -simple -rerun
    
    the following diff_ids were set to quality of 42
    
    1394181
    1394189
    1394230
    1394606
    
  • 13:20 MEH: updating labels in ippMonitor/czartool
  • 19:47 MEH: summitcopy having regular faults, seems to be related to ipp101 -- looks like it was put up from repair today -- looks like was also happening to afternoon darks and should have been dealt with then -- reminder, the darks in the afternoon provide a handy test for any changes to data system, only if checked... -- putting ipp101 back into repair to see if can clear...
    start copying file into nebulous  Mon May 16 17:21:49 HST 2016
    *** stderr ***
    Use of uninitialized value in string eq at /usr/lib64/perl5/5.8.8/File/Copy.pm line 76.
    Use of uninitialized value in stat at /usr/lib64/perl5/5.8.8/File/Copy.pm line 87.
    Use of uninitialized value in pattern match (m//) at /usr/lib64/perl5/5.8.8/File/Copy.pm line 131.
    Use of uninitialized value in concatenation (.) or string at /usr/lib64/perl5/5.8.8/File/Copy.pm line 133.
    2016/05/16 17:22:50 | ipp094 | FATAL | Nebulous::Client::replicate - can not copy instance file:///data/ipp101.0/nebulous/db/6f/8713342108.gpc1:20160517:o7525g0051d:o7525g0051d.ota22.fits
    can not copy instance file:///data/ipp101.0/nebulous/db/6f/8713342108.gpc1:20160517:o7525g0051d:o7525g0051d.ota22.fits at /usr/lib64/perl5/vendor_perl/5.8.8/Log/Log4perl/Logger.pm line 896
    Unable to perform dsget: 255 at /home/panstarrs/ipp/psconfig/ipp-20141024.lin64/bin/summit_copy.pl line 246.
    
    failure for: summit_copy.pl --uri http://conductor.ifa.hawaii.edu/ds/gpc1/o7525g0051d/o7525g0051d22.fits --filename neb://ipp060.0/gpc1/20160517/o7525g0051d/o7525g0051d.ota22.fits --summit_id 1088982 --exp_name o7525g0051d --inst gpc1 --telescope ps1 --class chip --class_id ota22 --bytes 49432320 --md5 80e639f1d9c99325dd84fbaeb8b6b631 --dbname gpc1 --timeout 600 --verbose --copies 2 --compress --nebulous
    job exit status: 255
    job host: ipp094
    job dtime: 110.888797
    job exit date: Mon May 16 17:22:51 2016
    
    .....
    
    start copying file into nebulous  Mon May 16 19:47:55 HST 2016
    *** stderr ***
    Use of uninitialized value in string eq at /usr/lib64/perl5/5.8.8/File/Copy.pm line 76.
    Use of uninitialized value in stat at /usr/lib64/perl5/5.8.8/File/Copy.pm line 87.
    Use of uninitialized value in pattern match (m//) at /usr/lib64/perl5/5.8.8/File/Copy.pm line 131.
    Use of uninitialized value in concatenation (.) or string at /usr/lib64/perl5/5.8.8/File/Copy.pm line 133.
    2016/05/16 19:48:55 | ipp054 | FATAL | Nebulous::Client::replicate - can not copy instance file:///data/ipp101.0/nebulous/0a/f8/8713344299.gpc1:20160517:o7525g0062o:o7525g0062o.ota14.fits
    can not copy instance file:///data/ipp101.0/nebulous/0a/f8/8713344299.gpc1:20160517:o7525g0062o:o7525g0062o.ota14.fits at /usr/lib64/perl5/vendor_perl/5.8.8/Log/Log4perl/Logger.pm line 896
    Unable to perform dsget: 255 at /home/panstarrs/ipp/psconfig/ipp-20141024.lin64/bin/summit_copy.pl line 246.
    
    failure for: summit_copy.pl --uri http://conductor.ifa.hawaii.edu/ds/gpc1/o7525g0062o/o7525g0062o14.fits --filename neb://ipp054.0/gpc1/20160517/o7525g0062o/o7525g0062o.ota14.fits --summit_id 1088993 --exp_name o7525g0062o --inst gpc1 --telescope ps1 --class chip --class_id ota14 --bytes 49432320 --md5 a41b6db19f64e1980f199d17e180a754 --dbname gpc1 --timeout 600 --verbose --copies 2 --compress --nebulous
    job exit status: 255
    job host: ipp054
    job dtime: 110.480272
    job exit date: Mon May 16 19:48:56 2016
    
    • does help if put ipp101 and not 100 into repair.. don't have time to deal with this..
  • 19:53 MEH: nightly processing needs its regular restart since not done over weekend??

Tuesday : 2016.05.17

  • 00:49 MEH: registration pantask seg faulted on ippc25.. restarting
    [2016-05-17 00:34:00] pantasks_server[14410]: segfault at ae3348 ip 0000000000408a4e sp 000000004242ef20 error 4 in pantasks_server[400000+16000]
    
  • 00:57 MEH: and while here... clearing fault 4, cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.0396.071 -warp_id 1729388 -fault 0
    
  • 21:00 EAM : using ippx001, ippx002, ippx003, ippx004, ippx009, ippx010, ippx011, ippx012 for dvo data processing

Wednesday : 2016.05.18

  • 12:30 EAM : stopping gpc1 mysql replication slave on ippdb03 to make a copy to ippdb08.
  • 17:25 EAM : stopping and restarting pantasks
  • 17:30 EAM : restarted ippdb03 replication after rsync to ippdb08 completed
  • 20:05 EAM : apparently I forgot to hit 'run' when I restarted the pantasks

Thursday : YYYY.MM.DD

  • 09:45 EAM : ipp hardware updates:
    I have shutdown ippc11, ipp034, ipp050, ipp052, and ipp053, based on Chris' report from yesterday.
    
    I have also copied the gpc1 replicated mysql from ippdb03 to ippdb08 and restarted replication.  
    At this point, gpc1, ippadmin, and czardb are running well under replication.
    
    It seems the isp replicated database was not actually up to date, and I was not confident of the 
    state of other dbs (they were not running under replication for a while).  I have made mysqldumps in 
    /export/ippdb05.0/mysqldumps for all of the following databases: gpc2, ps2_tc3, isp, ssp, uip.  
    We can re-ingest and restart replication for these databases or a subset as we desire.  
    I'm going to postpone that step until after I'm back from holidays.
    
    This means we are finally done with the upgrade from spinning disks to 
    SSDs for the 2 critical database servers.  The only outstanding task in this area 
    is the goal of having a second SSD slave machine for each server.  
    

Friday : 2016.05.20

  • MEH: pstamp blocking state for QUB stamps from OSS.20160517 -- somehow warpRun got into cleaned state while Skyfile all in full and label ps_ud_QUB -- have to manually set warpRun state full again for stamps to clear
    | 855512 | qub_ps_request_20160520_012110 | QUB   | pstamp  | run   |     0 |       28 |         28 |             23 |            5 |            0 |            19 |     2016-05-20 00:21:36 | 2016-05-20 00:21:50 | 
    
    | exp_id  | warp_id | state  | state   | label     |
    +---------+---------+--------+---------+-----------+
    | 1093470 | 1729365 | update | cleaned | ps_ud_QUB | 
    | 1093473 | 1729368 | update | cleaned | ps_ud_QUB | 
    | 1093482 | 1729377 | update | cleaned | ps_ud_QUB | 
    | 1093483 | 1729376 | update | cleaned | ps_ud_QUB | 
    | 1093487 | 1729379 | update | cleaned | ps_ud_QUB |
    
  • users filling up homedir and need to use datanodes to put large data products...
    /dev/sdb1             917G  857G   14G  99% /export/ippc19.0
    
  • MEH: ipp017 gmond crash, restarted
  • Haydn work --
    • replaced faulty disk in ippc32.1, c40.1, c56.1, c60.1 -- striped raid so was destructive and any data on those disks lost
    • ipps14 tracking power usage

Saturday : 2016.05.21

  • 08:15 EAM :cleared a bad warp : warptool -updateskyfile -set_quality 42 -fault 0 -warp_id 1731481 -skycell_id skycell.1487.040 -dbname gpc1
  • MEH: restarted nightly pantasks, ipp062,031 into repair from up for now due to email read errors

Sunday : 2016.05.22

  • 12:38 MEH: ipp031 raid still rebuilding so leave in repair