PS1 IPP Czar Logs for the week 2012.05.28 - 2012.06.03

(Up to PS1 IPP Czar Logs)

Monday : 2012.05.28

  • few MD08 exposures last night, look to be through okay. looking at if the new masking Chris setup is being used and if a problem.

http://ippmonitor.ipp.ifa.hawaii.edu/ippMonitor/showimage.php?name=neb://any/gpc1/MD08.nt/2012/05/28//o6075g0034o.492281/o6075g0034o.492281.cm.443364&rule=PPIMAGE.JPEG1&camera=GPC1&class_id=NONE

http://ippmonitor.ipp.ifa.hawaii.edu/ippMonitor/showimage.php?name=neb://any/gpc1/MD08.nt/2012/05/20//o6067g0308o.490670/o6067g0308o.490670.cm.439923&rule=PPIMAGE.JPEG1&camera=GPC1&class_id=NONE

  • looks like more masking so should probably revert masks for now
    | exp_id | exp_name    | label                      | ref_static | ref_dyna | ref_advis | max_static | max_dyna | max_advis |
    +--------+-------------+----------------------------+------------+----------+-----------+------------+----------+-----------+
    | 490670 | o6067g0308o | MD08.20120520              | 0.197      | 0.000234 | 0.0201    | 0.211      | 0.000223 | 0.0196    | 
    | 490672 | o6067g0309o | MD08.20120520              | 0.197      | 0.000280 | 0.0213    | 0.211      | 0.000300 | 0.0208    | 
    
    | 491784 | o6072g0163o | MD08.20120525              | 0.198      | 0.000255 | 0.0153    | 0.212      | 0.000275 | 0.0149    | 
    | 491785 | o6072g0164o | MD08.20120525              | 0.199      | 0.000246 | 0.0164    | 0.213      | 0.000243 | 0.0159    | 
    
    | 492281 | o6075g0034o | MD08.20120528              | 0.240      | 0.000245 | 0.0060    | 0.253      | 0.000234 | 0.0058    | 
    | 492282 | o6075g0035o | MD08.20120528              | 0.238      | 0.000258 | 0.0079    | 0.252      | 0.000247 | 0.0076    | 
    

Tuesday : 2012.05.29

Mark is czar

  • no data last night. reverted to previous mask set using a script Chris setup since his new ones to remove static issues due to crosstalk bleeds/etc was masking more than expected/desired.
  • 11:00 Serge: stopped replication save on ippdb02 to give mysqldump a chance to finish. Changed nebulous backup copy on ipp@ipp001 to 20:05 (instead of 12:05).
  • 15:40 re-ran SAS.Footprint to verify masks reverted. looks like they are.
    | exp_id | exp_name    | label                      | ref_static | ref_dyna | ref_advis | max_static | max_dyna | max_advis |
    +--------+-------------+----------------------------+------------+----------+-----------+------------+----------+-----------+
    | 356282 | o5744g0329o | czw.footprint.test21       | 0.195      | 0.000031 | 0.0102    | 0.209      | 0.000029 | 0.0099    | 
    | 356282 | o5744g0329o | czw.footprint.test120525   | 0.225      | 0.000029 | 0.0096    | 0.238      | 0.000027 | 0.0094    | 
    | 356282 | o5744g0329o | czw.footprint.test120525v2 | 0.238      | 0.000029 | 0.0094    | 0.251      | 0.000027 | 0.0092    | 
    | 356282 | o5744g0329o | czw.footprint.test120529   | 0.195      | 0.000031 | 0.0102    | 0.209      | 0.000029 | 0.0099    | 
    | 356283 | o5744g0330o | czw.footprint.test21       | 0.195      | 0.000036 | 0.0106    | 0.209      | 0.000034 | 0.0104    | 
    | 356283 | o5744g0330o | czw.footprint.test120525   | 0.224      | 0.000034 | 0.0101    | 0.237      | 0.000032 | 0.0099    | 
    | 356283 | o5744g0330o | czw.footprint.test120525v2 | 0.238      | 0.000034 | 0.0099    | 0.251      | 0.000031 | 0.0097    | 
    | 356283 | o5744g0330o | czw.footprint.test120529   | 0.195      | 0.000036 | 0.0106    | 0.209      | 0.000034 | 0.0104    | 
    
  • 15:50 shutting down all pantasks and turning apache off on ippc17 while Serge does some DB work.
  • 16:15 Serge is now in charge.
  • 16:30 Serge: restarted slave on ippdb02. All apache nebulous conf files are now pointing towards ippdb02. I will wait till 8pm for jt scripts running on ippc19 to finish.
  • 16:35 Serge: nebulous slave is now synchronized with its master.
  • 17:10 Serge: the things to be done are described here (just in case I have a moped accident)

Wednesday : 2012-05-30

Serge is czar

  • No observation last night due to a problem with the primary mirror actuator system. Repair is scheduled for 5-30-2012 HST (quoted from ps-obs report)
  • Nebulous maintenance is happening and details can be seen in the same reporting wikipage as yesterday
  • 12:00 Started optimization of gpc1 on ippdb01
  • 15:25 All but the last two versions of src/ipp-<tag> are being moved to /export/ippc18.1/ipp/src/Archives/
  • 16:35 Moved ~ipp/iasc to ippc18.1. Symbolic link in ~ipp
  • 17:08 End of gpc1 optimization: 195GB before. 167GB now.
  • 17:10 No change for the ippdb00 ingestion completion: 6am.

Thursday :2012-05-31

Serge is czar

  • 09:30 ippdb00 ingestion is still happening
  • 11:00 Processing has been restarted
  • 11:57 Processing stopped. The new tag is being built.
  • 13:05 All pantasks servers (but ippdvo related) have been restarted
  • 14:05 mysql in ippdb00 has crashed. From /var/log/message:
    May 31 14:01:02 ippdb00 [26276891.033202] 3w-9xxx: scsi5: ERROR: (0x06:0x0010): Microcontroller Error: clearing.
    May 31 14:01:32 ippdb00 [26276921.659024] sd 5:0:0:0: WARNING: (0x06:0x002C): Command (0x35) timed out, resetting card.
    May 31 14:02:02 ippdb00 [26276951.032734] 3w-9xxx: scsi5: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card.
    May 31 14:02:27 ippdb00 [26276975.998987] 3w-9xxx: scsi5: AEN: INFO (0x04:0x0063): Enclosure added:encl=0.
    May 31 14:02:52 ippdb00 [26277001.122317] 3w-9xxx: scsi5: AEN: INFO (0x04:0x0063): Enclosure added:encl=0.
    May 31 14:03:02 ippdb00 [26277011.439889] end_request: I/O error, dev sda, sector 2080656881
    May 31 14:03:02 ippdb00 [26277011.439931] I/O error in filesystem ("sda4") meta-data dev sda4 block 0x74603200       ("xlog_iodone") error 5 buf count 262144
    May 31 14:03:02 ippdb00 [26277011.439936] xfs_force_shutdown(sda4,0x2) called from line 1062 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa00c4652
    May 31 14:03:02 ippdb00 [26277011.439961] Filesystem "sda4": Log I/O Error Detected.  Shutting down filesystem: sda4
    May 31 14:03:02 ippdb00 [26277011.439964] Please umount the filesystem, and rectify the problem(s)
    May 31 14:03:17 ippdb00 [26277025.812318] Filesystem "sda4": xfs_log_force: error 5 returned.
    [...]
    May 31 14:08:26 ippdb00 [26277335.389888] 3w-9xxx: scsi5: AEN: INFO (0x04:0x001A): Drive inserted:vport=23.
    [...]
    May 31 14:19:23 ippdb00 [26277992.304317] Filesystem "sda4": xfs_log_force: error 5 returned.
    May 31 14:19:23 ippdb00 [26277992.304323] xfs_force_shutdown(sda4,0x1) called from line 420 of file fs/xfs/xfs_rw.c.  Return address = 0xffffffffa00d6a4d
    [...]
    
  • 15:40 Mark changed ~ipp/ippconfig/site.config and gpc1/psastro.config to use new refcat.20120524.v0, rebuilt ippconfig (stopped/restarted pstamp and replication pantasks)
  • 23:00 Mark: yay nightly science data. SAStest.v5 chip->warp finished, running the 5x600 stacks. adding 2xcompute3 into stack pantasks (no deepstack running) and setting SAStest.v5 priority to 200.
  • 23:40 Mark: fun.. looks like the MD diffims are failing, ends with "missing sources?". will look at images in the morning. diff.revert.off in stdscience until further notice.

Friday : 2012.06.01

Mark is czar

  • 08:00 a little MD data last night only and through except for the known diffim faults being looked into.
  • 08:30 SAStest.v5 chip->warp+stacks finished but staticsky all faulting in new ops tag ipp-20120531 with the stack_id in the CMF filename now... settting stack up for distribution at least (took 1.5 hours to start).
  • 10:30 SSdiffs also faulting.
  • 11:00 Heather looking into the staticsky trouble. Also seems deepstack pantasks wasn't recording the error/fault in the pantasks.stdout.log, restarting deepstack pantasks and is now properly logging.
  • 11:30 Mark is starting the MD09.GR0 chip->warp processing.
  • 13:30 Heather fixed code for the name change and staticsky running, faulted runs reverting okay. Adding in another compute3 and a compute2 to deepstack (since not doing stacks), memory use is ~10-15GB (>half the RAM on compute2, so only 1 group there).
  • 14:00 a note that nebdiskd is not running while we are using ippdb02 for nebulous so the czarplot of overall diskuse will not update. However, the diskuse by system will.
  • 15:00 Gene traced MD diffim faults to psphotFindDetections.c and don't want to fault fail on case of no sources but just continue on/return true in the usFootprints condition. Tested and merged into ops tag and nightly diffims successfully processing now.
  • 16:30 SAStest.v5 mostly done with ~20/581 needing to be looked at for repeat faults.
  • 17:30 need to restart stdscience, stack, deepstack to clear out changes for pushing SAS through quickly and normal regular restart of stdscience. stdscience stuck, Gene found was a pcontrol was still running/hung up on ippc16.

Saturday : 2012-06-02

  • 06:40 Serge: For the nebulous ingestion, 70% of the instance table has been ingested.
  • 09:00 Mark: seems like summit copy problem and some machines are stalled, looking into.
    • stdscience -- forgot to manually set "controller parameters unwant = 7" and ipp048 was fully loaded with stalled ppImages, large backlog of ipp048 targeted runs to do for MD09.GR0.
    • summitcopy -- 2 object files md5sum conflict, looking into but not real data last night. seems to also be some darks as well. will need to come in to look at.
  • 11:00 Mark: ugh, ppImage still having problems using a lot of RAM (+10-20GB) and trouble even on the compute2 and wave4 systems with MD09.GR0.. will need to look at in more detail.
  • 13:30 emailed ps-camera,ps-ops if the two are bad images from last night
    o6080g0092o52.fits - expected md5: f29c87138568f93fe0d22e62eb5375b6 got: 8982a5f074635e69998e8a0599b1e117
    o6080g0094o62.fits - expected md5: 895e6377e15427fe67c75b37971330de got: 1c3024cf110f385b20f8484d0664a553
    
  • 15:00 Mark: MD09.GR0 getting stuck often for XY71,67 it seems and using even all RAM on wave4/compute3 (50GB), tracing still.
  • 18:15 Jacob reports that 0092-0094 were taken during camera problems. am just going to flag all as bad,
    pztool -updatepzexp -exp_name o6080g0092o -inst gpc1 -telescope ps1 -set_state drop -summit_id 488351 -dbname gpc1
    pztool -updatepzexp -exp_name o6080g0093o -inst gpc1 -telescope ps1 -set_state drop -summit_id 488352 -dbname gpc1
    pztool -updatepzexp -exp_name o6080g0094o -inst gpc1 -telescope ps1 -set_state drop -summit_id 488353 -dbname gpc1
    
  • 18:30 since the download was okay for o6080g0093o leave, but shutdown summitcopy and set the other two in gpc1 on ippdb01 as
    update summitExp set exp_type ='broken', imfiles=0, fault=0 where exp_name='o6080g0094o';
    update summitExp set exp_type ='broken', imfiles=0, fault=0 where exp_name='o6080g0092o';
    
  • 19:00 Mark: 20120602 needed to be added back into registration pantasks to finish registering
    register.add.date 2012-06-02 gpc1 14
    
  • 19:30 summitcopy also has o6080g0101o with a fault=110, summit_id=488360.
  • 23:30 Mark: two more non-existent OTAs in MD09 to drop
    chiptool -dropprocessedimfile -set_quality 42 -chip_id 468770 -class_id XY33 -dbname gpc1
    regtool -updateprocessedimfile -set_ignored -exp_id 200329 -class_id XY33 -dbname gpc1
    
    chiptool -dropprocessedimfile -set_quality 42 -chip_id 468788 -class_id XY17 -dbname gpc1
    regtool -updateprocessedimfile -set_ignored -exp_id 203817 -class_id XY17 -dbname gpc1
    
  • 23:40 Mark: looks like the ppImage/psphot problem had overloaded the memory in order 10 machines, squelching nightly processing.. will start manually killing and have to stop MD09.GR0 processing for refstacks.. issue looks to be in Kron Iterate (2nd pass, ~5k sources) with ~2ks runtime >30GB memory, then psphot Source Size can take >5ks to maybe finish.

Sunday : 2012-06-03

  • 07:30 Serge: Nebulous ingestion. 90% of the instance table...
  • 12:00 Serge: Nebulous ingestion: 95% of instance table. Instance table should be finished in the late afternoon.
  • 17:00 Mark: chip.revert.off while working with problematic psphot/MD09 chips until midnight or so. will revert manually as needed