PS1 IPP Czar Logs for the week 2012.06.11 - 2012.06.18

(Up to PS1 IPP Czar Logs)

Monday : 2012.06.11

  • 21:20 leaving postage stamp server running along with nightly science. seems to do fine with active MOPS jobs past couple nights while being monitored, but will turn update off later. some QUB stamps seem to have stalled even with update on and no other stamps running, will need looking into.

Tuesday : 2012.06.12

Mark is czar

  • 07:20 nightly science data downloaded and processed. postage stamp running still, MOPS jobs almost finished but more QUB jobs seems to have become stuck and am looking into.
  • 09:20 tracing out the PSS stalling, one looks like wanting/waiting to update a diffim that is in the goto_cleaned state (never actually cleaned), MD06 diff_id=242933
    • turning on cleanup pantasks and watching processing/ippdb03 closely to avoid crashing, try and "clean-up" -- looks like will take a couple hours. want to turn back off before nightly science?
    • 12:30 QUB from past few days moving through now as well as some web ones too.
  • 10:00 Serge turning replication/balance back on since doesn't use gpc1
  • 10:30 removing ippc12 from pantasks and from loaded compute list since being swapped in for ippdb01.
  • 10:35 did ippc12 going down stall a distribution entry for MD07 SSdiff?
    • do the easy restart of distribution and see if picks up again -- yes and cleared.
  • 12:30 MOPS has ~36k stamps requests, running ~200/minute, so about 2-3hr finish with this batch for a gpc1 dump? no reason to wait, sooner is better since not replicated/no copy to fall back on..
  • 13:35 Serge: manually set the correct time and restarted ntpd on ippdb01
  • 13:40 Mark: stopping all processing, czartool etc for ippdb03:gpc1 dump
  • 13:53 Serge: dumping gpc1 on ippdb03 to /export/ippdb03.0/mysql-dumps/backup/gpc1_20120612.sql.bz2. Coordinates of ippdb03 as a master are:
    mysql> SHOW MASTER STATUS;
    +-------------------+----------+--------------+------------------+
    | File              | Position | Binlog_Do_DB | Binlog_Ignore_DB |
    +-------------------+----------+--------------+------------------+
    | mysqld-bin.000004 |       98 |              |                  | 
    +-------------------+----------+--------------+------------------+
    1 row in set (0.00 sec)
    
  • 15:40 Serge: end of dump. I also performed a czardb dump
  • 15:45 Mark: processing restarted, stdscience+stack+pstamp+update fully shutdown and restarted to make sure ippc12 was dropped.
  • 16:15 Serge: ippdb01 is "clean". Dropped the following databases: MD01, MD02, MD03, MD04, Rw10BH96, beaumont, czardb_2, czwtest, detection_efficiency, eamMC, eamtest, haf_addtest, heather, heather2, ipptopsps_test, simtestMEH, ippadmin. I also dropped czardb and gpc1 for a full reingestion (czardb has actually been ingested while I write these lines and gpc1 ingestion has begun).
  • 20:50 Mark: appears registration is stalled. trying a restart -- that worked...

Wednesday : 2012-06-13

Mark is czar

  • 06:44 Serge: Mysql replication of ippdb03 to ippdb01 started and looking fine (right now: 45000 seconds behind).
  • 07:40 Serge: ippdb03 has caught up. We can switch back to the normal configuration.
  • 07:50 Mark: nightly science downloaded and processed.
  • 08:10 all services shutdown to flip back over to primary DB scidbm (ippdb01)
  • 09:45 Serge: ippdb03 is replicating ippdb01. To avoid problems I removed all access to ippdb03 by the 'ipp' user. Other users are unchanged.
  • 10:00 Mark: restarting pantasks, starting with pstamp since will have MOPS requests to check. then update, then stdscience for the MD SSdiffs, then distribution
    • pstamp running, SSdiffs running, update failing -- needed ippadmin DB on scidbm/ippdb01 even though just queried entries. -- now SSdiffs and updates completing now.
  • Heather added back isp to summitcopy - it seems to be working, heather will verify later (it's several days behind). Heather will also add to input files so isp is again automatic . The magic command for summitcopy/registration) since there are no labels in those pantasks is to do add.database isp or del.database isp
  • 12:10 czartool page not fully updating, investigating -- czarpages on ippMonitor use scidbs (db03) and replication broken right now.
  • 14:30 init.day and cleanup seem to be okay.
  • 14:20 appears not all hosts know what scidbm is in the DNS (ippc13, others?) -- Serge checked and didn't find any others. Gavin fixed the DNS.
  • 17:20 adding use of DBSERVER rather than hardcoded server name for regpeek.pl and nightly_science.pl for testing tonight. stopping pantasks for that soon.
  • 23:00 change-over back to ippdb01 seems good, mod to nightly_science.pl and regpeek.pl seem to work.

Thursday : 2012.06.14

  • 07:50 Mark: watching nightly science processing while czar out this morning. Looks like mostly done, one stuck in publishing? yes for ~10ks.. stage_id=247263 (diff) -- killed and reverted but still not finishing..
    • by 11:00 has finished .pos.mops and is 104MB in size.. so why taking so long.. diffim shows what looks like an image jump/poor tracking. warned MOPS about pub.425940.diff.247263.
    • not sure which exposure is the cause (add here when traced out)
  • 07:55 Mark: looking into Larry report that about 1/3 of the OTAs in this diff pair (o6092g0331o and o6092g0348o) have no detections reported from them for MOPS.
    • looks like not all skycells run for diffim, why? not in warpSkyCellMap, why? poor astrometry solution. why for this exposure and not one 17 later? unclear, but is also seen to less of a degree in the following exposure as well. brief details from the log files:
      -- gpc1/OSS.nt/2012/06/14/o6092g0331o.499070/o6092g0331o.499070.wrp.433383.log
      bad astrometric solution in header
      skipping XY01.hdr
      bad astrometric solution in header
      skipping XY02.hdr, XY03,04,05,06,10,11,12,13,14,16,17,21,23
      
      -- next exposure somewhat as poor -- gpc1/OSS.nt/2012/06/14/o6092g0332o.499069/o6092g0332o.499069.wrp.433393.log
      bad astrometric solution in header
      skipping XY16.hdr, XY17,35,36,47,57,61,63,76
      
      -- compared to later -- gpc1/OSS.nt/2012/06/14/o6092g0348o.499086/o6092g0348o.499086.wrp.433398.log
      bad astrometric solution in header
      skipping XY10.hdr, XY12,46,54,73
      
      
      
  • 08:05 Mark: dropping old LAP stack that never ran
    stacktool -dbname gpc1 -updaterun -stack_id 926353 -set_state drop
    
  • 10:30 Mark: restarted stdscience and found OSS stack didn't trigger right last night, running now. may be change to nightly_science.pl? MD SSdiffs ran normally and distributed.
    • test change back to hardcoded DBSERVER and no timeout for: nightly_science.pl --queue_stacks --date 2012-06-14 --dbname gpc1 in pantasks logs.
  • 11:05 OSS stacks finished and going out to distribution.
  • 15:05 Serge: Replication on ipp001 was running fine when the same statement that stopped replication on ippdb03 yesterday was played. It is:
    mysql> SELECT id, user, host, db, command, time, state, info FROM INFORMATION_SCHEMA.PROCESSLIST ORDER BY Host;
    
    | 768 | system user |           | isp  | Connect |  220 | Copying to tmp table             | 
    INSERT INTO summitExp   SELECT       NULL,       incoming.*,       NULL,       0,       NULL   
    FROM incoming   LEFT JOIN summitExp       USING(exp_name, camera, telescope)   
    WHERE       summitExp.exp_name is NULL       AND summitExp.camera is NULL       AND summitExp.telescope is NULL 
    | 
    
    We need to figure out why this statement succeeds on ippdb01 but fails when it's replicated. My best guess is that it actually fails with a timeout and is then processed at the pantasks level. Note: that if there is any similar statement run for gpc1 in doesn't affect the replication...
  • 15:10 Serge: I'm removing isp from the replicated databases on ipp001.
  • 16:55 Serge: ipp001 replication is running and up to date.

Friday : 2012.06.15

  • 11:20 Gene rebooted ipp019, seemed stalled and holding up jobs.
  • 11:30 Mark is mucking around with stdscience now that nightly science processing is finished. will be stop/starting.
  • 15:20 Mark added readonly DB access RO_DBSERVER/USER/PASS variables to ~ipp/ippconfig/site.config for use by tools/regpeek.pl and ippScripts/scripts/nightly_science.pl. will watch tonight if nightly_science.pl working right.

Saturday : 2012.06.16

  • 16:20 Mark: looks like ipp019 has been causing problems, registration still not finished and has burntool process many 10ks old. going to set ipp019 as neb-repair for rest of weekend and see if can't sort this out before nightly science starts tonight.
  • 18:30 rebooted ipp019, remaining images registered and processed. will look into summit faults later tonight when checking on start of nightly science. /var/log/messages filled with lockd and mount.nfs hung entries throughout early morning.
  • 21:00 recovered/downloaded the 3 exposures faulted from last night (o6094g0050o, o6094g0053o, o6094g0054o) and processed.

Sunday : YYYY.MM.DD