PS1 IPP Czar Logs for the week 2013.04.29 - 2013.05.05

(Up to PS1 IPP Czar Logs)

Monday :2013.04.29

mark is czar

  • 07:00 MEH: lookslike 63 exposures stuck in registration, auto fixing of check_burntool problem not working now?
  • 14:50 MEH: stsci nodes gone, abducted by aliens? no can reach by console so must be network issue. may want to do neb-host down if gone for a while -- Gavin was working on host config issue, all okay.
  • 17:00 MEH: doing regular restart of stdscience

Tuesday : 2013.04.30

mark is czar

  • 04:00 MEH: manually reverted some 40 stacks (ESS,MD), normal nfs faults seem to be happening more often now. auto-revert needed.. looks like typical random compute nodes having issue, maybe overloaded. last night and part of the SAS test stacks Chris ran were 90% from ippc40 in the stack pantask log (nothing clear in messages why and mounts okay).
  • 11:10 MEH: enabling stack.revert -- removing all non-nightly science labels from ~ipp/stack/input, turned off in stack.pro task itself but adding stack.revert.on into the input file

Wednesday : 2013.05.01

  • 13:30 Bill: stopped all processing and apache servers on ippc17 (datastore) and ipp049 (pstamp-test) to rebuild the ippRequestServer database on ippc17 using innodb tables and restore replication to ippc19
  • 16:00 started up summit copy and registration

Thursday : 2013.05.02

Bill is czar today

  • 04:00 EAM : reconfigured mysql on ipp005 - ipp009 : set innodb_buffer_pool_size back down to 2G from 16G. I had bumped these up in an attempt to help ipptopsps / dvopsps go faster (avoid swapping), but have since switched to memory engine for the dvoDetectionFull table causing the problem. I also killed some rpc.statds on ipp008 and ipp010.
  • 05:12 replication of ippRequestServer database from ippc17 to ippc19 has failed. The error is that an insert into the slave failed because an insert into dsFileset tried to use a value for the primary key that duplicates an existing row
On the slave
mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: ippc17.ifa.hawaii.edu
                Master_User: repl_pstamp
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: mysqld-bin.000718
        Read_Master_Log_Pos: 63235322
             Relay_Log_File: mysqld-relay-bin.000002
              Relay_Log_Pos: 1751685
      Relay_Master_Log_File: mysqld-bin.000718
           Slave_IO_Running: Yes
          Slave_SQL_Running: No
            Replicate_Do_DB: ippRequestServer
        Replicate_Ignore_DB: 
         Replicate_Do_Table: 
     Replicate_Ignore_Table: 
    Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: 
                 Last_Errno: 1062
                 Last_Error: Error 'Duplicate entry '3478690' for key 1' on query. Default database: 'ippRequestServer'. Query: 'INSERT into dsFileset (prod_id, fileset_name, reg_time, type, prod_col_0, prod_col_1, prod_col_2, prod_col_3, prod_col_4, prod_col_5,  prod_col_6, prod_col_7) VALUES('52', 'dqstats.20794', UTC_TIMESTAMP(), 'table', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)'
               Skip_Counter: 0
        Exec_Master_Log_Pos: 1751547
            Relay_Log_Space: 63235460
            Until_Condition: None
             Until_Log_File: 
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File: 
         Master_SSL_CA_Path: 
            Master_SSL_Cert: 
          Master_SSL_Cipher: 
             Master_SSL_Key: 
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

On the master the troublesome row is

select * From dsFileset where fileset_id = 3478690;
+---------+------------+---------------+---------------------+------+-------+------------+------------+------------+------------+------------+------------+------------+------------+
| prod_id | fileset_id | fileset_name  | reg_time            | hide | type  | prod_col_0 | prod_col_1 | prod_col_2 | prod_col_3 | prod_col_4 | prod_col_5 | prod_col_6 | prod_col_7 |
+---------+------------+---------------+---------------------+------+-------+------------+------------+------------+------------+------------+------------+------------+------------+
|      52 |    3478690 | dqstats.20794 | 2013-05-02 06:18:09 |    0 | table | NULL       | NULL       | NULL       | NULL       | NULL       | NULL       | NULL       | NULL       | 
+---------+------------+---------------+---------------------+------+-------+------------+------------+------------+------------+------------+------------+------------+------------+
1 row in set (0.00 sec)

While on the slave that id is used for another entry

mysql> select * From dsFileset where fileset_id = 3478690;
+---------+------------+--------------------------------------+---------------------+------+----------+------------+------------+------------+-------------+------------------+------------+------------+------------+
| prod_id | fileset_id | fileset_name                         | reg_time            | hide | type     | prod_col_0 | prod_col_1 | prod_col_2 | prod_col_3  | prod_col_4       | prod_col_5 | prod_col_6 | prod_col_7 |
+---------+------------+--------------------------------------+---------------------+------+----------+------------+------------+------------+-------------+------------------+------------+------------+------------+
|      54 |    3478690 | o6414g0056o.camera.808365.2945979.26 | 2013-05-02 06:17:09 |    0 | IPP-DIST | 1316       | camera     | 808365     | o6414g0056o | ThreePi.20130502 | y.00000    | NULL       | NULL       | 
+---------+------------+--------------------------------------+---------------------+------+----------+------------+------------+------------+-------------+------------------+------------+------------+------------+
1 row in set (0.00 sec)

But that entry does not exist on the master

mysql> select * From dsFileset where fileset_name = 'o6414g0056o.camera.808365.2945979.26';
Empty set (0.50 sec)


Interestingly the corresponding fileset directory does exist in the dsroot directory on ippc17.

bills% ls /data/ippc17.0/datastore/dsroot/ps1-3pi-cat/o6414g0056o.camera.808365.2945979.26
dbinfo.camera.808365.mdc   index.txt
dirinfo.camera.808365.mdc  o6414g0056o.607641.cm.808365.tgz

But is inacessible through the datastore because the 
wcat http://ippc17/ds/ps1-3pi-cat/o6414g0056o.camera.808365.2945979.26/index.txt

failed to find o6414g0056o.camera.808365.2945979.26 in ps1-3pi-cat
  • 05:56 I have stopped distribution. It appears that the rcserver task in the distribution pantasks has been inserting data directly into the slave ippc19. This is very strange. pstamp publishing and dqstats are inserting into ippc17 but they all use the same program dsreg and same configuration ~ipp/ippconfig/site.config
  • 06:00 Problem understood. distribution gets the db from gpc1.rcDestintation which still specifies ippc19. System has a bit too much unused flexibility.
  • 07:37 Bill is running a script to remove the filesets that were mis-registered.
  • 07:54 The data store repair is continuing. Started rcserver task in distribution. It is now inserting the filesets from last night.
  • 08:29 Added label goto_cleaned to cleanup pantasks. Changed input file to remove commented out set.label goto_cleaned
  • 14:30 All pantasks restarted with tag updated with new pstamp and release management code.
  • 22:05 Queued a big chunk of data from ps_ud% labels for cleanup.... Need to automate this.

Friday : 2013.05.03

  • 01:15 EAM : removing ipp043 from processing pantasks for tonight as it is working on dvo / relastro needed for ipptopsps to start.
  • 10:00 EAM : reconfigured mysql on ipp012 & ipp014 : set innodb_buffer_pool_size back down to 2G from 16G. I had bumped these up in an attempt to help ipptopsps / dvopsps go faster (avoid swapping), but have since switched to memory engine for the dvoDetectionFull table causing the problem.
  • 10:10 MEH: MD05 stack fault 4, may have just reverted it in time to be caught in SSdiff nightly processing window

Saturday : 2013.05.04

  • 01:10 MEH: registration looks backed up, manually -revertprocessedimfile and clearing
  • 08:45 EAM: nebulous was having some trouble. I was getting 'too many connection' errors. 'show processlist' had 1022 entries. I shutdown all pantasks and stopped all apache servers. the backlog on nebulous went away. I started the apaches back up and things seem to be ok, but the number of sleeping processes seems high to me (~340). I decided to not include ippc06 -- I wonder if it was added recently and if this caused our resource usage to exceed some limit...
  • 13:15 Serge. Couldn't access ippdb00 mysql ('too many connection' errors as well). Stopped all apache servers. Could get a connection to the mysql server then. ippdb00 was running as a slave of ippdb02 (Who/what restarted it?!) For safety, deleted all replication parameters from ippdb02 to ippdb00. Stopped the mysql server on ippdb00 to clean it up (and make sure the mysql server forgot everything about its slave condition). Restarted it. Restarted the apache servers. Did the usual "good health" test of nebulous: neb-touch, neb-ls, neb-less, neb-stat, neb-stat validate, neb-rm and everything seems fine.
  • 13:30 Serge: Restarted crashed mysql server on ippdb04

Sunday : YYYY.MM.DD