The purpose of this page is to show how databases replication problems were solved. Please contribute when you fix a MySQL replication problem (add a header <date>/<time>: <Problem>). If you fix something which is similar to an existing entry, just add an entry something like 'Fixed problem as in <entry reference>'. When we will have a lot of problems, I (SC) will try to reorganize them by categories.

2013

2013-01-04 09:10

Last_error value when executing SHOW SLAVE STATUS on that host is:
Error 'Duplicate entry ...

Fixed with the usual: stop slave; set global sql_slave_skip_counter=1; start slave;

2012

2012-12-09

Last_error value when executing SHOW SLAVE STATUS on that host is:
Error 'Duplicate entry ...

Fixed with the usual: stop slave; set global sql_slave_skip_counter=1; start slave;

2012-12-07

Last_error value when executing SHOW SLAVE STATUS on that host is:
Error 'Duplicate entry ...

Fixed with the usual: stop slave; set global sql_slave_skip_counter=1; start slave;

2012-11-14

Last_error value when executing SHOW SLAVE STATUS on that host is:
Error 'Duplicate entry '21338906' for key 1' on query. Default database: 'gpc1'. Query: 'INSERT INTO diffInputSkyfile(diff_id, skycell_id, warp1, stack1, warp2, stack2, tess_id) SELECT diff_id, skycell_id, warp1, stack1, warp2, stack2, tess_id from skycellsToDiff'

Fixed with the usual: stop slave; set global sql_slave_skip_counter=1; start slave;

2012-11-13: nebulous backup

Stopped mysql server on ippc63. rsyncing to ippc62. 2012-11-14: Restarted

2012-11-13: ipp001 crash

No apparent reason. Restarted mysql server. Replication easily caught up.

2012-11-08: ippc17 crash

Error 'Duplicate entry '824346-fpa' for key 1' on query. Default database: 'isp'. Query: 'INSERT INTO [...]

The usual:

stop slave; set global sql_slave_skip_counter=1; start slave;

fixed it.

2012-11-05

Stopped replication from ippdb00 to ippc63. Started rsync from ippc63 to ippc61. (On 2012-11-06, restarted replication)

2012-11-02

Backup complete. Replication to ippc63 restarted.

2012-11-01

Stopped replication and mysql server on ippc63 to make a backup to ippc62

2012-10-23

Again a few STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;

2012-10-21/22 gpc1 Reingestion

Reingestion on ippdb03 and ipp001

2012-10-15 ippdb01 -> ippdb03

Replication problem again

 Error 'Duplicate entry '19665422' for key 1' on query. Default database: 'gpc1'. Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'

fixed ippdb03 with recipe

stop slave; set global sql_slave_skip_counter=1; start slave;

2012-10-11: ippdb01 -> ippdb03/ipp001

Replication problem happened twice last night.

Error 'Duplicate entry '19442342' for key 1' on query. Default database: 'gpc1'. 
Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'
Error 'Duplicate entry '19446512' for key 1' on query. Default database: 'gpc1'. 
Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'

2012-10-10: ippdb01 -> ippdb03/ipp001

Again

2012-10-09: ippdb01 -> ippdb03/ipp001

Same problem as yesterday.

Error 'Duplicate entry '19309559' for key 1' on query. Default database: 'gpc1'.
Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'

Same fix:

STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;

2012-10-08: ippdb01 -> ippdb03/ipp001

Error 'Duplicate entry '19172463' for key 1' on query. Default database: 'gpc1'. 
Query: 'INSERT INTO diffInputSkyfile SELECT * from skycellsToDiff'

Fix:

STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;

2012-08-25 / ippdb01 to ippdb03-ipp001

difftool.c allows playing twice an insert statement, hence breaking the replication.

2012-07-06 / ipp001/gpc1

Error 'Cannot add or update a child row: a foreign key constraint fails (gpc1/distComponent, CONSTRAINT distComponent_ibfk_1 FOREIGN KEY (dist_id) REFERENCES distRun (dist_id))' on query. Default database: 'gpc1'. Query: 'INSERT INTO distComponent (dist_id, component, bytes, md5sum, state, outdir, name, fault) VALUES (2066813, 'skycell.7.17', 108129, 'e5f49b59c7f693649f40a33e7a3980b9', 'full', 'neb://stsci08.0/gpc1/dist/STS.nightlyscience/2012/06/13/o6091g0322o/warp.2066813', 'o6091g0322o.498476.wrp.432885.skycell.7.17.tgz', 0)'

Tried to skip the offending statement without success.

2012-06-12 -> 2012-06-14

After ipdpb01 crash, see TODO

20120529-20120529 ippdb00/ippdb02

Full nebulous ingestion. Details here

20120427/11:05 ipp001

ipp001 replication broken because of 'Duplicate entry '32' for key 1' on query. Default database: 'ippadmin'. Query: 'insert into users (username, password) values ('chenx', 'chenx')'. I ran the usual STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;. Then 'Duplicate entry '30' for key 1' on query. Default database: 'ippadmin'. Query: 'INSERT into cookies (username, cookie) values ('chenx', '32709')'. Another STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; made it.

20120313/12:05 ipp001

ipp001 replication is broken again (broken foreign keys?! e.g. Cannot add or update a child row: a foreign key constraint fails (`gpc1/chipProcessedImfile`, CONSTRAINT `chipProcessedImfile_ibfk_1` FOREIGN KEY (`chip_id`, `exp_id`) REFERENCES `chipRun`). Made manual backup of ippdb03 to /export/ippdb03.0/mysql-dumps/backup_20120313T1208.sql.bz2

            Master_Log_File: mysqld-bin.020936
        Read_Master_Log_Pos: 73345738

20120312/13:31 ippdb03/ipp001

Error 'Table 'skycells.skycell' doesn't exist' on query. Default database: 'gpc1'. Query: 'update staticskyRun join staticskyInput using(sky_id) join stackRun using(stack_id) join skycells.skycell using(skycell_id) set staticskyRun.state = 'new.wait' where staticskyRun.label ='lap.threepi.20110809' and staticskyRun.state ='new' and ra/15 > 12'

From Bill: "I did use the table in an update last night. That was a one time command. The reason that the table is not in gpc1 is that I'm not finished with it yet and I didn't want to cause problems with replication. Apparently using it for a table update caused a problem."

I ran the usual STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; on ippdb03.

It looks though that there are other problems (multiple insert attempts for the same pk) for ipp001.

20120223/08:40: ippdb03

Duplicate entry 'XXXXX-X' for key 1' on query. Default database: 'ipptopsps'. I ran ~schastel/dev/many_failures/fix_replication.sh about 200 times without success. I talked to Roy who will replicate ipptopsps somewhere else. I'm dropping it from replication for now. I stopped the mysql server on ippdb03, modified /etc/mysql/my.cnf (commented out ipptopsps replication), and restarted the server.

20120222/09:21: ippc19

Replication on ippc19 broke. The 'SHOW SLAVE STATUS\G' command showed the following message: "Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0". /var/log/mysql/mysqld.err that replication was broken at log 'mysqld-bin.000639' position 1041257298 (see message 120222 8:46:23). I stopped the slave, changed the master coordinates on the slave (with the input found in /var/log/mysql/mysqld.err) and restarted the slave, taht is:

STOP SLAVE;
CHANGE MASTER TO MASTER_LOG_FILE='mysqld-bin.000639',  MASTER_LOG_POS=1041257298;
START SLAVE;

Then "SHOW SLAVE STATUS\G" commands showed that replication worked again (Seconds_Behind_Master is decreasing).

20120203/08:51: ippc19

ippc17 experienced crashes yesterday and tables were corrupted (see http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/PS1_IPP_CzarLog_20120130#Friday:2012.02.03). Repairing the tables crashed the replication server: the error messages were about "Duplicate entr[ies]"

I ran the usual: STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; about 16 times to fix it.

20120131/08:57: ipp001

Ken Smith told us about ipp001 unavailability. The mysql server on that host crashed for some unknown reason (it was 62000 seconds behind so I assume it happened around 15:00 HST on Jan 30? Maybe it's related to the global loss of connection to the development cluster that happened around that time). I restarted the mysql server and had to fix it with STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;

2011

20111121/08:47: ippdb03/ipp001

Two instances of czarpoll were started at the same time, running two "OPTIMIZE TABLE burntool". Replication broke on both clients. The message on both client was: "Query partially completed on the master (error on master: 1053) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'OPTIMIZE TABLE burntool'"

Fixed with STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;

20111117/16:23: ipp001 crashed

mysqld crashed. Nothing interesting in mysqld.err. Once restarted, slave resumed normally.

20111115/09:23: ippc19 errors

Errors like 'Duplicate entry...' after ippc17 crash last night. The usual "STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;" fixed the problem.

20111109/12:11: ippdb03/ipp001 errors

Alterations to gpc1:

mysql> ALTER TABLE diffInputSkyfile DROP PRIMARY KEY, ADD PRIMARY KEY(diff_skyfile_id);
Query OK, 11661867 rows affected (12 min 37.50 sec)
Records: 11661867  Duplicates: 0  Warnings: 0
mysql> ALTER TABLE diffInputSkyfile CHANGE COLUMN skycell_id  skycell_id VARCHAR(64) DEFAULT NULL;
Query OK, 11661867 rows affected (11 min 28.22 sec)
Records: 11661867  Duplicates: 0  Warnings: 0
mysql> UPDATE dbversion set schema_version = '1.1.71', updated= CURRENT_TIMESTAMP();
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

The first statement broke the replication on both ippdb03 and ipp001 (because of the multiple diff_skyfile_id = 10264768 in diffInputSkyfile. Error message was: "ERROR 1062 (23000): Duplicate entry '10264768' for key 1"). On both replicants I ran:

mysql> STOP SLAVE;
mysql> delete from diffInputSkyfile where diff_id = 162937 and skycell_id = 'skycell.1485.049';
mysql> update diffInputSkyfile set diff_skyfile_id = 10264767 where diff_id = 162937 and skycell_id = 'skycell.1485.059';

and ran again the ALTER command:

mysql> ALTER TABLE diffInputSkyfile DROP PRIMARY KEY, ADD PRIMARY KEY(diff_skyfile_id);

and then:

mysql> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;

20110921/13:56: ippdb03 error

From Anonymous: "screwed-up the replication to ippdb03 while adding the new ippMonitor user (i didn't think i could update a slave, so believed i was safe). i've tried to fix it" The usual STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE; fixed it

20110913/1720: ippdb03 "error"

Not really an error actually. isp was being updating with the statement (SHOW FULL PROCESSLIST or look at mysqld.err):

INSERT INTO summitExp SELECT NULL, incoming.*, NULL, 0, NULL FROM incoming LEFT JOIN summitExp
  USING(exp_name, camera, telescope)   
  WHERE summitExp.exp_name is NULL AND summitExp.camera is NULL AND summitExp.telescope is NULL

I killed the process in the mysql server and skipped the statement STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;

Note: I had to do it twice.

20110913/0930: ipp001 error

Error 'The table 'incoming' is full' on query. Default database: 'isp'. Query: 
'INSERT INTO incoming (exp_name, camera, telescope, dateobs, exp_type, uri) VALUES
 ('o5253i0240o01', 'isp', 'ps1', '2010-02-26 09:29:39', 'OBJECT', 
'http://otis3.ifa.hawaii.edu/ds/skyprobe/o5253i0240o01/index.txt')'

I don't understand why this _temporary_ table can't be created. I dropped isp from replication.

20110801/1505: ippdb03 error

ippdb03 crashed often during the last weekend. SHOW SLAVE STATUS\G showed this error message:

*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: ippdb01.ifa.hawaii.edu
                Master_User: repl_gpc1
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: mysqld-bin.020270
        Read_Master_Log_Pos: 962746384
             Relay_Log_File: mysqld-relay-bin.000005
              Relay_Log_Pos: 24123883
      Relay_Master_Log_File: mysqld-bin.020269
           Slave_IO_Running: Yes
          Slave_SQL_Running: No
            Replicate_Do_DB: gpc1,czardb,ippadmin,isp
        Replicate_Ignore_DB: 
         Replicate_Do_Table: 
     Replicate_Ignore_Table: 
    Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: 
                 Last_Errno: 0
                 Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary
 log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is 
corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the 
master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be 
able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
               Skip_Counter: 1
        Exec_Master_Log_Pos: 24123745
            Relay_Log_Space: 2036505138
            Until_Condition: None
             Until_Log_File: 
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File: 
         Master_SSL_CA_Path: 
            Master_SSL_Cert: 
          Master_SSL_Cipher: 
             Master_SSL_Key: 
      Seconds_Behind_Master: NULL

The relay log was corrupted (I ran: 'mysqlbinlog mysqld-relay-bin.000005 > /dev/null' and saw an error message lloking like:

ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 0, event_type: 0
Could not read entry at offset <some value>:Error in log format or read error

I googled a bit and found that whic fixed the problem: http://www.mysqlperformanceblog.com/2008/08/02/troubleshooting-relay-log-corruption-in-mysql/

Precisely I did: change master to master_log_file='mysqld-bin.020269',master_log_pos=24123745;

The problem seems to have been fixed

20110727/1410: ippdb03 Replication Error

ippdb03 crashed. SHOW SLAVE STATUS messages was like: 'Duplicate entry ...'

I applied STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE; about twenty times and everything looks fine now. (Messages were like: 'Duplicate entry ...')

20110726/1810: ippdb03 Replication Error

ippdb03 crashed. SHOW SLAVE STATUS messages was like: 'Duplicate entry ...'

I applied STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE; four times and everything looks fine now. (Messages were like: 'Duplicate entry ...')

20110429/0850: gpc1 Replication Error

From mysql slave on ippc02 (SHOW SLAVE STATUS) Query partially completed on the master (error on master: 1317) and was aborted. There is a chance that your master is inconsistent at this point. If you are sure that your master is ok, run this query manually on the slave and then restart the slave with SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'drop table lapExp'.

I checked with Chris. I manually dropped the table in gpc1 and did a 'STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;'. Hopefully we will not have to ingest a new dump once again...

20110413/1450: ippadmin Replication Error

  • Logged onto ipp001
  • Started mysql client as root
  • 'SHOW SLAVE STATUS\G' shows the following error:
    Last_Error: Error 'Duplicate entry '26' for key 1' on query. Default database: 'ippadmin'. Query: 'insert into users (username, password) values ('denver', '********')'
    
  • I did the usual 'STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;'
  • The slave is 172833 seconds (i.e. 2 days) behind its master. It clearly shows that no one looks at the replication server statuses.

20110307/0830: Nebulous Replication Error

  • Logged onto ippdb02
  • Started mysql client as root
  • 'SHOW SLAVE STATUS\G' shows the following error:
    Error 'Table 'nebulous.mountedvol' doesn't exist' on query. 
    Default database: 'nebulous'. 
    Query: 'REPLACE INTO mountedvol SELECT vol_id,name,host,path,allocate,available,xattr,mountpoint, 
           '10651598144', '8973230080',note FROM volume WHERE mountpoint = '/data/ipp004.0''
    
  • But, if I check the contents of the database, I can see nebulous.mountedvol exists! It seems that the replicated nebulous didn't pass the optimization of last week.
  • I'm setting up a new replicated server.

2010

20101116/1659: gpc1 Replication Error

20101104/1429: gpc1 Replication Error

Problem detected through czartool

  • Logged onto ipp001
  • Started mysql client as root
  • 'SHOW SLAVE STATUS\G' showed (shortened)
              Slave_SQL_Running: No
                     Last_Errno: 1062
                     Last_Error: Error 'Duplicate entry '23' for key 1' on query. Default database: 'ippadmin'. 
                                 Query: 'INSERT into cookies (username, cookie) values ('psobs', '99497')'
    
  • Local non-replicated database involved... Statement skipped (in MySQL client):
      STOP SLAVE;
      SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; 
      START SLAVE;
    
    on one line:
      STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;
    
  • Then 'SHOW SLAVE STATUS\G' showed (shortened)
              Slave_SQL_Running: Yes
                     Last_Errno: 0
                     Last_Error: 
    
  • Czartool didn't show any problem after the fix