PS1 IPP Czar Logs for the week 2011.06.27 - 2011.07.03

(Up to PS1 IPP Czar Logs)

Monday: 2011-06-27

  • 07:17 CZW: marked exposure impacted by loss of summit computer as drop to allow registration to proceed:

pztool -updatepzexp -exp_name o5739g0084o -inst gpc1 -telescope ps1 -set_state drop -summit_id 348983

  • 08:00 (Roy) 3 exposures not registered. Looks like one or all of summitcopy/burntool/registration were down most of the night.
  • 10:58 serge: in stdscience:
    del.label LAP.ThreePi.20110621
    
  • 15:17 serge stopped gpc1 pantasks servers. heather stopped addstar and isp related ones
  • 15:53 serge started dump of gpc1 czardb isp ippadmin ipptopsps to /export/ippdb01.0/mysql_gpc1.backup/gpc1.20110627.sql.bz2
    mysqldump [hidden connection information] --databases gpc1 czardb isp ippadmin | ~ipp/local/bin/pbzip2 > gpc1.20110627.sql.bz2
    

About the master:

mysql> SHOW MASTER STATUS;
+-------------------+----------+--------------+------------------+
| File              | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+----------+--------------+------------------+
| mysqld-bin.020077 | 84637877 |              |                  | 
+-------------------+----------+--------------+------------------+
  • 17:09 Pantasks servers have been restarted
  • 17:45 CZW: Finally remembered to queue the cleanup of LAP/20110505 distribution bundles for bad stacks.
  • 18:26 CZW: Rebuilt psModules to pick up stacking speed improvements. This may not be sufficient, however, so more work is likely needed.

Tuesday : 2011-06-28

  • 05:30 Serge: gpc1 is replicated on ippdb03 (ingestion on ipp001 is not finished yet)
  • 08:13 Roy: all exposures downloaded. Will add LAP.ThreePi.20110621 label back-in when 3PI is done
  • 09:15 Serge: gpc1 is replicated on ipp001
  • 09:35 Serge: gpc1, ippadmin, and isp dumps are distributed as they should be. I will keep an eye on them though
  • 12:00 roy: ipp033 is down on Ganglia, but has network. Gavin is on the case.
  • 13:50 roy: ipp033 is back, but /data/ipp033.0/ is not visible from other machines
  • 14:00 Serge: replication broken on both hosts. ippdb01 is caching the hostname of ippdb03 with the old hostname value (ippc00). I don't think it's the explanation for the failure though. I'm ingesting yesterday's dump in ippdb03 once again. I will not ingest it in ipp001. I leave the old ingestion schema activated (ingestion in gpc1_0 and gpc1_1).
  • 15:15 - 15:40 Bill: regenerated several corrupt files using the scripts in ipp/tools
corrupt file on ipp031 caused diff failure --warp_id 215094 --skycell_id skycell.1276.049
corrupt file on ipp018 caused diff failure --warp_id 215061 --skycell_id skycell.0962.042
corrupt file on ipp051 caused warp failure --cam_id 226915
corrupt file on ipp008 caused magic failure --diff_id 141410 --skycell_id skycell.1992.136
corrupt file on ipp012 caused magic failure --diff_id 141396 --skycell_id skycell.1992.143
corrupt file on ipp012 caused magic failure --diff_id 141386 --skycell_id skycell.1992.134
corrupt file on ipp012 caused distribution failure. Fixing this required faulting the corrupted magicDSFile and setting the states in the DB back to new.
  • 15:35 turned magic revert and dist revert off to track down some corrupt files
  • 19:38 Serge: Replication started on ippdb03
  • 19:47 Bill: turned off diff processing temporarily.
  • 19:43-53 Serge: Replication seems to have failed again. Last exp_id on ippdb01 is 354060 while on ippdb03 it's 353683.

The SHOW SLAVE STATUS command shows that everything's fine on the replication host:

            Slave_IO_State: Waiting for master to send event
                Master_Host: ippdb01.ifa.hawaii.edu
                Master_User: repl_gpc1
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: mysqld-bin.020091
        Read_Master_Log_Pos: 99203082
             Relay_Log_File: mysqld-relay-bin.000031
              Relay_Log_Pos: 99203220
      Relay_Master_Log_File: mysqld-bin.020091
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
            Replicate_Do_DB: gpc1,czardb,ippadmin,isp
        Replicate_Ignore_DB: 
         Replicate_Do_Table: 
     Replicate_Ignore_Table: 
    Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: 
                 Last_Errno: 0
                 Last_Error: 
               Skip_Counter: 0
        Exec_Master_Log_Pos: 99203082
            Relay_Log_Space: 99203220
            Until_Condition: None
             Until_Log_File: 
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File: 
         Master_SSL_CA_Path: 
            Master_SSL_Cert: 
          Master_SSL_Cipher: 
             Master_SSL_Key: 
      Seconds_Behind_Master: 0
1 row in set (0.00 sec)

On the master:

| 204458295 | repl_gpc1 | ippdb03.ifa.hawaii.edu:35103 | NULL   | Binlog Dump |   574 | Has sent all binlog to slave; waiting for binlog to be updated |

ippdb03 log does not show any error:

110628 19:37:42 [Note] Slave I/O thread: connected to master 'repl_gpc1@ippdb01.ifa.hawaii.edu:3306',  replication started in log 'mysqld-bin.020077' at position 84637877
110628 19:37:42 [Note] Slave SQL thread initialized, starting replication in log 'mysqld-bin.020077' at position 84637877, relay log './mysqld-relay-bin.000001' position: 4
EOF

... and the ippdb01 log is empty (size 0) (it doesn't seem to be redirected to some other file according to /etc/mysql/my.cnf)

Note: While looking at the PROCESSLIST table of ippdb01, I saw that:

| 204467208 | unauthenticated user | ipp034.ifa.hawaii.edu:59761  | NULL   | Connect     |  NULL | Reading from net                                               | NULL                                                                                                 | 

(It's likely unrelated but I wanted to write it down)

J'aime pas ça...

Bill is czar tomorrow and might as well start now.

  • 20:00 found a number of stuck ppSub processes (one with no activity since June 26.) Killed the processes. Cleared the diff book in stdscience and turned diff back on
  • 20:07 distribution panatasks was not queuing jobs quickly and had fallen way behind. Restarted.
  • 20:15 moved a number of log files with storage objects but no instances to the trash.
  • 20:31 several LAP cam runs are faulting repeatedly looks like bad corrupt chip files. Set camera.revert.off
  • 20:45 Actually 10 of the lap camera faults are due to "no reference stars found"

Wednesday : 2011-06-29

  • 09:00 Lots of red on the board but processing making progress. Turned off reverts for warp, diff, magic, destreak, and dist in order to get a clear idea of the situation.
  • 09:15 set LAP.ThreePi?.20110621 label to inactive.
  • Corrupt files repaired
    corrupt file on ipp029 --chip_id 249841 --class_id XY46 caused several skycells to fail Fixed.
    corrupt file on ipp012 magic_ds_id 571438 skycell.1365.087 Faulted component, set the magicDSRun back to new.
    corrupt file on ipp012 diff 141521 skycell.1023.049
    corrupt file on ipp027 diff 141510 skycell.0851.085
    corrupt file on ipp027 diff 141597 skycell.1024.064
    
  • md08 stack 323934 faults repeatedly with "known psLib error" set state to drop with a note
  • LAP diff 141693 -skycell_id skycell.1376.082 has faulted several times with error below. Set quality=42
              found 3 footprints: 0.088260 sec
    Specified number of rows (0) or columns (0) is invalid.
    Backtrace depth: 13
    Backtrace 0: p_psAbort
    Backtrace 1: (unknown)
    Backtrace 2: pmSetFootprintArrayIDs
    Backtrace 3: pmFootprintArrayGrow
    Backtrace 4: psphotFindFootprints
    Backtrace 5: psphotFindDetectionsReadout
    Backtrace 6: psphotFindDetections
    Backtrace 7: psphotReadoutMinimal
    Backtrace 8: (unknown)
    Backtrace 9: (unknown)
    Backtrace 10: (unknown)
    Backtrace 11: __libc_start_main
    Backtrace 12: (unknown)
    Using Convolved images because NOCONVOLVE is FALSE
    Using Convolved images because NOCONVOLVE is FALSE
    
  • 09:53 all repeating faults corrected except for the LAP no stars found camera runs and 17 LAP stacks which I haven't debugged yet.
  • 09:59 turned label back on to verify that the magic runs can complete. Turned it back off a few minutes later
  • 10:08 Queued the rest of the STS.2009.b data for processing. 294 exposures. Raised priority of label above LAP. It has over 5000 stacks pending. That seems like it is enough for now.
  • 10:18 turned magic, destreak, and dist reverts back on.
  • 10:52 all but 7 nodes are out of space. Set stdscience to stop.
  • 11:10 restarted stdscience with a minimal set of nodes and chip processing off. Hopefully we can limp through the rest of last night's data without creating too many corrupt files. All nightly science chips are done except sts.
  • 15:35 restarted stdscience with default.hosts.
  • 16:15 stack was stuck with a bunch of apparently never ending quickstacks. Restarted pantasks to allow nightly stacks to complete.
  • 17:17 set STS.2009.b and LAP.threePi.20110621 to active
  • 17:38 lots of LAP publish runs backed up. Ah. label was not in the list. Added it. Turned publish.revert.off

Thursday : 2011-06-30

  • 05:33 nightlyscience is progresing with few errors but completion rate seems quite slow. 220 chip runs in 10 hours. All pending LAP stacks that have inputs avialable are done so I did "hosts add compulte2" in stdscience.
  • 09:11 Serge stopped processing: ./check_system.sh stop
  • 09:57 Serge" dump started:
    mysqldump [hidden] --master-data=1 --single-transaction --databases gpc1 ippadmin isp czardb 
       | ~/local/bin/pbzip2 
       > /export/ippdb01.0/mysql_gpc1.backup/gpc1.20110630.sql.bz2
    

From dump file:

CHANGE MASTER TO MASTER_LOG_FILE='mysqld-bin.020111', MASTER_LOG_POS=17908812;
  • 10:55 Serge: processing restarted.
  • 10:57 The dump is being ingested into ippdb03 / full db recreation (i.e. rm -rf <contents of mysql datadir> + mysql_db_install...)
  • 11:00 The dump has been copied to ipp001: /export/ipp001.0/ipp/mysql-dumps/ippdb01_manual_dump.20110630.sql.bz2
  • 11:30 replication pantasks has been down since 7:08 this morning. Restarted it.
  • 12:40 (approximate) added another set of compute nodes to stdscience "hosts add compute"
  • 13:42 stdscience restarted
  • 15:45 hosts add compute2 and hosts add compute to stdscience
  • 16:02 noticed that ipp030 is in nebulous state repair. Set it to up.
  • 16:57 gpc1 replication is back Various figures:
    • Average transfer speed between ippdb03 and ippdb01 is 58.4MB/s
    • Average transfer speed between ipp001 and ippdb01 is 32.0MB/s
    • Bzip2 version of the dump has increased by 42 MB in three days (was: 7360867848 bytes, is: 7404921969 bytes).
    • gpc1 size on ippdb01 is 116G (120746524) while on the fresh ippdb03 it's only 98G (102253712)... When was the last optimization?
  • 18:08 nightlyscience 20110630 chipRuns all done. Set chip.off
  • 20:17 all but 6 warps done. set chip.on

Friday : 2011-07-01

Serge is czar but all is quiet?!

  • 09:30 Serge: ipp001 is now replicating gpc1, ippadmin, isp, and czardb. I need to check if it can bear the load though.
  • 10:40 Serge: chip.off to speed up last night data processing
  • 13:00 Serge: chip.on
  • 14:15 Serge: chip.off/warp.off just the time for the last diffs to be queued
  • 14:40 Serge: chip.on/warp.on
  • 16:52 CZW: detrend pantasks is running and should stay running until tuesday at the earliest.
  • 17:50 Serge: chip.off/warp.off again... I don't understand why they are not queued (or maybe they are and I don't see it)
  • 18:45 Serge: czartool has crashed because it tried to divide by 0.
  • 19:55 Serge: chip.on/warp.on... Roy has fixed czartool but it needs to rebuild.
  • 19:56 Serge: czartool has rebuilt. The 2 diffs are still here. Enough for today

Saturday : 2011-07-02

  • 07:50 Serge: restarted crashed replication (last heartbeat yesterday 19:16?)

Sunday : 2011-07-03