PS1 IPP Czar Logs for the week 2011.03.14 - 2011.03.20

(Up to PS1 IPP Czar Logs)

Monday : 2011.03.14

  • czar heather
  • 10:44 heather started addstar - no merging
  • 10:30 or so Bill fixed some failed warps due to a corrupt camera mask file.
  • 11:18 Bill doubled the number of hosts working on distribution since it is so far behind.
  • 12:54 reran --warp_id 171166 --skycell_id skycell.0852.098 because one of the output files was corrupted.
  • 13:48 heather restarted distribution with 3xdefault hosts

Tuesday : 2011.03.15

  • 07:00 Bill lots of red on czartool. Summit copy and registration are far behind. The problem appears to be that nebulous is overloaded and not responding in a timely fashion. Stopped most processing to investigate. Sample error messages
    500 Can't connect to (connect: timeout) at /home/panstarrs/ipp/psconfig//ipp-20110218.lin64/lib/Nebulous/ line 852
    Unable to perform neb-locate: 9 at /home/panstarrs/ipp/psconfig//ipp-20110218.lin64/bin/ line 233.
    Unable to stat Nebulous handle neb://ipp030.0/gpc1/20110315/o5635g0395o.310644/o5635g0395o.310644.reg.ota37.log at /home/panstarrs/ipp/psconfig//ipp-20110218.lin64/bin/ line 61
  • cleanup has a very large backlog of jobs. It looks like it's in the mode where it takes 10 minutes per (sparse) chipRun. Replication was running as well. With just summit copy, registration, and cleanup running the errors have stopped.
  • 07:09 restarting stdscience and distribution. Cleanup is stopped but at this rate will take awhile to empty the queues.
  • 7:26 put two repeatedly failing skycells out of their misery.
    mysql> update diffSkyfile set fault=0, quality=42 where diff_id = 115074 and skycell_id ='skycell.1455.059';
    Query OK, 1 row affected (0.03 sec)
    Rows matched: 1  Changed: 1  Warnings: 0
    mysql> update diffSkyfile set fault=0, quality=42 where diff_id = 115913 and skycell_id ='skycell.2517.092';
    Query OK, 1 row affected (0.00 sec)
    Rows matched: 1  Changed: 1  Warnings: 0
  • 08:21 turning replication back on
  • 08:43 found 103 magicDSRuns in state failed_revert. Set them to new. Turned off destreak to see if the faults repeat.
  • 08:52 all of the faults reverted successfully. revert off destreak on
  • 09:24 destreak revert back on.
  • 09:40 cleanup set to run with poll limit of 4
  • 11:30 heather reverted 4 skycells for MD08.jtrp and added them back into stdsci - 3 are fault 2 (related to various ipps being down/rebooted lately). one is a 'fix this code'. Once these fail (or success) heather is going to send them to the cleaners.
  • 11:40 heather set MD08.jtrp to be cleaned. the 3 fault 2s succeeded and the fault4 faulted again.
  • 11:45 All processing is stopped
  • 12:18 nebulous replication Parameters:
    *************************** 1. row ***************************
                File: mysqld-bin.002221
            Position: 1068308831

Dump is on ippdb00:

mysqldump --master-data nebulous -u root > /export/ippdb00.0/ipp/nebulous/nebulous_dump-20110315T121800.sql

For your records: Beginning: 2011-03-15 12:17:51 End: 14:41:03 (that is about 2h30). Size of dump: 174GB (185851128043 bytes)

  • 13:24 Finally got the lock on ippdb01 (1 hour to complete. Weird)
    | File              | Position | Binlog_Do_DB | Binlog_Ignore_DB |
    | mysqld-bin.018618 |  4201327 |              |                  | 

For information: Beginning: 2011-03-15 13:24:40; End: 2011-03-15 13:58:17; (that is about 30 minutes) Size of dump: 49GB (52416179007)

  • 14:38 heather added magictest.3Pi.200110309.a to ipp/distribution to run magic
  • 16:09 copy to ippdb02 is finished. I'm now 'cat nebulous_dump-20110315T121800.sql | mysql nebulous'. It finished the day after so it took 24 hours 17 minutes.

Wednesday : 2011-03-16

  • ran pubtool -revert I publishRun faulted again. This is 196772 from March 11. It got queued even though there are not skycells in diff_id 115106 htat have good quality. I deleted the 577MB log file, reverted one more time to get a record of the fault and then set the run state to 'drop'. I have filed a ticket 1467 recording the problem. We either need to not queue the run or handle the nothing to do gracefully. Perhaps posting an empty detections file would be useful for MOPS.
  • Bill removed ippdb02 from cleanup. ippdb02 should stay off any activity while the nebulous replciation server is rebuilding.
  • o5635g0509 has not been burntooled yesterday. Bill investigated and found that 79 other exposures had similar state: he opened ticket 1468
  • Bill and Gene looked into why distribution pantasks is not keeping the hosts busy. Played around with the timing parameters a bit. Nothing seemed to help permanently. Restarted pantasks around 11:10
  • now that burntool is done, heather added MD09.jtrp to ipp's stdscience
  • 15:04 heather removed MD09.jtrp
  • 16:26 ingestion of nebulous on ippdb02 just finished
  • 16:40 replication of Nebulous on ippdb02 is running
  • 17:00 Nebulous replication slave is 91520 seconds behind master
  • 18:28 Bill found a number of magicDSRuns in state 'failed_revert'. Set them back to new. Reverts seem to be working. (I really need to handle this automatically)
  • 19:43 diff.revert.on in stdscience. This should probably be on during the night. Off during the day so we can evaluate errors.
  • 20:07 All MOPS 3PI data for the last two nights have been published. Thanks Bill.
  • 20:08 Nebulous replication slave is 33651 seconds behind master

Thursday : 2011-03-17

Bill is czar today.

  • 06:30 All nodes are out of space except for the big ones. They are overloaded and generating lots of faults. So far only 299 of 689 science exposures have been downloaded and registered. It looks like we got
  • to reduce the load Bill shut down distribution and restarted it with the usual number of active hosts.
  • ipp035 couldn't talk speak nfs with ipp015. Force umount fixed the problem.
  • 7:44 registration is not progressing. The pending burntool query is timing out repeatedly. Increased the timeout value for the task to 300 seconds up from 30. It finds nothing to do.
  • 8:00 stopping registration and cleanup to see if the database load shrinks.
  • 8:08 once registration queue emptied the load on ippdb01 dropped significantly and load on other machines went up a bit. It looks like the backup of gpc1 and isp to ipp001 is in progress. Restarting registration.
  • 8:51 Gavin killed the 8am gpc1 dump to the 4am one finish.
  • 8:52 Nebulous replication slave is now synchronized with its master.
  • 10:05 Installing gpc1 replication on ippc02 (ingestion started at 10:05:10, finished at RAAAAAAAAAAAAAAAAAAAH! I killed mysql server while attempting to install apache/nebulous. #@%$ Gentoo!)
  • 10:10 regtool -pendingburntoolimfile is timing out even with the task timeout set to 5 minutes. I've asked Gavin to restart mysqld on db1.
  • 10:53 After the mysql restart things are moving along fine.
  • 12:22 Installing gpc1 replication on ippc02 (started at 12:22:26; finished at TODO)
  • 16:01 apache server on ippdb00 stopped. mysql server restarted
  • 17:22 serge removed ippc02 from publishing (bill did it from other services)
  • 20:33 heather restarted her stdscience and added magictest.20110316 back into it.
  • 20:33 heather set to clean all mopstestsuite% and magictest% (except for the 2 that are currently interesting) - she did this sometime in the middle of the day
  • 23:28 bill says that things are running smoothly since the database restarts. Since distribution is the bottleneck added another set of hosts to work on it.

Friday : 2011-03-18

Bill is czar today

  • 04:00 - 05:30 gpc1 database became very slow. registration/burntool stalled. Stopping and starting mysql "solved" the problem.
  • ippc17 (data store server) unable to serve files from ipp007. force.umount didn't work. Rebooted ipp007.
  • 09:15 stdscience was waiting for non existant jobs to finish and wasn't making progres. Restarted it. The problem was it was waiting for jobs to finish on ipp050 which had already finished. Now most of the stuff left to run wants that node.
  • 10:00 CZW: we had ~80 DOMEFLAT exposures that did not register properly overnight. These were taken as camera-controlled exposures, and although initially they had the data_state set to 'full', a later check in the script reset the data_state to 'pending_burntool'. I fixed this bug in, and manually set the data_state values to 'full', as this is what the corrected script would have done.
  • 11:30 psphot is exploding it's memory footprint on chip XY73. These data are on ipp050. This caused the machine to crash twice. Marked the affected chips as quality = 42.
  • 11:40 Shut down ipp for rebuild.

Saturday : 2011-03-19

  • 00:22 Bad weather but busy evening
    • ThreePi?.136 finally finished destreak. Set to be cleaned
    • queued ThreePi?.rerun consisting of exposures never released from regions that Bertrand is interested in.
    • started distribution on CNP. It was tricky due to label issues, but it is now taken care of.
  • 07:00 Bad weather continued for most of the night. CNP finished distribution. ThreePi?.rerun made progress. Warps were backed up because the book warpPendingSkyCell was full of entries in state DONE. Turned warp off, let the pending jobs finish, and then did warp.reset.
  • 15:21 doubled the number of hosts working on distribution.

Sunday : 2011-03-20

  • 23:10 - heather has been testing new addstar stuff. addstar is running as heather right now. heather also restarted her magictests - they got stalled (?). Currently doing the diffs for the magictest.

Sunday : 2011-03-20