PS1 IPP Czar Logs for the week 2011-03-28 - 2011-04-03

(Up to PS1 IPP Czar Logs)

Monday: 2011-03-28

  • (serge) Started gpc1 ingestion for replication on ippc02 (at 09:52:21)
  • 14:00 CZW: Stopped all pantasks to allow for upgrade of ISP database.
  • 14:39 CZW: Beginning flush for binlog status and initial dump.
  • 14:47:31 gpc1 ingestion complete
  • 14:57 gpc1 replication started. Seconds_Behind_Master: 327235
  • 14:58 CZW: Finished updates to ISP database. Beginning flush for binlog status for final dump.
  • 15:04 CZW: Making final dump. Completed while typing. Upgrade complete. See attached logfile for details.
  • 22:44 CZW: ippdb01 had load of ~20. Stopped all pantasks, let jobs finish, and then checked mysql processlist. Even with all jobs finished or timedout, a large number of queries were still pending. Cleared them, and restarted summitcopy and registration. Removed old dates from stdscience pantasks, as it might be adding extra queries for completed dates (some dates did not have morning darks taken).
  • 22:47 CZW: restarting stdscience.
  • 22:50 CZW: removed old dates from registration to ensure burntool backlog can be cleared without waiting for the date to cycle around (seven old dates were still being examined).
  • 23:14 CZW: registration promoted more things to stdscience, and task timeouts again started at fake.imfile.load. Registration/summitcopy appears to be fully caught up, so I don't believe that that is adding a significant load on the database. Without much else to do, turning stdscience back on. Perhaps a shutdown/restart would resolve things?
  • 23:22 CZW: continuing problems with fake.imfile.load, stopped, waiting for jobs to clear, then will restart stdscience.
  • 23:35 CZW: Removed labels we're not processing to see if
  • 23:46 CZW: Looking at the slow fake_pendingimfile run, it looks like the issue may just be a set of bad joins that slows the query down. Why this is suddenly an issue is something I don't know.
  • 00:12 CZW: Rewrote fake_pendingimfile.sql JOINS to work in a more efficient manner.
  • 00:23 CZW: Restarted remaining pantasks using new fake_pendingimfile.sql. Nothing seems to have exploded yet.
  • 01:06 CZW: Reverting pzDownloadImfiles that seem to be clogging up registration. Doesn't seem to be effective, probably a result of ipp050 being down. It should all continue successfully once ipp050 comes back up.
  • 01:17 CZW: Tired of waiting, moved spurious nebulous keys to $key.bak, and reverted the summit copies. This appears to have caught, and we're registering things again.
  • 01:49 CZW: One imfile refused to download. I manually constructed the command ( --uri --filename neb://any/gpc1/20110329/o5649g0208o/o5649g0208o.ota73.fits --exp_name o5649g0208o --inst gpc1 --telescope ps1 --class chip --class_id ota73 --bytes 49432320 --md5 b7fb344afbaaf56a803c7a4d1894945e --dbname gpc1 --timeout 600 --verbose --copies 2 --compress --summit_id 313570 --nebulous ), it succeeded, and things seem to be moving now.
  • 02:08 CZW: Caught up +/- 2 exposures. Calling everything working for the night.

Tuesday: 2011-03-29

  • 11:02 CZW: ipp050 going down last night resulted in a number of stuck runs that would not revert cleanly due to nebulous having storage_objects for some files (generally the logs), but no instance. This likely means that the storage_object was created, ipp050 crashed, and then the script terminated when it couldn't get a filehandle. To resolve this, I identified all the storage_objects for these runs (neb-ls $path_base%$class_id), and used neb-mv to rename them to $key.bak. This seems to have unstuck things.
  • 11:28 Bill queued 49 diffRuns with label ThreePi?.rerun. Added the label to stdscience and to survey.magic and survey.destreak
  • 12:10 CZW: stopped processing to allow Serge to dump the database for the db replication. Shutdown and restarted distribution pantasks as it was consuming a lot of resources. It should run smoother once processing resumes after the dump.
  • 12:20 Serge: dumping gpc1 to /export/ippdb01.0/mysql_gpc1.backup/gpc1.201103291220.sql
    *************************** 1. row ***************************
                File: mysqld-bin.018874
            Position: 189179
  • 13:05 Dump complete. The size of dump grew by 500 MB in 5 days.
  • 13:17 gpc1 ingestion started on ippc02 (13:17:50)
  • 18:10 end of gpc1 ingestion

Wednesday: 2011-03-30

  • 08:45 serge tries to start gpc1 replication. The mysql server immediately crashed 8o( ... but once restarted the replication seems to work. The slave is 69682 behind its master.
  • 09:30 serge sees that o5650g0192o is stuck at chip level.
  • 09:45 gpc1 replication works. The slave is synchronized with its master.
  • 10:45 Killed running / ppImage on ipp030. Set quality to 42. MOPS data processing go on... Thanks Bill
    chiptool -updateprocessedimfile -dbname gpc1 -chip_id 208801 -class_id XY47 -fault 0 -set_quality 42

Thursday: 2011-03-31

  • 07:10 bill noticed that burntool was stuck for one chip. It turned out that jobs running on ipp021 were hanging because they could not talk to ipp049. force.umount fixed the problem and burntool completed soon after.

Friday: 2011-04-01

Serge is czar and it's not a joke.

Saturday: 2011-04-02

  • 04:00 With the increased number of hosts working on distribution it caught up with yesterday's data and seems to be keeping up with tonight's observations.
  • 04:05 Bill has been processing 72 exposures through magic with label masktest.20110401 using his build on a pantasks running on ippc25. It has been quiet for several hours because all was finished except for one diff that got stuck due to a corrupt warp output file (on ipp049). Fixed the warp and just queued the magicRuns, so the load on the cluster will increase for a bit.
  • 04:20 3 of the MD09.jtrp skycells faulted at the end when trying to replicate the output files. Strange. Reverted them.
  • 13:15 Pantasks thought a chip job was still running on ippc02 which crashed earlier. chip.reset caused pantasks to re-run the job successfully.
  • 17:17 Bills we have some dist components that are faulting repeatedly due to missing config files. Turned dist.revert off to evaluate the situation. Since these are post destreak files this will be an advanced procedure to fix.

Sunday: 2011-04-03

  • 06:44 distribution pantasks had crashed. Saved logs to logs/crash.20110403 restarted it with 2x default hosts
  • 10:00 heather set MD09.jtrp to clean, added MD10.jtrp to stdsci
  • 10:48 EAM burntool was stuck because of an incomplete readout : o5654g0464o. I marked it as 'drop' using pztool -updatepzexp