PS1 IPP Czar Logs for the week 2012.06.04 - 2012.06.10

(Up to PS1 IPP Czar Logs)

Monday : 2012-06-04

Mark is czar

  • 06:34 Serge: ingestion of the instance table finished at 5:26 this morning. The storage_object table is now being ingested.
  • 06:50 Mark: limited nightly science data, downloaded and processed.
  • 08:50 Serge: Estimate of 25% of storage_object table ingested (based on so_id assuming there are no "holes" in the table).
  • 12:15 Serge: 50% of storage_object table ingested (based on so_id assuming there are no "holes" in the table).
  • 14:35 Serge: Cleaned ipp009:/tmp
  • 17:15 Mark: replication regularly show not running/up on czarpage when it is. been very busy, going to restart the pantasks.
  • 20:00 Mark: turning chip.revert.on, please do not revert anything for MD09.
  • 21:50 Serge: ingestion of storage_object table complete. The full ingestion should finish in a couple of hours, likely earlier.

Tuesday : 2012-06-05

Serge is czar

  • 06:35 Serge: ingestion of nebulous complete (at 0:02am).
  • 06:40 Serge: 5 exposures were not registered. I manually replayed the failing from ipp@ippc18 --exp_id 493380 --tmp_class_id ota11 --tmp_exp_name o6082g0100o 
    --uri neb://ipp056.0/gpc1/20120604/o6082g0100o/o6082g0100o.ota11.fits 
    --logfile neb://ipp056.0/gpc1/20120604/o6082g0100o.493380/o6082g0100o.493380.reg.ota11.log --bytes 23716800 
    --md5sum 37827e9387ccfebffff5d115dabaddcf --sunset 03:30:00 --sunrise 17:30:00 
    --summit_dateobs 2012-06-04T07:00:13.000000 --dbname gpc1 --verbose

which fixed the registration problem.

  • 07:10 Serge: Just forgot to mention that: I didn't revert anything for MD09 ;)
  • 07:15 Mark: turning on the diffims for MD09, adding 2012-06-01 to pick up MD09-y observed then, tweaking SSdiff run time for 7:30am. (and thanks Serge)
  • 10:30 Serge: Stopping all pantasks from ippc18; czarpoll and roboczar on ipp009; addstar/addstarlap as ippdbvo from ippc18
  • 11:45 Serge: Replaying the binlogs is AWFULLY long! The first one isn't finished yet.
  • 12:10 Serge: Restarted apache, pantasks but I did and for replication, apache on ippc17, and czar monitoring on ipp009.
  • 12:18 Serge: 2012-06-05 gpc1 14 since no date was shown in ippMonitor.
  • 13:30 Serge: First binlog ingested... 3 hours. I'm trying to play them without the COMMIT statements that happen every other line.
  • 13:45 Serge: The second binlog ingestion is slow as well... because of the BEGIN statements now.
  • 15:00 Serge: Second binlog ingestion complete. Further ingestions executed with: 'mysqlbinlog <binlog> | grep -v 'COMMIT/*!*/' | grep -v BEGIN | mysql -u root -p nebulous'.
  • 23:00 Mark: queuing some few 1000 MD09.GR0 nightly stack reprocessing. not putting label in stdscience to avoid problem chips reverting (both have same label), so will be ghost activity.

Wednesday : 2012-06-06

Serge is czar

  • 08:00 Serge: Nightly science processing complete
  • 08:50 Serge: nebulous binlogs: 9/13 completed.
  • 10:40 Mark: MD09.GR0 nightly stacks finished. turning to load onto datastore and not revert the problematic chips. final refstack i,r will be started in the deepstack pantasks to run over next day or so.
  • 13:10 Serge: Changed ippdb00:/etc/nebdiskd.rc configuration so that is used. Restarted nebdiskd on ippdb00
  • 16:15 Serge: ippc45 died around 15:00 but someone is already connected to it...
  • 16:40 Serge: binlogs since yesterday (i.e. with balance and shuffling) have all been played. I'm playing now the 2 binlogs that have been generated since yesterday. Hopefully playing them will not be as long as the other ones...
  • 16:45 Serge: From Gavin: "someone (onsite) disconnected ippc45 from the switch".
  • 16:55 Serge: Looks like ippc45 is back
  • 22:30 Serge: Playing mysqld-bin.000695 (ipp screen session 28803.pts-6.ippdb00)
  • 23:15 Mark: looks like something pushed ipp062 into an odd state around 22:10 and has become unresponsive/unable to ssh to.
  • 23:20 queuing MD09.refstack.20120603.5x20120606 staticsky with previous ops tag ipp-20120404 since new tag joins skycell table which doesn't have MD for tess_id loaded.
  • 23:45 gave it 30 minutes and is blocking download/registration/processing. unable to login to system via console either. nothing on console display, cycling power.
    • /dev/sda3 disk checked (211 days) - ok
    • nothing apparent in logs?

Thursday : 2012-06-07

Serge is czar

  • 07:10 Serge: Nightly science processing complete
  • 07:50 Serge: ippdb01 is down
  • 08:05 Serge: Playing binlog mysqld-bin.000696 (ipp screen session 28803.pts-6.ippdb00).
  • 09:00 Serge: ippdb01 is back (log here)
  • 09:30 Serge: Stopped nebdiskd
  • 09:45 Serge: Stopped pantasks
  • 10:10 Serge: Restarted pstamp to give mops some more bones to gnaw.
  • 10:30 Serge: All pantasks and apache servers including ippc17 stopped (log here)
  • 11:20 Serge: All services (pantasks and apache servers) started
  • 12:00 Serge: I'm running a small test (nebtest.20120607) to make sure everything's working (with Heather's help)
  • 20:00 Serge: ippdb01 is down again.
  • 20:25 Serge: Cindy rebooted ippdb01
  • 20:55 Serge: Restarted czarpoll/roboczar running on ipp009

Friday : 2012-06-08

  • 11:39 CZW: Clear registration issues for a bad burntool table:
    # Run burntool
    funpack -S `neb-locate --path neb://ipp058.0/gpc1/20120608/o6086g0448o/o6086g0448o.ota14.fits` > /tmp/k.fits
    burntool /tmp/k.fits out=/tmp/ tableonly=t persist=t
    # Insert table
    neb-mv neb://ipp058.0/gpc1/20120608/o6086g0448o/o6086g0448o.ota14.burn.tbl neb://ipp058.0/gpc1/20120608/o6086g0448o/o6086g0448o.ota14.burn.tbl.bak
    neb-insert neb://ipp058.0/gpc1/20120608/o6086g0448o/o6086g0448o.ota14.burn.tbl /tmp/ --copies 2
    # Clear database issues
    regtool -updateprocessedimfile -exp_id 495667 -class_id XY14 -burntool_state -14 -set_state full
    regtool -updateprocessedimfile -exp_id 495667 -class_id XY14 -fault 0
  • 15:45 Serge: After the repeated ippdb01 crashes, we will use the science databases on ippdb03. The files that have been changed are the following:
    • ipp001 is now replicating ippdb03 instead of ippdb01
    • czarpoll and roboczar are now using ippdb03
    • condor is configured to use ippdb03
    • all databases which were on ippdb01 (but Heather's test ones) are being ingested (it will take some time though
    • ippconfig/site.config
    • ippMnitor has been configured to use ippdb03 (read ~ipp/src/ippMonitor/INSTALL)
  • 17:40 Serge, on ippdb03
    *************************** 1. row ***************************
                File: mysqld-bin.000001
            Position: 19880653

Dumping to /export/ippdb03.0/schastel/backup/gpc1_isp_ssp_czardb.20120608.sql

  • 18:55 Serge: mysql server is setup on ipp064. Dump is still being performed (gpc1/pzDownloadImfile)
  • 20:05 Serge: Backup of gpc1 on ippdb03 complete. Started ingestion on ipp064.
  • 20:15 Serge: Stopping the apache server on ippdb03
  • 20:30 Serge: Commented out entries in ipp crontab on ipp001... Backups were performed from ippdb03.
  • 20:45 Serge: ippdb03 is overwhelmed. I'm stopping all services but summitcopy and registration. I will restart stdscience later.
  • 21:10 Serge: I'm stopping the mysql server on ippdb03... Everything is frozen.
  • 21:20 Serge: Restarted mysql, summitcopy, and registration.
  • 21:23 Serge: Seem to run smoothly. However the query 1118 looks suspect:
    mysql> SELECT id, user, host, db, command, time, state, info FROM INFORMATION_SCHEMA.PROCESSLIST WHERE id=1118;
    | id   | user | host                        | db   | command | time | state                | info                                                                                                                                                                                                                                                                                          |
    | 1118 | ipp  | | isp  | Query   |  398 | Copying to tmp table | INSERT INTO summitExp   SELECT       NULL,       incoming.*,       NULL,       0,       NULL   FROM incoming   LEFT JOIN summitExp       USING(exp_name, camera, telescope)   WHERE       summitExp.exp_name is NULL       AND is NULL       AND summitExp.telescope is NULL | 

It's likely been generated by summitcopy (running on ipp050)

  • 21:30 Serge: Summitcopy copies. Registration registers. The cluster looks bored: I try to start stdscience
  • 21:35 Serge: A new query (5244) similar to 1118... Note that 1118 is still running.
  • 21:40 Serge: I restarted publishing.

Saturday : 2012-06-09

  • 06:25 Serge: On ipp064 gpc1 is still being ingested (at 'magicNodeResult' when I write this line. Tables being ingested in alphabetical order).
  • 06:40 Serge: I restarted pstamp on ippc17. I also restarted the apache server for the czartool pages.

  • 18:30 Mark: finally figured out why MD SSdiffs didn't run after tweaking the time range, stack pantasks is needed for nightly stacks. also needed to add back last nights date 2012-06-09 in stdscience also after the restarts.
  • 19:00 should also be able to push out to distribution before nightly science starts.
  • 20:00 MD nightly stacks and SSdiffs weren't running because is hard coded for ippdb01. made change directly in ops tag bin version for now and will turn off in stdscience during the night until morning when summit/registration is finished.
  • 20:10 also looks to be picking up missed 3PI diffims to do as well.
  • 20:45 which also missed publishing for MOPS, some starting to come out now.
  • 21:00 also turned on update pantasks, suspect it was holding up some MOPS PSS requests.

Sunday : 2012.06.10

  • 01:00 Mark: things appear to be running smoothly with summit+registration+stdscience/nightly_science+stack+distribution+publication so will leave all running. stopping pstamp and update until morning.
  • 07:30 nightly science done, turning pstamp and update back on.
  • 19:30 turning pstamp+update off during nightly science again -- will turn on and watch if jobs submitted while up.