PS1 IPP Czar Logs for the week 2015.06.01 - 2015.06.06

(Up to PS1 IPP Czar Logs)

Monday : 2015.06.01

  • 05:50 EAM : pv3fflt does not have enough work queued and pv3ffrt is running very slowly in the bulge, so I've moved all of the machines to pv3ffrt, with reduced numbers of jobs per machine to avoid memory problems. once we are past the bulge, even for the outer galactic plane, we can add more jobs to those machines.
  • 12:00 CZW: power cycling stsci17.
  • 12:05 EAM: I have restarted mysql replication on ippdb03 (from ippdb05). replication failed last week due to a corrupted binlog file. I rsynced the database yesterday to ippdb05's local disk, then rsynced it last night to ippdb03. this morning, I set up the replication configuration. After discussing it with Heather, we have decided to keep the smaller camera databases (isp, uip, ssp) and the test databases (megacam, ps2_tc3) out of the replication stream to ippdb03. Instead, I have rsynced those databases to ippc42 and will set them up to replicate there (we can conveniently use the same master log position information as for the other setup).
  • 19:51 HAF: neb-host up on ipp065, 066,067, 068, 069, 070, 071, 072, 073, 074, 075, 076, 076, 077, 078, 079, 080, 081 since they have a lot of free space, we need some of it, and addstar diff needs to think about what it's done.
  • 20:03 EAM: stopping pv3ffrt to bump up the number of machine connections. we are mostly past the depths of the bulge and can handle a bit more processing.

Tuesday : YYYY.MM.DD

  • 11:30 CZW: Moved 314 ff runs from .remote to .left label. They all failed on the last run on the UHcray, so I suspect they have some fault/quality issue that isn't correctly handled there.
  • 15:10 CZW: Moved 1477 ff runs from .bulge to .remote that are dec > +30. These should be far enough away from the plane to not require special treatment. I'm queuing the remaining 20hr runs to the .left label, and towards the end of day today, I'll relabel non-faulted, not part of the 314 moved earlier runs to the .remote label as well, and relaunch the Cray processing on those. This should clear the "easy" parts of the sky out while the bulge consumes the local cluster.
  • 17:50 CZW: Moving ff runs around. 5387 .left to .remote, in box: 300 < ra < 315, dec > 30. This moves the majority of the non-faulted, not 314 runs to the remote label to be queued and run on the UH cray. There are still 20hr runs being queued to the left queue. /data/ippc19.0/home/watersc1/ffUHcray.20150504/move/left2remote.cmds contains the command to move these to remote. Based on the area and number moved at the 15:10 move, I suspect the 19hr queue may need to be double checked to ensure that it ran up all the way to the edge of the galactic latitude cut.
  • 10:50 HAF: registration jammed, reverts not happening in regtool, manually did {{regtool -revertprocessedexp -fault 2 -dbname gpc1}} and sent email

Wednesday : 2015.06.03

  • 07:35 MEH: cleared stalled warp -- cannot build growth curve (psf model is invalid everywhere)
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.1829.005 -warp_id  1586420   -fault 0
  • 08:30 MEH: doing regular restart of nightly pantasks
  • 09:15 MEH: seems to be stalled diffs not fixed since the weekend -- and warps have been through auto-cleanup so now more work to fix..
  • 10:35 MEH: roboczar sendmail configured for ippc18 which is down, so isn't working currently -- Gavin re-enabled sendmail on ippc18 and now we have nagios warning emails as well
    • apparently plan was to swap back to ippc18 as primary home system today -- will be delayed to a later date
  • 10:40 MEH: manually sending things to cleanup missed since Sunday with the skip of initday hour for auto-cleanup trigger..
  • 14:15 CZW: running shuffle code on stsci16.[0-2] to test loading. This should not substantially impact processing.
  • 14:30 MEH: Haydn needs to replace disk in ippx070 and requires reboot -- turning off in ~ippsky/pv3ffrt
  • 15:40 MEH: Haydn finished and ippx070 up -- on again in ~ippsky/pv3ffrt
  • 00:00 MEH: smooth processing, no hangups like last night -- something was running last night causing problems, gpc1 DB dump? other DB spamming?

Thursday : 2015.06.04

  • 07:35 MEH: stsci16 down only ~300s, power cycle in bit if remains unresponsive
    • 08:00 back up again
  • 07:45 MEH: odd exposure o7177g0443o reportedly not registered... summit reports file is gone for
    • Craig reports this is a 0 size file, something went wrong with the guide video and seems to have created a 0 sized file -- has this happened before?
  • 08:05 MEH: bad warps
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.2444.009 -warp_id 1586976   -fault 0
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.1994.000 -warp_id 1586991   -fault 0
    warptool -dbname gpc1 -updateskyfile -set_quality 42 -skycell_id skycell.2072.097 -warp_id 1587006   -fault 0
  • 08:10 MEH: clearing bad diff
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.2278.098 -diff_id 1155673  -fault 0
    difftool -dbname gpc1 -updatediffskyfile -set_quality 42 -skycell_id skycell.1994.002  -diff_id 1155920 -fault 0
  • 08:40 MEH: stsci15 down -- power cycle
  • 13:00 MEH: Bill had notes on how to clear single OTA problem rather than dropping full exposure -- added to Processing (needs some reorg..)
    • requeue diffims for MOPS to be v1-v2, v3-v4 rather then v1-v3..
      difftool -dbname gpc1 -definewarpwarp -exp_id 922869  -template_exp_id 923080 -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2015/06/04 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20150604 -set_reduction SWEETSPOT -simple -rerun
      difftool -dbname gpc1 -definewarpwarp -exp_id 922906 -template_exp_id 922924  -backwards -set_workdir neb://@HOST@.0/gpc1/OSS.nt/2015/06/04 -set_dist_group SweetSpot -set_label OSS.nightlyscience -set_data_group OSS.20150604 -set_reduction SWEETSPOT -simple -rerun
  • 14:15 MEH: general restart of nightly pantasks, registration done this morning already
  • 14:25 CZW: restart ippsky/ffUHcray, as it looks like the books were jammed up. I also shutdown ipplanl/stdlanl, which I thought I did earlier in the week (I had simply stopped it).
  • 17:40 MEH: ipps09 unresponsive -- power cycle
    <Jun/05 05:17 pm>ipps09 login: [5473869.330877] Uhhuh. NMI received for unknown reason 2d on CPU 0.
    <Jun/05 05:17 pm>[5473869.337037] Do you have a strange power saving mode enabled?
    <Jun/05 05:17 pm>[5473869.342911] Dazed and confused, but trying to continue
  • 20:50 EAM: rebooting stsci13.
  • 21:10 MEH: ipps09 down again -- take out of processing and reboot -- root is using console -- looks like back up, leaving unloaded to see how it does overnight

Friday : 2015.06.05

  • 04:55 Bill: ganglia reports that stsci17 has been down for about 3 hours. Set it to down in nebulous.
  • 04:55 Bill: Restarted the postage stamp pantasks. It was acting sluggish.
  • 05:45 EAM: rebooted stsci17, put it in repair
  • 08:10 MEH: ipps09 okay overnight w/o load -- loading 50% during the day
  • 14:00 CZW: running shuffle load tests on stsci01-stsci08. The machine index indicates the number of 3-volume jobs running (stsci03 has 9 jobs, three per volume).

Saturday : 2015.06.06

  • 11:10 EAM: cleared a bad warp:
    warptool -dbname gpc1 -updateskyfile -warp_id 1588035 -skycell_id skycell.2213.012 -fault 0 -set_quality 42
  • 22:10 MEH: stdsci well underloaded, needed its regular restart earlier... while here doing pstamp which is also a nightly required pantasks for regular restarts...

Sunday : 2015.06.07

  • 4:50 EAM: summitcopy was jammed up because all attempts to download from ps2 were timing out. for reasons that are a bit unclear to me, this was also blocking PS1 downloads (due to long jobs in the pantasks queue). i've restarted summitcopy with the gpc2 database removed and at least ps1 data is now flowing.