PS1 IPP Czar Logs for the week 2014-03-10 - 2014-03-16

(Up to PS1 IPP Czar Logs)

Monday : 2014-03-10

Bill is czar today

  • 10:15 moved pstamp pantasks back to ippc17 so that the save.status task could succeed. Currently the /data partitions are not visible from the node
  • 10:30 removed minimum ra cut from staticsky pantasks. This will cause a number of northern skycells between ra 16 - 18 to be processed "out of order" but once those are done the march across the sky will continue in the expected order.
  • 13:16 set staticsky to stop in preparation for the network port swaps scheduled for 14:30
  • 14:40 stopped the other pantasks execept replication which isn't in the list in the check_system script. Stopped it around 15:20
  • 17:30 all pantasks restarted with default host configuration.... well replication is not in the list. Will check with Chris
    • replication set to run by Chris
  • 20:30 some new data from summit. IPP seems to be running smoothly so far.

Tuesday : 2014-03-11

Bill is czar today

  • 06:36 storage.hosts.on in staticsky
    • that macro actually doesn't work quite right. Gene repaired things around 09:28
  • 11:30 cleared staticsky faults
    • fault 3 by deleting inputs with good_frac < 0.05
    • fault 4 for sky_id 477800, 477809, and 477818 by changing label to M31.LAP.ThreePi?.20130717 We plan to change code to avoid model fits in the M31 region, but not right now
    • other fault 4s were reverted as there was nothing special in the logs probably died due to cluster problems
    • fault 5s all look like nfs problems so reverted those
  • 15:45 restarted stdscience and pstamp pantasks
  • 16:11 staticsky jobs are starting to take a long time in the current working region 16 < ra < 18 which is getting to a busy portion of the galaxy.
    • Setting ra max back to 14 hours (210 degrees)
    • storage.nodes.off (the jobs will take several hours to flush out)

Wednesday : 2014-03-12

  • 08:25 Bill: set.ra.max 0 in staticsky
  • 10:15 Bill: ipp035 went down. Nothing on console. Tried to power cycle twice with no luck. Set it to down in nebulous (former state was repair)

Thursday : 2014-03-13

  • 10:02 Bill: staticsky processing drifted into the galactic bulge last night. This caused several systems to using 100GB of virtual memory. Around 7:30 this morning the pantasks died. Gene tried some things to let the processes finish but we decided it was fruitless. Killed all of the psphotStack processes and restarted pantasks with ra limit of 210 degrees. When will need to reconfigure the set of hosts used when the time comes to process the skycells in the bulge.
  • 13:45 Bill: staticsky stopped running jobs because the macro storage.hosts.ignore had the command "control run" commented out. Fixed now.

Friday : 2014-03-14

  • 08:48 Bill: set.ra.max 240 in staticsky. This is to avoid running out of work to do over the weekend.
  • 08:54 Bill: set label STS.rp.2013 distribution bundles to be cleaned. MPE has downloaded them all.
  • 09:03 Bill: restarted pstamp
  • 09:15 Bill: removed 2x ippc30 from staticsky host list. 3x + pstamp processing + pstamp workdir hosting was overloading the server and delaying pstamp processing
  • 14:00 Bill: stopped staticsky pantasks for a bit to rebuild with some changes to psphotFullForce and associated
    • lowered poll limit and processed skycal runs for SAS.20140313.dev to create cff files containing only the de Vaucouleurs models in the cff files.
    • 15:00 started processing of SAS.20140313.dev 148 r band skycells in the middle of SAS were chosen. There will be 2546 warps to process.
  • 15:07 Bill restarted stdscience pantasks

Saturday : 2014-03-15

Sunday : 2014-03-16