PS1 IPP Czar Logs for the week 2014.02.03 - 2014.02.09

(Up to PS1 IPP Czar Logs)

Monday : 2014.02.03

  • 08:20 MEH: stdsci is down since 0330?
    [2014-02-03 03:33:29] pantasks_server[24697]: segfault at 639c2d8 ip 0000000000407d98 sp 000000004145ff50 error 4 in pantasks_server[400000+13000]
  • 10:20 MEH: like yesterday, seeing ipp040,043,050,052 getting hung up at the end of (sync?) and ganglia shows high cpu_wait often on them (and ipp053). taken 2/3 out of stdsci processing, that wave may also be overloaded thread-wise.
  • 17:00 MEH: adjusting the hosts being used -- while loaders down, trying to allocate back to processing with focus on more staticsky progress
    • stare 3x to staticsky

Tuesday : 2014.02.04

  • 09:00 MEH: continuing to sort nodes in use
  • 11:00 MEH: fixed LAP camera missing .trace.update file, neb-mv file to file.bad (to add details)
    neb-mv neb://any/gpc1/ThreePi.nt/2012/09/27//o6197g0119o.521755/ neb://any/gpc1/ThreePi.nt/2012/09/27//o6197g0119o.521755/
    neb-mv neb://any/gpc1/ThreePi.nt/2012/09/27//o6197g0113o.521749/ neb://any/gpc1/ThreePi.nt/2012/09/27//o6197g0113o.521749/
    neb-mv neb://any/gpc1/ThreePi.nt/2012/09/27//o6197g0101o.521737/ neb://any/gpc1/ThreePi.nt/2012/09/27//o6197g0101o.521737/
    neb-mv neb://any/gpc1/ThreePi.nt/2012/09/27//o6197g0096o.521732/ neb://any/gpc1/ThreePi.nt/2012/09/27//o6197g0096o.521732/
    neb-mv neb://any/gpc1/ThreePi.nt/2012/09/27//o6197g0095o.521731/ neb://any/gpc1/ThreePi.nt/2012/09/27//o6197g0095o.521731/
    --> also gone.. will need to recover sometime
    neb-mv neb://ipp047.0/gpc1/LAP.ThreePi.20120706/2012/07/22/o5645g0634o.316341/ neb://ipp047.0/gpc1/LAP.ThreePi.20120706/2012/07/22/o5645g0634o.316341/
  • 21:00 MEH: difficult to fully balance w/o night data, LAP entering GC, and staticsky nearing Plane. Revised notes and table at pantasks_hosts_summary
    • wave3 running stack may be too much during night processing
    • ipp040,043,050,052 -2x as jobs take longer to run there

Wednesday : 2014.02.05

Bill is czar today

  • 08:30 Wow staticsky made huge progress yesterday and ran out of data inside the ra limit of 30 degrees. (This is a temporary workaround to the fact that the staticsky runs were not queued in a reasonable order. Oh well.)
    • set.ra.limit 60
  • 09:00 Ah the reason staticsky made such epic progress is that it was working on skycells within 20 degrees of the galactic plane. We don't do extended source model fits there and so it zipped right through.
  • There are 4 skycells continaing parts of M31 that are slowly progressing. Reduced the number of jobs on ippc53, c25 and c50 until those > 30GB jobs finish.
  • 10:44 restarted pstamp pantasks with new code that makes sure updates are run with the correct label priority.
  • 13:06 restarted stdscience
  • 20:30 MEH: again ipp052,050,043,040 hanging onto chip_imfile jobs and weren't taken out of stdscience when it was restarted

Thursday : 2014.02.06

Bill is czar today

  • 14:08 Bill: found 3 LAP warps in an inconsistent state. Faulted, yet in full state. Cleaning up the runs and then will regenerate them. warp_ids: 462703-5
  • warp revert is off I'm debugging something
  • 16:34 pantasks set to stop in preparation for periodic restart
  • 17:10 pantasks' have been restarted
    • stdscience ipp040,043,050,052 -2x as jobs take longer to run there

Friday : 2014.02.07

  • 06:16 Bill: set ~ippsky/staticsky to stop in preparation for the 9 am shutdown
  • 08:00 Bill: set stack to stop
  • 08:27 Bill: set other pantasks to stop. Stack still has 40 jobs going.

Saturday : 2014.02.08

Sunday : 2014.02.09

Serge 14:46 difftool -updatediffskyfile -set_quality 14006 -skycell_id skycell.2127.022 -diff_id 521568 -fault 0 -dbname gpc1

  • 17:45 MEH: looks like pstamp has been dead since noon and MOPS hasn't received their stamps.. restarting.. and for some reason roboczar stopped reporting it so restarted as well..
  • 18:05 MEH: looks like stdsci is underloaded and well past regular restart, preparing for restart
  • 18:55 MEH: finally able to restart stdscience from stalled jobs with overloaded/stalled systems
    • ipp040,043,050,052,053 out because of past issues with chip_imfile
  • 19:25 MEH: ipp053 still being harassed, taking out of processing.. setting neb-host down to help recover if possible..
  • 20:50 MEH: looks like nightly finally catching up
  • 21:00 MEH: spoke too soon.. loader still running on ipp054 and jobs having issues there, taking 3x out of processing as well
    • all summitcopy jobs >500s now as well.. not working well..