PS1 IPP Czar Logs for the week 2014.09.15 - 2014.09.21

(Up to PS1 IPP Czar Logs)

Monday : 2014.09.15

  • MEH: until the new consortium starts (10/2014?), all WS diffims to be turned off -- comment out label and turn off query to create in stdsci auto-startup, sending all to cleanup
  • 13:40 MEH: Gene asked for pv3 stdlocal stack poll bump to 200 and adding in more data nodes for processing -- adding 4x ippsXX for ~60 for now since they are unused (easier to deal with when nightly starts)
    • ipp058 getting hammered when trying to clean up ~1TB of test processing -- putting into repair for now
    • same for ipp060, both have a nice chunk of space available again
  • 16:00 MEH: adding 1x s3 manually to lanl stdlocal -- notice many in ignore_s3 running, so the auto on/off script not properly turning off??
    • cleanup still running, if too much loading, will just take that much longer -- adding 1x s3 see clear increase in cleanup time for jobs
  • 20:15 MEH: ipp058 behaving badly.. console login up, but get errors -- may need to power cycle
    Cannot execute /bin/tcsh: Too many open files in system
    
    login: failure forking: Cannot allocate memory
    
    • finally cleared some after a bit and neb-host down -- many many [2014-09-15 20:27:07] VFS: file-max limit 4945516 reached
  • 20:45 MEH: unclear if gpc1 down for the night, but lanl stdlocal should be fine w/ ippsXX added at least
  • 21:30 MEH: nightly started, ippsXX off in lanl stdlocal, ipp058 heavy load so into repair until morning
  • 22:00 MEH: seeing file issue faults in warp stage for ipp060 -- also repair until morning -- may not be robust enough with so few data nodes available
    Unable to perform warp.
     -> psFitsOpen (psFits.c:217): I/O error
         Failed to delete a previously-existing file (/data/ipp060.0/nebulous/e4/40/5106938020.gpc1:OSS.nt:2014:09:16:o6916g0050o.795035:o6916g0050o.795035.wrp.1030977.skycell.1140.098.mask.fits), error 2: No such file or directory
     -> pmFPAfileOpen (pmFPAfileIO.c:816): I/O error
         error opening file /data/ipp060.0/nebulous/e4/40/5106938020.gpc1:OSS.nt:2014:09:16:o6916g0050o.795035:o6916g0050o.795035.wrp.1030977.skycell.1140.098.mask.fits
     -> pmFPAfileWrite (pmFPAfileIO.c:415): I/O error
         failed to open /data/ipp060.0/nebulous/e4/40/5106938020.gpc1:OSS.nt:2014:09:16:o6916g0050o.795035:o6916g0050o.795035.wrp.1030977.skycell.1140.098.mask.fits (PSWARP.OUTPUT.MASK)
     -> pmFPAfileIOChecks (pmFPAfileIO.c:91): I/O error
         failed WRITE in FPA_AFTER block for PSWARP.OUTPUT.MASK
     -> pswarpCleanup (pswarpCleanup.c:84): I/O error
         Unable to write files.
    
    • also seeing warp stage log having issue log warning with new keyword (Unable to find (null) --- assuming UTC), will need to clean up
  • 22:10 MEH: stdsci having timeout flurry for chiptool -pendingimfile.. ippdb01 has load ~13 -- taking out all extra labels (CNP,STD,PI,ps_ud_*) -- seems improved
  • 22:30 MEH: even large load on apache servers (ippc0x), lanl stdlocal stop for now until nightly starts moving along faster
  • 23:10 MEH: no access to ganglia -- looks like ippops2 rebooted, forget if know how to restart -- Gavin notes
    # restart ganglia monitor
    /etc/init.d/gmond restart
    # restart gmetad web frontend
    /etc/init.d/gmetad restart
    
    • ipp svn/wiki also down -- wasn't ippops2 but ipp002 crashed -- back up once Gavin rebooted. Will look at possible UPS/PDU issues since it wasn't under any load
  • 23:50 MEH: lanl stdlocal set to run but w/ only ~100 nodes -- nightly processing is slower than normal and possible hampered by lanl stdlocal, but if weather comes in at leas some processing done

Tuesday : 2014.09.16

  • 06:30 MEH: ipp036 crash ~0618, nothing on console -- power cycled
  • 06:35 MEH: lanl stdlocal stop until nightly finished, nothing loaded anyways
  • 09:30 MEH: lanl stdlocal back run, +4x ippsXX back on, +1 s3 back on, stack.poll 200, ipp058+060 up from repair, regular stdsci restart w/ ps_ud_* labels back in
    • again, auto-on for storage nodes turning ipp060,064 and other ignore ones on in lanl stdlocal
  • 15:00 MEH: scale back ippsXX adds to lanl stdlocal so cleanup can make progress before nightly starts
  • 19:15 MEH: again many chiptool -pendingimfile fails, ps_ud_* label out of stdsci and better? ippdb01 load ~10
    • lanl stdlocal turning off after nightly starts, w/ 1500-3ks stacks, will get overloads
  • 20:30 MEH: nightly throughput low, -2x c2, -2x c1a in lanl stdlocal (ippc0x nodes seem to get extra load w/ stacks and are the apache servers..)
  • 21:05 MEH: fun, looks like ipp008,012,013,037 are down.. nothing on console and for some reason cab power is off..
    Outlet                Name                Status          Post-on Delay(s)
    cab1pdu0[17]          ipp013p0            OFF             0.5
    cab1pdu0[18]          ipp013p1            OFF             0.5
    
    • take all out of processing, luckily no new summit data due to fog
    • neb-host down (from up)
    • lanl stdlocal stop
  • 22:30 MEH: Haydn had Scott check for any physical damage, looks okay so try to boot -- ipp008,012,037 up normally, still waiting on ipp013
    • keep all neb-host repair now and out of processing
    • adding ippsXX machines in to replace the lost nodes
    • weather clearing so new data now
    • lanl stdlocal run w/ reduced nodes
  • 23:10 MEH: already 20 behind downloading, is there other processing going on? ps2iq?

Wednesday : 2014.09.17

Thursday : 2014.09.18

  • HAF nfs wedges of ipp013 on stsci10/stsci13/stsci14 - sorted out by umount -f /data/ipp013.0

Friday : 2014.09.19

  • 09:20 EAM : nightly processing finished smoothly. I'm stopping and will restart stdlocal. a number of camera jobs are wedged and need to be cleared / killed.
  • 10:15 EAM : I shutdown all processing and cleared out a number of hung mounts (ipp013). I restarted everything. here is the list of generally running pantasks:
    • ipp : distribution, registration, stack, stdscience, summitcopy, cleanup
    • ipplanl : stdlocal, stdlanl
    • ipptest : pstamp
    • ps2iq : summitcopy, registration

Saturday : 2014.09.20

  • 14:15 MEH: RW asking about missing diffims -- OSS warp stuck as mem fault, few reverts cleared (1038774 skycell.1502.005) -- publishing taking forever to queue up.. so manually created
    pubtool -dbname gpc1 -definerun -set_label OSS.nightlyscience -data_group OSS.20140920 -client_id 5 -simple -pretend
    

Sunday : 2014.09.21