PS1_IPP_CzarLog_20110912 – Pan-STARRS IPP

wiki:PS1_IPP_CzarLog_20110912

Context Navigation

PS1 IPP Czar Logs for the week 2011-09-12 - 2011-09-18

1. PS1 IPP Czar Logs for the week 2011-09-12 - 2011-09-18

(Up to PS1 IPP Czar Logs)

Monday : 2011-09-12

Around 10am Bill finally got all of the pieces checked into the branch for the ppMops memory reduction fix. pantasks were restarted
11:00 adjusted pstamp.dependent.run task to not be so agressive at running. Should reduce the database load that it cause.
15:00 CZW: Reworked host definitions to be more equitable (hopefully) and to run the processors we have as hard as possible without crashing anything. New definitions for all servers are stored in /home/panstarrs/ipp/ippconfig/pantasks_hosts.input

Tuesday : 2011.09.13

09:50 Mark (czar): removed from nebulous and processing stopped on ipp021 for Cindy to upgrade the motherboard.
13:00 ipp021 back online. Added back nebulous and pantasks.
15:10 ipp026 went down. Removed from nebulous list until back up. Kernel panic similar to what has happened before, Chris rebooted and added details to ipp026-crash-20110913. Put back into nebulous in same state as before: repair.

21:30 registration trouble, in pantasks.stdout.log and run manually

crash for: ipp_apply_burntool_single.pl --exp_id 392018 --class_id XY04 --this_uri neb://ipp006.0/gpc1/20110914/o5818g0058o/o5818g0058o.ota04.fits --continue 10 --previous_uri neb://ipp006.0/gpc1/20110914/o5818g0057o/o5818g0057o.ota04.fits --dbname gpc1 --verbose

Wednesday : 2011.09.14

06:00 Mark: stalled at o5818g0440o, ipp_apply_burntool_single.pl running for 8ks so killed.
- stalled again o5818g0448o with check_burntool and ota27 but picked itself up. ipp018 was having trouble connecting to ipp007.0.
11:45 excessive CPU use by distribution pcontrol, restarted distribution.
12:00 diff faulting from error reading FITS file /data/ipp042.0/nebulous/48/67/1297415775.gpc1:ThreePi.nt:2011:09:14:o5818g0508o.392470:o5818g0508o.392470.wrp.254273.skycell.2361.049.mask.fits. Regenerated with
```
perl ~ipp/src/ipp-20110622/tools/runwarpskycell.pl --warp_id 254273 --skycell_id skycell.2361.049 --redirect-output 
```
12:10 diffim (diff_id=165304) running on ipp026 for 46ks, ppSub hanging. Killed ppSub to fault and revert. ipp026 has had timeouts to ippb00,01,02 in the past (seen in dmesg, not sure when). Diff completed.
112:21-12:56 Serge: stopped pstamp; dumped ippRequestServer to /export/ippc17.0/ipp/mysql-dumps/ippRequestServer.20110914.sql ; all done in less than 2 minutes. Master coordinates: mysqld-bin.000610, 505678724. Dump copied to /export/ippc19.0/pstamp_replication. Stopped slave on ippc19. Dropped existing database on ippc19. Ingested dump. Changed master coordinates. Restarted slave.
12:30 Mark: stdscience pcontrol on ippc16 100%, restarting stdscience now that last night's data finished (and start habit of restarting regularly to see if improves rates). Waiting for jobs to finish.
13:10 took longer to flush stdscience than normal. Also a hanging warp on ipp026 (warp_id=254454). stdscience now restarted.
13:20 diffim repeatedly faulting (diff_id=165610, skycell_id skycell.0982.067) like described in PS1_IPP_czarLog_20110627 for LAP diff 141693. set quality=42, fault=0
```
difftool -updatediffskyfile -diff_id 165610 -skycell_id skycell.0982.067 -set_quality 42 -set_fault 0 -dbname gpc1
```

14:16 Bill Experimenting with pantasks parameters in update pantasks changed LOADEXEC from default 5 seconds to 20 seconds. Upped POLLIMIT from 32 to 64. Goal is to see if the database load is reduced noticeably
14:18 Set LOADEXEC to 30 and POLLLIMIT to 32 in cleanup pantasks. Previous polllimit was 200 which is silly since the jobs are taking a long time.
14:34 It turns out thta LOADEXEC gets applied when the task is created and is not subsequently updated. Restarted update pantasks.
17:11 CZW: After wondering why none of the lapRuns were completing, I tracked down a stuck magicRun (magic_id = 204696). Since I could see jobs to do by calling magictool -toprocess, I tried resetting the book (magic.reset) in the distribution pantasks. This appears to have unstuck this magicRun.

18:00 Mark: 16 remain faulted in magicDS for ThreePi.nightlyscience from missing Skychip.psf table in the diffim CMFs.

neb://ipp043.0/gpc1/destreak/ThreePi.nightlyscience/392499/diff/392499.mds.697053.165578.skycell.1338.011.log

Regenerated CMF with

perl ~ipp/src/ipp-20110622/tools/rundiffskycell.pl --redirect-output --diff_id 165578 --skycell_id skycell.1338.011

with same odd/bad result.

21:00 Reran sample ppSub (skycell.1338.011,diff_id=165578) redirected to local directory from the 3PI magicDS failing set with missing SkyChip.psf table. Produced table, with few (7) detections.

21:48 CZW: I merged the registration bugfix into the working branch. This of course means that a bug popped up elsewhere. summit_copy.pl exited with a "CRASH" state, which seems to have left a bad entry in the book. Manually running the commands that crashed:

summit_copy.pl --uri http://conductor.ifa.hawaii.edu/ds/gpc1/o5819g0063o/o5819g0063o36.fits --filename neb://ipp044.0/gpc1/20110915/o5819g0063o/o5819g0063o.ota36.fits --summit_id 388294 --exp_name o5819g0063o --inst gpc1 --telescope ps1 --class chip --class_id ota36 --bytes 51831360 --md5 5ce24a1da3a713695bfabd72fa6df8c8 --dbname gpc1 --timeout 600 --verbose --copies 2 --compress --nebulous
summit_copy.pl --uri http://conductor.ifa.hawaii.edu/ds/gpc1/o5819g0065o/o5819g0065o04.fits --filename neb://ipp006.0/gpc1/20110915/o5819g0065o/o5819g0065o.ota04.fits --summit_id 388296 --exp_name o5819g0065o --inst gpc1 --telescope ps1 --class chip --class_id ota04 --bytes 49432320 --md5 6a33bfd0cab134dcfb2563431cabef3f --dbname gpc1 --timeout 600 --verbose --copies 2 --compress --nebulous

cleared up the problems, and burntool started running and finishing registration for subsequent exposures.

Thursday : 2011-09-15

Serge is czar

09:00 Serge: nightly processing finished but a few 3pi at destreak stage. Reverted 4 errors in publishing.

10:00 Mark: still tracking down the 16 or so 3PI magicDS failures such as

failure for: magic_destreak.pl --magic_ds_id 697058 --camera GPC1 --exp_id 392505 --streaks_path_base neb://any/gpc1/20110914/o5818g0543o.392505/o5818g0543o.392505.mgc.204600 --inv_streaks_path_base neb://any/gpc1/20110914/o5818g0566o.392528/o5818g0566o.392528.mgc.204611 --streaks NULL --inv_streaks NULL --stage diff --stage_id 165583 --component skycell.1519.088 --uri NULL --path_base neb://ipp009.0/gpc1/ThreePi.nt/2011/09/14/RINGS.V3/skycell.1519.088/RINGS.V3.skycell.1519.088.dif.165583 --cam_path_base NULL --cam_reduction NULL --outroot neb://ipp009.0/gpc1/destreak/ThreePi.nightlyscience/392505/diff --logfile neb://ipp009.0/gpc1/destreak/ThreePi.nightlyscience/392505/diff/392505.mds.697058.165583.skycell.1519.088.log --recoveryroot neb://any/gpc1/destreak/recover/ThreePi.nightlyscience --replace T --magicked 0 --run-state new --dbname gpc1 --verbose

failed to read table in /data/ipp009.0/nebulous/69/f3/1297857751.gpc1:ThreePi.nt:2011:09:14:RINGS.V3:skycell.1519.088:RINGS.V3.skycell.1519.088.dif.165583.cmf. Chris suggested renaming the .cmf to .cmf.bak. So ran

neb-mv neb://ipp009.0/gpc1/ThreePi.nt/2011/09/14/RINGS.V3/skycell.1519.088/RINGS.V3.skycell.1519.088.dif.165583.cmf neb://ipp009.0/gpc1/ThreePi.nt/2011/09/14/RINGS.V3/skycell.1519.088/RINGS.V3.skycell.1519.088.dif.165583.cmf.bak

and reran

perl ~ipp/src/ipp-20110622/tools/rundiffskycell.pl --redirect-output --diff_id 165583 --skycell_id skycell.1519.088

still failed to produce a detection table. Running multiple times did however, and with 3 detections. So appears to be case when 0 detections an empty table isn't being made and not sure why. The following is a list of the 17 that originally failed magicDS with the diff stage, 3 noted as still failed because of the missing detection table

-stage_id 165557 --component skycell.2414.004
-stage_id 165586 --component skycell.1340.067 
-stage_id 165583 --component skycell.1519.088 
-stage_id 165519 --component skycell.2411.062 
-stage_id 165543 --component skycell.2505.023
-stage_id 165527 --component skycell.2411.052
-stage_id 165580 --component skycell.1429.028
-stage_id 165574 --component skycell.1517.056
-stage_id 165610 --component skycell.0982.016 -- still problem
-stage_id 165609 --component skycell.0896.063
-stage_id 164809 --component skycell.1693.014 -- still problem
-stage_id 165551 --component skycell.2622.095 -- still problem
-stage_id 164754 --component skycell.2019.073
-stage_id 165638 --component skycell.1050.064
-stage_id 165537 --component skycell.2576.075
-stage_id 164754 --component skycell.2019.084
-stage_id 165547 --component skycell.2543.053

Once the diff detection table was fixed, not sure if reverts will be successful or not.

11:50 Serge: stopped, shutdown and restarted distribution
11:58 heather restarted stack. it crashed on me. I suspect it was because I was doing 'status' too frequently. I also added a small number of stacks for test: MD09.haf
14:23 CZW: restarted stdscience, partially to see if that would kick processing rates, partially to add a rate adjustment for the LAP monitor stage to see if that is overloading the database.
16:06 Serge: killed ppMops on ippc12 (-exp_name o5814g0052o)
17:30 Bill: set all runs with label ps_ud% to goto_cleaned. This should free up about 900 chip runs worth of space

Friday : 2011-09-16

Serge is czar

No observation last night
10:00 Serge All processing stop for ipp020 mobo replacement
10:50 Serge/Mark: can't connect to ipp044

11:00 Mark/Bill: found LAP chip fault due to not able to access /data/ipp053.0/nebulous/c6/dc/423950490.gpc1:20100830:o5438g0382o:o5438g0382o.ota76.fits. Checked directory and copied from ippb00

ls -l /data/ipp053.0/nebulous/c6/dc
cp /data/ippb00.1/nebulous/c6/dc/1128537216.gpc1:20100830:o5438g0382o:o5438g0382o.ota76.fits /data/ipp053.0/nebulous/c6/dc/423950490.gpc1:20100830:o5438g0382o:o5438g0382o.ota76.fits

11:10 Serge: stopped all apcahe servers
12:00 Cindy is finished with ipp020 / Serge is not finished with mysql
13:33 Serge: gpc1 optimization is finished
13:48 Serge: stopped nebulous optimization and restarted (in this order): replication slaves; apache servers; czarpoll and roboczar; pantasks.
14:40 Serge: reports for optimization are attached to this page. Optimization of gpc1 lasted 2 hours 51. It seems that the time required to optimize a table is roughly 1.5 times greater than what it was 6 months ago (see attachments at http://svn.pan-starrs.ifa.hawaii.edu/trac/ipp/wiki/201103_Optimization)

Saturday : 2011.09.17

09:00 Mark: registration stuck with ~80 exposures left. Regpeek.pl reported o5821g0478o.ota11.fits was in state check_burntool and pantasks.stdout.log reported a config_error for exp_id 393399. Ran regtool and is finishing up.
```
regtool -updateprocessedimfile -exp_id  393399 -class_id XY11 -set_state pending_burntool -dbname gpc1
```
11:00 LAP chip fault, funpack error on /data/ipp053.0/nebulous/8b/3b/423999272.gpc1:20100830:o5438g0438o:o5438g0438o.ota76.fits, missing file neb://ipp053.0/gpc1/20100830/o5438g0438o/o5438g0438o.ota76.fits. Ran
```
cp /data/ippb01.0/nebulous/8b/3b/880997586.gpc1:20100830:o5438g0438o:o5438g0438o.ota76.fits /data/ipp053.0/nebulous/8b/3b/423999272.gpc1:20100830:o5438g0438o:o5438g0438o.ota76.fits
```
19:50 set ipp033 to repair after Cindy reported in degraded state.
22:50 burntool/registration stalled for past ~30min, regpeak said neb://ipp007.0/gpc1/20110918/o5822g0237o/o5822g0237o.ota05.fits and registration/pantasks.stdout.log said system failure for: register_imfile.pl --exp_id 39374. Catching up after ran
```
regtool -updateprocessedimfile -exp_id  393746 -class_id XY05 -set_state pending_burntool -dbname gpc1
```

Sunday : 2011.09.18

08:00 Mark: removed ipp033 from processing and nebulous so Cindy could reboot and re-seat a disk.
08:30 ipp033 pressed back into service.

09:30 another case of 0 size file with 3 copies

-rw-rw-r-- 1 apache 23503680 Jun 17 05:20 /data/ipp016.0/nebulous/aa/0b/1015497360.gpc1:20110617:o5729g0569o:o5729g0569o.ota03.fits
-rw-rw-r-- 1 apache 0 Jul 22 10:46 /data/ipp006.0/nebulous/aa/0b/1123250971.gpc1:20110617:o5729g0569o:o5729g0569o.ota03.fits
-rw-rw-r-- 1 apache 23503680 Jul 27 04:27 /data/ippb00.0/nebulous/aa/0b/1136967851.gpc1:20110617:o5729g0569o:o5729g0569o.ota03.fits

copied over 0 size with valid copy for now

cp /data/ipp016.0/nebulous/aa/0b/1015497360.gpc1:20110617:o5729g0569o:o5729g0569o.ota03.fits /data/ipp006.0/nebulous/aa/0b/1123250971.gpc1:20110617:o5729g0569o:o5729g0569o.ota03.fits

Last modified 15 years ago Last modified on Oct 25, 2011, 11:53:27 AM

Attachments (2)

gpc1_optimization_20110916 (32.8 KB ) - added by Serge CHASTEL 15 years ago.
neb_optimization_20110916 (3.1 KB ) - added by Serge CHASTEL 15 years ago.

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text