Context Navigation

Miscellaneous notes

rawcheck

The replication pantasks is now running the rawcheck.pro/rawcheck.pl scripts. The goal of this task is to scan the GPC1 database for raw exposure data, and then pass that exp_id to the rawcheck.pl script, which identifies all the raw FITS files in nebulous, and "does the right thing" such that upon completion:

One copy of the FITS file is on the cluster.
One copy of the FITS file is on an ippbXX node.
No other copies exist (third+ copies are culled away).

This is unfortunately a bit clunky, as there is no state information to use to iterate cleanly (and an absence of a -exp_id_min in regtool). Instead, it uses the dateobs field to select jobs to process. The date is initially set to NULL, and then pulls the date from the final page of the rawcheckPending book as the next date to use. I've run into issues where the camera takes a large number of exposures with the same dateobs value, which causes the iteration to get stuck. This led me to pushing up the poll option to get over these bumps. However, this seems to have the effect that the "final page" is not necessarily the largest date, and with large numbers of pages in the book, the date just floats around. Because of this, I've tried to keep the poll down to ~80-100, as this will allow the remaining bumps to be jumped (70 is the largest number of exposures with the same dateobs after 2011-07-22).

The current date can be set and shown with:

rawcheck.show.date
rawcheck.set.date 2011-07-22

2014-01-02

I suspect the slow progress is due to re-redoing the same time range. This is caused by the task not clearing pages until there are 2000 jobs running/book pages. I've changed the task to do this at 200 in the hope it will work better. The date prior to restarting pantasks is "2011-08-18T07:10:45.000000".

convolved stack cleaning

The convolved stack cleaning is running in my own pantasks at /data/ippc18.0/home/watersc1/clean_convolved_stacks.20131210 (symlinked to /data/ippc18.0/home/watersc1/this_is_where_pantasks_lives. This task iterates through a list of commands (PV1.cmds), and passes those commands to the pclients. This list/reading was necessary as running this via stacktool commands was prohibitively expensive. Instead, I construct the command list based on simple database queries. The current command list only covers the PV1 stacks.

Although I have changes in stack_skycell.pl done that should allow convolved stacks to not be constructed in the future, I have not made this the default, nor have I have pushed the change into the working tag.

single host colonization

This is a side effort to reproduce the first two points from the rawcheck task, but using only a single host as the source of exposures. The script is located in tools/neb_rawOTA_host_scan.pl. I've just now discovered that the ipp user does not have necessary environment variables set. NEB_USER/NEB_PASS/NEB_DBSERVER need to be set correctly (nebulous.ipp.ifa.hawaii.edu).

With those variables set, the command can be run with:

neb_rawOTA_host_scan.pl --host ipp0XY --limit 10000 --min 0 --continue

This scans host ipp0XY for instances that match /ota...fits/. If one is found, and has a user.copies xattr that is greater than 1, it checks for other instances of the storage_object, and checks that one of the other instances is on an ippbXX node. If not, a neb-replicate command is issued to put one on a randomly selected ippb0[4-5] volume. The limit of 10000 seems to work best, without needing lots of database interactions, and returning from the database quickly. The min value indicates the ins_id to start the search. The script prints out a suggested "next iteration" command upon finishing scanning the 10000 entries. With the --continue option, the next iteration is begun internally. I've left in the print statement so that after stopping the script, it is easy to determine what the next start value should be.

The current progress is:

host	state	ins_id	rerun	rerun status
ipp033	done		2014-01-02	done
ipp034	done	4268003345	2014-01-02	done
ipp035	done		2014-01-02	done
ipp036	done		2014-01-02	done
ipp037	done		2014-01-02	done
ipp038	done		2014-01-02	done
ipp039	done		2014-01-02	done
ipp040	done		2014-01-02	done
ipp041	done	3151002337	2014-01-02	done
ipp042	done	2705684563	2014-01-02	done
ipp043	done	2695734708	2014-01-02	done
ipp044	done	3025719620	2014-01-02	done
ipp045	done		2014-01-02	done
ipp046	done	2004251097	2014-01-02	done
ipp047	done	1968815686	2014-01-02	done
ipp048	done	0	2014-01-02	done

ipp049	done			done
ipp050	done			done
ipp051	done	0	2014-01-02	done
ipp052	done	2955744679		done
ipp053	done	3074357154		done
ipp021	pause	3984316153
ipp017	run	0	run	3788941610
ipp030	run	0	run	3444534397

ipp004	no nebulous data			done
ipp005	done	0		done
ipp006	run	0		3553498018
ipp007	run	0		3153700875
ipp009	run	0		2837354637
ipp010	run	0		2955669842
ipp011	run	0		2950501963
ipp015	run	0		2738594316
ipp012	run	0		2351997824
ipp013	run	0		2960420621
ipp008	run	0		2950497692
ipp016	run	0		3410988271
ipp014	run	0		2404871920
ipp018	run	0		3414468113
ipp019	run	0		3409930394

ipp023	run	0		1961967219
ipp024	run	0		1859606611
ipp025	run	0		2000339491
ipp026	run	0		3014419607
ipp027	run	0		2012018314
ipp028	run	0		1767599921
ipp029	run	0		2706686377

stsci16.0	moved	0		3559337193
stsci16.1	moved	0		3558194527
stsci16.2	moved	0		3558219673
stsci17.0	moved	0		3559524787
stsci17.1	moved	0		3558176515
stsci17.2	moved	0		3558196966
stsci18.0	moved	0		3559598644
stsci18.1	moved	0		3558206552
stsci18.2	moved	0		3558171508
stsci19.0	moved	0		3559215917
stsci19.1	moved	0		3558181118
stsci19.2	moved	0		3558197059

stsci10.0	moved	0		3563650506
stsci10.1	moved	0		3562834595
stsci10.2	moved	0		3562850632
stsci11.0	moved	0		3558241974
stsci11.1	moved	0		3557812251
stsci11.2	moved	0		3557808722
stsci12.0	moved	0		3557781636
stsci12.1	moved	0		3557781636
stsci12.2	moved	0		3557800135
stsci13.0	moved	0		3557801535
stsci13.1	moved	0		3557796271
stsci13.2	moved	0		3559222108
stsci14.0	moved	0		3557787141
stsci14.1	moved	0		3557802774
stsci14.2	moved	0		3559408358
stsci15.0	moved	0		3559023072
stsci15.1	moved	0		3557783580
stsci15.2	moved	0		3557798567

stsci06.0	moved	0
stsci06.1	moved	0
stsci06.2	moved	0
stsci07.0	moved	0
stsci07.1	moved	0
stsci07.2	moved	0
stsci08.0	moved	0
stsci08.1	moved	0
stsci08.2	moved	0
stsci09.0	moved	0
stsci09.1	moved	0
stsci09.2	moved	0

The paused/new hosts are in that state as it seems that running more than ~5-6 of these cause the entire cluster to have issues. It looks like the replications put the disks under slightly more load then they can handle while staying in processing. The other issue is that the scan goes very quickly until ~2012, at which point the previous backup nodes were full, and second copies began being placed on other storage nodes. The upside of this is that when the rawcheck task gets to this point, it will have less work to do on these nodes, as they will already have constructed their required backup node instance.

extra cleanup

I have a label goto_cleaned_redo that I've been adding to the cleanup pantasks when it doesn't have any other jobs to do. This is re-cleaning all the old chip data, with the goal of removing chip CMF files that have a valid SMF file. This should gain a reasonable amount of disk space, but the operation is slow (there are currently 360k runs to work through). Because this label will flood out the standard cleanups (which remove images and are therefore more effective at freeing space), it can't be left in the pantasks permanently.

Last modified 12 years ago Last modified on Jan 23, 2014, 5:37:56 PM

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text