Miscellaneous notes
rawcheck
The replication pantasks is now running the rawcheck.pro/rawcheck.pl scripts. The goal of this task is to scan the GPC1 database for raw exposure data, and then pass that exp_id to the rawcheck.pl script, which identifies all the raw FITS files in nebulous, and "does the right thing" such that upon completion:
- One copy of the FITS file is on the cluster.
- One copy of the FITS file is on an ippbXX node.
- No other copies exist (third+ copies are culled away).
This is unfortunately a bit clunky, as there is no state information to use to iterate cleanly (and an absence of a -exp_id_min in regtool). Instead, it uses the dateobs field to select jobs to process. The date is initially set to NULL, and then pulls the date from the final page of the rawcheckPending book as the next date to use. I've run into issues where the camera takes a large number of exposures with the same dateobs value, which causes the iteration to get stuck. This led me to pushing up the poll option to get over these bumps. However, this seems to have the effect that the "final page" is not necessarily the largest date, and with large numbers of pages in the book, the date just floats around. Because of this, I've tried to keep the poll down to ~80-100, as this will allow the remaining bumps to be jumped (70 is the largest number of exposures with the same dateobs after 2011-07-22).
The current date can be set and shown with:
rawcheck.show.date rawcheck.set.date 2011-07-22
2014-01-02
I suspect the slow progress is due to re-redoing the same time range. This is caused by the task not clearing pages until there are 2000 jobs running/book pages. I've changed the task to do this at 200 in the hope it will work better. The date prior to restarting pantasks is "2011-08-18T07:10:45.000000".
convolved stack cleaning
The convolved stack cleaning is running in my own pantasks at /data/ippc18.0/home/watersc1/clean_convolved_stacks.20131210 (symlinked to /data/ippc18.0/home/watersc1/this_is_where_pantasks_lives. This task iterates through a list of commands (PV1.cmds), and passes those commands to the pclients. This list/reading was necessary as running this via stacktool commands was prohibitively expensive. Instead, I construct the command list based on simple database queries. The current command list only covers the PV1 stacks.
Although I have changes in stack_skycell.pl done that should allow convolved stacks to not be constructed in the future, I have not made this the default, nor have I have pushed the change into the working tag.
single host colonization
This is a side effort to reproduce the first two points from the rawcheck task, but using only a single host as the source of exposures. The script is located in tools/neb_rawOTA_host_scan.pl. I've just now discovered that the ipp user does not have necessary environment variables set. NEB_USER/NEB_PASS/NEB_DBSERVER need to be set correctly (nebulous.ipp.ifa.hawaii.edu).
With those variables set, the command can be run with:
neb_rawOTA_host_scan.pl --host ipp0XY --limit 10000 --min 0 --continue
This scans host ipp0XY for instances that match /ota...fits/. If one is found, and has a user.copies xattr that is greater than 1, it checks for other instances of the storage_object, and checks that one of the other instances is on an ippbXX node. If not, a neb-replicate command is issued to put one on a randomly selected ippb0[4-5] volume. The limit of 10000 seems to work best, without needing lots of database interactions, and returning from the database quickly. The min value indicates the ins_id to start the search. The script prints out a suggested "next iteration" command upon finishing scanning the 10000 entries. With the --continue option, the next iteration is begun internally. I've left in the print statement so that after stopping the script, it is easy to determine what the next start value should be.
The current progress is:
| host | state | ins_id | rerun | rerun status |
| ipp033 | done | 2014-01-02 | done | |
| ipp034 | done | 4268003345 | 2014-01-02 | done |
| ipp035 | done | 2014-01-02 | done | |
| ipp036 | done | 2014-01-02 | done | |
| ipp037 | done | 2014-01-02 | done | |
| ipp038 | done | 2014-01-02 | done | |
| ipp039 | done | 2014-01-02 | done | |
| ipp040 | done | 2014-01-02 | done | |
| ipp041 | done | 3151002337 | 2014-01-02 | done |
| ipp042 | done | 2705684563 | 2014-01-02 | done |
| ipp043 | done | 2695734708 | 2014-01-02 | done |
| ipp044 | done | 3025719620 | 2014-01-02 | done |
| ipp045 | done | 2014-01-02 | done | |
| ipp046 | done | 2004251097 | 2014-01-02 | done |
| ipp047 | done | 1968815686 | 2014-01-02 | done |
| ipp048 | done | 0 | 2014-01-02 | done |
| ipp049 | done | done | ||
| ipp050 | done | done | ||
| ipp051 | done | 0 | 2014-01-02 | done |
| ipp052 | done | 2955744679 | done | |
| ipp053 | done | 3074357154 | done | |
| ipp021 | pause | 3984316153 | ||
| ipp017 | run | 0 | run | 3788941610 |
| ipp030 | run | 0 | run | 3444534397 |
| ipp004 | no nebulous data | done | ||
| ipp005 | done | 0 | done | |
| ipp006 | run | 0 | 3553498018 | |
| ipp007 | run | 0 | 3153700875 | |
| ipp009 | run | 0 | 2837354637 | |
| ipp010 | run | 0 | 2955669842 | |
| ipp011 | run | 0 | 2950501963 | |
| ipp015 | run | 0 | 2738594316 | |
| ipp012 | run | 0 | 2351997824 | |
| ipp013 | run | 0 | 2960420621 | |
| ipp008 | run | 0 | 2950497692 | |
| ipp016 | run | 0 | 3410988271 | |
| ipp014 | run | 0 | 2404871920 | |
| ipp018 | run | 0 | 3414468113 | |
| ipp019 | run | 0 | 3409930394 | |
| ipp023 | run | 0 | 1961967219 | |
| ipp024 | run | 0 | 1859606611 | |
| ipp025 | run | 0 | 2000339491 | |
| ipp026 | run | 0 | 3014419607 | |
| ipp027 | run | 0 | 2012018314 | |
| ipp028 | run | 0 | 1767599921 | |
| ipp029 | run | 0 | 2706686377 | |
| stsci16.0 | moved | 0 | 3559337193 | |
| stsci16.1 | moved | 0 | 3558194527 | |
| stsci16.2 | moved | 0 | 3558219673 | |
| stsci17.0 | moved | 0 | 3559524787 | |
| stsci17.1 | moved | 0 | 3558176515 | |
| stsci17.2 | moved | 0 | 3558196966 | |
| stsci18.0 | moved | 0 | 3559598644 | |
| stsci18.1 | moved | 0 | 3558206552 | |
| stsci18.2 | moved | 0 | 3558171508 | |
| stsci19.0 | moved | 0 | 3559215917 | |
| stsci19.1 | moved | 0 | 3558181118 | |
| stsci19.2 | moved | 0 | 3558197059 | |
| stsci10.0 | moved | 0 | 3563650506 | |
| stsci10.1 | moved | 0 | 3562834595 | |
| stsci10.2 | moved | 0 | 3562850632 | |
| stsci11.0 | moved | 0 | 3558241974 | |
| stsci11.1 | moved | 0 | 3557812251 | |
| stsci11.2 | moved | 0 | 3557808722 | |
| stsci12.0 | moved | 0 | 3557781636 | |
| stsci12.1 | moved | 0 | 3557781636 | |
| stsci12.2 | moved | 0 | 3557800135 | |
| stsci13.0 | moved | 0 | 3557801535 | |
| stsci13.1 | moved | 0 | 3557796271 | |
| stsci13.2 | moved | 0 | 3559222108 | |
| stsci14.0 | moved | 0 | 3557787141 | |
| stsci14.1 | moved | 0 | 3557802774 | |
| stsci14.2 | moved | 0 | 3559408358 | |
| stsci15.0 | moved | 0 | 3559023072 | |
| stsci15.1 | moved | 0 | 3557783580 | |
| stsci15.2 | moved | 0 | 3557798567 | |
| stsci06.0 | moved | 0 | ||
| stsci06.1 | moved | 0 | ||
| stsci06.2 | moved | 0 | ||
| stsci07.0 | moved | 0 | ||
| stsci07.1 | moved | 0 | ||
| stsci07.2 | moved | 0 | ||
| stsci08.0 | moved | 0 | ||
| stsci08.1 | moved | 0 | ||
| stsci08.2 | moved | 0 | ||
| stsci09.0 | moved | 0 | ||
| stsci09.1 | moved | 0 | ||
| stsci09.2 | moved | 0 |
The paused/new hosts are in that state as it seems that running more than ~5-6 of these cause the entire cluster to have issues. It looks like the replications put the disks under slightly more load then they can handle while staying in processing. The other issue is that the scan goes very quickly until ~2012, at which point the previous backup nodes were full, and second copies began being placed on other storage nodes. The upside of this is that when the rawcheck task gets to this point, it will have less work to do on these nodes, as they will already have constructed their required backup node instance.
extra cleanup
I have a label goto_cleaned_redo that I've been adding to the cleanup pantasks when it doesn't have any other jobs to do. This is re-cleaning all the old chip data, with the goal of removing chip CMF files that have a valid SMF file. This should gain a reasonable amount of disk space, but the operation is slow (there are currently 360k runs to work through). Because this label will flood out the standard cleanups (which remove images and are therefore more effective at freeing space), it can't be left in the pantasks permanently.
