IPP Software Navigation Tools IPP Links Communication Pan-STARRS Links
wiki:CZW_notes

Miscellaneous notes

rawcheck

The replication pantasks is now running the rawcheck.pro/rawcheck.pl scripts. The goal of this task is to scan the GPC1 database for raw exposure data, and then pass that exp_id to the rawcheck.pl script, which identifies all the raw FITS files in nebulous, and "does the right thing" such that upon completion:

  • One copy of the FITS file is on the cluster.
  • One copy of the FITS file is on an ippbXX node.
  • No other copies exist (third+ copies are culled away).

This is unfortunately a bit clunky, as there is no state information to use to iterate cleanly (and an absence of a -exp_id_min in regtool). Instead, it uses the dateobs field to select jobs to process. The date is initially set to NULL, and then pulls the date from the final page of the rawcheckPending book as the next date to use. I've run into issues where the camera takes a large number of exposures with the same dateobs value, which causes the iteration to get stuck. This led me to pushing up the poll option to get over these bumps. However, this seems to have the effect that the "final page" is not necessarily the largest date, and with large numbers of pages in the book, the date just floats around. Because of this, I've tried to keep the poll down to ~80-100, as this will allow the remaining bumps to be jumped (70 is the largest number of exposures with the same dateobs after 2011-07-22).

The current date can be set and shown with:

rawcheck.show.date
rawcheck.set.date 2011-07-22

2014-01-02

I suspect the slow progress is due to re-redoing the same time range. This is caused by the task not clearing pages until there are 2000 jobs running/book pages. I've changed the task to do this at 200 in the hope it will work better. The date prior to restarting pantasks is "2011-08-18T07:10:45.000000".

convolved stack cleaning

The convolved stack cleaning is running in my own pantasks at /data/ippc18.0/home/watersc1/clean_convolved_stacks.20131210 (symlinked to /data/ippc18.0/home/watersc1/this_is_where_pantasks_lives. This task iterates through a list of commands (PV1.cmds), and passes those commands to the pclients. This list/reading was necessary as running this via stacktool commands was prohibitively expensive. Instead, I construct the command list based on simple database queries. The current command list only covers the PV1 stacks.

Although I have changes in stack_skycell.pl done that should allow convolved stacks to not be constructed in the future, I have not made this the default, nor have I have pushed the change into the working tag.

single host colonization

This is a side effort to reproduce the first two points from the rawcheck task, but using only a single host as the source of exposures. The script is located in tools/neb_rawOTA_host_scan.pl. I've just now discovered that the ipp user does not have necessary environment variables set. NEB_USER/NEB_PASS/NEB_DBSERVER need to be set correctly (nebulous.ipp.ifa.hawaii.edu).

With those variables set, the command can be run with:

neb_rawOTA_host_scan.pl --host ipp0XY --limit 10000 --min 0 --continue

This scans host ipp0XY for instances that match /ota...fits/. If one is found, and has a user.copies xattr that is greater than 1, it checks for other instances of the storage_object, and checks that one of the other instances is on an ippbXX node. If not, a neb-replicate command is issued to put one on a randomly selected ippb0[4-5] volume. The limit of 10000 seems to work best, without needing lots of database interactions, and returning from the database quickly. The min value indicates the ins_id to start the search. The script prints out a suggested "next iteration" command upon finishing scanning the 10000 entries. With the --continue option, the next iteration is begun internally. I've left in the print statement so that after stopping the script, it is easy to determine what the next start value should be.

The current progress is:

host state ins_id rerun rerun status
ipp033 done 2014-01-02 done
ipp034 done 4268003345 2014-01-02 done
ipp035 done 2014-01-02 done
ipp036 done 2014-01-02 done
ipp037 done 2014-01-02 done
ipp038 done 2014-01-02 done
ipp039 done 2014-01-02 done
ipp040 done 2014-01-02 done
ipp041 done 3151002337 2014-01-02 done
ipp042 done 2705684563 2014-01-02 done
ipp043 done 2695734708 2014-01-02 done
ipp044 done 3025719620 2014-01-02 done
ipp045 done 2014-01-02 done
ipp046 done 2004251097 2014-01-02 done
ipp047 done 1968815686 2014-01-02 done
ipp048 done 0 2014-01-02 done
ipp049 done done
ipp050 done done
ipp051 done 0 2014-01-02 done
ipp052 done 2955744679 done
ipp053 done 3074357154 done
ipp021 pause 3984316153
ipp017 run 0 run 3788941610
ipp030 run 0 run 3444534397
ipp004 no nebulous data done
ipp005 done 0 done
ipp006 run 0 3553498018
ipp007 run 0 3153700875
ipp009 run 0 2837354637
ipp010 run 0 2955669842
ipp011 run 0 2950501963
ipp015 run 0 2738594316
ipp012 run 0 2351997824
ipp013 run 0 2960420621
ipp008 run 0 2950497692
ipp016 run 0 3410988271
ipp014 run 0 2404871920
ipp018 run 0 3414468113
ipp019 run 0 3409930394
ipp023 run 0 1961967219
ipp024 run 0 1859606611
ipp025 run 0 2000339491
ipp026 run 0 3014419607
ipp027 run 0 2012018314
ipp028 run 0 1767599921
ipp029 run 0 2706686377
stsci16.0 moved 0 3559337193
stsci16.1 moved 0 3558194527
stsci16.2 moved 0 3558219673
stsci17.0 moved 0 3559524787
stsci17.1 moved 0 3558176515
stsci17.2 moved 0 3558196966
stsci18.0 moved 0 3559598644
stsci18.1 moved 0 3558206552
stsci18.2 moved 0 3558171508
stsci19.0 moved 0 3559215917
stsci19.1 moved 0 3558181118
stsci19.2 moved 0 3558197059
stsci10.0 moved 0 3563650506
stsci10.1 moved 0 3562834595
stsci10.2 moved 0 3562850632
stsci11.0 moved 0 3558241974
stsci11.1 moved 0 3557812251
stsci11.2 moved 0 3557808722
stsci12.0 moved 0 3557781636
stsci12.1 moved 0 3557781636
stsci12.2 moved 0 3557800135
stsci13.0 moved 0 3557801535
stsci13.1 moved 0 3557796271
stsci13.2 moved 0 3559222108
stsci14.0 moved 0 3557787141
stsci14.1 moved 0 3557802774
stsci14.2 moved 0 3559408358
stsci15.0 moved 0 3559023072
stsci15.1 moved 0 3557783580
stsci15.2 moved 0 3557798567
stsci06.0 moved 0
stsci06.1 moved 0
stsci06.2 moved 0
stsci07.0 moved 0
stsci07.1 moved 0
stsci07.2 moved 0
stsci08.0 moved 0
stsci08.1 moved 0
stsci08.2 moved 0
stsci09.0 moved 0
stsci09.1 moved 0
stsci09.2 moved 0

The paused/new hosts are in that state as it seems that running more than ~5-6 of these cause the entire cluster to have issues. It looks like the replications put the disks under slightly more load then they can handle while staying in processing. The other issue is that the scan goes very quickly until ~2012, at which point the previous backup nodes were full, and second copies began being placed on other storage nodes. The upside of this is that when the rawcheck task gets to this point, it will have less work to do on these nodes, as they will already have constructed their required backup node instance.

extra cleanup

I have a label goto_cleaned_redo that I've been adding to the cleanup pantasks when it doesn't have any other jobs to do. This is re-cleaning all the old chip data, with the goal of removing chip CMF files that have a valid SMF file. This should gain a reasonable amount of disk space, but the operation is slow (there are currently 360k runs to work through). Because this label will flood out the standard cleanups (which remove images and are therefore more effective at freeing space), it can't be left in the pantasks permanently.

Last modified 12 years ago Last modified on Jan 23, 2014, 5:37:56 PM
Note: See TracWiki for help on using the wiki.