Difference between revisions of "Parallel Enclosing Circle Algorithm"
Line 55: | Line 55: | ||
**--min_points overwrites <code>MIN_POINTS_PER_CIRCLE</code> | **--min_points overwrites <code>MIN_POINTS_PER_CIRCLE</code> | ||
**--infile: Path to large master file, e.g. CirclesTestData.txt | **--infile: Path to large master file, e.g. CirclesTestData.txt | ||
+ | **--split-out overwrites <code>DATA_DIRECTORY</code> | ||
**--out overwrites <code>OUTPUT_DIRECTORY</code> | **--out overwrites <code>OUTPUT_DIRECTORY</code> | ||
− | |||
**--report overwrites <code>REPORT_DIRECTORY</code> | **--report overwrites <code>REPORT_DIRECTORY</code> | ||
Line 62: | Line 62: | ||
*What it does | *What it does | ||
− | * | + | **Called with two command line arguments, the input path and the output path |
+ | **Calculates points and circles for input and writes it to output | ||
+ | |||
+ | ==== outjoiner.py ==== | ||
+ | |||
+ | *What it does | ||
+ | **Using a given output directory, generates three files: circles.tsv, points.tsv, and summary.tsv, and stores them in a given reports directory | ||
+ | ==== DATA_DIRECTORY ==== | ||
+ | *The format of the filenames in this directory are <code>{city}{sep}{state}{sep}{year}{sep}{num}.tsv</code> where <code>num</code> is a 0-indexed integer of a split city/state/year <code>infile</code> that has greater than <code>SPLIT_THRESHOLD</code>. | ||
+ | *These are files created when vc_circles.py splits up a master file. | ||
+ | ==== OUTPUT_DIRECTORY==== | ||
+ | *The format of the filenames in this directory are <code>{city}{sep}{state}{sep}{year}{sep}{num}.tsv</code> where <code>num</code> is a 0-indexed integer of a split city/state/year <code>infile</code> that has greater than <code>SPLIT_THRESHOLD</code>. | ||
+ | *These are files created when circles.py processes a file from DATA_DIRECTORY. | ||
+ | ==== REPORT_DIRECTORY==== | ||
+ | *There are three files in this directory: circles.tsv, points.tsv, and summary.tsv. | ||
=== Example Usage === | === Example Usage === | ||
Line 71: | Line 85: | ||
<code>placestate, place, statecode, year, latitude, longitude, coname, datefirstinv, placens, geoid, city</code> | <code>placestate, place, statecode, year, latitude, longitude, coname, datefirstinv, placens, geoid, city</code> | ||
− | This command will populate (and overwrite) any files in <code>data/</code> | + | This command will populate (and overwrite) any files in <code>data/</code>, <code>out/</code>, and <code>reports/</code>. |
== Bugs/Issues == | == Bugs/Issues == | ||
Line 80: | Line 94: | ||
# How to separate outliers? | # How to separate outliers? | ||
− | == | + | == Makeshift way to plot circles == |
# Connect to database with command <code>psql -U postgres arc</code> | # Connect to database with command <code>psql -U postgres arc</code> | ||
# password is tabspaceenter I think | # password is tabspaceenter I think |
Revision as of 16:29, 8 November 2017
Parallel Enclosing Circle Algorithm | |
---|---|
Project Information | |
Project Title | Parallel Enclosing Circle Algorithm |
Owner | Oliver Chang |
Start Date | July 31, 2017 |
Deadline | October 4, 2017 |
Primary Billing | |
Notes | |
Has project status | Complete |
Is dependent on | Enclosing Circle Algorithm |
Copyright © 2016 edegan.com. All Rights Reserved. |
A thin-wrapper around the enclosing circle algorithm which allows for instance-level parallelization.
This project consists of the python files in E:\McNair\Projects\OliverLovesCircles\src\python
.
Parallelization is implemented via Python2's subprocess.open()
which is non-blocking and available in the standard library.
Contents
The Problem
Note that this is not the classical enclosing circle algorithm.
Rather, we seek to minimize the sum of enclosing circles containing at least n
points.
Thus, multiple circles are allowed and inclusion in multiple circles is possible.
This algorithm has terrible time-performance characteristics, so we make the assumption that we can divide a large number of points with k-means and then solve those subproblems. In other words, we make the simplifying assumption that the Enclosing Circle Algorithm has Optimal Substructure.
Parameters
- in
circles.py
:PATH_SEPARATOR
: the string that separates parts of the filename for both input and output files. For example, an input could look like "St. Louis#MO#2017#0.tsv" for PATH_SEPARATOR = '#'ITERATIONS
: the number of iterations to attempt for eachk
to find minimum for thatk
MIN_POINTS_PER_CIRCLE
(AKAn
): the minimum number of data points that must be included in a circle
- in
vc_circles.py
NUMBER_INSTANCES
: number of parallel instances to run; assume no data-races between instancesSWEEP_CYCLE_SECONDS
: amount of time before removing completed jobs from the current job and adding new jobs if any files are left to processTIMEOUT_MINUTES
: maximum running time of a parallel instance of the algorithmSPLIT_THRESHOLD
: if a dataset has more than this threshold of data points, it will be split via k-meansEXECUTABLE_INSTANCE_PATH
: the path to circles.pyOUTJOINER_INSTANCE_PATH
: the path to outjoiner.pyDATA_DIRECTORY
: the input directoryOUTPUT_DIRECTORY
: the directory to write the outputs of circle.py toGENERATE_REPORTS
: whether or not to call outjoiner.py (writes reports on the output of circles.py)REPORT_DIRECTORY
: the directory to write reports to
Structure and Usage
vc_circles.py
- What it does
- If given a "master file" through argument infile, splits it into constituent data files, and stores them in DATA_DIRECTORY
- Takes data files in DATA_DIRECTORY and calls circles.py in parallel for each of these data files, which writes its output files to OUTPUT_DIRECTORY
- Takes output files in OUTPUT_DIRECTORY and calls outjoiner.py, which writes its report files to REPORT_DIRECTORY
- Command Line Arguments
- --sweep-time overwrites
SWEEP_CYCLE_SECONDS
- --instances overwrites
NUMBER_INSTANCES
- --min_points overwrites
MIN_POINTS_PER_CIRCLE
- --infile: Path to large master file, e.g. CirclesTestData.txt
- --split-out overwrites
DATA_DIRECTORY
- --out overwrites
OUTPUT_DIRECTORY
- --report overwrites
REPORT_DIRECTORY
- --sweep-time overwrites
circles.py
- What it does
- Called with two command line arguments, the input path and the output path
- Calculates points and circles for input and writes it to output
outjoiner.py
- What it does
- Using a given output directory, generates three files: circles.tsv, points.tsv, and summary.tsv, and stores them in a given reports directory
DATA_DIRECTORY
- The format of the filenames in this directory are
{city}{sep}{state}{sep}{year}{sep}{num}.tsv
wherenum
is a 0-indexed integer of a split city/state/yearinfile
that has greater thanSPLIT_THRESHOLD
. - These are files created when vc_circles.py splits up a master file.
OUTPUT_DIRECTORY
- The format of the filenames in this directory are
{city}{sep}{state}{sep}{year}{sep}{num}.tsv
wherenum
is a 0-indexed integer of a split city/state/yearinfile
that has greater thanSPLIT_THRESHOLD
. - These are files created when circles.py processes a file from DATA_DIRECTORY.
REPORT_DIRECTORY
- There are three files in this directory: circles.tsv, points.tsv, and summary.tsv.
Example Usage
$ python vc_circles.py --infile E:/McNair/Projects/OliverLovesCircles/CoLevelForCirclesNotRunGTE200.txt
where CoLevelForCirclesNotRunGTE200.txt
is a tab-separated values file with the columns
placestate, place, statecode, year, latitude, longitude, coname, datefirstinv, placens, geoid, city
This command will populate (and overwrite) any files in data/
, out/
, and reports/
.
Bugs/Issues
- "St. Paul" and "St. Louis" have un-enclosed points--speculate because of weird file path issues
- Some place/state/year combinations do not run to completion regardless of how tractable the number of points
- How to merge small enclosing circles? This is a better measure of agglomeration regardless
- How to separate outliers?
Makeshift way to plot circles
- Connect to database with command
psql -U postgres arc
- password is tabspaceenter I think
\d
lists tables- Now run SQL script LoadCircles.sql in OliverLovesCircles
- Open ArcMap
- Add data -> Top of file tree -> Database connection -> localhost for instance, database arc -> connect to localhost and table testcirclegeom
- Add points from local files, make sure they are txt or tab files, not tsv