Changes

6,788 bytes added , 13:47, 21 September 2020

no edit summary

{{Project|Has project output=Tool|Has sponsor=McNair ~~Projects~~Center

|Has title=Parallel Enclosing Circle Algorithm

|Has owner=Oliver Chang,

|Has start date=July 31, 2017

|Has deadline=~~August~~ October 4, 2017|Has project status=~~Active~~Complete

|Is dependent on=Enclosing Circle Algorithm,

}}

A thin-wrapper around [[Enclosing_Circle_Algorithm_(Rework)|the enclosing circle algorithm]] which allows for instance-level parallelization.

This project consists of the python files in <code>E:\McNair\Projects\OliverLovesCircles\src\python</code>. There is another version of the project with plotting functionality that uses a slightly different approach (removes duplicate points and uses their counts before running the algorithm) in <code>E:\McNair\Projects\KyranLovesCircles\src\python</code>.

~~Project directory~~Parallelization is implemented via Python2's [https: //docs.python.org/2/library/subprocess.html#subprocess.Popen <code>~~E:\McNair\Projects\OliverLovesCircles~~subprocess.open()</code>] which is non-blocking and available in the standard library.

== The Problem ==

Rather, we seek to minimize the sum of enclosing circles containing at least <code>n</code> points.

Thus, multiple circles are allowed and inclusion in multiple circles is possible.

This algorithm has terrible time-performance characteristics, so we make the assumption that we can divide a large number of points with k-means and then solve those subproblems.

In other words, we make the simplifying assumption that the Enclosing Circle Algorithm has [https://en.wikipedia.org/wiki/Optimal_substructure Optimal Substructure].

== Parameters ==

* in <code>~~iterations_per_k~~circles.py</code>:** <code>PATH_SEPARATOR</code>: the string that separates parts of the filename for both input and output files. For example, an input could look like "St. Louis#MO#2017#0.tsv" for PATH_SEPARATOR = '#'** <code>ITERATIONS</code>: the number of iterations to attempt for each <code>k</code> to find minimum for that <code>k</code>* * <code>MIN_POINTS_PER_CIRCLE</code> (AKA <code>n</code>): the minimum number of data points that must be included in a circle* in <code>vc_circles.py</code>** <code>NUMBER_INSTANCES</code>: number of parallel instances to run; assume no data-races between instances** <code>SWEEP_CYCLE_SECONDS</code>: amount of time before removing completed jobs from the current job and adding new jobs if any files are left to process** <code>TIMEOUT_MINUTES</code>: maximum running time of a parallel instance of the algorithm** <code>SPLIT_THRESHOLD</code>: if a dataset has more than this threshold of data points, it will be split via k-means** <code>EXECUTABLE_INSTANCE_PATH</code>: the path to circles.py** <code>OUTJOINER_INSTANCE_PATH</code>: the path to outjoiner.py** <code>DATA_DIRECTORY</code>: the input directory** <code>OUTPUT_DIRECTORY</code>: the directory to write the outputs of circle.py to** <code>GENERATE_REPORTS</code>: whether or not to call outjoiner.py (writes reports on the output of circles.py)** <code>REPORT_DIRECTORY</code>: the directory to write reports to == Structure and Usage == ==== vc_circles.py ==== *What it does**If given a "master file" through argument infile, splits it into constituent data files, and stores them in DATA_DIRECTORY**Takes data files in DATA_DIRECTORY and calls circles.py in parallel for each of these data files, which writes its output files to OUTPUT_DIRECTORY**Takes output files in OUTPUT_DIRECTORY and calls outjoiner.py, which writes its report files to REPORT_DIRECTORY *Command Line Arguments**--sweep-time overwrites <code>SWEEP_CYCLE_SECONDS</code>**--instances overwrites <code>NUMBER_INSTANCES</code>**--min_points overwrites <code>MIN_POINTS_PER_CIRCLE</code>**--infile: Path to large master file, e.g. CirclesTestData.txt**--split-out overwrites <code>DATA_DIRECTORY</code>**--out overwrites <code>OUTPUT_DIRECTORY</code>**--report overwrites <code>REPORT_DIRECTORY</code> ==== circles.py ==== *What it does **Called with two command line arguments, the input path and the output path**Calculates points and circles for input and writes it to output ==== outjoiner.py ==== *What it does **Using a given output directory, generates three files: circles.tsv, points.tsv, and summary.tsv, and stores them in a given reports directory ==== DATA_DIRECTORY ====*The format of the filenames in this directory are <code>{city}{sep}{state}{sep}{year}{sep}{num}.tsv</code> where <code>num</code> is a 0-indexed integer of a split city/state/year <code>infile</code> that has greater than <code>SPLIT_THRESHOLD</code>. *These are files created when vc_circles.py splits up a master file.==== OUTPUT_DIRECTORY====*The format of the filenames in this directory are <code>{city}{sep}{state}{sep}{year}{sep}{num}.tsv</code> where <code>num</code> is a 0-indexed integer of a split city/state/year <code>infile</code> that has greater than <code>SPLIT_THRESHOLD</code>. *These are files created when circles.py processes a file from DATA_DIRECTORY.==== REPORT_DIRECTORY====*There are three files in this directory: circles.tsv, points.tsv, and summary.tsv. === Example Usage === ==== Splitting a master file and running ==== <nowiki>$ python vc_circles.py --infile E:/McNair/Projects/OliverLovesCircles/CoLevelForCirclesNotRunGTE200.txt</nowiki>where <code>CoLevelForCirclesNotRunGTE200.txt</code> is a tab-separated values file with the columns<code>placestate, place, statecode, year, latitude, longitude, coname, datefirstinv, placens, geoid, city</code> This command will populate (and overwrite) any files in <code>data/</code>, <code>out/</code>, and <code>reports/</code>. ==== Running on already split files ==== <nowiki>$ python vc_circles.py</nowiki> This command will populate (and overwrite) any files in <code>out/</code> and <code>reports/</code>. == Bugs/Issues == # "St. Paul" and "St. Louis" have un-enclosed points--speculate because of weird file path issues# Some place/state/year combinations do not run to completion regardless of how tractable the number of points# How to merge small enclosing circles? This is a better measure of agglomeration regardless# How to separate outliers?# Sometimes circles with 0 radius are created# enclosingcirclealg() returns None sometimes

== ~~Overview~~ Makeshift way to plot circles ==# Connect to database with command <code>psql -U postgres arc</code># password is tabspaceenter I think# <code>\d</code> lists tables# Now run SQL script LoadCircles.sql in OliverLovesCircles# Open ArcMap# Add data -> Top of file tree -> Database connection -> localhost for instance, database arc -> connect to localhost and table testcirclegeom# Add points from local files, make sure they are txt or tab files, not tsv, or they won't be found# Points -> Properties -> Source -> Set data source -> x field: long, y field: lat

~~TODO~~== St. Louis bug ==[[File: ~~explain range of k~~st_louis_bug.png]]

~~TODO: explain difficulty~~ This image shows a rendering of ~~port~~the results of running St. Louis. There are four circles (the centers of circles are green dots), but two have radii of 0.0.

~~TODO~~=== Progress on the bug ===# Removing duplicate points from the data actually removes all of the errors, but this doesn't give you the solution with the smallest area.# I tried removing duplicates but keeping track of a "count." # I narrowed down the bug to the constrained_kmeans method in ckmeans.py [https: ~~add initial cut documentation~~//www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2000-65.pdf (paper here)]## For some reason, this returns clusters with smaller numbers of points than n## [http://adared.ch/constrained-k-means-implementation-in-python/ This is a good overview of the algorithm]# I wrote a plotter, the plot method in circles.py

== Related Pages ==

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Parallel Enclosing Circle Algorithm (view source)

Revision as of 13:47, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools