Difference between revisions of "Parallel Enclosing Circle Algorithm"

From edegan.com
Jump to navigation Jump to search
 
(21 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{McNair Projects
+
{{Project
 +
|Has project output=Tool
 +
|Has sponsor=McNair Center
 
|Has title=Parallel Enclosing Circle Algorithm
 
|Has title=Parallel Enclosing Circle Algorithm
 
|Has owner=Oliver Chang,
 
|Has owner=Oliver Chang,
 
|Has start date=July 31, 2017
 
|Has start date=July 31, 2017
|Has deadline=August 4, 2017
+
|Has deadline=October 4, 2017
|Has project status=Active
+
|Has project status=Complete
 
|Is dependent on=Enclosing Circle Algorithm,
 
|Is dependent on=Enclosing Circle Algorithm,
 
}}
 
}}
  
 +
A thin-wrapper around [[Enclosing_Circle_Algorithm_(Rework)|the enclosing circle algorithm]] which allows for instance-level parallelization.
 +
This project consists of the python files in <code>E:\McNair\Projects\OliverLovesCircles\src\python</code>. There is another version of the project with plotting functionality that uses a slightly different approach (removes duplicate points and uses their counts before running the algorithm) in <code>E:\McNair\Projects\KyranLovesCircles\src\python</code>.
  
Project directory: <code>E:\McNair\Projects\OliverLovesCircles</code>
+
Parallelization is implemented via Python2's [https://docs.python.org/2/library/subprocess.html#subprocess.Popen <code>subprocess.open()</code>] which is non-blocking and available in the standard library.
 
 
  
 
== The Problem ==
 
== The Problem ==
Line 17: Line 20:
 
Rather, we seek to minimize the sum of enclosing circles containing at least <code>n</code> points.
 
Rather, we seek to minimize the sum of enclosing circles containing at least <code>n</code> points.
 
Thus, multiple circles are allowed and inclusion in multiple circles is possible.
 
Thus, multiple circles are allowed and inclusion in multiple circles is possible.
 +
 +
This algorithm has terrible time-performance characteristics, so we make the assumption that we can divide a large number of points with k-means and then solve those subproblems.
 +
In other words, we make the simplifying assumption that the Enclosing Circle Algorithm has [https://en.wikipedia.org/wiki/Optimal_substructure Optimal Substructure].
  
 
== Parameters ==
 
== Parameters ==
  
* <code>iterations_per_k</code>: the number of iterations to attempt for each <code>k</code> to find minimum for that <code>k</code>
+
* in <code>circles.py</code>:
* <code>n</code>: the minimum number of data points that must be included in a circle
+
** <code>PATH_SEPARATOR</code>: the string that separates parts of the filename for both input and output files. For example, an input could look like "St. Louis#MO#2017#0.tsv" for PATH_SEPARATOR = '#'
 +
** <code>ITERATIONS</code>: the number of iterations to attempt for each <code>k</code> to find minimum for that <code>k</code>
 +
** <code>MIN_POINTS_PER_CIRCLE</code> (AKA <code>n</code>): the minimum number of data points that must be included in a circle
 +
* in <code>vc_circles.py</code>
 +
** <code>NUMBER_INSTANCES</code>: number of parallel instances to run; assume no data-races between instances
 +
** <code>SWEEP_CYCLE_SECONDS</code>: amount of time before removing completed jobs from the current job and adding new jobs if any files are left to process
 +
** <code>TIMEOUT_MINUTES</code>: maximum running time of a parallel instance of the algorithm
 +
** <code>SPLIT_THRESHOLD</code>: if a dataset has more than this threshold of data points, it will be split via k-means
 +
** <code>EXECUTABLE_INSTANCE_PATH</code>: the path to circles.py
 +
** <code>OUTJOINER_INSTANCE_PATH</code>: the path to outjoiner.py
 +
** <code>DATA_DIRECTORY</code>: the input directory
 +
** <code>OUTPUT_DIRECTORY</code>: the directory to write the outputs of circle.py to
 +
** <code>GENERATE_REPORTS</code>: whether or not to call outjoiner.py (writes reports on the output of circles.py)
 +
** <code>REPORT_DIRECTORY</code>: the directory to write reports to
 +
 
 +
 
 +
== Structure and Usage ==
 +
 
 +
==== vc_circles.py ====
 +
 
 +
*What it does
 +
**If given a "master file" through argument infile, splits it into constituent data files, and stores them in DATA_DIRECTORY
 +
**Takes data files in DATA_DIRECTORY and calls circles.py in parallel for each of these data files, which writes its output files to OUTPUT_DIRECTORY
 +
**Takes output files in OUTPUT_DIRECTORY and calls outjoiner.py, which writes its report files to REPORT_DIRECTORY
 +
 
 +
*Command Line Arguments
 +
**--sweep-time overwrites <code>SWEEP_CYCLE_SECONDS</code>
 +
**--instances overwrites <code>NUMBER_INSTANCES</code>
 +
**--min_points overwrites <code>MIN_POINTS_PER_CIRCLE</code>
 +
**--infile: Path to large master file, e.g. CirclesTestData.txt
 +
**--split-out overwrites <code>DATA_DIRECTORY</code>
 +
**--out overwrites <code>OUTPUT_DIRECTORY</code>
 +
**--report overwrites <code>REPORT_DIRECTORY</code>
 +
 
 +
==== circles.py ====
 +
 
 +
*What it does
 +
**Called with two command line arguments, the input path and the output path
 +
**Calculates points and circles for input and writes it to output
 +
 
 +
==== outjoiner.py ====
 +
 
 +
*What it does
 +
**Using a given output directory, generates three files: circles.tsv, points.tsv, and summary.tsv, and stores them in a given reports directory
 +
 
 +
==== DATA_DIRECTORY ====
 +
*The format of the filenames in this directory are <code>{city}{sep}{state}{sep}{year}{sep}{num}.tsv</code> where <code>num</code> is a 0-indexed integer of a split city/state/year <code>infile</code> that has greater than <code>SPLIT_THRESHOLD</code>.
 +
*These are files created when vc_circles.py splits up a master file.
 +
==== OUTPUT_DIRECTORY====
 +
*The format of the filenames in this directory are <code>{city}{sep}{state}{sep}{year}{sep}{num}.tsv</code> where <code>num</code> is a 0-indexed integer of a split city/state/year <code>infile</code> that has greater than <code>SPLIT_THRESHOLD</code>.
 +
*These are files created when circles.py processes a file from DATA_DIRECTORY.
 +
==== REPORT_DIRECTORY====
 +
*There are three files in this directory: circles.tsv, points.tsv, and summary.tsv.
 +
 
 +
=== Example Usage ===
 +
 
 +
==== Splitting a master file and running ====
 +
<nowiki>
 +
$ python vc_circles.py --infile E:/McNair/Projects/OliverLovesCircles/CoLevelForCirclesNotRunGTE200.txt</nowiki>
 +
where <code>CoLevelForCirclesNotRunGTE200.txt</code> is a tab-separated values file with the columns
 +
<code>placestate, place, statecode, year, latitude, longitude, coname, datefirstinv, placens, geoid, city</code>
 +
 
 +
This command will populate (and overwrite) any files in <code>data/</code>, <code>out/</code>, and <code>reports/</code>.
 +
 
 +
==== Running on already split files ====
 +
<nowiki>
 +
$ python vc_circles.py</nowiki>
 +
 
 +
This command will populate (and overwrite) any files in <code>out/</code> and <code>reports/</code>.
 +
 
 +
== Bugs/Issues ==
 +
 
 +
# "St. Paul" and "St. Louis" have un-enclosed points--speculate because of weird file path issues
 +
# Some place/state/year combinations do not run to completion regardless of how tractable the number of points
 +
# How to merge small enclosing circles? This is a better measure of agglomeration regardless
 +
# How to separate outliers?
 +
# Sometimes circles with 0 radius are created
 +
# enclosingcirclealg() returns None sometimes
  
== Overview ==
+
== Makeshift way to plot circles ==  
 +
# Connect to database with command <code>psql -U postgres arc</code>
 +
# password is tabspaceenter I think
 +
# <code>\d</code> lists tables
 +
# Now run SQL script LoadCircles.sql in OliverLovesCircles
 +
# Open ArcMap
 +
# Add data -> Top of file tree -> Database connection -> localhost for instance, database arc -> connect to localhost and table testcirclegeom
 +
# Add points from local files, make sure they are txt or tab files, not tsv, or they won't be found
 +
# Points -> Properties -> Source -> Set data source -> x field: long, y field: lat
  
TODO: explain range of k
+
== St. Louis bug ==
 +
[[File:st_louis_bug.png]]
  
TODO: explain difficulty of port
+
This image shows a rendering of the results of running St. Louis. There are four circles (the centers of circles are green dots), but two have radii of 0.0.
  
TODO: add initial cut documentation
+
=== Progress on the bug ===
 +
# Removing duplicate points from the data actually removes all of the errors, but this doesn't give you the solution with the smallest area.
 +
# I tried removing duplicates but keeping track of a "count."
 +
# I narrowed down the bug to the constrained_kmeans method in ckmeans.py [https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2000-65.pdf (paper here)]
 +
## For some reason, this returns clusters with smaller numbers of points than n
 +
## [http://adared.ch/constrained-k-means-implementation-in-python/ This is a good overview of the algorithm]
 +
# I wrote a plotter, the plot method in circles.py
  
 
== Related Pages ==
 
== Related Pages ==

Latest revision as of 13:47, 21 September 2020


Project
Parallel Enclosing Circle Algorithm
Project logo 02.png
Project Information
Has title Parallel Enclosing Circle Algorithm
Has owner Oliver Chang
Has start date July 31, 2017
Has deadline date
Has project status Complete
Is dependent on Enclosing Circle Algorithm
Has sponsor McNair Center
Has project output Tool
Copyright © 2019 edegan.com. All Rights Reserved.


A thin-wrapper around the enclosing circle algorithm which allows for instance-level parallelization. This project consists of the python files in E:\McNair\Projects\OliverLovesCircles\src\python. There is another version of the project with plotting functionality that uses a slightly different approach (removes duplicate points and uses their counts before running the algorithm) in E:\McNair\Projects\KyranLovesCircles\src\python.

Parallelization is implemented via Python2's subprocess.open() which is non-blocking and available in the standard library.

The Problem

Note that this is not the classical enclosing circle algorithm. Rather, we seek to minimize the sum of enclosing circles containing at least n points. Thus, multiple circles are allowed and inclusion in multiple circles is possible.

This algorithm has terrible time-performance characteristics, so we make the assumption that we can divide a large number of points with k-means and then solve those subproblems. In other words, we make the simplifying assumption that the Enclosing Circle Algorithm has Optimal Substructure.

Parameters

  • in circles.py:
    • PATH_SEPARATOR: the string that separates parts of the filename for both input and output files. For example, an input could look like "St. Louis#MO#2017#0.tsv" for PATH_SEPARATOR = '#'
    • ITERATIONS: the number of iterations to attempt for each k to find minimum for that k
    • MIN_POINTS_PER_CIRCLE (AKA n): the minimum number of data points that must be included in a circle
  • in vc_circles.py
    • NUMBER_INSTANCES: number of parallel instances to run; assume no data-races between instances
    • SWEEP_CYCLE_SECONDS: amount of time before removing completed jobs from the current job and adding new jobs if any files are left to process
    • TIMEOUT_MINUTES: maximum running time of a parallel instance of the algorithm
    • SPLIT_THRESHOLD: if a dataset has more than this threshold of data points, it will be split via k-means
    • EXECUTABLE_INSTANCE_PATH: the path to circles.py
    • OUTJOINER_INSTANCE_PATH: the path to outjoiner.py
    • DATA_DIRECTORY: the input directory
    • OUTPUT_DIRECTORY: the directory to write the outputs of circle.py to
    • GENERATE_REPORTS: whether or not to call outjoiner.py (writes reports on the output of circles.py)
    • REPORT_DIRECTORY: the directory to write reports to


Structure and Usage

vc_circles.py

  • What it does
    • If given a "master file" through argument infile, splits it into constituent data files, and stores them in DATA_DIRECTORY
    • Takes data files in DATA_DIRECTORY and calls circles.py in parallel for each of these data files, which writes its output files to OUTPUT_DIRECTORY
    • Takes output files in OUTPUT_DIRECTORY and calls outjoiner.py, which writes its report files to REPORT_DIRECTORY
  • Command Line Arguments
    • --sweep-time overwrites SWEEP_CYCLE_SECONDS
    • --instances overwrites NUMBER_INSTANCES
    • --min_points overwrites MIN_POINTS_PER_CIRCLE
    • --infile: Path to large master file, e.g. CirclesTestData.txt
    • --split-out overwrites DATA_DIRECTORY
    • --out overwrites OUTPUT_DIRECTORY
    • --report overwrites REPORT_DIRECTORY

circles.py

  • What it does
    • Called with two command line arguments, the input path and the output path
    • Calculates points and circles for input and writes it to output

outjoiner.py

  • What it does
    • Using a given output directory, generates three files: circles.tsv, points.tsv, and summary.tsv, and stores them in a given reports directory

DATA_DIRECTORY

  • The format of the filenames in this directory are {city}{sep}{state}{sep}{year}{sep}{num}.tsv where num is a 0-indexed integer of a split city/state/year infile that has greater than SPLIT_THRESHOLD.
  • These are files created when vc_circles.py splits up a master file.

OUTPUT_DIRECTORY

  • The format of the filenames in this directory are {city}{sep}{state}{sep}{year}{sep}{num}.tsv where num is a 0-indexed integer of a split city/state/year infile that has greater than SPLIT_THRESHOLD.
  • These are files created when circles.py processes a file from DATA_DIRECTORY.

REPORT_DIRECTORY

  • There are three files in this directory: circles.tsv, points.tsv, and summary.tsv.

Example Usage

Splitting a master file and running

$ python vc_circles.py --infile E:/McNair/Projects/OliverLovesCircles/CoLevelForCirclesNotRunGTE200.txt

where CoLevelForCirclesNotRunGTE200.txt is a tab-separated values file with the columns placestate, place, statecode, year, latitude, longitude, coname, datefirstinv, placens, geoid, city

This command will populate (and overwrite) any files in data/, out/, and reports/.

Running on already split files

$ python vc_circles.py

This command will populate (and overwrite) any files in out/ and reports/.

Bugs/Issues

  1. "St. Paul" and "St. Louis" have un-enclosed points--speculate because of weird file path issues
  2. Some place/state/year combinations do not run to completion regardless of how tractable the number of points
  3. How to merge small enclosing circles? This is a better measure of agglomeration regardless
  4. How to separate outliers?
  5. Sometimes circles with 0 radius are created
  6. enclosingcirclealg() returns None sometimes

Makeshift way to plot circles

  1. Connect to database with command psql -U postgres arc
  2. password is tabspaceenter I think
  3. \d lists tables
  4. Now run SQL script LoadCircles.sql in OliverLovesCircles
  5. Open ArcMap
  6. Add data -> Top of file tree -> Database connection -> localhost for instance, database arc -> connect to localhost and table testcirclegeom
  7. Add points from local files, make sure they are txt or tab files, not tsv, or they won't be found
  8. Points -> Properties -> Source -> Set data source -> x field: long, y field: lat

St. Louis bug

St louis bug.png

This image shows a rendering of the results of running St. Louis. There are four circles (the centers of circles are green dots), but two have radii of 0.0.

Progress on the bug

  1. Removing duplicate points from the data actually removes all of the errors, but this doesn't give you the solution with the smallest area.
  2. I tried removing duplicates but keeping track of a "count."
  3. I narrowed down the bug to the constrained_kmeans method in ckmeans.py (paper here)
    1. For some reason, this returns clusters with smaller numbers of points than n
    2. This is a good overview of the algorithm
  4. I wrote a plotter, the plot method in circles.py

Related Pages

External Links