Enclosing Circle Algorithm

From edegan.com
Revision as of 13:36, 26 April 2017 by ChristyW (talk | contribs)
Jump to navigation Jump to search


McNair Project
Enclosing Circle Algorithm
Red-circle.jpg
Project Information
Project Title Enclosing Circle Algorithm
Owner Christy Warden
Start Date 201701
Deadline 201704
Keywords Tool
Primary Billing
Notes
Has project status Active
Copyright © 2016 edegan.com. All Rights Reserved.


The objective of this project is come up with a fast, reliable algorithm that finds the smallest circle area that can drawn around N Cartesian points such that each circle contains at least M<N points. This algorithm will be used in an academic paper on Urban Start-up Agglomeration, where the points will represent venture capital backed firms within cities.

To-Do

1) Cleanup of geocoding

   a) Take from tab delimited not from whitespace
   b) Load in to database
   c) Use google maps API 
   d) Throw out center of world/center of city 
   e) Change script to take Company Year 
   f) 100 largest cities 

Overview

This program takes in a set of points and the minimum number that should be included inside a unit, and returns circles of the smallest total area which encompass all of the data points. Function make_circle and all of its helper functions were taken from https://www.nayuki.io/res/smallest-enclosing-circle/smallestenclosingcircle.py.

Update 04/26/17: The most recent Enclosing Circle algorithm is EnclosingCircleRemake.py. I believe that in the past there was an issue with the way that I selected the next point to go into the circle. I was previously choosing the point closest to the arbitrary starting point, rather than the point closest to the center of the new circle, which was pretty illogical. EnclosingCircleRemake should fix this error, but it needs to be tested pretty thoroughly before I am ready to run it on the large data set.


Input: A sequence of pairs of floats or ints, e.g. [(0,5), (3.1,-2.7)]. Output: A triple of floats representing a circle. Returns the smallest circle that encloses all the given points. Runs in expected O(n) time, randomized.

Algorithm Description

Location

The original script is located in:

E:\McNair\Software\CodeBase\EnclosingCircle.py

Explanation

Inputs: A set of points, a minimum number of points to include in each circle.

1) Find a point on the outside of the set of points by choosing the rightmost point in the set.

2) If the quantity of points in the input is less than twice the minimum number of points to include in the circle, return a circle containing all of the points. This is because there is no way to split up the points without having a circle containing less than n points.

3) Sort the points by their distance from starting point (using distance formula).

4) Create a list of points called "Core." The core should contain starting point and the n - 1 points that are closest to it. This is the smallest circle that contains starting point.

5) Run the same algorithm on the rest of the points that are not contained in core and store the area of the core + the area of the result on the remaining points.

6) Add one point (the next closest one to starting point) to the core and then repeat step 5.

7) Repeat step 6 until there aren't enough points outside of the core to constitute a valid circle.

8) Choose the scheme that resulted in the smallest total area.

Brute Force

Location

The script is located in:

E:\McNair\Projects\Accelerators\Enclosing_Circle\enclosing_circle_brute_force.py

Explanation

1) Final all combinations of two points in the total data set.

2) For each of these combinations, draw a circle and figure out what other points are contained in a circle around those two points. If the circle contains at least n points, add it to a set of valid circles. There are Number of Points choose 2 possible circles.

3) Combine the valid circles in all possible ways and see if the resulting scheme contains all of the points. If it does, add it to a set of valid schemes. I believe that the number of ways to do this is the sum from i = 1 to i = number of valid circles Number of Circles choose i.

4) Iterate through the valid schemes and calculate the areas of the schemes.

5) Return the scheme with the minimum area.

Runtime

On preliminary analysis, the brute force approach to the enclosing circle algorithm runs O( n!), where n is the number of pairs in the input sequence.

Applications

VC Data

The Enclosing Circle Algorithm will be applied to VC data acquired through the SDC Platinum database. The script makes use of the Python GeoPy GeoCoder to get latitude and longitude coordinates to be used by the Enclosing Circle Algorithm.

Geopy Geocoder User Agreements can be found here.

The relevant files are located in:

E:\McNair\Projects\Accelerators\Enclosing_Circle

The results may eventually be plotted to a graph using python as well. Here is documentation for a python library called basemap.

CURRENT STATUS: Bug fixes needed in EnclosingCircle.py. The program errors with a key error on line 187 in cases where n is not a multiple of the length of the dataset. I made some temporary fixes to the enclosing circle file located in the above directory, but I am not certain if it is a permanent fix.

Speeding up with C

With the large amount of VC data we have, the enclosing circle algorithm would take an extremely long amount of time to run (on the order of weeks/months). If we can compile the code into C, we can speed up runtime dramatically. I've listed some possible sources for running python code as C.

PyPy. Documentation here.

Cython

Cython. Documentation here. Basic tutorial for Cython is given here.

Currently, the RDP is missing a compiler to run Cython successfully. The error that appears is "unable to find vcvarsall.bat".

Getting a C++ Compiler

These are proposed fixes to solve the Cython error shown above. A possible C++ Compiler for Python can be downloaded directly from Windows here.

The C++ Compiler for Python has been downloaded and installed. Instructions for installation and uninstallation can be found here.

The Installation file is located in :

E:\McNair\Software\Utilities\VCForPython27.msi

Running Python vs. Cython

A trial test run on 82 coordinates resulted int the following time stamps:

EnclosingCircle in python: 16.069 seconds
Enclosing Circle in cython:  9.633 seconds

The test files can be found in the following:

EnclosingCircle in python: E:\McNair\Projects\Accelerators\EnclosingCircle\EnclosingCircle.py
EnclosingCircle in cython: E:\McNair\Projects\Accelerators\EnclosingCircle\EnclosingCircleC_Test.py

Usage

The basic tutorial for cython can be found phttp://docs.cython.org/en/latest/src/tutorial/cython_tutorial.html here].

Essentially, a setup.py file needs to be created with the following format:

try:
    from setuptools import setup
    from setuptools import Extension
except ImportError:
    from distutils.core import setup
    from distutils.extension import Extension
from Cython.Build import cythonize
setup(
    ext_modules = cythonize("filename.pyx")
)

Then, after changing to the proper directory, execute the following from the command line

python setup.py build_ext --inplace

This will wrap your python program in C, and produce a filename.pyd file.

To use this new python code wrapped in C, simply import the pyd file as if it were a python file:

import filename

Treat this file as any other module. It will work just as if it were in Python, except it exhibits a faster run time.

NOTE: We fixed Enclosing Circle and drastically improved its runtime, so we decided to go with its Python implementation.


Wrapping C for Python

Documentation for multiple ways in which this can be done can be found here.

A GitHub example can be found here.

An advanced and detailed tutorial can be found here.

A step-by-step Tutorials Point guide can be found here.


Results

A data set with city, state, company name, year, and geocoded coordinates can be found at:

E:\McNair\Projects\Accelerators\Code+Final_Data\ChristyCode\GeoCodedBusinesses.txt


NEXT STEP: We need to determine how many companies we want in each circle, and then we can begin running the enclosing circle algorithm on the city data.

The Top 50 cities with the maximum number of companies in any given year are: 'Santa Monica', 'Nashville', 'Santa Clara', 'Chicago', 'Philadelphia', 'Denver', 'Dallas', 'Burlington', 'San Francisco', 'San Mateo', 'Milpitas', 'Boulder', 'Bellevue', 'Herndon', 'Pittsburgh', 'Mountain View', 'San Diego', 'Fremont', 'Ann Arbor', 'Irvine', 'Brooklyn', 'Durham', 'Los Angeles', 'Atlanta', 'Alpharetta', 'Menlo Park', 'Rockville', 'San Jose', 'Lexington', 'Saint Louis', 'Sunnyvale', 'Palo Alto', 'Richardson', 'Redwood City', 'Austin', 'Waltham', 'Baltimore', 'Cupertino', 'Houston', 'Cambridge', 'Boston', 'Washington', 'Minneapolis', 'Pleasanton', 'New York', 'Cleveland', 'South San Francisco', 'Portland', 'Seattle'.

Data on the Top 50 VC Backed Companies can be found here.

These were determined by the decide_cities.py script located in:

E:\McNair\Projects\Accelerators\EnclosingCircle

The final data of cities at a given year and their minimized circles can be found at:

 E:\McNair\Projects\Accelerators\EnclosingCircle\final_vc_circles.txt

You can plot this data onto a map using the script located:

E:\McNair\Projects\Accelerators\EnclosingCircle\Draw_circles_test\draw_vc_circles.py