Difference between revisions of "Enclosing Circle Algorithm"

From edegan.com
Jump to navigation Jump to search
Line 116: Line 116:
  
 
The Top 50 cities with the maximum number of companies in any given year are: 'Santa Monica', 'Nashville', 'Santa Clara', 'Chicago', 'Philadelphia', 'Denver', 'Dallas', 'Burlington', 'San Francisco', 'San Mateo', 'Milpitas', 'Boulder', 'Bellevue', 'Herndon', 'Pittsburgh', 'Mountain View', 'San Diego', 'Fremont', 'Ann Arbor', 'Irvine', 'Brooklyn', 'Durham', 'Los Angeles', 'Atlanta', 'Alpharetta', 'Menlo Park', 'Rockville', 'San Jose', 'Lexington', 'Saint Louis', 'Sunnyvale', 'Palo Alto', 'Richardson', 'Redwood City', 'Austin', 'Waltham', 'Baltimore', 'Cupertino', 'Houston', 'Cambridge', 'Boston', 'Washington', 'Minneapolis', 'Pleasanton', 'New York', 'Cleveland', 'South San Francisco', 'Portland', 'Seattle'.
 
The Top 50 cities with the maximum number of companies in any given year are: 'Santa Monica', 'Nashville', 'Santa Clara', 'Chicago', 'Philadelphia', 'Denver', 'Dallas', 'Burlington', 'San Francisco', 'San Mateo', 'Milpitas', 'Boulder', 'Bellevue', 'Herndon', 'Pittsburgh', 'Mountain View', 'San Diego', 'Fremont', 'Ann Arbor', 'Irvine', 'Brooklyn', 'Durham', 'Los Angeles', 'Atlanta', 'Alpharetta', 'Menlo Park', 'Rockville', 'San Jose', 'Lexington', 'Saint Louis', 'Sunnyvale', 'Palo Alto', 'Richardson', 'Redwood City', 'Austin', 'Waltham', 'Baltimore', 'Cupertino', 'Houston', 'Cambridge', 'Boston', 'Washington', 'Minneapolis', 'Pleasanton', 'New York', 'Cleveland', 'South San Francisco', 'Portland', 'Seattle'.
 +
 +
Data on the [http://mcnair.bakerinstitute.org/wiki/Top_Cities_for_VC_Backed_Companies Top 50 VC Backed Companies can be found here.]
  
 
These were determined by the decide_cities.py script located in:
 
These were determined by the decide_cities.py script located in:

Revision as of 18:19, 7 March 2017


McNair Project
Enclosing Circle Algorithm
Red-circle.jpg
Project Information
Project Title Enclosing Circle Algorithm
Owner Christy Warden
Start Date 201701
Deadline 201704
Primary Billing
Notes
Has project status Active
Copyright © 2016 edegan.com. All Rights Reserved.


To-Do

1) Cleanup of geocoding

   a) Take from tab delimited not from whitespace
   b) Load in to database
   c) Use google maps API 
   d) Throw out center of world/center of city 
   e) Change script to take Company Year 
   f) 100 largest cities 

Overview

This program takes in a set of points and the minimum number that should be included inside a unit, and returns circles of the smallest total area which encompass all of the data points. Function make_circle and all of its helper functions were taken from https://www.nayuki.io/res/smallest-enclosing-circle/smallestenclosingcircle.py.


Input: A sequence of pairs of floats or ints, e.g. [(0,5), (3.1,-2.7)]. Output: A triple of floats representing a circle. Returns the smallest circle that encloses all the given points. Runs in expected O(n) time, randomized.

Algorithm Description

Location

The original script is located in:

E:\McNair\Software\CodeBase\EnclosingCircle.py


Applications

VC Data

The Enclosing Circle Algorithm will be applied to VC data acquired through the SDC Platinum database. The script makes use of the Python GeoPy GeoCoder to get latitude and longitude coordinates to be used by the Enclosing Circle Algorithm.

Geopy Geocoder User Agreements can be found here.

The relevant files are located in:

E:\McNair\Projects\Accelerators\Enclosing_Circle

The results may eventually be plotted to a graph using python as well. Here is documentation for a python library called basemap.

CURRENT STATUS: Bug fixes needed in EnclosingCircle.py. The program errors with a key error on line 187 in cases where n is not a multiple of the length of the dataset. I made some temporary fixes to the enclosing circle file located in the above directory, but I am not certain if it is a permanent fix.

Speeding up with C

With the large amount of VC data we have, the enclosing circle algorithm would take an extremely long amount of time to run (on the order of weeks/months). If we can compile the code into C, we can speed up runtime dramatically. I've listed some possible sources for running python code as C.

PyPy. Documentation here.

Cython

Cython. Documentation here. Basic tutorial for Cython is given here.

Currently, the RDP is missing a compiler to run Cython successfully. The error that appears is "unable to find vcvarsall.bat".

Getting a C++ Compiler

These are proposed fixes to solve the Cython error shown above. A possible C++ Compiler for Python can be downloaded directly from Windows here.

The C++ Compiler for Python has been downloaded and installed. Instructions for installation and uninstallation can be found here.

The Installation file is located in :

E:\McNair\Software\Utilities\VCForPython27.msi

Running Python vs. Cython

A trial test run on 82 coordinates resulted int the following time stamps:

EnclosingCircle in python: 16.069 seconds
Enclosing Circle in cython:  9.633 seconds

The test files can be found in the following:

EnclosingCircle in python: E:\McNair\Projects\Accelerators\EnclosingCircle\EnclosingCircle.py
EnclosingCircle in cython: E:\McNair\Projects\Accelerators\EnclosingCircle\EnclosingCircleC_Test.py

Usage

The basic tutorial for cython can be found phttp://docs.cython.org/en/latest/src/tutorial/cython_tutorial.html here].

Essentially, a setup.py file needs to be created with the following format:

try:
    from setuptools import setup
    from setuptools import Extension
except ImportError:
    from distutils.core import setup
    from distutils.extension import Extension
from Cython.Build import cythonize
setup(
    ext_modules = cythonize("filename.pyx")
)

Then, after changing to the proper directory, execute the following from the command line

python setup.py build_ext --inplace

This will wrap your python program in C, and produce a filename.pyd file.

To use this new python code wrapped in C, simply import the pyd file as if it were a python file:

import filename

Treat this file as any other module. It will work just as if it were in Python, except it exhibits a faster run time.

NOTE: We fixed Enclosing Circle and drastically improved its runtime, so we decided to go with its Python implementation.

Results

A data set with city, state, company name, year, and geocoded coordinates can be found at:

E:\McNair\Projects\Accelerators\Code+Final_Data\ChristyCode\GeoCodedBusinesses.txt


NEXT STEP: We need to determine how many companies we want in each circle, and then we can begin running the enclosing circle algorithm on the city data.

The Top 50 cities with the maximum number of companies in any given year are: 'Santa Monica', 'Nashville', 'Santa Clara', 'Chicago', 'Philadelphia', 'Denver', 'Dallas', 'Burlington', 'San Francisco', 'San Mateo', 'Milpitas', 'Boulder', 'Bellevue', 'Herndon', 'Pittsburgh', 'Mountain View', 'San Diego', 'Fremont', 'Ann Arbor', 'Irvine', 'Brooklyn', 'Durham', 'Los Angeles', 'Atlanta', 'Alpharetta', 'Menlo Park', 'Rockville', 'San Jose', 'Lexington', 'Saint Louis', 'Sunnyvale', 'Palo Alto', 'Richardson', 'Redwood City', 'Austin', 'Waltham', 'Baltimore', 'Cupertino', 'Houston', 'Cambridge', 'Boston', 'Washington', 'Minneapolis', 'Pleasanton', 'New York', 'Cleveland', 'South San Francisco', 'Portland', 'Seattle'.

Data on the Top 50 VC Backed Companies can be found here.

These were determined by the decide_cities.py script located in:

E:\McNair\Projects\Accelerators\EnclosingCircle

The final data of cities at a given year and their minimized circles can be found at:

 E:\McNair\Projects\Accelerators\EnclosingCircle\final_vc_circles.txt