The objective of this project is come up with a fast, reliable algorithm that finds the smallest circle area that can drawn around N Cartesian points such that each circle contains at least M<N points. This algorithm will be used in an academic paper on [[Urban Start-up Agglomeration]], where the points will represent venture capital backed firms within cities.
e) Change script to take Company Year
f) 100 largest cities
=To-Do 04/26 Update=
a) Thursday 04/27: Plot circles resulting from the new Algorithm and ensure that they are correct, specifically on previously bad outputs.
b) Friday 04/28: Look for ways to speed up algorithm (reinstate memoization). Run algorithm over weekend.
c) Monday: Hope that it ran over the weekend. Check through data. Debug where necessary. Say many prayers to computer gods.
=Overview=
8) Choose the scheme that resulted in the smallest total area.
=K-Means Based Algorithm=
==Location==
The script is located in:
E:\McNair\Software\CodeBase\New Implement of Enclosing Circle (Constrained K Means, Smallest Circle)\enclosing_circle_new_implement.py
==Explanation==
The algorithm relies on an object oriented implementation of a "cluster". <br>
Each "cluster" has the following associated with it: <br>
1. area of minimum circle enclosing points in the cluster
2. points associated with the cluster
3. minimum circle associated with the cluster (x-coord,y-coord,radius)
4. "parent" cluster
5. "children" cluster(s)
'''Inputs:'''
1. n, the minimum number of points to include in a circle
2. k, the amount of clusters to split the points into
3. valid_nodes, an empty list to store the possible clusters to explore
4. explored_nodes, an empty list to store the explored clusters
5. cur_node, initialized to an object with all values set to none
6. end_nodes, initialized to an empty list used to store the clusters which cannot be split further
'''Algorithm:'''
1. Calculate the area of the minimum circle enclosing the total set of points
2. Split initial points into "k" clusters of at least "n" points using Constrained-K-Means Algorithm
3. construct minimum circles of "k" clusters and compute their total area
4. if this area <= area of the area of the first circle
5. Add all "k" clusters to "valid_nodes"
6. for node in valid_nodes, add node to explored_nodes, change cur_node to node, and update respective areas, parents, children, circle, points
7. run this algorithm again, change k = 2
8. if the cur_node cannot be split into 2, or the cur_node cannot be split into 2 circles with smaller areas than its parent, add cur_node to end_nodes and resume algorithm with next valid_node not in
explored_node as the cur_node
9. if all the valid_nodes are in explored_nodes, return end_nodes
10. repeat the above algorithm for all k from 2 to int(#number of total points/n) and take the end_nodes with the lowest total area
'''Runtime:'''
The runtime of the minimum enclosing circle is O(n), and the runtime of constrained k-means depends on k itself. Here, the value of k is proportional to the total number of points. <br>
We would expect the algorithm to slow as the number of points increase. For reference, a set of 80 points with n =3, took about 30 minutes. <br>
Further improvements could implement an early-stopping method to converge to a local optima. <br>
==Visualization ==
The K-Means based algorithm returns the optimal solution, albeit slower <br>
E:\McNair\Software\CodeBase\Julia Code for enclosing circle
==Explanation==
5) Return the scheme with the minimum area.
==Problems==
In my first implementation of EnclosingCircle (EnclosingCircle.py), there is a mistake in my memoization scheme where I had been setting a variable equal to
a "scheme[1].sort()", forgetting that the sort function operates in place and returns None. Thus, the memoization scheme was being used randomly and excessively such that not all possible circles were calculated and incorrect results were returned (hence the subsuming circle nonsense). Another mistake in the original algorithm was that when we were choosing the next point to add to a given circle based on how far it was from the first point I chose in the circle, rather than from the center of the circle. I corrected this by recalculating the circle every time I add a point and then sorting the rest of the points based on their distance from that center. Unfortunately, this code works very well and produces correct results where the original algorithm did not, but runs incredibly slowly. I do not think that it is feasible to use this algorithm on the volume of data that we need to analyze. I am currently running an example with 81 points (Houston 2012) split into groups of at least 4 and will report statistics on it soon.
==Ideas for Fixing==
One major holdup of the new algorithm (besides the fact that it calculates a ton more circles) is the recalculating of the center point every time I need to add a new point to the circle, which seems a bit absurd. I think it might be faster to adjust the center of the circle heuristically so between the first two, take the average x and y coordinate. For the next addition. Take a weighted average of the three points. Etc. I am not sure if this would work out mathematically but I can work it out and see. This might cut down on runtime, but we still need to calculate the distance from each individual point after that, which also takes a long time. This could possibly be resolved heuristically as well, since I don't really need to calculate all the distances since points that were really far from the circle before probably aren't going to be chosen anyway. I can look into implementing some of these algorithmic changes and testing them on Monday and Tuesday.
In the longer term, I would recommend trying to parallelize the algorithm, possibly with python threads or implementing it in java and using the constructs there. I do not know if I will have time to start this before the summer, but there are many processes that can be run in parallel for this algorithm so I think it is a good strategy.
Specifically, all calls to findgroups can be run in parallel so long as you wait to receive the results from all the threads before deciding on an answer.
For example imagine a system with 5 points split into groups of at least 2. There would be one call to findgroups on all 5 points, then calls to findgroups on a 3-2 split and on a 2-3 split. Those two calls could run in parallel, as long as the result from the call on all 5 points waits to receive the results from both children threads before deciding which is the optimal solution. You would have to be careful to keep threads from writing to the same memory (specifically for the removals in pointset), but I think (depending on how many cores our system operates on) this could incur impressive speed up.
Update: Did not incur any speed up by implementing changes the the calculation for center or distance. Reading about python threads https://pymotw.com/2/threading/
Update: Implemented a thread version of the program and am running that in addition to the non-threaded version to see how the timing compares. EnclosingCircleRemake2.py.
How the threading works:
1) Split the pointset into all its initial possible schemes (one split).
2) Divide these schemes amongst four threads.
3) Have each thread compute the best outcome of each of those schemes and put them into a global array.
4) Join all the threads.
5) Have the main function iterate through the schemes returned by the threads and choose the best one.
6) Create circles for the schemes and return them.
On preliminary examples, the code works correctly. It actually runs slower than the original algorithm on small examples but I suspect that is the overhead of creating the threads and that they will be useful to the large examples.
=== Use PostGIS ===
If the point of the project is to solve an application, the Not-Invented-Here approach is doomed to fail. Instead, I suggest an approach where
# Load reverse geolocated into a PostGIS PostgreSQL table
# Use [https://postgis.net/docs/ST_MinimumBoundingRadius.html ST_MinimumBoundingRadius] which is an implementation of a linear (read: very fast) minimum bounding circle algorithm. See https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwboundingcircle.c for implementation details.