**Problem:**Given N points on a cartesian ordinate plane, find the two points with the shortest distance.

**Solution:**We can use the approach that solving P(n) based on P(n-1). We first sort the points based on their X coordinates. Then assume we have solved the problem with the first

*n*-1 points and the shortest distance is

*d*. Then given the

*n*th point, we don't need naively compute the distance between the

*n*th point and all the previous

*n*-1 points. We can leverage the important information we have obtained: "

*d"*. If we can further find a closer pair given the

*n*th point, the other point in the pair should be in a specific area. The area is a rectangle with dimension (

*d**2

*d*) that sits on the left of the

*n*th point. Moreover, the number of potential points that could be in this area is limited, at most six points (which can be proved using pigeonhole principle). Therefore our algorithm comes as follows:

- two important data structures are used. The first one is an array
*X_array*that contains all the points sorted by their X coordinates. The second is a BST*Y_BST*that contains the active points sorted by their Y coordinates. Insert the first two points into*X_array*and*Y_BST.*Initialize*d_min*as the distance between the first points. - using a sweeping line method. Have a vertical line sweep from left to right and start from the third point in
*X_array.* - when the line touch a point
*p*, given the current*d_min*, first we need to remove the points in*X_array*that have distance larger than*d_min*to*p.*We also need to remove these points in*Y_BST*. Then we add*p*to*Y_BST*and adjust the active region within*X_array*accordingly. - search the points in the range
*[p.y-d_min, p.y+d_min]*in*Y_BST,*for each point in the range, calculate its distance to*p*. If smaller than*d_min,*update*d_min.* - When all the points have been swept, the algorithm finishes.
- The complexity for the whole algorithm is O(nlogn). Why? The initial sorting cost for
*X_array*is O(nlogn). For each points being removed from the active region in*X_array*, it takes O(logn). So the total cost for removing points from the active region in*X_array*is also O(nlogn). Since there are limited number of points in the rectangle with dimension (*Y_BST**d**2*d*). The cost to calculate the distance between them and*p*is constant. Therefore, the complexity for the whole algorithm is O(nlogn).

typedef pair<double, double> Coordinate; bool compX(Coordinate p1, Coordinate p2) { if(p1.first != p2.first) return p1.first < p2.first; else return p1.second < p2.second; } bool compY(Coordinate p1, Coordinate p2) { if(p1.second != p2.second) return p1.second < p2.second; else return p1.first < p2.first; } double cal_distance(Coordinate p, Coordinate q) { double a = abs(p.first-q.first); double b = abs(p.second-q.second); return sqrt(a*a + b*b); } double find_closest(vector<Coordinate> &co) { //sort based on X coordinates sort(co.begin(), co.end(), compX); if(co.size() == 0) return -1; if(co.size() == 2) return cal_distance(co[0], co[1]); double d_min = cal_distance(co[0], co[1]); //inition the y table bool(*fn_pt)(Coordinate,Coordinate) = compY; set<Coordinate, bool(*)(Coordinate,Coordinate)> y_table(fn_pt); y_table.insert(co[0]); y_table.insert(co[1]); //start from the third point from the left int start = 2; vector<Coordinate>::iterator tp; vector<Coordinate>::iterator it_head = co.begin(); vector<Coordinate>::iterator it_tail = it_head+1; for( ; start<co.size(); start++) { double l_bound = co[start].first - d_min; Coordinate lc(l_bound, 0); tp = lower_bound(it_head, it_tail, lc, compX); //delete points from y_table vector::iterator pp, endp; if(tp == it_tail) { if((*tp).first > (*it_tail).first) tp = it_tail+1; } for( pp = it_head; pp!=tp; pp++) { y_table.erase(*pp); } it_head = tp; it_tail++; y_table.insert(co[start]); //select points from y_table double y_low = co[start].second - d_min; double y_high = co[start].second + d_min; //search for these two bounds set<Coordinate>::iterator sp, sp_low, sp_high; Coordinate y_lc(0, y_low); Coordinate y_hc(0, y_high); sp_low = lower_bound(y_table.begin(), y_table.end(), y_lc, compY); sp_high = lower_bound(y_table.begin(), y_table.end(), y_hc, compY); //calculate the distance and compare for(sp=sp_low; sp!=sp_high; sp++) { if(sp->first == co[start].first && sp->second == co[start].second) continue; double new_d = cal_distance(co[start], *sp); if(new_d < d_min) d_min = new_d; } } return d_min; }

**Alternative approach:**We can also try to use K-d tree, more precisely, 2-d tree. K-d tree is basically a binary tree. It is constructed by recursively partition the points based on a particular axis (dimension). For example, for a 2-d tree, we first partition by

*x-coordinate*(it is also ok to partition by

*y-coordinate*), given all points, we find the points which

*x-coordinate*is the median of the

*x-coordinates*of all other points. Then we take the points which has this

*x-coordinate*as root. The remaining points are divided into two groups: a) those which

*x-coordinates*are smaller than

*root.x*b) those which

*x-coordinates*are bigger than

*root.x*. Then we recursively partition these two groups, but by

*y-coordinate*. This will help us find the nodes on the second level (assume

*root*is the first level). To find the nodes at the third level, we partition by

*x-coordinate*. Repeat this until we can't partition any more.

Then we need to apply the Nearest Neighbor Search (NNS) in 2-d tree. Here I just briefly explain the algorithm.

- First, we need to locate the query point query point
*q.*This process is similar to a binary search based on*q*'s coordinates. We stop when we encounter a leaf node. Calculate the distance between this leaf node and*q*. This is the*current best*(nearest neighbor). - However, we need to move forward since the
*current best*may not be optimal. We just rewind the recursion, go to the parent of the leaf node. Calculate the distance between this (parent) node and*q*. If necessary , update the*current best*. We also need to see if there remains some nearer neighbors in the other plain divided by this (parent) node. This can be done by checking if the circle centered at*q*with the radius*current best*across the splitting line. If yes, we need to check the other branch of this node. Basically, we repeat previous steps, searching*q*in that subtree. - If we are done with the (parent) node, we just keep going up, check the parent of this (parent) node, until we encounter the root node.

The empirical performance for a NNS in 2-d tree is O(logn). Then do the NNS for all the nodes is O(nlogn). Considering the cost for building 2-d tree is O(nlogn), the total cost for this problem is also O(nlogn).

See Also: The closest pair problem

## No comments:

## Post a Comment