Spatial and Vector Indexes
So far our focus has been on 1 dimensional indices. In the previous note we explored B+ trees where the search key is a single integer or string. Now we will explore search on multicolumn keys. In this note we introduce the Spatial Index: A type of index that allows you index multiple columns at once, often by treating pairs of values as coordinates in the cartesian plane. Spatial indices are also used to index data that naturally occurs in pairs (e.g. latitudes and longitudes, points in multidimensional space, etc). For this class, we will examine the basics of spatial indices and discuss the implementation of them through the KD Tree.
Motivation
Astronomy
Suppose we have a map of the stars with their x,y coordinates. We want to answer each of the following questions:

Does a given (x, y) coordinate contain a star?
Equality Search 
How many stars are in a region of the map? A region is a area of the map bounded by 4 coordinates.
Range Search 
What are the stars closest to a given (x,y) coordinate?
Nearest Neighbor
Graphics
Suppose we are coding a game with characters and items represented as objects on the screen. We want to answer the following questions about the display:

Do any two objects touch each other? In other words, are two objects located at the same (x, y) coordinate?
Equality Search 
How many objects should be rendered within the boundaries of the screen?
Range Search
In these examples, it is often inefficient and expensive to try to solve these problems with a 1 dimensional index like the B+ Tree. Thus, we introduce the KD tree!
KD Tree
By definition, a KD Tree is a spacepartitioning data structure used to organize points in a kdimensional space. The general idea is to build a balanced binary tree that splits on multiple dimensions. Compare this to the B+ tree, which only splits on one dimension.
Why would we do this?
The goal is to enable fast look up, range queries, and nearest neighbor search on multiple dimensions.
For simplicity we will use 2D points to illustrate a KD tree, but in general we can use any number of dimensions (hence the “k” in KD).
Tree Construction
The general procedure for creating a KD tree is below:
 If we have a single point, then the point forms the tree.
 Otherwise, divide the points in half by a line perpendicular to one of the axes.
 Recursively construct kd trees for the two partitioned sets of points.
But how do we divide the points?
 We could divide in a roundrobin fashion across two dimensions
 We could divide on the median value on one of the dimensions.
(Optional) Other Choices for splitting a KD Tree.
Cost of Operations
For the rest of this note, we will use the following variables to represent the cost of each operation in I/Os.
 B: Number of data blocks
 R: Number of records per block
 D: Average time to read/write disk block
 F: Average internal node fanout
 E: Average # data entries per leaf
Equality Search
Assume there exists a node D in the KD Tree. Since we are working in 2 dimensions, node D has an x value D.x and a y value D.y. Our goal is to find node D in the tree (i.e. an equality search).
 Start from the root node.
 Compare D.y with y1. We see that D.y is less than y1.
 Follow pointer to the left subtree
 Compare d.x with x1. Assume we we see that D.x is greater than x1.
 Follow the pointer to the right subtree
 Recursively repeat process until we reach the leaf node.
I/O Cost: log(BR)
Range search
Now assume we are interested in a range of points in our KD Tree. We can use the following algorithm for range search:
 Search index to find first page to read
 Scan the leaf pages to find which heap pages to read
 Read the corresponding heap pages
 I/Os for 1: log(BR) + 1
Note we have +1 for the index leaf page  I/Os for 3: #pages
#pages: number of pages storing records that matches search criteria  I/Os for 2: #pages
overapproximate and assume same as 3
I/O Cost: (log(BR) + 1 + (2 * #pages)) * D
Insertion
Now assume we want to insert a new entry into our database (and as a result a new node into the KD tree).
 Search index to find heap file page to modify
 Read corresponding page and modify in memory
 Write back the modified index leaf and heap file pages
 I/Os for 1: log(BR) + 1
 I/Os for 2: 1
 I/Os for 3: 2
Total cost : (log(BR) + 4) * D
Table of I/O Costs
Nearest neighbor search
The previous operations are all ones we’ve seen before with B+ Trees. The KD tree allows us to perform another type of search that is incredibly useful in computer science. Nearest Neighbor Search is the search algorithm used to find the closest point in a KD tree to a given input. For instance in the following example, we provide input Q and would expect the NN algorithm to return the point closest to Q (closest defined by Euclidean distance). In this case, that would be D.
General Idea: Traverse down the tree and keep track of the closest point found so far. Stop when we know that there can’t be any other point closer.
 Number of traversals is proportional to log(nodes)
 Stop when we know that there can’t be a closer point in the tree
 Traverse all nodes in the worst case
We can generalize this algorithm to find Nnearest neighbors (but won’t prove that here).
Nearest Neighbor Example
Let’s walk through the example from before to find the closest point for Q. Note that the tree for the example above is below.
 Q is right of y1, so we go right from the root.
 Q is below x4, so we go down.
 Q is below x3, so we go down. This is the first time we encounter a point, so we mark that H is the closest point we’ve seen so far.
 Go back up to x3 and now go up (the other branch). The leaf node is G, which is closer than H so G is our closest point.
 Now we go back up to x4, and traverse the other branch (go up).
 Q is less than y3 so we go Left. The leaf node is E, which is farther than G. We know anything in the region of the other branch (right from y3 will be) farther than our best choice so far, so we can skip that branch and continue our algorithm. How do we know anything in that region will be farther than our best choice? Well, we know the boundaries of that region, so we can compute the euclidean distance to the closest point in that region (i.e. The corner closest to Q) which is farther than our current best. Thus, anything in that region is farther than our nearest neighbor so far.
 Now we can go back up to the root and go left this time.
 Q is down of x1, so we go down.
 Q is right of y2 so we go right to reach the node D. Now our closest node is D. Using the logic from step 6, we can eliminate the rest of the possible branches (because we’ve found the actual closest neighbor).
Conclusion
In the examples in this note, we only worked with KD trees of 2 dimensions. In general though, KD trees can be extended to even higher dimensions. Some variations: Rtree, Quadtree, Octree. The primary application of this structure has been finding nearest neighbors, which is supported by Postgres, Oracle, SQLServer. Being able to efficiently find the nearest neighbors for a given data point can be extremely useful when working on recommendation systems and image recognition.