UPER Neighbors sounds like an awesome new approach to and age old algorithm that honestly still pulls its weight pretty well. But alas, that is not the case. It is instead the focus of this post to take the systematic approach of UPER framework, and combine it with a simple implementation of the K-Nearest Neighbor(KNN) algorithm.
U.P.E.R. stands for Understand, Plan, Execute, and Reflect. This is a project management framework that allows users to strategically find and execute solutions to problems of varying scope. Today, I will be applying the framework to the KNN algorithm.
Understand:
Understanding should take a good proportion of your time. The more you understand the problem at hand, the easier it will be to come up with a plan and execute. Here I walk you through my understanding of what KNN is and how it works for classification. During this step, you want to ask as many questions as you can and find those answers. Once you have a solid foundation of the problem, only then should you proceed to planning.
The KNN algorithm takes in an array of n_samples x m_features, along with some integer input value for k. Depending on the type of problem, it will return either the class or the score of a test variable. This is done by finding the distance, commonly the euclidean distance between all the points provided in the stored array, and the test data. Once the distance is determined, the closest K values are found, and again, depending on the type of problem, the class is determined, or the score (average distance of the K nearest neighbors) is returned. Weights can help in determining the closest neighbors by adding a penalty to points that are further from the test point e.g. 1/d (where d is the distance). Another note is that in the case of classification, an odd value for K should be presented, as it will break any ties that may arise, since classes are chosen based on votes. Our implementation will contain a fit and predict method.
Fitting a classifier is the process of taking in an input, and determining a classifier from a range of possible classifiers that are distinguished by the parameters passed, done using some optimization technique. However, K-NN is a non-parametric classification technique, so it does not perform any additional transformations to any data that is passed to it. Therefore, a fit function for this algorithm is simply used for the purpose of storing the data. The reason for this is that KNN is a lazy learner, so the data determines the classification. Lazy learners are useful in the case where the training data is always being modified/updated. This allows the model to make generalizations based on the query, and not necessarily on the training set. Situations where learning is done eagerly become obsolete when changes are made to the data. This is apparent in recommender systems, where KNN’s are often applied. For example, when Amazon adds new items for sale, or Netflix adds new content, the training data changes, so the model often needs to be update or retrained. With KNN, only the query data needs to be updated. This does come with its list of drawbacks including: inefficiencies with noisy data, large space requirements, slow evaluations of queries, and increased computational costs. However, all of these issues can be mitigated with proper application and heuristics.
For this algorithm, we will be implementing a classification technique. Predicting the class of a new data point is simply a process of finding the nearest k neighbors and having them vote on what the new point is. For example, say we have a feature space that contains a hundred points representing a number of samples that are attributed to 2 classes: ‘cat’ and ‘dog’. We chose k=3, which means when we query for ‘x’, we are saying we want to know which 3 points in that feature space are closest to ‘x’. Once we have those 3 points, we take a vote to decide to which of those 2 classes does ‘x’ belong to. If two points are of class ‘dog’ and 1 is of class ‘cat’, then the query ‘x’ would be classified as a ‘dog’ based on the number of votes. Therefore, predict is a method of determining the distance between a query and all the samples in a feature space, returning the closest k items, and assigning a class based on those items.
Plan:
Planning is about taking all of your newly acquired knowledge and coming up with an implementation. This is not to be confused with execution, although pseudo-code is permissible. Here, we walk through some techniques that can be applied to the problem and try to come up with a quick, naive/brute-force solution. This is often simply a first pass at attacking the problem, then upon further Reflection, can be rebuilt with optimization in mind.
- Build a class object for the model. It will initialize with k nearest neighbors as an integer input value.
- Requires a fit method. As stated previously, the input method is more of a storage function to hold the input data. This should be an array like structure. Take in an X (m x n) and y (1-dimensional) matrix and store them as class attributes for data manipulation. Cast as an array to ensure proper functionality.
- Requires a predict method. Here is the meat of the structure. Take in a query that should be of at least (1 x n), similar to X, and determine the distance from the query points to the points in X. This will be done by calculating the euclidean distance. Euclidean distance is similar to the Pythagorean theroem, where we sum the squared differences of each points, then find the square root of that sum. For classification, we will by finding the class of the passed in data point(s). To do this, we will use the calculated distances and place them in ascending order, and returning the top K data points. These data points will then vote to determine to which class the new data point belongs to. The classes are then stored in a list using the same index’s passed in the query array and returned.
Execute
Here, we take our plan and convert it to cold hard code. If the planning phase was in depth and contained pseudo-code, this step will be a lot easier. That being said, this is still programming, and bugs are a thing. However, you can rest easy knowing most of the work has already been done. You can see my execution here.
Reflect:
For this final step, we want to look back and the entire process, and see what improvements can be made. Similar to the Understand step, we want to ask plenty of questions. Given any time constraints, what provides the maximum value for our efforts? Can we reduce our run-time complexity or space requirements? Are there any user induced errors that are not being handled properly? Has extensive testing been done? These are all things I need to tackle for this project, and more. But it is something I happily look forward to.
Thank you for reading. This model can be installed using this pip code snippet:
pip install -i https://test.pypi.org/simple/ knn-classifier-dmartinez==0.0.2
Once installed, it can be imported as such:
from knn_package import KNearestNeighbors
Applciation can then be done in this manner:
my_model = KNearestNeighbors(num_neighbors=3)
my_model.fit(X,y)
y_pred = my_model.predict(X)
For a deeper walk-through, please check out this python notebook.