1. What is the “K” in KNN algorithm?
When predicting the class of an item, K is the number of closest neighbors you prefer to choose from an unseen dataset that the model has never seen previously.
2. What is the difference between KNN and K-means?
KNN
1) It's a method of supervised learning
2) Classification is the main use, with regression being employed occasionally as well.
3) In a KNN, "K" refers to the number of closest neighbors that are used to categories or, in the case of a continuous variable or regression, forecast a test sample.
4) It is utilized for regression and classification of pre-existing data, where the goal variable or characteristic is often known ahead of time.
5) There isn't really a training phase in K-NN. However, using weighted averages or votes, the K-Nearest (typically Euclidean distance) Neighbours (observations) is used to forecast a test observation.
K-means
1) It's an approach to unsupervised learning.
2) Clustering is one of its uses.
3) In K-Means, "K" refers to the total number of clusters the algorithm is attempting to identify or establish from the input. Since this is employed with unsupervised learning, the clusters are frequently unknown.
3. Explain the K Nearest Neighbor Classification in detail
In basically, KNN separates the entire set of data into training and test sample data when dealing with a classification problem. The nearest neighbor is defined as the point with the lowest distance between training and sample points. This distance is measured. The KNN algorithm uses the majority to forecast the outcome.
4. Explain the K Nearest Neighbor Regression in detail
By averaging the observations within the same neighborhood, KNN regression is a non-parametric technique that intuitively approximates the relationship between independent variables and the continuous result. The analyst must determine the size of the neighborhood, or it can be selected by employing cross-validation to determine the size that minimizes the mean-squared error.
5. Why is the odd value of “K” preferable in KNN algorithm?
To ensure that there are no ties in the vote, odd values of K should be chosen above even ones. To make a number of data points odd, add or subtract 1 from its square root if it is even.
6. How do we decide the value of "K" in KNN algorithm?
The ideal value for "K" cannot be determined in a certain manner; instead, we must experiment with several numbers until we find the one that works best.
Five is the most favored value for K.
Extremely low values of K, like K=1 or K=2, may introduce noise into the model and cause outlier effects. Although large values for K are desirable, there may be some issues.
7. What is the difference between Euclidean Distance and Manhattan distance? What is the formula for Euclidean distance and Manhattan distance?
Euclidian distance
to determine the separation in a plane between two data points. By setting p's value to 2, the Minkowski Distance formula is used to determine it.
((x1-x2)^2-(y1-y2)^2)^2 equals ED.
Manhattan's distance
In a path that follows a grid, the Manhattan Distance is used to determine the separation between two data points. By setting the value of p to 1, the Minkowski Distance formula is utilized to compute it.
MD equals |x1-x2| plus |y1-y2|
8. Why do you need to scale your data for the k-NN algorithm?
Since KNN is an algorithm that depends on distance and is sensitive to outliers.
9. Why KNN Algorithm is called as Lazy Learner?
For the reason that it conserves the dataset and acts on it while classifying, rather than learning from the training set instantly.
Because of this, KNN is known as a lazy learner due to the way it delays learning a model instead of learning it right away.
10. What are the advantages and disadvantages of the KNN algorithm?
Advantages of KNN Algorithm
1) It is easy to put into practice.
2) It is capable of handling noisy training data.
3) If there is a lot of training data, it might work better.
Disadvantages of KNN Algorithm
1) It is always necessary to calculate the value of K, which occasionally may be difficult.
2) Because the distance between each data point for each training sample must be calculated, there is a significant computation cost.
3) Alert to Conditions