1. Why is Random Forest Algorithm popular?
One of the most well-liked and widely applied machine learning techniques for classification issues is Random Forest. It works well on the classification model, but it may also be applied to the regression problem statements.
For modern data scientists, it has evolved into a deadly tool for improving the predictive model. The best thing about the method is that it relies on very few assumptions, making data preparation easier and saving time.
2. Can Random Forest Algorithm be used both for Continuous and Categorical Target Variables?
It is true that both continuous and categorical target (dependent) variables can be applied with Random Forest. The classification model refers to the category dependent variable in a random forest, or mixture of decision trees, and the regression model refers to the numerical or continuous dependent variable.
3. Explain the working of the Random Forest Algorithm.
The following are the steps involved in executing the random forest algorithm:
Step 1: Choose K at random records out of the N total records in the dataset.
Step 2: Using these K records, create and train a decision tree model.
Step 3: Repeat steps 1 and 2 after selecting the number of trees you want in your algorithm.
Step 4: In a regression issue, every tree in the forest forecasts an output value for an unknown data point. The mean, or average, of all the values predicted by each tree in the forest can be used to get the final value.
Each tree in the forest forecasts the class to which the new data point belongs in the event of a classification challenge. Ultimately, the class that receives the most votes—that is, the majority vote—is given the additional data point.
4. Why do we prefer a Forest (collection of Trees) rather than a single Tree?
Overfitting is an issue that arises when our model is flexible. A flexible model has a high variance since the training data will change the learned parameters, such as the decision tree's topology. Conversely, a rigid model is considered to have a high bias because it assumes things about the training data. It may also not be able to fit the training data at all, in which case the model has a high variance. Finally, a high bias suggests that the model is unable to appropriately generalize new and unseen data points.
Therefore, we must carefully consider the bias-variance tradeoff when building a model. Rather than restricting the tree's depth, which raises bias and decreases variance, we can merge numerous decision trees to create a forest at the end.
5. What does random refer to in ‘Random Forest’?
In Random Forest, "random" primarily refers to two processes:
1) Random observations that are used to grow each tree.
2) At each node, random variables are chosen for splitting.
Random Record Selection: Every tree in the forest is trained using typically 63.2% of the total training data; in this case, replacement data points are randomly selected from the original training dataset for each data point. The training set for the tree's growth will be this sample. Random Variable Selection: The node is divided using the best split on a set of independent variables (predictors), say m, that are chosen at random from all of the predictor variables.
6. Does Random Forest need Pruning? Why or why not?
There is no pruning in a random forest, so every tree grows to its full potential. Pruning is a technique used in decision trees to prevent overfitting. Pruning is the process of choosing a subtree that results in the fewest test errors.
Pruning is a process of removing some of the trees in a random forest. It is done to reduce the complexity of the model and improve its performance.
Pruning can be done by removing individual trees or by removing entire branches. The latter is called branch pruning and it can be done in two ways: hard pruning and soft pruning. Hard pruning removes all trees from a branch, while soft pruning removes only those trees that are not giving any predictive power to the model.
7. What is the importance of max_feature hyperparameter?
In order to determine the optimal split and maximum amount of features to consider dividing a node, random forest takes random subsets of features.
8. What are the advantages and disadvantages of the Random Forest Algorithm?
Advantages of Random Forest
1. Random Forest can perform both Classification and Regression tasks.
2. It is capable of handling large datasets with high dimensionality.
3. It enhances the accuracy of the model and prevents the overfitting issue.
4. It overcomes the problem of overfitting by averaging or combining the results of different decision trees.
5. Random Forests work well for a large range of data items than a single decision tree does.
6. Random Forest has less variance than a single decision tree.
7. Random forests are very flexible and possess very high accuracy.
8. Scaling of data is not required in a random forest algorithm. It maintains good accuracy even after providing data without scaling.
9. Random Forest algorithms maintain good accuracy even when a large proportion of the data is missing.
Disadvantages of Random Forest
1. Although random forest can be used for both classification and regression tasks, it is not more suitable for Regression tasks.
2. Complexity is the main disadvantage of Random Forest algorithms.
3. Construction of Random forests are much harder and time-consuming than decision trees.
4. More computational resources are required to implement the Random Forest algorithm.
5. It is less intuitive in the case when we have a large collection of decision trees.
6. The prediction process using random forests is very time-consuming in comparison with other algorithms.
9. List down the features of Bagged Trees
1. Lowers variance by averaging the performance of the group.
2. When taking node splits consideration, the most recent model makes use of the whole feature space.
3. The trees are able to grow without being pruned, which lowers the tree-depth sizes and generates more variance but lower bias. This can help increase prediction power.
10. What are the applications are random forests?
Four significant sectors are where Random Forest is most commonly used:
1. Banking: This algorithm is mostly used by the banking industry to identify loan risk.
2. Medicine: This technique can be used to identify patterns of sickness and associated risks.
3. Marketing: This algorithm can be used to analyse marketing trends.
4. E-commerce: Recommendation engines are a useful tool for cross-selling.