If you are thinking of using some random classifier to solve the classification problem for your own data, then your best option would be to try Random Forest or a Support Vector Machines (SVM) with Gaussian Kernel. In a recent study these two algorithms have proven to be the most effective among nearly 200 other algorithms tested on more than 100 publicly available data sets.
In this blog post we are highlighting some important points quoted in the paper – “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?”, which aid in choosing the right algorithm for our own machine learning problems.
Authors have evaluated 179 classifiers from 17 families on 121 datasets (total number of experiments is 241,637) and this is definitely an exhaustive study of classifier performance with a significant contribution to our community. The dataset consisted of 10 to 130,064 patterns, from 3 to 262 inputs and from 2 to 100 classes. Here is the snippet of the classifiers in the ranked (ascending) order:
- Random Forests with 8 variants.
- SVM with 10 variants.
- Neural networks with 21 variants.
- Decision trees with 14 variants.
- Bagging with 24 variants.
- Boosting with 20 variants.
- Other Methods with 10 variants.
- Discriminant analysis with 20 variants.
- Nearest neighbor methods with 5 variants.
- Other ensembles with 11 variants.
- Logistic and multinomial regression with 3 variants.
- Multivariate adaptive regression splines with 2 variants
- Generalized Linear Models with 5 variants.
- Partial least squares and principal component regression with 6 variants.
- Rule-based methods with 12 variants.
- Bayesian approaches with 6 variants.
- Stacking with 2 variants.
The average accuracy of the best performing 25 classifiers are shown in the below graph. Among them, the best performers are random forest and SVM with Gaussian kernel with average accuracy of 82.0%(±16.3) and 81.8%(±16.2) respectively. Most of the experiments involved fine-tuning the parameters. The reported average accuracies are computed using 4-fold cross validation.
Random Forest classifier has helped people win some Kaggle competitions. Here is what people have to say about usage of Random Forests:
“Since they have very few parameters to tune and can be used quite efficiently with default parameter settings (i.e. they are effectively non-parametric). This ease of use also makes it an ideal tool for people without a background in statistics, allowing lay people to produce fairly strong predictions free from many common mistakes, with only a small amount of research and programming.”
- For Random Forest – Scikit (python), Weka (Java) or random forest package (R).
- For SVM – LIBSVM for SVM classifier.
- Try random forests/Gaussian SVM as a baseline and later move onto new or advanced methods.
- Always remember to standardize (scaling, transforms) your data, this has shown to significantly affect the performance.
- Fine-tune the classifier parameters to get the best performance, you could do this by cross validation.
- If possible try to extract better features (may need domain knowledge) from the data which could aid the classifier to meet better performance.
Drop a comment and let us know what classifiers you are using in your products and what has been the experience so far.