Correcting Previous Mistakes from Imbalanced Datasets:
- Never test on the oversampled or undersampled dataset.
(일단 오버샘플링 또는 언더샘플링된 dataset에 test하지마라.)
- If we want to implement cross validation, remember to oversample or undersample your training data during cross-validation, not bofore!
(cross validation하고 싶다면 오버/언더 샘플링 이후에 해라)
- Don't use accuracy score as a metric with imbalanced datasets (will be usually high and misleading), instead use f1-score, precision/recall score or confusion matrix.
(정확도를 평가지표로 삼지마라.)
There are several ways to approach this classification problem taking into consideration this unbalance.
- Collect more data? Nice strategy but not applicable in this case
- Changing the performance metric:
- Use the confusio nmatrix to calculate Precision, Recall
- F1score (weighted average of precision recall)
- Use Kappa - which is a classification accuracy normalized by the imbalance of the classes in the data
- ROC curves - calculates sensitivity/specificity ratio.
- Resampling the dataset
- Essentially this is a method that will process the data to have an approximate 50-50 ratio.
- One way to achieve this is by OVER-sampling, which is adding copies of the under-represented class (better when you have little data)
- Another is UNDER-sampling, which deletes instances from the over-represented class (better when he have lot's of data)
'개발 > ML+ Data Science' 카테고리의 다른 글
Random Oversampling 방법 (0) | 2019.08.21 |
---|---|
Scaler 의 종류 (0) | 2019.08.21 |
머신러닝 분석에 필요한 단계들 (수정중) (0) | 2019.08.20 |
머신러닝 template (0) | 2019.08.16 |
머신러닝 지침! (0) | 2019.08.14 |