Imbalance 한 dataset에서의 실수 및 방법

Correcting Previous Mistakes from Imbalanced Datasets:

- Never test on the oversampled or undersampled dataset.

(일단 오버샘플링 또는 언더샘플링된 dataset에 test하지마라.)

- If we want to implement cross validation, remember to oversample or undersample your training data during cross-validation, not bofore!

(cross validation하고 싶다면 오버/언더 샘플링 이후에 해라)

- Don't use accuracy score as a metric with imbalanced datasets (will be usually high and misleading), instead use f1-score, precision/recall score or confusion matrix.

(정확도를 평가지표로 삼지마라.)

There are several ways to approach this classification problem taking into consideration this unbalance.

Collect more data? Nice strategy but not applicable in this case
Changing the performance metric:
- Use the confusio nmatrix to calculate Precision, Recall
- F1score (weighted average of precision recall)
- Use Kappa - which is a classification accuracy normalized by the imbalance of the classes in the data
- ROC curves - calculates sensitivity/specificity ratio.
Resampling the dataset
- Essentially this is a method that will process the data to have an approximate 50-50 ratio.
- One way to achieve this is by OVER-sampling, which is adding copies of the under-represented class (better when you have little data)
- Another is UNDER-sampling, which deletes instances from the over-represented class (better when he have lot's of data)

저작자표시

'개발 > ML+ Data Science' 카테고리의 다른 글

Random Oversampling 방법 (0)	2019.08.21
Scaler 의 종류 (0)	2019.08.21
머신러닝 분석에 필요한 단계들 (수정중) (0)	2019.08.20
머신러닝 template (0)	2019.08.16
머신러닝 지침! (0)	2019.08.14

There are several ways to approach this classification problem taking into consideration this unbalance.

'개발 > ML+ Data Science' 카테고리의 다른 글

검색 태그

티스토리툴바