Thief of Wealth

Correcting Previous Mistakes from Imbalanced Datasets:


- Never test on the oversampled or undersampled dataset.

(일단 오버샘플링 또는 언더샘플링된 dataset에 test하지마라.)


- If we want to implement cross validation, remember to oversample or undersample your training data during cross-validation, not bofore!

(cross validation하고 싶다면 오버/언더 샘플링 이후에 해라)


- Don't use accuracy score as a metric with imbalanced datasets (will be usually high and misleading), instead use f1-score, precision/recall score or confusion matrix.

(정확도를 평가지표로 삼지마라.)



There are several ways to approach this classification problem taking into consideration this unbalance.

  • Collect more data? Nice strategy but not applicable in this case
  • Changing the performance metric:
    • Use the confusio nmatrix to calculate Precision, Recall
    • F1score (weighted average of precision recall)
    • Use Kappa - which is a classification accuracy normalized by the imbalance of the classes in the data
    • ROC curves - calculates sensitivity/specificity ratio.
  • Resampling the dataset
    • Essentially this is a method that will process the data to have an approximate 50-50 ratio.
    • One way to achieve this is by OVER-sampling, which is adding copies of the under-represented class (better when you have little data)
    • Another is UNDER-sampling, which deletes instances from the over-represented class (better when he have lot's of data)


'개발 > ML+ Data Science' 카테고리의 다른 글

Random Oversampling 방법  (0) 2019.08.21
Scaler 의 종류  (0) 2019.08.21
머신러닝 분석에 필요한 단계들 (수정중)  (0) 2019.08.20
머신러닝 template  (0) 2019.08.16
머신러닝 지침!  (0) 2019.08.14
profile on loading

Loading...