https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets
Random Under-Sampling:
In this phase of the project we will implement "Random Under Sampling" which basically consists of removing data in order to have a more balanced dataset and thus avoiding our models to overfitting.
Steps:
- The first thing we have to do is determine how imbalanced is our class (use "value_counts()" on the class column to determine the amount for each label)
- Once we determine how many instances are considered fraud transactions (Fraud = "1") , we should bring the non-fraud transactions to the same amount as fraud transactions (assuming we want a 50/50 ratio), this will be equivalent to 492 cases of fraud and 492 cases of non-fraud transactions.
- After implementing this technique, we have a sub-sample of our dataframe with a 50/50 ratio with regards to our classes. Then the next step we will implement is to shuffle the data to see if our models can maintain a certain accuracy everytime we run this script.
Note: The main issue with "Random Under-Sampling" is that we run the risk that our classification models will not perform as accurate as we would like to since there is a great deal of information loss (bringing 492 non-fraud transaction from 284,315 non-fraud transaction)
'개발 > ML+ Data Science' 카테고리의 다른 글
Oversampling하는 올바른 방법 (0) | 2019.08.21 |
---|---|
Outlier 결정 및 trade off사항 (0) | 2019.08.21 |
Scaler 의 종류 (0) | 2019.08.21 |
Imbalance 한 dataset에서의 실수 및 방법 (0) | 2019.08.20 |
머신러닝 분석에 필요한 단계들 (수정중) (0) | 2019.08.20 |