Random Oversampling 방법

https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets

Random Under-Sampling:

In this phase of the project we will implement "Random Under Sampling" which basically consists of removing data in order to have a more balanced dataset and thus avoiding our models to overfitting.

Steps:

The first thing we have to do is determine how imbalanced is our class (use "value_counts()" on the class column to determine the amount for each label)
Once we determine how many instances are considered fraud transactions (Fraud = "1") , we should bring the non-fraud transactions to the same amount as fraud transactions (assuming we want a 50/50 ratio), this will be equivalent to 492 cases of fraud and 492 cases of non-fraud transactions.
After implementing this technique, we have a sub-sample of our dataframe with a 50/50 ratio with regards to our classes. Then the next step we will implement is to shuffle the data to see if our models can maintain a certain accuracy everytime we run this script.

Note: The main issue with "Random Under-Sampling" is that we run the risk that our classification models will not perform as accurate as we would like to since there is a great deal of information loss (bringing 492 non-fraud transaction from 284,315 non-fraud transaction)

저작자표시

'개발 > ML+ Data Science' 카테고리의 다른 글

Oversampling하는 올바른 방법 (0)	2019.08.21
Outlier 결정 및 trade off사항 (0)	2019.08.21
Scaler 의 종류 (0)	2019.08.21
Imbalance 한 dataset에서의 실수 및 방법 (0)	2019.08.20
머신러닝 분석에 필요한 단계들 (수정중) (0)	2019.08.20

https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets

Random Under-Sampling:

Steps:

'개발 > ML+ Data Science' 카테고리의 다른 글

검색 태그

티스토리툴바