Anomaly Detection:
Our main aim in this section is to remove "extreme outliers" from features that have a high correlation with our classes. This will have a positive impact on the accuracy of our models.
Interquartile Range Method:
- Interquartile Range (IQR): We calculate this by the difference between the 75th percentile and 25th percentile. Our aim is to create a threshold beyond the 75th and 25th percentile that in case some instance pass this threshold the instance will be deleted.
- Boxplots: Besides easily seeing the 25th and 75th percentiles (both end of the squares) it is also easy to see extreme outliers (points beyond the lower and higher extreme).
Outlier Removal Tradeoff:¶
We have to be careful as to how far do we want the threshold for removing outliers. We determine the threshold by multiplying a number (ex: 1.5) by the (Interquartile Range). The higher this threshold is, the less outliers will detect (multiplying by a higher number ex: 3), and the lower this threshold is the more outliers it will detect.
The Tradeoff: The lower the threshold the more outliers it will remove however, we want to focus more on "extreme outliers" rather than just outliers. Why? because we might run the risk of information loss which will cause our models to have a lower accuracy. You can play with this threshold and see how it affects the accuracy of our classification models.
'개발 > ML+ Data Science' 카테고리의 다른 글
t-SNE (t-Stochastic Neighbor Embedding) 이란? (0) | 2019.08.22 |
---|---|
Oversampling하는 올바른 방법 (0) | 2019.08.21 |
Random Oversampling 방법 (0) | 2019.08.21 |
Scaler 의 종류 (0) | 2019.08.21 |
Imbalance 한 dataset에서의 실수 및 방법 (0) | 2019.08.20 |