Mishandling Missing Values @ DS ML models

Subhashini Sharma Tripathi
3 min readDec 15, 2023

--

Handling missing data is a critical aspect of building accurate machine learning models. Mishandling missing data can lead to biased or inaccurate results. Yes, this can be the reason why your models don’t work so well in the real-life ! Here are some “what NOT to do” practices when dealing with missing data to ensure that your outcome models have high accuracy.

Check out the video : https://youtu.be/A5njZ0tDUPY?si=ahr21MMOtbtLG_7o

Contact me at subhashini@pexitics.com / WhatsApp at +91 734966–2320.

1. Ignoring Missing Data (Deletion):

- What NOT to do: Removing rows or columns with missing data without consideration.

- Why: You might lose valuable information and reduce the size of your dataset, potentially leading to biased or less accurate models.

2. Filling Missing Data with Arbitrary Values:

- What NOT to do: Replacing missing values with arbitrary constants like 0 or -1.

- Why: This can introduce noise and bias into your model, as the chosen values may not reflect the true nature of the missing data.

3. Using Mean/Median Imputation:

- What NOT to do: Filling missing values with the mean or median of the entire dataset.

- Why: This can distort the distribution of your data and make it less representative, especially if missing data is not missing at random.

4. Forward or Backward Filling for Time Series Data:

- What NOT to do: Using forward or backward filling to impute missing values in time series data.

- Why: Time series data often exhibits trends and seasonality, and filling with adjacent values can introduce temporal leakage and affect model accuracy.

5. Imputing with Predictive Models on the Entire Dataset:

- What NOT to do: Training a predictive model on the entire dataset, including rows with missing data, to impute missing values.

- Why: This can lead to data leakage, as the imputed values may rely on information that would not be available in a real-world scenario.

6. Ignoring Patterns in Missing Data:

- What NOT to do: Not investigating whether missing data is missing at random (MAR) or not at random (MNAR).

- Why: Ignoring patterns in missing data can lead to biased results. Understanding why data is missing can help you choose appropriate imputation methods.

7. Imputing Missing Data Before Data Splitting:

- What NOT to do: Imputing missing values before splitting your dataset into training and testing sets.

- Why: This can introduce data leakage, as information from the test set can influence imputation in the training set, leading to over-optimistic model evaluations.

8. Treating Missing Values as a Separate Category:

- What NOT to do: Creating a separate category (e.g., “unknown” or “missing”) for missing data.

- Why: While this may work in some cases, it can introduce bias and should be used cautiously, especially if the missingness carries meaningful information.

9. Over-Imputing or Over-Engineering Features:

- What NOT to do: Creating too many imputed features or complex imputation methods.

- Why: Over-imputation can lead to overfitting and complicate the model, reducing its generalization performance.

10. Not Sensitivity Testing:

- What NOT to do: Failing to test the sensitivity of your model to different imputation methods.

- Why: Different imputation methods may yield different results, and it’s essential to evaluate their impact on model performance.

Instead of these “what NOT to do” practices, consider using appropriate missing data handling techniques such as multiple imputation, data-specific imputation methods, or even redesigning your model architecture to handle missing data gracefully. Always evaluate the impact of missing data handling on your model’s performance to ensure high accuracy and robustness.

Do share, like and comment ! :)

Contact me at subhashini@pexitics.com / WhatsApp at +91 734966–2320.

#DataScience #MachineLearning #MissingValues #DataPreprocessing #DataCleaning #FeatureEngineering #DataQuality #DataAnalysis #AI #DataHandling #DataModeling #DataAnalytics #DataManagement #AIEthics #DataStrategy #MLModels #DataIntegrity #DataMining #AIInsights #DataEthics

Originally published at https://www.linkedin.com.

--

--

Subhashini Sharma Tripathi

My passion is Intelligence Amplification - using tech and data to make great decisions. Currently, am bullish on Generative AI for Banking solutions.