Learning, whether by humans or machines, requires a structured approach. While children naturally absorb knowledge through interaction and experience, machines depend on curated datasets, algorithms, and resources. Any gaps in these disrupt learning and compromise performance, especially in AI-powered applications. Challenges like dataset quality, model selection, and system design are pivotal to success. This piece explores these critical hurdles and the overlooked role of infrastructure in shaping effective AI systems.
1. Data Sufficiency challenges:
Data is the backbone of machine learning, with models relying entirely on datasets to learn and predict. Unlike humans, who can infer from just a few examples, machines need vast amounts of high-quality data for accuracy. Yet, many organizations invest in AI without assessing data sufficiency, often leading to prolonged efforts and poor outcomes. The quantity and quality of data—free from missing values, noise, and outliers—are critical to reliable predictions. This piece explores how data sufficiency shapes AI performance.
First, why it matters?
Insufficient or unrepresentative data leads to underperforming models, biased outcomes, and inaccurate predictions, turning AI investments into sunk costs.
1. Data Volume: Small datasets lack diversity, limiting generalization and accuracy. For instance, facial recognition or medical imaging systems fail to perform reliably without ample data to train robust models—quantity matters.
2. Data Quality: ML models rely on diverse data sources—images, text, voice, and more—often plagued by quality issues like missing values, inconsistent defaults, and erratic sensor readings. Standardizing data quality is essential for accurate and reliable outcomes.
3. Data Diversity: Unrepresentative training data leads to biased predictions. For instance, a loan prediction model focusing only on low-income individuals from specific demographics may unfairly classify them as high-risk, ignoring similar patterns in others. Diversity and accurate categorization matter.
4. Data Distribution: Overly similar training data limits a model's ability to differentiate. For instance, a fraud detection system trained on 90% non-fraudulent transactions may fail to catch subtle, clever fraud patterns. Diversity in data is key.
Data Relevance: Accurate predictions need relevant features. Outdated trends or irrelevant factors, like using nose length to assess loan applicants, only add noise. Focus on what truly matters.
Solution to above issues regarding Data sufficiency
Conscious Data Audits:
o Ensure dataset diversity, quality, and volume before training to avoid bias and one-dimensional learning.
o Represent all classes/samples in the dataset for comprehensive coverage.
o Ensure diversity within each sample group, with clear distinctions between groups.
o Maintain consistency within each group, using many examples to ensure thorough learning.
o Diverse, representative data is essential for practical AI training and model performance.
· Synthetic Data Generation:
If provision for a training dataset to train a model is scarce, create enough synthetic data to augment real-world data.
· Data Pre-processing:
Clean and standardize your dataset before training, just as you'd check food quality before feeding your baby.
· Continuous / regular training:
Implement a data feed process (ML-Ops) to regularly retrain models, ensuring relevance for accurate predictions.
· Teamwork & Collab:
Data scientists should collaborate with engineers and experts early on to select relevant features, avoiding wasted time on unnecessary exploratory analysis.
2. Training data fitness!
Why we need a model for prediction?
To predict an outcome, we look for patterns—if the event follows a consistent pattern, we can predict it using a statistically significant model. When the model works across different datasets, the pattern is considered generalized. Now, we can explore the concepts of overfitting vs. underfitting.
Fitting refers to generalizing a model.
Machine learning models can suffer from underfitting or overfitting, both of which impair the model's ability to generalize to new data, reducing effectiveness
Overfitting : occurs when a model memorizes training data, including noise, reducing its generalization ability Consequences include:
ü High training and testing accuracy but low validation accuracy.
ü Overly complex models for simple tasks (e.g., deep neural networks for basic predictions).
ü Predictions based on irrelevant features, wasting time and resources.
ü Failure to adapt to data variations leads to unpredictable results.
ü Complex models make debugging difficult and obscure the root cause of errors.
ü Unreliable predictions can damage customer trust.
ü Ethical and safety risks in critical decision-making.
Why it occurs:
ü Using overly complex models (e.g., CNN for simple temperature-pressure prediction or excessive layers in an ANN).
ü Choosing a dataset that doesn't reflect real-world factors affecting the outcome.
ü Training with too many epochs.
ü Failing to drop irrelevant features, assuming unnecessary relationships.
Underfitting :
Underfitting occurs when a model oversimplifies the data, failing to learn key patterns. It's like a student who doesn’t put in enough effort to understand complex topics, resulting in consistently poor performance.
ü Weak predictions are due to the failure to capture relationships among data and features.
ü Oversimplified models with high bias, failing to recognize implicit patterns.
ü Low training, testing, and validation accuracy.
ü Complex use cases are tackled with simple models (e.g., using linear regression for neural network tasks).
ü Underfitted models need to scale for complex scenarios.
ü Low-accuracy predictions waste time and resources.
ü Inability to adapt to data variations, making predictions unpredictable.
ü Difficulty in debugging and identifying issues—model weakness or data problems?
ü Damages customer trust and reputation due to unreliable predictions.
ü Life-threatening situations cannot rely on weak AI.
Why it occurs:
ü Incorrect model choice (e.g., linear) for complex, non-linear data trends.
ü Need more training time or iterations.
ü Inadequate feature selection, missing key variables impacting outcomes.
Over-regularized models that need to capture data nuances and variations.
Overfitting vs. Underfitting: Think of the model as a T-shirt and the data as the body. A model that’s too complex (like a 3XL T-shirt) overfits the data, failing to generalize to new datasets. On the other hand, a model that's too simple (like a Small T-shirt) underfits, missing important complexities.
In machine learning, overfitting means the model fits the training data too closely, capturing noise instead of useful patterns. Underfitting happens when the model is too simplistic to learn from the data. Both lead to poor predictions and lack of generalization. Proper data preparation and feature selection are key to finding the right balance.
Solution to above issues regarding Model fit-ness:
· To solve the problem of overfitting, we need to generalize / Regularize / Simplify/constrain the model! The question is, “How do we simplify the model?”
· We can reduce the number of features.
· We can pick a more straightforward or linear model instead of a polynomial/ Non-linear one.
· Gather ample data nicely! ( Do not be stingy with data)
· Arrange sample data wisely! ( stratified sampling might help )
· Pay attention to data quality along with sufficient exploratory data analysis.
· Regularization Hyperparameter tuning (a parameter of learning) would be necessary.
· To solve the problem of underfitting, we reduce the constraint.
· Select more complex expressions/powerful models with more parameters.
· Perform better feature engineering and add better features (X) instead of reducing dimensionality.
· Reduce the regularization hypermemeters.
· How do you evaluate your model? (after you have selected a model )
· One option to evaluate your model is to deploy/ serve the model to production. Ideally, launch your AI app ( knowing you will have screaming customers ! ), let the model learn with actual customer inputs, and eventually, you will have happy customers!
· If you cannot handle the above approach, a safer option would be to split your dataset into two sets with a 4:1 split ( 80-90% Taining Dataset, 10-20% Test set). This is not the validation set discussed in this article.
· How well does the model perform if the Out-of-sample / generalization/regularization error rate is low?
· How do you select a model? How do you tune hyperparameters?
· One option is to pick different parameters with a set but random Hyperparameter and see which model did well if you have all the time in the world.
Note :
The issue with the above is that if we use the same test set for multiple models and apply hyperparameter tuning, we lose the chance to improve the same test set for another model. So, a real comparison is unlikely to happen. A second model check will give a false sense of better model performance than it truly is. It will falter with new datasets coming into production.Keep about 10-20% of the remaining training set (after taking the test set out) additionally as a “hold-out validation set” or “Dev set.” .
Leave a comment
Your email address will not be published. Required fields are marked *