10 Big Mistakes In Data Projects- DSBoost #42
What are the most common mistakes made by Data Scientists?
This Reddit post started by u/dhaitz on common mistakes by junior data scientists sparked an insightful discussion. Here we have highlighted the top 10 errors:
Data Leakage in Training and Evaluation: A significant mistake is allowing data leakage from training to evaluation sets. This can give a false impression of model performance, especially in cases involving temporal elements.
Inadequate Handling of Temporal Data: Junior data scientists often struggle with handling temporal data, leading to data leakage. Suggestions like participating in ML competitions (e.g., Kaggle) were given for better learning.
Preprocessing Data Before Splitting: A common error is preprocessing data (like normalization) before splitting into training and test sets, which can bias the test data.
Incorrect Application of Categorical Encoding: Using methods like target encoding on the entire dataset rather than just the training set can lead to leakage.
Overestimating Model Performance: There's a tendency to overestimate a model's performance, especially when initial results seem exceptionally good, without thorough debugging and validation.
Lack of Domain Knowledge and Data Intuition: Junior data scientists often focus too much on modeling without gaining a sufficient understanding of the data and its domain context.
Not Starting with a Simple Baseline: There's a tendency to skip establishing a simple baseline model to understand its limitations before moving to complex models.
Overreliance on Complex Models: New data scientists often prefer complex or 'fancy' algorithms without assessing their necessity or effectiveness for the specific data type or problem.
Failure to Recognize Business Impact: Many junior data scientists fail to understand how their models can practically impact business decisions or actions.
Mistaking Features for Causes: There is confusion between identifying important features in a model and interpreting them as the actual causes of outcomes.
These points underscore the importance of a balanced approach in data science that combines technical skills with a deep understanding of data context, careful preprocessing, and an awareness of the business implications of models.
We are excited to announce a collaboration between DSBoost and Train In Data! As part of our launch offer, we are providing a 15% discount on Train In Data courses when you use the code DSBOOSTNL at checkout. Not only will you benefit from the valuable insights and practical knowledge shared in these courses, but you will also be supporting DSBoost for each course purchased through this offer. Don't miss out on this opportunity to enhance your data science skills. Enroll now and take the next step in your data science journey!
Podcast of the week 🎙️
We hear a lot about the bad, destructive consequences of technology, especially AI.
‘It will take my job.‘
‘AI will kill humanity.‘ It may sound extreme, but it’s definitely in the conversation.
But reality is not that horrible. We should focus more on the good things technology brings to our lives. The mentioned podcast does exactly this.
Here are some takeaways:
Achieving a sustainable society is not possible without technological advancements. There is no more powerful tool to tackle global challenges than ethical tech developments.
Technology is not positive or negative. It’s neutral. Its impact depends on how we use it. If we focus on positive outcomes, every technology can be used for a better future.
Trust is inevitable. Distrust is the result of negative incidents, thanks to the lack of governance usually.
Yes, AI can take away jobs, but at the same time, it will create some. ‘According to the World Economic Forum, 25% of jobs will be disrupted by AI and emerging technologies, with 75 million jobs disappearing. However, 133 new job titles will appear.’