

Discover more from DSBoost
You don’t need to know everything to land your first job - DSBoost #38
This week we interviewed Soledad Galli. She is a Data Scientist, Best-selling instructor and Top Data Science Voice on LinkedIn!
What inspired you to enter the field of data science, and how did you navigate your early learning journey?
One of the things I enjoyed the most while working as a research scientist in academia, were statistics and data analysis. Carrying out the experiments was daunting. But analysing the data… that I did love. That passion was what initially drew me to data science.
My journey into data science began when I noticed its increasing popularity on platforms like LinkedIn. I found some former colleagues who were now data scientists, so I reached out to them to ask how I could make the switch as well.
They recommended starting with online courses. I didn’t believe it was possible at first, but I enrolled in the courses anyways. I mean, what else can you do, right? You have to begin somewhere.
So I enrolled in a couple of online courses, including the Data Science Specialization on Coursera and the Analytics Edge on edX. These courses helped me take the first steps in data science. I then complemented my learning with hands-on data analysis projects, often sourced from the courses or from platforms like Kaggle. That gave me the practical experience to successfully navigate interview questions.
And, probably because of my academic background, I read relevant literature as well, including the works of Leo Breiman, a pioneer in the development of random forests.
Eventually, I secured my first job in the field.
How has your experience in finance and insurance shaped you as a data scientist?
My job at the finance company was also my first job as a data scientist. There was a steep learning curve for me. I needed to learn not just data science topics, but also, how to work in a corporate environment.
I learned that people consider you an expert, even when I did not feel like an expert myself. My colleagues were seeking my input and insight, while I was wondering why they should trust me. After all, I was a biologist, and my knowledge of credit risk was borderline zero.
So what I learned from that is that teamwork is what matters. Perhaps none of us were experts in the sense that academics consider us to be. But all of us had something to bring to the table. Including myself. And that was what mattered.
I also learned that data science in real life does not look like the projects you do in the online courses. The data comes from various sources, it's huge, and you need to tidy it yourself. And, discovering how things get done, professionally, it’s really hard.
Now, the online content has grown, people are sharing more. A few years back, the information was scarcer. It took me an incredible amount of time and search on the internet to find out how to preprocess data beyond “mean imputation” and “one hot encoding” for example. And I thought, wait a minute, all of us are spending an incredible amount of time on researching exactly the same things. Why not simply make available the product of the research that I had already done? So others can find it quicker.
That, and my gratitude to online education, that allowed me to step into data science, motivated me to create my first course on feature engineering. And since then, I came a long way.
Can you share a particular challenge you faced while developing and implementing machine learning models in a business setting, and how you overcame it?
I think the biggest challenge, or perhaps the biggest error many of us make, particularly on our first projects, is not understanding how the model will be used once it is developed. For a credit risk model, or an insurance model, that means not understanding which customers it will serve, what will be done with the outputs of the models, what data is available at the time the model is called, or who is going to interpret the results of those models.
When you start developing a model detached from that reality, you could end up using variables that are not available in production, training the model on data that is not representative of the customers it will serve, or creating models that are so complex that people can’t understand them and or use them.
The funny thing is, that you could have spent weeks or months developing the “perfect” model, but once you are deploying it, and you face these issues, you need to implement fixes quickly, and that affects the quality of the final product.
So, what I try to do, before even starting to think about what data will I need, is take time to sit with the various stakeholders, and understand how they will interact with the models. I talk to the users of the models, say the fraud investigators and risk analysis, to understand what they need from the model. I discuss with the software engineers to understand how the model will be integrated in production and which data will be available for the model to consume. In short, I try to anticipate as many pain points as possible, to minimise the quick fixes that we need to do while deploying. It won’t prevent all uncertainties. But some are better than none.
Your books and courses cover a range of intermediate topics in machine learning. What advice do you have for beginners in the field who are looking to advance their skills and knowledge?
I think the key is to never stop learning. Never stop questioning. You don’t need to know everything to land your first job. But as you go along, the more you know, the more resourceful you will be to solve problems and troubleshoot things when the results are not what you expect.
Concretely, that means, understanding what the machine learning models do, which model is suitable for which scenario, what are the trade-offs of each model, say interpretability vs performance, or speed vs complexity.
The other thing that was key for me to learn is, for everything that I want to code, someone has coded it already and it is available in an open source library. I spent hours writing code that you could do in one line of pandas.
The open-source ecosystem for machine learning is growing at incredible speed. So, I think it’s worth taking some time to look for what it’s out there because that will save you a lot of time.
Can you tell us a bit about “Train In Data”?
Train in Data is an online education school that I created at the back of the popularity of my courses on feature engineering and feature selection. With our courses, we try to breach the gap between beginner data science courses and what you need as a data science practitioner.
We teach the theory of the subject at hand, how to implement the methods or algorithms in Python, and we provide real-life examples and advice regarding when you should choose one methodology over the other.
The idea is to save data scientists the time they would spend if they were to do the research themselves, and at the same time show simple, elegant and efficient Python implementations that they can adapt and re-use in their projects.
We are excited to announce a collaboration between DSBoost and Train In Data! As part of our launch offer, we are providing a 15% discount on Train In Data courses when you use the code DSBOOSTNL at checkout. Not only will you benefit from the valuable insights and practical knowledge shared in these courses, but you will also be supporting DSBoost for each course purchased through this offer. Don't miss out on this opportunity to enhance your data science skills. Enroll now and take the next step in your data science journey!
Feature-engine is a highly successful open-source project with a large community of contributors. What have been some of the key lessons learned from developing and maintaining this project?
The first and most important lesson is that creating and maintaining Feature-engine was and continues to be a lot of fun and super rewarding.
I can’t explain what it feels like to see a project grow and learn how people are using it, or recommending it to others.
The second lesson is, you don’t need to be a top notch programmer to create, maintain or contribute to open source. I started Feature-engine with the motivation to become a better developer myself. That tells you a lot about what I thought of my Python skills when I started the project. It has worked wonders for me.
So in summary, I’d say, contributing to open-source can be fun, rewarding and help you become a better programmer. So I’d recommend not shying away from making that first contribution to any library you like. The developers will be more than happy.
How do you balance your roles as an instructor, author, and developer, and what drives you to continue contributing to the field in such a multifaceted way?
To be honest I just go with the flow. There are times when Feature-engine gets a lot of contributions, so I need to put more time into reviewing the pull requests. When Scikit-learn or pandas make a new release that breaks backward compatibility, and that happens more often than you’d think, I need to leave anything else that I am doing to make Feature-engine compatible with the latest versions. Otherwise, I focus mostly on creating courses and books.
The motivation comes from the continuous reward that I receive from the people who take my courses or read my books. We get amazing reviews on Trustpilot, or they reach out over LinkedIn or Twitter to say that the courses are wonderful, or they changed their lives, or have helped them to succeed at an interview. And many reach out as well with suggestions for new courses.
Can you discuss a particular moment or achievement in your career that has been especially meaningful or rewarding for you?
One of the moments marked my career as a data scientist were those 15 minutes after my first talk at a tech meeting. It was at Pydata, in London in 2017. I had barely over a year as a data scientist, hardly seeing myself as an expert in the field. I was encouraged to present our credit risk model, and so I did. And the result was amazing. I got so many questions at the end of the talk, and people kept coming throughout the meeting to ask more and more. They were genuinely interested, which reinforced my idea that I should get out more and spread the knowledge.
What trends or developments in data science and artificial intelligence are you most excited about, and how do you see them shaping the future of the field?
I am looking forward to the new and coming regulations on how corporations can use machine learning and what are their responsibilities to the users. Artificial intelligence has, without question, a lot of potential to solve complex and important problems. And we know as well that it has caused tremendous damage, through for example, the spread of misinformation, or through biases that more severely affect the most vulnerable sectors of society. If you haven’t heard about this, I invite you to read “Weapons of Math Destruction” by Cathy O-Neil, which provides a nice yet thorough summary.
I hope that this will force corporations to put the interest of users ahead of the interest of profit, and make them create products that are useful for society and not just for their pockets.