Discover more from DSBoost
5 Tips for aspiring Data Scientists from a Data Engineer - DSBoost #40
This week we interviewed Marco Franzon. He is a Data Engineer and DevOps.
How did your background in biochemistry and nanobiotechnology influence your approach to data science and DevOps in your current role at eXact lab?
I don’t have a direct benefit from the biochemistry knowledge field, as long as I am mostly a software developer at the moment. But my background has helped for sure to reach customers in bio-fields. The opportunity to share a common lingo gives me the possibility to identify the problem more quickly and tailor the best solution for them. (BTW, I had also an MD in Data Science, which is more related to my daily job)
Transitioning from an academic focus in the life sciences to a career in data engineering and software development is quite a leap. What motivated this transition and how did you prepare for it?
I took an MD in Data Science and Scientific Computing in order to acquire all the necessary skills to fill the gap. During the courses, I had different job opportunities but I chose eXact lab because it looked at me as the perfect place to grow, learn and have fun building great stuff with super experienced people. The motivation of the transition is a growing interest but also an affinity to the programming/coding world. At a certain point, I worked in the lab during the day and followed tutorials on how to code during the night, so I decided that it was time to change.
You've worked with a range of technologies, from Docker and Kubernetes to Python and R. For data scientists looking to broaden their technical skill set, which technologies would you recommend focusing on? What resources do you use or suggest to use to learn these skills?
In general, I love reading (well done) documentation. I strongly believe that the best way to learn something is to read the guides/tutorials/docs made by the authors of those technologies. This not only gives you the basics of the language/technology but also the way to think. These days we see the usage of generative AI to learn something new, in general, it is a good thing, but if you can choose between a good doc or a generated answer I strongly suggest the docs. On the other hand, it is fundamental to practise, 50% is theory, and 50% is practice. I used to learn theoretical and practical concepts in parallel to be sure to have a clear in mind what I was doing.
As a data scientist, I strongly suggest learning the most used languages in this field, Python and R, then Docker to be sure that your results will be reproducible for everybody. Databases is another fundamental set of skills, SQL and NoSQL, both are important and used in many different fields. I suggest understanding the infrastructure and not only how to make a query on them. In this way, you will be capable of making good design choices for your data when you have to store/organise it.
DevOps is a key part of modern software development, especially in data-intensive applications. How do you see DevOps evolving in the context of AI and machine learning workflows?
DevOps is very important in the AI/ML world because it covers a wide range of aspects. The reproducibility problem, for example, DevOps technologies gives a data scientist the right environment to develop its models/algorithms, which can be reproduced by any other data scientist. Another aspect is the deployment in production. Thanks to DevOps, for example, the MLOps specialisation, it is possible to put in production the ML model without rewriting it from scratch, to be compliant with the development environment.
As someone who has written for a scientific dissemination website, how important do you think communication skills are in the data science field, and what tips can you give for effectively communicating complex technical concepts?
A data scientist has to understand data and then explain them to others. So, communication skills are very important. I think that one of the most important skills is to make well-done graphics like diagrams, plots and so on. Let data tell the story. You have just to add some “glue” between what you are showing to tell a reasonable story but I think that the results should be clear to everybody without any long explanation.
A common exercise is to tell a complex concept to someone who is completely new to that topic, if it gets the point you are on the right track.
With your varied experience in both software development and scientific research, what advice would you give to data scientists aspiring to develop robust, scalable infrastructure for their data workflows?
Think simple. Don’t overcomplicate your code.
Use well-supported libraries or the core libraries.
Don’t look for “shortcuts”, be sure to understand the problem.
Test VERY WELL your solution.
Be sure to be reproducible, in terms of scientific results, but also in terms of environment and compatibility.
How do you approach the challenge of balancing rapid development and deployment with the need to ensure robust and secure operations in your DevOps role?
Having a good team is one of the most important things to be sure that the development and the operations work fine without losing time. At eXact lab, fortunately, we have a lot of experienced developers, not only in terms of technical skills, but also in team work, and this is crucial to move fast. So, the answer is having a good team gives you the opportunity to balance the development time and the DevOps activities.
In the realm of machine learning and AI, data is king. From your experience at eXact lab, what are some best practices for ensuring the quality and integrity of data in machine learning workflows?
Metadata. Data is just an amount of numbers or objects without metadata. Handling, saving in the proper way and in a proper format the metadata gives you the possibility to optimise the data usage and gain more value from them.
The integration of machine learning models into web services, as you've done with FastAPI, is becoming more common. Could you share your insights on the importance of this integration and any potential pitfalls to avoid?
FastAPI is one of the most used frameworks to build REST API services. It is quite ready to use in the ML field with a lot of work done to make its usage simple for serving ML models. It is a good choice because it is easy to configure and scale. Obviously, it does not solve all the problems, for example, you have to be sure to monitor your model with a dedicated service (in terms of accuracy but also performance).
Given your experience with Docker-compose and Kubernetes in creating scalable infrastructures, what would you say are the top considerations data scientists should keep in mind when they transition to a DevOps-oriented role?
Install Docker, try to wrap your model and make it a service. When you are familiar with this flow, move to something bigger like Kubernetes. Try to do the same but in a K8S cluster. Another time, when you have done you have seen enough to be considered a junior DevOps I think ;).
During the process, in my opinion, it is important to keep in mind that you should be able to reproduce what you are doing in every machine or environment, cloud or on-prem, it does not matter. This forces you to learn the best practices of DevOps and forget bad practices which bring you related to a certain Operating System or language/library version.
In today’s digital age, platforms like Twitter are invaluable for professional networking and learning. Can you describe how your engagement on Twitter has influenced your career or provided opportunities for growth?
I really like to talk with people on Twitter to share thoughts and ideas about tech topics. Every day I learn something new, sometimes it is a new tool to simplify a task, and sometimes it is a tip or trick to make things faster in the code. At the moment it does not change my career, maybe one day, who knows ;)
Open-source contributions on GitHub are a testament to a developer's work and collaboration in the community. Could you highlight one of your open-source projects on GitHub that you're particularly proud of and the impact it has had on the community or a specific audience?
I love open source, I started my journey thanks to open-source projects and tech communities and now it is time to give something back. I have some different open source projects on GitHub and I will be more than happy if other devs want to join me in the development. One of the most appreciated at the moment is the mojo-is-awesome repository which is a collection of public resources to start with Mojo programming language, a very promising language for AI, but also benchmarks to evaluate it with respect to Python. I also want to mention other two projects which are pytorch-in-public and docker-in-public in which I want to share my knowledge and experience in these fields with ML algorithms written from scratch and recipe for docker containers.
Finally, in the ever-evolving landscape of AI and machine learning, continuous learning is essential. How do you keep your skills up-to-date and what resources do you often turn to?
Continuous learning is one of the most important aspects for a developer. I am trying to stay up to date by reading news and Twitter is a good source of news. Once I find something interesting I have a look at the publication and/or the repository, if it is public or some good blog post. It is not easy to find good resources today due to the abnormal amount of “flash” posts which do not give you any value. In general, HackerNews is a good starting point.
Thanks for reading DSBoost! Subscribe for free to receive new posts and support our work.