My Journey to Becoming a Data Scientist: Last 12 Months Update

What have I done since my first steps post? What am I planning to do next?

12 min readNov 7, 2024

Backstory

To put it briefly, back in October 2023, I decided to write a future letter to myself and share my process on medium in a nutshell, in this article, I mentioned what I have done as a data science candidate in my first 5 months, and mentioned I would write every 5 months about the progress, but since then one year has passed, couldn’t keep the promise but here we are with longer one. I value experiences, and I believe some people also value them, so I think it is time to share an update on what I have done in the last twelve months before they fade from my memory.

First of all, did I become a decent data scientist? I think so. Did I gain knowledge in every corner of data science? I don’t think this is possible. Let’s dive into my last twelve months' progress now.

November 2023

I was still in this 4 months data science BootCamp which started in September, and while listening to lessons and experiences from our mentors I was creating notebooks about different subsets of data science like A/B Testing, Recommendation systems, CRM analytics, market basket analysis, etc.

Also, I have created several notebooks independently from BootCamp on Kaggle with well-known datasets that every starter touches once in their beginner times to experience working on different types of data, like House price prediction, Titanic, and Wine quality. These notebooks helped me how to organize a well-prepared notebook with several visual effects like markdown cell management.

December 2023

It was the end of the Bootcamp program, and we managed to form a team with Burhan Yıldız and Huseyin Baytar and complete the BootCamp with a great project, which Model Based Carboon Footprint Calculator, it surprises me but this streamlit app still has daily 5–10 visitors, and over 1k visitors since launching it. I assume that the topic is quite important for the world. This month was a saturating month for me, after gaining experience with a large number of datasets, I felt ready for future challenges.

Also, I bought an annual membership for Datacamp while there was a decent discount, the platform had many lessons and they would help me to gain more insights into how can I be a better data scientist.

January 2024

Since BootCamp is over, I had time to grind on Datacamp lessons and keep gaining experiences on datasets with different scopes, so I immediately started enrolling in the MLOps track, and one day my Bootcamp mentor Akif called me and asked if I could assist him in shaping future data scientists in the platform which I have graduated, of course I gladly accepted.

Also, this month as the WeBears team we decided to join a country-wide competition organized by Iyzico, one of Turkey’s most well-known payment gateways. We felt ready for a competition like that with our 6-month matured data manipulating skills. The subject of the competition was to accurately predict customers’ transaction volumes over the next three months, as well as to find out which customers would churn. Together with my team, we developed various supervised learning and time series models, the competition started end of this month, so details will be in the next month.

February 2024

The competition was very controversial, there were 470 entrants, since it is churn modeling and time series together, we used Catboost for detecting which customers would churn the next three months, here our success rate was around %93, which helped us a lot.

On the other hand, we had to predict the next three months' payment counts of these customers which didn’t churn, and for that, we had to use time series models. First, we started with several statistical models like ETS, SARIMA, ARIMA, and Prophet (was the worst one) but couldn't enter the top 10 for a while, and then probably because I was reading every type of medium article about time series analysis, in my daily suggested articles I got an article about Autogluon. I read it, this auto-ml model was developed by Amazon to predict product-based time series, and I got an idea like, why shouldn’t our customers be a product also? And trained the Autogluon model based on that idea, and guess what, it was predicting so well we jumped to the top 10 on the public leaderboard.

We kept our spot in the public leaderboard till the end of the competition by retraining models with better hyperparameters etc. At the end of the competition on 11 February, we managed to finish it as rank 3. This was our first serious competition and we managed to enter the top 3 with only 6 months of data science skills, I think it is imaginable how exciting this was for us.

At the end of this month, we got an invitation to Iyzico’s office to present our solution and get our reward, it was a priceless experience for us to have a presentation about what we had done with their data. Also, I had my first job interview as a Junior Data Scientist role in a Turkish company, with the help of my network. It went very well, we had two more meetings after the first one and both were very well, details in the next months since the only meetings were in February.

March 2024

The mentor assistant role was going very well in the Miuul (Bootcamp Platform), shaping new data scientist candidates was also shaping me to be a more skilled data scientist, I assume, the easiest way to learn something is by teaching it to someone else. Also, I completed my first track MLOps, and many more lessons related to Pytorch till the start of this month on Datacamp, and directly started gaining knowledge about cloud providers like AWS and Azure.

In the middle of this month, we had another country-wide data-related competition, of course as the WeBears team we joined this one also, we entered the top 3 once, why not again isn't it? This competition was organized by Turkey’s largest insurance company and had 836 Entrants which is almost double of latest competition. The subject of the competition is predicting which insurance package a customer will buy/join next month out of 8 choices, so we had to build a multi-class prediction model and predict the most possible outcome.

It wasn’t easy, the data was huge and heavily corrupted, we learned many more skills while cleaning and organizing it, but we made a huge mistake that caused us to end up ranked 25 in this competition, this mistake was a good lesson for us. We should have used GroupedStratifiedKfold instead of only StratifiedKfold because there was a monthly column in the data that we ignored, it was showing same customer may appear every month, and we didn’t think about that, so our model was data leaking, it was an expensive lesson to us, but helpt us to see where our skills yet.

April 2024

This month started with another competition, the company organizing the competition was known as the European-based domestic appliance company Haier-Europe. I know you might say, competition again? But believe me, timelines and out-of-scope tasks are good ways to learn things faster, at least for me.

This competition had fewer entrants than other ones with 188 Entrants, I suppose it was because of a lack of organization, and they called it a “hackathon” instead of a datathon might be another reason. The goal was to forecast the demand for products over the next 3 months and the next 12 weeks using two separate models. The data was largely unstructured and the input data was irregular. We performed detailed data processing and developed time series models.

Here our experience of the Iyzico competition helped a lot, as u remember if there are products and forecasting, nothing works better than an Autogluon model, we started the competition as ranked 1 with this knowledge, and ended up ranked 1.

Haier Europe Competition Result (Image by Author)

After this competition was completed, I had to take a 25-day break for my mandatory military service, and I don’t think this is the right place to share military memories :)

May 2024

I returned from military service on May 17th. While I was checking my emails which included tons of medium articles expecting me to read them as soon as possible, I also saw an e-mail for my first job interview results back in February. It was kindly rejected even though the interviews were pretty pleasurable. Sad but it would be too easy to get a job on the first try, isn't it?

It was my second season in my volunteer Data Science Assistant role, I had graduated my first batch of candidates just before I enlisted and new ones had signed up this month, this time I had more experience to help them advance, prepared plenty of mindful presentations, and side lessons to shape them and myself more in the data science field. We also received an invitation this month from Haier Europe to present our solution to the management team on June 10th.

June 2024

There wasn’t any competition in my skill set, so I kept enrolling in more and more lessons related to data science on different platforms like Udemy and Datacamp.

Haier Europe presentation went well also, we kept our first spot and got supply chain insights from the team and some networking, but not a job of course :)

Haier Europe Presentation (Image by Author)

July 2024

July started with a new challenge, Teknofest’s largest and most diverse category competition was starting, one of the categories was Natural Language Processing, what was expected from us was to extract sentiment and brand names from Turkish customer reviews, and this was totally out of my scope, which is good.

I spent my first few weeks learning the methods and libraries used in natural language processing like spacy, nltk, and most important transformers. The direction of the project was more or less set in my mind. One of the rules in the competition was “Don’t use an already created model or project, build your one” so we avoided fine-tuning a BERT model because of this rule and trained our transformer-encoder model from scratch with my freshly learned PyTorch skills, it wasn’t easy at first to be honest, but it didn’t stop me to watch lessons and read about how to do it, I had gained enough knowledge to create it under two weeks. The competition presentation was in the first week of August, we spent the rest of the month making our model more robust and highly accurate by training it with newly labeled datasets, we also shared the dataset here on Huggingface.

August 2024

The day of the competition arrived, we spent two days in a large complex with 350 other competitors, like a hackathon. After completing the final tests of the project without sleep, we made our presentations the next morning, it was a very different experience, I had never had to sleep on bean bags in a competition I had attended before.

Teknofest NLP Hackathon (Image by Author)

At the end of the completion and presentations, we observed that most of our competitors had used the BERT model which was fine-tuned for this specific subject, we believe that was against the rules, but the jury didn’t mind it as much as we did, we end up being in rank 18 in 88 teams with our transformer model from scratch, at least we gained a lot of experiences.

This competition awakened my interest in large language models. I decided that I had matured in the data science process and started to educate myself on topics such as orchestration, fine-tuning, and API usage of large language models.

September 2024

It was time for another competition and this time it was organized by The Information Technologies and Communication Authority, which regulates and supervises the telecommunications sector in Turkey, the Turkish Entrepreneurship Foundation, and Google. That was the biggest competition we have participated in so far with 1,065 Entrants, even though we had completed our first year in data science, we had competitors who had been working in this field for more than five years, so the competition wasn’t going to be an easy one.

The goal of the competition was to develop a predictive model to estimate entrepreneurship scores based on personal data from submitted forms, so it was regression-related data, we had a great feature engineering and detecting steps and which data we should implement in this already existing data since it is a real-life related data, had many new features.

The competition was only 7 days, and we finished the public leaderboard as ranked 10, and jumped to ranked 5 as the private leaderboard published, here following steps are detailed in our Kaggle notebook.

After the competition was completed, the notebooks were examined and presentations were made. After the presentations, the jury changed the top 3 among themselves and chose us as fourth and deemed us worthy of the jury special award. which we still don’t know what is it.

While voluntarily assisting new data science candidates, I had to start earning income with my knowledge somehow, and since I couldn’t land any job yet, I opened an UpWork profile, and then I started adding projects with the experiences I gained to fill this profile.

I enrolled in several certifications related to LLMs this month, for example, Generative AI with Large Language Models by Deeplearning.ai on Coursera, and Generative AI & Prompt Engineer by Miuul (the platform I am voluntarily assisting)

October 2024

Then I got an idea, I have learned new skills, and I have a lot of experience, so why don’t I write about it on Medium? By this month I started writing about how I build projects, my first two articles might look like beginner's, but I believe I will get better in this step by step, might first articles are about LLMs orchestration and their several use cases.

Created a Retrieval-Augmented Generation (RAG)-based AI chatbot, capable of querying PDF documents using the OpenAI API integrated with LangChain. With LangGraph and FAISS, vector-based data queries are performed on PDF files, allowing natural language interaction with the data and wrote an article about how I have done: How to Create a RAG-based PDF Chatbot with LangChain
Developed an AI-powered multi-agent essay writing system using CrewAI and LangChain. Integrated web scraping and summarization tools to gather real-time data from the web, allowing agents to autonomously generate essays based on user prompts. I have mentioned how Crews work and can integrate with Langchain in this article also: Building Autonomous Multi-Agent Systems with CrewAI

We also participated in another competition at the end of this month, this was our first competition, where we had to showcase our LLM orchestration skills by developing an application that could help the education field. The competition is not completed yet, but we believe in our project, I will share an article about the steps we followed here once it is completed.

Future goals

One of the things I noticed during my time as a data science assistant was that participants had difficulty getting models live in the streamlit interface, so I will start a series of articles where I explain the process of taking various projects from scratch to live using machine learning and deep learning. I believe they will be useful guides for students with project assignments and boot camp participants on the progress they need to follow.

I will continue my job search and since I have prepared my Upwork profile, I will start applying for various freelance jobs starting this month. I need to earn income somehow now.

I have worked on major language models such as Gemini and OpenAI, but I haven’t developed a project related to open-source models, yet. My next goal is to create various projects using the power of these models and to gain experience in open-source models.

I can say that local competitions no longer challenge me in developing my skills, so I plan to start participating in big prize competitions since it is time for me to reach the Master's level in Kaggle. In the coming period, maybe I will share articles about the knowledge I gained from these big competitions I participated in. And I definitely plan to keep my Github profile and Kaggle profiles active during this process. You can reach them from these links.

Mesut Duman | Expert

Hello, I am experienced in developing data science and artificial intelligence applications. I strive to continuously…

www.kaggle.com

mesutdmn - Overview

Data Scientist, Machine Learning Engineer. mesutdmn has 15 repositories available. Follow their code on GitHub.

github.com

I am happy that by writing this article, I have reduced the pressure on me since I promised to explain the next progress back in October 2023. I hope I was able to share valuable experiences with you. If you have any questions, please leave a comment.