View on GitHub

Data_Science_projects

Hi! I am Louis. Passionate about data, I share here some of my most interesting projects.

Louis’s Portfolio

This repository presents all my personal projects and work related to data science. The repository is split in three sections: Analytics, Data Science and template&images. Feel free to clic on any project to go visualize more details on GitHub and find the full code and explanation, and notebooks that will explain more in-depth each step! All projects, except the PowerBi one, are done with Jupyter Notebooks. If you cannot open Jupyter Notebooks, don’t worry, you will find an html version of each project that you can open with any browser.

Analytics projects

This section contains all projects and work related to Analytics. This means, all projects related to the following topics: Data Analysis, Mining and Visualizations. For now, it contains three distinct projects: multiplayer survey analysis, powerbi-training, and hackerrank SQL solutions. No further projects are planned for now.

Project 1: Multiplayer market analysis

Topics: Survey designing, Data cleaning, Data engineering, Data processing, Data visualization, Python dashboard, Hypothesis testing, Natural Language Processing

This analysis was done as part of a much larger personal/secret project about the development of a mobile application. (The whole project is in french, don’t worry it’s the only one). Feel free to look in more details at the visuals in the notebook and how they are created. The objective of this project was to gather data about the behaviour of multiplayer videogame players, and in particular who and how do they find people to play with. It involves several technical skills used during the different steps of the project:

  1. Designing and forward a survey
  2. Connecting to a dataset online
  3. Cleaning and preparing the dataset (python, pandas, numpy)
  4. Analysing each of the types of answer, and splitting them thanks to other answers (matplotlib, seaborn)
  5. Create hypothesis about the data, and statistically confirm (or reject) them. (Stats, Chi square testing)
  6. Use Natural Language Processing in order to analyze the questions that could have text answers. (nltk, textblob)
  7. Sum-up our results from a business perspective.

Example of playing frequency visualization

Key results: The project ended up successful, as it enabled me to correctly identify the potential customers and their characteristics for the mobile application project. In addition to that, it even provided new ideas for functionalities!

Project 2: PowerBI training

Topics: Data engineering, Data modelling, Data acquisition, Data cleaning, Data processing, ETL, Business Intelligence, Data visualization, Measure calculation, PowerBI dashboard, Artificial Intelligence

This project was done during my time training in PowerBI. It consists in the building of dashboards relating to different datasets. Two distinct dashboards were made: AW dashboard, and AI dashboard.

AW dashboard

This dashboard is a classical one, and relates to several different tables found in the AW files folder. Those tables can be found as ressources on the internet, or accessed directly in the Microsoft PowerBI ressources. The building of that dashboard involved three main steps:

  1. Connecting to each of the sources, and formatting them
  2. Data modelisation (linking the tables regarding their type and their primary and foreign keys)
  3. Creating the metrics that we want to visualize using the DAX language
  4. Visualising those KPIs in the best way possible (exploring different types of plot, and how their context interact)

Main interface of the dashboard

AI dashboard

This dashboard only connects to one source (kickstarter projects and their results). The main objective was here to explore the AI features of powerBI, as seen in the report.

Main interface of the dashboard

Key results: The two powerBI dashboards resulted in clear and easy-to-use visuals, that are transforming raw data into powerful key indicators. In addition to that, it enabled me to improve my powerBI skills.

Project 3: Stock market trading

Topics: Stock trading, Sentiment analysis, wallstreetbets, Algorithmic trading, Data visualization, Trading bot, MACD, Natural Language Processing,Text summarization, Transformers, Reinforcement Learning, Financial signals

This project’s goal is to analyze data from different subreddits (wallstreetbets, stockexchange,…) in order to choose in which stock to invest for a given day, and then to actually invest in it at the right time, in order to make a measurable profit. (This project also involved understanding how the stock market actually works). The main steps of the project are:

  1. Measuring the evolution of the number of times each stock is mentionnent in those subreddits (python, praw)
  2. Combining those results with the evolution of the sentiment (the positivity) of those quotes, in order to identify the stocks (AAPL, GME, …) that have a good trend from a social media point of view (pandas, nltk)
  3. Web scrappe Google news to get articles about those stocks, summarize them and apply some NLP to get sentiment (hugging face)
  4. Adding stocks that are trending on yahoo finance, while parsing it (mybeautifulsoup, request, html)
  5. Getting live data about those stocks (yfinance)
  6. Chosing when to buy and sell those chosen stocks using a statistical approach (MACD, signal line)
  7. Automating the choosing, selling and buying by scheduling the tasks with a bot.
  8. Trying a Reinforcement Learning model to trade the stocks regarding other financial signals (SMA, OBV, RSI) and compare it with our MACD bot

Sentiment Sentiment Sentiment Sentiment

Key results: The project was a success, and ended up in an operational, automated and successful trading bot! In addition to that, it enabled me to improve my knowledge of the trading world.

Project 4: Automatic identification of new doctors in my region

Topics: Webscraping, Automation

This project’s goal is to automatically identify new doctors in my region, in order to get an appointment (which is difficult where I live). The main steps of the project are:

  1. Extracting the names of all the doctors in my region using Ameli.fr
  2. Comparing the names with the last saved version of those names
  3. Sending email alerts automatically to me with the new doctors!

Key results: Once scheduled on my machine, this will help to easily identify new practitioners that might have available slots.

Hackerrank SQL

Topics: Data engineering, SQL

This file is a simple text file, where I keep my answers to some Hackerrank SQL challenges.

Shout out

I would like to thank:

Data science projects

This section contains all projects and work related to Data Science. This means, all projects related to the following topics: Webscrapping, Data exploration, Machine Learning, Deep Learning, Natural Language Processing. For now, it contains three distinct projects: housing prediction, tennis sentiment analysis and computer vision and classification using ML and Deep Learning. Further projects are still work in progress and will soon be added: computer vision (face mask detection) using deep learning.

Project 4: Housing prediction regression (EDA, ML)

Topics: Data cleaning, Exploratory Data Analysis, Data Mining, Pipelines, Python, Feature Engineering, Machine Learning, Supervised regression, Parameter optimization, Ensemble Learning, Dashboard for Machine Learning

This project consists in the analysis and prediction of housing prices (regression problem). The dataset used is a very famous one in the data science community, and can also be found with the link in the python notebook directly. The project consists in several steps:

  1. Importing the data
  2. Exploratory Data Analysis (using pandas, klib, pandas-profiling)
  3. Feature Engineering (creation of the pipelines)
  4. Training of Machine Learning models (sklearn, xgboost)
  5. Fine-tuning those models (gridsearch, randomizedsearch)
  6. Combining them into one even more performant model using ensemble learning.
  7. Visualizing the results and the performances of the model in a robust dashboard

EDA of the housing prices in the Californian region

Key insights about the target metrci using klib

Example of a decision tree applied to our dataset

Example of a decision tree applied to our dataset

Key results: The project was successful, and provides a powerful Machine Learning model, that can predict with high accuracy (40K) the price of a house, given specific features.

Project 5: Tennis sentiment analysis (Webscrapping, twitter scrapping, NLP)

Topics: Webscrapping, Data visualization, Python, Social Network scrapping, Natural Language Processing, Sentiment Analysis

This project’s goal is to compare the global popularity of the most famous tennis (one of my passions) players on twitter. The project consists in several steps:

  1. Webscrapping wikpedia in order to get the best players of the moment (using python, pandas, myBeautifoulSoup, html)
  2. Creating their twitter names using basic string methods
  3. Connecting to twitter and retrieving tweets about the players (using tweepy)
  4. Analsying the content of those tweets, and calculating their polarity (how positive they are) using Natural Language Processing (nltk, textblob)
  5. Plotting their popularity as a rolling moving average

Rolling moving average of Novak Djokovic's popularity on twitter

Key results: The project was successful, and can be used to accurately measure the popularity of a given tennis player on twitter for the previous week. You can find a nice presentation of the project and its complete result on GitHub.

Project 6: Computer Vision and Image classification using Machine Learning and Generative Adversial Networks

Topics: Computer Vision, Data cleaning, Dimensonality reduction, Clustering, Machine Learning, Ensemble Learning, Deep Learning, Generative Adversial Network, Stacking

This project’s goal is to classify images of hand written pictures, using Machine Learning and Deep Learning techniques. Then, in a second time, we will try to re-generate more pictures using Generative Adversial Networks. The main steps of the project are:

  1. Importing the data, and visualizing the pictures of the dataset, in order to understand what we are dealing with
  2. Preprocessing the data, using unsupervised learning (PCA, clustering)
  3. Cleaning the images to improve performance
  4. Trying out Machine Learning and Deep Learning models (Logistic regression, SVM, KNN, RandomForest, XGBoost, MultiLayer Perceptron)
  5. Measuring their accuracy, and trying to improve it by fine-tuning their hyperparameters
  6. Apply Ensemble Learning (Bagging, Stacking) to the models that are performing the best
  7. Data augmentation (improving our dataset), by creating new images from the ones we already have, from simple rotations and translations, to GANs.

PCA training Final confusion matrix Gif of GAN generating

Key results: The project was successful, and ended up in a complex Machine Learning model, able to accurately (97%) classifiy hand written picture. It is fast (using a PCD), and can compete with a tested deep learning model. In addition to that, a Generative Adversial Network is also trained to create hand written pictures on itself.

Project 7: Weather forecasting in Melbourne

Topics: Time series, Data visualization, Facebook Prophet, Forecasting, Python

This project consists in the analysis and prediction of temperature in the city of Melbourne, using a specific time series model (Facebook Prophet). The dataset can be found on Kaggle (link in the notebook)

  1. Importing the data
  2. Exploratory Data Analysis (using pandas)
  3. Creation of the final dataset (dealing with time types, filling missing values)
  4. Training of a Time series model (Facebook Prophet)
  5. Visualizing the results, and looking at the trends in the data and the forecasted data (climate change)

Key insights about the target metrci using klib

Example of a decision tree applied to our dataset

Example of a decision tree applied to our dataset

Key results: The project was successful, and provides a fast and powerful model, that can predict the temperature in Melbourne for a chosen number of days.

Shout out

I would like to thank:

template&images

This section contains a data science project template, with the important steps and imports already included, and the images used in this github. Feel free to use it!