Skip to content

This repo contains projects, tasks and other code which I have developed on a Data Scientist course at SkillFactory.

Notifications You must be signed in to change notification settings

helios12/DataScienceProjects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Science Projects

During my Data Scientist course at SkillFactory I am solving tasks and working on projects. I am presenting the solutions of those tasks and projects as well as auxiliary code here for any potential employers. If you are just interested in data science, please feel free to contact me with any questions.

Highlights

For the moment the following project on taxi ride duration analysis including data cleaning, exploratory data analysis, building a regression model and evaluating the quality of the model prediciton results, is the most illustrative example of my work: Project-5.Taxi_ride_duration.ipynb

In addition I am working on a library of reusable code for data-science purposes: Data Science Helpers

Repo Structure

Projects

Link: Projects

Project 5. Taxi Ride Duration

The goal of the project is to build a regression model to predict a taxi ride duration and thus be able to calculate the taxi ride price. Before that the data gets cleaned, basic exploratory data analysis are performed, categorical features get encoded, important features get selected and and the data is normalied. Then several regression models are being built and compared with each other using the Root Mean Squared Log Error metric. See Project-5.Taxi_ride_duration.ipynb

For more details see: Project-5 -> readme

Project 4. Machine Learning - Classification

The goal of the project is to build a classification model which would predict if a customer is likely to open a deposit with the bank. That is achieved by going through the following steps of the CRISP DM process: requirements analysis, data analysis, data preparation, building a model, model evaluation. See: Project_4.ML.ipynb

For more details see: Project-4 -> readme

Project 3. Hotel reviews analysis

The goal of the project was to perform data cleaning, exploratory data analysis, build a regression model and evaluate the quality of the model prediciton results. See: Project-3.Hotel_reviews_analysis.ipynb

For more details see: Project-3 -> readme

Project 2. Analysis of the HeadHunter.ru job openings.

The goal of the project was to practise SQL knowledge on a database which contained the HeadHunger job openings. Using the SQL queries different analyses should be made revealing the patterns in the data. See: Project-2.HH_Job_openings_analysis.ipynb

For more details see: Project-2 -> readme

Project 1. Analysis of the HeadHunter.ru resumes.

The goal of the project was to perform data preparation and clean-up before building a machine learning model. The model should be able to predict a candidate salary based on other candidate attributes. See: Project-1.HH_Resume_analysis.ipynb

For more details see: Project-1 -> readme

Tasks

Link: Tasks

Task-9. Text Classification Using Naive Bayes Classifier

The goal of the task is to use naive Bayes classifier to determine if an email is a spam or not. I used the ComplementNB classifier implementation from the sklearn library for that. See: Task-9.MATH-ML-7.Naive_bayes_classifier.ipynb

For more details see: Task-9 -> readme

Task-8. Machine Learning. Stochastic Gradient and Coordinate Descent

The goal of the task was to implement coordinate and stochastic gradient descent. See: Task-8.MATH-ML-5_optimization.ipynb

For more details see: Task-8 -> readme

Task-7. Machine Learning. Linear Algebra in the Context of Linear Methods

The goal of the task was to experiment with linear regression, starting at the low level with Ordinary Least Squares method, proceeding with the standard sklearn linear regression, adding polynomial features, $L_1$ and $L_2$ -regularizations and hyperparameter optimization for the regularization. See: Task-7.MATH-ML-2_linear_algebra.ipynb

For more details see: Task-7 -> readme

Task-6. Machine Learning. Hyper Parameters Optimization

The goal of the task was to apply different hyper parameter optimization methods like Grid Search, Random Search and Tree-Structured Parzen Estimators for two types of models Logistic Regression and Random Forest. F1-score metric must be calculated on all applied methods and thus the methods can be compared between each other on a given set of data. See: Task-6.ML_predicting_biological_response.ipynb

For more details see: Task-6 -> readme

Task-5. Machine Learning. Feature Selection

The goal of the task was to apply the RFE and the filter-based methods for the feature selection, identify the important features, train the models on those features. Calculate metrics for the models and evaluate which of the feature selection methods is more efficient. See: Task-5.ML_feature_selection.ipynb

For more details see: Task-5 -> readme

Task-4. Supervised Learning: Classification.

The goal of the task was to explore the following models of classification: logistic regression, decision tree and random forest on the bank customer churn dataset and perform some basic set-up of the models, like respectively regularization type, C-coefficient, max depth, max leaf objects, and identify optimum probability thresholds. See: Task-4.Classification_Of_Bank_Customers_Churn.ipynb

For more details see: Task-4 -> readme

Task-3. Introduction to Comet ML.

The goal of the task was to perform basic data cleaning, exploratory data model and building a logistic regression model. Diagrams, parameters and metrics have been logged to Comet ML. See: Task-3.Introduction-to-comet-ml.ipynb

For more details see: Task-3 -> readme

Task 2. Data science job analysis.

The goal of the task was to perform data cleaning and exploratory data analysis (EDA) on a data set of data science related jobs. The EDA consisted of a visual analysis and of picking and performing statistical tests. See: Task-2.Data_Science_job_analysis.ipynb

For more details see: Task-2 -> readme

Task 1. Research of bank customers churn data.

The goal of the task was to analyze the bank customers data set and unsing the visual library plotly identify the possible reasons why the customers are exiting their relationship with the bank. See: Task-1.Research_Of_Bank_Customers_Churn.ipynb

For more details see: Task-1 -> readme

Data Science Helpers

Link: Data Science Helpers

In this folder I am collecting auxiliary code for any data science purposes. Currently it contains functions for finding attributes having a certain ratio of empty values, low information attributes and outliers.

For more details see: Data Science Helpers -> readme

Technology Stack

In the course of studies I have mastered:

  • python
  • SQL
  • pandas
  • sklearn
  • numpy
  • plotly express
  • scipy.stats
  • psycopg2
  • BeautifulSoup
  • git

Authors

Dmitriy Golubitskiy | LinkedIn Profile | Codewars Profile

Releases

No releases published

Packages

No packages published