This project is the final project of CIS545 Big Data Analytics at the University of Pennsylvania.

In this project, we aim at predicting the probability of default with data from HomeLoan. The full dataset consists of three parts:

`application_train.csv`

, information about the applicant at the application time`bureau.csv`

and`bureau_balance.csv`

, past data from the credit bureau- past payments, such as spending history, credit card balance history

I primarily worked on `bureau.csv`

and `bureau_balance.csv`

. I firstly did EDA (Exploratory Data Analysis) and some visualizations on the two datasets using Pandas and Seaborn. Based on the results, I did feature engineering, such as one-hot encoding and imputation, on some selected features, and merged them as the training dataset. Then, I ran PCA (Principal Component Analysis) on the dataset, checked the explained variance ratio, and choose the optimal number of components at the flatten out point. Finally, I defined a pipeline of training a logistic regression model using Scikit-learn
, ran GridSearch on it, and chose the optimal set of hyperparameters. The trained model output a predicted score based on the features derived from the two bureau datasets.

The predicted score from my model was combined with the two outputs from my teammates` model using a combo model, which is essentially another logistic regression model.

The Jupyter Notebook of this project can be found in this GitHub repository.