solved Introduction (please read carefully)undefinedFor some of the following tasks you

Introduction (please read carefully)undefinedFor some of the following tasks you will need to analyze a data set. More information regarding this dataset can be found online here:undefinedhttps://www.kaggle.com/fedesoriano/company-bankruptcy-predictionundefinedPlease use the uploaded data(bankruptcy.csv) and do not download the data from online sources since the data has already been prepared for you.undefinedPractPractice #1 TasksundefinedundefinedPart 1: TheoryundefinedWhereyis the observed target variable andf ( x)is a model explainingundefinedy, andϵis the error.undefinedPart 2: Practical part Part 2A: Pre-processingundefinedDownload the data set “bankruptcy.csv” from the course page on campusnet or from the myebs dropbox. The target variable (y) in this dataset is a binary variable indicating whether a company has gone bankrupt. All other variables in the data set can be used for prediction.undefined(5 points) Plot the distribution of your target variable y (Bankruptcy) and also inspect the correlation matrix of all other variables in your data. Explain some problems that could arise with this type of data and their consequences. e.g. class imbalances, correlation…(10 points) K-means I: Using only the numpy package write a function (or several functions) that performs the K-Means clustering algorithm on some general input data x.(10 points) K-means II: There is no explicit variable in the data set which describes the sector the company operates in. You are, however, given many other variables which can be used to group similar companies into clusters. Use economic intuition and choose 6 variables from the data set that can be used to separate companies into sectors. Use K-means clustering (your code of question d)) and these 6 variables to group the companies into 10 different sectors.(8 points) Decision tree I : Using only the numpy package write a Python function that recreates the first two steps/splits of a decision tree. The function should allow as arguments some explanatory data x and target data y and return the optimal feature to split on and the optimal value of the same feature to split on.(6 points) Decision tree II: Use the function of question f) to find the first 2 splitting features and their respective splitting values of the bankruptcy data set. To keep things simpler only consider a small subset of the data with the four variables: Interest Coverage Ratio, Current Ratio, Interest-bearing debt interest rate, Cash flow rate.(5 points) Training: If you were to train a model to predict your target variable (Bankruptcy) which cross validation method is the best choice given the characteristic of this unbalanced data set (e.g. Group K-Fold) and which scoring/evaluation function would be the best choice e.g. ROC mean squared error etc..? Explain why.undefinedundefinedPart 2B: PredictionundefinedYou are allowed to use the scikit-learn package (or any packages of your choice) to answer the following questions.undefined(3 points) Use the cross validation method you have chosen in Part 2A h) to generate train and validation splits of the data set (5 folds are enough).(10 points) For every split of the data train a LogisticRegression on the training data and compute the in-sample error as well as the out-of-sample prediction error on the validation part of the data. Use the scoring function you have chosen in Part 2A h) in order to calculate the prediction error. Average the in-sample error and out-of-sample error over all 5 folds to obtain a single in-sample and out-of-sample error. Please make use of the scikit-learn LogisticRegression package –(10 points) Two important parameters of the LogisticRegression class are the Regularization strength C and the penalty function (l1, l2 ..) settings. More information can be found in the scikit-learn documentation. Redo question k) with different values of C. Create a plot with the values of C you have tried out on the x-axis and the corresponding prediction errors on the y-axis. Plot both the in-sample error and the out-of-sample error for every C.(3 points) Discuss your results of l) in light of the bias-variance trade-off. Which C would you choose and why?undefinedhttps://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Logistic Regression.html. (You can use all available x variables to train your model but often it makes sense to reduce the number of variables before training).

Looking for an Assignment Help? Order a custom-written, plagiarism-free paper

Order Now