Analytics and Machine Learning Case Solution
- [10 points] You have hired a data analyst to help your bank build a classification model for detectingwhichofyourcustomersarelikelytoapplyforapersonalloan.Basedonalargesample of past customers who applied/did not apply, the analyst plans to build a classification model with the goal of helping you figure out which customers are most likely to apply for a loan. [10 points] The analyst ends up building a classification model that he feels should do a good job of prediction.Heappliesthemodeltoatestsetof15customersandgetspredictedprobabilitiesof applying for a loan from his model as follows:
Case | Actual Class | Predicted Probability |
1 | Apply | 0.63 |
2 | Not apply | 0.7 |
3 | Not apply | 0.52 |
4 | Not apply | 0.12 |
5 | Apply | 0.65 |
6 | Apply | 0.76 |
7 | Not apply | 0.2 |
8 | Not apply | 0.06 |
9 | Not apply | 0.9 |
10 | Apply | 0.51 |
11 | Not apply | 0.32 |
12 | Not apply | 0.53 |
13 | Not apply | 0.49 |
14 | Not apply | 0.62 |
15 | Not apply | 0.44 |
Suppose he decides to use a threshold value of 0.55 to classify a person as someone who will apply for the loan. Based on the test data, what would be the values of:
Answer:
T.P = 3, T.N = 1, F.P = 3, F.N=8
- The true positiverate
TPR = TP/(TP+FN)
TPR = 0.272
- The false positiverate
FPR = FP/(FP+TN)
FPR = 0.75
- The positive predictive value
Positive predictive values = 6
You must show all your work that helped you arrive at these numbers. Zero credit will be given otherwise.
[5 points] discussed the example of the Facebook analysis, in which the likes of the Facebook user in their profile were used to predict the score of the Facebook user for a certain personality type (for example, their score on how “open” they were). As discussed,whatwasasourceofbiasinthetrainingdatathatwasusedtobuildthispredictionmodel? Why might such a model do poorly when used to predict the personality type score of other Facebook users based on the like seen on their profiles?(Note:your answer must not be more than 4-5 sentences and must be to the point)
Link to Article:https://www.pnas.org/content/pnas/110/15/5802.full.pdf
Answer:
The likes on such data can sometimes be misleading as it is ambiguous. Likes are often clicked due to peer pressure and not on a ingenuity scale.
- On pages 12-14 of handout 5 is the description of a cluster analysis of data from a social network. You can see the original data for yourselves in the file csv.Thisdatawascollectedbyasociologistwhowantedtostudytheattitudes of high school students(the group on who themed collectedly)with regard to the 5 categories described on page 12. The attitudes were measured by counting how often 36 chosen words, which were thought taco premature ti tunes to wards the 5 categories,occurred in the on line profiles. Theclusteranalysisofthestandardizedversionofthisdataonthe36wordsisshownonpage13 in the handout, with 5 clusters being created.
- [5points]The hand out describes how clusters 1,2 and 5 be have .How would you describe the main characteristic of the high school students who are in cluster 3? (A couple of sentences at most willsuffice)
In cluster 3 Basketball, football, soccer, softball, volleyball, swimming, kissed, dance, band, marching, music, rock, mall, shopping, clothes, Hollister, a combiner is above average.
- [5 points] As stated above, the clustering was done using only the data on the frequency of occurrence of the 36 words mentioned on page 12. However, data on two additional variables, “% of females” and “number of friends” was also available for each profile (you can see this in the .csv file mentioned above) Note that these two variables were not used when the clustering was done. Page 14 of the handout shows the average of these two variables for each of the 5 profiles. How can the students of cluster 3 be described in terms of these two additional variables?
This cluster mostly includes females who have an average of 37 friends
- [10points]Supposeyourunaclothingcompanythatdesignsclothesintendedforhighschool girls and want to display your ads online for your clothing line. The company that runs the social
network will allow you to display your ads online to members who have social media pages on thenetworkbutwillchargeyouforeachmemberthatyoudisplaytheadto.Themembersofthe socialnetworkcansettheirpagestoprivateandhenceyoucannotnecessarilyaccessthetexton their profiles. However, the social network is willing to give you access to the number of friends that each female high school student has on their network.
Given the data that you have in the social net work data.csv datasets and all the information available to you from the cluster analysis, how will you train a classification model that will help you predict which high school girls are most interested in fashion using only their “number of friends on the network” as the feature in the classification model? Such a model will thus help you decide which female network members you should target with your ads, sorted from most likely to be interested in fashion to least likely to be interested in fashion.
Answer:
Target students via social media and obtain the sum of friends. Then check if number of friends exceed 50 or not. If it does then it shows the popularity status of student. Offer timely discounts and sales to attract more fashion enthusiast.
Please describe your answer briefly but clearly
[10points]com was sang aggregation portal for Encarta launch he din the early 2010’s in the South Indian city of Bengaluru(formerly,Bangalore),which is a major tench terrine India. The tech industry was transforming a lot of things, transportation being one of them and Your-cabs offered customers a variety of means of ordering a taxi/car; either online, by phone (landline) or on a mobile device. (Uber did not launch its service in Bengaluru till mid 2014). The cars could be ordered at the time of travel or in advance.
OneoftheissuesthatYourcabsfacedwasthatthedriversusingtheirplatformwouldsometimes cancel their rides; if the cancellation did not occur sufficiently in advance, the customer’s trip would be delayed, thus harming the company’s reputation and business.
Your-cabs had collected data on its rides between 2011 and 2013 and posted a challenge, in coordination with the Indian School of Business, to see what could be learned about the factors affecting cancellations of rides. The file Taxicancellation soriginaldata.csv has the raw data on approximately 43,000 rides booked through Your-cabs along with various features of each of these rides. Approximately 10,000 of these rides were between Bengaluru and another city and I decided to remove these and focus only on rides within the city. I deleted variables such as customer id, vehicle model id, etc that I felt very clearly had no bearing on driver cancellation. I then reform attend the data and created the featureless cried below(Therefor matted data with thesefeaturescanbeseeninthefileTaxiCancellationsDataToFitTree.csv).I recommend that you take a look both at the original file as well as the reformatted data to get an appreciation of the idea that in the vast majority of ML applications, a great deal of time and effort has to be spent on cleaning and formatting the data into a usable form as well as potentially creating new features from the existing ones in the raw data (the latter generally needs some brainstorming with domain experts).
Features:
- Location of where the customer is picked up (in latitude and longitude)
- Destination (in latitude and longitude). Note that unlike with Uber/Lyft etc., the driver here knows the destination for which the booking has been made and so it makes sense to include this information in the modelling. There may be some unpopular destinations that drivers may not want to go to, specially during rush-hour
- The day the booking was made
- The time within the day that the booking was made (expressed in minutes, starting from 00 minutes and going to 1439 minutes, which would be one minute before the next day starts)
- The day of travel
- The time of travel with in he day of travel (expressed in minutes,starting fro m00 minutes and going to 1439 minutes, which would be one minute before the next day starts)
- The difference,in minutes,between when the booking was made and their dearests(this feature was not present in the original raw data and had to be-computed)
How the booking was made (by landline phone or on a computer or on a mobile device).......................
This is just a sample partial case solution. Please place the order on the website to order your own originally done case solution.