Young Data Scientist Challenge’16 (ZS) – Probability Approach

ZS Associates had organised a data science competition last weekend (29th to 31st July). It was long awaited and a total of 7000 participants had registered for it.

Data was released on 29th July at 12:00 a.m. Understanding the data was a big challenge and it was very different from train-test setting I used to work with.

Problem had two parts: Classification and Regression.
More detailed problem statement can be found here.

Brief overview of problem statement: Medaplexis leases surgical instruments (say syringe, stethoscopes, and other surgical instruments) to medical facilities and has contracts with hospitals in several districts.
We were given data in four files:
1. Hospital profile : Number of employees in each hospital
2. Hospital Revenue : Revenue generated by Hospital in a District having a Instrument over a period of one year
3. Projected Revenue : Revenue that will be generated by closing one particular deal
4. Solution : File which I needed to submit

From now on I will follow this notation: H = Hospital_ID, D = District_ID, I = Instrument_ID.
In HospitalRevenue file, we were given several H_D_I tuples which were leased previous year. (Say in 2014)
And, in ProjectedRevenue file, we were given H_D_I tuples confirmed at the starting of this year. (i.e. at the start of 2015)

So, Medaplexis wanted to utilise this data and create impactful approach towards Hospitals such that their success rate in leasing Instruments increases for rest of the year. They wanted to find out if H_D_I tuple in Solution file is going to be leased or not (Classification) and if it is, then how much revenue will be generated by that particular lease? (Regression)

Dataset took more than a day and lots of discussions on Forum to understand the prediction task.
However, at the end I was able to build a model using a simple probabilistic approach which gave me 0.297 accuracy for Classification task.
I didn’t get time to try out regression task. So, my overall score was (0.5)(0.297) + (0.5)(0) = 0.148.

Approach towards the problem:
1. As of my old habit, whenever I get data in csv format I try to do quick visualization using Excel pivot tables and pivot charts. This was none different.
– Pivot charts are best for understanding any discrepancies or insights of the data.
2. Also, I read the data in Python to do the tasks which Excel can’t do.
3. Some of the things I noticed earlier was that many Hospital_IDs were repeating in HospitalProfiling.csv.
So, I used the approach given in Discussion forum and used maximum employees as final.
4. There was some ‘No District Available’ in District_ID field of HospitalRevenue.csv, I considered them as a separate district. (Because I didn’t find any pattern or similarity with any other district)

Observed insights of the data:
1. Hospitals in same district have similar behaviour in Instrument leasing. This was found by looking at pivot table of District_ID,Hospital ID vs. Instrument_ID and filling values as Instrument counts.
2. Instrument_2 count was maximum, following instrument_3 and Instrument_1 showing that these Instruments are basic need of every hospitals.
3. However there were some districts where no hospitals have leased Instrument_1 (e.g. District_14, 11, 12) or Instrument_3 (e.g. District_14).
4. It was also observed that each hospital follows more or less same pattern across all the districts, but in some district they lease particular Instrument, while in others they don’t.

After observing 3 and 4, I was sure that to predict whether particular H_D_I is going to be leased or not, behaviour of Hospital across all the Districts as well as behaviour hospitals in that particular District are going to be crucial fields for my model.

Model building:

1. First I created different tuples (H_D, H_I, D_I and H_D_I) for all three files ( HospitalRevenue, ProjectedRevenue and Solution)

import pandas as pd
import numpy as np

Profiling = pd.read_csv("HospitalProfiling.csv")
Revenue = pd.read_csv("HospitalRevenue.csv")
PRevenue = pd.read_csv("ProjectedRevenue.csv")
Solution = pd.read_csv("Solution.csv")

#Creating tuples of H_D, H_I, I_D and H_D_I
#Here H = Hospital, D= District and I = Instrument
Revenue['H_D'] = Revenue[['Hospital_ID','District_ID']].apply(tuple, axis =1)
Revenue['H_I'] = Revenue[['Hospital_ID','Instrument_ID']].apply(tuple, axis =1)
Revenue['I_D'] = Revenue[['Instrument_ID','District_ID']].apply(tuple, axis =1)
Revenue['H_D_I'] = Revenue[['Hospital_ID','District_ID','Instrument_ID']].apply(tuple, axis =1)

PRevenue['H_D'] = PRevenue[['Hospital_ID','District_ID']].apply(tuple, axis =1)
PRevenue['H_I'] = PRevenue[['Hospital_ID','Instrument_ID']].apply(tuple, axis =1)
PRevenue['I_D'] = PRevenue[['Instrument_ID','District_ID']].apply(tuple, axis =1)
PRevenue['H_D_I'] = PRevenue[['Hospital_ID','District_ID','Instrument_ID']].apply(tuple, axis =1)

Solution['H_D'] = Solution[['Hospital_ID','District_ID']].apply(tuple, axis =1)
Solution['H_I'] = Solution[['Hospital_ID','Instrument_ID']].apply(tuple, axis =1)
Solution['I_D'] = Solution[['Instrument_ID','District_ID']].apply(tuple, axis =1)
Solution['H_D_I'] = Solution[['Hospital_ID', 'District_ID', 'Instrument_ID']].apply(tuple, axis =1)

Now, task was to predict if particular H_D_I combination will be leased or not (or say Hospital H in District D will lease an Instrument I or not).
This depends basically on two things:

  1. If Instrument I is useful in District D? (Because from my initial visualisation I have observed that Hospitals in same district have similar behaviour)
  2. If Instrument I is useful for Hospital H? ( Suppose Hospital H is heart specialist hospital, so why would they need Instrument which is used by say Dentist?)

So, I just created this two variables in form of probabilities and averaged them to get the final probability of H_D_I to be leased.

#Created column for hospital count corresponding to District where instrument needs to be leased
p1= pd.DataFrame()
p1['District_ID'] = Revenue.District_ID.unique()
h_c = []
for uniques in p1.District_ID:
    hc= len(Revenue.Hospital_ID[Revenue.District_ID == uniques].unique())
    h_c.append(hc)
p1['Hospital_counts'] = h_c
sol = Solution.merge(p1,how='left',left_on = 'District_ID', right_on ='District_ID')

#Created column for Hospital count with Intrument I in District D 
p2 = pd.DataFrame()
p2['I_D'] = Revenue.I_D.unique()
h_c=[]
for uniques in p2.I_D:
    hc = len(Revenue.Hospital_ID[Revenue.I_D == uniques].unique())
    h_c.append(hc)
p2['Hospital_counts_ID'] = h_c
sol = sol.merge(p2,how = 'left', left_on='I_D', right_on='I_D')

#There will be many NaN in Hospital_counts and Hospital_counts_ID
sol.Hospital_counts = sol.Hospital_counts.fillna(-1)
sol.Hospital_counts_ID = sol.Hospital_counts_ID.fillna(0)

#Created column for count of Districts in which Hospital H operates
p3= pd.DataFrame()
p3['Hospital_ID'] = Revenue.Hospital_ID.unique()
d_c =[]
for uniques in p3.Hospital_ID:
    dc = len(Revenue.District_ID[Revenue.Hospital_ID == uniques].unique())
    d_c.append(dc)
p3['District_counts'] = d_c
sol = sol.merge(p3, how = 'left', left_on='Hospital_ID',right_on ='Hospital_ID')

#Created column for District in which Hospital H has an Instrument I.
p4 = pd.DataFrame()
p4['H_I'] = Revenue.H_I.unique()
d_c = []
for uniques in p4.H_I:
    dc = len(Revenue.District_ID[Revenue.H_I == uniques].unique())
    d_c.append(dc)
p4['District_counts_HI'] = d_c
sol = sol.merge(p4, how='left',left_on='H_I', right_on='H_I')
#Filling out missing values
sol.District_counts_HI = sol.District_counts_HI.fillna(0)

#Probability of Hospital having instrument I in District D
sol['p_h_i'] = np.array(sol.Hospital_counts_ID)/np.array(sol.Hospital_counts)

#Probability of Hospital H having an Instrument I
sol['h_i'] = np.array(sol.District_counts_HI)/np.array(sol.District_counts)
#Averaged probability
sol.Buy_or_not = (np.array(sol.p_h_i)+np.array(sol.h_i))/2.

Then, I put threshold of 0.4 on probability to decide Buy_or_not = 1. So finally if avg_prob > 0.4 —–> Buy_or_not =1, else —–> Buy_or_not =0

#Selected this threshold by some hit and try
sol.Buy_or_not = (sol.Buy_or_not > 0.4).astype(int)
sum(sol.Buy_or_not)

When I checked my model on H_D_I tuples from ProjectedRevenue (which all were 1), I got 70% classification accuracy and leader board I was getting hardly 30% f_score. Then, I did some calculation and according to my estimate, very few deals were going to be leased from Solution files. Because the Instruments which are hardly leased (like Instrument 18,19 and 20) were offered 16000 times each (So, total 48000 lease).

I was confident that there will be hardly 4000 deals which can be leased. I went with my assumptions to prove my low Leaderboard score.

There were around 193000 observations I needed to predict. Out of which (according to my assumption) hardly 4000 can be leased. And my classification accuracy for class 1 was 70%. Total 1 predicted by my model were around 10000.

  1. True Positives (TP)=4000*0.7 = 2800
  2. False Positives (FP)= 4000 – 2800 = 1200
  3. False Negatives (FN)= 10000 – 2800 = 7200

So,

  1. Precision(P) = TP/(TP + FP) = 2800/(2800 + 1200)
  2. Recall(R) = TP/(TP + FN) = 2800/(2800 + 7200) = 0.28
  3. f1_score (Our evaluation metric) = 2PR/ (P+R) = 0.4 I was getting f1_score of 0.3 which was comparable.

I completed this competition without even importing sklearn. 😀

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s