Young Data Scientist Challenge’16 (ZS) – sklearn Approach

ZS Associates hosted a data science competition on HackerRank last weekend. This was my final code submission for the competition. The data set was very tricky this time. Most of my time was spent trying to create a suitable dataset on which I could train my model and predict the solution.

The evaluation metric they used for classification was F1 score and for regression they had their own metric.
The maximum classification accuracy I was able to achieve was 30.12%, giving me a 0.15 score on the leaderboard.
For regression, they had a very tricky metric and to get a positive score on that metric was nearly impossible without a good classification accuracy as we were supposed to predict revenue in our regression part and without correctly identifying the lease instrument there was no point in predicting the revenue.

In [3]:
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
%matplotlib inline

They provided us with four datasets – hospital profiling, hospital revenue, predicted revenue and solution.

In [4]:
hospital_profiling = pd.read_csv('Dataset/HospitalProfiling.csv')
hospital_revenue = pd.read_csv('Dataset/HospitalRevenue.csv')
projected_revenue = pd.read_csv('Dataset/ProjectedRevenue.csv')
solution = pd.read_csv('Dataset/Solution.csv')

There were in total 56 unique district ids, 1602 unique Hospital_ID’s and 15 Instrument_ID’s present in the 4 datasets. I was able to construct a dataframe containing all the possible combinations.
My initial aim was to extract all the Hospital_ID – District_ID combinations present in the 4 dataset out of this all possible combinations. These H-D combinations along with all 15 Instrument_ID’s allowed me to construct a model trainable dataset which contained both the confirmed lease deals and the deals which were not present in the revenue dataset but were still possible for a particular H-D combination.

In [5]:
all_possible_district_id = hospital_revenue['District_ID'].unique()
all_possible_district_id = np.concatenate((all_possible_district_id,['District 53']),axis = 0)
all_possible_district_id = np.concatenate((all_possible_district_id,['District 26','District 22']),axis = 0)
all_possible_district_id.shape
Out[5]:
(56,)
In [6]:
all_possible_hospital_id = hospital_revenue['Hospital_ID'].unique()
all_possible_hospital_id = np.concatenate((all_possible_hospital_id,['Hospital 838']), axis = 0)
all_possible_hospital_id.shape
Out[6]:
(1602,)
In [7]:
all_possible_instrument_id = hospital_revenue['Instrument_ID'].unique()
all_possible_instrument_id.shape
Out[7]:
(15,)
In [8]:
x = {'Hospital_ID':all_possible_hospital_id, 'Instrument_ID':all_possible_instrument_id, 'District_ID':all_possible_district_id}
import itertools
all_possible_combination = pd.DataFrame(list(itertools.product(*x.values())), columns=x.keys())
In [9]:
# all_possible_combination.to_csv('all_pos.csv', index = False)
In [10]:
all_possible_combination.head()
Out[10]:
Instrument_ID Hospital_ID District_ID
0 Instrument 2 Hospital 1 District 12
1 Instrument 2 Hospital 1 District 13
2 Instrument 2 Hospital 1 District 16
3 Instrument 2 Hospital 1 District 18
4 Instrument 2 Hospital 1 District 19

My next aim was to extract the possible H-D combinations present in the four datasets out of all possible combination dataframes. For extracting these combinations, I added a feature in all possible combination dataframe called ‘Exist’ and initially filled it with zeros. For the combinations that were present in the given dataframe, I converted the value in this column to 1 and later extracted those rows out of all possible combination data frame for which this value was 1.
I used this approach because it gave me flexibility to decide which combinations to include and exclude. And this was a very important aspect of competition.

Something I observed after playing with the datasets was that the H-D combinations that were present in the Hospital profiling dataset were very different from those present in the revenue dataset. So whenever I included those combinations it gave me a big training dataset but it was highly imbalanced when I added ‘Buy_or_not’ column in it.
So till the final submission, I experimented with both the possibilities one in which I included the H-D combinations of hospital profiling dataframe and the one in which I did not include the hospital profiling dataframe.

In the end, I went with the one which excluded the profiling combinations because it gave me better accuracy.

In [11]:
all_possible_combination['Exist'] = 0
In [12]:
a = all_possible_combination[['Hospital_ID','District_ID']].apply(tuple, axis = 1)
b = hospital_revenue[['Hospital_ID','District_ID']].apply(tuple, axis = 1)
all_possible_combination['Exist'][a.isin(b)] = 1
In [13]:
all_possible_combination[all_possible_combination['Exist'] == 1].shape
Out[13]:
(219660, 4)

Below is the commented out code to include profiling combinations. When I included these combinations it gave me almost 2.2 lakh more new combinations in the all possible dataframe.

In [14]:
# a = all_possible_combination[['Hospital_ID','District_ID']].apply(tuple, axis = 1)
# b = hospital_profiling[['Hospital_ID','District_ID']].apply(tuple, axis = 1)
# all_possible_combination['Exist'][a.isin(b)] = 1
In [15]:
# all_possible_combination[all_possible_combination['Exist'] == 1].shape
In [16]:
a = all_possible_combination[['Hospital_ID','District_ID']].apply(tuple, axis = 1)
b = projected_revenue[['Hospital_ID','District_ID']].apply(tuple, axis = 1)
all_possible_combination['Exist'][a.isin(b)] = 1
In [17]:
all_possible_combination[all_possible_combination['Exist'] == 1].shape
Out[17]:
(225180, 4)

Below I included those H-D combinations that were present in the solution dataset. This contained both the H-D-I combinations that were present in solution and those H-D-I combinations that were not present in solution.

In [18]:
a = all_possible_combination[['Hospital_ID','District_ID']].apply(tuple, axis = 1)
b = solution[['Hospital_ID','District_ID']].apply(tuple, axis = 1)
all_possible_combination['Exist'][a.isin(b)] = 1
In [19]:
all_possible_combination[all_possible_combination['Exist'] == 1].shape
Out[19]:
(258105, 4)

Below I added another column in all possible combination dataframe named ‘train’. Initially, I filled it with 0’s and for those H-D-I combinations that were present in solution dataset I replaced 0 with 1. I did this because initially for preparing dataset for training and prediction I wanted to do the same type of treatment with both training dataset and the dataset I wanted to use for prediction.

In [20]:
all_possible_combination['train'] = 0
a = all_possible_combination[['Hospital_ID','District_ID','Instrument_ID']].apply(tuple, axis = 1)
b = solution[['Hospital_ID','District_ID','Instrument_ID']].apply(tuple, axis = 1)
all_possible_combination['Exist'][a.isin(b)] = 1
all_possible_combination['train'][a.isin(b)] = 1
In [21]:
all_possible_combination[all_possible_combination['Exist'] == 1].shape
Out[21]:
(258105, 5)

I separated all those combinations for which ‘Exist’ value was 1 and proceeded with these combinations.

In [22]:
all_possible_combination = all_possible_combination[all_possible_combination['Exist'] == 1]

The most important part of any data science competition is feature engineering. Creating new features and transforming them according to your requirement were two of the important tasks of this competition. I created many features and experimented with them to improve my accuracy.

There were few assumptions which I made during this stage to construct a model trainable data. In the all possible combination dataframe I was sure that those combinations that were present in revenue table were leased, and the others which were present in all possible dataframe but not in revenue data frame were assumed as not leased instruments.

In the hospital revenue and projected revenue data frame I created the following features:

  1. Hospital_ID count:
    This feature contained the count of each unique hospital id occurrence. This helped me in providing information to
    model about a particular hospital id in lease details.
  2. District_ID count:
    This feature contained the count of each unique district id occurrence. This helped me in providing information to
    model about a particular district id in lease details.
  3. Instrument_ID count:
    This feature contained the count of each unique instrument id occurrence. This helped me in providing information to
    model about a particular instrument id in lease details.

Similarly, I created the following features in the revenue file:

  1. H+D_count:
    Unique Hospital_ID+District_ID count.
  2. H+I_count:
    Unique Hospital_ID+Instrument_ID count.
  3. D+I_count:
    Unique District_ID+Instrument_ID count.
In [23]:
hospital_revenue.head()
Out[23]:
Hospital_ID Region_ID District_ID Instrument_ID Month 1 Month 2 Month 3 Month 4 Month 5 Month 6 Month 7 Month 8 Month 9 Month 10 Month 11 Month 12 Year Total
0 Hospital 1 Region 1 District 12 Instrument 2 8534 9917 7825 11702 8776 7755 9289 7796 7595 8292 7787 8282 103550
1 Hospital 1 Region 1 District 12 Instrument 3 298 298 214 311 261 223 237 171 173 183 193 0 2562
2 Hospital 1 Region 1 District 13 Instrument 1 37 40 38 43 29 0 0 0 0 0 0 0 187
3 Hospital 1 Region 1 District 13 Instrument 2 2486 3332 3193 2556 2108 2757 2639 2531 2771 2682 12317 1369 40741
4 Hospital 1 Region 1 District 13 Instrument 3 857 892 739 759 736 415 1203 434 448 113 829 1124 8549
In [24]:
hospital_revenue['H+D'] = hospital_revenue['Hospital_ID'] + hospital_revenue['District_ID']
hospital_revenue['H+I'] = hospital_revenue['Hospital_ID'] + hospital_revenue['Instrument_ID']
hospital_revenue['D+I'] = hospital_revenue['District_ID'] + hospital_revenue['Instrument_ID']

lc = pd.DataFrame(hospital_revenue['Hospital_ID'].value_counts())
hospital_revenue = pd.merge(hospital_revenue, lc, how='left', left_on='Hospital_ID',right_index=True)

lc = pd.DataFrame(hospital_revenue['H+D'].value_counts())
hospital_revenue = pd.merge(hospital_revenue, lc, how='left', left_on=['H+D'],right_index=True)

lc = pd.DataFrame(hospital_revenue['H+I'].value_counts())
hospital_revenue = pd.merge(hospital_revenue, lc, how='left', left_on=['H+I'],right_index=True)

lc = pd.DataFrame(hospital_revenue['D+I'].value_counts())
hospital_revenue = pd.merge(hospital_revenue, lc, how='left', left_on=['D+I'],right_index=True)

lc = pd.DataFrame(hospital_revenue['Instrument_ID'].value_counts())
hospital_revenue = pd.merge(hospital_revenue, lc, how='left', left_on='Instrument_ID',right_index=True)

lc = pd.DataFrame(hospital_revenue['District_ID'].value_counts())
hospital_revenue = pd.merge(hospital_revenue, lc, how='left', left_on='District_ID',right_index=True)
In [25]:
hospital_revenue.columns = [u'Hospital_ID', u'Region_ID', u'District_ID', u'Instrument_ID',
       u'Month 1', u'Month 2', u'Month 3', u'Month 4', u'Month 5', u'Month 6',
       u'Month 7', u'Month 8', u'Month 9', u'Month 10', u'Month 11',
       u'Month 12', u'Year Total', u'H+D', u'H+I', u'D+I',
       u'Hospital_ID_count', u'H+D_count', u'H+I_count', u'D+I_count',
       u'Instrument_ID_count', u'District_ID_count']
In [26]:
hospital_revenue.head()
Out[26]:
Hospital_ID Region_ID District_ID Instrument_ID Month 1 Month 2 Month 3 Month 4 Month 5 Month 6 Year Total H+D H+I D+I Hospital_ID_count H+D_count H+I_count D+I_count Instrument_ID_count District_ID_count
0 Hospital 1 Region 1 District 12 Instrument 2 8534 9917 7825 11702 8776 7755 103550 Hospital 1District 12 Hospital 1Instrument 2 District 12Instrument 2 52 2 16 637 13638 1309
1 Hospital 1 Region 1 District 12 Instrument 3 298 298 214 311 261 223 2562 Hospital 1District 12 Hospital 1Instrument 3 District 12Instrument 3 52 2 9 527 8270 1309
2 Hospital 1 Region 1 District 13 Instrument 1 37 40 38 43 29 0 187 Hospital 1District 13 Hospital 1Instrument 1 District 13Instrument 1 52 5 7 878 7399 4878
3 Hospital 1 Region 1 District 13 Instrument 2 2486 3332 3193 2556 2108 2757 40741 Hospital 1District 13 Hospital 1Instrument 2 District 13Instrument 2 52 5 16 1017 13638 4878
4 Hospital 1 Region 1 District 13 Instrument 3 857 892 739 759 736 415 8549 Hospital 1District 13 Hospital 1Instrument 3 District 13Instrument 3 52 5 9 952 8270 4878

5 rows × 26 columns

After creating these features in hospital revenue dataframe I merged the dataframe with all possible combination to add these features in all possible combination dataframe. While merging these two dataframes I also merged Region ID of hospital revenue dataframe with all possible combination.

In [27]:
to_add_2 = hospital_revenue[['Hospital_ID','Region_ID']]
to_add_2 = to_add_2.drop_duplicates(subset = ['Hospital_ID'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'Hospital_ID')
all_possible_combination['Region_ID'] = all_possible_combination['Region_ID'].fillna(hospital_revenue['Region_ID'].value_counts().idxmax())
In [28]:
to_add_2 = hospital_revenue[['Hospital_ID','Hospital_ID_count']]
to_add_2 = to_add_2.drop_duplicates(subset = ['Hospital_ID'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'Hospital_ID')

to_add_2 = hospital_revenue[['Instrument_ID','Instrument_ID_count']]
to_add_2 = to_add_2.drop_duplicates(subset = ['Instrument_ID'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'Instrument_ID')

all_possible_combination['H+D'] = all_possible_combination['Hospital_ID'] + all_possible_combination['District_ID']
all_possible_combination['H+I'] = all_possible_combination['Hospital_ID'] + all_possible_combination['Instrument_ID']
all_possible_combination['D+I'] = all_possible_combination['District_ID'] + all_possible_combination['Instrument_ID']

to_add_2 = hospital_revenue[['H+D','H+D_count']]
to_add_2 = to_add_2.drop_duplicates(subset = ['H+D'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'H+D')

to_add_2 = hospital_revenue[['H+I','H+I_count']]
to_add_2 = to_add_2.drop_duplicates(subset = ['H+I'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'H+I')

to_add_2 = hospital_revenue[['D+I','D+I_count']]
to_add_2 = to_add_2.drop_duplicates(subset = ['D+I'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'D+I')

to_add_2 = hospital_revenue[['District_ID','District_ID_count']]
to_add_2 = to_add_2.drop_duplicates(subset = ['District_ID'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'District_ID')

I created the same features in projected revenue dataframe and merged it with all possible combination later. For projected revenue, I merged features in all possible combination and named it as hospital_id new count, district_id new count etc.

In [29]:
projected_revenue['H+D'] = projected_revenue['Hospital_ID'] + projected_revenue['District_ID']
projected_revenue['H+I'] = projected_revenue['Hospital_ID'] + projected_revenue['Instrument_ID']
projected_revenue['D+I'] = projected_revenue['District_ID'] + projected_revenue['Instrument_ID']

lc = pd.DataFrame(projected_revenue['Hospital_ID'].value_counts())
projected_revenue = pd.merge(projected_revenue, lc, how='left', left_on='Hospital_ID',right_index=True)

lc = pd.DataFrame(projected_revenue['H+D'].value_counts()).rename(columns={'hosp_id_count':'hosp_id_count'})
projected_revenue = pd.merge(projected_revenue, lc, how='left', left_on=['H+D'],right_index=True)

lc = pd.DataFrame(projected_revenue['H+I'].value_counts()).rename(columns={'hosp_id_count':'hosp_id_count'})
projected_revenue = pd.merge(projected_revenue, lc, how='left', left_on=['H+I'],right_index=True)

lc = pd.DataFrame(projected_revenue['D+I'].value_counts()).rename(columns={'D+I_y':'D+I_count'})
projected_revenue = pd.merge(projected_revenue, lc, how='left', left_on=['D+I'],right_index=True)

lc = pd.DataFrame(projected_revenue['Instrument_ID'].value_counts()).rename(columns={'instrument_id_popular':'instrument_id_popular'})
projected_revenue = pd.merge(projected_revenue, lc, how='left', left_on='Instrument_ID',right_index=True)

lc = pd.DataFrame(projected_revenue['District_ID'].value_counts())
projected_revenue = pd.merge(projected_revenue, lc, how='left', left_on='District_ID',right_index=True)

projected_revenue.columns = [u'Hospital_ID', u'District_ID', u'Instrument_ID',
       u'Annual_Projected_Revenue', u'H+D', u'H+I', u'D+I', u'Hospital_ID_new_count', u'H+D_new_count', u'H+I_new_count'
                            , u'D+I_new_count', u'Instrument_ID_new_count', u'District_ID_new_count']
In [30]:
to_add_2 = projected_revenue[['Hospital_ID','Hospital_ID_new_count']]
to_add_2 = to_add_2.drop_duplicates(subset = ['Hospital_ID'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'Hospital_ID')

to_add_2 = projected_revenue[['Instrument_ID','Instrument_ID_new_count']]
to_add_2 = to_add_2.drop_duplicates(subset = ['Instrument_ID'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'Instrument_ID')

all_possible_combination['H+D'] = all_possible_combination['Hospital_ID'] + all_possible_combination['District_ID']
all_possible_combination['H+I'] = all_possible_combination['Hospital_ID'] + all_possible_combination['Instrument_ID']
all_possible_combination['D+I'] = all_possible_combination['District_ID'] + all_possible_combination['Instrument_ID']

to_add_2 = projected_revenue[['H+D','H+D_new_count']]
to_add_2 = to_add_2.drop_duplicates(subset = ['H+D'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'H+D')

to_add_2 = projected_revenue[['H+I','H+I_new_count']]
to_add_2 = to_add_2.drop_duplicates(subset = ['H+I'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'H+I')

to_add_2 = projected_revenue[['D+I','D+I_new_count']]
to_add_2 = to_add_2.drop_duplicates(subset = ['D+I'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'D+I')

to_add_2 = projected_revenue[['District_ID','District_ID_new_count']]
to_add_2 = to_add_2.drop_duplicates(subset = ['District_ID'])
all_possible_combination = pd.merge(all_possible_combination, to_add_2, how = 'left', on = 'District_ID')

I filled the missing values present in value count and new value count features of all possible combination with 0, as these were those combinations that never happened in past but were possible.

At this stage, I made another assumption. Separately the value count and new value count features of all possible combination contained many missing values, but on observing carefully I realised that they were serving the same purpose of hinting the model of past deals. So I added both the hospital revenue count features and projected revenue count features of all possible combination Ex: Hospital id count feature of hospital revenue present in all possible data frame and hospital id new count of projected revenue in all possible data frame were added to create a single feature hospital id count.

In [31]:
all_possible_combination = all_possible_combination.fillna(0)
In [32]:
all_possible_combination.isnull().any()
Out[32]:
Instrument_ID              False
Hospital_ID                False
District_ID                False
Exist                      False
train                      False
Region_ID                  False
Hospital_ID_count          False
Instrument_ID_count        False
H+D                        False
H+I                        False
D+I                        False
H+D_count                  False
H+I_count                  False
D+I_count                  False
District_ID_count          False
Hospital_ID_new_count      False
Instrument_ID_new_count    False
H+D_new_count              False
H+I_new_count              False
D+I_new_count              False
District_ID_new_count      False
dtype: bool
In [33]:
all_possible_combination['Hospital_ID_count'] = all_possible_combination['Hospital_ID_count'] + all_possible_combination['Hospital_ID_new_count']
all_possible_combination['District_ID_count'] = all_possible_combination['District_ID_count'] + all_possible_combination['District_ID_new_count']
all_possible_combination['H+D_count'] = all_possible_combination['H+D_count'] + all_possible_combination['H+D_new_count']
all_possible_combination['H+I_count'] = all_possible_combination['H+I_count'] + all_possible_combination['H+I_new_count']
all_possible_combination['D+I_count'] = all_possible_combination['D+I_count'] + all_possible_combination['D+I_new_count']
all_possible_combination['Instrument_ID_count'] = all_possible_combination['Instrument_ID_count'] + all_possible_combination['Instrument_ID_new_count']

After this, I created a few new features in all possible data frame to provide more information to my final model. Some of these I used in my final model and some of these I did not, as lots of correlating features can reduce the accuracy. So after carefully observing the upward and downward trend of my lb score I selected the most effective features out of these features.

In [34]:
all_possible_combination['rating'] = all_possible_combination['Hospital_ID_count'] + all_possible_combination['District_ID_count'] + all_possible_combination['Instrument_ID_count']
all_possible_combination['rating_2'] = all_possible_combination['Hospital_ID_count'] * all_possible_combination['District_ID_count'] * all_possible_combination['Instrument_ID_count']
all_possible_combination['rating_3'] = all_possible_combination['H+D_count'] + all_possible_combination['D+I_count'] + all_possible_combination['H+I_count']
all_possible_combination['rating_4'] = all_possible_combination['H+D_count'] * all_possible_combination['D+I_count'] * all_possible_combination['H+I_count']
all_possible_combination['rating_5'] = all_possible_combination['Hospital_ID_count'] + all_possible_combination['District_ID_count']
all_possible_combination['rating_6'] = all_possible_combination['Hospital_ID_count'] + all_possible_combination['Instrument_ID_count']
all_possible_combination['rating_7'] = all_possible_combination['Instrument_ID_count'] + all_possible_combination['District_ID_count']

I created unique Region+Instrument ID count feature also in all possible combination.

In [35]:
all_possible_combination['R+I'] = all_possible_combination['Region_ID'] + all_possible_combination['Instrument_ID']
In [36]:
lc = pd.DataFrame(all_possible_combination['R+I'].value_counts())
all_possible_combination = pd.merge(all_possible_combination, lc, how='left', left_on='R+I',right_index=True)
In [37]:
all_possible_combination.columns = [          u'Instrument_ID',             u'Hospital_ID',
                   u'District_ID',                   u'Exist',
                         u'train',               u'Region_ID',
             u'Hospital_ID_count',     u'Instrument_ID_count',
                           u'H+D',                     u'H+I',
                           u'D+I',               u'H+D_count',
                     u'H+I_count',               u'D+I_count',
             u'District_ID_count',   u'Hospital_ID_new_count',
       u'Instrument_ID_new_count',           u'H+D_new_count',
                 u'H+I_new_count',           u'D+I_new_count',
         u'District_ID_new_count',                  u'rating',
                      u'rating_2',                u'rating_3',
                      u'rating_4',                u'rating_5',
                      u'rating_6',                u'rating_7',
                         u'R+I',                   u'R+I_count']

I added the ‘Buy_or_not’ feature in all possible combination dataframe. This was one of the important columns in my dataframe as it provided information of positive and negative classification. All those combinations of H-D-I that were present in hospital revenue and projected revenue dataframe I assumed as positive classification and the remaining I assumed as negative classification.

In [38]:
to_add_3 = hospital_revenue[['Hospital_ID','District_ID','Instrument_ID']]
to_add_3['Buy_or_Not'] = 1

all_possible_combination['Buy_or_Not'] = 0

a = all_possible_combination[['Hospital_ID','District_ID','Instrument_ID']].apply(tuple, axis = 1)
b = hospital_revenue[['Hospital_ID','District_ID','Instrument_ID']].apply(tuple, axis = 1)
all_possible_combination['Buy_or_Not'][a.isin(b)] = 1

to_add_5 = projected_revenue[['Hospital_ID','District_ID','Instrument_ID']]
to_add_5['Buy_or_Not'] = 1

a = all_possible_combination[['Hospital_ID','District_ID','Instrument_ID']].apply(tuple, axis = 1)
b = to_add_5[['Hospital_ID','District_ID','Instrument_ID']].apply(tuple, axis = 1)
all_possible_combination['Buy_or_Not'][a.isin(b)] = 1

I created some more rating features in my all possible dataframe to provide more information to my model about unique combination count with respect to other count.

In [39]:
all_possible_combination['ratio_1'] = all_possible_combination['H+I_count']/all_possible_combination['Hospital_ID_count']
In [40]:
all_possible_combination['ratio_2'] = all_possible_combination['D+I_count']/all_possible_combination['District_ID_count']
In [41]:
#all_possible_combination['ratio_3'] = all_possible_combination['H+D_count'] * all_possible_combination['Instrument_ID_count']
In [42]:
all_possible_combination['ratio_4'] = all_possible_combination['H+I_count']/all_possible_combination['Instrument_ID_count']
all_possible_combination['ratio_5'] = all_possible_combination['D+I_count']/all_possible_combination['Instrument_ID_count']
In [43]:
all_possible_combination['rating_8'] = 0.5*(all_possible_combination['H+I_count']+all_possible_combination['D+I_count'])
all_possible_combination['ratio_6'] = all_possible_combination['R+I_count']/all_possible_combination['Instrument_ID_count']
In [44]:
all_possible_combination['rating_9'] = 0.5*(all_possible_combination['Instrument_ID_count']+all_possible_combination['District_ID_count'])
all_possible_combination['rating_10'] = 0.5*(all_possible_combination['Instrument_ID_count']+all_possible_combination['Hospital_ID_count'])
In [45]:
# all_possible_combination['ratio_7'] = all_possible_combination['D+I_count']/(all_possible_combination['District_ID_count']*all_possible_combination['Instrument_ID_count'])
# all_possible_combination['ratio_8'] = all_possible_combination['H+I_count']/(all_possible_combination['Hospital_ID_new_count']*all_possible_combination['Instrument_ID_count'])

Parallely I constructed a dataframe for regression problem also. For this I separated out positive classified rows out of all possible combination dataframe. In that, I added the year total and revenue feature of hospital revenue and projected revenue. Then I filled the missing values with 0’s and combined the year total and revenue features of all possible combination to treat as my final revenue for training model for regression problem. This was purely done on intuition. There were missing values in the year total and revenue features initially because few of the positive H-D-I combinations of all possible data frame were present in hospital revenue data frame and few were present in projected revenue data frame. For those combinations which were repeated in both I simply added both the revenues.

In [46]:
# FOR REGRESSION

for_reg = all_possible_combination[all_possible_combination['Buy_or_Not'] == 1]
for_reg['H+D+I'] = for_reg['Hospital_ID'] + for_reg['District_ID'] + for_reg['Instrument_ID']
hospital_revenue['H+D+I'] = hospital_revenue['Hospital_ID'] + hospital_revenue['District_ID'] + hospital_revenue['Instrument_ID']
projected_revenue['H+D+I'] = projected_revenue['Hospital_ID'] + projected_revenue['District_ID'] + projected_revenue['Instrument_ID']

to_add_2 = hospital_revenue[['H+D+I','Year Total']]
to_add_2 = to_add_2.drop_duplicates(subset = ['H+D+I'])
for_reg = pd.merge(for_reg, to_add_2, how = 'left', on = 'H+D+I')

to_add_2 = projected_revenue[['H+D+I','Annual_Projected_Revenue']]
to_add_2 = to_add_2.drop_duplicates(subset = ['H+D+I'])
for_reg = pd.merge(for_reg, to_add_2, how = 'left', on = 'H+D+I')
In [47]:
solution_from_combination = all_possible_combination[all_possible_combination['train'] == 1]
all_possible_combination = all_possible_combination[all_possible_combination['train'] == 0]
In [48]:
solution_from_combination.head()
Out[48]:
Instrument_ID Hospital_ID District_ID Exist train Region_ID Hospital_ID_count Instrument_ID_count H+D H+I R+I_count Buy_or_Not ratio_1 ratio_2 ratio_4 ratio_5 rating_8 ratio_6 rating_9 rating_10
12 Instrument 2 Hospital 1 District 39 1 1 Region 1 70.0 17452.0 Hospital 1District 39 Hospital 1Instrument 2 4903 0 0.300000 0.300566 0.001203 0.042574 382.0 0.280942 9962.0 8761.0
76 Instrument 2 Hospital 1004 District 35 1 1 Region 3 29.0 17452.0 Hospital 1004District 35 Hospital 1004Instrument 2 3013 0 0.344828 0.948718 0.000573 0.016961 153.0 0.172645 8882.0 8740.5
81 Instrument 2 Hospital 1004 District 8 1 1 Region 3 29.0 17452.0 Hospital 1004District 8 Hospital 1004Instrument 2 3013 0 0.344828 0.463892 0.000573 0.021717 194.5 0.172645 9134.5 8740.5
82 Instrument 2 Hospital 1005 District 50 1 1 Region 1 2.0 17452.0 Hospital 1005District 50 Hospital 1005Instrument 2 4903 0 1.000000 0.262647 0.000115 0.058904 515.0 0.280942 10683.0 8727.0
89 Instrument 2 Hospital 1006 District 2 1 1 Region 3 107.0 17452.0 Hospital 1006District 2 Hospital 1006Instrument 2 3013 0 0.373832 0.314456 0.002292 0.012090 125.5 0.172645 9061.5 8779.5

5 rows × 39 columns

Finally after separating the solution combination from all possible combination dataframe I obtained a model ready dataset for training. The size was approximately 62K in which there were around 47K positive value and remaining were negative.
If I would have included the H-D combinations that were present in hospital profiling data frame, I would have obtained almost 2.5 lakh combination in which 47K would be positive and remaining 2 lakh would be negative.

In [49]:
all_possible_combination['Buy_or_Not'].value_counts()
Out[49]:
1    47636
0    15877
Name: Buy_or_Not, dtype: int64
In [50]:
solution_from_combination['H+D+I'] = solution_from_combination['Hospital_ID']+solution_from_combination['District_ID']+solution_from_combination['Instrument_ID']
all_possible_combination['H+D+I'] = all_possible_combination['Hospital_ID']+all_possible_combination['District_ID']+all_possible_combination['Instrument_ID']

all_possible_combination = all_possible_combination.drop_duplicates(subset = ['H+D+I'])
solution_from_combination = solution_from_combination.drop_duplicates(subset = ['H+D+I'])

Following were the continuous and categorical features I finally seleccted for my model.

In [51]:
cont_type = [ u'H+D_count',u'H+I_count',u'D+I_count',u'Instrument_ID_count','Hospital_ID_count','R+I_count'
            , 'ratio_1','ratio_2','ratio_4','ratio_5','ratio_6','rating_8','rating_9', 'rating_10']

cat_type = [u'Hospital_ID',u'Instrument_ID', u'District_ID', u'Region_ID']
In [52]:
solution_copy = solution.copy()
solution_copy = pd.merge(solution_copy, solution_from_combination[cat_type+cont_type], on = ['Hospital_ID','District_ID','Instrument_ID'])

The below commented out code is one of the things I tried with the dataset. In this trial, I separated out instrument wise rows out of my all possible combination training dataframe and trained my model using that dataframe to predict the result of that particular instrument in solution dataset.

In [53]:
# solution_retain = solution_copy.copy()
# all_possible_combination_retain = all_possible_combination.copy()
# #instrument 1
# all_possible_combination = all_possible_combination_retain[all_possible_combination_retain['Instrument_ID'] == 'Instrument 20']
# solution_copy = solution_retain[solution_retain['Instrument_ID'] == 'Instrument 20']

For categorical features, I could have either label encoded them or could have binarized them using one hot encoding. So for region ID and instrument ID, I used one hot encoding but for district id and hospital id I label encoded them as they were high cardinality features.

I created data_final and for_reg out of all possible combination dataframe to train my model for classification and regression. Then I created a solution_copy dataframe using solution_from_combination dataframe and solution dataframe for prediction.

In [54]:
cat_type = ['Region_ID','Instrument_ID']
In [55]:
data_final = pd.DataFrame()
In [56]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
all_possible_combination['Hospital_ID'] = le.fit_transform(all_possible_combination['Hospital_ID'])
for_reg['Hospital_ID'] = le.fit_transform(for_reg['Hospital_ID'])

all_possible_combination['District_ID'] = le.fit_transform(all_possible_combination['District_ID'])
for_reg['District_ID'] = le.fit_transform(for_reg['District_ID'])

# all_possible_combination['Instrument_ID'] = le.fit_transform(all_possible_combination['Instrument_ID'])
# for_reg['Instrument_ID'] = le.fit_transform(for_reg['Instrument_ID'])

# all_possible_combination['Region_ID'] = le.fit_transform(all_possible_combination['Region_ID'])
# for_reg['Region_ID'] = le.fit_transform(for_reg['Region_ID'])
In [57]:
data_final = pd.get_dummies(all_possible_combination[cat_type])
reg_final = pd.get_dummies(for_reg[cat_type])
In [58]:
data_final[cont_type] = all_possible_combination[cont_type]
data_final['Hospital_ID'] = all_possible_combination['Hospital_ID']
data_final['District_ID'] = all_possible_combination['District_ID']
#data_final[cat_type] = all_possible_combination[cat_type]

reg_final[cont_type] = for_reg[cont_type]
#reg_final[cat_type] = for_reg[cat_type]
reg_final['Hospital_ID'] = for_reg['Hospital_ID']
reg_final['District_ID'] = for_reg['District_ID']

I tried experimenting with different algorithms but there was no improvement. Also the cv score which I was getting for my training dataset was very different from leaderboard score. So I was not able to properly optimize the hyperparameters of my model. This difference also indicated that there were stil few things in the given dataset which I was not able to understand.

I tried experimenting with various combinations, features and algorithms but the lb score was showing no improvement.
At one point manually predicting solution dataset seemed like a better idea but we were given the dataset not to train ourselves but to train a model.

In [59]:
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier,GradientBoostingClassifier, RandomForestRegressor
rf = RandomForestClassifier(n_estimators = 500)
rf_reg = RandomForestRegressor(n_estimators = 500)
#gbm = GradientBoostingClassifier(n_estimators = 500)
# xgb = XGBClassifier(n_estimators=180,colsample_bytree=0.2,learning_rate=0.9,max_depth=7,min_child_weight = 10.0,gamma = 0.2,
#                      subsample=0.8,reg_alpha = 0.0006544639254088123,reg_lambda = 0.07683048356144306, objective = 'binary:logistic')
In [60]:
data_final = data_final.fillna(0)
reg_final = reg_final.fillna(0)
for_reg = for_reg.fillna(0)
In [61]:
for_reg['Year Total'] = for_reg['Year Total'] + for_reg['Annual_Projected_Revenue']
In [62]:
rf.fit(data_final, all_possible_combination['Buy_or_Not'])
rf_reg.fit(reg_final, for_reg['Year Total'])
Out[62]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
In [73]:
predict_final = pd.DataFrame()
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
solution_copy['Hospital_ID'] = le.fit_transform(solution_copy['Hospital_ID'])
solution_copy['District_ID'] = le.fit_transform(solution_copy['District_ID'])
#solution_copy['Region_ID'] = le.fit_transform(solution_copy['Region_ID'])
#solution_copy['Instrument_ID'] = le.fit_transform(solution_copy['Instrument_ID'])

predict_final = pd.get_dummies(solution_copy[cat_type])
#data_sum = predict_final.sum(axis = 1)
#data_mean = predict_final.mean(axis = 1)
# predict_final['sum'] = data_sum
# predict_final['mean'] = data_mean


predict_final[cont_type] = solution_copy[cont_type]
#predict_final[cat_type] = solution_copy[cat_type]
predict_final['Hospital_ID'] = solution_copy['Hospital_ID']
predict_final['District_ID'] = solution_copy['District_ID']
# predict_final['Hospital_employees'] = solution_copy['Hospital_employees']
# predict_final['Hospital_employees'] = predict_final['Hospital_employees'].fillna(-1)
In [74]:
predict_final = predict_final.fillna(0)
In [75]:
# sol_col = rf.predict_proba(predict_final)
# t = pd.DataFrame(sol_col)
# solution['Buy_or_not'] = t[t.columns[0]]<0.3
# solution['Buy_or_not'] = solution['Buy_or_not'].astype(int)
In [76]:
solution['Buy_or_not'] = rf.predict(predict_final)
In [78]:
solution['Buy_or_not'].value_counts()
Out[78]:
0    170712
1     23098
Name: Buy_or_not, dtype: int64
In [70]:
solution['Revenue'][solution['Buy_or_not'] == 1] = rf_reg.predict(predict_final)
# solution['Revenue'][solution['Buy_or_not'] == 1] = 100
solution['Revenue'][solution['Buy_or_not'] == 0] = 0
In [71]:
solution.to_csv('final.csv',index = False)
In [72]:
importance = rf.feature_importances_
indices = np.argsort(importance)[::-1]
data_final.columns[indices]
Out[72]:
Index([                   u'rating_8',                   u'D+I_count',
                           u'ratio_2',                     u'ratio_5',
                           u'ratio_1',                   u'H+D_count',
                          u'rating_9',                   u'H+I_count',
                           u'ratio_4',                   u'rating_10',
                       u'District_ID',         u'Instrument_ID_count',
                           u'ratio_6',           u'Hospital_ID_count',
                       u'Hospital_ID',                   u'R+I_count',
        u'Instrument_ID_Instrument 2',          u'Region_ID_Region 1',
                u'Region_ID_Region 2',          u'Region_ID_Region 3',
                u'Region_ID_Region 5', u'Instrument_ID_Instrument 18',
       u'Instrument_ID_Instrument 20',  u'Instrument_ID_Instrument 4',
        u'Instrument_ID_Instrument 1',  u'Instrument_ID_Instrument 5',
                u'Region_ID_Region 4',  u'Instrument_ID_Instrument 7',
        u'Instrument_ID_Instrument 8',  u'Instrument_ID_Instrument 3',
       u'Instrument_ID_Instrument 19', u'Instrument_ID_Instrument 10',
       u'Instrument_ID_Instrument 11',  u'Instrument_ID_Instrument 6',
       u'Instrument_ID_Instrument 15', u'Instrument_ID_Instrument 13',
                u'Region_ID_Region 6'],
      dtype='object')

END NOTE

Using random forest feature importance, I checked the most important features of the dataset and saw that the ratio and rating features were the most helpful ones. I tried a few more features but still was not able to improve my lb score. The solution dataset was heavily imbalanced as it contained very few positive results as compared to negative ones. I was getting around 23K 1’s and 1.7 lakhs 0’s and my lb score was 0.15 and one of my friends who was getting around 10K 1’s was getting the almost same lb score. F1 score is a tricky metric; if either of the precision or recall gets screwed up, it screws your entire result.

I had a few more feature ideas which I was not able to try during the competition. One of the interesting parameters which I never used for my classification was revenue. Another one of the features which I never used was hospital employees. Hospital employees was not helping me with the prediction and as I already stated that the H-D combination present in the profiling dataframe was very different from the one present in revenue and solution dataframe so on merging these dataframe I was getting lots of missing value in hospital employee columns, so after few tries I decided to skip this feature.

In the end, I learned many new things while playing with the data. There were many new problems that I faced while constructing the model trainable dataset which made me research few of the new important concepts. Training my model on a heavily imbalanced dataset and making predictions for it is still a big challenge for me.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s