Project: Investigate a Dataset - [Medical Appointment No Shows]

Mohamed Ossman

Data Scientist

Data Analyst

Jupyter

Latest commit

History

Introduction

Data Wrangling

Exploratory Data Analysis

Conclusions

In [1]:

# Use this cell to set up import statements for all of the packages that you # plan to use import pandas as pd import numpy as py import matplotlib.pyplot as plt import seaborn as snb # Remember to include a 'magic word' so that your visualizations are plotted # inline with the notebook. See this page for more: # http://ipython.readthedocs.io/en/stable/interactive/magics.html %matplotlib inline

Upgrade pandas to use dataframe.explode() function.

!pip install --upgrade pandas==0.25.0

Data Wrangling

General Properties

In [2]:

# Load your data and print out a few lines. Perform operations to inspect data # types and look for instances of missing or possibly errant data import pandas as pd df=pd.read_csv('noshowappointments-kagglev2-may-2016.csv') df.head()

Out[2]:

PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show 0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No 1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No 2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No 3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No 4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 1 0 0 0 No

In [3]:

#explore the shape of data df.shape

Out[3]:

(110527, 14)

In [4]:

#check for dublicates df.duplicated().sum()

Out[4]:

In [5]:

# check if there is a patiennts with the same id df['PatientId'].duplicated().sum()

Out[5]:

48228

In [6]:

#check number of dublicated patients Ids and no show #to check ifthere were patients Ids duplicates but differ in the no show status df.duplicated(['PatientId','No-show']).sum()

Out[6]:

38710

In [ ]:

In [7]:

df.describe()

Out[7]:

PatientId AppointmentID Age Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received count 1.105270e+05 1.105270e+05 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 110527.000000 mean 1.474963e+14 5.675305e+06 37.088874 0.098266 0.197246 0.071865 0.030400 0.022248 0.321026 std 2.560949e+14 7.129575e+04 23.110205 0.297675 0.397921 0.258265 0.171686 0.161543 0.466873 min 3.921784e+04 5.030230e+06 -1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25% 4.172614e+12 5.640286e+06 18.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 50% 3.173184e+13 5.680573e+06 37.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 75% 9.439172e+13 5.725524e+06 55.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 max 9.999816e+14 5.790484e+06 115.000000 1.000000 1.000000 1.000000 1.000000 4.000000 1.000000

In [8]:

#identifiying the row index for the age -1 mask=df.query('Age=="-1"') mask

Out[8]:

PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show 99832 4.659432e+14 5775010 F 2016-06-06T08:58:13Z 2016-06-06T00:00:00Z -1 ROMÃO 0 0 0 0 0 0 No

In [9]:

### Data Cleaning

In [10]:

#removing the value -1 df.drop(index=99832,inplace=True)

In [11]:

df.describe()

Out[11]:

PatientId AppointmentID Age Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received count 1.105260e+05 1.105260e+05 110526.000000 110526.000000 110526.000000 110526.000000 110526.000000 110526.000000 110526.000000 mean 1.474934e+14 5.675304e+06 37.089219 0.098266 0.197248 0.071865 0.030400 0.022248 0.321029 std 2.560943e+14 7.129544e+04 23.110026 0.297676 0.397923 0.258266 0.171686 0.161543 0.466874 min 3.921784e+04 5.030230e+06 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25% 4.172536e+12 5.640285e+06 18.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 50% 3.173184e+13 5.680572e+06 37.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 75% 9.438963e+13 5.725523e+06 55.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 max 9.999816e+14 5.790484e+06 115.000000 1.000000 1.000000 1.000000 1.000000 4.000000 1.000000

In [12]:

#corrections of columns names df.rename(columns={'Hipertension':'Hypertension'},inplace = True) df.rename(columns={'No-show':'No_show'},inplace = True ) df.head()

Out[12]:

PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hypertension Diabetes Alcoholism Handcap SMS_received No_show 0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No 1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No 2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No 3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No 4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 1 0 0 0 No

In [13]:

#removing dublictaes Ids with duplicated showing statues ,leaving the dublicated id if they differ in showing statues #as the same patient may have several appointments df.drop_duplicates(['PatientId','No_show'],inplace=True) df.shape

Out[13]:

(71816, 14)

In [14]:

#removing un necessary data df.drop(['PatientId','AppointmentID','ScheduledDay','AppointmentDay'],axis=1,inplace=True) df.head()

Out[14]:

Gender Age Neighbourhood Scholarship Hypertension Diabetes Alcoholism Handcap SMS_received No_show 0 F 62 JARDIM DA PENHA 0 1 0 0 0 0 No 1 M 56 JARDIM DA PENHA 0 0 0 0 0 0 No 2 F 62 MATA DA PRAIA 0 0 0 0 0 0 No 3 F 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No 4 F 56 JARDIM DA PENHA 0 1 1 0 0 0 No

In [15]:

df.describe()

Out[15]:

Age Scholarship Hypertension Diabetes Alcoholism Handcap SMS_received count 71816.000000 71816.000000 71816.000000 71816.000000 71816.000000 71816.000000 71816.000000 mean 36.527501 0.095536 0.195068 0.070959 0.025036 0.020135 0.335566 std 23.378262 0.293956 0.396256 0.256758 0.156236 0.155338 0.472191 min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25% 17.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 50% 36.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 75% 55.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 max 115.000000 1.000000 1.000000 1.000000 1.000000 4.000000 1.000000

Exploratory Data Analysis

Research Question 1 (explore data and more information and take ageneral look)

In [16]:

df.hist(figsize=(16,6.5));

In [17]:

#dividing the patients into 2 groups according to showning or not the exploring them show=df.No_show=='No' noshow=df.No_show=='Yes' df[show].count(),df[noshow].count()

Out[17]:

(Gender 54153 Age 54153 Neighbourhood 54153 Scholarship 54153 Hypertension 54153 Diabetes 54153 Alcoholism 54153 Handcap 54153 SMS_received 54153 No_show 54153 dtype: int64, Gender 17663 Age 17663 Neighbourhood 17663 Scholarship 17663 Hypertension 17663 Diabetes 17663 Alcoholism 17663 Handcap 17663 SMS_received 17663 No_show 17663 dtype: int64)

In [18]:

df[show].mean(),df[noshow].mean()

Out[18]:

(Age 37.229166 Scholarship 0.091334 Hypertension 0.202944 Diabetes 0.072868 Alcoholism 0.023600 Handcap 0.020904 SMS_received 0.297232 dtype: float64, Age 34.376267 Scholarship 0.108419 Hypertension 0.170922 Diabetes 0.065108 Alcoholism 0.029440 Handcap 0.017777 SMS_received 0.453094 dtype: float64)

investigation for the factors that effect the attendace

In [19]:

plt.figure(figsize=[16,4]) df[show].groupby(['Hypertension','Diabetes']).mean()['Age'].plot(kind='bar',color='blue',label='show') df[noshow].groupby(['Hypertension','Diabetes']).mean()['Age'].plot(kind='bar',color='red',label='noshow') plt.legend(); plt.title('comparison acc.to chronic disesase') plt.xlabel('chronic disease') plt.ylabel('Mean age');

In [20]:

df[show].groupby(['Hypertension','Diabetes']).mean()['Age'],df[noshow].groupby(['Hypertension','Diabetes']).mean()['Age']

Out[20]:

(Hypertension Diabetes 0 0 30.713360 1 53.701370 1 0 60.270517 1 63.764303 Name: Age, dtype: float64, Hypertension Diabetes 0 0 28.768691 1 49.481172 1 0 58.650380 1 62.913282 Name: Age, dtype: float64)

In [21]:

def attendance (df,col_name,attended,absent): plt.figure(figsize=[18,3]) df[col_name][show].hist(alpha=.5,bins=10,color='blue',label='show') df[col_name][noshow].hist(alpha=.5,bins=10,color='red',label='noshow') plt.legend(); plt.title('comparison acc.to age') plt.xlabel('Age') plt.ylabel('Patients Number'); attendance(df,'Age',show,noshow)

In [22]:

#what is the percentage of male and female def attendance (df,col_name,attended,absent): plt.figure(figsize=[12,4]) df[col_name][show].value_counts(normalize=True).plot(kind='pie',label='show') plt.legend(); plt.title('comparison acc.to gender') plt.xlabel('Gender') plt.ylabel('Patients Number'); attendance(df,'Gender',show,noshow)

In [23]:

#does age and chornic disease affect attendance to gether plt.figure(figsize=[12,4]) df[show].groupby('Gender').Age.mean().plot(kind='bar',color='blue',label='show') df[noshow].groupby('Gender').Age.mean().plot(kind='bar',color='red',label='noshow') plt.legend(); plt.title('comparison acc.to age and chronic disesase') plt.xlabel('GENDER') plt.ylabel('Mean age');

In [24]:

print(df[show].groupby(['Gender']).Age.mean(),df[noshow].groupby('Gender').Age.mean(), df[show].groupby(['Gender']).Age.median(),df[noshow].groupby('Gender').Age.median() ) Gender F 39.130292 M 33.766269 Name: Age, dtype: float64 Gender F 36.06501 M 31.22040 Name: Age, dtype: float64 Gender F 39 M 32 Name: Age, dtype: int64 Gender F 34 M 29 Name: Age, dtype: int64

there is no correlation bet. age and gender affect the show rate the mean and the median almost the same

In [25]:

#does receving sms affect the attendance def attendance (df,col_name,attended,absent): plt.figure(figsize=[16,4]) df[col_name][show].hist(alpha=.5,bins=10,color='blue',label='show') df[col_name][noshow].hist(alpha=.5,bins=10,color='red',label='noshow') plt.legend(); plt.title('comparison acc.to reciving SMS') plt.xlabel('SMS') plt.ylabel('Patients Number'); attendance(df,'SMS_received',show,noshow)

number of showig patients whithout receiving sms is greater than showing patients with receving sms

In [26]:

# Continue to explore the data to address your additional research # questions. Add more headers as needed if you have more questions to # investigate.

does nighbourhood affect the attendance

In [27]:

plt.figure(figsize=[14,8]) df.Neighbourhood[show].value_counts().plot(kind='bar',color='blue',label='show') df.Neighbourhood[noshow].value_counts().plot(kind='bar',color='red',label='noshow') plt.legend(); plt.title('comparison acc.to neighbourhood') plt.xlabel('Neighbourhood') plt.ylabel('Patients Number');

sms has response in only 5 neighbourhoods

In [32]:

plt.figure(figsize=[18,3]) df[show].groupby('Neighbourhood').SMS_received.mean().plot(kind='bar',color='blue',label='show') df[noshow].groupby('Neighbourhood').SMS_received.mean().plot(kind='bar',color='red',label='noshow') plt.legend(); plt.title('comparison acc.to neighbourhood,SMS') plt.xlabel('Neighbourhood') plt.ylabel('Patients Number');

sms has response in only 5 neighbourhoods ,ILHAS OCEANICAS DE TRINDADE is the most responsive to the sms

In [31]:

plt.figure(figsize=[18,3]) df[show].groupby('Neighbourhood').Age.mean().plot(kind='bar',color='blue',label='show') df[noshow].groupby('Neighbourhood').Age.mean().plot(kind='bar',color='red',label='noshow') plt.legend(); plt.title('comparison acc.to neighbourhood,SMS') plt.xlabel('Neighbourhood') plt.ylabel('Mean Age');

patients attendance from specific neighbourhood differ according to ages AEROPORTO has the highest age attend

Conclusions

neighbourhood has a great effect on attendance , JARDIM CAMBURI has the greatest number of patients and also the greatest showing rate

number of showing patients from specific neighbourhood affected by receiving sms and age

number of showing patients without receiving sms is greater than showing with receiving sms

Limitations

no clear correlation between showing and gender, chornic disease

Submitting your Project

Like this project

Posted Aug 18, 2023

Dataset Description we have a csv file contain data which we downloaded(no shoes appointment) - InvestigateaDataset/Investigate_a_Dataset.ipynb at main · Ossm…

Likes

Views

Clients

Udacity

Project: Investigate a Dataset - [Medical Appointment No Shows]

Latest commit

History

Table of Contents

Upgrade pandas to use dataframe.explode() function.

Data Wrangling

General Properties

Exploratory Data Analysis

Research Question 1 (explore data and more information and take ageneral look)

investigation for the factors that effect the attendace

does nighbourhood affect the attendance

Conclusions

Limitations

Submitting your Project

Join 50k+ companies and 1M+ independents