Titanic Analysis Project

Ali Hassan

Data Modelling Analyst
Data Visualizer
Data Analyst
Microsoft Excel
Microsoft Office 365
R
Codesoft

Data Analysis and Modeling Documentation

Introduction

In this analysis report, we will explore and analyze the "Titanic" dataset using R. The goal is to perform various data analysis tasks and answer several questions related to the dataset.

Importing Libraries

In this section, we begin by importing the necessary libraries for data analysis and modeling. The libraries include dplyr, tidyr, ggplot2, and tidyverse.
#Importing Libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(tidyverse)

Loading the Titanic Dataset

We load the Titanic dataset from a CSV file located at "C:\Users\user\Desktop\DSF Project\Titanic.csv" into a variable named Titanic.
# Load the Titanic data set
Titanic <- read.csv("C:\\Users\\user\\Desktop\\DSF Project\\Titanic.csv")

Data Pre-processing

Examining the Data

We start by examining the Titanic dataset:
We calculate and display the number of rows and columns in the dataset using the dim() function.
We display the first few rows of the dataset using the head() function.
We display the column names using the colnames() function.
We count the number of missing values in the dataset using sum(is.na(Titanic)).
# Number of rows and columns
dim(Titanic)

# Displaying the first few rows of the data
head(Titanic)

# Displaying the column names
colnames(Titanic)

# Counting the number of missing values
sum(is.na(Titanic))

Handling Missing Values

Given that there are a substantial number of missing values in the dataset (87), we decided not to omit entire rows with missing data.
We count the number of missing values in each column using colSums(is.na(Titanic)).
For the "Age" column (which has 86 missing values), we replace these missing values with the median of the "Age" column, ignoring NAs.
We then remove any rows with any remaining missing values using na.omit(Titanic).
# Counting the number of missing values in each column
colSums(is.na(Titanic))

# We can see that the number of missing values for the "Age" column (86) is relatively high.
# Replacing missing values in the "Age" column with the median, ignoring NA values
Titanic$Age[is.na(Titanic$Age)] <- median(Titanic$Age, na.rm = TRUE)

# Removing rows with any missing values
Titanic <- na.omit(Titanic)

# Checking the number of missing values after removing them
colSums(is.na(Titanic))

Data Structure and Summary Statistics

We check the number of missing values again to ensure that there are none.
We display the structure of the modified dataset using str(Titanic).
We provide summary statistics for the modified dataset using summary(Titanic).
# Displaying the structure of the modified data set
str(Titanic)

# Displaying summary statistics of the modified data set
summary(Titanic)

Prepare the Data for Modeling

Feature Selection and Transformation

We select relevant features (Age, Sex, Pclass, SibSp, Survived, Parch) from the modified dataset into a variable named features.
We convert the "Sex" variable from categorical to a factor using as.factor().
We further convert categorical variables to dummy variables by transforming "Sex" to numeric values.
# Select the relevant features and target variable
features <- Titanic %>%
select(Age, Sex, Pclass, SibSp, Survived ,Parch)

# Convert categorical variables to factor
features$Sex <- as.factor(features$Sex)

# Convert categorical variables to dummy variables
features <- features %>%
mutate(Sex = as.numeric(factor(Sex, levels = c("female", "male"))))

Data Splitting

We set the random seed for reproducibility using set.seed(123).
We split the data into training and testing sets, with 70% of the data in the training set and 30% in the testing set. The indices for splitting are generated using sample().
# Step 3: Split the Data into Training and Testing Sets
set.seed(123)
train_indices <- sample(nrow(features), nrow(features) * 0.7)
train_data <- features[train_indices, ]
test_data <- features[-train_indices, ]
train_target <- Titanic$Survived[train_indices]
test_target <- Titanic$Survived[-train_indices]

Linear Regression

Model Building

We build a linear regression model to predict "Survived" using all available features.
The model is constructed using lm().
# Linear Regression
linear_model <- lm(Survived ~ ., data = train_data)
linear_predictions <- predict(logistic_model, newdata = test_data, type = "response")
linear_predictions <- ifelse(logistic_predictions > 0.5, 1, 0)

Model Evaluation

We make predictions on the test data using predict(), and we classify predictions based on a threshold of 0.5.
We calculate evaluation metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared.
# Evaluate the model
mse <- mean((test_target - linear_predictions)^2)
rmse <- sqrt(mse)
r_squared <- cor(test_target, linear_predictions)^2

# Print the evaluation metrics
print(paste("Mean Squared Error (MSE):", mse))
print(paste("Root Mean Squared Error (RMSE):", rmse))
print(paste("R-squared:", r_squared))

# Display the summary of the linear regression model
summary(linear_model)
# Calculate accuracy
lin_accuracy <- mean(linear_predictions == test_target)

Visualization

We create a data frame with actual and predicted values.
# Create a data frame with predicted and actual values
plot_data <- data.frame(Actual = test_target, Predicted = linear_predictions)
Scatter Plot: We generate a scatter plot with a best-fit line to visualize predicted vs. actual fare
# Create a scatter plot with a best-fit line
ggplot(plot_data, aes(x = Actual, y = Predicted)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(x = "Actual Fare", y = "Predicted Fare") +
ggtitle("Linear Regression Model: Predicted vs Actual Fare")
Scatter Plot
Scatter Plot

Logistic Regression

Model Building

We build a logistic regression model to predict "Survived" using all available features.
The model is constructed using glm() with a binomial family and a maximum of 100 iterations.
We make predictions on the test data using predict(), and we classify predictions based on a threshold of 0.5.
# Logistic Regression
logistic_model <- glm(Survived ~ ., data = train_data, family = "binomial", maxit = 100)
logistic_predictions <- predict(logistic_model, newdata = test_data, type = "response")
logistic_predictions <- ifelse(logistic_predictions > 0.5, 1, 0)

Model Evaluation

We calculate the confusion matrix for the logistic regression predictions.
We compute the accuracy of the logistic regression model using the confusion matrix.
We print the accuracy and provide a summary of the logistic regression model using summary().
# Calculate the confusion matrix
confusion_matrix_log <- table(logistic_predictions, test_target)
# Calculate accuracy
log_accuracy <- sum(diag(confusion_matrix_log)) / sum(confusion_matrix_log)

# Print the evaluation metric
print(paste("Accuracy:", log_accuracy))
summary(logistic_model)

Visualization of Classification Results

We create a data frame containing actual and predicted values.
# Create a data frame with predicted and actual values
plot_data <- data.frame(Actual = test_target, Predicted = logistic_predictions)
Count Plot: We generate a count plot showing the count of passengers by "Pclass" and "Survived."
# Create a count plot
ggplot(Titanic, aes(x = factor(Pclass), fill = factor(Survived))) +
geom_bar() +
labs(x = "Pclass", y = "Count") +
ggtitle("Count of Passengers by Pclass and Survival in Titanic Dataset") +
scale_fill_manual(values = c("#FF0000", "#0000FF"), labels = c("No", "Yes"))
Count Plot
Count Plot
Box Plot: Distribution of Age by Survival
# Create a box plot
ggplot(Titanic, aes(x = factor(Survived), y = Age, fill = factor(Survived))) +
geom_boxplot() +
labs(x = "Survived", y = "Age") +
ggtitle("Distribution of Age by Survival in Titanic Dataset") +
scale_fill_manual(values = c("#FF0000", "#0000FF"), labels = c("No", "Yes"))
Box Plot
Box Plot

K-Means Clustering

Model Building

We perform k-means clustering with k=3 clusters on the selected features.
We assign cluster labels to the data points.
# K-Means Clustering
k <- 3 # Number of clusters
kmeans_model <- kmeans(features, centers = k)
kmeans_clusters <- kmeans_model$cluster
kmeans_y_pred_class <- ifelse(kmeans_clusters == 2, 0, 1)
# Calculate accuracy
kmeans_accuracy <- mean(kmeans_y_pred_class == test_target)

Visualization

We add cluster labels to the Titanic dataset and create scatter plots, histograms, and box plots to visualize clusters and age distribution.
# Add cluster labels to the Titanic dataset
Titanic$cluster <- as.factor(kmeans_clusters)

# Add cluster labels to the original dataset
# You need to assign the cluster labels to kmeans_y_pred_class
kmeans_y_pred_class <- ifelse(kmeans_clusters == 2, 0, 1)

# Ensure that kmeans_y_pred_class and test_target have the same length
if (length(kmeans_y_pred_class) != length(test_target)) {
# Adjust the length of kmeans_y_pred_class or test_target as needed
# For example, you can subset or modify one of them to match the length of the other.
# For instance, if test_target is longer:
kmeans_y_pred_class <- kmeans_y_pred_class[1:length(test_target)]
}
K-means Clustering: Age vs. Fare
# Plot the clusters
ggplot(Titanic, aes(x = Age, y = Fare, color = cluster)) +
geom_point() +
labs(x = "Age", y = "Fare") +
ggtitle("K-means Clustering: Age vs Fare")
Clustering by scatter plot
Clustering by scatter plot
Histogram of Age
# Plot histogram of age
ggplot(Titanic, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(x = "Age", y = "Count") +
ggtitle("Histogram of Age")
Histogram Plot
Histogram Plot
Box Plot of Age by Passenger Class
# Plot box plot of age by passenger class
ggplot(Titanic, aes(x = factor(Survived), y = Age)) +
geom_boxplot(fill = "skyblue", color = "black") +
labs(x = "Survived", y = "Age") +
ggtitle("Box Plot of Age by Survived")
Box Plot
Box Plot

Model Comparison Visualization

We compare the performance of the three models (Logistic Regression, Linear Regression, and K-Means Clustering) by calculating and visualizing their accuracies in a bar chart using ggplot2.
# Model Comparison Visualization
model_names <- c("Logistic Regression", "Linear Regression", "K-Means Clustering")
accuracies <- c(log_accuracy, lin_accuracy, kmeans_accuracy)

result_df <- data.frame(Model = model_names, Accuracy = accuracies)

ggplot(result_df, aes(x = Model, y = Accuracy)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Model Comparison", y = "Accuracy", x = "Model") +
theme_minimal()
theme_minimal()
Models Comparison
Models Comparison

Conclusion

In this analysis, we identified a predictive problem from the "Titanic" dataset and explored the data to gain insights. We applied various wrangling operations to clean and prepare the dataset for modeling. We then chose a linear regression algorithm to solve the predictive problem and visualized the predictions using a scatter plot. Additionally, we performed classification and clustering analyses using logistic regression and K-means clustering, respectively. We compared the results from different models and assessed their stability using visualization techniques such as count plots and box plots.
Partner With Ali
View Services

More Projects by Ali