FoodHub Data Analysis¶
Context¶
The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.
The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, they confirm the pick-up in the app and travel to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.
Objective¶
The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.
Data Description¶
The data contains the different data related to a food order. The detailed data dictionary is given below.
Data Dictionary¶
- order_id: Unique ID of the order
- customer_id: ID of the customer who ordered the food
- restaurant_name: Name of the restaurant
- cuisine_type: Cuisine ordered by the customer
- cost: Cost of the order
- day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
- rating: Rating given by the customer out of 5
- food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation.
- delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information
Let us start by importing the required libraries¶
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
Understanding the structure of the data¶
# read the data
df = pd.read_csv('foodhub_order.csv')
# returns the first 5 rows
df.head()
| order_id | customer_id | restaurant_name | cuisine_type | cost_of_the_order | day_of_the_week | rating | food_preparation_time | delivery_time | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1477147 | 337525 | Hangawi | Korean | 30.75 | Weekend | Not given | 25 | 20 |
| 1 | 1477685 | 358141 | Blue Ribbon Sushi Izakaya | Japanese | 12.08 | Weekend | Not given | 25 | 23 |
| 2 | 1477070 | 66393 | Cafe Habana | Mexican | 12.23 | Weekday | 5 | 23 | 28 |
| 3 | 1477334 | 106968 | Blue Ribbon Fried Chicken | American | 29.20 | Weekend | 3 | 25 | 15 |
| 4 | 1478249 | 76942 | Dirty Bird to Go | American | 11.59 | Weekday | 4 | 25 | 24 |
Observations:¶
The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer.
How many rows and columns are present in the data?¶
#printing shape of the dataframe
print(df.shape)
(1898, 9)
Observations:¶
There are 1898 rows and 9 columns present in the dataset
What are the datatypes of the different columns in the dataset?¶
# Using info function to print summary of dataframe
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1898 entries, 0 to 1897 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 1898 non-null int64 1 customer_id 1898 non-null int64 2 restaurant_name 1898 non-null object 3 cuisine_type 1898 non-null object 4 cost_of_the_order 1898 non-null float64 5 day_of_the_week 1898 non-null object 6 rating 1898 non-null object 7 food_preparation_time 1898 non-null int64 8 delivery_time 1898 non-null int64 dtypes: float64(1), int64(4), object(4) memory usage: 133.6+ KB None
Observations:¶
Based on the output from the info function, there are 4 int columns, 1 float column, and 4 object columns.
Are there any missing values in the data?¶
#printing summary table of null values
print(df.isnull().sum())
order_id 0 customer_id 0 restaurant_name 0 cuisine_type 0 cost_of_the_order 0 day_of_the_week 0 rating 0 food_preparation_time 0 delivery_time 0 dtype: int64
Observations:¶
There are no null values in the dataset - therefore it's not necessary to fill or drop any rows.
Check the statistical summary of the data¶
#describing only the food_preparation_time column
print(df['food_preparation_time'].describe())
count 1898.000000 mean 27.371970 std 4.632481 min 20.000000 25% 23.000000 50% 27.000000 75% 31.000000 max 35.000000 Name: food_preparation_time, dtype: float64
Observations:¶
Based on the output from the describe function on the food_preparation_time column, we can extract the following insights (including both the mean and median food preparation times):
- minimum preparation time is 20 minutes
- maximum preparation time is 35 minutes
- mean preparation time is 27.37 minutes
- median preparation time is 27 minutes
How many orders are not rated?¶
#summing the number of rows in the dataframe where rating is not given
total_count = ((df['rating']=='Not given').sum())
#calculating as a percentage
percent_count = round(100*((df['rating']=='Not given').mean()),2)
#printing answers in context
print(f"total orders with no rating given: {total_count}")
print(f"percentage of no rating orders: {percent_count}%")
total orders with no rating given: 736 percentage of no rating orders: 38.78%
Observations:¶
Taking the sum of the orders with no rating given, we can see there are 736 orders with no rating; comparing that to the total number of orders in the dataset, this amounts to 38.78% of the dataset.
Exploratory Data Analysis (EDA)¶
Univariate Analysis¶
Explore all the variables and provide observations on their distributions¶
#calculating total time from order placed to food delivered, for univariate analysis
df['total_time'] = df['food_preparation_time']+df['delivery_time']
#separating dataframe columns into numerical and object
#numerical columns (excluding order id, customer id)
num_cols = ['cost_of_the_order','food_preparation_time',
'delivery_time','total_time']
#describing numerical columns
print(df[num_cols].describe())
#object columns (excluding restaurant name)
obj_cols = ['cuisine_type','day_of_the_week','rating']
#looping numerical columns to show histogram distribution
print('\033[1mUnivariate analysis of numerical columns:\033[0m\n')
for x in num_cols:
#setting number of bins to 10 after trial and error
df.hist(x,bins=5)
plt.show()
#looping object (categorical) columns to show countplot distribution
print('\033[1m\nUnivariate analysis of object (categorical) columns:\033[0m\n')
for x in obj_cols:
#showing countplot of categories
sns.countplot(df[x])
plt.show()
#printing table of values
print(df[x].value_counts())
cost_of_the_order food_preparation_time delivery_time total_time
count 1898.000000 1898.000000 1898.000000 1898.000000
mean 16.498851 27.371970 24.161749 51.533720
std 7.483812 4.632481 4.972637 6.833603
min 4.470000 20.000000 15.000000 35.000000
25% 12.080000 23.000000 20.000000 47.000000
50% 14.140000 27.000000 25.000000 52.000000
75% 22.297500 31.000000 28.000000 56.000000
max 35.410000 35.000000 33.000000 68.000000
Univariate analysis of numerical columns:
Univariate analysis of object (categorical) columns:
cuisine_type American 584 Japanese 470 Italian 298 Chinese 215 Mexican 77 Indian 73 Middle Eastern 49 Mediterranean 46 Thai 19 French 18 Southern 17 Korean 13 Spanish 12 Vietnamese 7 Name: count, dtype: int64
day_of_the_week Weekend 1351 Weekday 547 Name: count, dtype: int64
rating Not given 736 5 588 4 386 3 188 Name: count, dtype: int64
Observations:¶
For this univariate analysis, the columns are separated into numerical and object columns for more specific analysis - a histogram function is applied to the numerical columns, while applying countplots to the object columns and printing the counts of instances by category. Beginning with observations from the numerical columns:
- The histogram of order cost shows that orders are most often between 10.65 and 16.85 dollars, with over 800 instances of this price range in the dataset.
- When plotting food preparation time in a 5 bin histogram, there is a noticable spike in food preparation time in the 32-35 min bucket, with the majority of datapoints found in this bucket.
- Delivery time also sees a noticable spike in the 25.8-29.4 minute bucket, with almost 600 instances in the data falling within this bucket.
- Total time from ordering to delivery is roughly bell-shaped, and could almost be approximated as normally distributed. The most populous bucket in the total time column is the 48.8-54.8 minute bucket.
Turning to the object columns, some interesting observations can be seen:
- American cuisine is the most common type of order in the dataset (584 instances), followed by Japanese (470 instances), Italian (298 instances), and Chinese (215 instances).
- Foodhub orders on the weekend are far more common than orders during the weekday; weekend orders account for 1351 instances in the dataset, compared with 547 instances on weekdays. Over 71% of orders in the dataset are ordered over the weekend.
- The majority of orders in the dataset have no rating (736 instances); of the orders that are rated, the most common rating is 5 stars (588 instances), followed by 4 stars (386 instances), and 3 stars (188 instances). There are no orders with ratings of 1 or 2 stars.
Which are the top 5 restaurants in terms of the number of orders received?¶
#Getting the value counts for the top 5 restaurants, using value_counts and head functions
restaurant_counts = df['restaurant_name'].value_counts().head()
#printing the restaurant counts
print(restaurant_counts)
restaurant_name Shake Shack 219 The Meatball Shop 132 Blue Ribbon Sushi 119 Blue Ribbon Fried Chicken 96 Parm 68 Name: count, dtype: int64
Observations:¶
As can be seen from the output above, the top 5 restaurants in the dataset by order count are:
- Shake Shack (219)
- The Meatball Shop (132)
- Blue Ribbon Sushi (119)
- Blue Ribbon Fried Chicken (96)
- Parm (68)
Which is the most popular cuisine on weekends?¶
#Filtering the dataset for weekend entries, getting the cuisine type column, counting values
cuisine_wknd_counts = df[df['day_of_the_week'] == 'Weekend']['cuisine_type'].value_counts()
#getting most popular cuisine type on weekends using .head(1)
cuisine_wknd_counts.head(1)
cuisine_type American 415 Name: count, dtype: int64
Observations:¶
Restricting the dataset to Weekend-only inputs and counting the top cuisine type in the dataset, we can see the most popular food type on weekends is American food.
What percentage of the orders cost more than 20 dollars?¶
#Calculating percentage of orders greater than 20 dollars
percent_gt_20dollars = round(100*(df['cost_of_the_order'] > 20).mean(),2)
#Printing answer in context
print(f"Percentage total of order greater than 20 dollars: {percent_gt_20dollars}%")
Percentage total of order greater than 20 dollars: 29.24%
Observations:¶
29.24% of the orders in the dataset are greater than 20 dollars.
What is the mean order delivery time?¶
#Calculating mean delivery time
mean_delivery_time = round(df['delivery_time'].mean(),2)
#Printing answer in context
print(f"mean delivery time: {mean_delivery_time} minutes")
mean delivery time: 24.16 minutes
Observations:¶
The mean delivery time is 24.16 minutes.
The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed.¶
#Getting top 3 customers using value counts, taking top 3 using .head(3)
customer_counts = df['customer_id'].value_counts().head(3)
#printing answer
print(customer_counts)
customer_id 52832 13 47440 10 83287 9 Name: count, dtype: int64
Observations:¶
The top 3 customer IDs in the dataset are as follows:
- Customer 52832: 13 orders
- Customer 47440: 10 orders
- Customer 83287: 9 orders
Multivariate Analysis¶
Perform a multivariate analysis to explore relationships between the important variables in the dataset¶
#Showing pairplot of numerical variables in dataframe
sns.pairplot(df[num_cols])
plt.show()
# showing correlation matrix
corr_matrix = df[num_cols].corr().round(2)
print(corr_matrix)
#describing numerical columns of whole dataset for reference
print(df[num_cols].describe())
#using for loops to show boxplots of numerical variables by categorical variables
for cat in obj_cols: #categorical variable
for num_var in num_cols: #numerical variable
#printing variables to keep track of plots
print(f'\n\033[1mCategory Column: {cat}\nNumerical Column: {num_var}\033[0m')
#creating boxplot of cat and num variables
sns.boxplot(x=cat,y=num_var,data=df)
#setting xtick rotation to 45 degrees for legibility
plt.xticks(rotation=45, ha='right')
plt.show()
#printing summary statistics of numerical variables by categorical variables
stats = df.groupby(cat)[num_var].describe().round(2)
print(f'\n\033[1mStatistics table of {num_var} against {cat}:\033[0m\n')
print(stats)
#using crosstab to show relationship between categorical variables
i = 0
while i < len(obj_cols):
j = 0 # reset inner loop for outer loop iterations
while j < len(obj_cols):
if i != j: # avoid same column comparison
contingency_table = pd.crosstab(df[obj_cols[i]], df[obj_cols[j]])
# plotting heatmap, setting heatmap formatting to decimal (avoid scientific notation)
sns.heatmap(contingency_table, annot=True, cmap='Oranges',fmt='d')
plt.title(f'Contingency Table: {obj_cols[i]} vs {obj_cols[j]}')
plt.show()
j += 1
i += 1
cost_of_the_order food_preparation_time \
cost_of_the_order 1.00 0.04
food_preparation_time 0.04 1.00
delivery_time -0.03 0.01
total_time 0.01 0.69
delivery_time total_time
cost_of_the_order -0.03 0.01
food_preparation_time 0.01 0.69
delivery_time 1.00 0.74
total_time 0.74 1.00
cost_of_the_order food_preparation_time delivery_time total_time
count 1898.000000 1898.000000 1898.000000 1898.000000
mean 16.498851 27.371970 24.161749 51.533720
std 7.483812 4.632481 4.972637 6.833603
min 4.470000 20.000000 15.000000 35.000000
25% 12.080000 23.000000 20.000000 47.000000
50% 14.140000 27.000000 25.000000 52.000000
75% 22.297500 31.000000 28.000000 56.000000
max 35.410000 35.000000 33.000000 68.000000
Category Column: cuisine_type
Numerical Column: cost_of_the_order
Statistics table of cost_of_the_order against cuisine_type: count mean std min 25% 50% 75% max cuisine_type American 584.0 16.32 7.46 4.71 12.08 14.12 22.26 33.18 Chinese 215.0 16.31 7.63 4.75 12.10 14.12 21.88 34.19 French 18.0 19.79 6.51 11.98 13.05 20.47 23.66 29.25 Indian 73.0 16.92 7.90 5.34 11.64 16.15 24.20 33.03 Italian 298.0 16.42 7.27 5.05 12.08 14.48 22.26 33.03 Japanese 470.0 16.30 7.42 4.47 12.08 14.07 21.83 33.37 Korean 13.0 14.00 7.56 5.77 11.84 12.18 14.02 30.75 Mediterranean 46.0 15.47 7.60 5.67 12.09 14.00 18.78 35.41 Mexican 77.0 16.93 7.56 4.85 12.08 15.57 22.36 33.32 Middle Eastern 49.0 18.82 8.09 5.77 12.95 16.30 25.27 32.93 Southern 17.0 19.30 7.97 7.38 12.47 16.11 29.15 31.43 Spanish 12.0 18.99 6.66 12.13 12.92 16.52 24.49 29.10 Thai 19.0 19.21 7.58 6.69 12.71 19.35 24.20 32.93 Vietnamese 7.0 12.88 5.32 6.01 10.02 12.13 14.87 22.26 Category Column: cuisine_type Numerical Column: food_preparation_time
Statistics table of food_preparation_time against cuisine_type: count mean std min 25% 50% 75% max cuisine_type American 584.0 27.44 4.66 20.0 23.00 27.0 31.0 35.0 Chinese 215.0 27.51 4.76 20.0 23.50 27.0 32.0 35.0 French 18.0 26.89 5.04 21.0 22.50 25.5 31.0 35.0 Indian 73.0 27.11 4.62 20.0 23.00 27.0 31.0 35.0 Italian 298.0 27.48 4.52 20.0 23.00 28.0 31.0 35.0 Japanese 470.0 27.51 4.60 20.0 24.00 27.5 32.0 35.0 Korean 13.0 25.46 3.97 20.0 23.00 25.0 26.0 33.0 Mediterranean 46.0 27.00 4.91 20.0 23.00 27.0 31.0 35.0 Mexican 77.0 26.73 4.60 20.0 23.00 26.0 31.0 35.0 Middle Eastern 49.0 26.67 4.01 20.0 24.00 27.0 30.0 34.0 Southern 17.0 27.59 5.52 20.0 24.00 26.0 33.0 35.0 Spanish 12.0 26.92 4.50 20.0 23.75 27.5 28.5 35.0 Thai 19.0 27.32 5.50 21.0 22.00 28.0 32.5 35.0 Vietnamese 7.0 25.71 5.15 20.0 21.50 25.0 29.5 33.0 Category Column: cuisine_type Numerical Column: delivery_time
Statistics table of delivery_time against cuisine_type: count mean std min 25% 50% 75% max cuisine_type American 584.0 24.19 4.83 15.0 20.00 25.0 28.00 33.0 Chinese 215.0 23.86 4.97 15.0 20.00 24.0 28.00 33.0 French 18.0 25.33 3.74 17.0 23.25 27.0 28.75 29.0 Indian 73.0 24.08 5.05 15.0 20.00 25.0 29.00 32.0 Italian 298.0 24.57 5.18 15.0 20.00 26.0 28.75 33.0 Japanese 470.0 24.13 4.99 15.0 20.00 25.0 28.00 33.0 Korean 13.0 20.92 4.03 16.0 17.00 20.0 26.00 26.0 Mediterranean 46.0 23.59 5.42 15.0 19.00 24.0 28.75 33.0 Mexican 77.0 24.39 4.56 16.0 21.00 25.0 28.00 33.0 Middle Eastern 49.0 24.08 5.66 15.0 19.00 25.0 29.00 33.0 Southern 17.0 23.82 5.96 15.0 21.00 25.0 28.00 33.0 Spanish 12.0 23.42 4.81 17.0 19.75 23.0 28.25 30.0 Thai 19.0 23.16 4.99 15.0 19.50 24.0 26.50 32.0 Vietnamese 7.0 26.14 3.89 19.0 25.00 26.0 28.50 31.0 Category Column: cuisine_type Numerical Column: total_time
Statistics table of total_time against cuisine_type: count mean std min 25% 50% 75% max cuisine_type American 584.0 51.63 6.62 35.0 47.00 51.0 56.00 68.0 Chinese 215.0 51.37 7.36 35.0 46.00 51.0 56.50 68.0 French 18.0 52.22 6.03 43.0 49.00 50.5 56.75 64.0 Indian 73.0 51.19 6.52 36.0 47.00 51.0 56.00 65.0 Italian 298.0 52.05 6.76 35.0 47.25 53.0 56.00 67.0 Japanese 470.0 51.64 6.99 36.0 47.00 52.0 56.00 68.0 Korean 13.0 46.38 4.52 38.0 44.00 45.0 51.00 52.0 Mediterranean 46.0 50.59 7.21 36.0 46.25 50.0 55.00 64.0 Mexican 77.0 51.12 5.89 36.0 48.00 51.0 55.00 65.0 Middle Eastern 49.0 50.76 7.22 36.0 47.00 53.0 56.00 67.0 Southern 17.0 51.41 8.60 37.0 46.00 53.0 56.00 68.0 Spanish 12.0 50.33 8.06 38.0 45.25 49.0 57.25 64.0 Thai 19.0 50.47 7.15 37.0 47.00 49.0 57.00 61.0 Vietnamese 7.0 51.86 4.74 46.0 48.50 51.0 55.00 59.0 Category Column: day_of_the_week Numerical Column: cost_of_the_order
Statistics table of cost_of_the_order against day_of_the_week: count mean std min 25% 50% 75% max day_of_the_week Weekday 547.0 16.31 7.48 4.75 12.08 14.07 22.28 35.41 Weekend 1351.0 16.57 7.49 4.47 12.08 14.41 22.28 33.37 Category Column: day_of_the_week Numerical Column: food_preparation_time
Statistics table of food_preparation_time against day_of_the_week: count mean std min 25% 50% 75% max day_of_the_week Weekday 547.0 27.21 4.71 20.0 23.0 27.0 31.0 35.0 Weekend 1351.0 27.44 4.60 20.0 23.0 27.0 31.0 35.0 Category Column: day_of_the_week Numerical Column: delivery_time
Statistics table of delivery_time against day_of_the_week: count mean std min 25% 50% 75% max day_of_the_week Weekday 547.0 28.34 2.89 24.0 26.0 28.0 31.0 33.0 Weekend 1351.0 22.47 4.63 15.0 18.5 22.0 27.0 30.0 Category Column: day_of_the_week Numerical Column: total_time
Statistics table of total_time against day_of_the_week: count mean std min 25% 50% 75% max day_of_the_week Weekday 547.0 55.55 5.57 44.0 51.0 55.0 59.5 68.0 Weekend 1351.0 49.91 6.62 35.0 45.0 50.0 55.0 65.0 Category Column: rating Numerical Column: cost_of_the_order
Statistics table of cost_of_the_order against rating: count mean std min 25% 50% 75% max rating 3 188.0 16.22 7.84 4.75 12.04 14.12 21.83 33.37 4 386.0 16.71 7.63 4.75 12.13 14.16 22.79 35.41 5 588.0 16.97 7.45 4.47 12.13 15.20 22.80 33.03 Not given 736.0 16.09 7.33 4.66 11.83 14.07 21.50 34.19 Category Column: rating Numerical Column: food_preparation_time
Statistics table of food_preparation_time against rating: count mean std min 25% 50% 75% max rating 3 188.0 27.43 4.77 20.0 23.0 27.0 32.00 35.0 4 386.0 27.40 4.62 20.0 23.0 27.0 31.75 35.0 5 588.0 27.36 4.69 20.0 23.0 27.5 32.00 35.0 Not given 736.0 27.36 4.56 20.0 23.0 27.0 31.00 35.0 Category Column: rating Numerical Column: delivery_time
Statistics table of delivery_time against rating: count mean std min 25% 50% 75% max rating 3 188.0 24.56 4.65 15.0 21.0 25.0 28.0 33.0 4 386.0 23.87 4.92 15.0 20.0 24.0 28.0 33.0 5 588.0 24.21 5.02 15.0 20.0 25.0 28.0 33.0 Not given 736.0 24.17 5.04 15.0 20.0 25.0 28.0 33.0 Category Column: rating Numerical Column: total_time
Statistics table of total_time against rating:
count mean std min 25% 50% 75% max
rating
3 188.0 51.99 6.43 35.0 47.75 52.0 56.0 67.0
4 386.0 51.26 6.68 36.0 47.00 51.0 56.0 68.0
5 588.0 51.57 6.93 35.0 47.00 51.0 56.0 68.0
Not given 736.0 51.53 6.94 35.0 47.00 52.0 56.0 68.0
Observations:¶
Numerical-Numerical Variable Relationships:¶
The initial pairplot of numerical variables in the dataset doesn't appear to show any noticable trends; looking at the correlation matrix we can see there is no identifiable correlation between preparation/delivery/total time and cost of the order.
Numerical-Categorical Variable Relationships:¶
Some more interesting observations can be found by comparing the categorical data against the numerical data using boxplots; These are given below, by categorical column:
Cuisine Type:¶
- Vietnamese and Korean foods have considerably lower median order costs; 12.13 dollars for Vietnamese and 12.18 dollars for Korean. Vietnamese also has the lowest mean order cost in the dataset, at 12.88 dollars. This is considerably lower than the overall median of the dataset at 14.14 dollars.
- French food has the highest average order cost (both mean and median, at 19.79 dollars and 20.47 dollars respectively).
- Vietnamese and Korean food also share the lowest median preparation time in the dataset, at 25 minutes; this is 2 minutes less than the overall median of 27 minutes.
- Conversely, Thai and Italian food share the highest median preparation time in the dataset, at 28 minutes (3 minutes higher than the overall median).
- In terms of delivery time, French food appears to have the highest median delivery time at 27 minutes, while Korean has the lowest at 20 minutes. This is 20% lower than the overall median of 25 minutes.
- Looking at total time, Korean food has the lowest median total time from order to delivery at 45 minutes; a full 7 minutes lower than the total median of 52 minutes.
Day of the Week:¶
- Order cost and food preparation time do not differ significantly between weekdays and weekends; they are largely the same across all metrics in the stats tables and boxplots.
- Average delivery time is significantly higher on weekdays compared with weekends; median delivery time is 28 minutes on weekdays versus 22 minutes on weekends, a difference of 6 minutes. This is despite the much larger volume of orders on weekends (1351) compared to on weekdays (547).
Rating:¶
- Median total time for 4 star and 5 star orders is a minute less than total time for 3 star and orders with no rating (51 minutes compared to 52 minutes). This suggests that orders with lower total time may receive higher ratings.
- Interestingly, order cost is slightly higher on orders in the 5 star bucket (at 15.20 dollars), suggesting that more expensive orders may lead to a greater likelihood of a 5 star rating.
Categorical-Categorical Variable Relationships:¶
Using the crosstab function and visualising the outputs as a heatmap in seaborn, we can compare categorical variables to each other and take some interesting observations from the data
Cuisine Type vs Day of the Week:¶
- While American cuisine is already very popular in the dataset, it's clear from the heatmap that this is much more pronounced on weekends; American food is ordered 415 times on weekends compared to 169 times on weekdays. The same trend is seen with the other most popular cuisine types (Japanese, Chinese, and Italian), which are all ordered far more often on weekends than weekdays
Cuisine Type vs Rating:¶
- American and Japanese food appear to be most prominent in orders with no rating given; however this may be symptomatic of the overall volume of orders for these types of cuisine, as they also have high numbers of 5 star and 4 star ratings
Rating vs Day of the Week:¶
- Orders with no rating given are most commonly ordered on weekends, though again is most likely due to the larger volume of orders on weekends than any underlying shift in food quality.
The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer¶
#filtering dataframe to rows with ratings, as df_rated
df_rated = df[df['rating'] != 'Not given']
#setting rating column in df_rated to int type
df_rated.loc[:, 'rating'] = df_rated['rating'].astype(int)
#pivoting restaurant name by rating count, average rating
df_rated_pivot = df_rated.pivot_table(index='restaurant_name', values='rating',
aggfunc=['count', 'mean'])
#setting column names in pivoted df
df_rated_pivot.columns = ['rating_count', 'average_rating']
#resetting index
df_rated_pivot.reset_index(inplace=True)
#filtering promo-eligible restaurants only
df_promo = df_rated_pivot[(df_rated_pivot['rating_count']>50) & (df_rated_pivot['average_rating']>4)]
#printing final list of promo-eligible restaurants
print(df_promo)
restaurant_name rating_count average_rating 16 Blue Ribbon Fried Chicken 64 4.328125 17 Blue Ribbon Sushi 73 4.219178 117 Shake Shack 133 4.278195 132 The Meatball Shop 84 4.511905
Observations:¶
Only 4 restaurants are currently eligible for the promotional offer from Foodhub:
- Blue Ribbon Fried Chicken
- Blue Ribbon Sushi
- Shake Shack
- The Meatball Shop
The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders¶
#calculating 25% revenue on orders greater than 20 dollars
orders_gt_20dollars_revenue = 0.25*df[df['cost_of_the_order']>20]['cost_of_the_order'].sum()
#calculating 15% revenue on orders greater than 5 dollars, and less than 20 dollars
orders_gt_5dollars_revenue = 0.15*df[(df['cost_of_the_order']>5) & (df['cost_of_the_order']<20)]['cost_of_the_order'].sum()
#summing and printing total revenue
total_revenue = round(orders_gt_5dollars_revenue+orders_gt_20dollars_revenue,2)
print(f"Total revenue from orders greater than 5 dollars and orders greater than 20 dollars: ${total_revenue}")
Total revenue from orders greater than 5 dollars and orders greater than 20 dollars: $6166.3
Observations:¶
The total revenue from orders greater than 5 dollars and orders greater than 20 dollars is $6166.30; this is assuming that the 15% charge is for orders greater than 5 dollars but less than 20 dollars
The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed?¶
#calculating total time from order placed to food delivered
#previously calculated for univariate analysis
df['total_time'] = df['food_preparation_time']+df['delivery_time']
#calculating percent of deliveries greater than 60 minutes
deliveries_gt_60mins = percent_gt_20dollars = round(100*((df['total_time']>60).sum())/len(df['total_time']),2)
#printing result
print(f"percentage of orders greater than 60 mins from order placed: {deliveries_gt_60mins}%")
percentage of orders greater than 60 mins from order placed: 10.54%
Observations:¶
The percentage of orders delivered greater than 60 minutes from when the order was placed is 10.54%
The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends?¶
#Calculating the mean delivery time on weekdays, rounded to 2 decimal points
Wkday_mean_delivery = round(df[df['day_of_the_week']=='Weekday']['delivery_time'].mean(),2)
#Calculating the mean delivery time on weekends, rounded to 2 decimal points
Wknd_mean_delivery = round(df[df['day_of_the_week']=='Weekend']['delivery_time'].mean(),2)
#Printing mean weekday and weekend delivery times in context
print(f"Mean delivery time on weekdays: {Wkday_mean_delivery} minutes")
print(f"Mean delivery time on weekends: {Wknd_mean_delivery} minutes")
#Calculating and printing difference in mean delivery times
print(f"Mean delivery time difference: {round(Wkday_mean_delivery-Wknd_mean_delivery,2)} minutes")
Mean delivery time on weekdays: 28.34 minutes Mean delivery time on weekends: 22.47 minutes Mean delivery time difference: 5.87 minutes
Observations:¶
The weekday mean delivery time is 28.34 minutes, while the weekend mean delivery time is 22.47 minutes. This amounts to a mean difference of 5.87 minutes between weekday and weekend deliveries.
Conclusions and Recommendations from the Analysis¶
Conclusions:¶
- The countplot of orders by day of the week in the univariate analysis shows that weekend orders are far more popular than weekday orders, with 71% of the instances in the dataset being weekend orders. There are a number of possible reasons behind this, first of which being that far fewer people are working on weekends and therefore see this as an opportunity to relax and possibly order takeaway. It's also clear from the difference in mean delivery time that weekend orders arrive much quicker than orders on a weekday (by 5.87 minutes on average).
- From the univariate analysis, American cuisine is the most common type of order in the dataset (584 instances), followed by Japanese (470 instances), Italian (298 instances), and Chinese (215 instances). Looking at the bivariate analysis between cuisine type and day of the week, it's evident that American food is ordered most often on weekends on comparison to other leading cuisine types. This may be due to the perception that American food is an 'unhealthy' choice, leading to customers preferring to order it on the weekends instead.
- The trends in order rating show that customers leave ratings on orders just over 61% of the time, and never less than 3 stars based on instances in this dataset. A significant proportion of the dataset does not have any rating given. This may be due to customers forgetting to rate the restaurant, or not seeing any justification in giving a rating.
- Based on the univariate analysis of order cost, it's evident that the majority of the orders are priced between 10.65 and 16.85 dollars, with over 800 instances of this price range in the dataset. The instances of orders exceeding this price range fall off considerablyorders get more expensive. This implies that food orders higher than 16.85 dollars are less popular with customers, who likely see this as a cutoff point for what they're willing to pay for a takeaway.
Recommendations:¶
- Given the desire for the company to have a promotional offer on restaurant advertising based on rating, it may be advisable to also incentivise the customer to rate their order. There are a number of possible ways to do this, for example the Foodhub app may send a push notification 30 minutes after the order is delivered to prompt the customer to rate their food (and possibly leave a tip); additionally the company may decide to offer discounts to users who rate their orders consistently (this initiative is already in place with giving 20% discount vouchers to customers who order the most takeaways, with a few adjustments this can also be applied to the rating system).
- The weekend is clearly the most profitable time of the week for the company - they may decide to advertise more heavily on Fridays and Saturdays to pull in more customers and orders during this time. Additionally, given the potential perception of the weekend as an 'unhealthy' food time and the weekday as a 'healthy' food time, the company may consider recruiting more health-conscious restaurants and advertising those more prominently at the start of the week (i.e. Monday or Tuesday). This could potentially lead to a rise in orders during the weekday.
- The company may consider altering their restaurant surcharge strategy; earlier it was suggested to charge 15% on order greater than 5 dollars and 25% on orders greater than 20 dollars. Given the distribution of order cost in the univariate analysis and the fact that the majority of orders are between 10.65 and 16.85 dollars, the company may instead wish to also charge 20% on orders between 10 dollars and 17 dollars, and instead charge 25% on orders greater than 17 dollars. This accomplishes 2 things; it brings in additional revenue for the company immediately, and it leads to a less sudden increase in surcharge fees from 15% to 25% by adding a third tier to the surcharge strategy.