Insights from Failed Orders¶
Gett, previously known as GetTaxi, is an Israeli-developed technology platform solely focused on corporate Ground Transportation Management (GTM). They have an application where clients can order taxis, and drivers can accept their rides (offers). At the moment, when the client clicks the Order button in the application, the matching system searches for the most relevant drivers and offers them the order. In this task, we would like to investigate some matching metrics for orders that did not completed successfully, i.e., the customer didn't end up getting a car.
GitHub: Insights From Failed Orders
In [552]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
In [553]:
data_offers = pd.read_csv("datasets/data_offers.csv")
data_orders = pd.read_csv("datasets/data_orders.csv")
In [554]:
data_offers.head()
Out[554]:
order_gk | offer_id | |
---|---|---|
0 | 3000579625629 | 300050936206 |
1 | 3000627306450 | 300052064651 |
2 | 3000632920686 | 300052408812 |
3 | 3000632771725 | 300052393030 |
4 | 3000583467642 | 300051001196 |
In [555]:
data_orders.head()
Out[555]:
order_datetime | origin_longitude | origin_latitude | m_order_eta | order_gk | order_status_key | is_driver_assigned_key | cancellations_time_in_seconds | |
---|---|---|---|---|---|---|---|---|
0 | 18:08:07 | -0.978916 | 51.456173 | 60.0 | 3000583041974 | 4 | 1 | 198.0 |
1 | 20:57:32 | -0.950385 | 51.456843 | NaN | 3000583116437 | 4 | 0 | 128.0 |
2 | 12:07:50 | -0.969520 | 51.455544 | 477.0 | 3000582891479 | 4 | 1 | 46.0 |
3 | 13:50:20 | -1.054671 | 51.460544 | 658.0 | 3000582941169 | 4 | 1 | 62.0 |
4 | 21:24:45 | -0.967605 | 51.458236 | NaN | 3000583140877 | 9 | 0 | NaN |
In [556]:
data_orders.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10716 entries, 0 to 10715 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_datetime 10716 non-null object 1 origin_longitude 10716 non-null float64 2 origin_latitude 10716 non-null float64 3 m_order_eta 2814 non-null float64 4 order_gk 10716 non-null int64 5 order_status_key 10716 non-null int64 6 is_driver_assigned_key 10716 non-null int64 7 cancellations_time_in_seconds 7307 non-null float64 dtypes: float64(4), int64(3), object(1) memory usage: 669.9+ KB
Section 1: Distribution of Orders by Reasons for Failure¶
In [557]:
cancelled_by_client_no_driver = data_orders[(data_orders['order_status_key'] == 4) & (data_orders['is_driver_assigned_key'] == 0)]
print(len(cancelled_by_client_no_driver))
4496
In [558]:
cancelled_by_client_driver = data_orders[(data_orders['order_status_key'] == 4) & (data_orders['is_driver_assigned_key'] == 1)]
print(len(cancelled_by_client_driver))
2811
In [559]:
cancelled_by_system = data_orders[(data_orders['order_status_key'] == 9)]
print(len(cancelled_by_system))
3409
In [560]:
categories = [
"Cancelled by client\n(no driver assigned)",
"Cancelled by client\n(driver assigned)",
"Cancelled by system"
]
counts = [len(cancelled_by_client_no_driver), len(cancelled_by_client_driver), len(cancelled_by_system)]
plt.figure(figsize=(8, 5))
sns.barplot(x=categories, y=counts, hue=counts, palette="colorblind")
plt.ylabel("Number of Cancelled Orders")
plt.title("Counts of Cancelled Orders by Category")
plt.tight_layout()
plt.show()
Analysis¶
- The largest category of failures is 'Clients cancelling before a driver is assigned'.
- There is a potential issue with long wait times or high initial ETAs, causing customer impatience.
- The second largest is 'System Rejects', shows a driver supply problem.
Section 2: How do failed orders change throughout the day?¶
In [561]:
order_data_by_hour = data_orders[['order_datetime', 'order_status_key', 'is_driver_assigned_key']].copy()
order_data_by_hour['order_datetime'] = pd.to_datetime(order_data_by_hour['order_datetime'])
order_data_by_hour['hour'] = order_data_by_hour['order_datetime'].dt.hour
/tmp/ipykernel_77565/2463487170.py:2: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. order_data_by_hour['order_datetime'] = pd.to_datetime(order_data_by_hour['order_datetime'])
In [562]:
order_data_by_hour.head()
Out[562]:
order_datetime | order_status_key | is_driver_assigned_key | hour | |
---|---|---|---|---|
0 | 2025-07-04 18:08:07 | 4 | 1 | 18 |
1 | 2025-07-04 20:57:32 | 4 | 0 | 20 |
2 | 2025-07-04 12:07:50 | 4 | 1 | 12 |
3 | 2025-07-04 13:50:20 | 4 | 1 | 13 |
4 | 2025-07-04 21:24:45 | 9 | 0 | 21 |
In [563]:
plt.figure(figsize=(10,5))
sns.histplot(order_data_by_hour, x='hour', discrete=True)
plt.xticks(range(24))
plt.title("Total Cancelled Orders per Hour")
plt.xlabel("Hour of Day")
plt.ylabel("Number of Cancelled Orders")
plt.tight_layout()
plt.show()
In [564]:
def cancellation_type(row):
if row['order_status_key'] == 4 and row['is_driver_assigned_key'] == 1:
return 'Cancelled by client (driver assigned)'
elif row['order_status_key'] == 4 and row['is_driver_assigned_key'] == 0:
return 'Cancelled by client (no driver assigned)'
elif row['order_status_key'] == 9:
return 'Cancelled by system'
else:
return None
order_data_by_hour['cancel_type'] = order_data_by_hour.apply(cancellation_type, axis=1)
filtered = order_data_by_hour[order_data_by_hour['cancel_type'].notnull()]
plt.figure(figsize=(12,6))
sns.histplot(data=filtered, x='hour', hue='cancel_type', multiple='stack', discrete=True, palette='colorblind')
plt.xticks(range(24))
plt.title("Cancelled Orders per Hour by Type")
plt.xlabel("Hour of Day")
plt.ylabel("Number of Cancelled Orders")
plt.tight_layout()
plt.show()
Analysis¶
- Cancels are very high during 8 in the morning by both client and system.
- Surprisingly, not much cancelled orders during evening rush hours - 5 P.M. to 8 P.M.
Section 3: How long are customers waiting before they cancel?¶
In [565]:
data_orders_cancellation = data_orders[['order_datetime', 'is_driver_assigned_key', 'cancellations_time_in_seconds']].copy()
data_orders_cancellation['order_datetime'] = pd.to_datetime(data_orders_cancellation['order_datetime'])
data_orders_cancellation['hour'] = data_orders_cancellation['order_datetime'].dt.hour
/tmp/ipykernel_77565/1789851112.py:2: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. data_orders_cancellation['order_datetime'] = pd.to_datetime(data_orders_cancellation['order_datetime'])
In [566]:
data_orders_cancellation.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10716 entries, 0 to 10715 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_datetime 10716 non-null datetime64[ns] 1 is_driver_assigned_key 10716 non-null int64 2 cancellations_time_in_seconds 7307 non-null float64 3 hour 10716 non-null int32 dtypes: datetime64[ns](1), float64(1), int32(1), int64(1) memory usage: 293.1 KB
In [567]:
data_orders_cancellation = data_orders_cancellation.dropna()
In [568]:
data_orders_cancellation.info()
<class 'pandas.core.frame.DataFrame'> Index: 7307 entries, 0 to 10715 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_datetime 7307 non-null datetime64[ns] 1 is_driver_assigned_key 7307 non-null int64 2 cancellations_time_in_seconds 7307 non-null float64 3 hour 7307 non-null int32 dtypes: datetime64[ns](1), float64(1), int32(1), int64(1) memory usage: 256.9 KB
In [569]:
Q1 = data_orders_cancellation['cancellations_time_in_seconds'].quantile(0.25)
Q3 = data_orders_cancellation['cancellations_time_in_seconds'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
In [570]:
outliers = data_orders_cancellation[
(data_orders_cancellation['cancellations_time_in_seconds'] < lower_bound) |
(data_orders_cancellation['cancellations_time_in_seconds'] > upper_bound)
]
print(f"Number of outliers: {len(outliers)}")
Number of outliers: 592
In [571]:
data_orders_cancellation_no_outliers = data_orders_cancellation[
(data_orders_cancellation['cancellations_time_in_seconds'] >= lower_bound) &
(data_orders_cancellation['cancellations_time_in_seconds'] <= upper_bound)
]
print(f"Rows after removing outliers: {len(data_orders_cancellation_no_outliers)}")
Rows after removing outliers: 6715
In [572]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# With outliers
sns.boxplot(ax=axes[0], data=data_orders_cancellation, y='cancellations_time_in_seconds')
axes[0].set_title("With Outliers")
# Without outliers
sns.boxplot(ax=axes[1], data=data_orders_cancellation_no_outliers, y='cancellations_time_in_seconds', color='skyblue')
axes[1].set_title("Without Outliers")
plt.tight_layout()
plt.show()
In [573]:
plt.figure(figsize=(10, 5))
ax = sns.histplot(
data=data_orders_cancellation_no_outliers,
x='cancellations_time_in_seconds',
hue='is_driver_assigned_key',
multiple='stack',
bins=30,
kde=True,
palette='colorblind'
)
plt.title("Distribution of Cancellation Time")
plt.xlabel("Cancellation Time (seconds)")
plt.ylabel("Frequency")
plt.legend(title='Driver Assigned', labels=['Yes', 'No'])
plt.show()
In [574]:
avg_time_to_cancel = data_orders_cancellation_no_outliers.groupby('is_driver_assigned_key').mean()
In [575]:
avg_time_df = avg_time_to_cancel.reset_index()
plt.figure(figsize=(6, 4))
sns.barplot(
data=avg_time_df,
x='is_driver_assigned_key',
y='cancellations_time_in_seconds',
palette='colorblind'
)
plt.title('Average Cancellation Time by Driver Assignment')
plt.xlabel('Driver Assigned')
plt.ylabel('Average Cancellation Time (seconds)')
plt.tight_layout()
plt.show()
/tmp/ipykernel_77565/2608496666.py:4: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(
In [576]:
# Group by hour and driver assignment, then calculate mean cancellation time
avg_cancel_time_by_hour = data_orders_cancellation_no_outliers.groupby(
['hour', 'is_driver_assigned_key']
)['cancellations_time_in_seconds'].mean().reset_index()
plt.figure(figsize=(12,6))
sns.lineplot(
data=avg_cancel_time_by_hour,
x='hour',
y='cancellations_time_in_seconds',
hue='is_driver_assigned_key',
marker='o',
palette='colorblind'
)
plt.title('Average Cancellation Time by Hour and Driver Assignment')
plt.xlabel('Hour of Day')
plt.ylabel('Average Cancellation Time (seconds)')
plt.xticks(range(24))
plt.legend(title='Driver Assigned')
plt.tight_layout()
plt.show()
Analysis¶
- So, many people are cancelling within 20 seconds of trying to book a taxi without waiting for driver
- More than 400 cancelled orders during 9 P.M.
Section 4: What do the Estimated Times of Arrival (ETAs) look like by hour?¶
In [577]:
data_orders_eta = data_orders[data_orders['is_driver_assigned_key'] == 1]
In [578]:
data_orders_eta = data_orders_eta[['order_datetime', 'm_order_eta']]
In [579]:
data_orders_eta.info()
<class 'pandas.core.frame.DataFrame'> Index: 2814 entries, 0 to 10715 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_datetime 2814 non-null object 1 m_order_eta 2814 non-null float64 dtypes: float64(1), object(1) memory usage: 66.0+ KB
In [580]:
data_orders_eta['order_datetime'] = pd.to_datetime(data_orders_eta['order_datetime'])
data_orders_eta['hour'] = data_orders_eta['order_datetime'].dt.hour
/tmp/ipykernel_77565/3767668566.py:1: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. data_orders_eta['order_datetime'] = pd.to_datetime(data_orders_eta['order_datetime'])
In [581]:
data_orders_eta.info()
<class 'pandas.core.frame.DataFrame'> Index: 2814 entries, 0 to 10715 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_datetime 2814 non-null datetime64[ns] 1 m_order_eta 2814 non-null float64 2 hour 2814 non-null int32 dtypes: datetime64[ns](1), float64(1), int32(1) memory usage: 76.9 KB
In [582]:
avg_data_order_eta = data_orders_eta.groupby('hour')['m_order_eta'].mean().reset_index()
avg_data_order_eta.head()
Out[582]:
hour | m_order_eta | |
---|---|---|
0 | 0 | 357.959016 |
1 | 1 | 324.750000 |
2 | 2 | 391.012821 |
3 | 3 | 388.093750 |
4 | 4 | 299.658537 |
In [583]:
plt.figure(figsize=(10,7))
sns.barplot(data=avg_data_order_eta, x='hour', y='m_order_eta', palette='colorblind')
plt.title('Estimated Time of Arrival (seconds) by Hours')
plt.xlabel('Hour')
plt.ylabel('Estimated Time of Arrival (seconds)')
plt.tight_layout()
plt.show()
/tmp/ipykernel_77565/3068694397.py:2: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(data=avg_data_order_eta, x='hour', y='m_order_eta', palette='colorblind')
Analysis¶
- The average ETA peaks dramatically at 8 AM, reaching over 10 minutes.
- Also, a clear explanation for the spike in cancellations, as long wait times directly lead to customer frustration and a higher rate of failed orders.
Section 5: Where are the failure hotspots?¶
In [584]:
import h3, folium
In [585]:
H3_RESOLUTION = 8
data_orders['hex_id'] = data_orders.apply(
lambda row: h3.latlng_to_cell(row['origin_latitude'], row['origin_longitude'], H3_RESOLUTION),
axis=1
)
In [586]:
data_orders.head()
Out[586]:
order_datetime | origin_longitude | origin_latitude | m_order_eta | order_gk | order_status_key | is_driver_assigned_key | cancellations_time_in_seconds | hex_id | |
---|---|---|---|---|---|---|---|---|---|
0 | 18:08:07 | -0.978916 | 51.456173 | 60.0 | 3000583041974 | 4 | 1 | 198.0 | 88195d2b03fffff |
1 | 20:57:32 | -0.950385 | 51.456843 | NaN | 3000583116437 | 4 | 0 | 128.0 | 88195d2b19fffff |
2 | 12:07:50 | -0.969520 | 51.455544 | 477.0 | 3000582891479 | 4 | 1 | 46.0 | 88195d2b1dfffff |
3 | 13:50:20 | -1.054671 | 51.460544 | 658.0 | 3000582941169 | 4 | 1 | 62.0 | 88195d7497fffff |
4 | 21:24:45 | -0.967605 | 51.458236 | NaN | 3000583140877 | 9 | 0 | NaN | 88195d2b1dfffff |
In [587]:
hex_total_orders = data_orders.groupby('hex_id').size().reset_index(name='total_orders')
failed_orders = data_orders[data_orders['order_status_key'].isin([4, 9])]
hex_failed_orders = failed_orders.groupby('hex_id').size().reset_index(name='fail_count')
In [588]:
hex_data = pd.merge(hex_total_orders, hex_failed_orders, on='hex_id', how='left')
hex_data['fail_count'] = hex_data['fail_count'].fillna(0).astype(int)
In [589]:
hex_data_sorted = hex_data.sort_values(by='total_orders', ascending=False)
eighty_percent_target = data_orders.shape[0] * 0.80
In [590]:
hex_data_sorted['cumulative_orders'] = hex_data_sorted['total_orders'].cumsum()
cutoff_index = hex_data_sorted[hex_data_sorted['cumulative_orders'] >= eighty_percent_target].index[0]
top_hexes = hex_data_sorted.loc[:cutoff_index]
In [591]:
num_top_hexes = len(top_hexes)
print(f"Number of size 8 hexes containing 80% of all orders: {num_top_hexes}")
Number of size 8 hexes containing 80% of all orders: 24
In [592]:
map_center = [data_orders['origin_latitude'].mean(), data_orders['origin_longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=12, tiles='cartodbpositron')
for index, row in top_hexes.iterrows():
hex_id = row['hex_id']
fail_count = row['fail_count']
total_orders = row['total_orders']
geo_boundary = h3.cell_to_boundary(hex_id)
boundary_tuples = [(lat, lon) for lat, lon in geo_boundary]
if fail_count > 150:
color = '#d73027'
elif fail_count > 75:
color = '#fdae61'
elif fail_count > 25:
color = '#fee090'
else:
color = '#abd9e9'
folium.Polygon(
locations=boundary_tuples,
color="black",
weight=1,
fill_color=color,
fill_opacity=0.7,
tooltip=f"Hex ID: {hex_id}<br>Failed Orders: {fail_count}<br>Total Orders: {total_orders}"
).add_to(m)
map_filename = 'failure_hotspot_map.html'
m.save(map_filename)
Analysis¶
- Just 24 size-8 hexagons account for 80% of all failed orders, highlighting extreme geographic concentration.
- Below map shows the areas which have many failed orders.
In [593]:
from IPython.display import IFrame
IFrame(src='failure_hotspot_map.html', width=700, height=600)
Out[593]: