SQL & Tableau: Online Chatroom Data Analysis
Background
Booth is a free online mental health service for individuals aged 11-25 years old. Service users can access an online chatroom feature allowing them to speak with trained mental health counselers.
Booth has provided two datasets containing information on service user activity within this chatroom feature. This data was collected during the week of April 13th, 2020. All service users in this data are located within the United States.
We have been asked to investigate trends in service user behavior regarding chat attempts, wait times, and successful chats from queues. First, we will use SQL and Python to join and format the two datasets. After that, we will visualize our findings in Tableau with interactive dashboards. All Python code will be executed in an IPython kernel using Jupyter Notebook.
Key Notes
- ‘Queues’ are synonymous for waits to enter a Booth chatroom.
- Not all successful chats require queues. Users can book chat appointments ahead of time.
- Not all users have a successful chat after initiating a queue.
- Users can be removed from their queues by a practitioner if they are not going to get a chat. They may also exit queues themselves.
- Chatroom availability begins at 9:00am and ends at 9:00pm.
Data Analysis
For this analysis, we will be executing MySQL queries within a Python environment using the pandasql Python package. Before we begin, we must install this package on our desktop. We can then import all required packages for analysis.
# install pandasql on desktop
pip install pandasql
# import required libraries
from pandasql import sqldf
import pandas as pd
Next, we will initialize a SQL querying function in Python. This can be done using a simple lambda function.
# initialize SQL querying function
pysql = lambda q: sqldf(q, globals())
We will now import the Booth datasets as pandas dataframes. We will also rename the variables in both datasets for better clarity.
# import Booth datasets
queue_df = pd.read_csv("C:/Users/ethan/OneDrive/Documents/Booth/join_queue_events.csv")
chat_df = pd.read_csv("C:/Users/ethan/OneDrive/Documents/Booth/chat_start_events.csv")
# assign new variable names
queue_df = queue_df.rename(columns = {
'Service User ID':'user_id',
'Join Queue Event ID':'queue_id',
'date_su join queue':'queue_start',
'date_su left queue':'queue_end'}
)
chat_df = chat_df.rename(columns = {
'Service User ID':'user_id',
'chats starts':'chat_start',
'chat ends':'chat_end',
'meeting room ID':'meeting_room_id'}
)
queue_df contains four variables:
- user_id indicates the unique ID assigned to each Kooth service user.
- queue_id represents an ID for each distinct queue initialized by a user.
- queue_start represents the datetime of an initialized queue.
- queue_end represents the datetime of a terminated queue.
chat_df contains four variables:
- user_id indicates the unique ID assigned to each Kooth service user.
- chat_start represents the datetime of an initialized chat.
- chat_end represents the datetime of a terminated chat.
- meeting_room_id indicates the unique ID generated for each distinct chat session.
We can see that user_id is a shared variable in both dataframes. Before we join this data togther, we will perform a simple deduplication of rows in each dataframe using SQL.
# 0 rows were removed from queue_df
queue_df = sqldf(
'''
SELECT distinct *
FROM queue_df
'''
)
# 1383 duplicate rows were removed from chat_df
chat_df = sqldf(
'''
SELECT distinct *
FROM chat_df
'''
)
When considering the Kooth chatroom feature, service users fall under one of three events:
- Users who queued and failed to chat.
- Users who queued and chatted successfully.
- Users who did not queue and chatted successfully.
We will filter and join data from queue_df and chat_df by creating distinct tables for each of these three events. After creating these tables, we will combine them to create a master dataset that can be imported into Tableau for further analysis.
Starting with Event 1, we will create a table for user records that initiated a queue but failed to chat. While exploring this data, we found that some users had instantiated multiple queues ending at the same time. This could be the due to errors in the data logging system, or perhaps the result of a user opening multiple queues on different tabs of their browser.
Regardless of the cause for this discrepancy, we will only keep the record with the longest queue duration per queue cluster. These types of records will best reflect the true wait time of each user after starting a queue for a distinct chat session.
Using SQL, we will first create a common table expression RankedQueues that left-joins user_id from chat_df onto queue_df. Within RankedQueues, we will construct a field row_num using a window funciton to partition by users with queues ending at the same time. Within each aforementioned queue cluster, we will rank records by queue start time to indicate the longest (most accurate) queue duration for that distinct queue session.
# create a table for records with queues that failed to chat
queue_fail = sqldf(
'''
WITH RankedQueues AS (
SELECT
q.*,
ROW_NUMBER() OVER (PARTITION BY q.user_id, q.queue_end ORDER BY q.queue_start) AS row_num,
c.chat_start,
c.chat_end,
c.meeting_room_id
FROM
queue_df q
LEFT JOIN
chat_df c ON q.user_id = c.user_id
)
SELECT
user_id,
queue_id,
queue_start,
queue_end,
ROUND((1.0 * (strftime('%s', substr(queue_end, 7, 4) || '-' || substr(queue_end, 4, 2) || '-' || substr(queue_end, 1, 2) || ' ' || substr(queue_end, 12))) - strftime('%s', substr(queue_start, 7, 4) || '-' || substr(queue_start, 4, 2) || '-' || substr(queue_start, 1, 2) || ' ' || substr(queue_start, 12))) / 60.0, 2) AS queue_minutes,
chat_start,
chat_end,
ROUND((1.0 * (strftime('%s', substr(chat_end, 7, 4) || '-' || substr(chat_end, 4, 2) || '-' || substr(chat_end, 1, 2) || ' ' || substr(chat_end, 12))) - strftime('%s', substr(chat_start, 7, 4) || '-' || substr(chat_start, 4, 2) || '-' || substr(chat_start, 1, 2) || ' ' || substr(chat_start, 12))) / 60.0, 2) AS chat_minutes,
meeting_room_id
FROM
RankedQueues
WHERE
row_num = 1 AND chat_start IS NULL
ORDER BY
user_id, queue_end
'''
)
Now, we can display the first 5 rows of queue_fail.
queue_fail.head()
user_id | queue_id | queue_start | queue_end | queue_minutes | chat_start | chat_end | chat_minutes | meeting_room_id |
---|---|---|---|---|---|---|---|---|
0015d01f-f8f1-4f8a-9c33-c22b45b3e59a | d6b735f4-dd34-4e1b-aa1b-0c6d1f84ef45 | 17/04/2020 11:23:02 | 17/04/2020 11:56:29 | 33.45 | None | None | None | None |
0070a3e7-fee7-48f8-82d7-af96b1d08082 | bc49da27-124f-410b-a97f-6dc36ac520ff | 17/04/2020 16:03:40 | 17/04/2020 16:03:57 | 0.28 | None | None | None | None |
007e63d9-51e0-4738-89d1-ba968cfa9820 | 1c45e072-284b-4145-b65e-91363ca7c7c6 | 15/04/2020 13:44:49 | 15/04/2020 13:49:26 | 4.62 | None | None | None | None |
007e63d9-51e0-4738-89d1-ba968cfa9820 | b4533ead-0cb9-4f44-bb5f-01e3bc74032e | 15/04/2020 14:14:51 | 15/04/2020 14:15:46 | 0.92 | None | None | None | None |
0086373e-4c0b-4c2a-b82e-9f3057f14de4 | 01bcf559-9928-4dbb-900a-35a73e4716e9 | 17/04/2020 19:57:50 | 17/04/2020 19:59:51 | 2.02 | None | None | None | None |
As we can see, no information is available for chat-related variables because all of these queues failed to reach a chat room. Additionally, we can see that some users instantiated multiple queues that started and ended at different times, indicating multiple queue attempts for that particular user. We have also calculated the variables queue_minutes and chat_minutes to measure the queue and chat duration for each record respectively.
During this investigation, we find that one user instantiated a queue that appears twice in the data. The start time for these two queues are the same, but their end times differed…
sqldf(
'''
select *
from queue_fail
where queue_id = "cf7cb7ff-e366-42d4-a226-213851b208df"
''')
user_id | queue_id | queue_start | queue_end | queue_minutes | chat_start | chat_end | chat_minutes | meeting_room_id |
---|---|---|---|---|---|---|---|---|
912b513e-ce66-4cd8-85b4-6509d7d0ee23 | cf7cb7ff-e366-42d4-a226-213851b208df | 15/04/2020 19:22:19 | 15/04/2020 | 19:25:34 3.25 | None | None | None | None |
912b513e-ce66-4cd8-85b4-6509d7d0ee23 | cf7cb7ff-e366-42d4-a226-213851b208df | 15/04/2020 19:22:19 | 15/04/2020 | 19:38:32 16.22 | None | None | None | None |
This is an odd occurance. We might expect that, provided duplicated records have the same queue_start time, they would have the same queue_end time as well. We will assume that this is another error arising from the data logging system. To preserve accuracy in user queue durations, we will remove the record with the shortest queue time (3.25 min).
# update queue_fail
queue_fail = sqldf(
'''
select *
from queue_fail
where not (queue_id = "cf7cb7ff-e366-42d4-a226-213851b208df" and queue_minutes = 3.25)
''')
We will now look into Event 2 for users that entered a chat via queue. During our data exploration, we identified 667 records where chats commenced prior to their queues ending. Intuitively, users should not still be waiting for a chat after successfully joining a chatroom. Thus, we will assume all successful chats in this event to have equivalent queue_end and chat_start datetimes.
Using SQL, we will construct a new table for these successful chat records from queues. Again, for cases where a user instantiates multiple queues that end at the same time, only the queue with the longest duration will be kept. We will inner join the user ID of chat_df on queue_df where queue_end is equivalent to chat_start.
# create a table for queues that succeeded in chat connections
queue_connect = sqldf(
'''
WITH RankedQueues AS (
SELECT
q.*,
ROW_NUMBER() OVER (PARTITION BY q.user_id, q.queue_end ORDER BY q.queue_start) AS row_num,
c.chat_start,
c.chat_end,
c.meeting_room_id
FROM
queue_df q
INNER JOIN
chat_df c
ON q.user_id = c.user_id
AND queue_end = chat_start
)
SELECT
user_id,
queue_id,
queue_start,
queue_end,
ROUND((1.0 * (strftime('%s', substr(queue_end, 7, 4) || '-' || substr(queue_end, 4, 2) || '-' || substr(queue_end, 1, 2) || ' ' || substr(queue_end, 12))) - strftime('%s', substr(queue_start, 7, 4) || '-' || substr(queue_start, 4, 2) || '-' || substr(queue_start, 1, 2) || ' ' || substr(queue_start, 12))) / 60.0, 2) AS queue_minutes,
chat_start,
chat_end,
ROUND((1.0 * (strftime('%s', substr(chat_end, 7, 4) || '-' || substr(chat_end, 4, 2) || '-' || substr(chat_end, 1, 2) || ' ' || substr(chat_end, 12))) - strftime('%s', substr(chat_start, 7, 4) || '-' || substr(chat_start, 4, 2) || '-' || substr(chat_start, 1, 2) || ' ' || substr(chat_start, 12))) / 60.0, 2) AS chat_minutes,
meeting_room_id
FROM
RankedQueues
WHERE
row_num = 1 AND chat_start IS NOT NULL
ORDER BY
user_id, queue_end
'''
)
Here are the first 5 rows of queue_connect.
queue_connect.head()
user_id | queue_id | queue_start | queue_end | queue_minutes | chat_start | chat_end | chat_minutes | meeting_room_id |
---|---|---|---|---|---|---|---|---|
0027a586-b290-437e-9d33-c7cab0397128 | 22c19016-095e-42de-9a3c-fe49b453e9d1 | 16/04/2020 17:20:21 | 16/04/2020 17:43:28 | 23.12 | 16/04/2020 17:43:28 | 16/04/2020 17:49:35 | 6.12 | 3da7f68e-aafc-4551-ae96-a22a27f1694c |
01321fec-7dbe-48be-afdc-aab1f990a719 | 72dfc99a-3cd4-42f7-9251-f4bfbccebc95 | 15/04/2020 13:42:56 | 15/04/2020 15:06:10 | 83.23 | 15/04/2020 15:06:10 | 15/04/2020 16:02:36 | 56.43 | e01699f4-1f25-4466-9498-433ce1cfd243 |
0145c4a9-99d6-4fdd-81c9-23b28bee8b05 | ca5d5a46-1993-4a7e-a62e-d83889dfb189 | 13/04/2020 12:20:45 | 13/04/2020 12:37:03 | 16.30 | 13/04/2020 12:37:03 | 13/04/2020 13:42:25 | 65.37 | 0334a1db-b370-4d52-8aa2-0a4778d89097 |
01b40aec-50f0-439b-9b82-3f38910ef977 | 7eb87549-e07c-43b7-afc8-0c935decc63a | 17/04/2020 12:26:46 | 17/04/2020 13:16:06 | 49.33 | 17/04/2020 13:16:06 | 17/04/2020 13:43:21 | 27.25 | 22c2dac8-798d-441f-af2a-775a47d4aad4 |
022713bb-2b94-41dd-9327-7d14c8c388cb | a7069744-0260-40de-a12e-0960004aab70 | 14/04/2020 11:23:06 | 14/04/2020 12:04:46 | 41.67 | 14/04/2020 12:04:46 | 14/04/2020 13:04:33 | 59.78 | fe829044-efd9-44f0-b14b-f5611cf81b25 |
Finally, we will create a table for users with booked chats from Event 3. We can left join the user ID of queue_df on chat_df. We must also ensure that queue_end holds a null value as these records did not require a queue. There were no discrepancies in duplicated records, so we can ignore queue clusters in this table.
# create a table for chat sessions that were booked in advance
booked = sqldf(
'''
SELECT
c.user_id,
queue_id,
queue_start,
queue_end,
ROUND((1.0 * (strftime('%s', substr(queue_end, 7, 4) || '-' || substr(queue_end, 4, 2) || '-' || substr(queue_end, 1, 2) || ' ' || substr(queue_end, 12))) - strftime('%s', substr(queue_start, 7, 4) || '-' || substr(queue_start, 4, 2) || '-' || substr(queue_start, 1, 2) || ' ' || substr(queue_start, 12))) / 60.0, 2) AS queue_minutes,
chat_start,
chat_end,
ROUND((1.0 * (strftime('%s', substr(chat_end, 7, 4) || '-' || substr(chat_end, 4, 2) || '-' || substr(chat_end, 1, 2) || ' ' || substr(chat_end, 12))) - strftime('%s', substr(chat_start, 7, 4) || '-' || substr(chat_start, 4, 2) || '-' || substr(chat_start, 1, 2) || ' ' || substr(chat_start, 12))) / 60.0, 2) AS chat_minutes,
meeting_room_id
FROM
chat_df c
LEFT JOIN
queue_df q
ON
c.user_id = q.user_id AND c.chat_start = q.queue_end
WHERE
q.queue_end IS NULL
ORDER BY c.user_id
'''
)
Here are the first 5 rows of our new table booked.
booked.head()
user_id | queue_id | queue_start | queue_end | queue_minutes | chat_start | chat_end | chat_minutes | meeting_room_id |
---|---|---|---|---|---|---|---|---|
01620d28-abfa-4784-8767-6a6c2fb6b0c1 | None | None | None | None | 14/04/2020 13:24:00 | 14/04/2020 13:29:31 | 5.52 | af24a750-89c7-4c83-a8db-b87f9282875b |
05651105-744b-4bd9-a3a8-7a36753e6e92 | None | None | None | None | 13/04/2020 19:30:12 | 13/04/2020 20:29:43 | 59.52 | 45282270-43a5-461c-9aa8-ded68026a038 |
066bd671-d2c8-4309-888a-4392a334c70e | None | None | None | None | 14/04/2020 19:00:09 | 14/04/2020 20:08:01 | 67.87 | 3b9c5902-6c82-4fb7-8e9b-0994762d29a3 |
07c0ee48-5e5c-4f5f-bc7d-1db581fcfa63 | None | None | None | None | 15/04/2020 20:05:37 | 15/04/2020 20:15:48 | 10.18 | 0adbb5ba-9ae5-4294-8f8f-8dd10750b83b |
07cbb93a-7f37-5fc8-a2d6-410b897681f2 | None | None | None | None | 15/04/2020 14:31:48 | 15/04/2020 15:30:35 | 58.78 | 4766b561-4770-4230-b75b-cef0d9c108d5 |
As we expected, all queue-related variables hold null values.
Now that we have our three event tables, we will union them together in SQL to create a master dataset. We will also include a new descriptor variable event_type to indicate the event type of each record.
# create master dataset to export
all_events = sqldf('''
WITH all_events AS (
SELECT *
FROM queue_connect
UNION
SELECT *
FROM queue_fail
UNION
SELECT *
FROM booked
)
SELECT *,
CASE
WHEN queue_end IS NULL THEN 'booked'
WHEN chat_start IS NULL THEN 'queue_failure'
ELSE 'queue_success'
END AS event_type
FROM all_events
''')
Here are the first 5 rows of our master dataset all_events.
all_events.head()
user_id | queue_id | queue_start | queue_end | queue_minutes | chat_start | chat_end | chat_minutes | meeting_room_id | |
---|---|---|---|---|---|---|---|---|---|
0015d01f-f8f1-4f8a-9c33-c22b45b3e59a | d6b735f4-dd34-4e1b-aa1b-0c6d1f84ef45 | 17/04/2020 11:23:02 | 17/04/2020 11:56:29 | 33.45 | None | None | NaN | None | queue_failure |
0027a586-b290-437e-9d33-c7cab0397128 | 22c19016-095e-42de-9a3c-fe49b453e9d1 | 16/04/2020 17:20:21 | 16/04/2020 17:43:28 | 23.12 | 16/04/2020 17:43:28 | 16/04/2020 17:49:35 | 6.12 | 3da7f68e-aafc-4551-ae96-a22a27f1694c | queue_success |
0070a3e7-fee7-48f8-82d7-af96b1d08082 | bc49da27-124f-410b-a97f-6dc36ac520ff | 17/04/2020 16:03:40 | 17/04/2020 16:03:57 | 0.28 | None | None | NaN | None | queue_failure |
007e63d9-51e0-4738-89d1-ba968cfa9820 | 1c45e072-284b-4145-b65e-91363ca7c7c6 | 15/04/2020 13:44:49 | 15/04/2020 13:49:26 | 4.62 | None | None | NaN | None | queue_failure |
007e63d9-51e0-4738-89d1-ba968cfa9820 | b4533ead-0cb9-4f44-bb5f-01e3bc74032e | 15/04/2020 14:14:51 | 15/04/2020 14:15:46 | 0.92 | None | None | NaN | None | queue_failure |
Next, we will import all_events into Tableau to visualize our data!
Data Visualization
Using the following Tableau storyboard, we can gain a deeper understanding of user engagement within Kooth’s chatroom feature. A detailed description of these findings can be located below the storyboard. Recall that this data was recorded over a one-week period.
What does the distribution of events in the master dataset look like?
Along the top of the first dashboard, we can see the total records for booked chats, queue successes, and queue failures. The number of failed queues greatly outnumbers successful queues into chats.
What queue attempt patterns can we see throughout the day?
On the first dashboard, locate the Summary area chart. In the yellow dropdown box, we will first select Hour to view the distribution of chatroom queues throughout the day. We see that users can attempt to join a chatroom at any hour of the day. However, chatroom availability does not open until 9:00am and closes after 9:00pm. Users may be inclined to start their queues early to increase their odds of joining a chatroom when availability opens.
We can see the first activity spike occurring between 11:00am and 12:00pm. Queue attempts slow down in the early afternoon and rise again after 4:00pm. The most popular time to queue for a chat occured between 8:00pm and 9:00pm.
What queue success patterns can we see throughout the day?
Queue successes follow a similar pattern to queue attempts during operational chatroom hours. Using our cursor to hover over the area chart, we can take a closer look at queue success rates throughout the day.
Users that queued for a chat between the hours of 4:00pm and 5:00pm had the greatest success rate at 45.79%.
During operational chatroom hours, users that queued for a chat between 8:00pm and 9:00pm had the worst success rate at 3.19%. Considering that 8:00pm to 9:00pm is the most popular window to queue for a chat, staffing more counselers during this time may increase successful queues.
For users with successful queues, what time of day were they waiting the longest?
On the first dashboard, select Hour from the yellow dropdown box and navigate to the Wait for Successful Queues bar chart. Here, we can view the median queue duration for users that connected to a chatroom.
Note that, rather than average durations, we are looking at median durations in this graphic. This is due to the large variance in queue duration across all records in the dataset. Users who waited in a queue for an exeptionally long time (i.e. over 2 hours) inflate the average of this metric. Thus, we will be focusing on median time metrics throughout these dashboards.
Throughout the week, users with successful queues that were initiated between 1:00pm and 2:00pm experienced the longest median queue duration. These users had a median queue duration of approximately 97 minutes.
On the contrary, users with successful queues that were initiated between 8:00pm and 9:00pm enjoyed the shortest median queue duration. These users had a median queue duration of 3 minutes. However, keep in mind that the 8:00pm to 9:00pm queue window also had the worst queue success rate.