SQL & Tableau: Online Chatroom Data Analysis

13 minute read

Background

Booth is a free online mental health service for individuals aged 11-25 years old. Service users can access an online chatroom feature allowing them to speak with trained mental health counselers.

Booth has provided two datasets containing information on service user activity within this chatroom feature. This data was collected during the week of April 13th, 2020. All service users in this data are located within the United States.

We have been asked to investigate trends in service user behavior regarding chat attempts, wait times, and successful chats from queues. First, we will use SQL and Python to join and format the two datasets. After that, we will visualize our findings in Tableau with interactive dashboards. All Python code will be executed in an IPython kernel using Jupyter Notebook.

Key Notes

  • ‘Queues’ are synonymous for waits to enter a Booth chatroom.
  • Not all successful chats require queues. Users can book chat appointments ahead of time.
  • Not all users have a successful chat after initiating a queue.
  • Users can be removed from their queues by a practitioner if they are not going to get a chat. They may also exit queues themselves.
  • Chatroom availability begins at 9:00am and ends at 9:00pm.

Data Analysis

For this analysis, we will be executing MySQL queries within a Python environment using the pandasql Python package. Before we begin, we must install this package on our desktop. We can then import all required packages for analysis.

# install pandasql on desktop
pip install pandasql

# import required libraries
from pandasql import sqldf
import pandas as pd

Next, we will initialize a SQL querying function in Python. This can be done using a simple lambda function.

# initialize SQL querying function
pysql = lambda q: sqldf(q, globals())

We will now import the Booth datasets as pandas dataframes. We will also rename the variables in both datasets for better clarity.

# import Booth datasets
queue_df = pd.read_csv("C:/Users/ethan/OneDrive/Documents/Booth/join_queue_events.csv")
chat_df = pd.read_csv("C:/Users/ethan/OneDrive/Documents/Booth/chat_start_events.csv")

# assign new variable names
queue_df = queue_df.rename(columns = {
    'Service User ID':'user_id', 
    'Join Queue Event ID':'queue_id',
    'date_su join queue':'queue_start',
    'date_su left queue':'queue_end'}
)

chat_df = chat_df.rename(columns = {
    'Service User ID':'user_id', 
    'chats starts':'chat_start',
    'chat ends':'chat_end',
    'meeting room ID':'meeting_room_id'}
)

queue_df contains four variables:

  • user_id indicates the unique ID assigned to each Kooth service user.
  • queue_id represents an ID for each distinct queue initialized by a user.
  • queue_start represents the datetime of an initialized queue.
  • queue_end represents the datetime of a terminated queue.

chat_df contains four variables:

  • user_id indicates the unique ID assigned to each Kooth service user.
  • chat_start represents the datetime of an initialized chat.
  • chat_end represents the datetime of a terminated chat.
  • meeting_room_id indicates the unique ID generated for each distinct chat session.

We can see that user_id is a shared variable in both dataframes. Before we join this data togther, we will perform a simple deduplication of rows in each dataframe using SQL.

# 0 rows were removed from queue_df
queue_df = sqldf(
'''
SELECT distinct *
FROM queue_df
'''
)

# 1383 duplicate rows were removed from chat_df
chat_df = sqldf(
'''
SELECT distinct *
FROM chat_df
'''
)

When considering the Kooth chatroom feature, service users fall under one of three events:

  1. Users who queued and failed to chat.
  2. Users who queued and chatted successfully.
  3. Users who did not queue and chatted successfully.

We will filter and join data from queue_df and chat_df by creating distinct tables for each of these three events. After creating these tables, we will combine them to create a master dataset that can be imported into Tableau for further analysis.

Starting with Event 1, we will create a table for user records that initiated a queue but failed to chat. While exploring this data, we found that some users had instantiated multiple queues ending at the same time. This could be the due to errors in the data logging system, or perhaps the result of a user opening multiple queues on different tabs of their browser.

Regardless of the cause for this discrepancy, we will only keep the record with the longest queue duration per queue cluster. These types of records will best reflect the true wait time of each user after starting a queue for a distinct chat session.

Using SQL, we will first create a common table expression RankedQueues that left-joins user_id from chat_df onto queue_df. Within RankedQueues, we will construct a field row_num using a window funciton to partition by users with queues ending at the same time. Within each aforementioned queue cluster, we will rank records by queue start time to indicate the longest (most accurate) queue duration for that distinct queue session.

# create a table for records with queues that failed to chat
queue_fail = sqldf(
'''
WITH RankedQueues AS (
    SELECT
        q.*,
        ROW_NUMBER() OVER (PARTITION BY q.user_id, q.queue_end ORDER BY q.queue_start) AS row_num,
        c.chat_start,
        c.chat_end,
        c.meeting_room_id
    FROM
        queue_df q
    LEFT JOIN
        chat_df c ON q.user_id = c.user_id
)

SELECT
    user_id,
    queue_id,
    queue_start,
    queue_end,
    ROUND((1.0 * (strftime('%s', substr(queue_end, 7, 4) || '-' || substr(queue_end, 4, 2) || '-' || substr(queue_end, 1, 2) || ' ' || substr(queue_end, 12))) - strftime('%s', substr(queue_start, 7, 4) || '-' || substr(queue_start, 4, 2) || '-' || substr(queue_start, 1, 2) || ' ' || substr(queue_start, 12))) / 60.0, 2) AS queue_minutes,
    chat_start,
    chat_end,
    ROUND((1.0 * (strftime('%s', substr(chat_end, 7, 4) || '-' || substr(chat_end, 4, 2) || '-' || substr(chat_end, 1, 2) || ' ' || substr(chat_end, 12))) - strftime('%s', substr(chat_start, 7, 4) || '-' || substr(chat_start, 4, 2) || '-' || substr(chat_start, 1, 2) || ' ' || substr(chat_start, 12))) / 60.0, 2) AS chat_minutes,
    meeting_room_id
FROM
    RankedQueues
WHERE
    row_num = 1 AND chat_start IS NULL
ORDER BY
    user_id, queue_end
'''
)

Now, we can display the first 5 rows of queue_fail.

queue_fail.head()
user_id queue_id queue_start queue_end queue_minutes chat_start chat_end chat_minutes meeting_room_id
0015d01f-f8f1-4f8a-9c33-c22b45b3e59a d6b735f4-dd34-4e1b-aa1b-0c6d1f84ef45 17/04/2020 11:23:02 17/04/2020 11:56:29 33.45 None None None None
0070a3e7-fee7-48f8-82d7-af96b1d08082 bc49da27-124f-410b-a97f-6dc36ac520ff 17/04/2020 16:03:40 17/04/2020 16:03:57 0.28 None None None None
007e63d9-51e0-4738-89d1-ba968cfa9820 1c45e072-284b-4145-b65e-91363ca7c7c6 15/04/2020 13:44:49 15/04/2020 13:49:26 4.62 None None None None
007e63d9-51e0-4738-89d1-ba968cfa9820 b4533ead-0cb9-4f44-bb5f-01e3bc74032e 15/04/2020 14:14:51 15/04/2020 14:15:46 0.92 None None None None
0086373e-4c0b-4c2a-b82e-9f3057f14de4 01bcf559-9928-4dbb-900a-35a73e4716e9 17/04/2020 19:57:50 17/04/2020 19:59:51 2.02 None None None None

As we can see, no information is available for chat-related variables because all of these queues failed to reach a chat room. Additionally, we can see that some users instantiated multiple queues that started and ended at different times, indicating multiple queue attempts for that particular user. We have also calculated the variables queue_minutes and chat_minutes to measure the queue and chat duration for each record respectively.

During this investigation, we find that one user instantiated a queue that appears twice in the data. The start time for these two queues are the same, but their end times differed…

sqldf(
'''
select *
from queue_fail
where queue_id = "cf7cb7ff-e366-42d4-a226-213851b208df"
''')
user_id queue_id queue_start queue_end queue_minutes chat_start chat_end chat_minutes meeting_room_id
912b513e-ce66-4cd8-85b4-6509d7d0ee23 cf7cb7ff-e366-42d4-a226-213851b208df 15/04/2020 19:22:19 15/04/2020 19:25:34 3.25 None None None None
912b513e-ce66-4cd8-85b4-6509d7d0ee23 cf7cb7ff-e366-42d4-a226-213851b208df 15/04/2020 19:22:19 15/04/2020 19:38:32 16.22 None None None None

This is an odd occurance. We might expect that, provided duplicated records have the same queue_start time, they would have the same queue_end time as well. We will assume that this is another error arising from the data logging system. To preserve accuracy in user queue durations, we will remove the record with the shortest queue time (3.25 min).

# update queue_fail
queue_fail = sqldf(
'''
select *
from queue_fail
where not (queue_id = "cf7cb7ff-e366-42d4-a226-213851b208df" and queue_minutes = 3.25)
''')

We will now look into Event 2 for users that entered a chat via queue. During our data exploration, we identified 667 records where chats commenced prior to their queues ending. Intuitively, users should not still be waiting for a chat after successfully joining a chatroom. Thus, we will assume all successful chats in this event to have equivalent queue_end and chat_start datetimes.

Using SQL, we will construct a new table for these successful chat records from queues. Again, for cases where a user instantiates multiple queues that end at the same time, only the queue with the longest duration will be kept. We will inner join the user ID of chat_df on queue_df where queue_end is equivalent to chat_start.

# create a table for queues that succeeded in chat connections
queue_connect = sqldf(
'''
WITH RankedQueues AS (
    SELECT
        q.*,
        ROW_NUMBER() OVER (PARTITION BY q.user_id, q.queue_end ORDER BY q.queue_start) AS row_num,
        c.chat_start,
        c.chat_end,
        c.meeting_room_id
        
    FROM
        queue_df q
    INNER JOIN
        chat_df c
    ON q.user_id = c.user_id
    AND queue_end = chat_start
)

SELECT
    user_id,
    queue_id,
    queue_start,
    queue_end,
    ROUND((1.0 * (strftime('%s', substr(queue_end, 7, 4) || '-' || substr(queue_end, 4, 2) || '-' || substr(queue_end, 1, 2) || ' ' || substr(queue_end, 12))) - strftime('%s', substr(queue_start, 7, 4) || '-' || substr(queue_start, 4, 2) || '-' || substr(queue_start, 1, 2) || ' ' || substr(queue_start, 12))) / 60.0, 2) AS queue_minutes,
    chat_start,
    chat_end,
    ROUND((1.0 * (strftime('%s', substr(chat_end, 7, 4) || '-' || substr(chat_end, 4, 2) || '-' || substr(chat_end, 1, 2) || ' ' || substr(chat_end, 12))) - strftime('%s', substr(chat_start, 7, 4) || '-' || substr(chat_start, 4, 2) || '-' || substr(chat_start, 1, 2) || ' ' || substr(chat_start, 12))) / 60.0, 2) AS chat_minutes,
    meeting_room_id
    
FROM
    RankedQueues
WHERE
    row_num = 1 AND chat_start IS NOT NULL
ORDER BY
    user_id, queue_end
'''
)

Here are the first 5 rows of queue_connect.

queue_connect.head()
user_id queue_id queue_start queue_end queue_minutes chat_start chat_end chat_minutes meeting_room_id
0027a586-b290-437e-9d33-c7cab0397128 22c19016-095e-42de-9a3c-fe49b453e9d1 16/04/2020 17:20:21 16/04/2020 17:43:28 23.12 16/04/2020 17:43:28 16/04/2020 17:49:35 6.12 3da7f68e-aafc-4551-ae96-a22a27f1694c
01321fec-7dbe-48be-afdc-aab1f990a719 72dfc99a-3cd4-42f7-9251-f4bfbccebc95 15/04/2020 13:42:56 15/04/2020 15:06:10 83.23 15/04/2020 15:06:10 15/04/2020 16:02:36 56.43 e01699f4-1f25-4466-9498-433ce1cfd243
0145c4a9-99d6-4fdd-81c9-23b28bee8b05 ca5d5a46-1993-4a7e-a62e-d83889dfb189 13/04/2020 12:20:45 13/04/2020 12:37:03 16.30 13/04/2020 12:37:03 13/04/2020 13:42:25 65.37 0334a1db-b370-4d52-8aa2-0a4778d89097
01b40aec-50f0-439b-9b82-3f38910ef977 7eb87549-e07c-43b7-afc8-0c935decc63a 17/04/2020 12:26:46 17/04/2020 13:16:06 49.33 17/04/2020 13:16:06 17/04/2020 13:43:21 27.25 22c2dac8-798d-441f-af2a-775a47d4aad4
022713bb-2b94-41dd-9327-7d14c8c388cb a7069744-0260-40de-a12e-0960004aab70 14/04/2020 11:23:06 14/04/2020 12:04:46 41.67 14/04/2020 12:04:46 14/04/2020 13:04:33 59.78 fe829044-efd9-44f0-b14b-f5611cf81b25

Finally, we will create a table for users with booked chats from Event 3. We can left join the user ID of queue_df on chat_df. We must also ensure that queue_end holds a null value as these records did not require a queue. There were no discrepancies in duplicated records, so we can ignore queue clusters in this table.

# create a table for chat sessions that were booked in advance
booked = sqldf(
'''
SELECT
    c.user_id,
    queue_id,
    queue_start,
    queue_end,
    ROUND((1.0 * (strftime('%s', substr(queue_end, 7, 4) || '-' || substr(queue_end, 4, 2) || '-' || substr(queue_end, 1, 2) || ' ' || substr(queue_end, 12))) - strftime('%s', substr(queue_start, 7, 4) || '-' || substr(queue_start, 4, 2) || '-' || substr(queue_start, 1, 2) || ' ' || substr(queue_start, 12))) / 60.0, 2) AS queue_minutes,
    chat_start,
    chat_end,
    ROUND((1.0 * (strftime('%s', substr(chat_end, 7, 4) || '-' || substr(chat_end, 4, 2) || '-' || substr(chat_end, 1, 2) || ' ' || substr(chat_end, 12))) - strftime('%s', substr(chat_start, 7, 4) || '-' || substr(chat_start, 4, 2) || '-' || substr(chat_start, 1, 2) || ' ' || substr(chat_start, 12))) / 60.0, 2) AS chat_minutes,
    meeting_room_id
FROM
    chat_df c
LEFT JOIN
    queue_df q 
ON 
    c.user_id = q.user_id AND c.chat_start = q.queue_end
WHERE
    q.queue_end IS NULL
ORDER BY c.user_id
'''
)

Here are the first 5 rows of our new table booked.

booked.head()
user_id queue_id queue_start queue_end queue_minutes chat_start chat_end chat_minutes meeting_room_id
01620d28-abfa-4784-8767-6a6c2fb6b0c1 None None None None 14/04/2020 13:24:00 14/04/2020 13:29:31 5.52 af24a750-89c7-4c83-a8db-b87f9282875b
05651105-744b-4bd9-a3a8-7a36753e6e92 None None None None 13/04/2020 19:30:12 13/04/2020 20:29:43 59.52 45282270-43a5-461c-9aa8-ded68026a038
066bd671-d2c8-4309-888a-4392a334c70e None None None None 14/04/2020 19:00:09 14/04/2020 20:08:01 67.87 3b9c5902-6c82-4fb7-8e9b-0994762d29a3
07c0ee48-5e5c-4f5f-bc7d-1db581fcfa63 None None None None 15/04/2020 20:05:37 15/04/2020 20:15:48 10.18 0adbb5ba-9ae5-4294-8f8f-8dd10750b83b
07cbb93a-7f37-5fc8-a2d6-410b897681f2 None None None None 15/04/2020 14:31:48 15/04/2020 15:30:35 58.78 4766b561-4770-4230-b75b-cef0d9c108d5

As we expected, all queue-related variables hold null values.

Now that we have our three event tables, we will union them together in SQL to create a master dataset. We will also include a new descriptor variable event_type to indicate the event type of each record.

# create master dataset to export
all_events = sqldf('''
WITH all_events AS (
    SELECT *
    FROM queue_connect
    UNION 
    SELECT *
    FROM queue_fail
    UNION
    SELECT *
    FROM booked
)

SELECT *, 
      CASE 
        WHEN queue_end IS NULL THEN 'booked'
        WHEN chat_start IS NULL THEN 'queue_failure'
        ELSE 'queue_success'
        END AS event_type
FROM all_events
''')

Here are the first 5 rows of our master dataset all_events.

all_events.head()
user_id queue_id queue_start queue_end queue_minutes chat_start chat_end chat_minutes meeting_room_id  
0015d01f-f8f1-4f8a-9c33-c22b45b3e59a d6b735f4-dd34-4e1b-aa1b-0c6d1f84ef45 17/04/2020 11:23:02 17/04/2020 11:56:29 33.45 None None NaN None queue_failure
0027a586-b290-437e-9d33-c7cab0397128 22c19016-095e-42de-9a3c-fe49b453e9d1 16/04/2020 17:20:21 16/04/2020 17:43:28 23.12 16/04/2020 17:43:28 16/04/2020 17:49:35 6.12 3da7f68e-aafc-4551-ae96-a22a27f1694c queue_success
0070a3e7-fee7-48f8-82d7-af96b1d08082 bc49da27-124f-410b-a97f-6dc36ac520ff 17/04/2020 16:03:40 17/04/2020 16:03:57 0.28 None None NaN None queue_failure
007e63d9-51e0-4738-89d1-ba968cfa9820 1c45e072-284b-4145-b65e-91363ca7c7c6 15/04/2020 13:44:49 15/04/2020 13:49:26 4.62 None None NaN None queue_failure
007e63d9-51e0-4738-89d1-ba968cfa9820 b4533ead-0cb9-4f44-bb5f-01e3bc74032e 15/04/2020 14:14:51 15/04/2020 14:15:46 0.92 None None NaN None queue_failure

Next, we will import all_events into Tableau to visualize our data!

Data Visualization

Using the following Tableau storyboard, we can gain a deeper understanding of user engagement within Kooth’s chatroom feature. A detailed description of these findings can be located below the storyboard. Recall that this data was recorded over a one-week period.

What does the distribution of events in the master dataset look like?

Along the top of the first dashboard, we can see the total records for booked chats, queue successes, and queue failures. The number of failed queues greatly outnumbers successful queues into chats.

What queue attempt patterns can we see throughout the day?

On the first dashboard, locate the Summary area chart. In the yellow dropdown box, we will first select Hour to view the distribution of chatroom queues throughout the day. We see that users can attempt to join a chatroom at any hour of the day. However, chatroom availability does not open until 9:00am and closes after 9:00pm. Users may be inclined to start their queues early to increase their odds of joining a chatroom when availability opens.

We can see the first activity spike occurring between 11:00am and 12:00pm. Queue attempts slow down in the early afternoon and rise again after 4:00pm. The most popular time to queue for a chat occured between 8:00pm and 9:00pm.

What queue success patterns can we see throughout the day?

Queue successes follow a similar pattern to queue attempts during operational chatroom hours. Using our cursor to hover over the area chart, we can take a closer look at queue success rates throughout the day.

Users that queued for a chat between the hours of 4:00pm and 5:00pm had the greatest success rate at 45.79%.

During operational chatroom hours, users that queued for a chat between 8:00pm and 9:00pm had the worst success rate at 3.19%. Considering that 8:00pm to 9:00pm is the most popular window to queue for a chat, staffing more counselers during this time may increase successful queues.

For users with successful queues, what time of day were they waiting the longest?

On the first dashboard, select Hour from the yellow dropdown box and navigate to the Wait for Successful Queues bar chart. Here, we can view the median queue duration for users that connected to a chatroom.

Note that, rather than average durations, we are looking at median durations in this graphic. This is due to the large variance in queue duration across all records in the dataset. Users who waited in a queue for an exeptionally long time (i.e. over 2 hours) inflate the average of this metric. Thus, we will be focusing on median time metrics throughout these dashboards.

Throughout the week, users with successful queues that were initiated between 1:00pm and 2:00pm experienced the longest median queue duration. These users had a median queue duration of approximately 97 minutes.

On the contrary, users with successful queues that were initiated between 8:00pm and 9:00pm enjoyed the shortest median queue duration. These users had a median queue duration of 3 minutes. However, keep in mind that the 8:00pm to 9:00pm queue window also had the worst queue success rate.