Python: Investigating Guest Stars in The Office
Background
This analysis comes from an unguided project offered by DataCamp. In this project, I utilize the Pandas
and Matplotlib
libraries to create an informative plot of episode data from The Office. The dataset used in this analysis comes from the Kaggle data science community.
Analysis
To begin this project, we can download the pandas
and Matplotlib
packages and alias them accordingly. We also need to import the CSV file containing the episode data. Once this is accomplished, we can take a look at the first few rows of our dataset.
### Import packages
import pandas as pd
import matplotlib.pyplot as plt
### Preview the data
data = pd.read_csv('datasets/office_episodes.csv')
data.head()
episode_number | season | title | description | ratings | votes | viewership_mil | duration | release_date | guest_stars | director | writers | has_guests | scaled_ratings |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Pilot | The premiere episode introduces the boss and s… | 7.5 | 4936 | 11.2 | 23 | 2005-03-24 | NaN | Ken Kwapis | Ricky Gervais, Stephen Merchant, and Greg Daniels | False | 0.28125 |
1 | 1 | Diversity Day | Michael’s off color remark puts a sensitivity … | 8.3 | 4801 | 6.0 | 23 | 2005-03-29 | NaN | Ken Kwapis | B. J. Novak | False | 0.53125 |
2 | 1 | Health Care | Michael leaves Dwight in charge of picking the… | 7.8 | 4024 | 5.8 | 22 | 2005-04-05 | NaN | Ken Whittingham | Paul Lieberstein | False | 0.37500 |
3 | 1 | The Alliance | Just for a laugh, Jim agrees to an alliance wi… | 8.1 | 3915 | 5.4 | 23 | 2005-04-12 | NaN | Bryan Gordon | Michael Schur | False | 0.46875 |
4 | 1 | Basketball | Michael and his staff challenge the warehouse … | 8.4 | 4294 | 5.0 | 23 | 2005-04-19 | NaN | Greg Daniels | Greg Daniels | False | 0.56250 |
To create a scatter plot, we will plot the episode number along the x-axis and viewership (in millions) along the y-axis for each episode. For each point in the scatter plot, we can introduce a color scheme that reflects the scaled ratings of each episode.
- Ratings < 0.25 are colored
red
. - Ratings ≥ 0.25 and < 0.50 are colored
orange
. - Ratings ≥ 0.50 and < 0.75 are colored
lightgreen
. - Ratings ≥ 0.75 are colored
darkgreen
.
### Set plot color based on scaled ratings
ratings_col = data.loc[:, 'scaled_ratings']
col_scheme = []
for i in range(len(ratings_col)):
if ratings_col[i] < 0.25:
col_scheme.append('red')
elif ratings_col[i] >= 0.25 and ratings_col[i] < 0.5:
col_scheme.append('orange')
elif ratings_col[i] >= 0.5 and ratings_col[i] < 0.75:
col_scheme.append('lightgreen')
else:
col_scheme.append('darkgreen')
Next, we can create a sizing system for points in the scatter plot such that…
- episodes with guest appearances have a marker size of 250
- episodes without guest appearances are sized 25
### Set plot sizing based on guest appearances
sizing = []
guest_col = data.loc[:,'has_guests']
for j in range(len(guest_col)):
if guest_col[j] == True:
sizing.append(250.0)
else:
sizing.append(25.0)
Finally, we can display the scatter plot.
### Plot figure
fig = plt.figure()
plt.scatter(data['episode_number'], data['viewership_mil'], color=col_scheme, s=sizing)
plt.title('Popularity, Quality, and Guest Appearances on the Office')
plt.xlabel('Episode Number')
plt.ylabel('Viewership (Millions)')
plt.show()
There appears to be a notable downward trend in viewership after the 125th episode of the show. Further, it is clear that the most watched episode of The Office had at least one guest appearance and was highly rated ( > 0.75). We can easily find which guest stars made an appearence in this episode.
### Identify top stars
most_viewed = data[data.loc[:,'viewership_mil']==max(data.loc[:,'viewership_mil'])]
print(most_viewed['guest_stars'])
77 Cloris Leachman, Jack Black, Jessica Alba
Name: guest_stars, dtype: object
Cloris Leachman, Jack Black, and Jessica Alba all made an appearence in the most watched episode of the show. This may explain why this episode was so popular!