How does the system perform?¶

This notebook will look the following:

How long do users wait to answer questions?
How long do users wait for the network to respond?
How many simultaneous users do we have?
How many simultaneous requests are served?

[1]:

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('seaborn')

[2]:

import caption_contest_data as ccd
responses = ccd.responses(577)
print("Number of responses:", len(responses))
responses.columns

Number of responses: 389898

[2]:

Index(['alg_label', 'network_delay', 'participant_uid', 'response_time',
       'target', 'target_id', 'target_reward', 'timestamp_query_generated',
       'label', 'contest', 'filename'],
      dtype='object')

[3]:

responses.iloc[0]

[3]:

alg_label                                                                KLUCB
network_delay                                                         0.670163
participant_uid              eea58658d9d40ceaf97f0bccbfa324_BuIKwH3xGWWASxp...
response_time                                                           28.324
target                       Your mother and I don't think working from hom...
target_id                                                                 4181
target_reward                                                                2
timestamp_query_generated                           2017-07-28 17:05:39.090118
label                                                           somewhat_funny
contest                                                                    577
filename                                                     577-responses.csv
Name: 0, dtype: object

Response time¶

How long does the average user wait before providing a respose? That data is recorded in the responses and we can plot a histogram.

We know that waiting for something to happen is characterized by an exponential random variable. Can we fit the PDF of an exponential random variable to the reponse time we see?

[4]:

most_responses = (responses['response_time'] >= 0) & (responses['response_time'] <= 15)
df = responses[most_responses].copy()

[5]:

df['response_time'].plot.hist(bins=50, normed=True)
plt.xlabel('Respond time (seconds)')
plt.show()

/Users/scott/anaconda3/envs/ccd-api3/lib/python3.7/site-packages/pandas/plotting/_matplotlib/hist.py:62: MatplotlibDeprecationWarning:
The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
  n, bins, patches = ax.hist(y, bins=bins, bottom=bottom, **kwds)

../_images/example-analyses_Performance_6_1.png

[6]:

users = df.participant_uid.unique()
num_users = len(users)
print(num_users, "total users")

9628 total users

Network delay¶

How long does our system take to respond?

[7]:

most_delays = (responses['network_delay'] >= 0) & (responses['network_delay'] <= 2)
df = responses[most_delays]

[8]:

plt.style.use('seaborn-talk')
df['network_delay'].plot.hist(bins=100)
plt.title('Network delay')
plt.show()

../_images/example-analyses_Performance_10_0.png

Concurrent users¶

How many users hit our system in a one second period?

[9]:

import datetime
df = responses.copy()
contest_start = df['timestamp_query_generated'].min()
contest_end = df['timestamp_query_generated'].max()
df = df.sort_values(by='timestamp_query_generated')
delta = datetime.timedelta(seconds=1)
df['seconds_elapsed'] = df['timestamp_query_generated'] - contest_start
df['seconds_elapsed'] = df.apply(lambda row: row['seconds_elapsed'].total_seconds(), axis=1)

[10]:

total_seconds = (contest_end - contest_start).total_seconds()

[14]:

def find_users_in_range(start, k, total, resolution=None, times=None, participants=None):
    if k % 1000 == 0:
        print(k / total, "fraction")
    end = start + resolution
    n_questions = (times >= start) & (times < end)
    n_users = participants[n_questions].nunique()

    return {'questions served': n_questions.sum(), 'n_users': n_users, 'start': start, 'end': end}

[15]:

from joblib import Parallel, delayed

times = df['seconds_elapsed'].values.astype("float32")
participants = df["participant_uid"].apply(hash)

measure = np.linspace(times.min(), times.max(), num=20_000)
print(np.diff(measure).min())
resolution = 5  # seconds
print(f"Launching {len(measure)//1000}k jobs...")
print(f"Resolution: {resolution} seconds")

30.568269038340077
Launching 20k jobs...
Resolution: 5 seconds

[17]:

kwargs = {"resolution": resolution, "times": times, "participants": participants}
stats = []
for k, m in enumerate(measure):
    stat = find_users_in_range(m, k, len(measure), **kwargs)
    stats.append(stat)

0.0 fraction
0.05 fraction
0.1 fraction
0.15 fraction
0.2 fraction
0.25 fraction
0.3 fraction
0.35 fraction
0.4 fraction
0.45 fraction
0.5 fraction
0.55 fraction
0.6 fraction
0.65 fraction
0.7 fraction
0.75 fraction
0.8 fraction
0.85 fraction
0.9 fraction
0.95 fraction

[18]:

print(stats[:3])

[{'questions served': 1, 'n_users': 1, 'start': 0.0, 'end': 5.0}, {'questions served': 0, 'n_users': 0, 'start': 30.568269038451923, 'end': 35.56826903845192}, {'questions served': 0, 'n_users': 0, 'start': 61.13653807690385, 'end': 66.13653807690385}]

[19]:

stats = pd.DataFrame(stats)
stats["minutes"] = stats["start"] / 60
stats["hours"] = stats["minutes"] / 60
stats["days"] = stats["hours"] / 24

[20]:

stats.sample(n=5)

[20]:

	questions served	n_users	start	end	minutes	hours	days
12154	19	17	371526.741893	371531.741893	6192.112365	103.201873	4.300078
13255	1	1	405182.406105	405187.406105	6753.040102	112.550668	4.689611
16981	0	0	519079.776542	519084.776542	8651.329609	144.188827	6.007868
17872	1	1	546316.104255	546321.104255	9105.268404	151.754473	6.323103
2474	3	2	75625.897601	75630.897601	1260.431627	21.007194	0.875300

[21]:

max_rate = stats['questions served'].max()
print(f'Questions served per {resolution} seconds: {max_rate}')

Questions served per 5 seconds: 96

[22]:

import matplotlib.pyplot as plt

w = 4.5
fig, axs = plt.subplots(ncols=2, figsize=(2 * w, 1*w))
ax = stats.plot(x="days", y="questions served", ax=axs[0])
ax.set_title(f"Questions served in {resolution} seconds")
ax.legend_.remove()

ax = stats.plot(x="days", y="n_users", ax=axs[1])
ax.set_title("Concurrent users")
ax.legend_.remove()

../_images/example-analyses_Performance_21_0.png

[23]:

idx = (97.5<= stats.hours) & (stats.hours <= 104)


w = 4.5
fig, axs = plt.subplots(ncols=2, figsize=(2 * w, 1*w))
ax = stats[idx].plot(x="days", y="questions served", ax=axs[0])
ax.set_title(f"Questions served in {resolution} seconds")
ax.legend_.remove()

ax = stats[idx].plot(x="days", y="n_users", ax=axs[1])
ax.set_title("Concurrent users")
ax.legend_.remove()

../_images/example-analyses_Performance_22_0.png

[ ]: