See sources
Detecting anomalies in data by fitting dataset to a normal (Gaussian) distribution with corresponding mean $\mu$ and variance $\sigma^2$.
Given that $X = \{x^{(0)}, \dots, x^{(m)}\}$ and each $x^{(j)} \in R^n$
$$\mu_i = \frac{1}{m} \sum_{j = 0}^{m} x^{(j)}_i$$$$\sigma^2_i = \frac{1}{m} \sum_{j = 0}^{m}(\mu_i - x^{(j)}_i)^2$$Use distribution parameters to calculate Gaussian probability distribution for feature $x_i$:
$$p(x_i, \mu_i, \sigma^2_i) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{(x - \mu)^2}{2 \sigma^2}}$$Total probability distribution for feature vector $x$ is defined as a product of individual $x_i$ feature probabilties:
$$p(x) = \prod_{i = 0}^{n} p(x_i, \mu_i, \sigma^2_i)$$Alternatively, $p(x)$ can be calculated using covariance matrix:
$$\Sigma = \frac{1}{m} \sum_{j = 0}^{m} (x^{(j)} - \mu^{(j)}) (x^{(j)} - \mu^{(j)})^T$$This matrix captures covariances between $x$ vector individual features $x_i$.
$$p(x) = \frac{1}{(2 \pi)^{n/2} |\Sigma|^{1/2}} \exp(- \frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu))$$Where $|\Sigma|$ is a convariance matrix determinant and $\Sigma^{-1}$ inverse matrix.
# import libs
import sys
sys.path.append("../")
from anomaly_detection import *
import numpy as np
import scipy.io
import matplotlib.pyplot as plt
import math
# load data
def load_performmance_data():
return scipy.io.loadmat('../data/ex8data1.mat')
performance_data = load_performmance_data()
x = performance_data['X']
x_cv = performance_data['Xval']
y_cv = performance_data['yval']
print(x.shape)
# calculate distribution parameters
x_mean = get_mean(x)
print("Mean: \n", x_mean)
x_cov = get_covariance(x)
print("Covariance matrix: \n", x_cov)
# plot data
def plot_performance_data():
plt.scatter(x[:,0], x[:,1], marker=".")
plt.xlabel("Latency (ms)")
plt.ylabel("Throughput (mb/sec)")
plt.title("Server Performance")
def plot_probability_contours():
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 50)
yy = np.linspace(ylim[0], ylim[1], 50)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = get_multivariate_probability(xy, x_mean, x_cov).reshape(XX.shape)
ax.contour(XX, YY, Z, levels=[math.pow(10, i) for i in range(-20, 0, 3)])
plot_performance_data()
plot_probability_contours()
plt.show()
By selecting treshold $\epsilon$, the given example $x$ can be classified as anomaly if:
$$p(x) < \epsilon$$$F_1$ score measures test accuracy:
$$F_1 = \frac{2 \cdot prec \cdot rec}{prec + rec}$$Where $prec$ is a precision - how many results are accurate?
$tp$ - true positive, $fp$ - false positive.
$$prec = \frac{tp}{tp + fp}$$And $rec$ is a recall - how many true posivite items selected?
$fn$ - false negative.
$$rec = \frac{tp}{tp + fn}$$# find threshold which gives the maximum F1 score
p_cv = get_multivariate_probability(x_cv, x_mean, x_cov)
threshold, score = select_threshold(p_cv, y_cv)
print("Threshold:", threshold, "\nF1 score:", score)
# plot data with anomalies
def plot_anomalies(x, p, threshold):
plt.scatter(x[p < threshold, 0], x[p < threshold, 1], marker="x")
p = get_multivariate_probability(x, x_mean, x_cov)
plot_performance_data()
plot_anomalies(x, p, threshold)
plt.show()