How do we measure the correlation of time series? Pearson correlation analysis

When you discover that your time series have the similar trend, you may want to measure how much are they correlated. In that case, the Pearson correlation coefficient is one of the most widely used metric. It is widely used in the sciences. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s. Pearson’s correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. (Source: Wikipedia) If you are interested to know more about that, this paper may be relevant to you. For others who want to calculate the Pearson value, Scipy library provides a function named “pearsonr”. Alternatively, numpy library has the function named “corrcoef”. Here is my example:

(1) It can be easily discovered that the two plots have the similar trend, even though the scale of y values are different.

(2) By using pearsonr function of Scipy library, we calculate the Pearson correlation coefficient. Here are the Python codes.

from scipy.stats.stats import pearsonr
import numpy as np
import sys

if __name__ == "__main__":
    list1 = [241, 69, 72, 143, 128, 68, 126, 82, 126, 108, 68, 90, 81, 60, 72, 93, 80, 97, 65, 74, 71]
    list2 = [621711, 190310, 204282, 319612, 367879, 200600, 329108, 226406, 399833, 253989, 233108, 301069, 257548,
             206579, 255322, 268418, 279106, 304694, 216643, 236923, 254406]

    if len(list1) != len(list2):
        print("error, two series should contain same size of elements")
        sys.exit

    # scipy library
    print("scipy result: ", pearsonr(list1, list2))

    # numpy library
    print("numpy result: ", str(np.corrcoef(list1, list2)))

(3) As a result, we see that two series are highly correlated, with a Pearson coefficient value as 0.94. (Pearson’s R has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation)

(4) Alternatively, you may want to see how that value will be affected when we change one single value from the series. To discover that, change the value of the last element for the list1 from 71 to 710.

(5) You will observe that the Pearson score decreased significantly from 0.94 to 0.21.

Topic Discovery – Latent Dirichlet Allocation

Topic discovery is an important research area, and one of the most important algorithms used in this field is Latent Dirichlet Allocation, which is an unsupervised learning algorithm based on statistics, and its inventor is Columbia University Professor David M Blei. In his research paper published in 2013 (click to view the paper), he gives the details of the algorithm. (He is the co-author with Andrew Ng and Michael I. Jordan) Until today, many variations of the algorithm are invented for different needs, but I mostly focused on the LDA algorithms that are capable to discover topics on short texts, such as Tweets. I will share the prominent extensions of LDA from this post.