Correlation is useful for finding commonality between two signals. This commonality can also be considered redundancy.
The auto-correlation is defined as,
For a message, the auto-correlation sequence describes how well a message matches up with a shifted version of itself. When the message is not shifted at all, it matches up very well with itself. For non-repetitive, random or otherwise unpredictable messages, there is little in common with a shifted version of itself. The auto-correlation for a random message of 15 symbols is shown in the following plot.
The auto-correlation can be generalized to the cross-correlation between two messages.
Both are represented by a sequence of values of the number of matches at each shift position. For the purpose of calculation, shifts are barrel shifts, meaning the part that is shifted off the end is prepended back to the beginning.
A plot of cross-correlation between two random messages follows. Notice that even if the two messages each don’t match up very well when compared with a shifted version of themselves, they may still correlate when compared with each other at the appropriate shift amount.
Under an appropriate signaling alphabet (i.e. closed, finite set of unique symbols), the atomic portion of the autocorrelation calculation, x[n] * x[n+i] is simply a compare between two symbols at different points in the message.
Some example Python code for the correlation function is given below. Rxy is the calculated cross-correlation sequence between messages a and b, each length N.
Notice the nested loops. There may be a better way of doing this in Python, a better way to code it or a better solution, such as using lower-level routines like Numpy does, but this illustrates what makes correlation computationally expensive. For two sequences, each N long, it takes N2 operations to calculate the cross-correlation between the two. This is the drawback of using a serial machine, which for all intents and purposes is how most processors approach it. You would probably enjoy a significant speed up by using a highly-parallelized machine like a graphics processor, at the expense of some very low-level coding to trick it into calculating correlation for you, something it was probably not designed to do.
Even though it can take a while to calculate for long sequences, the auto-correlation function can be very useful, especially when analyzing a message of symbols. In a graph of the correlation function, the peaks are instances of repetition. Repetition occurs, for instance in the chorus of a song. Consider an entire song a message — a sequences of symbols. Consider only the lyrics to the song, stripping it entirely of its musical content. Each word of the song is a symbol; the specific order and collection of these symbols uniquely identifies the lyrics of the song.
Take, for example, the lyrics to Michael Jackson’s 1982 smash hit “Beat It.” The start of the chorus is recognizable by the oft-repeated title of the song. However, this is not the entire chorus. Can we find the chorus of the using auto-correlation? In fact we can; the autocorrelation function does a good job of finding the chorus in this example. The compare part of the auto-correlation calculation is a simple string compare: do the two words being compared match exactly or not?
- Flatten the song into a sequence of words.
- Calculate the auto-correlation for the sequence
- Find the index (not 0) of the auto-correlation that has the biggest value.
- Find the longest run of consecutive matches for the index found in step 3.
- Therein lies your chorus!
For “Beat It,” I plotted the auto-correlation.
In fact, it keeps going to index 333. The auto-correlation method finds two consecutive instances of the chorus, which is:
Just beat it! Beat it!No one wants to be defeatedShow them how funky and strong is your fightIt doesn’t matter who’s wrong or right
Notice that it finds the whole chorus, not just the oft-repeated “Beat” or “Beat It.” Other methods might focus on the most frequently appearing word or diagram (two words together) and miss the rest of the chorus, which actually contains much more thematic content. In this example, the most oft-repeated phrase, “Beat It,” is just part of the reason for the high-degree of correlation. The rest of the chorus, probably the less easily remembered (but equally important) part, “No one wants to be defeated…. who’s wrong or right,” matches as well when the message is shifted by the appropriate amount (28 words) and compared with itself. With other methods it can be difficult to tell that “No one wants to be defeated…” is part of the chorus and not the next verse.