# Case study: PCA in financial analysis

Introduction: Suppose you have samples located in space, defined by several variables (usually more than three). If you could simultaneously envision all these variables, then there would be little need for additional data manipulation. However, with more than three dimensions, we usually need a little help. What Principal Component Analysis does (speaking loosely) is that it takes your cloud of data points, and rotates it such that the maximum variability is visible. Another way of saying this is that it identifies your most important gradients.

Case and data: The example follows the description avaiable here. Suppose today is June 30, 2000. We consider a random vector Z whose components represent the simple price returns that specific European currencies will realize versus the US dollar (USD) over the upcoming trading day:

Based upon a time series analysis of the historical price data, a covariance and correlation matrices for Z can be constructed:

Now we will use PCA to try to reduce the number of variables (currencies) describing time series stored in Z. All we must do is use tMtxPCA control, connect it to covariance/correlation matrix, run PCA and finally interpret results.

1) Setup tMtxPCA control: We begin with creating a tMtxPCA control. Since we will be analyzing the correlation matrix, we set the tMtxPCA.PCAMode property to PCACorrMat and tMtxPCA.DataFormat property to DataFormatCorrCov. Next we must populate the tMtxPCA.Data matrix with correlation matrix values.

``````Uses MtxVec, StatTools, Statistics, Math387;

procedure PCAAnalysis;
var pca: TMtxPCA;
begin
pca := TMtxPCA.Create(nil);
try
// we'll be doing PCA on correlation matrix
pca.DataFormat := DataFormatCorrCov;
pca.PCAMode := PCACorrMat;
// setup correlation matrix
pca.Data.SetIt(7,7,false,
[1.0000, 0.9488, 0.9530, 0.5804, 0.9197, 0.7682, 0.7388,
0.9488, 1.0000, 0.9911, 0.5523, 0.9717, 0.8229, 0.7940,
0.9530, 0.9911, 1.0000, 0.5554, 0.9698, 0.8184, 0.7894,
0.5804, 0.5523, 0.5554, 1.0000, 0.5276, 0.5031, 0.4512,
0.9197, 0.9717, 0.9698, 0.5276, 1.0000, 0.8124, 0.7855,
0.7682, 0.8229, 0.8184, 0.5031, 0.8124, 1.0000, 0.8434,
0.7388, 0.7940, 0.7894, 0.4512, 0.7855, 0.8434, 1.0000]);``````

``````#include "MtxVecCpp.h"
#include "Statistics.hpp"
#include "Math387.hpp"
#include "MtxVec.hpp"
#include "StatTools.hpp"

void PCAAnalysis(void);
{
TMtxPCA* pca = new TMtxPCA(NULL);
try
{
pca->DataFormat = DataFormatCorrCov;
pca->PCAMode = PCACorrMat;
// setup correlation matrix
pca->Data->SetIt(7,7,false,OPENARRAY(TSample,
(1.0000, 0.9488, 0.9530, 0.5804, 0.9197, 0.7682, 0.7388,
0.9488, 1.0000, 0.9911, 0.5523, 0.9717, 0.8229, 0.7940,
0.9530, 0.9911, 1.0000, 0.5554, 0.9698, 0.8184, 0.7894,
0.5804, 0.5523, 0.5554, 1.0000, 0.5276, 0.5031, 0.4512,
0.9197, 0.9717, 0.9698, 0.5276, 1.0000, 0.8124, 0.7855,
0.7682, 0.8229, 0.8184, 0.5031, 0.8124, 1.0000, 0.8434,
0.7388, 0.7940, 0.7894, 0.4512, 0.7855, 0.8434, 1.0000)));;``````

``````using Dew.Math;
using Dew.Math.Units;
using Dew.Stats.Units;
using Dew.Stats;

namespace Dew.Tests
{
private void PCAAnalysis()
{
TMtxPCA pca = new TMtxPCA();
try
{
// we'll be doing PCA on correlation matrix
pca.DataFormat = DataFormatCorrCov;
pca.PCAMode = PCACorrMat;
// setup correlation matrix (7 variables)
pca.Data.SetIt(7,7,false, new double[]
{1.0000, 0.9488, 0.9530, 0.5804, 0.9197, 0.7682, 0.7388,
0.9488, 1.0000, 0.9911, 0.5523, 0.9717, 0.8229, 0.7940,
0.9530, 0.9911, 1.0000, 0.5554, 0.9698, 0.8184, 0.7894,
0.5804, 0.5523, 0.5554, 1.0000, 0.5276, 0.5031, 0.4512,
0.9197, 0.9717, 0.9698, 0.5276, 1.0000, 0.8124, 0.7855,
0.7682, 0.8229, 0.8184, 0.5031, 0.8124, 1.0000, 0.8434,
0.7388, 0.7940, 0.7894, 0.4512, 0.7855, 0.8434, 1.0000});``````

2) Run PCA: A simple call to tMtxPCA.Recalc() method triggers PCA calculation. The results of PCA are:

• Principal Components, stored in tMtxPCA.PC matrix,
• Eigenvalues, stored in tMtxPCA.EigValues vector,
• Percentage (total) of variation for individual eigenvalues (eigenvectors, stored in tMtxPCA.TotalVarPct vector.

Using these results we can decide which and how many variables (currencies) are needed to describe currencies time series.