Kernel PCA Implementation in Julia - julia

I am trying to implement the method of kernel principal component analysis (kernel PCA) in a Julia notebook. More specifically, I am trying to replicate the process done in this tutorial: https://sebastianraschka.com/Articles/2014_kernel_pca.html#References
But the tutorial is in python, and hence I am having problems with the replication of the method in Julia.
Here is the code that I have so far in Julia
using LinearAlgebra, CSV, Plots, DataFrames
function sq_norm(X, rows, cols)
# X should be MxN matrix, and it will do the square norm between all N-dim vectors
# rows is the number of rows (INT)
# cols is the number of columns
result = zeros(rows, rows)
for i in 1:rows
for j in 1:rows
sum = 0.0
for k in 1:cols
sum = (X[i, k] - X[j, k])^2
end
# print("this is the sum at i: ")
# print(i)
# print(" and j: ")
# print(j)
# print(" sum: ")
# print(sum)
# print("\n")
result[i, j] = sum
end
end
return result
end
function kernel_mat_maker(gamma, data, rows)
#data must be a square symmetric matrix
result = zeros(rows, rows)
for i in 1:rows
for j in 1:rows
result[i, j] = exp( (-gamma) * data[i, j])
end
end
return result
end
function center_k(K, rows)
one_N = ones(rows, rows)
one_N = (1/rows) * one_N
return K - one_N*K - K*one_N + one_N*K*one_N
end
function data_splitter(data, filter, key)
# data should be Nx2 matrix
# filter will be a Nx1 matrix composed of 1's and 0's
# sum = 0
# siz = size(filter)
# for i in 1:100
# sum += filter[i]
# end
output1 = DataFrame(A = 1:50, B = 0)
output2 = DataFrame(A = 1:50, B = 0)
print("everything fine where expected\n")
for i in 1:size(data, 1)
if filter[i] == 1
output1 = data[i, :]
print("saved to output1 fine\n")
end
end
return output1
end
# data1 = CSV.read("C:\\Users\\JP-27\\Desktop\\X1data.csv", header=false)
# data2 = CSV.read("C:\\Users\\JP-27\\Desktop\\X2data.csv", header=false)
data = CSV.read("C:\\Users\\JP-27\\Desktop\\data.csv", header=true)
gdf = groupby(data, :a)
plot(gdf[1].x, gdf[1].y, seriestype=:scatter, legend=nothing)
plot!(gdf[2].x, gdf[2].y, seriestype=:scatter)
# select(data, 2:3)
# filter = select(data, :1)
newData = select(data, 2:3)
# print("this is newData:\n")
# print(newData)
# print("\n")
nddf = DataFrame(newData)
# print("this is nddf:\n")
# print(nddf)
# print("\n")
# CSV.write("C:\\Users\\JP-27\\Desktop\\ju_data_preprocessing.csv", nddf)
step1 = sq_norm(data, 100, 2)
# df1 = DataFrame(step1)
# CSV.write("C:\\Users\\JP-27\\Desktop\\ju_sq_dists.csv", df1)
step2 = kernel_mat_maker(15,step1,100)
# df2 = DataFrame(step2)
# CSV.write("C:\\Users\\JP-27\\Desktop\\ju_mat_sq_dists.csv", df2)
step3 = center_k(step2, 100)
# df3 = DataFrame(step3)
# CSV.write("C:\\Users\\JP-27\\Desktop\\juliaK.csv", df3)
e_vals = eigvals(step3)
e_vcts = eigvecs(step3)
e_vcts = real(e_vcts)
# print("this is e_vcts\n")
# print(e_vcts)
# print("\n")
# e_vects = DataFrame(e_vcts)
# CSV.write("C:\\Users\\JP-27\\Desktop\\juliaE_vcts.csv", e_vects)
result = DataFrame(e_vcts[:, 99:100])
# step11 = sq_norm(data1, 50, 2)
# step12 = kernel_mat_maker(15,step11,50)
# step13 = center_k(step12, 50)
# step21 = sq_norm(data2, 50, 2)
# step22 = kernel_mat_maker(15,step21,50)
# step23 = center_k(step22, 50)
# vals1 = eigvals(step13)
# vals2 = eigvals(step23)
# evects1 = eigvecs(step13)
# evects2 = eigvecs(step23)
# evects1 = real(evects1)
# evects2 = real(evects2)
# dataevect1 = DataFrame(evects1[:, 49:50])
# dataevect2 = DataFrame(evects2[:, 49:50])
#now we extract the last two columns of our two processed 50x50 matrices
# plot(dataevect1[1], dataevect1[2], seriestype = :scatter, title = "My Scatter Plot")
# plot!(dataevect2[1], dataevect2[2], seriestype = :scatter, title = "My Scatter Plot")
Could anyone help me with correcting the implementation above? If you know of an easier method to do the process, which does not involve the use of a kernel pca function that will carry out the process, that would be extremely helpful too.

For your information,the kernel PCA method is implemented in the MultivariateStats (https://multivariatestatsjl.readthedocs.io/en/stable/kpca.html).
Here is an implementation from scratch if you are interested in the details:
https://github.com/Alexander-Barth/MachineLearningNotebooks/blob/master/kernel-pca.ipynb

Related

Use of tail() in out-of-sample prediction

Below you see an out of sample rolling window estimation I found here: (https://www.r-bloggers.com/2017/11/formal-ways-to-compare-forecasting-models-rolling-windows/)
Here is my question: I know the tail() function returns the last n rows of a dataset. But I don't understand its purpose when its used in the random walk in line 13 or when calculating the errors in line 17 and 18. Any help on clarifying this would be much appreciated.
# = Number of windows and window size
w_size = 300
n_windows = nrow(X) - 300
# = Rolling Window Loop = #
forecasts = foreach(i=1:n_windows, .combine = rbind) %do%{
# = Select data for the window (in and out-of-sample) = #
X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] forxpanding window
X_out = X[w_size + i, ]
# = Regression Model = #
m1 = lm(infl0 ~ . - prodl0, data = X_in)
f1 = predict(m1, X_out)
# = Random Walk = #
f2 = tail(X_in$infl0, 1)
return(c(f1, f2))
}
# = Calculate and plot errors = #
e1 = tail(X[ ,"infl0"], nrow(forecasts)) - forecasts[ ,1]
e2 = tail(X[ ,"infl0"], nrow(forecasts)) - forecasts[ ,2]
Here the function tail is applied to a vector because you select only the "inf10" column. In this case tail return the last element of the selected column.
df <- data.frame(A = c(1,2), B = c(3,4))
df[,"A"] # will return c(1,2)
tail(df[,"A"], 1) # will return 2
tail(df$B, 1) # will return 4

Function in R that performs multiple operations over columns of two datasets

I have two datasets, each with 5 columns and 10,000 rows. I want to calculate y from values in columns between the two datasets, column 1 in data set 1 and column 1 in data set 2; then column 2 in data set 1 and column 2 in data set 2. The yneeds nonetheless to follow a set of rules before being calculated. What I did so far doesn't work, and I cannot figure it out why and if there is a easier way to do all of this.
Create data from t-distributions
mx20 <- as.data.frame(replicate(10000, rt(20,19)))
mx20.50 <- as.data.frame(replicate(10000, rt(20,19)+0.5))
Calculates the mean for each simulated sample
m20 <- apply(mx20, FUN=mean, MARGIN=2)
m20.05 <- apply(mx20.50, FUN=mean, MARGIN=2)
The steps 1 and 2_ above are repeated for five sample sizes from t-distributions rt(30,29); rt(50,49); rt(100,99); and rt(1000,999)
Bind tables (create data.frame) for each t-distribution specification
tbl <- cbind(m20, m30, m50, m100, m1000)
tbl.50 <- cbind(m20.05, m30.05, m50.05, m100.05, m1000.05)
Finally, I want to calculate the y as specified above. But here is where I get totally lost. Please see below my best attempt so far.
y = (mtheo-m0)/(m1-m0), where y = 0 when m1 < m0 and y = y when m1 >= m0. mtheo is a constant (e.g. 0.50), m1 is value in column 1 of tbl and m0 is value in column 1 of tbl.50.
ycalc <- function(mtheo, m1, m0) {
ifelse(m1>=m0) {
y = (mteo-m0)/(m1-m0)
} ifelse(m1<m0) {
y=0
} returnValue(y)
}
You can try this. I used data frames instead of data tables.
This code is more versatile. You can add or remove parameters. Below are the parameters that you can use to create t distributions.
params = data.frame(
n = c(20, 30, 50, 100, 1000),
df = c(19, 29, 49, 99, 999)
)
And here is a loop that creates the values you need for each t distribution. You can ignore this part if you already have those values (or code to create those values).
tbl = data.frame(i = c(1:10000))
tbl.50 = data.frame(i = c(1:10000))
for (i in 1:nrow(params)) {
mx = as.data.frame(replicate(10000, rt(params[i, 1], params[i, 2])))
m <- apply(mx, FUN=mean, MARGIN=2)
tbl = cbind(tbl, m)
names(tbl)[ncol(tbl)] = paste("m", params[i, 1], sep="")
mx.50 = as.data.frame(replicate(10000, rt(params[i, 1], params[i, 2])+.5))
m.50 <- apply(mx.50, FUN=mean, MARGIN=2)
tbl.50 = cbind(tbl.50, m.50)
names(tbl.50)[ncol(tbl.50)] = paste("m", params[i, 1], ".50", sep="")
}
tbl = tbl[-1]
tbl.50 = tbl.50[-1]
And here is the loop that does the calculations. I save them in a data frame (y). Each column in this data frame is the result of your function applied for all rows.
mtheo = .50
y = data.frame(i = c(1:10000))
for (i in 1:nrow(params)) {
y$dum = 0
idx = which(tbl[, i] >= tbl.50[, i])
y[idx, ]$dum =
(mtheo - tbl.50[idx, i]) /
(tbl[idx, i] - tbl.50[idx, i])
names(y)[ncol(y)] = paste("y", params[i, 1], sep="")
}
y = y[-1]
You could try this, if the first column in tbl is called m0 and the first column in tbl.50 is called m1:
mteo <- 0.5
ycalc <- ifelse(tbl$m1 >= tbl.50$m0, (mteo - tbl.50$m0)/(tbl$m1 - tbl.50$m0),
ifelse(tbl$m1 < tbl.50$m0), 0, "no")
Using the same column names provided by your code, and transforming your matrices into dataframes:
tbl <- data.frame(tbl)
tbl.50 <- data.frame(tbl.50)
mteo <- 0.5
ycalc <- ifelse(tbl$m20 >= tbl.50$m20.05, (mteo - tbl.50$m20.05)/(tbl$m20 - tbl.50$m20.05),
ifelse(tbl$m20 < tbl.50$m20.05, "0", "no"))
This results in:
head(ycalc)
[1] "9.22491706576716" "0" "0" "0" "0" "1.77027049630147"

Two same type of dataframes perform differently in a function

Below is my data
set.seed(100)
toydata <- data.frame(A = sample(1:50,50,replace = T),
B = sample(1:50,50,replace = T),
C = sample(1:50,50,replace = T)
)
Below is my swapping function
derangement <- function(x){
if(max(table(x)) > length(x)/2) return(NA)
while(TRUE){
y <- sample(x)
if(all(y != x)) return(y)
}
}
swapFun <- function(x, n = 10){
inx <- which(x < n)
y <- derangement(x[inx])
if(length(y) == 1) return(NA)
x[inx] <- y
x
}
In the first case,I get the new data toy by swapping the entire dataframe. The code is below:
toydata<-as.matrix(toydata)
toy<-swapFun(toydata)
toy<-as.data.frame(toy)
In the second case, I get the new data toy by swapping each column respectively. Below is the code:
toydata<-as.data.frame(toydata)
toy2 <- toydata # Work with a copy
toy2[] <- lapply(toydata, swapFun)
toy<-toy2
Below is the function that can output the difference of contigency table after swapping.
# the function to compare contingency tables
f = function(x,y){
table1<-table(toydata[,x],toydata[,y])
table2<-table(toy[,x],toy[,y])
sum(abs(table1-table2))
}
# vectorise your function
f = Vectorize(f)
combn(x=names(toydata),
y=names(toydata), 2) %>%# create all combinations of your column names
t() %>% # transpose
data.frame(., stringsAsFactors = F) %>% # save as dataframe
filter(X1 != X2) %>% # exclude pairs of same
# column
mutate(SumAbs = f(X1,X2)) # apply function
In the second case, this mutate function works.
But in the first case, this mutatefunction does not work. It says:
+ filter(X1 != X2) %>% # exclude pairs of same column
+ mutate(SumAbs = f(X1,X2)) # apply function
Error in combn(x = names(toydata), y = names(toydata), 2) : n < m
However in the two cases, the toy data are all dataframes with the same dimension, the same row names and the same column names. I feel confused.
How can I fix it? Thanks.

Fast crosstabs and stats on all pairs of variables

I am trying to calculate a measure of association between all variables in a data.table. (This is not a stats question, but as an aside: the variables are all factors, and the measure is Cramér's V.)
Example dataset:
p = 50; n = 1e5; # actual dataset has p > 1e3, n > 1e5, much wider but barely longer
set.seed(1234)
obs <- as.data.table(
data.frame(
cbind( matrix(sample(c(LETTERS[1:4],NA), n*(p/2), replace=TRUE),
nrow=n, ncol=p/2),
matrix(sample(c(letters[1:6],NA), n*(p/2), replace=TRUE),
nrow=n, ncol=p/2) ),
stringsAsFactors=TRUE ) )
I am currently using the split-apply-combine approach, which involves looping (via plyr::adply) through all pairs of indices and returning one row for each pair. (I attempted to parallelize adply but failed.)
# Calculate Cramér's V between all variables -- my kludgey approach
pairs <- t( combn(ncol(obs), 2) ) # nx2 matrix contains indices of upper triangle of df
# library('doParallel') # I tried to parallelize -- bonus points for help here (Win 7)
# cl <- makeCluster(8)
# registerDoParallel(cl)
library('plyr')
out <- adply(pairs, 1, function(ix) {
complete_cases <- obs[,which(complete.cases(.SD)), .SDcols=ix]
chsq <- chisq.test(x= dcast(data = obs[complete_cases, .SD, .SDcols=ix],
formula = paste( names(obs)[ix], collapse='~'),
value.var = names(obs)[ix][1], # arbitrary
fun.aggregate=length)[,-1, with=FALSE] )
return(data.table(index_1 = ix[1],
var_1 = names(obs)[ix][1],
index_2 = ix[2],
var_2 = names(obs)[ix][2],
cramers_v = sqrt(chsq$statistic /
(sum(chsq$observed) *
(pmin(nrow(chsq$observed),
ncol(chsq$observed) ) -1 ) )
) )
)
})[,-1] #}, .parallel = TRUE)[,-1] # using .parallel returns Error in do.ply(i) :
# task 1 failed - "object 'obs' not found"
out <- data.table(out) # adply won't return a data.table
# stopCluster(cl)
What are my options for speeding up this calculation? My challenge is in passing the row-wise operation on pairs into the column-wise calculations in obs. I am wondering if it is possible to generate the column pairs directly into J, but the Force is just not strong enough with this data.table padawan.
First, I would go with 'long' data format as following:
obs[, id := 1:n]
mobs <- melt(obs, id.vars = 'id')
Next set key on data table setkeyv(mobs, 'id').
Finally, iterate through variables and do calculations on pairs:
out <- list()
for(i in 1:p) {
vari <- paste0('X', i)
tmp <- mobs[mobs[variable == vari]]
nn <- tmp[!(is.na(value) | is.na(i.value)), list(i.variable = i.variable[1], nij = length(id)), keyby = list(variable, value, i.value)]
cj <- nn[, CJ(value = value, i.value = i.value, sorted = FALSE, unique = TRUE), by = variable]
setkeyv(cj, c('variable', 'value', 'i.value'))
nn <- nn[cj]
nn[is.na(nij), nij := 0]
nn[, ni := sum(nij), by = list(variable, i.value)]
nn[, nj := sum(nij), by = list(variable, value)]
nn[, c('n', 'r', 'k') := list(sum(nij), length(unique(i.value)), length(unique(value))), by = variable]
out[[i]] <- nn[, list(i.variable = vari, cramers_v = (sqrt(sum((nij - ni * nj / n) ^ 2 / (ni * nj / n)) / n[1]) / min(k[1] - 1, r[1] - 1))), by = variable]
}
out <- rbindlist(out)
So you need to iterate only once through variables. As you see I would also wouldn't use chisq.test and would write computations myself.

How to vectorize a function in R

I need some help vectorizing the following code because I believe that it will become more efficient. However i do not know how to begin... I created a loop that goes through z. z has 3 columns and 112847 rows, which might be a reason it takes a long time. The 3 columns contain numbers that are used in the MACD() function...
library(quantmod)
library(TTR)
# get stock data
getSymbols('LUNA')
#Choose the Adjusted Close of a Symbol
stock <- Ad(LUNA)
#Create matrix for returns only
y <- stock
#Create a "MATRIX" by choosing the Adjusted Close
Nudata3 <- stock
#Sharpe Ratio Matrix
SR1<- matrix(NA, nrow=1)
# I want to create a table with all possible combinations from the ranges below
i = c(2:50)
k = c(4:50)
j = c(2:50)
# stores possible combinations into z
z <- expand.grid(i,k,j)
colnames(z)<- c("one","two","three")
n = 1
stretches <- length(z[,1])
while (n < stretches){
# I am trying to go through all the values in "z"
Nuw <- MACD((stock), nFast=z[n,1], nSlow=z[n,2], nSig=z[n,3], maType="EMA")
colnames(Nuw) <- c("MACD","Signal") #change the col names to create signals
x <- na.omit(merge((stock), Nuw))
x$sig <- NA
# Create trading signals
sig1 <- Lag(ifelse((x$MACD <= x$Signal),-1, 0)) # short when MACD < SIGNAL
sig2 <- Lag(ifelse((x$MACD >= x$Signal),1, 0)) # long when MACD > SIGNAL
x$sig <- sig1 + sig2
#calculate Returns
ret <- na.omit(ROC(Ad(x))*x$sig)
colnames(ret)<- c(paste(z[n,1],z[n,2],z[n,3],sep=","))
x <- merge(ret,x)
y <- merge(y,ret) #This creates a MATRIX with RETURNs ONLY
Nudata3 <- merge(Nudata3, x)
((mean(ret)/sd(ret)) * sqrt(252)) -> ANNUAL # Creates a Ratio
ANNUAL->Shrat # stores Ratio into ShRat
SR1 <- cbind(SR1,Shrat) # binds all ratios as it loops
n <- (n+1)
}
I would like to know how to vectorize the MACD() function, to speed up the process since the length of stretches is approx. 112847. It takes my computer quite some time to go through the loop itself.
First and foremost - case specific optimization - remove the cases where nFast > nSlow as it doesn't make sense technically.
Secondly - you are creating objects and copying them over and over again. This is very expensive.
Thirdly - you can code this better perhaps by creating a matrix of signals in one loop and doing rest of the operations in vectorized manner.
I would code what you are doing something like this.
Please read help pages of mapply, do.call, merge and sapply if you don't understand.
require(quantmod)
getSymbols("LUNA")
#Choose the Adjusted Close of a Symbol
stock <- Ad(LUNA)
# I want to create a table with all possible combinations from the ranges below
i = c(2:50)
k = c(4:50)
j = c(2:50)
# stores possible combinations into z
z <- expand.grid(i,k,j)
IMO : This is where your first optimization should be. Remove cases where i > k
z <- z[z[,1]<z[,2], ]
It reduces the number of cases from 112847 to 57575
#Calculate only once. No need to calculate this in every iteration.
stockret <- ROC(stock)
getStratRet <- function(nFast, nSlow, nSig, stock, stockret) {
x <- MACD((stock), nFast=nFast, nSlow=nSlow, nSig=nSig, maType="EMA")
x <- na.omit(x)
sig <- Lag(ifelse((x$macd <= x$signal),-1, 0)) + Lag(ifelse((x$macd >= x$signal),1, 0))
return(na.omit(stockret * sig))
}
RETURNSLIST <- do.call(merge, mapply(FUN = getStratRet, nFast = z[,1], nSlow = z[,2], nSig = z[,3], MoreArgs = list(stock = stock, stockret = stockret), SIMPLIFY = TRUE))
getAnnualSharpe <- function(ret) {
ret <- na.omit(ret)
return ((mean(ret)/sd(ret)) * sqrt(252))
}
SHARPELIST <- sapply(RETURNSLIST, FUN = getAnnualSharpe)
Results will be as below. Which column belongs to which combo of i, j, k is trivial.
head(RETURNSLIST[, 1:3])
## LUNA.Adjusted LUNA.Adjusted.1 LUNA.Adjusted.2
## 2007-01-10 0.012739026 -0.012739026 0
## 2007-01-11 -0.051959739 0.051959739 0
## 2007-01-12 -0.007968170 -0.007968170 0
## 2007-01-16 -0.007905180 -0.007905180 0
## 2007-01-17 -0.005235614 -0.005235614 0
## 2007-01-18 0.028315920 -0.028315920 0
SHARPELIST
## LUNA.Adjusted LUNA.Adjusted.1 LUNA.Adjusted.2 LUNA.Adjusted.3 LUNA.Adjusted.4 LUNA.Adjusted.5 LUNA.Adjusted.6
## 0.04939150 -0.07428392 NaN 0.02626382 -0.06789803 -0.22584987 -0.07305477
## LUNA.Adjusted.7 LUNA.Adjusted.8 LUNA.Adjusted.9
## -0.05831643 -0.08864845 -0.08221986
system.time(
+ RETURNSLIST <- do.call(merge, mapply(FUN = getStratRet, nFast = z[1:100,1], nSlow = z[1:100,2], nSig = z[1:100,3], MoreArgs = list(stock = stock, stockret = stockret), SIMPLIFY = TRUE)),
+ SHARPELIST <- sapply(RETURNSLIST, FUN = getAnnualSharpe)
+ )
user system elapsed
2.28 0.00 2.29

Resources