Using K- Protoypes in R - r

I'd like to try the K prototypes for my data, but when I use the code:
kpres <- kproto(#name of data", "number of cluster", lambda = xx)
I get the following mistake:
Error in Ops.data.frame(x[, j], rep(protos[i, j], nrows)) : list of
length 3994 not meaningful
Does somebody know how to fix it?
Thanks!!

You need to convert the data into a dataframe first and then pass it to kproto.
For example:
data<- as.data.frame(data)
kpres <- kproto(data, "number of cluster", lambda = xx)

Related

passing arguments to igraph function within sapply

I am new to network analyses and ERGM models and have a list of lists which has 100 Erdos Renyi Models as content and was created with the code below
set.seed(666)
gs <- list()
for (x in seq_len(100L)) {
gs[[x]] <- erdos.renyi.game(374, 0.0084, type = "gnp", directed = F)
E(gs[[x]])$weight <- sample(1:5, ecount(gs[[x]]), T)
}
Now I would like to calculate the mean path length between two nodes as well as the average clustering across these 100 models.
For the mean path length I used the following code:
random_mean_paths <- sapply(gs, igraph::mean_distance, 1:100)
mean(random_mean_paths)
However, if I try the same with igraph::transitivity , i.e.
random_mean_clus <- sapply(gs, igraph::transitivity, 1:100)
I get the error
Error in match.arg(arg = arg, choices = choices, several.ok = several.ok) :
'arg' must be of length 1
and when trying to resolve this error by setting type = "global", i.e.
random_mean_clus <- sapply(gs, igraph::transitivity(type = "global", 1:100)
I get the error argument "graph" is missing with no default
I cannot specify gs in the transitivity() function, since it is not an igraph object and I am stuck trying to pass the correct argument to this function.
Thanks in advance.
Any of
random_mean_clus <- sapply(gs, igraph::transitivity, type = "global", 1:100)
random_mean_clus <- sapply(gs, \(s) igraph::transitivity(s, type = "global", 1:100))
will solve the problem.
The first includes the named argument in the arguments list and the second defines an anonymous function, \(s) using the new way introduced in R 4.1.

How to obtain GARCH volatility for 100 firms in a single CSV in r?

I am a newbie to R programming environment. Can anyone please help me with the following problem:
I have a .csv file with stock return data for 100 odd firms (207 days each). I need to estimate GARCH volatilities for each firm and save the output for all firms in a single .csv file. My data looks like this:
StockData
Below is the reproducible code I have tried so far but unsuccessfully:
library(fGarch)
stdata <- read.csv("StockData.csv", header = T)
out <- vector("list",c(437))
for(j in length(names(stdata[,-1]))) {
fit = garchFit(~arma(1,0)+garch(1,1), data = stdata[,j], trace = F)
volatility(fit)
out = as.data.frame(volatility(fit))
}
write.csv(out, 'volatility.csv')
The output .csv file prints the volatility only for the last firm (the 100th firm). I also get the following warning message:
Warning message:
Using formula(x) is deprecated when x is a character vector of length > 1.
Consider formula(paste(x, collapse = " ")) instead.
My expected output will be as follows:
SampleVolatilityOutput
Please tell me if there is a way to get all volatilities at once in a single .csv file.
You get your output, because you overwrite "out" for each j in the loop. If you store it in a matrix, i.e. each column ist the volatility for one firm, you don't need a list:
library(fGarch)
stdata <- matrix(rnorm(100 * 207), ncol = 100)
out <- matrix(rep(NA, 100*207), ncol = 100)
for (j in 1:ncol(stdata)) {
fit <-
garchFit(
formula = ~ arma(1, 0) + garch(1, 1),
data = stdata[, j],
trace = F
)
out[, j] = volatility(fit)
}
write.csv(out, 'volatility.csv')

The function lda() throws an error when passing a subset argument

This error looks common but I've can't seem to get my head round this.
I've been given the following code (on a course but it's (the code) not graded) as a shortcut to doing LDA. Apparently it works on some computers but not mine. I've upgraded R and R studio and also the MASS library. Any ideas?
The error I get is:
Error in eval(expr, envir, enclos) : object 'training' not found
The code is
lda.valid <- function(formula,data,...,train.fraction=0.75){
grouping <- model.response(model.frame(formula,data))
tbl <- table(grouping,lda(formula,data,...,CV=TRUE)$class)
CV <- sum(diag(tbl))/sum(tbl)
n <- nrow(data)
training <- sample(1:n,n*train.fraction)
lda.training <- lda(formula,data,...,subset=training)
lda.pred <- predict(lda.training,data[-training,])
tbl <- table(grouping[-training],lda.pred$class)
VAL <- sum(diag(tbl))/sum(tbl)
c(CV=CV,VAL=VAL)
}
I run the following and get the error. Is it related to the "..." (ellipsis)
lda.valid(Species~.,data=iris,prior=c(1/3,1/3,1/3),train.fraction=0.5)
I was looking at the trycatch stuff to catch the error but don't see how I can print the stacktrace.
Any hints or suggestions. I probably don't understand the stacktrace at this point.
The error occurs where you call lda.training <- lda(...). This seems to be related to internals of the lda() function, and it's not clear to me why this happens.
However, the intent of this code seems to perform the lda using a only a training subset of the data.
This is easy enough to specify directly by subsetting the data in advance. So I suggest replacing the offending line with
lda.training <- lda(formula, data[training, ], ...)
Thus the complete function is:
library(MASS)
lda.valid <- function(formula, data, ..., train.fraction = 0.75){
grouping <- model.response(model.frame(formula, data))
tbl <- table(grouping, lda(formula, data, ..., CV = TRUE)$class)
CV <- sum(diag(tbl))/sum(tbl)
n <- nrow(data)
training <- sample(1:n, n*train.fraction)
lda.training <- lda(formula, data[training, ], ...) # <<<--- Changed
lda.pred <- predict(lda.training, data[-training, ])
tbl <- table(grouping[-training], lda.pred$class)
VAL <- sum(diag(tbl))/sum(tbl)
c(CV = CV, VAL = VAL)
}
lda.valid(Species~., data = iris, prior = c(1/3, 1/3, 1/3), train.fraction = 0.5)
This results in:
> lda.valid(Species~., data = iris, prior = c(1/3, 1/3, 1/3), train.fraction = 0.5)
CV VAL
0.98 0.96

Viterbi algorithm in R - Number of the replaced elements is not a multiple of replacement length error

I'm trying to implement Viterbi algorithm in R. I've written the following code,
viterbi_impl <- function(y,P,B,pi){
# Creating required matrices based on dimension of P
Sk <- matrix(0,nrow=dim(P)[1],ncol=length(y))
path <- matrix(0,,nrow=dim(P)[1],ncol=length(y))
# creating the first column
for(i in 1:dim(Sk)[1]){
Sk[i,1] <- log(pi[i]) + log(B[i,y[1]])
}
for(x in 2:length(y)){
for(z in 1:dim(P)[1]){
max_Sk <- max(Sk[,(x-1)] + log(P[,z]))
Sk[z,x] <- log(B[z,y[x]]) + max_Sk
p <- which((Sk[,(x-1)] + log(P[,z])) == max_Sk)
path[z,x] <- p
}
}
likelihood <- max(Sk[,length(y)]) # Gives the likelihood of the most optimal path
start_opt_path <- which(Sk[,length(y)] == max(Sk[,length(y)]))
backtrace <- vector(length=length(y))
backtrace[length(backtrace)] <- start_opt_path
for(i in (length(y)-1):1){
backtrace[i] <- path[backtrace[i+1],i+1]
}
return(list(backtrace,likelihood))
}
I tried to pass the following parameters to the function arguments,
#Computing optimal path log-likelihood for the observed sequence (a,b,c,b,a)
y <- c(1,2,3,2,1)
P <- matrix(c(1/3,0.5,0.5,1/3,1/3,1/3,0.5,0.5,0.5),3,3,byrow = TRUE)
B <- matrix(c(1/3,1/3,1/3,0.5,0.5,1/3,0.5,1/3,1/3),3,3,byrow = TRUE)
pi <-c(1/3,1/3,1/3)
output <- viterbi_impl(y,P,B,pi)
The program does not throw any error when I run the algorithm itself, however, when I run the program with the above mentioned values it throws the following error
"Number of the replaced elements is not a multiple of replacement length"
I'm not quite familiar with R programming errors yet and I'm not really sure what this is about or how to debug this. Could someone help please?
Thanks in advance!

R: make pls calibration models from n number of subset and use them to predict different test sets

I am trying to apply a function I wrote that uses the 'pls' package to make a model and then use it
to predict several test set(in this case 9), returning the R2,RMSEP and prediction bias of each test set
for n number of subset selected from the data frame.
the function is
cpo<-function(data,newdata1,newdata2,newdata3,newdata4,newdata5,newdata6,newdata7,newdata8,newdata9){
data.pls<-plsr(protein~.,8,data=data,validation="LOO")#making a pls model
newdata1.pred<-predict(data.pls,8,newdata=newdata1) #using the model to predict test sets
newdata2.pred<-predict(data.pls,8,newdata=newdata2)
newdata3.pred<-predict(data.pls,8,newdata=newdata3)
newdata4.pred<-predict(data.pls,8,newdata=newdata4)
newdata5.pred<-predict(data.pls,8,newdata=newdata5)
newdata6.pred<-predict(data.pls,8,newdata=newdata6)
newdata7.pred<-predict(data.pls,8,newdata=newdata7)
newdata8.pred<-predict(data.pls,8,newdata=newdata8)
newdata9.pred<-predict(data.pls,8,newdata=newdata9)
pred.bias1<-mean(newdata1.pred-newdata1[742]) #calculating the prediction bias
pred.bias2<-mean(newdata2.pred-newdata2[742])
pred.bias3<-mean(newdata3.pred-newdata3[742]) #[742] reference values in column742
pred.bias4<-mean(newdata4.pred-newdata4[742])
pred.bias5<-mean(newdata5.pred-newdata5[742])
pred.bias6<-mean(newdata6.pred-newdata6[742])
pred.bias7<-mean(newdata7.pred-newdata7[742])
pred.bias8<-mean(newdata8.pred-newdata8[742])
pred.bias9<-mean(newdata9.pred-newdata9[742])
r<-c(R2(data.pls,"train"),RMSEP(data.pls,"train"),pred.bias1,
pred.bias2,pred.bias3,pred.bias4,pred.bias5,pred.bias6,
pred.bias7,pred.bias8,pred.bias9)
return(r)
}
selecting n number of subsets (based on an answer from my question[1]: Select several subsets by taking different row interval and appy function to all subsets
and applying cpo function to each subset I tried
Edited based on #Gavin advice
FO03 <- function(data, nSubsets, nSkip){
outList <- vector("list", 11)
names(outList) <- c("R2train","RMSEPtrain", paste("bias", 1:9, sep = ""))
sub <- vector("list", length = nSubsets) # sub is the n number subsets created by selecting rows
names(sub) <- c( paste("sub", 1:nSubsets, sep = ""))
totRow <- nrow(data)
for (i in seq_len(nSubsets)) {
rowsToGrab <- seq(i, totRow, nSkip)
sub[[i]] <- data[rowsToGrab ,]
}
for(i in sub) { #for every subset in sub i want to apply cpo
outList[[i]] <- cpo(data=sub,newdata1=gag11p,newdata2=gag12p,newdata3=gag13p,
newdata4=gag21p,newdata5=gag22p,newdata6=gag23p,
newdata7=gag31p,newdata8=gag32p,newdata9=gag33p) #new data are test sets loaded in the workspace
}
return(outlist)
}
FOO3(GAGp,10,10)
when I try this I keep getting 'Error in eval(expr, envir, enclos) : object 'protein' not found' not found.
Protein is used in the plsr formula of cpo, and is in the data set.
I then tried to use the plsr function directly as seen below
FOO4 <- function(data, nSubsets, nSkip){
outList <- vector("list", 11)
names(outList) <- c("R2train","RMSEPtrain", paste("bias", 1:9, sep = ""))
sub <- vector("list", length = nSubsets)
names(sub) <- c( paste("sub", 1:nSubsets, sep = ""))
totRow <- nrow(data)
for (i in seq_len(nSubsets)) {
rowsToGrab <- seq(i, totRow, nSkip)
sub[[i]] <- data[rowsToGrab ,]
}
cal<-vector("list", length=nSubsets) #for each subset in sub make a pls model for protein
names(cal)<-c(paste("cal",1:nSubsets, sep=""))
for(i in sub) {
cal[[i]] <- plsr(protein~.,8,data=sub,validation="LOO")
}
return(outlist) # return is just used to end script and check if error still occurs
}
FOO4(gagpm,10,10)
When I tried this I get the same error 'Error in eval(expr, envir, enclos) : object 'protein' not found'.
Any advice on how to deal with this and make the function work will be much appreciated.
I suspect the problem is immediately at the start of FOO3():
FOO3 <- function(data, nSubsets, nSkip) {
outList <- vector("list", r <- c(R2(data.pls,"train"), RMSEP(data.pls,"train"),
pred.bias1, pred.bias2, pred.bias3, pred.bias4, pred.bias5,
pred.bias6, pred.bias7, pred.bias8, pred.bias9))
Not sure what you are trying to do when creating outList, but vector() has two arguments and you seem to be assigning to r a vector of numerics that you want R to use as the length argument to vector().
Here you are using the object data.pls and this doesn't exist yet - and never will in the frame of FOO3() - it is only ever created in cpo().
Your second loop looks totally wrong - you are not assigning the output from cpo() to anything. I suspect you wanted:
outList <- vector("list", 11)
names(outList) <- c("R2train","RMSEPtrain", paste("bias", 1:9, sep = ""))
....
for(i in subset) {
outList[[i]] <- cpo(....)
}
return(outList)
But that depends on what subset is etc. You also haven't got the syntax for this loop right. You have
for(i in(subset)) {
when it should be
for(i in subset) {
And subset and data aren't great names as these are common R functions and modelling arguments.
There are lots of problems with your code. Try to start simple and build up from there.
I have managed to achieved what i wanted using this, if there is a better way of doing it (i'm sure there must be) I'm eager to learn.This function preforms the following task
1. select "n" number of subsets from a dataframe
2. For each subset created, a plsr model is made
3. Each plsr model is used to predict 9 test sets
4. For each prediction, the prediction bias is calculated
far5<- function(data, nSubsets, nSkip){
sub <- vector("list", length = nSubsets)
names(sub) <- c( paste("sub", 1:nSubsets, sep = ""))
totRow <- nrow(data)
for (i in seq_len(nSubsets)) {
rowsToGrab <- seq(i, totRow, nSkip)
sub[[i]] <- data[rowsToGrab ,]} #sub is the subsets created
mop<- lapply(sub,cpr2) #assigning output from cpr to mop
names(mop)<-c(paste("mop", mop, sep=""))
return(names(mop))
}
call: far5(data,nSubsets, nSkip))
The first part -selecting the subsets is based on the answer to my question Select several subsets by taking different row interval and appy function to all subsets
I was then able to apply the function cpr2 to the subsets created using "lapply" instead of the "for' loop as was previously done.
cpr2 is a modification of cpo, for which only data is supplied, and the new data to be predicted is used directly in the function as shown below.
cpr2<-function(data){
data.pls<-plsr(protein~.,8,data=data,validation="LOO") #make plsr model
gag11p.pred<-predict(data.pls,8,newdata=gag11p) #predict each test set
gag12p.pred<-predict(data.pls,8,newdata=gag12p)
gag13p.pred<-predict(data.pls,8,newdata=gag13p)
gag21p.pred<-predict(data.pls,8,newdata=gag21p)
gag22p.pred<-predict(data.pls,8,newdata=gag22p)
gag23p.pred<-predict(data.pls,8,newdata=gag23p)
gag31p.pred<-predict(data.pls,8,newdata=gag31p)
gag32p.pred<-predict(data.pls,8,newdata=gag32p)
gag33p.pred<-predict(data.pls,8,newdata=gag33p)
pred.bias1<-mean(gag11p.pred-gag11p[742]) #calculate prediction bias
pred.bias2<-mean(gag12p.pred-gag12p[742])
pred.bias3<-mean(gag13p.pred-gag13p[742])
pred.bias4<-mean(gag21p.pred-gag21p[742])
pred.bias5<-mean(gag22p.pred-gag22p[742])
pred.bias6<-mean(gag23p.pred-gag23p[742])
pred.bias7<-mean(gag31p.pred-gag31p[742])
pred.bias8<-mean(gag32p.pred-gag32p[742])
pred.bias9<-mean(gag33p.pred-gag33p[742])
r<-signif(c(pred.bias1,pred.bias2,pred.bias3,pred.bias4,pred.bias5,
pred.bias6,pred.bias7,pred.bias8,pred.bias9),2)
out<-c(R2(data.pls,"train",ncomp=8),RMSEP(data.pls,"train",ncomp=8),r)
return(out)
} #signif use to return 2 decimal place for prediction bias
call:cpr2(data)
I was able to use this to solve my problem, however since the amount of new data to be predicted was only nine, it was possible to list them out as i did. If there is a more generalized way to do this I'm interested in learning.

Resources