I was given a task to write a function, which I name: my_mode_k.
The input is consisted of two variables:
(x, k)
as x, is a vector of natural numbers with the length of n. the greatest object of x can be k, given that k < n.
my_mode_k output is the highest frequency object of x. if there's more then one object in the vector that are common in x the same number of times - then the function will output the minimum object between them.
for example:
my_mode_k(x = c(1, 1, 2, 3, 3) , k =3)
1
This is code I wrote:
my_mode_k <- function(x, k){
n <- length(x)
x_lemma <- rep(0, k)
for(i in 1:n){
x_lemma[i] < x_lemma[i] +1
}
x_lem2 <- 1
for( j in 2:k){
if(x_lemma[x_lem2] < x_lemma[j]){
x_lem2 <- j
}
}
x_lem2
}
which isn't working properly.
for example:
my_mode_k(x = c(2,3,4,3,2,2,5,5,5,5,5,5,5,5), k=5)
[1] 1
as the function is supposed to return 5.
I don't understand why and what is the intuition to have in order to even know if a function is working properly (It took me some time to realize that it's not executing the needed task) - so I could fix the mistake in it.
Here are a few steps on how you can achieve this.
k <- 5
input <- c(2,3,4,3,3,3,3,3,3,3,2,2,5,5,5,5,5,5,5,5)
# Calculate frequencies of elements.
tbl <- table(input[input <= k])
# Find which is max. Notice that it returns the minimum of there is a tie.
tbl.max <- which.max(tbl)
# Find which value is your result.
names(tbl.max)
input <- c(2,2,3,3,3,5,5,5)
names(which.max(table(input[input <= k])))
# 3
input <- c(2,2,5,5,5,3,3,3)
names(which.max(table(input[input <= k])))
# 3
I am following up an old question without answer (https://stackoverflow.com/questions/31653029/r-thresholding-networks-with-inputted-p-values-in-q-graph). I'm trying to assess relations between my variables.For this, I've used a correlation network map. Once I did so, I would like to implement a significance threshold component. For instance, I want to only show results with p-values <0.05. Any idea about how could I implement my code?
Data set: https://www.dropbox.com/s/xntc3i4eqmlcnsj/d100_partition_all3.csv?dl=0
My code:
library(qgraph)
cor_d100_partition_all3<-cor(d100_partition_all3)
qgraph(cor_d100_partition_all3, layout="spring",
label.cex=0.9, labels=names(d100_partition_all3),
label.scale=FALSE, details = TRUE)
Output:
Additionally, I have this small piece of code that transform R2 values into p.values:
Code:
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
p.mat <- cor.mtest(d100_partition_all3)
Cheers
There are a few ways to only plot the significant correlations. First, you could pass additional arguments to the qgraph()function. You can look at the documentation for more details. The function call given below should have values that are close to what is needed.
qgraph(cor_d100_partition_all3
, layout="spring"
, label.cex=0.9
, labels=names(d100_partition_all3)
, label.scale=FALSE
, details = TRUE
, minimum='sig' # minimum based on statistical significance
,alpha=0.05 # significance criteria
,bonf=F # should Bonferroni correction be used
,sampleSize=6 # number of observations
)
A second option is to create a modified correlation matrix. When the correlations are not statistically significant based on your cor.mtest() function, the value is set to NA in the modified correlation matrix. This modified matrix is plotted. A main visual difference between the first and second solutions seems to be the relative line weights.
# initializing modified correlation matrix
cor_d100_partition_all3_mod <- cor_d100_partition_all3
# looping through all elements and setting values to NA when p-values is greater than 0.05
for(i in 1:nrow(cor_d100_partition_all3)){
for(j in 1:nrow(cor_d100_partition_all3)){
if(p.mat[i,j] > 0.05){
cor_d100_partition_all3_mod[i,j] <- NA
}
}
}
# plotting result
qgraph(cor_d100_partition_all3_mod
,layout="spring"
,label.cex=0.7
,labels=names(d100_partition_all3)
,label.scale=FALSE
,details = F
)
So I needed some help with a train and test set that I am creating in R. The goal of the code is to break a data set into a certain amount k, and the number of folds the test set will be i. It will then return the training and test sets. We assume that k will be 5 or 10.
This is what I have so far.
create_sets<-function(df,k,i)
{
n<-dim(df)[1]
#fold size
size<-n/k
#beggining of test set
test_start<-(size*i)-(size)+1
#end of test set
test_end<-size*i
indices<-df(test_start,test_end)
train<-df[indices,]
test<-df[-indices,]
return (list(train=train,test=test))
}
df is just a random data frame of x and y. That is:
x<-c(1,6,7,4,3,5,7,8,9,8,7,6,5,4,3,4,5,3,2,1)
y<-c(3,5,6,7,5,4,3,5,7,8,9,0,2,3,4,5,6,7,5,6)
df<-data.frame(x,y)
When I run the code I am returning an error
Error in df(test_start, test_end) :
argument "df2" is missing, with no default
This is how I would approach it:
n <- nrow(df)
k <- 5
set.seed(10272015)
s <- sample(1:k, n, replace=TRUE)
result <- rep(NA, k)
for (i in 1:k) {
train <- df[s!=i, ]
test <- df[s==i,]
# fit model
# evaluate
# result[i] <- evalscore
}
mean(result)
I think you just need an index for different subsets,like this:
k <- 5
folds <- sample(rep(1:k,length=nrow(df)))
then, you can get any one of k subsets by(take 1 for example):
df[folds==1,]
I have the 3 following functions which I would like to make faster, I assume apply functions are the best way to go, but I have never used apply functions, so I have no idea what to do. Any type of hints, ideas and code snippets will be much appreciated.
n, T, dt are global parameters and par is a vector of parameters.
Function 1: is a function to create an m+1,n matrix containing poisson distributed jumps with exponentially distributed jump sizes. My troubles here is because I have 3 loops and I am not sure how to incorporate the if statement in the inner loop. Also I have no idea if it is at all possible to use apply functions on the outer layers of the loops only.
jump <- function(t=0,T=T,par){
jump <- matrix(0,T/dt+1,n) # initializing output matrix
U <- replicate(n,runif(100,t,T)) #matrix used to decide when the jumps will happen
Y <-replicate(n,rexp(100,1/par[6])) #matrix with jump sizes
for (l in 1:n){
NT <- rpois(1,par[5]*T) #number of jumps
k=0
for (j in seq(t,T,dt)){
k=k+1
if (NT>0){
temp=0
for (i in 1:NT){
u <- vector("numeric",NT)
if (U[i,l]>j){ u[i]=0
}else u[i]=1
temp=temp+Y[i,l]*u[i]
}
jump[k,l]=temp
}else jump[k,l]=0
}
}
return(jump)
}
Function 2: calculates a default intensity, based on Brownian motions and the jumps from function 1. Here my trouble is how to use apply functions when the variable used for the calculation is the values from the row above in the output matrix AND how to get the right values from the external matrices which are used in the calculations (BMz_C & J)
lambda <- function(t=0,T=T,par,fit=0){
lambda <- matrix(0,m+1,n) # matrix to hold intesity path output
lambda[1,] <- par[4] #initializing start value of the intensity path.
J <- jump(t,T,par) #matrix containing jumps
for(i in 2:(m+1)){
dlambda <- par[1]*(par[2]-max(lambda[i-1,],0))*dt+par[3]*sqrt(max(lambda[i- 1,],0))*BMz_C[i,]+(J[i,]-J[i-1,])
lambda[i,] <- lambda[i-1,]+dlambda
}
return(lambda)
}
Function 3: calculates a survival probability based on the intensity from function 2. Here a() and B() are functions that return numerical values. My problem here is that the both value i and j are used because i is not always an integer which thus can to be used to reference the external matrix. I have earlier tried to use i/dt, but sometimes it would overwrite one line and skip the next lines in the matrix, most likely due to rounding errors.
S <- function(t=0,T=T,par,plot=0, fit=0){
S <- matrix(0,(T-t)/dt+1,n)
if (fit > 0) S.fit <- matrix(0,1,length(mat)) else S.fit <- 0
l=lambda(t,T,par,fit)
j=0
for (i in seq(t,T,dt)){
j=j+1
S[j,] <- a(i,T,par)*exp(B(i,T,par)*l[j,])
}
return(S)
}
Sorry for the long post, any help for any of the functions will be much appreciated.
EDIT:
First of all thanks to digEmAll for the great reply.
I have now worked on vectorising function 2. First I tried
lambda <- function(t=0,T=T,par,fit=0){
lambda <- matrix(0,m+1,n) # matrix to hold intesity path input
J <- jump(t,T,par,fit)
lambda[1,] <- par[4]
lambda[2:(m+1),] <- sapply(2:(m+1), function(i){
lambda[i-1,]+par[1]*(par[2]-max(lambda[i-1,],0))*dt+par[3]*sqrt(max(lambda[i-1,],0))*BMz_C[i,]+(J[i,]-J[i-1,])
})
return(lambda)
}
but it would only produce the first column. So I tried a two step apply function.
lambda <- function(t=0,T=T,par,fit=0){
lambda <- matrix(0,m+1,n) # matrix to hold intesity path input
J <- jump(t,T,par,fit)
lambda[1,] <- par[4]
lambda[2:(m+1),] <- sapply(1:n, function(l){
sapply(2:(m+1), function(i){
lambda[i-1,l]+par[1]*(par[2]-max(lambda[i-1,l],0))*dt+par[3]*sqrt(max(lambda[i-1,l],0))*BMz_C[i,l]+(J[i,l]-J[i-1,l])
})
})
return(lambda)
}
This seems to work, but only on the first row, all rows after that have an identical non-zero value, as if lambda[i-1] is not used in the calculation of lambda[i], does anyone have an idea how to manage that?
I'm going to explain to you, setp-by-step, how to vectorize the first function (one possible way of vectorization, maybe not the best one for your case).
For the others 2 functions, you can simply apply the same concepts and you should be able to do it.
Here, the key concept is: start to vectorize from the innermost loop.
1) First of all, rpois can generate more than one random value at a time but you are calling it n-times asking one random value. So, let's take it out of the loop obtaining this:
jump <- function(t=0,T=T,par){
jump <- matrix(0,T/dt+1,n)
U <- replicate(n,runif(100,t,T))
Y <-replicate(n,rexp(100,1/par[6]))
NTs <- rpois(n,par[5]*T) # note the change
for (l in 1:n){
NT <- NTs[l] # note the change
k=0
for (j in seq(t,T,dt)){
k=k+1
if (NT>0){
temp=0
for (i in 1:NT){
u <- vector("numeric",NT)
if (U[i,l]>j){ u[i]=0
}else u[i]=1
temp=temp+Y[i,l]*u[i]
}
jump[k,l]=temp
}else jump[k,l]=0
}
}
return(jump)
}
2) Similarly, it is useless/inefficient to call seq(t,T,dt) n-times in the loop since it will always generate the same sequence. So, let's take it out of the loop and store into a vector, obtainig this:
jump <- function(t=0,T=T,par){
jump <- matrix(0,T/dt+1,n)
U <- replicate(n,runif(100,t,T))
Y <-replicate(n,rexp(100,1/par[6]))
NTs <- rpois(n,par[5]*T)
js <- seq(t,T,dt) # note the change
for (l in 1:n){
NT <- NTs[l]
k=0
for (j in js){ # note the change
k=k+1
if (NT>0){
temp=0
for (i in 1:NT){
u <- vector("numeric",NT)
if (U[i,l]>j){ u[i]=0
}else u[i]=1
temp=temp+Y[i,l]*u[i]
}
jump[k,l]=temp
}else jump[k,l]=0
}
}
return(jump)
}
3) Now, let's have a look at the innermost loop:
for (i in 1:NT){
u <- vector("numeric",NT)
if (U[i,l]>j){ u[i]=0
}else u[i]=1
temp=temp+Y[i,l]*u[i]
}
this is equal to :
u <- as.integer(U[1:NT,l]<=j)
temp <- sum(Y[1:NT,l]*u)
or, in one-line:
temp <- sum(Y[1:NT,l] * as.integer(U[1:NT,l] <= j))
hence, now the function can be written as :
jump <- function(t=0,T=T,par){
jump <- matrix(0,T/dt+1,n)
U <- replicate(n,runif(100,t,T))
Y <-replicate(n,rexp(100,1/par[6]))
NTs <- rpois(n,par[5]*T)
js <- seq(t,T,dt)
for (l in 1:n){
NT <- NTs[l]
k=0
for (j in js){
k=k+1
if (NT>0){
jump[k,l] <- sum(Y[1:NT,l]*as.integer(U[1:NT,l]<=j)) # note the change
}else jump[k,l]=0
}
}
return(jump)
}
4) Again, let's have a look at the current innermost loop:
for (j in js){
k=k+1
if (NT>0){
jump[k,l] <- sum(Y[1:NT,l]*as.integer(U[1:NT,l]<=j)) # note the change
}else jump[k,l]=0
}
as you can notice, NT does not depend on the iteration of this loop, so the inner if can be moved outside, as follows:
if (NT>0){
for (j in js){
k=k+1
jump[k,l] <- sum(Y[1:NT,l]*as.integer(U[1:NT,l]<=j)) # note the change
}
}else{
for (j in js){
k=k+1
jump[k,l]=0
}
}
this seems worse than before, well yes it is, but now the 2 conditions can be turned into one-liner's (note the use of sapply¹):
if (NT>0){
jump[1:length(js),l] <- sapply(js,function(j){ sum(Y[1:NT,l]*as.integer(U[1:NT,l]<=j)) })
}else{
jump[1:length(js),l] <- 0
}
obtaining the following jump function:
jump <- function(t=0,T=T,par){
jump <- matrix(0,T/dt+1,n)
U <- replicate(n,runif(100,t,T))
Y <-replicate(n,rexp(100,1/par[6]))
NTs <- rpois(n,par[5]*T)
js <- seq(t,T,dt)
for (l in 1:n){
NT <- NTs[l]
if (NT>0){
jump[1:length(js),l] <- sapply(js,function(j){ sum(Y[1:NT,l]*as.integer(U[1:NT,l]<=j)) })
}else{
jump[1:length(js),l] <- 0
}
}
return(jump)
}
5) finally we can get rid of the last loop, using again the sapply¹ function, obtaining the final jump function :
jump <- function(t=0,T=T,par){
U <- replicate(n,runif(100,t,T))
Y <-replicate(n,rexp(100,1/par[6]))
js <- seq(t,T,dt)
NTs <- rpois(n,par[5]*T)
jump <- sapply(1:n,function(l){
NT <- NTs[l]
if (NT>0){
sapply(js,function(j){ sum(Y[1:NT,l]*as.integer(U[1:NT,l]<=j)) })
}else {
rep(0,length(js))
}
})
return(jump)
}
(¹)
sapply function is pretty easy to use. For each element of the list or vector passed in the X parameter, it applies the function passed in the FUN parameter, e.g. :
vect <- 1:3
sapply(X=vect,FUN=function(el){el+10}
# [1] 11 12 13
since by default the simplify parameter is true, the result is coerced to the simplest possible object. So, for example in the previous case the result becomes a vector, while in the following example result become a matrix (since for each element we return a vector of the same size) :
vect <- 1:3
sapply(X=vect,FUN=function(el){rep(el,5)})
# [,1] [,2] [,3]
# [1,] 1 2 3
# [2,] 1 2 3
# [3,] 1 2 3
# [4,] 1 2 3
# [5,] 1 2 3
Benchmark :
The following benchmark just give you an idea of the speed gain, but the actual performances may be different depending on your input parameters.
As you can imagine, jump_old corresponds to your original function 1, while jump_new is the final vectorized version.
# let's use some random parameters
n = 10
m = 3
T = 13
par = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6)
dt <- 3
set.seed(123)
system.time(for(i in 1:5000) old <- jump_old(T=T,par=par))
# user system elapsed
# 12.39 0.00 12.41
set.seed(123)
system.time(for(i in 1:5000) new <- jump_new(T=T,par=par))
# user system elapsed
# 4.49 0.00 4.53
# check if last results of the 2 functions are the same:
isTRUE(all.equal(old,new))
# [1] TRUE
I need to build a dependency matrix with all the 91 variables of my data-set.
I tried to use some codes, but I didn't succeed.
Here you are part of the important codes:
p<- length(dati)
chisquare <- matrix(dati, nrow=(p-1), ncol=p)
It should create a squared-matrix with all the variables
system.time({for(i in 1:p){
for(j in 1:p){
a <- dati[, rn[i+1]]
b <- dati[, cn[j]]
chisquare[i, (1:(p-1))] <- chisq.test(dati[,i], dati[, i+1])$statistic
chisquare[i, p] <- chisq.test(dati[,i], dati, i+1])$p.value
}}
})
It should relate the "p" variables to analyze whether they are dependent to each other
Error in `[.data.frame`(dati, , rn[i + 1]) :
not defined columns selected
Moreover: There are 50 and more alerts (use warnings() to read the first 50)
Timing stopped at: 32.23 0.11 32.69
warnings() #let's check
>: In chisq.test(dati[, i], dati[, i + 1]) :
Chi-squared approximation may be incorrect
chisquare #all the cells (unless in the last column which seems to have the p-values) have the same values by row
I also tried another way, which were provided me by someone who knows how to manage R much better than me:
#strange values I have in some columns
sum(dati == 'x')
#replacing "x" by x
x <- dati[dati=='x']
#distribution of answers for each question
answers <- t(sapply(1:ncol(dati), function(i) table(factor(dati[, i], levels = -2:9), useNA = 'always')))
rownames(answers) <- colnames(dati)
answers
#correlation for the pairs
I<- diag(ncol(dati))
#empty diagonal matrix
colnames(I) <- rownames(I) <- colnames(dati)
rn <- rownames(I)
cn <- colnames(I)
#loop
system.time({
for(i in 1:ncol(dati)){
for(j in 1:ncol(spain)){
a <- dati[, rn[i]]
b <- dati[, cn[j]]
r <- chisq.test(a,b)$statistic
r <- chisq.test(a,b)$p.value
I[i, j] <- r
}
}
})
user system elapsed
29.61 0.09 30.70
There are 50 and more alerts (use warnings() to read the first 50)
warnings() #let's check
-> : In chisq.test(a, b) : Chi-squared approximation may be incorrect
diag(I)<- 1
#result
head(I)
The columns stop at the 5th variable, whereas I need to check the dependency between all the variables. Each one.
I don't understand where I'm wrong, but I hope I'm not so far...
I hope to receive a good help, please.
You are apparently trying to compute the p-value of a chi-squared test,
for all pairs of variables in your dataset.
This can be done as follows.
# Sample data
n <- 1000
k <- 10
d <- matrix(sample(LETTERS[1:5], n*k, replace=TRUE), nc=k)
d <- as.data.frame(d)
names(d) <- letters[1:k]
# Compute the p-values
k <- ncol(d)
result <- matrix(1, nr=k, nc=k)
rownames(result) <- colnames(result) <- names(d)
for(i in 1:k) {
for(j in 1:k) {
result[i,j] <- chisq.test( d[,i], d[,j] )$p.value
}
}
In addition, there may be something wrong with your data,
leading to the warnings you get,
but we do not know anything about it.
Your code has too many problems for me to try to enumerate them
(you start to try to create a square matrix with a different number
of rows and columns, and then I am completely lost).