Combine same-name columns and apply Johansen test in R - r

I have two data bases (data is multicolumn before and after treatment):
Before treatment
Data1<-read.csv("before.csv")
X1 X2 X3
1 0.21 0.32 0.42
2 0.34 0.23 0.33
3 0.42 0.14 0.11
4 0.35 0.25 0.35
5 0.25 0.41 0.44
After treatment
Data2<-read.csv("after.csv")
X1 X2 X3
1 0.33 0.43 0.7
2 0.28 0.51 0.78
3 0.11 0.78 0.34
4 0.54 0.34 0.34
5 0.42 0.64 0.22
I would like to combine the data by columns (i.e. x1 in Data1 and x1 in Data2 similarly: x2 in Data1 and x2 in Data2 and so on) and perform Johansen Cointegration test for each pair.
What I tried is to make:
library("urca")
x1<-cbind(Data1$x1, Data2$x1)
Jo1<-ca.jo(x1, type="trace",K=2,ecdet="none", spec="longrun")
summary(Jo1)
x2<-cbind(Data1$x1, Data2$x2)
Jo2<-ca.jo(x2, type="trace",K=2,ecdet="none", spec="longrun")
summary(Jo2)
This gives me what I want but I would like to automate the process, i.e. instead of manually combining data, to have all pair-wise combinations.

Based on krishna's answere, but modified the loop:
for(i in 1:ncol(Data1)) {
col <- paste0("X", as.character(i))
data <- cbind(Data1[, col], Data2[, col])
colnames(data) <- c(paste0("Data1_",col),paste0("Data2_",col)) # add column names
Jo<- ca.jo(data, type="trace",K=2,ecdet="none", spec="longrun")
print(summary(Jo)) # print the summary to the console
}

You can loop through the columns name and find the Johansen Cointegration as follows:
# Create a sample data frame
Data1<- data.frame(X1 = rnorm(10, 0, 1), X2 = rnorm(10, 0, 1), X3 = rnorm(10, 0, 1))
Data2 <-data.frame(X1 = rnorm(10, 0, 1), X2 = rnorm(10, 0, 1), X3 = rnorm(10, 0, 1))
library("urca")
# loop through all columns index
for(i in ncol(Data1)) {
col <- paste0("X", as.character(i)) # find the column name
data <- cbind(Data1[, col], Data2[, col]) # get the data from Data1 and Data2, all rows of a column = col
# Your method for finding Ca.Jo ...
Jo<- ca.jo(data, type="trace",K=2,ecdet="none", spec="longrun")
summary(Jo)
}
You can also use colnames for looping as:
for(col in colnames(Data1)) {
print(col)
data <- cbind(Data1[, col], Data2[, col])
print(data)
#Jo<- ca.jo(data, type="trace",K=2,ecdet="none", spec="longrun")
#summary(Jo)
}
Hope this will help you.

Related

How to iterate through parameters in for loop

I have a model written as a for loop that incorporates a number of parameters that I specify:
## functions needed to run the model
learn <- function(prior, sensi, speci, e){
out <- ifelse(e == 1, (sensi*prior) / ((sensi*prior) + (1-speci)*(1-prior)),
((1-sensi)*prior) / (((1-sensi)*prior) + (speci*(1-prior))))
out
}
feed <- function(vec){
prior <- 0.5
for (i in vec){
res <- learn(prior, sensi, speci, i)
prior <- res
}
return(prior)
}
## specify parameters
iterations <- 100
N <- 10
BR <- 0.66
sensi <- 0.75
speci <- 0.45
## initialize results object
res <- NULL
## loop for number of iterations
for (j in 1:iterations){
X <- as.numeric(rbinom(1, 1, BR))
if (X == 1){ # if X is 1...
agents <- c(1:N)
evidence <- vector("list", length(agents))
for (i in agents) {
n <- sample(10, 1, replace = TRUE)
evidence[[i]] <- rbinom(n, 1, sensi)
}
} else { # if X is 0...
agents <- c(1:N)
evidence <- vector("list", length(agents))
for (i in agents) {
n <- sample(10, 1, replace = TRUE)
evidence[[i]] <- rbinom(n, 1, sensi)
evidence[[i]] <- ifelse(evidence[[i]]==1, 0, 1) # flip evidence
}
}
# feed vectors of evidence through learn function
t0 <- sapply(evidence, feed)
# save dataframe
df <- data.frame("i" = j,
"ID" = c(1:N),
"E" = t0,
"X" = X,
"N" = N,
"BR" = BR,
"sensi" = sensi,
"speci" = speci)
res <- rbind(res, df)
}
This works fine for a single parameterisation, but I now want to automate the process of specifying different parameter values and re-running the model. So instead of defining each parameter as a single value, I define them as a vector of values and store all the possible parameterisations in a dataframe (paramspace) with each row holding the values for a single parameterisation that I want to run:
## set up for multiple parameterizations
iterations <- 100
N_vec <- c(10, 50)
BR_vec <- c(0.25, 0.50, 0.75)
sensi_vec <- c(0.45, 0.75)
speci_vec <- c(0.45, 0.75)
paramspace <- expand.grid(iterations = iterations, N = N_vec, BR = BR_vec, sensi = sensi_vec, speci = speci_vec)
> paramspace
iterations N BR sensi speci
1 100 10 0.25 0.45 0.45
2 100 50 0.25 0.45 0.45
3 100 10 0.50 0.45 0.45
4 100 50 0.50 0.45 0.45
5 100 10 0.75 0.45 0.45
6 100 50 0.75 0.45 0.45
7 100 10 0.25 0.75 0.45
8 100 50 0.25 0.75 0.45
9 100 10 0.50 0.75 0.45
10 100 50 0.50 0.75 0.45
11 100 10 0.75 0.75 0.45
12 100 50 0.75 0.75 0.45
13 100 10 0.25 0.45 0.75
14 100 50 0.25 0.45 0.75
15 100 10 0.50 0.45 0.75
16 100 50 0.50 0.45 0.75
17 100 10 0.75 0.45 0.75
18 100 50 0.75 0.45 0.75
19 100 10 0.25 0.75 0.75
20 100 50 0.25 0.75 0.75
21 100 10 0.50 0.75 0.75
22 100 50 0.50 0.75 0.75
23 100 10 0.75 0.75 0.75
24 100 50 0.75 0.75 0.75
How can I pass each row of parameter values to my model and automatically run through all the parameterisations stated in paramspace?
As suggested in comments, you can create a function and then use apply to loop over the parameters combinations :
## functions needed to run the model
learn <- function(prior, sensi, speci, e){
out <- ifelse(e == 1, (sensi*prior) / ((sensi*prior) + (1-speci)*(1-prior)),
((1-sensi)*prior) / (((1-sensi)*prior) + (speci*(1-prior))))
out
}
feed <- function(vec,sensi,speci){
prior <- 0.5
for (i in vec){
res <- learn(prior, sensi, speci, i)
prior <- res
}
return(prior)
}
runModel <- function(iterations = 100,
N = 10,
BR = 0.66,
sensi = 0.75,
speci = 0.45 ) {
## initialize results object
res <- NULL
## loop for number of iterations
for (j in 1:iterations){
X <- as.numeric(rbinom(1, 1, BR))
if (X == 1){ # if X is 1...
agents <- c(1:N)
evidence <- vector("list", length(agents))
for (i in agents) {
n <- sample(10, 1, replace = TRUE)
evidence[[i]] <- rbinom(n, 1, sensi)
}
} else { # if X is 0...
agents <- c(1:N)
evidence <- vector("list", length(agents))
for (i in agents) {
n <- sample(10, 1, replace = TRUE)
evidence[[i]] <- rbinom(n, 1, sensi)
evidence[[i]] <- ifelse(evidence[[i]]==1, 0, 1) # flip evidence
}
}
# feed vectors of evidence through learn function
#t0 <- sapply(evidence, feed)
t0 <- sapply(evidence,function(e){feed(e,sensi,speci)})
# save dataframe
df <- list("i" = iterations,
"ID" = c(1:N),
"E" = t0,
"X" = X,
"N" = N,
"BR" = BR,
"sensi" = sensi,
"speci" = speci)
res <- rbind(res, df)
}
res
}
# Define parameter space
iterations <- 100
N_vec <- c(10, 50)
BR_vec <- c(0.25, 0.50, 0.75)
sensi_vec <- c(0.45, 0.75)
speci_vec <- c(0.45, 0.75)
paramspace <- expand.grid(iterations = iterations, N = N_vec, BR = BR_vec, sensi = sensi_vec, speci = speci_vec)
# Loop over parameter space :
res <- apply(paramspace,1,function(paramset) {
iterations = paramset[1]
N = paramset[2]
BR = paramset[3]
sensi = paramset[4]
speci = paramset[5]
runModel(iterations = iterations, N = N, BR = BR , sensi = sensi, speci = speci )
})
You can also use the foreach package, that used with an appropriate backend offers parallelization capabilities, in case your task becomes more intensive. Here a simple example to understand how it works.
foreach(a=1:3, b=4:6) %do% (a + b)
Then I tried to embed your code into foreach
require(foreach)
## functions needed to run the model
learn <- function(prior, sensi, speci, e){
out <- ifelse(e == 1, (sensi*prior) / ((sensi*prior) + (1-speci)*(1-prior)),
((1-sensi)*prior) / (((1-sensi)*prior) + (speci*(1-prior))))
out
}
feed <- function(vec){
prior <- 0.5
for (i in vec){
res <- learn(prior, sensi, speci, i)
prior <- res
}
return(prior)
}
## set up for multiple parameterizations
iterations <- 100
N_vec <- c(10, 50)
BR_vec <- c(0.25, 0.50, 0.75)
sensi_vec <- c(0.45, 0.75)
speci_vec <- c(0.45, 0.75)
paramspace <- expand.grid(iterations = iterations, N = N_vec, BR = BR_vec, sensi = sensi_vec, speci = speci_vec)
res <- foreach(iterations = paramspace$iterations,
N = paramspace$N,
BR = paramspace$BR,
sensi = paramspace$sensi,
speci = paramspace$speci) %do% {
## initialize results object
res <- NULL
## loop for number of iterations
for (j in 1:iterations){
X <- as.numeric(rbinom(1, 1, BR))
if (X == 1){ # if X is 1...
agents <- c(1:N)
evidence <- vector("list", length(agents))
for (i in agents) {
n <- sample(10, 1, replace = TRUE)
evidence[[i]] <- rbinom(n, 1, sensi)
}
} else { # if X is 0...
agents <- c(1:N)
evidence <- vector("list", length(agents))
for (i in agents) {
n <- sample(10, 1, replace = TRUE)
evidence[[i]] <- rbinom(n, 1, sensi)
evidence[[i]] <- ifelse(evidence[[i]]==1, 0, 1) # flip evidence
}
}
# feed vectors of evidence through learn function
t0 <- sapply(evidence, feed)
# save dataframe
df <- data.frame("i" = j,
"ID" = c(1:N),
"E" = t0,
"X" = X,
"N" = N,
"BR" = BR,
"sensi" = sensi,
"speci" = speci)
res <- rbind(res, df)
}
res
}
Another approach is to make a function and to use Map(...). The advantage of Map is that your paramspace will not be coerced into a matrix which will make everything the same type (i.e., numeric, character, etc.).
There were also some other changes I made in order to allow R to do the acccounting for us. Primarily:
X is now a logical so we can simplify our if statements. Additionally, the allocation is made all at once instead of looping.
We change the feed() function to also generate the evidence. This allows us to...
Use replicate to repeat the loops.
learn2 <- function(prior, sensi, speci, e){
out <- ifelse(e, (sensi*prior) / ((sensi*prior) + (1-speci)*(1-prior)),
((1-sensi)*prior) / (((1-sensi)*prior) + (speci*(1-prior))))
out
}
feed2 = function(x, N, samp_n = 10L, sensi, speci) {
evidence = rbinom(sample(samp_n, 1L, replace = TRUE),
1,
if (x) sensi else 1 - sensi)
prior = 0.5
for (i in evidence) {
res = learn2(prior, sensi, speci, i)
prior = res
}
return(prior)
}
runModel2 <- function(iterations = 2,
N = 10,
BR = 0.66,
sensi = 0.75,
speci = 0.45 ) {
X = sample(c(TRUE, FALSE), N, BR)
## this is done now so that the columns will be ordered nicer
ans = list(ID = 1:N,
N = N,
BR = BR,
sensi = sensi,
speci = speci,
X = X)
t0s = replicate(iterations,
vapply(X, feed2, FUN.VALUE = 0, N, 10L, sensi, speci, USE.NAMES = FALSE),
simplify = FALSE)
names(t0s) = paste0("E_", 1:iterations)
return(as.data.frame(c(ans, t0s)))
}
runModel2()
#> ID N BR sensi speci X E_1 E_2
#> 1 1 10 0.66 0.75 0.45 TRUE 0.82967106 0.657648599
#> 2 2 10 0.66 0.75 0.45 FALSE 0.43103448 0.006827641
#> 3 3 10 0.66 0.75 0.45 TRUE 0.43103448 0.775671866
#> 4 4 10 0.66 0.75 0.45 TRUE 0.71716957 0.431034483
#> 5 5 10 0.66 0.75 0.45 FALSE 0.24176079 0.016593958
#> 6 6 10 0.66 0.75 0.45 FALSE 0.30303324 0.008992838
#> 7 7 10 0.66 0.75 0.45 TRUE 0.82967106 0.865405260
#> 8 8 10 0.66 0.75 0.45 FALSE 0.43103448 0.439027817
#> 9 9 10 0.66 0.75 0.45 FALSE 0.57692308 0.050262167
#> 10 10 10 0.66 0.75 0.45 FALSE 0.02178833 0.296208531
This output is a little wider than your original approach. We can always reshape the E_# columns but this may end up being better for your actual use case.
Finally, here is Map() in action:
iterations <- 100
N_vec <- c(10, 50)
BR_vec <- c(0.25, 0.50, 0.75)
sensi_vec <- c(0.45, 0.75)
speci_vec <- c(0.45, 0.75)
paramspace <- expand.grid(iterations = iterations, N = N_vec, BR = BR_vec, sensi = sensi_vec, speci = speci_vec)
res = Map(runModel2, paramspace$iterations, paramspace$N, paramspace$BR, paramspace$sensi, paramspace$speci)
res[[24L]][1:10, 1:8] ## only first 10 rows for demonstration
## ID N BR sensi speci X E_1 E_2
##1 1 50 0.75 0.75 0.75 TRUE 0.500000000 0.500000000
##2 2 50 0.75 0.75 0.75 FALSE 0.001369863 0.035714286
##3 3 50 0.75 0.75 0.75 FALSE 0.250000000 0.900000000
##4 4 50 0.75 0.75 0.75 TRUE 0.750000000 0.250000000
##5 5 50 0.75 0.75 0.75 TRUE 0.987804878 0.500000000
##6 6 50 0.75 0.75 0.75 TRUE 0.964285714 0.250000000
##7 7 50 0.75 0.75 0.75 TRUE 0.750000000 0.750000000
##8 8 50 0.75 0.75 0.75 FALSE 0.012195122 0.035714286
##9 9 50 0.75 0.75 0.75 TRUE 0.750000000 0.500000000
##10 10 50 0.75 0.75 0.75 FALSE 0.250000000 0.001369863

Summarizing count data as proportion in a data.frame

dummy <- data.frame(Q1 = c(0, 1, 0, 1),
Q2 = c(1, 1, 0, 1),
Q3 = c(0, 1, 1, 0))
df_dummy <- data.frame(Question = c("Q1", "Q2", "Q3"),
X1 = c(2/4, 3/4, 2/4),
X0 = c(2/4, 1/4, 2/4))
> dummy
Q1 Q2 Q3
1 0 1 0
2 1 1 1
3 0 0 1
4 1 1 0
> df_dummy
Question X1 X0
1 Q1 0.50 0.50
2 Q2 0.75 0.25
3 Q3 0.50 0.50
I have some data (dummy) where I have binary responses to Q1, Q2, and Q3. I want to summarize my data in the format as shown in df_dummy, where for each question, column X1 tells me the proportion of people that answered 1 to Q1, and column X0 tells me the proportion of people that answered 0 to Q0. I tried prop.table but that didn't return the desired result.
Another way is counting the proportion of 1s and then deducing from that the proportion of 0s:
X1 <- colSums(dummy==1)/nrow(dummy)
df_dummy <- data.frame(X1, X0=1-X1)
df_dummy
# X1 X0
#Q1 0.50 0.50
#Q2 0.75 0.25
#Q3 0.50 0.50
NB, inspired from #akrun's idea of ColMeans: You can also use colMeans instead of dividing colSumsby the number of row to define X1:
X1 <- colMeans(dummy==1)
df_dummy <- data.frame(X1, X0=1-X1)
df_dummy
# X1 X0
#Q1 0.50 0.50
#Q2 0.75 0.25
#Q3 0.50 0.50
We can try apply with margin =2 and divide the counts of each value with the total length in the column
t(apply(dummy, 2, function(x) table(x)/length(x)))
# 0 1
#Q1 0.50 0.50
#Q2 0.25 0.75
#Q3 0.50 0.50
We can do this with table and prop.table
t(sapply(dummy, function(x) prop.table(table(x))))
# 0 1
#Q1 0.50 0.50
#Q2 0.25 0.75
#Q3 0.50 0.50
Or a more efficient approach is to call table once
prop.table(table(stack(dummy)[2:1]),1)
# values
#ind 0 1
# Q1 0.50 0.50
# Q2 0.25 0.75
# Q3 0.50 0.50
Or another option is colMeans (inspired from #Cath's use of colSums)
X0 <- colMeans(!dummy)
data.frame(X1 = 1 - X0, X0)
# X1 X0
#Q1 0.50 0.50
#Q2 0.75 0.25
#Q3 0.50 0.50
Another way to do this would be using do.call & lapply
do.call(cbind,lapply(dummy,function(x) data.frame(table(x))[,2]))
# Q1 Q2 Q3
[1,] 2 1 2
[2,] 2 3 2
Less elegantly than in the answer above:
d <- t(dummy)
cbind(X0 = (ncol(d) - rowSums(d)) / ncol(d), X1 = rowSums(d) / ncol(d))
Or, to avoid computing the same stuff twice, and to get a data frame:
d <- t(dummy)
i <- ncol(d)
j <- rowSums(d)
data.frame(Question = rownames(d), X0 = (i - j) / i, X1 = j / i)
There you go:
Question X0 X1
Q1 Q1 0.50 0.50
Q2 Q2 0.25 0.75
Q3 Q3 0.50 0.50
A tidyverse option:
library(tidyr)
library(janitor)
dummy %>%
gather(question, val) %>% # reshape to long form
tabyl(question, val) %>% # make crosstab table
adorn_percentages("row") %>%
clean_names()
question x0 x1
Q1 0.50 0.50
Q2 0.25 0.75
Q3 0.50 0.50

R Conditional standard deviation

I have a large data set and I need to get the standard deviation for the Main column based on the number of rows in other columns. Here is a sample data set:
df1 <- data.frame(
Main = c(0.33, 0.57, 0.60, 0.51),
B = c(NA, NA, 0.09,0.19),
C = c(NA, 0.05, 0.07, 0.05),
D = c(0.23, 0.26, 0.23, 0.26)
)
View(df1)
# Main B C D
# 1 0.33 NA NA 0.23
# 2 0.57 NA 0.05 0.26
# 3 0.60 0.09 0.07 0.23
# 4 0.51 0.19 0.05 0.26
Take column B as an example, since row 1&2 are NA, its standard deviation will be sd(df1[3:4,1]); column C&D will be sd(df1[2:4,1]) and sd(df1[1:4,1]). Therefore, the result will be:
# B C D
# 1 0.06 0.05 0.12
I did the followings but it only returned one number - 0.0636
df2 <- df1[,-1]!=0
sd(df1[df2,1], na.rm = T)
My data set has many more columns, and I'm wondering if there is a more efficient way to get it done? Many thanks!
Try:
sapply(df1[,-1], function(x) sd(df1[!is.na(x), 1]))
# B C D
# 0.06363961 0.04582576 0.12093387
x <- colnames(df) # list all columns you want to calculate sd of
value <- sapply(1:length(x) , function(i) sd(df[,x[i],drop=TRUE], na.rm = T))
names(value) <- x
# Main B C D
# 0.12093387 0.07071068 0.01154701 0.01732051
We can get this with colSds from matrixStats
library(matrixStats)
colSds(`dim<-`(df1[,1][NA^is.na(df1[-1])*row(df1[-1])], dim(df1[,-1])), na.rm = TRUE)
#[1] 0.06363961 0.04582576 0.12093387

What is an alternative to ifelse() in R?

I have a variable (say, VarX) with values 1:4 in a dataset with approximately 2000 rows. There are other variables in the dataset too. I would like to create a new variable (NewVar) so that if the value VarX is 1, the value of NewVar is 0.32 (the value from myMat[1, 1]), if the value VarX is 2, the value of NewVar is 0.05 (the value from myMat[2, 1]) and so on...
myMat
VarA VarB VarC
[1,] 0.32 0.34 0.27
[2,] 0.05 0.02 0.11
[3,] 0.11 0.11 0.17
[4,] 0.52 0.52 0.45
I have tried the following and it works fine:
df$NewVar <- ifelse(df$VarX == 1, 0.32,
ifelse(df$VarX == 2, 0.05,
ifelse(df$VarX == 3, 0.11,
ifelse(df$VarX == 4, 0.52, 0))))
However, I have another variable (say, VarY) which has 182 values and another matrix with 182 different values. So, using ifelse() would be quite tedious. Is there another way to perform the task in R? Thank you!

How to delete a duplicate row in R

I have the following data
x y z
1 2 a
1 2
data[2,3] is a factor but nothing shows,
In the data, it has a lot rows like this way.How to delete the row when the z has nothing?
I mean deleting the rows such as the second row.
output should be
x y z
1 2 a
OK. Stabbing a little bit in the dark here.
Imagine the following dataset:
mydf <- data.frame(
x = c(.11, .11, .33, .33, .11, .11),
y = c(.22, .22, .44, .44, .22, .44),
z = c("a", "", "", "f", "b", ""))
mydf
# x y z
# 1 0.11 0.22 a
# 2 0.11 0.22
# 3 0.33 0.44
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
From the combination of your title and your description (neither of which seems to fully describe your problem), I would decode that you want to drop rows 2 and 3, but not row 6. In other words, you want to first check whether the row is duplicated (presumably only the first two columns), and then, if the third column is empty, drop that row. By those instructions, row 5 should remain (column "z" is not blank) and row 6 should remain (the combination of columns 1 and 2 is not a duplicate).
If that's the case, here's one approach:
# Copy the data.frame, "sorting" by column "z"
mydf2 <- mydf[rev(order(mydf$z)), ]
# Subset according to your conditions
mydf2 <- mydf2[duplicated(mydf2[1:2]) & mydf2$z %in% "", ]
mydf2
# x y z
# 3 0.33 0.44
# 2 0.11 0.22
^^ Those are the data that we want to remove. One way to remove them is using setdiff on the rownames of each dataset:
mydf[setdiff(rownames(mydf), rownames(mydf2)), ]
# x y z
# 1 0.11 0.22 a
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
Some example data:
df = data.frame(x = runif(100),
y = runif(100),
z = sample(c(letters[0:10], ""), 100, replace = TRUE))
> head(df)
x y z
1 0.7664915 0.86087017 a
2 0.8567483 0.83715022 d
3 0.2819078 0.85004742 f
4 0.8241173 0.43078311 h
5 0.6433988 0.46291916 e
6 0.4103120 0.07511076
Spot row six with the missing value. You can subset using a vector of logical's (TRUE, FALSE):
df[df$z != "",]
And as #AnandaMahto commented, you can even check against multiple conditions:
df[!df$z %in% c("", " "),]

Resources