block diagonal covariance matrix by group of variable - r

If this is my dataset, 3 subjects, measured at two time points t1,t2 , at each time point each subject is measured twice
ID Time Observation Y
1 t1 1 3.1
1 t1 2 4.5
1 t2 1 4.2
1 t2 2 1.7
2 t1 1 2.3
2 t1 2 3.1
2 t2 1 1.5
2 t2 2 2.2
3 t1 1 4.1
3 t1 2 4.9
3 t2 1 3.5
3 t2 2 3.2
What is the block diagonal covariance matrix for this dataset ? I am assuming the matrix would have three blocks, each block representing 2x2 variance-covariance matrix of subject i with t1 and t2. Is this correct ?

The code below computes the covariance matrices for each ID, then creates a block matrix out of them.
make_block_matrix <- function(x, y, fill = NA) {
u <- matrix(fill, nrow = dim(x)[1], ncol = dim(y)[2])
v <- matrix(fill, nrow = dim(y)[1], ncol = dim(x)[2])
u <- cbind(x, u)
v <- cbind(v, y)
rbind(u, v)
}
# compute covariance matrices for each ID
sp <- split(df1, df1$ID)
cov_list <- lapply(sp, \(x) {
y <- reshape(x[-1], direction = "wide", idvar = "Time", timevar = "Observation")
cov(y[-1])
})
# no longer needed, tidy up
rm(sp)
# fill with NA's, simple call
# res <- Reduce(make_block_matrix, cov_list)
# fill with zeros
res <- Reduce(\(x, y) make_block_matrix(x, y, fill = 0), cov_list)
# finally, the results' row and column names
nms <- paste(sapply(cov_list, rownames), names(cov_list), sep = ".")
dimnames(res) <- list(nms, nms)
res
#> Y.1.1 Y.2.2 Y.1.3 Y.2.1 Y.1.2 Y.2.3
#> Y.1.1 0.605 -1.54 0.00 0.000 0.00 0.000
#> Y.2.2 -1.540 3.92 0.00 0.000 0.00 0.000
#> Y.1.3 0.000 0.00 0.32 0.360 0.00 0.000
#> Y.2.1 0.000 0.00 0.36 0.405 0.00 0.000
#> Y.1.2 0.000 0.00 0.00 0.000 0.18 0.510
#> Y.2.3 0.000 0.00 0.00 0.000 0.51 1.445
Created on 2023-02-20 with reprex v2.0.2
Data
df1 <- "ID Time Observation Y
1 t1 1 3.1
1 t1 2 4.5
1 t2 1 4.2
1 t2 2 1.7
2 t1 1 2.3
2 t1 2 3.1
2 t2 1 1.5
2 t2 2 2.2
3 t1 1 4.1
3 t1 2 4.9
3 t2 1 3.5
3 t2 2 3.2"
df1 <- read.table(text = df1, header = TRUE)
Created on 2023-02-20 with reprex v2.0.2

Related

How to calculate values in an R dataframe when columns are dependent on each other

First time poster - so apologies if this question is basic/poorly explained. Grateful to anyone can help or point me in the right direction!
I would like to do the following within an R dataframe if possible:
Existing Data
Column A is a vector of values, say 10 to 20.
New Data/columns
Column B will be be Column A multiplied by Column C
Column C will be Column C minus Column B from the previous row of data, i.e data$C[-1] - data$B[-1], apart the first row of Column C of course, which I will give a fixed value.
I have tried these as separate steps, but I keep overwriting columns B and C, and have a feeling I have been going about this the wrong way! I could share my code, but I think this would confuse matters!
Thanks in advance!
EDIT TO ADD CODE:
A <- c(0.1,0.2,0.3,0.4,0.5)
df1 <- data.frame(A)
df1$B <- 0
df1$C <- 0
df1$C[1] <- 100
df2 <- df1 %>%
mutate(B = C * A,
C = lag(C-B))
RESULT FROM THE ABOVE
A
B
C
1
0.1
10
NA
2
0.2
0
90
3
0.3
0
0
4
0.4
0
0
5
0.5
0
0
EXPECTED OUTPUT
A
B
C
1
0.1
10
100
2
0.2
18
90
3
0.3
21.6
72
4
0.4
20.16
50.4
5
0.5
15.12
30.24
C2 = C1 - B1
B2 = C2 * A2
We can use accumulate from purrr to do recursive update
library(dplyr)
library(purrr)
tmp <- with(df1, accumulate(A, ~ .x - (.x * .y), .init = first(C)))
df2 <- df1 %>%
mutate(C = head(tmp, -1), B = -diff(tmp))
df2
# A B C
#1 0.1 10.00 100.00
#2 0.2 18.00 90.00
#3 0.3 21.60 72.00
#4 0.4 20.16 50.40
#5 0.5 15.12 30.24
Or use base R
tmp <- with(df1, Reduce(function(x, y) x - (x * y), A,
accumulate = TRUE, init = C[1]))
df2 <- transform(df1, C = head(tmp, -1), B = -diff(tmp))
If you don't mind using a mathematical approach, you can first derive the general expression for the recursion and then have the R code afterwards.
Below is one implementation with base R
transform(
transform(
df1,
C = C[1] * c(1, cumprod(1 - A)[-nrow(df1)])
),
B = A * C
)
which gives
A B C
1 0.1 10.00 100.00
2 0.2 18.00 90.00
3 0.3 21.60 72.00
4 0.4 20.16 50.40
5 0.5 15.12 30.24
A data.table option in a similar manner is
> setDT(df1)[, C := first(C) * c(1, cumprod(1 - A)[-.N])][, B := A * C][]
A B C
1: 0.1 10.00 100.00
2: 0.2 18.00 90.00
3: 0.3 21.60 72.00
4: 0.4 20.16 50.40
5: 0.5 15.12 30.24

Storing output for nested loop

I'm trying to do a nested loop for logistic regression.
I'm trying to run a loop for the discretization value and for each class.
Here's the code so far... I'm unable to get an output for each different iteration.
class <- c(1,2,3,4,5)
discretization_value <- seq(0.25, 0.75, by =0.05)
output<-data.frame(matrix(nrow=500, ncol=5))
names(output)=c("discretization_value", "class", "var1_coef", "var2_coef", "var3_coef")
for (i in discretization_value){
for (j in class) {
df$discretization_value <- ifelse(df$score >= i,1,0)
result <- (glm(discretization_value ~
var1 + var2 + var3,
data = df[df$class == j,], family= "binomial"))
output[i,1] <- i
output[i,2] <- j
output[i,3] <- coef(summary(result))[c("var1"),c("Estimate")]
output[i,4] <- coef(summary(result))[c("var2"),c("Estimate")]
output[i,5] <- coef(summary(result))[c("var3"),c("Estimate")]
}
}
a snippet of my df
class score var1 var2 var3
1 0.3 0.18 0.33 356
1 0.5 0.22 0.55 33
1 0.6 0.77 0.44 35
2 0.9 0.99 0.55 2
3 0 0 0 0
3 0.4 0.5 0.11 5
4 0 0.6 0 7
4 0 0.6 0 9
4 0.6 0.2 0.1 6
Could this be the problem?
data = df[df$class == j,], family= "binomial"))
I would try to remove the comma before the squared parenthesis.

dataframe column wise subtraction and division.

need help in N number or column wise subtraction and division, Below are the columns in a input dataframe.
input dataframe:
> df
A B C D
1 1 3 6 2
2 3 3 3 4
3 1 2 2 2
4 4 4 4 4
5 5 2 3 2
formula - a, (b - a) / (1-a)
MY CODE
ABC <- cbind.data.frame(DF[1], (DF[-1] - DF[-ncol(DF)])/(1 - DF[-ncol(DF)]))
Expected out:
A B C D
1 Inf -1.5 0.8
3 0.00 0.0 -0.5
1 Inf 0.0 0.0
4 0.00 0.0 0.0
5 0.75 -1.0 0.5
But i dont want to use ncol here, cause there is a last column after column D in the actual dataframe.
So want to apply this formula only till first 4 column, IF i use ncol, it will traverse till last column in the dataframe.
Please help thanks.
What about trying:
df <- matrix(c(1,3,6,2,3,3,3,4,1,2,2,2,4,4,4,4,5,2,3,2), nrow = 5, byrow = TRUE)
df_2 <- matrix((df[,2]-df[,1])/(1-df[,1]),5,1)
df_3 <- matrix((df[,3]-df[,2])/(1-df[,2]),5,1)
df_4 <- matrix((df[,4]-df[,3])/(1-df[,3]),5,1)
cbind(df[,1],df_2,df_3,df_4)
edit: a loop version
df <- matrix(c(1,3,6,2,3,3,3,4,1,2,2,2,4,4,4,4,5,2,3,2), nrow = 5, byrow = TRUE)
test_bind <- c()
test_bind <- cbind(test_bind, df[,1])
for (i in 1:3){df_1 <- matrix((df[,i+1]-df[,i])/(1-df[,i]),5,1)
test_bind <- cbind(test_bind,df_1)}
test_bind
here is one option with tidyverse
library(dplyr)
library(purrr)
map2_df(DF[2:4], DF[1:3], ~ (.x - .y)/(1- .y)) %>%
bind_cols(DF[1], .)
# A B C D
#1 1 Inf -1.5 0.8
#2 3 0.00 0.0 -0.5
#3 1 Inf 0.0 0.0
#4 4 0.00 0.0 0.0
#5 5 0.75 -1.0 0.5

replacing values in dataframe with another dataframe r

I have a dataframe of values that represent fold changes as such:
> df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
A B C
1 1.74 1.50 1.10
2 -1.30 0.90 3.01
3 3.10 0.71 1.40
And a dataframe of pvalues as such that matches rows and columns identically:
> df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
A B C
1 0.02 NA 0.01
2 0.01 0.01 0.01
3 0.80 0.06 0.03
What I want is to modify the values in df1 so that only retain the values that had a correponding pvalue in df2 < .05, and replace with NA otherwise. Note there are also NA in df2.
> desired <- data.frame(A=c(1.74,-1.3,NA), B=c(NA,.9,NA), C=c(1.1,3.01,1.4))
> desired
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40
I first tried to use vector syntax on these dataframes and that didn't work. Then I tried a for loop by columns and that also failed.
I don't think i understand how to index each i,j position and then replace df1 values with df2 values based on a logical.
Or if there is a better way in R.
You can try this:
df1[!df2 < 0.05 | is.na(df2)] <- NA
Out:
> df1
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40
ifelse and as.matrix seem to do the trick.
df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
x1 <- as.matrix(df1)
x2 <- as.matrix(df2)
as.data.frame( ifelse( x2 >= 0.05 | is.na(x2), NA, x1) )
Result
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40

Replicate rows by different N

I’ve the following data
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
1) I need to repeat number of rows for each id by n. For example, n=2.63 for id=1, then I need to replicated id=1 row three times. If n=0.5, then I need to replicate it only one time... so n needs to be round up.
2) Create a new variable called t, where the sum of t for each id must equal to n.
3) Create another new variable called accumulated.t
Here how the output looks like:
id n t accumulated.t
1 2.63 1 1
1 2.63 1 2
1 2.63 0.63 2.63
2 1.5 1 1
2 1.5 0.5 1.5
3 0.5 0.5 0.5
4 3.5 1 1
4 3.5 1 2
4 3.5 1 3
4 3.5 0.5 3.5
5 4 1 1
5 4 1 2
5 4 1 3
5 4 1 4
Get the ceiling of 'n' column and use that to expand the rows of 'mydata' (rep(1:nrow(mydata), ceiling(mydata$n)))
Using data.table, we convert the 'data.frame' to 'data.table' (setDT(mydata1)), grouped by 'id' column, we replicate (rep) 1 with times specified as the trunc of the first value of 'n' (rep(1, trunc(n[1]))). Take the difference between the unique value of 'n' per group and the sum of 'tmp' (n[1]-sum(tmp)). If the difference is greater than 0, we concatenate 'tmp' and 'tmp2' (c(tmp, tmp2)) or if it is '0', we take only 'tmp'. This can be placed in a list to create the two columns 't' and the cumulative sum of 'tmp3 (cumsum(tmp3)).
library(data.table)
mydata1 <- mydata[rep(1:nrow(mydata),ceiling(mydata$n)),]
setDT(mydata1)[, c('t', 'taccum') := {
tmp <- rep(1, trunc(n[1]))
tmp2 <- n[1]-sum(tmp)
tmp3= if(tmp2==0) tmp else c(tmp, tmp2)
list(tmp3, cumsum(tmp3)) },
by = id]
mydata1
# id n t taccum
# 1: 1 2.63 1.00 1.00
# 2: 1 2.63 1.00 2.00
# 3: 1 2.63 0.63 2.63
# 4: 2 1.50 1.00 1.00
# 5: 2 1.50 0.50 1.50
# 6: 3 0.50 0.50 0.50
# 7: 4 3.50 1.00 1.00
# 8: 4 3.50 1.00 2.00
# 9: 4 3.50 1.00 3.00
#10: 4 3.50 0.50 3.50
#11: 5 4.00 1.00 1.00
#12: 5 4.00 1.00 2.00
#13: 5 4.00 1.00 3.00
#14: 5 4.00 1.00 4.00
An alternative that utilizes base R.
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
mynewdata <- data.frame(id = rep(x = mydata$id,times = ceiling(x = mydata$n)),
n = mydata$n[match(x = rep(x = mydata$id,ceiling(mydata$n)),table = mydata$id)],
t = rep(x = mydata$n / ceiling(mydata$n),times = ceiling(mydata$n)))
mynewdata$t.accum <- unlist(x = by(data = mynewdata$t,INDICES = mynewdata$id,FUN = cumsum))
We start by creating a data.frame with three columns, id, n, and t. id is calculated using rep and ceiling to repeat the ID variable the number of appropriate times. n is obtained by using match to look up the right value in mydata$n. t is obtained by obtaining the ratio of n and ceiling of n, and then repeating it the appropriate amount of times (in this case, ceiling of n again.
Then, we use cumsum to get the cumulative sum, called using by to allow by-group processing for each group of IDs. You could probably use tapply() here as well.

Resources