I've got a dataset where each column has 4 binary variables. How do i create 4 x 4 grid with the tally of each pair combination of the variables?
Here's an example data frame:
Person <- c("Bob", "Jim", "Sarah", "Dave")
A <- c(1,0,1,1)
B <- c(1,1,1,0)
C <- c(0,0,0,1)
D <- c(1,0,0,0)
So in the 4x4 grid, the intersection of A and B would have a 2 because Bob and Sarah have 1 for A and B.
For two vectors A and B it will be a cross product:
res <- A %*% B
or
res <- crossprod(A, B)
to make a matrix of all combinations use two level for or apply:
data <- list(A,B,C,D)
res <- matrix(NA, nrow = n, ncol = m, dimnames = dimnames(product.m))
for(i in 1:n) {
for(j in 1:i) {
res[i,j] <- crossprod(data[[i]], data[[j]])
}
}
Here I fill only one half of the matrix. You then can copy the values across like this:
res[upper.tri(res)] <- t(res)[upper.tri(res)]
Related
Script:
a <- c(10, 20)
b <- c(100, 200)
c <- c(50 , 1000)
d <- c(3000, 4300)
for (i in c(a,b,c,d))
{
print(prop.test(a,b))
}.
So essentially I want every 2 objects to be paired up. I hope I am somewhat clear.
You can put the vectors in a list and use a for loop as follows -
list_data <- list(a, b, c, d)
result <- vector('list', length(list_data)/2)
for(i in seq_along(result)) {
n <- (i -1) * 2 + 1
result[[i]] <- prop.test(list_data[[n]], list_data[[n+1]])
print(result[[i]])
}
I am attempting to fit a Poisson regression model to a dataset in R, whereby I have vectors of different lengths stored in two lists as dataframe columns, as so:
test <- data.frame(a = 1:10, b = rnorm(10))
test$c <- list(length = nrow(test))
test$d <- list(length = nrow(test))
for(i in 1:nrow(test)) {
test$c[[i]] <- LETTERS[1:sample(10:11, 1)]
test$d[[i]] <- LETTERS[1:sample(10:11, 1)]
}
I need to build a model to predict a from b and the vectors c and d. As it is not possible to pass lists to a glm, I tried unlisting c and d to feed them into the model, but this just ends up creating one long vector for both c and d, meaning I get this error:
m0.glm <- glm(a ~ b + unlist(c) + unlist(d), data = test)
Error in model.frame.default(formula = a ~ b + unlist(c) + unlist(d), :
variable lengths differ (found for 'unlist(c)')
I feel like there will be a simple solution that I am missing to my problem, but I have not had to attempt to pass a list of vectors to a model before.
Thanks in advance.
If the problem is to create a df out of lists, then:
test <- data.frame(a = 1:10, b = rnorm(10))
test$c <- list(length(nrow(test)))
test$d <- list(length(nrow(test)))
for(i in 1:nrow(test)) {
test$c[[i]] <- LETTERS[1:sample(10:11, 1)]
test$d[[i]] <- LETTERS[1:sample(10:11, 1)]
}
#
do.call(rbind, lapply(test$c, function(x) {
res <- rep(NA, max(vapply(test$c, length, integer(1))))
res[1:length(x)] <- x
res
})) -> test_c_df
do.call(rbind, lapply(test$d, function(x) {
res <- rep(NA, max(vapply(test$d, length, integer(1))))
res[1:length(x)] <- x
res
})) -> test_d_df
test_new <- cbind(test[c("a", "b")], test_c_df, test_d_df)
names(test_new) <- make.unique(names(test_new))
m0.glm <- glm(a ~ ., data = test_new) # data reasonable??
I have a matrix of species occurring in sites and I want to compute the following formula for each pair ab of species:
where Ra and Rb are the occurrences of species a and b respectively and S the number of sites where a and b co-occur.
So far, I have this solution which is very slow (actually way too slow for my matrix):
set.seed(1)
# Example of binary matrix with sites in rows and species in columns
mat <- matrix(runif(200), ncol = 20)
mat_bin <- mat
mat_bin[mat_bin > 0.5] <- 1
mat_bin[mat_bin <= 0.5] <- 0
rownames(mat_bin) <- paste0("site_", seq(1:nrow(mat_bin)))
colnames(mat_bin) <- paste0("sp_", seq(1:ncol(mat_bin)))
# Number of occurrences for every species
nbocc <- colSums(mat_bin)
# Number of cooccurrences between species
S <- crossprod(mat_bin)
diag(S) <- 0
# Data frame with all the pair combinations
comb <- data.frame(t(combn(colnames(mat_bin), 2)))
colnames(comb) <- c("sp1", "sp2")
comb$Cscore <- 0
# Slow for_loop to compute the Cscore of each pair
for(i in 1:nrow(comb)){
num <- (nbocc[[comb[i, "sp1"]]] - S[comb[i, "sp1"], comb[i, "sp2"]]) *
(nbocc[[comb[i, "sp2"]]] - S[comb[i, "sp1"], comb[i, "sp2"]])
denom <- nbocc[[comb[i, "sp1"]]] * nbocc[[comb[i, "sp2"]]]
comb[i, "Cscore"] <- num/denom
}
A first solution could be to parallelize the for-loop, but maybe a more optimized solution exist.
Like you have started with S, you could do the full calculation in a vectorized manner based on matrices.
This would look as follows:
set.seed(1)
# Example of binary matrix with sites in rows and species in columns
mat <- matrix(runif(200), ncol = 20)
mat_bin <- mat
mat_bin[mat_bin > 0.5] <- 1
mat_bin[mat_bin <= 0.5] <- 0
rownames(mat_bin) <- paste0("site_", seq(1:nrow(mat_bin)))
colnames(mat_bin) <- paste0("sp_", seq(1:ncol(mat_bin)))
# Number of occurrences for every species
nbocc <- colSums(mat_bin)
# Number of cooccurrences between species
S <- crossprod(mat_bin)
resMat <- (nbocc - S) * t(nbocc - S) /
outer(nbocc, nbocc, `*`)
# in the end you would need just the triangle
resMat[lower.tri(resMat)]
I have a following problem where I have a dataframe (df):
df <- data.frame(inp = c("inp1", "inp2", "inp3"), A = c(1,2,3), B = c(1,2,3))
I need to construct a inp*inp square matrix from this dataframe that complies to certain formulas for diagonal and off-diagonal elements.
The diagonal elements are calculated as M[i,i] = A[i,i]^2 + B[i,i] and the off-diagonal elements as M[i,j] = A[i]*A[j] where i,j belong to set (inp1, inp2, inp3).
This is what I've got thus far - the function for calculating the off-diagonal values still escapes me.
matFun <- function(df){
x <- matrix(,
nrow = nrow(df),
ncol = nrow(df),
dimnames = list(df$inp, df$inp))
#funOffDiag <- ???
funDiag <- function(A,B){A^2 + B}
d <- apply(df[c("A","B")], 1, function(y) funDiag(y["A"],y["B"]))
diag(x) <- d
x
}
matFun(df)
I need this solution as a function because I have to apply it to a longish list of dataframes.
df <- data.frame(inp = c("inp1", "inp2", "inp3"), A = c(1,2,3), B = c(1,2,3))
mat <- tcrossprod(df$A)
colnames(mat) <- rownames(mat) <- df$inp
diag(mat) <- diag(mat) + df$B
# inp1 inp2 inp3
#inp1 2 2 3
#inp2 2 6 6
#inp3 3 6 12
You should be able to create a function from this yourself ...
(Very) amateur coder and statistician working on a problem in R.
I have four integer lists: A, B, C, D.
A <- [1:133]
B <- [1:266]
C <- [1:266]
D <- [1:133, 267-400]
I want R to generate all of the permutations from picking 1 item from each of these lists (I know this code will take forever to run), and then take the mean of each of those permutations. So, for instance, [1, 100, 200, 400] -> 175.25.
Ideally what I would have at the end is a list of all of these means then.
Any ideas?
Here's how I'd do this for a smaller but similar problem:
A <- 1:13
B <- 1:26
C <- 1:26
D <- c(1:13, 27:40)
mymat <- expand.grid(A, B, C, D)
names(mymat) <- c("A", "B", "C", "D")
mymat <- as.matrix(mymat)
mymeans <- rowSums(mymat)/4
You'll probably crash R if you just up all the indices, but you could probably set up a loop, something like this (not tested):
B <- 1:266
C <- 1:266
D <- c(1:133, 267:400)
for(A in 1:133) {
mymat <- expand.grid(A, B, C, D)
names(mymat) <- c("A", "B", "C", "D")
mymat <- as.matrix(mymat)
mymeans <- rowSums(mymat)/4
write.table(mymat, file = paste("matrix", A, "txt", sep = "."))
write.table(mymeans, file = paste("means", A, "txt", sep = "."))
rm(mymat, mymeans)
}
to get them all. That still might be too big, in which case you could do a nested loop, or loop over D (since it's the biggest)
Alternatively,
n <- 1e7
A <- sample(133, size = n, replace= TRUE)
B <- sample(266, size = n, replace= TRUE)
C <- sample(266, size = n, replace= TRUE)
D <- sample(x = c(1:133, 267:400), size = n, replace= TRUE)
mymeans <- (A+B+C+D)/4
will give you a large sample of the means and take no time at all.
hist(mymeans)
Even creating a vector of means as large as your permutations will use up all of your memory. You will have to split this into smaller problems, look up writing objects to excel and then removing objects from memory here (both on SO).
As for the code to do this, I've tried to keep it as simple as possible so that it's easy to 'grow' your knowledge:
#this is how to create vectors of sequential integers integers in R
a <- c(1:33)
b <- c(1:33)
c <- c(1:33)
d <- c(1:33,267:300)
#this is how to create an empty vector
means <- rep(NA,length(a)*length(b)*length(c)*length(d))
#set up for a loop
i <- 1
#how you run a loop to perform this operation
for(j in 1:length(a)){
for(k in 1:length(b)){
for(l in 1:length(c)){
for(m in 1:length(d)){
y <- c(a[j],b[k],c[l],d[m])
means[i] <- mean(y)
i <- i+1
}
}
}
}
#and to graph your output
hist(means, col='brown')
#lets put a mean line through the histogram
abline(v=mean(means), col='white', lwd=2)