I have a data set with 5 different variables (e.g., a, b, c, d, and e). I want to have simple code (preferably with the tidyverse) that allows me to take the mean for each possible combination of the variables. For example, the mean of "ab", "ac", ..., all the way to "abcde". Is there a simple way of doing this?
All I've tried is manually creating the code for each variable. However, it seems like something like a loop would be more appropriate.
For example, if my data looked like this:
a <- rnorm(10, 0, 1)
b <- rnorm(10, 0, 1)
c <- rnorm(10, 0, 1)
d <- rnorm(10, 0, 1)
e <- rnorm(10, 0, 1)
data <- cbind.data.frame(a,b,c,d,e)
I want the data to look like the output as if I had done this for each combination of a, b, c, d, e:
data$ab <- (data$a + data$b)/2
.
.
.
data$abcde <- (data$a + data$b + data$c + data$d + data$e)/5
You can generate the combinations with combn and compute means for each combination with rowMeans:
all.combs <- unlist(lapply(2:ncol(data), function(x) combn(colnames(data), x, simplify = F)), recursive = F)
m <- lapply(all.combs, function(x) rowMeans(data[, x]))
data[sapply(all.combs, paste0, collapse = '')] <- m
# example output
data[, c('ab', 'ac', 'abcde')]
# ab ac abcde
# 1 0.9145668 -0.15422891 0.46534449
# 2 1.0593771 0.36834739 -0.28974715
# 3 0.8504790 0.37486041 0.58032864
# 4 0.8392725 1.67687954 0.62420232
# 5 -0.1612623 -0.31145576 0.06580884
# 6 -0.6140748 -0.05931374 -0.01082605
# 7 0.4424551 0.75504165 0.53706206
# 8 -0.1202238 -0.02772524 0.43865296
# 9 -1.3020701 -0.18290837 -0.61781512
# 10 -0.7414824 -1.56409902 -1.12516693
Related
I have a dataframe ("md") containing several variables, of which one is binary ("adopter"). I would like to mean center three of the other (continous) variables, let's say X, Y, and Z, but only for the ones where adopter = 1. The others, for which adopter = 0, should remain unchanged.
In the end I would like to end up with a new dataframe containing all variables as before, but with the X, Y, and Z for which adopter = 1 being mean centered, while leaving the X, Y, and Z for which adopter = 0 being unchanged.
My dataframe looks like this (117 observations in total):
adopter
X
Y
Z
A
B
0
0.5
2.3
4.5
3
4.7
1
1.5
6.5
-2.3
69.3
-2.5
...
...
...
...
So the new dataframe should contain the center means of X, Y, and Z of the second row in this example, as adopter=1, and leave the rest unchanged.
I know how to mean center all X, Y, and Z:
md_cen <- md
covs_to_center <- c("X", "Y", "Z")
md_cen[covs_to_center] <- scale(md_cen[covs_to_center],
scale = FALSE)
But I cannot figure out how to get the "only if adopter == "1" " into it. I also tried applying a function:
center_apply <- function(x) {
apply(x, 2, function(y) y - mean(y))}
However, this leaves me again with the mean centered versions for all X, Y, Z, of course, and on top the new dataset contains only those three variables.
Can anyone help me out here, please?
The basic way to accomplish what you're trying to do is to use the split-apply-combine workflow. That is:
Split your data frame up into coherent and useful sub-parts.
Do the thing you want to each sub-part.
Reconstitute the parts into the whole.
First, here's a toy dataset:
covs_to_center <- c("X", "Y", "Z")
set.seed(123)
md <- data.frame(
adopter = sample(0:1, 10, replace = T),
X = rnorm(10, 2, 1),
Y = rnorm(10, 3, 2),
Z = rnorm(10, 5, 10),
A = rnorm(10, 40, 50),
B = rnorm(10, 0, 2)
)
md
## adopter X Y Z A B
## 1 0 3.7150650 6.5738263 -11.866933 74.432013 -2.24621717
## 2 0 2.4609162 3.9957010 13.377870 67.695883 -0.80576967
## 3 0 0.7349388 -0.9332343 6.533731 36.904414 -0.93331071
## 4 1 1.3131471 4.4027118 -6.381369 24.701867 1.55993024
## 5 0 1.5543380 2.0544172 17.538149 20.976450 -0.16673813
## 6 1 3.2240818 0.8643526 9.264642 5.264651 0.50663703
## 7 1 2.3598138 2.5640502 2.049285 29.604136 -0.05709351
## 8 1 2.4007715 0.9479911 13.951257 -23.269818 -0.08574091
## 9 0 2.1106827 1.5422175 13.781335 148.447798 2.73720457
## 10 0 1.4441589 1.7499215 13.215811 100.398100 -0.45154197
A base R solution:
md_base <- data.frame(row_num = 1:nrow(md), md)
# append column of row numbers to make it easier to recombine things later
md_split <- split(md_base, md_base$adopter)
# this is a list of 2 data frames, corresponding to the 2 possible outcomes
# of the adopter variable
md_split$`1`[, covs_to_center] <-
apply(md_split$`1`[, covs_to_center], 2, function(y) y - mean(y))
# grab the data frame that had a 1 in the response column; apply the centering
# function to the correct variables in that data frame
md_new <- do.call(rbind, md_split)
# glue the data frame back together; it will be ordered by adopter
rownames(md_new) <- NULL
# remove row name artifact created by joining
md_new <- md_new[order(md_new$row_num), names(md_new) != "row_num"]
# sort by the row_num column, then drop it
This is pretty clunky, and I'm sure it could be improved. Here's a tidyverse equivalent that produces the same output:
library(tidyverse)
md %>%
group_by(adopter) %>%
mutate(across(covs_to_center, function(y) y - adopter * mean(y))) %>%
ungroup()
The idea behind this is: group by adopter (much like the split() approach), calculate the mean() of the relevant variables within each group, and then subtract the mean of the subgroup multiplied by the adopter variable (meaning when adopter == 0, nothing will be subtracted).
I'm working on a project where I have to apply the same transformation to multiple variables. For example
a <- a + 1
b <- b + 1
d <- d + 1
e <- e + 1
I can obviously perform the operations in sequence using
for (i in c(a, b, d, e)) i <- i + 1
However, I can't actually assign the result to each variable this way, since i is a copy of each variable, not a reference.
Is there a way to do this? Obviously, it'd be easier if the variables were merged in a data.frame or something, but that's not possible.
Usually if you find yourself doing the same thing to multiple objects, they should be stored / thought-of as single object with sub-components. You say that storing these as a data.frame is not possible, so you can use a list instead. This allows you to use lapply/sapply to apply a function to each element of the list in one step.
a <- c(1, 2, 3)
b <- c(1, 4)
c <- 5
d <- rnorm(10)
e <- runif(5)
lstt <- list(a = a, b = b, c = c, d = d, e = e)
lstt$a
# [1] 1 2 3
lstt <- lapply(lstt, '+', 1)
lstt$a
# [1] 2 3 4
The question states that the variables to increment cannot be in a larger structure but then in the comments it is stated that that is not so after all so we will assume they are in a list L.
L <- list(a = 1, b = 2, d = 3, e = 4) # test data
for(nm in names(L)) L[[nm]] <- L[[nm]] + 1
# or
L <- lapply(L, `+`, 1)
# or
L <- lapply(L, function(x) x + 1)
Scalars
If they are all scalars then they can be put in an ordinary vector:
v <- c(a = 1, b = 2, d = 3, e = 4)
v <- v + 1
Vectors
If they are all vectors of the same length they can be put in data frame or if they are also of the same type they can be put in a matrix in which case we can also add 1 to it.
Environment
If the variables do have to be free in an environment then if nms is a vector of the variable names then we can iterate over the names and use those names to subscript the environment env. If the names follow some pattern we may be able to use nms <- ls(pattern = "...", envir = env) or if they are the only variables in that environment we can use nms <- ls(env).
a <- b <- d <- e <- 1 # test data
env <- .GlobalEnv # can change this if not being done in global envir
nms <- c("a", "b", "d", "e")
for(nm in nms) env[[nm]] <- env[[nm]] + 1
a;b;d;e # check
## [1] 2
## [1] 2
## [1] 2
## [1] 2
let's assume we have 4 vectors
a <- c(200,204,209,215)
b <- c(215,220,235,245)
c <- c(230,236,242,250)
d <- c(240,242,243,267)
I basically want to create a loop which creates the differentials between each pair, and then calculate the Z scores for those differentials. So something like scale(d-a). How do I create the loop that basically goes scale(b-a), then scale(c-a), scale(d-a) etc? many thanks.
Single named variables don't lend themselves too well to "looping".
Let's use a list() of vectors instead:
vecs <- list(
a = c(200,204,209,215),
b = c(215,220,235,245),
c = c(230,236,242,250),
d = c(240,242,243,267)
)
This allows us to apply a function to all pairs using combn
scale_diff <- function(subset) {
z <- scale(subset[[1]] - subset[[2]])
colnames(z) <- paste(names(subset), collapse = " - ")
z
}
z_scores <- combn(vecs, 2, scale_diff, simplify = FALSE)
Now z_scores is a list of 6 matrices (column vectors). The column names show you which vectors were subtracted before scaling.
We can place it in a list and use combn to get the combinations and then apply the difference
lst1 <- list(a = a, b = b, c = c, d = d)
out <- combn(lst1, 2, FUN = function(x) scale(Reduce(`-`, x))[,1])
colnames(out) <- combn(names(lst1), 2, FUN = paste, collapse='_')
out
# a_b a_c a_d b_c b_d c_d
#[1,] 0.9108601 1.2009612 0.1290994 -0.7643506 -0.753390 -0.2219686
#[2,] 0.7759179 0.2401922 0.3872983 -0.9441978 -0.360317 0.3699477
#[3,] -0.5735045 -0.2401922 0.9036961 0.6744270 1.474024 1.1098432
#[4,] -1.1132735 -1.2009612 -1.4200939 1.0341214 -0.360317 -1.2578222
As #AlexR mentioned in the comments, if the attributes are important, then remove [,1] and keep it as a matrix of 1 column
out <- combn(lst1, 2, FUN = function(x) scale(Reduce(`-`, x)), simplify = FALSE)
I asked this question a while ago (Recode dataframe based on one column) and the answer worked perfectly. Now however, i almost want to do the reverse. Namely, I have a (700k * 2000) of 0/1/2 or NA. In a separate dataframe I have two columns (Ref and Obs). The 0 corresponds to two instances of Ref, 1 is one instance of Ref and one instance of Obs and 2 is two Obs. To clarify, data snippet:
Genotype File ---
Ref Obs
A G
T C
G C
Ref <- c("A", "T", "G")
Obs <- c("G", "C", "C")
Current Data---
Sample.1 Sample.2 .... Sample.2000
0 1 2
0 0 0
0 NA 1
mat <- matrix(nrow=3, ncol=3)
mat[,1] <- c(0,0,0)
mat[,2] <- c(1,0,NA)
mat[,3] <- c(2,0,1)
Desired Data format---
Sample.1 Sample.1 Sample.2 Sample.2 Sample.2000 Sample.2000
A A A G G G
T T T T T T
G G 0 0 G C
I think that's right. The desired data format has two columns (space separated) for each sample. 0 in this format (plink ped file for the bioinformaticians out there) is missing data.
MAJOR ASSUMPTION: your data is in 3 element frames, i.e. you want to apply your mapping to the first 3 rows, then the next 3, and so on, which I think makes sense given DNA frames. If you want a rolling 3 element window this will not work (but code can be modified to make it work). This will work for an arbitrary number of columns, and arbitrary number of 3 row groups:
# Make up a matrix with your properties (4 cols, 6 rows)
col <- 4L
frame <- 3L
mat <- matrix(sample(c(0:2, NA_integer_), 2 * frame * col, replace=T), ncol=col)
# Mapping data
Ref <- c("A", "T", "G")
Obs <- c("G", "C", "C")
map.base <- cbind(Ref, Obs)
num.to.let <- matrix(c(1, 1, 1, 2, 2, 2), byrow=T, ncol=2) # how many from each of ref obs
# Function to map 0,1,2,NA to Ref/Obs
re_map <- function(mat.small) { # 3 row matrices, with col columns
t(
mapply( # iterate through each row in matrix
function(vals, map, num.to.let) {
vals.2 <- unlist(lapply(vals, function(x) map[num.to.let[x + 1L, ]]))
ifelse(is.na(vals.2), 0, vals.2)
},
vals=split(mat.small, row(mat.small)), # a row
map=split(map.base, row(map.base)), # the mapping for that row
MoreArgs=list(num.to.let=num.to.let) # general conversion of number to Obs/Ref
) )
}
# Split input data frame into 3 row matrices (assumes frame size 3),
# and apply mapping function to each group
mat.split <- split.data.frame(mat, sort(rep(1:(nrow(mat) / frame), frame)))
mat.res <- do.call(rbind, lapply(mat.split, re_map))
colnames(mat.res) <- paste0("Sample.", rep(1:ncol(mat), each=2))
print(mat.res, quote=FALSE)
# Sample.1 Sample.1 Sample.2 Sample.2 Sample.3 Sample.3 Sample.4 Sample.4
# 1 G G A G G G G G
# 2 C C 0 0 T C T C
# 3 0 0 G C G G G G
# 1 A A A A A G A A
# 2 C C C C T C C C
# 3 C C G G 0 0 0 0
I am not sure but this could be what you need:
first same simple data
geno <- data.frame(Ref = c("A", "T", "G"), Obs = c("G", "C", "C"))
data <- data.frame(s1 = c(0,0,0),s2 = c(1, 0, NA))
then a couple of functions:
f <- function(i , x, geno){
x <- x[i]
if(!is.na(x)){
if (x == 0) {y <- geno[i , c(1,1)]}
if (x == 1) {y <- geno[i, c(1,2)]}
if (x == 2) {y <- geno[i, c(2,2)]}
}
else y <- c(0,0)
names(y) <- c("s1", "s2")
y
}
g <- function(x, geno){
Reduce(rbind, lapply(1:length(x), FUN = f , x = x, geno = geno))
}
The way f() is defined may not be the most elegant but it does the job
Then simply run it as a doble for loop in a lapply fashion
as.data.frame(Reduce(cbind, lapply(data , g , geno = geno )))
hope it helps
Here's one way based on the sample data in your answer:
# create index
idx <- lapply(data, function(x) cbind((x > 1) + 1, (x > 0) + 1))
# list of matrices
lst <- lapply(idx, function(x) {
tmp <- apply(x, 2, function(y) geno[cbind(seq_along(y), y)])
replace(tmp, is.na(tmp), 0)
})
# one data frame
as.data.frame(lst)
# s1.1 s1.2 s2.1 s2.2
# 1 A A A G
# 2 T T T T
# 3 G G 0 0
Friends
I'm trying t set up a matrix or data.frame for a canonical correlation analysis. The original dataset has a column designating one of x conditions and subsequent columns of explanatory variables. I need to set up an array that sets an indicator variable for each condition "x". eg. Columns in df are:
ID cond task1 taskN
A, x, 12, 14
B, x, 13, 17
C, y, 11, 10
D, z, 10, 13
here "cond" can be x,y,z,... (can vary, so I don't know how many). This needs to go to:
ID, x, y, z, task1, taskN
A, 1, 0, 0, 12, 14
B, 1, 0, 0, 13, 17
C, 0, 1, 0, 11, 10
D, 0, 0, 1, 10, 13
So, I can set up the indicators in an array
iv<-as.data.frame(array(,c(nrow(df),length(levels(cond)))))
and then cbind this to df, but I can't figure out how to go into the array and set the appropriate indicator to "1" and the rest to "0".
Any suggestions?
Thanks
Jon
If you code cond as a factor, you can get R to do the expansion you want via model.matrix. The only complication is that to get the coding you chose (dummy variables coding, or sum contrasts in R) we need to change the default constrasts used by R's model formula code.
## data
dat <- data.frame(ID = LETTERS[1:4], cond = factor(c("x","x","y","z")),
task1 = c(12,13,11,10), taskN = c(14,17,10,13))
dat
## We get R to produce the dummy variables for us,
## but your coding needs the contr.sum contrasts
op <- options(contrasts = c("contr.sum","contr.poly"))
dat2 <- data.frame(ID = dat$ID, model.matrix(ID ~ . - 1, data = dat))
## Levels of cond
lev <- with(dat, levels(cond))
## fix-up the names
names(dat2)[2:(1+length(lev))] <- lev
dat2
## reset contrasts
options(op)
This gives us:
> dat2
ID x y z task1 taskN
1 A 1 0 0 12 14
2 B 1 0 0 13 17
3 C 0 1 0 11 10
4 D 0 0 1 10 13
This should scale automatically as the number of levels in cond changes/increases.
HTH
Another alternative is to use use cast in the reshape package:
library(reshape)
l <- length(levels(dat$cond))
dat2 <- merge(cast(dat,ID~cond),dat)[,c(1:(l+1),(l+3):(ncol(dat)+l))]
dat2[,2:(1+l)] <- !is.na(dat2[,2:(1+l)])
This gives you logical values rather than 0 and 1 though:
> dat2
ID x y z task1 taskN
1 A TRUE FALSE FALSE 12 14
2 B TRUE FALSE FALSE 13 17
3 C FALSE TRUE FALSE 11 10
4 D FALSE FALSE TRUE 10 13
That's cool using model.matrix for this. (reshape too.) Always learning something here. A couple more ideas:
indicator1 <- function(groupStrings) {
groupFactors <- factor(groupStrings)
colNames <- levels(groupFactors)
bits <- matrix(0, nrow=length(groupStrings), ncol=length(colNames))
bits[matrix(c(1:length(groupStrings),
unclass(groupFactors)), ncol=2)] <- 1
setNames(as.data.frame(bits), colNames)
}
indicator2 <- function(groupStrings) {
colNames <- unique(groupStrings)
bits <- outer(groupStrings, colNames, "==")
setNames(as.data.frame(bits * 1), colNames)
}
Used as follows
d <- data.frame(cond=c("a", "a", "b"))
d <- cbind(d, indicator2(as.character(d$cond)))
Again, a great example of the greatness of open-source! Thanks so much for your help. The initial solution seemed to work best for me. In case someone else might be interested, here is how I implemented this with my (very large) dataset:
# Load needed libraries if not already so
if("packages:sciplot" %in% search()) next else library(moments)
# Initialize dataframes. DEFINE THE workspace SUBSET TO ANALYZE HERE
df<-stroke
# Make any necessary modifications to the df
df$TrDif <- df$TrBt-df$TrAt
# 0) Set up indicator variables (iv) from the factor you choose.
op <- options(contrasts = c("contr.sum","contr.poly"))
dat<-subset(df,select=c("newcat"))
iv<-data.frame(model.matrix(~.-1,data=dat))
names(iv) <- levels(dat$newcat)
lbl<-levels(dat$newcat) # need this for plot functions below
# Select task variables with n > 1150 to be regressed (THIS CAN PROBABLY BE DONE MORE ELEGANTLY).
taskarr<-subset(df, select=c("B20","B40","FW","Anim","TrAt","TrBt","TrBerr","TrDif","Snod15","tt","GEMS","Clock3","orient","Wlenc","wlfr","wlcr","wlrec","Snod15Rec","GEMSfr"))
## 1) evaluate covariance matrix and extract sub-matrices
## Caution: Covariance samples differ due to missing values.
sig <- cov(cbind(iv,taskarr),use="pairwise.complete.obs")