Friends
I'm trying t set up a matrix or data.frame for a canonical correlation analysis. The original dataset has a column designating one of x conditions and subsequent columns of explanatory variables. I need to set up an array that sets an indicator variable for each condition "x". eg. Columns in df are:
ID cond task1 taskN
A, x, 12, 14
B, x, 13, 17
C, y, 11, 10
D, z, 10, 13
here "cond" can be x,y,z,... (can vary, so I don't know how many). This needs to go to:
ID, x, y, z, task1, taskN
A, 1, 0, 0, 12, 14
B, 1, 0, 0, 13, 17
C, 0, 1, 0, 11, 10
D, 0, 0, 1, 10, 13
So, I can set up the indicators in an array
iv<-as.data.frame(array(,c(nrow(df),length(levels(cond)))))
and then cbind this to df, but I can't figure out how to go into the array and set the appropriate indicator to "1" and the rest to "0".
Any suggestions?
Thanks
Jon
If you code cond as a factor, you can get R to do the expansion you want via model.matrix. The only complication is that to get the coding you chose (dummy variables coding, or sum contrasts in R) we need to change the default constrasts used by R's model formula code.
## data
dat <- data.frame(ID = LETTERS[1:4], cond = factor(c("x","x","y","z")),
task1 = c(12,13,11,10), taskN = c(14,17,10,13))
dat
## We get R to produce the dummy variables for us,
## but your coding needs the contr.sum contrasts
op <- options(contrasts = c("contr.sum","contr.poly"))
dat2 <- data.frame(ID = dat$ID, model.matrix(ID ~ . - 1, data = dat))
## Levels of cond
lev <- with(dat, levels(cond))
## fix-up the names
names(dat2)[2:(1+length(lev))] <- lev
dat2
## reset contrasts
options(op)
This gives us:
> dat2
ID x y z task1 taskN
1 A 1 0 0 12 14
2 B 1 0 0 13 17
3 C 0 1 0 11 10
4 D 0 0 1 10 13
This should scale automatically as the number of levels in cond changes/increases.
HTH
Another alternative is to use use cast in the reshape package:
library(reshape)
l <- length(levels(dat$cond))
dat2 <- merge(cast(dat,ID~cond),dat)[,c(1:(l+1),(l+3):(ncol(dat)+l))]
dat2[,2:(1+l)] <- !is.na(dat2[,2:(1+l)])
This gives you logical values rather than 0 and 1 though:
> dat2
ID x y z task1 taskN
1 A TRUE FALSE FALSE 12 14
2 B TRUE FALSE FALSE 13 17
3 C FALSE TRUE FALSE 11 10
4 D FALSE FALSE TRUE 10 13
That's cool using model.matrix for this. (reshape too.) Always learning something here. A couple more ideas:
indicator1 <- function(groupStrings) {
groupFactors <- factor(groupStrings)
colNames <- levels(groupFactors)
bits <- matrix(0, nrow=length(groupStrings), ncol=length(colNames))
bits[matrix(c(1:length(groupStrings),
unclass(groupFactors)), ncol=2)] <- 1
setNames(as.data.frame(bits), colNames)
}
indicator2 <- function(groupStrings) {
colNames <- unique(groupStrings)
bits <- outer(groupStrings, colNames, "==")
setNames(as.data.frame(bits * 1), colNames)
}
Used as follows
d <- data.frame(cond=c("a", "a", "b"))
d <- cbind(d, indicator2(as.character(d$cond)))
Again, a great example of the greatness of open-source! Thanks so much for your help. The initial solution seemed to work best for me. In case someone else might be interested, here is how I implemented this with my (very large) dataset:
# Load needed libraries if not already so
if("packages:sciplot" %in% search()) next else library(moments)
# Initialize dataframes. DEFINE THE workspace SUBSET TO ANALYZE HERE
df<-stroke
# Make any necessary modifications to the df
df$TrDif <- df$TrBt-df$TrAt
# 0) Set up indicator variables (iv) from the factor you choose.
op <- options(contrasts = c("contr.sum","contr.poly"))
dat<-subset(df,select=c("newcat"))
iv<-data.frame(model.matrix(~.-1,data=dat))
names(iv) <- levels(dat$newcat)
lbl<-levels(dat$newcat) # need this for plot functions below
# Select task variables with n > 1150 to be regressed (THIS CAN PROBABLY BE DONE MORE ELEGANTLY).
taskarr<-subset(df, select=c("B20","B40","FW","Anim","TrAt","TrBt","TrBerr","TrDif","Snod15","tt","GEMS","Clock3","orient","Wlenc","wlfr","wlcr","wlrec","Snod15Rec","GEMSfr"))
## 1) evaluate covariance matrix and extract sub-matrices
## Caution: Covariance samples differ due to missing values.
sig <- cov(cbind(iv,taskarr),use="pairwise.complete.obs")
Related
I have a dataframe ("md") containing several variables, of which one is binary ("adopter"). I would like to mean center three of the other (continous) variables, let's say X, Y, and Z, but only for the ones where adopter = 1. The others, for which adopter = 0, should remain unchanged.
In the end I would like to end up with a new dataframe containing all variables as before, but with the X, Y, and Z for which adopter = 1 being mean centered, while leaving the X, Y, and Z for which adopter = 0 being unchanged.
My dataframe looks like this (117 observations in total):
adopter
X
Y
Z
A
B
0
0.5
2.3
4.5
3
4.7
1
1.5
6.5
-2.3
69.3
-2.5
...
...
...
...
So the new dataframe should contain the center means of X, Y, and Z of the second row in this example, as adopter=1, and leave the rest unchanged.
I know how to mean center all X, Y, and Z:
md_cen <- md
covs_to_center <- c("X", "Y", "Z")
md_cen[covs_to_center] <- scale(md_cen[covs_to_center],
scale = FALSE)
But I cannot figure out how to get the "only if adopter == "1" " into it. I also tried applying a function:
center_apply <- function(x) {
apply(x, 2, function(y) y - mean(y))}
However, this leaves me again with the mean centered versions for all X, Y, Z, of course, and on top the new dataset contains only those three variables.
Can anyone help me out here, please?
The basic way to accomplish what you're trying to do is to use the split-apply-combine workflow. That is:
Split your data frame up into coherent and useful sub-parts.
Do the thing you want to each sub-part.
Reconstitute the parts into the whole.
First, here's a toy dataset:
covs_to_center <- c("X", "Y", "Z")
set.seed(123)
md <- data.frame(
adopter = sample(0:1, 10, replace = T),
X = rnorm(10, 2, 1),
Y = rnorm(10, 3, 2),
Z = rnorm(10, 5, 10),
A = rnorm(10, 40, 50),
B = rnorm(10, 0, 2)
)
md
## adopter X Y Z A B
## 1 0 3.7150650 6.5738263 -11.866933 74.432013 -2.24621717
## 2 0 2.4609162 3.9957010 13.377870 67.695883 -0.80576967
## 3 0 0.7349388 -0.9332343 6.533731 36.904414 -0.93331071
## 4 1 1.3131471 4.4027118 -6.381369 24.701867 1.55993024
## 5 0 1.5543380 2.0544172 17.538149 20.976450 -0.16673813
## 6 1 3.2240818 0.8643526 9.264642 5.264651 0.50663703
## 7 1 2.3598138 2.5640502 2.049285 29.604136 -0.05709351
## 8 1 2.4007715 0.9479911 13.951257 -23.269818 -0.08574091
## 9 0 2.1106827 1.5422175 13.781335 148.447798 2.73720457
## 10 0 1.4441589 1.7499215 13.215811 100.398100 -0.45154197
A base R solution:
md_base <- data.frame(row_num = 1:nrow(md), md)
# append column of row numbers to make it easier to recombine things later
md_split <- split(md_base, md_base$adopter)
# this is a list of 2 data frames, corresponding to the 2 possible outcomes
# of the adopter variable
md_split$`1`[, covs_to_center] <-
apply(md_split$`1`[, covs_to_center], 2, function(y) y - mean(y))
# grab the data frame that had a 1 in the response column; apply the centering
# function to the correct variables in that data frame
md_new <- do.call(rbind, md_split)
# glue the data frame back together; it will be ordered by adopter
rownames(md_new) <- NULL
# remove row name artifact created by joining
md_new <- md_new[order(md_new$row_num), names(md_new) != "row_num"]
# sort by the row_num column, then drop it
This is pretty clunky, and I'm sure it could be improved. Here's a tidyverse equivalent that produces the same output:
library(tidyverse)
md %>%
group_by(adopter) %>%
mutate(across(covs_to_center, function(y) y - adopter * mean(y))) %>%
ungroup()
The idea behind this is: group by adopter (much like the split() approach), calculate the mean() of the relevant variables within each group, and then subtract the mean of the subgroup multiplied by the adopter variable (meaning when adopter == 0, nothing will be subtracted).
I am running a Montecarlo simulation of a multinomial logit. Therefore I have a function that generates the data and estimates the model. Additionally, I want to generate different datasets over a grid of values. In particular, changing both the number of individuals (n.indiv) and the number of answers by each individual (n.choices).
So far, I have managed to solve it, but at some point, I incurred into a nested for-loop structure over a grid search of the possible values for the number of individuals (n.indiv_list) and the number of answers by each individual(n.choices_list). Finally, I am quite worried about the efficiency of the usage of my last bit of code with the double for-loop structure running on the combinations of the possible values. Probably there is a vectorized way to do it that I am missing (or maybe not?).
Finally, and this is mostly a matter of style, I managed to arrive a multiples objects that contain the models from the combinations of the grid search with informative names, but also would be great if I could collapse all of them in a list but with the current structure, I am not sure how to do it. Thank you in advance!
1) Function that generates data and estimates the model.
library(dplyr)
library(VGAM)
library(mlogit)
#function that generates the data and estimates the model.
mlogit_sim_data <- function(...){
# generating number of (n.alter) X (n.choices)
df <- data.frame(id= rep(seq(1,n.choices ),n.alter ))
# id per individual
df <- df %>%
group_by(id) %>%
mutate(altern = sequence(n()))%>%
arrange(id)
#Repeated scheme for each individual + id_ind
df <- cbind(df[rep(1:nrow(df), n.indiv), ], id_ind = rep(1:n.indiv, each = nrow(df)))
## creating attributes
df<- df %>%
mutate(
x1=rlnorm(n.indiv*n.alter),
x2=rlnorm(n.indiv*n.alter),
)%>%
group_by(altern) %>%
mutate(
id_choice = sequence(n()))%>%
group_by(id_ind) %>%
mutate(
z1 = rpois(1,lambda = 25),
z2 = rlnorm(1,meanlog = 5, sdlog = 0.5),
z3 = ifelse(runif(1, min = 0 , max = 1) > 0.5 , 1 , 0)
)
# Observed utility
df$V1 <- with(df, b1 * x1 + b2 * x2 )
#### Generate Response Variable ####
fn_choice_generator <- function(V){
U <- V + rgumbel(length(V), 0, 1)
1L * (U == max(U))
}
# Using fn_choice_generator to generate 'choice' columns
df <- df %>%
group_by(id_choice) %>%
mutate(across(starts_with("V"),
fn_choice_generator, .names = "choice_{.col}")) %>% # generating choice(s)
select(-starts_with("V")) %>% ##drop V variables.
select(-c(id,id_ind))
tryCatch(
{
model_result <- mlogit(choice_V1 ~ 0 + x1 + x2 |1 ,
data = df,
idx = c("id_choice", "altern"))
return(model_result)
},
error = function(e){
return(NA)
}
)
}
2) Grid search over possible combinations of the data
#List with the values that varies in the simulation
#number of individuals
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
# Values that remains constant across simulations
#set number of alternatives
n.alter <- 3
## Real parameters
b1 <- 1
b2 <- 2
#Number of reps
nreps <- 10
#Set seed
set.seed(777)
#iteration over different values in the simulation
for(i in n.indiv_list) {
for(j in n.choices_list) {
n.indiv <- i
n.choices <- j
assign(paste0("m_ind_", i, "_choices_", j), lapply(X = 1:nreps, FUN = mlogit_sim_data))
}
}
You can vectorize using the map2 function of the purrr package:
library(tidyverse)
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
l1 <- length(n.indiv_list)
l2 <- length(n.choices_list)
v1 <- rep(n.indiv_list, each = l2)
v2 <- rep(n.choices_list, l1) #v1, v2 generate all pairs
> v1
[1] 1 1 1 1 1 15 15 15 15 15 100 100 100 100 100 500 500 500 500 500
> v2
[1] 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10
result <- map2(v1, v2, function(v1, v2) assign(paste0("m_ind_", v1, "_choices_", v2), lapply(X = 1:nreps, FUN = mlogit_sim_data)))
result will be a list of your function outputs.
I have a data set with 5 different variables (e.g., a, b, c, d, and e). I want to have simple code (preferably with the tidyverse) that allows me to take the mean for each possible combination of the variables. For example, the mean of "ab", "ac", ..., all the way to "abcde". Is there a simple way of doing this?
All I've tried is manually creating the code for each variable. However, it seems like something like a loop would be more appropriate.
For example, if my data looked like this:
a <- rnorm(10, 0, 1)
b <- rnorm(10, 0, 1)
c <- rnorm(10, 0, 1)
d <- rnorm(10, 0, 1)
e <- rnorm(10, 0, 1)
data <- cbind.data.frame(a,b,c,d,e)
I want the data to look like the output as if I had done this for each combination of a, b, c, d, e:
data$ab <- (data$a + data$b)/2
.
.
.
data$abcde <- (data$a + data$b + data$c + data$d + data$e)/5
You can generate the combinations with combn and compute means for each combination with rowMeans:
all.combs <- unlist(lapply(2:ncol(data), function(x) combn(colnames(data), x, simplify = F)), recursive = F)
m <- lapply(all.combs, function(x) rowMeans(data[, x]))
data[sapply(all.combs, paste0, collapse = '')] <- m
# example output
data[, c('ab', 'ac', 'abcde')]
# ab ac abcde
# 1 0.9145668 -0.15422891 0.46534449
# 2 1.0593771 0.36834739 -0.28974715
# 3 0.8504790 0.37486041 0.58032864
# 4 0.8392725 1.67687954 0.62420232
# 5 -0.1612623 -0.31145576 0.06580884
# 6 -0.6140748 -0.05931374 -0.01082605
# 7 0.4424551 0.75504165 0.53706206
# 8 -0.1202238 -0.02772524 0.43865296
# 9 -1.3020701 -0.18290837 -0.61781512
# 10 -0.7414824 -1.56409902 -1.12516693
Toy example:
> myfn = function(a,x){sum(a*x)}
> myfn(a=2, x=c(1,2,3))
[1] 12
Good so far. Now:
> df = data.frame(a=c(4,5))
> df$ans = myfn(a=df$a, x=c(1,2,3))
Warning message:
In a * x : longer object length is not a multiple of shorter object length
> df
a ans
1 4 26
2 5 26
What I want to happen is that for the first row, it is as if I called myfn(a=4, x=c(1,2,3), giving an answer of 24, and for the second row, it is as if I called myfn(a=5, x=c(1,2,3) giving an answer of 30. How do I do this? Thank you.
EDIT: slightly more complex version. Now suppose that the function is
myfn = function(a,b, x){sum((a+b)*x)}
and that I have the data frame
df = data.frame(a=c(4,5), b=c(6,7), c=c(9,9))
I want to create df$ans such that, for the first row it is as if I called myfn(a=4, b=6, x=c(1,2,3) and for the second for it is as if I called myfn(a=5, b=7, x=c(1,2,3), that is, use df$x for a, df$y for b, and ignore df$z.
Something like this would work:
myfn = function(a,x){
return(sum(a*x))
}
df <- data.frame(a=c(4,5))
df$ans <- apply(df, 1, myfn, x = c(1,2,3))
df$ans
a ans
1 4 24
2 5 30
** Edited Based On User Edit **
df = data.frame(a=c(4,5), b=c(6,7), c=c(9,9))
df$ans <- apply(df[, c("a", "b")], 1, function(y) sum((y['a']+y['b'])*c(1,2,3)))
a b c ans
1 4 6 9 60
2 5 7 9 72
There are several ways this can be done, each with it's own charms. If you don't want to modify the function I would just do
mapply(myfn, df$x, df$y, MoreArgs = list(x = 1:3))
Alternatively, you can bake the iteration right into the function, e.g,
myfn = function(a,b, x){
sapply(a+b, function(ab) {
sum(ab*x)
})
}
myfn(df$x, df$y, 1:3)
That's probably the way I would do it.
I have two vectors, A and B. For every element in A I want to find the index of the first element in B that is greater and has higher index. The length of A and B are the same.
So for vectors:
A <- c(10, 5, 3, 4, 7)
B <- c(4, 8, 11, 1, 5)
I want a result vector:
R <- c(3, 3, 5, 5, NA)
Of course I can do it with two loops, but it's very slow, and I don't know how to use apply() in this situation, when the indices matter. My data set has vectors of length 20000, so the speed is really important in this case.
A few bonus questions:
What if I have a sequence of numbers (like seq = 2:10), and I want to find the first number in B that is higher than a+s for every a of A and every s of seq.
Like with question 1), but I want to know the first greater, and the first lower value, and create a matrix, which stores which one was first. So for example I have a of A, and 10 from seq. I want to find the first value of B, which is higher than a+10, or lower than a-10, and then store it's index and value.
sapply(sapply(seq_along(a),function(x) which(b[-seq(x)]>a[x])+x),"[",1)
[1] 3 3 5 5 NA
This is a great example of when sapply is less efficient than loops.
Although the sapply does make the code look neater, you are paying for that neatness with time.
Instead you can wrap a while loop inside a for loop inside a nice, neat function.
Here are benchmarks comparing a nested-apply loop against nested for-while loop (and a mixed apply-while loop, for good measure). Update: added the vapply..match.. mentioned in comments. Faster than sapply, but still much slower than while loop.
BENCHMARK:
test elapsed relative
1 for.while 0.069 1.000
2 sapply.while 0.080 1.159
3 vapply.match 0.101 1.464
4 nested.sapply 0.104 1.507
Notice you save a third of your time; The savings will likely be larger when you start adding the sequences to A.
For the second part of your question:
If you have this all wrapped up in an nice function, it is easy to add a seq to A
# Sample data
A <- c(10, 5, 3, 4, 7, 100, 2)
B <- c(4, 8, 11, 1, 5, 18, 20)
# Sample sequence
S <- seq(1, 12, 3)
# marix with all index values (with names cleaned up)
indexesOfB <- t(sapply(S, function(s) findIndx(A+s, B)))
dimnames(indexesOfB) <- list(S, A)
Lastly, if you want to instead find values of B less than A, just swap the operation in the function.
(You could include an if-clause in the function and use only a single function. I find it more efficient
to have two separate functions)
findIndx.gt(A, B) # [1] 3 3 5 5 6 NA 8 NA NA
findIndx.lt(A, B) # [1] 2 4 4 NA 8 7 NA NA NA
Then you can wrap it up in one nice pacakge
rangeFindIndx(A, B, S)
# A S indxB.gt indxB.lt
# 10 1 3 2
# 5 1 3 4
# 3 1 5 4
# 4 1 5 NA
# 7 1 6 NA
# 100 1 NA NA
# 2 1 NA NA
# 10 4 6 4
# 5 4 3 4
# ...
FUNCTIONS
(Notice they depend on reshape2)
rangeFindIndx <- function(A, B, S) {
# For each s in S, and for each a in A,
# find the first value of B, which is higher than a+s, or lower than a-s
require(reshape2)
# Create gt & lt matricies; add dimnames for melting function
indexesOfB.gt <- sapply(S, function(s) findIndx.gt(A+s, B))
indexesOfB.lt <- sapply(S, function(s) findIndx.lt(A-s, B))
dimnames(indexesOfB.gt) <- dimnames(indexesOfB.gt) <- list(A, S)
# melt the matricies and combine into one
gtltMatrix <- cbind(melt(indexesOfB.gt), melt(indexesOfB.lt)$value)
# clean up their names
names(gtltMatrix) <- c("A", "S", "indxB.gt", "indxB.lt")
return(gtltMatrix)
}
findIndx.gt <- function(A, B) {
lng <- length(A)
ret <- integer(0)
b <- NULL
for (j in seq(lng-1)) {
i <- j + 1
while (i <= lng && ((b <- B[[i]]) < A[[j]]) ) {
i <- i + 1
}
ret <- c(ret, ifelse(i<lng, i, NA))
}
c(ret, NA)
}
findIndx.lt <- function(A, B) {
lng <- length(A)
ret <- integer(0)
b <- NULL
for (j in seq(lng-1)) {
i <- j + 1
while (i <= lng && ((b <- B[[i]]) > A[[j]]) ) { # this line contains the only difference from findIndx.gt
i <- i + 1
}
ret <- c(ret, ifelse(i<lng, i, NA))
}
c(ret, NA)
}