How to mean center variables based on binary condition in r - r

I have a dataframe ("md") containing several variables, of which one is binary ("adopter"). I would like to mean center three of the other (continous) variables, let's say X, Y, and Z, but only for the ones where adopter = 1. The others, for which adopter = 0, should remain unchanged.
In the end I would like to end up with a new dataframe containing all variables as before, but with the X, Y, and Z for which adopter = 1 being mean centered, while leaving the X, Y, and Z for which adopter = 0 being unchanged.
My dataframe looks like this (117 observations in total):
adopter
X
Y
Z
A
B
0
0.5
2.3
4.5
3
4.7
1
1.5
6.5
-2.3
69.3
-2.5
...
...
...
...
So the new dataframe should contain the center means of X, Y, and Z of the second row in this example, as adopter=1, and leave the rest unchanged.
I know how to mean center all X, Y, and Z:
md_cen <- md
covs_to_center <- c("X", "Y", "Z")
md_cen[covs_to_center] <- scale(md_cen[covs_to_center],
scale = FALSE)
But I cannot figure out how to get the "only if adopter == "1" " into it. I also tried applying a function:
center_apply <- function(x) {
apply(x, 2, function(y) y - mean(y))}
However, this leaves me again with the mean centered versions for all X, Y, Z, of course, and on top the new dataset contains only those three variables.
Can anyone help me out here, please?

The basic way to accomplish what you're trying to do is to use the split-apply-combine workflow. That is:
Split your data frame up into coherent and useful sub-parts.
Do the thing you want to each sub-part.
Reconstitute the parts into the whole.
First, here's a toy dataset:
covs_to_center <- c("X", "Y", "Z")
set.seed(123)
md <- data.frame(
adopter = sample(0:1, 10, replace = T),
X = rnorm(10, 2, 1),
Y = rnorm(10, 3, 2),
Z = rnorm(10, 5, 10),
A = rnorm(10, 40, 50),
B = rnorm(10, 0, 2)
)
md
## adopter X Y Z A B
## 1 0 3.7150650 6.5738263 -11.866933 74.432013 -2.24621717
## 2 0 2.4609162 3.9957010 13.377870 67.695883 -0.80576967
## 3 0 0.7349388 -0.9332343 6.533731 36.904414 -0.93331071
## 4 1 1.3131471 4.4027118 -6.381369 24.701867 1.55993024
## 5 0 1.5543380 2.0544172 17.538149 20.976450 -0.16673813
## 6 1 3.2240818 0.8643526 9.264642 5.264651 0.50663703
## 7 1 2.3598138 2.5640502 2.049285 29.604136 -0.05709351
## 8 1 2.4007715 0.9479911 13.951257 -23.269818 -0.08574091
## 9 0 2.1106827 1.5422175 13.781335 148.447798 2.73720457
## 10 0 1.4441589 1.7499215 13.215811 100.398100 -0.45154197
A base R solution:
md_base <- data.frame(row_num = 1:nrow(md), md)
# append column of row numbers to make it easier to recombine things later
md_split <- split(md_base, md_base$adopter)
# this is a list of 2 data frames, corresponding to the 2 possible outcomes
# of the adopter variable
md_split$`1`[, covs_to_center] <-
apply(md_split$`1`[, covs_to_center], 2, function(y) y - mean(y))
# grab the data frame that had a 1 in the response column; apply the centering
# function to the correct variables in that data frame
md_new <- do.call(rbind, md_split)
# glue the data frame back together; it will be ordered by adopter
rownames(md_new) <- NULL
# remove row name artifact created by joining
md_new <- md_new[order(md_new$row_num), names(md_new) != "row_num"]
# sort by the row_num column, then drop it
This is pretty clunky, and I'm sure it could be improved. Here's a tidyverse equivalent that produces the same output:
library(tidyverse)
md %>%
group_by(adopter) %>%
mutate(across(covs_to_center, function(y) y - adopter * mean(y))) %>%
ungroup()
The idea behind this is: group by adopter (much like the split() approach), calculate the mean() of the relevant variables within each group, and then subtract the mean of the subgroup multiplied by the adopter variable (meaning when adopter == 0, nothing will be subtracted).

Related

Efficient Montecarlo simulation over a grid in R

I am running a Montecarlo simulation of a multinomial logit. Therefore I have a function that generates the data and estimates the model. Additionally, I want to generate different datasets over a grid of values. In particular, changing both the number of individuals (n.indiv) and the number of answers by each individual (n.choices).
So far, I have managed to solve it, but at some point, I incurred into a nested for-loop structure over a grid search of the possible values for the number of individuals (n.indiv_list) and the number of answers by each individual(n.choices_list). Finally, I am quite worried about the efficiency of the usage of my last bit of code with the double for-loop structure running on the combinations of the possible values. Probably there is a vectorized way to do it that I am missing (or maybe not?).
Finally, and this is mostly a matter of style, I managed to arrive a multiples objects that contain the models from the combinations of the grid search with informative names, but also would be great if I could collapse all of them in a list but with the current structure, I am not sure how to do it. Thank you in advance!
1) Function that generates data and estimates the model.
library(dplyr)
library(VGAM)
library(mlogit)
#function that generates the data and estimates the model.
mlogit_sim_data <- function(...){
# generating number of (n.alter) X (n.choices)
df <- data.frame(id= rep(seq(1,n.choices ),n.alter ))
# id per individual
df <- df %>%
group_by(id) %>%
mutate(altern = sequence(n()))%>%
arrange(id)
#Repeated scheme for each individual + id_ind
df <- cbind(df[rep(1:nrow(df), n.indiv), ], id_ind = rep(1:n.indiv, each = nrow(df)))
## creating attributes
df<- df %>%
mutate(
x1=rlnorm(n.indiv*n.alter),
x2=rlnorm(n.indiv*n.alter),
)%>%
group_by(altern) %>%
mutate(
id_choice = sequence(n()))%>%
group_by(id_ind) %>%
mutate(
z1 = rpois(1,lambda = 25),
z2 = rlnorm(1,meanlog = 5, sdlog = 0.5),
z3 = ifelse(runif(1, min = 0 , max = 1) > 0.5 , 1 , 0)
)
# Observed utility
df$V1 <- with(df, b1 * x1 + b2 * x2 )
#### Generate Response Variable ####
fn_choice_generator <- function(V){
U <- V + rgumbel(length(V), 0, 1)
1L * (U == max(U))
}
# Using fn_choice_generator to generate 'choice' columns
df <- df %>%
group_by(id_choice) %>%
mutate(across(starts_with("V"),
fn_choice_generator, .names = "choice_{.col}")) %>% # generating choice(s)
select(-starts_with("V")) %>% ##drop V variables.
select(-c(id,id_ind))
tryCatch(
{
model_result <- mlogit(choice_V1 ~ 0 + x1 + x2 |1 ,
data = df,
idx = c("id_choice", "altern"))
return(model_result)
},
error = function(e){
return(NA)
}
)
}
2) Grid search over possible combinations of the data
#List with the values that varies in the simulation
#number of individuals
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
# Values that remains constant across simulations
#set number of alternatives
n.alter <- 3
## Real parameters
b1 <- 1
b2 <- 2
#Number of reps
nreps <- 10
#Set seed
set.seed(777)
#iteration over different values in the simulation
for(i in n.indiv_list) {
for(j in n.choices_list) {
n.indiv <- i
n.choices <- j
assign(paste0("m_ind_", i, "_choices_", j), lapply(X = 1:nreps, FUN = mlogit_sim_data))
}
}
You can vectorize using the map2 function of the purrr package:
library(tidyverse)
n.indiv_list <- c(1, 15, 100, 500 )
#number of choice situations
n.choices_list <- c(1, 2, 4, 8, 10)
l1 <- length(n.indiv_list)
l2 <- length(n.choices_list)
v1 <- rep(n.indiv_list, each = l2)
v2 <- rep(n.choices_list, l1) #v1, v2 generate all pairs
> v1
[1] 1 1 1 1 1 15 15 15 15 15 100 100 100 100 100 500 500 500 500 500
> v2
[1] 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10 1 2 4 8 10
result <- map2(v1, v2, function(v1, v2) assign(paste0("m_ind_", v1, "_choices_", v2), lapply(X = 1:nreps, FUN = mlogit_sim_data)))
result will be a list of your function outputs.

Calculate euclidean distance with R

I have data where rows are points and columns are coordinates x,y,z.
I'd like to calculate euclidean distance between points in couple, as 3-4, 11-12, 18-19 and so on... for example, I dont' need distance between 3 and 11, 12, 18
The problem is that I have to analize 1074 tables with 1000 rows or more, so I'm searching a way to do it automatically, maybe considering tha fact that I want to calculate distance between an odd number and the even following one. I don't care too much about the output format, but pls consider that after I have to select only distances <3.2, so a dataframe format will be great.
THANK YOU! :*
How about something like this:
First, I'll make some fake data
set.seed(4304)
df <- data.frame(
x = runif(1000, -1, 1),
y = runif(1000, -1, 1),
z = runif(1000, -1,1)
)
Make a sequence of values from 1 to the number of rows of your dataset by 2s.
s <- seq(1, nrow(df), by=2)
Use sapply() to make the distance between each pair of points.
out <- sapply(s, function(i){
sqrt(sum((df[i,] - df[(i+1), ])^2))
})
Organize the distances into a data frame
res <- data.frame(
pair = paste(rownames(df)[s], rownames(df)[(s+1)], sep="-"),
dist=out)
head(res)
# pair dist
# 1 1-2 1.379992
# 2 3-4 1.303511
# 3 5-6 1.242302
# 4 7-8 1.257228
# 5 9-10 1.107484
# 6 11-12 1.392247
Here is a function that can be applied to a data.frame or matrix holding the data.
DistEucl <- function(X){
i <- cumsum(seq_len(nrow(X)) %% 2 == 1)
sapply(split(X, i), function(Y){
sqrt(sum((Y[1, ] - Y[2, ])^2))
})
}
DistEucl(df1)
# 1 2 3 4
#1.229293 1.234273 1.245567 1.195319
With the data in DaveArmstrong's answer, the results are the same except for a names attribute in the above function's return value.
out2 <- DistEucl(df)
all.equal(out, out2)
#[1] "names for current but not for target"
identical(out, unname(out2))
#[1] TRUE
Data in the question
x <- c(13.457, 13.723, 15.319, 15.713, 18.446, 19.488, 19.762, 19.743)
y <- c(28.513, 29.656, 28.510, 27.342, 28.827, 28.24, 29.841, 30.942)
z <- c(40.513, 40.147, 43.281, 43.218, 43.095, 43.443, 40.094, 40.559)
df1 <- data.frame(x, y, z)

Trying to make a script calculate a value (using a function) for every 24 rows

I have not been able to find a solution to a problem similar to this on StackOverflow. I hope someone can help!
I am using the R environment.
I have data from turtle nests. There are two types of hourly data in each nest. The first is hourly Temperature, and it has an associated hourly Development (amount of "anatomical" embryonic development").
I am calculating a weighted median. In this case, the median is temperature and it is weighted by development.
I have a script here that I am using to calculated weighted median:
weighted.median <- function(x, w, probs=0.5, na.rm=TRUE) {
x <- as.numeric(as.vector(x))
w <- as.numeric(as.vector(w))
if(anyNA(x) || anyNA(w)) {
ok <- !(is.na(x) | is.na(w))
x <- x[ok]
w <- w[ok]
}
stopifnot(all(w >= 0))
if(all(w == 0)) stop("All weights are zero", call.=FALSE)
#'
oo <- order(x)
x <- x[oo]
w <- w[oo]
Fx <- cumsum(w)/sum(w)
#'
result <- numeric(length(probs))
for(i in seq_along(result)) {
p <- probs[i]
lefties <- which(Fx <= p)
if(length(lefties) == 0) {
result[i] <- x[1]
} else {
left <- max(lefties)
result[i] <- x[left]
if(Fx[left] < p && left < length(x)) {
right <- left+1
y <- x[left] + (x[right]-x[left]) * (p-Fx[left])/(Fx[right]- Fx[left])
if(is.finite(y)) result[i] <- y
}
}
}
names(result) <- paste0(format(100 * probs, trim = TRUE), "%")
return(result)
}
So from the function you can see that I need two input vectors, x and w (which will be temperature and development, respectively).
The problem I'm having is that I have hourly temperature traces that last anywhere from 5 days to 53 days (i.e., 120 hours to 1272 hours).
I would like to calculate the daily weighted median for all days within a nest (i.e., take the 24 rows of x and w, and calculate the weighted median, then move onto rows 25-48, and so forth.) The output vector would therefore be a list of daily weighted medians with length n/24 (where n is the total number of rows in x).
In other words, I would like to analyse my data automatically, in a fashion equivalent to manually doing this (nest1 is the datasheet for Nest 1 which contains two vectors, temp and devo (devo is the weight))):
`weighted.median(nest1$temp[c(1,1:24)],nest1$devo[c(1,1:24)],na.rm=TRUE)`
followed by
weighted.median(nest1$temp[c(1,25:48)],nest1$devo[c(1,25:48)],na.rm=TRUE)
followed by
weighted.median(nest1$temp[c(1,49:72)],nest1$devo[c(1,49:72)],na.rm=TRUE)
all the way to
`weighted.median(nest1$temp[c(1,n-23:n)],nest1$devo[c(1,n-23:n)],na.rm=TRUE)`
I'm afraid I don't even know where to start. Any help or clues would be very much appreciated.
The main idea is to create a new column for day 1, day 2, ..., day n/24, split the dataframe into subsets by day, and apply your function to each subset.
First I create some sample data:
set.seed(123)
n <- 120 # number of rows
nest1 <- data.frame(temp = rnorm(n), devo = rpois(n, 5))
Create the splitting variable:
nest1$day <- rep(1:(nrow(nest1)/24), each = 24)
Then, use the by() function to split nest1 by nest1$day and apply the function to each subset:
out <- by(nest1, nest1$day, function(d) {
weighted.median(d$temp, d$devo, na.rm = TRUE)
})
data.frame(day = dimnames(out)[[1]], x = as.vector(out))
# day x
# 1 1 -0.45244433
# 2 2 0.15337312
# 3 3 0.07071673
# 4 4 0.23873174
# 5 5 -0.27694709
Instead of using by, you can also use the group_by + summarise functions from the dplyr package:
library(dplyr)
nest1 %>%
group_by(day) %>%
summarise(x = weighted.median(temp, devo, na.rm = TRUE))
# # A tibble: 5 x 2
# day x
# <int> <dbl>
# 1 1 -0.452
# 2 2 0.153
# 3 3 0.0707
# 4 4 0.239
# 5 5 -0.277

R: Rolling window function with adjustable window and step-size for irregularly spaced observations

Say there is a 2-column data frame with a time or distance column which sequentially increases and an observation column which may have NAs here and there. How can I efficiently use a sliding window function to get some statistic, say a mean, for the observations in a window of duration X (e.g. 5 seconds), slide the window over Y seconds (e.g. 2.5 seconds), repeat... The number of observations in the window is based on the time column, thus both the number of observations per window and the number of observations to slide the window may vary The function should accept any window size up to the number of observations and a step size.
Here is sample data (see "Edit:" for a larger sample set)
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
head(dat)
time measure
1 1.914806 1.0222694
2 2.937075 0.3490641
3 3.286140 NA
4 4.830448 0.8112979
5 5.641746 0.8773504
6 6.519096 1.2174924
Desired Output for the specific case of a 5 second window, 2.5 second step, first window from -2.5 to 2.5, na.rm=FALSE:
[1] 1.0222694
[2] NA
[3] NA
[4] 1.0126639
[5] 0.9965048
[6] 0.9514456
[7] 1.0518228
[8] NA
[9] NA
[10] NA
Explanation: In the desired output the very first window looks for times between -2.5 and 2.5. One observation of measure is in this window, and it is not an NA, thus we get that observation: 1.0222694. The next window is from 0 to 5, and there is an NA in the window, so we get NA. Same for the window from 2.5 to 7.5. The next window is from 5 to 10. There are 5 observations in the window, none are NA. So, we get the average of those 5 observations (i.e. mean(dat[dat$time >5 & dat$time <10,'measure']) )
What I tried: Here is what I tried for the specific case of a window where the step size is 1/2 the window duration:
windo <- 5 # duration in seconds of window
# partition into groups depending on which window(s) an observation falls in
# When step size >= window/2 and < window, need two grouping vectors
leaf1 <- round(ceiling(dat$time/(windo/2))+0.5)
leaf2 <- round(ceiling(dat$time/(windo/2))-0.5)
l1 <- tapply(dat$measure, leaf1, mean)
l2 <- tapply(dat$measure, leaf2, mean)
as.vector(rbind(l2,l1))
Not flexible, not elegant, not efficient. If step size isn't 1/2 window size, the approach will not work, as is.
Any thoughts on a general solution to this kind of problem? Any solution is acceptable. The faster the better, though I prefer solutions using base R, data.table, Rcpp, and/or parallel computation. In my real data set, there are several millions of observations contained in a list of data frames (max data frame is ~400,000 observations).
Below is a extra info: A larger sample set
Edit: As per request, here is a larger, more realistic example dataset with many more NAs and the minimum time span (~0.03). To be clear, though, the list of data frames contains small ones like the one above, as well as ones like the following and larger:
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)
Here is an attempt with Rcpp. The function assumes that data is sorted according to time. More testing would be advisable and adjustments could be made.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector rollAverage(const NumericVector & times,
NumericVector & vals,
double start,
const double winlen,
const double winshift) {
int n = ceil((max(times) - start) / winshift);
NumericVector winvals;
NumericVector means(n);
int ind1(0), ind2(0);
for(int i=0; i < n; i++) {
if (times[0] < (start+winlen)) {
while((times[ind1] <= start) &
(times[ind1+1] <= (start+winlen)) &
(ind1 < (times.size() - 1))) {
ind1++;
}
while((times[ind2+1] <= (start+winlen)) & (ind2 < (times.size() - 1))) {
ind2++;
}
if (times[ind1] >= start) {
winvals = vals[seq(ind1, ind2)];
means[i] = mean(winvals);
} else {
means[i] = NA_REAL;
}
} else {
means[i] = NA_REAL;
}
start += winshift;
}
return means;
}
Testing it:
set.seed(42)
dat <- data.frame(time = seq(1:20)+runif(20,0,1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:19,2)] <- NA_real_
rollAverage(dat$time, dat$measure, -2.5, 5.0, 2.5)
#[1] 1.0222694 NA NA 1.0126639 0.9965048 0.9514456 1.0518228 NA NA NA
With your list of data.frames (using data.table):
set.seed(42)
dat <- data.frame(time = seq(1:50000)+runif(50000, 0.025, 1))
dat <- data.frame(dat, measure=c(diff(dat$time),NA_real_))
dat$measure[sample(1:50000,1000)] <- NA_real_
dat$measure[c(350:450,3000:3300, 20000:28100)] <- NA_real_
dat <- dat[-c(1000:2000, 30000:35000),]
# a list with a realistic number of observations:
dat <- lapply(1:300,function(x) dat)
library(data.table)
dat <- lapply(dat, setDT)
for (ind in seq_along(dat)) dat[[ind]][, i := ind]
#possibly there is a way to avoid these copies?
dat <- rbindlist(dat)
system.time(res <- dat[, rollAverage(time, measure, -2.5, 5.0, 2.5), by=i])
#user system elapsed
#1.51 0.02 1.54
print(res)
# i V1
# 1: 1 1.0217126
# 2: 1 0.9334415
# 3: 1 0.9609050
# 4: 1 1.0123473
# 5: 1 0.9965922
# ---
#6000596: 300 1.1121296
#6000597: 300 0.9984581
#6000598: 300 1.0093060
#6000599: 300 NA
#6000600: 300 NA
Here is a function that gives the same result for your small data frame. It's not particularly quick: it takes several seconds to run on one of the larger datasets in your second dat example.
rolling_summary <- function(DF, time_col, fun, window_size, step_size, min_window=min(DF[, time_col])) {
# time_col is name of time column
# fun is function to apply to the subsetted data frames
# min_window is the start time of the earliest window
times <- DF[, time_col]
# window_starts is a vector of the windows' minimum times
window_starts <- seq(from=min_window, to=max(times), by=step_size)
# The i-th element of window_rows is a vector that tells us the row numbers of
# the data-frame rows that are present in window i
window_rows <- lapply(window_starts, function(x) { which(times>=x & times<x+window_size) })
window_summaries <- sapply(window_rows, function(w_r) fun(DF[w_r, ]))
data.frame(start_time=window_starts, end_time=window_starts+window_size, summary=window_summaries)
}
rolling_summary(DF=dat,
time_col="time",
fun=function(DF) mean(DF$measure),
window_size=5,
step_size=2.5,
min_window=-2.5)
Here are some functions that will give the same output on your first example:
partition <- function(x, window, step = 0){
a = x[x < step]
b = x[x >= step]
ia = rep(0, length(a))
ib = cut(b, seq(step, max(b) + window, by = window))
c(ia, ib)
}
roll <- function(df, window, step = 0, fun, ...){
tapply(df$measure, partition(df$time, window, step), fun, ...)
}
roll_steps <- function(df, window, steps, fun, ...){
X = lapply(steps, roll, df = df, window = window, fun = fun, ...)
names(X) = steps
X
}
Output for your first example:
> roll_steps(dat, 5, c(0, 2.5), mean)
$`0`
1 2 3 4 5
NA 1.0126639 0.9514456 NA NA
$`2.5`
0 1 2 3 4
1.0222694 NA 0.9965048 1.0518228 NA
You can also ignore missing values this way easily:
> roll_steps(dat, 5, c(0, 2.5), mean, na.rm = TRUE)
$`0`
1 2 3 4 5
0.7275438 1.0126639 0.9514456 0.9351326 NaN
$`2.5`
0 1 2 3 4
1.0222694 0.8138012 0.9965048 1.0518228 0.6122983
This can also be used for a list of data.frames:
> x = lapply(dat2, roll_steps, 5, c(0, 2.5), mean)
Ok, how about this.
library(data.table)
dat <- data.table(dat)
setkey(dat, time)
# function to compute a given stat over a time window on a given data.table
window_summary <- function(start_tm, window_len, stat_fn, my_dt) {
pos_vec <- my_dt[, which(time>=start_tm & time<=start_tm+window_len)]
return(stat_fn(my_dt$measure[pos_vec]))
}
# a vector of window start times
start_vec <- seq(from=-2.5, to=dat$time[nrow(dat)], by=2.5)
# sapply'ing the function above over vector of start times
# (in this case, getting mean over 5 second windows)
result <- sapply(start_vec, window_summary,
window_len=5, stat_fn=mean, my_dt=dat)
On my machine, it processes the first 20,000 rows of your large dataset in 13.06781 secs; all rows in 51.58614 secs
Here's another attempt to use pure data.table approach and its between function.
Have compared Rprof against the above answers (except #Rolands answer) and it seems the most optimized one.
Haven't tested for bugs though, but if you"ll like it, I'll expand the answer.
Using your dat from above
library(data.table)
Rollfunc <- function(dat, time, measure, wind = 5, slide = 2.5, FUN = mean, ...){
temp <- seq.int(-slide, max(dat$time), by = slide)
temp <- cbind(temp, temp + wind)
setDT(dat)[, apply(temp, 1, function(x) FUN(measure[between(time, x[1], x[2])], ...))]
}
Rollfunc(dat, time, measure, 5, 2.5)
## [1] 1.0222694 NA NA 1.0126639 0.9965048 0.9514456 1.0518228 NA NA
## [10] NA
You can also specify the functions and its arguments, i.e., for example:
Rollfunc(dat, time, measure, 5, 2.5, max, na.rm = TRUE)
will also work
Edit: I did some benchnarks against #Roland and his method clearly wins (by far), so I would go with the Rcpp aproach

r row-wide conditional replacement

Friends
I'm trying t set up a matrix or data.frame for a canonical correlation analysis. The original dataset has a column designating one of x conditions and subsequent columns of explanatory variables. I need to set up an array that sets an indicator variable for each condition "x". eg. Columns in df are:
ID cond task1 taskN
A, x, 12, 14
B, x, 13, 17
C, y, 11, 10
D, z, 10, 13
here "cond" can be x,y,z,... (can vary, so I don't know how many). This needs to go to:
ID, x, y, z, task1, taskN
A, 1, 0, 0, 12, 14
B, 1, 0, 0, 13, 17
C, 0, 1, 0, 11, 10
D, 0, 0, 1, 10, 13
So, I can set up the indicators in an array
iv<-as.data.frame(array(,c(nrow(df),length(levels(cond)))))
and then cbind this to df, but I can't figure out how to go into the array and set the appropriate indicator to "1" and the rest to "0".
Any suggestions?
Thanks
Jon
If you code cond as a factor, you can get R to do the expansion you want via model.matrix. The only complication is that to get the coding you chose (dummy variables coding, or sum contrasts in R) we need to change the default constrasts used by R's model formula code.
## data
dat <- data.frame(ID = LETTERS[1:4], cond = factor(c("x","x","y","z")),
task1 = c(12,13,11,10), taskN = c(14,17,10,13))
dat
## We get R to produce the dummy variables for us,
## but your coding needs the contr.sum contrasts
op <- options(contrasts = c("contr.sum","contr.poly"))
dat2 <- data.frame(ID = dat$ID, model.matrix(ID ~ . - 1, data = dat))
## Levels of cond
lev <- with(dat, levels(cond))
## fix-up the names
names(dat2)[2:(1+length(lev))] <- lev
dat2
## reset contrasts
options(op)
This gives us:
> dat2
ID x y z task1 taskN
1 A 1 0 0 12 14
2 B 1 0 0 13 17
3 C 0 1 0 11 10
4 D 0 0 1 10 13
This should scale automatically as the number of levels in cond changes/increases.
HTH
Another alternative is to use use cast in the reshape package:
library(reshape)
l <- length(levels(dat$cond))
dat2 <- merge(cast(dat,ID~cond),dat)[,c(1:(l+1),(l+3):(ncol(dat)+l))]
dat2[,2:(1+l)] <- !is.na(dat2[,2:(1+l)])
This gives you logical values rather than 0 and 1 though:
> dat2
ID x y z task1 taskN
1 A TRUE FALSE FALSE 12 14
2 B TRUE FALSE FALSE 13 17
3 C FALSE TRUE FALSE 11 10
4 D FALSE FALSE TRUE 10 13
That's cool using model.matrix for this. (reshape too.) Always learning something here. A couple more ideas:
indicator1 <- function(groupStrings) {
groupFactors <- factor(groupStrings)
colNames <- levels(groupFactors)
bits <- matrix(0, nrow=length(groupStrings), ncol=length(colNames))
bits[matrix(c(1:length(groupStrings),
unclass(groupFactors)), ncol=2)] <- 1
setNames(as.data.frame(bits), colNames)
}
indicator2 <- function(groupStrings) {
colNames <- unique(groupStrings)
bits <- outer(groupStrings, colNames, "==")
setNames(as.data.frame(bits * 1), colNames)
}
Used as follows
d <- data.frame(cond=c("a", "a", "b"))
d <- cbind(d, indicator2(as.character(d$cond)))
Again, a great example of the greatness of open-source! Thanks so much for your help. The initial solution seemed to work best for me. In case someone else might be interested, here is how I implemented this with my (very large) dataset:
# Load needed libraries if not already so
if("packages:sciplot" %in% search()) next else library(moments)
# Initialize dataframes. DEFINE THE workspace SUBSET TO ANALYZE HERE
df<-stroke
# Make any necessary modifications to the df
df$TrDif <- df$TrBt-df$TrAt
# 0) Set up indicator variables (iv) from the factor you choose.
op <- options(contrasts = c("contr.sum","contr.poly"))
dat<-subset(df,select=c("newcat"))
iv<-data.frame(model.matrix(~.-1,data=dat))
names(iv) <- levels(dat$newcat)
lbl<-levels(dat$newcat) # need this for plot functions below
# Select task variables with n > 1150 to be regressed (THIS CAN PROBABLY BE DONE MORE ELEGANTLY).
taskarr<-subset(df, select=c("B20","B40","FW","Anim","TrAt","TrBt","TrBerr","TrDif","Snod15","tt","GEMS","Clock3","orient","Wlenc","wlfr","wlcr","wlrec","Snod15Rec","GEMSfr"))
## 1) evaluate covariance matrix and extract sub-matrices
## Caution: Covariance samples differ due to missing values.
sig <- cov(cbind(iv,taskarr),use="pairwise.complete.obs")

Resources