Vectorized Conditional Random Matching - r

I want to create conditional random pairs without using for-loops so I can use the code with large datasets. At first, I create rows with unique IDs and randomly assign two different "types" to my rows:
df<-data.frame(id=1:10,type=NA,partner=NA)
df[sample(df$id,nrow(df)/2),"type"]<-1 ##random 50% type 1
df[which(is.na(df$type)==TRUE),"type"]<-2 ##other 50% type 2
df
id type partner
1 1 2 NA
2 2 1 NA
3 3 1 NA
4 4 1 NA
5 5 2 NA
6 6 1 NA
7 7 1 NA
8 8 2 NA
9 9 2 NA
10 10 2 NA
Now I want them to receive a random partner of the opposite type. So I randomize my type 1 IDs and match them to some type 2 IDs like so:
df$partner[which(df$type==2)]<-sample(df$id[which(df$type==1)],
nrow(df)/2)
df
id type partner
1 1 2 4
2 2 1 NA
3 3 1 NA
4 4 1 NA
5 5 2 2
6 6 1 NA
7 7 1 NA
8 8 2 6
9 9 2 3
10 10 2 7
And that's where I'm stuck. For some reason I can't think of a vectorized way to tell R "take the IDs of type 1, look where these IDs are in df$partner and return the corresponding row ID as df$partner instead of NA".
One example for a for-loop for conditional random pairing can be found here: click
I'm pretty sure that that's very basic and doable, however, any help appreciated!

Presumably, you want the type 1 and type 2 matched together to have each other's id in their respective partner entries. Fully vectorized solution.
# Define number of ids
n = 100
# Generate startingn data frame
df = data.frame(id = 1:n, type = NA, partner = NA)
# Generate the type column
df$type[(a<-sample(df$id, n/2))] = 1
df$type[(b<-setdiff(1:100, a))] = 2
# Select a random partner id from the other type
df$partner[a] = sample(df$id[b])
# Fill in partner values based on previous line
df$partner[b] = df$id[match(df$id[b], df$partner)]
Output:
id type partner
1 2 11
2 1 13
3 2 19
4 2 10
5 1 17
6 2 28
7 2 27
8 2 21
9 1 22
10 1 4
11 1 1
12 2 20
13 2 2
14 2 25
15 2 24
16 2 30
17 2 5
18 2 29
19 1 3
20 1 12
21 1 8
22 2 9
23 2 26
24 1 15
25 1 14
26 1 23
27 1 7
28 1 6
29 1 18
30 1 16

Related

Assign ID based on a sequence of consecutive days in R

I have a dataset with repeated measures which I want to use to assign IDs. The repeated measures are from a sequence of consecutive days. However, the sequence itself may be unbalanced (e.g., some have more days while others have less, some start with day 1 but a few others may start with 2 or 3). My question is how to create and assign the same ID withinid the same block of sequence. Here is a toy dataset:
days <- data.frame(
day = c(1L,2L,3L,4L,5L,6L,8L,9L,10L,
2L,3L,4L,5L,6L,7L,9L,10L,
1L,2L,4L,5L,6L,8L,9L,10L,
1L,2L,3L,4L,5L,6L,7L,8L,9L,10L)
)
Here is the end result I expect:
id day
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 8
8 1 9
9 1 10
10 2 2
11 2 3
12 2 4
13 2 5
14 2 6
15 2 7
16 2 9
17 2 10
18 3 1
19 3 2
20 3 4
21 3 5
22 3 6
23 3 8
24 3 9
25 3 10
26 4 1
27 4 2
28 4 3
29 4 4
30 4 5
31 4 6
32 4 7
33 4 8
34 4 9
35 4 10
Get the difference between adjacent elements and check if it is less than 0, take the cumulative sum
days$id <- cumsum(c(TRUE, diff(days$day) < 0))

Match values based on multiple conditions from dataframes of different sizes in R

I have two dataframes of different sizes. Example:
t1 <- data.frame("id"=c(1,1,1,2,2,2,4,5,5,5,6,7,8),"condition"=c(3,3,1,5,5,5,10,10,5,5,2,3,1) )
t2 <- data.frame("ind"=c(1,2,4,5,6,7,8),"test_c"=c(3,5,10,10,2,3,1), "time"=c(32,55,21,34,55,22,19))
I would like to match the cases based on two criteria:
t1$id==t2$ind and t1$condition==t2$test_c and create an additional column in t1 based on the outcome of the variable t2$time under these two conditions.
Expected outcome:
t3 <- data.frame("id"=c(1,1,1,2,2,2,4,5,5,5,6,7,8),"condition"=c(3,3,1,5,5,5,10,10,5,5,2,3,1) , "time"=c (32,32,NA,55,55,55,21,34,NA,NA,55,22,19))
I suspect I should use merge or match functions but I am not sure which would be the right approach.
Base R
> out <- merge(t1, t2, by.x=c("id","condition"), by.y=c("ind","test_c"), all.x=TRUE)
> out
id condition time
1 1 1 NA
2 1 3 32
3 1 3 32
4 2 5 55
5 2 5 55
6 2 5 55
7 4 10 21
8 5 5 NA
9 5 5 NA
10 5 10 34
11 6 2 55
12 7 3 22
13 8 1 19
dplyr
library(dplyr)
left_join(t1, t2, by = c("id" = "ind", "condition" = "test_c"))
Differences with your t3
There are some differences between them. For the sake of display, I'll show them side-by-side, arranged so that we have an easier comparison.
cbind(out[with(out,order(id,condition)),], t3[with(t3,order(id,condition)),])
# id condition time id condition time
# 1 1 1 NA 1 1 NA
# 2 1 3 32 1 3 32
# 3 1 3 32 1 3 32
# 4 2 5 55 2 5 55
# 5 2 5 55 2 5 NA
# 6 2 5 55 2 5 NA
# 7 4 10 21 4 10 21
# 8 5 5 NA 5 5 NA
# 9 5 5 NA 5 5 NA
# 10 5 10 34 5 10 34
# 11 6 2 55 6 2 55
# 12 7 3 22 7 3 22
# 13 8 1 19 8 1 19
The only differences are with id=2,condition=5, where all of them in the merge are assigned the same time=55, and your t3 fills only the first of them. I don't think this is a "first only" logic, as there are other repeat id,condition that do not elicit the same response. I suspect this is just a mistake with the sample data, or perhaps there is post-merge processing you haven't told us yet :-)
In case you want to use match you can use in addition interaction (or paste) to use multiple columns.
t1$time <- t2[match(interaction(t1), interaction(t2[-3])), 3]
t1
# id condition time
#1 1 3 32
#2 1 3 32
#3 1 1 NA
#4 2 5 55
#5 2 5 55
#6 2 5 55
#7 4 10 21
#8 5 10 34
#9 5 5 NA
#10 5 5 NA
#11 6 2 55
#12 7 3 22
#13 8 1 19

How to generate new variables based on the name of the variables in the data frame

For example, I have a toy dataset as the one I created below,
a1<-1:10
a2<-11:20
v<-c(1,2,1,NA,2,1,2,1,2,1)
data<-data.frame(a1,a2,v,stringsAsFactors = F)
Then I want to create a new variable y which will be assigned the value a1 or a2 or NA based on the value of variable v. Therefore, the 'y'
should equals to 1 12 3 NA 15 6 17 8 19 10.
I want to generate it with the command similar to the ones I list below, It doesn't work, I guess it's because of the vectorization issue, then how can I fix it?
In reality, I have several as, say 10 and the actual values are characters instead of numeric ones.
data$y[!is.na(data$v)]<-data[,paste0('a',data$v)]
or
data%>%
mutate(y=ifelse(!is.na(v),get(paste0('a',v)),NA))
You could use standard indexing with cbind for that:
dat$y <- dat[cbind(1:nrow(dat), dat$v)]
The result:
> dat
a1 a2 v y
1 1 11 1 1
2 2 12 2 12
3 3 13 1 3
4 4 14 NA NA
5 5 15 2 15
6 6 16 1 6
7 7 17 2 17
8 8 18 1 8
9 9 19 2 19
10 10 20 1 10
(I used dat instead of data, because it is not wise to call a dataframe the same as a function; see ?data)
Only idea that comes to my mind:
data%>%
mutate(y=ifelse(!is.na(v),paste0('a',v),NA)) %>%
mutate(z=ifelse(!is.na(y),(ifelse(y=="a1",get("a1"),get("a2"))),NA))
a1 a2 v y z
1 1 11 1 a1 1
2 2 12 2 a2 12
3 3 13 1 a1 3
4 4 14 NA <NA> NA
5 5 15 2 a2 15
6 6 16 1 a1 6
7 7 17 2 a2 17
8 8 18 1 a1 8
9 9 19 2 a2 19
10 10 20 1 a1 10
or more directly:
data%>%
mutate(y=ifelse(!is.na(v),(ifelse(v==1, get("a1"),get("a2"))),NA))
a1 a2 v y
1 1 11 1 1
2 2 12 2 12
3 3 13 1 3
4 4 14 NA NA
5 5 15 2 15
6 6 16 1 6
7 7 17 2 17
8 8 18 1 8
9 9 19 2 19
10 10 20 1 10
still based on ifelse :(
You need to use a matrix accessor:
# Get the indices of missing values
ind <- which(!is.na(data$v))
# Transform colnames to indices
tab <- structure(match(c("a1", "a2"), names(data)), .Names = c("a1", "a2"))
# Access data with a matrix accessor
data$y[ind] <- data[cbind(ind, tab[paste0('a', data$v[ind])])]

Give unique identifier to consecutive groupings

I'm trying to identify groups based on sequential numbers. For example, I have a dataframe that looks like this (simplified):
UID
1
2
3
4
5
6
7
11
12
13
15
17
20
21
22
And I would like to add a column that identifies when there are groupings of consecutive numbers, for example, 1 to 7 are first consecutive , then they get 1 , the second consecutive set will get 2 etc .
UID Group
1 1
2 1
3 1
4 1
5 1
6 1
7 1
11 2
12 2
13 2
15 3
17 4
20 5
21 5
22 5
none of the existed code helped me to solved this issue
Here is one base R method that uses diff, a logical check, and cumsum:
cumsum(c(1, diff(df$UID) > 1))
[1] 1 1 1 1 1 1 1 2 2 2 3 4 5 5 5
Adding this onto the data.frame, we get:
df$id <- cumsum(c(1, diff(df$UID) > 1))
df
UID id
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 11 2
9 12 2
10 13 2
11 15 3
12 17 4
13 20 5
14 21 5
15 22 5
Or you can also use dplyr as follows :
library(dplyr)
df %>% mutate(ID=cumsum(c(1, diff(df$UID) > 1)))
# UID ID
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 5 1
#6 6 1
#7 7 1
#8 11 2
#9 12 2
#10 13 2
#11 15 3
#12 17 4
#13 20 5
#14 21 5
#15 22 5
We can also get the difference between the current row and the previous row using the shift function from data.table, get the cumulative sum of the logical vector and assign it to create the 'Group' column. This will be faster.
library(data.table)
setDT(df1)[, Group := cumsum(UID- shift(UID, fill = UID[1])>1)+1]
df1
# UID Group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 1
# 7: 7 1
# 8: 11 2
# 9: 12 2
#10: 13 2
#11: 15 3
#12: 17 4
#13: 20 5
#14: 21 5
#15: 22 5

Averages based on elements from other columns

So, I’ve been trying to get this working but for some reason, I’m just not making any progress on this. And I was hoping if you guys could help me. Pretty much, I have a data frame that I would like to get the average of a specific range of values, where these values are from other columns within the same data frame, for each user.
So, let’s say I have this data frame.
a<-data.frame(user=c(rep(1,10),rep(2,10),rep(3,10)),
values=c(1:30),toot=c(rep(4,10),rep(5,10),rep(3,10)))
user values toot
1 1 4
1 2 4
1 3 4
1 4 4
1 5 4
1 6 4
1 7 4
1 8 4
1 9 4
1 10 4
2 11 5
2 12 5
2 13 5
2 14 5
2 15 5
2 16 5
2 17 5
2 18 5
2 19 5
2 20 5
3 21 3
3 22 3
3 23 3
3 24 3
3 25 3
3 26 3
3 27 3
3 28 3
3 29 3
3 30 3
So, what I would like is to take the average of the values between 2 elements prior of the toot element through the toot element.
Here's what I'm looking for:
user values toot deck
1 1 4 3
1 2 4 3
1 3 4 3
1 4 4 3
1 5 4 3
1 6 4 3
1 7 4 3
1 8 4 3
1 9 4 3
1 10 4 3
2 11 5 14
2 12 5 14
2 13 5 14
2 14 5 14
2 15 5 14
2 16 5 14
2 17 5 14
2 18 5 14
2 19 5 14
2 20 5 14
3 21 3 22
3 22 3 22
3 23 3 22
3 24 3 22
3 25 3 22
3 26 3 22
3 27 3 22
3 28 3 22
3 29 3 22
3 30 3 22
As you see, for user 1, that user’s toot value is 4, so I want to take the average of user’s 1 values at the 4th element and average it with the 2 elements before it.
This is what I have so far (with many variations of this and with the by function):
a$deck<-ave(a$values,a$user,FUN=function(x)
{
z<-a$toot
y<-z-2
mean(x[y:z])
})
But the problem is that it’s not using the toot value as it’s starting position. Here are the warning messages:
> Warning messages:
1: In y:z : numerical expression has 30 elements: only the first used
2: In y:z : numerical expression has 30 elements: only the first used
Error in mean(x[y:z]) :
error in evaluating the argument 'x' in selecting a method for function 'mean': Error in x[y:z] : only 0's may be mixed with negative subscripts
Anything is welcomed and appreciated, thanks.
You can do it with by(). Like:
do.call(rbind, by(a, a$user, function(x) { cbind(x,deck=mean(x$values[x$toot[1]:(x$toot[1]-2)])) }))
library(plyr)
ddply(a,.(user),function(df) {
df$deck <- mean(df$values[(df$toot[1]-2):df$toot[1]])
df
})

Resources