Finding matching pairs of identifiers in a data frame

Finding matching pairs of identifiers in a data frame - r

I have a dataframe df:
id1 id2 action
1 2 10
1 3 11
1 4 21
2 1 6
...
It means, the user id1 do something (10) to user id2, and id2 do something (6) to id1.
Now I want to create a new column, called partner_action, which basically records what did the partner do. So it will look like:
id1 id2 action partner_action
1 2 10 6
2 1 6 10
1 3 11 9
3 1 9 11
I tried:
df$partner_action = df[df$id2 == df$id1,]$action
But of course, it does not work.
I thought about make a copy of df, called df_copy then:
df$partner_action = df_copy[df_copy$id1 == df$id2,]$action
But is there a better way to do it?

Basically you want to merge df with itself, matching pairs of (id2, id1) with pairs of (id1, id2). You can do this in R either with merge or match:
df$partner_action <- with(df, action[match(paste(id2, id1), paste(id1, id2))])
df
# id1 id2 action partner_action
# 1 1 2 10 6
# 2 2 1 6 10
# 3 1 3 11 9
# 4 3 1 9 11
Data:
(df <- data.frame(id1=c(1, 2, 1, 3), id2=c(2, 1, 3, 1), action=c(10, 6, 11, 9)))
# id1 id2 action
# 1 1 2 10
# 2 2 1 6
# 3 1 3 11
# 4 3 1 9

for(i in 1:nrow(df))
{df[i,4]<-df[which(df$id1==df[i,2]&df$id2==df[i,1]),3]}

Related

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?

One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]

temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.

Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA

A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)

Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

Select unique values from a list of 3

I would like to list all unique combinations of vectors of length 3 where each element of the vector can range between 1 to 9.
First I list all such combinations:
df <- expand.grid(1:9, 1:9, 1:9)
Then I would like to remove the rows that contain repetitions.
For example:
1 1 9
9 1 1
1 9 1
should only be included once.
In other words if two lines have the same numbers and the same number of each number then it should only be included once.
Note that
8 8 8 or
9 9 9 is fine as long as it only appears once.

Based on your approach and the idea to remove repetitions:
df <- expand.grid(1:2, 1:2, 1:2)
# Var1 Var2 Var3
# 1 1 1 1
# 2 2 1 1
# 3 1 2 1
# 4 2 2 1
# 5 1 1 2
# 6 2 1 2
# 7 1 2 2
# 8 2 2 2
df2 <- unique(t(apply(df, 1, sort))) #class matrix
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 1 1 2
# [3,] 1 2 2
# [4,] 2 2 2
df2 <- as.data.frame(df2) #class data.frame
There are probably more efficient methods, but if I understand you correct, that is the result you want.

Maybe something like this (since your data frame is not large, so it does not pain!):
len <- apply(df,1,function(x) length(unique(x)))
res <- rbind(df[len!=2,], df[unique(apply(df[len==2,],1,prod)),])
Here is what is done:
Get the number of unique elements per row
Comprises two steps:
First argument of rbind: Those with length either 1 (e.g. 1 1 1, 7 7 7, etc) or 3 (e.g. 5 8 7, 2 4 9, etc) are included in the final results res.
Second argument of rbind: For those in which the number of unique elements are 2 (e.g. 1 1 9, 3 5 3, etc), we apply product per row and take whose unique products (cause, for example, the product of 3 3 5 and 3 5 3 and 5 3 3 are the same)

aggregate dataframe subsets in R

I have the dataframe ds
CountyID ZipCode Value1 Value2 Value3 ... Value25
1 1 0 etc etc etc
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0
and would like to aggregate based on ds$ZipCode and set ds$CountyID equal to the primary county based on the highest ds$Value1. For the above example, it would look like this:
CountyID ZipCode Value1 Value2 Value3 ... Value25
2 1 4 etc etc etc
5 2 2
6 3 3
7 4 9
9 5 1
10 6 0
All the ValueX columns are the sum of that column grouped by ZipCode.
I've tried a bunch of different strategies over the last couple days, but none of them work. The best I've come up with is
#initialize the dataframe
ds_temp = data.frame()
#loop through each subset based on unique zipcodes
for (zip in unique(ds$ZipCode) {
sub <- subset(ds, ds$ZipCode == zip)
len <- length(sub)
maxIndex <- which.max(sub$Value1)
#do the aggregation
row <- aggregate(sub[3:27], FUN=sum, by=list(
CountyID = rep(sub$CountyID[maxIndex], len),
ZipCode = sub$ZipCode))
rbind(ds_temp, row)
}
ds <- ds_temp
I haven't been able to test this on the real data, but with dummy datasets (such as the one above), I keep getting the error "arguments must have the same length). I've messed around with rep() and fixed vectors (eg c(1,2,3,4)) but no matter what I do, the error persists. I also occasionally get an error to the effect of
cannot subset data of type 'closure'.
Any ideas? I've also tried messing around with data.frame(), ddply(), data.table(), dcast(), etc.

You can try this:
data.frame(aggregate(df[,3:27], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
Fully reproducible sample data:
df<-read.table(text="
CountyID ZipCode Value1
1 1 0
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0", header=TRUE)
data.frame(aggregate(df[,3], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
# Group.1 x CountyID
#1 1 4 2
#2 2 2 5
#3 3 3 6
#4 4 9 7
#5 5 1 9
#6 6 0 10

In response to your comment on Frank's answer, you can preserve the column names by using the formula method in aggregate. Using Franks's data df, this would be
> cbind(aggregate(Value1 ~ ZipCode, df, sum),
CountyID = sapply(split(df, df$ZipCode), function(x) {
with(x, CountyID[Value1 == max(Value1)]) }))
# ZipCode Value1 CountyID
# 1 1 4 2
# 2 2 2 5
# 3 3 3 6
# 4 4 9 7
# 5 5 1 9
# 6 6 0 10

Remover observations for which there is not a duplicate

I would like to break a dataset into two frames - one for which the original dataset has duplicate observations based on a condition and one for which the original dataset does not have duplicate observations based on a condition. In the following example, I would like to break the frame into one for which there is only one coder for an observation and one for which there are two coders::
frame <- data.frame(id = c(1,1,1,2,2,3), coder = c("A", "A", "B", "A", "B", "A"), y = c(4,5,4,1,1,2))
frame
For this, I would like to produce, such that:
frame1:
id coder y
1 1 A 4
2 1 A 5
3 1 B 4
4 2 A 1
5 2 B 1
frame2:
6 3 A 2

You can use aggregate to determine the ids you want in each data frame:
cts <- aggregate(coder~id, frame, function(x) length(unique(x)))
cts
# id coder
# 1 1 2
# 2 2 2
# 3 3 1
Then you can subset as appropriate based on this:
subset(frame, id %in% cts$id[cts$coder >= 2])
# id coder y
# 1 1 A 4
# 2 1 A 5
# 3 1 B 4
# 4 2 A 1
# 5 2 B 1
subset(frame, id %in% cts$id[cts$coder < 2])
# id coder y
# 6 3 A 2

You may also try:
indx <- !colSums(!table(frame$coder, frame$id))
frame[frame$id %in% names(indx)[indx],]
# id coder y
#1 1 A 4
#2 1 A 5
#3 1 B 4
#4 2 A 1
#5 2 B 1
frame[frame$id %in% names(indx)[!indx],]
# id coder y
#6 3 A 2
Explanation
table(frame$coder, frame$id)
# 1 2 3
# A 2 1 1
# B 1 1 0 #Here for id 3, B==0
If we Negate that, the result would be a logical index
!table(frame$coder, frame$id).
Do the colSums of the above, which results
# 1 2 3
# 0 0 1
Negate again and get the index for ids and subset those ids which are TRUE
From this you can subset by matching with the names of the ids

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Finding matching pairs of identifiers in a data frame - r

for(i in 1:nrow(df)) {df[i,4]<-df[which(df$id1==df[i,2]&df$id2==df[i,1]),3]}

Related

cumulative product in R across column

Adding NA's where data is missing [duplicate]

Select unique values from a list of 3

aggregate dataframe subsets in R

Remover observations for which there is not a duplicate

Categories

Resources