Extract data based on another list - r

I am trying to extract rows of a dataset based on a list of time points nested within individuals. I have repeated time points (therefore exactly the same variable values) but I still want to keep the duplicated rows. How to achieve that in base R?
Here is the original dataset:
xx <- data.frame(id=rep(1:3, each=3), time=1:3, y=rep(1:3, each=3))
Here is the list of matrices where the third one is a vector
lst <- list(`1` = c(1, 1, 2), `2` = c(1, 3, 3), `3` = c(2, 2, 3))
Desirable outcome:
id time y
1 1 1
1 1 1 #this is the duplicated row
1 2 1
2 1 2
2 3 2
2 3 2 #this is the duplicated row
3 2 3
3 2 3 #this is the duplicated row
3 3 3
The code do.call(rbind, Map(function(p, q) subset(xx, id == q & time %in% p), lst, names(lst))) did not work for me because subset removes duplicated rows

The issue is that %in% doesn't iterate over the non-unique values repeatedly. To do so, we need to also iterate (lapply) over p internally. I'll wrap your inner subset in another do.call(rbind, lapply(p, ...)) to get what you expect:
do.call(rbind, Map(function(p, q) {
do.call(rbind, lapply(p, function(p0) subset(xx, id == q & time %in% p0)))
}, lst, names(lst)))
# id time y
# 1.1 1 1 1
# 1.2 1 1 1
# 1.21 1 2 1
# 2.4 2 1 2
# 2.6 2 3 2
# 2.61 2 3 2
# 3.8 3 2 3
# 3.81 3 2 3
# 3.9 3 3 3
(Row names are a distraction here ...)
An alternative would be to convert your lst into a frame of id and time, and then left-join on it:
frm <- do.call(rbind, Map(function(x, nm) data.frame(id = nm, time = x), lst, names(lst)))
frm
# id time
# 1.1 1 1
# 1.2 1 1
# 1.3 1 2
# 2.1 2 1
# 2.2 2 3
# 2.3 2 3
# 3.1 3 2
# 3.2 3 2
# 3.3 3 3
merge(frm, xx, by = c("id", "time"), all.x = TRUE)
# id time y
# 1 1 1 1
# 2 1 1 1
# 3 1 2 1
# 4 2 1 2
# 5 2 3 2
# 6 2 3 2
# 7 3 2 3
# 8 3 2 3
# 9 3 3 3
Two good resources for learning about merges/joins:
How to join (merge) data frames (inner, outer, left, right)
What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?

Related

Combining elements of one column into two columns by group in R

Given a two column data.frame with one containing group labels and a second containing integer values ordered from smallest to largest. How can the data be expanded creating pairs of combinations of the integer column?
Not sure the best way to state this. I'm not interested in all possible combinations but instead all unique combinations starting from the lowest value.
In r, the combn function gives the desired output not considering groups, for example:
t(combn(seq(1:4),2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Since the first values is 1 we get the unique combination of (1,2) and not the additional combination of (2,1) which I don't need. How would one then apply a similar method by groups?
for example given a data.frame
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
test
Group Val
1 A 1
2 A 3
3 A 6
4 A 8
5 B 2
6 B 4
7 B 5
8 B 7
I was able to come up with this solution that gives the desired output:
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
j=1
for(i in unique(test$Group)){
if(j==1){
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test1 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
j=j+1
}else{
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test2 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
test1 <- rbind(test1,test2)
}
}
test1
Group Val1 Val2
1 A 1 3
2 A 1 6
3 A 1 8
4 A 3 6
5 A 3 8
6 A 6 8
7 B 2 4
8 B 2 5
9 B 2 7
10 B 4 5
11 B 4 7
12 B 5 7
However, this is not elegant and is really slow as the number of groups and length of each group become large. It seems like there should be a more elegant and efficient solution but so far I have not come across anything on SO.
I would appreciate any ideas!
here is a data.table approach
library( data.table )
#make test a data.table
setDT(test)
#split by group
L <- split( test, by = "Group")
#get unique combinations of 2 Vals
L2 <- lapply( L, function(x) {
as.data.table( t( combn( x$Val, m = 2, simplify = TRUE ) ) )
})
#merge them back together
data.table::rbindlist( L2, idcol = "Group" )
# Group V1 V2
# 1: A 1 3
# 2: A 1 6
# 3: A 1 8
# 4: A 3 6
# 5: A 3 8
# 6: A 6 8
# 7: B 2 4
# 8: B 2 5
# 9: B 2 7
#10: B 4 5
#11: B 4 7
#12: B 5 7
You can set simplify = F in combn() and then use unnest_wider() in dplyr.
library(dplyr)
library(tidyr)
test %>%
group_by(Group) %>%
summarise(Val = combn(Val, 2, simplify = F)) %>%
unnest_wider(Val, names_sep = "_")
# Group Val_1 Val_2
# <chr> <dbl> <dbl>
# 1 A 1 3
# 2 A 1 6
# 3 A 1 8
# 4 A 3 6
# 5 A 3 8
# 6 A 6 8
# 7 B 2 4
# 8 B 2 5
# 9 B 2 7
# 10 B 4 5
# 11 B 4 7
# 12 B 5 7
library(tidyverse)
df2 <- split(df$Val, df$Group) %>%
map(~gtools::combinations(n = 4, r = 2, v = .x)) %>%
map(~as_tibble(.x, .name_repair = "unique")) %>%
bind_rows(.id = "Group")

Aggregate data frame/table by all rows, add counts, and do it fast [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame like the following example
a = c(1, 1, 1, 2, 2, 3, 4, 4)
b = c(3.5, 3.5, 2.5, 2, 2, 1, 2.2, 7)
df <-data.frame(a,b)
I can remove duplicated rows from R data frame by the following code, but how can I find how many times each duplicated rows repeated? I need the result as a vector.
unique(df)
or
df[!duplicated(df), ]
Here is solution using function ddply() from library plyr
library(plyr)
ddply(df,.(a,b),nrow)
a b V1
1 1 2.5 1
2 1 3.5 2
3 2 2.0 2
4 3 1.0 1
5 4 2.2 1
6 4 7.0 1
You could always kill two birds with the one stone:
aggregate(list(numdup=rep(1,nrow(df))), df, length)
# or even:
aggregate(numdup ~., data=transform(df,numdup=1), length)
# or even:
aggregate(cbind(df[0],numdup=1), df, length)
a b numdup
1 3 1.0 1
2 2 2.0 2
3 4 2.2 1
4 1 2.5 1
5 1 3.5 2
6 4 7.0 1
Here are two approaches.
# a example data set that is not sorted
DF <-data.frame(replicate(sequence(1:3),n=2))
# example using similar idea to duplicated.data.frame
count.duplicates <- function(DF){
x <- do.call('paste', c(DF, sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(DF[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
count.duplicates(DF)
# X1 X2 count
# 4 1 1 3
# 5 2 2 2
# 6 3 3 1
# a far simpler `data.table` approach
library(data.table)
count.dups <- function(DF){
DT <- data.table(DF)
DT[,.N, by = names(DT)]
}
count.dups(DF)
# X1 X2 N
# 1: 1 1 3
# 2: 2 2 2
# 3: 3 3 1
Using dplyr:
summarise(group_by(df,a,b),length(b))
or
group_size(group_by(df,a,b))
#[1] 1 2 2 1 1 1

Finding unique rows in data.frame [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame like the following example
a = c(1, 1, 1, 2, 2, 3, 4, 4)
b = c(3.5, 3.5, 2.5, 2, 2, 1, 2.2, 7)
df <-data.frame(a,b)
I can remove duplicated rows from R data frame by the following code, but how can I find how many times each duplicated rows repeated? I need the result as a vector.
unique(df)
or
df[!duplicated(df), ]
Here is solution using function ddply() from library plyr
library(plyr)
ddply(df,.(a,b),nrow)
a b V1
1 1 2.5 1
2 1 3.5 2
3 2 2.0 2
4 3 1.0 1
5 4 2.2 1
6 4 7.0 1
You could always kill two birds with the one stone:
aggregate(list(numdup=rep(1,nrow(df))), df, length)
# or even:
aggregate(numdup ~., data=transform(df,numdup=1), length)
# or even:
aggregate(cbind(df[0],numdup=1), df, length)
a b numdup
1 3 1.0 1
2 2 2.0 2
3 4 2.2 1
4 1 2.5 1
5 1 3.5 2
6 4 7.0 1
Here are two approaches.
# a example data set that is not sorted
DF <-data.frame(replicate(sequence(1:3),n=2))
# example using similar idea to duplicated.data.frame
count.duplicates <- function(DF){
x <- do.call('paste', c(DF, sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(DF[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
count.duplicates(DF)
# X1 X2 count
# 4 1 1 3
# 5 2 2 2
# 6 3 3 1
# a far simpler `data.table` approach
library(data.table)
count.dups <- function(DF){
DT <- data.table(DF)
DT[,.N, by = names(DT)]
}
count.dups(DF)
# X1 X2 N
# 1: 1 1 3
# 2: 2 2 2
# 3: 3 3 1
Using dplyr:
summarise(group_by(df,a,b),length(b))
or
group_size(group_by(df,a,b))
#[1] 1 2 2 1 1 1

How can I subset a dataframe according to group membership?

I am wanting to write a function so that a (potentially large) dataframe can be subsetted according to group membership, where a 'group' is a unique combination of a set of column values.
For example, I would like to subset the following data frame according to unique combination of the first two columns (Loc1 and Loc2).
Loc1 <- c("A","A","A","A","B","B","B")
Loc2 <- c("a","a","b","b","a","a","b")
Dat1 <- c(1,1,1,1,1,1,1)
Dat2 <- c(1,2,1,2,1,2,2)
Dat3 <- c(2,2,4,4,6,5,3)
DF=data.frame(Loc1,Loc2,Dat1,Dat2,Dat3)
Loc1 Loc2 Dat1 Dat2 Dat3
1 A a 1 1 2
2 A a 1 2 2
3 A b 1 1 4
4 A b 1 2 4
5 B a 1 1 6
6 B a 1 2 5
7 B b 1 2 3
I want to return (i) the number of groups (i.e. 4), (ii) the number in each group (i.e. c(2,2,2,1), and (iii) to relabel the rows so that I can further analyse the data frame according to group membership (e.g. for ANOVA and MANOVA) (i.e.
Group<-as.factor(c(1,1,2,2,3,3,4))
Data <- cbind(Group,DF[,-1:-2])
Group Dat1 Dat2 Dat3
1 1 1 1 2
2 1 1 2 2
3 2 1 1 4
4 2 1 2 4
5 3 1 1 6
6 3 1 2 5
7 4 1 2 3
).
So far all I have managed is to get the number of groups, and I'm suspicious that there's a better way to do even this:
nrow(unique(DF[,1:2]))
I was hoping to avoid for-loops as I am concerned about the function being slow.
I have tried converting to a data matrix so that I could concatenate the row values but I couldn't get that to work either.
Many thanks
You could try:
Create Group column by using unique level combination of Loc1 and Loc2.
indx <- paste(DF[,1], DF[,2])
DF$Group <- as.numeric(factor(indx, unique(indx))) #query No (iii)
DF1 <- DF[-(1:2)][,c(4,1:3)]
# Group Dat1 Dat2 Dat3
#1 1 1 1 2
#2 1 1 2 2
#3 2 1 1 4
#4 2 1 2 4
#5 3 1 1 6
#6 3 1 2 5
#7 4 1 2 3
table(DF$Group) #(No. ii)
#1 2 3 4
#2 2 2 1
length(unique(DF$Group)) #(i)
#[1] 4
Then, if you need to subset the datasets by group, you could split the dataset using the Group to create a list of 4 list elements
split(DF1, DF1$Group)
Update
If you have multiple columns, you could still try:
ColstoGroup <- 1:2
indx <- apply(DF[,ColstoGroup], 1, paste, collapse="")
as.numeric(factor(indx, unique(indx)))
#[1] 1 1 2 2 3 3 4
You could create a function;
fun1 <- function(dat, GroupCols){
FactGroup <- dat[, GroupCols]
if(length(GroupCols)==1){
dat$Group <- as.numeric(factor(FactGroup, levels=unique(FactGroup)))
}
else {
indx <- apply(FactGroup, 1, paste, collapse="")
dat$Group <- as.numeric(factor(indx, unique(indx)))
}
dat
}
fun1(DF, "Loc1")
fun1(DF, c("Loc1", "Loc2"))
This gets all three of your queries.
Begin with a table of the first two columns and then work with that data.
> (tab <- table(DF$Loc1, DF$Loc2))
#
# a b
# A 2 2
# B 2 1
#
> (ct <- c(tab)) ## (ii)
# [1] 2 2 2 1
> length(unlist(dimnames(tab))) ## (i)
# [1] 4
> cbind(Group = rep(seq_along(ct), ct), DF[-c(1,2)]) ## (iii)
# Group Dat1 Dat2 Dat3
# 1 1 1 1 2
# 2 1 1 2 2
# 3 2 1 1 4
# 4 2 1 2 4
# 5 3 1 1 6
# 6 3 1 2 5
# 7 4 1 2 3
Borrowing a bit from this answer and using some dplyr idioms:
library(dplyr)
Loc1 <- c("A","A","A","A","B","B","B")
Loc2 <- c("a","a","b","b","a","a","b")
Dat1 <- c(1,1,1,1,1,1,1)
Dat2 <- c(1,2,1,2,1,2,2)
Dat3 <- c(2,2,4,4,6,5,3)
DF <- data.frame(Loc1, Loc2, Dat1, Dat2, Dat3)
emitID <- local({
idCounter <- -1L
function(){
idCounter <<- idCounter + 1L
}
})
DF %>% group_by(Loc1, Loc2) %>% mutate(Group=emitID())
## Loc1 Loc2 Dat1 Dat2 Dat3 Group
## 1 A a 1 1 2 0
## 2 A a 1 2 2 0
## 3 A b 1 1 4 1
## 4 A b 1 2 4 1
## 5 B a 1 1 6 2
## 6 B a 1 2 5 2
## 7 B b 1 2 3 3

Find how many times duplicated rows repeat in R data frame [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 4 years ago.
I have a data frame like the following example
a = c(1, 1, 1, 2, 2, 3, 4, 4)
b = c(3.5, 3.5, 2.5, 2, 2, 1, 2.2, 7)
df <-data.frame(a,b)
I can remove duplicated rows from R data frame by the following code, but how can I find how many times each duplicated rows repeated? I need the result as a vector.
unique(df)
or
df[!duplicated(df), ]
Here is solution using function ddply() from library plyr
library(plyr)
ddply(df,.(a,b),nrow)
a b V1
1 1 2.5 1
2 1 3.5 2
3 2 2.0 2
4 3 1.0 1
5 4 2.2 1
6 4 7.0 1
You could always kill two birds with the one stone:
aggregate(list(numdup=rep(1,nrow(df))), df, length)
# or even:
aggregate(numdup ~., data=transform(df,numdup=1), length)
# or even:
aggregate(cbind(df[0],numdup=1), df, length)
a b numdup
1 3 1.0 1
2 2 2.0 2
3 4 2.2 1
4 1 2.5 1
5 1 3.5 2
6 4 7.0 1
Here are two approaches.
# a example data set that is not sorted
DF <-data.frame(replicate(sequence(1:3),n=2))
# example using similar idea to duplicated.data.frame
count.duplicates <- function(DF){
x <- do.call('paste', c(DF, sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(DF[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
count.duplicates(DF)
# X1 X2 count
# 4 1 1 3
# 5 2 2 2
# 6 3 3 1
# a far simpler `data.table` approach
library(data.table)
count.dups <- function(DF){
DT <- data.table(DF)
DT[,.N, by = names(DT)]
}
count.dups(DF)
# X1 X2 N
# 1: 1 1 3
# 2: 2 2 2
# 3: 3 3 1
Using dplyr:
summarise(group_by(df,a,b),length(b))
or
group_size(group_by(df,a,b))
#[1] 1 2 2 1 1 1

Resources