How to extract paired rows from a data frame in R - r

I have a large data frame where most of subjects have a pair of observations such like that:
set.seed(123)
df<-data.frame(ID=c(letters[1:4],letters[1:6]),x=sample(1:5,10,T))
ID x
1 a 2
2 b 4
3 c 3
4 d 5
5 a 5
6 b 1
7 c 3
8 d 5
9 e 3
10 f 3
I'd extract the rows that all IDs are paired such as:
ID x
1 a 2
5 a 5
2 b 4
6 b 1
3 c 3
7 c 3
4 d 5
8 d 5
What's the best way to do that in R?

Alternatively, I tend to use duplicated:
> df[df$ID %in% df$ID[duplicated(df$ID)],]
ID x
1 a 2
2 b 1
3 c 5
4 d 5
5 a 4
6 b 2
7 c 3
8 d 4

You can use ave to get the length of each value in df$ID and use that to subset your data.frame:
out <- df[as.numeric(ave(as.character(df$ID), df$ID, FUN = length)) == 2, ]
out
# ID x
# 1 a 2
# 2 b 4
# 3 c 3
# 4 d 5
# 5 a 5
# 6 b 1
# 7 c 3
# 8 d 5
Use order to sort the output if required.
out[order(out$ID), ]
You can also look into using data.table:
dt <- data.table(df, key = "ID") # Also sorts the output
dt[, n := .N, by = "ID"][n == 2]

Related

Identify and remove duplicated list elements by their colnames in R

I have a large list object which contains correlation matrices with colnames and rownames, but some of these matrices in the list appears more than once in a different order. How do I remove the duplicates without altering the matrix form or list names?
> my.list
$list1
A B C
A 1 8 5
B 8 1 2
C 5 2 1
$list2
B A C
B 1 8 2
A 8 1 5
C 2 5 1
$list3
C A B
C 1 5 2
A 5 1 8
B 2 8 1
$list4
X Y
X 1 9
Y 9 1
$list5
Y X
Y 1 9
X 9 1
I would like to be able to match the colnames/rownames of the matrix list and remove ones appearing more than once, I'm expecting the output below
$list1
A B C
A 1 8 5
B 8 1 2
C 5 2 1
$list4
X Y
X 1 9
Y 9 1
I have tried the codes below but it doesn't do the job
my.list[!(duplicated(my.list)
You can order the columns and rows according to their names, and then use unique:
lapply(my.list, \(x) x[order(row.names(x)), order(colnames(x))]) |>
unique()
# [[1]]
# A B C
# A 1 8 5
# B 8 1 2
# C 5 2 1
I used your first two elements as example:
list1 <- read.table(header = T, text = " A B C
A 1 8 5
B 8 1 2
C 5 2 1")
list2 <- read.table(header = T,text = " B A C
B 1 8 2
A 8 1 5
C 2 5 1")
my.list <- list(list1, list2)
# [[1]]
# A B C
# A 1 8 5
# B 8 1 2
# C 5 2 1
#
# [[2]]
# B A C
# B 1 8 2
# A 8 1 5
# C 2 5 1

How do I group_by if the column that I want to summarize with has all the same values

x l
1 1 a
2 3 b
3 2 c
4 3 b
5 2 c
6 4 d
7 5 f
8 2 c
9 1 a
10 1 a
11 3 b
12 4 d
The above is the input.
The below is the output.
x l
1 1 a
2 3 b
3 2 c
4 4 d
5 5 f
I know that column l will have the same value for each group_by(x).
l is a string
# Creation of dataset
x <- c(1,3,2,3,2,4,5,2,1,1,3,4)
l<- c("a","b","c","b","c","d","f","c","a","a","b","d")
df <- data.frame(x,l)
# Simply call unique function on your dataframe
dfu <- unique(df)

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

how to mutate a column with ID in group

how to mutate a column with ID in group
data.frame like:
a b c
1 a 1 1
2 a 1 2
3 a 2 3
4 b 1 4
5 b 2 5
6 b 3 6
group by a, flag start with 1, if b equals pre b,then flag=1 else flag+=1
a b c flag
1 a 1 1 1 <- group a start with 1
2 a 1 2 1 <-- in group a, 1(in row 2)=1(in row 1)
3 a 2 3 2 <- in group a, 2(in row 3)!=1(in row 2)
4 b 1 4 1 <- group b start with 1
5 b 2 5 2 <- in group b, 2(in row 5)!=1(in row 4)
6 b 3 6 3 <- in group b, 3(in row 6)!=2(in row 5)
i now using this:
for(i in 2:nrow(x)){
x[i, 'flag'] = ifelse(x[i, 'a']!=x[i-1,'a'], 1, ifelse(x[i, 'b']==x[i-1, 'b'], x[i-1, 'flag'], x[i-1,'flag']+1))
}
but it is inefficiency in large dataset
#
UPDATE
dense_rank in dplyr give me the answer
> x %>% group_by(a) %>% mutate(dense_rank(b))
Source: local data frame [10 x 4]
Groups: a
a b c dense_rank(b)
1 a x 1 1
2 a x 2 1
3 a y 3 2
4 b x 4 1
5 b y 5 2
6 b z 6 3
7 c x 7 1
8 c y 8 2
9 c z 9 3
10 c z 10 3
thanks.
I am not entirely sure what you are trying to do. But it seems to me that you are trying to assign index numbers to values in b for each group (a or b).
#I modified your example here.
a <- rep(c("a","b"), each =3)
b <- c(4,4,5,11,12,13)
c <- 1:6
foo <- data.frame(a,b,c, stringsAsFactors = F)
a b c
1 a 4 1
2 a 4 2
3 a 5 3
4 b 11 4
5 b 12 5
6 b 13 6
#Since you referred to dplyr, I will use it.
cats <- list()
for(i in unique(foo$a)){
ana <- foo %>%
filter(a == i) %>%
arrange(b) %>%
mutate(indexInb = as.integer(as.factor(b)))
cats[[i]] <- ana
}
bob <- rbindlist(cats)
a b c indexInb
1: a 4 1 1
2: a 4 2 1
3: a 5 3 2
4: b 11 4 1
5: b 12 5 2
6: b 13 6 3
Hers's a quick vectorized way to solve this without using any for loops
Base R solution using ave and transform
transform(x, flag = ave(b, a, FUN = function(x) cumsum(c(1, diff(x)))))
# a b c flag
# 1 a 1 1 1
# 2 a 1 2 1
# 3 a 2 3 2
# 4 b 1 4 1
# 5 b 2 5 2
# 6 b 3 6 3
Or a data.table solution (more efficient)
library(data.table)
setDT(x)[, flag := cumsum(c(1, diff(b))), by = a]
x
# a b c flag
# 1: a 1 1 1
# 2: a 1 2 1
# 3: a 2 3 2
# 4: b 1 4 1
# 5: b 2 5 2
# 6: b 3 6 3
Or a dplyr solution (because you tagged it)
library(dplyr)
x %>%
group_by(a) %>%
mutate(flag = cumsum(c(1, diff(b))))
# Source: local data frame [6 x 4]
# Groups: a
#
# a b c flag
# 1 a 1 1 1
# 2 a 1 2 1
# 3 a 2 3 2
# 4 b 1 4 1
# 5 b 2 5 2
# 6 b 3 6 3

How to merge two datasets by the different values in R?

I have two datasets and want to merge them. How I add to first dataset only the lines that are in the second that are not in the first?
Only add to final dataset if the value not exists in the another dataset. An example dataset:
x = data.frame(id = c("a","c","d","g"),
value = c(1,3,4,7))
y = data.frame(id = c("b","c","d","e","f"),
value = c(5,6,8,9,7))
The merged dataset should look like (the order is not important):
a 1
b 5
c 3
d 4
e 9
f 7
g 7
Using !, %in% and rbind:
rbind(x[!x$id %in% y$id,], y)
id value
1 a 1
4 g 7
3 b 2
41 c 3
5 d 4
6 e 5
7 f 6
For your example to work, you first need to ensure that id in each data.frame are directly comparable. Since they're factors, you need ensure they have the same levels/labels; or you can just convert them to character.
# convert factors to character
x$id <- as.character(x$id)
y$id <- as.character(y$id)
# merge
z <- merge(x,y,by="id",all=TRUE)
# keep first value, if it exists
z$value <- ifelse(is.na(z$value.x),z$value.y,z$value.x)
# keep desired columns
z <- z[,c("id","value")]
z
# id value
# 1 a 1
# 2 b 5
# 3 c 3
# 4 d 4
# 5 e 9
# 6 f 7
# 7 g 7
You already answered your own question, but just didn't realize it right away. :)
> merge(x,y,all=TRUE)
id value
1 a 1
2 c 3
3 c 6
4 d 4
5 d 8
6 g 7
7 b 5
8 e 9
9 f 7
EDIT
I'm a bit dense here and I'm not sure where you're getting at, so I provide you with a shotgun approach. What I did was I merged the data.frames by id and copied values from x to y if y` was missing. Take whichever column you need.
> x = data.frame(id = c("a","c","d","g"),
+ value = c(1,3,4,7))
> y = data.frame(id = c("b","c","d","e","f"),
+ value = c(5,6,8,9,7))
> xy <- merge(x, y, by = "id", all = TRUE)
> xy
id value.x value.y
1 a 1 NA
2 c 3 6
3 d 4 8
4 g 7 NA
5 b NA 5
6 e NA 9
7 f NA 7
> find.na <- is.na(xy[, "value.y"])
> xy$new.col <- xy[, "value.y"]
> xy[find.na, "new.col"] <- xy[find.na, "value.x"]
> xy
id value.x value.y new.col
1 a 1 NA 1
2 c 3 6 6
3 d 4 8 8
4 g 7 NA 7
5 b NA 5 5
6 e NA 9 9
7 f NA 7 7
> xy[order(as.character(xy$id)), ]
id value.x value.y new.col
1 a 1 NA 1
5 b NA 5 5
2 c 3 6 6
3 d 4 8 8
6 e NA 9 9
7 f NA 7 7
4 g 7 NA 7

Resources