R - subset rows by rows in another data frame - r

Let's say I have a data frame df containing only factors/categorical variables. I have another data frame conditions where each row contains a different combination of the different factor levels of some subset of variables in df (made using expand.grid and levels etc.). I'm trying to figure out a way of subsetting df based on each row of conditions. So for example, if the column names of conditions are c("A", "B", "C") and the first row is c('a1', 'b1', 'c1'), then I want df[df$A == 'a1' & df$B == 'b1' & df$C == 'c1',], and so on.

I'd think this is a great time to use merge (or dplyr::*_join or ...):
df1 <- expand.grid(A = letters[1:4], B = LETTERS[1:4], stringsAsFactors = FALSE)
df1$rn <- seq_len(nrow(df1))
# 'df2' contains the conditions we want to filter (retain)
df2 <- data.frame(
a1 = c('a', 'a', 'c'),
b1 = c('B', 'C', 'C'),
stringsAsFactors = FALSE
)
df1
# A B rn
# 1 a A 1
# 2 b A 2
# 3 c A 3
# 4 d A 4
# 5 a B 5
# 6 b B 6
# 7 c B 7
# 8 d B 8
# 9 a C 9
# 10 b C 10
# 11 c C 11
# 12 d C 12
# 13 a D 13
# 14 b D 14
# 15 c D 15
# 16 d D 16
df2
# a1 b1
# 1 a B
# 2 a C
# 3 c C
Using df2 to define which combinations we need to keep,
merge(df1, df2, by.x=c('A','B'), by.y=c('a1','b1'))
# A B rn
# 1 a B 5
# 2 a C 9
# 3 c C 11
# or
dplyr::inner_join(df1, df2, by=c(A='a1', B='b1'))
(I defined df2 with different column names just to show how it works, but in reality since its purpose is "solely" to be declarative on which combinations to filter, it would make sense to me to have the same column names, in which case the by= argument just gets simpler.)

One option is to create the condition with Reduce
df[Reduce(`&`, Map(`==`, df[c("A", "B", "C")], df[1, c("A", "B", "C")])),]
Or another option is rowSums
df[rowSums(df[c("A", "B", "C")] ==
df[1, c("A", "B", "C")][col(df[c("A", "B", "C")])]) == 3,]

Related

Reshaping in R: grouping vectors after merge

I have a pretty basic problem that I can't seem to solve:
Take 3 vectors:
V1
V2
V3
1
4
7
2
5
8
3
6
9
I want to merge the 3 columns into a single column and assign a group.
Desired result:
Group
Value
A
1
A
2
A
3
B
4
B
5
B
6
C
7
C
8
C
9
My code:
V1 <- c(1,2,3)
V2 <- c(4,5,6)
V3 <- c(7,8,9)
data <- data.frame(group = c("A", "B", "C"),
values = c(V1, V2, V3))
Actual result:
Group
Value
A
1
B
2
C
3
A
4
B
5
C
6
A
7
B
8
C
9
How can I reshape the data to get the desired result?
Thank you!
We can stack on a named list of vectors
stack(setNames(mget(paste0("V", 1:3)), LETTERS[1:3]))[2:1]
-output
ind values
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
Regarding the issue in the OP's data creation, if the length is less than the length of the second column, it will recycle. We may need rep
data <- data.frame(group = rep(c("A", "B", "C"), c(length(V1),
length(V2), length(V3))),
values = c(V1, V2, V3))
-output
> data
group values
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
Here is a another option using pivot_longer() and converting the 1,2,3 column names into factors labeled A, B, C
V1 <- c(1,2,3)
V2 <- c(4,5,6)
V3 <- c(7,8,9)
df <- data.frame(V1, V2, V3)
library(dplyr)
library(tidyr)
#basic answer:
answer<-pivot_longer(df, cols=starts_with("V"), names_to = "Group")
OR
Answer changing the column names to a different label
answer<-pivot_longer(df, cols=starts_with("V"), names_prefix = "V", names_to = "Group")
answer$Group <- factor(answer$Group, labels = c("A", "B", "C"))
answer %>% arrange(Group, value)

R - Merge and Replace Column If ID Found on Another Data Frame

I have two data frames as below and am trying to improve my code so the letters column in df1 should replaced with the letters column in df2 if they match.
df1 <- data.frame(ID = c(1,3,2,4,5), Letters = LETTERS[1:5], stringsAsFactors = F)
df2 <- data.frame(ID = c(1,3,4), Letters2 = "F", stringsAsFactors = F)
desired:
ID letters
1 F
2 B
3 F
4 D
5 F
It would be like doing the following by in one line:
desired <- merge(df1, df2, by = "ID", all.x = T)
desired$letters <- ifelse(is.na(desired$letters2), desired$letters, desired$letters2)
desired$letters2 <- NULL
Try this:
library(tidyverse)
df1%>%
left_join(df2)%>%
mutate(Letters=coalesce(letters2,Letters),letters2=NULL)
Joining, by = "ID"
ID Letters
1 1 F
2 2 B
3 3 F
4 4 F
5 5 E
We could use the numeric 'ID' as index to change the values in 'Letters' to those of 'letters2' (which are all 'F's)
df1$Letters[df2$ID] <- df2$letters2
df1
# ID Letters
#1 1 F
#2 2 B
#3 3 F
#4 4 F
#5 5 E
Or using data.table
library(data.table)
setDT(df1)[df2, Letters := Letters2, on = .(ID)]
df1
# ID Letters
#1: 1 F
#2: 3 F
#3: 2 C
#4: 4 F
#5: 5 E

Creating new variable in dataframe based on matching values from other dataframe

I have two dataframes, df1 and df2, of which two columns have partly matching values, however in completely different order; also, the values are unique in df2 but may be repeated in df1.
What I'd like to do is transfer into df1, not the matching values, but values associated with them in another variable in df2; for one value in df1, "G", I do not want the associated value to be transferred but rather just NA.
Consider df1 and df2:
df1 <- data.frame(
x = c("A", NA, "L", "G", "C", "F", NA, "J", "G", "K")
)
df2 <- data.frame(
a = LETTERS[1:10],
b = 1:10 # these are the values to be transferred into df1$z
)
df1$z <- ifelse(df1$x=="G", NA, ifelse(df1$x %in% df2$a, df2$b[df2$a %in% df1$x], NA))
The values to be transferred from df2 into df1 are in df2$b. I've tried the above ifelse() string but the resulting values in df1$z are only partly correct. Where's the mistake?
I think this does what you want:
df1$z <- df2$b[match(df1$x,df2$a)]
df1$z[df1$x=='G']=NA
Output:
> df1
x z
1 A 1
2 <NA> NA
3 L NA
4 G 7
5 C 3
6 F 6
7 <NA> NA
8 J 10
9 G 7
10 K NA
Hope this helps!
dplyr::left_join(df1,df2,by=c("x"="a")) %>% mutate(b = ifelse(x=="G",NA,b))
# x b
# 1 A 1
# 2 <NA> NA
# 3 L NA
# 4 G NA
# 5 C 3
# 6 F 6
# 7 <NA> NA
# 8 J 10
# 9 G NA
# 10 K NA

Group data frame by elements from a variable containing lists of elements

I would like to perform a a non-trivial group_by, grouping and summarizing a data frame by single elements of lists found in one of its variables.
df <- data.frame(x = 1:5)
df$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E")
df
x y
1 1 A
2 2 A, B
3 3 C
4 4 B, D, C
5 5 E
Now grouping by y (and say counting no. of rows), which is a variable holding lists of elements, the required end results should be:
data.frame(group = c("A", "B", "C", "D", "E"), n = c(2,2,2,1,1))
group n
1 A 2
2 B 2
3 C 2
4 D 1
5 E 1
Because "A" appears in 2 rows, "B" in 2 rows, etc.
Note: the sum of n is not necessarily equal to number of rows in the data frame.
We can use simple base R solution with table to calculate the frequency after unlisting the list and then create a data.table based on that table object
tbl <- table(unlist(df$y))
data.frame(group = names(tbl), n = as.vector(tbl))
# group n
#1 A 2
#2 B 2
#3 C 2
#4 D 1
#5 E 1
Or another option with tidyverse
library(dplyr)
library(tidyr)
unnest(df) %>%
group_by(group = y) %>%
summarise(n=n())
# <chr> <int>
#1 A 2
#2 B 2
#3 C 2
#4 D 1
#5 E 1
Or as #alexis_laz mentioned in the comments, an alternative is as.data.frame.table
as.data.frame(table(group = unlist(df$y)), responseName = "n")
simple base R solution: (actually this is dup question, unable to locate it though)
sapply(unique(unlist(df$y)), function(x) sum(grepl(x, df$y))
# A B C D E
# 2 2 2 1 1

Reshaping dataframe - two columns from correlation variables

I have the below df
var1 var2 Freq
1 a b 10
2 b a 5
3 b d 10
created from
help <- data.frame(var1 = c("a", "b", "b"), var2 = c("b", "a", "d"), Freq = c(10, 5, 10))
ab correlation is the same as ba, and I am hoping to combine them into one row to look like
var1 var2 Freq
1 a b 15
2 b d 10
any thoughts?
Here's one way:
setNames(aggregate(help$Freq, as.data.frame(t(apply(help[-3], 1, sort))), sum),
names(help))
# var1 var2 Freq
# 1 a b 15
# 2 b d 10
In base R :
do.call(rbind,
by(dat,rowSums(sapply(dat[,c("var1","var2")],as.integer)),
function(x)data.frame(x[1,c("var1","var2")],
Freq= sum(x[,"Freq"]))))
var1 var2 Freq
3 a b 15
5 b d 10
I create first a grouping variable by summing the integer representation of your columns. Then performing the sum of frequencies by group. Finally bind the result to get a new data.frame.

Resources