R: Match a dataframe with 3 others, and create a column - r

I have a big data frame (df) with a variable id, and 3 other data (df1, df2, df3) frames that have some values of this id. So like the big dataframe has id 1:100, df1 might have 1,2,4,11 etc.
What i need to do is add a column to the big dataframe so that it says from which of the smaller dataframes the data came from.
df$new[df$id %in% df1$id] <- 1
df$new[df$id %in% df2$id] <- 2
df$new[df$id %in% df3$id] <- 3
df$new<- factor(df$new, labels = c('a', 'b', 'c'))
This is my solution but i don't really like it. Any other ideas?

We can use a nested ifelse
with(df, ifelse(id %in% df1$id, 'a',
ifelse(id %in% df2$id, 'b',
ifelse(id %in% df3$id, 'c', id)))

Related

R: Assign values to column, from a column from another data frame, based on a condition (different sized data frames)

I need to create a new column in df1 named col_2, and assign it values from another data frame (df2). When the value in col_1 from df1 equals a value in col_a from df2, I want the corresponding value of col_b of df2 assigned to col_2.
The data frames are different sizes.
The data:
col_1 <- c(23,31,98,76,47,65,23,76,3,47)
col_2 <- NA
df1 <- data.frame(col_1, col_2)
col_a <- c(1:100)
col_b <- c(runif(100,0,1))
df2 <- data.frame(col_a, col_b)
I tried the following but none seemed to work... I keep running into the same problem, that the data frames are not of the same length.
for (i in 1:10){
if(df1$col_1[i] == df2$col_a[]){
df1$col_2[i] == df2$col_b[]
}
}
df1$col_2 <- ifelse(df2$col_a %in% df1$col_1, df2$col_b, NA)
df1$col_1[df1$col_1 %in% df2$col_a] <- df2$col_b[df1$col_1 %in% df2$col_a]
We can use left_join
library(dplyr)
left_join(df1, df2, by = c('col_1' = 'col_a'))

Find rows in a dataframe which contain all elements of a row of another dataframe

I have a dataframe which contains three columns, and a second which contains two columns.
df1 <- data.frame(X1 = c('A', 'A', 'A', 'A', 'A', 'A', 'B'),
X2 = c('B', 'B', 'B', 'C', 'C', 'D', 'C'),
X3 = c('C', 'D', 'E', 'D', 'E', 'E', 'D'))
df2 <- data.frame(X1 = c('A', 'A'),
X2 = c('B', 'D'))
Questions:
How do I find the rows in df1 which contain all the elements of a row of df2? i.e. rows 1:3 of df1 contain both A and B (first row of df2). I am looking to remove any rows of df1 which contain both elements of the rows of df2. So in the example, I would like to remove rows 1, 2, 3, 4 and 6 of df1 as these include A and B OR A and D.
Is there a quick way to count the number of rows for each row of df2 without looping? i.e. df2 row 1 would have a count of 3 and row 2 a count of 3.
Here is base R option using outer + intersect
mat <- lengths(
outer(
asplit(df1, 1),
asplit(df2, 1),
Vectorize(intersect)
)
) >= ncol(df2)
and you will obtain
> subset(df1, !rowSums(mat))
X1 X2 X3
5 A C E
7 B C D
> within(df2, cnt <- colSums(mat))
X1 X2 cnt
1 A B 3
2 A D 3
asplit splits the data frames by rows
outer produces all combinations of rows from df1 and df2
intersect gives the intersected elements of rows from two data frames
subset selects the rows which has less than one common elements
Using apply:
df1[ !apply(df1, 1, function(i) any(apply(df2, 1, function(j) all(j %in% i)))), ]
# X1 X2 X3
# 5 A C E
# 7 B C D
Do the similar loops for df2 match counts:
cbind(df2,
cnt = apply(df2, 1, function(i) sum(apply(df1, 1, function(j) all(i %in% j)))))
# X1 X2 cnt
# 1 A B 3
# 2 A D 3
You need to loop somehow. Here is one way to do it using dplyr and purrr:
1.
for(iRow in seq_len(nrow(df2))){
df1 <- df1 %>%
rowwise() %>%
filter(!all(as.character(df2[iRow,]) %in% c_across(everything())))
}
2.
df2 %>%
rowwise() %>%
mutate(n = sum(map_int(transpose(df1), ~all(c_across(everything()) %in% .x))))
Just be sure to do 2nd part before 1st because 1st part removes rows. Also you can first detect which rows to remove for each row of df2. This way you can count them and afterwards remove them.
df2 <- df2 %>%
rowwise() %>%
mutate(
indices = list(which(map_lgl(transpose(df1), ~all(c_across(everything()) %in% .x))))
) %>%
ungroup() %>%
mutate(n = map_int(indices, length))
df1 <- df2[["indices"]] %>%
unlist() %>%
unique() %>%
"*"(-1) %>%
df1[.,]
df2 <- df2 %>% select(-indices)

Filter Data Frame by Matching Multiple String in Multiple Columns

I have been unsuccessfully trying to filter my data frame using the dplyr and grep libraries using a list of string across multiple columns of my data frame. I would assume this is a simple task, but either nobody has asked my specific question or it's not as easy as I thought it would originally be.
For the following data frame...
foo <- data.frame(var.1 = c('a', 'b',' c'),
var.2 = c('b', 'd', 'e'),
var.3 = c('c', 'f', 'g'),
var.4 = c('z', 'a', 'b'))
... I would like to be able to filter row wise to find rows that contain all three variables a, b, and c in them. My sought after answer would only return row 1, as it contains a, b, and c, and not return rows 2 and 3 even though they contain two of the three sought after variables, they do not contain all three in the same row.
I'm running into issues where grep only allows specifying vectors or one column at a time when I really just care about finding string across many columns in the same row.
I've also used dplyr to filter using %in%, but it just returns when any of the variables are present:
foo %>%
filter(var.1 %in% c('a', 'b', 'c') |
var.2 %in% c('a', 'b', 'c') |
var.3 %in% c('a', 'b', 'c'))
Thanks for any and all help and please, let me know if you need any clarification!
Here's an approach in base R where we check if the elements of foo are equal to "a", "b", or "c" successively, add the Booleans and check if the sum of those Booleans for each row is greater than or equal to 3
Reduce("+", lapply(c("a", "b", "c"), function(x) rowSums(foo == x) > 0)) >=3
#[1] TRUE FALSE FALSE
Timings
foo = matrix(sample(letters[1:26], 1e7, replace = TRUE), ncol = 5)
system.time(Reduce("+", lapply(letters[1:20], function(x) rowSums(foo == x) > 0)) >=20)
# user system elapsed
# 3.26 0.48 3.79
system.time(apply(foo, 1, function(x) all(letters[1:20] %in% x)))
# user system elapsed
# 18.86 0.00 19.19
identical(Reduce("+", lapply(letters[1:20], function(x) rowSums(foo == x) > 0)) >=20,
apply(foo, 1, function(x) all(letters[1:20] %in% x)))
#[1] TRUE
>
Your problem arises from trying to apply "tidyverse" solutions to data that isn't tidy. Here's the tidy solution, which uses melt to make your data tidy. See how much tidier this solution is?
> library(reshape2)
> rows = foo %>%
mutate(id=1:nrow(foo)) %>%
melt(id="id") %>%
filter(value=="a" | value=="b" | value=="c") %>%
group_by(id) %>%
summarize(N=n()) %>%
filter(N==3) %>%
select(id) %>%
unlist
Warning message:
attributes are not identical across measure variables; they will be dropped
That gives you a vector of matching row indexes, which you can then subset your original data frame with:
> foo[rows,]
var.1 var.2 var.3 var.4
1 a b c z
>

Filter Dataframe by second dataframe [duplicate]

This question already has answers here:
Subsetting a data frame to the rows not appearing in another data frame
(5 answers)
Closed 6 years ago.
I have two dataframes.
selectedcustomersa is a dataframe with information about 50 customers. Fist column is the name (Group.1).
selectedcustomersb is another dataframe (same structure) with information about 2000 customers and customers from selectedcustomersa are included there.
I want selctedcustomersb without the customers from selctedcustomersa.
I tried:
newselectedcustomersb<-filter(selectedcustomersb, Group.1!=selectedcustomersa$Group.1)
One way to do this is to use the anti_join in dplyr as follows. It will work across multiple columns and such.
library(dplyr)
df1 <- data.frame(x = c('a', 'b', 'c', 'd'), y = 1:4)
df2 <- data.frame(x = c('c', 'd', 'e', 'f'), z = 1:4)
df <- anti_join(df2, df1)
df
x z
1 e 3
2 f 4
Try:
newselectedcustomersb <- filter(selectedcustomersb, !(Group.1 %in% selectedcustomersa$Group.1))

Combine vector and data.frame matching column values and vector values

I have
vetor <- c(1,2,3)
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
I need a data.frame output that match each vector value to a specific id, resulting:
id vector1
1 a 1
2 b 2
3 a 1
4 c 3
5 a 1
Here are two approaches I often use for similar situations:
vetor <- c(1,2,3)
key <- data.frame(vetor=vetor, mat=c('a', 'b', 'c'))
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
data$vector1 <- key[match(data$id, key$mat), 'vetor']
#or with merge
merge(data, key, by.x = "id", by.y = "mat")
So you want one unique integer for each different id column?
This is called a factor in R, and your id column is one.
To convert to a numeric representation, use as.numeric:
data <- data.frame(id=c('a', 'b', 'a', 'c', 'a'))
data$vector1 <- as.numeric(data$id)
This works because data$id is not a column of strings, but a column of factors.
Here's an answer I found that follows the "mathematical.coffee" tip:
vector1 <- c('b','a','a','c','a','a') # 3 elements to be labeled: a, b and c
labels <- factor(vector1, labels= c('char a', 'char b', 'char c') )
data.frame(vector1, labels)
The only thing we need to observe is that in the factor(vector1,...) function, vector1 will be ordered and the labels must follow that order correctly.

Resources