I have a data frame in which I'd like to make a new column which is set to TRUE or FALSE depending on whether it matches with a vector.
So far I've tried two different approaches, the first directly using the %in% operator to check whether elements of test occurred in column apples, the second by putting this in an ifelse statement.
test <- c("a","b","c")
df <- tibble(apples = c("a","d","e","f","z","g","c"))
#First attempt
df_match <- df %>%
mutate(
match = test %in% apples
)
#Second attempt
df_match <- df %>%
mutate(
match = ifelse(test %in% apples,TRUE,FALSE)
)
The desired output for column 'match' would be
> df$match
[1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE
Using base R
transform(df, match = apples %in% test)
Related
Let's say I have the following data.frame:
df = data.frame(groups =c("A","A","A","B","B","B","C","C","D","D","D","D","D"),
values =c(1,1,5,3,2,1,7,7,9,8,7,6,5))
and another data.frame:
df_t = data.frame(groups=c("A","B","C","D"),
threshold=c(2,5,3,9))
Now I would like to add another column to df indicating whether the values are below the grouping threshold (TRUE) or not (FALSE). In this case:
TRUE,TRUE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE
I am aware that this could easily be done with a for loop. However, I think there must be a more elegant way to achieve this. I would also prefer a base R solution over dplyr or data.table.
Consider joining the dataset by the 'groups' and create the column
library(dplyr)
df %>%
left_join(df_t) %>%
mutate(flag = values < threshold, threshold = NULL)
Or in base R use match to get the corresponding index (or a merge)
df$flag <- with(df, values < df_t$threshold[match(groups, df_t$groups)])
df$flag
#[1] TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
I have a dataframe containing a column of strings. I want to check whether any of the elements in each string match any of the elements in one or more predefined vectors, and then return a new logical column. This is easily accomplished using grepl().
However (and this is the part I need help with), I also want to check whether the strings contain any elements other than those contained in the keyword vectors.
Example data:
matchvector1 <- c("Apple","Banana","Orange")
matchvector2 <- c("Strawberry","Kiwi","Grapefruit")
id <- c(1,2,3)
string_column <- c(paste0(c("Apple","Banana"),collapse=", "), paste0(c("Strawberry","Kiwi"), collapse = ", "), paste0(c("Apple","Pineapple"), collapse = ", "))
df <- data.frame(id, string_column)
df$string_column <- as.character(df$string_column)
matches_vector1 <- grepl(paste(matchvector1, collapse = "|"), df$string_column)
matches_vector2 <- grepl(paste(matchvector2, collapse = "|"), df$string_column)
The output should look something like:
matches_vector1: TRUE FALSE TRUE
matches_vector2: FALSE TRUE FALSE
unmatched_words: FALSE FALSE TRUE
I'm stuck on this last part. Is there an easy way to match on anything except something in a list of keywords using grepl() (or another function)? I suspect it will involve using negative lookaround somehow but the few existing threads on this didn't seem to answer my question.
One option is to split the 'string_column' with separate_rows, grouped by 'id', check if there are not any elements from 'string_column' %in% the concatenated vectors
library(dplyr)
library(tidyr)
df %>%
separate_rows(string_column) %>%
group_by(id) %>%
summarise(unmatched = any(!string_column %in% c(matchvector1, matchvector2)) )
# A tibble: 3 x 2
# id unmatched
#* <dbl> <lgl>
#1 1 FALSE
#2 2 FALSE
#3 3 TRUE
or in base R
lengths(sapply(strsplit(df$string_column, ",\\s*"),
setdiff, c(matchvector1, matchvector2))) > 0
#[1] FALSE FALSE TRUE
I have a dataframe and a number of conditions. Each condition is supposed to check whether the value in a certain column of the dataframe is within a set of valid values.
This is what I tried:
# create the sample dataframe
age <- c(120, 45)
sex <- c("x", "f")
df <-data.frame(age, sex)
# create the sample conditions
conditions <- list(
list("age", c(18:100)),
list("sex", c("f", "m"))
)
addIndicator <- function (df, columnName, validValues) {
indicator <- vector()
for (row in df[, toString(columnName)]) {
# for some strange reason, %in% doesn't work correctly here, but always returns FALSe
indicator <- append(indicator, row %in% validValues)
}
df <- cbind(df, indicator)
# rename the column
names(df)[length(names(df))] <- paste0("I_", columnName)
return(df)
}
for (condition in conditions){
columnName <- condition[1]
validValues <- condition[2]
df <- addIndicator(df, columnName, validValues)
}
print(df)
However, this leads to all conditions considered not to be met - which is not what I expect:
age sex I_age I_sex
1 120 x FALSE FALSE
2 45 f FALSE FALSE
I figured that %in% does not return the expected result. I checked for the typeof(row) and tried to boil this down into a minimum example. In a simple ME, with the same type and values of the variables, the %in% works properly. So, something must be wrong within the context I try to apply this. Since this is my first attempt to write anything in R, I am stuck here.
What am I doing wrong and how can I achieve what I want?
If you prefer an approach that uses the tidyverse family of packages:
library(tidyverse)
allowed_values <- list(age = 18:100, sex = c("f", "m"))
df %>%
imap_dfr(~ .x %in% allowed_values[[.y]]) %>%
rename_with(~ paste0('I_', .x)) %>%
bind_cols(df)
imap_dfr allows you to manipulate each column in df using a lambda function. .x references the column content and .y references the name.
rename_with renames the columns using another lambda function and bind_cols combines the results with the original dataframe.
I borrowed the simplified list of conditions from ben's answer. I find my approach slightly more readable but that is a matter of taste and of whether you are already using the tidyverse elsewhere.
conditions appears to be a nested list. When you use:
validValues <- condition[2]
in your for loop, your result is also a list.
To get the vector of values to use with %in%, you can extract [[ by:
validValues <- condition[[2]]
A simplified approach to obtaining indicators could be with a simple list:
conditions_lst <- list(age = 18:100, sex = c("f", "m"))
And using sapply instead of a for loop:
cbind(df, sapply(setNames(names(df), paste("I", names(df), sep = "_")), function(x) {
df[[x]] %in% conditions_lst[[x]]
}))
Output
age sex I_age I_sex
1 120 x FALSE FALSE
2 45 f TRUE TRUE
An alternative approach using across and cur_column() (and leaning heavily on severin's solution):
library(tidyverse)
df <- tibble(age = c(12, 45), sex = c('f', 'f'))
allowed_values <- list(age = 18:100, sex = c("f", "m"))
df %>%
mutate(across(c(age, sex),
c(valid = ~ .x %in% allowed_values[[cur_column()]])
)
)
Reference: https://dplyr.tidyverse.org/articles/colwise.html#current-column
Related question: Refering to column names inside dplyr's across()
I have a list containing many data frames:
df1 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df2 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df3 <- data.frame(A = 1:5, C = LETTERS[1:5])
my_list <- list(df1, df2, df3)
I want to know if every data frame in this list contains the same columns (i.e., the same number of columns, all having the same names and in the same order).
I know that you can easily find column names of data frames in a list using lapply:
lapply(my_list, colnames)
Is there a way to determine if any differences in column names occur? I realize this is a complicated question involving pairwise comparisons.
You can avoid pairwise comparison by simply checking if the count of each column name is == length(my_list). This will simultaneously check for dim and names of you dataframe -
lapply(my_list, names) %>%
unlist() %>%
table() %>%
all(. == length(my_list))
[1] FALSE
In base R i.e. without %>% -
all(table(unlist(lapply(my_list, names))) == length(my_list))
[1] FALSE
or sightly more optimized -
!any(table(unlist(lapply(my_list, names))) != length(my_list))
Here's another base solution with Reduce:
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, names)
)
)
You could also account for same columns in a different order with
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, function(z) sort(names(z)))
)
)
As for what's going on, Reduce() accumulates as it goes through the list. At first, identical(names_df1, names_df2) are evaluated. If it's true, we want to have it return the same vector evaluated! Then we can keep using it to compare to other members of the list.
Finally, if everything evaluates as true, we get a character vector returned. Since you probably want a logical output, !is.logical(...) is used to turn that character vector into a boolean.
See also here as I was very inspired by another post:
check whether all elements of a list are in equal in R
And a similar one that I saw after my edit:
Test for equality between all members of list
We can use dplyr::bind_rows:
!any(is.na(dplyr::bind_rows(my_list)))
# [1] FALSE
Here is my answer:
k <- 1
output <- NULL
for(i in 1:(length(my_list) - 1)) {
for(j in (i + 1):length(my_list)) {
output[k] <- identical(colnames(my_list[[i]]), colnames(my_list[[j]]))
k <- k + 1
}
}
all(output)
So I have
df=data.frame(age=c(10,12,12,13,13,10), name=c('Maria','anders','anders','per','johanna','Maria'))
dups=df[duplicated(df),]
What R does when I run df %in% dups
Output: FALSE FALSE
I do realise for example if I run df$name %in% dups$name
Output: TRUE TRUE TRUE FALSE FALSE TRUE
which compares every name of df with the name of dups and checks if a name is found at least once on dups. I would assume df %in% dups would check every row of df against every row of dups but that doesn't seem to be the case.
When %in% is applied to data frames, the comparison takes place column-wise.
For example
df %in% df["age"]
# [1] TRUE FALSE
compares each column in df with the column in the one-column data frame df["age"]. Since the age column is identical in both data frames, the first value is TRUE.
For a row-wise comparison, you can use the following (complex) command:
sapply(seq(nrow(df)),
function(i1) any(as.logical(rowSums(sapply(seq(nrow(dups)),
function(i2) df[i1, ] == dups[i2, ])))))
# [1] TRUE TRUE TRUE FALSE FALSE TRUE