Let's say I have a simple data-frame with a list of IDs like this:
df<-tibble(id=1:5)
Then I have a vector of IDs like this:
ids<-3:7
I am trying to write an if or ifelse statement that will check to see if each id is contained in any ids in the ids vector. The result would be something like:
df1<-tibble(id=1:5,included=c("no","no","yes","yes","yes"))
# A tibble: 5 x 2
id included
<int> <chr>
1 no
2 no
3 yes
4 yes
5 yes
The ifelse functions works on vectors. You were probably looking for the %in% operator? Did you try google?
df$included <- ifelse(df$id %in% ids, "yes", "no")
R base:
df$included <- df$id %in% ids
df
# A tibble: 5 x 2
id included
<int> <lgl>
1 1 FALSE
2 2 FALSE
3 3 TRUE
4 4 TRUE
5 5 TRUE
We can use %in% to create a logical vector and replace the values based on either ifelse or case_when or just by indexing
library(dplyr)
df <- df %>%
mutate(included = case_when(id %in% ids ~ "yes", TRUE ~ "no"))
df
You can use %in% to subset a vector containing no and yes:
df$included <- c("no", "yes")[1 + df$id %in% ids]
## A tibble: 5 x 2
# id included
# <int> <chr>
#1 1 no
#2 2 no
#3 3 yes
#4 4 yes
#5 5 yes
Related
I would like to identify all rows of a tibble that have been altered after mutate .
My real data has multiple columns and the mutate function changes more than one column at once.
# library
library(tidyverse)
# get df
df <- tibble(name=c("A","B","C","D"),value=c(1,2,3,4))
# mutate df
dfnew <- df %>%
mutate(value=case_when(name=="A" ~ value+1, TRUE ~value)) %>%
mutate(name=case_when(name=="B" ~ "K", TRUE ~name))
Created on 2020-04-26 by the reprex package (v0.3.0)
Now I look for a way how to compare all rows of df with dfnew and identify all rows with at least one change.
The desired output would be:
# desired output:
#
# # A tibble: 4 x 2
# name value
# <chr> <dbl>
# 1 A 2
# 2 K 2
You can do:
anti_join(dfnew, df)
name value
<chr> <dbl>
1 A 2
2 K 2
#tmfmnk's response does the trick, but if you'd like to use a loop (e.g. for some flexibility using different kinds of messages or warnings depending on what you're checking) you could do:
output <- list()
for (i in 1:nrow(dfnew)) {
if (all(df[i, ] == dfnew[i, ])) {
next
}
output[[i]] <- dfnew[i, ]
}
bind_rows(output)
# A tibble: 2 x 2
name value
<chr> <dbl>
1 A 2
2 K 2
We can also use setdiff from dplyr
library(dplyr)
setdiff(dfnew, df)
# A tibble: 2 x 2
# name value
# <chr> <dbl>
#1 A 2
#2 K 2
Or using fsetdiff from data.table
library(data.table)
fsetdiff(setDT(dfnew), setDT(df))
I am attempting to change the value of a variable using dplyr::mutate(). I want to change the value of the column 'certainty' from "unsure" to "likely" if the ID from a character vector is found in the ID column in the dataset. If it does not match, I would like to keep the original value. Here is a reprex with my current attempt:
library(dplyr)
library(magrittr)
data <- data.frame(
ID = c("a100", "b100", "c100", "d100", "e100", "f100"),
certainty = c("confirmed", "likely", "unsure", "likely", "unsure", "confirmed")
)
data %<>% as_tibble()
id_list <- c("c100", "e100")
data %<>%
mutate(certainty = if_else(id_list %in% ID, "likely", certainty))
The output should look like this:
ID certainty
<fct> <fct>
1 a100 confirmed
2 b100 likely
3 c100 likely
4 d100 likely
5 e100 likely
6 f100 confirmed
Currently I get this error:
Error: `false` must be length 2 (length of `condition`) or one, not 6
How should I solve this?
The issue is with the order of arguments in %in%. It is returning the length of id_list which is 2 if we use id_list %in% ID. Instead it should be the other way i.e. ID %in% id_list e..g
1:3 %in% 1:2
#[1] TRUE TRUE FALSE
and
1:2 %in% 1:3
#[1] TRUE TRUE
Here, it would be
library(dplyr)
data %>%
mutate(certainty = ifelse(ID %in% id_list, "likely", as.character(certainty)))
# A tibble: 6 x 2
# ID certainty
# <fct> <chr>
#1 a100 confirmed
#2 b100 likely
#3 c100 likely
#4 d100 likely
#5 e100 likely
#6 f100 confirmed
NOTE: certainty is factor, so it needs to be converted to character or add likely as another level (if we want to stick to factor class)
It can be also be remained as factor
library(forcats)
data %>%
mutate(certainty = fct_collapse(certainty,
likely = as.character(certainty)[ID %in% id_list]))\
# A tibble: 6 x 2
# ID certainty
# <fct> <fct>
#1 a100 confirmed
#2 b100 likely
#3 c100 likely
#4 d100 likely
#5 e100 likely
#6 f100 confirmed
Below I create a function that deletes a specific column if there is only one unique value in it. Can I somehow use lapply within %>% to avoid calling the function three times? Or even call the function for all columns?
df <- tibble(col1 = sample(1:6), col2 = sample(1:6), col3 = 3, col4 = 4)
condDelCol <- function(mycolumn, mydataframe) {
if(length(unique(mydataframe[[mycolumn]])) == 1) { mydataframe[[mycolumn]] = NULL }
mydataframe
}
df %>%
condDelCol("col2", .) %>%
condDelCol("col3", .) %>%
condDelCol("col4", .)
With dplyr, an option is select_if
library(dplyr)
df %>%
select_if(~ n_distinct(.) > 1)
# A tibble: 6 x 2
# col1 col2
# <int> <int>
#1 1 6
#2 6 1
#3 5 5
#4 3 4
#5 4 2
#6 2 3
Or another way is base R by looping over the columns with sapply, create a logical vector, extract the column names that have only single unique value and assign (<-) it to NULL
i1 <- sapply(df, function(x) length(unique(x)))
df[names(which(i1 == 1))] <- NULL
Or with Filter
Filter(var, df)
You could use this one as well. It ignores the columns for which the standard deviation is 0.
df[, sapply(df, sd) != 0]
# A tibble: 6 x 2
col1 col2
<int> <int>
1 1 3
2 5 6
3 6 1
4 2 2
5 3 4
6 4 5
or if you want to use the pipe operator
df %>%
select(which(sapply(df, sd) != 0))
I have long data where a given subject has 4 observations. I want to only include a given id that meets the following conditions:
has at least one 3
has at least one of 1,2 OR NA
My data structure:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3), a=c(NA,1,2,3, NA,3,2,0, NA,NA,1,1))
My unsuccessful attempt (I get an empty data frame):
df %>% dplyr::group_by(id) %>% filter(a==3 & a %in% c(1,2,NA))
An option is to group by 'id', create a logic to return single TRUE/FALSE as output. Based on the OP's post, we need both values '3' and either one of the values 1, 2, NA in the column 'a'. So, 3 %in% a returns a logical vector of length 1, then wrap any on the second set where we do a comparison with multiple values or check the NA elements (is.na), merge both logical output with &
library(dplyr)
df %>%
group_by(id) %>%
filter((3 %in% a) & any(c(1, 2) %in% a|is.na(a)) )
# A tibble: 8 x 2
# Groups: id [2]
# id a
# <dbl> <dbl>
#1 1 NA
#2 1 1
#3 1 2
#4 1 3
#5 2 NA
#6 2 3
#7 2 2
#8 2 0
I have done this a bit of a long way to show how an idea could work. You can consolidate this a bit.
df %>%
group_by(id) %>%
mutate(has_3 = sum(a == 3, na.rm = T) > 0,
keep_me = has_3 & (sum(is.na(a)) > 0 | sum(a %in% c(1, 2)) > 0)) %>%
filter(keep_me == TRUE) %>%
select(id, a)
id a
<dbl> <dbl>
1 1 NA
2 1 1
3 1 2
4 1 3
5 2 NA
6 2 3
7 2 2
8 2 0
As I read it, the filter should keep ids 1 and 2. So I would use combo of all/any:
df %>%
group_by(id) %>%
filter(all(3 %in% a) & any(c(1,2,NA) %in% a))
This question already has answers here:
Check if each row of a data frame is contained in another data frame
(4 answers)
Closed 6 years ago.
I have two data frames (df1 and df2) in R with different information except for two columns, a person number (pnr) and a drug name (name). Row for row in df1, I want to check, if the combination of pnr and name exists somewhere in df2. If this combination exists, I want "yes" in a another column in df1. If not a "no".
df1
pnr|drug|...|check
---|----|---|-----
1 | 1 |...| no
1 | 2 |...| yes
2 | 2 |...| yes
3 | 2 |...| no
.....
df2
pnr|drug|...|
---|----|---|
1 | 2 |...|
2 | 2 |...|
....
For example, I want check, if the row combination pnr=1 & drug=1 exists in df2 (no), pnr=1 & drug=2 (yes) etc. And then place a "yes" or "no" in the check column in df1
I have tried the following for statement without luck. It does place a "yes or "no" in the "check" column, but it doesn't do it correctly
for(index in 1:nrow(df1)){
if((df1[index,]$pnr %in% df2$pnr)&(df1[index,]$name %in% df2$name)){
check_text="yes"}else{check_text="no"}
df1$check=check_text
}
I have a sense that I should be using apply, but I haven't been able to figure that out. Do any of you have an idea how to solve this?
One way is using base R methods.
Pasting the columns pnr and drug together and finding a similar match in df1
df1$check <- ifelse(is.na(match(paste0(df1$pnr, df1$drug),
paste0(df2$pnr, df2$drug))),"No", "Yes")
# pnr drug check
#1 1 1 No
#2 1 2 Yes
#3 2 2 Yes
#4 3 2 No
This is natural for dplyr::left_join:
library(dplyr) # for left_join, transmute
library(tidyr) # for replace_na
df1 <- expand.grid(pnr = 1:3, drug = 1:3)
df2 <- data.frame(pnr = c(1, 3), drug = c(2, 1))
df1 <- df1 %>%
left_join(df2 %>% transmute(pnr, drug, check = 'yes')) %>%
replace_na(list(check = 'no'))
df1
#> pnr drug check
#> 1 1 1 no
#> 2 2 1 no
#> 3 3 1 yes
#> 4 1 2 yes
#> 5 2 2 no
#> 6 3 2 no
#> 7 1 3 no
#> 8 2 3 no
#> 9 3 3 no
We could use apply, checking for matches using the any function:
df1$check <-
apply(df1, 1, function(x)
ifelse(any(x[1] == df2$pnr & x[2] == df2$drug), 'yes','no'))
# df1
# pnr drug check
# 1 1 1 no
# 2 1 2 yes
# 3 2 2 yes
# 4 3 2 no
data
df1 <- data.frame(pnr = c(1,1,2,3),
drug = c(1,2,2,2))
df2 <- data.frame(pnr = c(1,2),
drug = c(2,2))