R dplyr find all mutated rows - r

I would like to identify all rows of a tibble that have been altered after mutate .
My real data has multiple columns and the mutate function changes more than one column at once.
# library
library(tidyverse)
# get df
df <- tibble(name=c("A","B","C","D"),value=c(1,2,3,4))
# mutate df
dfnew <- df %>%
mutate(value=case_when(name=="A" ~ value+1, TRUE ~value)) %>%
mutate(name=case_when(name=="B" ~ "K", TRUE ~name))
Created on 2020-04-26 by the reprex package (v0.3.0)
Now I look for a way how to compare all rows of df with dfnew and identify all rows with at least one change.
The desired output would be:
# desired output:
#
# # A tibble: 4 x 2
# name value
# <chr> <dbl>
# 1 A 2
# 2 K 2

You can do:
anti_join(dfnew, df)
name value
<chr> <dbl>
1 A 2
2 K 2

#tmfmnk's response does the trick, but if you'd like to use a loop (e.g. for some flexibility using different kinds of messages or warnings depending on what you're checking) you could do:
output <- list()
for (i in 1:nrow(dfnew)) {
if (all(df[i, ] == dfnew[i, ])) {
next
}
output[[i]] <- dfnew[i, ]
}
bind_rows(output)
# A tibble: 2 x 2
name value
<chr> <dbl>
1 A 2
2 K 2

We can also use setdiff from dplyr
library(dplyr)
setdiff(dfnew, df)
# A tibble: 2 x 2
# name value
# <chr> <dbl>
#1 A 2
#2 K 2
Or using fsetdiff from data.table
library(data.table)
fsetdiff(setDT(dfnew), setDT(df))

Related

Ambiguity with tidyverse filter( %in% )

I ran into an annoying issue earlier today where I had a dataframe with hundreds of columns that I had been given. I was then attempting to select rows from this dataframe using a list I had created using a different process. When I attempted to filter using the list, I got a blank dataframe in return. After struggling with this for awhile I realized that the massive dataframe I was selecting from also had a column with the same name as my list, and that my filter action was usinig this as priority.
My question is, is there a better way I should be filtering from dataframes rather than the way I am currently? I do not like that it is ambiguous if a column or a list is used. Here is a minimum example to show this:
Consider a dataframe which has two columns, a and b:
library(tidyverse)
df = tibble(a = c("first", "second", "third"),
b = c("2", "3", "4"))
# A tibble: 3 × 2
# a b
# <chr> <chr>
# 1 first 2
# 2 second 3
# 3 third 4
I would then like to select rows from this dataframe using a list of values I created using a different process. Notice that the first list is named b, which is also the name of one of the columns in the df.
b = c("first")
d = c("first")
These two commands are almost the same, except that the first filters based on the column (and therefore returns nothing) and the second filters based on the list(and therefore returns the first row):
# Returns Nothing:
df %>%
filter(a %in% b)
# # A tibble: 0 × 2
# … with 2 variables: a <chr>, b <chr>
# ℹ Use `colnames()` to see all variable names
# Returns Desired Row(s)
df %>%
filter(a %in% d)
# A tibble: 1 × 2
# a b
# <chr> <chr>
# 1 first 2
Is there a better way to filter which is less ambiguous? I guess I would like an error or something like that. I realize this is kind of an edge case.
You can use .data$ and.env$ from rlang to distinguish between the variable in the data set and the object in the environment.
df %>%
filter(a %in% .env$b)
A tibble: 1 × 2
a b
<chr> <chr>
1 first 2
You can use !! to evaluate the vector b, rather than use the variable b from the dataset. It also works with vectors that are not also variable names in the data, like d. So, if you imagined this happening a lot, you could always prefix the vector of values you're filtering on with !! and you won't run unto this problem.
library(tidyverse)
df = tibble(a = c("first", "second", "third"),
b = c("2", "3", "4"))
b = c("first")
d = c("first")
df %>%
filter(a %in% !!b)
#> # A tibble: 1 × 2
#> a b
#> <chr> <chr>
#> 1 first 2
df %>%
filter(a %in% !!d)
#> # A tibble: 1 × 2
#> a b
#> <chr> <chr>
#> 1 first 2
Created on 2023-02-10 by the reprex package (v2.0.1)

R: Encoding categorical data using across()

I have a dataset with features of type character (not all are binary and one of them represents a region).
In order to avoid having to use the function several times, I was trying to use a pipeline and across() to identify all of the columns of character type and encode them with the function created.
encode_ordinal <- function(x, order = unique(x)) {
x <- as.numeric(factor(x, levels = order, exclude = NULL))
x
}
dataset <- dataset %>%
encode_ordinal(across(where(is.character)))
However, it seems that I am not using across() correctly as I get the error:
Error: across() must only be used inside dplyr verbs.
I wonder if I am overcomplicating myself and there is an easier way of achieving this, i.e., identifying all of the features of character type and encode them.
You should call across and encode_ordinal inside mutate, as illustrated in the following example:
dataset <- tibble(x = 1:3, y = c('a', 'b', 'b'), z = c('A', 'A', 'B'))
# # A tibble: 3 x 3
# x y z
# <int> <chr> <chr>
# 1 1 a A
# 2 2 b A
# 3 3 b B
dataset %>%
mutate(across(where(is.character), encode_ordinal))
# # A tibble: 3 x 3
# x y z
# <int> <dbl> <dbl>
# 1 1 1 1
# 2 2 2 1
# 3 3 2 2

How to apply functions sequentially with purrr and pipes

I am struggling with the purrr package.
I am trying to apply the function is.factor to a data frame, and then fct_count on those columns that are factors.
I have tried some variations of modify_if, and summarise_if. I guess I am using incorrectly the dots (.) when calling for the previous object.
(A guide about purrr, and dots would be really beneficial if you have a link).
For example,
df <- data.frame(f1 = c("men", "woman", "men", "men"),
f2 = c("high", "low", "low", "low"),
n1 = c(1, 3, 3, 6))
Then
map(df, is.factor)
If I use
map_if(df, is.factor, forcats::fct_count)
I got results for every variable, instead of only for the factors.
I think it is a pretty simple problem, and with a bit of understanding of the dots (.) can be solved.
Thanks in advance
:)
Issue is that map_if returns the unmodified columns as well. Hence, when the OP tries the code (repeating the same code as in the OP just to show)
map_if(df, is.factor, forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
#$n1
#[1] 1 3 3 6 ### it is the same column value unchanged
Here, we can specify the .else and discard the NULL elements. So, if we specify the other columns to return NULL and then use discard the NULL elements, it would be a list of factor counts.
library(tidyverse)
map_if(df, is.factor, forcats::fct_count, .else = ~ NULL) %>%
discard(is.null)
#$f1
## A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
Or another option is summarise_if and place the output in a list
df %>%
summarise_if(is.factor, list(~ list(fct_count(.)))) %>%
unclass
Or another option would be to gather into 'long' format and then count once
gather(df, key, val, f1:f2) %>%
dplyr::count(key, val)
Or this can be done with lapply from base R
lapply(df[sapply(df, is.factor)], fct_count)
Or using only base R
lapply(df[sapply(df, is.factor)], table)
Or the results can be represented in a different way
table(names(df)[1:2][col(df[1:2])], unlist(df[1:2]))
The issue with map_if/modify_if is it applies the function to only the columns which satisfy the predicate function and rest of them are returned as it is.
Hence, when you try
library(tidyverse)
map_if(df, is.factor, forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
#$n1
#[1] 1 3 3 6
fct_count is applied to columns f1 and f2 which are factors and column n1 is returned as it is. If you want to get only factor columns in the output one way would be to select them first and then apply the function
df %>%
select_if(is.factor) %>%
map(forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3

Ranking Values of data frame excluding same dates

I have a data frame with Dates and Values:
library(dplyr)
library(lubridate)
df<-tibble(DateTime=ymd(c("2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-03","2018-01-03")),
Value=c(5,10,12,3,9,11),Rank=rep(0,6))
I would like to Rank the values of the two last rows, each compared with the rest four Value rows (the ones of previous dates).
I have managed to do this:
dfReference<-df%>%filter(DateTime!=max(DateTime))
dfTarget<-df%>%filter(DateTime==max(DateTime))
for (i in 1:nrow(dfTarget)){
tempDf<-rbind(dfReference,dfTarget[i,])%>%
mutate(Rank=rank(Value,ties.method = "first"))
dfTarget$Rank[i]=filter(tempDf,DateTime==max(df$DateTime))$Rank
}
Desired output:
> dfTarget
# A tibble: 2 x 3
DateTime Value Rank
<date> <dbl> <dbl>
1 2018-01-03 9 3
2 2018-01-03 11 4
But I am looking for a more delicate way.
Thanks
This is basically the same idea as your for loop, but instead of a loop it uses map_int, and instead of creating a new data frame using rbind it creates a new vector with c().
library(tidyverse)
is.max <- with(df, DateTime == max(DateTime))
df[is.max,] %>%
mutate(Rank = map_int(Value, ~
c(df$Value[!is.max], .x) %>%
rank(ties.method = 'first') %>%
tail(1)))
# # A tibble: 2 x 3
# DateTime Value Rank
# <date> <dbl> <int>
# 1 2018-01-03 9 3
# 2 2018-01-03 11 4

how to use tidyeval functions with loops?

Consider this simple example
library(dplyr)
dataframe <- data_frame(id = c(1,2,3,4),
group = c('a','b','c','c'),
value = c(200,400,120,300))
> dataframe
# A tibble: 4 x 3
id group value
<dbl> <chr> <dbl>
1 1 a 200
2 2 b 400
3 3 c 120
4 4 c 300
and this tidyeval function that uses dplyr to aggregate my dataframe according to some input column.
func_tidy <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
df_agg
}
now, this works
> func_tidy(dataframe, group)
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2
but doing the same thing from within a loop FAILS
for(col in c(group)){
func_tidy(dataframe, col)
}
Error in grouped_df_impl(data, unname(vars), drop) : Column `col` is unknown
What is the problem here? How can I use my tidyeval function in a loop?
Thanks!
For looping through column names you will need to use character strings.
for(col in "group")
When you pass this variable to your function, you will need to convert it from a character string to a symbol using rlang::sym. You use !! to unquote so the expression is evaluated.
So your loop would look like (I add a print to see the output):
for(col in "group"){
print( func_tidy(dataframe, !! rlang::sym(col) ) )
}
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2

Resources