If any column in a row meets condition than mutate() column - r

Using dplyr, I am trying to conditionally update values in a column using ifelse and mutate. I am trying to say that, in a data frame, if any variable (column) in a row is equal to 7, then variable c should become 100, otherwise c remains the same.
df <- data.frame(a = c(1,2,3),
b = c(1,7,3),
c = c(5,2,9))
df <- df %>% mutate(c = ifelse(any(vars(everything()) == 7), 100, c))
This gives me the error:
Error in mutate_impl(.data, dots) :
Evaluation error: (list) object cannot be coerced to type 'double'.
The output I'd like is:
a b c
1 1 1 5
2 2 7 100
3 3 3 9
Note: this is an abstract example of a larger data set with more rows and columns.
EDIT:
This code gets me a bit closer, but it does not apply the ifelse statement by each row. Instead, it is changing all values to 100 in column c if 7 is present anywhere in the data frame.
df <- df %>% mutate(c = ifelse(any(select(., everything()) == 7), 100, c))
a b c
1 1 1 100
2 2 7 100
3 3 3 100
Perhaps this is not possible to do using dplyr?

I think this should work. We can check if values in df equal to 7. After that, use rowSums to see if any rows larger than 0, which means there is at least one value is 7.
df <- df %>% mutate(c = ifelse(rowSums(df == 7) > 0, 100, c))
Or we can use apply
df <- df %>% mutate(c = ifelse(apply(df == 7, 1, any), 100, c))
A base R equivalent is like this.
df$c[apply(df == 7, 1, any)] <- 100

You could try with purrr::map_dbl
library(purrr)
df$c <- map_dbl(1:nrow(df), ~ifelse(any(df[.x,]==7), 100, df[.x,]$c))
Output
a b c
1 1 1 5
2 2 7 100
3 3 3 9
In a dplyr::mutate statement this would be
library(purrr)
library(dplyr)
df %>%
mutate(c = map_dbl(1:nrow(df), ~ifelse(any(df[.x,]==7), 100, df[.x,]$c)))

Related

How to select variables with numeric suffixes lower than a value

I have a data frame similar to this one.
df <- data.frame(id=c(1,2,3), tot_1=runif(3, 0, 100), tot_2=runif(3, 0, 100), tot_3=runif(3, 0, 100), tot_4=runif(3, 0, 100))
I want to select or make an operation only with those with suffixes lower than 3.
#select
df <- df %>% select(id, tot_1, tot_2)
#or sum
df <- df %>% mutate(sumVar = rowSums(across(c(tot_1, tot_2))))
However, in my real data, there are many more variables and not in order. So how could I select them without doing it manually?
We may use matches
df %>%
mutate(sumVar = rowSums(across(matches('tot_[1-2]$'))))
If we need to be more flexible, extract the digit part from the column names that starts with 'tot', subset based on the condition and use that new names
library(stringr)
nm1 <- str_subset(names(df), 'tot')
nm2 <- nm1[readr::parse_number(nm1) <3]
df %>%
mutate(sumVar = rowSums(across(all_of(nm2))))
Solution with num_range
This is the rare case for the often forgotten num_range selection helper from dplyr, which extracts the numbers from the names in a single step, then selects a range:
determine the threshold
suffix_threshold <- 3
Select( )
library(dplyr)
df %>% select(id, num_range(prefix='tot_',
range=seq_len(suffix_threshold-1)))
id tot_1 tot_2
1 1 26.75082 26.89506
2 2 21.86453 18.11683
3 3 51.67968 51.85761
mutate() with rowSums()
library(dplyr)
df %>% mutate(sumVar = across(num_range(prefix='tot_', range=seq_len(suffix_threshold-1)))%>%
rowSums)
id tot_1 tot_2 tot_3 tot_4 sumVar
1 1 26.75082 26.89506 56.27829 71.79353 53.64588
2 2 21.86453 18.11683 12.91569 96.14099 39.98136
3 3 51.67968 51.85761 25.63676 10.01408 103.53730
Here is a base R way -
cols <- grep('tot_', names(df), value = TRUE)
#Select
df[c('id', cols[as.numeric(sub('tot_', '',cols)) < 3])]
# id tot_1 tot_2
#1 1 75.409112 30.59338
#2 2 9.613496 44.96151
#3 3 58.589574 64.90672
#Rowsums
df$sumVar <- rowSums(df[cols[as.numeric(sub('tot_', '',cols)) < 3]])
df
# id tot_1 tot_2 tot_3 tot_4 sumVar
#1 1 75.409112 30.59338 59.82815 50.495758 106.00250
#2 2 9.613496 44.96151 84.19916 2.189482 54.57501
#3 3 58.589574 64.90672 18.17310 71.390459 123.49629

using replace_na() with indeterminate number of columns

My data frame looks like this:
df <- tibble(x = c(1, 2, NA),
y = c(1, NA, 3),
z = c(NA, 2, 3))
I want to replace NA with 0 using tidyr::replace_na(). As this function's documentation makes clear, it's straightforward to do this once you know which columns you want to perform the operation on.
df <- df %>% replace_na(list(x = 0, y = 0, z = 0))
But what if you have an indeterminate number of columns? (I say 'indeterminate' because I'm trying to create a function that does this on the fly using dplyr tools.) If I'm not mistaken, the base R equivalent to what I'm trying to achieve using the aforementioned tools is:
df[, 1:ncol(df)][is.na(df[, 1:ncol(df)])] <- 0
But I always struggle to get my head around this code. Thanks in advance for your help.
We can do this by creating a list of 0's based on the number of columns of dataset and set the names with the column names
library(tidyverse)
df %>%
replace_na(set_names(as.list(rep(0, length(.))), names(.)))
# A tibble: 3 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 1 0
#2 2 0 2
#3 0 3 3
Or another option is mutate_all (for selected columns -mutate_at or base don conditions mutate_if) and applyreplace_all
df %>%
mutate_all(replace_na, replace = 0)
With base R, it is more straightforward
df[is.na(df)] <- 0

Replace last value in group with corresponding value in other column

Working with grouped data, I want to change the last entry in one column to match the corresponding value for that group in another column. So for my data below, for each 'nest' (group), the last 'Status' entry will equal the 'fate' for that nest.
Data like this:
nest Status fate
1 1 2
1 1 2
2 1 3
2 1 3
2 1 3
Desired result:
nest Status fate
1 1 2
1 2 2
2 1 3
2 1 3
2 3 3
It should be so simple. I tried the following from dplyr and tail to change last value in a group_by in r; it works properly for some groups, but in others it substitutes the wrong 'fate' value:
library(data.table)
indx <- setDT(df)[, .I[.N], by = .(nest)]$V1
df[indx, Status := df$fate]
I get various errors trying this approach dplyr mutate/replace on a subset of rows:
mutate_last <- function(.data, ...) {
n <- n_groups(.data)
indices <- attr(.data, "indices")[[n]] + 1
.data[indices, ] <- .data[indices, ] %>% mutate(...)
.data
}
df <- df %>%
group_by(nest) %>%
mutate_last(df, Status == fate)
I must be missing something simple from the resources mentioned above?
Something like
library(tidyverse)
df <- data.frame(nest = c(1,1,2,2,2),
status = rep(1, 5),
fate = c(2,2,3,3,3))
df %>%
group_by(nest) %>%
mutate(status = c(status[-n()], tail(fate,1)))
Not sure if this is definitely the best way to do it but here's a very simple solution:
library(dplyr)
dat <- data.frame(nest = c(1,1,2,2,2),
Status = c(1,1,1,1,1),
fate = c(2,2,3,3,3))
dat %>%
arrange(nest, Status, fate) %>% #enforce order
group_by(nest) %>%
mutate(Status = ifelse(is.na(lead(nest)), fate, Status))
E: Made a quick change.

Using mutate to output value from named vector

I have 1X2 dataframe with values 'sent1' and 'sent2'.
test.df <- data.frame(sentence = c('sent1', 'sent2'))
I also have a reference vector that has values for the combination of the 2 sentences and 3 categories (a, b, c).
test.vec <- c(sent1_a = 1,
sent1_b = 0,
sent1_c = 1,
sent2_a = 0,
sent2_b = 1,
sent2_c = 1)
I would like to create a new df that looks like this:
output.df <- data.frame(sentence = c('sent1', 'sent2'),
a = c(1,0),
b = c(0,1),
c = c(0,1))
output.df
# sentence a b c
#1 sent1 1 0 0
#2 sent2 0 1 1
Ideally, I would like to use mutate to select the relevant values from the vector based on the corresponding sentence that I'm looping through
results <- test.df %>%
mutate(a = test.vec[[paste0(sentence, '_a')]])
However, I'm getting an error on this.
Error in mutate_impl(.data, dots) :
Evaluation error: attempt to select more than one element in vectorIndex.
You can reshape test.vec to the output you need:
library(tidyverse)
data.frame(test.vec) %>%
tibble::rownames_to_column() %>%
separate(rowname, c('sentence', 'vars')) %>%
spread(vars, test.vec)
# sentence a b c
#1 sent1 1 0 1
#2 sent2 0 1 1

filtering data.frame based on row_number()

UPDATE: dplyr has been updated since this question was asked and now performs as the OP wanted
I´m trying to get the second to the seventh line in a data.frame using dplyr.
I´m doing this:
require(dplyr)
df <- data.frame(id = 1:10, var = runif(10))
df <- df %>% filter(row_number() <= 7, row_number() >= 2)
But this throws an error.
Error in rank(x, ties.method = "first") :
argument "x" is missing, with no default
I know i could easily make:
df <- df %>% mutate(rn = row_number()) %>% filter(rn <= 7, rn >= 2)
But I would like to understand why my first try is not working.
Actually dplyr's slice function is made for this kind of subsetting:
df %>% slice(2:7)
(I'm a little late to the party but thought I'd add this for future readers)
The row_number() function does not simply return the row number of each element and so can't be used like you want:
• ‘row_number’: equivalent to ‘rank(ties.method = "first")’
You're not actually saying what you want the row_number of. In your case:
df %>% filter(row_number(id) <= 7, row_number(id) >= 2)
works because id is sorted and so row_number(id) is 1:10. I don't know what row_number() evaluates to in this context, but when called a second time dplyr has run out of things to feed it and you get the equivalent of:
> row_number()
Error in rank(x, ties.method = "first") :
argument "x" is missing, with no default
That's your error right there.
Anyway, that's not the way to select rows.
You simply need to subscript df[2:7,], or if you insist on pipes everywhere:
> df %>% "["(.,2:7,)
id var
2 2 0.52352994
3 3 0.02994982
4 4 0.90074801
5 5 0.68935493
6 6 0.57012344
7 7 0.01489950
Here is another way to do row-number based filtering in a pipeline.
df <- data.frame(id = 1:10, var = runif(10))
df %>% .[2:7,]
> id var
2 2 0.28817
3 3 0.56672
4 4 0.96610
5 5 0.74772
6 6 0.75091
7 7 0.05165
Another option using subset:
df <- data.frame(id = 1:10, var = runif(10))
subset(df, row.names(df) %in% 2:7)
#> id var
#> 2 2 0.75924106
#> 3 3 0.17096427
#> 4 4 0.10886090
#> 5 5 0.98703882
#> 6 6 0.04190195
#> 7 7 0.73268672
Created on 2023-01-13 with reprex v2.0.2

Resources