How can I use mutate_if() to change values of b to NA in case a > 25
I can do it with ifelse but I feel mutate_if is created for such task.
library(tidyverse)
tbl <- tibble(a = c(10, 20, 30, 40, 10, 60),
b = c(12, 23, 34, 45, 56, 67))
In this small example, I'm not sure that you actually need mutate_if(). mutate_if is designed to use the _if part to determine which columns to subset and work on, rather than an if condition when modifying a value.
Rather, you can use mutate_at() to select your columns to operate on - either based on their exact name or by using vars(contains('your_string')).
See the help page for more info on the mutate_* functions: https://dplyr.tidyverse.org/reference/mutate_all.html
Here are 3 options, using mutate() and mutate_at():
# using mutate()
tbl %>%
mutate(
b = ifelse(a > 25, NA, b)
)
# mutate_at - we select only column 'b'
tbl %>%
mutate_at(vars(c('b')), ~ifelse(a > 25, NA, .))
# select only columns with 'b' in the col name
tbl %>%
mutate_at(vars(contains('b')), ~ifelse(a > 25, NA, .))
Which all produce the same output:
# A tibble: 6 x 2
a b
<dbl> <dbl>
1 10 12
2 20 23
3 30 NA
4 40 NA
5 10 56
6 60 NA
I know it's not mutate_if, but I suspect you don't actually need it.
The mutate_if() variant applies a predicate function (a function that
returns TRUE or FALSE) to determine the relevant subset of columns. So the mutate_if condition will apply to all columns and in the example provided below, you can see it uses. Examples of usage is performing an mathematical operation on numeric fields, etc.
https://dplyr.tidyverse.org/reference/mutate_all.html
function (.tbl, .predicate, .funs, ...)
library(dplyr)
# Below code gets the job done but as Hugh Allan explained it is probably not
the right approach
tbl %>%
mutate_if(colnames(tbl) != 'a', ~ifelse(a > 25, NA, .))
# A tibble: 6 x 2
a b
<dbl> <dbl>
1 10 12
2 20 23
3 30 NA
4 40 NA
5 10 56
6 60 NA
We can use replace
library(dplyr)
tbl %>%
mutate(b = replace(b, a > 25, NA))
-output
# A tibble: 6 x 2
a b
<dbl> <dbl>
1 10 12
2 20 23
3 30 NA
4 40 NA
5 10 56
6 60 NA
And for completion here is base R and data.table variant.
tbl$b[tbl$a > 25] <- NA
tbl
# a b
# <dbl> <dbl>
#1 10 12
#2 20 23
#3 30 NA
#4 40 NA
#5 10 56
#6 60 NA
In data.table -
library(data.table)
setDT(tbl)
tbl[a > 25, b := NA]
tbl
Related
I have a series of dates and I want to count each record the sequence of dates, while skipping missing values.
Essentially, I want to see the following result, where a are my dates and b is my index of the date record. You can see that row 5 is my 4th record, and visit 7 is my 5th record.
tibble(a = c(12, 24, 32, NA, 55, NA, 73), b = c(1, 2, 3, NA, 4, NA, 5))
a b
<dbl> <dbl>
1 12 1
2 24 2
3 32 3
4 NA NA
5 55 4
6 NA NA
7 73 5
It seems that group_by() %>% mutate(sq = sequence(n())) doesn't work in this case, because I don't know how to filter out the missing values while counting. I need to keep those missing values because my data is pretty large.
Is a separate operation of filtering the data, getting the sequence, and using left_join my best option?
library(dplyr)
dat <- tibble(a = c(12, 24, 32, NA, 55, NA, 73))
dat %>%
mutate(sq = ifelse(is.na(a), NA, cumsum(!is.na(a))))
#> # A tibble: 7 x 2
#> a sq
#> <dbl> <int>
#> 1 12 1
#> 2 24 2
#> 3 32 3
#> 4 NA NA
#> 5 55 4
#> 6 NA NA
#> 7 73 5
Cumulatively sum an indicator of non-NA and then add 0*a to effectively NA out any component that was originally NA while adding 0 to the rest (so not changing them).
a <- c(12, 24, 32, NA, 55, NA, 73)
cumsum(!is.na(a)) + 0 * a
## [1] 1 2 3 NA 4 NA 5
Maybe you can try replace + seq_along like below
within(
df,
b <- replace(a, !is.na(a), seq_along(na.omit(a)))
)
We could specify the i as non-NA logical vector, and create the 'b' by assigning the sequence of rows
library(data.table)
setDT(dat)[!is.na(a), b := seq_len(.N)]
-output
dat
# a b
#1: 12 1
#2: 24 2
#3: 32 3
#4: NA NA
#5: 55 4
#6: NA NA
#7: 73 5
This question already has answers here:
Concatenate several columns to comma separated strings by group
(5 answers)
Closed 1 year ago.
This is an extension to post Collapse / concatenate / aggregate a column to a single comma separated string within each group
Goal: aggregate multiple columns according to one grouping variable and separate individual values by separator of choice.
Reproducible example:
data <- data.frame(A = c(rep(111, 3), rep(222, 3)), B = c(rep(c(100), 3), rep(200,3)), C = rep(c(1,2,NA),2), D = c(15:20), E = rep(c(1,NA,NA),2))
data
A B C D E
1 111 100 1 15 1
2 111 100 2 16 NA
3 111 100 NA 17 NA
4 222 200 1 18 1
5 222 200 2 19 NA
6 222 200 NA 20 NA
A is the grouping variable but B is still displayed in overall result (B depends on A in my application) and C, D and E are the variables to be collapsed into separated character strings.
Desired Output
A B C D E
1 111 100 1,2 15,16,17 1
2 222 100 1,2 18,19,20 1
I don't have a ton of experience with R. I did try to expand upon the solutions posted by G. Grothendieck to the linked post to meet my requirements but can't quite get it right for multiple columns.
What would be a proper implementation to get the desired output?
I focused specifically on group_by and summarise_all and aggregate in my attempts. They are a complete mess so I don't believe it would even be helpful to display.
EDIT:
Solutions posted work great at displaying desired result!
To continue improving the value in this post for those that find it.
How would it be possible for users to select their own separation characters.
e.g. '-', '\n'
The current solutions by #akrun and #tmfmnk both result in lists instead of a concatenated character string. Please correct me if I said this incorrectly.
data$D
[1] 15 16 17 18 19 20
> data$A
[1] 111 111 111 222 222 222
> data$B
[1] 100 100 100 200 200 200
> data$C
[1] 1 2 NA 1 2 NA
> data$D
[1] 15 16 17 18 19 20
> data$E
[1] 1 NA NA 1 NA NA
We can group by 'A', 'B', and use summarise_at to paste all the non-NA elements
library(dplyr)
data %>%
group_by(A, B) %>%
summarise_at(vars(-group_cols()), ~ toString(.[!is.na(.)]))
# A tibble: 2 x 5
# Groups: A [2]
# A B C D E
# <dbl> <dbl> <chr> <chr> <chr>
#1 111 100 1, 2 15, 16, 17 1
#2 222 200 1, 2 18, 19, 20 1
If we need to pass custom delimiter, use paste or str_c
library(stringr)
data %>%
group_by(A, B) %>%
summarise_at(vars(-group_cols()), ~ str_c(.[!is.na(.)], collapse="_"))
Or using base R with aggregate
aggregate(. ~ A + B, data, FUN = function(x)
toString(x[!is.na(x)]), na.action = NULL)
With dplyr, you can do:
data %>%
group_by(A, B) %>%
summarise_all(~ toString(na.omit(.)))
A B C D E
<dbl> <dbl> <chr> <chr> <chr>
1 111 100 1, 2 15, 16, 17 1
2 222 200 1, 2 18, 19, 20 1
I'm cleaning up my dataset which is a data frame with two columns, obtained after I joined multiple data frames. I'm trying to find a code or a logic way to tell R to create a third column using the following rules:
If both columns contain non-NA values, then the third column contains their average.
If one column contains a NA, then the third column is the value of the column without the missing value.
For example:
df1 <-
data.frame(Var1 = c(34, 23, 23, NA, 32),
Var2 = c(NA, 34, NA, 35, 55))
df1
# Var1 Var2
# 1 34 NA
# 2 23 34
# 3 23 NA
# 4 NA 35
# 5 32 55
The result I want is:
# Var1 Var2 Var3
# 1 34 NA 34.0
# 2 23 34 28.5
# 3 23 NA 23.0
# 4 NA 35 35.0
# 5 32 55 43.5
We need rowMeans here (assuming the no value as NA)
df1$Var3 <- rowMeans(df1, na.rm = TRUE)
If the value is above 100 or below 1, change it to NA (not clear about that condition) and then do the rowMeans
rowMeans(replace(df1, df1 < 1| df1 > 100, NA), na.rm = TRUE)
df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6))
#> a b
#> 1 1 4
#> 2 2 5
#> 3 3 6
I figured out how to overwrite data with base R. Maybe that was day 1 of learning R.
df[2:3, 2] <- c(50, 60)
#> a b
#> 1 1 4
#> 2 2 50
#> 3 3 60
I never found an easy way to do it with dplyr. How do I overwrite data with the pipe %>%?
We can use replace within mutate. If we can use the column names, i.e. 'b', replace the 'b' by specifying the list parameter in replace with the index of rows and the values as a vector
library(dplyr)
df %>%
mutate(b = replace(b, 2:3, c(50, 60)))
# a b
#1 1 4
#2 2 50
#3 3 60
Or specify the index of columns in mutate_at
df %>%
mutate_at(2, replace, list = 2:3, values = c(50, 60))
Here is a simple example:
> df <- data.frame(sn=rep(c("a","b"), 3), t=c(10,10,20,20,25,25), r=c(7,8,10,15,11,17))
> df
sn t r
1 a 10 7
2 b 10 8
3 a 20 10
4 b 20 15
5 a 25 11
6 b 25 17
Expected result is
sn t r
1 a 20 3
2 a 25 1
3 b 20 7
4 b 25 2
I want to group by a specific column ("sn"), leave some columns unchanged ("t" for this example), and apply diff() to remaining columns ("r" for this example).
I explored "dplyr" package to try something like:
df1 %>% group_by(sn) %>% do( ... diff(r)...)
but couldn't figure out correct code.
Can anyone recommend me a clean way to get expected result?
You can do like this (I don't use directly diff because it returns n-1 values):
library(dplyr)
df %>% arrange(sn) %>% group_by(sn) %>% mutate(r = r-lag(r)) %>% slice(2:n())
#### sn t r
#### <fctr> <dbl> <dbl>
#### 1 a 20 3
#### 2 a 25 1
#### 3 b 20 7
#### 4 b 25 2
The slice fonction is here to remove the NA rows created by the differenciation at the beginning of each group. One could also use na.omit instead, but it could also remove other lines unintentionally
We can also use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), set the key as 'sn' (it will order it based on 'sn'), grouped by 'sn', get the difference of 'r' with the lag of 'r' (i.e. shift in data.table does that) and remove the NA rows with `na.rows.
library(data.table)
na.omit(setDT(df, key = "sn")[, r := r-shift(r) , sn])
# sn t r
#1: a 20 3
#2: a 25 1
#3: b 20 7
#4: b 25 2
Or if we are using diff, then make sure that length are the same as the diff output will be one less than the length of the original vector. So, we can pad with NA and later remove by filter
library(dplyr)
df %>%
arrange(sn) %>%
group_by(sn) %>%
mutate(r = c(NA, diff(r))) %>%
filter(!is.na(r))
# sn t r
# <fctr> <dbl> <dbl>
#1 a 20 3
#2 a 25 1
#3 b 20 7
#4 b 25 2