I am trying to find an R/tidyverse equivalent to Stata's replace b = a if missing(b).
Say I have these data:
library(tidyverse)
data <- data.frame(a=c(1:8), b= c(1:5, NA, NA, NA))
I am trying to replace the missing values in b with the values in a. I try this:
data %<>% mutate(b = replace_na(b, a))
But get an error. What can I do in the tidyverse to solve this problem?
The way you're going about it I'd use coalesce from dplyr:
data %<>% mutate(b = coalesce(b, a))
Output:
data
a b
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
You can simply use ifelse in mutate:
data %>%
mutate(b = ifelse(is.na(b), a, b))
#> a b
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
#> 7 7 7
#> 8 8 8
Created on 2020-03-21 by the reprex package (v0.3.0)
In data.table, an option is fcoalesce
library(data.table)
setDT(data)[, b := fcoalesce(b, a)]
Related
This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed last year.
I have the following data frame:
df <- data.frame(catergory=c("a","b","b","b","b","a","c"), value=c(1,5,3,6,7,4,6))
and I want to record the number of occurrences of each category so the output would be:
df <- data.frame(catergory=c("a","b","b","b","b","a","c"), value=c(1,5,3,6,7,4,6),
category_count=c(2,4,4,4,4,2,1))
Is there a simple way to do this?
# load package
library(data.table)
# set as data.table
setDT(df)
# count by category
df[, category_count := .N, category]
With dplyr:
library(dplyr)
df %>%
group_by(category) %>%
mutate(category_count = n()) %>%
ungroup
# A tibble: 7 × 3
category value category_count
<chr> <dbl> <int>
1 a 1 2
2 b 5 4
3 b 3 4
4 b 6 4
5 b 7 4
6 a 4 2
7 c 6 1
base
df <- data.frame(catergory=c("a","b","b","b","b","a","c"), value=c(1,5,3,6,7,4,6),
category_count=c(2,4,4,4,4,2,1))
df$res <- with(df, ave(x = seq(nrow(df)), list(catergory), FUN = length))
df
#> catergory value category_count res
#> 1 a 1 2 2
#> 2 b 5 4 4
#> 3 b 3 4 4
#> 4 b 6 4 4
#> 5 b 7 4 4
#> 6 a 4 2 2
#> 7 c 6 1 1
Created on 2022-02-08 by the reprex package (v2.0.1)
Input
(Say d is the data frame below.)
a b c
1 5 7
2 6 8
3 7 9
I want to shift the contents of column b one position down and put an arbitrary number in the first position in b. How do I do this? I would appreciate any help in this regard. Thank you.
I tried c(6,tail(d["b"],-1)) but it does not produce (6,5,6).
Output
a b c
1 6 7
2 5 8
3 6 9
Use head instead
df$b <- c(6, head(df$b, -1))
# a b c
#1 1 6 7
#2 2 5 8
#3 3 6 9
You could also use lag in dplyr
library(dplyr)
df %>% mutate(b = lag(b, default = 6))
Or shift in data.table
library(data.table)
setDT(df)[, b:= shift(b, fill = 6)]
A dplyr solution uses lag with an explicit default argument, if you prefer:
library(dplyr)
d <- tibble(a = 1:3, b = 5:7, c = 7:9)
d %>% mutate(b = lag(b, default = 6))
#> # A tibble: 3 x 3
#> a b c
#> <int> <dbl> <int>
#> 1 1 6 7
#> 2 2 5 8
#> 3 3 6 9
Created on 2019-12-05 by the reprex package (v0.3.0)
Here is a solution similar to the head approach by #Ronak Shah
df <- within(df,b <- c(runif(1),b[-1]))
where a uniformly random variable is added to the first place of b column:
> df
a b c
1 1 0.6644704 7
2 2 6.0000000 8
3 3 7.0000000 9
Best solution below will help in any lag or lead position
d <- data.frame(a=c(1,2,3),b=c(5,6,7),c=c(7,8,9))
d1 <- d %>% arrange(b) %>% group_by(b) %>%
mutate(b1= dplyr::lag(b, n = 1, default = NA))
In dplyr 0.8.0, funs() is deprecated, and the new format is to use list() with ~. However, I have noticed that this no longer updates columns using mutate_at() as previously expected.
> set.seed(5)
> testdf <- data.frame(a = sample(1:9, size = 5, replace = TRUE),
+ b = 1:5,
+ c = LETTERS[1:5])
> testdf
a b c
1 2 1 A
2 7 2 B
3 9 3 C
4 3 4 D
5 1 5 E
Example of old code:
> testdf %>% mutate_at(.vars = c('a','b'), .funs = funs(. + 2))
a b c
1 4 3 A
2 9 4 B
3 11 5 C
4 5 6 D
5 3 7 E
Example of new code:
> testdf %>% mutate_at(.vars = c('a','b'), .funs = lst(~. + 2))
a b c a_~. + 2 b_~. + 2
1 2 1 A 4 3
2 7 2 B 9 4
3 9 3 C 11 5
4 3 4 D 5 6
5 1 5 E 3 7
EDIT: I just noticed that if I use list() this problem is resolved:
> testdf %>% mutate_at(.vars = c('a','b'), .funs = list(~. + 2))
a b c
1 4 3 A
2 9 4 B
3 11 5 C
4 5 6 D
5 3 7 E
However, I wish to use lst() because within my code I regularly requiring unquoting variables using !!!, which is not supported by list() (see here)
I'm not sure of the proper way to use lst() while retaining the names.
rlang::list2
is equivalent to list(...) but provides tidy dots semantics:
> testdf %>% mutate_at(.vars = c('a','b'), .funs = rlang::list2(~. + 2))
a b c
1 4 3 A
2 9 4 B
3 11 5 C
4 5 6 D
5 3 7 E
I'm sure this question has been asked before, but I can't find the answer.
Here's my data:
df <- data.frame(group=c("a","a","a","b","b","c"), value=c(1,2,3,4,5,7))
df
#> group value
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 4
#> 5 b 5
#> 6 c 7
I'd like a 3rd column which has the sum of "value" for each "group", like so:
#> group value group_sum
#> 1 a 1 6
#> 2 a 2 6
#> 3 a 3 6
#> 4 b 4 9
#> 5 b 5 9
#> 6 c 7 7
How can I do this with dplyr?
Using dplyr -
df %>%
group_by(group) %>%
mutate(group_sum = sum(value))
Nobody mentioned data.table yet:
library(data.table)
dat <- data.table(df)
dat[, `:=`(sums = sum(value)), group]
Which transforms dat into:
group value sums
1: a 1 6
2: a 2 6
3: a 3 6
4: b 4 9
5: b 5 9
6: c 7 7
left_join(
df,
df %>% group_by(group) %>% summarise(group_sum = sum(value)),
by = c("group")
)
I don't know how to do it one step, but
df_avg <- df %>% group_by(group) %>% summarize(group_sum=sum(value))
df %>% full_join(df_avg,by="group")
works. (This is basically equivalent to #KeqiangLi's answer.)
ave(), from base R, is useful here too:
df %>% mutate(group_sum=ave(value,group,FUN=sum))
How can I get a dense rank of multiple columns in a dataframe? For example,
# I have:
df <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3))
# I want:
res <- data.frame(x = c(1,1,1,1,2,2,2,3,3,3),
y = c(1,2,3,4,2,2,2,1,2,3),
r = c(1,2,3,4,5,5,5,6,7,8))
res
x y z
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 2 2 5
6 2 2 5
7 2 2 5
8 3 1 6
9 3 2 7
10 3 3 8
My hack approach works for this particular dataset:
df %>%
arrange(x,y) %>%
mutate(r = if_else(y - lag(y,default=0) == 0, 0, 1)) %>%
mutate(r = cumsum(r))
But there must be a more general solution, maybe using functions like dense_rank() or row_number(). But I'm struggling with this.
dplyr solutions are ideal.
Right after posting, I think I found a solution here. In my case, it would be:
mutate(df, r = dense_rank(interaction(x,y,lex.order=T)))
But if you have a better solution, please share.
data.table
data.table has you covered with frank().
library(data.table)
frank(df, x,y, ties.method = 'min')
[1] 1 2 3 4 5 5 5 8 9 10
You can df$r <- frank(df, x,y, ties.method = 'min') to add as a new column.
tidyr/dplyr
Another option (though clunkier) is to use tidyr::unite to collapse your columns to one plus dplyr::dense_rank.
library(tidyverse)
df %>%
# add a single column with all the info
unite(xy, x, y) %>%
cbind(df) %>%
# dense rank on that
mutate(r = dense_rank(xy)) %>%
# now drop the helper col
select(-xy)
You can use cur_group_id:
library(dplyr)
df %>%
group_by(x, y) %>%
mutate(r = cur_group_id())
# x y r
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 2 2
# 3 1 3 3
# 4 1 4 4
# 5 2 2 5
# 6 2 2 5
# 7 2 2 5
# 8 3 1 6
# 9 3 2 7
# 10 3 3 8