I am new to R, but experienced in Stata. To learn R, I am tracking Covid-19 infections. That requires creating seven-day trailing averages, and I do so with the following loop.
for (mylag in c(1:7)) {
data <- data %>% group_by(state) %>% mutate(!!paste0("deathIncrease", "_", mylag) := lag(deathIncrease, mylag)) %>% ungroup()
}
This works, but then I want to run the same code, not just for deaths, but also for cases. So I tried the following.
var_list <- c("deathIncrease", "positiveIncrease")
for (var in var_list) {
for (mylag in c(1:7)) {
var <- enquo(var)
varname <- enquo( paste0(quo_name(var), "_", mylag) )
data <- data %>% group_by(state) %>% mutate(!!varname := lag(!!var, mylag)) %>% ungroup()
}
}
But that leads to the error arg must be a symbol. Any help would be much appreciated. In Stata, loops are simpler. Is there no package that gets R to automatically fill in the looping variables everywhere, like so: {{ var }}?
Edit: here is a minimal working example. The first way to create lags works, but only for var1. The second nested loop does not.
df <- tribble(
~group_var, ~var1, ~var2,
"A", 1, 10,
"A", 2, 11,
"A", 3, 12,
"B", 1, 10,
"B", 2, 11,
"B", 3, 12)
for (mylag in c(1:2)) {
df <- df %>% group_by(group_var) %>% mutate(!!paste0("var1", "_lag", mylag) := lag(var1, mylag)) %>% ungroup()
}
## Another loop
var_list <- c("var1", "var2")
for (myvar in var_list) {
for (mylag in c(1:2)) {
myvar <- enquo(myvar)
varname <- enquo( paste0(quo_name(myvar), "_", mylag) )
data <- data %>% group_by(state) %>% mutate(!!varname := lag(!!myvar, mylag)) %>% ungroup()
}
}
You can use the function get(), like lag(get(myvar), mylag), to point the specific column the string myvar is referred to:
for(mylag in 1:7){
for(myvar in c('deathIncrease', 'positiveIncrease')){
data <- data %>%
group_by(state) %>%
mutate(
!!paste0(myvar, '_', mylag) := lag(get(myvar), mylag)
) %>%
ungroup()
}
}
My first solution contained a function, that did not respect grouped data. I wanted to look at that anyways, so i spend a bit of time to respect grouped data as well.
This is my solution now, it works as expected on grouped data, but it feels a bit hacky tbh.
add_lag <- function(.data, column, days) {
group <- unlist(groups(.data))
if(is.null(group)){
new <- mapply(function(x, y) {
lag(x, y)
}, x = .data[column], y = sort(rep(days, length(column))))
if(is.null(dim(new))){
new <- t(new)
}
new <- as.data.frame(new, stringsAsFactors = F)
names(new) <- paste0(column, "_", sort(rep(days, length(column))))
new <- as_tibble(new) %>%
select(sort(names(new)))
mutate(.data, !!!new)
} else {
tmp <- .data %>%
nest()
tmp$data <- lapply(tmp$data, function(x,y){
x %>%
add_lag(column,y)
}, y = days)
tmp %>% unnest(c(data))
}
}
df <- tribble(
~group_var, ~var1, ~var2,
"A", 1, 10,
"A", 2, 11,
"A", 3, 12,
"B", 1, 10,
"B", 2, 11,
"B", 3, 12)
df %>%
group_by(group_var) %>%
add_lag("var1", 1:2)
# A tibble: 6 x 5
# Groups: group_var [2]
group_var var1 var2 var1_1 var1_2
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 10 NA NA
2 A 2 11 1 NA
3 A 3 12 2 1
4 B 1 10 NA NA
5 B 2 11 1 NA
6 B 3 12 2 1
Related
I'm trying to do a rowwise matching for vector elements nested inside a data.table list-column.
library(tibble)
library(data.table)
my_dt <-
tibble::tribble(
~col_x, ~col_y,
"a", c(1, 2, 3),
"b", c(4, 5, 6),
"c", c(7, 8, 9)
) %>%
as.data.table()
external_vec <- 1:9
I want to match each element from col_y against external_vec such that I will get a new column in my_dt that is equivalent to external_vec %in% col_y, such as:
# desired output
## col_x col_y new_col
## 1: a 1,2,3 TRUE, TRUE, TRUE,FALSE,FALSE,FALSE,...
## 2: b 4,5,6 FALSE,FALSE,FALSE, TRUE, TRUE, TRUE,...
## 3: c 7,8,9 FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,...
EDIT
I know how to implement a solution based on tidyverse's purrr, like the following. However, I'm looking for a data.table native functionality.
# this works but relies on purrr and dplyr, which I hope to avoid in this problem
my_dt %>%
tibble::add_column(ext_vec_as_col = list(external_vec)) %>%
mutate(new_col = purrr::map2(.x = ext_vec_as_col, .y = col_y, .f = ~.x %in% .y) )
#> col_x col_y ext_vec_as_col new_col
#> 1: a 1,2,3 1,2,3,4,5,6,... TRUE, TRUE, TRUE,FALSE,FALSE,FALSE,...
#> 2: b 4,5,6 1,2,3,4,5,6,... FALSE,FALSE,FALSE, TRUE, TRUE, TRUE,...
#> 3: c 7,8,9 1,2,3,4,5,6,... FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,...
I've got the tidyverse-solution:
library(tibble)
library(dplyr)
library(tidyr)
library(purrr)
library(magrittr)
my_dt <-
tibble::tribble(
~col_x, ~col_y,
"a", c(1, 2, 3),
"b", c(4, 5, 6),
"c", c(7, 8, 9)
)
external_vec <- 1:9
my_dt %>%
mutate(new_col = col_y %>%
map(~external_vec %in% .x)) %>%
unnest_wider(col_y, names_sep = "_") %>% # this is just to see the result..
unnest_wider(new_col, names_sep = "_") # this is just to see the result..
Usually, library(tidyverse) includes a lot of these, but I just thought that it would be nicer to know exactly what you need to make this work.
data.table solution
I think this is the equivalent:
my_dt[, new_col := lapply(col_y, \(x) external_vec %in% x)]
my_dt
Performance
Since that is important, here is some performance benchmark
bench::mark(
tidy =
my_dt %>%
mutate(new_col = col_y %>%
map(~external_vec %in% .x)),
dt = my_dt[, new_col := lapply(col_y, \(x) external_vec %in% x)],
op_tidy = my_dt %>%
tibble::add_column(ext_vec_as_col = list(external_vec)) %>%
mutate(new_col = purrr::map2(.x = ext_vec_as_col, .y = col_y, .f = ~.x %in% .y)),
check = FALSE
) %>%
plot()
I have a large dataset with about 167k rows. I would like to take a sample of 2000 rows of it while making sure I am taking rows from all groups in two columns (id & quality) in the data.
This is a snapshot of the data
df <- data.frame(id=c(1,2,3,4,5,1,2),
quality=c("a","b","c","d","z","g","t"))
df %>% glimpse()
Rows: 7
Columns: 2
$ id <dbl> 1, 2, 3, 4, 5, 1, 2
$ quality <chr> "a", "b", "c", "d", "z", "g", "t"
So, I need to ensure that the sampled data has rows from all combinations of these two group columns.
I hope someone can help out.
Thanks!
I think that's what you're looking for.
my_df <- data.frame(id = c(1, 2, 3, 4, 5, 1, 2, 2, 2),
quality = c("a", "b", "c", "d", "z", "g", "t", "t", "t"))
my_df <- my_df %>% group_by(id, quality) %>% mutate(Unique = cur_group_id())
my_df$Test <- seq.int(from = 1, to = nrow(my_df), by = 1)
my_a <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_b <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_c <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_d <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_e <- my_df %>% group_by(Unique) %>% sample_n(., 1)
You don't need that much dataframe, those are just examples to show that for each unique group 1 row will be extract randomly. The difference is seen in the column named "Test" especially for the id = 2 and quality = t, based on the data sample.
If you want to make sure that each id and quality is represented in your new sample, you will need to group you data by these variables.
What you are looking for is the following,
df %>%
group_by(id,quality) %>%
sample_n(1, replace = TRUE)
You can change sample size pr group and id, and set replacement as desired.
It gives the following output,
# Groups: id, quality [7]
id quality
<dbl> <chr>
1 1 a
2 1 g
3 2 b
4 2 t
5 3 c
6 4 d
7 5 z
The data that you provided, have unique groups, and therefore sampling the way you want it, gives the same number of rows as you data.
Edit: sample_n is superseeded by slice_sample, I wasnt aware of this. But you can easily change the script by,
df %>%
group_by(id,quality) %>%
slice_sample(
n = 1
)
You can also sample a proportion of your data.frame by setting prop instead of n,
df %>%
group_by(id,quality) %>%
slice_sample(
prop = 0.25
)
I have the following dataframe:
id <- c("A", "B", "C")
col1 <- c(1, 3, 5)
col2 <- c(6, 12, 9)
col3 <- c(2, 4, 30)
df <- data.frame(id, col1, col2, col3)
Essentially, I want every i to be replaced by 20, 25, 30, 35, 40. This loop works but it works very, very slowly.
library(dplyr)
library(tibble)
library(foreach)
library(tidyverse)
library(purrr)
id <- c("A", "B", "C")
col1 <- c(1, 3, 5)
col2 <- c(6, 12, 9)
col3 <- c(2, 4, 30)
df <- data.frame(id, col1, col2, col3)
vals <- c(seq(from=20, to=40, by=5))
final <- foreach(i = vals, .combine='cbind') %do% {
# if cell is greater than i, then code 0
df_2 <- df %>% mutate(across(starts_with("col"), ~ +(. < i)))
# transpose the dataset
rownames(df_2) <- df_2$id
df_2$id <- NULL
df_2_t <- as.data.frame(t(df_2))
# sum the rows
df_2_t <- cbind(id = rownames(df_2_t), df_2_t)
rownames(df_2_t) <- 1:nrow(df_2_t)
df_2_t <- df_2_t %>%
mutate(sum = rowSums(.[2:ncol(.)]))
# merge a new column
id2 <- c("col1", "col2", "col3")
D <- c(3, 4, 5)
id_d <- data.frame(id2, D)
df_2_t_d <- left_join(df_2_t, id_d, by = c("id" = "id2"))
# divide D by the number of letters (there are 3 letter columns -- A, B, C)
df_2_t_d$letters <- rep(3)
df_2_t_d <- df_2_t_d %>%
mutate(frac = D/letters)
# recode all 1s to the frac
letters <- grep("^A|^B|^C", names(df_2_t_d))
df_2_t_d[letters] <- apply(df_2_t_d[letters], 2, function(x) ifelse(x == 1, df_2_t_d$frac, 0))
# drop two columns
df_2_t_d <- select(df_2_t_d, -c(D, letters))
# transpose again
rownames(df_2_t_d) <- df_2_t_d$id
df_2_t_d$id <- NULL
df_2_t_d2 <- as.data.frame(t(df_2_t_d))
df_2_t_d2_sum <- df_2_t_d2 %>%
mutate(rowSums(.[1:3])) %>%
transmute(!!paste0('sum_', i) := rowSums(select(., starts_with('col'))))
}
df_2_t_d2 <- cbind(list_name = rownames(df_2_t_d2), df_2_t_d2)
rownames(df_2_t_d2) <- 1:nrow(df_2_t_d2)
df_2_t_d2 <- select(df_2_t_d2, list_name)
abc <- cbind(df_2_t_d2, df_2_t_d2_sum)
View(abc)
If there's any way to speed it up, suggestions are welcome!
Here's a way to do this map_dfc :
library(dplyr)
library(purrr)
vals <- seq(from=20, to=40, by=5)
bind_cols(
df, map_dfc(vals, function(x) df %>%
mutate(across(starts_with("col"), ~ +(. < x))) %>%
transmute(!!paste0('sum_', x) := rowSums(select(., starts_with('col'))))))
Or in base R :
cols <- grep('col', names(df))
df[paste0('sum_', vals)] <- lapply(vals, function(x) rowSums(+(df[cols] < x)))
df
# id col1 col2 col3 sum_20 sum_25 sum_30 sum_35 sum_40
#1 A 1 6 2 3 3 3 3 3
#2 B 3 12 4 3 3 3 3 3
#3 C 5 9 30 2 2 2 3 3
I am trying to filter a column which contains a numeric/date name using a as.Date variable.
As an example, consider this small database:
dt <- data.table(
names = c("A", "B", "C"),
`2020-01-01` = c(1, NA, 2),
`2020-01-02` = c(3, 4, 5),
`2020-01-03` = c(6, 7, 8)
)
I am currently filtering the desired date column as follows:
dt1 <- dt %>% filter(!is.na(`2020-01-01`)) %>% select(names)
My idea is to create a meeting_date variable, this variable should be used as a date reference for all my R code.
meeting_date <- as.Date("2020-01-01")
But of course the code:
dt1 <- dt %>% filter(!is.na(meeting_date)) %>% select(names)
Does not work. The reason for this is the missing backticks, so without success I tried the following codes:
dt1 <- dt %>% filter(!is.na(paste("`", meeting_date, "`", sep=""))) %>% select(names)
dt1 <- dt %>% filter(!is.na(noquote(paste("`", meeting_date, "`", sep="")))) %>% select(names)
Does anyone knows how to proceed? Thanks!
You can do:
meeting_date <- as.Date("2020-01-01")
dt %>%
filter_at(vars(one_of(as.character(meeting_date))), ~ !is.na(.))
names 2020-01-01 2020-01-02 2020-01-03
1 A 1 3 6
2 C 2 5 8
Long data should be easier to work with:
library(data.table)
dt <- data.table(
names = c("A", "B", "C"),
`2020-01-01` = c(1, NA, 2),
`2020-01-02` = c(3, 4, 5),
`2020-01-03` = c(6, 7, 8)
)
#Make data 'long' & change the new 'name' column to dates
# change confusing column 'name' to date while we're at it.
dt_long <- dt %>% pivot_longer(-names) %>%
mutate(date = lubridate::ymd(name)) %>%
select(-name)
meeting_date <- as.Date("2020-01-01")
dt_long %>% filter(date == meeting_date & (!is.na(value)))
You can use subset + is.naas below
meeting_date <- "2020-01-01"
dtout <- subset(dt,as.vector(!is.na(dt[, ..meeting_date])))
such that
> dtout
names 2020-01-01 2020-01-02 2020-01-03
1: A 1 3 6
2: C 2 5 8
I am trying to create a function that passes a list of column names to a dplyr function. I know how to do this if the list of columns names is given in the ... form, as explained in the tidyeval documentation:
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise <- function(df, ...) {
group_var <- quos(...)
df %>%
group_by(!!!group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, g1, g2)
But if I want to list the column names as an argument of the function, the above solution will not work (of course):
my_summarise <- function(df, group_var, sum_var) {
group_var <- quos(group_var) # nor enquo(group_var)
sum_var <- enquo(sum_var)
df %>%
group_by(!!!group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, list(g1, g2), a)
my_summarise(df, list(g1, g2), b)
How can I get the items inside the list to be quoted individually?
This question is similar to Passing dataframe column names in a function inside another function but in the comments it was suggested to use strings, while here I would like to use bare column names.
library(dplyr)
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(1, 2, 1, 2, 1),
a = sample(5),
b = sample(5)
)
my_summarise = function(df, group_var, fun_name) {
df %>%
group_by(!!! group_var) %>%
summarize_all(fun_name)
}
my_summarise(df, alist(g1, g2), mean)
alist() handles the arguments 'g1' and 'g2' as function arguments (does not evaluate them) while !!! (same as UQS() unquotes and splices the list. sum_var is not necessary as it looks like you want to take the mean of both 'a' and 'b'. Also, you can generalize it by passing in the function as well.
You could pass your list of arguments using alist instead of list, as it won't evaluate the arguments.
my_summarise = function(df, group_var, sum_var) {
group_var = quos(!!! group_var)
sum_var = enquo(sum_var)
df %>%
group_by(!!! group_var) %>%
summarise(!! quo_name( sum_var) := mean( !! sum_var) )
}
my_summarise(df, alist(g1, g2), b)
# A tibble: 4 x 3
# Groups: g1 [?]
g1 g2 b
<dbl> <dbl> <dbl>
1 1 1 2.0
2 1 2 3.0
3 2 1 4.5
4 2 2 1.0
Another alternative would be to pass that argument directly with quos instead of list as shown in this answer, which bypasses some complications all together.
my_summarise = function(df, group_var, sum_var) {
# group_var = quos(!!! group_var)
sum_var = enquo(sum_var)
df %>%
group_by(!!! group_var) %>%
summarise(!! quo_name( sum_var) := mean( !! sum_var) )
}
my_summarise(df, quos(g1, g2), b)
# A tibble: 4 x 3
# Groups: g1 [?]
g1 g2 b
<dbl> <dbl> <dbl>
1 1 1 2.0
2 1 2 3.0
3 2 1 4.5
4 2 2 1.0