I have a dataframe where I would like to keep a row as soon as the cumulative value of a column reaches a certain level. The dataset could look like this:
set.seed(0)
n <- 10
dat <- data.frame(id=1:n,
group=rep(LETTERS[1:2], n/2),
age=sample(18:30, n, replace=TRUE),
type=factor(paste("type", 1:n)),
x=abs(rnorm(n)))
dat
id group age type x
1 1 A 26 type 1 0.928567035
2 2 B 21 type 2 0.294720447
3 3 A 24 type 3 0.005767173
4 4 B 18 type 4 2.404653389
5 5 A 19 type 5 0.763593461
6 6 B 30 type 6 0.799009249
7 7 A 24 type 7 1.147657009
8 8 B 28 type 8 0.289461574
9 9 A 19 type 9 0.299215118
10 10 B 28 type 10 0.411510833
Where I want to keep a row as soon as the cumulative value of x reaches a threshold (e.g. 1), starting to count again as soon as a row was retained. Which would result in the following output:
id group age type x
2 2 B 21 type 2 0.294720447
4 4 B 18 type 4 2.404653389
6 6 B 30 type 6 0.799009249
7 7 A 24 type 7 1.147657009
10 10 B 28 type 10 0.411510833
I am trying to get a dplyr based solution but can't seem to figure it out. Any tips?
You can use purrr::accumulate to compute the cumsum with threshold, then use dplyr::slice_tail to get the last value before the cumsum cuts the threshold:
library(dplyr)
library(purrr)
dat %>%
group_by(a = cumsum(x == accumulate(x, ~ ifelse(.x <= 1, .x + .y, .y)))) %>%
slice_tail(n = 1)
# id group age type x gp
# 1 2 B 21 type 2 0.295 1
# 2 4 B 18 type 4 2.40 2
# 3 6 B 30 type 6 0.799 3
# 4 7 A 24 type 7 1.15 4
# 5 10 B 28 type 10 0.412 5
Another option is to use MESS::cumsumbinning, which might be more friendly to use:
library(MESS)
library(dplyr)
dat %>%
group_by(a = cumsumbinning(x, 1, cutwhenpassed = T)) %>%
slice_tail(n = 1)
Mael beat me with the cumsumbinning() from the MESS-package...
Here is a data.table option using that function:
library(MESS)
library(data.table)
setDT(dat)[, .SD[.N], by = MESS::cumsumbinning(x, 1, cutwhenpassed = TRUE)]
# MESS id group age type
# 1: 1 2 B 21 type 2
# 2: 2 4 B 18 type 4
# 3: 3 6 B 30 type 6
# 4: 4 7 A 24 type 7
# 5: 5 10 B 28 type 10
I have a data frame with this structure:
var
A1sometext_r2
BXother_r11
A1sometext_r4
C7sometext_r8
And would like a new column that stores the number that follows the "r"
var new
A1some9text_r2 2
BXother_r11 11
A1sometext_r4 4
C7sometext_r8 8
I'm trying to incorporate into a pipe so Tidyverse would be better
Thx!
U can do it like this:
tibble(var = paste0('lala_r', sample(1:20, 15))) %>%
dplyr::mutate(
new = stringr::str_replace_all(var, '.*_r([0-9]*)$', '\\1'),
new = as.integer(new)
)
Output:
# A tibble: 15 x 2
var new
<chr> <int>
1 lala_r8 8
2 lala_r11 11
3 lala_r16 16
4 lala_r7 7
5 lala_r1 1
6 lala_r10 10
7 lala_r12 12
8 lala_r9 9
9 lala_r18 18
10 lala_r6 6
11 lala_r3 3
12 lala_r20 20
13 lala_r4 4
14 lala_r14 14
15 lala_r15 15
I have a table with hundreds of columns. Their names end either with .a or .b
What I need is to rename all columns.a with a columns.a_new and column.b with column->column.b_new at once.
I can do it only one pattern at a time but I don't know how to do it at once for all columns.
rename_at_example <- my_table %>% rename_at(vars(ends_with(".a")),
funs(str_replace(., ".a", ".a_new")))
Any idea how to write it in a compact way for all columns?
Thank you
One dplyr option could be:
df %>%
rename_at(vars(matches("[ab]$")), ~ paste0(., "_new"))
col1a_new col2a_new col1b_new col2b_new col1c col2c
1 1 11 1 11 1 11
2 2 12 2 12 2 12
3 3 13 3 13 3 13
4 4 14 4 14 4 14
5 5 15 5 15 5 15
6 6 16 6 16 6 16
7 7 17 7 17 7 17
8 8 18 8 18 8 18
9 9 19 9 19 9 19
10 10 20 10 20 10 20
Sample data:
df <- data.frame(col1a = 1:10,
col2a = 11:20,
col1b = 1:10,
col2b = 11:20,
col1c = 1:10,
col2c = 11:20,
stringsAsFactors = FALSE)
If '.a' names and '.b' names don't require the same replacement/action, e.g. adding '_new' to the end, you could use reduce2
library(tidyverse) # dplyr + purrr for reduce2
df <- data.frame(one.a = 1, one.d = 2, twoa = 3, two.b = 4, three.a = 5)
df
# one.a one.d twoa two.b three.a
# 1 1 2 3 4 5
df %>%
rename_all(~ reduce2(c('\\.a$', '\\.b$'), c('.a_new1', '.b_new2'),
str_replace, .init = .x))
# one.a_new1 one.d twoa two.b_new2 three.a_new1
# 1 1 2 3 4 5
I am attempting to reference existing columns in dplyr through a loop. Effectively, I would like to evaluate the operations from one table (evaluation in below example) to be performed to another table (dt in below example). I do not want to hardcode the column names on the RHS within mutate(). I would like to control the evaluations being performed from the evaluation table below. So I am trying to make the process dynamic.
Here is a sample dataframe:
dt = data.frame(
A = c(1:20),
B = c(11:30),
C = c(21:40),
AA = rep(1, 20),
BB = rep(2, 20)
)
Here is a table of sample operations to be performed:
evaluation = data.frame(
New_Var = c("AA", "BB"),
Operation = c("(A*2) > B", "(B*2) <= C"),
Result = c("True", "False")
) %>% mutate_all(as.character)
What I am trying to do is the following:
for (i in 1:nrow(evaluation)) {
var = evaluation$New_Var[i]
dt = dt %>%
rowwise() %>%
mutate(!!var := ifelse(eval(parse(text = evaluation$Operation[i])),
evaluation$Result[i],
!!var))
}
my desired result would be something like this except for the "AA" in the AA column would be the original numeric values of the AA column of 1, 1, 1, 1, 1.
UPDATED:
I believe my syntax in the "False" part of the ifelse statement is incorrect. What is the correct syntax to specify "!!var" in the false portion of the ifelse statement?
I know there are other ways to do it using base R, but I would rather do it through dplyr as it is cleaner code to look at. I am leveraging "rowise()" to do it element by element.
Modified data to (a) enforce type consistency for columns AA and BB and (b) ensure that at least one row satisfies the second condition.
dt = tibble(
A = c(1:20),
B = c(10:29), ## Note the change
C = c(21:40),
AA = rep("a", 20), ## Note initialization with strings
BB = rep("b", 20) ## Ditto
)
To make your loop work, you need to convert your code strings into actual expressions. You can use rlang::sym() for variable names and rlang::parse_expr() for everything else.
for( i in 1:nrow(evaluation) )
{
var <- rlang::sym(evaluation$New_Var[i])
op <- rlang::parse_expr(evaluation$Operation[i])
dt = dt %>% rowwise() %>%
mutate(!!var := ifelse(!!op, evaluation$Result[i],!!var))
}
# # A tibble: 20 x 5
# A B C AA BB
# <int> <int> <int> <chr> <chr>
# 1 1 10 21 a False
# 2 2 11 22 a False
# 3 3 12 23 a b
# 4 4 13 24 a b
# 5 5 14 25 a b
# 6 6 15 26 a b
# 7 7 16 27 a b
# 8 8 17 28 a b
# 9 9 18 29 a b
# 10 10 19 30 True b
# 11 11 20 31 True b
# 12 12 21 32 True b
# 13 13 22 33 True b
# 14 14 23 34 True b
# 15 15 24 35 True b
# 16 16 25 36 True b
# 17 17 26 37 True b
# 18 18 27 38 True b
# 19 19 28 39 True b
# 20 20 29 40 True b
Assuming that Felipe's answer was the functionality you desired, here's a more "tidyverse"/pipe-oriented/functional approach.
Data
library(rlang)
library(dplyr)
library(purrr)
operations <- tibble(
old_var = exprs(A, B),
new_var = exprs(AA, BB),
test = exprs(2*A > B, 2*B <= C),
result = exprs("True", "False")
)
original <- tibble(
A = sample.int(30, 10),
B = sample.int(30, 10),
C = sample.int(30, 10)
)
original
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 4 20 5
2 30 29 11
3 1 27 14
4 2 21 4
5 17 19 24
6 14 25 9
7 5 22 22
8 6 13 7
9 25 4 21
10 12 11 12
Functions
# Here's your reusable functions
generic_mutate <- function(dat, new_var, test, result, old_var) {
dat %>% mutate(!!new_var := ifelse(!!test, !!result, !!old_var))
}
generic_ops <- function(dat, ops) {
pmap(ops, generic_mutate, dat = dat) %>%
reduce(full_join)
}
generic_mutate takes a single original dataframe, a single new_var, etc. It performs the test, adds the new column with the appropriate name and values.
generic_ops is the "vectorized" version. It takes the original dataframe as the first argument, and a dataframe of operations as the second. It then parallel maps over each column of new variable names, tests, etc, and calls generic_mutate on each one. That results in a list of dataframes, each with one added column. The reduce then combines them back all together with a sequential full_join.
Results
original %>%
generic_ops(operations)
Joining, by = c("A", "B", "C")
# A tibble: 10 x 5
A B C AA BB
<int> <int> <int> <chr> <chr>
1 4 20 5 4 20
2 30 29 11 True 29
3 1 27 14 1 27
4 2 21 4 2 21
5 17 19 24 True 19
6 14 25 9 True 25
7 5 22 22 5 22
8 6 13 7 6 13
9 25 4 21 True False
10 12 11 12 True 11
The magic here is using exprs(...) so you can store NSE names and operations in a tibble without forcing their evaluation. I think this is a lot cleaner than storing names and operations in strings with quotation marks.
How's this:
evaluation = data.frame(
Old_Var = c('A', 'B'),
New_Var = c("AA", "BB"),
Operation = c("(A*2) > B", "(B*2) <= C"),
Result = c("True", "False")
) %>% mutate_all(as.character)
for (i in 1:nrow(evaluation)) {
old <- sym(evaluation$Old_Var[i])
new <- sym(evaluation$New_Var[i])
op <- sym(evaluation$Operation[i])
res <- sym(evaluation$Result[i])
dt <- dt %>%
mutate(!!new := ifelse(!!op, !!res, !!old))
}
EDIT: My last answer doesn't work because rlang tries to find a variable named !!op (e.g. named (A*2) > B) instead of evaluating the expression. I got this to work using a mix of tidyselect and base R. You can of course follow #Brian's advice and use this solution with pmap. I honestly don't know how well this will perform though, as I think it will evaluate the ifelse once per row, and am not sure it's a vectorized operation...
dt <- tibble(
A = c(1:20),
B = c(11:30),
C = c(21:40),
AA = rep(1, 20),
BB = rep(2, 20)
)
evaluation = tibble(
Old_Var = c('A', 'B'),
New_Var = c("AA", "BB"),
Operation = c('(A*2) > B', '(B*2) <= C'),
Result = c("True", "False")
)
for (i in 1:nrow(evaluation)) {
old <- evaluation$Old_Var[i]
new <- evaluation$New_Var[i]
op <- evaluation$Operation[i]
res <- evaluation$Result[i]
dt <- dt %>%
mutate(!!sym(new) := eval(parse(text = sprintf('ifelse(%s, "%s", %s)', op, res, old))))
}
One way is to rework the conditions first, then pass them to mutate :
conds <- parse(text=evaluation$Operation) %>%
as.list() %>%
setNames(evaluation$New_Var) %>%
imap(~expr(ifelse(!!.,"True", !!sym(.y))))
conds
#> $AA
#> ifelse((A * 2) > B, "True", AA)
#>
#> $BB
#> ifelse((B * 2) <= C, "True", BB)
dt %>% mutate(!!!conds)
#> A B C AA BB
#> 1 1 11 21 1 2
#> 2 2 12 22 1 2
#> 3 3 13 23 1 2
#> 4 4 14 24 1 2
#> 5 5 15 25 1 2
#> 6 6 16 26 1 2
#> 7 7 17 27 1 2
#> 8 8 18 28 1 2
#> 9 9 19 29 1 2
#> 10 10 20 30 1 2
#> 11 11 21 31 True 2
#> 12 12 22 32 True 2
#> 13 13 23 33 True 2
#> 14 14 24 34 True 2
#> 15 15 25 35 True 2
#> 16 16 26 36 True 2
#> 17 17 27 37 True 2
#> 18 18 28 38 True 2
#> 19 19 29 39 True 2
#> 20 20 30 40 True 2
I get a freq table, but can I save this table in a csv file or - better - sort it or extract the biggest values?
library(plyr)
count(birthdaysExample, 'month')
I'm guessing at what the relevant part of your data looks like, but in any case this should get you a frequency table sorted by values:
library(plyr)
birthdaysExample <- data.frame(month = round(runif(200, 1, 12)))
freq_df <- count(birthdaysExample, 'month')
freq_df[order(freq_df$freq, decreasing = TRUE), ]
This gives you:
month freq
5 5 29
9 9 24
3 3 22
4 4 18
6 6 17
7 7 15
2 2 14
10 10 14
11 11 14
8 8 13
1 1 10
12 12 10
To get the highest 3 values:
library(magrittr)
freq_df[order(freq_df$freq, decreasing = TRUE), ] %>% head(., 3)
month freq
5 5 29
9 9 24
3 3 22
Or, with just base R:
head(freq_df[order(freq_df$freq, decreasing = TRUE), ], 3)
With dplyr
dplyr is a newer approaching for many routine data manipulations in R (one of many tutorials) that is a bit more intuitive:
library(dplyr)
library(magrittr)
freq_df2 <- birthdaysExample %>%
group_by(month) %>%
summarize(freq = n()) %>%
arrange(desc(freq))
freq_df2
This returns:
Source: local data frame [12 x 2]
month freq
1 5 29
2 9 24
3 3 22
4 4 18
5 6 17
6 7 15
7 2 14
8 10 14
9 11 14
10 8 13
11 1 10
12 12 10
The object it returns is not a data frame anymore, so if you want to use base R functions with it, it might be easier to convert it back, with something like:
my_df <- as.data.frame(freq_df2)
And if you really want, you can write this to a CSV file with:
write.csv(my_df, file="foo.csv")