I have a column called "equation" which stored formuala about "t". Another column is "t". I want to calculate the equation's value (y) according to each t in the row. Below is an example.
library(magrittr);library(dplyr)
dt <- data.frame(t = c(1,2,3),
equation = c("t+1", "5*t", "t^3"))
dt %<>%
mutate(y = eval(parse(text = equation)))
However, the results seem not expected:
t equation y
1 t+1 1
2 5*t 8
3 t^3 27
The expected results for y is: 2, 10, 27. What should I do to fix it (but the third y is correct)?
This is because eval(parse()) isn't vectorised. You can get around this using rowwise():
library(magrittr)
library(dplyr, warn.conflicts = FALSE)
dt <- data.frame(
t = c(1,2,3),
equation = c("t+1", "5*t", "t^3")
)
dt %<>%
rowwise() %>%
mutate(y = eval(parse(text = equation))) %>%
ungroup()
#> # A tibble: 3 × 3
#> t equation y
#> <dbl> <chr> <dbl>
#> 1 1 t+1 2
#> 2 2 5*t 10
#> 3 3 t^3 27
Created on 2022-10-14 with reprex v2.0.2
Related
Still quite new to R, so trying to figure out what I am doing wrong in the following explanation.
I am trying to calculate the expanding mean over time per subgroup for a dataframe. My code works when there is only a single subgroup in the dataframe, but starts to break when multiple subgroups are available within the dataframe.
Apologies if I have overlooked something, but I cant figure out where exactly my code is incorrect. My hunch is that I am not filling in the width correctly, but I have not been able to figure out how to change width to a dynamically expanding window over time per subgroup.
See my data below;
sample file
See my code below;
library(ggplot2)
library(zoo)
library(RcppRoll)
library(dplyr)
x <- read.csv("stackoverflow.csv")
x$datatime <- as.POSIXlt(x$datatime,format="%m/%d/%Y %H:%M",tz=Sys.timezone())
x$Event <- as.factor(x$Event)
x2 <- arrange(x,x$Event,x$datatime) %>%
group_by(x$Event) %>%
mutate(ma=rollapply(data = x$Actual, width=seq_along(x$Actual), FUN=mean,
partial=TRUE, fill=NA,
align = "right"))
Any help is very much appreciated!
Thanks
EDIT:
A fix has been found! Thanks to all the useful feedback.
The working code is;
x <-
arrange(x,x$Event,x$datatime) %>%
group_by(Event) %>%
mutate(ma=rollapply(data = Actual,
width=seq_along(Actual),
FUN=mean,
partial=TRUE,
fill=NA,
align = "right"))
I think the problem here is that you’re using x$ to extract columns from
the original data in mutate(), rather than using the column name directly
to refer to the column in the grouped slice.
In dplyr verbs you can (and in case of grouped operations, must) refer to the columns directly.
The solution is to just remove
all x$ references from your code in dplyr functions.
Here’s a small example that illustrates what’s going on:
library(dplyr, warn.conflicts = FALSE)
tbl <- tibble(g = c(1, 1, 2, 2, 2), x = 1:5)
tbl
#> # A tibble: 5 x 2
#> g x
#> <dbl> <int>
#> 1 1 1
#> 2 1 2
#> 3 2 3
#> 4 2 4
#> 5 2 5
tbl %>%
group_by(g) %>%
mutate(y = cumsum(tbl$x))
#> Error in `mutate_cols()`:
#> ! Problem with `mutate()` column `y`.
#> i `y = cumsum(tbl$x)`.
#> i `y` must be size 2 or 1, not 5.
#> i The error occurred in group 1: g = 1.
And how to fix it:
tbl %>%
group_by(g) %>%
mutate(y = cumsum(x))
#> # A tibble: 5 x 3
#> # Groups: g [2]
#> g x y
#> <dbl> <int> <int>
#> 1 1 1 1
#> 2 1 2 3
#> 3 2 3 3
#> 4 2 4 7
#> 5 2 5 12
In the example below, I would like to add column 'value' based on the values of column 'variable' (i.e., 1 and 20).
toy_data <-
tibble::tribble(
~x, ~y, ~variable,
1, 2, "x",
10, 20, "y"
)
Like this:
x
y
variable
value
1
2
x
1
10
20
y
20
However, none of the below works:
toy_data %>%
dplyr::mutate(
value = get(variable)
)
toy_data %>%
dplyr::mutate(
value = mget(variable)
)
toy_data %>%
dplyr::mutate(
value = mget(variable, inherits = TRUE)
)
toy_data %>%
dplyr::mutate(
value = !!variable
)
How can I do this?
If you know which variables you have in the dataframe in advance: use simple logic like ifelse() or dplyr::case_when() to choose between them.
If not: use functional programming. Under is an example:
library(dplyr)
f <- function(data, variable_col) {
data[[variable_col]] %>%
purrr::imap_dbl(~ data[[.y, .x]])
}
toy_data$value <- f(toy_data, "variable")
Here are a few options that should scale well.
First is a base option that works along both the variable column and its index. (I made a copy of the data frame just so I had the original intact for more programming.)
library(dplyr)
toy2 <- toy_data
toy2$value <- mapply(function(v, i) toy_data[[v]][i], toy_data$variable, seq_along(toy_data$variable))
toy2
#> # A tibble: 2 × 4
#> x y variable value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2 x 1
#> 2 10 20 y 20
Second uses purrr::imap_dbl to iterate along the variable and its index and return a double.
toy_data %>%
mutate(value = purrr::imap_dbl(variable, function(v, i) toy_data[[v]][i]))
#> # A tibble: 2 × 4
#> x y variable value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2 x 1
#> 2 10 20 y 20
Third is least straightforward, but what I'd most likely use personally, maybe just because it's a process that fits many of my workflows. Pivotting makes a long version of the data, letting you see both values of variable and corresponding values of x and y, which you can then filter for where those 2 columns match. Then self-join back to the data frame.
inner_join(
toy_data,
toy_data %>%
tidyr::pivot_longer(cols = -variable, values_to = "value") %>%
filter(variable == name),
by = "variable"
) %>%
select(-name)
#> # A tibble: 2 × 4
#> x y variable value
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2 x 1
#> 2 10 20 y 20
Edit: #jpiversen rightly points out that the self-join won't work if variable has duplicates—in that case, add a row number to the data and use that as an additional joining column. Here I first add an additional observation to illustrate.
toy3 <- toy_data %>%
add_row(x = 5, y = 4, variable = "x") %>%
tibble::rowid_to_column()
inner_join(
toy3,
toy3 %>%
pivot_longer(cols = c(-rowid, -variable), values_to = "value") %>%
filter(variable == name),
by = c("rowid", "variable")
) %>%
select(-name, -rowid)
I don't mean like finding the min of a column. I mean comparing every value in a column to a number and extracting the minimum for comparison, preferably as a new column? Do I have to use loops, rapply/lapply, or can I do something with vectorisation? Example below.
Input:
Column
1
2
3
Number for comparing in min: 2
Output:
Column
1
2
2
If I understand you right, there are a few ways to do this. Here's a very dplyrish way that works, but I don't think is the sexiest, necessarily:
library(tidyverse)
input <- tibble(
x = 1:3
)
mins_to_compare_across <- c(1,2,2)
input %>%
mutate(mins_to_compare_across = mins_to_compare_across) %>%
rowwise() %>%
transmute(
x = min(
c_across(
c(
x,
mins_to_compare_across
)
)
)
) %>%
ungroup()
#> # A tibble: 3 × 1
#> x
#> <dbl>
#> 1 1
#> 2 2
#> 3 2
Created on 2021-08-18 by the reprex package (v2.0.1)
Here's a purrr way that I like a lot:
library(tidyverse)
input <- tibble(
x = 1:3
)
mins_to_compare_across <- c(1,2,2)
input %>%
mutate(
x = map2_dbl(
x,
mins_to_compare_across,
~ min(.x, .y)
)
)
#> # A tibble: 3 × 1
#> x
#> <dbl>
#> 1 1
#> 2 2
#> 3 2
Created on 2021-08-18 by the reprex package (v2.0.1)
I have a data.frame (or tiibble or whatever) with an id variable. Often I made some operation for this id with dplyr::group_by, so
data %>%
group_by(id) %>%
summarise/mutate/...()
Often, I have other non-numeric variables that are unique for each id, such as the project or country to which the id belongs and other characteristics of the id (such as gender, etc.). When I use the summarise function above, these other variables ares lost unless I specify, either
data %>%
group_by(id) %>%
summarise(across(c(project, country, gender, ...), unique),...)
or
data %>%
group_by(id, project, country, gender, ...) %>%
summarise()
Is there some functions which detect these variables which are unique for each id, so that one does not have to specify them?
Thank you!
PS: I am asking mainly on dplyr and group_by related functions, but other environments like R-base or data.table are wellcome also.
I did not test it extensively yet it should do the job
library(dplyr)
myData <- tibble(X = c(1, 1, 2, 2, 2, 3),
Y = LETTERS[c(1, 1, 2, 2, 2, 3)],
R = rnorm(6))
myData
#> # A tibble: 6 x 3
#> X Y R
#> <dbl> <chr> <dbl>
#> 1 1 A 0.463
#> 2 1 A -0.965
#> 3 2 B -0.403
#> 4 2 B -0.417
#> 5 2 B -2.28
#> 6 3 C 0.423
group_by_id_vars <- function(.data, ...) {
# group by the prespecified ID variables
.data <- .data %>% group_by(...)
# how many groups do these ID determine
ID_groups <- .data %>% n_groups()
# Get the number of groups if the initial grouping variables are combined
# with other variables
groupVars <- sapply(substitute(list(...))[-1], deparse) #specified grouping Variable
nms <- names(.data) # all variables in .data
res <- sapply(nms[!nms %in% groupVars],
function(x) {
.data %>%
# important to specify add = TRUE to combine the variable
# with the IDs
group_by(across(all_of(x)), .add = TRUE) %>%
n_groups()})
# which combinations are identical, i.e. this variable does not increase the
# number of groups in the data if combined with IDvars
v <- names(res)[which(res == ID_groups)]
# group the data accordingly
.data <- .data %>% ungroup() %>% group_by(across(all_of(c(groupVars, v))))
return(.data)
}
myData %>%
group_by_id_vars(X) %>%
summarise(n = n())
#> `summarise()` regrouping output by 'X' (override with `.groups` argument)
#> # A tibble: 3 x 3
#> # Groups: X [3]
#> X Y n
#> <dbl> <chr> <int>
#> 1 1 A 2
#> 2 2 B 3
#> 3 3 C 1
This is a bit more advanced in application, but what you are looking for are linear combinations of your grouping variables. You can convert these to factors and then use some linear algebra.
You can use findLinearCombos() from caret to locate these. It takes a bit of work to get it all organized how I think you want it though.
Something like this may do the trick. I also have not extensively tested this.
Packages
library(dplyr)
library(caret)
library(purrr)
Function
group_by_lc <- function(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data)) {
# capture the ... and convert to a character vector
.groups <- rlang::ensyms(...)
.groups_chr <- map_chr(.groups, rlang::as_name)
# convert all character and factor variables to a numeric
d <- .data %>%
mutate(across(where(is.factor), as.character),
across(where(is.character), as.factor),
across(where(is.factor), as.integer))
# find linear combinations of the character / factor variables
lc <- caret::findLinearCombos(d)
# see if any of your grouping variables have linear combinations
find_group_match <- function(known_groups, lc_pair) {
if (any(lc_pair %in% known_groups)) unique(c(lc_pair, known_groups)) else NULL
}
# convert column indices to names
lc_pairs <- map(lc$linearCombos, ~ names(d)[.x])
# iteratively look for linear combinations of known grouping variabels
lc_cols <- reduce(lc_pairs, find_group_match, .init = .groups_chr)
# find new grouping variables
added_groups <- rlang::syms(lc_cols[!(lc_cols %in% .groups_chr)])
# apply the grouping to your groups and the linear combinations
group_by(.data, !!!.groups, !!!added_groups, .add = .add, .drop = .drop)
}
Usage
data <- tibble(V = LETTERS[1:10], W = letters[1:10], X = paste0(V, W), Y = rep(LETTERS[1:5], each = 2), Z = runif(10))
group_by_lc(data, W)
Result
You can see how it added in all the other grouping variables. You can rework this all in other ways, the key part is building that added_groups list to find them.
# A tibble: 10 x 5
# Groups: W, X, V [10]
V W X Y Z
<chr> <chr> <chr> <chr> <dbl>
1 A a Aa A 0.884
2 B b Bb A 0.133
3 C c Cc B 0.194
4 D d Dd B 0.407
5 E e Ee C 0.256
6 F f Ff C 0.0976
7 G g Gg D 0.635
8 H h Hh D 0.0542
9 I i Ii E 0.0104
10 J j Jj E 0.464
I used group_map for the first time and think I do it correctly. This is my code:
library(REAT)
df <- data.frame(value = c(1,1,1, 1,0.5,0.1, 0,0,0,1), group = c(1,1,1, 2,2,2, 3,3,3,3))
haves <- df %>%
group_by(group) %>%
group_map(~gini(.x$value, coefnorm = TRUE))
The thing is that haves is a list rather than a data frame. What would I have to do to obtain this df
wants <- data.frame(group = c(1,2,3), gini = c(0,0.5625,1))
group gini
1 0.0000
2 0.5625
3 1.0000
Thanks!
You can use dplyr::summarize:
df %>%
group_by(group) %>%
summarize(gini = gini(value, coefnorm = TRUE))
#> # A tibble: 3 x 2
#> group gini
#> <dbl> <dbl>
#> 1 1 0
#> 2 2 0.562
#> 3 3 1
According to the documentation, group_map always produces a list. group_modify is an alternative that produces a tibble if the function does, but gini just outputs a vector. So, you could do something like this...
df %>%
group_by(group) %>%
group_modify(~tibble(gini = gini(.x$value, coefnorm = TRUE)))
# A tibble: 3 x 2
# Groups: group [3]
group gini
<dbl> <dbl>
1 1 0
2 2 0.562
3 3 1
Using data.table
library(data.table)
setDT(df)[, .(gini = gini(value, coefnorm = TRUE)), group]
For grouped datasets, we can specify .data if in case we don't want to use column names unquoted
library(dplyr)
df %>%
group_by(group) %>%
summarize(gini = gini(.data$value, coefnorm = TRUE))