Expanding mean over time per subgroup in dataframe - r

Still quite new to R, so trying to figure out what I am doing wrong in the following explanation.
I am trying to calculate the expanding mean over time per subgroup for a dataframe. My code works when there is only a single subgroup in the dataframe, but starts to break when multiple subgroups are available within the dataframe.
Apologies if I have overlooked something, but I cant figure out where exactly my code is incorrect. My hunch is that I am not filling in the width correctly, but I have not been able to figure out how to change width to a dynamically expanding window over time per subgroup.
See my data below;
sample file
See my code below;
library(ggplot2)
library(zoo)
library(RcppRoll)
library(dplyr)
x <- read.csv("stackoverflow.csv")
x$datatime <- as.POSIXlt(x$datatime,format="%m/%d/%Y %H:%M",tz=Sys.timezone())
x$Event <- as.factor(x$Event)
x2 <- arrange(x,x$Event,x$datatime) %>%
group_by(x$Event) %>%
mutate(ma=rollapply(data = x$Actual, width=seq_along(x$Actual), FUN=mean,
partial=TRUE, fill=NA,
align = "right"))
Any help is very much appreciated!
Thanks
EDIT:
A fix has been found! Thanks to all the useful feedback.
The working code is;
x <-
arrange(x,x$Event,x$datatime) %>%
group_by(Event) %>%
mutate(ma=rollapply(data = Actual,
width=seq_along(Actual),
FUN=mean,
partial=TRUE,
fill=NA,
align = "right"))

I think the problem here is that you’re using x$ to extract columns from
the original data in mutate(), rather than using the column name directly
to refer to the column in the grouped slice.
In dplyr verbs you can (and in case of grouped operations, must) refer to the columns directly.
The solution is to just remove
all x$ references from your code in dplyr functions.
Here’s a small example that illustrates what’s going on:
library(dplyr, warn.conflicts = FALSE)
tbl <- tibble(g = c(1, 1, 2, 2, 2), x = 1:5)
tbl
#> # A tibble: 5 x 2
#> g x
#> <dbl> <int>
#> 1 1 1
#> 2 1 2
#> 3 2 3
#> 4 2 4
#> 5 2 5
tbl %>%
group_by(g) %>%
mutate(y = cumsum(tbl$x))
#> Error in `mutate_cols()`:
#> ! Problem with `mutate()` column `y`.
#> i `y = cumsum(tbl$x)`.
#> i `y` must be size 2 or 1, not 5.
#> i The error occurred in group 1: g = 1.
And how to fix it:
tbl %>%
group_by(g) %>%
mutate(y = cumsum(x))
#> # A tibble: 5 x 3
#> # Groups: g [2]
#> g x y
#> <dbl> <int> <int>
#> 1 1 1 1
#> 2 1 2 3
#> 3 2 3 3
#> 4 2 4 7
#> 5 2 5 12

Related

Unexpected results using eval() in R

I have a column called "equation" which stored formuala about "t". Another column is "t". I want to calculate the equation's value (y) according to each t in the row. Below is an example.
library(magrittr);library(dplyr)
dt <- data.frame(t = c(1,2,3),
equation = c("t+1", "5*t", "t^3"))
dt %<>%
mutate(y = eval(parse(text = equation)))
However, the results seem not expected:
t equation y
1 t+1 1
2 5*t 8
3 t^3 27
The expected results for y is: 2, 10, 27. What should I do to fix it (but the third y is correct)?
This is because eval(parse()) isn't vectorised. You can get around this using rowwise():
library(magrittr)
library(dplyr, warn.conflicts = FALSE)
dt <- data.frame(
t = c(1,2,3),
equation = c("t+1", "5*t", "t^3")
)
dt %<>%
rowwise() %>%
mutate(y = eval(parse(text = equation))) %>%
ungroup()
#> # A tibble: 3 × 3
#> t equation y
#> <dbl> <chr> <dbl>
#> 1 1 t+1 2
#> 2 2 5*t 10
#> 3 3 t^3 27
Created on 2022-10-14 with reprex v2.0.2

Unexpected dplyr::bind_rows() behavior

Short Version:
I'm encountering an error with dplyr::bind_rows() which I don't understand. I want to split my data based on some condition (e.g. a == 1), operate on one part (e.g. b = b * 10), and bind it back to the other part using dplyr::bind_rows() in a single pipe chain. It works fine if I provide the first input to the two parts explictly, but if instead I pipe them in with . it complains about the data type of agrument 2.
Here's a MRE of the issue:
library(tidyverse)
# sim data
d <- tibble(a = 1:4, b = 1:4)
# works when 'd' is supplied directly to bind_rows()
bind_rows(d %>% filter(a == 1),
d %>% filter(!a == 1) %>% mutate(b = b * 10))
#> # A tibble: 4 x 2
#> a b
#> <int> <dbl>
#> 1 1 1
#> 2 2 20
#> 3 3 30
#> 4 4 40
# fails when 'd' is piped in to bind_rows()
d %>%
bind_rows(. %>% filter(a == 1),
. %>% filter(!a == 1) %>% mutate(b = b * 10))
#> Error: Argument 2 must be a data frame or a named atomic vector.
Long Version:
If I capture what the bind_rows() call is getting as input as a list() instead, I can see that two unexpected (to me) things are happening.
Instead of evaluating the pipe chains I provided it seems to just capure them as a functional sequence.
I can see that the input (.) is invisibly being provided in addition to the two explict arguments, so I get 3 items instead of 2 in the list.
# capture intermediate values for diagnostics
d %>%
list(. %>% filter(a == 1),
. %>% filter(!a == 1) %>% mutate(b = b * 10))
#> [[1]]
#> # A tibble: 4 x 2
#> a b
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#>
#> [[2]]
#> Functional sequence with the following components:
#>
#> 1. filter(., a == 1)
#>
#> Use 'functions' to extract the individual functions.
#>
#> [[3]]
#> Functional sequence with the following components:
#>
#> 1. filter(., !a == 1)
#> 2. mutate(., b = b * 10)
#>
#> Use 'functions' to extract the individual functions.
This leads me to the following inelegant solution where I solve the first problem by piping to the inner function which seems to force evaluation correctly (for reasons I don't understand) and then solve the second problem by subsetting the list prior to performing the bind_rows() operation.
# hack solution to force eval and clean duplicated input
d %>%
list(filter(., a == 1),
filter(., !a == 1) %>% mutate(b = b * 10)) %>%
.[-1] %>%
bind_rows()
#> # A tibble: 4 x 2
#> a b
#> <int> <dbl>
#> 1 1 1
#> 2 2 20
#> 3 3 30
#> 4 4 40
Created on 2022-01-24 by the reprex package (v2.0.1)
It seems like it might be related to this issue, but I can't quite see how. It would be great to understand why this is happening and find a way code this without the need to assign intermediate variables or do this weird hack to subset the intermediate list.
EDIT:
Knowing this was related to curly braces ({}) enabled me to find a few more helpful links:
1, 2, 3
If we want to use ., then block it with scope operator ({})
library(dplyr)
d %>%
{
bind_rows({.} %>% filter(a == 1),
{.} %>% filter(!a == 1) %>% mutate(b = b * 10))
}
-output
# A tibble: 4 × 2
a b
<int> <dbl>
1 1 1
2 2 20
3 3 30
4 4 40

Fastest way to pad a dataframe with uneven columns

Similar to this previous question I'm trying to transform a vector into a dataframe in R. I use this trick where I turn it into a matrix and then data frame, but the issue is that some rows potentially have a different number of columns, which throws out my data frame. There can be an arbitrary number of values per row (i.e. not necessarily 3 columns as in the examples), so I check first to work out how many columns I need.
For example, given the example data below, I get a neat data frame.
example <- c(
"col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c")
# Get the number of values between the repeating start == number of columns
ncols <- diff(grep("col-a", example))
data.frame(matrix(example, ncol = ncols[1], byrow = T))
# X1 X2 X3
# 1 col-a col-b col-c
# 2 col-a col-b col-c
# 3 col-a col-b col-c
That's all well and good until I get a vector that has an extra value in one row (i.e. requires and extra column). For example:
example <- c("col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"WATCH OUT!",
"col-a",
"col-b",
"col-c")
# Get the number of values between the repeating start == number of columns
ncols <- diff(grep("col-a", example))
data.frame(matrix(example, ncol = ncols[1], byrow = T))
# X1 X2 X3
# 1 col-a col-b col-c
# 2 col-a col-b col-c
# 3 WATCH OUT! col-a col-b
# 4 col-c col-a col-b
Whereas, what I really want is:
# X1 X2 X3 X4
# 1 col-a col-b col-c NA
# 2 col-a col-b col-c WATCH OUT!
# 3 col-a col-b col-c NA
I could deal with this with a double for loop after checking that there is an uneven number of elements between first column elements, but that's not even going to be close to the best option surely.
The additional complication is that the "extra" column could potentially be anywhere, not necessarily the last column.
Edit: The column ordering is actually arbitrary, so there's no reason why the extra column has to be in the middle, it could be appended at the end. The is one option I considered, to pull it out and just append it after padding with NA afterwards. The text that should be in the same column are also delimited so it's clear where they belong. Have updated the example below.
Here is some more realistic example data and desired output:
example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")
# Desired output
X1 X2 X3 X4
1 name:start date:a NA value:b
2 name:start date:c desc:WATCH OUT! value:d
3 name:start date:e NA value:f
What would be the fastest way to process this?
Thanks in advance!
EDIT: the "blocks" that turn into rows are well defined, so the start and end of a block are clear and finding the size of a block isn't hard, hence my diff(grep(...)) command earlier (could also use dist() for similar result). The WATCH OUT! Text can be arbitrary though, so it's not as simple as searching for WATCH OUT!.
I am not sure, if the output in this format, is useful
example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate(dummy, into=c("name", 'value'), sep = '\\:') %>%
mutate(rowid = cumsum(name == first(name))) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = value)
#> # A tibble: 3 x 5
#> rowid name date value desc
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 start a b <NA>
#> 2 2 start c d WATCH OUT!
#> 3 3 start e f <NA>
OR perhaps this?
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate(dummy, into=c("name", 'value'), sep = '\\:', remove = F) %>%
mutate(rowid = cumsum(name == first(name))) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 3 x 5
#> rowid name date value desc
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 name:start date:a value:b <NA>
#> 2 2 name:start date:c value:d desc:WATCH OUT!
#> 3 3 name:start date:e value:f <NA>
Created on 2021-05-30 by the reprex package (v2.0.0)
For your first example, you could do
``` r
example <- c("col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"WATCH OUT!",
"col-a",
"col-b",
"col-c")
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
group_by(rowid = cumsum(dummy == first(dummy))) %>%
mutate(name = paste0('X', row_number())) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 3 x 5
#> # Groups: rowid [3]
#> rowid X1 X2 X3 X4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 col-a col-b col-c <NA>
#> 2 2 col-a col-b col-c WATCH OUT!
#> 3 3 col-a col-b col-c <NA>
Created on 2021-05-30 by the reprex package (v2.0.0)
Is this useful?
library(tidyverse)
library(rebus)
#>
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#>
#> regex
#> The following object is masked from 'package:ggplot2':
#>
#> alpha
example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")
example_dirty <- example #i will use it at the end of the script for replacing
custom_pattern <- rebus::or('name:.*', 'date:.', 'value:.')
alien_text_index <- str_detect(example, pattern = custom_pattern) %>%
as.character()
replacement <- which(alien_text_index == 'FALSE') %>%
`/`(., 3) %>% #in this case every three rows the repetition should start over.
round() #round for getting an index to modify
example <- str_match(example , pattern = custom_pattern) %>% keep(~!is.na(.))
df <- c('name:.*', 'date:.', 'value:.') %>%
map(~example[str_detect(example, .x)]) %>% reduce(bind_cols) %>%
mutate(..4 = '')
#> New names:
#> * NA -> ...1
#> * NA -> ...2
#> New names:
#> * NA -> ...3
for (i in length(replacement)) {
df[replacement[i], 4] <- example_dirty[!as.logical(alien_text_index)][i]
}
df
#> # A tibble: 3 x 4
#> ...1 ...2 ...3 ..4
#> <chr> <chr> <chr> <chr>
#> 1 name:start date:a value:b ""
#> 2 name:start date:c value:d "desc:WATCH OUT!"
#> 3 name:start date:e value:f ""
Created on 2021-05-29 by the reprex package (v2.0.0)

assigning id values from values, not names, with purrr::map_dfr

I think this question is related to Using map_dfr and .id for list names and list of list names but not identical ...
I often use map_dfr for a case where I want to use the value of each argument, not its name, as the .id variable. Here's a silly example: I am computing the mean of mtcars$mpg raised to the second, fourth, and sixth power:
library(tidyverse)
list(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
## name x
## <chr> <dbl>
## 1 1 439.
## 2 2 262350.
## 3 3 198039783.
I would like the name variable to be 2, 4, 6 instead of 1, 2, 3. I can hack this by including setNames(.data) in the pipeline:
list(2,4,6) %>%
setNames(.data) %>%
map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
but I wonder if there is a more idiomatic approach I'm missing?
As for the suggestion of using something like ~ tible(name=., ...): nice, but slightly less convenient for the case where the mapping function already returns a tibble, because we have to add an otherwise unnecessary tibble() call:
list(2, 4, 6) %>%
map_dfr(~ tibble(name=.,
broom::tidy(lm(mpg~cyl, data=mtcars, offset=rep(., nrow(mtcars))))))
OK, I think I found this shortly before posting (so I'll answer). This answer points out that tibble::lst() is a self-naming list function, so as long as we use tibble::lst(2,4,6) instead of list(2,4,6), it Just Works, e.g.
lst(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="name")
This can work too:
library(tidyverse)
##ben Bolker answer.
lst(2,4,6) %>% map_dfr(~tibble(x=mean(mtcars$mpg^.)), .id="power")
#> # A tibble: 3 x 2
#> power x
#> <chr> <dbl>
#> 1 2 439.
#> 2 4 262350.
#> 3 6 198039783.
list(2, 4, 6) %>% map_df(~ tibble(power = as.character(.x) , x = mean(mtcars$mpg^.)))
#> # A tibble: 3 x 2
#> power x
#> <chr> <dbl>
#> 1 2 439.
#> 2 4 262350.
#> 3 6 198039783.
#another option
seq(2, 6, 2) %>% map2_df(rerun(length(.), mtcars$mpg), ~ c(x = as.character(.x), mean = round(mean(.y^.x), 0)))
#> # A tibble: 3 x 2
#> x mean
#> <chr> <chr>
#> 1 2 439
#> 2 4 262350
#> 3 6 198039783
Created on 2021-06-06 by the reprex package (v2.0.0)
This is also possible, however it would not have been my first choice and only a map would suffice:
library(purrr)
list(2, 4, 6) %>%
pmap_dfr(~ tibble(power = c(...), x = map_dbl(c(...), ~ mean(mtcars$mpg ^ .x))))
# A tibble: 3 x 2
power x
<dbl> <dbl>
1 2 439.
2 4 262350.
3 6 198039783.

Filter data.frame using sqldf and/or dplyr in R

I need to find null values inside a data.frame using sqldf or dplyr libraries.
I know that I can use na.omit() to do that, but I cant find the way to do the same using sqldf or dplyr libraries.
Does anyone know how to do that?
Thank you
drop_na from tidyr will drop all rows that contain missing values in any column (or in columns that you specify).
Here's the example from the documentation:
library(dplyr)
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% drop_na()
#> # A tibble: 1 x 2
#> x y
#> <dbl> <chr>
#> 1 1 a
df %>% drop_na(x)
#> # A tibble: 2 x 2
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 2 NA

Resources