Trying to add a calculation involving lags into a new dataframe column

Trying to add a calculation involving lags into a new dataframe column - r

I'm trying to do something extremely simple, but I can't work out how to do it.
Basically this:
Where A and B are columns in a data frame.
If I use:
df$B <- lag(df$B,1) + df$A
It obviously results in NA because there is no lag of B before row 1.

We could use accumulate
library(tidyverse)
df %>%
mutate(B = accumulate(A, `+`))
Or it could be just cumsum
df %>%
mutate(B = cumsum(A))
data
df <- data.frame(A= c(10, 9, 3, 1, 7))

Related

fill() in missing lubridate value from a different column

Below is a fictional reproducible example of pick-up and drop-of times of four taxis.
Taxi 1, 2, and 3 unfortunately have a missing in the drop-of time. fortunately, two of these times (for taxi 1 and 3) can be inferred to be at least 1 sec before they pick-up new costumers (these are non-ride sharing taxi, very corona-proof):
(the below df is - in the real use case - the result of a group_by and summarise of another df)
library(dplyr)
x <- seq(as.POSIXct('2020/01/01'), # Create sequence of dates
as.POSIXct('2030/01/01'),
by = "10 mins") %>%
head(20) %>%
sort()
taxi_nr <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4)
drop_of <- x[c(TRUE, FALSE)]
pick_up <- x[c(FALSE, TRUE)]
drop_of[2] <- NA
drop_of[5] <- NA
drop_of[7] <- NA
df <- data.frame(taxi_nr,pick_up,drop_of) %>%
arrange(pick_up)
I wish to fill in the NA of taxi 1 and 3, I have tried the following:
df <- df %>%
fill(drop_of, .direction = "up")
However, this take the below drop-of value instead of the below pick-up value and does not take into account the taxi nr.
I have also thought about:
df <- df %>%
filter(is.na(drop_of)) %>%
mutate(drop_of, ov[,+1])
This seems to run into problems with the taxi_nr 2 case, as there is no [,+1] in within the group - or so I believe is the issue. I have tried to add safely(), possibly() and quietly(), but that did not help:
df <- df %>%
filter(is.na(drop_of)) %>%
mutate(drop_of, purr::safely(ov[,+1]))
Does anyone have a solution?
ps: once I get the right column for filling in it also needs to be subtracted 1 second and be in the right lubridate formate (d/m/y-h/m/s)
THANKS!

You can try to use a temporary variable for it, although it does not look pretty
df <- df %>%
mutate(temp = ifelse(is.na(drop_of), NA, pick_up)) %>%
group_by(taxi_nr) %>%
fill(temp, .direction = "up") %>%
ungroup() %>%
mutate(drop_of = ifelse(is.na(drop_of), temp - 1, drop_of),
drop_of = as.POSIXct(drop_of, origin = "1970-01-01")) %>%
select(-temp)
And if you need your data in a format d/m/y-h/m/s, you could do that with a format() function (I am not sure if what you described is exactly what you need, but at least you should get the idea)
df <- df %>% mutate(drop_of = format(drop_of, "%d/%m/%Y-%H/%M/%S"))

replace_na with tidyselect?

Suppose I have a data frame with a bunch of columns where I want to do the same NA replacement:
dd <- data.frame(x = c(NA, LETTERS[1:4]), a = rep(NA_real_, 5), b = c(1:4, NA))
For example, in the data frame above I'd like to do something like replace_na(dd, where(is.numeric), 0) to replace the NA values in columns a and b.
I could do
num_cols <- purrr::map_lgl(dd, is.numeric)
r <- as.list(setNames(rep(0, sum(num_cols)), names(dd)[num_cols]))
replace_na(dd, r)
but I'm looking for something tidier/more idiomatic/nicer ...

If we need to dynamically do the replacement with where(is.numeric), can wrap it in across
library(dplyr)
library(tidyr)
dd %>%
mutate(across(where(is.numeric), replace_na, 0))
Or we can specify the replace as a list of key/value pairs
replace_na(dd, list(a = 0, b = 0))
which can be programmatically created by selecting the columns that are numeric, get the names, convert to a key/value pair with deframe (or use summarise with 0) and then use replace_na
library(tibble)
dd %>%
select(where(is.numeric)) %>%
summarise(across(everything(), ~ 0)) %>%
replace_na(dd, .)

Apply dplyr functions on a single column across a list using piping

I'm tring to filter something across a list of dataframes for a specific column. Typically across a single dataframe using dplyr I would use:
#creating dataframe
df <- data.frame(a = 0:10, d = 10:20)
# filtering column a for rows greater than 7
df %>% filter(a > 7)
I've tried doing this across a list using the following:
# creating list
x <- list(data.frame(a = 0:10, b = 10:20),
data.frame(c = 11:20, d = 21:30),
data.frame(e = 15:25, f = 35:45))
# selecting the appropriate column and trying to filter
# this is not working
x[1][[1]][1] %>% lapply(. %>% {filter(. > 2)})
# however, if I use the min() function it works
x[1][[1]][1] %>% lapply(. %>% {min(.)})
I find the %>% syntax quite easy to understand and carry out. However, in this case, selecting a specific column and doing something quite simple like filtering is not working. I'm guessing map could be equally useful. Any help is appreciated.

You can use filter_at to refer column by position.
library(dplyr)
purrr::map(x, ~.x %>% filter_at(1, any_vars(. > 7)))
In filter, you can subset the column and use it
purrr::map(x, ~.x %>% filter(.[[1]] > 7))
In base R, that would be :
lapply(x, function(y) y[y[[1]] > 7, ])

It seems you are interested in checking the condition on the first column of each dataframe in your list.
One solution using dplyr would be
lapply(x, function(df) {df %>% filter_at(1, ~. > 7)})
The 1 in filter_at indicates that I want to check the condition on the first column (1 is a positional index) of each dataframe in the list.
EDIT
After the discussion in the comments, I propose the following solution
lapply(x, function(df) {df %>% filter(a > 7) %>% select(a) %>% slice(1)})
Input data
x <- list(data.frame(a = 0:10, b = 10:20),
data.frame(a = 11:20, b = 21:30),
data.frame(a = 15:25, b = 35:45))
Output
[[1]]
a
1 8
[[2]]
a
1 11
[[3]]
a
1 15

Using filter with across
library(dplyr)
library(purrr)
map(x, ~ .x %>%
filter(across(names(.)[1], ~ .> 7)))

How to restrict full_join() duplicates? - R

I am a novice R programmer. Below is the dataframe I am using.
I am currently running into a filtering problem with the full_join() from tidyverse.
library(tidyverse)
set.seed(1234)
df <- data.frame(
trial = rep(0:1, each = 8),
sex = rep(c('M','F'), 4),
participant = rep(1:4, 4),
x = runif(16, 1, 10),
y = runif(16, 1, 10))
df
I am currently doing the following operation to do the full_join()
df <- df %>% mutate(k = 1)
df <- df %>%
full_join(df, by = "k")
I am restricting the results to obtain the combination of points for the same participant between the trials
df2 <- filter(df, sex.x == sex.y, participant.x == participant.y, trial.x != trial.y)
df3 <- filter(df2, participant.x == 1)
df3
Here, at this step, I am running into trouble. I do not care about the order of the points. How do I condense the duplicates into one row?
Thank you

Depending on the columns you are considering, use the duplicate function. The first one will weed out duplicates based on the first 5 columns. The last one will weed out duplicates based on
df3[!duplicated(df3[,1:5]),]
df3[!duplicated(df3[,7:11]),]

dplyr having trouble redefining type with group_by()

I have the following problem:
When using dplyr to mutate a numeric column after group_by(), it fails if a row contains only one value which is an NaN when using the mutate command.
Thus, if the grouped column contains a numeric, it correctly classifies as dbl, but as soon as there is an instance of only a NaN for a group, it fails as dplyr defines that group as lgl, while all the other groups are dbl.
My first (and more general question) is:
Is there a way to tell dplyr, when using group_by(), to always define a column in a certain way?
Secondly, can someone help me with a hack for the problem explained in the MWE below:
# ERROR: This will provide the column defining error mentioned:
df <- data_frame(a = c(rep(LETTERS[1:2],4),"C"),g = c(rep(LETTERS[5:7],3)), x = c(7, 8,3, 5, 9, 2, 4, 7,8)) %>% tbl_df()
df <- df %>% group_by(a) %>% mutate_each(funs(sd(., na.rm=TRUE)),x)
df <- df %>% mutate(Winsorise = ifelse(x>2,2,x))
# NO ERROR (as no groups have single entry with NaN):
df2 <- data_frame(a = c(rep(LETTERS[1:2],4),"C"),g = c(rep(LETTERS[5:7],3)), x = c(7, 8,3, 5, 9, 2, 4, 7,8)) %>% tbl_df()
df2 <- df2 %>% group_by(a) %>% mutate_each(funs(sd(., na.rm=TRUE)),x)
# Update the Group for the row with an NA - Works
df2[9,1] <- "A"
df2 <- df2 %>% mutate(Winsorise = ifelse(x>3,3,x))
# REASON FOR ERROR: What happens for groups with one member = NaN, although we want the winsorise column to be dbl not lgl:
df3 <- data_frame(g = "A",x = NaN)
df3 <- df3 %>% mutate(Winsorise = ifelse(x>3,3,x))

The reason is, as you rightly pointed out in df3, that the mutate result is cast as a logical when the source column is NaN/NA.
To circumvent this, cast your answer as numeric:
df <- df %>% mutate(Winsorise = as.numeric(ifelse(x>2,2,x)))
Perhaps #hadley could shed some light on why the mutate result is cast as lgl?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Trying to add a calculation involving lags into a new dataframe column - r

I'm trying to do something extremely simple, but I can't work out how to do it. Basically this: Where A and B are columns in a data frame. If I use: df$B <- lag(df$B,1) + df$A It obviously results in NA because there is no lag of B before row 1.

We could use accumulate library(tidyverse) df %>% mutate(B = accumulate(A, `+`)) Or it could be just cumsum df %>% mutate(B = cumsum(A)) data df <- data.frame(A= c(10, 9, 3, 1, 7))

Related

fill() in missing lubridate value from a different column

replace_na with tidyselect?

Apply dplyr functions on a single column across a list using piping

How to restrict full_join() duplicates? - R

dplyr having trouble redefining type with group_by()

Categories

Resources