Using dplyr for multiple mutate on different criteria - r

This is my repex:
dates <- seq(as.POSIXct("2015-01-01 13:10:00", tz = "UTC"), as.POSIXct("2015-01-01 13:10:10", tz="UTC"), by="1 sec")
dates[dst(dates)] <- dates[dst(dates)] - 3600
datavalues <- data.frame(x=c(90,90,80,65,NA,64,71,75,62,63,74))
data <- cbind(dates,datavalues)
data
dates x
1 2015-01-01 13:10:00 90
2 2015-01-01 13:10:01 90
3 2015-01-01 13:10:02 80
4 2015-01-01 13:10:03 65
5 2015-01-01 13:10:04 NA
6 2015-01-01 13:10:05 64
7 2015-01-01 13:10:06 71
8 2015-01-01 13:10:07 75
9 2015-01-01 13:10:08 62
10 2015-01-01 13:10:09 63
11 2015-01-01 13:10:10 74
I would have to obtain the following data frame (which I will concatenate to data):
results <- data.frame(Duration=c(3,3,3,0,0,0,2,2,0,0,1),Maxx=c(90,90,90,0,0,0,75,75,0,0,74),Delta=c(0,0,0,0,0,0,7,0,0,0,11))
results
Duration Maxx Delta
1 3 90 0
2 3 90 0
3 3 90 0
4 0 0 0
5 0 0 0
6 0 0 0
7 2 75 7
8 2 75 0
9 0 0 0
10 0 0 0
11 1 74 11
I fix a threshold to 70.
The Duration column is the number of consecutive times during exceeding the threshold.
The Maxx column is the maximum of x for each non null duration.
lastly the Delta column is the difference between the first x exceeding 70 and the precedent x.
I would like if possible to get code using dplyr because around this pice of code, there is already dplyr code. Thank you in advance.

With the help of data.table rleid you can create group of consecutive values which are above or below the threshold and calculate the numbers in each group.
library(dplyr)
library(data.table)
threshold <- 70
data %>%
#Create a unique group of consecutive values
group_by(group = rleid(replace(x, is.na(x), 0) < threshold)) %>%
#If the value is less than threshold put 0 in duration or else
#include number of observations in the group. Do the same for max value.
mutate(Duration = if_else(x < threshold, 0L, n(), missing = 0L),
#+(Duration > 0) is used to turn values less than threshold to 0
Maxx = max(x, na.rm = TRUE) * +(Duration > 0)) %>%
ungroup() %>%
#Subtract current value with previous value
mutate(Delta = x - lag(x),
#Keep only those values that are first row in each group
Delta = replace(Delta, group == lag(group, default = first(group)) |
Duration == 0, 0)) %>%
select(-group)
# dates x Duration Maxx Delta
# <dttm> <dbl> <int> <dbl> <dbl>
# 1 2015-01-01 13:10:00 90 3 90 0
# 2 2015-01-01 13:10:01 90 3 90 0
# 3 2015-01-01 13:10:02 80 3 90 0
# 4 2015-01-01 13:10:03 65 0 0 0
# 5 2015-01-01 13:10:04 NA 0 0 0
# 6 2015-01-01 13:10:05 64 0 0 0
# 7 2015-01-01 13:10:06 71 2 75 7
# 8 2015-01-01 13:10:07 75 2 75 0
# 9 2015-01-01 13:10:08 62 0 0 0
#10 2015-01-01 13:10:09 63 0 0 0
#11 2015-01-01 13:10:10 74 1 74 11

Related

How to flag/count missing values in RStudio on a row by row basis?

Taking the airquality dataset from the MASS library as an example:
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
I want to create three columns:
Missing_Ozone, Missing_Total and Missing_Percent, such that:
Missing_Ozone = 1 if there is a missing value in the Ozone column in the current row.
Missing_total = total count of missing values in the current row
Missing_Percent = percentage of missing values in a row.
So for example, in row 1:
Missing_Ozone = 0, Missing_total = 0, Missing_percent = 0
In row 5:
Missing_Ozone = 1, Missing_total = 2, Missing percent = 100*(2/6)
In row 6:
Missing_Ozone = 0, Missing_total = 1, Missing percent = 100*(1/6)
I tried two approaches, without any luck:
The first was to iterate over each row and use an if statement:
library(MASS)
df_test = airquality
df_test$Missing_Ozone <- 0
for(i in 1:nrow(df_test)){
if (is.na(df_test$Ozone)) {
df_test$Missing_Ozone <- 1
}
}
The second was to just use the if-statement inside that for-loop.
Neither work, and I just get:
> df_test
Ozone Solar.R Wind Temp Month Day Missing_Ozone
1 41 190 7.4 67 5 1 0
2 36 118 8.0 72 5 2 0
3 12 149 12.6 74 5 3 0
4 18 313 11.5 62 5 4 0
5 NA NA 14.3 56 5 5 0
Any help is appreciated.
Edit: Also, does this type of data manipulation have a certain name? I found it hard to search online for a guide that goes through this type of data manipulation.
Tidyverse approach:
library(dplyr)
airquality <- datasets::airquality
cols <- ncol(airquality)
airquality <- airquality %>%
mutate(
Missing_Ozone = as.numeric(is.na(Ozone)),
Missing_Total = rowSums(is.na(.)),
Missing_Percent = Missing_Total/cols
)
> head(airquality)
Ozone Solar.R Wind Temp Month Day Missing_Ozone Missing_Total Missing_Percent
1 41 190 7.4 67 5 1 0 0 0.0000000
2 36 118 8.0 72 5 2 0 0 0.0000000
3 12 149 12.6 74 5 3 0 0 0.0000000
4 18 313 11.5 62 5 4 0 0 0.0000000
5 NA NA 14.3 56 5 5 1 2 0.3333333
6 28 NA 14.9 66 5 6 0 1 0.1666667
Base R approach:
cols <- ncol(airquality)
airquality$Missing_Ozone <- as.numeric(is.na(airquality$Ozone))
airquality$Missing_Total <- rowSums(is.na(airquality))
airquality$Missing_Percent <- airquality$Missing_Total/cols
> head(airquality)
> Ozone Solar.R Wind Temp Month Day Missing_Ozone Missing_Total Missing_Percent
1 41 190 7.4 67 5 1 0 0 0.0000000
2 36 118 8.0 72 5 2 0 0 0.0000000
3 12 149 12.6 74 5 3 0 0 0.0000000
4 18 313 11.5 62 5 4 0 0 0.0000000
5 NA NA 14.3 56 5 5 1 2 0.3333333
6 28 NA 14.9 66 5 6 0 1 0.1666667
edit: A note on performance
I would advise in general against usage of rowwise operations outside of very specific use cases. It will slow you down heavily as your data set scales. The execution time tends to grow linearly with your data, which is really, really bad. A little benchmark with a data set size of 6,426 rows instead of 153:
library(dplyr)
library(microbenchmark)
airquality <- datasets::airquality
# Rowwise
approachA <- function(data) {
result <- data %>%
mutate(Missing_Ozone = as.integer(is.na(Ozone))) %>%
rowwise() %>%
mutate(Missing_Total = sum(is.na((c_across(-Missing_Ozone))))) %>%
mutate(Missing_Percent = Missing_Total/ncol(data)) %>%
ungroup()
return(result)
}
# Tidy
approachB <- function(data) {
cols <- ncol(data)
result <- data %>%
mutate(
Missing_Ozone = as.numeric(is.na(Ozone)),
Missing_Total = rowSums(is.na(.)),
Missing_Percent = Missing_Total/cols
)
return(result)
}
# Base R
approachC <- function(data) {
cols <- ncol(data)
data$Missing_Ozone <- as.numeric(is.na(data$Ozone))
data$Missing_Total <- rowSums(is.na(data))
data$Missing_Percent <- data$Missing_Total/cols
return(data)
}
Result with data x 42: rowwise() has led to some orders of magnitude worse performance over both proposed approaches.
> test_data <- do.call("rbind", replicate(42, airquality, simplify = FALSE))
> set.seed(42)
> microbenchmark::microbenchmark(approachA(test_data), approachB(test_data), approachC(test_data))
Unit: microseconds
expr min lq mean median uq max neval cld
approachA(test_data) 243340.904 251838.3590 259083.8089 256546.9015 260567.8945 405326.615 100 b
approachB(test_data) 577.977 624.0610 723.8304 741.0955 770.3695 2382.756 100 a
approachC(test_data) 102.377 107.9735 139.5595 119.6175 129.4165 2074.231 100 a
Result with data x 420: Execution time of rowwise approach has grown 10x.
test_data <- do.call("rbind", replicate(420, airquality, simplify = FALSE))
> set.seed(42)
> microbenchmark::microbenchmark(approachA(test_data), approachB(test_data), approachC(test_data))
Unit: microseconds
expr min lq mean median uq max neval cld
approachA(test_data) 2519480.258 2620528.08 2671419.663 2672263.417 2707896.209 2907659.730 100 b
approachB(test_data) 1266.818 1488.71 1909.167 1576.327 1678.725 21011.147 100 a
approachC(test_data) 808.684 881.09 1220.151 1000.277 1067.907 8218.655 100 a
A solution using the dplyr package. rowwise and c_cross allow us to do calculation by each row.
library(dplyr)
dat <- airquality %>%
mutate(Missing_Ozone = as.integer(is.na(Ozone))) %>%
rowwise() %>%
mutate(Missing_Total = sum(is.na((c_across(-Missing_Ozone))))) %>%
mutate(Missing_Percent = Missing_Total/ncol(airquality)) %>%
ungroup()
dat
# # A tibble: 153 x 9
# Ozone Solar.R Wind Temp Month Day Missing_Ozone Missing_Total Missing_Percent
# <int> <int> <dbl> <int> <int> <int> <int> <int> <dbl>
# 1 41 190 7.4 67 5 1 0 0 0
# 2 36 118 8 72 5 2 0 0 0
# 3 12 149 12.6 74 5 3 0 0 0
# 4 18 313 11.5 62 5 4 0 0 0
# 5 NA NA 14.3 56 5 5 1 2 0.333
# 6 28 NA 14.9 66 5 6 0 1 0.167
# 7 23 299 8.6 65 5 7 0 0 0
# 8 19 99 13.8 59 5 8 0 0 0
# 9 8 19 20.1 61 5 9 0 0 0
# 10 NA 194 8.6 69 5 10 1 1 0.167
# # ... with 143 more rows

How to calculate events per day in R including dates when no events occurred?

I would like to create a data frame in which in the first column I will have all the dates from a certain period of time and in the second the number of events that occurred on each date including dates when no events occurred. I would also like to count the events to which specific factors have been assigned
The first data frame in which I have the events with dates for a given date:
Row Sex Age Date
1 2 36 2004-01-05
2 1 47 2004-01-06
3 1 26 2004-01-10
4 2 23 2004-01-20
5 1 50 2004-01-27
6 2 35 2004-01-28
7 1 35 2004-01-30
8 1 38 2004-02-06
9 2 29 2004-02-11
Where in the column "Sex" 1 means female and 2 male.
Second data frame in which I have dates from the examined period:
Row Date
1 2004-01-05
2 2004-01-06
3 2004-01-07
4 2004-01-08
5 2004-01-09
6 2004-01-10
7 2004-01-11
8 2004-01-12
9 2004-01-13
10 2004-01-14
I want to get a data frame that looks like this:
Row Date Events (All) Events (Female) Events (Male)
1 2004-01-05 1 0 1
2 2004-01-06 1 1 0
3 2004-01-07 0 0 0
4 2004-01-08 0 0 0
5 2004-01-09 0 0 0
6 2004-01-10 0 1 0
7 2004-01-11 0 0 0
8 2004-01-12 0 0 0
9 2004-01-13 0 0 0
10 2004-01-14 0 0 0
Can anyone help?
Here's one method:
library(data.table)
library(magrittr) # just for %>%
out <- dat1 %>%
dcast(Date ~ Sex, data = ., fun.aggregate = length) %>%
setnames(., c("1", "2"), c("Female", "Male")) %>%
.[ dat2[ , .(Date)], on = "Date" ] %>%
.[, lapply(.SD, function(a) replace(a, is.na(a), 0)), ] %>%
.[, All := Female + Male ]
out
# Date Female Male All
# 1: 2004-01-05 0 1 1
# 2: 2004-01-06 1 0 1
# 3: 2004-01-07 0 0 0
# 4: 2004-01-08 0 0 0
# 5: 2004-01-09 0 0 0
# 6: 2004-01-10 1 0 1
# 7: 2004-01-11 0 0 0
# 8: 2004-01-12 0 0 0
# 9: 2004-01-13 0 0 0
# 10: 2004-01-14 0 0 0
Note that the use of lapply might not be the overall fastest method to replace NA with 0, but it gets the point across. Also, I use magrittr::%>% merely to break out steps, this can be done easily without %>%.
Data:
dat1 <- fread(text = "
Row Sex Age Date
1 2 36 2004-01-05
2 1 47 2004-01-06
3 1 26 2004-01-10
4 2 23 2004-01-20
5 1 50 2004-01-27
6 2 35 2004-01-28
7 1 35 2004-01-30
8 1 38 2004-02-06
9 2 29 2004-02-11")
dat2 <- fread(text = "
Row Date
1 2004-01-05
2 2004-01-06
3 2004-01-07
4 2004-01-08
5 2004-01-09
6 2004-01-10
7 2004-01-11
8 2004-01-12
9 2004-01-13
10 2004-01-14")
A tidyversion:
dat1 <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
Row Sex Age Date
1 2 36 2004-01-05
2 1 47 2004-01-06
3 1 26 2004-01-10
4 2 23 2004-01-20
5 1 50 2004-01-27
6 2 35 2004-01-28
7 1 35 2004-01-30
8 1 38 2004-02-06
9 2 29 2004-02-11")
dat2 <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
Row Date
1 2004-01-05
2 2004-01-06
3 2004-01-07
4 2004-01-08
5 2004-01-09
6 2004-01-10
7 2004-01-11
8 2004-01-12
9 2004-01-13
10 2004-01-14")
library(dplyr)
library(tidyr)
as_tibble(dat1) %>%
group_by(Date, Sex) %>%
tally() %>%
ungroup() %>%
pivot_wider(id_cols = "Date", names_from = "Sex", values_from = "n",
values_fill = list(n = 0)) %>%
rename(Female = "1", Male = "2") %>%
left_join(select(dat2, Date), ., by = "Date") %>%
mutate_at(vars(Female, Male), ~ replace(., is.na(.), 0)) %>%
mutate(All = Female + Male)

dplyr: average over a range based on first occurrence in a different column

I would like to use dyplr and mutate to create a new variable that is either 0 or the average the values in column y, conditional on a range from column z.
For column z range, I would like to use the first time z >= 90 for the max value of the range and then the first time z=31 immediately before z >= 90 for the minimum value of the range.
Note: I will be grouping by column x
For example:
x y z
1 100 0
1 90 0
1 90 31
1 90 60
1 80 31
1 75 60
1 60 90
1 60 60
2 60 0
2 60 30
I would to average y over this range:
x y z
1 80 31
1 75 60
1 60 90
so I would end up with the value 71.7 (I don't care about rounding).
x y z ave
1 100 0 0
1 90 0 0
1 90 31 0
1 90 60 0
1 80 31 71.7
1 75 60 71.7
1 60 90 71.7
1 60 60 0
2 60 0 0
2 60 30 0
We may do
df %>% group_by(x) %>% mutate(ave = {
if(any(z >= 90)) {
idxU <- which.max(z >= 90)
idxL <- max(which(z[1:idxU] == 31))
replace(z * 0, idxL:idxU, mean(z[idxL:idxU]))
} else {
0
}
})
# x y z ave
# 1 1 100 0 0.00000
# 2 1 90 0 0.00000
# 3 1 90 31 0.00000
# 4 1 90 60 0.00000
# 5 1 80 31 60.33333
# 6 1 75 60 60.33333
# 7 1 60 90 60.33333
# 8 1 60 60 0.00000
# 9 2 60 0 0.00000
# 10 2 60 30 0.00000
So, idxU is the upper limit for the range, idxL is the lower limit, then in the last line we replace elements idxL:idxU of the zero vector z * 0 by the required mean.

Calculating days difference on rolling basis depending on another column

I'm trying to create a calculated column using dplyr to get the days difference between the reference date(current) and a future date on a rolling basis. For e.g, I have a data frame like-
sample = data.frame(dates = seq(today(), today() + weeks(3), by = 1), qty =
floor(100 * rnorm(22)))
What I want to achieve is create a new column, say days_to which will be 0 if the qty >=0. However if qty < 0, then days_to should be the number of days till the qty goes above 0. If the qty doesn't go above 0 for any future date, then days_to = NA/Inf (not important). So for the above example it should look something like -
dates qty days_to
10/17/2018 175 0
10/18/2018 -69 2
10/19/2018 -20 1
10/20/2018 113 0
10/21/2018 7 0
10/22/2018 120 0
10/23/2018 48 0
10/24/2018 -31 NA
10/25/2018 -9 NA
10/26/2018 -87 NA
I need to do this for a large number of rows(~2M) on a grouped variable and hence trying to use dplyr to achieve this. Any help is appreciated.
Thanks!
dplyr
library(dplyr)
sampledplyr <- sample %>%
mutate(grp = cumsum(qty > 0 & lag(qty) < 0)) %>%
group_by(grp) %>%
mutate(days_to = if_else(qty < 0, n() - row_number() + 1L, 0L)) %>%
ungroup() %>%
select(-grp)
print(sampledplyr, n=22)
# # A tibble: 22 x 3
# dates qty days_to
# <date> <dbl> <int>
# 1 2018-10-17 -63 1
# 2 2018-10-18 18 0
# 3 2018-10-19 -84 1
# 4 2018-10-20 159 0
# 5 2018-10-21 32 0
# 6 2018-10-22 -83 1
# 7 2018-10-23 48 0
# 8 2018-10-24 73 0
# 9 2018-10-25 57 0
# 10 2018-10-26 -31 1
# 11 2018-10-27 151 0
# 12 2018-10-28 38 0
# 13 2018-10-29 -63 2
# 14 2018-10-30 -222 1
# 15 2018-10-31 112 0
# 16 2018-11-01 -5 2
# 17 2018-11-02 -2 1
# 18 2018-11-03 94 0
# 19 2018-11-04 82 0
# 20 2018-11-05 59 0
# 21 2018-11-06 91 0
# 22 2018-11-07 78 0
data.table
library(data.table)
sampledt <- as.data.table(sample)
sampledt[,days_to := ifelse(qty < 0, .N - seq_len(nrow(.SD)) + 1L, 0L),
by = cumsum(qty > 0 & lag(qty) < 0)]
(Same output.)
Data:
set.seed(1) # alway
sample = data.frame(dates = seq(Sys.Date(), Sys.Date() + 3*7, by = 1),
qty = floor(100 * rnorm(22)))

Deduplicate observations based on window of time

I have data on a large number of individuals and there may be multiple observations per person. I want to deduplicate the data into 'episodes' of 28 days for each individual. I want to drop those records where the date of the observation is 28 days or less than the date of the start of the prior episode.
Some sample data on 6 observations of a single individual are below. The duplicate and new_episode variables are dummy variables and are not present in the original data and indicate the logic of the example.
dat <- data.frame(id = rep(1, 6), spec_n = seq(1,6,1),
spec_date = as.Date(c("2016/01/01", "2016/01/02", "2016/01/30",
"2016/01/31", "2016/02/02", "2016/02/28")),
duplicate = c(0,1,0,1,1,0), new_episode = c(1,0,1,0,0,1),
stringsAsFactors = FALSE)
dat
id spec_n spec_date duplicate new_episode
1 1 1 2016-01-01 0 1
2 1 2 2016-01-02 1 0
3 1 3 2016-01-30 0 1
4 1 4 2016-01-31 1 0
5 1 5 2016-02-02 1 0
6 1 6 2016-02-28 0 1
With dplyr I can calculate the time since the last observation and the time since the first episode. So deduplicating on date_diff would not provide the data I require.
library(dplyr)
dat <- dat %>% group_by(id) %>%
mutate(date_diff = spec_date - lag(spec_date),
earliest_spec_date = min(spec_date),
diff_earliest = spec_date - earliest_spec_date)
dat
id spec_n spec_date duplicate new_episode date_diff earliest_spec_date diff_earliest
<dbl> <dbl> <date> <dbl> <dbl> <time> <date> <time>
1 1 1 2016-01-01 0 1 NA days 2016-01-01 0 days
2 1 2 2016-01-02 1 0 1 days 2016-01-01 1 days
3 1 3 2016-01-30 0 1 28 days 2016-01-01 29 days
4 1 4 2016-01-31 1 0 1 days 2016-01-01 30 days
5 1 5 2016-02-02 1 0 2 days 2016-01-01 32 days
6 1 6 2016-02-28 0 1 26 days 2016-01-01 58 days
However, this does not quite provide what I need. spec_n == 6 is less than 28 days since the previous observation, but more than 28 days since the start of the last episode (spec_n == 3).
Expected output would be those rows where duplicate is 0 or new_episode is 1, e.g.
id spec_n spec_date duplicate new_episode date_diff earliest_spec_date diff_earliest
<dbl> <dbl> <date> <dbl> <dbl> <time> <date> <time>
1 1 1 2016-01-01 0 1 NA days 2016-01-01 0 days
2 1 3 2016-01-30 0 1 28 days 2016-01-01 29 days
3 1 6 2016-02-28 0 1 26 days 2016-01-01 58 days
This should work (its an implementation of the idea Llopis suggested I think).
I make some simulated data first:
df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by=1), data=rnorm(365))
head(df)
date data
1 2015-01-01 -1.4493544
2 2015-01-02 -0.8860342
3 2015-01-03 1.3629541
4 2015-01-04 -2.0131108
5 2015-01-05 -0.4527413
6 2015-01-06 0.8428585
Now we write a function that takes the first date and checks if subsequent dates are more than 28 days distant from it, returning 0 if they are not and 1 if they are. If a date is 28 days away it takes that new date as the basis of future comparisons.
dupFinder <- function(x) {
n <- 1
N <- length(x)
res <- rep(1, N)
start <- x[n]
while (n < (N)) {
if (as.numeric(x[n+1]-start)>=28) {
res[n+1] <- 1
n <- n+1
start <- x[n]
}
else {
res[n+1] <- 0
n <- n+1
}
}
return(res)
}
The function dupFinder will return a vector of length equal to that of your dataframe, and you can then use it to subset the dataframe to the rows of interest. Thus:
df[dupFinder(df$date)==1,]
date data
1 2015-01-01 -1.4493544
29 2015-01-29 0.2084123
57 2015-02-26 1.4541566
85 2015-03-26 0.6794230
113 2015-04-23 -0.8285670
141 2015-05-21 -0.8686872
169 2015-06-18 2.1657994
197 2015-07-16 -1.1802231
225 2015-08-13 0.1808395
253 2015-09-10 -0.4762835
281 2015-10-08 -0.3769593
309 2015-11-05 0.2825544
337 2015-12-03 -0.7132649
365 2015-12-31 -1.8111226
As expected we start with the January 1, then January 29, then Feb 26, since Feb has 28 days we next get March 26th, etc.

Resources