Related
This is my repex:
dates <- seq(as.POSIXct("2015-01-01 13:10:00", tz = "UTC"), as.POSIXct("2015-01-01 13:10:10", tz="UTC"), by="1 sec")
dates[dst(dates)] <- dates[dst(dates)] - 3600
datavalues <- data.frame(x=c(90,90,80,65,NA,64,71,75,62,63,74))
data <- cbind(dates,datavalues)
data
dates x
1 2015-01-01 13:10:00 90
2 2015-01-01 13:10:01 90
3 2015-01-01 13:10:02 80
4 2015-01-01 13:10:03 65
5 2015-01-01 13:10:04 NA
6 2015-01-01 13:10:05 64
7 2015-01-01 13:10:06 71
8 2015-01-01 13:10:07 75
9 2015-01-01 13:10:08 62
10 2015-01-01 13:10:09 63
11 2015-01-01 13:10:10 74
I would have to obtain the following data frame (which I will concatenate to data):
results <- data.frame(Duration=c(3,3,3,0,0,0,2,2,0,0,1),Maxx=c(90,90,90,0,0,0,75,75,0,0,74),Delta=c(0,0,0,0,0,0,7,0,0,0,11))
results
Duration Maxx Delta
1 3 90 0
2 3 90 0
3 3 90 0
4 0 0 0
5 0 0 0
6 0 0 0
7 2 75 7
8 2 75 0
9 0 0 0
10 0 0 0
11 1 74 11
I fix a threshold to 70.
The Duration column is the number of consecutive times during exceeding the threshold.
The Maxx column is the maximum of x for each non null duration.
lastly the Delta column is the difference between the first x exceeding 70 and the precedent x.
I would like if possible to get code using dplyr because around this pice of code, there is already dplyr code. Thank you in advance.
With the help of data.table rleid you can create group of consecutive values which are above or below the threshold and calculate the numbers in each group.
library(dplyr)
library(data.table)
threshold <- 70
data %>%
#Create a unique group of consecutive values
group_by(group = rleid(replace(x, is.na(x), 0) < threshold)) %>%
#If the value is less than threshold put 0 in duration or else
#include number of observations in the group. Do the same for max value.
mutate(Duration = if_else(x < threshold, 0L, n(), missing = 0L),
#+(Duration > 0) is used to turn values less than threshold to 0
Maxx = max(x, na.rm = TRUE) * +(Duration > 0)) %>%
ungroup() %>%
#Subtract current value with previous value
mutate(Delta = x - lag(x),
#Keep only those values that are first row in each group
Delta = replace(Delta, group == lag(group, default = first(group)) |
Duration == 0, 0)) %>%
select(-group)
# dates x Duration Maxx Delta
# <dttm> <dbl> <int> <dbl> <dbl>
# 1 2015-01-01 13:10:00 90 3 90 0
# 2 2015-01-01 13:10:01 90 3 90 0
# 3 2015-01-01 13:10:02 80 3 90 0
# 4 2015-01-01 13:10:03 65 0 0 0
# 5 2015-01-01 13:10:04 NA 0 0 0
# 6 2015-01-01 13:10:05 64 0 0 0
# 7 2015-01-01 13:10:06 71 2 75 7
# 8 2015-01-01 13:10:07 75 2 75 0
# 9 2015-01-01 13:10:08 62 0 0 0
#10 2015-01-01 13:10:09 63 0 0 0
#11 2015-01-01 13:10:10 74 1 74 11
The problem looks so easy, that it should have a simple solution, yet I couldn't find any;/
I have a long table of records indexed by time. Time intervals are not fixed. There is one categorical variable and I am interested in calculating the streak for each category (for how many days in a row we had "A", than e.g. "B". Then "A" may return and begin another streak). Doing it in Excel requires simply an if function with reference to the row above. In R I can do it with a for loop, which I provide in the toy example below. I mainly wonder, how it could be done in dplyr.
library(tidyverse)
library(lubridate)
set.seed(33)
# I create a date column - 20 dates starting from "2020-01-31", then uneven intervals, from 1 to 5 weeks
date <- rep(ymd(20200131), 20)
# (btw, this, I belive, should also be possible to do without a for loop, and I also cannot come up with a solution,
# how):
for (i in 2:length(date)){
date[i] <- date[i-1]+7*sample(1:5, 1)
}
# A categorical column
user <- c(rep("A",3), "B", rep("C",4), rep("B",5), rep("A", 6), "B")
df <- data.frame(date, user)
df$desired_result <-0
for (i in 2:nrow(df)){
if (df[i, "user"] != df[i-1, "user"]) df[i, "desired_result"] <- 0
else df[i, "desired_result"] <- as.integer(df[i, "date"] - df[i-1, "date"]) + df[i-1, "desired_result"]
}
date user desired_result
1 2020-01-31 A 0
2 2020-03-06 A 35
3 2020-04-03 A 63
4 2020-04-10 B 0
5 2020-04-17 C 0
6 2020-05-08 C 21
7 2020-05-29 C 42
8 2020-06-26 C 70
9 2020-07-03 B 0
10 2020-07-10 B 7
11 2020-07-24 B 21
12 2020-08-28 B 56
13 2020-09-18 B 77
14 2020-10-02 A 0
15 2020-10-09 A 7
16 2020-10-23 A 21
17 2020-11-06 A 35
18 2020-11-13 A 42
19 2020-11-20 A 49
20 2020-12-04 B 0
And now the question: how to do it in dplyr?
# This is wrong: "object 'result' not found":
df %>%
as_tibble() %>%
mutate(result = if_else(user == lag(user),
as.integer(date - lag(date)) + lag(result),
0))
# This is wrong: if condition is fulfilled, it adds as.integer(date - lag(date)) to 0, not to the result in the row above.
# It dosen't proceed like a loop does, from the top of the column to the bottom, doesn't "update" values in the column,
# as it proceeds.
df %>%
as_tibble() %>%
mutate(result = 0) %>%
mutate(result = if_else(user == lag(user),
as.integer(date - lag(date)) + lag(result),
0))
# A tibble: 20 x 4
date user desired_result result
<date> <fct> <dbl> <dbl>
1 2020-01-31 A 0 NA
2 2020-02-14 A 14 14
3 2020-03-13 A 42 28
4 2020-03-20 B 0 0
5 2020-04-03 C 0 0
6 2020-05-01 C 28 28
7 2020-05-08 C 35 7
8 2020-06-12 C 70 35
9 2020-07-17 B 0 0
10 2020-08-21 B 35 35
11 2020-09-04 B 49 14
12 2020-09-18 B 63 14
13 2020-10-16 B 91 28
14 2020-10-23 A 0 0
15 2020-11-13 A 21 21
16 2020-11-27 A 35 14
17 2020-12-25 A 63 28
18 2021-01-08 A 77 14
19 2021-02-12 A 112 35
20 2021-03-05 B 0 0
I've tried group_by() - rather not applicable, because categories may return and start new streaks, cumsum() - also didn't help me so far. I have a strong feeling that there must be a basic solution:)
We can do a group by operation with rleid on the 'user' and then get the difference between the 'date' and lag of 'date', and get the cumulative sum (cumsum)
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(user)) %>%
mutate(desired_result2 = cumsum(as.integer(date - lag(date,
default = first(date))))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 20 x 4
# date user desired_result desired_result2
# <date> <chr> <dbl> <int>
# 1 2020-01-31 A 0 0
# 2 2020-02-14 A 14 14
# 3 2020-03-13 A 42 42
# 4 2020-03-20 B 0 0
# 5 2020-04-03 C 0 0
# 6 2020-05-01 C 28 28
# 7 2020-05-08 C 35 35
# 8 2020-06-12 C 70 70
# 9 2020-07-17 B 0 0
#10 2020-08-21 B 35 35
#11 2020-09-04 B 49 49
#12 2020-09-18 B 63 63
#13 2020-10-16 B 91 91
#14 2020-10-23 A 0 0
#15 2020-11-13 A 21 21
#16 2020-11-27 A 35 35
#17 2020-12-25 A 63 63
#18 2021-01-08 A 77 77
#19 2021-02-12 A 112 112
#20 2021-03-05 B 0 0
NOTE: Here the desired_result is the output from the OP's for loop and desired_result2 is the non-loop output
Or this can be done with rle from base R
df$desired_result2 <- with(df, ave(as.numeric(date), with(rle(user),
rep(seq_along(values), lengths)), FUN = function(x)
cumsum(c(0, diff(x)))))
df$desired_result2
#[1] 0 14 42 0 0 28 35 70 0 35 49
#[11] 63 91 0 21 35 63 77 112 0
I would like to create a data frame in which in the first column I will have all the dates from a certain period of time and in the second the number of events that occurred on each date including dates when no events occurred. I would also like to count the events to which specific factors have been assigned
The first data frame in which I have the events with dates for a given date:
Row Sex Age Date
1 2 36 2004-01-05
2 1 47 2004-01-06
3 1 26 2004-01-10
4 2 23 2004-01-20
5 1 50 2004-01-27
6 2 35 2004-01-28
7 1 35 2004-01-30
8 1 38 2004-02-06
9 2 29 2004-02-11
Where in the column "Sex" 1 means female and 2 male.
Second data frame in which I have dates from the examined period:
Row Date
1 2004-01-05
2 2004-01-06
3 2004-01-07
4 2004-01-08
5 2004-01-09
6 2004-01-10
7 2004-01-11
8 2004-01-12
9 2004-01-13
10 2004-01-14
I want to get a data frame that looks like this:
Row Date Events (All) Events (Female) Events (Male)
1 2004-01-05 1 0 1
2 2004-01-06 1 1 0
3 2004-01-07 0 0 0
4 2004-01-08 0 0 0
5 2004-01-09 0 0 0
6 2004-01-10 0 1 0
7 2004-01-11 0 0 0
8 2004-01-12 0 0 0
9 2004-01-13 0 0 0
10 2004-01-14 0 0 0
Can anyone help?
Here's one method:
library(data.table)
library(magrittr) # just for %>%
out <- dat1 %>%
dcast(Date ~ Sex, data = ., fun.aggregate = length) %>%
setnames(., c("1", "2"), c("Female", "Male")) %>%
.[ dat2[ , .(Date)], on = "Date" ] %>%
.[, lapply(.SD, function(a) replace(a, is.na(a), 0)), ] %>%
.[, All := Female + Male ]
out
# Date Female Male All
# 1: 2004-01-05 0 1 1
# 2: 2004-01-06 1 0 1
# 3: 2004-01-07 0 0 0
# 4: 2004-01-08 0 0 0
# 5: 2004-01-09 0 0 0
# 6: 2004-01-10 1 0 1
# 7: 2004-01-11 0 0 0
# 8: 2004-01-12 0 0 0
# 9: 2004-01-13 0 0 0
# 10: 2004-01-14 0 0 0
Note that the use of lapply might not be the overall fastest method to replace NA with 0, but it gets the point across. Also, I use magrittr::%>% merely to break out steps, this can be done easily without %>%.
Data:
dat1 <- fread(text = "
Row Sex Age Date
1 2 36 2004-01-05
2 1 47 2004-01-06
3 1 26 2004-01-10
4 2 23 2004-01-20
5 1 50 2004-01-27
6 2 35 2004-01-28
7 1 35 2004-01-30
8 1 38 2004-02-06
9 2 29 2004-02-11")
dat2 <- fread(text = "
Row Date
1 2004-01-05
2 2004-01-06
3 2004-01-07
4 2004-01-08
5 2004-01-09
6 2004-01-10
7 2004-01-11
8 2004-01-12
9 2004-01-13
10 2004-01-14")
A tidyversion:
dat1 <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
Row Sex Age Date
1 2 36 2004-01-05
2 1 47 2004-01-06
3 1 26 2004-01-10
4 2 23 2004-01-20
5 1 50 2004-01-27
6 2 35 2004-01-28
7 1 35 2004-01-30
8 1 38 2004-02-06
9 2 29 2004-02-11")
dat2 <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
Row Date
1 2004-01-05
2 2004-01-06
3 2004-01-07
4 2004-01-08
5 2004-01-09
6 2004-01-10
7 2004-01-11
8 2004-01-12
9 2004-01-13
10 2004-01-14")
library(dplyr)
library(tidyr)
as_tibble(dat1) %>%
group_by(Date, Sex) %>%
tally() %>%
ungroup() %>%
pivot_wider(id_cols = "Date", names_from = "Sex", values_from = "n",
values_fill = list(n = 0)) %>%
rename(Female = "1", Male = "2") %>%
left_join(select(dat2, Date), ., by = "Date") %>%
mutate_at(vars(Female, Male), ~ replace(., is.na(.), 0)) %>%
mutate(All = Female + Male)
I am tinkering around with this loop (im new to writing loops but trying to learn).
I am aiming when x == 1, on first 1 match, store the value of z, then on each successive z value subtract that z value from the first value. If x == 0 then it will do nothing (not sure if i have to tell the code to do nothing when x ==0?)
This is my dummy data:
x <- c(0,0,1,1,1,0,1,1,1,1,0,0,0)
z <- c(10,34,56,43,56,98,78,98,23,21,45,65,78)
df <- data.frame(x,z)
for (i in 1:nrow(df)) {
if (df$x[i] == 1)
first_price <- df$z[i]
df$output <- first_price - df$z
}
}
I have my if (df$x == 1)
Then I want to save the first price... so first_price <- df$z[i]
The i in here, that means the first in the series right?
Then for my output... i wanted to subtract the first price from each successive price. If I fix the first price with [i] is this the correct way? And if I leave df$z would that then take the next price each time in the loop and subtract from
first_price <- df$z[i]?
Heres a visual:
******Progress****
> for (i in 1:nrow(df)) {
+ if (df$x[i] == 1) {
+ first_price <- df$z[1]
+ df$output <- first_price - df$z
+ }
+ }
> df$output
[1] 0 -24 -46 -33 -46 -88 -68 -88 -13 -17 -35 -55 -68
If i add [1] which is assigning the first element in df$z this actually fixes the first element and then subtracts each successive, now It needs to be rule based and understand that this is only to be the case when df$x == 1
This should work for you
library(dplyr)
library(data.table)
ans <- df %>%
mutate(originalrow = row_number()) %>% # original row position
group_by(rleid(x)) %>%
mutate(ans = first(z) - z) %>%
filter(x==1)
# # A tibble: 7 x 5
# # Groups: rleid(x) [2]
# x z originalrow `rleid(x)` ans
# <dbl> <dbl> <int> <int> <dbl>
# 1 1 56 3 2 0
# 2 1 43 4 2 13
# 3 1 56 5 2 0
# 4 1 78 7 4 0
# 5 1 98 8 4 -20
# 6 1 23 9 4 55
# 7 1 21 10 4 57
vans <- ans$ans
# [1] 0 13 0 0 -20 55 57
EDIT
To keep all rows, and outputting 0 where x==0
ans <- df %>%
mutate(originalrow = row_number()) %>%
group_by(rleid(x)) %>%
mutate(ans = ifelse(x==0, 0, first(z) - z))
# # A tibble: 13 x 5
# # Groups: rleid(x) [5]
# x z originalrow `rleid(x)` ans
# <dbl> <dbl> <int> <int> <dbl>
# 1 0 10 1 1 0
# 2 0 34 2 1 0
# 3 1 56 3 2 0
# 4 1 43 4 2 13
# 5 1 56 5 2 0
# 6 0 98 6 3 0
# 7 1 78 7 4 0
# 8 1 98 8 4 -20
# 9 1 23 9 4 55
# 10 1 21 10 4 57
# 11 0 45 11 5 0
# 12 0 65 12 5 0
# 13 0 78 13 5 0
my data is structured as follows:
DT <- data.table(Id = c(1, 1, 1, 1, 10, 100, 100, 101, 101, 101),
Date = as.Date(c("1997-01-01", "1997-01-02", "1997-01-03", "1997-01-04",
"1997-01-02", "1997-01-02", "1997-01-04", "1997-01-03",
"1997-01-04", "1997-01-04")),
group = c(1,1,1,1,1,2,2,2,2,2),
Price.1 = c(29, 25, 14, 26, 30, 16, 13, 62, 12, 6),
Price.2 = c(4, 5, 6, 6, 8, 2, 3, 5, 7, 8))
>DT
Id Date group Price.1 Price.2
1: 1 1997-01-01 1 29 4
2: 1 1997-01-02 1 25 5
3: 1 1997-01-03 1 14 6
4: 1 1997-01-04 1 26 6
5: 10 1997-01-02 1 30 8
6: 100 1997-01-02 2 16 2
7: 100 1997-01-04 2 13 3
8: 101 1997-01-03 2 62 5
9: 101 1997-01-04 2 12 7
10: 101 1997-01-04 2 6 8
I am trying to cast it (using dcast.data.table):
dcast.data.table(DT, Id ~ Date, fun = sum, value.var = "Price.1")
dcast.data.table(DT, Id ~ group, fun = sum, value.var = "Price.1")
dcast.data.table(DT, Id ~ Date, fun = sum, value.var = "Price.2")
dcast.data.table(DT, Id ~ group, fun = sum, value.var = "Price.2")
but rather than 4 separate outputs I am trying to get the following:
Id 1997-01-01 1997-01-02 1997-01-03 1997-01-04 1 2 Price
1: 1 29 25 14 26 94 0 Price.1
2: 10 0 30 0 0 30 0 Price.1
3: 100 0 16 0 13 0 29 Price.1
4: 101 0 0 62 18 0 80 Price.1
5: 1 4 5 6 6 21 0 Price.2
6: 10 0 8 0 0 8 0 Price.2
7: 100 0 2 0 3 0 5 Price.2
8: 101 0 0 5 15 0 20 Price.2
and my work-around uses rbind, cbind, and merge.
cbind(rbind(merge(dcast.data.table(DT, Id ~ Date, fun = sum, value.var = "Price.1"),
dcast.data.table(DT, Id ~ group, fun = sum, value.var = "Price.1"), by = "Id", all.x = T),
merge(dcast.data.table(DT, Id ~ Date, fun = sum, value.var = "Price.2"),
dcast.data.table(DT, Id ~ group, fun = sum, value.var = "Price.2"), by = "Id", all.x = T)),
Price = c("Price.1","Price.1","Price.1","Price.1","Price.2","Price.2","Price.2","Price.2"))
Is there an existing and cleaner way to do this?
I make the assumption that each Id maps to a unique group and get rid of that variable, but otherwise this is essentially the same as #user227710's answer.
Idg <- unique(DT[,.(Id,group)])
DT[,group:=NULL]
res <- dcast(
melt(DT, id.vars = c("Id","Date")),
variable+Id ~ Date,
value.var = "value",
fill = 0,
margins = "Date",
fun.aggregate = sum
)
# and if you want the group back...
setDT(res) # needed before data.table 1.9.5, where using dcast.data.table is another option
setkey(res,Id)
res[Idg][order(variable,Id)]
which gives
variable Id 1997-01-01 1997-01-02 1997-01-03 1997-01-04 (all) group
1: Price.1 1 29 25 14 26 94 1
2: Price.2 1 4 5 6 6 21 1
3: Price.1 10 0 30 0 0 30 1
4: Price.2 10 0 8 0 0 8 1
5: Price.1 100 0 16 0 13 29 2
6: Price.2 100 0 2 0 3 5 2
7: Price.1 101 0 0 62 18 80 2
8: Price.2 101 0 0 5 15 20 2
This was really a trial and error: I hope it works.
library(data.table) #version 1.9.4
library(reshape2)
kk <- melt(DT,id.vars=c("Id","Date","group"),
measure.vars = c("Price.1","Price.2"),
value.name = "Price")
dcast(kk, Id + variable + group ~ Date, value.var = "Price", fun = sum,margins="Date")
# ^ use of margins borrowed from #Frank.
# Id variable group 1997-01-01 1997-01-02 1997-01-03 1997-01-04 (all)
# 1 1 Price.1 1 29 25 14 26 94
# 2 1 Price.2 1 4 5 6 6 21
# 3 10 Price.1 1 0 30 0 0 30
# 4 10 Price.2 1 0 8 0 0 8
# 5 100 Price.1 2 0 16 0 13 29
# 6 100 Price.2 2 0 2 0 3 5
# 7 101 Price.1 2 0 0 62 18 80
# 8 101 Price.2 2 0 0 5 15 20
And just to compare, a solution in dplyr (as I have yet to learn how to get my brain to melt things properly.)
# aggregate the data completely
## (rows 9 & 10 need to be collapsed, and spread works on a single key)
DTT <-
DT %>%
group_by(Id, Date, group) %>%
summarise(Price.1 = sum(Price.1), Price.2 = sum(Price.2)) %>%
left_join(DT) %>%
unite(id_grp, Id, group, sep = "_") %>%
group_by(id_grp) %>%
mutate(s1 = sum(Price.1), s2 = sum(Price.2))
# pivot out the index into cartesian (long to wide) for 1st Price set
DW1 <-
DTT %>%
select(-Price.2) %>%
spread(Date, Price.1) %>%
mutate(Price = "Price.1")
# pivot out the index into cartesian (long to wide) for 2nd Price set
DW2 <-
DTT %>%
select(-Price.1) %>%
spread(Date, Price.2) %>%
mutate(Price = "Price.2")
# Bind records back together and make purdy
DWFin <-
bind_rows(DW1,DW2) %>%
separate(id_grp, c("Id", "group")) %>%
mutate(g = group, p = str_sub(Price, -1),
n1 = ifelse(group == 1 & p == 1, s1, ifelse(group == 1 & p == 2, s2, 0)),
n2 = ifelse(group == 2 & p == 2, s2, ifelse(group == 2 & p == 1, s1, 0))) %>%
select(Id, starts_with("19"), "1" = n1, "2" = n2, Price)
DWFin
Source: local data table [8 x 8]
# tbl_dt [8 × 8]
Id `1997-01-01` `1997-01-02` `1997-01-03` `1997-01-04` `1` `2` Price
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 29 25 14 26 94 0 Price.1
2 10 NA 30 NA NA 30 0 Price.1
3 100 NA 16 NA 13 0 29 Price.1
4 101 NA NA 62 18 0 80 Price.1
5 1 4 5 6 6 21 0 Price.2
6 10 NA 8 NA NA 8 0 Price.2
7 100 NA 2 NA 3 0 5 Price.2
8 101 NA NA 5 15 0 20 Price.2