I have timeseries with several instances:
Some instances looks like this which is normal:
But some looks like that - they have peaks with same intervals (1 hour in this example):
I need to analyze data and find instances with that anomaly - peaks repeated with nearly same interval.
As a result I expect name of Instance with that anomaly (in my example data - only 'A'), it's period of peaks in seconds (3600 in my example data) and spread between let's say median and peaks of that instance.
How to do that?
Here is my example data:
library(dplyr)
library(lubridate)
library(ggplot2)
set.seed(900)
data1 <-
data.frame(
datetime = seq.POSIXt(as.POSIXct("2020-12-26 10:00:00"), as.POSIXct("2020-12-26 10:00:00") + 15*50001, "15 sec"),
Value = sample(1:10, 50002, replace = T),
Instance = "A"
)
data1.1 <- data.frame(
datetime= seq.POSIXt(as.POSIXct("2020-12-26 10:00:00"), as.POSIXct("2020-12-26 10:00:00") + 15*50001, "hour"),
Value = sample(10:100, 209, replace = T),
Instance = "A"
)
data1 <- rbind(data1, data1.1) %>% group_by(datetime, Instance) %>% summarise(Value = max(Value)) %>% ungroup()
data2 <- data.frame(
datetime = seq.POSIXt(as.POSIXct("2020-12-26 10:00:00"), as.POSIXct("2020-12-26 10:00:00") + 15*50001, "15 sec"),
Value = sample(1:100, 50002, replace = T),
Instance = "B"
)
data3 <-
data.frame(
datetime = seq.POSIXt(as.POSIXct("2020-12-26 10:00:00"), as.POSIXct("2020-12-26 10:00:00") + 15*50001, "15 sec"),
Value = sample(1:100, 50002, replace = T),
Instance = "C"
)
data4 <- data.frame(
datetime = seq.POSIXt(as.POSIXct("2020-12-26 10:00:00"), as.POSIXct("2020-12-26 10:00:00") + 15*50001, "15 sec"),
Value = sample(1:100, 50002, replace = T),
Instance = "D"
)
data <- do.call("rbind", list(data1, data2, data3, data4))
As I am still not quite sure how the answer is supposed to look like, this is only a preliminary attempt to help. this is probably not exactly what you need but maybe it can be step in the correct direction?
qt<-data[data$Value>=99,] # find peaks
qt_test<-qt
# length(qt$datetime)
# table(qt$Instance)
# create hourly difference between peak times (seconds were not useful, mins may be better)
x<-vector()
for(i in seq_along(qt_test[[1]])){
if (i == 1){
a<-0; x<-c(x, a)
}
if (i>1){
a<-difftime(qt_test[[1]][i], qt_test[[1]][i-1], units = "hours")
x<-c(x, a)
}
}
qt_test$difftime_in_hours<-x
# summary(qt_test$difftime_in_hours)
qt2<-qt_test[between(qt_test$difftime_in_hours, 0.04, 0.05),] # timeframe - +/- floor/ceiling mean (0.04647) - only for test purposes
# A tibble: 154 x 4
datetime Instance Value difftime_in_hours
<dttm> <chr> <int> <dbl>
1 2020-12-26 15:44:15 B 100 0.0417
2 2020-12-26 16:44:00 B 99 0.0417
3 2020-12-26 16:57:30 B 100 0.05
4 2020-12-26 17:58:00 B 99 0.0417
5 2020-12-26 19:15:30 B 100 0.05
6 2020-12-26 19:24:30 B 100 0.0417
7 2020-12-27 04:04:45 B 99 0.05
8 2020-12-27 09:37:00 B 99 0.0417
9 2020-12-27 11:16:00 B 100 0.0417
10 2020-12-27 11:55:30 B 99 0.05
# ... with 144 more rows
table(qt2$Instance)
B C D
69 47 38
results will differ due to random seed but this is an answer as far as i understand your question
Related
I have a data.frame with 156 variables and I would like to multiply a subset of those variables to another subset of the 156. How can I do that: The variables are of the following form:
ID||quantity_1||quantity_2||...||quantity_156||priceperunit_q1||...priceperunit||q156
essentially I would like to multiply each quantity by its priceperunit
Given you first have all the quantity columns and then all the priceperunit columns in the correct order, you can make two different data sets and multiply them using mapply, like this. The result set will contain the product of the two columns, rename this if needed:
quantity_1 <- c(1, 2, 3)
quantity_2 <- c(1, 2, 3)
quantity_3 <- c(1, 2, 3)
quantity_40 <- c(1, 2, 3)
priceperunit_1 <- c(20, 20, 20)
priceperunit_2 <- c(30, 30, 30)
priceperunit_3 <- c(15, 15, 15)
priceperunit_40 <- c(1.25, 1.25, 1.65)
df <- data.frame(quantity_1, quantity_2, quantity_3, quantity_40,
priceperunit_1, priceperunit_2, priceperunit_3, priceperunit_40)
df
#> quantity_1 quantity_2 quantity_3 quantity_40 priceperunit_1 priceperunit_2
#> 1 1 1 1 1 20 30
#> 2 2 2 2 2 20 30
#> 3 3 3 3 3 20 30
#> priceperunit_3 priceperunit_40
#> 1 15 1.25
#> 2 15 1.25
#> 3 15 1.65
as.data.frame(mapply("*",
df[, grep("^quantity", names(df))],
df[, grep("^priceperunit", names(df))]))
#> quantity_1 quantity_2 quantity_3 quantity_40
#> 1 20 30 15 1.25
#> 2 40 60 30 2.50
#> 3 60 90 45 4.95
A simple solution would be to reshape your data from wide to long format, see ?reshape. This will also help you keeping your original data format.
The trick is to store all your price and all your quantity data into the same 2 columns, but records will be differentiated thanks to the newly created time variable of reshape (1 to 156 according to your post).
You can then compute the volume (price x quantity), and go back to the wide format if desired.
See the reproducible example below.
df <- data.frame(price1 = 1:10,
price2 = 11:20,
quantity1 = c(1:5, 1:5),
quantity2 = c(5:9, 5:9))
df_long <- reshape(df,
direction = "long",
v.names = c("price", "quantity"),
varying = list(paste0("price", 1:2), paste0("quantity", 1:2)),
times = c("1", "2"))
df_long$volume <- df_long$price*df_long$quantity
df_wide <- reshape(df_long,
direction = "wide",
v.names = c("price", "quantity", "volume"),
timevar = "time",
sep = "")
This is hopefully a simple question about loops in R. I have a dataset that is made up of results from a simulation. Each column is the results from a single cow, taken each day for a month, then repeated 100 times. So the total length of the column is 3000.
I would like to calculate the mean of the simulated results for each day, to get a single value for each day, for each cow. So I need to calculate the mean of the first entry, the 31st entry, the 61st entry and so on, and then the mean of the second entry, the 32nd entry, the 62nd entry and so on. I would like to end up with a 30 entry column for each cow. I have been trying to do it using a loop in R but can't work out how. Any advice would be greatly appreciated.
Here is some example data:
a<-seq(from = 1, by = 1, length = 30)
b<-seq(from = 1, by = 0.5, length = 30)
c<-seq(from = 1, by = 2, length = 30)
cow1<-rep(a,100)
cow2<-rep(b,100)
cow3<-rep(c,100)
dat<-as.data.frame(cbind(cow1,cow2,cow3))
I think it is better to construct a column "day" and then use it with tapply, as Xi'an said, there is no need for a loop and a loop would be slower and less clean. In code this gives us :
a <- seq(from = 1, by = 1, length = 30)
b <- seq(from = 1, by = 0.5, length = 30)
c <- seq(from = 1, by = 2, length = 30)
day <- seq(from = 1, by = 1, length = 30)
day <- rep(day,100)
cow1 <- rep(a,100)
cow2 <- rep(b,100)
cow3 <- rep(c,100)
# Construct a data frame, I find this cay is better as it gives names to the columns.
dat <- data.frame(day,cow1,cow2,cow3)
# Here are the results
tapply(dat$cow1, dat$day, mean)
tapply(dat$cow2, dat$day, mean)
tapply(dat$cow3, dat$day, mean)
I agree with TMat, including a column with day is useful.
Here is my working example using tidyverse
library(tidyverse)
a <- seq(from = 1, by = 1, length = 30)
b <- seq(from = 1, by = 0.5, length = 30)
c <- seq(from = 1, by = 2, length = 30)
day <- seq(from = 1, by = 1, length = 30)
day <- rep(day,100)
cow1 <- rep(a,100)
cow2 <- rep(b,100)
cow3 <- rep(c,100)
dat <- data.frame(day,cow1,cow2,cow3) %>%
pivot_longer(cols = 2:4) %>%
group_by(day, name) %>%
summarize(mean = mean(value))
#> `summarise()` regrouping output by 'day' (override with `.groups` argument)
dat
#> # A tibble: 90 x 3
#> # Groups: day [30]
#> day name mean
#> <dbl> <chr> <dbl>
#> 1 1 cow1 1
#> 2 1 cow2 1
#> 3 1 cow3 1
#> 4 2 cow1 2
#> 5 2 cow2 1.5
#> 6 2 cow3 3
#> 7 3 cow1 3
#> 8 3 cow2 2
#> 9 3 cow3 5
#> 10 4 cow1 4
#> # ... with 80 more rows
ggplot(dat, aes(x = day, y = mean, fill = name)) +
geom_col(position = "dodge")
Created on 2020-07-08 by the reprex package (v0.3.0)
I am trying to calculate mean for some data along a non-regular date sequence. For example, I have minute level data for specific periods of time during the day and I am interested in calculating 5 minute averages. However, I am not sure how does the width parameter in rollapply works when is specified as a list.
library(tidyverse)
library(zoo)
length = 16
set.seed(10)
dxf <- data.frame(
date = seq(Sys.time(), by = "59 sec", length.out = length),
value = runif(length)
)
# Create a "discontinuity"
dxf$date[8:length] <- dxf$date[8:length] + 3600*24
# Add some noise
dxf$date <- dxf$date + runif(length, 0, 1)
diff(dxf$date)
dxf %>%
arrange(date) %>%
mutate(
diff = c(as.numeric(diff(date)), NA),
mean = rollapply(value, width = 5, mean, partial = TRUE, align = "left")
)
# This is what I need. Therefore, I need a variable width but adjusting to the discontinuity in the rows.
mean1 <- mean(dxf$value[1:5])
mean2 <- mean(dxf$value[2:6])
mean3 <- mean(dxf$value[3:7])
mean4 <- NA # Only have 4 values mean(dxf$value[4:7])
mean5 <- NA # Only have 3 values mean(dxf$value[5:7])
mean6 <- NA # Only have 2 values mean(dxf$value[6:7])
mean7 <- NA # Only have 1 values mean(dxf$value[7:7])
mean8 <- mean(dxf$value[7:11])
etc.
I think this is a tricky problem. Here is one approach
1 Generate a 1 min sequence from the first to the last datetime
2 Interpolate so we have a value at each 1 min. This includes interpolating across the discontinuity
3 Calculate the running 5 min mean based on the 1 min interpolated values
4 Remove the values where the gap in the original datetime values is too large
Also, take care with time zones, best to set these to some deliberately chosen value or UTC which the lubridate functions do by default.
library(tidyverse)
library(RcppRoll)
library(lubridate)
dxf <- tibble(
date = seq(from = ymd_hms('2019-08-14 09:06:05'), by = "59 sec", length.out = 30),
value = runif(30)
)
dxf$date[15:30] <- dxf$date[15:30] + 3600*24 # discontinuing
dxf$date <- dxf$date + round(runif(30)) # noise
dxf <- dxf %>%
mutate(date = ymd_hms(date),
date_num = as.numeric(date),
diff = date_num - lag(date_num))
discontinuity <- which(dxf$diff > 70)
n = nrow(dxf)
date_seq <- seq(from = dxf$date_num[1], to = dxf$date_num[n], by = 60) # create a 1 min sequence
value_interp = approx(x = dxf$date_num, y = dxf$value, xout = date_seq) # interpolate values for the 5 min sequence
df <- tibble(
date = as_datetime(date_seq),
mean_value = RcppRoll::roll_mean(value_interp$y, n = 5, fill = NA, align = 'left'))
df %>%
filter(date < dxf$date[discontinuity - 1] | date > dxf$date[discontinuity])
We could extract the date, group them and then use rollmean
library(dplyr)
dxf %>%
mutate(d1 = as.Date(date)) %>%
group_by(d1) %>%
mutate(mean = zoo::rollmean(value, 5, align = "left", fill = NA)) %>%
ungroup %>%
select(-d1)
# date value mean
# <dttm> <dbl> <dbl>
# 1 2019-08-14 12:49:09 0.507 0.404
# 2 2019-08-14 12:50:08 0.307 0.347
# 3 2019-08-14 12:51:07 0.427 0.341
# 4 2019-08-14 12:52:07 0.693 NA
# 5 2019-08-14 12:53:06 0.0851 NA
# 6 2019-08-14 12:54:05 0.225 NA
# 7 2019-08-14 12:55:04 0.275 NA
# 8 2019-08-15 12:56:02 0.272 0.507
# 9 2019-08-15 12:57:01 0.616 0.476
#10 2019-08-15 12:58:01 0.430 0.472
#11 2019-08-15 12:59:00 0.652 0.457
#12 2019-08-15 12:59:58 0.568 0.413
#13 2019-08-15 13:00:58 0.114 NA
#14 2019-08-15 13:01:56 0.596 NA
#15 2019-08-15 13:02:56 0.358 NA
#16 2019-08-15 13:03:54 0.429 NA
data
set.seed(10)
dxf <- data.frame(
date = seq(Sys.time(), by = "59 sec", length.out = length),
value = runif(length)
)
dxf$date[8:length] <- dxf$date[8:length] + 3600*24
dxf$date <- dxf$date + runif(length, 0, 1)
Here w[i] is number of elements of date that are less than or equal to date[i] + 300 minus i - 1 noting that 300 refers to 300 seconds.
date <- dxf$date
w <- findInterval(date + 300, date) - seq_along(date) + 1
rollapply(dxf$value, w, mean, align = "left") * ifelse(w < 5, NA, 1)
# same
sapply(seq_along(w), function(i) mean(dxf$value[seq(i, length = w[i])])) *
ifelse(w < 5, NA, 1)
There are 10 projects split between group A & B, each with different start and end dates. For each day within a given period the sum of outputX and outputY needs to be calculated. I manage to do this for all projects together, but how to split the results per group?
I've made several attempts with lapply() and purrr:map(), also looking at filters and splits, but to no avail. An example that doesn't distinguish between groups is found below.
library(tidyverse)
library(lubridate)
df <- data.frame(
project = 1:10,
group = c("A","B"),
outputX = rnorm(2),
outputY = rnorm(5),
start_date = sample(seq(as.Date('2018-01-3'), as.Date('2018-1-13'), by="day"), 10),
end_date = sample(seq(as.Date('2018-01-13'), as.Date('2018-01-31'), by="day"), 10))
df$interval <- interval(df$start_date, df$end_date)
period <- data.frame(date = seq(as.Date("2018-01-08"), as.Date("2018-01-17"), by = 1))
df_sum <- do.call(rbind, lapply(period$date, function(x){
index <- x %within% df$interval;
list("X" = sum(df$outputX[index]),
"Y" = sum(df$outputY[index]))}))
outcome <- cbind(period, df_sum) %>% gather("id", "value", 2:3)
outcome
Ultimately, it should be a 40x4 table. Some suggestions are much appreciated!
If I understand you correctly, you need to use inner join. SO can suggest us to use sqldf. See https://stackoverflow.com/a/11895368/9300556
With your data we can do smth like this. There is no need to calculate df$interval but we need to add ID to period, otherwise sqldf wont work.
df <- data.frame(
project = 1:10,
group = c("A","B"),
outputX = rnorm(2),
outputY = rnorm(5),
start = sample(seq(as.Date('2018-01-3'), as.Date('2018-1-13'), by="day"), 10),
end = sample(seq(as.Date('2018-01-13'), as.Date('2018-01-31'), by="day"), 10))
# df$interval <- interval(df$start_date, df$end_date)
period <- data.frame(date = seq(as.Date("2018-01-08"), as.Date("2018-01-17"), by = 1)) %>%
mutate(id = 1:nrow(.))
Then we can use sqldf
sqldf::sqldf("select * from period inner join df
on (period.date > df.start and period.date <= df.end) ") %>%
as_tibble() %>%
group_by(date, group) %>%
summarise(X = sum(outputX),
Y = sum(outputY)) %>%
gather(id, value, -group, -date)
# A tibble: 40 x 4
# Groups: date [10]
date group id value
<date> <fct> <chr> <dbl>
1 2018-01-08 A X 3.04
2 2018-01-08 B X 2.34
3 2018-01-09 A X 3.04
4 2018-01-09 B X 3.51
5 2018-01-10 A X 3.04
6 2018-01-10 B X 4.68
7 2018-01-11 A X 4.05
8 2018-01-11 B X 4.68
9 2018-01-12 A X 4.05
10 2018-01-12 B X 5.84
# ... with 30 more rows
I have a set of data along these lines
d1 <- data.frame(
cat1 = sample(c('a', 'b', 'c'), 100, replace = TRUE),
date = rep(Sys.Date() - sample(1:100)),
val = rnorm(100, 50, 5)
)
require(data.table)
d2 <- data.table(d1)
I can get a daily sum without problem
d2[ , list(.N, sum(val)), by = c("cat1", "date")]
I want to get a sum over 2 days (and then 7 days)
This works:
d.list <- sort(unique(d2$date))
o.list <- list()
for(i in seq_along(d.list)){
o.list[[i]] <- d2[d2$date >= d.list[i] - 1 & d2$date <= d.list[i], list(.N, sum(val), max(date)), by = c("cat1")]
}
do.call(rbind, o.list)
But slows down on a bigger data set, and doesn't seem to be the best use of data.table.
Is there a more efficient way?
This is a bit faster:
First we join for exact matches and obtain the last index (in case of multiple matches)
setkey(d2, cat1, date)
tmp1 = d2[unique(d2, by=key(d2)), which=TRUE, mult="last", allow.cartesian=TRUE]
Then, we construct a copy of d2 and change date to date-1 by reference. Then, we perform a join with roll=-Inf - which is next observation carried backwards. In other words, if there's no exact match, it'll fill the next available value.
d3 = copy(d2)[, date := date-1]
setkey(d3, cat1, date)
tmp2 = d2[unique(d3, by=key(d2)), roll=-Inf, which=TRUE, allow.cartesian=TRUE]
From here, we put together the indices:
idx1 = tmp1-tmp2+1L
idx2 = data.table:::vecseq(tmp2, idx1, sum(idx1))
Subset d2 from idx2 and generate unique ids from idx1:
ans1 = d2[idx2][, grp := rep(seq_along(idx1), idx1)]
Finally aggregate by grp and get the desired result:
ans1 = ans1[, list(cat1=cat1[1L], date=date[.N],
N = .N, val=sum(val)), by=grp][, grp:=NULL]
> head(ans1, 10L)
# cat1 date N val
# 1: a 2014-01-20 1 47.69178
# 2: a 2014-01-25 1 52.01006
# 3: a 2014-02-01 1 46.82132
# 4: a 2014-02-06 1 44.62404
# 5: a 2014-02-11 1 49.63218
# 6: a 2014-02-14 1 48.80676
# 7: a 2014-02-22 1 49.27800
# 8: a 2014-02-23 2 96.17617
# 9: a 2014-02-26 1 49.20623
# 10: a 2014-02-28 1 46.72708
The results are identical as in your solution. This one took 0.02 seconds on my laptop, where as yours took 0.58 seconds.
For 7 days, just change:
d3 = copy(d2)[, date := date-1]
to
d3 = copy(d2)[, date := date-6]
It's very poorly explained in the OP what you want, but this seems to be it:
# generate the [date-1,date] sequences for each date
# adjust length.out to suit your needs
dates = d2[, list(date.seq = seq(date, by = -1, length.out = 2)), by = date]
setkey(dates, date.seq)
setkey(d2, date)
# merge and extract info needed
dates[d2][, list(.N, sum(val), date.seq[.N]), by = list(date, cat1)][, !"date"]
# cat1 N V2 V3
# 1: a 1 38.95774 2014-01-21
# 2: a 1 38.95774 2014-01-21
# 3: c 1 55.68445 2014-01-22
# 4: c 2 102.20806 2014-01-23
# 5: c 1 46.52361 2014-01-23
# ---
#164: c 1 50.17986 2014-04-27
#165: b 1 51.43489 2014-04-28
#166: b 2 100.91982 2014-04-29
#167: b 1 49.48493 2014-04-29
#168: c 1 54.93311 2014-04-30
Would it be possible to set up a binned date, and then do by on that?
d2$day7 <- as.integer(d2$date) %/% 7
d2[ , list(.N, sum(val)), by = c("cat1", "day7")]
That would give a binned value - if you want a sliding 7 day window, I'd need to think again. Also, for a binned approach, you might need to subtract an offset before doing the %/% if you want to chose the day of the week the groups start at.