calculate and plot time interval means - r

I would like to calculate and plot changing numbers of differently colored animals over time using dplyr and ggplot2.
I have observations of different animals on random dates and so I would first like to group those observations into 4-day brackets and then calculate mean color for each 4-day bracket. I created the column Bracket.mean with a gimmick result for the first few just to show what I have in mind. I would like to add those means in the same data frame (as opposed to creating a new data.frame or vectors) for a later analysis and plotting, if possible.
And for the plot I’m hoping to show the bracket means with some measure of variance around it (SD or boxplots) as well as the daily observations (perhaps a faded overlay of the observations in the background) over time.
Below is a part of the dataset I'm using (with a made up 'Bracket.mean' column I’m hoping to calulcate). 'Count' is the number of animals on a given 'Date' of a specific 'Color'.
Date Julian Count Color Bracket.color
4/19/16 110 1 50 mean of 4/19-4/22
4/19/16 110 1 50 mean of 4/19-4/22
4/19/16 110 1 100 mean of 4/19-4/22
4/20/16 111 4 50 mean of 4/19-4/22
4/20/16 111 1 0 mean of 4/19-4/22
4/20/16 111 2 100 mean of 4/19-4/22
4/20/16 111 1 50 mean of 4/19-4/22
4/20/16 111 2 100 mean of 4/19-4/22
4/21/16 112 1 100 mean of 4/19-4/22
4/21/16 112 2 50 mean of 4/19-4/22
4/21/16 112 4 50 mean of 4/19-4/22
4/21/16 112 1 100 mean of 4/19-4/22
4/21/16 112 2 50 mean of 4/19-4/22
4/21/16 112 1 0 mean of 4/19-4/22
4/22/16 113 2 0 mean of 4/19-4/22
4/22/16 113 4 50 mean of 4/23-4/26
4/23/16 114 6 0 mean of 4/23-4/26
4/23/16 114 1 50 mean of 4/23-4/26
4/24/16 115 2 0 mean of 4/23-4/26
4/26/16 117 5 0 mean of 4/23-4/26
4/30/16 121 1 50
5/2/16 123 1 NA
5/2/16 123 1 50
5/7/16 128 2 0
5/7/16 128 3 0
5/7/16 128 3 0
5/8/16 129 4 0
5/8/16 129 1 0
5/10/16 131 1 50
5/10/16 131 4 50
5/12/16 133 1 0
5/13/16 134 1 50
5/14/16 135 1 0
5/14/16 135 2 50
5/14/16 135 2 0
5/14/16 135 1 0
5/17/16 138 1 0
5/17/16 138 2 0
5/23/16 144 1 0
5/24/16 145 4 0
5/24/16 145 1 0
5/24/16 145 1 0
5/27/16 148 3 NA
5/27/16 148 1 0
5/27/16 148 1 50
Any help would be greatly appreciated. Thanks very much in advance!

Something like this should get you started.
library(dplyr)
df <- df %>% mutate(Date = as.Date(Date, format='%m/%d/%y'),
Start = as.Date(cut(Date, breaks= seq(min(Date), max(Date)+4, by = 4)))) %>%
mutate(End = Start+3) %>%
group_by(Start,End) %>%
summarise(meanColor = mean(Color, na.rm=T),
sdColor = sd(Color, na.rm=T))
df
#Source: local data frame [10 x 4]
#Groups: Start [?]
# Start End meanColor sdColor
# <date> <date> <dbl> <dbl>
#1 2016-04-19 2016-04-22 56.25000 35.93976
#2 2016-04-23 2016-04-26 12.50000 25.00000
#3 2016-04-27 2016-04-30 50.00000 NA
#4 2016-05-01 2016-05-04 50.00000 NA
#5 2016-05-05 2016-05-08 0.00000 0.00000
#6 2016-05-09 2016-05-12 33.33333 28.86751
#7 2016-05-13 2016-05-16 20.00000 27.38613
#8 2016-05-17 2016-05-20 0.00000 0.00000
#9 2016-05-21 2016-05-24 0.00000 0.00000
#10 2016-05-25 2016-05-28 25.00000 35.35534
Then plot using,
library(ggplot)
ggplot(df) + geom_line(aes(Start,meanColor))

Related

R - more effective left_join [duplicate]

This question already has answers here:
Overlap join with start and end positions
(5 answers)
Closed 1 year ago.
I have got two dataframes - one containing names and ranges of limits (only few hundreds of rows, 1000 at most), which needs to be assigned to a "measurements" dataframe which can consist of million of rows (or ten's of millions of row).
Currently I am doing left_join and filtering value to get a specific limit assigned to each measurement. This however is quite ineffective and cost a lot of resources. For larger dataframes, the code is even unable to run.
Any ideas for more effective solutions will be helpful.
library(dplyr)
## this one has got only few houndreds rows
df_limits <- read.table(text="Title station_id limit_from limit_to
Level_3_Low 1 0 70
Level_2_Low 1 70 90
Level_1_Low 1 90 100
Optimal 1 100 110
Level_1_High 1 110 130
Level_2_High 1 130 150
Level_3_High 1 150 180
Level_3_Low 2 0 70
Level_2_Low 2 70 90
Level_1_Low 2 90 100
Optimal 2 100 110
Level_1_High 2 110 130
Level_2_High 2 130 150
Level_3_High 2 150 180
Level_3_Low 3 0 70
Level_2_Low 3 70 90
Level_1_Low 3 90 100
Optimal 3 100 110
Level_1_High 3 110 130
Level_2_High 3 130 150
Level_3_High 3 150 180
",header = TRUE, stringsAsFactors = TRUE)
# this DF has got millions of rows
df_measurements <- read.table(text="measurement_id station_id value
12121534 1 172
12121618 1 87
12121703 1 9
12121709 2 80
12121760 2 80
12121813 2 115
12121881 3 67
12121907 3 100
12121920 3 108
12121979 1 102
12121995 1 53
12122022 1 77
12122065 2 158
12122107 2 144
12122113 2 5
12122135 3 100
12122187 3 136
12122267 3 130
12122359 1 105
12122366 1 126
12122398 1 143
",header = TRUE, stringsAsFactors = TRUE)
df_results <- left_join(df_measurements,df_limits, by = "station_id") %>%
filter ((value >= limit_from & value < limit_to) | is.na(Title)) %>%
select(names(df_measurements), Title)
Another data.table solution using non-equijoins:
library(data.table)
setDT(df_measurements)
setDT(df_limits)
df_limits[df_measurements, .(station_id, measurement_id, value, Title),
on=.(station_id = station_id, limit_from < value, limit_to >= value)]
station_id measurement_id value Title
1: 1 12121534 172 Level_3_High
2: 1 12121618 87 Level_2_Low
3: 1 12121703 9 Level_3_Low
4: 2 12121709 80 Level_2_Low
5: 2 12121760 80 Level_2_Low
6: 2 12121813 115 Level_1_High
7: 3 12121881 67 Level_3_Low
8: 3 12121907 100 Level_1_Low
9: 3 12121920 108 Optimal
10: 1 12121979 102 Optimal
11: 1 12121995 53 Level_3_Low
12: 1 12122022 77 Level_2_Low
13: 2 12122065 158 Level_3_High
14: 2 12122107 144 Level_2_High
15: 2 12122113 5 Level_3_Low
16: 3 12122135 100 Level_1_Low
17: 3 12122187 136 Level_2_High
18: 3 12122267 130 Level_1_High
19: 1 12122359 105 Optimal
20: 1 12122366 126 Level_1_High
21: 1 12122398 143 Level_2_High
A simple base R (no need additional packages) option using subset + merge
subset(
merge(
df_measurements,
df_limits,
all = TRUE
),
limit_from < value & limit_to >= value
)
gives
station_id measurement_id value Title limit_from limit_to
7 1 12121534 172 Level_3_High 150 180
9 1 12121618 87 Level_2_Low 70 90
15 1 12121703 9 Level_3_Low 0 70
23 1 12122022 77 Level_2_Low 70 90
34 1 12122398 143 Level_2_High 130 150
39 1 12121979 102 Optimal 100 110
43 1 12121995 53 Level_3_Low 0 70
54 1 12122366 126 Level_1_High 110 130
60 1 12122359 105 Optimal 100 110
65 2 12121760 80 Level_2_Low 70 90
75 2 12121813 115 Level_1_High 110 130
79 2 12121709 80 Level_2_Low 70 90
91 2 12122065 158 Level_3_High 150 180
97 2 12122107 144 Level_2_High 130 150
99 2 12122113 5 Level_3_Low 0 70
108 3 12121907 100 Level_1_Low 90 100
116 3 12121920 108 Optimal 100 110
124 3 12122267 130 Level_1_High 110 130
127 3 12121881 67 Level_3_Low 0 70
136 3 12122135 100 Level_1_Low 90 100
146 3 12122187 136 Level_2_High 130 150
Another option is using dplyr
df_measurements %>%
group_by(station_id) %>%
mutate(Title = with(
df_limits,
Title[
findInterval(
value,
unique(unlist(cbind(limit_from, limit_to)[station_id == first(.$station_id)])),
left.open = TRUE
)
]
)) %>%
ungroup()
which gives
# A tibble: 21 x 4
measurement_id station_id value Title
<int> <int> <int> <fct>
1 12121534 1 172 Level_3_High
2 12121618 1 87 Level_2_Low
3 12121703 1 9 Level_3_Low
4 12121709 2 80 Level_2_Low
5 12121760 2 80 Level_2_Low
6 12121813 2 115 Level_1_High
7 12121881 3 67 Level_3_Low
8 12121907 3 100 Level_1_Low
9 12121920 3 108 Optimal
10 12121979 1 102 Optimal
# ... with 11 more rows
Benchmarking
f_TIC1 <- function() {
subset(
merge(
df_measurements,
df_limits,
all = TRUE
),
limit_from < value & limit_to >= value
)
}
f_TIC2 <- function() {
df_measurements %>%
group_by(station_id) %>%
mutate(Title = with(
df_limits,
Title[
findInterval(
value,
unique(unlist(cbind(limit_from, limit_to)[station_id == first(station_id)])),
left.open = TRUE
)
]
)) %>%
ungroup()
}
dt_limits <- as.data.table(df_limits)
dt_measurements <- as.data.table(df_measurements)
f_Waldi <- function() {
dt_limits[
dt_measurements,
.(station_id, measurement_id, value, Title),
on = .(station_id, limit_from < value, limit_to >= value)
]
}
f_TimTeaFan <- function() {
setkey(dt_limits, station_id, limit_from, limit_to)
foverlaps(dt_measurements[, value2 := value],
dt_limits,
by.x = c("station_id", "value", "value2"),
type = "within",
)[
value < limit_to,
.(measurement_id, station_id, value, Title)
]
}
you will see that
Unit: relative
expr min lq mean median uq max neval
f_TIC1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
f_TIC2() 4.848639 4.909985 4.895588 4.942616 5.124704 2.580819 100
f_Waldi() 3.182027 3.010615 3.069916 3.114160 3.397845 1.698386 100
f_TimTeaFan() 5.523778 5.112872 5.226145 5.112407 5.745671 2.446987 100
Here is one way to do it. The problematic part was the condition value < limit_to. foverlaps checks for the condition value <= limit_to which results in double matches so here we call the filter condition after the overlapping join and then select the desired columns. Note that the result is not in the same order as the df_results generated with dplyr.
library(data.table)
dt_limits <- as.data.table(df_limits)
dt_measurements <- as.data.table(df_measurements)
setkey(dt_limits, station_id, limit_from, limit_to)
dt_results <- foverlaps(dt_measurements[, value2 := value],
dt_limits,
by.x = c("station_id", "value", "value2"),
type = "within",
)[value < limit_to,
.(measurement_id , station_id, value, Title)]
dt_results[]
#> measurement_id station_id value Title
#> 1: 12121534 1 172 Level_3_High
#> 2: 12121618 1 87 Level_2_Low
#> 3: 12121703 1 9 Level_3_Low
#> 4: 12121709 2 80 Level_2_Low
#> 5: 12121760 2 80 Level_2_Low
#> 6: 12121813 2 115 Level_1_High
#> 7: 12121881 3 67 Level_3_Low
#> 8: 12121907 3 100 Optimal
#> 9: 12121920 3 108 Optimal
#> 10: 12121979 1 102 Optimal
#> 11: 12121995 1 53 Level_3_Low
#> 12: 12122022 1 77 Level_2_Low
#> 13: 12122065 2 158 Level_3_High
#> 14: 12122107 2 144 Level_2_High
#> 15: 12122113 2 5 Level_3_Low
#> 16: 12122135 3 100 Optimal
#> 17: 12122187 3 136 Level_2_High
#> 18: 12122267 3 130 Level_2_High
#> 19: 12122359 1 105 Optimal
#> 20: 12122366 1 126 Level_1_High
#> 21: 12122398 1 143 Level_2_High
#> measurement_id station_id value Title
Created on 2021-08-09 by the reprex package (v0.3.0)

Turning data long to wide with repeating values

fill W id X T
1 403 29730 100 111
1 8395 10766 100 92
1 4170 14291 100 98
1 2768 20506 200 110
1 3581 15603 100 112
6 1 10504 200 87
9 48 29730 100 89
1 4790 10766 200 80
This is a slightly modified random sample from my actual data. I'd like:
id X T 403 8395 ....
29730 100 111 1
10766 100 92 1
14291 100 98
20506 200 110
15603 100 112
10504 200 87
29730 100 89
10766 200 80
Notice ID 29730 is both in T 89 and 111. I think this should just be reshape2::dcast however
data_wide <- reshape2::dcast(data_long, id + T + X ~ W, value.var = "fill") gives an illogical result. Is there generally a to keep the same ID at T1 and T2 while casting a data frame?
If I understand correctly, this is not a trivial reshape long to wide question considering OP's requirements:
The row order must be maintained.
The columns must be ordered in appearance of W.
Missing entries should appear blank rather than NA.
This requires
to add a row number to be included in the reshape formula,
to turn W into a factor where the factor levels are ordered by appearance using forecats::fct_inorder(), e.g.,
to use a aggregation function which turns NA in "" using toString(), e.g.,
and to remove the row numbers from the reshaped result.
Here, the data.table implementation of dcast() is used as data.table appears a bit more convenient, IMHO.
library(data.table)
dcast(setDT(data_long)[, rn := .I], rn + id + T + X ~ forcats::fct_inorder(factor(W)),
toString, value.var = "fill")[
, rn := NULL][]
id T X 403 8395 4170 2768 3581 1 48 4790
1: 29730 111 100 1
2: 10766 92 100 1
3: 14291 98 100 1
4: 20506 110 200 1
5: 15603 112 100 1
6: 10504 87 200 6
7: 29730 89 100 9
8: 10766 80 200 1
Data
library(data.table)
data_long <- fread(" fill W id X T
1 403 29730 100 111
1 8395 10766 100 92
1 4170 14291 100 98
1 2768 20506 200 110
1 3581 15603 100 112
6 1 10504 200 87
9 48 29730 100 89
1 4790 10766 200 80")

Stacking time series data vertically

I am struggling with manipulation of time series data. The dataset has first Column containing information about time points of data collection, 2nd column onwards contains data from different studies.I have several hundred studies. As an example I have included sample data for 5 studies. I want to stack the dataset vertically with time and datapoints for each study. Example data set looks like data provided below:
TIME Study1 Study2 Study3 Study4 Study5
0.00 52.12 53.66 52.03 50.36 51.34
90.00 49.49 51.71 49.49 48.48 50.19
180.00 47.00 49.83 47.07 46.67 49.05
270.00 44.63 48.02 44.77 44.93 47.95
360.00 42.38 46.28 42.59 43.25 46.87
450.00 40.24 44.60 40.50 41.64 45.81
540.00 38.21 42.98 38.53 40.08 44.78
I am looking for an output in the form of:
TIME Study ID
0 52.12 1
90 49.49 1
180 47 1
270 44.63 1
360 42.38 1
450 40.24 1
540 38.21 1
0 53.66 2
90 51.71 2
180 49.83 2
270 48.02 2
360 46.28 2
450 44.6 2
540 42.98 2
0 52.03 3
90 49.49 3
180 47.07 3
270 44.77 3
...
This is a classic 'wide to long' dataset manipulation. Below, I show the use of the base function ?reshape for your data:
d.l <- reshape(d, varying=list(c("Study1","Study2","Study3","Study4","Study5")),
v.names="Y", idvar="TIME", times=1:5, timevar="Study",
direction="long")
d.l <- d.l[,c(2,1,3)]
rownames(d.l) <- NULL
d.l
# Study TIME Y
# 1 1 0 52.12
# 2 1 90 49.49
# 3 1 180 47.00
# 4 1 270 44.63
# 5 1 360 42.38
# 6 1 450 40.24
# 7 1 540 38.21
# 8 2 0 53.66
# 9 2 90 51.71
# 10 2 180 49.83
# 11 2 270 48.02
# 12 2 360 46.28
# 13 2 450 44.60
# 14 2 540 42.98
# 15 3 0 52.03
# 16 3 90 49.49
# 17 3 180 47.07
# ...
However, there are many ways to do this in R: the most basic reference on SO (of which this is probably a duplicate) is Reshaping data.frame from wide to long format, but there are many other relevant threads (see this search: [r] wide to long). Beyond using reshape, #lmo's method can be used, as well as methods based on the reshape2, tidyr, and data.table packages (presumably among others).
Here is one method using cbind and stack:
longdf <- cbind(df$TIME, stack(df[,-1], ))
names(longdf) <- c("TIME", "Study", "id")
This returns
longdf
TIME Study id
1 0 52.12 Study1
2 90 49.49 Study1
3 180 47.00 Study1
4 270 44.63 Study1
5 360 42.38 Study1
6 450 40.24 Study1
7 540 38.21 Study1
8 0 53.66 Study2
9 90 51.71 Study2
...
If you want to change id to integers as in your example, use
longdf$id <- as.integer(longdf$id)

Calculate mean of respective column values based on condition

I have a data.frame named sampleframe where I have stored all the table values. Inside sampleframe I have columns id, month, sold.
id month SMarch SJanFeb churn
101 1 0.00 0.00 1
101 2 0.00 0.00 1
101 3 0.00 0.00 1
108 2 0.00 6.00 1
103 2 0.00 10.00 1
160 1 0.00 2.00 1
160 2 0.00 3.00 1
160 3 0.50 0.00 0
164 1 0.00 3.00 1
164 2 0.00 6.00 1
I would like to calculate average sold for last three months based on ID. If it is month 3 then it has to consider average sold for the last two months based on ID, if it is month 2 then it has to consider average sold for 1 month based on ID., respectively for all months.
I have used ifelse and mean function to avail it but some rows are missing when i try to use it for all months
Query that I have used for execution
sampleframe$Churn <- ifelse(sampleframe$Month==4|sampleframe$Month==5|sampleframe$Month==6, ifelse(sampleframe$Sold<0.7*mean(sampleframe$Sold[sampleframe$ID[sampleframe$Month==-1&sampleframe$Month==-2&sampleframe$Month==-3]]),1,0),0)
adding according to the logic of the query it should compare with the previous months sold value of 70% and if the current value is higher than previous average months values then it should return 1 else 0
Not clear about the expected output. Based on the description about calculating average 'sold' for each 3 months, grouped by 'id', we can use roll_mean from library(RcppRoll). We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', if the number of rows is greater than 1, we get the roll_mean with n specified as 3 and concatenate with the averages for less than 3 or else i.e. for 1 observation, get the value itself.
library(RcppRoll)
library(data.table)
k <- 3
setDT(df1)[, soldAvg := if(.N>1) c(cumsum(sold[1:(k-1)])/1:(k-1),
roll_mean(sold,n=k, align='right')) else as.numeric(sold), id]
df1
# id month sold soldAvg
#1: 101 1 124 124.0000
#2: 101 2 211 167.5000
#3: 104 3 332 332.0000
#4: 105 4 124 124.0000
#5: 101 5 211 182.0000
#6: 101 6 332 251.3333
#7: 101 7 124 222.3333
#8: 101 8 211 222.3333
#9: 101 9 332 222.3333
#10: 102 10 124 124.0000
#11: 102 12 211 167.5000
#12: 104 3 332 332.0000
#13: 105 4 124 124.0000
#14: 102 5 211 182.0000
#15: 102 6 332 251.3333
#16: 106 7 124 124.0000
#17: 107 8 211 211.0000
#18: 102 9 332 291.6667
#19: 103 11 124 124.0000
#20: 103 2 211 167.5000
#21: 108 3 332 332.0000
#22: 108 4 124 228.0000
#23: 109 5 211 211.0000
#24: 103 6 332 222.3333
#25: 104 7 124 262.6667
#26: 105 8 211 153.0000
#27: 103 10 332 291.6667
Solution for above Question can be done by using library(dplyr) and use this query to avail the output
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
link to refer for solution and output Answer

R: Error while calculating Rolling Median and Rolling Mean

I am trying to calculate 3 period rolling means and rolling medians for the following data:
SiteID Month TotalSessions TotalMinutes
1 201401 132 1334
1 201402 159 2498
1 201403 98 734
1 201404 112 909
2 201402 25 220
2 201404 32 407
4 201401 10 77
4 201402 12 112
4 201403 9 59
However I am getting an when I use the following function:
ave(mydf$TotalSessions, mydf$SiteID, FUN = function(x) rollmedian(x,k=3, align = "right", na.pad = T))
Error: k <= n is not TRUE
I understand that the error is because that for some SiteIDs there are less than 3 periods of data and hence the rolling median is not getting calculated.
My question is, is there a way where I can add the missing months with 0s in TotalSessions and Total Minutes so that the data would look as follows:
SiteID Month TotalSessions TotalMinutes
1 201401 132 1334
1 201402 159 2498
1 201403 98 734
1 201404 112 909
2 201401 0 0
2 201402 25 220
2 201403 0 0
2 201404 32 407
4 201401 10 77
4 201402 12 112
4 201403 9 59
4 201404 0 0
Thanks for the help!
Personally I would use one of the solution proposed in the answer or in comments.
Here an answer to modify your data by adding 0 for missing months(the desired output). I mainly use merge function.
xx <- data.frame(Month=unique(dat$Month))
res <- do.call(rbind,
by(dat,dat$SiteID,function(x)merge(x,xx,all.y=TRUE)))
res[is.na(res)] <- 0
# Month SiteID TotalSessions TotalMinutes
# 1.1 201401 1 132 1334
# 1.2 201402 1 159 2498
# 1.3 201403 1 98 734
# 1.4 201404 1 112 909
# 2.1 201401 0 0 0
# 2.2 201402 2 25 220
# 2.3 201403 0 0 0
# 2.4 201404 2 32 407
# 4.1 201401 4 10 77
# 4.2 201402 4 12 112
# 4.3 201403 4 9 59
# 4.4 201404 0 0 0
Padding with NAs would be better, but even better than that is rollapply with partial = TRUE:
ave(mydf$TotalSessions, mydf$SiteID
, FUN = function(x) {rollapply(x, 3, median, align = "right", partial = TRUE)})

Resources