I am trying to calculate 3 period rolling means and rolling medians for the following data:
SiteID Month TotalSessions TotalMinutes
1 201401 132 1334
1 201402 159 2498
1 201403 98 734
1 201404 112 909
2 201402 25 220
2 201404 32 407
4 201401 10 77
4 201402 12 112
4 201403 9 59
However I am getting an when I use the following function:
ave(mydf$TotalSessions, mydf$SiteID, FUN = function(x) rollmedian(x,k=3, align = "right", na.pad = T))
Error: k <= n is not TRUE
I understand that the error is because that for some SiteIDs there are less than 3 periods of data and hence the rolling median is not getting calculated.
My question is, is there a way where I can add the missing months with 0s in TotalSessions and Total Minutes so that the data would look as follows:
SiteID Month TotalSessions TotalMinutes
1 201401 132 1334
1 201402 159 2498
1 201403 98 734
1 201404 112 909
2 201401 0 0
2 201402 25 220
2 201403 0 0
2 201404 32 407
4 201401 10 77
4 201402 12 112
4 201403 9 59
4 201404 0 0
Thanks for the help!
Personally I would use one of the solution proposed in the answer or in comments.
Here an answer to modify your data by adding 0 for missing months(the desired output). I mainly use merge function.
xx <- data.frame(Month=unique(dat$Month))
res <- do.call(rbind,
by(dat,dat$SiteID,function(x)merge(x,xx,all.y=TRUE)))
res[is.na(res)] <- 0
# Month SiteID TotalSessions TotalMinutes
# 1.1 201401 1 132 1334
# 1.2 201402 1 159 2498
# 1.3 201403 1 98 734
# 1.4 201404 1 112 909
# 2.1 201401 0 0 0
# 2.2 201402 2 25 220
# 2.3 201403 0 0 0
# 2.4 201404 2 32 407
# 4.1 201401 4 10 77
# 4.2 201402 4 12 112
# 4.3 201403 4 9 59
# 4.4 201404 0 0 0
Padding with NAs would be better, but even better than that is rollapply with partial = TRUE:
ave(mydf$TotalSessions, mydf$SiteID
, FUN = function(x) {rollapply(x, 3, median, align = "right", partial = TRUE)})
Related
This question already has answers here:
Overlap join with start and end positions
(5 answers)
Closed 1 year ago.
I have got two dataframes - one containing names and ranges of limits (only few hundreds of rows, 1000 at most), which needs to be assigned to a "measurements" dataframe which can consist of million of rows (or ten's of millions of row).
Currently I am doing left_join and filtering value to get a specific limit assigned to each measurement. This however is quite ineffective and cost a lot of resources. For larger dataframes, the code is even unable to run.
Any ideas for more effective solutions will be helpful.
library(dplyr)
## this one has got only few houndreds rows
df_limits <- read.table(text="Title station_id limit_from limit_to
Level_3_Low 1 0 70
Level_2_Low 1 70 90
Level_1_Low 1 90 100
Optimal 1 100 110
Level_1_High 1 110 130
Level_2_High 1 130 150
Level_3_High 1 150 180
Level_3_Low 2 0 70
Level_2_Low 2 70 90
Level_1_Low 2 90 100
Optimal 2 100 110
Level_1_High 2 110 130
Level_2_High 2 130 150
Level_3_High 2 150 180
Level_3_Low 3 0 70
Level_2_Low 3 70 90
Level_1_Low 3 90 100
Optimal 3 100 110
Level_1_High 3 110 130
Level_2_High 3 130 150
Level_3_High 3 150 180
",header = TRUE, stringsAsFactors = TRUE)
# this DF has got millions of rows
df_measurements <- read.table(text="measurement_id station_id value
12121534 1 172
12121618 1 87
12121703 1 9
12121709 2 80
12121760 2 80
12121813 2 115
12121881 3 67
12121907 3 100
12121920 3 108
12121979 1 102
12121995 1 53
12122022 1 77
12122065 2 158
12122107 2 144
12122113 2 5
12122135 3 100
12122187 3 136
12122267 3 130
12122359 1 105
12122366 1 126
12122398 1 143
",header = TRUE, stringsAsFactors = TRUE)
df_results <- left_join(df_measurements,df_limits, by = "station_id") %>%
filter ((value >= limit_from & value < limit_to) | is.na(Title)) %>%
select(names(df_measurements), Title)
Another data.table solution using non-equijoins:
library(data.table)
setDT(df_measurements)
setDT(df_limits)
df_limits[df_measurements, .(station_id, measurement_id, value, Title),
on=.(station_id = station_id, limit_from < value, limit_to >= value)]
station_id measurement_id value Title
1: 1 12121534 172 Level_3_High
2: 1 12121618 87 Level_2_Low
3: 1 12121703 9 Level_3_Low
4: 2 12121709 80 Level_2_Low
5: 2 12121760 80 Level_2_Low
6: 2 12121813 115 Level_1_High
7: 3 12121881 67 Level_3_Low
8: 3 12121907 100 Level_1_Low
9: 3 12121920 108 Optimal
10: 1 12121979 102 Optimal
11: 1 12121995 53 Level_3_Low
12: 1 12122022 77 Level_2_Low
13: 2 12122065 158 Level_3_High
14: 2 12122107 144 Level_2_High
15: 2 12122113 5 Level_3_Low
16: 3 12122135 100 Level_1_Low
17: 3 12122187 136 Level_2_High
18: 3 12122267 130 Level_1_High
19: 1 12122359 105 Optimal
20: 1 12122366 126 Level_1_High
21: 1 12122398 143 Level_2_High
A simple base R (no need additional packages) option using subset + merge
subset(
merge(
df_measurements,
df_limits,
all = TRUE
),
limit_from < value & limit_to >= value
)
gives
station_id measurement_id value Title limit_from limit_to
7 1 12121534 172 Level_3_High 150 180
9 1 12121618 87 Level_2_Low 70 90
15 1 12121703 9 Level_3_Low 0 70
23 1 12122022 77 Level_2_Low 70 90
34 1 12122398 143 Level_2_High 130 150
39 1 12121979 102 Optimal 100 110
43 1 12121995 53 Level_3_Low 0 70
54 1 12122366 126 Level_1_High 110 130
60 1 12122359 105 Optimal 100 110
65 2 12121760 80 Level_2_Low 70 90
75 2 12121813 115 Level_1_High 110 130
79 2 12121709 80 Level_2_Low 70 90
91 2 12122065 158 Level_3_High 150 180
97 2 12122107 144 Level_2_High 130 150
99 2 12122113 5 Level_3_Low 0 70
108 3 12121907 100 Level_1_Low 90 100
116 3 12121920 108 Optimal 100 110
124 3 12122267 130 Level_1_High 110 130
127 3 12121881 67 Level_3_Low 0 70
136 3 12122135 100 Level_1_Low 90 100
146 3 12122187 136 Level_2_High 130 150
Another option is using dplyr
df_measurements %>%
group_by(station_id) %>%
mutate(Title = with(
df_limits,
Title[
findInterval(
value,
unique(unlist(cbind(limit_from, limit_to)[station_id == first(.$station_id)])),
left.open = TRUE
)
]
)) %>%
ungroup()
which gives
# A tibble: 21 x 4
measurement_id station_id value Title
<int> <int> <int> <fct>
1 12121534 1 172 Level_3_High
2 12121618 1 87 Level_2_Low
3 12121703 1 9 Level_3_Low
4 12121709 2 80 Level_2_Low
5 12121760 2 80 Level_2_Low
6 12121813 2 115 Level_1_High
7 12121881 3 67 Level_3_Low
8 12121907 3 100 Level_1_Low
9 12121920 3 108 Optimal
10 12121979 1 102 Optimal
# ... with 11 more rows
Benchmarking
f_TIC1 <- function() {
subset(
merge(
df_measurements,
df_limits,
all = TRUE
),
limit_from < value & limit_to >= value
)
}
f_TIC2 <- function() {
df_measurements %>%
group_by(station_id) %>%
mutate(Title = with(
df_limits,
Title[
findInterval(
value,
unique(unlist(cbind(limit_from, limit_to)[station_id == first(station_id)])),
left.open = TRUE
)
]
)) %>%
ungroup()
}
dt_limits <- as.data.table(df_limits)
dt_measurements <- as.data.table(df_measurements)
f_Waldi <- function() {
dt_limits[
dt_measurements,
.(station_id, measurement_id, value, Title),
on = .(station_id, limit_from < value, limit_to >= value)
]
}
f_TimTeaFan <- function() {
setkey(dt_limits, station_id, limit_from, limit_to)
foverlaps(dt_measurements[, value2 := value],
dt_limits,
by.x = c("station_id", "value", "value2"),
type = "within",
)[
value < limit_to,
.(measurement_id, station_id, value, Title)
]
}
you will see that
Unit: relative
expr min lq mean median uq max neval
f_TIC1() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
f_TIC2() 4.848639 4.909985 4.895588 4.942616 5.124704 2.580819 100
f_Waldi() 3.182027 3.010615 3.069916 3.114160 3.397845 1.698386 100
f_TimTeaFan() 5.523778 5.112872 5.226145 5.112407 5.745671 2.446987 100
Here is one way to do it. The problematic part was the condition value < limit_to. foverlaps checks for the condition value <= limit_to which results in double matches so here we call the filter condition after the overlapping join and then select the desired columns. Note that the result is not in the same order as the df_results generated with dplyr.
library(data.table)
dt_limits <- as.data.table(df_limits)
dt_measurements <- as.data.table(df_measurements)
setkey(dt_limits, station_id, limit_from, limit_to)
dt_results <- foverlaps(dt_measurements[, value2 := value],
dt_limits,
by.x = c("station_id", "value", "value2"),
type = "within",
)[value < limit_to,
.(measurement_id , station_id, value, Title)]
dt_results[]
#> measurement_id station_id value Title
#> 1: 12121534 1 172 Level_3_High
#> 2: 12121618 1 87 Level_2_Low
#> 3: 12121703 1 9 Level_3_Low
#> 4: 12121709 2 80 Level_2_Low
#> 5: 12121760 2 80 Level_2_Low
#> 6: 12121813 2 115 Level_1_High
#> 7: 12121881 3 67 Level_3_Low
#> 8: 12121907 3 100 Optimal
#> 9: 12121920 3 108 Optimal
#> 10: 12121979 1 102 Optimal
#> 11: 12121995 1 53 Level_3_Low
#> 12: 12122022 1 77 Level_2_Low
#> 13: 12122065 2 158 Level_3_High
#> 14: 12122107 2 144 Level_2_High
#> 15: 12122113 2 5 Level_3_Low
#> 16: 12122135 3 100 Optimal
#> 17: 12122187 3 136 Level_2_High
#> 18: 12122267 3 130 Level_2_High
#> 19: 12122359 1 105 Optimal
#> 20: 12122366 1 126 Level_1_High
#> 21: 12122398 1 143 Level_2_High
#> measurement_id station_id value Title
Created on 2021-08-09 by the reprex package (v0.3.0)
I would like to calculate and plot changing numbers of differently colored animals over time using dplyr and ggplot2.
I have observations of different animals on random dates and so I would first like to group those observations into 4-day brackets and then calculate mean color for each 4-day bracket. I created the column Bracket.mean with a gimmick result for the first few just to show what I have in mind. I would like to add those means in the same data frame (as opposed to creating a new data.frame or vectors) for a later analysis and plotting, if possible.
And for the plot I’m hoping to show the bracket means with some measure of variance around it (SD or boxplots) as well as the daily observations (perhaps a faded overlay of the observations in the background) over time.
Below is a part of the dataset I'm using (with a made up 'Bracket.mean' column I’m hoping to calulcate). 'Count' is the number of animals on a given 'Date' of a specific 'Color'.
Date Julian Count Color Bracket.color
4/19/16 110 1 50 mean of 4/19-4/22
4/19/16 110 1 50 mean of 4/19-4/22
4/19/16 110 1 100 mean of 4/19-4/22
4/20/16 111 4 50 mean of 4/19-4/22
4/20/16 111 1 0 mean of 4/19-4/22
4/20/16 111 2 100 mean of 4/19-4/22
4/20/16 111 1 50 mean of 4/19-4/22
4/20/16 111 2 100 mean of 4/19-4/22
4/21/16 112 1 100 mean of 4/19-4/22
4/21/16 112 2 50 mean of 4/19-4/22
4/21/16 112 4 50 mean of 4/19-4/22
4/21/16 112 1 100 mean of 4/19-4/22
4/21/16 112 2 50 mean of 4/19-4/22
4/21/16 112 1 0 mean of 4/19-4/22
4/22/16 113 2 0 mean of 4/19-4/22
4/22/16 113 4 50 mean of 4/23-4/26
4/23/16 114 6 0 mean of 4/23-4/26
4/23/16 114 1 50 mean of 4/23-4/26
4/24/16 115 2 0 mean of 4/23-4/26
4/26/16 117 5 0 mean of 4/23-4/26
4/30/16 121 1 50
5/2/16 123 1 NA
5/2/16 123 1 50
5/7/16 128 2 0
5/7/16 128 3 0
5/7/16 128 3 0
5/8/16 129 4 0
5/8/16 129 1 0
5/10/16 131 1 50
5/10/16 131 4 50
5/12/16 133 1 0
5/13/16 134 1 50
5/14/16 135 1 0
5/14/16 135 2 50
5/14/16 135 2 0
5/14/16 135 1 0
5/17/16 138 1 0
5/17/16 138 2 0
5/23/16 144 1 0
5/24/16 145 4 0
5/24/16 145 1 0
5/24/16 145 1 0
5/27/16 148 3 NA
5/27/16 148 1 0
5/27/16 148 1 50
Any help would be greatly appreciated. Thanks very much in advance!
Something like this should get you started.
library(dplyr)
df <- df %>% mutate(Date = as.Date(Date, format='%m/%d/%y'),
Start = as.Date(cut(Date, breaks= seq(min(Date), max(Date)+4, by = 4)))) %>%
mutate(End = Start+3) %>%
group_by(Start,End) %>%
summarise(meanColor = mean(Color, na.rm=T),
sdColor = sd(Color, na.rm=T))
df
#Source: local data frame [10 x 4]
#Groups: Start [?]
# Start End meanColor sdColor
# <date> <date> <dbl> <dbl>
#1 2016-04-19 2016-04-22 56.25000 35.93976
#2 2016-04-23 2016-04-26 12.50000 25.00000
#3 2016-04-27 2016-04-30 50.00000 NA
#4 2016-05-01 2016-05-04 50.00000 NA
#5 2016-05-05 2016-05-08 0.00000 0.00000
#6 2016-05-09 2016-05-12 33.33333 28.86751
#7 2016-05-13 2016-05-16 20.00000 27.38613
#8 2016-05-17 2016-05-20 0.00000 0.00000
#9 2016-05-21 2016-05-24 0.00000 0.00000
#10 2016-05-25 2016-05-28 25.00000 35.35534
Then plot using,
library(ggplot)
ggplot(df) + geom_line(aes(Start,meanColor))
I have a data.frame named sampleframe where I have stored all the table values. Inside sampleframe I have columns id, month, sold.
id month SMarch SJanFeb churn
101 1 0.00 0.00 1
101 2 0.00 0.00 1
101 3 0.00 0.00 1
108 2 0.00 6.00 1
103 2 0.00 10.00 1
160 1 0.00 2.00 1
160 2 0.00 3.00 1
160 3 0.50 0.00 0
164 1 0.00 3.00 1
164 2 0.00 6.00 1
I would like to calculate average sold for last three months based on ID. If it is month 3 then it has to consider average sold for the last two months based on ID, if it is month 2 then it has to consider average sold for 1 month based on ID., respectively for all months.
I have used ifelse and mean function to avail it but some rows are missing when i try to use it for all months
Query that I have used for execution
sampleframe$Churn <- ifelse(sampleframe$Month==4|sampleframe$Month==5|sampleframe$Month==6, ifelse(sampleframe$Sold<0.7*mean(sampleframe$Sold[sampleframe$ID[sampleframe$Month==-1&sampleframe$Month==-2&sampleframe$Month==-3]]),1,0),0)
adding according to the logic of the query it should compare with the previous months sold value of 70% and if the current value is higher than previous average months values then it should return 1 else 0
Not clear about the expected output. Based on the description about calculating average 'sold' for each 3 months, grouped by 'id', we can use roll_mean from library(RcppRoll). We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', if the number of rows is greater than 1, we get the roll_mean with n specified as 3 and concatenate with the averages for less than 3 or else i.e. for 1 observation, get the value itself.
library(RcppRoll)
library(data.table)
k <- 3
setDT(df1)[, soldAvg := if(.N>1) c(cumsum(sold[1:(k-1)])/1:(k-1),
roll_mean(sold,n=k, align='right')) else as.numeric(sold), id]
df1
# id month sold soldAvg
#1: 101 1 124 124.0000
#2: 101 2 211 167.5000
#3: 104 3 332 332.0000
#4: 105 4 124 124.0000
#5: 101 5 211 182.0000
#6: 101 6 332 251.3333
#7: 101 7 124 222.3333
#8: 101 8 211 222.3333
#9: 101 9 332 222.3333
#10: 102 10 124 124.0000
#11: 102 12 211 167.5000
#12: 104 3 332 332.0000
#13: 105 4 124 124.0000
#14: 102 5 211 182.0000
#15: 102 6 332 251.3333
#16: 106 7 124 124.0000
#17: 107 8 211 211.0000
#18: 102 9 332 291.6667
#19: 103 11 124 124.0000
#20: 103 2 211 167.5000
#21: 108 3 332 332.0000
#22: 108 4 124 228.0000
#23: 109 5 211 211.0000
#24: 103 6 332 222.3333
#25: 104 7 124 262.6667
#26: 105 8 211 153.0000
#27: 103 10 332 291.6667
Solution for above Question can be done by using library(dplyr) and use this query to avail the output
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
link to refer for solution and output Answer
I have a large data table Divvy (over 2.4 million records) that appears as such (some columns removed):
X trip_id from_station_id.x to_station_id.x
1 1109420 94 69
2 1109421 69 216
3 1109427 240 245
4 1109431 113 94
5 1109433 127 332
3 1109429 240 245
I would like to find the number of trips from each station to each opposing station. So for example,
From X To Y Sum
94 69 1
240 245 2
etc. and then join it back to the inital table using dplyr to make something like the below and then limit it to distinct from_station_id/to_combos, which I'll use to map routes (I have lat/long for each station):
X trip_id from_station_id.x to_station_id.x Sum
1 1109420 94 69 1
2 1109421 69 216 1
3 1109427 240 245 2
4 1109431 113 94 1
5 1109433 127 332 1
3 1109429 240 245 1
I successfully used count to get some of this, such as:
count(Divvy$from_station_id.x==94 & Divvy$to_station_id.x == 69)
x freq
1 FALSE 2454553
2 TRUE 81
But this is obviously labor intensive as there are 300 unique stations, so well over 44k poss combinations. I created a helper table thinking I could loop it.
n <- select(Divvy, from_station_id.y )
from_station_id.x
1 94
2 69
3 240
4 113
5 113
6 127
count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1])
x freq
1 FALSE 2454553
2 TRUE 81
I felt like a loop such as
output <- matrix(ncol=variables, nrow=iterations)
output <- matrix()
for(i in 1:n)(output[i, count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1]))
should work but come to think of it that will still only return 300 rows, not 44k, so it would have to then loop back and do n[2] & n[1] etc...
I felt like there might also be a quicker dplyr solution that would let me return a count of each combo and append it directly without the extra steps/table creation, but I haven't found it.
I'm newer to R and I have searched around/think I'm close, but I can't quite connect that last dot of joining that result to Divvy. Any help appreciated.
#Here is the data.table solution, which is useful if you are working with large data:
library(data.table)
setDT(DF)[,sum:=.N,by=.(from_station_id.x,to_station_id.x)][] #DF is your dataframe
X trip_id from_station_id.x to_station_id.x sum
1: 1 1109420 94 69 1
2: 2 1109421 69 216 1
3: 3 1109427 240 245 2
4: 4 1109431 113 94 1
5: 5 1109433 127 332 1
6: 3 1109429 240 245 2
Since you said "limit it to distinct from_station_id/to_combos", the following code seems to provide what you are after. Your data is called mydf.
library(dplyr)
group_by(mydf, from_station_id.x, to_station_id.x) %>%
count(from_station_id.x, to_station_id.x)
# from_station_id.x to_station_id.x n
#1 69 216 1
#2 94 69 1
#3 113 94 1
#4 127 332 1
#5 240 245 2
I'm not entirely sure that's what you're looking for as a result, but this calculates the number of trips having the same origin and destination. Feel free to comment and let me know if that's not quite what you expect as a final result.
dat <- read.table(text="X trip_id from_station_id.x to_station_id.x
1 1109420 94 69
2 1109421 69 216
3 1109427 240 245
4 1109431 113 94
5 1109433 127 332
3 1109429 240 245", header=TRUE)
dat$from.to <- paste(dat$from_station_id.x, dat$to_station_id.x, sep="-")
freqs <- as.data.frame(table(dat$from.to))
names(freqs) <- c("from.to", "sum")
dat2 <- merge(dat, freqs, by="from.to")
dat2 <- dat2[order(dat2$trip_id),-1]
Results
dat2
# X trip_id from_station_id.x to_station_id.x sum
# 6 1 1109420 94 69 1
# 5 2 1109421 69 216 1
# 3 3 1109427 240 245 2
# 4 3 1109429 240 245 2
# 1 4 1109431 113 94 1
# 2 5 1109433 127 332 1
data<- c(100,101,102,103,104,99,98,97,94,93,103,90,104,105,110)
date<- Sys.Date()-15:1
file<- xts(data,date)
colnames(file)<- "CLOSE"
file$high<- cummax(file$CLOSE)
file$trade <- 0
file$trade[file$high*.95>=file$CLOSE] <- 1
file$trade[file$high*.90>=file$CLOSE] <- 2
file$trade[file$high*.85>=file$CLOSE] <- 3
file
CLOSE high trade
2013-07-06 100 100 0
2013-07-07 101 101 0
2013-07-08 102 102 0
2013-07-09 103 103 0
2013-07-10 104 104 0
2013-07-11 99 104 0
2013-07-12 98 104 1
2013-07-13 97 104 1
2013-07-14 94 104 1
2013-07-15 93 104 2
2013-07-16 103 104 0
2013-07-17 90 104 2
2013-07-18 104 104 0
2013-07-19 105 105 0
2013-07-20 110 110 0
I need to modify trade column, so after i get my first "1" then all elements would be zero until i get 2 and then all elements should be 0, till i get 3 and so on.
I think, you could simply do:
> file$trade[duplicated(file$trade)] <- 0
You don't need a loop to do this. Indeed, you simply need to find the positions of the first "1", "2",.... Try the following codes.
rank.trade <- rank(file$trade, ties.method = "first")
marks <- cumsum(head(table(file$trade), -1)) + 1
black.list <- is.na(match(rank.trade, marks))
file$trade[black.list] <- 0