Related
I have a dataframe of daily observations dating from 1963-2022. I want to calculate the mean of the observation for each month. However, some months don't have data for each day and some only have one datapoint for one month. This skews some of the data points. How do I calculate how many observations have been used to calculate the mean for a given month.
Head of Data frame
structure(list(prcp_amt = c(0, 1.8, 6.4, 5.1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 4.3, 0, 0, 0, 0, 4.6, 0, 0, 0, 0, 0, 0, 0, 0.3,
4.8, 0, 0, 4.1, 0, 0, 0, 0.3, 3.6, 6.6, 0, 0, 0, 0, 0, 0, 0.8,
0, 0, 0, 0, 0), ob_date = structure(c(-220838400, -220752000,
-220665600, -220579200, -220492800, -220406400, -220320000, -220233600,
-220147200, -220060800, -219974400, -219888000, -219801600, -219715200,
-219628800, -219542400, -219456000, -219369600, -219283200, -219196800,
-219110400, -219024000, -218937600, -218851200, -218764800, -218678400,
-218592000, -218505600, -218419200, -218332800, -218246400, -218160000,
-218073600, -217987200, -217900800, -217814400, -217728000, -217641600,
-217555200, -217468800, -217382400, -217296000, -217209600, -217123200,
-217036800, -216950400, -216864000, -216777600, -216691200, -216604800
), class = c("POSIXct", "POSIXt"), tzone = "GMT")), row.names = c(NA,
50L), class = "data.frame")
Existing code
# historic monthly rainfall
rainHist$month <- as.numeric(format(rainHist$ob_date, '%m'))
rainHist$year <- as.numeric(format(rainHist$ob_date, '%Y'))
rainHistMean <- aggregate(prcp_amt ~ month + year, rainHist, FUN=mean)
rainHistMean$day <- 01
rainHistMean <-
rainHistMean %>%
mutate(rainHistMean, Date=paste(year, month, day, sep='-'))
rainHistMean[['Date']] <- as.POSIXct(rainHistMean[['Date']],
format='%Y-%m-%d',
tz='GMT'
)
Updated Code
rainHist$month <- as.numeric(format(rainHist$ob_date, '%m'))
rainHist$year <- as.numeric(format(rainHist$ob_date, '%Y'))
rainHistMean <- aggregate(prcp_amt ~ month + year, rainHist, FUN=function(x) c(mean(x), length(x)))
names(rainHistMean) <- c('month', 'year', 'prcp_amt', 'n')
How do I get there to be 4 columns not 3 with a matrix?
Solution
rainHist$month <- as.numeric(format(rainHist$ob_date, '%m'))
rainHist$year <- as.numeric(format(rainHist$ob_date, '%Y'))
rainHistMean <- aggregate(prcp_amt ~ month + year, rainHist, FUN=function(x) c(mean(x), length(x)))
rainHistMean <- data.frame(rainHistMean[1:2], rainHistMean[[3]])
names(rainHistMean) <- c('month', 'year', 'prcp_amt', 'n')
There may be more elegant solutions, but you can use dplyr to group by month and year, then get the count and mean in summarize:
df %>%
group_by(month(ob_date), year(ob_date)) %>%
summarize(mean_prcp = mean(prcp_amt),
count = n())
Output:
# # Groups: month(ob_date) [2]
# `month(ob_date)` `year(ob_date)` mean_prcp count
# <dbl> <dbl> <dbl> <int>
# 1 1 1963 0.91 30
# 2 2 1963 0.77 20
I have the following data:
dat<- structure(list(Pentad = 1:73, RR = c(0, 0.014285714, 0, 0.088571429,
0.071428571, 0, 0.065714286, 0.028571429, 0.094285714, 0.011428571,
0, 0, 0, 0, 0, 0, 0.04, 0, 0.814285714, 0.285714286, 1.14, 5.334285714,
2.351428571, 1.985714286, 1.494285714, 2.005714286, 20.04857143,
25.00857143, 16.32, 11.06857143, 8.965714286, 3.985714286, 5.202857143,
7.802857143, 4.451428571, 9.22, 32.04857143, 19.50571429, 3.148571429,
2.434285714, 9.057142857, 28.70857143, 34.15142857, 33.02571429,
46.50571429, 70.61714286, 3.168571429, 1.928571429, 7.031428571,
0.902857143, 5.377142857, 11.35714286, 15.04571429, 11.66285714,
21.24, 11.43714286, 11.69428571, 2.977142857, 4.337142857, 0.871428571,
1.391428571, 0.871428571, 1.145714286, 2.317142857, 0.182857143,
0.282857143, 0.348571429, 0, 0.345714286, 0.142857143, 0.18,
4.894285714, 0.037142857), YY = c(0.577142857, 0, 1.282857143,
1.445714286, 0.111428571, 0.36, 0, 0, 0, 1, 0.011428571, 0.008571429,
0.305714286, 0, 0, 0, 0, 0.8, 0.062857143, 0, 0, 0, 0, 0.013333333,
0.043333333, 1.486666667, 0, 2.486666667, 1.943333333, 0.773333333,
8.106666667, 7.733333333, 0.5, 4.356666667, 2.66, 6.626666667,
4.404285714, 7.977142857, 12.94285714, 18.49428571, 7.357142857,
11.08285714, 9.034285714, 14.29142857, 34.61428571, 45.30285714,
6.66, 6.702857143, 5.962857143, 14.85428571, 2.1, 2.837142857,
7.391428571, 32.03714286, 9.005714286, 3.525714286, 12.32, 2.32,
7.994285714, 6.565714286, 4.771428571, 2.354285714, 0.005714286,
2.508571429, 0.817142857, 2.885714286, 0.897142857, 0, 0, 0,
0, 0.145714286, 0.434285714)), class = "data.frame", row.names = c(NA,
-73L))
There are three columns: Pentad, RR, and YY.
I would like to get the following:
(a) Get the first pentad when the precipitation exceeds the "annual mean" in "at least three consecutive pentads"
(b) Get the last pentad when the precipitation exceeds the "annual mean" in at least three consecutive pentads BEFORE lowering it below the annual mean.
I was able to do (a) using the following script:
first_exceed_seq <- function(x, thresh = mean(x), len = 3)
{
# Logical vector, does x exceed the threshold
exceed_thresh <- x > thresh
# Indices of transition points; where exceed_thresh[i - 1] != exceed_thresh[i]
transition <- which(diff(c(0, exceed_thresh)) != 0)
# Reference index, grouping observations after each transition
index <- vector("numeric", length(x))
index[transition] <- 1
index <- cumsum(index)
# Break x into groups following the transitions
exceed_list <- split(exceed_thresh, index)
# Get the number of values exceeded in each index period
num_exceed <- vapply(exceed_list, sum, numeric(1))
# Get the starting index of the first sequence where more then len exceed thresh
transition[as.numeric(names(which(num_exceed >= len))[1])]
}
first_exceed_seq(dat$RR)
Here's the plot of the time series:
The correct answer in (a) is 27.
I would like to ask how can I do this for (b). The correct answer for (b) should be 57.
I'll appreciate any help on in this in R.
I don't know if I got your problem right.
This is what I tried:
dat %>%
mutate(
anual_mean = mean(RR),
exceed_thresh = RR > anual_mean,
lag1 = lag(exceed_thresh, 1),
lag2 = lag(exceed_thresh, 2),
pick_3 = ifelse(exceed_thresh & lag1 & lag2, RR, NA)
)
I've tried working on this loop and come out with the below errors. I'm not sure if I can provide data, if needed I'll do my best to obfuscate the data. Here is the loop I am trying to use, any tips on what I'm doing wrong would be greatly appreciated as I haven't found a viable solution yet. The exact error is below the code.
decay_function = function(df)
{
df <- df[order(df$department,df$product,df$region,df$monthnum),]
for(mk in 1:ncol(levels_department)) {
newdata <- df[which(df$department==as.character(levels_department[,mk])), ]
levels_product<-as.data.frame(t(levels(as.factor(newdata$product))))
for(md in 1:ncol(levels_product)){
newdata <- newdata[which(newdata$product==as.character(levels_product[,md])), ]
levels_region<-as.data.frame(t(levels(as.factor(newdata$region))))
for(dm in 1:ncol(levels_region)){
newdata <- newdata[which(newdata$region==as.character(levels_region[,dm])), ]
for(i in 1:(nrow(newdata)-1)){
start_month = newdata$monthnum[i]
end_month = newdata$monthnum[nrow(newdata)]
row_vector = c()
decay_vector = c()
for(j in 5:ncol(newdata)){
k = 0
for(l in start_month:end_month){
distance_initial = (l - start_month)
vector_increment = (l - (start_month-1))
decay_rate = (0.5)^((1/halflife)*distance_initial)
decay_value = (decay_rate)*(newdata[[i,j]])
k = k + decay_value
}
df2[i,j] = k
}
print(df2)
}
if (mk=='1' & md=='1' & dm=='1'){
outdata<-df2
} else {
outdata<-rbind(outdata,df2)
}
}
}
}
}
output_data = decay_function(tempone)
Error in start_month:end_month : argument of length 0
> dput(head(df))
structure(list(monthnum = c(33, 33, 33, 33, 33, 33), Region = c(2251,
2251, 2251, 2251, 2251, 2251), Department = c("Softlines", "Softlines",
"Softlines", "Softlines", "Softlines", "Softlines"), Product = c("T-Shirt",
"Jacket", "Sweat Shirt", "Tank Top", "Sweat Pants", "Mens Jeans"
), Incentive_Amount = c(5742.43, 108006.61, 459076.67, 34006,
141632.42, 29580.38), Leads_T1 = c(0, 0, 0, 0, 0, 0), DCLeads = c(0,
1, 0, 0, 0, 0), PhoneLeads = c(0, 0, 0, 0, 0, 0), T3_CRM_Leads = c(0,
0, 0, 0, 0, 0), Leads_Third = c(0, 1, 0, 0, 0, 0)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
Can someone explain me why this is not working?
tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff(dat$`lagged Date`))
I receive the following error:
Error in match.fun(FUN) : 'diff(dat$lagged Date)' is not a
function, character or symbol
structure(list(`lagged Date` = structure(c(1466306880, 1466307060,
1466307240, 1466307420, 1466307600, 1466307780), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), Location = c(309, 309, 309, 309, 309,
309), Duration = c(0, 0, 0, 0, 0, 0), Latitude = c(53.50205667,
53.501915, 53.50183667, 53.50178833, 53.50184, 53.50186167),
Longitude = c(-3.354733333, -3.354096667, -3.353838333, -3.353673333,
-3.353711667, -3.353741667), `Number of Records` = c(1, 1,
1, 1, 1, 1), Speed = c(0.9, 0, 0, 0, 0, 0), `Sum of Var` = c(38,
38, 38, 38, 38, 38), check = c(0, 0, 0, 0, 0, 0)), .Names = c("lagged Date",
"Location", "Duration", "Latitude", "Longitude", "Number of Records",
"Speed", "Sum of Var", "check"), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
thank you!
I'm not sure what you want to achieve, but using only diff as the FUN part works and produces this output:
tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff)
$`309`
Time differences in mins
[1] 3 3 3 3 3
If you want to convert the output into hours, you can do that by selecting only the values of the difftime-list object and convert those:
as.numeric(tapply(dat$`lagged Date`, INDEX = dat$Location, FUN = diff)[[1]], units = "hours")
Output then looks like this:
[1] 0.05 0.05 0.05 0.05 0.05
I have tried searching stackoverflow and google to get answers to my question, but I couldn't find anything that applied closely enough for me to be able to apply it. However, I'm very new to R, so it's likely that I may just need a little walking through it.
If I use the following code, it works just fine.
> dput(b)
structure(list(DUMP_END_SHIFT_DATE = structure(c(1420070400,
1420070400, 1420156800, 1420156800, 1420243200, 1420243200, 1420329600,
1420329600, 1420416000, 1420416000, 1420502400), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), QUANTITY_REPORTING = c(235, 219, 232,
219, 219, 219, 219, 219, 219, 219, 235), WTRECV = c(32.71, 32.71,
20.19, 33.42, 21.61, 21.61, 21.61, 20.19, 21.61, 20.19, 24.2),
LC12 = c(0, 0, 0, 94, 100, 100, 100, 0, 100, 0, 100), LC34 = c(0,
100, 0, 6, 0, 0, 0, 0, 0, 0, 0), LC5 = c(0, 0, 5, 0, 0, 0,
0, 5, 0, 5, 0), HIS = c(25, 0, 60, 0, 0, 0, 0, 60, 0, 60,
0), UC = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), IBC = c(75,
0, 35, 0, 0, 0, 0, 35, 0, 35, 0)), .Names = c("DUMP_END_SHIFT_DATE",
"QUANTITY_REPORTING", "WTRECV", "LC12", "LC34", "LC5", "HIS",
"UC", "IBC"), class = c("data.table", "data.frame"), row.names = c(NA,
-11L), .internal.selfref = <pointer: 0x0000000005860788>)
library(data.table)
b_daily <- b[,.(d_tons=sum(QUANTITY_REPORTING)),by=DUMP_END_SHIFT_DATE]
b_daily[,"d_WTRECV" := b[,.(d_WTRECV=sum(QUANTITY_REPORTING*WTRECV)),by=DUMP_END_SHIFT_DATE] [,.(round(d_WTRECV/d_tons, digits=2))]]
b_daily[,"d_LC12" := b[,.(d_LC12=sum(QUANTITY_REPORTING*LC12)),by=DUMP_END_SHIFT_DATE] [,.(round(d_LC12/d_tons, digits=2))]]
b_daily[,"d_LC34" := b[,.(d_LC34=sum(QUANTITY_REPORTING*LC34)),by=DUMP_END_SHIFT_DATE] [,.(round(d_LC34/d_tons, digits=2))]]
b_daily[,"d_LC5" := b[,.(d_LC5=sum(QUANTITY_REPORTING*LC5)),by=DUMP_END_SHIFT_DATE] [,.(round(d_LC5/d_tons, digits=2))]]
b_daily[,"d_HIS" := b[,.(d_HIS=sum(QUANTITY_REPORTING*HIS)),by=DUMP_END_SHIFT_DATE] [,.(round(d_HIS/d_tons, digits=2))]]
b_daily[,"d_UC" := b[,.(d_UC=sum(QUANTITY_REPORTING*UC)),by=DUMP_END_SHIFT_DATE] [,.(round(d_UC/d_tons, digits=2))]]
b_daily[,"d_IBC" := b[,.(d_IBC=sum(QUANTITY_REPORTING*IBC)),by=DUMP_END_SHIFT_DATE] [,.(round(d_IBC/d_tons, digits=2))]]
However, it seems very inelegant - I think that I should be able to do this using SD and SDcols. I tried the following, just as a test case:
b_daily2 <- b[,lapply(.SD, function (x) sum(x*b[,QUANTITY_REPORTING])/sum(b[,QUANTITY_REPORTING])), by=DUMP_END_SHIFT_DATE, .SDcols=c("WTRECV")] [,.(DUMP_END_SHIFT_DATE,d_WTRECV=round(WTRECV, digits=2))]
The resulting numbers are a little off, and I get the following warning:
"In x * MQD[, QUANTITY_REPORTING] : longer object length is not a multiple of shorter object length"
I understand that this indicates recycling due to objects being different lengths...but I don't understand why or what. Any help would be much appreciated. I apologize in advance if this is an elementary question. Thank you.
This is arguably also inelegant, but at least fits into a single operation:
b_daily <- b[,{
d_tons = sum(QUANTITY_REPORTING)
d_WTRECV = round( sum(QUANTITY_REPORTING*WTRECV)/d_tons, digits = 2 )
list(d_tons = d_tons, d_WTRECV = d_WTRECV)
},by=DUMP_END_SHIFT_DATE]
If there are many columns like d_WTRECV, with names stored in cols = c("WTRECV",...), then...
cols <- c("WTRECV","LC12","LC34","LC5","HIS","UC","IBC")
b_daily2 <- b[,{
d_tons = sum(QUANTITY_REPORTING)
res = lapply(mget(cols), function(x)
round( sum(QUANTITY_REPORTING*x)/d_tons, digits = 2 )
)
c(list(d_tons = d_tons), setNames(res, paste0("d_",cols)))
},by=DUMP_END_SHIFT_DATE]
A similar approach using .SDcols will be possible when a bug related to it is fixed.
Aside. I think there is a feature request to allow for the first column to be used in computing the second, like
# NON-WORKING CODE:
b_daily <- b[,.(
d_tons = sum(QUANTITY_REPORTING),
d_WTRECV = round( sum(QUANTITY_REPORTING*WTRECV) / d_tons, digits = 2)
),by=DUMP_END_SHIFT_DATE]
This is how mutate in the dplyr package works. However, for your multicolumn case, dplyr is more of a hassle than a help, as far as I can figure.
By the way, you may want to wait on rounding. Usually, it's only a good idea for printing purposes and just unnecessarily worsens your later calculations.
I don't think there is a particularly elegant way to do this. Here's a quick take.
sdc <- c("WTRECV", "LC12", "LC34", "LC5", "HIS", "UC", "IBC")
b2 <- copy(b)
b2[, (sdc) := lapply(.SD, "*", b2[, QUANTITY_REPORTING]), .SDcols=sdc]
b_daily <- b2[, lapply(.SD, sum), by=DUMP_END_SHIFT_DATE]
data.table(
b_daily[, .(DUMP_END_SHIFT_DATE)],
b_daily[, lapply(lapply(.SD, "/", b_daily[,QUANTITY_REPORTING]), round, 2), .SDcols=sdc]
)