Let's say our dataset looks as follows;
demand <- ts(BJsales, start = c(2000, 1), frequency = 12)
plot(demand)
Now I pass the timeseries object to HoltWinter and plot the fitted data.
hw <- HoltWinters(demand)
plot(hw)
I want to difference Demand and fitted data to find Mean Absolute Deviation(MAD).
I took the demand by hw$x
I took the fit by hw$fit
accu_Holt_data <- as.data.frame(hw$x)
fore_holt <- as.data.frame(hw$fit)
differnce <- accu_Holt_data - fore_holt
cant difference as row length is different
Following up on my comment above, you can do something like this:
dta <- cbind(hw$fit[, 1], hw$x)
mean(abs(dta[, 2] - dta[, 1]), na.rm = TRUE)
There are two main issues with your approach: First, hw$fit is a multi-column dataframe where the first column, xhat, represents the filtered series. Second, the two times series have different indices. Hence the need for someting like cbind to merge the time series.
Related
I have some lab data and I am looking to calculate the difference between sample measurements over a moving time frame/window e.g 2 minutes (as apposed to 0-2, 2-4, 4-6 minute, static windows)
The problem is that although the data is sampled every second there are some missed samples (e.g. 1,2,4,6,7) so I cannot use a fixed lag function especially for larger time windows.
Here is the most promising I have tried. I have tried to calculate the difference in the row positions that will then use that to determine the lag value.
library(tidyverse)
df <- data.frame(sample_group = c(rep("a", 25), rep("b", 25)),t_seconds = c(1:50), measurement = seq(1,100,2))
df <- df[-c(5,10,23,33,44),] #remove samples
t_window = 5
df_diff <- df %>%
group_by(sample_group) %>%
arrange(t_seconds) %>%
mutate(lag_row = min(which(t_seconds >= t_seconds + t_window))- min(which(t_seconds == t_seconds)), #attempt to identify the lag value for each element
Meas_diff = measurement - lag(measurement, lag_row))
In this example (lag_row) I am trying to call an element from a vector and the vector itself, which obviously does not work! to make it clearer, I have added '_v' to identify what I wanted as a vector and '_e' as an element of that vector min(which(t_seconds_v >= t_seconds_e + t_window))- min(which(t_seconds_v == t_seconds_e))
I have tried to stay away from using loops but I have failed to solve the problem.
I would appreciate if anyone has any better ideas?
Your first step should be inserting missing observations into your time series. Then you could fill the missing values using a Last-Observation-Carried-Backwards operation. This provides you with a complete regular time series.
Your desired output is very unclear, so the next step after that in the following example is just a guess. Adjust as needed.
#complete time series (using a data.table join):
library(data.table)
setDT(df)
df_fill <- df[, .SD[data.table(t_seconds = min(t_seconds):max(t_seconds)),
on = "t_seconds"],
by = sample_group]
df_fill[, filled := is.na(measurement)]
#last observation carried backwards
library(zoo)
df_fill[, measurement := na.locf(measurement, fromLast = TRUE), by = sample_group]
#differences
df_fill[, diff_value := shift(measurement, -t_window) - measurement, by = sample_group]
I have to a column in R which has uneven distribution like an exponential distribution. I want to normalize the data and then bin the data in subsequent buckets.
Saw following links which helps in normalizing the data but nothing with binning the data to different categories.
Normalizing data in R
Standardize data columns in R
Example: of how eneven distributed column would look like but with lot of rows.
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Qty = c(1,1,1,2,3,13,30,45))
I want it binned the column in 5 categories which may look like:
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Qty = c(1,1,1,2,3,13,30,45),
Binned_Category = c(1,1,1,1,2,3,4,5))
Above binned_Category is sample, the values may not look like this for the given data in real world. I just wanted to showcase how I want the output to look like.
This will help:
num_bins <- 5
findInterval(Qty, unique(quantile(Qty, prob = seq(0, 1, 1/num_bins))))
I'm trying to store p values from a long nested for loop into an empty column in a data frame. I've tried looking up examples close to my code, but I feel as though my code is really long (and maybe even incorrect) that the same things that can be applied to other for loops can't be applied to mine.
The overview of what I'm trying to do is I'm trying to compare the relatedness of observed paired birds to the relatedness of all possible paired birds in a given year by finding a p value. To do this, I'm writing a for loop where I am selecting a range of years from a huge data set, and then I am applying a bunch of functions to those given years where I'm trying to narrow down the data for observed pairs and then I'm adding a column for relatedness and transferring those relatedness values for the pairs from another data set. I am then applying another for loop function within this in order to create a data frame with all possible paired birds in that given year and also adding and transferring a column of relatedness values for the pairs. From these two data frames of pairs and relatedness within each year, I want to apply the wilcox test to find the p value for each given year. I want to transfer over these p values into a separate data frame that I have created with a year column and a p value column.
Here is my (crazy looking) code:
`year <- c(2000:2013)
pvalue <- c(NA)
results <- data.frame(year, pvalue)
for(j in c(2000:2013)) {
allbr_demo_noEPP_year <- subset(allbr_demo_noEPP, Year == j)
allbr_demo_noEPP_year_geno_obs <- allbr_demo_noEPP_year[allbr_demo_noEPP_year$Pairs %in% c(genome$pair1,genome$pair2),]
allbr_demo_noEPP_year_geno_obs$relatedness <- laply(allbr_demo_noEPP_year_geno_obs$Pairs, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
allbr_demo_noEPP_year_geno <- allbr_demo_noEPP_year[c(allbr_demo_noEPP_year$MB_USFWS,allbr_demo_noEPP_year$FB_USFWS) %in% genotyped$V2,]
breeder_list_males <- allbr_demo_noEPP_year_geno_obs[,8]
breeder_list_females <- allbr_demo_noEPP_year_geno_obs[,10]
unq_breeder_list_males <- unique(breeder_list_males)
unq_breeder_list_females <- unique(breeder_list_females)
all_poss_combo <-list()
for(i in unq_breeder_list_males){
print(i)
all_poss_combo[[i]]<-paste0(i, ",", unq_breeder_list_females)}
lapply(X = all_poss_combo, FUN= function(x) length(unique(x)))
all_poss_df<-unlist(all_poss_combo, use.names = F)
all_poss_df <- data.frame("combo"=all_poss_df, "M"=NA, "F"=NA)
all_poss_df$M <- substr(all_poss_df$combo, start = 1, stop = 10)
all_poss_df$F <- substr(all_poss_df$combo, start = 12, stop = 22)
all_poss_df_geno <- all_poss_df[all_poss_df$combo %in% c(genome$pair1,genome$pair2),]
all_poss_df_geno$relatedness <- laply(all_poss_df_geno$combo, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness, all_poss_df_geno$relatedness, alternative='greater')}`
To be honest, I'm not even sure if this for loop will work (it seems pretty complex to me, but I am a beginner), but I was told that doing a for loop for this situation should work. I understand there are probably easier or faster ways to do what I am trying to do, which I also welcome, but I would also like to see how I could fix this for loop so it would work and how I could store the results from it into a data frame.
Thank you so much for any help given!
If you are simply looking to save the p value:
str(wilcox.test(rnorm(10), rnorm(10, 2))) # example from running ?Wilcox.test
wilcox.test(rnorm(10), rnorm(10, 2))$p.value #
So with your dataset, perhaps putting this in the bottom of your for loop:
pvalue[j] <- wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness,
all_poss_df_geno$relatedness, alternative='greater')$p.value
I am trying to calculate some rates for time on condition parameters, and have written the following, which successfully calculates the desired rates. But, I'm sure there must be a more succinct way to do this using the data.table methods. Any suggestions?
Background on what I'm trying to achieve with the code.
For each run number there are 10 record numbers. Each record number refers to a value bin (the full range of values for each parameter is split into 10 equal sized bins). The values are counts of time spent in each bin. I am trying to sum the counts for P1 over each run number (calling this opHours for the run number). I then want to divide each of the bin counts by the opHours to show the proportion of each run that is spent in each bin.
library(data.table)
#### Create dummy parameter values
P1 <- rnorm(2000,400, 50);
Date <- seq(from=as.Date("2010/1/1"), by = "day", length.out = length(P1));
RECORD_NUMBER <- rep(1:10, 200);
RUN_NUMBER <- rep(1:200, each=10, len = 2000);
#### Combine the dummy parameters into a dataframe
data <- data.frame(Date, RECORD_NUMBER, RUN_NUMBER, P1);
#### Calculating operating hours for each run
setDT(data);
running_hours_table <- data[ , .(opHours = sum(P1)), by = .(RUN_NUMBER)];
#### Set the join keys for the data and running_hours tables
setkey(data, RUN_NUMBER);
setkey(running_hours_table, RUN_NUMBER);
#### Combine tables row-wise
data <- data[running_hours_table];
data$P1.countRate <- (data$P1 / data$opHours)
Is it possible to generate the opHours column in the data table without first creating a separate table and then joining them back together?
data2[ , opHours := sum(P1), by = .(RUN_NUMBER)]
You should probably read some materials about data.table:
wiki Getting-started
or
data.table.cheat.sheet
Let's start from the end: the R output will be read in Tableau to create a dashboard, and therefore I need the R output to look like in a certain way. With that in mind, I'm starting with a data frame in R with n groups of time series. I want to run auto.arima (or another forecasting method from package forecast) on each by group. I'm using the by function to do that, but I'm not attached to that approach, it's just what seemed to do the job for an R beginner like me.
The output I need would append a (say) 1 period forecast to the original data frame, filling in the date (variable t) and by variable (variable class).
If possible I'd like the approach to generalize to multiple by variables (i.e class_1,...class_n,).
#generate fake data
t<-seq(as.Date("2012/1/1"), by = "month", length.out = 36)
class<-rep(c("A","B"),each=18)
set.seed(1234)
metric<-as.numeric(arima.sim(model=list(order=c(2,1,1),ar=c(0.5,.3),ma=0.3),n=35))
df <- data.frame(t,class,metric)
df$type<-"ORIGINAL"
#sort of what I'd like to do
library(forecast)
ts<-ts(df$metric)
ts<-by(df$metric,df$class,auto.arima)
#extract forecast and relevant other pieces of data
#???
#what I'd like to look like
t<-as.Date(c("2013/7/1","2015/1/1"))
class<-rep(c("A","B"),each=1)
metric<-c(1.111,2.222)
dfn <- data.frame(t,class,metric)
dfn$type<-"FORECAST"
dfinal<-rbind(df,dfn)
I'm not attached to the how-to, as long as it starts with a data frame that looks like what I described, and outputs a data frame like the output I described.
Your description is a little vague, but something along these lines should work:
library(data.table)
dt = data.table(df)
dt[, {result = auto.arima(metric);
rbind(.SD,
list(seq(t[.N], length.out = 2, by = '1 month')[2], result$sigma2, "FORECAST"))},
by = class]
I arbitrarily chose to fill in the sigma^2, since it wasn't clear which variable(s) you want there.