How to create a numerical variable from Qualterly time column - R - r

I have a dataset which currently looks like this:
Time Var1 Var2
2013 Q4 123 756
2013 Q4 657 987
2014 Q1 746 756
2014 Q1 66 999
2014 Q2 774 542
And I need to convert this categorical 'Time' variable into a numerical variable, something which may look potentially like this:
Time Var1 Var2 n.Time
2013 Q4 123 756 1
2013 Q4 657 987 1
2014 Q1 746 756 2
2014 Q1 66 999 2
2014 Q2 774 542 3
Or something similar which gives the 'Time' column a numerical value which is proportional.
I have attempted the
df$n.Time <- as.yearqtr(df$Time)
But this just gives the same output as the 'Time' column instead of making it numerical.
Any help would be greatly appreciated

Would something like this work?
df$n.Time <- as.numeric(as.factor(df$Time))

I think you are looking for splitting Q part, from Time column and then change it a numerical value.
df$n.Time <- as.factor(substr(as.character(df$Time),
gregexpr("Q",df$Time),nchar(as.character(df$Time))))

Related

How to find correlation in a data set

I wish to find the correlation of the trip duration and age from the below data set. I am applying the function cor(age,df$tripduration). However, it is giving me the output NA. Could you please let me know how do I work on the correlation? I found the "age" by the following syntax:
age <- (2017-as.numeric(df$birth.year))
and tripduration(seconds) as df$tripduration.
Below is the data. the number 1 in gender means male and 2 means female.
tripduration birth year gender
439 1980 1
186 1984 1
442 1969 1
170 1986 1
189 1990 1
494 1984 1
152 1972 1
537 1994 1
509 1994 1
157 1985 2
1080 1976 2
239 1976 2
344 1992 2
I think you are trying to subtract a number by a data frame, so it would not work. This worked for me:
birth <- df$birth.year
year <- 2017
age <- year - birth
cor(df$tripduration, age)
>[1] 0.08366848
# To check coefficient
cor(dat$tripduration, dat$birth.year)
>[1] -0.08366848
By the way, please format the question with an easily replicable data where people can just copy and paste to their R. This actually helps you in finding an answer.
Based on the OP's comment, here is a new suggestion. Try deleting the rows with NA before performing a correlation test.
df <- df[complete.cases(df), ]
age <- (2017-as.numeric(df$birth.year))
cor(age, df$tripduration)
>[1] 0.1726607

Testing whether n% of data values exist in a variable grouped by posix date

I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237

Rolling multi regression in R data table

Say I have an R data.table DT which has a list of returns:
Date Return
2016-01-01 -0.01
2016-01-02 0.022
2016-01-03 0.1111
2016-01-04 -0.006
...
I want to do a rolling multi regression of the previous N observations of Return predicting the next Return over some window K. E.g. Over the last K = 120 days do a regression of the last N = 14 observations to predict the next observation. Once I have this regression I want to use the predict function to get a prediction for each row based on the regression. In pseudocode it would be something like:
DT[, Prediction := predict(lm(Return[prev K - N -1] ~ Return[N observations prev for each observation]), Return[N observations previous for this observation])]
To be clear i want to do a multi regression so if N was 3 it would be:
lm(Return ~ Return[-1] + Return[-2] + Return[-3]) ## where the negatives are the prev rows
How do I write this (as efficiently as possible).
Thanks
If I understand correctly you want a quarterly auto-regression.
There's a related thread on time-series with data.table here.
You can setup a rolling date in data.table like this (see the link above for more context):
#Example for quarterly data
quarterly[, rollDate:=leftBound]
storeData[, rollDate:=date]
setkey(quarterly,"rollDate")
setkey(storeData,"rollDate")
Since you only provided a few rows of example data, I extended the series through 2019 and made up random return values.
First get your data setup:
require(forecast)
require(xts)
DT <- read.table(con<- file ( "clipboard"))
dput(DT) # the dput was too long to display here
DT[,1] <- as.POSIXct(strptime(DT[,1], "%m/%d/%Y"))
DT[,2] <- as.double(DT[,2])
dat <- xts(DT$V2,DT$V1, order.by = DT$V1)
x.ts <- to.quarterly(dat) # 120 days
dat.Open dat.High dat.Low dat.Close
2016 Q1 1292 1292 1 698
2016 Q2 138 1290 3 239
2016 Q3 451 1285 5 780
2016 Q4 355 1243 27 1193
2017 Q1 878 1279 4 687
2017 Q2 794 1283 12 411
2017 Q3 858 1256 9 1222
2017 Q4 219 1282 15 117
2018 Q1 554 1286 32 432
2018 Q2 630 1272 30 46
2018 Q3 310 1288 18 979
2019 Q1 143 1291 10 184
2019 Q2 250 1289 8 441
2019 Q3 110 1220 23 571
Then you can do a rolling ARIMA model with or without re-estimation like this:
fit <- auto.arima(x.ts)
order <- arimaorder(fit)
fcmat <- matrix(0, nrow=nrow(x), ncol=1)
n <- nrow(x)
for(i in 1:n)
{
x <- window(x.ts, end=2017.99 + (i-1)/4)
refit <- Arima(x, order=order[1:3], seasonal=order[4:6])
fcmat[i,] <- forecast(refit, h=h)$mean
}
Here's a good related resource with several examples of different ways you might construct this: http://robjhyndman.com/hyndsight/rolling-forecasts/
You have to have the lags in the columns anyway, so I if i understand you correctly you can do something like this, say for a lag of 3:
setkey(DT,date)
lag_max<-3
for(i in 1:lag_max){
set(DT,NULL,paste0("lag",i),shift(DT[["return"]],1L,type="lag"))
}
DT[, prediction := lm(return~lag1+lag2+lag3)[["fitted.values"]]]

R add in/populate missing combinations dcast reshape2 table

This is my data table:
Name.1 <- c(rep("IVa",12),rep("VIa",10),rep("VIIb",3),rep("IVa",5))
qrt <- c(rep("Q1",6),rep("Q3",10),rep("Q4",3),rep("Q1",5),rep("Q1",3),rep("Q3",3))
variable <- c(rep("wtTonnes",30))
value <- c(201:230)
df <- data.frame(Name.1,qrt,variable,value)
df1 <- dcast(df, Name.1 ~ qrt, fun.aggregate=sum, value.var="value",margins=TRUE)
It gives me an output like this;
Name.1 Q1 Q3 Q4 (all)
IVa 1674 1944 0 3618
VIa 663 858 654 2175
VIIb 672 0 0 672
(all) 3009 2802 654 6465
The 'qrt' values Q1, Q3, Q4 represent quarters of the year. Basically I would like the table to include missing quarters and populate with 0. As every year when I run the script there could be wtTonne values for any combination of quarters and I don't want to hard code each time to add whichever are missing.
In this case I would like it to look like:
Name.1 Q1 Q2 Q3 Q4 (all)
IVa 1674 0 1944 0 3618
VIa 663 0 858 654 2175
VIIb 672 0 0 0 672
(all) 3009 0 2802 654 6465
Is it possible to pass a list to a table or the raw data at any stage to say which columns I want to have? (i.e. there always to be Q1, Q2, Q3, Q4) with dummy values if needs be.
The following should give you the required output:
df$qrt <- factor(df$qrt, levels = c("Q1", "Q2", "Q3", "Q4"))
df1 <- dcast(df, Name.1 ~ qrt, fun.aggregate=sum, value.var="value",margins=TRUE, drop = F)
At first, I tell R that qrt is a factor with the corresponding levels, including the level that does not occur, and then I tell dcast to avoid droppping unused combinations. This gives:
Name.1 Q1 Q2 Q3 Q4 (all)
1 IVa 1674 0 1944 0 3618
2 VIa 663 0 858 654 2175
3 VIIb 672 0 0 0 672
4 (all) 3009 0 2802 654 6465

r ddply error undefined columns selected

I have a time series data set like below:
age time income
16 to 24 2004 q1 400
16 to 24 2004 q2 500
… …
65 and over 2014 q3 800
it has different 60 quarters of income data for each age group.as income data is seasonal. i am trying to apply decomposition function to filter out trends.what i have did so far is below. but R consistently throw errors (error message:undefined columns selected) at me. any idea how to go about it?
fun =function(x){
ts = ts(x,frequency=4,start=c(2004,1))
ts.d =decompose(ts,type='additive')
as.vector(ts.d$trend)
}
trend.dt = ddply(my.dat,.(age),transform,trend=fun(income))
expected result is (NA is because, after decomposition, the first and last ob will not have value,but the rest should have)
age time income trend
16 to 24 2004 q1 400 NA
16 to 24 2004 q2 500 489
… …
65 and over 2014 q3 800 760
65 and over 2014 q3 810 NA

Resources