I just started using R for statistical purposes and I appreciate any kind of help.
As a first step, I ran a time series regression over my columns. Y values are dependent and the X is explanatory.
# example
Y1 <- runif(100, 5.0, 17.5)
Y2 <- runif(100, 4.0, 27.5)
Y3 <- runif(100, 3.0, 14.5)
Y4 <- runif(100, 2.0, 12.5)
Y5 <- runif(100, 5.0, 17.5)
X <- runif(100, 5.0, 7.5)
df1 <- data.frame(X, Y1, Y2, Y3, Y4, Y5)
# calculating log returns to provide data for the first regression
n <- nrow(df1)
X_logret <- log(X[2:n])-log(X[1:(n-1)])
Y1_logret <- log(Y1[2:n])-log(Y1[1:(n-1)])
Y2_logret <- log(Y2[2:n])-log(Y2[1:(n-1)])
Y3_logret <- log(Y3[2:n])-log(Y3[1:(n-1)])
Y4_logret <- log(Y4[2:n])-log(Y4[1:(n-1)])
Y5_logret <- log(Y5[2:n])-log(Y5[1:(n-1)])
# bringing the calculated log returns together in one data frame
df2 <- data.frame(X_logret, Y1_logret, Y2_logret, Y3_logret, Y4_logret, Y5_logret)
# running the time series regression
Regression <- lm(as.matrix(df2[c('Y1_logret', 'Y2_logret', 'Y3_logret', 'Y4_logret', 'Y5_logret')]) ~ df2$X)
# extracting the coefficients for further calculation
Regression$coefficients[2,(1:5)]
As a second step I want to run a regression row by row, which is day by day, since the data contains daily observed values. I also have a column "DATE" but I didn't know how to bring it in here in the example. The format of the DATE column is POSIXct, maybe someone has an idea how to refer to a certain period in it on which the regression should be done.
In the row by row regression I would like to use the 5 calculated coefficients (from the first regression) as an explanatory variable. The 5 Y_logret values, I would like to use as dependent variable.
Y_logret(1 to 5) = Beta * Regression$coefficients[2,(1:5)] + error value. The intercept is not needed, so I would set it to zero by adding +0 in the lm function.
My goal is to run this regression over a period of time, for example over 20 days. Day by day, this would provide a total of 20 Beta estimates (for one regression per day), but I would also need all errors for further calculation. So I have to extract 5 errors per day, that is a total of 20*5 error values.
This is just an example, in the original dataset I have 20 of the Y values and over 4000 rows. I would like to run the regression over certain intervals with 900-1000 day. Since I am completely new to R, I have no idea how to proceed. Especially how to code this in a few lines.
I really appreciate any kind of help.
Related
I'm trying to build a linear regression model using eight independent variables, but when I run lm() one variable--what I anticipate being my best predictor!--keeps returning NA. I'm still new to R, and I cannot find a solution.
Here are my independent variables:
TEMPERATURE
HUMIDITY
WIND_SPEED
VISIBILITY
DEW_POINT_TEMPERATURE
SOLAR_RADIATION
RAINFALL
SNOWFALL
My df is training_set and looks like:
I'm not sure whether this matters, but training_set is 75% of my original df, and testing_set is 25%. Created thusly:
set.seed(1234)
split_bike_sharing <- sample(c(rep(0, round(0.75 * nrow(bike_sharing_df))), rep(1, round(0.25 * nrow(bike_sharing_df)))))
This gave me table(split_bike_sharing):
0
1
6349
2116
And then I did:
training_set <- bike_sharing_df[split_bike_sharing == 0, ]
testing_set <- bike_sharing_df[split_bike_sharing == 1, ]
The structure of training_set is like:
To create the model I run the code:
lm_model_weather=lm(RENTED_BIKE_COUNT ~ TEMPERATURE + HUMIDITY + WIND_SPEED + VISIBILITY + DEW_POINT_TEMPERATURE +
SOLAR_RADIATION + RAINFALL + SNOWFALL, data = training_set)
However, as you can see the resultant model returns RAINFALL as NA. Here is the resultant model:
My first thought was to check RAINFALL datatype, which is numeric with range 0-1 (because at an earlier step I performed min-max normalization). But SNOWFALL also is numeric, and I've done nothing (that I know of!) to the one but not the other. My second thought was to confirm that RAINFALL contains enough values to work, and that does not appear to be an issue: summary(training_set$RAINFALL):
So, how do I correct the NAs in RAINFALL? Truly I will be most grateful for your guidance to a solution.
UPDATE 10 MARCH 2022
I've now checked for collinearity:
X <- model.matrix(RENTED_BIKE_COUNT ~ ., data = training_set)
X2 <- caret::findLinearCombos(X)
print(X2)
This gave me:
I believe this means certain columns are jointly multicollinear. As you can see, columns 8, 13, and 38 are:
[8] is RAINFALL
[13] is SEASONS_WINTER
[38] is HOUR_23
Question: if I want to preserve RAINFALL as a predictor variable (viz., return proper values rather than NAs when I run lm()), what do I do? Remove columns [13] and [38] from the dataset?
I'm very new both to this platform and to R, but here is a question I need help with.
Suppose you have the following data frame:
ex <- rnorm(10, 3, 1.4)
ei <- rep(1:5, times=2)
eg <- rep(letters[1:2], each=5)
eg.df <- data.frame(eg, ex, ei)
The following code summarizes maximum values of 'ex' and 'ei' in the data frame by group:
eg.df <- eg.df%>% group_by(eg)%>%
mutate(Xmax=max(ex), Imax=ei[which.max(ex)])
and the following one is my 'crude' attempt at finding the group standard deviation for the bootstrapped values of 'ex' (which seems to work fine):
eg.df%>% group_by(eg)%>% summarise(sdEx=sd(rep(sample(max(rep(sample(ex, replace = T), 100)), replace=T), 100)))
Now, my challenge is how to obtain the corresponding value for the 'ei' (or sdEi)--the standard deviation for the bootstrapped values of 'ei' at which 'ex' was maximum.
I'm totally at a loss as to how to approach it. I would appreciate your help. Many thanks in advance!
I just started using R for statistical purposes and I appreciate any kind of help.
My task is to make calculations on one index and 20 stocks from the index. The data contains 22 columns (DATE, INDEX, S1 .... S20) and about 4000 rows (one row per day).
Firstly I imported the .csv file, called it "dataset" and calculated log returns this way and did it for all stocks "S1-S20" plus the INDEX.
n <- nrow(dataset)
S1 <- dataset$S1
S1_logret <- log(S1[2:n])-log(S1[1:(n-1)])
Secondly, I stored the data in a data.frame:
logret_data <- data.frame(INDEX_logret, S1_logret, S2_logret, S3_logret, S4_logret, S5_logret, S6_logret, S7_logret, S8_logret, S9_logret, S10_logret, S11_logret, S12_logret, S13_logret, S14_logret, S15_logret, S16_logret, S17_logret, S18_logret, S19_logret, S20_logret)
Then I ran the regression (S1 to S20) using the log returns:
S1_Reg1 <- lm(S1_logret~INDEX_logret)
I couldn't figure out how to write the code in a more efficient way and use some function for repetition.
In a further step I have to run a cross sectional regression for each day in a selected interval. It is impossible to do it manually and R should provide some quick solution. I am quite insecure about how to do this part. But I would also like to use kind of loop for the previous calculations.
Yet I lack the necessary R coding knowledge. Any kind of help top the point or advise for literature or tutorial is highly appreciated! Thank you!
You could provide all the separate dependent variables in a matrix to run your regressions. Something like this:
#example data
Y1 <- rnorm(100)
Y2 <- rnorm(100)
X <- rnorm(100)
df <- data.frame(Y1, Y2, X)
#run all models at once
lm(as.matrix(df[c('Y1', 'Y2')]) ~ X)
Out:
Call:
lm(formula = as.matrix(df[c("Y1", "Y2")]) ~ df$X)
Coefficients:
Y1 Y2
(Intercept) -0.15490 -0.08384
df$X -0.15026 -0.02471
I have raw data of power system frequency. 86 400 numbers.
frequency=a$Ist_Frq
plot.ts(frequency, main="System frequency [Hz]", xlab="Time [s]")
See example:
Raw data
Now, i have to determine quarter-hour time interval.
frequency=ts(a$Ist_Frq, start=1, frequency=900)
[quarter-hour time interval][2]
My question is:
Is there any way how to determine standart deviation in every quarter-hour?
Thanks for your answers.
There are probably several solutions to this problem: here is one
#some data
x <- rnorm(10000)
#identify quarter hour segments
y <- rep(1:ceiling(length(x)/(15 * 60)), each = 15 * 60)[1:length(x)]
#use tapply to find sd of x for every value of y
tapply(x, y, sd)
nb the last value might be based on fewer than 900 values
I am using the CasualImpact R package and I would like to get the counter-factual/control time series from the output after estimation. I run the following code which is basically the same as the example code on the website of the package.
set.seed(1)
x1 <- 100 + arima.sim(model = list(ar = 0.999), n = 100)
y <- 1.2 * x1 + rnorm(100)
y[71:100] <- y[71:100] + 10
data <- cbind(y, x1)
pre.period <- c(1, 70)
post.period <- c(71, 100)
impact <- CausalImpact(data, pre.period, post.period)
The the local linear trend is in
impact$model$bsts.model$state.contributions
while the coefficient draws are supposed to be in
impact$model$bsts.model$coefficients
so I run
trend=colMeans(impact$model$bsts.model$state.contributions[1:1000,1,1:100])
trend+mean(impact$model$bsts.model$coefficients[1:1000,2])*x1
to get the counter-factual time series, however this is far from the actual counter-factual time series when plotting the results with
plot(impact)
Can somebody tell me how I can get back the counter-factual time series?
Thanks in advance!
The point predictions for the entire time series (both pre and post intervention) can be found at
impact$series
in the point.pred column. The counter-factual is the part of the point predictions that occur in the post.period portion of that column. impact$series provides the data for all three graphs in plot(impact).