3month rolling correlation keeping date column in R - r

This is my data. Daily return data for different sectors.
I would like to compute the 3 month rolling correlation between sectors but keep the date field and have it line up.
> head(data)
Date Communication Services Consumer Discretionary Consumer Staples Energy Financials - AREITs Financials - ex AREITs Health Care
1 2003-01-02 -0.0004 0.0016 0.0033 0.0007 0.0073 0.0006 0.0370
2 2003-01-03 -0.0126 -0.0008 0.0057 -0.0019 0.0016 0.0062 0.0166
3 2003-01-06 0.0076 0.0058 -0.0051 0.0044 0.0063 0.0037 -0.0082
4 2003-01-07 -0.0152 0.0052 -0.0024 -0.0042 -0.0037 -0.0014 0.0027
5 2003-01-08 0.0107 0.0017 0.0047 -0.0057 0.0013 -0.0008 -0.0003
6 2003-01-09 -0.0157 0.0019 -0.0020 0.0009 -0.0016 -0.0012 0.0055
`
My data type is this
$ Date : Date[1:5241], format: "2003-01-02" "2003-01-03" "2003-01-06" "2003-01-07" ...
$ Communication Services : num [1:5241] -0.0004 -0.0126 0.0076 -0.0152 0.0107 -0.0157 0.0057 -0.0131 0.0044 0.0103 ...
$ Consumer Discretionary : num [1:5241] 0.0016 -0.0008 0.0058 0.0052 0.0017 0.0019 -0.0022 0.0057 -0.0028 0.0039 ...
$ Consumer Staples : num [1:5241] 0.0033 0.0057 -0.0051 -0.0024 0.0047 -0.002 0.0043 -0.0005 0.0163 0.004 ...
$ Energy : num [1:5241] 0.0007 -0.0019 0.0044 -0.0042 -0.0057 0.0009 0.0058 0.0167 -0.0026 -0.0043 ...
$ Financials - AREITs : num [1:5241] 0.0073 0.0016 0.0063 -0.0037 0.0013 -0.0016 0 0.0025 -0.0051 0.0026 ...`
Currently what I am doing is this:
rollingcor <- rollapply(data, width=60, function(x) cor(x[,2],x[,3]),by=60, by.column=FALSE)
This works fine and works out the rolling 60 day correlation and shifts the window by 60 days. However it doesnt keep the date column and I find it hard to match the dates.
The end goal here is to produce a df in which the the date is every 3 months and the other columns are the correlations between all the sectors in my data.

Please read the information at the top of the r tag and, in particular provide the input in an easily reproducible manner using dput. In the absence of that we will use data shown below based on the 6x2 BOD data frame that comes with R and use a width of 4. The names on the correlation columns are the row:column numbers in the correlation matrix. For example, compare the 4th row of the output below with cor(data[1:4, -1]) .
fill=NA causes it to output the same number of rows as the input by filling with NA's.
library(zoo)
# test data
data <- cbind(Date = as.Date("2023-02-01") + 0:5, BOD, X = 1:6)
# given data frame x return lower triangular part of cor matrix
# Last 2 lines add row:column names.
Cor <- function(x) {
k <- cor(x)
lo <- lower.tri(k)
k.lo <- k[lo]
m <- which(lo, arr.ind = TRUE) # rows & cols of lower tri
setNames(k.lo, paste(m[, 1], m[, 2], sep = ":"))
}
cbind(data, rollapplyr(data[-1], 4, Cor, by.column = FALSE, fill = NA))
giving:
Date Time demand X 2:1 3:1 3:2
1 2023-02-01 1 8.3 1 NA NA NA
2 2023-02-02 2 10.3 2 NA NA NA
3 2023-02-03 3 19.0 3 NA NA NA
4 2023-02-04 4 16.0 4 0.8280576 1.0000000 0.8280576
5 2023-02-05 5 15.6 5 0.4604354 1.0000000 0.4604354
6 2023-02-06 7 19.8 6 0.2959666 0.9827076 0.1223522

Related

How can I minimize nonlinear objective function with linear equality constraint with R?

Minimize
f(z) = sum_(t=2)^IJ (Z_t - Z_t-1)^2
Subject to constraint
sum_(j=1)^J (Z_(i-1)J+j+k ) = y^f_i, i = 1,....., I-1.
This optimization find out quarterly values from fiscal year data series (y^f_i) and then sum those quarterly value to find out annual value. I is the number of calendar years considered in the series interval (2<=I<=N). J is quarter value and K is the number of period of calendar year i which are in fiscal year i-1.
In my case, I = 39, J = 4, K = 2
How can I solve this problem using R?
The way I tried to write code is provided below:
library(NlcOptim)
library(readxl)
Calendarization <- read_excel("C:/Users/HP/Desktop/Calendarization.xlsx")
View(Calendarization)
y<-Calendarization$`wholesale price`
objfun = function(z){
return(sum(z[t] - lag(z[t], k=1))^2)
}
for (t in 2:156){
objfun
} -> objfun
p0<-0:39
Aeq<-sum(z[((i-1)*4)+j+2])
for (j in 1:4){
for (i in 1:39){
Aeq
}->Aeq
}
Beq<- y[i]
x=p0
solnl(x, objfun=objfun, Aeq=Aeq, Beq=Beq)
Here is the data I have:
year wholesale price
1970-1971 0.99
1971-1972 1.32
1972-1973 20.9
1973-1974 2.83
1974-1975 5.78
1975-1976 3.38
1976-1977 3.02
1977-1978 2.88
1978-1979 4.08
1979-1980 5.4
1980-1981 4.51
1981-1982 5.91
1982-1983 6.42
1983-1984 7.07
1984-1985 7.68
1985-1986 8.04
1986-1987 9.62
1987-1988 10.05
1988-1989 9.81
1989-1990 9.6
1990-1991 10.59
1991-1992 11.08
1992-1993 9.42
1993-1994 9.6
1994-1995 12.28
1995-1996 12.58
1996-1997 10.87
1997-1998 12.09
1998-1999 13.66
1999-2000 12.28
2000-2001 11.75
2001-2002 11.49
2002-2003 13.08
2003-2004 13.43
2004-2005 15.06
2005-2006 16.5
2006-2007 18.48
2007-2008 24.74
2008-2009 26.69
There seems to be something wrong with the formulation. Just looking at the formulas in the image, the last constraint for i=I-1=39-1=38 sums z elements (38-1)*4 + 1 + 6, (38-1)*4 + 2 + 6, (38-1)*4 + 3 + 6 and (38-1)*4 + 4 + 6 which is elements 155 156 157 158 but z goes from 1 to 4*39 and so has only 156 elements. Furthermore not all the z values participate in a constraint.
Given the problems cited let us change the problem to o ne that makes sense and assume we want to minimize the sum of the squares of the successive differences of the I*J elements of z subject to the first 4 elements of z summing to Cal[1, 2], the next 4 summing to Cal[2, 2], and so on up to the last 4 elements of z summing to Cal[39, 2]. In that case we can write the Aeq constraint matrix as a block diagonal matrix using the kronecker product shown. We ignore K. (Cal is shown reproducibly in the Note at the end.)
library(NlcOptim)
I = 39; J = 4
objfun <- function(x) sum(diff(x)^2)
Aeq <- diag(I) %x% matrix(1, 1, J)
Beq <- Cal[, 2]
st <- rep(1, I*J)
res <- solnl(st, objfun, Aeq = Aeq, Beq = Beq)
giving
> res
List of 6
$ par : num [1:156, 1] 0.576 0.445 0.182 -0.213 -0.739 ...
$ fn : num 25.5
$ counts : num [1, 1:2] 19332 124
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "nfval" "ngval"
$ lambda :List of 3
..$ lower: num [1:156, 1] 0 0 0 0 0 0 0 0 0 0 ...
..$ upper: num [1:156, 1] 0 0 0 0 0 0 0 0 0 0 ...
..$ eqlin: num [1:39] 0.263 1.486 2.427 1.662 0.68 ...
$ grad : num [1:156, 1] 0.263 0.263 0.263 0.263 -1.486 ...
$ hessian: num [1:156, 1:156] 1.65775 -1.10379 0.62425 -0.09085 0.00878 ...
Note
Lines <- "year wholesale price
1970-1971 0.99
1971-1972 1.32
1972-1973 20.9
1973-1974 2.83
1974-1975 5.78
1975-1976 3.38
1976-1977 3.02
1977-1978 2.88
1978-1979 4.08
1979-1980 5.4
1980-1981 4.51
1981-1982 5.91
1982-1983 6.42
1983-1984 7.07
1984-1985 7.68
1985-1986 8.04
1986-1987 9.62
1987-1988 10.05
1988-1989 9.81
1989-1990 9.6
1990-1991 10.59
1991-1992 11.08
1992-1993 9.42
1993-1994 9.6
1994-1995 12.28
1995-1996 12.58
1996-1997 10.87
1997-1998 12.09
1998-1999 13.66
1999-2000 12.28
2000-2001 11.75
2001-2002 11.49
2002-2003 13.08
2003-2004 13.43
2004-2005 15.06
2005-2006 16.5
2006-2007 18.48
2007-2008 24.74
2008-2009 26.69"
Cal <- read.table(text = Lines, skip = 1, col.names = c("year", "wholesale price"),
check.names = FALSE, strip.white = TRUE)

How can I extract specific data points from a wide-formatted text file in R?

I have datasheets with multiple measurements that look like the following:
FILE DATE TIME LOC QUAD LAI SEL DIFN MTA SEM SMP
20 20210805 08:38:32 H 1161 2.80 0.68 0.145 49. 8. 4
ANGLES 7.000 23.00 38.00 53.00 68.00
CNTCT# 1.969 1.517 0.981 1.579 1.386
STDDEV 1.632 1.051 0.596 0.904 0.379
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.137 0.192 0.288 0.073 0.025
A 1 08:38:40 31.66 33.63 34.59 39.13 55.86
1 2 08:38:40 -5.0e-006
B 3 08:38:48 25.74 20.71 15.03 2.584 1.716
B 4 08:38:55 0.344 1.107 2.730 0.285 0.265
B 5 08:39:02 3.211 5.105 13.01 4.828 1.943
B 6 08:39:10 8.423 22.91 48.77 16.34 3.572
B 7 08:39:19 12.58 14.90 18.34 18.26 4.125
I would like to read the entire datasheet and extract the values for 'QUAD' and 'LAI' only. For example, for the data above I would only be extracting a QUAD of 1161 and an LAI of 2.80.
In the past the datasheets were formatted as long data, and I was able to use the following code:
library(stringr)
QUAD <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^QUAD).*$")))
LAI <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
QUAD = QUAD[!is.na(QUAD)],
LAI = LAI[!is.na(LAI)]
)
data_extract
Unfortunately, this does not work because of the wide formatting in the current datasheet. Any help would be hugely appreciated. Thanks in advance for your time.

CAPdiscrim error: numeric 'envir' arg not of length one

I'm trying to apply a CAP function to chemical data collected in different years.
I have a data archive:
head(ISPA_data)
SrCa MgCa MnCa RbCa CuCa ZnCa BaCa PbCa NaCa LiCa CoCa NiCa
1 5178 25.101 9.334 0.166 4.869 8.379 34.846 0.194 5464 0.313 2.510 25.181
2 6017 22.922 7.185 0.166 4.685 8.720 24.659 0.154 4600 0.300 2.475 25.060
3 5628 26.232 6.248 0.179 4.628 10.157 23.942 0.166 5378 0.300 2.529 25.252
4 4769 35.598 7.683 0.131 4.370 8.735 50.068 0.180 5938 0.568 2.159 21.645
5 5330 28.284 6.828 0.130 5.370 12.742 34.257 0.220 5614 0.397 2.275 23.852
6 5786 24.603 4.797 0.156 5.317 13.331 66.896 0.117 5001 0.423 2.298 24.361
and a environmental dataset:
head(ISPA.env)
Year OM Code Location
<dbl> <chr> <chr> <chr>
1 1975 0.04349 CSP75_25 CSP
2 1975 0.0433 CSP75_28 CSP
3 1975 0.04553 CSP75_31 CSP
4 1975 0.0439 CSP75_33 CSP
5 1975 0.02998 CSP75_37 CSP
6 1975 0.0246 CSP75_39 CSP
When performing CAPdiscrim,
Ordination.model1 <- CAPdiscrim(ISPA_data~Year,
ISPA.env,
dist="euclidean",
axes=4,
m=0,
add=FALSE,
permutations=999)
this Error occurs:
Error in eval(predvars, data, env) :
numeric 'envir' arg not of length one
Besides: Warning message:
In cmdscale(distmatrix, k = nrow(x) - 1, eig = T, add = add) :
only 13 of the first 19 eigenvalues are > 0
All data has the same length.
Can anyone help me? Thanks!

Error Standardising variables in R : 'only defined on a data frame with all numeric variables'

I am simply looking to standardise my set of data frame variables to a 100 point scale. The original variables were on a 10 point scale with 4 decimal points.
I can see that my error is not unheard of e.g
Why am I getting a function error in seemingly similar R code?
Error: only defined on a data frame with all numeric variables with ddply on large dataset
but I have verified that all variables are numeric using
library(foreign)
library(scales)
ches <- read.csv("chesshort15.csv", header = TRUE)
ches2 <- ches[1:244, 3:10]
rescale(ches2, to = c(0,100), from = range(ches2, na.rm = TRUE, finite = TRUE))
This gives the error: Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
I have verified that all variables are of type numeric using str(ches2) - see below:
'data.frame': 244 obs. of 8 variables:
$ galtan : num 8.8 9 9.65 8.62 8 ...
$ civlib_laworder : num 8.5 8.6 9.56 8.79 8.56 ...
$ sociallifestyle : num 8.89 7.2 9.65 9.21 8.25 ...
$ immigrate_policy : num 9.89 9.6 9.38 9.43 9.13 ...
$ multiculturalism : num 9.9 9.6 9.57 8.77 9.07 ...
$ ethnic_minorities : num 8.8 9.6 9.87 9 8.93 ...
$ nationalism : num 9.4 10 9.82 9 8.81 ...
$ antielite_salience: num 8 9 9.47 8.88 8.38
In short, I'm stumped as to why it refuses to carry out the code.
For info, Head(bb) gives :
galtan civlib_laworder sociallifestyle immigrate_policy multiculturalism ethnic_minorities
1 8.800 8.500 8.889 9.889 9.900 8.800
2 9.000 8.600 7.200 9.600 9.600 9.600
3 9.647 9.563 9.647 9.375 9.571 9.867
4 8.625 8.786 9.214 9.429 8.769 9.000
5 8.000 8.563 8.250 9.133 9.071 8.929
6 7.455 8.357 7.923 8.800 7.800 8.455
nationalism antielite_salience
1 9.400 8.000
2 10.000 9.000
3 9.824 9.471
4 9.000 8.882
5 8.813 8.375
6 8.000 8.824
The rescale function is throwing that error because it expects a numeric vector, and you are feeding it a dataframe instead. You need to iterate; go through every column on your dataframe and scale them individually.
Try this:
sapply(ches2, rescale, to = c(0,100))
You don't need the range(ches2, na.rm = TRUE, finite = TRUE) portion of your code because rescale is smart enough to remove NA values on its own

boxplots grouped as factor

I'm trying to plot boxplots of multiple variables (columns in a table), grouping the subjects for the levels in cl.med.
This is what I tried:
boxplot(scfa[,c("acetate","propionate")]~as.factor(cl.med),outline=FALSE)
this is my table:
aceticacid.methylester.1 acetate butyrate fumarate caprate propionate X3phenylpropionate valerate formate
01.BA.V 4.509 0.1430 0.0168 4e-04 0.0080 0.0174 0.0008 0.0030 5e-04
01.BA.VG 2.750 0.2736 0.0228 4e-04 0.0047 0.0261 0.0012 0.0014 4e-04
01.BO.VG 15.281 0.1667 0.0159 6e-04 0.0049 0.0191 0.0008 0.0011 4e-04
01.PR.O 0.317 0.2470 0.0327 4e-04 0.0078 0.0293 0.0006 0.0016 4e-04
01.TO.VG 0.210 0.1406 0.0186 4e-04 0.0034 0.0161 0.0006 0.0026 6e-04
and this is my class vector
01.BA.VG 01.BO.VG 01.PR.O 01.TO.VG 02.BA.VG
1 2 3 1 3 2
This produces 3 boxes (for the 3 classes as expected), but the two variables are merged. How could I modify it obtaining 3 boxes for each variable?
Thanks
Your code really shouldn't work at all instead of producing 3 boxes. Usually you have to add them one plot at a time using the add=TRUE argument:
boxplot(scfa[,c("acetate")]~as.factor(cl.med), outline=FALSE, ylim=c(0, 0.3))
boxplot(scfa[,c("propionate")]~as.factor(cl.med), outline=FALSE, add=TRUE)
You either need to 'melt' or 'stack' your data so that the 2 response columns are stacked. Or split the data yourself. Here is an example of the second option:
tmp <- split(iris[,c('Sepal.Width','Petal.Width')], iris$Species)
tmp2 <- unlist(tmp, recursive=FALSE)
boxplot(tmp2)

Resources