i am confused with the R implementation of lag in Regression analysis - r

look at this linear regression: Y ~ X + lag(X,1) ,the meaning is very clear that it is trying to do a linear regression. and the lag(X,1) means the first lag of X. What confuse me is the R implementation of lag function. In R the lag(X, 1) moves X to the prior time, for example
>library(zoo)
>
>str(zoo(x))
‘zoo’ series from 1 to 4
Data: num [1:4] 11 12 13 14
Index:int [1:4] 1 2 3 4
>lag(zoo(x))
1 2 3
12 13 14
when you regress, which value does the R use exactly at time 2? I guess R use the data like this:
time 1 2 3 4
Y anything
X 11 12 13 14
lagX 12 13 14
But this is nonsense! Because we are supposed to use the fisrt lag of X and the current X at time 2 (or any specific time ), that is 11 and 12 , not 13 12 as above! The fisrt lag of X should be the prior X , isn't it? I am so confused! Please explain to me, thanks a lot.

The question starts out with:
look at this linear regression: Y ~ X + lag(X,1) ,the meaning is very clear
that it is trying to do a linear regression. and the lag(X,1) means the first
lag of X
Actually that is not the case. It does not refer to this model:
Y[i] = a + b * X[i] + c * X[i-1] + error[i]
It actually refers to this model:
Y[i] = a + b * X[i] + c * X[i+1] + error[i]
which is not likely what you intended.
It is likely that you wanted lag(X, -1) rather than lag(X, 1). Lagging a series in R means that the lagged series starts earlier which implies that the series itself moves forward.
The other item to be careful of is that lm does not align series. It knows nothing about the time index. You will need to align the series yourself or use a package which does it for you.
More on these points below.
ts
First let us consider lag.ts from the core of R since lag.zoo and lag.zooreg are based on it and consistent with it. lag.ts lags the times of the series so that the lagged series starts earlier. That is if we have a series whose values are 11, 12, 13 and 14 at times 1, 2, 3 and 4 respectively lag.ts lags each time so that the lagged series has the same values 11, 12, 13 and 14 but at the times 0, 1, 2, 3. The original series started at 1 but the lagged series starts at 0. Originally the value 12 was at time 2 but in the lagged series the value 13 is at time 2. In code, we have:
tt <- ts(11:14)
cbind(tt, lag(tt), lag(tt, 1), lag(tt, -1))
gives:
Time Series:
Start = 0
End = 5
Frequency = 1
tt lag(tt) lag(tt, 1) lag(tt, -1)
0 NA 11 11 NA
1 11 12 12 NA
2 12 13 13 11
3 13 14 14 12
4 14 NA NA 13
5 NA NA NA 14
zoo
lag.zoo is consistent with lag.ts. Note that since zoo represents irrelgularly spaced series it cannot assume that time 0 comes before time 1. We could only make such an assumption if we knew the series were regularly spaced. Thus if time 1 is the earliest time in a series the value at this time is dropped since there is no way to determine what earlier time to lag it to. The new lagged series now starts at the second time value in the original series. This is similar to the lag.ts example except in the lag.ts there was a 0 time and in this example there is no such time. Similarly we cannot extend the time scale forward in time either.
library(zoo)
z <- zoo(11:14)
merge(z, lag(z), lag(z, 1), lag(z,-1))
giving:
z lag(z) lag(z, 1) lag(z, -1)
1 11 12 12 NA
2 12 13 13 11
3 13 14 14 12
4 14 NA NA 13
zooreg
The zoo package does have a zooreg class which assumes regularly spaced series except for some missing values and it can deduce what comes before just as ts can. With zooreg it can deduce that time 0 comes before and time 5 comes after.
library(zoo)
zr <- zooreg(11:14)
merge(zr, lag(zr), lag(zr, 1), lag(zr,-1))
giving:
zr lag(zr) lag(zr, 1) lag(zr, -1)
0 NA 11 11 NA
1 11 12 12 NA
2 12 13 13 11
3 13 14 14 12
4 14 NA NA 13
5 NA NA NA 14
lm
lm does not know anything about zoo and will ignore the time index entirely. If you want to not ignore it, i.e. you want to align the series involved prior to running the regression, use the dyn (or dynlm) package. Using the former:
library(dyn)
set.seed(123)
zr <- zooreg(rnorm(10))
y <- 1 + 2 * zr + 3 * lag(zr, -1)
dyn$lm(y ~ zr + lag(zr, -1))
giving:
Call:
lm(formula = dyn(y ~ zr + lag(zr, -1)))
Coefficients:
(Intercept) zr lag(zr, -1)
1 2 3
Note 1: Be sure to read the documentation in the help files: ?lag.ts , ?lag.zoo , ?lag.zooreg and help(package = dyn)
Note 2: If the direction of the lag seems confusing you could define your own function and use that in place of lag. For example, this gives the same coefficients as the lm output shown above:
Lag <- function(x, k = 1) lag(x, -k)
dyn$lm(y ~ zr + Lag(zr))
An additional word of warning is that unlike lag.zoo and lag.zooreg which are consistent with the core of R, lag.xts from the xts package is inconsistent. Also the lag in dplyr is also inconsistent (and to make things worse if you load dplyr then dplyr will mask lag in R with its own inconsistent version of lag. Also note that L in dynlm works the same as Lag but wisely used a different name to avoid confusion.

Please, consult the manual first:
Description
Compute a lagged version of a time series, shifting the time base back by a given number of observations.
Default S3 method:
lag(x, k = 1, ...)
Arguments
x A vector or matrix or univariate or multivariate time series
k The number of lags (in units of observations).
So, lag does not return a lagged value. It returns the entire lagged time series, shifted back by some k. This is not something a simple lm can work with, and indeed not what you want to use. This, however, does work for me:
library(zoo)
x <- zoo(c(11, 12, 13, 14))
y <- c(1, 2.3, 3.8, 4.2)
lagged <- lag(x, -1)
lagged <- c(lagged, c=0) # first lag is defined as zero
model <- lm(y ~ x + lagged)
summary(model)
Returns:
Call:
lm(formula = y ~ x + lagged)
Residuals:
1 2 3 4
-8.327e-17 -1.833e-01 3.667e-01 -1.833e-01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.86333 4.20149 -2.110 0.282
x 0.89667 0.38456 2.332 0.258
lagged 0.05333 0.08199 0.650 0.633
Residual standard error: 0.4491 on 1 degrees of freedom
Multiple R-squared: 0.9687, Adjusted R-squared: 0.9062
F-statistic: 15.49 on 2 and 1 DF, p-value: 0.1769

Related

Spatial rolling functions (min, max, mean)

I'm currently working on a project where I need to calculate the rolling minimum over a spatial window of 30 meters (it's a square around the central point). On my data frame for each point I have the X and Y coordinates and the variable Z for which I'm trying to get the rolling minimum.
So far I have accomplished it using for loops with conditionals and data table filtering. This takes some time, specially when the data bases have over a million points. I would really appreciate if you could help me with some tips of how to improve the performance of this code.
d = 1
attach(data)
#### OPTION 1 - CONDITIONAL ####
op1 = NULL
for (i in 1:nrow(data)) {
op1[i]<-
min(
ifelse(POINT_X>=POINT_X[i]-d,
ifelse(POINT_X<=POINT_X[i]+d,
ifelse(POINT_Y>=POINT_Y[i]-d,
ifelse(POINT_Y<=POINT_Y[i]+d, Z, Z[i]),Z[i]),Z[i]),Z[i]), na.rm = T)}
#### OPTION 2 - SUBSET ####
setDT(data)
local_min = function(i){
x = POINT_X[i]
y = POINT_Y[i]
base = data[POINT_X %inrange% c(x-d,x+d)&
POINT_Y %inrange% c(y-d,y+d)]
local_min = min(base$Z, na.rm=T)
return(local_min)}
op2 = NULL
for (i in 1:nrow(data)) {
op2[i]<- local_min(i)}
I've tried other alternatives but the most common type of rolling statistic functions on R are based on index windows rather than values of other variables. Here's some data for you to try the the code above with d=1. I would be really grateful if you could help me improve this process.
data = data.frame(POINT_X=rep(1:5, each =5),
POINT_Y=rep(1:5,5),
Z=1:25)
The desired output should look like this:
> op1
[1] 1 1 2 3 4 1 1 2 3 4 6 6 7 8 9 11 11 12 13 14 16 16 17 18 19
I think it's important to note that currently the option 1 is faster than the option 2. Thanks in advance for your attention. :)
You could use a non-equi join :
d = 1
data[,`:=`(xmin = POINT_X-d,
xmax = POINT_X+d,
ymin = POINT_Y-d,
ymax = POINT_Y+d)]
data[data,on=.(POINT_X >= xmin,
POINT_X <= xmax,
POINT_Y >= ymin,
POINT_Y <= ymax)][
,.(rollmin=min(Z)),by=.(POINT_X,POINT_Y)][
,rollmin]
#[1] 1 1 2 3 4 1 1 2 3 4 6 6 7 8 9 11 11 12 13 14 16 16 17 18 19

How to total rows with only certain columns

So I have a data set that is based on HR data training which asks tech and common questions.
The rows represent an employee and the columns represent the score they got on each question. The columns also include demographic data. I only want to see the row total of the tech and common questions though and not include the demographic data.
techs<-grep("^T",rownames(dat))
commons<-grep("^C",rownames(dat))
I used this to try to group the columns together but when I do:
total<-rowsum(commons,techs)
and try to put it in a linear regression:
Mod1Train<-lm(total~.,data=dat[Train,])
it says that there are different variable lengths.
I'm a super newbie to R, so sorry in advance if I'm really off.
in the future it would be ever so helpful if you provided a sample of your data. It's hard for us to help when we're guessing about that. Please see this link https://stackoverflow.com/help/minimal-reproducible-example.
Having said that LOL and realizing you're new I'll take a guess...
Let's make pretend data that I imagine is a smaller imaginary version of yours...
set.seed(2020)
emplid <- 1:10
gender <- sample(c("Male", "Female"), size = 10, replace = TRUE)
Tech1 <- sample(10:20, size = 10, replace = TRUE)
Tech2 <- sample(10:20, size = 10, replace = TRUE)
Tech3 <- sample(10:20, size = 10, replace = TRUE)
Common1 <- sample(10:20, size = 10, replace = TRUE)
Common2 <- sample(10:20, size = 10, replace = TRUE)
Common3 <- sample(10:20, size = 10, replace = TRUE)
Kathryn <- data.frame(emplid, gender, Tech1, Tech2, Tech3, Common1, Common2, Common3)
Kathryn
#> emplid gender Tech1 Tech2 Tech3 Common1 Common2 Common3
#> 1 1 Female 10 17 15 18 17 15
#> 2 2 Female 17 13 11 20 11 13
#> 3 3 Male 17 11 19 18 10 12
#> 4 4 Female 19 16 15 14 15 16
#> 5 5 Female 11 13 20 20 16 13
#> 6 6 Male 15 11 17 19 17 13
#> 7 7 Male 11 13 11 15 14 11
#> 8 8 Female 12 14 10 11 17 19
#> 9 9 Female 11 13 15 18 11 10
#> 10 10 Female 17 20 12 12 14 15
If you're new may want to invest some time learning the tidyverse which could make this simple like here Efficiently sum across multiple columns in R
Per your note in the comments, you have a pattern we can match for summing questions. You were close with your attempt at grep but we want the values back so we need value = TRUE which we'll store and make use of.
techqs <- grep(x = names(Kathryn), pattern = "^Tech", value = TRUE)
commonqs <- grep(x = names(Kathryn), pattern = "^Common", value = TRUE)
Kathryn$TechScores <- rowSums(Kathryn[,techqs])
Kathryn$CommonScores <- rowSums(Kathryn[,commonqs])
### Commented out how to do it manually.
# Kathryn$TechScores <- rowSums(Kathryn[,c("TQ1", "TQ2", "TQ3")])
# Kathryn$CommonScores <- rowSums(Kathryn[,c("CQ1", "CQ2", "CQ3")])
Kathryn$TotalScore <- Kathryn$TechScores + Kathryn$CommonScores
Now to regress which is where the statistical problem comes in. Are you really trying to predict the total score from the components??? That's not hard in r but it leads to silly answers.
Kathryn_model <- lm(formula = TotalScore ~ TechScores + CommonScores, data = Kathryn)
summary(Kathryn_model)
#> Warning in summary.lm(Kathryn_model): essentially perfect fit: summary may be
#> unreliable
#>
#> Call:
#> lm(formula = TotalScore ~ TechScores + CommonScores, data = Kathryn)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.165e-14 -1.905e-15 9.290e-16 8.590e-15 1.183e-14
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8.089e-14 6.345e-14 1.275e+00 0.243
#> TechScores 1.000e+00 9.344e-16 1.070e+15 <2e-16 ***
#> CommonScores 1.000e+00 1.130e-15 8.853e+14 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.43e-14 on 7 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 9.875e+29 on 2 and 7 DF, p-value: < 2.2e-16
I don't understand your code and what you search for
rowsums don't make "a row total" but, quite on the contrary, adds rows between themselves. It returns a matrix, not a vector. Is that what you want ?
Otherwise, maybe you're looking for rowSums, which computes every rows totals of a matrix.
(by the way, if you need it, the matrix product is %*% in R)
Are you sure you have understood lm ?
In lm, there should be something like
lm(y~x,data=adataframe)
"adataframe" is the eventual dataframe/matrix where lm seeks both the response and the input variable,named "y" and "x" here. It is optional. If not found, y and x are seeked in the Global Env as if the columns names are not found in data, they are seeked in the Global environment. It is sometimes better however, to have such a matrix-like object, to avoid common errors.
So if you want to use lm, maybe you should first try to obtain 2 vectors, one for x and one for y, have them in a data.frame with 2 columns (x and y), and call the code above, if I have correctly understood
Note : if you want to remove the constant, use then
lm(y~x+0,data=adataframe)

Trying to add breakpoint lines from strucchange to a plot by "lines" command [duplicate]

This is my first time with strucchange so bear with me. The problem I'm having seems to be that strucchange doesn't recognize my time series correctly but I can't figure out why and haven't found an answer on the boards that deals with this. Here's a reproducible example:
require(strucchange)
# time series
nmreprosuccess <- c(0,0.50,NA,0.,NA,0.5,NA,0.50,0.375,0.53,0.846,0.44,1.0,0.285,
0.75,1,0.4,0.916,1,0.769,0.357)
dat.ts <- ts(nmreprosuccess, frequency=1, start=c(1996,1))
str(dat.ts)
Time-Series [1:21] from 1996 to 2016: 0 0.5 NA 0 NA 0.5 NA 0.5 0.375 0.53 ...
To me this means that the time series looks OK to work with.
# obtain breakpoints
bp.NMSuccess <- breakpoints(dat.ts~1)
summary(bp.NMSuccess)
Which gives:
Optimal (m+1)-segment partition:
Call:
breakpoints.formula(formula = dat.ts ~ 1)
Breakpoints at observation number:
m = 1 6
m = 2 3 7
m = 3 3 14 16
m = 4 3 7 14 16
m = 5 3 7 10 14 16
m = 6 3 7 10 12 14 16
m = 7 3 5 7 10 12 14 16
Corresponding to breakdates:
m = 1 0.333333333333333
m = 2 0.166666666666667 0.388888888888889
m = 3 0.166666666666667
m = 4 0.166666666666667 0.388888888888889
m = 5 0.166666666666667 0.388888888888889 0.555555555555556
m = 6 0.166666666666667 0.388888888888889 0.555555555555556 0.666666666666667
m = 7 0.166666666666667 0.277777777777778 0.388888888888889 0.555555555555556 0.666666666666667
m = 1
m = 2
m = 3 0.777777777777778 0.888888888888889
m = 4 0.777777777777778 0.888888888888889
m = 5 0.777777777777778 0.888888888888889
m = 6 0.777777777777778 0.888888888888889
m = 7 0.777777777777778 0.888888888888889
Fit:
m 0 1 2 3 4 5 6 7
RSS 1.6986 1.1253 0.9733 0.8984 0.7984 0.7581 0.7248 0.7226
BIC 14.3728 12.7421 15.9099 20.2490 23.9062 28.7555 33.7276 39.4522
Here's where I start having the problem. Instead of reporting the actual breakdates it reports numbers which then makes it impossible to plot the break lines onto a graph because they're not at the breakdate (2002) but at 0.333.
plot.ts(dat.ts, main="Natural Mating")
lines(fitted(bp.NMSuccess, breaks = 1), col = 4, lwd = 1.5)
Nothing shows up for me in this graph (I think because it's so small for the scale of the graph).
In addition, when I try fixes that may possibly work around this problem,
fm1 <- lm(dat.ts ~ breakfactor(bp.NMSuccess, breaks = 1))
I get:
Error in model.frame.default(formula = dat.ts ~ breakfactor(bp.NMSuccess, :
variable lengths differ (found for 'breakfactor(bp.NMSuccess, breaks = 1)')
I get errors because of the NA values in the data so the length of dat.ts is 21 and the length of breakfactor(bp.NMSuccess, breaks = 1) 18 (missing the 3 NAs).
Any suggestions?
The problem occurs because breakpoints() currently can only (a) cope with NAs by omitting them, and (b) cope with times/date through the ts class. This creates the conflict because when you omit internal NAs from a ts it loses its ts property and hence breakpoints() cannot infer the correct times.
The "obvious" way around this would be to use a time series class that can cope with this, namely zoo. However, I just never got round to fully integrate zoo support into breakpoints() because it would likely break some of the current behavior.
To cut a long story short: Your best choice at the moment is to do the book-keeping about the times yourself and not expect breakpoints() to do it for you. The additional work is not so huge. First, we create a time series with the response and the time vector and omit the NAs:
d <- na.omit(data.frame(success = nmreprosuccess, time = 1996:2016))
d
## success time
## 1 0.000 1996
## 2 0.500 1997
## 4 0.000 1999
## 6 0.500 2001
## 8 0.500 2003
## 9 0.375 2004
## 10 0.530 2005
## 11 0.846 2006
## 12 0.440 2007
## 13 1.000 2008
## 14 0.285 2009
## 15 0.750 2010
## 16 1.000 2011
## 17 0.400 2012
## 18 0.916 2013
## 19 1.000 2014
## 20 0.769 2015
## 21 0.357 2016
Then we can estimate the breakpoint(s) and afterwards transform from the "number" of observations back to the time scale. Note that I'm setting the minimal segment size h explicitly here because the default of 15% is probably somewhat small for this short series. 4 is still small but possibly enough for estimating of a constant mean.
bp <- breakpoints(success ~ 1, data = d, h = 4)
bp
## Optimal 2-segment partition:
##
## Call:
## breakpoints.formula(formula = success ~ 1, h = 4, data = d)
##
## Breakpoints at observation number:
## 6
##
## Corresponding to breakdates:
## 0.3333333
We ignore the break "date" at 1/3 of the observations but simply map back to the original time scale:
d$time[bp$breakpoints]
## [1] 2004
To re-estimate the model with nicely formatted factor levels, we could do:
lab <- c(
paste(d$time[c(1, bp$breakpoints)], collapse = "-"),
paste(d$time[c(bp$breakpoints + 1, nrow(d))], collapse = "-")
)
d$seg <- breakfactor(bp, labels = lab)
lm(success ~ 0 + seg, data = d)
## Call:
## lm(formula = success ~ 0 + seg, data = d)
##
## Coefficients:
## seg1996-2004 seg2005-2016
## 0.3125 0.6911
Or for visualization:
plot(success ~ time, data = d, type = "b")
lines(fitted(bp) ~ time, data = d, col = 4, lwd = 2)
abline(v = d$time[bp$breakpoints], lty = 2)
One final remark: For such short time series where just a simple shift in the mean is needed, one could also consider conditional inference (aka permutation tests) rather than the asymptotic inference employed in strucchange. The coin package provides the maxstat_test() function exactly for this purpose (= short series where a single shift in the mean is tested).
library("coin")
maxstat_test(success ~ time, data = d, dist = approximate(99999))
## Approximative Generalized Maximally Selected Statistics
##
## data: success by time
## maxT = 2.3953, p-value = 0.09382
## alternative hypothesis: two.sided
## sample estimates:
## "best" cutpoint: <= 2004
This finds the same breakpoint and provides a permutation test p-value. If however, one has more data and needs multiple breakpoints and/or further regression coefficients, then strucchange would be needed.

strucchange not reporting breakdates

This is my first time with strucchange so bear with me. The problem I'm having seems to be that strucchange doesn't recognize my time series correctly but I can't figure out why and haven't found an answer on the boards that deals with this. Here's a reproducible example:
require(strucchange)
# time series
nmreprosuccess <- c(0,0.50,NA,0.,NA,0.5,NA,0.50,0.375,0.53,0.846,0.44,1.0,0.285,
0.75,1,0.4,0.916,1,0.769,0.357)
dat.ts <- ts(nmreprosuccess, frequency=1, start=c(1996,1))
str(dat.ts)
Time-Series [1:21] from 1996 to 2016: 0 0.5 NA 0 NA 0.5 NA 0.5 0.375 0.53 ...
To me this means that the time series looks OK to work with.
# obtain breakpoints
bp.NMSuccess <- breakpoints(dat.ts~1)
summary(bp.NMSuccess)
Which gives:
Optimal (m+1)-segment partition:
Call:
breakpoints.formula(formula = dat.ts ~ 1)
Breakpoints at observation number:
m = 1 6
m = 2 3 7
m = 3 3 14 16
m = 4 3 7 14 16
m = 5 3 7 10 14 16
m = 6 3 7 10 12 14 16
m = 7 3 5 7 10 12 14 16
Corresponding to breakdates:
m = 1 0.333333333333333
m = 2 0.166666666666667 0.388888888888889
m = 3 0.166666666666667
m = 4 0.166666666666667 0.388888888888889
m = 5 0.166666666666667 0.388888888888889 0.555555555555556
m = 6 0.166666666666667 0.388888888888889 0.555555555555556 0.666666666666667
m = 7 0.166666666666667 0.277777777777778 0.388888888888889 0.555555555555556 0.666666666666667
m = 1
m = 2
m = 3 0.777777777777778 0.888888888888889
m = 4 0.777777777777778 0.888888888888889
m = 5 0.777777777777778 0.888888888888889
m = 6 0.777777777777778 0.888888888888889
m = 7 0.777777777777778 0.888888888888889
Fit:
m 0 1 2 3 4 5 6 7
RSS 1.6986 1.1253 0.9733 0.8984 0.7984 0.7581 0.7248 0.7226
BIC 14.3728 12.7421 15.9099 20.2490 23.9062 28.7555 33.7276 39.4522
Here's where I start having the problem. Instead of reporting the actual breakdates it reports numbers which then makes it impossible to plot the break lines onto a graph because they're not at the breakdate (2002) but at 0.333.
plot.ts(dat.ts, main="Natural Mating")
lines(fitted(bp.NMSuccess, breaks = 1), col = 4, lwd = 1.5)
Nothing shows up for me in this graph (I think because it's so small for the scale of the graph).
In addition, when I try fixes that may possibly work around this problem,
fm1 <- lm(dat.ts ~ breakfactor(bp.NMSuccess, breaks = 1))
I get:
Error in model.frame.default(formula = dat.ts ~ breakfactor(bp.NMSuccess, :
variable lengths differ (found for 'breakfactor(bp.NMSuccess, breaks = 1)')
I get errors because of the NA values in the data so the length of dat.ts is 21 and the length of breakfactor(bp.NMSuccess, breaks = 1) 18 (missing the 3 NAs).
Any suggestions?
The problem occurs because breakpoints() currently can only (a) cope with NAs by omitting them, and (b) cope with times/date through the ts class. This creates the conflict because when you omit internal NAs from a ts it loses its ts property and hence breakpoints() cannot infer the correct times.
The "obvious" way around this would be to use a time series class that can cope with this, namely zoo. However, I just never got round to fully integrate zoo support into breakpoints() because it would likely break some of the current behavior.
To cut a long story short: Your best choice at the moment is to do the book-keeping about the times yourself and not expect breakpoints() to do it for you. The additional work is not so huge. First, we create a time series with the response and the time vector and omit the NAs:
d <- na.omit(data.frame(success = nmreprosuccess, time = 1996:2016))
d
## success time
## 1 0.000 1996
## 2 0.500 1997
## 4 0.000 1999
## 6 0.500 2001
## 8 0.500 2003
## 9 0.375 2004
## 10 0.530 2005
## 11 0.846 2006
## 12 0.440 2007
## 13 1.000 2008
## 14 0.285 2009
## 15 0.750 2010
## 16 1.000 2011
## 17 0.400 2012
## 18 0.916 2013
## 19 1.000 2014
## 20 0.769 2015
## 21 0.357 2016
Then we can estimate the breakpoint(s) and afterwards transform from the "number" of observations back to the time scale. Note that I'm setting the minimal segment size h explicitly here because the default of 15% is probably somewhat small for this short series. 4 is still small but possibly enough for estimating of a constant mean.
bp <- breakpoints(success ~ 1, data = d, h = 4)
bp
## Optimal 2-segment partition:
##
## Call:
## breakpoints.formula(formula = success ~ 1, h = 4, data = d)
##
## Breakpoints at observation number:
## 6
##
## Corresponding to breakdates:
## 0.3333333
We ignore the break "date" at 1/3 of the observations but simply map back to the original time scale:
d$time[bp$breakpoints]
## [1] 2004
To re-estimate the model with nicely formatted factor levels, we could do:
lab <- c(
paste(d$time[c(1, bp$breakpoints)], collapse = "-"),
paste(d$time[c(bp$breakpoints + 1, nrow(d))], collapse = "-")
)
d$seg <- breakfactor(bp, labels = lab)
lm(success ~ 0 + seg, data = d)
## Call:
## lm(formula = success ~ 0 + seg, data = d)
##
## Coefficients:
## seg1996-2004 seg2005-2016
## 0.3125 0.6911
Or for visualization:
plot(success ~ time, data = d, type = "b")
lines(fitted(bp) ~ time, data = d, col = 4, lwd = 2)
abline(v = d$time[bp$breakpoints], lty = 2)
One final remark: For such short time series where just a simple shift in the mean is needed, one could also consider conditional inference (aka permutation tests) rather than the asymptotic inference employed in strucchange. The coin package provides the maxstat_test() function exactly for this purpose (= short series where a single shift in the mean is tested).
library("coin")
maxstat_test(success ~ time, data = d, dist = approximate(99999))
## Approximative Generalized Maximally Selected Statistics
##
## data: success by time
## maxT = 2.3953, p-value = 0.09382
## alternative hypothesis: two.sided
## sample estimates:
## "best" cutpoint: <= 2004
This finds the same breakpoint and provides a permutation test p-value. If however, one has more data and needs multiple breakpoints and/or further regression coefficients, then strucchange would be needed.

r join two lists and sum their values

I have two lists: x, y
> x
carlo monte simulation model quantum
31 31 9 6 6
> y
model system temperature quantum simulation problem
15 15 15 13 13 12
What function should I use to obtain:
simulation model quantum
22 21 19
I tried to merge them like in example but it gives me an error:
merge(x,y,by=intersect(names(x),names(y))) produces:
Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns
There's no argument in that function what to do with values. What would be the best function to use?
intersect(names(x),names(y)) will give the names of resulting list, but how to summarize values together??
You can use Map in base R to return a list.
Map("+", x[intersect(names(x),names(y))], y[intersect(names(x),names(y))])
$simulation
[1] 22
$model
[1] 21
$quantum
[1] 19
or mapply to return a named vector which may be more useful.
mapply("+", x[intersect(names(x),names(y))], y[intersect(names(x),names(y))])
simulation model quantum
22 21 19
Using [intersect(names(x), names(y))] will not only be subset the contents of x and y to those with intersecting names, but will also properly sort the elements for the operation.
data
x <- list(carlo=1, monte=2, simulation=9, model=6, quantum=6)
y <-list(model=15, system=8, temperature=10, quantum=13, simulation=13, problem="no")
simple names matching does the trick :
# subset those from x which have names in y also
x1 = x[names(x)[names(x) %in% names(y)]]
# x1
# simulation model quantum
# 9 6 6
# similarily do it for y. note the order of names might be different from that in x
y1 = y[names(y)%in%names(x1)]
# y1
# model quantum simulation
# 15 13 13
# now order the names in both and then add.
x1[order(names(x1))]+y1[order(names(y1))]
# model quantum simulation
# 21 19 22
Base function merge() should do this with no issue so long as your fields make sense, but you need to include merge(..., all=TRUE), as in:
y <- data.frame(rbind(c(15,15,15,13,13,12)))
names(y) <- c("model","system","temperature","quantum","simulation","problem")
x <- data.frame(rbind(c(31,31,9,6,6)))
names(x) <- c("carlo","monte","simulation","model","quantum")
merge(x, y, by = c("simulation","model","quantum"), all = TRUE)
results in:
simulation model quantum carlo monte system temperature problem
1 9 6 6 31 31 NA NA NA
2 13 15 13 NA NA 15 15 12
Here you actually have data frames of length 1, not lists.

Resources