Recursive regression in R (extract residuals) - r

For a different question, this procedure for a recursive regression of X on Y, starting at say the first 20 observations and increasing the regression window by one observation at a time until it covers the full sample, was suggested:
X1 <- runif(50, 0, 1)
X2 <- runif(50, 0, 10)
Y <- runif(50, 0, 1)
df <- data.frame(X1,X2,Y)
rolling_lms <- lapply( seq(20,nrow(df) ), function(x) lm( Y ~ X1+X2, data = df[1:x , ]) )
This works fine, but is there a way to:
Get the residuals for the first 20 observations.
Adding on the residuals one by one for each regression.
So that the 21. residual is the one from the regression including 21 observations, the 22. residual is the one from the regression with 22 observation and so on?

Here is a possibile solution for your problem.
set.seed(1)
X1 <- runif(50, 0, 1)
X2 <- runif(50, 0, 10)
Y <- runif(50, 0, 1)
df <- data.frame(X1,X2,Y)
rolling_lms <- lapply(seq(20,nrow(df)), function(x) lm(Y ~ X1+X2, data = df[1:x , ]))
resk <- function(k) if(k==1) rolling_lms[[k]]$residuals else tail(rolling_lms[[k]]$residuals,1)
unlist(sapply(1:length(rolling_lms), resk))
############
1 2 3 4 5 6
0.051243613 -0.284725835 -0.209235819 0.677747763 0.085196300 -0.077111032
7 8 9 10 11 12
-0.185700617 0.016194254 0.422214060 -0.067994796 0.265315143 0.130531648
13 14 15 16 17 18
-0.083662353 -0.098826853 -0.298235953 -0.459746026 0.282954796 -0.281752756
19 20 21 22 23 24
-0.037180134 0.152774597 0.576060893 -0.121303797 0.001336554 -0.357956306
25 26 27 28 29 30
0.205847757 -0.111231524 -0.082662882 -0.291013740 -0.223480493 0.051223304
31 32 33 34 35 36
0.082970698 -0.393398739 -0.428164426 0.122919273 0.457861478 0.148282532
37 38 39 40 41 42
0.081855106 0.023024731 0.500627476 0.005097244 0.189354101 0.092481013
43 44 45 46 47 48
-0.245542247 -0.217881519 0.234771342 -0.023343600 -0.328489644 0.242163946
49 50
-0.358311100 0.373917319

Related

Using a loop to create a polynomial model gives R trouble understanding it?

I create a lot of polynomial models to compare them, so I used a loop like this:
library(ISLR)
library(boot)
data(Wage)
list = list()
for (i in 1:10){
list[[i]] = lm(wage ~ poly(age, i), data = Wage)
assign(paste("fit.aov", i, sep = ""), list[[i]])
}
agelims <- range(Wage$age)
age.grid <- seq(agelims[1], agelims[2])
If I run the following code
preds <- predict(fit.aov1, data.frame(age = age.grid), se=TRUE)
I receive the following error:
Error: variable 'poly(age, i)' was fitted with type "nmatrix.1" but type "nmatrix.10" was supplied
In addition: Warning message:
In Z/rep(sqrt(norm2[-1L]), each = length(x)) :
longer object length is not a multiple of shorter object length
However, if I create each model manually like this
fit1 = lm(wage, poly(age,1), data = Wage)
Then the predict() function runs just fine.
Here we need to create the formula with paste
lst1 <- vector('list', 10)
for (i in 1:10){
fmla <- sprintf("wage~ poly(age,%d)", i)
print(fmla)
lst1[[i]] = lm(as.formula(fmla), data = Wage)
lst1[[i]]$call <- parse(text =fmla )[[1]]
assign(paste("fit.aov", i, sep = ""), lst1[[i]])
}
-testing with predict
predict(fit.aov1, data.frame(age = age.grid), se=TRUE)
#$fit
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14
# 94.43570 95.14298 95.85025 96.55753 97.26481 97.97208 98.67936 99.38663 100.09391 100.80119 101.50846 102.21574 102.92301 103.63029
# 15 16 17 18 19 20 21 22 23 24 25 26 27 28
#104.33757 105.04484 105.75212 106.45939 107.16667 107.87394 108.58122 109.28850 109.99577 110.70305 111.41032 112.11760 112.82488 113.53215
# 29 30 31 32 33 34 35 36 37 38 39 40 41 42
#114.23943 114.94670 115.65398 116.36126 117.06853 117.77581 118.48308 119.19036 119.89764 120.60491 121.31219 122.01946 122.72674 123.43402
# 43 44 45 46 47 48 49 50 51 52 53 54 55 56
#124.14129 124.84857 125.55584 126.26312 126.97039 127.67767 128.38495 129.09222 129.79950 130.50677 131.21405 131.92133 132.62860 133.33588
# 57 58 59 60 61 62 63
#134.04315 134.75043 135.45771 136.16498 136.87226 137.57953 138.28681
# ...
The issue was that we are passing poly(age, i) which is not getting recognized as 1, 2, ... instead as only i

How to randomly select row from a dataframe for which the row skewness is larger that a given value in R

I am trying to select random rows from a data frame with 1000 lines (and six columns) where the skewness of the line is larger than a given value (say Sk > 0.3).
I've generated the following data frame
df=data.frame(replicate(6,sample(10:100,1000,rep=TRUE)))
I can get row skewness from the fbasics package:
rowSkewness(df) gives:
[8] -0.2243295435 0.5306809351 0.0707122386 0.0341447417 0.3339384838 -0.3910593364 -0.6443905090
[15] 0.5603809206 0.4406091534 -0.3736108832 0.0397860038 0.9970040772 -0.7702547535 0.2065830354
But now, I need to select say 10 rows of the df which have rowskewness greater than say 0.1... May with
for (a in 1:10) {
sample.data[a,] = sample(x=df[which(rowSkewness(df[sample(1:nrow(df),1)>0.1),], size = 1, replace = TRUE)
}
or something like this?
Any thoughts on this will be appreciated.
thanks in advance.
you can use the sample_n() function or sample_frac() - makes your version a little shorter:
library(tidyr)
library(fBasics)
df=data.frame(replicate(6,sample(10:100,1000,rep=TRUE)))
x=df %>% dplyr::filter(rowSkewness(df)>0.1) %>% dplyr::sample_n(10)
Got it:
x=df %>% filter(rowSkewness(df)>0.1)
for (a in 1:samplesize) {
sample.data[a,] = sample(x=x, size = 1, replace = TRUE)
}
Just do a subset:
res1 <- DF[fBasics::rowSkewness(DF) > .1, ]
head(res1)
# X1 X2 X3 X4 X5 X6
# 7 56 28 21 93 74 24
# 8 33 56 23 44 10 12
# 12 29 19 29 38 94 95
# 13 35 51 54 98 66 10
# 14 12 51 24 23 36 68
# 15 50 37 81 22 55 97
Or with e1071::skewness:
res2 <- DF[apply(as.matrix(DF), 1, e1071::skewness) > .1, ]
stopifnot(all.equal(res1, res2))
Data
set.seed(42); DF <- data.frame(replicate(6, sample(10:100, 1000, rep=TRUE)))

How to resample and remodel n times by vectorization?

here's my for loop version of doing resample and remodel,
B <- 999
n <- nrow(butterfly)
estMat <- matrix(NA, B+1, 2)
estMat[B+1,] <- model$coef
for (i in 1:B) {
resample <- butterfly[sample(1:n, n, replace = TRUE),]
re.model <- lm(Hk ~ inv.alt, resample)
estMat[i,] <- re.model$coef
}
I tried to avoid for loop,
B <- 999
n <- nrow(butterfly)
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
re.model.coef <- sapply(re.model,coef)
estMat <- cbind(re.model.coef, model$coef)
It worked but didn't improve efficiency. Is there any approach I can do vectorization?
Sorry, not quite familiar with StackOverflow. Here's the dataset butterfly.
colony alt precip max.temp min.temp Hk
pd+ss 0.5 58 97 16 98
sb 0.8 20 92 32 36
wsb 0.57 28 98 26 72
jrc+jrh 0.55 28 98 26 67
sj 0.38 15 99 28 82
cr 0.93 21 99 28 72
mi 0.48 24 101 27 65
uo+lo 0.63 10 101 27 1
dp 1.5 19 99 23 40
pz 1.75 22 101 27 39
mc 2 58 100 18 9
hh 4.2 36 95 13 19
if 2.5 34 102 16 42
af 2 21 105 20 37
sl 6.5 40 83 0 16
gh 7.85 42 84 5 4
ep 8.95 57 79 -7 1
gl 10.5 50 81 -12 4
(Assuming butterfly$inv.alt <- 1/butterfly$alt)
You get the error because resample is not a list of resampled data.frames, which you can obtain with:
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
The the following should work:
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
To extract coefficients from a list of models, re.model$coef does work. The correct path to coefficients are: re.model[[1]]$coef, re.model[[2]]$coef, .... You can get all of them with the following code:
re.model.coef <- sapply(re.model, coef)
Then you can combined it with the observed coefficients:
estMat <- cbind(re.model.coef, model$coef)
In fact, you can put all of them into replicate:
re.model.coef <- replicate(B, {
bf.rs <- butterfly[sample(1:n, replace = TRUE),]
coef(lm(formula = Hk ~ inv.alt, data = bf.rs))
})
estMat <- cbind(re.model.coef, model$coef)

Ceil and floor values in R

I have a data.table of integers with values between 1 and 60.
My question is about flooring or ceiling any number to the following values: 12 18 24 30 36 ... 60.
For example, let's say my data.table contains the number 13. I want R to "transform" this number into 12 and 18 as 13 lies in between those numbers. Moreover, if I have 18 I want R to keep it at 18.
If my data.table contains the value 50, I want R to convert that number into 48 and 54 and so on.
My goal is to get two different data.tables. One where the floored values are saved and one where the ceiled values are saved.
Any idea how one could do this in R?
EDIT: Numbers smaller than 12 should always be transformed to 12.
Example output:
If have the following data.table data.table(c(1,28,29,41,53,53,17,41,41,53))
I want the following two output data.tables: floored values data.table(c(12,24,24,36,48,48,12,36,36,48))
I want the following two output data.tables: ceiled values data.table(c(12,30,30,42,54,54,18,42,42,54))
Here is a fairly direct way (edited to round up to 12 if any values are below):
df <- data.frame(nums = 10:20)
df$floors <- with(df,pmax(12,6*floor(nums/6)))
df$ceils <- with(df,pmax(12,6*ceiling(nums/6)))
Leading to:
> df
nums floors ceils
1 10 12 12
2 11 12 12
3 12 12 12
4 13 12 18
5 14 12 18
6 15 12 18
7 16 12 18
8 17 12 18
9 18 18 18
10 19 18 24
11 20 18 24
Here's a way we could do this, using sapply and the which.min functions. From your question, it's not immediately clear how values < 12 should be handled.
x <- 1:60
num_list <- seq(12, 60, 6)
floorr <- sapply(x, function(x){
diff_vec <- x - num_list
diff_vec <- ifelse(diff_vec < 0, Inf, diff_vec)
num_list[which.min(diff_vec)]
})
ceill <- sapply(x, function(x){
diff_vec <- num_list - x
diff_vec <- ifelse(diff_vec < 0, Inf, diff_vec)
num_list[which.min(diff_vec)]
})
tail(cbind(x, floorr, ceill))
x floorr ceill
[55,] 55 54 60
[56,] 56 54 60
[57,] 57 54 60
[58,] 58 54 60
[59,] 59 54 60
[60,] 60 60 60

Ordering clustered points using Kmeans and R

I have set of data (of 5000 points with 4 dimensions) that I have clustered using kmeans in R.
I want to order the points in each cluster by their distance to the center of that cluster.
Very simply, the data looks like this (I am using a subset to test out various approaches):
id Ans Acc Que Kudos
1 100 100 100 100
2 85 83 80 75
3 69 65 30 29
4 41 45 30 22
5 10 12 18 16
6 10 13 10 9
7 10 16 16 19
8 65 68 100 100
9 36 30 35 29
10 36 30 26 22
Firstly, I used the following method to cluster the dataset into 2 clusters:
(result <- kmeans(data, 2))
This returns a kmeans object that has the following methods:
cluster, centers etc.
But I cannot figure out how to compare each point and produce an ordered list.
Secondly, I tried the seriation approach as suggested by another SO user here
I use these commands:
clus <- kmeans(scale(x, scale = FALSE), centers = 3, iter.max = 50, nstart = 10)
mns <- sapply(split(x, clus$cluster), function(x) mean(unlist(x)))
result <- dat[order(order(mns)[clus$cluster]), ]
Which seems to produce an ordered list but if I bind it to the labeled clusters (using the following cbind command):
result <- cbind(x[order(order(mns)[clus$cluster]), ],clus$cluster)
I get the following result, which does not appear to be ordered correctly:
id Ans Acc Que Kudos clus
1 3 69 65 30 29 1
2 4 41 45 30 22 1
3 5 10 12 18 16 2
4 6 10 13 10 9 2
5 7 10 16 16 19 2
6 9 36 30 35 29 2
7 10 36 30 26 22 2
8 1 100 100 100 100 1
9 2 85 83 80 75 2
10 8 65 68 100 100 2
I don't want to be writing commands willy-nilly but understand how the approach works. If anyone could help out or spread some light on this, it would be really great.
EDIT:::::::::::
As the clusters can be easily plotted, I'd imagine there is a more straightforward way to get and rank the distances between points and the center.
The centers for the above clusters (when using k = 2) are as follows. But I do not know how to get and compare this with each individual point.
Ans Accep Que Kudos
1 83.33333 83.66667 93.33333 91.66667
2 30.28571 30.14286 23.57143 20.85714
NB::::::::
I don't need top use kmeans but I want to specify the number of clusters and retrieve an ordered list of points from those clusters.
Here is an example that does what you ask, using the first example from ?kmeans. It is probably not terribly efficient, but is something to build upon.
#Taken straight from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
cl <- kmeans(x, 2)
x <- cbind(x,cl = cl$cluster)
#Function to apply to each cluster to
# do the ordering
orderCluster <- function(i,data,centers){
#Extract cluster and center
dt <- data[data[,3] == i,]
ct <- centers[i,]
#Calculate distances
dt <- cbind(dt,dist = apply((dt[,1:2] - ct)^2,1,sum))
#Sort
dt[order(dt[,4]),]
}
do.call(rbind,lapply(sort(unique(cl$cluster)),orderCluster,data = x,centers = cl$centers))

Resources