using predict with a list of lm() objects - r

I have data which I regularly run regressions on. Each "chunk" of data gets fit a different regression. Each state, for example, might have a different function that explains the dependent value. This seems like a typical "split-apply-combine" type of problem so I'm using the plyr package. I can easily create a list of lm() objects which works well. However I can't quite wrap my head around how I use those objects later to predict values in a separate data.frame.
Here's a totally contrived example illustrating what I'm trying to do:
# setting up some fake data
set.seed(1)
funct <- function(myState, myYear){
rnorm(1, 100, 500) + myState + (100 * myYear)
}
state <- 50:60
year <- 10:40
myData <- expand.grid( year, state)
names(myData) <- c("year","state")
myData$value <- apply(myData, 1, function(x) funct(x[2], x[1]))
## ok, done with the fake data generation.
require(plyr)
modelList <- dlply(myData, "state", function(x) lm(value ~ year, data=x))
## if you want to see the summaries of the lm() do this:
# lapply(modelList, summary)
state <- 50:60
year <- 50:60
newData <- expand.grid( year, state)
names(newData) <- c("year","state")
## now how do I predict the values for newData$value
# using the regressions in modelList?
So how do I use the lm() objects contained in modelList to predict values using the year and state independent values from newData?

Here's my attempt:
predNaughty <- ddply(newData, "state", transform,
value=predict(modelList[[paste(piece$state[1])]], newdata=piece))
head(predNaughty)
# year state value
# 1 50 50 5176.326
# 2 51 50 5274.907
# 3 52 50 5373.487
# 4 53 50 5472.068
# 5 54 50 5570.649
# 6 55 50 5669.229
predDiggsApproved <- ddply(newData, "state", function(x)
transform(x, value=predict(modelList[[paste(x$state[1])]], newdata=x)))
head(predDiggsApproved)
# year state value
# 1 50 50 5176.326
# 2 51 50 5274.907
# 3 52 50 5373.487
# 4 53 50 5472.068
# 5 54 50 5570.649
# 6 55 50 5669.229
JD Long edit
I was inspired enough to work out an adply() option:
pred3 <- adply(newData, 1, function(x)
predict(modelList[[paste(x$state)]], newdata=x))
head(pred3)
# year state 1
# 1 50 50 5176.326
# 2 51 50 5274.907
# 3 52 50 5373.487
# 4 53 50 5472.068
# 5 54 50 5570.649
# 6 55 50 5669.229

You need to use mdply to supply both the model and the data to each function call:
dataList <- dlply(newData, "state")
preds <- mdply(cbind(mod = modelList, df = dataList), function(mod, df) {
mutate(df, pred = predict(mod, newdata = df))
})

A solution with just base R. The format of the output is different, but all the values are right there.
models <- lapply(split(myData, myData$state), 'lm', formula = value ~ year)
pred4 <- mapply('predict', models, split(newData, newData$state))

What is wrong with
lapply(modelList, predict, newData)
?
EDIT:
Thanks for explaining what is wrong with that. How about:
newData <- data.frame(year)
ldply(modelList, function(model) {
data.frame(newData, predict=predict(model, newData))
})
Iterate over the models, and apply the new data (which is the same for each state since you just did an expand.grid to create it).
EDIT 2:
If newData does not have the same values for year for every state as in the example, a more general approach can be used. Note that this uses the original definition of newData, not the one in the first edit.
ldply(state, function(s) {
nd <- newData[newData$state==s,]
data.frame(nd, predict=predict(modelList[[as.character(s)]], nd))
})
First 15 lines of this output:
year state predict
1 50 50 5176.326
2 51 50 5274.907
3 52 50 5373.487
4 53 50 5472.068
5 54 50 5570.649
6 55 50 5669.229
7 56 50 5767.810
8 57 50 5866.390
9 58 50 5964.971
10 59 50 6063.551
11 60 50 6162.132
12 50 51 5514.825
13 51 51 5626.160
14 52 51 5737.496
15 53 51 5848.832

I take it the hard part is matching each state in newData to the corresponding model.
Something like this perhaps?
predList <- dlply(newData, "state", function(x) {
predict(modelList[[as.character(min(x$state))]], x)
})
Here I used a "hacky" way of extracting the corresponding state model: as.character(min(x$state))
...There is probably a better way?
Output:
> predList[1:2]
$`50`
1 2 3 4 5 6 7 8 9 10 11
5176.326 5274.907 5373.487 5472.068 5570.649 5669.229 5767.810 5866.390 5964.971 6063.551 6162.132
$`51`
12 13 14 15 16 17 18 19 20 21 22
5514.825 5626.160 5737.496 5848.832 5960.167 6071.503 6182.838 6294.174 6405.510 6516.845 6628.181
Or, if you want a data.frame as output:
predData <- ddply(newData, "state", function(x) {
y <-predict(modelList[[as.character(min(x$state))]], x)
data.frame(id=names(y), value=c(y))
})
Output:
head(predData)
state id value
1 50 1 5176.326
2 50 2 5274.907
3 50 3 5373.487
4 50 4 5472.068
5 50 5 5570.649
6 50 6 5669.229

Maybe I'm missing something, but I believe lmList is the ideal tool here,
library(nlme)
ll = lmList(value ~ year | state, data=myData)
predict(ll, newData)
## Or, to show that it produces the same results as the other proposed methods...
newData[["value"]] <- predict(ll, newData)
head(newData)
# year state value
# 1 50 50 5176.326
# 2 51 50 5274.907
# 3 52 50 5373.487
# 4 53 50 5472.068
# 5 54 50 5570.649
# 6 55 50 5669.229

Related

How do I calculate CV of triplicates in R?

I have 1000+ rows and I want to calculate the CV for each row that has the same condition.
The data look like this:
Condition Y
0.5 25
0.5 26
0.5 27
1 43
1 45
1 75
5 210
5 124
5 20
10 54
10 78
10 10
and then I did:
CV <- function(x){
(sd(x)/mean(x))*100
}
CV.for every row. <- aggregate(y ~ Condition,
data = df,
FUN = CV)
I have the feeling that what I did, uses the mean of the whole column, cause the results are a bit whatever.

rpart: How to get the "where" vector for validation dataset?

when fitting with rpart, it returns the "where" vector which tells which leave each record in the training dataset is on the tree. Is there a function which return something similar to this "where" vector for a test dataset?
I think the partykit package does what you want
library('rpart')
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit
rpart.plot::rpart.plot(fit)
Check with same data
set.seed(1)
idx <- sample(nrow(kyphosis), 5L)
fit$where[idx]
# 22 30 46 71 16
# 9 3 7 7 3
library('partykit')
fit <- as.party(fit)
predict(fit, kyphosis[idx, ], type = 'node')
# 22 30 46 71 16
# 9 3 7 7 3
Check with new data
dd <- kyphosis[idx, ]
set.seed(1)
dd[] <- lapply(dd, sample)
predict(fit, dd, type = 'node')
# 22 30 46 71 16
# 5 3 7 9 3
## so #46 should meet criteria for the 7th leaf:
with(kyphosis[46, ],
Start >= 8.5 & # node 1
Start < 14.5 & # node 2
Age >= 55 & # node 4
Age >= 111 # node 6
)
# [1] TRUE
As you mention, the function predict.rpart in the rpart package
doesn't have a where option (to show the leaf node number associated
with a prediction).
However, the rpart.predict function in the rpart.plot package
will do this. For example
> library(rpart.plot)
> fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
> rpart.predict(fit, newdata=kyphosis[1:3,], nn=TRUE)
gives (note the node number nn column):
absent present nn
1 0.42105 0.57895 3
2 0.85714 0.14286 22
3 0.42105 0.57895 3
And
> rpart.predict(fit, newdata=kyphosis[1:3,], nn=TRUE)$nn
gives just the where node numbers:
[1] 3 22 3
To show the rule for each prediction use
> rpart.predict(fit, newdata=kyphosis[1:5,], rules=TRUE)
which gives
absent present
1 0.42105 0.57895 because Start < 9
2 0.85714 0.14286 because Start is 9 to 15 & Age >= 111
3 0.42105 0.57895 because Start < 9

How to resample data by clusters (block sampling) with replacement in R using Sampling package

This is my dummy data:
income <- as.data.frame.vector <- sample(1000:10000, 1000, replace=TRUE)
individuals <- as.data.frame.vector <- sample(1:50,1000,replace=TRUE)
datatest <- as.data.frame (cbind (income, individuals))
I know I can sample by individual rows with this code:
sample <- datatest[sample(nrow(datatest), replace=TRUE),]
Now, I want to extract random samples with replacement and equal probabilities of the dataset but sampling complete blocks of observations with the same individual code.
Note that there are 50 individuals, but 1000 observations. Some observations belong to the same individual, so I want to sample by individuals (clusters, in this case), not observations. I don't mind if the extracted samples differ slightly in the number of observations. How can I do that?
I have tried:
library(sampling)
samplecluster <- cluster (datatest, clustername=c("individuals"), size=50,
method="srswr")
But the outcome is not the sampled data. Am I missing something?
Well, it seems I was indeed missing something. After the cluster command you need to apply the getdata command (all from the Sampling Package). This way I do get the sample as I wanted, plus some additional columns.
samplecluster <- cluster (datatest, clustername=c("personid"), size=50, method="srswr")
Gives you:
head(samplecluster)
individuals ID_unit Replicates Prob
1 1 259 2 0.63583
2 1 178 2 0.63583
3 1 110 2 0.63583
4 1 153 2 0.63583
5 1 941 2 0.63583
6 1 667 2 0.63583
Then using getdata, I also get the original data on income sampled by whole clusters:
datasample <- getdata (datatest, samplecluster)
head(datasample)
income individuals ID_unit Replicates Prob
1 8567 1 259 2 0.63583
2 2701 1 178 2 0.63583
3 4998 1 110 2 0.63583
4 3556 1 153 2 0.63583
5 2893 1 941 2 0.63583
6 7581 1 667 2 0.63583
I am not sure if I am missing something. If you just want some of your individuals, you can create a smaller sample of them:
ind.sample <- sample(1:50, size = 10)
print(ind.sample)
# [1] 17 43 38 39 28 23 35 47 9 13
my.sample <- datatest[datatest$individuals %in% ind.sample) ,]
head(my.sample)
# income individuals
#21 9072 17
#97 5928 35
#122 9130 43
#252 4388 43
#285 8083 28
#287 1065 35
I guess a more generic approach would be to generate random indexes;
ind.unique <- unique(individuals)
ind.sample.index <- sample(1:length(ind.unique), size = 10)
ind.sample <- ind.unique[ind.sample.index]
print(ind.sample[order(ind.sample)])
my.sample <- datatest[datatest$individuals %in% ind.sample, ]
ind.counts <- aggregate(income ~ individuals, my.sample, FUN = length)
print(ind.counts)
I think its important to note that the dataset still needs to be expanded to include all the replicates.
sw<-data.frame(datasample[rep(seq_len(dim(datasample)[1]), datasample$Replicates),, drop = FALSE], row.names=NULL)
Might be helpful to someone

Calculate mean of each n-rows in a dataframe in r when the first row is varying

First make some example data:
df = data.frame(matrix(rnorm(200), nrow=100))
df1=data.frame(t(c(25,34)))
The starting row is different in each column. For example, in X1 I would like to start from 25 th row while in X2 from row 34. Then, I want to calculate the mean for each 5 values for the next 50 rows for all the columns in df.
I am new to R so this is probably very obvious. Can anyone provide some suggestions that how I can do this?
You could try Map.
lst <- Map(function(x,y) {x1 <- x[y:length(x)]
tapply(x1,as.numeric(gl(length(x1), 5,
length(x1))), FUN=mean)},
df, df1)
lst
# $X1
# 1 2 3 4 5 6
#-0.16500158 0.11339623 -0.86961872 -0.54985564 0.19958461 0.35234983
# 7 8 9 10 11 12
#0.32792769 0.65989801 -0.30409184 -0.53264725 -0.45792792 -0.59139844
# 13 14 15 16
# 0.03934133 -0.38068187 0.10100007 1.21017392
#$X2
# 1 2 3 4 5 6
# 0.24525622 0.07367300 0.18733973 -0.43784202 -0.45756095 -0.45740178
# 7 8 9 10 11 12
#-0.54086152 0.10439072 0.65660937 0.70623380 -0.51640088 0.46506135
# 13 14
#-0.09428336 -0.86295101
Because of the length difference, it might be better to keep it as a list. But, if you need it in a matrix/data.frame, you can make the lengths equal by padding with NAs.
do.call(cbind,lapply(lst, `length<-`,(max(sapply(lst, length)))))
Update
If you need only 50 rows, then change y:(length(x) to y:(y+49) in the Map code
data
set.seed(24)
df <- data.frame(matrix(rnorm(200), nrow=100))
df1 <- data.frame(t(c(25,34)))
Not entirely clear, especially, the second line of your code, but I think this might be close to what you want to do:
every_fifth_row <- df[seq(1, nrow(df), 5), ]
every_fifth_row
# X1 X2
# 1 -0.09490455 -0.28417104
# 6 -0.14949662 0.12857284
# 11 0.15297366 -0.84428186
# 16 -1.03397309 0.04775516
# 21 -1.95735213 -1.03750794
# 26 1.61135194 1.10189370
# 31 0.12447365 1.80792719
# 36 -0.92344017 0.66639710
# 41 -0.88764143 0.10858376
# 46 0.27761464 0.98382526
# 51 -0.14503359 -0.66868956
# 56 -1.70208187 0.05993688
# 61 0.33828525 1.00208639
# 66 -0.41427863 1.07969341
# 71 0.35027994 -1.46920059
# 76 1.38943839 0.01844205
# 81 -0.81560917 -0.32133221
# 86 1.38188423 -0.77755471
# 91 1.53247872 -0.98660308
# 96 0.45721909 -0.22855622
rowMeans(every_fifth_row)
colMeans(every_fifth_row)
# Alternative
# apply(every_fifth_row, 1, mean) # Row-wise mean
# apply(every_fifth_row, 2, mean) # Column-wise mean

Apply LR models to another dataframe

I searched SO, but I could not seem to find the right code that is applicable to my question. It is similar to this question: Linear Regression calculation several times in one dataframe
I got a dataframe of LR coefficients following Andrie's code:
Cddply <- ddply(test, .(sumtest), function(test)coef(lm(Area~Conc, data=test)))
sumtest (Intercept) Conc
1 -108589.2726 846.0713372
2 -49653.18701 811.3982918
3 -102598.6252 832.6419926
4 -72607.4017 727.0765558
5 54224.28878 391.256075
6 -42357.45407 357.0845661
7 -34171.92228 367.3962888
8 -9332.569856 289.8631555
9 -7376.448899 335.7047756
10 -37704.92277 359.1457617
My question is how to apply each of these LR models (1-10) to specific row intervals in another dataframe in order to get x, the independent variable, into a 3rd column. For example, I would like to apply sumtest1 to Samples 6:29, sumtest2 to samples 35:50, sumtest3 to samples 56:79, etc.. in intervals of 24 and 16 samples. The sample numbers repeats after 200, so sumtest9 will be for Samples 6:29 again.
Sample Area
6 236211
7 724919
8 1259814
9 1574722
10 268836
11 863818
12 1261768
13 1591845
14 220322
15 608396
16 980182
17 1415859
18 276276
19 724532
20 1130024
21 1147840
22 252051
23 544870
24 832512
25 899457
26 285093
27 4291007
28 825922
29 865491
35 246707
36 538092
37 767269
38 852410
39 269152
40 971471
41 1573989
42 1897208
43 261321
44 481486
45 598617
46 769240
47 229695
48 782691
49 1380597
50 1725419
The resulting dataframe would look like this:
Sample Area Calc
6 236211 407.5312917
7 724919 985.1525288
8 1259814 1617.363812
9 1574722 1989.564693
10 268836 446.0919309
...
35 246707 365.2452551
36 538092 724.3591324
37 767269 1006.805521
38 852410 1111.736505
39 269152 392.9073207
Thank you for your assistance.
Is this what you want? I made up a slightly larger dummy data set of 'area' to make it easier to see how the code worked when I tried it out.
# create 400 rows of area data
set.seed(123)
df <- data.frame(area = round(rnorm(400, mean = 1000000, sd = 100000)))
# "sample numbers repeats after 200" -> add a sample nr 1-200, 1-200
df$sample_nr <- 1:200
# create a factor which cuts the vector of sample_nr into pieces of length 16, 24, 16, 24...
# repeat to a total length of the pieces is 200
# i.e. 5 repeats of (16, 24)
grp <- cut(df$sample_nr, breaks = c(-Inf, cumsum(rep(c(16, 24), 5))))
# add a numeric version of the chunks to data frame
# this number indicates the model from which coefficients will be used
# row 1-16 (16 rows): model 1; row 17-40 (24 rows): model 2;
# row 41-56 (16 rows): model 3; and so on.
df$mod <- as.numeric(grp)
# read coefficients
coefs <- read.table(text = "intercept beta_conc
1 -108589.2726 846.0713372
2 -49653.18701 811.3982918
3 -102598.6252 832.6419926
4 -72607.4017 727.0765558
5 54224.28878 391.256075
6 -42357.45407 357.0845661
7 -34171.92228 367.3962888
8 -9332.569856 289.8631555
9 -7376.448899 335.7047756
10 -37704.92277 359.1457617", header = TRUE)
# add model number
coefs$mod <- rownames(coefs)
head(df)
head(coefs)
# join area data and coefficients by model number
# (use 'join' instead of merge to avoid sorting)
library(plyr)
df2 <- join(df, coefs)
# calculate conc from area and model coefficients
# area = intercept + beta_conc * conc
# conc = (area - intercept) / beta_conc
df2$conc <- (df2$area - df2$intercept) / df2$beta_conc
head(df2, 41)

Resources