How to sample/partition panel data by individuals( preferably with caret library)? - r

I would like to partition panel data and preserve the panel nature of the data:
library(caret)
library(mlbench)
#example panel data where id is the persons identifier over years
data <- read.table("http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv",
header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
## Here for instance the dependent variable is working
inTrain <- createDataPartition(y = data$WORKING, p = .75,list = FALSE)
# subset into training
training <- data[ inTrain,]
# subset into testing
testing <- data[-inTrain,]
# Here we see some intersections of identifiers
str(training$id[10:20])
str(testing$id)
However I would like, when partitioning or sampling the data, to avoid that the same person (id) is splitted into two data sets.Is their a way to randomly sample/partition from the data an assign indivuals to the corresponding partitions rather then observations?
I tried to sample:
mysample <- data[sample(unique(data$id), 1000,replace=FALSE),]
However, that destroys the panel nature of the data...

I think there's a little bug in the sampling approach using sample(): It is using the id variable like a row number. Instead, the function needs to fetch all rows belonging to an ID:
nID <- length(unique(data$id))
p = 0.75
set.seed(123)
inTrainID <- sample(unique(data$id), round(nID * p), replace=FALSE)
training <- data[data$id %in% inTrainID, ]
testing <- data[!data$id %in% inTrainID, ]
head(training[, 1:5], 10)
# id FEMALE YEAR AGE HANDDUM
# 1 1 0 1984 54 0.0000000
# 2 1 0 1985 55 0.0000000
# 3 1 0 1986 56 0.0000000
# 8 3 1 1984 58 0.1687193
# 9 3 1 1986 60 1.0000000
# 10 3 1 1987 61 0.0000000
# 11 3 1 1988 62 1.0000000
# 12 4 1 1985 29 0.0000000
# 13 5 0 1987 27 1.0000000
# 14 5 0 1988 28 0.0000000
dim(data)
# [1] 27326 41
dim(training)
# [1] 20566 41
dim(testing)
# [1] 6760 41
20566/27326
### 75.26% were selected for training
Let's check class balances, because createDataPartition would keep the class balance for WORKING equal in all sets.
table(data$WORKING) / nrow(data)
# 0 1
# 0.3229525 0.6770475
#
table(training$WORKING) / nrow(training)
# 0 1
# 0.3226685 0.6773315
#
table(testing$WORKING) / nrow(testing)
# 0 1
# 0.3238166 0.6761834
### virtually equal

I thought I would point out caret's groupKFold function for anyone looking at this, which would be handy for cross validation with this class of data. From the documentation:
"To split the data based on groups, groupKFold can be used:
set.seed(3527)
subjects <- sample(1:20, size = 80, replace = TRUE)
folds <- groupKFold(subjects, k = 15)
The results in folds can be used as inputs into the index argument of the trainControl function."

Related

Trying to add breakpoint lines from strucchange to a plot by "lines" command [duplicate]

This is my first time with strucchange so bear with me. The problem I'm having seems to be that strucchange doesn't recognize my time series correctly but I can't figure out why and haven't found an answer on the boards that deals with this. Here's a reproducible example:
require(strucchange)
# time series
nmreprosuccess <- c(0,0.50,NA,0.,NA,0.5,NA,0.50,0.375,0.53,0.846,0.44,1.0,0.285,
0.75,1,0.4,0.916,1,0.769,0.357)
dat.ts <- ts(nmreprosuccess, frequency=1, start=c(1996,1))
str(dat.ts)
Time-Series [1:21] from 1996 to 2016: 0 0.5 NA 0 NA 0.5 NA 0.5 0.375 0.53 ...
To me this means that the time series looks OK to work with.
# obtain breakpoints
bp.NMSuccess <- breakpoints(dat.ts~1)
summary(bp.NMSuccess)
Which gives:
Optimal (m+1)-segment partition:
Call:
breakpoints.formula(formula = dat.ts ~ 1)
Breakpoints at observation number:
m = 1 6
m = 2 3 7
m = 3 3 14 16
m = 4 3 7 14 16
m = 5 3 7 10 14 16
m = 6 3 7 10 12 14 16
m = 7 3 5 7 10 12 14 16
Corresponding to breakdates:
m = 1 0.333333333333333
m = 2 0.166666666666667 0.388888888888889
m = 3 0.166666666666667
m = 4 0.166666666666667 0.388888888888889
m = 5 0.166666666666667 0.388888888888889 0.555555555555556
m = 6 0.166666666666667 0.388888888888889 0.555555555555556 0.666666666666667
m = 7 0.166666666666667 0.277777777777778 0.388888888888889 0.555555555555556 0.666666666666667
m = 1
m = 2
m = 3 0.777777777777778 0.888888888888889
m = 4 0.777777777777778 0.888888888888889
m = 5 0.777777777777778 0.888888888888889
m = 6 0.777777777777778 0.888888888888889
m = 7 0.777777777777778 0.888888888888889
Fit:
m 0 1 2 3 4 5 6 7
RSS 1.6986 1.1253 0.9733 0.8984 0.7984 0.7581 0.7248 0.7226
BIC 14.3728 12.7421 15.9099 20.2490 23.9062 28.7555 33.7276 39.4522
Here's where I start having the problem. Instead of reporting the actual breakdates it reports numbers which then makes it impossible to plot the break lines onto a graph because they're not at the breakdate (2002) but at 0.333.
plot.ts(dat.ts, main="Natural Mating")
lines(fitted(bp.NMSuccess, breaks = 1), col = 4, lwd = 1.5)
Nothing shows up for me in this graph (I think because it's so small for the scale of the graph).
In addition, when I try fixes that may possibly work around this problem,
fm1 <- lm(dat.ts ~ breakfactor(bp.NMSuccess, breaks = 1))
I get:
Error in model.frame.default(formula = dat.ts ~ breakfactor(bp.NMSuccess, :
variable lengths differ (found for 'breakfactor(bp.NMSuccess, breaks = 1)')
I get errors because of the NA values in the data so the length of dat.ts is 21 and the length of breakfactor(bp.NMSuccess, breaks = 1) 18 (missing the 3 NAs).
Any suggestions?
The problem occurs because breakpoints() currently can only (a) cope with NAs by omitting them, and (b) cope with times/date through the ts class. This creates the conflict because when you omit internal NAs from a ts it loses its ts property and hence breakpoints() cannot infer the correct times.
The "obvious" way around this would be to use a time series class that can cope with this, namely zoo. However, I just never got round to fully integrate zoo support into breakpoints() because it would likely break some of the current behavior.
To cut a long story short: Your best choice at the moment is to do the book-keeping about the times yourself and not expect breakpoints() to do it for you. The additional work is not so huge. First, we create a time series with the response and the time vector and omit the NAs:
d <- na.omit(data.frame(success = nmreprosuccess, time = 1996:2016))
d
## success time
## 1 0.000 1996
## 2 0.500 1997
## 4 0.000 1999
## 6 0.500 2001
## 8 0.500 2003
## 9 0.375 2004
## 10 0.530 2005
## 11 0.846 2006
## 12 0.440 2007
## 13 1.000 2008
## 14 0.285 2009
## 15 0.750 2010
## 16 1.000 2011
## 17 0.400 2012
## 18 0.916 2013
## 19 1.000 2014
## 20 0.769 2015
## 21 0.357 2016
Then we can estimate the breakpoint(s) and afterwards transform from the "number" of observations back to the time scale. Note that I'm setting the minimal segment size h explicitly here because the default of 15% is probably somewhat small for this short series. 4 is still small but possibly enough for estimating of a constant mean.
bp <- breakpoints(success ~ 1, data = d, h = 4)
bp
## Optimal 2-segment partition:
##
## Call:
## breakpoints.formula(formula = success ~ 1, h = 4, data = d)
##
## Breakpoints at observation number:
## 6
##
## Corresponding to breakdates:
## 0.3333333
We ignore the break "date" at 1/3 of the observations but simply map back to the original time scale:
d$time[bp$breakpoints]
## [1] 2004
To re-estimate the model with nicely formatted factor levels, we could do:
lab <- c(
paste(d$time[c(1, bp$breakpoints)], collapse = "-"),
paste(d$time[c(bp$breakpoints + 1, nrow(d))], collapse = "-")
)
d$seg <- breakfactor(bp, labels = lab)
lm(success ~ 0 + seg, data = d)
## Call:
## lm(formula = success ~ 0 + seg, data = d)
##
## Coefficients:
## seg1996-2004 seg2005-2016
## 0.3125 0.6911
Or for visualization:
plot(success ~ time, data = d, type = "b")
lines(fitted(bp) ~ time, data = d, col = 4, lwd = 2)
abline(v = d$time[bp$breakpoints], lty = 2)
One final remark: For such short time series where just a simple shift in the mean is needed, one could also consider conditional inference (aka permutation tests) rather than the asymptotic inference employed in strucchange. The coin package provides the maxstat_test() function exactly for this purpose (= short series where a single shift in the mean is tested).
library("coin")
maxstat_test(success ~ time, data = d, dist = approximate(99999))
## Approximative Generalized Maximally Selected Statistics
##
## data: success by time
## maxT = 2.3953, p-value = 0.09382
## alternative hypothesis: two.sided
## sample estimates:
## "best" cutpoint: <= 2004
This finds the same breakpoint and provides a permutation test p-value. If however, one has more data and needs multiple breakpoints and/or further regression coefficients, then strucchange would be needed.

rpart: How to get the "where" vector for validation dataset?

when fitting with rpart, it returns the "where" vector which tells which leave each record in the training dataset is on the tree. Is there a function which return something similar to this "where" vector for a test dataset?
I think the partykit package does what you want
library('rpart')
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit
rpart.plot::rpart.plot(fit)
Check with same data
set.seed(1)
idx <- sample(nrow(kyphosis), 5L)
fit$where[idx]
# 22 30 46 71 16
# 9 3 7 7 3
library('partykit')
fit <- as.party(fit)
predict(fit, kyphosis[idx, ], type = 'node')
# 22 30 46 71 16
# 9 3 7 7 3
Check with new data
dd <- kyphosis[idx, ]
set.seed(1)
dd[] <- lapply(dd, sample)
predict(fit, dd, type = 'node')
# 22 30 46 71 16
# 5 3 7 9 3
## so #46 should meet criteria for the 7th leaf:
with(kyphosis[46, ],
Start >= 8.5 & # node 1
Start < 14.5 & # node 2
Age >= 55 & # node 4
Age >= 111 # node 6
)
# [1] TRUE
As you mention, the function predict.rpart in the rpart package
doesn't have a where option (to show the leaf node number associated
with a prediction).
However, the rpart.predict function in the rpart.plot package
will do this. For example
> library(rpart.plot)
> fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
> rpart.predict(fit, newdata=kyphosis[1:3,], nn=TRUE)
gives (note the node number nn column):
absent present nn
1 0.42105 0.57895 3
2 0.85714 0.14286 22
3 0.42105 0.57895 3
And
> rpart.predict(fit, newdata=kyphosis[1:3,], nn=TRUE)$nn
gives just the where node numbers:
[1] 3 22 3
To show the rule for each prediction use
> rpart.predict(fit, newdata=kyphosis[1:5,], rules=TRUE)
which gives
absent present
1 0.42105 0.57895 because Start < 9
2 0.85714 0.14286 because Start is 9 to 15 & Age >= 111
3 0.42105 0.57895 because Start < 9

strucchange not reporting breakdates

This is my first time with strucchange so bear with me. The problem I'm having seems to be that strucchange doesn't recognize my time series correctly but I can't figure out why and haven't found an answer on the boards that deals with this. Here's a reproducible example:
require(strucchange)
# time series
nmreprosuccess <- c(0,0.50,NA,0.,NA,0.5,NA,0.50,0.375,0.53,0.846,0.44,1.0,0.285,
0.75,1,0.4,0.916,1,0.769,0.357)
dat.ts <- ts(nmreprosuccess, frequency=1, start=c(1996,1))
str(dat.ts)
Time-Series [1:21] from 1996 to 2016: 0 0.5 NA 0 NA 0.5 NA 0.5 0.375 0.53 ...
To me this means that the time series looks OK to work with.
# obtain breakpoints
bp.NMSuccess <- breakpoints(dat.ts~1)
summary(bp.NMSuccess)
Which gives:
Optimal (m+1)-segment partition:
Call:
breakpoints.formula(formula = dat.ts ~ 1)
Breakpoints at observation number:
m = 1 6
m = 2 3 7
m = 3 3 14 16
m = 4 3 7 14 16
m = 5 3 7 10 14 16
m = 6 3 7 10 12 14 16
m = 7 3 5 7 10 12 14 16
Corresponding to breakdates:
m = 1 0.333333333333333
m = 2 0.166666666666667 0.388888888888889
m = 3 0.166666666666667
m = 4 0.166666666666667 0.388888888888889
m = 5 0.166666666666667 0.388888888888889 0.555555555555556
m = 6 0.166666666666667 0.388888888888889 0.555555555555556 0.666666666666667
m = 7 0.166666666666667 0.277777777777778 0.388888888888889 0.555555555555556 0.666666666666667
m = 1
m = 2
m = 3 0.777777777777778 0.888888888888889
m = 4 0.777777777777778 0.888888888888889
m = 5 0.777777777777778 0.888888888888889
m = 6 0.777777777777778 0.888888888888889
m = 7 0.777777777777778 0.888888888888889
Fit:
m 0 1 2 3 4 5 6 7
RSS 1.6986 1.1253 0.9733 0.8984 0.7984 0.7581 0.7248 0.7226
BIC 14.3728 12.7421 15.9099 20.2490 23.9062 28.7555 33.7276 39.4522
Here's where I start having the problem. Instead of reporting the actual breakdates it reports numbers which then makes it impossible to plot the break lines onto a graph because they're not at the breakdate (2002) but at 0.333.
plot.ts(dat.ts, main="Natural Mating")
lines(fitted(bp.NMSuccess, breaks = 1), col = 4, lwd = 1.5)
Nothing shows up for me in this graph (I think because it's so small for the scale of the graph).
In addition, when I try fixes that may possibly work around this problem,
fm1 <- lm(dat.ts ~ breakfactor(bp.NMSuccess, breaks = 1))
I get:
Error in model.frame.default(formula = dat.ts ~ breakfactor(bp.NMSuccess, :
variable lengths differ (found for 'breakfactor(bp.NMSuccess, breaks = 1)')
I get errors because of the NA values in the data so the length of dat.ts is 21 and the length of breakfactor(bp.NMSuccess, breaks = 1) 18 (missing the 3 NAs).
Any suggestions?
The problem occurs because breakpoints() currently can only (a) cope with NAs by omitting them, and (b) cope with times/date through the ts class. This creates the conflict because when you omit internal NAs from a ts it loses its ts property and hence breakpoints() cannot infer the correct times.
The "obvious" way around this would be to use a time series class that can cope with this, namely zoo. However, I just never got round to fully integrate zoo support into breakpoints() because it would likely break some of the current behavior.
To cut a long story short: Your best choice at the moment is to do the book-keeping about the times yourself and not expect breakpoints() to do it for you. The additional work is not so huge. First, we create a time series with the response and the time vector and omit the NAs:
d <- na.omit(data.frame(success = nmreprosuccess, time = 1996:2016))
d
## success time
## 1 0.000 1996
## 2 0.500 1997
## 4 0.000 1999
## 6 0.500 2001
## 8 0.500 2003
## 9 0.375 2004
## 10 0.530 2005
## 11 0.846 2006
## 12 0.440 2007
## 13 1.000 2008
## 14 0.285 2009
## 15 0.750 2010
## 16 1.000 2011
## 17 0.400 2012
## 18 0.916 2013
## 19 1.000 2014
## 20 0.769 2015
## 21 0.357 2016
Then we can estimate the breakpoint(s) and afterwards transform from the "number" of observations back to the time scale. Note that I'm setting the minimal segment size h explicitly here because the default of 15% is probably somewhat small for this short series. 4 is still small but possibly enough for estimating of a constant mean.
bp <- breakpoints(success ~ 1, data = d, h = 4)
bp
## Optimal 2-segment partition:
##
## Call:
## breakpoints.formula(formula = success ~ 1, h = 4, data = d)
##
## Breakpoints at observation number:
## 6
##
## Corresponding to breakdates:
## 0.3333333
We ignore the break "date" at 1/3 of the observations but simply map back to the original time scale:
d$time[bp$breakpoints]
## [1] 2004
To re-estimate the model with nicely formatted factor levels, we could do:
lab <- c(
paste(d$time[c(1, bp$breakpoints)], collapse = "-"),
paste(d$time[c(bp$breakpoints + 1, nrow(d))], collapse = "-")
)
d$seg <- breakfactor(bp, labels = lab)
lm(success ~ 0 + seg, data = d)
## Call:
## lm(formula = success ~ 0 + seg, data = d)
##
## Coefficients:
## seg1996-2004 seg2005-2016
## 0.3125 0.6911
Or for visualization:
plot(success ~ time, data = d, type = "b")
lines(fitted(bp) ~ time, data = d, col = 4, lwd = 2)
abline(v = d$time[bp$breakpoints], lty = 2)
One final remark: For such short time series where just a simple shift in the mean is needed, one could also consider conditional inference (aka permutation tests) rather than the asymptotic inference employed in strucchange. The coin package provides the maxstat_test() function exactly for this purpose (= short series where a single shift in the mean is tested).
library("coin")
maxstat_test(success ~ time, data = d, dist = approximate(99999))
## Approximative Generalized Maximally Selected Statistics
##
## data: success by time
## maxT = 2.3953, p-value = 0.09382
## alternative hypothesis: two.sided
## sample estimates:
## "best" cutpoint: <= 2004
This finds the same breakpoint and provides a permutation test p-value. If however, one has more data and needs multiple breakpoints and/or further regression coefficients, then strucchange would be needed.

How to resample data by clusters (block sampling) with replacement in R using Sampling package

This is my dummy data:
income <- as.data.frame.vector <- sample(1000:10000, 1000, replace=TRUE)
individuals <- as.data.frame.vector <- sample(1:50,1000,replace=TRUE)
datatest <- as.data.frame (cbind (income, individuals))
I know I can sample by individual rows with this code:
sample <- datatest[sample(nrow(datatest), replace=TRUE),]
Now, I want to extract random samples with replacement and equal probabilities of the dataset but sampling complete blocks of observations with the same individual code.
Note that there are 50 individuals, but 1000 observations. Some observations belong to the same individual, so I want to sample by individuals (clusters, in this case), not observations. I don't mind if the extracted samples differ slightly in the number of observations. How can I do that?
I have tried:
library(sampling)
samplecluster <- cluster (datatest, clustername=c("individuals"), size=50,
method="srswr")
But the outcome is not the sampled data. Am I missing something?
Well, it seems I was indeed missing something. After the cluster command you need to apply the getdata command (all from the Sampling Package). This way I do get the sample as I wanted, plus some additional columns.
samplecluster <- cluster (datatest, clustername=c("personid"), size=50, method="srswr")
Gives you:
head(samplecluster)
individuals ID_unit Replicates Prob
1 1 259 2 0.63583
2 1 178 2 0.63583
3 1 110 2 0.63583
4 1 153 2 0.63583
5 1 941 2 0.63583
6 1 667 2 0.63583
Then using getdata, I also get the original data on income sampled by whole clusters:
datasample <- getdata (datatest, samplecluster)
head(datasample)
income individuals ID_unit Replicates Prob
1 8567 1 259 2 0.63583
2 2701 1 178 2 0.63583
3 4998 1 110 2 0.63583
4 3556 1 153 2 0.63583
5 2893 1 941 2 0.63583
6 7581 1 667 2 0.63583
I am not sure if I am missing something. If you just want some of your individuals, you can create a smaller sample of them:
ind.sample <- sample(1:50, size = 10)
print(ind.sample)
# [1] 17 43 38 39 28 23 35 47 9 13
my.sample <- datatest[datatest$individuals %in% ind.sample) ,]
head(my.sample)
# income individuals
#21 9072 17
#97 5928 35
#122 9130 43
#252 4388 43
#285 8083 28
#287 1065 35
I guess a more generic approach would be to generate random indexes;
ind.unique <- unique(individuals)
ind.sample.index <- sample(1:length(ind.unique), size = 10)
ind.sample <- ind.unique[ind.sample.index]
print(ind.sample[order(ind.sample)])
my.sample <- datatest[datatest$individuals %in% ind.sample, ]
ind.counts <- aggregate(income ~ individuals, my.sample, FUN = length)
print(ind.counts)
I think its important to note that the dataset still needs to be expanded to include all the replicates.
sw<-data.frame(datasample[rep(seq_len(dim(datasample)[1]), datasample$Replicates),, drop = FALSE], row.names=NULL)
Might be helpful to someone

using predict with a list of lm() objects

I have data which I regularly run regressions on. Each "chunk" of data gets fit a different regression. Each state, for example, might have a different function that explains the dependent value. This seems like a typical "split-apply-combine" type of problem so I'm using the plyr package. I can easily create a list of lm() objects which works well. However I can't quite wrap my head around how I use those objects later to predict values in a separate data.frame.
Here's a totally contrived example illustrating what I'm trying to do:
# setting up some fake data
set.seed(1)
funct <- function(myState, myYear){
rnorm(1, 100, 500) + myState + (100 * myYear)
}
state <- 50:60
year <- 10:40
myData <- expand.grid( year, state)
names(myData) <- c("year","state")
myData$value <- apply(myData, 1, function(x) funct(x[2], x[1]))
## ok, done with the fake data generation.
require(plyr)
modelList <- dlply(myData, "state", function(x) lm(value ~ year, data=x))
## if you want to see the summaries of the lm() do this:
# lapply(modelList, summary)
state <- 50:60
year <- 50:60
newData <- expand.grid( year, state)
names(newData) <- c("year","state")
## now how do I predict the values for newData$value
# using the regressions in modelList?
So how do I use the lm() objects contained in modelList to predict values using the year and state independent values from newData?
Here's my attempt:
predNaughty <- ddply(newData, "state", transform,
value=predict(modelList[[paste(piece$state[1])]], newdata=piece))
head(predNaughty)
# year state value
# 1 50 50 5176.326
# 2 51 50 5274.907
# 3 52 50 5373.487
# 4 53 50 5472.068
# 5 54 50 5570.649
# 6 55 50 5669.229
predDiggsApproved <- ddply(newData, "state", function(x)
transform(x, value=predict(modelList[[paste(x$state[1])]], newdata=x)))
head(predDiggsApproved)
# year state value
# 1 50 50 5176.326
# 2 51 50 5274.907
# 3 52 50 5373.487
# 4 53 50 5472.068
# 5 54 50 5570.649
# 6 55 50 5669.229
JD Long edit
I was inspired enough to work out an adply() option:
pred3 <- adply(newData, 1, function(x)
predict(modelList[[paste(x$state)]], newdata=x))
head(pred3)
# year state 1
# 1 50 50 5176.326
# 2 51 50 5274.907
# 3 52 50 5373.487
# 4 53 50 5472.068
# 5 54 50 5570.649
# 6 55 50 5669.229
You need to use mdply to supply both the model and the data to each function call:
dataList <- dlply(newData, "state")
preds <- mdply(cbind(mod = modelList, df = dataList), function(mod, df) {
mutate(df, pred = predict(mod, newdata = df))
})
A solution with just base R. The format of the output is different, but all the values are right there.
models <- lapply(split(myData, myData$state), 'lm', formula = value ~ year)
pred4 <- mapply('predict', models, split(newData, newData$state))
What is wrong with
lapply(modelList, predict, newData)
?
EDIT:
Thanks for explaining what is wrong with that. How about:
newData <- data.frame(year)
ldply(modelList, function(model) {
data.frame(newData, predict=predict(model, newData))
})
Iterate over the models, and apply the new data (which is the same for each state since you just did an expand.grid to create it).
EDIT 2:
If newData does not have the same values for year for every state as in the example, a more general approach can be used. Note that this uses the original definition of newData, not the one in the first edit.
ldply(state, function(s) {
nd <- newData[newData$state==s,]
data.frame(nd, predict=predict(modelList[[as.character(s)]], nd))
})
First 15 lines of this output:
year state predict
1 50 50 5176.326
2 51 50 5274.907
3 52 50 5373.487
4 53 50 5472.068
5 54 50 5570.649
6 55 50 5669.229
7 56 50 5767.810
8 57 50 5866.390
9 58 50 5964.971
10 59 50 6063.551
11 60 50 6162.132
12 50 51 5514.825
13 51 51 5626.160
14 52 51 5737.496
15 53 51 5848.832
I take it the hard part is matching each state in newData to the corresponding model.
Something like this perhaps?
predList <- dlply(newData, "state", function(x) {
predict(modelList[[as.character(min(x$state))]], x)
})
Here I used a "hacky" way of extracting the corresponding state model: as.character(min(x$state))
...There is probably a better way?
Output:
> predList[1:2]
$`50`
1 2 3 4 5 6 7 8 9 10 11
5176.326 5274.907 5373.487 5472.068 5570.649 5669.229 5767.810 5866.390 5964.971 6063.551 6162.132
$`51`
12 13 14 15 16 17 18 19 20 21 22
5514.825 5626.160 5737.496 5848.832 5960.167 6071.503 6182.838 6294.174 6405.510 6516.845 6628.181
Or, if you want a data.frame as output:
predData <- ddply(newData, "state", function(x) {
y <-predict(modelList[[as.character(min(x$state))]], x)
data.frame(id=names(y), value=c(y))
})
Output:
head(predData)
state id value
1 50 1 5176.326
2 50 2 5274.907
3 50 3 5373.487
4 50 4 5472.068
5 50 5 5570.649
6 50 6 5669.229
Maybe I'm missing something, but I believe lmList is the ideal tool here,
library(nlme)
ll = lmList(value ~ year | state, data=myData)
predict(ll, newData)
## Or, to show that it produces the same results as the other proposed methods...
newData[["value"]] <- predict(ll, newData)
head(newData)
# year state value
# 1 50 50 5176.326
# 2 51 50 5274.907
# 3 52 50 5373.487
# 4 53 50 5472.068
# 5 54 50 5570.649
# 6 55 50 5669.229

Resources