Merge plm fitted values to dataset - r

I'm working with a fixed effects regression model using plm.
The model looks like this:
FE.model <-plm(fml, data = data.reg2,
index=c('Site.ID','date.hour'), # cross section ID and time series ID
model='within', #coefficients are fixed
effect='individual')
summary(FE.model)
"fml" is a formula I defined previously. I have many independent variables, so this made it more efficient.
What I want to do is get my fitted values (my yhats) and join them to my base dataset; data.reg2
I was able to get the fitted values using this code:
Fe.model.fitted <- FE.model$model[[1]] - FE.model$residuals
However, this only gives me a one column vector of fitted values only - I have no way of joining it to my base dataset.
Alternatively, I've tried something like this:
Fe.model.fitted <- cbind(data.reg2, resid=resid(FE.model), fitted=fitted(FE.model))
However, I get this error with that:
Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ""pseries"" to a data.frame
Are there any other ways to get my fitted values in my base dataset? Or can someone explain the error I'm getting and maybe a way to fix it?
I should note that I don't want to manually compute the yhats based on my betas. I have way too many independent variables for that option and my defined formula (fml) may change so that option would not be efficient.
Many thanks!!

Merging plm fitted values back into the original dataset requires some intermediate steps -- plm drops any rows with missing data, and as far as I can tell, a plm object does not contain the index info. The order of the data is not preserved -- see what Giovanni Millo, one of plm's authors, commented in this thread:
"...the input order is not always preserved: observations are always reordered by (individual, time) internally, so that the output you get is ordered accordingly..."
The steps in short:
Get fitted values from the estimated plm object. It is a single vector but the entries are named. The names correspond to the position in the index.
Get the index, using the index() function. It can return both individual and time indices. Note the index may contain more rows than the fitted values, in case rows were removed for missing data. (It is also possible to generate an index directly from the original data, but I did not see a promise that the original order of the data is preserved in what plm returns.)
Merge into the original data, looking up the id and time values from the index.
Sample code is provided below. Kind of long but I've tried to comment. The code is not optimized, my intention was to list the steps explicitly. Also, I am using data.tables rather than data.frames.
library(data.table); library(plm)
### Generate dummy data. This way we know the "true" coefficients
set.seed(100)
n <- 500 # Run with more data if you want to get closer to the "true" coefficients
DT <- data.table(CJ(id = c("a","b","c","d","e"), time = c(1:(n / 5))))
DT[, x1 := rnorm(n)]
DT[, x2 := rnorm(n)]
DT[, y := x1 + 2 * x2 + rnorm(n) / 10]
setkey(DT, id, time)
# # Make it an unbalanced panel & put in some NAs
DT <- DT[!(id == "a" & time == 4)]
DT[.("a", 3), x2 := as.numeric(NA)]
DT[.("d", 2), x2 := as.numeric(NA)]
str(DT)
### Run the model -- both individual and time effects; "within" model
summary(PLM <- plm(data = DT, id = c("id", "time"), formula = y ~ x1 + x2, model = "within", effect = "twoways", na.action = "na.omit"))
### Merge the fitted values back into the data.table DT
# Note that PLM$model$y is shorter than the data, i.e. the row(s) with NA have been dropped
cat("\nRows omitted (due to NA): ", nrow(DT) - length(PLM$model$y))
# Since the objects returned by plm() do not contain the index, need to generate it from the data
# The object returned by plm(), i.e. PLM$model$y, has names that point to the place in the index
# Note: The index can also be done as INDEX <- DT[, j = .(id, time)], but use the longer way with index() in case plm does not preserve the order
INDEX <- data.table(index(x = pdata.frame(x = DT, index = c("id", "time")), which = NULL)) # which = NULL extracts both the individual and time indexes
INDEX[, id := as.character(id)]
INDEX[, time := as.integer(time)] # it is returned as a factor, convert back to integer to match the variable type in DT
# Generate the fitted values as the difference between the y values and the residuals
if (all(names(PLM$residuals) == names(PLM$model$y))) { # this should not be needed, but just in case...
FIT <- data.table(
index = as.integer(names(PLM$model$y)), # this index corresponds to the position in the INDEX, from where we get the "id" and "time" below
fit.plm = as.numeric(PLM$model$y) - as.numeric(PLM$residuals)
)
}
FIT[, id := INDEX[index]$id]
FIT[, time := INDEX[index]$time]
# Now FIT has both the id and time variables, can match it back into the original dataset (i.e. we have the missing data accounted for)
DT <- merge(x = DT, y = FIT[, j = .(id, time, fit.plm)], by = c("id", "time"), all = TRUE) # Need all = TRUE, or some data from DT will be dropped!

I have a simplified method. The main problem here is twofold:
1) pdata.frames sort your input alphabetically by name, then year. This can be addressed by sorting your data frame first before running plm.
2) rows with NA in variables included in the formula are dropped. I handle this problem by creating a second formula including my id and time variable, and then use model.frame to extract the data used in the regression (excluding NAs but now also includes id and time)
library(plm)
set.seed(100)
n <- 10 # Run with more data if you want to get closer to the "true" coefficients
DT <- data.frame(id = c("a","c","b","d","e"), time = c(1:(n / 5)),x1 = rnorm(n),x2= rnorm(n),x3=rnorm(n))
DT$Y = DT$x2 + 2 * DT$x3 + rnorm(n) / 10 # make x1 a function of other variables
DT$x3[3]=NA # add an NA to show this works with missing data
DT
# now can add drop.index = F, but note that DT is now sorted by order(id,time)
pdata.frame(DT,index=c('id','time'),drop.index = F)
# order DT to match pdata.frame that will be used for plm
DT=DT[order(DT$id,DT$time),]
# formulas
formulas =Y~x1+x2+x3
formulas_dataframe = Y~x1+x2+x3 +id+time # add id and time for model.frame
# estimate
random <- plm(formulas, data=DT, index=c("id", "time"), model="random",na.action = 'na.omit')
summary(random)
# merge prediction and and model.frame
fitted = data.frame(fitted = random$model[[1]] - random$residuals)
model_data = cbind(as.data.frame(as.matrix(random$model)),fitted) # this isn't really needed but shows that input and model.frame are same
model_data = cbind(model_data,na.omit(model.frame(formulas_dataframe,DT)))
model_data

I wrote a function (predict.out.plm) to do out of sample predictions after estimating First Differences or Fixed-Effects models with plm.
The function further adds the predicted values to the indices of the original data. This is done by using the rownames saved within the plm - attributes(plmobject)$index and the rownames within the model.matrix
for more details see the function posted here:
https://stackoverflow.com/a/44185441/2409896

It's been a while for this post, but I believe the easiest way to do this now would be:
Fe.model.fitted <- cbind(FE.model$model,
resid=FE.model$residuals,
fitted=plm:::fitted_exp.plm(FE.model))
The function fitted_exp.plm is not exported by the plm package but we can use the ::: to extract it.

The residuals are deviation of the model from the value on the LHS of the formula .... which you have not shown to us. There is a fitted.panelmodel function in the 'plm' package, but it appears to expect that there will be a fitted value which the plm function does not return by default, nor is it documented to do so, nor is the a way that I see to make it cough one up.
library(plm)
data("Produc", package = "plm")
zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc, index = c("state","year"))
summary(zz) # the example on the plm page:
> str(fitted(zz))
NULL
> names(zz$model)
[1] "log(gsp)" "log(pcap)" "log(pc)" "log(emp)" "unemp"
> Produc[ , c("Yvar", "Fitted")] <- cbind( zz$model[ ,"log(gsp)", drop=FALSE], zz$residuals)
> str(Produc)
'data.frame': 816 obs. of 12 variables:
$ state : Factor w/ 48 levels "ALABAMA","ARIZONA",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...
$ pcap : num 15033 15502 15972 16406 16763 ...
$ hwy : num 7326 7526 7765 7908 8026 ...
$ water : num 1656 1721 1765 1742 1735 ...
$ util : num 6051 6255 6442 6756 7002 ...
$ pc : num 35794 37300 38670 40084 42057 ...
$ gsp : int 28418 29375 31303 33430 33749 33604 35764 37463 39964 40979 ...
$ emp : num 1010 1022 1072 1136 1170 ...
$ unemp : num 4.7 5.2 4.7 3.9 5.5 7.7 6.8 7.4 6.3 7.1 ...
$ Yvar :Classes 'pseries', 'pseries', 'integer' atomic [1:816] 10.3 10.3 10.4 10.4 10.4 ...
.. ..- attr(*, "index")='data.frame': 816 obs. of 2 variables:
.. .. ..$ state: Factor w/ 48 levels "ALABAMA","ARIZONA",..: 1 1 1 1 1 1 1 1 1 1 ...
.. .. ..$ year : Factor w/ 17 levels "1970","1971",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Fitted: num -0.04656 -0.03064 -0.01645 -0.00873 -0.02708 ...

Related

How to use mice for multiple imputation of missing values in longitudinal data?

I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL

Why is 'Age' not considered a factor when using the 'rankabuncomp' function

I am trying to create Rank Abundance Curves for tree communities that are grouped according to age. I have 15 sampling sites in 8 seral stages (15(×8)) (N=120). If I had to factor my data it would look something like this:
Age <- factor(c(rep(1,15), rep(2,15), rep(3,15), rep(4,15), rep(5,15), rep(6,15), rep(7,15), rep(8,15)), labels =c("12","15","19","23","27","31","35","38"))
Now, when I use the function 'rankabuncomp' (BiodiversityR package) to create RAC curves for each seral stage, I get an error message:
Error in diversitycomp(x, y, factor, index = "richness") :
specified factor1 'Age' is not a factor
In addition: Warning message:
In if ((method %in% METHOD) == F) { :
the condition has length > 1 and only the first element will be used
This is the code I used:
Trees_2015 <- read.csv(file = "Trees_2015.csv")
Trees<- Trees_2015[,-1]
Reg_age <- read.csv(file = "Age.csv")
RAC_trees <- rankabundance(Trees,y=Reg_age, factor = "Age", level = c("12","15","19","23","27","31","35","38"))
RAC_trees
rankabunplot(RAC_trees,scale='abundance', addit=FALSE, specnames=c(1,2,3))
rankabuncomp(Trees, y=Reg_age, factor='Age',
scale='proportion', legend=TRUE)
Why is R producing this error? How can I rectify it?
The 'Reg_age' data frame (120 obvs. of 1 variable) looks something like this:
Age
1 12
2 12
16 15
20 15
120 38
The 'Trees' data frame has 120 obs. of 75 variables (i.e. 75 different species)
Thanks
Elena

Formating life-tables to use in survival analysis

I'm trying to use the 'relsurv' package in R to compare the survival of a cohort to national life tables. The code below shows my problem using the example from relsurv but changing the life-table data. I've just used two years and two ages in the life-table data below, the actual data is much larger but gives the same error. The error is 'invalid ratetable argument' but I've formatted it as per the example life-tables 'slopop' and 'survexp.us'.
library(survival)
library(relsurv)
data(rdata) # example data from relsurv
raw = read.table(header=T, stringsAsFactors = F, sep=' ', text='
Year Age sex qx
1980 30 1 0.00189
1980 31 1 0.00188
1981 30 1 0.00191
1981 31 1 0.00191
1980 30 2 0.00077
1980 31 2 0.00078
1981 30 2 0.00076
1981 31 2 0.00074
')
ages = c(30,40) # in years
years = c(1980, 1990)
rtab = array(data=NA, dim=c(length(ages), 2, length(years))) # set up blank array: ages, sexes, years
for (y in unique(raw$Year)){
for (s in 1:2){
rtab[ , s, y-min(years)+1] = -1 * log(1-subset(raw, Year==y&sex==s)$qx) / 365.24 # probability of death in next year, transformed to hazard (see ratetables help)
}
}
attributes(rtab)$dimnames[[1]] = as.character(ages)
attributes(rtab)$dimnames[[2]] = c('male','female')
attributes(rtab)$dimnames[[3]] = as.character(years)
attributes(rtab)$dimid <- c("age", "sex", 'year')
attributes(rtab)$dim <- c(length(ages), 2, length(years))
attributes(rtab)$factor = c(0,0,1)
attributes(rtab)$type = c(2,1,4)
attributes(rtab)$cutpoints[[1]] = ages*365.24 # must be in days
attributes(rtab)$cutpoints[[2]] = NULL
attributes(rtab)$cutpoints[[3]] = as.date(paste("1Jan", years, sep='')) # must be date
attributes(rtab)$class = "ratetable"
# example from relsurv
rsmul(Surv(time,cens) ~ sex+as.factor(agegr)+
ratetable(age=age*365.24, sex=sex, year=year),
data=rdata, ratetable=rtab, int=1)
Try using the transrate function from the relsurv package to reformat the data. That should give you a compatible dataset.
Regards,
Josh
Three things to add:
You should set attributes(rtab)$factor = c(0,1,0), since sex (the second dimension) is a factor (i.e., doesn't change over time).
A good way to check whether something is a valid rate table is to use the is.ratetable() function. is.ratetable(rtab, verbose = TRUE) will even return a message stating what was wrong.
Check the result of is.ratetable without using verbose first, because it will lie about valid rate tables.
The rest of this comment is about this lie.
If the type attribute isn't given, is.ratetable will calculate it using the factor attribute; you can see this by just printing the function. However, it seems to do so incorrectly. It uses type <- 1 * (fac == 1) + 2 * (fac == 0) + 4 * (fac > 0), where fac is attributes(rtab)$factor.
But the next section, which checks the type attribute if it's provided, says the only valid values are 1, 2, 3, and 4. It's impossible to get 1 from the code above.
For example, let's examine the slopop ratetable provided with the relsurv package.
library(relsurv)
data(slopop)
is.ratetable(slopop)
# [1] TRUE
is.ratetable(slopop, verbose = TRUE)
# [1] "wrong length for cutpoints 3"
I think this is where your rate table is being hung up.

How do I access a second sub-element within a list and pass it to *apply?

Given a list (list.data.partitions) with 72 elements (dataset_1, dataset_2, etc.), each of which contain two sub-elements (2 dataframes):$training and $testing; e.g.:
> str(list.data.partitions$dataset_1)
List of 2
$ training:'data.frame': 81 obs. of 20 variables:
..$ b0 : num [1:81] 11.61 9.47 10.61 7.34 12.65 ...
..$ b1 : num [1:81] 11.6 9.94 10.7 10.11 12.2 ...
..$ b2 : num [1:81] 34.2 31 32.7 27.9 36.1 ...
...
..$ index: num [1:81] 0.165 0.276 0.276 0.181 0.201 ...
$ testing :'data.frame': 19 obs. of 20 variables:
..$ b0 : num [1:19] 6.05 12.4 13.99 16.82 8.8 ...
..$ b1 : num [1:19] 12.4 10.8 11.8 13.7 16.3 ...
..$ b2 : num [1:19] 25.4 29.8 31.2 34.1 27.3 ...
...
..$ index: num [1:19] 0.143 1.114 0.201 0.529 1.327 ...
How would I correctly access the $testing dataframe using lapply (or similar functionality) and caret's predict function below:
fun.predict.rf <- function(x, y) {
predict(x, newdata = y$testing)
}
list.predictions <- lapply(list.models, fun.predict.rf, y=list.data.partitions)
The above function "works", but it returns predictions based on $training dataframe (~80 obs), instead of the $testing dataframe (~20 obs) that was specified. Ultimately, I'd expect a list containing predictions for each of the elements in my list, based on the $testing dataframe.
list.models is a list of 72 models based on the $training dataframe using the caret package in R (not shown or included). The number of models (72) in list.models equals the number of elements (72) in list.data.partitions when considering a single sub-element (either $training or $testing). The name of each of the 72 elements in list.data.partitions differs like so: dataset_1, dataset_2, etc., but the structure is identical (see str output above).
list.data.partitions can be downloaded here. In this version, the 72 elements do not have names, but in my version the 72 elements are named (e.g., dataset_1, dataset_2, etc). Each of the sub-elements are still named $training and $testing.
You can declare function within apply.
After I read the question carefully, this might work.
Let's assume you got the following data structure
list.data.partitions
..$dataset_1
..$training
..$testing
..$model # model created using the caret package
..$dataset_2
..$training
..$testing
..$model # model created using the caret package
Let's add $model to the dataset, since it one-to-one relationship. It make sense to keep them together. I assuming you build the model from $training, and going to test on $test.
for(i in 1:len(list.data.partitions){
list.data.partitions[[i]]$model <- list.models[[i]]
}
Assuming dataset 1 and 2 are not related, and each dataset got 3 elements (training, testing, model from training, more on this later)
fun.predict.rf <- function(x, y) {
predict(x, newdata = y)
}
lapply(list.data.partitions, function(x){
#something like
#if no model exist yet, then you can create it here with x$training
result<- fun.predict.rf(x$model, x$testing)
#other things you want to do
})
I believe the simple solution is to use mapply instead of lapply. Alternatively, you could store the model objects in the same list with the training and testing data sets and use lapply as suggested by Steven. Using a modified version of Richard Scriven's example data set with your list names:
set.seed(1)
dataset <- list(training = data.frame(replicate(4, rnorm(10))),
testing = data.frame(replicate(4, rexp(5))))
dataset1 <- list(training = data.frame(replicate(4, rnorm(10))),
testing = data.frame(replicate(4, rexp(5))))
dataset2 <- list(training = data.frame(replicate(4, rnorm(10))),
testing = data.frame(replicate(4, rexp(5))))
list.data.partitions <- c(replicate(2, dataset, simplify = FALSE),list(dataset1), list(dataset2))
names(list.data.partitions) <- paste0("dataset", seq(list.data.partitions))
This gives a list with two identical data sets followed by two unique data sets for explanatory purposes.
Then, creating your model object list with a basic linear fit:
list.models <- lapply(list.data.partitions, function(x) lm(X1 ~ X2+X3+X4, data = x$training))
With these two objects, use mapply:
fun.predict.rf <- function(x, y) {
predict(x, newdata = y$testing)
}
list.predictions <- mapply(fun.predict.rf, list.models, list.data.partitions)
list.predictions
dataset1 dataset2 dataset3 dataset4
1 -0.098696452 -0.098696452 0.09015207 -0.5004038
2 0.103316974 0.103316974 0.11770013 -0.7323202
3 -0.908623491 -0.908623491 -0.06951799 -0.8765770
4 -1.332241452 -1.332241452 -0.20407761 -0.5816534
5 -0.002156741 -0.002156741 -0.24583670 -0.7057936
Note that the first two data sets have identical predictions as we would expect and there are five predicted elements for each dataset, consistent with the number of testing elements.
I think there was some confusion because it was not clear in your question that your model objects were stored in a separate list (list.models). Since you were passing lapply your list.models but specifying y=list.data.partitions, your function fun.predict.rf was being passed each model element sequentially, but your entire list.data.partitions with each call. There is no element list.data.partitions$testing, so you were actually specifying newdata = NULL, so the predict function ignored the newdata argument and used the data from the model object. Notice, if you use your lapply code and compare to predictions for individual training elements, they match:
list.predictions <- lapply(list.models, fun.predict.rf, y=list.data.partitions)
list.predictions
predict(model.list[[1]], newdata=list.data.partitions[[1]]$training)
predict(model.list[[2]], newdata=list.data.partitions[[2]]$training)
predict(model.list[[3]], newdata=list.data.partitions[[3]]$training)
predict(model.list[[4]], newdata=list.data.partitions[[4]]$training)
And if you change the data in the list.data.partitions, the lapply call still gives the same result while specifying the list.data.partitions$training data gives a different result:
list.data.partitions[[1]] <- list.data.partitions[[3]]
lapply(list.models, fun.predict.rf, y=list.data.partitions)
predict(list.models[[1]], newdata=list.data.partitions[[1]]$training)

Creating averaged time-bins from an existing dataframe

I have the following dataframe called 'EasyScaled';
str(EasyScaled)
'data.frame': 675045 obs. of 3 variables:
$ Trial : chr "1_easy.wav" "1_easy.wav" "1_easy.wav" "1_easy.wav" ...
$ TrialTime : num 3000 3001 3002 3003 3004 ...
$ PupilBaseCorrect: num 0.784 0.781 0.78 0.778 0.777 ...
The 'TrialTime' numeric variable denotes the time of each data point (3000 = 3000ms, 3001 = 3001 ms, etc.), 'PupilBaseCorrect' is my dependent variable, and the 'Trial' variable refers to the experimental trial.
I would like to create a new object which firstly divides my data into 3 time-bins (TimeBin1 = 3000-8000ms, TimeBin2 = 8001-13000ms, TimeBin3 = 13001 - 18000ms) and then calculate an average value for each timebin (for each trial) so that I would end up with something that looks like this (with the value given reflecting 'PupilBaseCorrect');
Trial TimeBin1 TimeBin2 TimeBin3
1_easy 0.784 0.876 0.767
34_easy 0.781 0.872 0.765
35_easy 0.78 0.871 0.762
...etc ...etc ...etc ....etc
I have tried using cut(), ddply() and some of the suggestions on this blog http://lamages.blogspot.co.uk/2012/01/say-it-in-r-with-by-apply-and-friends.html but haven't been able to find the correct code. I also tried this;
EasyTimeBin <- aggregate(PupilBaseCorrect ~ Trial + TrialTime[3000:8000, 8001:1300,1301:1800], data=EasyScaled, mean)
But got the following error;
Error in TrialTime[3000:8000, 8001:1300, 1301:1800] :
incorrect number of dimensions
Any suggestions or advice would be much appreciated.
Good use of cut and ddply are correct, but here's some vanilla R chicken scratch that will do what you need.
# Generate example data
EasyScaled <- data.frame(
Trial = paste0(c(sapply(1:3, function(x) rep(x, 9))), "_easy.wav"),
TrialTime = c(sapply(seq_len(9)-1, function(x) (floor(x/3))*5000 + x%%3 + 3000)),
PupilBaseCorrect = rnorm(27, 0.78, 0.1)
)
# group means of PupilBaseCorrect by Trial + filename
tmp <- tapply(EasyScaled$PupilBaseCorrect,
paste0(EasyScaled$Trial, ',',
as.integer((EasyScaled$TrialTime - 3000)/5000)+1), mean)
# melt & recast the array manually into a dataframe
EasyTimeBin <- do.call(data.frame,
append(list(row.names = NULL,
Trial = gsub('.wav,.*','',names(tmp)[3*seq_len(length(tmp)/3)])),
structure(lapply(seq_len(3),
function(x) tmp[3*(seq_len(length(tmp)/3)-1) + x]
), .Names = paste0("TimeBin", seq_len(3))
)
)
)
print(EasyTimeBin)
# Trial TimeBin1 TimeBin2 TimeBin3
# 1 1_easy 0.7471973 0.7850524 0.8939581
# 2 2_easy 0.8096973 0.8390587 0.7757359
# 3 3_easy 0.8151430 0.7855042 0.8081268

Resources