I want to produce a pdf which shows multiple graphs, one for each NetworkTrackingPixelId.
I have a data frame similar to this:
> head(data)
NetworkTrackingPixelId Name Date Impressions
1 2421 Rubicon RTB 2014-02-16 168801
2 2615 Google RTB 2014-02-16 1215235
3 3366 OpenX RTB 2014-02-16 104419
4 3606 AppNexus RTB 2014-02-16 170757
5 3947 Pubmatic RTB 2014-02-16 68690
6 4299 Improve Digital RTB 2014-02-16 701
I was thinking to use a script similar to the one below:
# create a vector which stores the NetworkTrackingPixelIds
tp <- data %.%
group_by(NetworkTrackingPixelId) %.%
select(NetworkTrackingPixelId)
# create a for loop to print the line graphs
for (i in tp) {
print(ggplot(data[which(data$NetworkTrackingPixelId == i), ], aes(x = Date, y = Impressions)) + geom_point() + geom_line())
}
I was expecting this command to produce many graphs, one for each NetworkTrackingPixelId. Instead the result is an unique graph which aggregate all the NetworkTrackingPixelIds.
Another thing I've noticed is that the variable tp is not a real vector.
> is.vector(tp)
[1] FALSE
Even if I force it..
tp <- as.vector(data %.%
group_by(NetworkTrackingPixelId) %.%
select(NetworkTrackingPixelId))
> is.vector(tp)
[1] FALSE
> str(tp)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1397 obs. of 1 variable:
$ NetworkTrackingPixelId: int 2421 2615 3366 3606 3947 4299 4429 4786 6046 6286 ...
- attr(*, "vars")=List of 1
..$ : symbol NetworkTrackingPixelId
- attr(*, "drop")= logi TRUE
- attr(*, "indices")=List of 63
..$ : int 24 69 116 162 205 253 302 351 402 454 ...
..$ : int 1 48 94 140 184 232 281 330 380 432 ...
[I've cut a bit this output]
- attr(*, "group_sizes")= int 29 29 2 16 29 1 29 29 29 29 ...
- attr(*, "biggest_group_size")= int 29
- attr(*, "labels")='data.frame': 63 obs. of 1 variable:
..$ NetworkTrackingPixelId: int 8799 2615 8854 8869 4786 7007 3947 9109 9126 9137 ...
..- attr(*, "vars")=List of 1
.. ..$ : symbol NetworkTrackingPixelId
Since I don't have your dataset, I will use the mtcars dataset to illustrate how to do this using dplyr and data.table. Both packages are the finest examples of the split-apply-combine paradigm in rstats. Let me explain:
Step 1 Split data by gear
dplyr uses the function group_by
data.table uses argument by
Step 2: Apply a function
dplyr uses do to which you can pass a function that uses the pieces x.
data.table interprets the variables to the function in context of each piece.
Step 3: Combine
There is no combine step here, since we are saving the charts created to file.
library(dplyr)
mtcars %.%
group_by(gear) %.%
do(function(x){ggsave(
filename = sprintf("gear_%s.pdf", unique(x$gear)), qplot(wt, mpg, data = x)
)})
library(data.table)
mtcars_dt = data.table(mtcars)
mtcars_dt[,ggsave(
filename = sprintf("gear_%s.pdf", unique(gear)), qplot(wt, mpg)),
by = gear
]
UPDATE: To save all files into one pdf, here is a quick solution.
plots = mtcars %.%
group_by(gear) %.%
do(function(x) {
qplot(wt, mpg, data = x)
})
pdf('all.pdf')
invisible(lapply(plots, print))
dev.off()
I recently had a project that required producing a lot of individual pngs for each record. I found I got a huge speed up doing some pretty simple parallelization. I am not sure if this is more performant than the dplyr or data.table technique but it may be worth trying. I saw a huge speed bump:
require(foreach)
require(doParallel)
workers <- makeCluster(4)
registerDoParallel(workers)
foreach(i = seq(1, length(mtcars$gear)), .packages=c('ggplot2')) %dopar% {
j <- qplot(wt, mpg, data = mtcars[i,])
png(file=paste(getwd(), '/images/',mtcars[i, c('gear')],'.png', sep=''))
print(j)
dev.off()
}
Unless I'm missing something, generating plots by a subsetting variable is very simple. You can use split(...) to split the original data into a list of data frames by NetworkTrackingPixelId, and then pass those to ggplot using lapply(...). Most of the code below is just to crate a sample dataset.
# create sample data
set.seed(1)
names <- c("Rubicon","Google","OpenX","AppNexus","Pubmatic")
dates <- as.Date("2014-02-16")+1:10
df <- data.frame(NetworkTrackingPixelId=rep(1:5,each=10),
Name=sample(names,50,replace=T),
Date=dates,
Impressions=sample(1000:10000,50))
# end create sample data
pdf("plots.pdf")
lapply(split(df,df$NetworkTrackingPixelId),
function(gg) ggplot(gg,aes(x = Date, y = Impressions)) +
geom_point() + geom_line()+
ggtitle(paste("NetworkTrackingPixelId:",gg$NetworkTrackingPixelId)))
dev.off()
This generates a pdf containing 5 plots, one for each NetworkTrackingPixelId.
I think you would be better off writing a function for plotting, then using lapply for every Network Tracking Pixel.
For example, your function might look like:
plot.function <- function(ntpid){
sub = subset(dataset, dataset$networktrackingpixelid == ntpid)
ggobj = ggplot(data=sub, aes(...)) + geom...
ggsave(filename=sprintf("%s.pdf", ntpid))
}
It would be helpful for you to put a reproducible example, but I hope this works! Not sure about the vector issue though..
Cheers!
Related
Can someone please help how to get the list of built-in data sets and their dependency packages?
There are several ways to find the included datasets in R:
1: Using data() will give you a list of the datasets of all loaded packages (and not only the ones from the datasets package); the datasets are ordered by package
2: Using data(package = .packages(all.available = TRUE)) will give you a list of all datasets in the available packages on your computer (i.e. also the not-loaded ones)
3: Using data(package = "packagename") will give you the datasets of that specific package, so data(package = "plyr") will give the datasets in the plyr package
If you want to know in which package a dataset is located (e.g. the acme dataset), you can do:
dat <- as.data.frame(data(package = .packages(all.available = TRUE))$results)
dat[dat$Item=="acme", c(1,3,4)]
which gives:
Package Item Title
107 boot acme Monthly Excess Returns
I often need to also know which structure of datasets are available, so I created dataStr in my misc package.
dataStr <- function(package="datasets", ...)
{
d <- data(package=package, envir=new.env(), ...)$results[,"Item"]
d <- sapply(strsplit(d, split=" ", fixed=TRUE), "[", 1)
d <- d[order(tolower(d))]
for(x in d){ message(x, ": ", class(get(x))); message(str(get(x)))}
}
dataStr()
Please mind that the output in the console is quite long.
This is the type of output:
[...]
warpbreaks: data.frame
'data.frame': 54 obs. of 3 variables:
$ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
$ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
WorldPhones: matrix
num [1:7, 1:7] 45939 60423 64721 68484 71799 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:7] "1951" "1956" "1957" "1958" ...
..$ : chr [1:7] "N.Amer" "Europe" "Asia" "S.Amer" ...
WWWusage: ts
Time-Series [1:100] from 1 to 100: 88 84 85 85 84 85 83 85 88 89 ...
Edit: To get more informative output and use it for unloaded packages or all the packages on the search path, please use the revised online version with
source("https://raw.githubusercontent.com/brry/berryFunctions/master/R/dataStr.R")
Here is a comprehensive R packages datasets list maintained by Prof. Vincent Arel-Bundock.
https://vincentarelbundock.github.io/Rdatasets/
Rdatasets is a collection of 1892 datasets that were originally
distributed alongside the statistical software environment R and some
of its add-on packages. The goal is to make these data more broadly
accessible for teaching and statistical software development.
Run
help(package = "datasets")
in the R Studio console and you'll get all available datasets in the tidy Help tab on the right.
I have multiple data frames (moving temperature of different duration at 130 observation points), and want to generate monthly average for all the data by applying the below code to each data frame - then put the outcome into one data frame. I have been trying to do this with for-loop, but not getting anywhere. I'm relatively new to R and really appreciate if someone could help me get through this.
Here is the glimpse of a data frame:
head(maxT2016[,1:5])
X X0 X1 X2 X3
1 20160101 26.08987 26.08987 26.08987 26.08987
2 20160102 25.58242 25.58242 25.58242 25.58242
3 20160103 25.44290 25.44290 25.44290 25.44290
4 20160104 26.88043 26.88043 26.88043 26.88043
5 20160105 26.60278 26.60278 26.60278 26.60278
6 20160106 24.87676 24.87676 24.87676 24.87676
str(maxT2016)
'data.frame': 274 obs. of 132 variables:
$ X : int 20160101 20160102 20160103 20160104 20160105 20160106 20160107 20160108 20160109 20160110 ...
$ X0 : num 26.1 25.6 25.4 26.9 26.6 ...
$ X1 : num 26.1 25.6 25.4 26.9 26.6 ...
$ X2 : num 26.1 25.6 25.4 26.9 26.6 ...
$ X3 : num 26.1 25.6 25.4 26.9 26.6 ...
Here is the code that I use for individual data frame:
library(dplyr)
library(lubridate)
library(tidyverse)
maxT10$X <- as.Date(as.character(maxTsma10$X), format="%Y%m%d")
monthlyAvr <- maxT10 %>%
group_by(month=floor_date(date, "month")) %>%
summarise(across(X0:X130, mean, na.rm=TRUE)) %>%
slice_tail(n=6) %>%
select(-month)
monthlyAvr2 <- as.data.frame(t(montlyAvr))
colnames(monthlyAvr2) <- c("meanT_Apr", "meanT_May", "meanT_Jun", "meanT_Jul", "meanT_Aug",
"meanT_Sep")
Essentially, I want to put all the all the data frames into a list and run the function through all the data frame, then sort these outputs into one data frame. I came across with lapply function as an alternative, but somewhat felt more comfortable with for-loop.
d = list(maxT10, maxT20, maxT30, maxT60 ... ...)
for (i in 1:lengh(d)){
}
MonthlyAvrT <- cbind(maxT10, maxT20, maxT30, maxT60... ... )
Basil. Welcome to StackOverflow.
I was wary of lapply when I first stated using R, but you should stick with it. It's almost always more efficient than using a for loop. In your particular case, you can put your individual data frames in a list and the code you run on each into a function myFunc, say, which takes the data frame you want to process as its argument.
Then you can simply say
allData <- bind_rows(lapply(1:length(dataFrameList), function(x) myFunc(dataFrameList[[x]])))
Incidentally, your column names make me think your data isn't yet tidy. I'd suggest you spend a little time making it so before you do much else. It will save you a huge amount of effort in the long run.
The logic in pseudo-code would be:
for each data.frame in list
apply a function
save the results
Applying my_function on each data.frame of the data_set list :
my_function <- function(my_df) {
my_df <- as.data.frame(my_df)
out <- apply(my_df, 2, mean) # compute mean on dimension 2 (columns)
return(out)
}
# 100 data.frames
data_set <- replicate(100, data.frame(X=runif(6, 20160101, 20160131), X0=rnorm(6, 25)))
> dim(data_set)
[1] 2 100
results <- apply(data_set, 2, my_function) # Apply my_function on dimension 2
# Output for first 5 data.frames
> results[, 1:5]
[,1] [,2] [,3] [,4] [,5]
X 2.016012e+07 2.016011e+07 2.016011e+07 2.016012e+07 2.016011e+07
X0 2.533888e+01 2.495086e+01 2.523087e+01 2.491822e+01 2.482142e+01
I would like to plot a shape file loaded using read.shp from the fastshp package. However, the read.shp function returns a list of list and not a data.frame. I'm unsure which part of the list I need to extract to obtain the correctly formatted data.frame object. This exact question has been asked on stack overflow already, however, the solution no longer seems to work (solution was from > 7 years ago). Any help is much appreciated.
remotes::install_github("s-u/fastshp") #fastshp not on CRAN
library(ggplot2);library(fastshp)
temp <- tempfile()
temp2 <- tempfile()
download.file("https://www2.census.gov/geo/tiger/TIGER2017/COUNTY/tl_2017_us_county.zip",temp)
unzip(zipfile = temp, exdir = temp2)
shp <- list.files(temp2, pattern = ".shp$",full.names=TRUE) %>% read.shp(.)
shp is a list of lists containing a plethora of information. I tried the following solution from the SO posted earlier, but to no avail:
shp.list <- sapply(shp, FUN = function(x) Polygon(cbind(lon = x$x, lat = x$y))) #throws an error here cbind(lon = x$x, lat = x$y) returns NULL
shp.poly <- Polygons(shp.list, "area")
shp.df <- fortify(shp.poly, region = "area")
I also tried the following:
shp.list <- sapply(shp, FUN = function(x) do.call(cbind, x[c("id","x","y")])) #returns NULL value here...
shp.df <- as.data.frame(do.call(rbind, shp.list))
Updated: Still no luck but closer:
file_shp<-list.files(temp2, pattern = ".shp$",full.names=TRUE) %>%
read.shp(., format = c("table"))
ggplot() +
geom_polygon(data = file_shp, aes(x = x, y = y, group = part),
colour = "black", fill = NA)
Looks like the projection is off. I'm not sure how to order the data to map correctly, also not sure how to read in the CRS data. Tried the following to no avail:
file_prj<-list.files(temp2, pattern = ".prj$",full.names=TRUE) %>%
proj4string(.)
I tried to use the census data you have in your script. However, R Studio somehow kept crashing when I applied read.shp() to the polygon data. Hence, I decided to use the example from the help page of read.shp(), which is also census data. I hope you do not mind. It took some time to figure out how to draw a map with class shp. Let me explain what I went through step by step.
This part is from the help page. I am basically getting shapefile and importing it as shp object.
# Census 2010 TIGER/Line(TM) state shapefile
library(fastshp)
fn <- system.file("shp", "tl_2010_us_state10.shp.xz", package="fastshp")
s <- read.shp(xzfile(fn, "rb"))
Let's check how this object, s is like. It contains 52 lists. In each list, there are six vectors. ID is a unique integer to represent a state. x is longitude and y is latitude. The nasty part was parts. In this example below, there is only one number, which means there is one polygon only in this state. But some other lists (states) have multiple numbers. These numbers are basically indices which indicate where new polygons begin in the data.
#> str(s)
#List of 52
# $ :List of 6
# ..$ id : int 1
# ..$ type : int 5
# ..$ box : num [1:4] -111 41 -104 45
# ..$ parts: int 0
# ..$ x : num [1:9145] -109 -109 -109 -109 -109 ...
# ..$ y : num [1:9145] 45 45 45 45 45 ...
Here is the one for Alaska. As you see there are some numbers in parts These numbers indicate where new polygon data begin. Alaksa has many small islands. Hence they needed to indicate different polygons in the data with this information. We will come back to this later when we create data frames.
#List of 6
# $ id : int 18
# $ type : int 5
# $ box : num [1:4] -179.2 51.2 179.9 71.4
# $ parts: int [1:50] 0 52 88 127 175 207 244 306 341 375 ...
# $ x : num [1:14033] 177 177 177 177 177 ...
# $ y : num [1:14033] 52.1 52.1 52.1 52.1 52.1 ...
What we need is the following. For each list, we need to extract longitude (i.e., x), latitude (i.e., y), and id in order to create a data fame for one state. In addition, we need to use parts so that we can indicate all polygons with unique IDs. We need to crate a new group variable, which contains unique ID value for each polygon. I used findInterval() which takes indices to create a group variable. One tricky part was that we need to use left.open = TRUE in findInterval() in order to create a group variable. (This gave me some hard time to figure out what was going on.) This map_dfr() part handles the job I just described.
library(tidyverse)
map_dfr(.x = s,
.f = function(mylist){
temp <- data.frame(id = mylist$id,
lon = mylist$x,
lat = mylist$y)
ind <- mylist$parts
out <- mutate(temp,
subgroup = findInterval(x = 1:n(), vec = ind, left.open = TRUE),
group = paste(id, subgroup, sep = "_"))
return(out)
}) -> test
Once we have test, we have another job. Some longitude points of Alaska stay in positive numbers (e.g., 179.85). As long as we have numbers like this, ggplot2 draws funny long lines, which you can see even in your example. What we need is to convert these positive numbers to negative ones so that ggplot2 can draw a proper map.
mutate(test,
lon = if_else(lon > 0, lon * -1, lon)) -> out
By this time, out looks like this.
id lon lat subgroup group
1 1 -108.6213 45.00028 1 1_1
2 1 -108.6197 45.00028 1 1_1
3 1 -108.6150 45.00031 1 1_1
4 1 -108.6134 45.00032 1 1_1
5 1 -108.6133 45.00032 1 1_1
6 1 -108.6130 45.00032 1 1_1
Now we are ready to draw a map.
ggplot() +
geom_polygon(data = out, aes(x = lon, y = lat, group = group))
Excel allows you to switch rows and columns in its Chart functionality.
I am trying to replicate this in R. My data (shown) below, is showing production for each company in rows. I am unable to figure out how to show the Month-1, Month-2 etc in x-axis, and the series for each company in the same graph. Any help appreciated.
Data:
tibble::tribble( ~Company.Name, ~Month-1, ~Month-2, ~Month-3, ~Month-4, "Comp-1", 945.5438986, 1081.417009, 976.7388701, 864.309703, "Comp-2", 16448.87, 13913.19, 12005.28, 10605.32, "Comp-3", 346.9689321, 398.2297592, 549.1282647, 550.4207169, "Comp-4", 748.8806367, 949.463941, 1018.877481, 932.3773791 )
I'm going to skip the part where you want to transpose, and infer that your purpose for that was solely to help with plotting. The part I'm focusing on here is "show the Month-1, Month-2 etc in x-axis, and the series for each company in the same graph".
This is doable in base graphics, but I highly recommend using ggplot2 (or plotly or similar), due to its ease of dealing with dimensional plots like this. The "grammar of graphics" (which both tend to implement) really prefers data like this be in a "long" format, so part of what I'll do is convert to this format.
First, some data:
set.seed(2)
months <- paste0("Month", 1:30)
companies <- paste0("Comp", 1:5)
m <- matrix(abs(rnorm(length(months)*length(companies), sd=1e3)),
nrow = length(companies))
d <- cbind.data.frame(
Company = companies,
m,
stringsAsFactors = FALSE
)
colnames(d)[-1] <- months
str(d)
# 'data.frame': 5 obs. of 31 variables:
# $ Company: chr "Comp1" "Comp2" "Comp3" "Comp4" ...
# $ Month1 : num 896.9 184.8 1587.8 1130.4 80.3
# $ Month2 : num 132 708 240 1984 139
# $ Month3 : num 418 982 393 1040 1782
# $ Month4 : num 2311.1 878.6 35.8 1012.8 432.3
# (truncated)
Reshaping can be done with multiple libraries, including base R, here are two techniques:
library(data.table)
d2 <- melt(as.data.table(d), id = 1, variable.name = "Month", value.name = "Cost")
d2[,Month := as.integer(gsub("[^0-9]", "", Month)),]
d2
# Company Month Cost
# 1: Comp1 1 896.91455
# 2: Comp2 1 184.84918
# 3: Comp3 1 1587.84533
# 4: Comp4 1 1130.37567
# 5: Comp5 1 80.25176
# ---
# 146: Comp1 30 653.67306
# 147: Comp2 30 657.10598
# 148: Comp3 30 549.90924
# 149: Comp4 30 806.72936
# 150: Comp5 30 997.37972
library(dplyr)
# library(tidyr)
d2 <- tbl_df(d) %>%
tidyr::gather(Month, Cost, -Company) %>%
mutate(Month = as.integer(gsub("[^0-9]", "", Month)))
I also integerized the Month, since it made sense with an ordinal variable. This isn't strictly necessary, the plot would just treat them as discretes.
The plot is anti-climactically simple:
library(ggplot2)
ggplot(d2, aes(Month, Cost, group=Company)) +
geom_line(aes(color = Company))
Bottom line: I don't think you need to worry about transposing your data: doing so has many complications that can just confuse things. Reshaping is a good thing (in my opinion), but with this kind of data is fast enough that if your data is stored in the wide format, you can re-transform it without too much difficulty. (If you are thinking about putting this in a database, however, I'd strongly recommend you re-think "wide", your db schema will be challenging if you keep it.)
Given a list (list.data.partitions) with 72 elements (dataset_1, dataset_2, etc.), each of which contain two sub-elements (2 dataframes):$training and $testing; e.g.:
> str(list.data.partitions$dataset_1)
List of 2
$ training:'data.frame': 81 obs. of 20 variables:
..$ b0 : num [1:81] 11.61 9.47 10.61 7.34 12.65 ...
..$ b1 : num [1:81] 11.6 9.94 10.7 10.11 12.2 ...
..$ b2 : num [1:81] 34.2 31 32.7 27.9 36.1 ...
...
..$ index: num [1:81] 0.165 0.276 0.276 0.181 0.201 ...
$ testing :'data.frame': 19 obs. of 20 variables:
..$ b0 : num [1:19] 6.05 12.4 13.99 16.82 8.8 ...
..$ b1 : num [1:19] 12.4 10.8 11.8 13.7 16.3 ...
..$ b2 : num [1:19] 25.4 29.8 31.2 34.1 27.3 ...
...
..$ index: num [1:19] 0.143 1.114 0.201 0.529 1.327 ...
How would I correctly access the $testing dataframe using lapply (or similar functionality) and caret's predict function below:
fun.predict.rf <- function(x, y) {
predict(x, newdata = y$testing)
}
list.predictions <- lapply(list.models, fun.predict.rf, y=list.data.partitions)
The above function "works", but it returns predictions based on $training dataframe (~80 obs), instead of the $testing dataframe (~20 obs) that was specified. Ultimately, I'd expect a list containing predictions for each of the elements in my list, based on the $testing dataframe.
list.models is a list of 72 models based on the $training dataframe using the caret package in R (not shown or included). The number of models (72) in list.models equals the number of elements (72) in list.data.partitions when considering a single sub-element (either $training or $testing). The name of each of the 72 elements in list.data.partitions differs like so: dataset_1, dataset_2, etc., but the structure is identical (see str output above).
list.data.partitions can be downloaded here. In this version, the 72 elements do not have names, but in my version the 72 elements are named (e.g., dataset_1, dataset_2, etc). Each of the sub-elements are still named $training and $testing.
You can declare function within apply.
After I read the question carefully, this might work.
Let's assume you got the following data structure
list.data.partitions
..$dataset_1
..$training
..$testing
..$model # model created using the caret package
..$dataset_2
..$training
..$testing
..$model # model created using the caret package
Let's add $model to the dataset, since it one-to-one relationship. It make sense to keep them together. I assuming you build the model from $training, and going to test on $test.
for(i in 1:len(list.data.partitions){
list.data.partitions[[i]]$model <- list.models[[i]]
}
Assuming dataset 1 and 2 are not related, and each dataset got 3 elements (training, testing, model from training, more on this later)
fun.predict.rf <- function(x, y) {
predict(x, newdata = y)
}
lapply(list.data.partitions, function(x){
#something like
#if no model exist yet, then you can create it here with x$training
result<- fun.predict.rf(x$model, x$testing)
#other things you want to do
})
I believe the simple solution is to use mapply instead of lapply. Alternatively, you could store the model objects in the same list with the training and testing data sets and use lapply as suggested by Steven. Using a modified version of Richard Scriven's example data set with your list names:
set.seed(1)
dataset <- list(training = data.frame(replicate(4, rnorm(10))),
testing = data.frame(replicate(4, rexp(5))))
dataset1 <- list(training = data.frame(replicate(4, rnorm(10))),
testing = data.frame(replicate(4, rexp(5))))
dataset2 <- list(training = data.frame(replicate(4, rnorm(10))),
testing = data.frame(replicate(4, rexp(5))))
list.data.partitions <- c(replicate(2, dataset, simplify = FALSE),list(dataset1), list(dataset2))
names(list.data.partitions) <- paste0("dataset", seq(list.data.partitions))
This gives a list with two identical data sets followed by two unique data sets for explanatory purposes.
Then, creating your model object list with a basic linear fit:
list.models <- lapply(list.data.partitions, function(x) lm(X1 ~ X2+X3+X4, data = x$training))
With these two objects, use mapply:
fun.predict.rf <- function(x, y) {
predict(x, newdata = y$testing)
}
list.predictions <- mapply(fun.predict.rf, list.models, list.data.partitions)
list.predictions
dataset1 dataset2 dataset3 dataset4
1 -0.098696452 -0.098696452 0.09015207 -0.5004038
2 0.103316974 0.103316974 0.11770013 -0.7323202
3 -0.908623491 -0.908623491 -0.06951799 -0.8765770
4 -1.332241452 -1.332241452 -0.20407761 -0.5816534
5 -0.002156741 -0.002156741 -0.24583670 -0.7057936
Note that the first two data sets have identical predictions as we would expect and there are five predicted elements for each dataset, consistent with the number of testing elements.
I think there was some confusion because it was not clear in your question that your model objects were stored in a separate list (list.models). Since you were passing lapply your list.models but specifying y=list.data.partitions, your function fun.predict.rf was being passed each model element sequentially, but your entire list.data.partitions with each call. There is no element list.data.partitions$testing, so you were actually specifying newdata = NULL, so the predict function ignored the newdata argument and used the data from the model object. Notice, if you use your lapply code and compare to predictions for individual training elements, they match:
list.predictions <- lapply(list.models, fun.predict.rf, y=list.data.partitions)
list.predictions
predict(model.list[[1]], newdata=list.data.partitions[[1]]$training)
predict(model.list[[2]], newdata=list.data.partitions[[2]]$training)
predict(model.list[[3]], newdata=list.data.partitions[[3]]$training)
predict(model.list[[4]], newdata=list.data.partitions[[4]]$training)
And if you change the data in the list.data.partitions, the lapply call still gives the same result while specifying the list.data.partitions$training data gives a different result:
list.data.partitions[[1]] <- list.data.partitions[[3]]
lapply(list.models, fun.predict.rf, y=list.data.partitions)
predict(list.models[[1]], newdata=list.data.partitions[[1]]$training)