best way to store many training data on R - r

I want to random the dataset I have on R for 100 times and want to see which training and testing data give the best model result. how I should store these data so I can compare the prediction result? should I make different variable for each one training and testing data or save it on an array? I'm pretty new on R so I don't really know how to do it in the best way. I'm using RStudio 1.1.423.
This is how I random the data, I use holdout function from package rminer
H=holdout(myData$salary, ratio = 2/3, mode = "random")
trainData <- myData[H$tr,]
testData <- myData[H$ts,]
trainData and testData is the variable I made to store the training and testing data. myData is my dataset.

Whenever I deal with multiple frames of the same structure, I tend to put them into a list and do "one thing" to everything in that list. A good reference for this can be found here: How do I make a list of data frames?.
In this example, there are a couple of ways to proceed. I don't have your data, so I'll use mtcars:
dat <- mtcars[1:3]
ntrain <- (2/3) * nrow(dat)
n <- 3 # 100 for you?
Reproducibility is important, but hard-coding set.seed can be problematic (academically, at least), so here's a randomly-generated seed that we track/store:
(seed <- sample(.Machine$integer.max, size=1L))
seed
# [1] 558990070
I like to store the indices for easy recall later.
set.seed(seed)
inds <- replicate(n, sample(nrow(dat), size=ntrain), simplify=FALSE)
str(inds)
# List of 3
# $ : int [1:21] 22 32 15 16 30 20 21 3 14 1 ...
# $ : int [1:21] 6 11 17 24 22 9 15 4 10 21 ...
# $ : int [1:21] 23 26 4 21 14 10 20 17 32 28 ...
Now these can be used easily to generate your training and test sets:
trains <- lapply(inds, function(i) dat[i,,drop=FALSE])
tests <- lapply(inds, function(i) dat[-i,,drop=FALSE])
str(tests)
# List of 3
# $ :'data.frame': 11 obs. of 3 variables:
# ..$ mpg : num [1:11] 18.1 14.3 24.4 22.8 17.8 32.4 30.4 13.3 19.2 27.3 ...
# ..$ cyl : num [1:11] 6 8 4 4 6 4 4 8 8 4 ...
# ..$ disp: num [1:11] 225 360 147 141 168 ...
# $ :'data.frame': 11 obs. of 3 variables:
# ..$ mpg : num [1:11] 21 18.7 24.4 16.4 17.3 10.4 33.9 19.2 26 15.8 ...
# ..$ cyl : num [1:11] 6 8 4 8 8 8 4 8 4 8 ...
# ..$ disp: num [1:11] 160 360 147 276 276 ...
# $ :'data.frame': 11 obs. of 3 variables:
# ..$ mpg : num [1:11] 21 18.7 18.1 22.8 17.8 17.3 10.4 32.4 30.4 19.2 ...
# ..$ cyl : num [1:11] 6 8 6 4 6 8 8 4 4 8 ...
# ..$ disp: num [1:11] 160 360 225 141 168 ...
Alternatively, you can generate both train/test in each element, though I don't know if this adds much value:
str(both)
# List of 3
# $ :List of 3
# ..$ ind : int [1:21] 22 32 15 16 30 20 21 3 14 1 ...
# ..$ train:'data.frame': 21 obs. of 3 variables:
# .. ..$ mpg : num [1:21] 15.5 21.4 10.4 10.4 19.7 33.9 21.5 22.8 15.2 21 ...
# .. ..$ cyl : num [1:21] 8 4 8 8 6 4 4 4 8 6 ...
# .. ..$ disp: num [1:21] 318 121 472 460 145 ...
# ..$ test :'data.frame': 11 obs. of 3 variables:
# .. ..$ mpg : num [1:11] 18.1 14.3 24.4 22.8 17.8 32.4 30.4 13.3 19.2 27.3 ...
# .. ..$ cyl : num [1:11] 6 8 4 4 6 4 4 8 8 4 ...
# .. ..$ disp: num [1:11] 225 360 147 141 168 ...
# $ :List of 3
# ..$ ind : int [1:21] 6 11 17 24 22 9 15 4 10 21 ...
# ..$ train:'data.frame': 21 obs. of 3 variables:
# .. ..$ mpg : num [1:21] 18.1 17.8 14.7 13.3 15.5 22.8 10.4 21.4 19.2 21.5 ...
# .. ..$ cyl : num [1:21] 6 6 8 8 8 4 8 6 6 4 ...
# .. ..$ disp: num [1:21] 225 168 440 350 318 ...
# ..$ test :'data.frame': 11 obs. of 3 variables:
# .. ..$ mpg : num [1:11] 21 18.7 24.4 16.4 17.3 10.4 33.9 19.2 26 15.8 ...
# .. ..$ cyl : num [1:11] 6 8 4 8 8 8 4 8 4 8 ...
# .. ..$ disp: num [1:11] 160 360 147 276 276 ...
# $ :List of 3
# ..$ ind : int [1:21] 23 26 4 21 14 10 20 17 32 28 ...
# ..$ train:'data.frame': 21 obs. of 3 variables:
# .. ..$ mpg : num [1:21] 15.2 27.3 21.4 21.5 15.2 19.2 33.9 14.7 21.4 30.4 ...
# .. ..$ cyl : num [1:21] 8 4 6 4 8 6 4 8 4 4 ...
# .. ..$ disp: num [1:21] 304 79 258 120 276 ...
# ..$ test :'data.frame': 11 obs. of 3 variables:
# .. ..$ mpg : num [1:11] 21 18.7 18.1 22.8 17.8 17.3 10.4 32.4 30.4 19.2 ...
# .. ..$ cyl : num [1:11] 6 8 6 4 6 8 8 4 4 8 ...
# .. ..$ disp: num [1:11] 160 360 225 141 168 ...
From here, it's just a matter of running your model against the data:
results <- lapply(trains, function(x) randomForest(mpg~., data=x, ...))
(where ... are your other model parameters). Then something like:
validation <- mapply(function(result, test) predict(result, data=test, ...),
results, tests, SIMPLIFY=FALSE)
(You can certainly do more than just predict, perhaps checking yhat or similar.)

Related

Is there a way to split data in r by column then run the same set of commands for each data set

I have a table of data made in excel that i converted to a txt file.
The command I'm using will only let me run it if I have only two columns. I transposed my data into columns but now I need to somehow split it all up so every column 2 to column 189 is a different table with column 1 staying the same in all.
Is it possible to then run the exact same set of commands over and over again for the 188 tables created and save the resulting data into a separate file (or better yet substitute some of the obtained values into an equation).
Sorry if the question is too long or ridiculously easy - I'm a complete newbie to anything beyond basic analysis.
Happy to try and learn other programs if it will solve my problem.
You can do the following in base R (I use the built-in mtcars data.frame as an example)
df <- mtcars
lst <- apply(rbind(1, 2:ncol(df)), 2, function(idx) df[, idx])
This returns a list of data.frames with columns (1,2), (1,3), (1,4) and so on, of the original data.frame.
str(lst)
#List of 10
# $ :'data.frame': 32 obs. of 2 variables:
# ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
# $ :'data.frame': 32 obs. of 2 variables:
# ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# ..$ disp: num [1:32] 160 160 108 258 360 ...
# $ :'data.frame': 32 obs. of 2 variables:
# ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
# $ :'data.frame': 32 obs. of 2 variables:
# ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# $ :'data.frame': 32 obs. of 2 variables:
# ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
# $ :'data.frame': 32 obs. of 2 variables:
# ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
# $ :'data.frame': 32 obs. of 2 variables:
# ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
# $ :'data.frame': 32 obs. of 2 variables:
# ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
# $ :'data.frame': 32 obs. of 2 variables:
# ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
# $ :'data.frame': 32 obs. of 2 variables:
# ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
It's not easy to operate on the list of data.frames using a function from the *apply family.
To generate the basic combinations, you could use Map:
Map(cbind, df[1], df[-1])
To apply a function to each combination you would need to edit the function a bit:
Map(function(a,b) fun(cbind(a,b)), df[1], df[-1])
Or add another level of looping with lapply if you want to keep the code compact.
lapply(Map(cbind, df[1], df[-1]), fun)

append n identical dataframes with id [duplicate]

This question already has answers here:
Repeat rows of a data.frame N times
(10 answers)
Closed 4 years ago.
I want to append n identical data frames to each other. This works if n=2:
> d = data.frame(a=1:2)
> dplyr::bind_rows(d,d, .id="id")
# id a
# 1 1
# 1 2
# 2 1
# 2 2
But I don't know how to extend this to larger values of n, without manually typing something like dplyr::bind_rows(d,d,d .id="id") for n = 3. Is there some smart way to programatically feed a list of d with length=n to the bind_rows command? This doesn't work: dplyr::bind_rows(rep(d,3), .id="id").
Also - is there a data.table solution?
Here's a solution using data.table::rbindlist():
library(data.table)
l <- list(mtcars, mtcars*2, mtcars*3)
DATA
# Check l
> str(l)
List of 3
$ :'data.frame': 32 obs. of 11 variables:
..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
..$ disp: num [1:32] 160 160 108 258 360 ...
..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
$ :'data.frame': 32 obs. of 11 variables:
..$ mpg : num [1:32] 42 42 45.6 42.8 37.4 36.2 28.6 48.8 45.6 38.4 ...
..$ cyl : num [1:32] 12 12 8 12 16 12 16 8 8 12 ...
..$ disp: num [1:32] 320 320 216 516 720 ...
..$ hp : num [1:32] 220 220 186 220 350 210 490 124 190 246 ...
..$ drat: num [1:32] 7.8 7.8 7.7 6.16 6.3 5.52 6.42 7.38 7.84 7.84 ...
..$ wt : num [1:32] 5.24 5.75 4.64 6.43 6.88 6.92 7.14 6.38 6.3 6.88 ...
..$ qsec: num [1:32] 32.9 34 37.2 38.9 34 ...
..$ vs : num [1:32] 0 0 2 2 0 2 0 2 2 2 ...
..$ am : num [1:32] 2 2 2 0 0 0 0 0 0 0 ...
..$ gear: num [1:32] 8 8 8 6 6 6 6 8 8 8 ...
..$ carb: num [1:32] 8 8 2 2 4 2 8 4 4 8 ...
$ :'data.frame': 32 obs. of 11 variables:
..$ mpg : num [1:32] 63 63 68.4 64.2 56.1 54.3 42.9 73.2 68.4 57.6 ...
..$ cyl : num [1:32] 18 18 12 18 24 18 24 12 12 18 ...
..$ disp: num [1:32] 480 480 324 774 1080 ...
..$ hp : num [1:32] 330 330 279 330 525 315 735 186 285 369 ...
..$ drat: num [1:32] 11.7 11.7 11.55 9.24 9.45 ...
..$ wt : num [1:32] 7.86 8.62 6.96 9.64 10.32 ...
..$ qsec: num [1:32] 49.4 51.1 55.8 58.3 51.1 ...
..$ vs : num [1:32] 0 0 3 3 0 3 0 3 3 3 ...
..$ am : num [1:32] 3 3 3 0 0 0 0 0 0 0 ...
..$ gear: num [1:32] 12 12 12 9 9 9 9 12 12 12 ...
..$ carb: num [1:32] 12 12 3 3 6 3 12 6 6 12 ...
CODE & OUTPUT
dat <- rbindlist(l, use.names = T, fill = T)
# Verify if data looks like what we want
> str(dat)
Classes ‘data.table’ and 'data.frame': 96 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
- attr(*, ".internal.selfref")=<externalptr>

writing multiple variables in a short way in r

I have a list of data.frames from the years 2005 - 2016. they all written the same way, except the digits of the years:
m =list(X2016_kvish_1_10t = X2016_kvish_1_10t, X2015_kvish_1_10t = X2015_kvish_1_10t, X2014_kvish_1_10t = X2014_kvish_1_10t,
X2013_kvish_1_10t = X2013_kvish_1_10t, X2012_kvish_1_10t = X2012_kvish_1_10t, X2011_kvish_1_10t = X2011_kvish_1_10t,
X2010_kvish_1_10t = X2010_kvish_1_10t, X2009_kvish_1_10t = X2009_kvish_1_10t, X2008_kvish_1_10t = X2008_kvish_1_10t,
X2007_kvish_1_10t = X2007_kvish_1_10t, X2006_kvish_1_10t = X2006_kvish_1_10t, X2005_kvish_1_10t = X2005_kvish_1_10t)
is there shorter way to write it, without needing to write all of them separately ?
Try mget:
df_names = paste0("X", 2005:2016, "_kvish_1_10t")
m = mget(df_names)
EDIT
As #d.b points out, you don't even need to create df_names
m = mget(ls(pattern="_kvish_1_10t$"))
You can use mget function providing a character vector of the objects names in your workspace.
I made a reproductible example for the purpose of showing how to do it.
df_name <- paste0("x", 2005:2016, "_kvish_1_10t")
df_name
#> [1] "x2005_kvish_1_10t" "x2006_kvish_1_10t" "x2007_kvish_1_10t"
#> [4] "x2008_kvish_1_10t" "x2009_kvish_1_10t" "x2010_kvish_1_10t"
#> [7] "x2011_kvish_1_10t" "x2012_kvish_1_10t" "x2013_kvish_1_10t"
#> [10] "x2014_kvish_1_10t" "x2015_kvish_1_10t" "x2016_kvish_1_10t"
# juste create some dummy table for example
l <- lapply(df_name, assign, value = mtcars[1:2], envir= .GlobalEnv)
# Use mget to get a list of all the object
m <- mget(df_name, envir = .GlobalEnv)
str(m)
#> List of 12
#> $ x2005_kvish_1_10t:'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
#> $ x2006_kvish_1_10t:'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
#> $ x2007_kvish_1_10t:'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
#> $ x2008_kvish_1_10t:'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
#> $ x2009_kvish_1_10t:'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
#> $ x2010_kvish_1_10t:'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
#> $ x2011_kvish_1_10t:'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
#> $ x2012_kvish_1_10t:'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
#> $ x2013_kvish_1_10t:'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
#> $ x2014_kvish_1_10t:'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
#> $ x2015_kvish_1_10t:'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
#> $ x2016_kvish_1_10t:'data.frame': 32 obs. of 2 variables:
#> ..$ mpg: num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> ..$ cyl: num [1:32] 6 6 4 6 8 6 8 4 4 6 ...

Count rows of dataframes within a list of dataframes

I have a list of dataframes, str(datalist,max.level = 1) reveals
List of 9
$ :'data.frame': 200 obs. of 21 variables:
$ :'data.frame': 200 obs. of 21 variables:
$ :'data.frame': 200 obs. of 21 variables:
$ :'data.frame': 200 obs. of 21 variables:
$ :'data.frame': 200 obs. of 21 variables:
$ :'data.frame': 200 obs. of 21 variables:
$ :'data.frame': 200 obs. of 21 variables:
$ :'data.frame': 200 obs. of 21 variables:
$ :'data.frame': 41 obs. of 21 variables:
Now some the variables within the 21 variables of the dataframe are again dataframes. For eg. the 18th variable is a dataframe called topics which in turn contains 3 variables. How do I get the count of rows in each of the topics dataframe?
I tried using the map() function from the purrr package : x <- map(datalist, ~.x[["topics"]]) and thereafter sapply(x, NROW) but this gives me the number of rows of the original dataframe and not the topics dataframe. Any help would be appreciated.
To give you an example of what the topics dataframe looks like, datalist[[1]]$topics[[1]]
urlkey name id
1 selfdefense Self-Defense 443
2 martial Martial Arts 681
3 jujitsu Jiu Jitsu 9615
4 mixed-martial-arts Mixed Martial Arts 15514
5 kickboxing Kickboxing 18225
6 jiu-jitsu Jiu-jitsu 21219
7 brazilian-jiujitsu Brazilian Jiu-Jitsu 22237
8 mma-mixed-martial-arts MMA Mixed Martial Arts 35023
9 brazilian-jiu-jitsu Brazilian Jiu Jitsu 46818
The solution you described works for me:
Make a reproducible example:
datalist <- list(
data.frame(V1 = 1:2, topics = I(list(mtcars, mtcars))),
data.frame(V1 = 1:2, topics = I(list(mtcars, mtcars)))
)
str(datalist)
# List of 2
# $ :'data.frame': 2 obs. of 2 variables:
# ..$ V1 : int [1:2] 1 2
# ..$ topics:List of 2
# .. ..$ :'data.frame': 32 obs. of 11 variables:
# .. .. ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# .. .. ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
# .. .. ..$ disp: num [1:32] 160 160 108 258 360 ...
# .. .. ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
# .. .. ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# .. .. ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
# .. .. ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
# .. .. ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
# .. .. ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
# .. .. ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
# .. .. ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
# .. ..$ :'data.frame': 32 obs. of 11 variables:
# .. .. ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# .. .. ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
# .. .. ..$ disp: num [1:32] 160 160 108 258 360 ...
# .. .. ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
# .. .. ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# .. .. ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
# .. .. ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
# .. .. ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
# .. .. ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
# .. .. ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
# .. .. ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
# .. ..- attr(*, "class")= chr "AsIs"
# $ :'data.frame': 2 obs. of 2 variables:
# ..$ V1 : int [1:2] 1 2
# ..$ topics:List of 2
# .. ..$ :'data.frame': 32 obs. of 11 variables:
# .. .. ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# .. .. ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
# .. .. ..$ disp: num [1:32] 160 160 108 258 360 ...
# .. .. ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
# .. .. ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# .. .. ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
# .. .. ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
# .. .. ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
# .. .. ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
# .. .. ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
# .. .. ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
# .. ..$ :'data.frame': 32 obs. of 11 variables:
# .. .. ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# .. .. ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
# .. .. ..$ disp: num [1:32] 160 160 108 258 360 ...
# .. .. ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
# .. .. ..$ drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# .. .. ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
# .. .. ..$ qsec: num [1:32] 16.5 17 18.6 19.4 17 ...
# .. .. ..$ vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
# .. .. ..$ am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
# .. .. ..$ gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
# .. .. ..$ carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
# .. ..- attr(*, "class")= chr "AsIs"
Your solution:
library(purrr)
map(datalist, ~ sapply(.x[["topics"]], NROW))
# [[1]]
# [1] 32 32
#
# [[2]]
# [1] 32 32
count_rows <- function(dfs) {
nrow(dfs$topics)
}
count <- lapply(datalist, count_rows)
The count_rows function just subsets each dataframe in the list and then applies nrow on your "topics" dataframe.

Splitting a dataset into two datasets in R (for ggplot2 channeled through Shiny)

I saw some similar questions here, but none exactly like mine - or if they were the same, I didn't recognize it, as a rank newbie to programming in R (I've programmed in lots of other languages, but not R!)
I have an input dataset from a csv file, that I convert with read.csv. The dataset may or may not, have two groups in it. I found I could split the groups as follows:
datalist <- split(mydata, mydata$group)
but then the list I get back does not play nice with ggplot2 (I get an error that it cannot plot a list variable - although the list variable, if I print it to the console, shows the split data subset?). OK, fine. But if I then do
data = as.data.frame(datalist[1])
And feed that to ggplot2, as.data.frame mangles my column names, and so I lose the name of the variable I want to plot. Augh!
What I ideally want, is to split my input data as read by read.csv, into two separate variables (data frames, I take it?) that ggplot2 can recognize as valid data sets. Actually, I want to overlay them as histograms on the same plot.
There HAS to be an easy way to do this, but I'm not gettin' it? Advice or pointers welcome.
If you just want a single index value then using subset might be easier (at least for interactive use.)
p <- qplot(value, # assuming there is a column named "value"
data = subset(mydata, group==mydata$group[1]),
colour = "cyan")
The result of split(mydata, mydata$group) is a list of data.frames. There is a difference in the [ and [[ notation: [ subsets the list where [[ extracts from the list. So datalist[1] is a list of length 1 consisting of just the first data.frame. datalist[[1]] is the data.frame which is in the first position. Since ggplot (and qplot) expects a data.frame, you need the second (double bracket) version as #Alex mentioned in the comment. I don't know why you got the error you saw and can't diagnosis it without a complete example. Using a different data set (mtcars), I don't see it.
datalist <- split(mtcars, mtcars$am)
ggplot(datalist[[1]], aes(x=wt, y=mpg)) + geom_point()
qplot(wt, data=datalist[[1]], colour="cyan")
(I'm guessing you wanted colour=I("cyan"), but that's an unrelated issue.)
The difference in the subsetting/extraction operators can be seen here:
> str(datalist)
List of 2
$ 0:'data.frame': 19 obs. of 11 variables:
..$ mpg : num [1:19] 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 ...
..$ cyl : num [1:19] 6 8 6 8 4 4 6 6 8 8 ...
..$ disp: num [1:19] 258 360 225 360 147 ...
..$ hp : num [1:19] 110 175 105 245 62 95 123 123 180 180 ...
..$ drat: num [1:19] 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 ...
..$ wt : num [1:19] 3.21 3.44 3.46 3.57 3.19 ...
..$ qsec: num [1:19] 19.4 17 20.2 15.8 20 ...
..$ vs : num [1:19] 1 0 1 0 1 1 1 1 0 0 ...
..$ am : num [1:19] 0 0 0 0 0 0 0 0 0 0 ...
..$ gear: num [1:19] 3 3 3 3 4 4 4 4 3 3 ...
..$ carb: num [1:19] 1 2 1 4 2 2 4 4 3 3 ...
$ 1:'data.frame': 13 obs. of 11 variables:
..$ mpg : num [1:13] 21 21 22.8 32.4 30.4 33.9 27.3 26 30.4 15.8 ...
..$ cyl : num [1:13] 6 6 4 4 4 4 4 4 4 8 ...
..$ disp: num [1:13] 160 160 108 78.7 75.7 ...
..$ hp : num [1:13] 110 110 93 66 52 65 66 91 113 264 ...
..$ drat: num [1:13] 3.9 3.9 3.85 4.08 4.93 4.22 4.08 4.43 3.77 4.22 ...
..$ wt : num [1:13] 2.62 2.88 2.32 2.2 1.61 ...
..$ qsec: num [1:13] 16.5 17 18.6 19.5 18.5 ...
..$ vs : num [1:13] 0 0 1 1 1 1 1 0 1 0 ...
..$ am : num [1:13] 1 1 1 1 1 1 1 1 1 1 ...
..$ gear: num [1:13] 4 4 4 4 4 4 4 5 5 5 ...
..$ carb: num [1:13] 4 4 1 1 2 1 1 2 2 4 ...
> str(datalist[1])
List of 1
$ 0:'data.frame': 19 obs. of 11 variables:
..$ mpg : num [1:19] 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 ...
..$ cyl : num [1:19] 6 8 6 8 4 4 6 6 8 8 ...
..$ disp: num [1:19] 258 360 225 360 147 ...
..$ hp : num [1:19] 110 175 105 245 62 95 123 123 180 180 ...
..$ drat: num [1:19] 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 ...
..$ wt : num [1:19] 3.21 3.44 3.46 3.57 3.19 ...
..$ qsec: num [1:19] 19.4 17 20.2 15.8 20 ...
..$ vs : num [1:19] 1 0 1 0 1 1 1 1 0 0 ...
..$ am : num [1:19] 0 0 0 0 0 0 0 0 0 0 ...
..$ gear: num [1:19] 3 3 3 3 4 4 4 4 3 3 ...
..$ carb: num [1:19] 1 2 1 4 2 2 4 4 3 3 ...
> str(datalist[[1]])
'data.frame': 19 obs. of 11 variables:
$ mpg : num 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 ...
$ cyl : num 6 8 6 8 4 4 6 6 8 8 ...
$ disp: num 258 360 225 360 147 ...
$ hp : num 110 175 105 245 62 95 123 123 180 180 ...
$ drat: num 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 ...
$ wt : num 3.21 3.44 3.46 3.57 3.19 ...
$ qsec: num 19.4 17 20.2 15.8 20 ...
$ vs : num 1 0 1 0 1 1 1 1 0 0 ...
$ am : num 0 0 0 0 0 0 0 0 0 0 ...
$ gear: num 3 3 3 3 4 4 4 4 3 3 ...
$ carb: num 1 2 1 4 2 2 4 4 3 3 ...

Resources