I have list of data with varying list length:
[[1]]
[1] "2009" "2010" "2011" "2012"
[[2]]
[1] "2010" "2011" "2012" "2013"
[[3]]
[1] "2008" "2009" "2010" "2011" "2012"
[[4]]
[1] "2011" "2012"
I would like to get one column data.frame like this:
2009
2010
2011
2012
2010
2011
....
I went on doing this unsuccessfully:
# transpose list of years
YearsDf <- lapply(GetYears, data.frame)
Remove colnames (since the list of dataframes gave some weird column names):
YearsOk <- lapply(YearsDf, function(x) "colnames<-"(x, NULL))
All this comes to:
[[1]]
NA
1 2009
2 2010
3 2011
4 2012
[[2]]
NA
1 2010
2 2011
3 2012
4 2013
......
Now just bind and get data.frame. This gave NA's
ldply(YearsOk, data.frame)
How I get to the data.frame of one column?
Did you consider unlist?
myL <- list(as.character(2009:2012),
as.character(2010:2011),
as.character(2009:2014))
data.frame(year = unlist(myL))
# year
# 1 2009
# 2 2010
# 3 2011
# 4 2012
# 5 2010
# 6 2011
# 7 2009
# 8 2010
# 9 2011
# 10 2012
# 11 2013
# 12 2014
If you think it will be important for you to retain which list element the value came from, consider stack (which requires a named list) or melt from the "reshape2" package:
library(reshape2)
melt(myL)
# value L1
# 1 2009 1
# 2 2010 1
# ...SNIP...
# 11 2013 3
# 12 2014 3
## stack requires names, so add some in...
stack(setNames(myL, seq_along(myL)))
# values ind
# 1 2009 1
# 2 2010 1
# ...SNIP...
# 12 2014 3
Finally, this is absolutely not the approach I would take, but based on your example code, perhaps you were trying to do something like:
do.call(rbind, lapply(myL, function(x) data.frame(year = x)))
It's quite simple. This answer gets for different length
Q<-list(a=a,b=b)
str(Q)
List of 2
$ a: int [1:11] 1 2 3 4 5 6 7 8 9 10 ...
$ b: int [1:29] 2 3 4 5 6 7 8 9 10 11 ...
Q$a
[1] 1 2 3 4 5 6 7 8 9 10 11
T<-c(Q$a,Q$b)
T
[1] 1 2 3 4 5 6 7 8 9 10 11 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[28] 18 19 20 21 22 23 24 25 26 27 28 29 30
TT<-data.frame(T)
TT
T
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 2
13 3
14 4
15 5
16 6
17 7
18 8
19 9
20 10
21 11
22 12
23 13
24 14
25 15
26 16
27 17
28 18
29 19
30 20
31 21
32 22
33 23
34 24
35 25
36 26
37 27
38 28
39 29
40 30
Related
The following randomly splits a data frame into halves.
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
head(df, 3)
# dv iv subject item
#1 562 -0.5 1 7
#2 790 0.5 1 21
#3 NA -0.5 1 19
r <- seq_len(nrow(df))
first <- sample(r, 240)
second <- r[!r %in% first]
df_1 <- df[first, ]
df_2 <- df[second, ]
However, in this way, each data frame (df_1 and df_2) is not balanced on subject and item: e.g.,
table(df_1$subject)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# 7 8 3 5 5 3 8 1 5 7 7 6 7 7 9 8 8 9 6 7 8 5 4 4 5 2 7 6 9
# 30 31 32 33 34 35 36 37 38 39 40
# 7 5 7 7 7 3 5 7 5 3 8
table(df_1$item)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# 12 11 12 12 9 11 11 8 11 12 10 8 14 7 14 10 8 7 9 9 7 11 9 8
# There are 40 subjects and 24 items, and each subject is assigned to 12 items and each item to 20 subjects.
I would like to know how to split the data frame into halves that are balanced on subject and item (i.e., exactly 6 data points from each subject and 10 data points from each item).
You can use the createDataPartition function from the caret package to create a balanced partition of one variable.
The code below creates a balanced partition of the dataset according to the variable subject:
df <- read.csv("https://raw.githubusercontent.com/HirokiYamamoto2531/data/master/data.csv")
partition <- caret::createDataPartition(df$subject, p = 0.5, list = FALSE)
first.half <- df[partition, ]
second.half <- df[-partition, ]
table(first.half$subject)
table(second.half$subject)
I'm not sure whether it's possible to balance two variables at once. You can try balancing for one variable and checking if you're happy with the partition of the second variable.
I have a data frame called df that looks like:
> df
Date A B C
1 2001 1 12 14
2 2002 2 13 15
3 2003 3 14 16
4 2004 4 15 17
5 2005 5 16 18
6 2006 6 17 19
7 2007 7 18 20
8 2008 8 19 21
9 2009 9 20 22
10 2010 10 21 23
and a matrix called index that looks like:
> index
Resample01 Resample02 Resample03 Resample04 Resample05
[1,] 1 7 1 2 7
[2,] 3 9 2 3 8
[3,] 5 1 3 8 1
[4,] 8 3 4 9 4
[5,] 10 4 5 10 9
The numbers in each column stands for the row number to be selected.
The aim is to split the dataframe into two exclusive groups of "train" and "test" according to the row numbers in each column of the matrix "index". For example for "Resample01", the result should be look like:
> train
Date A B C
1 2001 1 12 14
3 2003 3 14 16
5 2005 5 16 18
8 2008 8 19 21
10 2010 10 21 23
and
> test
Date A B C
2 2002 2 13 15
4 2004 4 15 17
6 2006 6 17 19
7 2007 7 18 20
9 2009 9 20 22
and this process should be done for each colum in "index", and the results should be saved in two lists of "train" and "test", in which "train" is like:
$train1
Date A B C
1 2001 1 12 14
3 2003 3 14 16
5 2005 5 16 18
8 2008 8 19 21
10 2010 10 21 23
$train2
:
:
$train5
and "test" should be in the same format.
Only to note that my df accually contains 43,000 observations and the index matrix has 2000 columns and more than 20,000 rows. I know that subsetting for one column is easy, by doing:
test = df[-c(index[,1]),]
but for multiple columns I don't know how to do it (or loop it), and the saving form of a list seems also difficult.
You could try it something like this. The result should be of length ncol(index) and each element should hold two list elements, training and testing datasets each.
apply(index, MARGIN = 2, FUN = function(x, data) {
# is is "demoted" from a column to a vector
list(train = data[x, ], test = data[-x, ])
}, data = df)
The solution from akrun solves my problem.
by #Roman Luštrik codes:
listofsample = apply(index, MARGIN = 2, FUN = function(x, data) {
list(train = df[x, ], test = df[-x, ])
}, data = df)
following code from akrun:
train = sapply(listofsample, `[`,1)
test = sapply(listofsample, `[`,2)
it produce the two lists that I wanted.
I have a list of Data Frames named StatesList (it's a list of states), and I'm trying to pull out two Columns from each one, sum it, and return the sums. This is what I have so far:
StatesList <- list(Alabam, Alask, Arizon, Arkansa, Californi, Colorado, Connecticu, Delawar, District_ColUmbi, Florid, Georgi, Hawai, Idah, Illinoi, Indian, Iow, Kansa, Kentuck, Louisian, Main, Marylan, Massachusett, Michiga, Minnesot, Mississipp, Missour, Montan, Nebrask, Nevad, New_Hamp, New_Jer, New_Mex, New_York, North_Carol, North_Dak, Ohi, Oklahom, Orego, Pennsylvani, Rhode_Isl, South_Carol, South_Dak, Tennesse, Texa, Uta,Vermon, Virgini, Washingto, West_Vir, Wisconsi, Wyomin )
my_function <- function(x) {
c <- sum(x + $Clinton_Weighted)
t <- sum(x + $Trump_Weighted)
ans <- list(Clinton = c, Trump = t)
return(print(ans))
}
lapply(StatesList, my_function(x))
I know that x + $Clinton_Weighted won't work, but I'm not sure what will.
How do I pull out that specific column in the function's code? And is trying to combine the names of each list with the $ and the desired column a bad idea?
Here is a simple way to do this using a combination of lapply and apply:
# Create sample data
cols = list(Clinton = 1:10, Trump = 10:1, SomeoneElse = 21:30)
Alabama = data.frame(cols)
Alaska = data.frame(cols)
Arison = data.frame(cols)
Arkansa = data.frame(cols)
Californi = data.frame(cols)
df_list = list(Alabama, Alaska, Arison, Arkansa, Californi)
The list of dataframes look like this:
df_list
[[1]]
Clinton Trump SomeoneElse
1 1 10 21
2 2 9 22
3 3 8 23
4 4 7 24
5 5 6 25
6 6 5 26
7 7 4 27
8 8 3 28
9 9 2 29
10 10 1 30
[[2]]
Clinton Trump SomeoneElse
1 1 10 21
2 2 9 22
3 3 8 23
4 4 7 24
5 5 6 25
6 6 5 26
7 7 4 27
8 8 3 28
9 9 2 29
10 10 1 30
[[3]]
Clinton Trump SomeoneElse
1 1 10 21
2 2 9 22
3 3 8 23
4 4 7 24
5 5 6 25
6 6 5 26
7 7 4 27
8 8 3 28
9 9 2 29
10 10 1 30
[[4]]
Clinton Trump SomeoneElse
1 1 10 21
2 2 9 22
3 3 8 23
4 4 7 24
5 5 6 25
6 6 5 26
7 7 4 27
8 8 3 28
9 9 2 29
10 10 1 30
[[5]]
Clinton Trump SomeoneElse
1 1 10 21
2 2 9 22
3 3 8 23
4 4 7 24
5 5 6 25
6 6 5 26
7 7 4 27
8 8 3 28
9 9 2 29
10 10 1 30
Now sum up the columns of the dataframe, and apply it over the list of dataframes:
# Choose the columns to extract the sum of
cols = c("Clinton", "Trump")
lapply(df_list, function(x) apply(x[cols], 2, sum))
Below is the returned list
[[1]]
Clinton Trump
55 55
[[2]]
Clinton Trump
55 55
[[3]]
Clinton Trump
55 55
[[4]]
Clinton Trump
55 55
[[5]]
Clinton Trump
55 55
I have a large data frame called "df" (with some NA values inside)
dim(df)
[1] 2174 420
I would like to change the dimension of it into 32610 rows and 28 columns (by row), for example:
#df=
a b c d e f g ...
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ...
2 .........
3 .........
4 .........
5 .........
6 .........
...........
Into:
#new.df=
r1 r2 r3 r4 r5 r6 r7 ... ... r28
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
2 29 30 ...
3 .........
4 .........
5 .........
6 .........
...........
Therefore, new dimension:
dim(new.df)
[1] 32610 28
Can anyone help me with the code?
To reformat the layout of the data by row we can create an array from the unlisted elements of the original data.frame:
matrix(unlist(t(df)), byrow=T, 32610, 28)
Reproducible Example
There is no reason to not have a reproducible example in your question. It is very easy to simplify the problem to understand the underlying solution:
df <- as.data.frame(matrix(1:12,3, byrow=T))
df
V1 V2 V3 V4
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
matrix(unlist(t(df)), byrow=T, 6, 2)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
[5,] 9 10
[6,] 11 12
I have made a start to create some training and test sets using 10 fold crossvalidation for an artificial dataset:
rows <- 1000
X1<- sort(runif(n = rows, min = -1, max =1))
occ.prob <- 1/(1+exp(-(0.0 + 3.0*X1)))
true.presence <- rbinom(n = rows, size = 1, prob = occ.prob)
# combine data as data frame and save
data <- data.frame(X1, true.presence)
id <- sample(1:10,nrow(data),replace=TRUE)
ListX <- split(data,id)
fold1 <- data[id==1,]
fold2 <- data[id==2,]
fold3 <- data[id==3,]
fold4 <- data[id==4,]
fold5 <- data[id==5,]
fold6 <- data[id==6,]
fold7 <- data[id==7,]
fold8 <- data[id==8,]
fold9 <- data[id==9,]
fold10 <- data[id==10,]
trainingset <- subset(data, id %in% c(2,3,4,5,6,7,8,9,10))
testset <- subset(data, id %in% c(1))
I am just wondering whether there are easier ways to achieve this and how I could perform stratified crossvalidation which ensures that the class priors (true.presence) are roughly the same in all folds?
createFolds method of caret package performs a stratified partitioning. Here is a paragraph from the help page:
... The random sampling is done within the levels of y (=outcomes) when y is a factor in an attempt to balance the class distributions within the splits.
Here is the answer of your problem:
library(caret)
folds <- createFolds(factor(data$true.presence), k = 10, list = FALSE)
and the proportions:
> library(plyr)
> data$fold <- folds
> ddply(data, 'fold', summarise, prop=mean(true.presence))
fold prop
1 1 0.5000000
2 2 0.5050505
3 3 0.5000000
4 4 0.5000000
5 5 0.5000000
6 6 0.5049505
7 7 0.5000000
8 8 0.5049505
9 9 0.5000000
10 10 0.5050505
I'm sure that (a) there's a more efficient way to code this, and (b) there's almost certainly a function somewhere in a package that will just return the folds, but here's some simple code that gives you an idea of how one might do this:
rows <- 1000
X1<- sort(runif(n = rows, min = -1, max =1))
occ.prob <- 1/(1+exp(-(0.0 + 3.0*X1)))
true.presence <- rbinom(n = rows, size = 1, prob = occ.prob)
# combine data as data frame and save
dat <- data.frame(X1, true.presence)
require(plyr)
createFolds <- function(x,k){
n <- nrow(x)
x$folds <- rep(1:k,length.out = n)[sample(n,n)]
x
}
folds <- ddply(dat,.(true.presence),createFolds,k = 10)
#Proportion of true.presence in each fold:
ddply(folds,.(folds),summarise,prop = sum(true.presence)/length(true.presence))
folds prop
1 1 0.5049505
2 2 0.5049505
3 3 0.5100000
4 4 0.5100000
5 5 0.5100000
6 6 0.5100000
7 7 0.5100000
8 8 0.5100000
9 9 0.5050505
10 10 0.5050505
#joran is right (regarding his assumption (b)). dismo::kfold() is what you are looking for.
So using data from the initial question:
require(dismo)
folds <- kfold(data, k=10, by=data$true.presence)
gives a vector of length nrow(data) containing the fold association of each row of data.
Hence, data[fold==1,] returns the 1st fold and data[fold!=1,] can be used for validation.
edit 6/2018: I strongly support using the caret package as recommended by #gkcn. It is better integrated in the tidyverse workflow and more actively developed. Go with that!
I found splitTools is pretty useful, hope the vignette https://cran.r-project.org/web/packages/splitTools/vignettes/splitTools.html can help anyone interested in this topic.
> y <- rep(c(letters[1:4]), each = 5)
> y
[1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b" "c" "c" "c" "c" "c" "d" "d" "d" "d" "d"
> create_folds(y)
$Fold1
[1] 1 2 3 5 6 7 8 10 12 13 14 15 17 18 19 20
$Fold2
[1] 1 2 4 5 6 8 9 10 11 12 13 14 16 17 19 20
$Fold3
[1] 2 3 4 5 6 7 9 10 11 12 13 15 16 17 18 20
$Fold4
[1] 1 2 3 4 7 8 9 10 11 13 14 15 16 18 19 20
$Fold5
[1] 1 3 4 5 6 7 8 9 11 12 14 15 16 17 18 19
> create_folds(y, m_rep = 3)
$Fold1.Rep1
[1] 1 2 4 5 6 7 8 10 11 12 13 15 16 17 19 20
$Fold2.Rep1
[1] 2 3 4 5 6 8 9 10 11 12 13 14 16 17 18 20
$Fold3.Rep1
[1] 1 2 3 5 7 8 9 10 11 12 14 15 17 18 19 20
$Fold4.Rep1
[1] 1 2 3 4 6 7 9 10 11 13 14 15 16 18 19 20
$Fold5.Rep1
[1] 1 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19
$Fold1.Rep2
[1] 1 2 3 5 6 8 9 10 11 12 13 14 16 17 18 19
$Fold2.Rep2
[1] 1 2 3 4 6 7 8 10 11 12 14 15 17 18 19 20
$Fold3.Rep2
[1] 2 3 4 5 6 7 8 9 12 13 14 15 16 17 19 20
$Fold4.Rep2
[1] 1 3 4 5 7 8 9 10 11 13 14 15 16 17 18 20
$Fold5.Rep2
[1] 1 2 4 5 6 7 9 10 11 12 13 15 16 18 19 20
$Fold1.Rep3
[1] 1 2 3 4 6 7 9 10 11 12 13 15 16 18 19 20
$Fold2.Rep3
[1] 2 3 4 5 6 8 9 10 11 12 13 14 16 17 18 19
$Fold3.Rep3
[1] 1 2 4 5 6 7 8 9 11 12 14 15 16 17 19 20
$Fold4.Rep3
[1] 1 2 3 5 7 8 9 10 12 13 14 15 17 18 19 20
$Fold5.Rep3
[1] 1 3 4 5 6 7 8 10 11 13 14 15 16 17 18 20