I have a dataframe that looks like this:
x<-data.frame(a=6, b=5:1, c=7, d=10:6)
> x
a b c d
1 6 5 7 10
2 6 4 7 9
3 6 3 7 8
4 6 2 7 7
5 6 1 7 6
I am trying to get the sums of columns a & b and c&d in another data frame that should look like:
> new
ab cd
1 11 17
2 10 16
3 9 15
4 8 14
5 7 13
I've tried the rowSums() function but it returns the sum of ALL the columns per row, and I tried rowSums(x[c(1,2), c(3,4)]) but nothing works. Please help!!
You can use rowSums on a column subset.
As a data frame:
data.frame(ab = rowSums(x[c("a", "b")]), cd = rowSums(x[c("c", "d")]))
# ab cd
# 1 11 17
# 2 10 16
# 3 9 15
# 4 8 14
# 5 7 13
As a matrix:
cbind(ab = rowSums(x[1:2]), cd = rowSums(x[3:4]))
For a wider data frame, you can also use sapply over a list of column subsets.
sapply(list(1:2, 3:4), function(y) rowSums(x[y]))
For all pairwise column combinations:
y <- combn(ncol(x), 2L, function(y) rowSums(x[y]))
colnames(y) <- combn(names(x), 2L, paste, collapse = "")
y
# ab ac ad bc bd cd
# [1,] 11 13 16 12 15 17
# [2,] 10 13 15 11 13 16
# [3,] 9 13 14 10 11 15
# [4,] 8 13 13 9 9 14
# [5,] 7 13 12 8 7 13
Here's another option:
> sapply(split.default(x, 0:(length(x)-1) %/% 2), rowSums)
0 1
[1,] 11 17
[2,] 10 16
[3,] 9 15
[4,] 8 14
[5,] 7 13
The 0:(length(x)-1) %/% 2 step creates a sequence of groups of 2 that can be used with split. It will also handle odd numbers of columns (treating the final column as a group of its own). Since there's a different default split "method" for data.frames that splits by rows, you need to specify split.default to split the columns into groups.
Related
How to write an R-script to initialize a vector with integers, rearrange the elements by interleaving the
first half elements with the second half elements and store in the same vector without using pre-defined function and display the updated vector.
This sounds like a homework question, and it would be nice to see some effort on your own part, but it's pretty straightforward to do this in R.
Suppose your vector looks like this:
vec <- 1:20
vec
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Then you can just do:
c(t(cbind(vec[1:10], vec[11:20])))
#> [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
This works by joining the two vectors into a 10 x 2 matrix, then transposing that matrix and turning it into a vector.
We may use matrix directly and concatenate
c(matrix(vec, nrow = 2, byrow = TRUE))
-output
[1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
data
vec <- 1:20
Or using mapply:
vec <- 1:20
c(mapply(\(x,y) c(x,y), vec[1:10], vec[11:20]))
#> [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
We can try this using order + %%
> vec[order((seq_along(vec) - 1) %% (length(vec) / 2))]
[1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
Another way is to use rbind on the 2 halves of the vector, which creates a matrix with two rows. Then, we can then turn the matrix into a vector, which will go through column by column (i.e., 1, 11, 2, 12...). However, this will only work for even vectors.
vec <- 1:20
c(rbind(vec[1:10], vec[11:20]))
# [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20
So, for uneven vectors, we can use order, which will return the indices of the numbers in the two seq_along vectors.
vec2 <- 1:21
order(c(seq_along(vec2[1:10]),seq_along(vec2[11:21])))
# [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20 21
I want to create a tale like:
1 1 6 6 10 10 ...
2 2 7 7 11 11 ...
3 3 8 8 12 12 ...
4 4 9 9 13 13 ...
5 5 14 14 ...
15 15 ...
I want to use variables:
n (repeat) and m(total number of columns) and k(k=the prior columns's end number+1,for example: 6=5+1, and 10=9+1), and different number length of row
to create a table.
I know I can use like:
rep(list(1:5,6:9,10:15), each = 2)),
but how to make them as parameters using a general expression to list list(1:5,6:9,10:15,..use n,m,k expression...).
I tried to use loop for (i in 1:m) etc.. but cannot work it out
finally I want a sequence by using unlist(): 1,2,3,4,5,6,1,2,3,4,5,6......)
Many thanks.
Maybe the code below can help
len <- c(5,4,6)
res <- unlist(unname(rep(split(1:sum(len),
findInterval(1:sum(len),cumsum(len)+1)),
each = 2)))
which gives
> res
[1] 1 2 3 4 5 1 2 3 4 5 6 7 8 9 6 7 8 9 10 11 12 13 14 15 10 11 12 13 14 15
Probably, something like this would be helpful.
#Number of times to repeat
r <- 2
#Length of each sequence
len <- c(5, 4, 6)
#Get the end of the sequence
end <- cumsum(Glen)
#Calculate the start of each sequence
start <- c(1, end[-length(end)] + 1)
#Create a sequence of start and end and repeat it r times
Map(function(x, y) rep(seq(x, y), r), start, end)
#[[1]]
# [1] 1 2 3 4 5 1 2 3 4 5
#[[2]]
#[1] 6 7 8 9 6 7 8 9
#[[3]]
# [1] 10 11 12 13 14 15 10 11 12 13 14 15
You could unlist to get it as one vector.
unlist(Map(function(x, y) rep(seq(x, y), r), start, end))
My data looks like this:
x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18
y is a grouping variable. I would like to see how well this grouping went.
Because of this I want to extract a sample of n pairs of cases that are grouped together by variable y
and n pairs of cases that are not grouped together by variable y. In order to calculate the number of
false positives and false negatives (either falsly grouped or not). How do I extract a sample of grouped pairs
and a sample of not-grouped pairs?
I would like the samples to look like this (for n=6) :
Grouped sample:
x y
2 2
3 2
9 9
10 9
15 14
17 14
Not-grouped sample:
x y
1 1
2 2
6 8
6 8
11 11
19 17
How would I go about this in R?
I'm not entirely clear on what you like to do, partly because I feel there is some context missing as to what you're trying to achieve. I also don't quite understand your expected output (for example, the not-grouped sample contains an entry 6 8 that does not exist in your original data...)
That aside, here is a possible approach.
# Maximum number of samples per group
n <- 3;
# Set fixed RNG seed for reproducibility
set.seed(2017);
# Grouped samples
df.grouped <- do.call(rbind.data.frame, lapply(split(df, df$y),
function(x) if (nrow(x) > 1) x[sample(min(n, nrow(x))), ]));
df.grouped;
# x y
#2.3 3 2
#2.2 2 2
#6.6 6 6
#6.7 7 6
#9.10 10 9
#9.9 9 9
#13.13 13 13
#13.14 14 13
#14.15 15 14
#14.17 17 14
# Ungrouped samples
df.ungrouped <- df[sample(nrow(df.grouped)), ];
df.ungrouped;
# x y
#7 7 6
#1 1 1
#9 9 9
#4 4 4
#3 3 2
#2 2 2
#5 5 5
#6 6 6
#10 10 9
#8 8 8
Explanation: Split df based on y, then draw min(n, nrow(x)) samples from subset x containing >1 rows; rbinding gives the grouped df.grouped. We then draw nrow(df.grouped) samples from df to produce the ungrouped df.ungrouped.
Sample data
df <- read.table(text =
"x y
1 1
2 2
3 2
4 4
5 5
6 6
7 6
8 8
9 9
10 9
11 11
12 12
13 13
14 13
15 14
16 15
17 14
18 16
19 17
20 18", header = T)
I need to extract separate tables from each excel sheet and have them as a list object. I have two lists : "allsheets" contains 38 sheets and each of sheets includes at least 2 tables, and "dataRowMeta" contains information about which rows are relevant for each table. For example,
a1 <- data.frame(y1=c(1:15),y2=c(6:20))
a2 <- data.frame(y1=c(3:18),y2=c(2:17))
allsheets <- list(a1, a2)
d1<- data.frame(starthead=c(1,9),endhead=c(2,10),startdata =c(3,11),
enddata = c(7,14),footer = c(8,15))
d2<- data.frame(starthead=c(1,10),endhead=c(2,11),startdata =c(3,12),
enddata = c(8,15),footer = c(9,16))
dataRowMeta <- list(d1,d2)
[[1]]
y1 y2
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
6 6 11
7 7 12
8 8 13
9 9 14
10 10 15
11 11 16
12 12 17
13 13 18
14 14 19
15 15 20
[[2]]
y1 y2
1 3 2
2 4 3
3 5 4
4 6 5
5 7 6
6 8 7
7 9 8
8 10 9
9 11 10
10 12 11
11 13 12
12 14 13
13 15 14
14 16 15
15 17 16
16 18 17
and here is dataRowMeta :
[[1]]
starthead endhead startdata enddata footer
1 1 2 3 7 8
2 9 10 11 14 15
[[2]]
starthead endhead startdata enddata footer
1 1 2 3 8 9
2 10 11 12 15 16
I've tried to write a loop function which would subset each sheet according to dataRowMeta, but failed to get a desired output.
I am getting an error
Error in sheet[[a[m]:b[m], ]] : incorrect number of subscripts
I guess that's because I am iterating over list, not matrices...but how to tell R to subset list in this case?
So I need 1st and 4th columns of dataRowMeta(starthead and enddata) as "start" and "end" id rows of future tables.
tables <- function(allsheets,dataRowMeta){
for(i in 1 : length(dataRowMeta)){
for (j in 1 : nrow(dataRowMeta[[i]])){
a <-""
b <- ""
a <- dataRowMeta[[i]][j:j,1]
b <- dataRowMeta[[i]][j:j,4]
for (k in 1 : length(allsheets)){
sheet <- allsheets[k]
for ( m in 1 : length(a)){
tbl <- sheet[[a[m]:b[m],]]
}
}
}
}}
Desired output : I have this for the first element of the first list(sheet1):
sheet1 <- allsheets[[1]]
tmp1 <- sheet1[dataRowMeta[[1]][1:1,1] :dataRowMeta[[1]][1:1,4] ,]
> tmp1
y1 y2
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
6 6 11
7 7 12
And need a loop which would do it for all sheets. Please help me to figure out how to get it. Thank you!
I have made a start to create some training and test sets using 10 fold crossvalidation for an artificial dataset:
rows <- 1000
X1<- sort(runif(n = rows, min = -1, max =1))
occ.prob <- 1/(1+exp(-(0.0 + 3.0*X1)))
true.presence <- rbinom(n = rows, size = 1, prob = occ.prob)
# combine data as data frame and save
data <- data.frame(X1, true.presence)
id <- sample(1:10,nrow(data),replace=TRUE)
ListX <- split(data,id)
fold1 <- data[id==1,]
fold2 <- data[id==2,]
fold3 <- data[id==3,]
fold4 <- data[id==4,]
fold5 <- data[id==5,]
fold6 <- data[id==6,]
fold7 <- data[id==7,]
fold8 <- data[id==8,]
fold9 <- data[id==9,]
fold10 <- data[id==10,]
trainingset <- subset(data, id %in% c(2,3,4,5,6,7,8,9,10))
testset <- subset(data, id %in% c(1))
I am just wondering whether there are easier ways to achieve this and how I could perform stratified crossvalidation which ensures that the class priors (true.presence) are roughly the same in all folds?
createFolds method of caret package performs a stratified partitioning. Here is a paragraph from the help page:
... The random sampling is done within the levels of y (=outcomes) when y is a factor in an attempt to balance the class distributions within the splits.
Here is the answer of your problem:
library(caret)
folds <- createFolds(factor(data$true.presence), k = 10, list = FALSE)
and the proportions:
> library(plyr)
> data$fold <- folds
> ddply(data, 'fold', summarise, prop=mean(true.presence))
fold prop
1 1 0.5000000
2 2 0.5050505
3 3 0.5000000
4 4 0.5000000
5 5 0.5000000
6 6 0.5049505
7 7 0.5000000
8 8 0.5049505
9 9 0.5000000
10 10 0.5050505
I'm sure that (a) there's a more efficient way to code this, and (b) there's almost certainly a function somewhere in a package that will just return the folds, but here's some simple code that gives you an idea of how one might do this:
rows <- 1000
X1<- sort(runif(n = rows, min = -1, max =1))
occ.prob <- 1/(1+exp(-(0.0 + 3.0*X1)))
true.presence <- rbinom(n = rows, size = 1, prob = occ.prob)
# combine data as data frame and save
dat <- data.frame(X1, true.presence)
require(plyr)
createFolds <- function(x,k){
n <- nrow(x)
x$folds <- rep(1:k,length.out = n)[sample(n,n)]
x
}
folds <- ddply(dat,.(true.presence),createFolds,k = 10)
#Proportion of true.presence in each fold:
ddply(folds,.(folds),summarise,prop = sum(true.presence)/length(true.presence))
folds prop
1 1 0.5049505
2 2 0.5049505
3 3 0.5100000
4 4 0.5100000
5 5 0.5100000
6 6 0.5100000
7 7 0.5100000
8 8 0.5100000
9 9 0.5050505
10 10 0.5050505
#joran is right (regarding his assumption (b)). dismo::kfold() is what you are looking for.
So using data from the initial question:
require(dismo)
folds <- kfold(data, k=10, by=data$true.presence)
gives a vector of length nrow(data) containing the fold association of each row of data.
Hence, data[fold==1,] returns the 1st fold and data[fold!=1,] can be used for validation.
edit 6/2018: I strongly support using the caret package as recommended by #gkcn. It is better integrated in the tidyverse workflow and more actively developed. Go with that!
I found splitTools is pretty useful, hope the vignette https://cran.r-project.org/web/packages/splitTools/vignettes/splitTools.html can help anyone interested in this topic.
> y <- rep(c(letters[1:4]), each = 5)
> y
[1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b" "c" "c" "c" "c" "c" "d" "d" "d" "d" "d"
> create_folds(y)
$Fold1
[1] 1 2 3 5 6 7 8 10 12 13 14 15 17 18 19 20
$Fold2
[1] 1 2 4 5 6 8 9 10 11 12 13 14 16 17 19 20
$Fold3
[1] 2 3 4 5 6 7 9 10 11 12 13 15 16 17 18 20
$Fold4
[1] 1 2 3 4 7 8 9 10 11 13 14 15 16 18 19 20
$Fold5
[1] 1 3 4 5 6 7 8 9 11 12 14 15 16 17 18 19
> create_folds(y, m_rep = 3)
$Fold1.Rep1
[1] 1 2 4 5 6 7 8 10 11 12 13 15 16 17 19 20
$Fold2.Rep1
[1] 2 3 4 5 6 8 9 10 11 12 13 14 16 17 18 20
$Fold3.Rep1
[1] 1 2 3 5 7 8 9 10 11 12 14 15 17 18 19 20
$Fold4.Rep1
[1] 1 2 3 4 6 7 9 10 11 13 14 15 16 18 19 20
$Fold5.Rep1
[1] 1 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19
$Fold1.Rep2
[1] 1 2 3 5 6 8 9 10 11 12 13 14 16 17 18 19
$Fold2.Rep2
[1] 1 2 3 4 6 7 8 10 11 12 14 15 17 18 19 20
$Fold3.Rep2
[1] 2 3 4 5 6 7 8 9 12 13 14 15 16 17 19 20
$Fold4.Rep2
[1] 1 3 4 5 7 8 9 10 11 13 14 15 16 17 18 20
$Fold5.Rep2
[1] 1 2 4 5 6 7 9 10 11 12 13 15 16 18 19 20
$Fold1.Rep3
[1] 1 2 3 4 6 7 9 10 11 12 13 15 16 18 19 20
$Fold2.Rep3
[1] 2 3 4 5 6 8 9 10 11 12 13 14 16 17 18 19
$Fold3.Rep3
[1] 1 2 4 5 6 7 8 9 11 12 14 15 16 17 19 20
$Fold4.Rep3
[1] 1 2 3 5 7 8 9 10 12 13 14 15 17 18 19 20
$Fold5.Rep3
[1] 1 3 4 5 6 7 8 10 11 13 14 15 16 17 18 20