How to get the combinations of multiple vectors in R - r

I have a data frame x. I want to get the pairwise combinations of all rows, like (x[1,], x[2,), (x[1,], x[3,]), (x[2,], x[3,]). Here I take each row as an entirety. I tried functions like combn, but it gave me the combinations of all elements in all rows.

I think with combn you are on the right track:
x <- data.frame(a=sample(letters, 10), b=1:10, c=runif(10), stringsAsFactors=FALSE)
ans <- combn(nrow(x), 2, FUN=function(sub) x[sub,], simplify=FALSE)
Now ans is a list of (in this case 45, in general choose(nrow(x), 2)) data.frames with two rows each.

The crossing() function from the tidyr package may help you. (The link contains a StackOverflow example.)

Related

How to find elements with not enough observations in a list

Say I have the following list where each element is a data.frame of different sizes
df1 <- data.frame(matrix(rnorm(12346), ncol = 2))
df2 <- data.frame(matrix(rnorm(14330), ncol = 2))
df3 <- data.frame(matrix(rnorm(2422), ncol = 2))
l <- list(df1, df2, df3)
In my example each data.frame represents a year of observations, and clearly df3 contains a lot fewer observations compared to the other two.
My question is then:
What is the best approach to detect those elements of the list l that does not compare in the number of rows and then remove them from the list?
I've so far tried using the median but as this should always remove half of the elements in l I'm not sure this is the best solution for future use
library(collapse)
cutoff <- input %>%
vapply(nrow, FUN.VALUE = length(.) %>%
median()
idx <- dapply(X = input, FUN = function(x) nrow(x) >= cutoff)
input[idx]
where input is a list as the above l
NOTE: As this is my first question on SO, please feel free to edit the question if it does not live up the standards of this community or give feedback on asking better questions. Thanks in advance
EDIT:
The question is not so much on how to use median to remove elements of the list, but rather IF median is the right method to remove those data.frames which have a lot less observations than the others
Does this work:
l[sapply(l, function(x) nrow(x) >= median(unlist(lapply(l, nrow))))]
purrr::keep is the way to go when filtering lists with conditions.
library(purrr)
keep(l, ~ nrow(.x) > median(map_dbl(l, nrow)))
It looks like you have a variable number of rows in your data frames, and you want to identify those that are unusually low. This is a statistical problem called outlier detection.
Programmatically, you want to extract the number of rows from your list of data frames, which is easily done with
rows <- sapply(l, nrow)
Statistically, you now want to take a look at your data and its distribution. Good, simple visualisations in R can be
hist(rows)
boxplot(rows)
Note that these will work better if you have many dfs, and are pretty useless with 3.
How to now determine which values are outside an "expected" distribution is not always trivial. Some resources:
outliers tag on CrossValidated
a nice RBloggers post
Note that it's also acceptable for you to choose a cutoff manually if you can reasonably justify it.

How to apply operation and sum over columns in R?

I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))

intersecting across 10 large data sets and merging automatically

I have 10 data.frames with 2 columns with names s and p. s is for sequence and p is for p-values. I want to find the sequences that intersect across all data.frames, so I did this:
# 10 data.frames are a, b, c, ..., j
masterseq_list <- Reduce(intersect, list(a$s, b$s, c$s, d$s, e$s, f$s, g$s,h$s, i$s,j$s))
I'd like to take masterseq_list and merge each dataframe a:j by this new reduced sequence so I am left with each data.frame having masterseq_list as the new column instead of s and the p-values remaining intact. I know I can use this code somehow but I'm really not sure how to do it if the column I want is currently a list.
total <- merge(data frameA,data frameB,by="s")
The files are really big so I'd like to find a way to automate this, how can I loop through this faster and efficiently? Thanks so much!
I'd start by putting all the data.frames in a list first:
my_l <- list(a,b,c)
# now get intersection
isect <- Reduce(intersect, lapply(my_l, "[[", 1))
> isect
# [1] "gtcg" "gtcgg" "gggaa" "cttg"
# subset the original data.frames for just this intersecting rows
lapply(my_l, function(x) subset(x, s %in% isect))

Mean of elements in a list of data.frames

Suppose I had a list of data.frames (of equal rows and columns)
dat1 <- as.data.frame(matrix(rnorm(25), ncol=5))
dat2 <- as.data.frame(matrix(rnorm(25), ncol=5))
dat3 <- as.data.frame(matrix(rnorm(25), ncol=5))
all.dat <- list(dat1=dat1, dat2=dat2, dat3=dat3)
How can I return a single data.frame that is the mean (or sum, etc.) for each element in the data.frames across the list (e.g., mean of first row and first column from lists 1, 2, 3 and so on)? I have tried lapply and ldply in plyr but these return the statistic for each data.frame within the list.
Edit: For some reason, this was retagged as homework. Not that it matters either way, but this is not a homework question. I just don't know why I can't get this to work. Thanks for any insight!
Edit2: For further clarification:
I can get the results using loops, but I was hoping that there were a way (a simpler and faster way because the data I am using has data.frames that are 12 rows by 100 columns and there is a list of 1000+ of these data frames).
z <- matrix(0, nrow(all.dat$dat1), ncol(all.dat$dat1))
for(l in 1:nrow(all.dat$dat1)){
for(m in 1:ncol(all.dat$dat1)){
z[l, m] <- mean(unlist(lapply(all.dat, `[`, i =l, j = m)))
}
}
With a result of the means:
> z
[,1] [,2] [,3] [,4] [,5]
[1,] -0.64185488 0.06220447 -0.02153806 0.83567173 0.3978507
[2,] -0.27953054 -0.19567085 0.45718399 -0.02823715 0.4932950
[3,] 0.40506666 0.95157856 1.00017954 0.57434125 -0.5969884
[4,] 0.71972821 -0.29190645 0.16257478 -0.08897047 0.9703909
[5,] -0.05570302 0.62045662 0.93427522 -0.55295824 0.7064439
I was wondering if there was a less clunky and faster way to do this. Thanks!
Here is a one liner with plyr. You can replace mean with any other function that you want.
ans1 = aaply(laply(all.dat, as.matrix), c(2, 3), mean)
You would have an easier time changing the data structure, combining the three two dimensional matrices into a single 3 dimensional array (using the abind library). Then the solution is more direct using apply and specifying the dimensions to average over.
EDIT:
When I answered the question, it was tagged homework, so I just gave an approach. The original poster removed that tag, so I will take him/her at his/her word that it isn't.
library("abind")
all.matrix <- abind(all.dat, along=3)
apply(all.matrix, c(1,2), mean)
I gave one answer that uses a completely different data structure to achieve the result. This answer uses the data structure (list of data frames) given directly. I think it is less elegant, but wanted to provide it anyway.
Reduce(`+`, all.dat) / length(all.dat)
The logic is to add the data frames together element by element (which + will do with data frames), then divide by the number of data frames. Using Reduce is necessary since + can only take two arguments at a time (and addition is associative).
Another approach using only base functions to change the structure of the object:
listVec <- lapply(all.dat, c, recursive=TRUE)
m <- do.call(cbind, listVec)
Now you can calculate the mean with rowMeans or the median with apply:
means <- rowMeans(m)
medians <- apply(m, 1, median)
I would take a slightly different approach:
library(plyr)
tmp <- ldply(all.dat) # convert to df
tmp$counter <- 1:5 # 1:12 for your actual situation
ddply(tmp, .(counter), function(x) colMeans(x[2:ncol(x)]))
Couldn't you just use nested lapply() calls?
This appears to give the correct result on my machine
mean.dat <- lapply(all.dat, function (x) lapply(x, mean, na.rm=TRUE))

How can combine dataset in R?

I think my question is very simple.
dat1<-seq(1:100)
dat2<-seq(1:100)
how can I combine dat1 and dat2 and make it look like
dat3<-seq(1:200)
Thanks so much!
How do you want to combine dat1 and dat2? By rows or columns? I'd take a look at the help pages for rbind() (row bind) , cbind() (column bind), orc() which combines arguments to form a vector.
Let me start by a comment.
In order to create a sequence of number on can use the following syntax:
x <- seq(from=, to=, by=)
A shorthand for, e.g., x <- seq(from=1, to=10, by=1) is simply 1:10. So, your notation is a little bit weird...
On the other hand, you can combine two or more vectors using the c() function. Let us say, for example, that a <- c(1, 2) and b <- c(3, 4). Then c <- c(a, b) is the vector (1, 2, 3, 4).
There exist similar functions to combine data sets: rbind() and cbind().

Resources