intersecting across 10 large data sets and merging automatically

intersecting across 10 large data sets and merging automatically - r

I have 10 data.frames with 2 columns with names s and p. s is for sequence and p is for p-values. I want to find the sequences that intersect across all data.frames, so I did this:
# 10 data.frames are a, b, c, ..., j
masterseq_list <- Reduce(intersect, list(a$s, b$s, c$s, d$s, e$s, f$s, g$s,h$s, i$s,j$s))
I'd like to take masterseq_list and merge each dataframe a:j by this new reduced sequence so I am left with each data.frame having masterseq_list as the new column instead of s and the p-values remaining intact. I know I can use this code somehow but I'm really not sure how to do it if the column I want is currently a list.
total <- merge(data frameA,data frameB,by="s")
The files are really big so I'd like to find a way to automate this, how can I loop through this faster and efficiently? Thanks so much!

I'd start by putting all the data.frames in a list first:
my_l <- list(a,b,c)
# now get intersection
isect <- Reduce(intersect, lapply(my_l, "[[", 1))
> isect
# [1] "gtcg" "gtcgg" "gggaa" "cttg"
# subset the original data.frames for just this intersecting rows
lapply(my_l, function(x) subset(x, s %in% isect))

Related

Remove Non-Matching Dataframe Names Nested in A List

I have two lists consisting of dataframes - df_quintile and disease_df_quintile. I do not know how to represent them concisely, but this is how they look like in Rstudio:
Notice, disease_df_quintile consists of 5 dataframes (dataframes 1 through 5), while disease_df_quintile consists of 4 (dataframes 2 through 5). I would like to cross check both lists and remove any dataframes that are not shared by both lists - so in this case, I would like to remove the first dataframe from the df_quintile list. How can I achieve this?
Thank you.

Independently of the content of the list, you can first find the repeated names and then subsetting the lists:
##-- Fake lists
l1 <- as.list(1:5)
names(l1) <- 1:5
l2 <- as.list(2:5)
names(l2) <- 2:5
##-- Common names and subsetting
common_names <- intersect(names(l1), names(l2))
l1 <- l1[common_names]
l2 <- l2[common_names]

You can match the list's names and keep the common ones.
keep <- match(names(disease_df_quintile), names(df_quintile))
new_df_quintile <- df_quintile[keep]

merge list of lists in R

I have a list of lists, where some lists are NULL (contain nothing), and some lists contains 12 columns and 1 row. lets say this list of lists is named: pages.
I would like to merge the lists that contain the 12 columns and 1 row into a dataframe. so that I have a final dataframe of 12 columns and x rows.
I first tried:
final_df <- Reduce(function(x,y) merge(x, y, all=TRUE), pages)
which yielded a dataframe with the right 12 columns, but no rows, so it was empty.
I then tried:
listofvectors <- list()
for (i in 1:length(pages)) {listofvectors <- c(listofvectors, pages[[i]])}
which just pasted every list below each other.
I finally tried playing with:
final<-do.call(c, unlist(pages, recursive=FALSE))
which only resulted in a very long value.
What am I missing? Who can help me out? Thanks a lot for your input.

The merge function is for joining data on common column values (commonly called a join). You need to use rbind instead (the r for row, use cbind to stick columns together).
do.call(rbind, pages) # equivalent to rbind(pages[[1]], pages[[2]], ...)
do.call(rbind, pages[lengths(pages) > 0]) # removing the 0-length elements
If you have additional issues, please provide a reproducible example in your question. This code works on this example:
x = list(data.frame(x = 1), NULL, data.frame(x = 2))
do.call(rbind, x)
# x
# 1 1
# 2 2

ranking multiple data frames and summing across them in R

I have 10 data frames with 2 columns each, I'm calling the dataframes a, b, c, d, e, f, g, h, i and j.
The first column in each data frame is called s for sequences and the second is p for p-values corresponding to each sequence. The s column contains the same sequences across all 10 data frames, essentially the only difference is in the p-values.
Below is a short version of data frame a, which has 600,000 rows.
s p
gtcg 0.06
gtcgg 0.05
gggaa 0.07
cttg 0.05
I want to rank each dataframe by p-value, the smallest p-value should get a rank of 1 and equal p-values should get the same rank. Each final data frame should be in this format:
s p_rank_a
gtcg 2
gtcgg 1
gggaa 3
cttg 1
I've used this to do one:
r<-rank(a$p)
cbind(a$s,r)
but I'm not very familiar with loops and I don't know how to do this automatically. Ultimately I would like a final file that has the s column and in the next column the rank sum of all the ranks across all data frames for each specific sequence.
SO basically this:
s ranksum_P_a-j
gtcg 34
gtcgg 5
gggaa 5009093
cttg 499
Please help and thanks!

for a single data.frame, you can do it one line, as follows:
credit to #Arun for pointing out to use as.numeric(factor(p))
library(data.table)
aDT <- data.table(a)[, p_rank := as.numeric(factor(p))]
I would suggest keeping all the data.frames in a single list, so that you can easily iterate over them.
Since your date.frames are letters, it's easy to collect the ten of them:
# collect them all
allOfThem <- lapply(letters[1:10], get, envir=.GlobalEnv)
# keep in mind you named an object `c`
# convert to DT and create the ranks
allOfThem <- lapply(allOfThem, function(x) data.table(x)[, p_rank := as.numeric(factor(p))])
on a separate note: it might be good habbit to start avoiding naming objects "c" and other common functions in R. otherwise, you will find that you'll start encountering many "unexplainable" behaviors which, after you've beaten your
head against a wall for an hour trying to debug it, you realize that you've overwritten the name of a function. This has never happened to me :)

I'd put all the data.frames in a list and then use lapply and transform as follows:
my_l <- list(a,b,c) # all your data.frames
# you can use rank but it'll give you the average in case of ties
# lapply(my_l, function(x) transform(x, rank_p = rank(p)))
# I prefer this method instead
my_o <- lapply(my_l, function(x) transform(x, p = as.numeric(factor(p))))
# now bind them in to a single data.frame
my_o <- do.call(rbind, my_o)
# now paste them
aggregate(data = my_o, p ~ s, function(x) paste(x, collapse=","))
# s p
# 1 cttg 1,1,1
# 2 gggaa 3,3,3
# 3 gtcg 2,2,2
# 4 gtcgg 1,1,1
Edit since you've asked for a potential faster solution (due to large data), I'd suggest, like #Ricardo, a data.table solution:
require(data.table)
# bind all your data.frames together
dt <- rbindlist(my_l) # my_l is your list of data.frames
# replace p-value with their "rank"
dt[, p := as.numeric(factor(p))]
# set key
setkey(dt, "s")
# combine them using `,`
dt[, list(p_ranks = paste(p, collapse=",")), by=s]
Try this out:

R: t tests on rows of 2 dataframes

I have two dataframes and I would like to do independent 2-group t-tests on the rows (i.e. t.test(y1, y2) where y1 is a row in dataframe1 and y2 is matching row in dataframe2)
whats best way of accomplishing this?
EDIT:
I just found the format: dataframe1[i,] dataframe2[i,]. This will work in a loop. Is that the best solution?

The approach you outlined is reasonable, just make sure to preallocate your storage vector. I'd double check that you really want to compare the rows instead of the columns. Most datasets I work with have each row as a unit of observation and the columns represent separate responses/columns of interest Regardless, it's your data - so if that's what you need to do, here's an approach:
#Fake data
df1 <- data.frame(matrix(runif(100),10))
df2 <- data.frame(matrix(runif(100),10))
#Preallocate results
testresults <- vector("list", nrow(df1))
#For loop
for (j in seq(nrow(df1))){
testresults[[j]] <- t.test(df1[j,], df2[j,])
}
You now have a list that is as long as you have rows in df1. I would then recommend using lapply and sapply to easily extract things out of the list object.

It would make more sense to have your data stored as columns.
You can transpose a data.frame by
df1_t <- as.data.frame(t(df1))
df2_t <- as.data.frame(t(df2))
Then you can use mapply to cycle through the two data.frames a column at a time
t.test_results <- mapply(t.test, x= df1_t, y = df2_t, SIMPLIFY = F)
Or you could use Map which is a simple wrapper for mapply with SIMPLIFY = F (Thus saving key strokes!)
t.test_results <- Map(t.test, x = df1_t, y = df2_t)

Subtracting a list of names from a bigger list in R

I have 2 datasets.
A = 3085 rows, 1 column.
B = 527 rows, 1000 columns.
All values in both of these datasets are the names of shapefiles.
I would like to create a new list of A - B[,1]. A.k.a I would like to remove any values from A that appear in the first column of B.
I will eventually be looping this for all 1000 columns.
If anyone can help, it would be much appreciated.
Regards,

If A and B are data.frames or matrices you can use such procedure
A[!(A[,1] %in% B[,1]), 1]
I just now realized fully your question. To loop on all columns of B you can use apply family function. This call will iterate each column of B as x parameter and will return list of length equal to the number of columns of B, each element of the list will be a vector of mismatched elements of A to corresponding column of B.
apply(B, 2, function(x) A[!(A[,1] %in% x), 1])

Something simple (but untested):
x <- A[, 1]
keep <- seq_along(x)
for(i in seq_along(B))
keep <- setdiff(keep, which(x[keep] %in% B[, i]))
A[keep, ]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

intersecting across 10 large data sets and merging automatically - r

Related

Remove Non-Matching Dataframe Names Nested in A List

merge list of lists in R

ranking multiple data frames and summing across them in R

R: t tests on rows of 2 dataframes

Subtracting a list of names from a bigger list in R

Categories

Resources