Combine data.frames in R using only common row.names - r

I have five data.frames with gene expression data for different sets of samples. I have a different number of rows in each data.set and therefore only partly overlapping row.names (genes).
Now I want
a) to filter the five data.frames to contain only genes that are present in all data.frames and
b) to combine the gene expression data for those genes to one data.frame.
All I could find so far was merge, but that can only merge two data.frames, so I'd have to use it multiple times. Is there an easier way?

Merging is not very efficient if you want to exclude row names which are not present in every data frame. Here's a different proposal.
First, three example data frames:
df1 <- data.frame(a = 1:5, b = 1:5,
row.names = letters[1:5]) # letters a to e
df2 <- data.frame(a = 1:5, b = 1:5,
row.names = letters[3:7]) # letters c to g
df3 <- data.frame(a = 1:5, b = 1:5,
row.names = letters[c(1,2,3,5,7)]) # letters a, b, c, e, and g
# row names being present in all data frames: c and e
Put the data frames into a list:
dfList <- list(df1, df2, df3)
Find common row names:
idx <- Reduce(intersect, lapply(dfList, rownames))
Extract data:
df1[idx, ]
a b
c 3 3
e 5 5
PS. If you want to keep the corresponding rows from all data frames, you could replace the last step, df1[idx, ], with the following command:
do.call(rbind, lapply(dfList, "[", idx, ))

Check out the uppermost answer in this SO post. Just list your data frames and apply the following line of code:
Reduce(function(...) merge(..., by = "x"), list.of.dataframes)
You just have to adjust the by argument to specify by which common column the data frames should be merged.

Related

Adding a column to every dataframe in a list with the name of the list element

I have a list containing multiple data frames, and each list element has a unique name. The structure is similar to this dummy data
a <- data.frame(z = rnorm(20), y = rnorm(20))
b <- data.frame(z = rnorm(30), y = rnorm(30))
c <- data.frame(z = rnorm(40), y = rnorm(40))
d <- data.frame(z = rnorm(50), y = rnorm(50))
my.list <- list(a,b,c,d)
names(my.list) <- c("a","b","c","d")
I want to create a column in each of the data frames that has the name of it's respective list element. My goal is to merge all the list element into a single data frame, and know which data frame they came from originally. The end result I'm looking for is something like this:
z y group
1 0.6169132 0.09803228 a
2 1.1610584 0.50356131 a
3 0.6399438 0.84810547 a
4 1.0878453 1.00472105 b
5 -0.3137200 -1.20707112 b
6 1.1428834 0.87852556 b
7 -1.0651735 -0.18614224 c
8 1.1629891 -0.30184443 c
9 -0.7980089 -0.35578381 c
10 1.4651651 -0.30586852 d
11 1.1936547 1.98858128 d
12 1.6284174 -0.17042835 d
My first thought was to use mutate to assign the list element name to a column in each respective data frame, but it appears that when used within lapply, names() refers to the column names, not the list element names
test <- lapply(my.list, function(x) mutate(x, group = names(x)))
Error: Column `group` must be length 20 (the number of rows) or one, not 2
Any suggestions as to how I could approach this problem?
there is no need to mutate just bind using dplyr's bind_rows
library(tidyverse)
my.list %>%
bind_rows(.id = "groups")
Obviously requires that the list is named.
We can use Map from base R
Map(cbind, my.list, group = names(my.list))
Or with imap from purrr
library(dplyr)
library(purrr)
imap(my.list, ~ .x %>% mutate(group = .y))
Or if the intention is to create a single data.frame
library(data.table)
rbindlist(my.list. idcol = 'groups')

join two lists of data frames into a single list of binded_rows data frames

I have two lists of dataframes. I would like to combine each dataframe in each of the list, one on top of the other using somehting like bind_rows or just rbind. The two lists have the columns with exact same names and order.
Something like combined <- map_df(rapheys_df_list, XGB_models_Prep, bind_rows) which resulted in "Error: Index 1 must have length 1".
How can I join the two lists of dataframes into a combined single list where each dataframe in one list is combined by rows on top of the other?
We need map2 for binding two corresponding lists
library(purrr)
map2_dfr(rapheys_df_list, XGB_models_Prep, bind_rows)
data
rapheys_df_list <- list(data.frame(col1 = 1:3, col2 = 4:6),
data.frame(col1 = 7:9, col2 = 10:12))
XGB_models_Prep <- list(data.frame(col1 = 2:5, col2 = 3:6),
data.frame(col1 = 4:6, col2 = 0:2))
base R
Reduce(rbind, Map(rbind, rapheys_df_list, XGB_models_Prep))
# or, with the same result:
do.call(rbind, Map(rbind, rapheys_df_list, XGB_models_Prep))

Extract specific information from a list of dataframes

I am facing the following challenge:
I have a list of dataframes in R and I'd like to extract some specific information from it. Here is an example:
df_1 <- data.frame(A = c(1,2), B = c(3,4), D = c(5,6))
df_2 <- data.frame(A = c(7,8), B = c(9,10), D = c(11,12))
df_3 <- data.frame(A = c(0,1), B = c(2,3), D = c(4,5))
L <- list(df_1, df_2, df_3)
What I'd like to extract are the values at position (1,1) in each of these dataframes. In the above case this would be: 1, 7, 0.
Is there a way to extract this information easily, probably with one line of code?
As Ronak has suggested , you can use function like lapply and wrap it with unlist for desired output.
unlist(lapply(L,function(x) x[1,1]))
In addition to the *apply methods shown above, you can also do this in a Vectorized manner. Since all the data frames in your list have the same column names, and you want the first element from the first column, i.e. 'A1', then you can simply unlist (which will create a named vector) and grab the values with the name A1.
v1 <- unlist(L)
v1[names(v1) == 'A1']
#A1 A1 A1
# 1 7 0

Most efficient way to find common values for one column across many data frames

I have many files and I am trying to find the most efficient way of reading the data frames and finding common values in one column.
For now I have:
1. I read a list of files using:
files = c("test1.txt", "test2.txt", test3.txt")
my.data <- lapply(files, read.table, header=T)
Each containing columns e.g.
df1 = data.frame(id=c("a", "b", "c"), v = c(1:3), c=c(10:12))
df2 = data.frame(id=c("x", "b", "c"), v = c(2:4), c=c(13:15))
df3 = data.frame(id=c("a", "n", "c"), v = c(4:6), c=c(16:18))
my.data = list(df1, df2, df3)
And now I am trying to subset the list of data frames to return the same list of data frames each containing only the common rows for the first column called "id", e.g.
df1, df2, and df3 in this case would be a list containing only "id" common to all read files, i.e. a row with only "c" in this case:
intersect(intersect(df1$id, df2$id), df3$id);
list(df1[3,], df2[3,], df3[3,])
but I can't figure out a way using lists to merge all data frames, maybe this is a longer/more difficult process than reading all files, merging them all first by the common column "id", and then splitting them into a list of data frames? Does anybody have any insight for most efficient ways? Thank you!
To find the common intersection of the id columns, you can use
common <- Reduce(intersect, Map("[[", my.data, "id"))
Then we can use that to subset the list elements.
lapply(my.data, function(x) x[x$id %in% common, ])
# [[1]]
# id v c
# 3 c 3 12
#
# [[2]]
# id v c
# 3 c 4 15
#
# [[3]]
# id v c
# 3 c 6 18

R: Looping through list of dataframes in a vector

I have a dataset where I only want to loop through certain columns in a dataframe one at a time to create a graph. The structure of my dataframe consists of data that I parsed from a larger dataset into a vector containing multiple dataframes.
I want to call one column from one dataframe in the vector. I want to loop on the dataframes to call each column.
See example below:
d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6))
d2 <- data.frame(y1=c(3,2,1),y2=c(6,5,4))
my.list <- list(d1, d2)
All I have to work with is my.list
How would I do this?
You can use lapply to plot each of the individual data frames in your list. For example,
d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6),y3=c(7,8,9))
d2 <- data.frame(y1=c(3,2,1),y2=c(6,5,4),y3=c(11,12,13))
mylist <- list(d1, d2)
par(mfrow=c(2,1))
# lapply on a subset of columns
lapply(mylist, function(x) plot(x$y2, x$y3))
You don't need a for loop to get their data points. You can call the column by their column names.
# a toy dataframe
d <- data.frame(A = 1:20, B = sample(c(FALSE, TRUE), 20, replace = TRUE),
C = LETTERS[1:20], D = rnorm(20, 0, 1))
col_names <- c("A", "B", "D") # names of columns I want to get
d[,col_names] # returns a dataset with the values of the columns you want
Here is a solution to your problem using a for loop:
# a toy dataframe
mylist <- list(dat1 = data.frame(A = 1:20, B = LETTERS[1:20]),
dat2 = data.frame(A = 21:40, B = LETTERS[1:20]),
dat3 = data.frame(A = 41:60, B = LETTERS[1:20]))
col_names <- c("A") # name of columns I want to get
for (i in 1:length(mylist)){
# you can do whatever you want with what is returned;
# here I am just print them out
print(names(mylist)[i]) # name of the data frame
print(mylist[[i]][,col_names]) # values in Column A
}
I think the simplest answer to your question is to use double brackets.
for (i in 1:length(my.list)) {
print(my.list[[i]]$column)
}
That works assuming all of the columns in your list of data frames have the same names. You could also call the position of the column in the data frame if you wanted.
Yes, lapply can be more elegant, but in some situations a for loop makes more sense.

Resources