Extract specific information from a list of dataframes - r

I am facing the following challenge:
I have a list of dataframes in R and I'd like to extract some specific information from it. Here is an example:
df_1 <- data.frame(A = c(1,2), B = c(3,4), D = c(5,6))
df_2 <- data.frame(A = c(7,8), B = c(9,10), D = c(11,12))
df_3 <- data.frame(A = c(0,1), B = c(2,3), D = c(4,5))
L <- list(df_1, df_2, df_3)
What I'd like to extract are the values at position (1,1) in each of these dataframes. In the above case this would be: 1, 7, 0.
Is there a way to extract this information easily, probably with one line of code?

As Ronak has suggested , you can use function like lapply and wrap it with unlist for desired output.
unlist(lapply(L,function(x) x[1,1]))

In addition to the *apply methods shown above, you can also do this in a Vectorized manner. Since all the data frames in your list have the same column names, and you want the first element from the first column, i.e. 'A1', then you can simply unlist (which will create a named vector) and grab the values with the name A1.
v1 <- unlist(L)
v1[names(v1) == 'A1']
#A1 A1 A1
# 1 7 0

Related

Adding a column to every dataframe in a list with the name of the list element

I have a list containing multiple data frames, and each list element has a unique name. The structure is similar to this dummy data
a <- data.frame(z = rnorm(20), y = rnorm(20))
b <- data.frame(z = rnorm(30), y = rnorm(30))
c <- data.frame(z = rnorm(40), y = rnorm(40))
d <- data.frame(z = rnorm(50), y = rnorm(50))
my.list <- list(a,b,c,d)
names(my.list) <- c("a","b","c","d")
I want to create a column in each of the data frames that has the name of it's respective list element. My goal is to merge all the list element into a single data frame, and know which data frame they came from originally. The end result I'm looking for is something like this:
z y group
1 0.6169132 0.09803228 a
2 1.1610584 0.50356131 a
3 0.6399438 0.84810547 a
4 1.0878453 1.00472105 b
5 -0.3137200 -1.20707112 b
6 1.1428834 0.87852556 b
7 -1.0651735 -0.18614224 c
8 1.1629891 -0.30184443 c
9 -0.7980089 -0.35578381 c
10 1.4651651 -0.30586852 d
11 1.1936547 1.98858128 d
12 1.6284174 -0.17042835 d
My first thought was to use mutate to assign the list element name to a column in each respective data frame, but it appears that when used within lapply, names() refers to the column names, not the list element names
test <- lapply(my.list, function(x) mutate(x, group = names(x)))
Error: Column `group` must be length 20 (the number of rows) or one, not 2
Any suggestions as to how I could approach this problem?
there is no need to mutate just bind using dplyr's bind_rows
library(tidyverse)
my.list %>%
bind_rows(.id = "groups")
Obviously requires that the list is named.
We can use Map from base R
Map(cbind, my.list, group = names(my.list))
Or with imap from purrr
library(dplyr)
library(purrr)
imap(my.list, ~ .x %>% mutate(group = .y))
Or if the intention is to create a single data.frame
library(data.table)
rbindlist(my.list. idcol = 'groups')

R reduced subsetting a List

I have a question of subsetting a nested list by names.
I have an example list like:
test_list <- list(a = list(A1 = c(1,2,3), A2 = c(4,5,6)),
b = c(7,8,9),
c = list(C1 = c(10,11,12), C2 = list(C21 =c(13,14,15))))
And I want to subset values based on a vector like lnames <-
c('c','C2','C21'). The way I can think of doing this is using:
exp_str <- paste0('test_list','$',paste0(lnames, collapse = '$'))
eval(parse(text = exp_str))
But this seems a little be clunky to me. I am just wondering if there's a functional way to do this like using reduce function.
You can just do
test_list[[lnames]]
# [1] 13 14 15
This is somewhat cryptically described in the ?Extract help page.
[[ can be applied recursively to lists, so that if the single index i is a vector of length p, alist[[i]] is equivalent to alist[[i1]]...[[ip]] providing all but the final indexing results in a list.

R: Looping through list of dataframes in a vector

I have a dataset where I only want to loop through certain columns in a dataframe one at a time to create a graph. The structure of my dataframe consists of data that I parsed from a larger dataset into a vector containing multiple dataframes.
I want to call one column from one dataframe in the vector. I want to loop on the dataframes to call each column.
See example below:
d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6))
d2 <- data.frame(y1=c(3,2,1),y2=c(6,5,4))
my.list <- list(d1, d2)
All I have to work with is my.list
How would I do this?
You can use lapply to plot each of the individual data frames in your list. For example,
d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6),y3=c(7,8,9))
d2 <- data.frame(y1=c(3,2,1),y2=c(6,5,4),y3=c(11,12,13))
mylist <- list(d1, d2)
par(mfrow=c(2,1))
# lapply on a subset of columns
lapply(mylist, function(x) plot(x$y2, x$y3))
You don't need a for loop to get their data points. You can call the column by their column names.
# a toy dataframe
d <- data.frame(A = 1:20, B = sample(c(FALSE, TRUE), 20, replace = TRUE),
C = LETTERS[1:20], D = rnorm(20, 0, 1))
col_names <- c("A", "B", "D") # names of columns I want to get
d[,col_names] # returns a dataset with the values of the columns you want
Here is a solution to your problem using a for loop:
# a toy dataframe
mylist <- list(dat1 = data.frame(A = 1:20, B = LETTERS[1:20]),
dat2 = data.frame(A = 21:40, B = LETTERS[1:20]),
dat3 = data.frame(A = 41:60, B = LETTERS[1:20]))
col_names <- c("A") # name of columns I want to get
for (i in 1:length(mylist)){
# you can do whatever you want with what is returned;
# here I am just print them out
print(names(mylist)[i]) # name of the data frame
print(mylist[[i]][,col_names]) # values in Column A
}
I think the simplest answer to your question is to use double brackets.
for (i in 1:length(my.list)) {
print(my.list[[i]]$column)
}
That works assuming all of the columns in your list of data frames have the same names. You could also call the position of the column in the data frame if you wanted.
Yes, lapply can be more elegant, but in some situations a for loop makes more sense.

Combine data.frames in R using only common row.names

I have five data.frames with gene expression data for different sets of samples. I have a different number of rows in each data.set and therefore only partly overlapping row.names (genes).
Now I want
a) to filter the five data.frames to contain only genes that are present in all data.frames and
b) to combine the gene expression data for those genes to one data.frame.
All I could find so far was merge, but that can only merge two data.frames, so I'd have to use it multiple times. Is there an easier way?
Merging is not very efficient if you want to exclude row names which are not present in every data frame. Here's a different proposal.
First, three example data frames:
df1 <- data.frame(a = 1:5, b = 1:5,
row.names = letters[1:5]) # letters a to e
df2 <- data.frame(a = 1:5, b = 1:5,
row.names = letters[3:7]) # letters c to g
df3 <- data.frame(a = 1:5, b = 1:5,
row.names = letters[c(1,2,3,5,7)]) # letters a, b, c, e, and g
# row names being present in all data frames: c and e
Put the data frames into a list:
dfList <- list(df1, df2, df3)
Find common row names:
idx <- Reduce(intersect, lapply(dfList, rownames))
Extract data:
df1[idx, ]
a b
c 3 3
e 5 5
PS. If you want to keep the corresponding rows from all data frames, you could replace the last step, df1[idx, ], with the following command:
do.call(rbind, lapply(dfList, "[", idx, ))
Check out the uppermost answer in this SO post. Just list your data frames and apply the following line of code:
Reduce(function(...) merge(..., by = "x"), list.of.dataframes)
You just have to adjust the by argument to specify by which common column the data frames should be merged.

R: control auto-created column names in call to rbind()

If I do something like this:
> df <- data.frame()
> rbind(df, c("A","B","C"))
X.A. X.B. X.C.
1 A B C
You can see the row gets added to the empty data frame. However, the columns get named automatically based on the content of the data.
This causes problems if I later want to:
> df <- rbind(df, c("P", "D", "Q"))
Is there a way to control the names of the columns that get automatically created by rbind? Or some other way to do what I'm attempting to do here?
#baha-kev has a good answer regarding strings and factors.
I just want to point out the weird behavior of rbind for data.frame:
# This is "should work", but it doesn't:
# Create an empty data.frame with the correct names and types
df <- data.frame(A=numeric(), B=character(), C=character(), stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Messes up names!
rbind(df, list(A=42, B='foo', C='bar')) # OK...
# If you have at least one row, names are kept...
df <- data.frame(A=0, B="", C="", stringsAsFactors=FALSE)
rbind(df, list(42, 'foo', 'bar')) # Names work now...
But if you only have strings then why not use a matrix instead? Then it works fine to start with an empty matrix:
# Create a 0x3 matrix:
m <- matrix('', 0, 3, dimnames=list(NULL, LETTERS[1:3]))
# Now add a row:
m <- rbind(m, c('foo','bar','baz')) # This works fine!
m
# Then optionally turn it into a data.frame at the end...
as.data.frame(m, stringsAsFactors=FALSE)
Set the option "stringsAsFactors" to False, which stores the values as characters:
df=data.frame(first = 'A', second = 'B', third = 'C', stringsAsFactors=FALSE)
rbind(df,c('Horse','Dog','Cat'))
first second third
1 A B C
2 Horse Dog Cat
sapply(df2,class)
first second third
"character" "character" "character"
Later, if you want to use factors, you could convert it like this:
df2 = as.data.frame(df, stringsAsFactors=T)

Resources