Suppose I have 3 dataframes in the current R environment, named as d1f, df2, df_3. There is no pattern for their names. How can I access one dataframe by its name?
For example, I have a for loop to process the three dataframes. How can I do something like this?
df_names<-c("d1f", "df2", "df_3")
for(name in df_names)
{
df<-some_function(name)
....some action on df....
}
Best is to store the data frames in a list like so:
set.seed(1)
d1f = rnorm(10)
df2 = rnorm(10)
df_3 = rnorm(10)
dfs = list(d1f, df2, df_3)
for (i in 1:length(dfs)){
dfs[[i]] = dfs[[i]] +1 # eg. add 1 to each element of the three data frames
}
Related
So, I have a list of strings named control_for. I have a data frame sampleTable with some of the columns named as strings from control_for list. And I have a third object dge_obj (DGElist object) where I want to append those columns. What I wanted to do - use lapply to loop through control_for list, and for each string, find a column in sampleTable with the same name, and then add that column (as a factor) to a DGElist object. For example, for doing it manually with just one string, it looks like this, and it works:
group <- as.factor(sampleTable[,3])
dge_obj$samples$group <- group
And I tried something like this:
lapply(control_for, function(x) {
x <- as.factor(sampleTable[, x])
dge_obj$samples$x <- x
}
Which doesn't work. I guess the problem is that R can't recognize addressing columns like this. Can someone help?
Here are two base R ways of doing it. The data set is the example of help("DGEList") and a mock up data.frame sampleTable.
Define a vector common_vars of the table's names in control_for. Then create the new columns.
library(edgeR)
sampleTable <- data.frame(a = 1:4, b = 5:8, no = letters[21:24])
control_for <- c("a", "b")
common_vars <- intersect(control_for, names(sampleTable))
1. for loop
for(x in common_vars){
y <- sampleTable[[x]]
dge_obj$samples[[x]] <- factor(y)
}
2. *apply loop.
tmp <- sapply(sampleTable[common_vars], factor)
dge_obj$samples <- cbind(dge_obj$samples, tmp)
This code can be rewritten as a one-liner.
Data
set.seed(2021)
y <- matrix(rnbinom(10000,mu=5,size=2),ncol=4)
dge_obj <- DGEList(counts=y, group=rep(1:2,each=2))
I already have a list of data frames (mylist) and need to switch the first and second column for all the data frames in the list.
Test Data Frame in List
[reads] [phylum]
1 phylum1
2 phylum2
3 phylum3
Into....
[phylum] [reads]
phylum1 1
phylum2 2
phylum3 3
I know I need to use lapply, but not sure what to input for the FUN=
mylist <- lapply(mylist, FUN = mylist[ ,c("phylum", "reads")])
errors saying incorrect number of dimensions
Sorry if this is a simple question and thanks in advance for your help!
-Brand new R user
The FUN asks for a function that it can apply to every element in the list. You are passing mylist[ ,c("phylum", "reads")]) which is not a function.
# sample data
df1 <- data.frame(reads = sample(10,4), phylum = sample(10,4))
df2 <- data.frame(reads = sample(10,4), phylum = sample(10,4))
df3 <- data.frame(reads = sample(10,4), phylum = sample(10,4))
df4 <- data.frame(reads = sample(10,4), phylum = sample(10,4))
ldf <- list(df1,df2,df3,df4)
ldf_re <- lapply(ldf, FUN = function(X){X[c('phylum', 'reads')]})
In the last line, the lapply will iterate through all the dataframes, they will be passed as the X argument for the function defined in the FUN argument and the columns will be dataframes will be stored in the list ldf_re with their columns rearranged.
I have dataframes and want to pass them as a parameter to process in function. Let say there are 4 dataframes and want to rename first columns to 'ROWNUM'.
df1 = data.frame(c(1:10),sample(1:100,10))
df2 = data.frame(c(1:10),sample(1:100,10))
df3 = data.frame(c(1:10),sample(1:100,10))
df4 = data.frame(c(1:10),sample(1:100,10))
function(df) colnames(df)[1] = 'ROWNUM'
My objective is I want to rename in one shot rather than passing one by one
Thanks.
We can use lapply after keeping the datasets in a list
nm1 <- ls(pattern="df\\d+")
lst <- lapply(mget(nm1), function(x) {
colnames(x)[1] <- 'ROWNUM'
x})
It is better to keep the datasets in a list, but if we need to update the original datasets
list2env(lst, envir=.GlobalEnv)
Or we use assign
for(j in seq_along(nm1)){
assign(nm1[j], `names<-`(get(nm1[j]),
c("ROWNUM", names(get(nm1[j]))[-1])))
}
Suppose I have a list object such like:
set.seed(123)
df <- data.frame(x = rnorm(5), y = rbinom(5,2,0.5))
rownames(df) <- LETTERS[1:5]
ls <- list(df1 = df, df2 = df, df3 = df)
My question is how to quickly check the row names are identical across the three elements (data frames) in the ls.
You can try
all(sapply(ls, rownames) == rownames(ls[[1]]))
To check only the name of the ith column, you can modify this to
all(sapply(ls, rownames)[i, ] == rownames(ls[[1]])[i])
You can get a list of row names with:
Map(rownames, ls)
so you can check that all the dataframes have the same rownames checking that there is only one unique value of row.names vector with:
length(unique(Map(rownames, ls))) == 1
If I am working with dataframes in a loop, how can I use a variable data frame name (and additionally, variable column names) to access data frame contents?
dfnames <- c("df1","df2")
df1 <- df2 <- data.frame(X = sample(1:10),Y = sample(c("yes", "no"), 10, replace = TRUE))
for (i in seq_along(dfnames)){
curr.dfname <- dfnames[i]
#how can I do this:
curr.dfname$X <- 42:52
#...this
dfnames[i]$X <- 42:52
#or even this doubly variable call
for (j in 1_seq_along(colnames(curr.dfname)){
curr.dfname$[colnames(temp[j])] <- 42:52
}
}
You can use get() to return a variable reference based on a string of its name:
> x <- 1:10
> get("x")
[1] 1 2 3 4 5 6 7 8 9 10
So, yes, you could iterate through dfnames like:
dfnames <- c("df1","df2")
df1 <- df2 <- data.frame(X = sample(1:10), Y = sample(c("yes", "no"), 10, replace = TRUE))
for (cur.dfname in dfnames)
{
cur.df <- get(cur.dfname)
# for a fixed column name
cur.df$X <- 42:52
# iterating through column names as well
for (j in colnames(cur.df))
{
cur.df[, j] <- 42:52
}
}
I really think that this is gonna be a painful approach, though. As the commenters say, if you can get the data frames into a list and then iterate through that, it'll probably perform better and be more readable. Unfortunately, get() isn't vectorised as far as I'm aware, so if you only have a string list of data frame names, you'll have to iterate through that to get a data frame list:
# build data frame list
df.list <- list()
for (i in 1:length(dfnames))
{
df.list[[i]] <- get(dfnames[i])
}
# iterate through data frames
for (cur.df in df.list)
{
cur.df$X <- 42:52
}
Hope that helps!
2018 Update: I probably wouldn't do something like this anymore. Instead, I'd put the data frames in a list and then use purrr:map(), or, the base equivalent, lapply():
library(tidyverse)
stuff_to_do = function(mydata) {
mydata$somecol = 42:52
# … anything else I want to do to the current data frame
mydata # return it
}
df_list = list(df1, df2)
map(df_list, stuff_to_do)
This brings back a list of modified data frames (although you can use variants of map(), map_dfr() and map_dfc(), to automatically bind the list of processed data frames row-wise or column-wise respectively. The former uses column names to join, rather than column positions, and it can also add an ID column using the .id argument and the names of the input list. So it comes with some nice added functionality over lapply()!