Convert a list into multiple data frame by list column - r

I have import a excel file with multi worksheets. It’s a list format.
names(mysheets)
#[1] "test_sheet1" "test_sheet2"
Test_sheet1 and test _sheet2 have a different matrix.
I have to put each worksheets as individual data frame.
If do it manually, the code will look like this:
s_1 <- data.frame(mysheets[1])
s_2 <- data.frame(mysheets[2])
I try to write a function to do it, because I have many excel files and each file have multi worksheets
function
p_fun <- function (y) {
for (s_i in 1:2) {
for (i in 1:2) {
s_i<- data.frame(y[i])
return(s_i) }}}
It didn’t work correctly.
Appreciate if anyone can help.

You could use mget to get the object and then change them to data.frame
list_df <- lapply(mget(names(mysheets)), data.frame)
If you want them as separate dataframes, we can do
names(list_df) <- paste0('s_', seq_along(list_df))
list2env(list_df, .GlobalEnv)

We can use assign if we are doing this in a for loop
for(i in seq_along(mysheets)) assign(paste0("s", i), data.frame(mysheets[i]))

Related

R Export several dataframes to differeny Excel files

I have sveral dataframes (mydf1, mydf2, mydf3 etc.). How can I export each dataframe to the separate Excel file so that the name of the file is the name of the dataframe (eg. mydf1.xlsx).
I've tried to put them in a list and do a loop as below. It nearly gives me what I want, but I don't know how to make R name the Excel files properly instead of 1.xlsx, 2.xlsx etc. Any ideas?
install.packages("writexl")
library(writexl)
list_of_dfs <- lapply(ls(pattern="mydf"), function(x) get(x))
for (i in c(1:length(list_of_dfs))){
write_xlsx(list_of_dfs[i], paste(i,".xlsx"))
}
Try the following.
use mget to get all df's in one go, no need for lapply;
the list of df's is a named list and the names can be used to assemble the filenames.
The code corrected is then:
library(writexl)
list_of_dfs <- mget(ls(pattern = "mydf"))
for(i in seq_along(list_of_dfs)){
filename <- paste0(names(list_of_dfs)[i], ".xlsx")
write_xlsx(list_of_dfs[[i]], filename)
}

Apply function to all dataframes

I work with SAS files (sas7bdat = dataframes) and SAS formats (sas7bcat).
My sas7bdat files are in a "data" file, so I can get a list in object files_names.
Here is the first part of my code, working perfectly
files_names <- list.files(here("data"))
nb_files <- length(files_names)
data_names <- vector("list",length=nb_files)
for (i in 1 : nb_files) {
data_names[i] <- strsplit(files_names[i], split=".sas7bdat")
}
for (i in 1:nb_files) {
assign(data_names[[i]],
read_sas(paste(here("data", files_names[i])), "formats/formats.sas7bcat")
)}
but I get some issues when trying to apply function as_factor from package haven (in order to apply labels on my new dataframes and get like SEX = "Male" instead of SEX = 1).
I can make it work dataframe by dataframe like the code below
df_labelled <- haven::as_factor(df, only_labelled = TRUE)
I would like to create a loop but didn't work because my data_names[i] isn't a dataframe and as_factor requires a dataframe in first argument.
I'm quite new to R, thank you very much if someone could help me.
you might want to think about using different data structures, for example you can use a named list to save your dataframes then you can easily loop through them.
In fact you could do everything in one loop, I'm sure there's a more efficient way to do this, but here's an example of one way without changing your code too much :
files_names <- list.files(here("data"))
raw_dfs <- list()
labelled_dfs <- list()
for (file_name in files_names) {
# # strsplit returns a list either extract the first element
# # like this
# df_name <- (strsplit(file_name, split=".sas7bdat"))[[1]]
# # or use something else like gsub
df_name <- gsub(".sas7bdat", '', file_name)
raw_dfs[df_name] <- read_sas(paste(here("data", file_name)), "formats/formats.sas7bcat")
labelled_dfs[df_name] <- haven::as_factor(raw_dfs[[df_name]], only_labelled = TRUE)
}

For loop to read table R

I would like to loop through a vector of directory names and insert the directory name into the read.table function and 'see' the tables outside the loop.
I have a vector of directory names:
dir_names <- c("SRR2537079","SRR2537080","SRR2537081","SRR2537082", "SRR2537083","SRR2537084")
I now want to loop through these and read the tables in each directory.
I have:
list.data<-list()
for(i in dir_names){
#print(i)
list.data[[i]] <- read.table('dir_names[i]/circularRNA_known.txt', header=FALSE, sep="\t",stringsAsFactors=FALSE)
}
but this isn't recognizing dir_names[i]. Do I have to use paste somehow??
You are right, you need to paste the value. i will also be the list element not a number, so you don't need to call it as dir_names[i] just i
list.data<-list()
for(i in dir_names){
#print(i)
list.data[[i]] <- read.table(paste0(i,'/circularRNA_known.txt'), header=FALSE, sep="\t",stringsAsFactors=FALSE)
}
Can I also suggest that (just for your info, if you wanted a more elegant solution) that you could use plyr's llply instead of a loop. It means it can all happen within one line, and could easily change the output to combine all files into a data.frame (using ldply) if they are in consistent formats
list.data.2 <- llply(dir_names, function(x) read.table(paste0(x,"/circularRNA_known.txt"), header=FALSE, sep="\t",stringsAsFactors=FALSE))
dir_names[i] should be used as a variable.
list.data<-list()
for(i in (1:length(dir_names))){
#print(i)
list.data[[i]] <- read.table(paste0(dir_names[i], '/circularRNA_known.txt'), header=FALSE, sep="\t",stringsAsFactors=FALSE)
}

How to read multiple .xlsx and generate multimple data frames in R?

I want to read three different files in xlsx and save them in three different dataframes called excel1, excel2 and excel3. How can I do that? I think it should be something like this:
files = list.files(pattern='[.]xlsx') #There are three files.
for (i in 1:files){
"excel" + i =read.xlsx(files[i])
}
I suggest you to use a list instead of creating 3 variables in the current workspace:
dfList <- list()
for (i in 1:files){
dfList[[paste0("excel",i)]] <- read.xlsx(files[i])
}
Then you can access to them in this way :
dfList$excel1
dfList$excel2
dfList$excel3
or :
dfList[[1]]
dfList[[2]]
dfList[[3]]
But, if you really really want to create new variables, you can use assign function :
for (i in 1:files){
assign(paste0("excel",i), read.xlsx(files[i]))
}
# now excel1, excel2, excel3 variables exist...
You can use plyr also and it's a good practice to mention the environment in which you want to create the variable:
library(plyr)
l_ply(1:length(files), function(i) assign(paste0('excel',i),read.xlsx(files[i]), envir=globalenv()))
If someone tries to use this code, this parameters are really helpful:
library(xlsx)
files = list.files(pattern='[.]xlsx')
dfList <- list()
for (i in 1:length(files)){
dfList[[paste0("excel",i)]] <- read.xlsx(files[i],header=T,stringsAsFactors=FALSE,sheetIndex = 1)
}

R function that iterates over data.frame, opens/merges files, and returns another dataframe.

I would like to know how I solve the following problem using higher order functions like ddply, ldply, dlply, and avoid using problematic for loops.
The problem:
I have a .csv file representing a dataset loaded into a data.frame, with each row containing the path to a directory where more information is stored in files. I want to use the directory information in the datas.frame to open the files("file1.txt","file2.txt") in that directory, merge them, then combine the merged files from each entry in one large dataframe.
something like this:
df =
entryName,dir
1,/home/guest/data/entry1
2,/home/guest/data/entry2
3,/home/guest/data/entry3
4,/home/guest/data/entry4
what I would like to do is apply a function to the dataframe that take the directory,
appends a couple of file names "file1.txt", "file.txt", then merges the two files together based off a given field.
for example file1.txt could be:
entry,subEntry,value
1,A,2
1,B,3
1,C,4
1,D,5
1,E,3
1,F,3
for example file2.txt could be:
entry,subEntry,value
1,A,8
1,B,7
1,C,8
1,D,9
1,E,8
1,F,7
the output would look something like this:
entryName,subEntry,valueFromFile1,valueFromFile2
1,A,2,8
1,B,3,7
1,C,4,8
1,D,5,9
1,E,3,8
1,F,3,7
2,A,4,8
2,B,5,9
2,C,6,7
2,D,3,7
2,E,6,8
2,F,5,9
Right now I am using a for loop, but for obvious reasons would like to use a higher order function. Here is what I have so far:
allCombined <- data.frame()
df <- read.csv(file="allDataEntries.csv",header=true)
numberOfEntries = <- dim(df)[1]
for(i in 1:numberOfEntries){
dir <- df$dir[i]
file1String <- paste(dir,"/file1.txt",sep='')
file2String <- paste(dir,"/file2.txt",sep='')
file1.df <- read.csv(file=file1String,header=TRUE)
file2.df <- read.csv(file=file2String,header=TRUE)
localMerged <- merge(file1.df,file2.df, by="value")
allCombined <- rbind(allCombined,localMerged)
}
#rest of my analysis...
Here is one way to do it. The idea is to create a list with contents of all the files, and then use Reduce to merge them sequentially using the common columns entry and subEntry.
# READ DIRECTORIES, FILES AND ENTRIES
dirs <- read.csv(file = "allDataEntries.csv", header = TRUE, as.is = TRUE)$dir
files <- as.vector(outer(dirs, c('file.txt', 'file2.txt'), 'file.path'))
entries <- lapply(files, 'read.csv', header = TRUE)
# APPLY CUSTOM MERGE FUNCTION TO COMBINE ENTRIES
merge_by <- function(x, y){
merge(x, y, by = c('entry', 'subEntry'))
}
Reduce('merge_by', entries)
I've not tested this, but it seems like it should work. The anonymous function takes a single row from df, reads in the two associated files, and merges them together by value. Using ddply will take these data frames and make a single one out of them by rbinding (since the requested output is a data frame). It does assume entryName is not repeated in df. If it is, you can add a unique row to group over instead.
ddply(df, .(entryName), function(DF) {
dir <- df$dir
file1String <- paste(dir,"/file1.txt",sep='')
file2String <- paste(dir,"/file2.txt",sep='')
file1.df <- read.csv(file=file1String,header=TRUE)
file2.df <- read.csv(file=file2String,header=TRUE)
merge(file1.df,file2.df, by="value")
})

Resources