R: Loading data from folder with multiple files - r

I have a folder with multiple files to load:
Every file is a list. And I want to combine all the lists loaded in a single list. I am using the following code (the variable loaded every time from a file is called TotalData) :
Filenames <- paste0('DATA3_',as.character(1:18))
Data <- list()
for (ii in Filenames){
load(ii)
Data <- append(Data,TotalData)
}
Is there a more elegant way to write it? For example using apply functions?

You can use lapply. I assume that your files have been stored using save, because you use load to get them. I create two files to use in my example as follows:
TotalData<-list(1:10)
save(TotalData,file="DATA3_1")
TotalData<-list(11:20)
save(TotalData,file="DATA3_2")
And then I read them in by
Filenames <- paste0('DATA3_',as.character(1:2))
Data <- lapply(Filenames,function(fn) {
load(fn)
return (TotalData)
})
After this, Data will be a list that contains the lists from the files as its elements. Since you are using append in your example, I assume this is not what you want. I remove one level of nesting with
Data <- unlist(Data,recursive=FALSE)
For my two example files, this gave the same result as your code. Whether it is more elegant can be debated, but I would claim that it is more R-ish than the for-loop.

Related

Write excels to path with variable in R [duplicate]

I want to manipulate different .csv files through a loop and a list. Works fine, but I for the output, I have to create many .xlsx files and the files have to be named according to the value of a certain variable.
I've already tried piping the write_xlsx function with ifelse condition like:
for (i in 1:length(files)) {
files[[i]] %>%
write_xlsx(files[[i]], paste(ifelse(x="test1", "/Reportings/test1.xlsx",
ifelse(x="test2", "/Reportings/test2.xlsx", "test3")
}
I expect that multiple .xlsx files will be created in the folder Reportings.
Not easy to answer precisely with the information you gave, but here is a minimal example that seems to do what you want :
According that your list is composed of matrix, that x is a variable and that it always has the same value.
df=data.frame(x=rep("test1",3),y=rep("test1",3))
df2=data.frame(x=rep("test2",3),y=rep("test2",3))
files=list(df,df2)
files[[1]]$x[1]
for(i in 1:length(files)){
write.xlsx(files[[i]],paste0("Reportings/",files[[i]]$x[1],".xlsx"))
}

parameter not passed to the function when using walk function in PURRR package

I am using the purrr:walk to read multiple excel files and it failed. I have 3 questions:
(1) I used the function list.files to read the excel file list in one folder. But the returned values also included the subfolders. I tried set value for the parameters recursive= and include.dirs=, but it didn't work.
setwd(file_path)
files<-as_tibble(list.files(file_path,recursive=F,include.dirs=F)) %>%
filter(str_detect(value,".xlsx"))
files
(2) When I used the following piece of code, it can run without any error or warning message, but there is no returned data.
###read the excel data
file_read <- function(value1) {
print(value1)
file1<-read_excel(value1,sheet=1)
}
walk(files$value,file_read)
When I used the following, it worked. Not sure why.
test<-read_excel(files$value,sheet=1)
(3) In Q2, actually I want to create file1 to file6, suppose there are 6 excel files. How can I dynamically assign the dataset name?
list.files has pattern argument where you can specify what kind of files you are looking for. This will help you avoid filter(str_detect(value,".xlsx")) step. Also list.files only returns the files that are included in the main directory (file_path) and not it's subdirectory unless you specify recursive = TRUE.
library(readxl)
setwd(file_path)
files <- list.files(pattern = '\\.xlsx')
In the function you need to return the object.
file_read <- function(value1) {
data <- read_excel(value1,sheet=1)
return(data)
}
Now you can use map/lapply to read the files.
result <- purrr::map(files,file_read)

Reading multiple offline html files to a list in R

I have rawdata as 20 offline html files stored in following format
../rawdata/1999_table.html
../rawdata/2000_table.html
../rawdata/2001_table.html
../rawdata/2002_table.html
.
.
../rawdata/2017_table.html
These files contain tables that I am extracting and reshaping to a particular format.
I want to read these files at once to a list and process them one by one through a function that I have written.
What I tried:
I put the names of these files into an Excel file called filestoread.xlsx and used a for loop to load these files using the names mentioned in the sheet. But it doesn't seem to work
filestoread <- fread("../rawdata/filestoread.csv")
x <- list()
for (i in nrow(filestoread)) {
x[[i]] <- read_html(paste0("../rawdata/", filestoread[i]))
}
How can this be done?
Also, after reading the HTML files I want to extract the tables from them and reshape them using a function I wrote after converting it to a data table.
My final objective is to rbind all the tables and have a single data table with year wise entries of the tables in the html file.
First save path of your data on one of the following ways.
Either, hardcoded
filestoread <- paste0("../rawdata/", 1999:2017, "_table.html")
or reading all html files in the directory
filestoread <- list.files(path = "../rawdata/", pattern="\\.html$")
Then use lapply()
library(rvest)
lapply(filestoread, function(x) try(read_html(x)))
Note: try() runs the code even when there is a file missing (throwing error).
The second part of your question is a little broad, depends on the content of your files, and there are already some answers, you could consider e.g. this answer. In principle you use a combination of ?html_nodes and ?html_table.

How to create an object by adding a variable to a fixed value?

I am trying to write a program to open a large amount of files and run them through a function I made called "sort". Every one of my file names starts with "sa1", however after that the characters vary based on the file. I was hoping to do something along the lines of this:
for(x in c("Put","Characters","which","Vary","by","File","here")){
sa1+x <- read.csv("filepath/sa1+x",header= FALSE)
sa1+x=sort(sa1+x)
return(sa1+x)
}
In this case, say that x was 88. It would open the file sa188, name that dataframe sa188, and then run it through the function sort. I dont think that writing sa1+x is the correct way to bind together two values, but I dont know a way to.
You need to use a list to contain the data in each csv file, and loop over the filenames using paste0.
file_suffixes <- c("put","characters","which","vary","by","file","here")
numfiles <- length(file_suffixes)
list_data <- list()
sorted_data <- list()
filename <- "filepath/sa1"
for (x in 1:numfiles) {
list_data[[x]] <- read.csv(paste0(filename, file_suffixes[x]), header=FALSE)
sorted_data[[x]] <- sort(list_data[[x]])
}
I am not sure why you use return in that loop. If you're writing a function, you should be returning the sorted_data list which contains all your post-sorting data.
Note: you shouldn't call your function sort because there is already a base R function called sort.
Additional note: you can use dir() and regex parsing to find all the files which start with "sa1" and loop over all of them, thus freeing you from having to specify the file_suffixes.

Assigning unknown variable to new variable name

I have to load in many files and tansform their data. Each file contains only one data.table, however the tables have various names.
I would like to run a single script over all of the files -- to do so, i must assign the unknown data.table to a common name ... say blob.
What is the R way of doing this? At present, my best guess (which seems like a hack, but works) is to load the data.table into a new environment, and then: assign('blob', get(objects(envir=newEnv)[1], env=newEnv).
In a reproducible context this is:
newEnv <- new.env()
assign('a', 1:10, envir = newEnv)
assign('blob', get(objects(envir=newEnv)[1], env=newEnv))
Is there a better way?
The R way is to create a single object, i.e. a single list of data tables.
Here is some pseudocode that contains three steps:
Use list.files() to create a list of all files in a folder.
Use lapply() and read.csv() to read your files and create a list of data frames. Replace read.csv() with read.table() or whatever is appropriate for your data.
Use lapply() again, this time with as.data.table() to convert the data frames to data tables.
The pseudocode:
filenames <- list.files("path/to/files")
dat <- lapply(files, read.csv)
dat <- lapply(dat, as.data.table)
Your result should be a single list, called dat, containing a data table for each of your original files.
I assume that you saved the data.tables using save() somewhat like this:
d1 <- data.table(value=1:10)
save(d1, file="data1.rdata")
and your problem is that when you load the file you don't know the name (here: d1) that you used when saving the file. Correct?
I suggest you use instead saveRDS() and readRDS() for saving/loading single objects:
d1 <- data.table(value=1:10)
saveRDS(d1, file="data1.rds")
blob <- readRDS("data1.rds")

Resources