I'd like to read SELECTED multiple files with sparklyr. I have multiple csv files (eg. a1.csv, a2.csv, a3.csv, a4.csv, a5.csv) in a folder, and I'd like to read a2.csv, a3.csv, a4.csv files at once if possible.
I know I can read csv file with spark_read_csv(sc, "cash", "/dir1/folder1/a2") so I tried
a_all <- data.frame(col1=integer(),col2=integer())
a_all <- sdf_copy_to(sc, a_all, "a_all")
for(i in 2:4){
tmp1 <- spark_read_csv(sc=sc, name="tmp1", paste0("/dir1/folder1/a",i))
a_all <- sdf_bind_rows(a_all, tmp1)
}
As a result I will get a spark_tbl which is binding a2.csv, a3.csv, a4.csv files rbind(a2,a3,a4).
I think there is a way to do it easier (maybe without for loop) by using path= but I am not sure how to select only few csv files in a folder. Please help!
Related
I am trying to read multiple excel files with in R studio so I can append them to one large dataset. For some reason it is not uploading any of the excel files. I have the correct working directory but the R file is always empty.
The name of the files are sbir_award_1 sbir_award_2 , etc. Any help would be greatly appreciated.
library(tidyverse)
library(readxl)
library(writexl)
sbir <- list.files(pattern = "data/raw/sbir/sbir_award_.xlsx")
data <- list()
df.list <- lapply(file.list,read.xlsx)
I have many txt files that I want to import into R. These files are imported one by one, I do the operations that I want, and then I import the next file.
All these files are located in a database system where all the folders have almost the same names, e.g.
database\type4\system50
database\type6\system50
database\type4\system30
database\type4\system50
Similarly, the names of the files are also almost the same, referring to the folder where they are positioned, e.g..
type4.system50.txt
type6.system50.txt
type4.system30.txt
type4.system50.txt
I have heard that there should be a easier way of importing these many files one by one, than simply multiple setwd and read.csv2 commands. As far as I understand this is possible by the macro import function in SAS, where you specify an overall path and then for each time you want to import a file you specify what is specific about this file name/folder name.
Is there a similar function in R? I tried to look at
Importing Data in R like SAS macro
, but this question did not really show me how to specify the folder name/file name.
Thank you for your help.
If you want to specify folder name / file name, try this
databasepath="path/to/database"
## list all files
list.files(getwd(),recursive = T,full.names = T,include.dirs = T) -> tmp
## filter files you want to read
readmyfile <- function(foldername,filename){
tmp[which(grepl(foldername,tmp) & grepl(filename,tmp))]
}
files_to_read <- readmyfile("type4", "system50")
some_files <- lapply(files_to_read, read.csv2)
## Or you can read all of them (if memory is large enough to hold them)
all_files <- lapply(tmp,read.csv2)
Instead of using setwd continuously, you could specify the absolute path for each file, save all of the paths to a vector, loop through the vector of paths and load the files into a list
library(data.table)
file_dir <- "path/to/files/"
file_vec <- list.files(path = file_dir, pattern = "*.txt")
file_list <- list()
for (n in 1:length(file_list)){
file_list[[n]] <- fread(input = paste0(file_dir, file_vec[n]))
}
I have multiple file whose names are not in proper format. For example, one file might have name like "TEST_1.XLSX", the other has name like "test_2.xlsx" and, for worse, another file has name as "tEsT_3.XlsX".
When I tried to read file using:
df <- xlsx::read.xlsx(file.choose(), sheetIndex = 1)
it reads the file with names and extension in "lower cases" but failed to read all others.
Can there be a way to avoid such issues and read all the files despite whatever the case of names/extension is?
import all files in your folder and store the file names in a vector:
file_names <- list.files(path = "path/where/files/are")
then import each file and store it in a list
df_list<-list()
for(i in 1:length(file_names){
df_list[[i]] <- xlsx::read.xlsx(as.character(file_names[i]))
}
To avoid further issues like that you can use to_lower() when saving the files again.
UPDATE
Thanks for the suggestions. This is how for I got so far, but I still don't find how I can get the loop to work within the file path name.
setwd("//tsclient/C/Users/xxx")
folders <- list.files("TEST")
--> This gives me a list of my folder names
for(f in folders){
setwd("//tsclient/C/xxx/[f]")
files[f] <- list.files("//tsclient/C/Users/xxx/TEST/[f]", pattern="*.TXT")
mergedfile[f] <- do.call(rbind, lapply(files[f], read.table))
write.table(mergedfile[f], "//tsclient/C/Users/xxx/[f].txt", sep="\t")
}
I have around 100 folders, each containing multiple txt files. I want to create 1 merged file per folder and save that elsewhere. However, I do not want to manually adapt the folder name in my code for each folder.
I created the following code to load in all files from a single folder (which works) and merge these files.
setwd("//tsclient/C/xxx")
files <- list.files("//tsclient/C/Users/foldername", pattern="*.TXT")
file.list <- lapply(files, read.table)
setattr(file.list, "names", files)
masterfilesales <- rbindlist(file.list, idcol="id")[, id := substr(id,1,4)]
write.table(masterfilesales, "//tsclient/C/Users/xxx/datasets/foldername.txt", sep="\t")
If I wanted to do this manually, I would every time have to adapt "foldername". The foldernames contain numeric values, containing 100 numbers between 2500 and 5000 (always 4 digits).
I looked into repeat loops, but those don't run using it within a file path.
If anyone could direct me in a good direction, I would be very grateful.
I currently have a folder containing all Excel (.xlsx) files, and using R I would like to automatically convert all of these files to CSV files using the "openxlsx" package (or some variation). I currently have the following code to convert one of the files and place it in the same folder:convert("team_order\\team_1.xlsx", "team_order\\team_1.csv")
I would like to automate the process so it does it to all the files in the folder, and also removes the current xlsx files, so only the csv files remain. Thanks!
You can try this using rio, since it seems like that's what you're already using:
library("rio")
xls <- dir(pattern = "xlsx")
created <- mapply(convert, xls, gsub("xlsx", "csv", xls))
unlink(xls) # delete xlsx files
library(readxl)
# Create a vector of Excel files to read
files.to.read = list.files(pattern="xlsx")
# Read each file and write it to csv
lapply(files.to.read, function(f) {
df = read_excel(f, sheet=1)
write.csv(df, gsub("xlsx", "csv", f), row.names=FALSE)
})
You can remove the files with the command below. However, this is dangerous to run automatically right after the previous code. If the previous code fails for some reason, the code below will still delete your Excel files.
lapply(files.to.read, file.remove)
You could wrap it in a try/catch block to be safe.