I have many txt files that I want to import into R. These files are imported one by one, I do the operations that I want, and then I import the next file.
All these files are located in a database system where all the folders have almost the same names, e.g.
database\type4\system50
database\type6\system50
database\type4\system30
database\type4\system50
Similarly, the names of the files are also almost the same, referring to the folder where they are positioned, e.g..
type4.system50.txt
type6.system50.txt
type4.system30.txt
type4.system50.txt
I have heard that there should be a easier way of importing these many files one by one, than simply multiple setwd and read.csv2 commands. As far as I understand this is possible by the macro import function in SAS, where you specify an overall path and then for each time you want to import a file you specify what is specific about this file name/folder name.
Is there a similar function in R? I tried to look at
Importing Data in R like SAS macro
, but this question did not really show me how to specify the folder name/file name.
Thank you for your help.
If you want to specify folder name / file name, try this
databasepath="path/to/database"
## list all files
list.files(getwd(),recursive = T,full.names = T,include.dirs = T) -> tmp
## filter files you want to read
readmyfile <- function(foldername,filename){
tmp[which(grepl(foldername,tmp) & grepl(filename,tmp))]
}
files_to_read <- readmyfile("type4", "system50")
some_files <- lapply(files_to_read, read.csv2)
## Or you can read all of them (if memory is large enough to hold them)
all_files <- lapply(tmp,read.csv2)
Instead of using setwd continuously, you could specify the absolute path for each file, save all of the paths to a vector, loop through the vector of paths and load the files into a list
library(data.table)
file_dir <- "path/to/files/"
file_vec <- list.files(path = file_dir, pattern = "*.txt")
file_list <- list()
for (n in 1:length(file_list)){
file_list[[n]] <- fread(input = paste0(file_dir, file_vec[n]))
}
Related
I have several PDF files in my directory. I have downloaded them previously, no big deal so far.
I want to read all those files in R. My idea was to use the "pdf_text" function from the "pdftools" package and write a formula like this:
mypdftext <- pdf_text(files)
Where "files" is an object that gathers all the PDF file names, so that I don't have to write manually all the names. Because I have actually downlaoded a lot of files, it would avoid me to write:
mypdftext <- pdf_text("file1.pdf", "file2.pdf", and many more files...)
To create the object "pdflist", I used "files <- list.files (pattern = "pdf$")"
The “files” vector contains all the PDF file names.
But "files" does not work with pdf_text function, probably because it's a vector. What can I do instead?
maybe this is not the best solution but this works for me:
library(pdftools)
# Set your path here.
your_path = 'C:/Users/.../pdf_folder'
setwd(your_path)
getwd()
lf = list.files(path=getwd(), pattern=NULL, all.files=FALSE,
full.names=FALSE)
#Creating a list to iterate
my_pdfs = {}
#Iterate. Asssign each element of list files, to a list.
for (i in 1:length(lf)){my_pdfs[i] <- pdf_text(lf[i])}
#Calling the first pdf of the list.
my_pdfs[1]
Then you can assign each of the pdfs to a single file of whatever you want. Of course, each file will be saved in each element of the list. Does this solve your problem?
You could try using lapply over the vector that contains the location of every pdf file (files). I would recommend using list.files(..., full.names = T) to get the complete location of each pdf file. This should work.
mypdfs<-lapply(files, pdf_text)
I have multiple file whose names are not in proper format. For example, one file might have name like "TEST_1.XLSX", the other has name like "test_2.xlsx" and, for worse, another file has name as "tEsT_3.XlsX".
When I tried to read file using:
df <- xlsx::read.xlsx(file.choose(), sheetIndex = 1)
it reads the file with names and extension in "lower cases" but failed to read all others.
Can there be a way to avoid such issues and read all the files despite whatever the case of names/extension is?
import all files in your folder and store the file names in a vector:
file_names <- list.files(path = "path/where/files/are")
then import each file and store it in a list
df_list<-list()
for(i in 1:length(file_names){
df_list[[i]] <- xlsx::read.xlsx(as.character(file_names[i]))
}
To avoid further issues like that you can use to_lower() when saving the files again.
I'm writing a loop script which involves reading a file from a workbook (using the package XLConnect). The challenge is that the file names contain characters (representing time) that I want to ignore.
For example, here are 3 paths to those files:
G://User//Documents//daily_data//Op_Schedule_20160520_132025.xlsx
G://User//Documents//daily_data//Op_Schedule_20160521_142805.xlsx
G://User//Documents//daily_data//Op_Schedule_20160522_103052.xlsx
I need to import hundreds of those files. I can easily account for the character string representing the date (e.g. 20160522), but not the time.
Is there a way to tell R to ignore some characters located in the file path? Here is how I was thinking of writing my script (the "???" is where i need help). I know a loop is probably not the most efficient way, but i'm open to suggestions, should you have any:
require(XLConnect)
path= "G://User//Documents//daily_data//Op_Schedule_"
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
scheduleList = rep(list(matrix(1,1,1)),length(wd.seq))
for(i in 1:length(wd.seq)) {
wb = loadWorkbook(file= paste0(path,wd.seq[i],"???",".xlxs"))
scheduleList[[i]] = readWorksheet(wb,sheet='=SCHEDULE', header = TRUE)
}
`
Thanks for reading and suggestions, if any.
Mathieu
I don't know if this is helpful, but if you want to read all the files in a certain directory (which it seems to me is what you're after), you can read all the filenames into a list using the list.files() function, for example
fileList <- list.files(""G://User//Documents//daily_data//")
And then load the xlsx files looping through the list with a for loop
for(i in fileList) {
loadWorkbook(file = i)
}
I haven't used the XLConnect function before so that exact code probably doesn't work, but the loop will iterate through all the files in that directory and so you can construct your loading call using the i variable for the filename (it won't be an absolute path though, so you might need to use paste to add the first part of the filepath)
I realize there might be other files in the directory that are not excel files, you could use grepl to select only files containg "OP_Schedule_"
fileListClean <- fileList[grepl("Op_Schedule_",fileList)]
or perhaps only selecting .xlsx files in the directory:
fileListClean <- fileList[grepl(".xlsx",fileList)]
Edit to fit your reply:
Since you need to fit it to a sequence, you can do it as you did earlier:
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
wd.seq2 <- paste("Op_Schedule_", wd.seq, sep = "")
And then use grepl to only pick files starting with that extensions:
fileListClean <- fileList[grepl(paste(wd.seq2, collapse = "|"), fileList)]
Full disclosure: The last part i got from this SO answer: grep using a character vector with multiple patterns
Using this script I have created a specific folder for each csv file and then saved all my further analysis results in this folder. The name of the folder and csv file are same. The csv files are stored in the main/master directory.
Now, I have created a csv file in each of these folders which contains a list of all the fitted values.
I would now like to do the following:
Set the working directory to the particular filename
Read fitted values file
Add a row/column stating the name of the site/ unique ID
Add it to the masterfile which is stored in the main directory with a title specifying site name/filename. It can be stacked by rows or by columns it doesn't really matter.
Come to the main directory to pick the next file
Repeat the loop
Using the merge(), rbind(), cbind() combines all the data under one column name. I want to keep all the sites separate for comparison at a later on stage.
This is what I'm using at the moment and I'm lost on how to proceed further.
setwd( "path") # main directory
path <-"path" # need this for convenience while switching back to main directory
# import all files and create a character type array
files <- list.files(path=path, pattern="*.csv")
for(i in seq(1, length(files), by = 1)){
fileName <- read.csv(files[i]) # repeat to set the required working directory
base <- strsplit(files[i], ".csv")[[1]] # getting the filename
setwd(file.path(path, base)) # setting the working directory to the same filename
master <- read.csv(paste(base,"_fiited_values curve.csv"))
# read the fitted value csv file for the site and store it in a list
}
I want to construct a for loop to make one master file with the files in different directories. I do not want to merge all under one column name.
For example, If I have 50 similar csv files and each had two columns of data, I would like to have one csv file which accommodates all of it; but in its original format rather than appending to the existing row/column. So then I will have 100 columns of data.
Please tell me what further information can I provide?
for reading a group of files, from a number of different directories, with pathnames patha pathb pathc:
paths = c('patha','pathb','pathc')
files = unlist(sapply(paths, function(path) list.files(path,pattern = "*.csv", full.names = TRUE)))
listContainingAllFiles = lapply(files, read.csv)
If you want to be really quick about it, you can grab fread from data.table:
library(data.table)
listContainingAllFiles = lapply(files, fread)
Either way this will give you a list of all objects, kept separate. If you want to join them together vertically/horizontally, then:
do.call(rbind, listContainingAllFiles)
do.call(cbind, listContainingAllFiles)
EDIT: NOTE, the latter makes no sense unless your rows actually mean something when they're corresponding. It makes far more sense to just create a field tracking what location the data is from.
if you want to include the names of the files as the method of determining sample location (I don't see where you're getting this info from in your example), then you want to do this as you read in the files, so:
listContainingAllFiles = lapply(files,
function(file) data.frame(filename = file,
read.csv(file)))
then later you can split that column to get your details (Assuming of course you have a standard naming convention)
I need to download a few hundred number of excel files and import them into R each day. Each one should be their own data-frame. I have a csv. file with all the adresses (the adresses remains static).
The csv. file looks like this:
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%a
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%b
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%a0
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%aa11
etc.....
I can do it with a single file like this:
library(XLConnect)
my.url <- "http://www.somehomepage.com/chartserver/hometolotsoffiles%a"
loc.download <- "C:/R/lotsofdata/" # each files probably needs to have their own name here?
download.file(my.url, loc.download, mode="wb")
df.import.x1 = readWorksheetFromFile("loc.download", sheet=2))
# This kind of import works on all the files, if you ran them individually
But I have no idea how to download each file, and place it separately in a folder, and then import them all into R as individual data frames.
It's hard to answer your question as you haven't provided a reproducible example and it isn't clear what you exactly want. Anyway, the code below should point you in the right direction.
You have a list of urls you want to visit:
urls = c("http://www/chartserver/hometolotsoffiles%a",
"http://www/chartserver/hometolotsoffiles%b")
in your example, you load this from a csv file
Next we download each file and put it in a separate directory (you mentioned that in your question
for(url in urls) {
split_url = strsplit(url, "/")[[1]]
##Extract final part of URL
dir = split_url[length(split_url)]
##Create a directory
dir.create(dir)
##Download the file
download.file(url, dir, mode="wb")
}
Then we loop over the directories and files and store the results in a list.
##Read in files
l = list(); i = 1
dirs = list.dirs("/data/", recursive=FALSE)
for(dir in dirs){
file = list.files(dir, full.names=TRUE)
##Do something?
##Perhaps store sheets as a list
l[[i]] = readWorksheetFromFile(file, sheet=2)
i = i + 1
}
We could of course combine steps two and three into a single loop. Or drop the loops and use sapply.