Iteration over a non-existing file in the directory - r

I have around five files in my directory that I want to read in r. Each file has a name pattern: "filex.html", where x=1,2,3 and so on. However, a few files are missing. I wanted to create a loop to read all the files and whenever any file is non-existential, the loop should jump to next file in the sequence. However, my loop stops whenever it encounters the first non-existing file.
Following is the loop.
ids = c(1:10)
for (i in ids) {
myurl = paste("mypage",i,".html")
myurl = gsub(" ","",myurl)
pointer = read_html(myurl)
if(is_null(pointer)){
next
}
}
This is the error.
Error: 'mypage3.html' does not exist in current working directory ('E:/My_projects/mydb').
How can iterate my loop over the non-existing file?

Instead of looping over your ids vector that may include non-existant files, try to lapply over a list of the actual files, obtained from list.files().
You can use a pattern to only get the html files with list.files(pattern = "*.html").
Here is an example
html_files = list.files(pattern = "*.html").
lapply(html_files, function(x) {
pointer = read_html(x)
}

Related

Import directory of docx files

I have a directory with .docx files that I want to import via textreadr's read_docx function.
First I set the working directory and create list of files:
setwd("C:/R")
files <- list.files("C:/R", pattern = "\\.docx")
Now I want to iterate through the list and import every file individually, named data_"file":
for (file in files) {
assign("data_", file, sep = "") <- read_docx("file")
}
Optionally, I tried creating a list of lists:
data_list <- lapply(files, function(v){
read_docx("v")
})
Both variants don't work and I'm not sure what I do wrong.
Maybe the full path is not present, we can add
files <- list.files("C:/R", pattern = "\\.docx", full.names = TRUE)
The issue is that v or file is quoted i.e. "" i.e. it is trying to read a string "v" instead of the value. Thus, the code in the OP's post can be corrected to
data_list <- lapply(files, function(v){
read_docx(v)
})
or in the for loop
for (file in files) {
assign(paste0("data_", file, sep = ""), read_docx(file))
}
Also, as noted in the comments, if there are 1000 files, assign creates 1000 new objects which is a bit messy when we want to gather all of them again. Instead, as in the lapply, which creates a single list, the output from for loop can be store in a list
data_list2 <- vector('list', length(files))
names(data_list2) <- files
for(file in files) {
data_list2[[file]] <- read_docx(file)
}
First off, you need to grab the full path instead of just the filenames from list.files:
files <- list.files("C:/R", pattern = "\\.docx$", full.names = TRUE)
Then the lapply solution works if you pass the parameter v to read_docx instead of a literal string "v". You don’t even need the nested function:
data_list <- lapply(files, read_docx)
As an aside, there’s no need for setwd in your code, and its use is strongly discouraged.
Furthermore, using the assign function as in your code doesn’t work and even after fixing the syntax, this use is simply completely inappropriate: at best it is a hack that approximates the functionality of lists, but badly. The correct solution, 10 times out of 10, is to use a named list or vector in its place.

Skip empty files inside zip files

I am reading a lot of .csv files inside a .zip file with the following code
for (i in unzip("data.zip", list = TRUE)) {
read.csv(unz("data.zip", i))
}
The problem is that some of .csv files are empty that leads to no lines available in input error that causes the execution of the loop be interrupted. How can I skip those empty files?
Try this
flist <- unzip("data.zip", list=TRUE)
Now flist$Length gives you the length of each file, so e.g.
keep <- flist$Length > 100 # or some other value that indicates the file has no data
Now you can read the nonempty ones and save them to a list:
AllFiles <- lapply(flist$Name[keep], read.csv)

Why does "write.dat" (R) save data files within folders?

In order to conduct some analysis using a particular software, I am required to have separate ".dat" files for each participant, with each file named as the participant number, all saved in one directory.
I have tried to do this using the "write.dat" function in R (from the 'multiplex' package).
I have written a loop that outputs a ".dat" file for each participant in a dataset. I would like each file that is outputted to be named the participant number, and for them all to be stored in the same folder.
## Using write.dat
participants_ID <- unique(newdata$SJNB)
for (i in 1:length(participants_ID)) {
data_list[[i]] <- newdata %>%
filter(SJNB == participants_ID[i])
write.dat(data_list[[i]], paste0("/Filepath/Directory/", participants_ID[i], ".dat"))
}
## Using write_csv this works perfectly:
participants_ID <- unique(newdata$SJNB)
for (i in 1:length(participants_ID)) {
newdata %>%
filter(SJNB == participants_ID[i]) %>%
write_csv(paste0("/Filepath/Directory/", participants_ID[i], ".csv"), append = FALSE)
}
If I use the function "write_csv", this works perfectly (saving .csv files for each participant). However, if I use the function "write.dat" each participant file is saved inside a separate folder - the folder name is the participant number, and the file inside the folder is called "data_list[[i]]". In order to get all of the data_list files into the same directory, I then have to rename them which is time consuming.
I could theoretically output the files to .csv and then convert them to .dat, but I'm just intrigued to know if there's anything I could do differently to get the write.dat function to work the way I'm trying it :)
The documentation on write.dat is subminimal, but it would appear that you have confused a directory path with a file name . You have deliberately created a directory named "/Filepath/Directory/[participants_ID[i]].dat" and that's where each output file is placed. That you cannot assing a name to the x.dat file itself appears to be a defect in the package as supplied.
However, not all is lost. Inside your loop, replace your write.dat line with the following lines, or something similar (not tested):
edit
It occurs to me that there's a smoother solution, albeit using the dreaded eval:
Again inside the loop, (assuming participants_ID[i] is a char string)
eval(paste0(participants_ID[i],'<- dataList[[i]]'))
write.dat(participants_ID[i], "/Filepath/Directory/")
previous answer
write.dat(data_list[[i]], "/Filepath/Directory/")
thecommand = paste0('mv /Filepath/Directory/dataList[[i]] /Filepath/Directory/',[participants_ID[i]],'.dat',collapse="")
system(thecommand)

Looping through folder and finding specific file in R

I am trying to loop through many folders in a directory, looking for a particular xml file buried in one of the folders. I would then like to save the location of that file and then run my code against that file (I will not include that code in this). What I am asking here is to loop through all the folders and then open the specific file.
For example:
My main folder would be: C:\Parsing
It has two folders named "folder1" and "folder2".
each folder has an xml file that I am interested in, lets say its called "needed.xml"
I would like to have a scrip that loops through the directory and finds those particular scripts.
Do you know how I could that in R.
Using list.files and greplyou could look recursively through all sub-folders
rootPath="C:\Parsing"
listFiles=list.files(rootPath,recursive=TRUE)
searchFileName="needed.xml"
presentFile=grepl(searchFileName,listFiles)
if(nchar(presentFile)) cat("File",searchFileName,"is present at", presentFile,"\n")
Is this what you're looking for?
require(XML)
fol <- list.files("C:/Parsing")
for (i in fol){
dir <- paste("C:/Parsing" , i, "/needed.xml", sep = "")
if(file.exists(dir) == T){
needed <- xmlToList(dir)
}
}
This will locate your xml file and read it into R as a list. I wasn't clear from your question if you wanted the output to be the data itself or just the directory location of your data which could then be supplied to another function/script. If you just want the location, remove the 'xmlToList' function.
I would do something like this (replace *.xml with your filename.xml if you want):
list.files(path = "C:\Parsing", pattern = "*.xml", recursive = TRUE, full.names = TRUE)
This will recursively look for files with extension .xml in the path C:\Parsing and return the full path of the matched files.

How to read a csv file or load an excel workbook by ignoring some characters in the file path?

I'm writing a loop script which involves reading a file from a workbook (using the package XLConnect). The challenge is that the file names contain characters (representing time) that I want to ignore.
For example, here are 3 paths to those files:
G://User//Documents//daily_data//Op_Schedule_20160520_132025.xlsx
G://User//Documents//daily_data//Op_Schedule_20160521_142805.xlsx
G://User//Documents//daily_data//Op_Schedule_20160522_103052.xlsx
I need to import hundreds of those files. I can easily account for the character string representing the date (e.g. 20160522), but not the time.
Is there a way to tell R to ignore some characters located in the file path? Here is how I was thinking of writing my script (the "???" is where i need help). I know a loop is probably not the most efficient way, but i'm open to suggestions, should you have any:
require(XLConnect)
path= "G://User//Documents//daily_data//Op_Schedule_"
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
scheduleList = rep(list(matrix(1,1,1)),length(wd.seq))
for(i in 1:length(wd.seq)) {
wb = loadWorkbook(file= paste0(path,wd.seq[i],"???",".xlxs"))
scheduleList[[i]] = readWorksheet(wb,sheet='=SCHEDULE', header = TRUE)
}
`
Thanks for reading and suggestions, if any.
Mathieu
I don't know if this is helpful, but if you want to read all the files in a certain directory (which it seems to me is what you're after), you can read all the filenames into a list using the list.files() function, for example
fileList <- list.files(""G://User//Documents//daily_data//")
And then load the xlsx files looping through the list with a for loop
for(i in fileList) {
loadWorkbook(file = i)
}
I haven't used the XLConnect function before so that exact code probably doesn't work, but the loop will iterate through all the files in that directory and so you can construct your loading call using the i variable for the filename (it won't be an absolute path though, so you might need to use paste to add the first part of the filepath)
I realize there might be other files in the directory that are not excel files, you could use grepl to select only files containg "OP_Schedule_"
fileListClean <- fileList[grepl("Op_Schedule_",fileList)]
or perhaps only selecting .xlsx files in the directory:
fileListClean <- fileList[grepl(".xlsx",fileList)]
Edit to fit your reply:
Since you need to fit it to a sequence, you can do it as you did earlier:
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
wd.seq2 <- paste("Op_Schedule_", wd.seq, sep = "")
And then use grepl to only pick files starting with that extensions:
fileListClean <- fileList[grepl(paste(wd.seq2, collapse = "|"), fileList)]
Full disclosure: The last part i got from this SO answer: grep using a character vector with multiple patterns

Resources