Import directory of docx files - r

I have a directory with .docx files that I want to import via textreadr's read_docx function.
First I set the working directory and create list of files:
setwd("C:/R")
files <- list.files("C:/R", pattern = "\\.docx")
Now I want to iterate through the list and import every file individually, named data_"file":
for (file in files) {
assign("data_", file, sep = "") <- read_docx("file")
}
Optionally, I tried creating a list of lists:
data_list <- lapply(files, function(v){
read_docx("v")
})
Both variants don't work and I'm not sure what I do wrong.

Maybe the full path is not present, we can add
files <- list.files("C:/R", pattern = "\\.docx", full.names = TRUE)
The issue is that v or file is quoted i.e. "" i.e. it is trying to read a string "v" instead of the value. Thus, the code in the OP's post can be corrected to
data_list <- lapply(files, function(v){
read_docx(v)
})
or in the for loop
for (file in files) {
assign(paste0("data_", file, sep = ""), read_docx(file))
}
Also, as noted in the comments, if there are 1000 files, assign creates 1000 new objects which is a bit messy when we want to gather all of them again. Instead, as in the lapply, which creates a single list, the output from for loop can be store in a list
data_list2 <- vector('list', length(files))
names(data_list2) <- files
for(file in files) {
data_list2[[file]] <- read_docx(file)
}

First off, you need to grab the full path instead of just the filenames from list.files:
files <- list.files("C:/R", pattern = "\\.docx$", full.names = TRUE)
Then the lapply solution works if you pass the parameter v to read_docx instead of a literal string "v". You don’t even need the nested function:
data_list <- lapply(files, read_docx)
As an aside, there’s no need for setwd in your code, and its use is strongly discouraged.
Furthermore, using the assign function as in your code doesn’t work and even after fixing the syntax, this use is simply completely inappropriate: at best it is a hack that approximates the functionality of lists, but badly. The correct solution, 10 times out of 10, is to use a named list or vector in its place.

Related

Importing files with almost similar path and name

I have many txt files that I want to import into R. These files are imported one by one, I do the operations that I want, and then I import the next file.
All these files are located in a database system where all the folders have almost the same names, e.g.
database\type4\system50
database\type6\system50
database\type4\system30
database\type4\system50
Similarly, the names of the files are also almost the same, referring to the folder where they are positioned, e.g..
type4.system50.txt
type6.system50.txt
type4.system30.txt
type4.system50.txt
I have heard that there should be a easier way of importing these many files one by one, than simply multiple setwd and read.csv2 commands. As far as I understand this is possible by the macro import function in SAS, where you specify an overall path and then for each time you want to import a file you specify what is specific about this file name/folder name.
Is there a similar function in R? I tried to look at
Importing Data in R like SAS macro
, but this question did not really show me how to specify the folder name/file name.
Thank you for your help.
If you want to specify folder name / file name, try this
databasepath="path/to/database"
## list all files
list.files(getwd(),recursive = T,full.names = T,include.dirs = T) -> tmp
## filter files you want to read
readmyfile <- function(foldername,filename){
tmp[which(grepl(foldername,tmp) & grepl(filename,tmp))]
}
files_to_read <- readmyfile("type4", "system50")
some_files <- lapply(files_to_read, read.csv2)
## Or you can read all of them (if memory is large enough to hold them)
all_files <- lapply(tmp,read.csv2)
Instead of using setwd continuously, you could specify the absolute path for each file, save all of the paths to a vector, loop through the vector of paths and load the files into a list
library(data.table)
file_dir <- "path/to/files/"
file_vec <- list.files(path = file_dir, pattern = "*.txt")
file_list <- list()
for (n in 1:length(file_list)){
file_list[[n]] <- fread(input = paste0(file_dir, file_vec[n]))
}

Loop over tif-files in a folder but excluding tif.aux files in R

I want to loop through all .TIF files in a folder and process them with a function. While looping through the files I have the problem, that my looping function does not only use .TIF-files, but also TIF.aux.xml, TIF.xml and TIF.ovr. As soon as I want to process them in my loop the loop stops because my processing can only done with the real TIF files. When I tried to use the regular expression '.tif' it does not recognize any file, when I use '.TIF' it recognizes each single file including 'TIF.aux.xml', 'TIF.xml' and 'TIF.ovr'. There must be a trick with regular expressions to prevent that and make the expression stop after the F. Does anyone have an idea how to work around?
The code I use (the function does not matter so far... it's just about the regular expression I guess):
library(raster)
library(sp)
library(rgdal)
files <- list.files(input_dir, pattern = '*.TIF', full.names = TRUE)
for (i in 1:length(files)){
file_name <- files[i]
file_raster <- brick(paste(input_dir), files[i], sep="")
# function...
}
Try this:
library(tools)
files <- list_files_with_exts(input_dir, "tif")

R - iterate over list of files

I have a list of files with some data that I want to read in to R and then iterate over each file for some calculations.
So far I was able to read the files with the following code:
METHOD1
filenames<-list.files(pattern="*.txt")
mynames<-gsub(".txt$", "", filenames)
for (i in 1:length(mynames)) assign(mynames[i], read.table(filenames[i]))
However when I try to apply some function to "names" it just says NULL
lapply(mynames,nrow)
I know that it could be easier to read the files directly into a list
METHOD2
temp<-list.files(pattern="*.txt")
myfiles<-lapply(temp, read.table,skip="#")
and then do lapply to that list lapply(myfiles,nrow), but this just looses the information about which file produced each list.
Is there any way to circumvent this with either method in order to keep tracking the relation list-file?
Thanks
For the first method, try
sapply(mynames,function(nameoffile){nrow(get(nameoffile))})
For method 2 you could easily use seomething like
temp <- list.files(pattern = "*.txt")
myfiles <- lapply(temp, read.table, skip = "#")
names(myfiles) <- temp
In this way the names attribute stores the filenames and you do not clutter your working environment with new variables.
So when you want to iterate over the content you can use lapply(myfiles, function(.) nrow(.)) or if you need to iterate over both the filename and the content you could so something like lapply(names(myfiles), function(.) nrow(myfiles[[.]]))

Read, process and export analysis results from multiple .csv files in R

I have a bunch of CSV files and I would like to perform the same analysis (in R) on the data within each file. Firstly, I assume each file must be read into R (as opposed to running a function on the CSV and providing output, like a sed script).
What is the best way to input numerous CSV files to R, in order to perform the analysis and then output separate results for each input?
Thanks (btw I'm a complete R newbie)
You could go for Sean's option, but it's going to lead to several problems:
You'll end up with a lot of unrelated objects in the environment, with the same name as the file they belong to. This is a problem because...
For loops can be pretty slow, and because you've got this big pile of unrelated objects, you're going to have to rely on for loops over the filenames for each subsequent piece of analysis - otherwise, how the heck are you going to remember what the objects are named so that you can call them?
Calling objects by pasting their names in as strings - which you'll have to do, because, again, your only record of what the object is called is in this list of strings - is a real pain. Have you ever tried to call an object when you can't write its name in the code? I have, and it's horrifying.
A better way of doing it might be with lapply().
# List files
filelist <- list.files(pattern = "*.csv")
# Now we use lapply to perform a set of operations
# on each entry in the list of filenames.
to_dispose_of <- lapply(filelist, function(x) {
# Read in the file specified by 'x' - an entry in filelist
data.df <- read.csv(x, skip = 1, header = TRUE)
# Store the filename, minus .csv. This will be important later.
filename <- substr(x = x, start = 1, stop = (nchar(x)-4))
# Your analysis work goes here. You only have to write it out once
# to perform it on each individual file.
...
# Eventually you'll end up with a data frame or a vector of analysis
# to write out. Great! Since you've kept the value of x around,
# you can do that trivially
write.table(x = data_to_output,
file = paste0(filename, "_analysis.csv"),
sep = ",")
})
And done.
You can try the following codes by putting all csv files in the same directory.
names = list.files(pattern="*.csv") %csv file names
for(i in 1:length(names)){ assign(names[i],read.csv(names[i],skip=1, header=TRUE))}
Hope this helps !

Open 100 files in R

I need to read many files with data, but I can't make it work.
For example: I have 6 ASCII files named "rain,wind, etc..."
This is what I thought:
namelist<-c("rain","wind","sunshine hour","radiation","soil moisture","pressure")
for (i in 1:6){
metedata<-read.table('d:/namelist[i].txt')
metedata
}
But that didn't work. What should I do?
Try this :
namelist<-c("rain","wind","sunshine hour","radiation","soil moisture","pressure")
for (name in namelist){
metedata<-read.table(paste0('d:/',name,'.txt')
metedata
}
Or read them into a list using lapply. Assuming your working directory is in the location of the files:
dat = lapply(list.files(pattern = "txt"), read.table)
this makes a list of all the .txt files in your working directory, and call read.table on them, returning a list of their contents.
Or directly read them into one big data.frame:
library(plyr)
dat = ldply(list.files(pattern = "txt"), read.table)

Resources