I am now trying to extract a large number of docx files (1500) placed in one folder, using readtext (after creating a list using list.files)
You can find similar examples here: https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html
I am getting errors with some files (examples below), the problem is when this error occurs, the extraction process is stopped. I can identify the problematic file, by changing verbosity = 3, but then I have to restart the extraction process (to find another problematic file(s)).
My question is if there is a way to avoid interrupting the process if an error is encountered?
I change ignore_missing_files = TRUE but this did not fix the problem.
examples for the errors encountered:
write error in extracting from zip file
Error: 'C:\Users--- c/word/document.xml' does not exist.
Sorry for not posting a reproducible example, but I do not know how to post an example with large docx files. But this is the code:
library(readtext)
data_files <- list.files(path = "PATH", full.names = T, recursive = T) # PATH = the path to the folder where the documents are located
extracted_texts <- readtext(data_files, docvarsfrom = "filepaths", dvsep = "/", verbosity = 3, ignore_missing_files = TRUE) # this is to extract the text in the files
write.csv2(extracted_texts, file = "data/text_extracts.csv", fileEncoding = "UTF-8") # this is to export the files into csv
Let's first put together a reproducible example:
download.file("https://file-examples-com.github.io/uploads/2017/02/file-sample_1MB.docx", "test1.docx")
writeLines("", "test2.docx")
The first file I produced here should be a proper docx file, the second one is rubbish.
I would wrap readtext in a small function that deals with the errors and warnings:
readtext_safe <- function(f) {
out <- tryCatch(readtext::readtext(f),
error = function(e) "fail",
warning = function(e) "fail")
if (isTRUE("fail" == out)) {
write(f, "errored_files.txt", append = TRUE)
} else {
return(out)
}
}
Note that I treat errors and warning the same, which might not be what you actually want. We can use this function to loop through your files:
files <- list.files(pattern = ".docx$", ignore.case = TRUE, full.names = TRUE)
x <- lapply(files, readtext_safe)
x
#> [[1]]
#> readtext object consisting of 1 document and 0 docvars.
#> # Description: df[,2] [1 × 2]
#> doc_id text
#> <chr> <chr>
#> 1 test1.docx "\"Lorem ipsu\"..."
#>
#> [[2]]
#> NULL
In the resulting list, failed files simply have a NULL entry as nothing is returned. I like to write out a list of these errored files and the function above creates a txt file that looks like this:
readLines("errored_files.txt")
#> [1] "./test2.docx"
Related
I have been working on a dataset of folders and subfolders (folder -> subfolder - > file)
I have trouble reading the first 10 folders of data. I have used the below code but it doesn't work. Please help
> for(i in seq_along(my_folders)){
+ my_data[[[i]]] = list.files(path = "~/dataset1", recursive = TRUE)
Below see problem with reading txt file in subfolder:
> for(i in 1:13){
+ current_dir = dirs[i]
+ lines = readLines(mydata[[i]])}
This gives error: Error in file(con, "r") : invalid 'description' argument
But outside of the loop this works:
> lines <- readLines(my_data[[1]])
What do you think of that:
dirs = list.dirs(recursive = FALSE) # reads all directories/folders
mydata = list() # create empty list
for (i in 1:10) { # only takes the first 10 directories
current_dir = dirs[i]
mydata[[i]] = list.files(path = file.path("~/dataset1", current_dir), recursive = TRUE)
}
You only have to adapt your folder structure
#sequoia's answer works, but in R you can take advantage of concise functional programming, which #langtang's answer gets at with lapply(). Try this one-liner:
library(tidyverse)
library(fs)
d <- dir_ls("path/to/folders", recurse = TRUE) %>% walk(~read_lines(.x))
Use dir to get a vector of file names, for example all .txt files in folder "f" and all it subfolders
files= dir("f",pattern = ".txt", full.names = T,recursive = T)
files
[1] "f/f1/f1_1/f1_1.txt"
[2] "f/f1/f1_2/f1_2.txt"
[3] "f/f2/f2_1/f2_1.txt"
[4] "f/f2/f2_2/f2_2.txt"
Then read them using readLines
lapply(files, readLines)
I am using RStudio (running R 4.0.1) and Stata 12 for Windows and have got a large number of folders with Stata 16 .dta files (and other types of files not relevant to this question). I want to create an automated process of converting all Stata 16 .dta files into Stata 12 format (keeping all labels) to then analyze.
Ideally, I want to keep the names of the original folders and files but save the converted versions into a new location.
This is what I have got so far:
setwd("C:/FilesLocation")
#vector with name of files to be converted
all_files <- list.files(pattern="*.dta",full.names = TRUE)
for (i in all_files){
#Load file to be converted into STATA12 version
data <- read_dta("filename.dta",
encoding = NULL,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = "unique")
#Write as .dta
write_dta(data,"c:/directory/filename.dta", version = 12, label = attr(data, "label"))
}
Not sure this is the best approach. I know the commands inside the loop are working for a single file but not really being able to automate for all files.
Your code only needs some very minor modifications. I've indicated the changes (along with comments explaining them) in the snippet below.
library(haven)
mypath <- "C:/FilesLocation"
all_files <- list.files(path = mypath, pattern = "*.dta", full.names = TRUE)
for (i in 1:length(all_files)){
#(Above) iterations need the length of the vector to be specified
#Load file to be converted into STATA12 version
data <- read_dta(all_files[i], #You want to read the ith element in all_files
encoding = NULL,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = "unique")
#Add a _v12 to the filename to
#specify that is is version 12 now
new_fname <- paste0(unlist(strsplit(basename(all_files[i]), "\\."))[1],
"_v12.", unlist(strsplit(basename(all_files[i]), "\\."))[2])
#Write as .dta
#with this new filename
write_dta(data, path = paste0(mypath, "/", new_fname),
version = 12, label = attr(data, "label"))
}
I tried this out with some .sta files from here, and the script ran without throwing up errors. I haven't tested this on Windows but in theory it should work fine.
Edit: here is a more complete solution with read_dta and write_dta wrapped into a single function dtavconv. This function also allows the user to convert version numbers to arbitrary values (default is 12).
#----
#.dta file version conversion function
dtavconv <- function(mypath = NULL, myfile = NULL, myver = 12){
#Function to convert .dta file versions
#Default version files are converted to is v12
#Default directory is whatever is specified by getwd()
if(is.null(mypath)) mypath <- getwd()
#Main code block wrapped in a tryCatch()
myres <- tryCatch(
{
#Load file to be converted into STATA12 version
data <- haven::read_dta(paste0(mypath, "/", myfile),
encoding = NULL,
col_select = NULL,
skip = 0,
n_max = Inf,
.name_repair = "unique")
#Add a _v12 to the filename to
#specify that is is version 12 now
new_fname <- paste0(unlist(strsplit(basename(myfile), "\\."))[1],
"_v", myver, ".", unlist(strsplit(basename(myfile), "\\."))[2])
#Write as .dta
#with this new filename
haven::write_dta(data, path = paste0(mypath, "/", new_fname),
version = myver, label = attr(data, "label"))
message("\nSuccessfully converted ", myfile, " to ", new_fname, "\n")
},
error = function(cond){
#message("Unable to write file", myfile, " as ", new_fname)
message("\n", cond, "\n")
return(NA)
}
)
return(myres)
}
#----
The function can then be run on as many files as desired by invoking it via lapply or a for loop, as the example below illustrates:
#----
#Example run
library(haven)
#Set your path here below
mypath <- paste0(getwd(), "/", "dta")
#Check to see if this directory exists
#if not, create it
if(!dir.exists(mypath)) dir.create(mypath)
list.files(mypath)
# character(0)
#----
#Downloading some valid example files
myurl <- c("http://www.principlesofeconometrics.com/stata/airline.dta",
"http://www.principlesofeconometrics.com/stata/cola.dta")
lapply(myurl, function(x){ download.file (url = x, destfile = paste0(mypath, "/", basename(x)))})
#Also creating a negative test case
file.create(paste0(mypath, "/", "anegcase.dta"))
list.files(mypath)
# [1] "airline.dta" "anegcase.dta" "cola.dta"
#----
#Getting list of files in the directory
all_files <- list.files(path = mypath, pattern = "*.dta")
#Converting files using dtavconv via lapply
res <- lapply(all_files, dtavconv, mypath = mypath)
#
# Successfully converted airline.dta to airline_v12.dta
#
#
# Error in df_parse_dta_file(spec, encoding, cols_skip, n_max, skip,
# name_repair = .name_repair): Failed to parse /my/path/
# /dta/anegcase.dta: Unable to read from file.
#
#
#
# Successfully converted cola.dta to cola_v12.dta
#
list.files(mypath)
# [1] "airline_v12.dta" "airline.dta" "anegcase.dta" "cola_v12.dta"
# "cola.dta"
#Example for converting to version 14
res <- lapply(all_files, dtavconv, mypath = mypath, myver = 14)
#
# Successfully converted airline.dta to airline_v14.dta
#
#
# Error in df_parse_dta_file(spec, encoding, cols_skip, n_max, skip,
# name_repair = .name_repair): Failed to parse /my/path
# /dta/anegcase.dta: Unable to read from file.
#
#
#
# Successfully converted cola.dta to cola_v14.dta
#
list.files(mypath)
# [1] "airline_v12.dta" "airline_v14.dta" "airline.dta" "anegcase.dta"
# "cola_v12.dta" "cola_v14.dta" "cola.dta"
#----
I want to read in several fixed width format txt files into R but I first need to unzip them.
Since they are very large files I want to use read_fwf from the readr package because it's very fast.
When I do:
read_fwf(unz(zipfileName, fileName), fwf_widths(colWidths, col_names = colNames))
I get this error Error in isOpen(con) : invalid connection
However when I do:
read.table(unz(zipfileName, fileName)) without specfiying widths it reads into R just fine. Any thoughts as to why this isn't working with read_fwf ?
I am having trouble making a reproducible example. Here is what I got:
df <- data.frame(
rnorm(100),
rnorm(100)
)
write.table(df, "data.txt", row.names=F, col.names = F)
zip(zipfile = "data.zip", files = "data.txt")
colWidths <- rep(2, 100)
colNames <- c("thing1","thing2")
zipfileName <- "data.zip"
fileName <- "data.csv"
I also had trouble getting read_fwf to read zip files when passing an unz-ed file to it but then reading the ?read_fwf page I see that zipped files are promised to be handled automagically. You didn't make a file that was a valid fwf as an example, since neither of the columns had constant positions but that is apparent with the output:
read_fwf(file="~/data.zip", fwf_widths(widths=rep(16,2) ,col_names = colNames) )
Warning: 1 parsing failure.
row col expected actual
3 thing2 16 chars 14
# A tibble: 100 x 2
thing1 thing2
<chr> <chr>
1 1.37170820802141 -0.58354018425322
2 0.03608988699566 7 -0.402708262870141
3 1.02963272114 -1 .0644333112294
4 0.73546166509663 8 0.607941664550652
5 -1.5285547658079 -0.319983522035755
6 -1.4673290956901 0.523579231857175
7 0.24946312418273 9 -0.574046655188405
8 0.58126541455159 5 -0.406516495600345
9 1.5074477698981 -0.496512994239183
10 -2.2999905645658 8 -0.662667854341041
# ... with 90 more rows
The error you were getting was from the unz function because it expects a full path to a zip extension file (and apparently won't accept an implicit working directory location) as the "description" argument. It's second argument is the name of the compressed file inside the zip file. I think it returns a connection, but not of a type that read_fwf is able to process. Doing parsing by hand I see that the errors both of us got was from this section of code in read_connection:
> readr:::read_connection
function (con)
{
stopifnot(is.connection(con))
if (!isOpen(con)) {
open(con, "rb")
on.exit(close(con), add = TRUE)
}
read_connection_(con)
}
<environment: namespace:readr>
You didn't give unz a valid "description" argument, and even if we did the effort to open with open(con, "rb") fails because of the lack of standardization in arguments in the various file handling functions.
Following is the code that I've written for applying moving average forecast to all the .csv files in a directory.
fileNames <- Sys.glob("*.csv")
for (fileName in fileNames) {
abc <- read.csv(fileName, header = TRUE, sep = ",")
nrows <- sapply(fileNames, function(f) nrow(read.csv(f)))
if (nrows>=as.vector(10)) {
library(stats)
library(graphics)
library(forecast)
library(TTR)
library(zoo)
library(tseries)
abc1 = abc[,1]
abc1 = t(t(abc1))
abc1 = as.vector(abc1)
abc2 = ts(abc1, frequency = 12,start = c(2014,1))
abc_decompose = decompose(abc2)
plot(abc_decompose)
forecast = (abc_decompose$trend)
x <- data.frame(abc, forecast)
write.csv (x, file = fileName, row.names=FALSE, col.names=TRUE)
}
}
Now when I exclude line 5, i.e. if(nrows>=as.vector(10)) the code is working fine on files which has enough no. of entries (I had taken around 20 files all having more than 10 rows).
But, I have some csv files in the directory which contains 2 or less than 2 entries, so when the code runs on the whole directory it's giving the following error message:
Error in decompose(abc2) : time series has no or less than 2 periods. As excluding those files manually is hard, I have to use something like line 5.
Now nrows is giving me a list of all the file names of the directory with their no. of rows, but when I run the whole code I'm getting 148 warning messages (that directory has 148 csv files), each one saying:
In if (nrows >= as.vector(10)) { ... :
the condition has length > 1 and only the first element will be used and I'm not getting the output.
So, definitely I'm doing something wrong in that line 5. Please help.
Change nrows <- sapply(fileNames, function(f) nrow(read.csv(f))) to:
nrows <- nrow(abc)
Why do you need to take the number of rows in all files at each iteration. The error is telling you what is going wrong. It is using the number of rows in the first file every time.
I have a simple for loop to write the past 100 tweets of a few usernames to .csv files:
library(twitteR)
mclist <- read.table('usernames.txt')
for (mc in mclist)
{
tweets <- userTimeline(mc, n = 100)
df <- do.call("rbind", lapply(tweets, as.data.frame))
write.csv(df, file=paste("Desktop/", mc, ".csv", sep = ""), row.names = F)
}
I mostly followed what I've read on StackOverflow but I continue to get this error message:
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
In addition: Warning message:
In if (file == "") file <- stdout() else if (is.character(file)) { :
the condition has length > 1 and only the first element will be used
Where did I go wrong?
I just cleaned up the code a bit, and everything started working.
Step 1: Let's set the working directory and load the 'twitteR' package.
library(twitteR)
setwd("C:/Users/Dinre/Desktop") # Replace with your desired directory
Step 2: First, we need to load a list of user names from a flat text file. I'm assuming that each line in the text file has one username, like so:
[contents of usernames.txt]
edclef
notch
dkanaga
Let's load it using the 'scan' function to read each line into an array:
mclist <- scan("usernames.txt", what="", sep="\n")
Step 3: We'll loop through the usernames, just like you did before, but we're not going to refer to the directory, since we're going to use the same directory for output as input. The original code had a syntax error in attempting to referring to the desktop directory, and we're just going to sidestep that.
for (mc in mclist){
tweets <- userTimeline(mc, n = 100)
df <- do.call("rbind", lapply(tweets, as.data.frame))
write.csv(df, file=paste(mc, ".csv", sep = ""), row.names = F)
}
I end up with three files on the desktop, and all the data seems to be correct.
edclef.csv
notch.csv
dkanaga.csv
Update: If you really want to refer to different directories within your code, use the '.' character to refer to the parent directory. For instance, if your working directory is your Windows user profile, you would refer to the 'Desktop' folder like so:
setwd("C:/Users/Dinre")
...
write.csv(df, file=paste("./Desktop/". mc, ".csv", sep = ""), row.names = F)
There's a convenience function in the package twListToDF which will handle the conversion of the list of tweets to a data.frame.
Since your mclist is a data.frame, you can replace your for by apply
apply( mclist, 1,function(mc){
tweets <- userTimeline(mc, n = 100)
df <- do.call("rbind", lapply(tweets, as.data.frame))
write.csv(df, file=paste("Desktop/", mc, ".csv", sep = ""), ##!! Change Desktop to
## something like Desktop/tweets/
row.names = F)
})
PS :
The userTimeline function will only work if the user requested has a
public timeline, or you have previously registered a OAuth object
using registerTwitterOAuth