reading a subset of files within a folder using R - r

I'm quite new to R and am looking to build an R script that takes a csv file containing 3 elements:
Id
Type
Filename
The contents of the DataFrame look something like:
14261336 5 Test1.xml
16767594 8 Test2.xml
13601470 7 Test3.xml
12963658 5 Test4.xml
17771952 6 Test5.xml
I've tried to use the following code to get the filenames, and then use these to be able to parse the XML, but I seem to be hitting a bit of a wall (down to my inexperience with R):
headerNames <- c('Id','Type','Filename')
GetNames <- read.csv(file= 'c:/temp/XML/myXMLFiles.csv', header = FALSE, col.names = headerNames) #
list(c(GetNames[3])) %>%
map(read_xml)
The outcome is that I get the message:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "list"
Can one of you experts point me in the right direction please?
Many Thanks

You can normally only read data from a string. Furthermore, to read xml, you will need the xmlParse() from xml-library:
library(XML) # install it with install.packages("XML") if needed
files_inp <- as.character(GetNames[,3]) # you will need the filenames as character
for (f in files_inp) {
assign(paste0("file",f), xmlParse(file = f)) # I never read XML files, but that should work! :-)
}
Your output data should be variables named file1, file2, ...
Hope that helps!

Related

creating a loop for "load" and "save" processes

I have a data.frame (dim: 100 x 1) containing a list of url links, each url looks something like this: https:blah-blah-blah.com/item/123/index.do .
The list (the list is a data.frame called my_list with 100 rows and a single column named col and is in character format $ col: chr) together looks like this :
1 "https:blah-blah-blah.com/item/123/index.do"
2" https:blah-blah-blah.com/item/124/index.do"
3 "https:blah-blah-blah.com/item/125/index.do"
etc.
I am trying to import each of these url's into R and collectively save the object as an object that is compatible for text mining procedures.
I know how to successfully convert each of these url's (that are on the list) manually:
library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tm)
#1st document
url <- "https:blah-blah-blah.com/item/123/index.do"
article <- pdf_text(url)
Once this "article" file has been successfully created, I can inspect it:
str(article)
chr [1:13]
It looks like this:
[1] "abc ....."
[2] "def ..."
etc etc
[15] "ghi ...:
From here, I can successfully save this as an RDS file:
saveRDS(article, file = "article_1.rds")
Is there a way to do this for all 100 articles at the same time? Maybe with a loop?
Something like :
for (i in 1:100) {
url_i <- my_list[i,1]
article_i <- pdf_text(url_i)
saveRDS(article_i, file = "article_i.rds")
}
If this was written correctly, it would save each article as an RDS file (e.g. article_1.rds, article_2.rds, ... article_100.rds).
Would it then be possible to save all these articles into a single rds file?
Please note that list is not a good name for an object, as this will
temporarily overwrite the list() function. I think it is usually good
to name your variables according to their content. Maybe url_df would be
a good name.
library(pdftools)
#> Using poppler version 20.09.0
library(tidyverse)
url_df <-
data.frame(
url = c(
"https://www.nimh.nih.gov/health/publications/autism-spectrum-disorder/19-mh-8084-autismspecdisordr_152236.pdf",
"https://www.nimh.nih.gov/health/publications/my-mental-health-do-i-need-help/20-mh-8134-mymentalhealth-508_161032.pdf"
)
)
Since the urls are already in a data.frame we could store the text data in
an aditional column. That way the data will be easily available for later
steps.
text_df <-
url_df %>%
mutate(text = map(url, pdf_text))
Instead of saving each text in a separate file we can now store all of the data
in a single file:
saveRDS(text_df, "text_df.rds")
For historical reasons for loops are not very popular in the R community.
base R has the *apply() function family that provides a functional
approach to iteration. The tidyverse has the purrr package and the map*()
functions that improve upon the *apply() functions.
I recommend taking a look at
https://purrr.tidyverse.org/ to learn more.
It seems that there are certain url's in your data which are not valid pdf files. You can wrap it in tryCatch to handle the errors. If your dataframe is called df with url column in it, you can do :
library(pdftools)
lapply(seq_along(df$url), function(x) {
tryCatch({
saveRDS(pdf_text(df$url[x]), file = sprintf('article_%d.rds', x)),
},error = function(e) {})
})
So say you have a data.frame called my_df with a column that contains your URLs of pdf locations. As by your comments, it seems that some URLs lead to broken PDFs. You can use tryCatch in these cases to report back which links were broken and check manually what's wrong with these links.
You can do this in a for loop like this:
my_df <- data.frame(url = c(
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # working pdf
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pfd" # broken pdf
))
# make some useful new columns
my_df$id <- seq_along(my_df$url)
my_df$status <- NA
for (i in my_df$id) {
my_df$status[i] <- tryCatch({
message("downloading ", i) # put a status message on screen
article_i <- suppressMessages(pdftools::pdf_text(my_df$url[i]))
saveRDS(article_i, file = paste0("article_", i, ".rds"))
"OK"
}, error = function(e) {return("FAILED")}) # return the string FAILED if something goes wrong
}
my_df$status
#> [1] "OK" "FAILED"
I included a broken link in the example data on purpose to showcase how this would look.
Alternatively, you can use a loop from the apply family. The difference is that instead of iterating through a vector and applying the same code until the end of the vector, *apply takes a function, applies it to each element of a list (or objects which can be transformed to lists) and returns the results from each iteration in one go. Many people find *apply functions confusing at first because usually people define and apply functions in one line. Let's make the function more explicit:
s_download_pdf <- function(link, id) {
tryCatch({
message("downloading ", id) # put a status message on screen
article_i <- suppressMessages(pdftools::pdf_text(link))
saveRDS(article_i, file = paste0("article_", id, ".rds"))
"OK"
}, error = function(e) {return("FAILED")})
}
Now that we have this function, let's use it to download all files. I'm using mapply which iterates through two vectors at once, in this case the id and url columns:
my_df$status <- mapply(s_download_pdf, link = my_df$url, id = my_df$id)
my_df$status
#> [1] "OK" "FAILED"
I don't think it makes much of a difference which approach you choose as the speed will be bottlenecked by your internet connection instead of R. Just thought you might appreciate the comparison.

Large file processing - error using chunked::read_csv_chunked with dplyr::filter

When using the function chunked::read_csv_chunked and dplyr::filter in a pipe, I get an error every time the filter returns an empty dataset on any of the chunks. In other words, this occurs when all the rows from a given chunk of the dataset are filtered out.
Here is a modified example, drawn from the package chunked help file:
library(chunked); library(dplyr)
# create csv file for demo purpose
in_file <- file.path(tempdir(), "in.csv")
write.csv(women, in_file, row.names = FALSE, quote = FALSE)
# reading chunkwise and filtering
women_chunked <-
read_chunkwise(in_file, chunk_size = 3) %>% #read only a few lines for the purpose of this example
filter(height > 150) # This basically filters out most lines of the dataset,
# so for instance the first chunk (first 3 rows) should return an empty table
# Trying to read the output returns an error message
women_chunked
# >Error in UseMethod("groups") :
# >no applicable method for 'groups' applied to an object of class "NULL"
# As does of course trying to write the output to a file
out_file <- file.path(tempdir(), "processed.csv")
women_chunked %>%
write_chunkwise(file=out_file)
# >Error in read.table(con, nrows = nrows, sep = sep, dec = dec, header = header, :
# >first five rows are empty: giving up
I am working on many csv files, each 50 millions rows, and will thus often end up in a similar situation where the filtering returns (at least for some chunks) an empty table.
I coudn't find a solution or any post related to on this problem. Any suggestions?
I do not think the sessionInfo output is useful in this case, but please let me know if I should post it anyway. Thanks a lot for any help!

How to load multiple JSON files into a quanteda corpus using readtext?

I'm trying to load a large number of JSON files from a news website into a quanteda corpus using readtext. To simplify the process, the JSON files are all in the working directory. But I have also tried them in their own directory.
When using c() to create a variable that explicitly defines a small subset of files, readtext works as hoped and a corpus is properly created with corpus().
When attempting to create a variable using list.files() to list all of the +1500 JSON files readtext does not work as hoped, errors are returned, and a corpus is not created.
I tried to inspect the results of the two methods of defining the set of texts (i.e. c() and list.files()) as well as paste0().
# Load libraries
library(readtext)
library(quanteda)
# Define a set of texts explicitly
a <- c("border_2020_05_10__1589150513.json","border_2020_05_10__1589143358.json","border_2020_05_07__1589170960.json")
# This produces a corpus
extracted_texts <- readtext(a, text_field = "maintext")
my_corpus <- corpus(extracted_texts)
# Define a set of all texts in working directory
b <- list.files(pattern = "*.json", full.names = F)
# This, which I hope to use, produces an error
extracted_texts <- readtext(b, text_field = "maintext")
my_corpus <- corpus(extracted_texts)
The error produced by extracted_texts <- readtext(b, text_field = "maintext") is as follows
File doesn't contain a single valid JSON object.
Error: This JSON file format is not supported.
This is perplexing because the same files called with a do not produce an error. I validated several of the JSON files which in every case returned VALID (RFC 8259), the IETF standard for JSON.
Inspecting the differences between a and b:
typeof() returns "character" for both a and b.
is.vector() and is.atomic() return TRUE for both.
is.list() returns FALSE for both.
they look similar in RStudio and when called in the console
I'm really confused why a works and b does not.
Lastly, attempting to exactly mimic procedures employed at the readtext documentation the following was also tried:
# XXXX = my username
data_dir <- file.path("C:/Users/XXXX/Documents/R/")
d <- readtext(paste0(data_dir, "/corpus_linguistics/*.json"), text_field = "maintext")
This also returned the error
File doesn't contain a single valid JSON object.
Error: This JSON file format is not supported.
At this point I'm stumped. Thanks in advance for any insight on how to move forward.
Solution and Summary
Unclean Data: A few of the input JSON files have a null main_text field. These are not useful for analysis and should be removed. All of the files contain a JSON field called "title_rss" that is null. This can be eliminated through a directory level find and replace with Notepad ++, or probably R or Python though I still lack the skills for this. Additionally, the files were not in UTF-8 encoding, that was resolved with Codepage Converter.
Method to call directory string: The list.files() method is employed in the readtext How to Use documentation and several third party tutorials. This method works with *.txt files but for some reason it does not seem to work with these particular JSON files. Once the JSON files are properly cleaned and encoded, the method below works without errors. If the data_dir is wrapped in a list.files() function it produces the following error:
Error in list_files(file, ignore_missing, TRUE, verbosity) : File '' does not exist. I'm not sure why that is, but leaving it out works for these JSON files.
# Load libraries
library(readtext)
library(quanteda)
# Define a set of texts explicitly
data_dir <- "C:/Users/Nathan/Documents/R/corpus_linguistics/"
extracted_texts <- readtext(paste0(data_dir, "texts_unmodified/*.json"), text_field = "maintext", verbosity = 3)
my_corpus <- corpus(extracted_texts)
Test with unmodified files, one known to have empty fields
Input: 5 files consisting of 4 w/o an empty or null text_field and 1 file with a null text field. In addition, all of the files have Western European (Windows) 1252 Encoding.
Errors:
Reading texts from C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/*.json
, using glob pattern
... reading (json) file: C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/border_2014_02_17__1589147645.json
File doesn't contain a single valid JSON object.
contain a single valid JSON object.
... reading (json) file: C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/border_2014_03_13__1589150325.json
File doesn't contain a single valid JSON object.
Column 14 ['maintext'] of item 1 is length 0. This (and 0 others like it) has been filled with NA (NULL for list columns) to make each item uniform. ... read 5 documents.
Result: a properly formed corpus consisting of 5 documents. One document lacks either tokens or types. The corpus seems to build properly despite the errors. Perhaps some special characters don't display properly because of the encoding issue. I was not able to check this.
Test with cleaned files known to have no empty fields
Input files: 4 files that have no empty or null JSON fields. In all cases,text_field contains text and the title_rss field was removed. Each of the files was converted from Western European (Windows) 1252 into Unicode UTF-8-65001.
Errors: NONE!
Result: A properly formed corpus.
Many thanks to the two developers for detailed feedback and useful leads. The assistance is deeply appreciated.
There are a few possibilities here, but the most likely are:
One of your files has a malformed JSON structure, from the point of view of readtext(). Even though this might be OK from a strictly JSON format, if one of your text fields is empty, for instance, then this will cause the error. (See below for a demonstration and a solution.)
While readtext() can take a "glob" pattern match, list.files() takes a regular expression. It's possible (but unlikely) that you are picking up something you don't want then in list.files(pattern = "*.json".... But this should not be necessary with readtext() -- see below.
To demonstrate, let's write out each document in data_corpus_inaugural as a separate JSON file, and then read them in using readtext().
library("quanteda", warn.conflicts = FALSE)
## Package version: 2.0.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
tmpdir <- tempdir()
corpdf <- convert(data_corpus_inaugural, to = "data.frame")
for (d in corpdf$doc_id) {
cat(jsonlite::toJSON(dplyr::filter(corpdf, doc_id == d)),
file = paste0(tmpdir, "/", d, ".json")
)
}
head(list.files(tmpdir))
## [1] "1789-Washington.json" "1793-Washington.json" "1797-Adams.json"
## [4] "1801-Jefferson.json" "1805-Jefferson.json" "1809-Madison.json"
To read them in, you can use the "glob" pattern patch here and just read the JSON files.
rt <- readtext::readtext(paste0(tmpdir, "/*.json"),
text_field = "text", docid_field = "doc_id"
)
summary(corpus(rt), n = 5)
## Corpus consisting of 58 documents, showing 5 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington.json 625 1537 23 1789 Washington George
## 1793-Washington.json 96 147 4 1793 Washington George
## 1797-Adams.json 826 2577 37 1797 Adams John
## 1801-Jefferson.json 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson.json 804 2380 45 1805 Jefferson Thomas
## Party
## none
## none
## Federalist
## Democratic-Republican
## Democratic-Republican
So that all worked fine.
But if we add to this one file whose text field is empty, then this produces the error in question:
cat('[ { "doc_id" : "d1", "text" : "this is a file" },
{ "doc_id" : "d2", "text" : } ]',
file = paste0(tmpdir, "/badfile.json")
)
rt <- readtext::readtext(paste0(tmpdir, "/*.json"),
text_field = "text", docid_field = "doc_id"
)
## File doesn't contain a single valid JSON object.
## Error: This JSON file format is not supported.
True, that was not a valid JSON file, since it contained a tag with no value. But I suspect you have something like that in one of your files.
Here's how you can identify the problem: loop through your b (from the question, not as I've specified it below).
b <- tail(list.files(tmpdir, pattern = ".*\\.json", full.names = TRUE))
for (f in b) {
cat("Reading:", f, "\n")
rt <- readtext::readtext(f, text_field = "text", docid_field = "doc_id")
}
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2001-Bush.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2005-Bush.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2009-Obama.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2013-Obama.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2017-Trump.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/badfile.json
## File doesn't contain a single valid JSON object.
## Error: This JSON file format is not supported.

How to get a vector of the file names contained in a tempfile in R?

I am trying to automatically download a bunch of zipfiles using R. These files contain a wide variety of files, I only need to load one as a data.frame to post-process it. It has a unique name so I could catch it with str_detect(). However, using tempfile(), I cannot get a list of all files within it using list.files().
This is what I've tried so far:
temp <- tempfile()
download.file("https://url/file.zip", destfile = temp)
files <- list.files(temp) # this is where I only get "character(0)"
# After, I'd like to use something along the lines of:
data <- read.table(unz(temp, str_detect(files, "^file123.txt"), header = TRUE, sep = ";")
unlink(temp)
I know that the read.table() command probably won't work, but I think I'll be able to figure that out once I get a vector with the list of the files within temp.
I am on a Windows 7 machine and I am using R 3.6.0.
Following what was said before, this structure should allow you to check the correct download with a temporary file structure :
temp <- tempfile("test.zip")
download.file("https://url/file.zip", destfile = temp)
files <- list.files(temp)

Batch reading compressed CSV files in R

Newbie here. I have 1000 compressed CSV files that I need to read and row bind. My problem is similar to this one, but with two differences:
a) File names are of different lengths and not sequential, in this form:
"members_[name of company]_[state code].csv"`
I have two vectors, company and states with the required codes. So, I've built a vector of all the files I need with this code:
combinations <- expand.grid(company, states)
csvfiles <- paste0("members_" ,
combinations$Var1, "_",
combinations$Var2,".csv" )
so it has all the filenames I need (20 companies X 50 states). But I am lost as to how to cycle through all zip files. There are 10 other CSVs inside those zip files, but I only need the ones described above.
b) When decompressed, the files expand to a directory structure such as this:
/files/member_database/members/state/members_[name of company]_[state code].csv
but when I try to read the CSV from the zip file using
data <- read.csv(unz("members_GE_FL.zip", "members_GE_FL.csv"), header=F, sep=":")
it returns the 'cannot open connection' message. Adding the path such as ./files/member_database/members/state/members_GE_FL.csv doesn't work either.
Then, I'm not sure if the command read.csv(unz(csvfiles... would make it read the names in my csvfiles, but I'm not sure if that's because of the above or if the command is wrong altogether.
Any help is appreciated -- insights, docs I should look at, etc. Again, I'm NOT trying to get people to do my work. As I type, I have 37 tabs open (many from SO), and have already spent 22 hours on this thing alone. I've learned this post and others how to read a file within a ZIP and from this post how to extract and import data. Still, I can't piece it all together. I've only started with R a few months ago, and have no prior experience as a programmer.
I suspect all that was missing was the correct path to the file in the archive: neither "members_GE_FL.csv" nor "./files/member_database/members/state/members_GE_FL.csv" will work.
But "files/member_database/members/state/members_GE_FL.csv" (without the initial dot) should.
For the sake of completeness, here is a complete example:
Let's create some dummy data, three files named out-1.csv, out-2.csv, out-3.csv and zip them in dummy-archive.zip:
if (!dir.exists("data")) dir.create("data")
if (!dir.exists("data/dummy-files")) dir.create("data/dummy-files")
for (i in 1:3)
write.csv(data.frame(foo = 1:2, bar = 7:8), paste0("data/dummy-files/out-", i, ".csv"), row.names = FALSE)
zip("data/dummy-archive.zip", "data/dummy-files")
Now let's assume we're looking for 3 other files, two of which are in the archive, one is not:
files_to_find <- c("out-2.csv", "out-3.csv", "out-4.csv")
List the files in the archive, and name them for the sake of clarity:
files_in_archive <- unzip("data/dummy-archive.zip", list = TRUE)$Name
files_in_archive <- setNames(files_in_archive, basename(files_in_archive))
# dummy-files out-2.csv
# "data/dummy-files/" "data/dummy-files/out-2.csv"
# out-3.csv out-1.csv
# "data/dummy-files/out-3.csv" "data/dummy-files/out-1.csv"
Find the indices of files we're looking for in the archive, and read them like you intended to (with read.csv(unz(....))):
i <- basename(files_in_archive) %in% files_to_find
res <- lapply(files_in_archive[i], function(f) read.csv(unz("data/dummy-archive.zip", f)))
# $`out-2.csv`
# foo bar
# 1 1 7
# 2 2 8
#
# $`out-3.csv`
# foo bar
# 1 1 7
# 2 2 8
Clean-up:
unlink(c("data/dummy-files/", "data/dummy-archive.zip"), recursive = TRUE)

Resources