creating a loop for "load" and "save" processes - r

I have a data.frame (dim: 100 x 1) containing a list of url links, each url looks something like this: https:blah-blah-blah.com/item/123/index.do .
The list (the list is a data.frame called my_list with 100 rows and a single column named col and is in character format $ col: chr) together looks like this :
1 "https:blah-blah-blah.com/item/123/index.do"
2" https:blah-blah-blah.com/item/124/index.do"
3 "https:blah-blah-blah.com/item/125/index.do"
etc.
I am trying to import each of these url's into R and collectively save the object as an object that is compatible for text mining procedures.
I know how to successfully convert each of these url's (that are on the list) manually:
library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tm)
#1st document
url <- "https:blah-blah-blah.com/item/123/index.do"
article <- pdf_text(url)
Once this "article" file has been successfully created, I can inspect it:
str(article)
chr [1:13]
It looks like this:
[1] "abc ....."
[2] "def ..."
etc etc
[15] "ghi ...:
From here, I can successfully save this as an RDS file:
saveRDS(article, file = "article_1.rds")
Is there a way to do this for all 100 articles at the same time? Maybe with a loop?
Something like :
for (i in 1:100) {
url_i <- my_list[i,1]
article_i <- pdf_text(url_i)
saveRDS(article_i, file = "article_i.rds")
}
If this was written correctly, it would save each article as an RDS file (e.g. article_1.rds, article_2.rds, ... article_100.rds).
Would it then be possible to save all these articles into a single rds file?

Please note that list is not a good name for an object, as this will
temporarily overwrite the list() function. I think it is usually good
to name your variables according to their content. Maybe url_df would be
a good name.
library(pdftools)
#> Using poppler version 20.09.0
library(tidyverse)
url_df <-
data.frame(
url = c(
"https://www.nimh.nih.gov/health/publications/autism-spectrum-disorder/19-mh-8084-autismspecdisordr_152236.pdf",
"https://www.nimh.nih.gov/health/publications/my-mental-health-do-i-need-help/20-mh-8134-mymentalhealth-508_161032.pdf"
)
)
Since the urls are already in a data.frame we could store the text data in
an aditional column. That way the data will be easily available for later
steps.
text_df <-
url_df %>%
mutate(text = map(url, pdf_text))
Instead of saving each text in a separate file we can now store all of the data
in a single file:
saveRDS(text_df, "text_df.rds")
For historical reasons for loops are not very popular in the R community.
base R has the *apply() function family that provides a functional
approach to iteration. The tidyverse has the purrr package and the map*()
functions that improve upon the *apply() functions.
I recommend taking a look at
https://purrr.tidyverse.org/ to learn more.

It seems that there are certain url's in your data which are not valid pdf files. You can wrap it in tryCatch to handle the errors. If your dataframe is called df with url column in it, you can do :
library(pdftools)
lapply(seq_along(df$url), function(x) {
tryCatch({
saveRDS(pdf_text(df$url[x]), file = sprintf('article_%d.rds', x)),
},error = function(e) {})
})

So say you have a data.frame called my_df with a column that contains your URLs of pdf locations. As by your comments, it seems that some URLs lead to broken PDFs. You can use tryCatch in these cases to report back which links were broken and check manually what's wrong with these links.
You can do this in a for loop like this:
my_df <- data.frame(url = c(
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # working pdf
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pfd" # broken pdf
))
# make some useful new columns
my_df$id <- seq_along(my_df$url)
my_df$status <- NA
for (i in my_df$id) {
my_df$status[i] <- tryCatch({
message("downloading ", i) # put a status message on screen
article_i <- suppressMessages(pdftools::pdf_text(my_df$url[i]))
saveRDS(article_i, file = paste0("article_", i, ".rds"))
"OK"
}, error = function(e) {return("FAILED")}) # return the string FAILED if something goes wrong
}
my_df$status
#> [1] "OK" "FAILED"
I included a broken link in the example data on purpose to showcase how this would look.
Alternatively, you can use a loop from the apply family. The difference is that instead of iterating through a vector and applying the same code until the end of the vector, *apply takes a function, applies it to each element of a list (or objects which can be transformed to lists) and returns the results from each iteration in one go. Many people find *apply functions confusing at first because usually people define and apply functions in one line. Let's make the function more explicit:
s_download_pdf <- function(link, id) {
tryCatch({
message("downloading ", id) # put a status message on screen
article_i <- suppressMessages(pdftools::pdf_text(link))
saveRDS(article_i, file = paste0("article_", id, ".rds"))
"OK"
}, error = function(e) {return("FAILED")})
}
Now that we have this function, let's use it to download all files. I'm using mapply which iterates through two vectors at once, in this case the id and url columns:
my_df$status <- mapply(s_download_pdf, link = my_df$url, id = my_df$id)
my_df$status
#> [1] "OK" "FAILED"
I don't think it makes much of a difference which approach you choose as the speed will be bottlenecked by your internet connection instead of R. Just thought you might appreciate the comparison.

Related

Looping over rows of self-reported job titles to gather publicly available data

Suppose I have a data frame with one case of self-reported job titles (this is done in R):
x <- data.frame("job.title" = c("psychologist"))
I'd like to have this job title entered into a search engine on a website (this part I can do) in order to have data on these jobs pulled into a data frame (this part I can also do).
The following function does this for me:
onet.sum <- function(x) {
obj1 <- as.list(ONETr::keySearch(x)) # enter self-reported job title into ONET's search engine
job.title <- obj1[["title"]][1] # pull best-matching title
soc.code <- obj1[["code"]][1] # pull best matching title's SOC code
obj4 <- as.data.frame(cbind(job.title,soc.code))
return(obj4)
}
However, once I add a second job title in a second row...
x <- data.frame("job.title" = c("psychologist", "social worker"))
...I get this system error that I'm not sure how to diagnose.
Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Any advice?
UPDATE
So it turns out that there are two solutions that work if I pass job titles that do not contain spaces:
Using lapply(). Make sure that the job titles do not contain spaces.
So this works:
final_data <- lapply(c("psychologist","socialworker"), onet.sum) %>%
bind_rows
...but this doesn't work:
final_data <- lapply(c("psychologist","social worker"), onet.sum) %>%
bind_rows
Use purrr's map_df() is more flexible.
result <- purrr::map_df(gsub('\s', '', x$job.title), onet.sum)
You can try with a lapply statement -
result <- do.call(rbind, lapply(x$job.title, onet.sum))
Using purrr::map_df might be shorter.
result <- purrr::map_df(x$job.title, onet.sum)

Nested Json with Different Attribute Names in R

I am playing with the Kaggle Star Trek Scripts dataset but I am struggling with converting from json to a dataframe in R. Ideally I would convert it in a long form dataset with index columns for episodes and characters with their lines on individual rows. I did find this answer, however it is not in R.
Currently the json looks like the photo below. Sorry it is not a full exmaple, but I put a small mocked version below as well. If you want you can download the data yourselves from here.
Current JSON View
Mock Example
"ENT": {
"episode_0": {
"KLAANG": {
"\"Pungghap! Pung ghap!\"": {},
"\"DujDajHegh!\"": {}
}
},
"eipsode_1": {
"ARCHER": {
"\"Warpme!\"": {},
"\"Toboldly go!\"": {}
}
}
}
}
The issue I have is that the second level, epsiodes, are individually numbered. So my regular bag of tricks for flattening by attribute name are not working. I am unsure how to loop through a level rather than an attribute name.
What I would ideally want is a long form data set that looks like this:
Series Episode Character Lines
ENT episode_0 KLAANG Pung ghap! Pung ghap!
ENT episode_0 KLAANG DujDaj Hegh!
ENT episode_1 ARCHER Warp me!
ENT episode_1 ARCHER To boldly go!
My currnet code looks like the below, which is what I would normally start with, but is obviously not working or far enough.
your_df <- result[["ENT"]] %>%
purrr::flatten() %>%
map_if(is_list, as_tibble) %>%
map_if(is_tibble, list) %>%
bind_cols()
I have also tried using stack() and map_dfr() but with no success. So I yet again come humbly to you, dear reader, for expertise. Json is the bane of my existance. I struggle with applying other answers to my circumstance so any advice or examples I can reverse engineer and lear from are most appreciated.
Also happy to clarify or expand on anything if possible.
-Jake
So I was able to brute force it thanks to an answer from Michael on this tread called How to flatten a list of lists? so shout out to them.
The function allowed me to covert JSON to a list of lists.
flattenlist <- function(x){
morelists <- sapply(x, function(xprime) class(xprime)[1]=="list")
out <- c(x[!morelists], unlist(x[morelists], recursive=FALSE))
if(sum(morelists)){
Recall(out)
}else{
return(out)
}
}
So Putting it all together I ended up with the following solution. Annotation for your entertainment.
library(jsonlite)
library(tidyverse)
library(dplyr)
library(data.table)
library(rjson)
result <- fromJSON(file = "C:/Users/jacob/Downloads/all_series_lines.json")
# Mike's function to get to a list of lists
flattenlist <- function(x){
morelists <- sapply(x, function(xprime) class(xprime)[1]=="list")
out <- c(x[!morelists], unlist(x[morelists], recursive=FALSE))
if(sum(morelists)){
Recall(out)
}else{
return(out)
}
}
# Mike's function applied
final<-as.data.frame(do.call("rbind", flattenlist(result)))
# Turn all the lists into a master data frame and ensure the index becomes a column I can separate later for context.
final <- cbind(Index_Name = rownames(final), final)
rownames(final) <- 1:nrow(final)
# So the output takes the final elements at the end of the JSON and makes those the variables in a dataframe so I need to force it back to a long form dataset.
final2<-gather(final,"key","value",-Index_Name)
# I separate each element of index name into my three mapping variables; Series,Episode and Character. I can also keep the original column names from above as script line id
final2$Episode<-gsub(".*\\.(.*)\\..*", "\\1", final2$Index_Name)
final2$Series<-substr(final2$Index_Name, start = 1, stop = 3)
final2$Character<-sub('.*\\.'," ", final2$newColName)

Displaying data from a list in R without dynamically changing variable names

I'm writing some code in R that builds a list of data frames. While it runs, it needs to display each of the data frames it creates in a separate tab. The data frames and the list are both created by several nested for loops, along the lines of:
df.list <- vector("list", length(e))
i <- 1
for (...){
data <- as.data.frame(stuff)
j <- 1
for (...){
for (...){
[loop stuff]
data[j,] <- [more stuff]
}
}
df.list[[i]] <- data
i <- i + 1
}
The question is where to put the "View" function. If I add a second loop at the end that runs through the list and displays the data frames, then they all get named "df.list". If I put View(data) right before df.list[[i]] <- data then they all get named "data". Having them all have the same name is not an acceptable situation for this context. Ideally, I would be able to name them whatever string I want, but I would settle for anything that is reasonably understandable and distinguishable from the other data frames.
I know I can solve this by dynamically changing the variable name to be datai where i is the list index, but that's almost always the wrong way to do things.
I thought I'd never post an answer using eval(parse()), but it's the only way I can think to make this work:
# sample data
df.list = list(mtcars, iris)
# name your list however you want the tabs to be named
names(df.list) = c("mtcars data", "this is iris")
for (i in seq_along(df.list)) eval(parse(text = sprintf("View(df.list[['%s']])", names(df.list)[i])))
This might be what you meant by "dynamically changing the variable name to be datai where i is the list index", and I agree that it's almost always wrong. In this case it may also be by far the most expedient way to do it as well.
Posting the solution from the comments so I can close:
The View() function takes names as optional arguments! View(data, name) will display data and call the tab name

R: subset() function altered character data into strange code

i read some data into R with the read.xlsx() in openxlsx package, and here's my code for reading the data:
data_all = read.xlsx(xlsxFile = paste0(path, EoLfileName), sheet = 1, detectDates = T, skipEmptyRows = F)
now, when i access one name cell in my data, it will print the name in characters:
> data_all[1,'name']
[1] "76-ES+ADVIP-20G"
now, lets say i want to subset out some rows based on a condition on another colum:
data_sub = subset(data_all, !is.na(data_all$amount))
however, then if i print this subset data, i'd get:
> data_sub[1,'name']
[1] "A94198.10"
i've also tried to do subsetting using the following method:
data_sub = data_all[!is.na(data_all$amount),]
but i get the same thing: the expected output of "76-ES+ADVIP-20G" would be turned into "A94198.10"
I've checked many times with mode() and str() for data_all$name and data_sub$name, both return character, so they are in correct format.
here's a link to smaple data to play with:
https://drive.google.com/file/d/0BwIbultIWxeVY1VtdDU5NFp1Tkk/view?usp=sharing
Please please help me! I am quite stuck, and i dont see other posts with similar problem.
Why is this happeneing? subsetting shouldnt change data formatting correct?
Thank you in advance for your help!
additional note (if its helpful):
so when i tried to debug, i noticed that, when i was viewing the data_all in RStudio, and if i copy and paste the name "76-ES+ADVIP-20G" into the filter bar, it actually cannot find it; i'd have to type in "76-ES" and as soon as i type in the next character which is "+", RStudio data view filter would say "no matching records found"

r-project create a data frame function and probably use *apply somewhere too

trying to create a function that looks up a bunch of CSV files in a directory and then, taking the file ID as an argument, outputs a table (actually data frame - new to R language) where there are 2 columns, one titled ID for the corresponding id parameter and the second column will be the count of rows in that file.
The files are all titled 001.csv - 322.csv
e.g. output would look like column title: ID, first record: 001 (derived from 001.csv), second column: title "count of rows", first record
The function looks like so: myfunction(directory,id)
Directory is the folder where the csv files are and id can be a number (or vector?) e.g. simply 1 or 9 or 100 or it can be a vector like so 200:300.
In the case of the later, 200:300, the output would be a table with 100 rows where the 1st row would be 200 with say 10 rows of data within it.
So far:
complete <- function(directory,id = 1:332) {
# create an object to help read the appropriate csv files later int he function
csvfilespath <- sprintf("/Users/gcameron/Desktop/%s/%03d.csv", directory, id)
colID <- sprintf('%03d', id)
# now, how do I tell R to create a table with 2 columns titled ID and countrows?
# Now, how would I take each instance of an ID and add to this table the id and count of rows in each?
}
I apologize if this seems really basic. The tutorial I'm on moves fast and I have watched each video lecture and done a fair amount of research too.
SO is by far my favourite resource and I learn better by using it. Perhaps because it's personalised and directly applicable to my immediate tasks. I hope my questions also benefit others who are learning R.
BASED ON FEEDBACK BELOW
I now have the following script:
complete <- function(directory,id = 1:332) {
csvfiles <- sprintf("/Users/gcameron/Desktop/%s/%03d.csv", directory, id)
nrows <- sapply( csvfiles, function(f) nrow(read.csv(f)))
data.frame(ID=id, countrows=sapply(csvfiles,function(x) length(count.fields(x)))
}
Does this look like I'm on the right track?
I'm receiving an error "Error: unexpected '}' in:
"data.frame(ID=id, countrows=sapply(csvfiles,function(x) length(count.fields(x)))
}"
I cannot see hwere the extra "}" is coming from?
data.frame(ID=id, countrows=sapply(csvfilepath, function(x) length(count.fields(x))))

Resources