Naming of downloaded pdf-files does not work - r

When downloading many (48) pdf-files the nameing using str_match(myurl, "UniqueID=(.+)) fails. I see that the downloads goes fine but name does not work and when its done I have only one file named "NA".
I am downloading a number of pdfs from a UN organisation database. This is going fine as I see that all files being downloaded. However, all files naming goes wrong and in the end I have only one file called "NA".
library(downloader)
library(stringr)
for (myurl in pdfscollect) {
filename<-paste("collected/", str_match(myurl, "UniqueID=(.+)")[2], ".pdf", sep="")
download(myurl, filename)
Sys.sleep(2)
}
I would expect all pdfs being named uniquely, but no naming happens and only one file in the end with "NA".
pdfscollect is file with all links. Example:
pdfstest<-c("http://www.ilo.org/evalinfo/product/download.do;?type=document&id=8287", "http://www.ilo.org/evalinfo/product/download.do;?type=document&id=10523",….)

If I understand correctly (?) the problem is that
paste("collected/", str_match(myurl, "UniqueID=(.+)")[2]
is returning a vector of NA when you are expecting the document ids:
[1] "8287" "10523"
I suggest using instead something like the following (which does get the expected output):
str_extract(pdfstest, "(?<=id=)\\d+")
Here we use regular expressions to match any number of digits that follow immediately after the first id= of the urls in your vector.

Thanks for the suggestion, #sindri_baldur. Acutally the result turns out the same, except that the name of the pdf file changes. I cannot open the pdf-file it either, I realise now. I think some of the problem is that the pdf-link is an "..download.do..." link (ilo.org/evalinfo/product/download.do;?type=document&id=8287). I guess I should fine another way to collect these pdfs.

Related

Unable to use here() to get to the correct project directory

I have this weird problem. I was able to get here() to work but it stopped working and I can't figure out how to fix it.
So basically the structure of my file is like:
C:/First/Second/Third/Analysis/Scripts
C:/First/Second/Third/Analysis/Data
I want to easily bounce between data and scripts in the code
If I type here("C:/First/")
and then follow up with here(), R says that I'm at
C:/First/Documents
I'm not able to type here("Second", "Third", "Analysis", "Data", "todayscode.R")
because it puts me in the folder:
C:/First/Documents/Second/Third/Analysis/Data/
Which obviously does not exist.
It's unclear why you say that C:/First/Documents/Second/Third/Analysis/Data/ "does not exist". You will need to create a subdirectory if it doesn't exist, but at the beginning of your problem presentation you implied that was an existing directory.
Sounds like you want something like this possibilities:
setwd("C:/First/Second/Third/Analysis/")
You should first clarify where your version of the here function is coming from by typing here at the console input line and noting which namespace it comes from. If you then wanted to refer to a directory with that starting point you could use this (using the example set up in the here package at the here help page):
library(here); library(readr)
readr::read_csv(
here("data", "penguins.csv"),
col_types = list(.default = readr::col_guess()),
n_max = 3
)

How to Read Data from .rda with read.table [duplicate]

I am trying to load an .rda file in r which was a saved dataframe. I do not remember the name of it though.
I have tried
a<-load("al.rda")
which then does not let me do anything with a. I get the error
Error:object 'a' not found
I have also tried to use the = sign.
How do I load this .rda file so I can use it?
I restared R with load("al.rda) and I know get the following error
Error: C stack usage is too close to the limit
Use 'attach' and then 'ls' with a name argument. Something like:
attach("al.rda")
ls("file:al.rda")
The data file is now on your search path in position 2, most likely. Do:
search()
ls(pos=2)
for enlightenment. Typing the name of any object saved in al.rda will now get it, unless you have something in search path position 1, but R will probably warn you with some message about a thing masking another thing if there is.
However I now suspect you've saved nothing in your RData file. Two reasons:
You say you don't get an error message
load says there's nothing loaded
I can duplicate this situation. If you do save(file="foo.RData") then you'll get an empty RData file - what you probably meant to do was save.image(file="foo.RData") which saves all your objects.
How big is this .rda file of yours? If its under 100 bytes (my empty RData files are 42 bytes long) then I suspect that's what's happened.
I had to reinstall R...somehow it was corrupt. The simple command which I expected of
load("al.rda")
finally worked.
I had a similar issue, and it was solved without reinstall R. for example doing
load("al.rda) works fine, however if you do
a <- load("al.rda") will not work.
The load function does return the list of variables that it loaded. I suspect you actually get an error when you load "al.rda". What exactly does R output when you load?
Example of how it should work:
d <- data.frame(a=11:13, b=letters[1:3])
save(d, file='foo.rda')
a <- load('foo.rda')
a # prints "d"
Just to be sure, check that the load function you actually call is the original one:
find("load") # should print "package:base"
EDIT Since you now get an error when you load the file, it is probably corrupt in some way. Try this and say what it prints:
file.info("a1.rda") # Prints the file size etc...
readBin("a1.rda", "raw", 50) # reads first 50 bytes from the file
Without having access to the file, it's hard to investigate more... Maybe you could share the file somehow (http://www.filedropper.com or similar)?
I usually use save to save only a single object, and I then use the following utility method to retrieve that object into a given variable name using load, but into a temporary namespace to avoid overwriting existing objects. Maybe it will be helpful for others as well:
load_first_object <- function(fname){
e <- new.env(parent = parent.frame())
load(fname, e)
return(e[[ls(e)[1]]])
}
The method can of course be extended to also return named objects and lists of objects, but this simple version is for me the most useful.

Why can I only read one .json file at a time?

I have 500+ .json files that I am trying to get a specific element out of. I cannot figure out why I cannot read more than one at a time..
This works:
library (jsonlite)
files<-list.files(‘~/JSON’)
file1<-fromJSON(readLines(‘~/JSON/file1.json),flatten=TRUE)
result<-as.data.frame(source=file1$element$subdata$data)
However, regardless of using different json packages (eg RJSONIO), I cannot apply this to the entire contents of files. The error I continue to get is...
attempt to run same code as function over all contents in file list
for (i in files) {
fromJSON(readLines(i),flatten = TRUE)
as.data.frame(i)$element$subdata$data}
My goal is to loop through all 500+ and extract the data and its contents. Specifically if the file has the element ‘subdata$data’, i want to extract the list and put them all in a dataframe.
Note: files are being read as ASCII (Windows OS). This does bot have a negative effect on single extractions but for the loop i get ‘invalid character bytes’
Update 1/25/2019
Ran the following but returned errors...
files<-list.files('~/JSON')
out<-lapply(files,function (fn) {
o<-fromJSON(file(i),flatten=TRUE)
as.data.frame(i)$element$subdata$data
})
Error in file(i): object 'i' not found
Also updated function, this time with UTF* errors...
files<-list.files('~/JSON')
out<-lapply(files,function (i,fn) {
o<-fromJSON(file(i),flatten=TRUE)
as.data.frame(i)$element$subdata$data
})
Error in parse_con(txt,bigint_as_char):
lexical error: invalid bytes in UTF8 string. (right here)------^
Latest Update
Think I found out a solution to the crazy 'bytes' problem. When I run readLines on the .json file, I can then apply fromJSON),
e.x.
json<-readLines('~/JSON')
jsonread<-fromJSON(json)
jsondf<-as.data.frame(jsonread$element$subdata$data)
#returns a dataframe with the correct information
Problem is, I cannot apply readLines to all the files within the JSON folder (PATH). If I can get help with that, I think I can run...
files<-list.files('~/JSON')
for (i in files){
a<-readLines(i)
o<-fromJSON(file(a),flatten=TRUE)
as.data.frame(i)$element$subdata}
Needed Steps
apply readLines to all 500 .json files in JSON folder
apply fromJSON to files from step.1
create a data.frame that returns entries if list (fromJSON) contains $element$subdata$data.
Thoughts?
Solution (Workaround?)
Unfortunately, the fromJSON still runs in to trouble with the .json files. My guess is that my GET method (httr) is unable to wait/delay and load the 'pretty print' and thus is grabbing the raw .json which in-turn is giving odd characters and as a result giving the ubiquitous '------^' error. Nevertheless, I was able to put together a solution, please see below. I want to post it for future folks that may have the same problem with the .json files not working nicely with any R json package.
#keeping the same 'files' variable as earlier
raw_data<-lapply(files,readLines)
dat<-do.call(rbind,raw_data)
dat2<-as.data.frame(dat,stringsasFactors=FALSE)
#check to see json contents were read-in
dat2[1,1]
library(tidyr)
dat3<-separate_rows(dat2,sep='')
x<-unlist(raw_data)
x<-gsub('[[:punct:]]', ' ',x)
#Identify elements wanted in original .json and apply regex
y<-regmatches(x,regexc('.*SubElement2 *(.*?) *Text.*',x))
for loops never return anything, so you must save all valuable data yourself.
You call as.data.frame(i) which is creating a frame with exactly one element, the filename, probably not what you want to keep.
(Minor) Use fromJSON(file(i),...).
Since you want to capture these into one frame, I suggest something along the lines of:
out <- lapply(files, function(fn) {
o <- fromJSON(file(fn), flatten = TRUE)
as.data.frame(o)$element$subdata$data
})
allout <- do.call(rbind.data.frame, out)
### alternatives:
allout <- dplyr::bind_rows(out)
allout <- data.table::rbindlist(out)

Scraping multiple urls from a list of urls. Obtaining the data (text) behind each url, and writing out a text file

I am learning python (using 3.5). I realize I will probably take a bit of heat for posting my question. Here goes: I have literally reviewed several hundred posts, help docs, etc. all in an attempt to construct the code I need. No luck thus far. I hope someone can help me. I have a set of URLs say, 18 or more. Only 2 illustrated here:
[1] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html"
[2] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"
I need to scrape all the data (text) behind each url and write out to individual text files (one for each URL) for future topic model analysis. Right now, I pull in the urls through R using rvest. I then take each url (one at a time, by code) into python and do the following:
soup = BeautifulSoup(urlopen('http://www.senate.mo.gov/media/14info/chappelle-nadal/Columns/012314-Condensed.html').read())
txt = soup.find('div', {'class' : 'body'})
print(soup.get_text())
#print(soup.prettify()) not much help
#store the info in an object, then write out the object
test=print(soup.get_text())
test=soup.get_text()
#below does write a file
#how to take my BS object and get it in
open_file = open('23Jan2014cplNadal1.txt', 'w')
open_file.write(test)
open_file.close()
The above gets me partially to my target. It leaves me just a little clean up regarding the text, but that's okay. The problem is that it is labor intensive.
Is there a way to
Write a clean text file (without invisibles, etc.) out from R with all listed urls?
For python 3.5: Is there a way to take all the urls, once they are in a clean single file (the clean text file, one url per line), and have some iterative process retrieve the text behind each url and write out a text file for each URL's data(text) to a location on my hard drive?
I have to do this process for approximately 1000 state-level senators. Any help or direction is greatly appreciated.
Edit to original: Thank you so much all. To N. Velasquez: I tried the following:
urls<-c("http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/120114.html",
"http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/110614.htm"
)
for (url in urls) {
download.file(url, destfile = basename(url), method="curl", mode ="w", extra="-k")
}
html files are then written out to my working directory. However, is there a way to write out text files instead of html files? I've read download.file info and can't seem to figure out a way to push out individual text files. Regarding the suggestion for a for loop: Is what I illustrate what you mean for me to attempt? Thank you!
The answer for 1 is: Sure!
The following code will loop you through the html list and export atomic TXTs, as per your request.
Note that through rvest and html_node() you could get a much more structure datset, with recurring parts of the html stored separately. (header, office info, main body, URL, etc...)
library(rvest)
urls <- (c("http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html", "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"))
for (i in 1:length(urls))
{
ht <- list()
ht[i] <- html_text(html_node(read_html(urls[i]), xpath = '//*[#id="mainContent"]'), trim = TRUE)
ht <- gsub("[\r\n]","",ht)
writeLines(ht[i], paste("DOC_", i, ".txt", sep =""))
}
Look for the DOC_1.txt and DOC_2.txt in your working directory.

How to import multiple matlab files into R (Using package R.Matlab)

Thank you in advance for your're help. I am using R to analyse some data that is initially created in Matlab. I am using the package "R.Matlab" and it is fantastic for 1 file, but I am struggling to import multiple files.
The working script for a single file is as follows...
install.packages("R.matlab")
library(R.matlab)
x<-("folder_of_files")
path <- system.file("/home/ashley/Desktop/Save/2D Stream", package="R.matlab")
pathname <- file.path(x, "Test0000.mat")
data1 <- readMat(pathname)
And this works fantastic. The format of my files is 'Name_0000.mat' where between files the name is a constant and the 4 digits increase, but not necesserally by 1.
My attempt to load multiple files at once was along these lines...
for (i in 1:length(temp))
data1<-list()
{data1[[i]] <- readMat((get(paste(temp[i]))))}
And also in multiple other ways that included and excluded path and pathname from the loop, all of which give me the same error:
Error in get(paste(temp[i])) :
object 'Test0825.mat' not found
Where 0825 is my final file name. If you change the length of the loop it is always just the name of the final one.
I think the issue is that when it pastes the name it looks for that object, which as of yet does not exist so I need to have the pasted text in speach marks, yet I dont know how to do that.
Sorry this was such a long post....Many thanks

Resources