Webscraping in R, "... does not exist in current working directory" error

Webscraping in R, "... does not exist in current working directory" error - r

I'm trying to use the XML2 package to scrape a few tables from ESPN.com. For the sake of example, I'd like to scrape the week 7 fantasy quarterback rankings into R, the URL to which is:
http://www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-quarterback-rankings
I'm trying to use the "read_html()" function to do this because it is what I am most familiar with. Here is my syntax and its error:
> wk.7.qb.rk = read_html("www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-rankings-quarterbacks", which = 1)
Error: 'www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-rankings-quarterbacks' does not exist in current working directory ('C:/Users/Brandon/Documents/Fantasy/Football/Daily').
I've also tried "read_xml()", only to get the same error:
> wk.7.qb.rk = read_xml("www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-rankings-quarterbacks", which = 1)
Error: 'www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-rankings-quarterbacks' does not exist in current working directory ('C:/Users/Brandon/Documents/Fantasy/Football/Daily').
Why is R looking for this URL in the working directory? I've tried this function with other URLs and had some success. What is it about this specific URL that makes it look in a different location than it does for others? And, how do I change that?

I got this error while I was running my read_html in a loop to navigate through 20 pages. After the 20th page the loop was still running with no urls and hence it started calling read_html with NAs for the other loop iterations.Hope this helps!

Related

Error when trying to read_html from a website with delayed loading

I'm trying to scrape the public consultations from the European Commission's website with R for my master thesis. However, the loading function in the first second when opening the website seems to prevent this.
By running
test <- read_html("https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives")
I get the following error:
Fehler in readBin(3L, "raw", 65536L) :
Failure when receiving data from the peer
("Fehler in" translates to "error in")
I couldn't find a solution yet and also struggle to identify the specific problem because I would actually expect the function the return the HTML of the loading screen that appears in the first second.
Does anyone have an idea how to get beyond the loading part with R?

Using speech to text with googlelanguageR produces NULL transcripts

I'm using the R package 'googleLanguageR' to transcribe various 30 second audio files (over 500 so want to automatize this). I've followed all the steps in the googleLanguageR tutorials, got my key, and authenticated through R.
I'm able to transcribe the test audio (.wav) that comes with the package, but whenever I apply the same function to my files (.mp3), I get NULL for both transcript and timings.
This is the code provided in tutorials:
# get the sample source file
test_audio <- system.file("woman1_wb.wav", package = "googleLanguageR")
gl_speech(test_audio)$transcript
If I use the same for my file, I get an empty element, so I've tried the following with no luck:
test_audio <- "/audio_location/filename.mp3"
gl_speech(test_audio)$transcript
Has anybody encountered a similar problem with this package or have any suspicions of why it produces NULL transcripts?

Problems parsing StreamR JSON data

I am attempting to use the streamR in R to download and analyze Twitter, under the pretense that this library can overcome the limitations from the twitteR package.
When downloading data everything seems to work fabulously, using the filterStream function (just to clarify, the function captures Twitter data, just running it will provide the json file -saved in the working directory- that needs to be used in further steps):
filterStream( file.name="tweets_test.json",
track="NFL", tweets=20, oauth=credential, timeout=10)
Capturing tweets...
Connection to Twitter stream was closed after 10 seconds with up to 21 tweets downloaded.
However, when moving on to parse the json file, I keep getting all sorts of errors:
readTweets("tweets_test.json", verbose = TRUE)
0 tweets have been parsed.
list()
Warning message:
In readLines(tweets) : incomplete final line found on 'tweets_test.json'
Or with this function from the same package:
tweet_df <- parseTweets(tweets='tweets_test.json')
Error in `$<-.data.frame`(`*tmp*`, "country_code", value = NA) :
replacement has 1 row, data has 0
In addition: Warning message:
In stream_in_int(path.expand(path)) : Parsing error on line 0
I have tried reading the json file with jsonlite and rjson with the same results.
Originally, it seemed that the error came from special characters ({, then \) within the json file that I tried to clean up following the suggestion from this post, however, not much came out of it.
I found out about the streamR package from this post, which shows the process as very straight forward and simple (which it is, except for the parsing part!).
If any of you have experience with this library and/or these parsing issues, I'd really appreciate your input. I have been searching non stop but haven't been able to locate a solution.
Thanks!

Reading SDMX in R - parse error?

I've been trying to develop a shiny app in R with INEGI (mexican statistics agency) data through their recently initiated SDMX service. I went as far a contacting the developers themselves and they gave me the following, unworkable, code:
require(devtools)
require(RSQLite)
require(rsdmx)
require(RCurl)
url <- paste("http://www.snieg.mx/opendata/NSIRestService/Data/ALL,DF_PIB_PB2008,ALL/ALL/INEGI");
sdmxObj <- readSDMX(url)
df_pib <- as.data.frame(sdmxObj)
Which brings me to the following errors:
sdmxObj <- readSDMX(url)
Opening and ending tag mismatch: ad line 1 and Name
Opening and ending tag mismatch: b3 line 1 and Name
Opening and ending tag mismatch: b3 line 1 and Department
Opening and ending tag mismatch: c3 line 1 and Contact
Opening and ending tag mismatch: a1 line 1 and Sender
Opening and ending tag mismatch: c3 line 1 and Header
Opening and ending tag mismatch: b3 line 1 and GenericData
... etc, you get the point.
I tried to use another url (maybe this was to broad, bringing in every GDP measurement), but I get the same result:
url<-"http://www.snieg.mx/opendata/NSIRestService/Data/ALL,DF_PIB_PB2008,ALL/.MX.........C05.......0101/INEGI?format=compact"
If I download the file directly with my browser I seem to be getting useful structures.
Any ideas? Does this seem like a faulty definition directly from the source or an issue with the package "rsdmx", if so, has anyone found a way to parse similar structures correctly?

The code you pasted above, using rsdmx, works perfectly fine. The issue you had was about your workplace firewall, as you correctly figure out.
You only need to load rsdmx package (the other packages do not need to be explicitely declared)
require(rsdmx)
and do this code:
url <- paste("http://www.snieg.mx/opendata/NSIRestService/Data/ALL,DF_PIB_PB2008,ALL/ALL/INEGI");
sdmxObj <- readSDMX(url)
df_pib <- as.data.frame(sdmxObj)
I've checked for any potential issue related to this datasource, but there is not. Staying strictly within the scope of your post, your code is fine.
This being said, if you find a bug in rsdmx, you can directly submit a ticket at https://github.com/opensdmx/rsdmx/issues Prompt feedback is provided to users. You can also send suggestions or wished features there or in the rsdmx mailing list.

You could try RJSDMX .
To download all the time series of the DF_PIB_PB2008 dataflow you just need to hit:
library(RJSDMX)
result = getSDMX('INEGI', 'DF_PIB_PB2008/.................')
or equivalently:
result = getSDMX('INEGI', 'DF_PIB_PB2008/ALL')
If you need time series as a result, you're done. Elseway, if you prefer a data.frame, you can get it calling:
dfresult = sdmxdf(result, meta=T)
You can find more information about the package and its configuration in the project wiki

Using R package BerkeleyEarth

I'm working for the first time with the R package BerkeleyEarth, and attempting to use its convenience functions to access the BEST data. I think maybe it's just a problem with their servers (a matter I've separately addressed to the package's maintainer) but I wanted to know if it's instead something silly I'm doing.
To reproduce my fault
library(BerkeleyEarth)
downloadBerkeley()
which provides the following error message
trying URL 'http://download.berkeleyearth.org/downloads/TAVG/LATEST%20-%20Non-seasonal%20_%20Quality%20Controlled.zip'
Error in download.file(urls$Url[thisUrl], destfile = file.path(destDir, :
cannot open URL 'http://download.berkeleyearth.org/downloads/TAVG/LATEST%20-%20Non-seasonal%20_%20Quality%20Controlled.zip'
In addition: Warning message:
In download.file(urls$Url[thisUrl], destfile = file.path(destDir, :
InternetOpenUrl failed: 'A connection with the server could not be established'
Has anyone had a better experience using this package?

The error message is pointing to a different URL than one should get judging what URLs are listed at http://berkeleyearth.org/data/ that point to the zip formatted files. There are another set of .nc files that appear to be more recent. I would replace the entries in the BerkeleyUrls dataframe with the ones that match your analysis strategy:
This is the current URL that should be in position 1,1:
http://berkeleyearth.lbl.gov/downloads/TAVG/LATEST%20-%20Non-seasonal%20_%20Quality%20Controlled.zip
And this is the one that is in the package dataframe:
> BerkeleyUrls[1,1]
[1] "http://download.berkeleyearth.org/downloads/TAVG/LATEST%20-%20Non-seasonal%20_%20Quality%20Controlled.zip"
I suppose you could try:
BerkeleyUrls[, 1] <- sub( "download\\.berkeleyearth\\.org", "berkeleyearth.lbl.gov", BerkeleyUrls[, 1])