Reading SDMX in R - parse error? - r

I've been trying to develop a shiny app in R with INEGI (mexican statistics agency) data through their recently initiated SDMX service. I went as far a contacting the developers themselves and they gave me the following, unworkable, code:
require(devtools)
require(RSQLite)
require(rsdmx)
require(RCurl)
url <- paste("http://www.snieg.mx/opendata/NSIRestService/Data/ALL,DF_PIB_PB2008,ALL/ALL/INEGI");
sdmxObj <- readSDMX(url)
df_pib <- as.data.frame(sdmxObj)
Which brings me to the following errors:
sdmxObj <- readSDMX(url)
Opening and ending tag mismatch: ad line 1 and Name
Opening and ending tag mismatch: b3 line 1 and Name
Opening and ending tag mismatch: b3 line 1 and Department
Opening and ending tag mismatch: c3 line 1 and Contact
Opening and ending tag mismatch: a1 line 1 and Sender
Opening and ending tag mismatch: c3 line 1 and Header
Opening and ending tag mismatch: b3 line 1 and GenericData
... etc, you get the point.
I tried to use another url (maybe this was to broad, bringing in every GDP measurement), but I get the same result:
url<-"http://www.snieg.mx/opendata/NSIRestService/Data/ALL,DF_PIB_PB2008,ALL/.MX.........C05.......0101/INEGI?format=compact"
If I download the file directly with my browser I seem to be getting useful structures.
Any ideas? Does this seem like a faulty definition directly from the source or an issue with the package "rsdmx", if so, has anyone found a way to parse similar structures correctly?

The code you pasted above, using rsdmx, works perfectly fine. The issue you had was about your workplace firewall, as you correctly figure out.
You only need to load rsdmx package (the other packages do not need to be explicitely declared)
require(rsdmx)
and do this code:
url <- paste("http://www.snieg.mx/opendata/NSIRestService/Data/ALL,DF_PIB_PB2008,ALL/ALL/INEGI");
sdmxObj <- readSDMX(url)
df_pib <- as.data.frame(sdmxObj)
I've checked for any potential issue related to this datasource, but there is not. Staying strictly within the scope of your post, your code is fine.
This being said, if you find a bug in rsdmx, you can directly submit a ticket at https://github.com/opensdmx/rsdmx/issues Prompt feedback is provided to users. You can also send suggestions or wished features there or in the rsdmx mailing list.

You could try RJSDMX .
To download all the time series of the DF_PIB_PB2008 dataflow you just need to hit:
library(RJSDMX)
result = getSDMX('INEGI', 'DF_PIB_PB2008/.................')
or equivalently:
result = getSDMX('INEGI', 'DF_PIB_PB2008/ALL')
If you need time series as a result, you're done. Elseway, if you prefer a data.frame, you can get it calling:
dfresult = sdmxdf(result, meta=T)
You can find more information about the package and its configuration in the project wiki

Related

Problems parsing StreamR JSON data

I am attempting to use the streamR in R to download and analyze Twitter, under the pretense that this library can overcome the limitations from the twitteR package.
When downloading data everything seems to work fabulously, using the filterStream function (just to clarify, the function captures Twitter data, just running it will provide the json file -saved in the working directory- that needs to be used in further steps):
filterStream( file.name="tweets_test.json",
track="NFL", tweets=20, oauth=credential, timeout=10)
Capturing tweets...
Connection to Twitter stream was closed after 10 seconds with up to 21 tweets downloaded.
However, when moving on to parse the json file, I keep getting all sorts of errors:
readTweets("tweets_test.json", verbose = TRUE)
0 tweets have been parsed.
list()
Warning message:
In readLines(tweets) : incomplete final line found on 'tweets_test.json'
Or with this function from the same package:
tweet_df <- parseTweets(tweets='tweets_test.json')
Error in `$<-.data.frame`(`*tmp*`, "country_code", value = NA) :
replacement has 1 row, data has 0
In addition: Warning message:
In stream_in_int(path.expand(path)) : Parsing error on line 0
I have tried reading the json file with jsonlite and rjson with the same results.
Originally, it seemed that the error came from special characters ({, then \) within the json file that I tried to clean up following the suggestion from this post, however, not much came out of it.
I found out about the streamR package from this post, which shows the process as very straight forward and simple (which it is, except for the parsing part!).
If any of you have experience with this library and/or these parsing issues, I'd really appreciate your input. I have been searching non stop but haven't been able to locate a solution.
Thanks!

Error reading hdf file in R

I have two hdf4 files namely file 1:"MYD04_L2.A2011001.2340.006.2014078044212.hdf" and file 2: "MYD04_L2.A2011031.mosaic.006.AOD_550_DT_DB_Combined.hdf". First one is raw data file with 72 sub-datasets and second one is the file I obtained after ordering (i.e. post-processed). For the first R code:
layer_name <- getSds("MYD04_L2.A2011001.2340.006.2014078044212.hdf",method="mrt")
layer_name$SDSnames[66:68]
[1] "AOD_550_Dark_Target_Deep_Blue_Combined"
[2] "AOD_550_Dark_Target_Deep_Blue_Combined_QA_Flag"
[3] "AOD_550_Dark_Target_Deep_Blue_Combined_Algorithm_Flag"
It works ok with method="gdal" as well. However, when I try to read file 2, a window pops up showing gdalinfo.exe has stopped working (method = "gdal"). The same kind of problem arises for mrt and it shows sdslist.exe has stopped working. I get following error message:
Error in sds[[i]] <- substr(sdsRaw[i], 1, 11) == "SDgetinfo: " :
attempt to select less than one element in integerOneIndex
Is single layer is the issue here? As the first one has 72 sub-data sets and second one has only one sub-data set (assuming because of the given file name as I couldn't read it), have R failed to read the data file? Can anyone propose any solution for reading such data files? If ncdf4 package is the solution with enabled hdf4, can anyone explain, step-by-step, how can I enable hdf4 and build ncdf4 using windows platform?

Webscraping in R, "... does not exist in current working directory" error

I'm trying to use the XML2 package to scrape a few tables from ESPN.com. For the sake of example, I'd like to scrape the week 7 fantasy quarterback rankings into R, the URL to which is:
http://www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-quarterback-rankings
I'm trying to use the "read_html()" function to do this because it is what I am most familiar with. Here is my syntax and its error:
> wk.7.qb.rk = read_html("www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-rankings-quarterbacks", which = 1)
Error: 'www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-rankings-quarterbacks' does not exist in current working directory ('C:/Users/Brandon/Documents/Fantasy/Football/Daily').
I've also tried "read_xml()", only to get the same error:
> wk.7.qb.rk = read_xml("www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-rankings-quarterbacks", which = 1)
Error: 'www.espn.com/fantasy/football/story/_/page/16ranksWeek7QB/fantasy-football-week-7-rankings-quarterbacks' does not exist in current working directory ('C:/Users/Brandon/Documents/Fantasy/Football/Daily').
Why is R looking for this URL in the working directory? I've tried this function with other URLs and had some success. What is it about this specific URL that makes it look in a different location than it does for others? And, how do I change that?
I got this error while I was running my read_html in a loop to navigate through 20 pages. After the 20th page the loop was still running with no urls and hence it started calling read_html with NAs for the other loop iterations.Hope this helps!

How do I close unused connections after read_html in R

I am quite new to R and am trying to access some information on the internet, but am having problems with connections that don't seem to be closing. I would really appreciate it if someone here could give me some advice...
Originally I wanted to use the WebChem package, which theoretically delivers everything I want, but when some of the output data is missing from the webpage, WebChem doesn't return any data from that page. To get around this, I have taken most of the code from the package but altered it slightly to fit my needs. This worked fine, for about the first 150 usages, but now, although I have changed nothing, when I use the command read_html, I get the warning message " closing unused connection 4 (http:....." Although this is only a warning message, read_html doesn't return anything after this warning is generated.
I have written a simplified code, given below. This has the same problem
Closing R completely (or even rebooting my PC) doesn't seem to make a difference - the warning message now appears the second time I use the code. I can run the querys one at a time, outside of the loop with no problems, but as soon as I try to use the loop, the error occurs again on the 2nd iteration.
I have tried to vectorise the code, and again it returned the same error message.
I tried showConnections(all=TRUE), but only got connections 0-2 for stdin, stdout, stderr.
I have tried searching for ways to close the html connection, but I can't define the url as a con, and close(qurl) and close(ttt) also don't work. (Return errors of no applicable method for 'close' applied to an object of class "character and no applicable method for 'close' applied to an object of class "c('xml_document', 'xml_node')", repectively)
Does anybody know a way to close these connections so that they don't break my routine? Any suggestions would be very welcome. Thanks!
PS: I am using R version 3.3.0 with RStudio Version 0.99.902.
CasNrs <- c("630-08-0","463-49-0","194-59-2","86-74-8","148-79-8")
tit = character()
for (i in 1:length(CasNrs)){
CurrCasNr <- as.character(CasNrs[i])
baseurl <- 'http://chem.sis.nlm.nih.gov/chemidplus/rn/'
qurl <- paste0(baseurl, CurrCasNr, '?DT_START_ROW=0&DT_ROWS_PER_PAGE=50')
ttt <- try(read_html(qurl), silent = TRUE)
tit[i] <- xml_text(xml_find_all(ttt, "//head/title"))
}
After researching the topic I came up with the following solution:
url <- "https://website_example.com"
url = url(url, "rb")
html <- read_html(url)
close(url)
# + Whatever you wanna do with the html since it's already saved!
I haven't found a good answer for this problem. The best work-around that I came up with is to include the function below, with Secs = 3 or 4. I still don't know why the problem occurs or how to stop it without building in a large delay.
CatchupPause <- function(Secs){
Sys.sleep(Secs) #pause to let connection work
closeAllConnections()
gc()
}
I found this post as I was running into the same problems when I tried to scrape multiple datasets in the same script. The script would get progressively slower and I feel it was due to the connections. Here is a simple loop that closes out all of the connections.
for (i in seq_along(df$URLs)){function(i)
closeAllConnections(i)
}

Reproduce weather research, download and extract file using RStudio and wgrib2

tl;dr
First of all, all necessary code to reproduce the problem is available below. I'm experiencing some challenges with the grib part, but also with the more basic things with download.file()
Problem
There's a nice article on using GFS weather data in R here. It would be a great starting point for further analysis, but I'm having several problems. The article contains all R code required to reproduce it, but here are the first few lines which also is the most important part to get things going:
#STEP 1 (doesn't work either way because of bad link)
loc=file.path("ftp://ftp.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs.2009121700/gfs.t00z.sfluxgrbf03.grib2")
My first problem was that the link didn't work, but here's the url to a file of the same format from the same source, so this should work:
#PART 1 with working link (at least in google chrome):
loc=file.path("ftp://ftp.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs.2015081112/gfs.t12z.sfluxgrbf00.grib2")
download.file(loc,"temp.grb",mode="wb")
#PART 2
shell("wgrib2 -s temp03.grb | grep :LAND: | wgrib2 -i temp00.grb -netcdf LAND.nc",intern=T)
#PART3
library(ncdf)
landFrac <-open.ncdf("LAND.nc")
land <- get.var.ncdf(landFrac,"LAND_surface")
x <- get.var.ncdf(landFrac,"longitude")
y <- get.var.ncdf(landFrac,"latitude")
#PART4
rgb.palette <- colorRampPalette(c("snow1","snow2","snow3","seagreen","orange","firebrick"), space = "rgb")#colors
image.plot(x,y,t2m.mean,col=rgb.palette(200),axes=F,main=as.expression(paste("GFS 24hr Average 2M Temperature",day,"00 UTC",sep="")),axes=F,legend.lab="o C")
contour(x,y,land,add=TRUE,lwd=1,levels=0.99,drawlabels=FALSE,col="grey30")
My system
Windows 7, 64 bit
Rstudio 0.99.467
Question 1 - download.file()
I'm having problems with download.file(), even with a link (or ftp path) that works fine in Chrome. Here's a part of the error message:
In download.file(url = loc, destfile = "temp.grb", mode = "wb") :
InternetOpenUrl failed:
Anyone know what causes this?
Question: 2 - wgrib2
I've downloaded a sample file manually to get things going, but get stuck again in Part 2. How would the following r command look like directly in a command window?
shell("wgrib2 -s temp03.grb | grep :LAND: | wgrib2 -i temp00.grb -netcdf LAND.nc",intern=T)
I've installed the necessary dlls available under the links below, and I've also got wgrib 2 up and running (apparently). Information on wgrib2 is provided in the article, and available here. Source for Windows 7.
I hope some of you find this interesting!

Resources