I download some JSON files from Twitter with this command
library(RCurl)
getURL("http://search.twitter.com/search.json?since=2012-12-06&until=2012-12-07&q=royalbaby&result_type=recent&rpp=10&page=1")
But now all double quotes are transformed to \". In some special cases this destroys the JSON format. I think getURL or curl will make this change. Is there any way to suppress this action?
Thanks
Markus
Your page contain the "\ , it is not Rcurl behavior ( try to open the page with a browser)
library(RJSONIO)
library(RCurl)
raw_data <- getURL(you.url)
data <- fromJSON(raw_data)
The data is well formated.
Use cat to avoid \ representation.
Related
How do I properly download and load in R an OData dataset?
I tried the OData package, and even if the documentation is really simple, I am sure, I am missing something trivial.
I am trying to download and parse in R this dataset, but I cannot get how it is structured. Is it a XML format? Hence, what is the reason for a separator argument?
library(OData)
#What is the correct argument for the separator?
downloadResourceCsv("https://data.nasa.gov/OData.svc/gh4g-9sfh", sep = "")
As hrbrmstr suggests, use the RSocrata package
e.g., go to 1, click on ... in the top right,
click on "Access this Dataset via OData", click
on "Copy" to copy the OData endpoint, save it:
url <- "https://data.cdc.gov/api/odata/v4/9bhg-hcku"
library(RSocrata)
dat <- read.socrata(url)
It's XML format.So download first.
Try using httr package.
library(httr)
r <- GET("http://httpbin.org/get")
Visit this site for quick-start.
After download use XML package for xmlParse.
Thank you
If I try to import a public spreadsheet like this example into R:
using:
library(httr)
url <- "https://docs.google.com/spreadsheets/d/1qIOv7MlpQAuBBgzV9SeP3gu0jCyKkKZapPrZHD7DUyQ/pub?gid=0&single=true&output=tsv"
GET(url)
I get the wrong accented words, as you can see in this picture:
How can I get the right encode?
I know I can use googlesheets package, but for public data I prefer to work with direct download, so I don't have to handle user login authentication and token refresh.
I don't know why httr::GET do not work, but this works:
data <- utils::read.csv(url, header=TRUE, sep="\t", stringsAsFactors=FALSE)
If you have a *nix operating system you could use
curl -o data.tsv 'https://docs.google.com/spreadsheets/d/1qIOv7MlpQAuBBgzV9SeP3gu0jCyKkKZapPrZHD7DUyQ//pub?gid=0&single=true&output=tsv'
Taking the following url
URL <- "http://www.google.de/complete/search?output=toolbar&q=TDS38311DE"
doc <- read_xml(URL)
I get the following error:
Error: Input is not proper UTF-8, indicate encoding !
Bytes: 0xDF 0x20 0x2F 0x20 [9]
Using read_html instead everything is fine.
Am i doing something wrong? Why does this error occur?
First: rvest uses xml2 for the acquisition of content so you should file any issue relating to it under that package gh vs rvest.
Second, read_xml takes an encoding parameter for a reason and says so: "Unless otherwise specified XML documents are assumed to be in UTF-8 or UTF-16. If the document is not UTF-8/16, and lacks an explicit encoding directive, this allows you to supply a default."
XML files have the ability to specify an encoding but this "AJAX-y" response from google clearly isn't (and it's not something it's expecting you to be pilfering and it knows it's being read—usually—by an HTML parsing engine [a.k.a. a browser], not an XML parsing engine).
rvest used to do this:
encoding <- encoding %||% default_encoding(x)
xml2::read_xml(httr::content(x, "raw"), encoding = encoding, base_url = x$url,
as_html = as_html)
And default_encoding does this:
default_encoding <- function(x) {
type <- httr::headers(x)$`Content-Type`
if (is.null(type)) return(NULL)
media <- httr::parse_media(type)
media$params$charset
}
but rvest now only exposes read_xml methods for session and response objects (where it does the encoding guessing).
So, you can either:
do some manual introspection prior to scraping (after reading a site's ToS),
use httr to grab a page and pass that to read_xml, or
hook up your own reader function into your script with the same idiom
I came across this service that will format my local markdown files rather nicely. Per the example, it is easy to get a nicely formatted response with a sample curl command.
What I am looking to do is utilize some of the options available, namely the version and "name" parameters. How would I go about structuring the curl command? Below are the code samples that I have used within R.
This code works nicely, but lacks the specified options:
doc.up <- "curl -X POST --data-urlencode content#test-markdown.md \ http://documentup.com/compiled > index.html"
system(doc.up)
I tried to specify the name option, but no dice:
doc.up <- "curl -X POST --data-urlencode name#mynamevar content#test-markdown.md \ http://documentup.com/compiled > index.html"
system(doc.up)
Any help will be greatly appreciated!
EDIT: Per some of the suggestions below, I have attempted a few ways to use Rcurl and HTTR. I am using the default Markdown template within Rstudio, but for completeness sake, I saved it as test-markdown.Rmd and compiled it test-markdown.md.
Using RCurl, I attempted:
## attempt 1
f <- paste(readLines('test-markdown.md'),collapse="\n" )
h <- dynCurlReader()
wp <- curlPerform(url="http://documentup.com/compiled",
postfields = c(content=f))
## attempt 2
postForm("http://documentup.com/compiled",
"content" = fileUpload('test-markdown.md'))
Using httr, I tried:
## attempt 3
tmp <- POST("http://documentup.com/compiled", body = list(content= upload_file(f)))
content(tmp)
## attempt 4
tmp <- POST("http://documentup.com/compiled", body = list(content= upload_file("test-markdown.md")))
The following code works for me:
library(httr)
url <- "http://documentup.com/compiled"
contents <- readLines("README.md")
resp <- POST(url, body = list(content = contents, name = "plyr"))
content(resp)
#Btibert3 Actually, you can tie it to RStudio yourself using the rstudio.markdownToHTML option. See http://www.rstudio.com/ide/docs/authoring/markdown_custom_rendering
Why do I get garbled characters in parse a web?
I have used encoding="big-5\\IGNORE"to get the normal character, but it doesn't work.
require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
options(encoding="big-5")
data=htmlParse(url,isURL=TRUE,encoding="big-5\\IGNORE")
tdata=xpathApply(data,"//table[#class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
How should I revise my code to change the garbled characters into normal?
#MartinMorgan (below) suggested using
htmlParse(url,isURL=TRUE,encoding="big-5")
Here is an example of what is going on:
require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
options(encoding="big-5")
data=htmlParse(url,isURL=TRUE,encoding="big-5")
tdata=xpathApply(data,"//table[#class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
stock
The total records should be 1335. In the case above it is 309 - many records appear to have been lost
This is a complicated problem. There are a number of issues:
A Badly-formed html file
The web is not a standard web, not well formed html file,let me prove my point.
please run :
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
txt=download.file(url,destfile="stockbig-5",quiet = TRUE)
How about to open the downloaded file stockbig-5wiht firefox?
Iconv function bug in R
if a html file is well formed,you can use
data=readLines(file)
datachange=iconv(data,from="source encode",to="target encode\IGNORE")
when a html file is not well formed,you can do that way ,in this example,
please run ,
data=readLines(stockbig-5)
An error will occur.
1: In readLines("stockbig-5") :
invalid input found on input connection 'stockbig-5'
You can't use iconv function in R to change encode in bad formed html file.
You can, however do this in shell
I have solved it myself for one night,hard time.
System:debian6(locale utf-8)+R2.15(locale utf-8)+gnome terminal(locale utf-8).
Here is the code:
require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
txt=download.file(url,destfile="stockbig-5",quiet = TRUE)
system('iconv -f big-5 -t UTF-8//IGNORE stockbig-5 > stockutf-8')
data=htmlParse("stockutf-8",isURL=FALSE,encoding="utf-8\\IGNORE")
tdata=xpathApply(data,"//table[#class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
stock
I want my code more elegant ,the shell command in R code is ugly maybe,
system('iconv -f big5 -t UTF-8//IGNORE stockgb2312 > stockutf-8')
i made tries to replace it with pure R code ,failed ,how can replace it in pure R code?
you can duplicate the result in your computer with the code.
half done,half success,continue to try.