I would like to use a website from R. The website is http://soundoftext.com/ where I can download WAV. files with audios from a given text and a language (voice).
There are two steps to download the voice in WAV:
1) Insert text and Select language. And Submit
2) On the new window, click Save and select folder.
Until now, I could get the xml tree, convert it to list and modify the values of text and language. However, I don't know how to convert the list to XML (with the new values) and execute it. Then, I would need to do the second step too.
Here is my code so far:
require(RCurl)
require(XML)
webpage <- getURL("http://soundoftext.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
x<-xmlToList(pagetree)
# Inserting word
x$body$div$div$div$form$div$label$.attrs[[1]]<-"Raúl"
x$body$div$div$div$form$div$label$.attrs[[1]]
# Select language
x$body$div$div$div$form$div$select$option$.attrs<-"es"
x$body$div$div$div$form$div$select$option$.attrs
I have follow this approach but there is an error with "tag".
UPDATED: I just tried to use rvest to download the audio file, however, it does not respond or trigger anything. What am I doing wrong (missing)?
url <- "http://soundoftext.com/"
s <- html_session(url)
f0 <- html_form(s)
f1 <- set_values(f0[[1]], text="Raúl", lang="es")
attr(f1, "type") <- "Submit"
s[["fields"]][["submit"]] <- f1
attr(f1, "Class") <- "save"
test <- submit_form(s, f1)
I see nothing wrong with your approach and it was worth a try.. that's what I'd write too.
The page is somewhat annoying in that uses jquery to append new divs at each request. I still think that should be possible to do with rvest, but I found a fun workaround using the httr package:
library(httr)
url <- "http://soundoftext.com/sounds"
fd <- list(
submit = "save",
text = "Banana",
lang="es"
)
resp<-POST(url, body=fd, encode="form")
id <- content(resp)$id
download.file(URLencode(paste0("http://soundoftext.com/sounds/", id)), destfile = 'test.mp3')
Essentially when it send the POST request to the server, an ID come back, if we simply GET that id when can download the file.
Creator of Sound of Text here. Sorry it took so long for me to find this post.
I just redesigned Sound of Text, so your html parsing probably won't work anymore.
However, there is now an API that you can use which should make things considerably easier for you.
You can find the documentation here: https://soundoftext.com/docs
I apologize if it's not very good. Please let me know if you have any questions.
Related
This seems like a simple problem but I've been struggling with it for a few days. This is a minimum working example rather than the actual problem:
This question seemed similat but I was unable to use the answer to solve my problem.
In a browser, I go to this url, and click on [Search] (no need to make any choices from the lists), and then on [Download Results] (choosing, for example, the Xlsx option). The file then downloads.
To automate this in R I have tried:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
Using Chrome Developer tools I find the url being used to initiate the download, so I try:
url2 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
res <- GET(url = url2, query = list(format = "xlsx"))
However this does not download the file:
> res$content
raw(0)
I also tried
download.file(url = paste0(url2, "?format=xlsx") , destfile = "down.xlsx", mode = "wb")
But this downloads nothing:
> Content type '' length 0 bytes
> downloaded 0 bytes
Note that, in the browser, pasting url2 and adding the format query does initiate the download (after doing the search from url1)
I thought that I should somehow be using the session info from the initial code block to do the download, but so far I can't see how.
Thanks in advance for any help !
You are almost there and your intuition is correct about using the session info.
You just need to use rvest::jump_to to navigate to the second url and then write it to disk:
library(rvest)
url1 <- "https:/secure.gamblingcommission.gov.uk/PublicRegister/Search"
sesh1 <- html_session(url1)
form1 <-html_form(sesh1)[[1]]
subform <- submit_form(sesh1, form1)
url2 <- "https://secure.gamblingcommission.gov.uk/PublicRegister/Search/Download"
#### The above is your original code - below is the additional code you need:
download <- jump_to(subform, paste0(url2, "?format=xlsx"))
writeBin(download$response$content, "down.xlsx")
I'm attempting web scraping. Before posting my question, I looked up several similar questions such as this, and this. However, I still get stuck in my problem.
Speicifically, I'm trying to extract the listed prices on a second-hand cars website. # In case you are unable to see the data because you're not a registered user of this website, I also attached the screenshot of this website's html elements:
the screen shot.
The code I executed are:
library(httr)
library(XML)
url <- "https://www.sahibinden.com/vasita?query_text_mf=alfa+romeo+giulietta&query_text=alfa+romeo+giulietta"
htmlresponse <- GET(url)
htmlcontent <- content(htmlresponse, as="text")
parsedhtml <- htmlParse(htmlcontent, asText = TRUE)
# The above is just following the conventions, and seems okay.
prices <- xpathSApply(doc = parsedhtml, path = "//div/td[#class='searchResultsPriceValue']", fun = xmlValue)
# This command returned me an empty list.
Can someone have a look and give me some advices? Thank you very much!
There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.
[PART 1:]
I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:
library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")
However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.
[PART 2:]
As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.
There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
crawlSGXdata = function(date){
POST("https://www2.sgx.com/derivatives/negotiated-large-trade",
body = NULL
encode = "form",
write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res)
}
I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:
Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400
The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.
The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.
library(jsonlite)
url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
r <- jsonlite::fromJSON(url)
num_pages <- r$meta$totalPages
df <- r$data
url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
if(num_pages > 1){
for(i in seq(1, num_pages)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$data
df <- rbind(df, newdf)
}
}
I have a url:
url <- "http://www.railroadpm.org/home/RPM/Performance%20Reports/BNSF.aspx"
that contains a link to a csv file that I would like to download. The "Export to CSV" link on the above page. The problem is that the csv file is not part of a url, but rather it's javascript. What I would like to do is access the link and create a dataframe out of the csv file. The javascript is:
javascript:__doPostBack('ctl11$btnCSV','')
and from that I can tell that the id is
"ctl11_btnCSV"
but I am unsure of how this fits into RCUrl, which from SO seems to be the best way to access this data. Any help would be appreciated.
Thanks.
There was zero effort put into this question (esp since the OP came to the conclusion that RCurl is the current best practice for web wrangling in R) but anytime an SO web scraping question that involves a SharePoint site can actually be answered (Microsoft SharePoint is one of the worst things invented ever next to Windows) it's worth posting an answer.
library(rvest)
library(httr)
# make an initial connection to get cookies
httr::GET(
"http://www.railroadpm.org/home/RPM/Performance%20Reports/BNSF.aspx"
) -> res
# retrieve some hidden bits we need to pass b/c SharePoint is a wretched thing.
pg <- content(res, as = "parsed")
for_post <- html_nodes(pg, "input[type='hidden']")
# post the hidden form & save out the CSV
httr::POST(
"http://www.railroadpm.org/home/RPM/Performance%20Reports/BNSF.aspx",
body = as.list(
c(
setNames(
html_attr(for_post, "value"),
html_attr(for_post, "id")
),
`__EVENTTARGET` = "ctl11$btnCSV"
)
),
write_disk("meaures.csv"),
progress()
) -> res
I'm stuck on this one after much searching....
I started with scraping the contents of a table from:
http://www.skatepress.com/skates-top-10000/artworks/
Which is easy:
data <- data.frame()
for (i in 1:100){
print(paste("page", i, "of 100"))
url <- paste("http://www.skatepress.com/skates-top-10000/artworks/", i, "/", sep = "")
temp <- readHTMLTable(stringsAsFactors = FALSE, url, which = 1, encoding = "UTF-8")
data <- rbind(data, temp)
} # end of scraping loop
However, I need to additionally scrape the detail that is contained in a pop-up box when you click on each name (and on the artwork title) in the list on the site.
I can't for the life of me figure out how to pass the breadcrumb (or artist-id or painting-id) through in order to make this happen. Since straight up using rvest to access the contents of the nodes doesn't work, I've tried the following:
I tried passing the painting id through in the url like this:
url <- ("http://www.skatepress.com/skates-top-10000/artworks/?painting_id=576")
site <- html(url)
But it still gives an empty result when scraping:
node1 <- "bread-crumb > ul > li.activebc"
site %>% html_nodes(node1) %>% html_text(trim = TRUE)
character(0)
I'm (clearly) not a scraping expert so any and all assistance would be greatly appreciated! I need a way to capture this additional information for each of the 10,000 items on the list...hence why I'm not interested in doing this manually!
Hoping this is an easy one and I'm just overlooking something simple.
This will be a more efficient base scraper and you can get progress bars for free with the pbapply package:
library(xml2)
library(httr)
library(rvest)
library(dplyr)
library(pbapply)
library(jsonlite)
base_url <- "http://www.skatepress.com/skates-top-10000/artworks/%d/"
n <- 100
bind_rows(pblapply(1:n, function(i) {
mutate(html_table(html_nodes(read_html(sprintf(base_url, i)), "table"))[[1]],
`Sale Date`=as.Date(`Sale Date`, format="%m.%d.%Y"),
`Premium Price USD`=as.numeric(gsub(",", "", `Premium Price USD`)))
})) -> skatepress
I added trivial date & numeric conversions.
I belive your main issue is that the site requires a login to get the additional data. You should give that (i.e. logging in) a shot using httr and grab the wordpress_logged_inXXXXXXX… cookie from that endeavour. I just grabbed it from inspecting the session with Developer Tools in Chrome and that will also work for you (but it's worth the time to learn how to do it via httr).
You'll need to scrape two additional <a … tags from each table row. The one for "artist" looks like:
Pablo Picasso
You can scrape the contents with:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artist.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id="pab_pica_1881"),
verbose()) -> artist_response
fromJSON(content(artist_response, as="text"))
(The return value is too large to post here)
The one for "artwork" looks like:
Les femmes d′Alger (Version ′O′)
and you can get that in similar fashion:
POST("http://www.skatepress.com/wp-content/themes/skatepress/scripts/query_artwork.php",
set_cookies(wordpress_logged_in_XXX="userid%XXXXXreallylongvalueXXXXX…"),
encode="form",
body=list(id=576),
verbose()) -> artwork_response
fromJSON(content(artwork_response, as="text"))
That's not huge but I won't clutter the response with it.
NOTE that you can also use rvest's html_session to do the login (which will get you cookies for free) and then continue to use that session in the scraping (vs read_html) which will mean you don't have to do the httr GET/PUT.
You'll have to figure out how you want to incorporate that data into the data frame or associate it with it via various id's in the data frame (or some other strategy).
You can see it call those two php scripts via Developer Tools and it also shows the data it passes in. I'm also really shocked that site doesn't have any anti-scraping clauses in their ToS but they don't.