rvest jump_to with special characters - r

I'm having difficulty getting rvest to jump_to url's with special characters in them. When I type the link into chrome it works, but in R / rvest I get an error:
Error in curl::curl_fetch_memory(url, handle=handle) :
Could not resolve host: NA
URL's that have issues:
http://incrediblewinestore.com/ProductDetail.asp?title=-You-Had-Me-At-Merlot--Napkins&UPCCode=876718049392
http://incrediblewinestore.com/ProductDetail.asp?title=10-BARREL-RASPBERRY-CRUSH-6PK&UPCCode=`851538002611
http://incrediblewinestore.com/ProductDetail.asp?title=14-HANDS-CABERNET-SAUVIGNON&UPCCode=\088586001895
URL that works:
http://incrediblewinestore.com/ProductDetail.asp?title=Cuarenta-y-Tres-Liqueur-43&UPCCode=029929115411
The code I've tried:
library(stringr)
library(rvest)
# Load first page, try to go to search, but expect age-check
iws_ac_url <- "http://incrediblewinestore.com"
iws_session <- html_session(iws_ac_url)
age_gate <- iws_session %>%
html_node("form[name='AgeGate']")
age_gate <- html_form(age_gate)
age_gate <- set_values(age_gate, PageAction = 'Yes21')
# Submit form and enter the rest of the site
iws_site <- submit_form(iws_session,age_gate)
# Unworking Links
temp_link <- paste0("http://incrediblewinestore.com","/ProductDetail.asp?title=<i>-You-Had-Me-At-Merlot-<i>-Napkins&UPCCode=876718049392")
iws_site %>% jump_to(temp_link)
temp_link <- paste0("http://incrediblewinestore.com","/ProductDetail.asp?title=10-BARREL-RASPBERRY-CRUSH-6PK&UPCCode=`851538002611")
iws_site %>% jump_to(temp_link)
# Working link
temp_link <- paste0("http://incrediblewinestore.com","/ProductDetail.asp?title=Cuarenta-y-Tres-Liqueur-43&UPCCode=029929115411")
iws_site %>% jump_to(temp_link)

As usual, once I found the answer I was shocked at the simplicity. Just needed the function name: URLencode(url,reserved = FALSE)
temp_link <- paste0("http://incrediblewinestore.com",URLencode("/ProductDetail.asp?title=10-BARREL-RASPBERRY-CRUSH-6PK&UPCCode=`851538002611",reserved = FALSE))
The secret was that I needed a function which would not encode reserved characters such as = ? & . The other function I tried was converting all the characters.

Related

Problems extracting data using JSON in R (getting a lexical error)

Related to the question asked here: R - Using SelectorGadget to grab a dataset
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
get_state_index <- function(states, state) {
return(match(T, map(states, ~ {
.x$name == state
})))
}
s <- read_html("https://www.opentable.com/state-of-industry") %>% html_text()
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
fullbook <- all_data$covidDataCenter$fullbook
hawaii_dataset <- tibble(
date = fullbook$headers %>% unlist() %>% as.Date(),
yoy = fullbook$states[get_state_index(fullbook$states, "Hawaii")][[1]]$yoy %>% unlist()
)
I am trying to grab the Hawaii dataset from the State tab. The code was working before but now it is throwing an error with this part of the code:
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
I am getting the error:
Error: lexical error: invalid char in json text. NA (right here) ------^
Any proposed solutions? It seems that the website has remained the same for the year but what type of change is causing the code to break?
EDIT: The solution proposed by #QHarr:
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
This was working for a while but then it seems that their website again changed the underlying HTML codes.
Change the regex pattern as shown below to ensure it correctly captures the desired string within the response text i.e. the JavaScript object to use for all_data
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
Note: in R the single escape is doubled e.g. \\s rather than shown \s above.

Web Scraping in R using rvest and finding the html_note

I am trying to find the current html_note to fetch the replies count for each post in this forum: https://d.cosx.org/. I used CSS selector and it said .DiscussionListItem-count but it seems not working.
My code:
library(rvest)
library(tidyverse)
COS_link <- read_html("https://d.cosx.org/")
COS_link %>%
# The relevant tag
html_nodes(css = '.DiscussionListItem-count') %>%
html_text()
I would like to fetch the replies count, for example: 1k for 1st post and 30 for 2nd post. I am wondering if I miss something or anyone has a better idea?
You can use the API and parse the json response for the title and participantCount attributes
API endpoint returning that info is:
https://d.cosx.org/api
Substring the response to remove the trailing 0 and leading ac76 then parse with a json library of choice.
Less optimal is to regex out the json string from original url
library(rvest)
library(jsonlite)
library(stringr)
url <- "https://d.cosx.org/"
r <- read_html(url) %>%
html_nodes('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'flarum\\.core\\.app\\.load\\((.*)\\);')
json <- jsonlite::fromJSON(x[[1]][,2])
counts <- json$resources$attributes$participantCount
For those wishing to pair up the title with count and who don't have chinese settings a colleague helped me write the following:
library(rvest)
library(jsonlite)
library(stringr)
library(corpus)
url <- "https://d.cosx.org/"
r <- read_html(url) %>%
html_nodes('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'flarum\\.core\\.app\\.load\\((.*)\\);')
json <- jsonlite::fromJSON(x[[1]][,2])
titles <- json$resources$attributes$title
counts <- json$resources$attributes$participantCount
cf <- corpus_frame(name = titles, text = counts)
names(cf) <- c("titles", "counts")
print(cf[which(!is.na(cf$counts)),], 100)

Adding ifelse() into a Map function

I've got a simple Map function that scrapes text files from a blog site. It's pretty easy to get a scraper that gets all of the text files and downloads them to my working directory. My goal: use an ifelse() or a plain if statement to only scrape a file based on a certain date.
Eg, if four files were posted on 1/31/19, and I pointed my ifelse at that date, the function would return those four files. Code:
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Getting date elements
dates <- page %>%
html_nodes("node.dates") %>%
html_text()
dates <- parse_date_time(dates, "%m/%d/%Y", tz = "EST",
locale = Sys.getlocale("LC_TIME"))
# Function
out <- Map(function(ln) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
ifelse(dates == '2019-01-31', write, "He's dead, Jim")
}, links)
I've tried various ways to get that if statement in there, and also moving the writeBin around. (Usually the writeBin would not be vectorized - I did it for easy viewing in my ifelse). Error:
Error in ans[test & ok] <- rep(yes, length.out = length(ans))[test & ok] :
replacement has length zero
If I leave out the if code, everything works great, it just returns many text files, when I only want the ones from the specified date.
Based on the description, it seems like check the corresponding 'dates' for each 'links' and then apply the if/else. If that is the case, then we can have two arguments in Map
Map(function(ln, y) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
if(y == '2019-01-31') {
write
} else "He's dead, Jim"
},
links, dates)

rvest limits the results to 24 items

Good evening everyone,
I am currently trying to scrape zalando website to get the name of every products that appaears on the first two pages of the following url : (https://www.zalando.nl/damesschoenen-sneakers/)
Here is my code:
require(rvest)
require(dplyr)
url <- read_html('https://www.zalando.nl/damesschoenen-sneakers/')
selector_name <- '.z-nvg-cognac_brandName-2XZRz'
output <- html_nodes(x = url, css = selector_name) %>% html_text
The result is a list of 24 items while there is 86 products on the page. Has anyone encounter this issue before ? Any idea on how to solve it ?
Thank you for your help.
Thomas
I just tried what Nicolas Velasqueaz suggested
url <- read_html('https://www.zalando.nl/damesschoenen-sneakers/')
write_html(url, file = "test_url.html")
selector_name <- '.z-nvg-cognac_brandName-2XZRz'
test_file <- read_html("test_url.html")
output <- html_nodes(x = test_file, css = selector_name) %>% html_text
The results are the same. I still have only 24 items that shows up.
So if anyone has a solution would be very appreciated.
Thank you for your kind answer. I will dive into that direction.
I also find a way to get the name of the brand without RSelenium, here si my code:
library('httr')
library('magrittr')
library('rvest')
################# FUNCTION #################
extract_data <- function(firstPosition,lastPosition){
mapply(function(first,last){
substr(pageContent, first, last) %>%
gsub( "\\W", "\\1 ",.) %>%
gsub("^ *|(?<= ) | *$", "", ., perl = TRUE)
},
firstPosition, lastPosition )
}
############################################
url <- 'https://www.zalando.nl/damesschoenen-sneakers/'
page <- GET(url)
pageContent <- content(page, as='text')
# Get the brand name of the products
firstPosition <-
unlist(gregexpr('brand_name',pageContent))+nchar('brand_name')+1
lastPosition <- unlist(gregexpr('is_premium',pageContent))-2
extract_data(firstPosition, lastPosition)
Unfortunately it starts being difficult when you want something else than brand name so maybe that the best soution is to do it with RSelenium.

need help in extracting the first google search result using html_node in R

I have a list of hospital names for which I need to extract the first google search URL. Here is the code I'm using
library(rvest)
library(urltools)
library(RCurl)
library(httr)
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>%
html_text()
result <- results[1]
return(as.character(result))}
websites <- data.frame(Website = sapply(c,getWebsite))
View(websites)
For short URLs this code works fine but when the link is long and appears in R with "..." (ex. www.medicine.northwestern.edu/divisions/allergy-immunology/.../fellowship.html) it appears in the dataframe the same way with "...". How can I extract the actual URLs without "..."? Appreciate your help!
This is a working example, tested on my computer:
library("rvest")
# Load the page
main.page <- read_html(x = "https://www.google.com/search?q=software%20programming")
links <- main.page %>%
html_nodes(".r a") %>% # get the a nodes with an r class
html_attr("href") # get the href attributes
#clean the text
links = gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
# as a dataframe
websites <- data.frame(links = links, stringsAsFactors = FALSE)
View(websites)

Resources