Rvest wont return data - web-scraping

Rvest wont return data - web-scraping

I have been trying to scrape the following table:

Your problem was that your request delivered you a html site, not a json response. Thus, parsing it as a json failed with the error you saw.
(I can't tell you exactly whether it was because you missed out on the accept_json() or whether the URL you used was a bit off.).
Either way, reverse engineering the essentials of the API request behind the table you linked, you'd have to put something like this together:
require(httr)
require(dplyr)
library(purrr)
first_req <- GET("https://www.barchart.com")
xsrf_token <- cookies(first_req) %>% filter(name == 'XSRF-TOKEN') %>% pull(value) %>% URLdecode()
req <- GET(
"https://www.barchart.com/proxies/core-api/v1/quotes/get",
query = list(
lists = "stocks.optionable.by_sector.all.us",
fields = "symbol,symbolName,lastPrice,priceChange,percentChange,highPrice,lowPrice,volume,tradeTime,symbolCode,symbolType,hasOptions",
orderBy = "symbol",
orderDir = "asc",
meta = "field.shortName,field.type,field.description",
hasOptions = TRUE,
#page = 1,
#limit = 100,
raw = 1
),
content_type_json(),
accept_json(),
add_headers(
"x-xsrf-token" = xsrf_token,
"referrer" = "https://www.barchart.com/options/stocks-by-sector?page=1"
)
)
table_data <- req %>%
content() %>%
.$data %>%
map_dfr(unlist)
This will get you the full list of 4258 items and coerce it into a tibble for convenience :)

Related

Scraping reviews from Multiple pages in R

I was struggling to get the scraping done on a web page. My task is to scrape the reviews from the website and run a sentiment analysis on it. But I have only managed to get the Scraping done on the first page, How can I scrape all the reviews of the same movie distributed on multiple pages.
This is my code:
library(rvest)
read_html("https://www.rottentomatoes.com/m/dune_2021/reviews") %>%
html_elements(xpath = "//div[#class='the_review']") %>%
html_text2()
This only gets me the reviews from the first page but I need reviews from all the pages. Any help would be highly appreciated.

You could avoid the expensive overhead of a browser and use httr2. The page uses a queryString GET request to grab the reviews in batches. For each batch, the offset parameters of startCursor and endCursor can be picked up from the previous request, as well as there being a hasNextPage flag field which can be used to terminate requests for additional reviews. For the initial request, the
title id needs to be picked up and the offset parameters can be set as ''.
After collecting all reviews, in a list in my case, I apply a custom function to extract some items of possible interest from each review to generate a final dataframe.
Acknowledgments: I took the idea of using repeat() from #flodal here
library(tidyverse)
library(httr2)
get_reviews <- function(results, n) {
r <- request("https://www.rottentomatoes.com/m/dune_2021/reviews") %>%
req_headers("user-agent" = "mozilla/5.0") %>%
req_perform() %>%
resp_body_html() %>%
toString()
title_id <- str_match(r, '"titleId":"(.*?)"')[, 2]
start_cursor <- ""
end_cursor <- ""
repeat {
r <- request(sprintf("https://www.rottentomatoes.com/napi/movie/%s/criticsReviews/all/:sort", title_id)) %>%
req_url_query(f = "", direction = "next", endCursor = end_cursor, startCursor = start_cursor) %>%
req_perform() %>%
resp_body_json()
results[[n]] <- r$reviews
nextPage <- r$pageInfo$hasNextPage
if (!nextPage) break
start_cursor <- r$pageInfo$startCursor
end_cursor <- r$pageInfo$endCursor
n <- n + 1
}
return(results)
}
n <- 1
results <- list()
data <- get_reviews(results, n)
df <- purrr::map_dfr(data %>% unlist(recursive = F), ~
data.frame(
date = .x$creationDate,
reviewer = .x$publication$name,
url = .x$reviewUrl,
quote = .x$quote,
score = if (is.null(.x$scoreOri)) {
NA_character_
} else {
.x$scoreOri
},
sentiment = .x$scoreSentiment
))

R - Use Twitter API to get every tweet from an account

My goal is to get EVERY tweet ever for any twitter account. I picked the NYTimes for this example.
The code below works, but it only pulls the last 100 tweets. max_results does not allow you to put a value over 100.
The code below almost fully copy-paste-able, you would have to have your own bearer token.
How can I expand this to give me every tweet from an account?
One idea is that I can loop it for every day since the account was created, but that seems tedious if there is a faster way.
# NYT Example --------------------------------------------------------------------
library(httr)
library(jsonlite)
library(tidyverse)
bearer_token <- "insert your bearer token here"
headers <- c(`Authorization` = sprintf('Bearer %s', bearer_token))
params <- list(`user.fields` = 'description')
handle <- 'nytimes'
url_handle <- sprintf('https://api.twitter.com/2/users/by?usernames=%s', handle)
response <- httr::GET(url = url_handle,
httr::add_headers(.headers = headers),
query = params)
json_data <- fromJSON(httr::content(response, as = "text"), flatten = TRUE)
json_data %>%
as_tibble()
NYT_ID <- json_data$data$id
url_handle <- paste0("https://api.twitter.com/2/users/", NYT_ID, "/tweets")
params <- list(`tweet.fields` = 'id,text,author_id,created_at,attachments,public_metrics',
`max_results` = '100')
response <- httr::GET(url = url_handle,
httr::add_headers(.headers = headers),
query = params)
json_data <- fromJSON(httr::content(response, as = "text"), flatten = TRUE)
NYT_tweets <- json_data$data %>%
as_tibble() %>%
select(-id, -author_id, -9)
NYT_tweets

For anyone that finds this later on, I found a solution that works for me.
Using the parameters of start_time and end_time you can clarify dates for the tweets to be between. I was able to pull all tweets from November for example and then rbind those to the ones from December, etc. Sometimes I had to do two tweet pulls (half of March, second half of March) to get all of them, but it worked for this.
params <- list(`tweet.fields` = 'id,text,author_id,created_at,attachments,public_metrics',
`max_results` = '100',
`start_time` = '2021-11-01T00:00:01.000Z',
`end_time` = '2021-11-30T23:58:21.000Z')

Passing many values to an API using R

I wish to scale my working API query to query many IDs and to store this in a nice rectangular data frame.
I need some help understanding how I can scale my code to take many input variables and then how to store them.
My working code is as follows:
pacman::p_load(tidyverse,httr,jsonlite,purrr)
path <- "https://npiregistry.cms.hhs.gov/api/?"
request <- httr::GET(url = path,
query = list(version = "2.0",
number = 1154328938))
response <- content(request, as = "text", encoding = "UTF-8")
df <- jsonlite::fromJSON(response, flatten = TRUE) %>%
data.frame()
providerData <- df %>%
select(results.number,
results.basic.name,
results.basic.gender,
results.basic.credential,
results.taxonomies) %>%
unnest_wider(results.taxonomies) %>%
rename(Provider_NPI = results.number,
Provider_Name = results.basic.name,
Provider_Gender = results.basic.gender,
Provider_Credentials = results.basic.credential,
Provider_Taxonomy = desc,
Provider_State = state) %>%
select(-code,-license,-primary)
I now wish to query these 4 IDs and to store them in the same data format as the example above.
I have tried using lapply and building my own function but I don't fully understand how to create objects that store returned values.
My function looks as follows:
getNPI <- function(object) {
httr::GET(url = path,
query = list(version = "2.0",
number = object))
}
providerIDs <- c('1073666335',
'1841395357',
'1104023381',
'1477765634')
test <- lapply(providerIDs, getNPI)
I'm pretty certain I need some sort of object like a list or data frame to store the values of httr::GET but this is where I am falling down. The other piece is how to pull the appropriate values from the returned objects and to store them in a neat data frame.
Your help would be greatly appreciated.

you have to add the "cleaning" steps and return a df inside your getNPI function, then you can later use do.call for "combine" all data into a "final" data frame:
Example
getNPI <- function(object) {
request <- httr::GET(url = path,
query = list(version = "2.0",
number = object))
df <- content(request, as = "text", encoding = "UTF-8") %>%
jsonlite::fromJSON(. , flatten = TRUE) %>%
data.frame()
df %>%
select(results.number,
results.basic.name,
results.basic.gender,
results.basic.credential,
results.taxonomies) %>%
unnest_wider(results.taxonomies)
# Add more selection, mutations as needed
}
test <- lapply(providerIDs, getNPI)
# Use do.call for rbind an make the final df
final_df <- do.call("rbind",test)
Hope this can help you
NOTE: In order to rbind works with do.call as expected, all the columns names has to be the same.

r Web scraping: Unable to read the main table

I am new to web scraping. I am trying to scrape a table with the following code. But I am unable to get it. The source of data is
https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1
url <- "https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1"
urlYAnalysis <- paste(url, sep = "")
webpage <- readLines(urlYAnalysis)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
Tab <- readHTMLTable(tableNodes[[1]])
I copied this apporach from the link (Web scraping of key stats in Yahoo! Finance with R) where it is applied on yahoo finance data.
In my opinion, in readHTMLTable(tableNodes[[12]]), it should be Table 12. But when I try giving tableNodes[[12]], it always gives me an error.
Error in do.call(data.frame, c(x, alis)) :
variable names are limited to 10000 bytes
Please suggest me the way to extract the table and combine the data from other tabs as well (Fundamental, Technical and Performance).

This data is returned dynamically as json. In R (behaves differently from Python requests) you get html from which you can extract a given page's results as json. A page includes all the tabs info and 50 records. From the first page you are given the total record count and therefore can calculate the total number of pages to loop over to get all results. Perhaps combine them info a final dataframe during a loop to total number of pages; where you alter the pn param of the XHR POST body to the appropriate page number for desired results in each new POST request. There are two required headers.
Probably a good idea to write a function that accepts a page number in signature and returns a given page's json as a dataframe. Apply that via a tidyverse package to handle loop and combining of results to final dataframe?
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>%read_html()%>%html_node('p')%>% html_text()
page1_data <- jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])
total_rows <- str_match(s, '"totalCount\":(\\d+),' )[1,2]%>%as.integer()
num_pages <- ceiling(total_rows/50)
My current attempt at combining which I would welcome feedback on. This is all the returned columns, for all pages, and I have to handle missing columns and different ordering of columns as well as 1 column being a data.frame. As the returned number is far greater than those visible on page, you could simply revise to subset returned columns with a mask just for the columns present in the tabs.
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
get_data <- function(page_number){
data['pn'] = page_number
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>% read_html() %>% html_node('p') %>% html_text()
if(page_number==1){ return(s) }
else{return(data.frame(jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])))}
}
clean_df <- function(df){
interim <- df['viewData']
df_minus <- subset(df, select = -c(viewData))
df_clean <- cbind.data.frame(c(interim, df_minus))
return(df_clean)
}
initial_data <- get_data(1)
df <- clean_df(data.frame(jsonlite::fromJSON(str_match(initial_data, '(\\[.*\\])' )[1,2])))
total_rows <- str_match(initial_data, '"totalCount\":(\\d+),' )[1,2] %>% as.integer()
num_pages <- ceiling(total_rows/50)
dfs <- map(.x = 2:num_pages,
.f = ~clean_df(get_data(.)))
r <- rbindlist(c(list(df),dfs),use.names=TRUE, fill=TRUE)
write_csv(r, 'data.csv')

How to pass multiple values in a rvest submission form

This is a follow up to a prior thread. The code works fantastic for a single value but I get the following error when trying to pass more than 1 value I get an error based on the length of the function.
Error in vapply(elements, encode, character(1)) :
values must be length 1,
but FUN(X[1]) result is length 3
Here is a sample of the code. In most instances I have been able just to name an object and scrape that way.
library(httr)
library(rvest)
library(dplyr)
b<-c('48127','48180','49504')
POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode = b),
encode = "form"
) -> res
I was wondering if a loop to insert the values into the form would be the right way to go? However my loop writing skills are still in development and I am unsure of where to place it; in addition when i call the loop it doesn't print line by line it just returns null results.
#d isn't listed in the above code as it returns null
d<-for(i in 1:3){nrow(b)}

Here is an approach to send multiple POST requests
library(httr)
library(rvest)
b <- c('48127','48180','49504')
For each element in b perform a function that will send the appropriate POST request
res <- lapply(b, function(x){
res <- POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode = x),
encode = "form"
)
res <- read_html(content(res, as="raw"))
})
Now for each element of the list res you should do the parsing steps explained by hrbrmstr: How can I Scrape a CGI-Bin with rvest and R?
library(tidyverse)
I will use hrbrmstr's code since he is king and it is already clear to you. Only thing we are doing here is performing it on each element of res list.
res_list = lapply(res, function(x){
rows <- html_nodes(x, "table[width='300'] > tr > td")
ret <- data_frame(
record = !is.na(html_attr(rows, "bgcolor")),
text = html_text(rows, trim=TRUE)
) %>%
mutate(record = cumsum(record)) %>%
filter(text != "") %>%
group_by(record) %>%
summarise(x = paste0(text, collapse="|")) %>%
separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"), sep="\\|", extra="merge")
return(ret)
}
)
or using map from purrr
res %>%
map(function(x){
rows <- html_nodes(x, "table[width='300'] > tr > td")
data_frame(
record = !is.na(html_attr(rows, "bgcolor")),
text = html_text(rows, trim=TRUE)
) %>%
mutate(record = cumsum(record)) %>%
filter(text != "") %>%
group_by(record) %>%
summarise(x = paste0(text, collapse="|")) %>%
separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"),
sep="\\|", extra="merge") -> ret
return(ret)
}
)
If you would like this in a data frame:
res_df <- data.frame(do.call(rbind, res_list), #rbinds list elements
b = rep(b, times = unlist(lapply(res_list, length)))) #names the rows according to elements in b

You can put the values inside the post as below,
b<-c('48127','48180','49504')
for(i in 1:length(b)) {
POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode =b[i]),
encode = "form"
) -> res
# YOUR CODES HERE (for getting content of the page etc.)
}
But since for every different zipcode value the "res" value will be different, you need the put the rest of the codes inside the area I commented. Otherwise you get the last value only.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rvest wont return data - web-scraping

I have been trying to scrape the following table:

Related

Scraping reviews from Multiple pages in R

R - Use Twitter API to get every tweet from an account

Passing many values to an API using R

r Web scraping: Unable to read the main table

How to pass multiple values in a rvest submission form

Categories

Resources