Can't scrape all the rows using rvest - css

My goal is to scrape all this diamond data from bluenile.com. I've got some code that seems to be doing that, but it only grabs the first 61 rows.
By the way, I am using the "SelectorGadget" chrome plugin to get the CSS selectors. If I scroll down a little, the highlighting stops. Is it something to do with the website?
library('rvest')
le_url <- "https://www.bluenile.com/diamonds/round-cut?track=DiaSearchRDmodrn"
webpage <- read_html(le_url)
shape_data_html <- html_nodes(webpage,'.shape')
price_data_html <- html_nodes(webpage,'.price')
carat_data_html <- html_nodes(webpage,'.carat')
cut_data_html <- html_nodes(webpage,'.cut')
color_data_html <- html_nodes(webpage,'.color')
clarity_data_html <- html_nodes(webpage,'.clarity')
#Converting data to text
shape_data <- html_text(shape_data_html)
price_data <- html_text(price_data_html)
carat_data <- html_text(carat_data_html)
cut_data <- html_text(cut_data_html)
color_data <- html_text(color_data_html)
clarity_data <- html_text(clarity_data_html)
# make a data.frame
le_mat <- cbind(shape_data, price_data, carat_data, cut_data, color_data, clarity_data)
le_df <- le_mat[-1,]
colnames(le_df) <- le_mat[1,]

Data is dynamically added via API call as you scroll down page. The API call has a query string that allows you to specify startIndex (start row) and number of results per page (pageSize). The results per page max seems to be 1000. The return is json from which you can extract all the info you want including the total number of rows; accessed via key of countRaw. So, you can make a request for the initial 1000, parse out the total row count, countRaw, and perform a loop, adjusting the row startIndex parameter until you have all the results.
You can use a json parser e.g. jsonlite to handle the json response.
Example API endpoint call for first 1000 results:
https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=0&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN&skus=
library(jsonlite)
url <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=0&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN&skus='
r <- jsonlite::fromJSON(url)
print(r$countRaw)
You get a list of 8 elements from each call. r$results is a dataframe containing info of main interest.
Part of response:
Given the indicated result count I was expecting I could do something like (bearing in mind my limited R experience):
total <- r$countRaw
url2 <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=placeholder&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN&skus='
if(total > 1000){
for(i in seq(1000, total + 1, by = 1000)){
newUrl <- gsub("placeholder", i , url2)
newdf <- jsonlite::fromJSON(newUrl)$results
# do something with df e.g. merge
}
}
However, it seems that there are only results for first two calls i.e. the initial df from r$results shown above and then:
url2 <- 'https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=1000&pageSize=1000&_=1562612289615&sortDirection=asc&sortColumn=default&shape=RD&hasVisualization=true&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN&skus='
r <- jsonlite::fromJSON(url2)
df2 <- r$results
Searching the page with css selector .row yields 1002 results versus the indicated total All diamonds number; so, I think there is some exploration to do around filters.

Related

R - scraping nodes from website pages - how to insert individual pages link in the final data frame

Dear Stackoverflow users,
I am trying to scrape two nodes in different pages from a website (Psychology today, and the pages refer to mental health professionals, MHPs).
First, I create a scraping function and then I create a loop that includes this function.
Eventually, I am able to create a data frame. However, I would like to include--as a third variable--the full link to individual pages I have scraped.
How can I include this information in the data frame?
This is the loop:
j <- 1 #set the running variable = to 1 (the MHP id will increase by one)
MHP_codes <- c(150130:150170) #therapist identifier range
df_list <- vector(mode = "list", length(MHP_codes)) #set up the vector
#that collects individual
#MHP information
for(code1 in MHP_codes) {
delayedAssign("do.next", {next})
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- tryCatch(read_html(URL),
error = function(e) force(do.next))
#tryCatch catches also those with missing URL
#however, if an error occur in those pages
#the loop stops; this is why we need delayedAssign
#and force(do.next) in tryCatch
df_list[[j]] <- getProfile(URL) #the function puts the scraped data
#into a row
na.omit(df_list) #this function eliminates rows with only NAs, which happens if the URL does not exist
j <- j + 1
}
final_df <- rbind.fill(df_list) #gather the vectors into one unique data set
Should I modify the scraping function? that is getProfile() or can I create the new variable within the loop, but outside of the scraping function? and how do I do that? Have been thinking about it for days, and now, with the complete data set, I am still dragging this issue.
This post is related to other two, with additional information on the scraping function as well post1 and post2.

Faster download for paginated nested JSON data from API in R?

I am downloading nested json data from the UN's SDG Indicators API, but using a loop for 50,006 paginations is waaayyy too slow to ever complete. Is there a better way?
https://unstats.un.org/SDGAPI/swagger/#!/Indicator/V1SdgIndicatorDataGet
I'm working in RStudio on a Windows laptop. Getting to the json nested data and structuring into a dataframe was a hard-fought win, but dealing with the paginations has me stumped. No response from the UN statistics email.
Maybe an 'apply' would do it? I only need data from 2004, 2007, and 2011 - maybe I can filter, but I don't think that would help the fundamental issue.
I'm probably misunderstanding the API structure - I can't see how querying 50,006 pages individually can be functional for anyone. Thanks for any insight!
library(dplyr)
library(httr)
library(jsonlite)
#Get data from the first page, and initialize dataframe and data types
page1 <- fromJSON("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data", flatten = TRUE)
#Get number of pages from the field of the first page
pages <- page1$totalPages
SDGdata<- data.frame()
for(j in 1:25){
SDGdatarow <- rbind(page1$data[j,1:16])
SDGdata <- rbind(SDGdata,SDGdatarow)
}
SDGdata[1] <- as.character(SDGdata[[1]])
SDGdata[2] <- as.character(SDGdata[[2]])
SDGdata[3] <- as.character(SDGdata[[3]])
#Loop through all the rest of the pages
baseurl <- ("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data")
for(i in 2:pages){
mydata <- fromJSON(paste0(baseurl, "?page=", i), flatten=TRUE)
message("Retrieving page ", i)
for(j in 1:25){
SDGdatarow <- rbind(mydata$data[j,1:16])
rownames(SDGdatarow) <- as.numeric((i-1)*25+j)
SDGdata <- rbind.data.frame(SDGdata,SDGdatarow)
}
}
I do get the data I want, and in a nice dataframe, but inevitably the query has a connection issue after a few hundred pages, or my laptop shuts down etc. It's about 5 seconds a page. 5*50,006/3600 ~= 70 hours.
I think I (finally) figured out a workable solution: I can set the # of elements per page, resulting in a manageable number of pages to call. (I also filtered for just the 3 years I want which reduces the data). Through experimentation I figured out that about 1/10th of the elements download ok, so I set the call to 1/10 per page, with a loop for 10 pages. Takes about 20 minutes, but better than 70 hours, and works without losing the connection.
#initiate the df
SDGdata<- data.frame()
# call to get the # elements with the years filter
page1 <- fromJSON("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?timePeriod=2004&timePeriod=2007&timePeriod=2011", flatten = TRUE)
perpage <- ceiling(page1$totalElements/10)
ptm <- proc.time()
for(i in 1:10){
SDGpage <- fromJSON(paste0("https://unstats.un.org/SDGAPI/v1/sdg/Indicator/Data?timePeriod=2004&timePeriod=2007&timePeriod=2011&pageSize=",perpage,"&page=",i), flatten = TRUE)
message("Retrieving page ", i, " :", (proc.time() - ptm) [3], " seconds")
SDGdata <- rbind(SDGdata,SDGpage$data[,1:16])
}

Reading nodes from multiple html and storing result as a vector

I have a list of locally saved html files. I want to extract multiple nodes from each html and save the results in a vector. Afterwards, I would like to combine them in a dataframe. Now, I have a piece of code for 1 node, which works (see below), but it seems quite long and inefficient if I apply it for ~ 20 variables. Also, something really strange with the saving to vector (XXX_name) it starts with the last observation and then continues with the first, second, .... Do you have any suggestions for simplifying the code/ making it more efficient?
# Extracts name variable and stores in a vector
XXX_name <- c()
for (i in 1:216) {
XXX_name <- c(XXX_name, name)
mydata <- read_html(files[i], encoding = "latin-1")
reads_name <- html_nodes(mydata, 'h1')
name <- html_text(reads_name)
#print(i)
#print(name)
}
Many thanks!
You can put the workings inside a function then apply that function to each of your variables with map
First, create the function:
read_names <- function(var, node) {
mydata <- read_html(files[var], encoding = "latin-1")
reads_name <- html_nodes(mydata, node)
name <- html_text(reads_name)
}
Then we create a df with all possible combinations of inputs and apply the function to that
library(tidyverse)
inputs <- crossing(var = 1:216, node = vector_of_nodes)
output <- map2(inputs$var, inputs$node, read_names)

R – Using a loop on a list of Twitter handles to extract tweets and create multiple data frames

I’ve got a df that consists of Twitter handles that I wish to scrape on a regular basis.
df=data.frame(twitter_handles=c("#katyperry","#justinbieber","#Cristiano","#BarackObama"))
My Methodology
I would like to run a for loop that loops over each of the handles in my df and creates multiple dataframes:
1) By using the rtweet library, I would like to gather tweets using the search_tweets function.
2) Then I would like to merge the new tweets to existing tweets for each dataframe, and then use the unique function to remove any duplicate tweets.
3) For each dataframe, I'd like to add a column with the name of the Twitter handle used to obtain the data. For example: For the database of tweets obtained using the handle #BarackObama, I'd like an additional column called Source with the handle #BarackObama.
4) In the event that the API returns 0 tweets, I would like Step 2) to be ignored. Very often, when the API returns 0 tweets, I get an error as it attempts to merge an empty dataframe with an existing one.
5) Finally, I would like to save the results of each scrape to the different dataframe objects. The name of each dataframe object would be its Twitter handle, in lower case and without the #
My Desired Output
My desired output would be 4 dataframes, katyperry, justinbieber, cristiano & barackobama.
My Attempt
library(rtweet)
library(ROAuth)
#Accessing Twitter API using my Twitter credentials
key <-"yKxxxxxxxxxxxxxxxxxxxxxxx"
secret <-"78EUxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
setup_twitter_oauth(key,secret)
#Dataframe of Twitter handles
df=data.frame(twitter_handles=c("#katyperry","#justinbieber","#Cristiano","#BarackObama"))
# Setting up the query
query <- as.character(df$twitter_handles)
query <- unlist(strsplit(query,","))
tweets.dataframe = list()
# Loop through the twitter handles & store the results as individual dataframes
for(i in 1:length(query)){
result<-search_tweets(query[i],n=10000,include_rts = FALSE)
#Strip tweets that contain RTs
tweets.dataframe <- c(tweets.dataframe,result)
tweets.dataframe <- unique(tweets.dataframe)
}
However I have not been able to figure out how to include in my for loop the part which ignores the concatenation step if the API returns 0 tweets for a given handle.
Also, my for loop does not return 4 dataframes in my environment, but stores the results as a Large list
I identified a post that addresses a problem very similar to the one I face, but I find it difficult to adapt to my question.
Your inputs would be greatly appreciated.
Edit: I have added Step 3) in My Methodology, in case you are able to help with that too.
tweets.dataframe = list()
# Loop through the twitter handles & store the results as individual dataframes
for(i in 1:length(query)){
result<-search_tweets(query[i],n=10,include_rts = FALSE)
if (nrow(result) > 0) { # only if result has data
tweets.dataframe <- c(tweets.dataframe, list(result))
}
}
# tweets.dataframe is now a list where each element is a date frame containing
# the results from an individual query; for example...
tweets.dataframe[[1]]
# to combine them into one data frame
do.call(rbind, tweets.dataframe)
in response to a reply...
twitter_handles <- c("#katyperry","#justinbieber","#Cristiano","#BarackObama")
# Loop through the twitter handles & store the results as individual dataframes
for(handle in twitter_handles) {
result <- search_tweets(handle, n = 15 , include_rts = FALSE)
result$Source <- handle
df_name <- substring(handle, 2)
if(exists(df_name)) {
assign(df_name, unique(rbind(get(df_name), result)))
} else {
assign(df_name, result)
}
}

Automate webscraping with r

I have managed to scrape content for a single url, but am struggling to automate it for multiple urls.
Here how it is done for a single page:
library(XML); library(data.table)
theurl <- paste("http://google.com/",url,"/ul",sep="")
convertUTF <- htmlParse(theurl, encoding = "UTF-8")
tables <- readHTMLTable(convertUTF)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
table <- tables[[which.max(n.rows)]]
TableData <- data.table(table)
Now I have a vector of urls and want to scrape each for the corresponding table:
Here, I read in data comprising multiple http links:
ur.l <- data.frame(read.csv(file.choose(), header=TRUE, fill=TRUE))
theurl <- matrix(NA, nrow=nrow(ur.l), ncol=1)
for(i in 1:nrow(ur.l)){
url <- as.character(ur.l[i, 2])
}
Each of the three additional urls that you provided refer to pages that contain no tables, so it's not a particularly useful example dataset. However, a simple way to handle errors is with tryCatch. Below I've defined a function that reads in tables from url u, calculates the number of rows for each table at that url, then returns the table with the most rows as a data.table.
You can then use sapply to apply this function to each url (or, in your case, each org ID, e.g. 36245119) in a vector.
library(XML); library(data.table)
scrape <- function(u) {
tryCatch({
tabs <- readHTMLTable(file.path("http://finstat.sk", u, "suvaha"),
encoding='utf-8')
tab <- tabs[[which.max(sapply(tabs, function(x) nrow(x)))]]
data.table(tab)
}, error=function(e) e)
}
urls <- c('36245119', '46894853', '46892460', '46888721')
res <- sapply(urls, scrape)
Take a look at ?tryCatch if you want to improve the error handling. Presently the function simply returns the errors themselves.

Resources