Web scraping in R with rvest and XML2 extract table

Web scraping in R with rvest and XML2 extract table - r

I wish to extract the table with the ranks and returns from a sample URL
https://www.valueresearchonline.com/funds/fundSelector/returns.asp?cat=10&exc=susp%2Cclose&rettab=st
So far tried rvest
#Reading the HTML code from the website
webpage <- read_html(urlString)
#Using CSS selectors to scrap the section
tables <- webpage %>% html_node("tr") %>% html_text()
tables <- html_node(".fundtool_cat") %>% html_text()
I need a dataframe/table with name of the scheme along with ranks and returns for all periods mentioned

library(rvest)
urlString <- "https://www.valueresearchonline.com/funds/fundSelector/returns.asp?cat=10&exc=susp%2Cclose&rettab=st"
urlString %>%
read_html() %>%
html_nodes(xpath='//*[#id="fundCatData"]/table[1]') %>%
html_table(fill=T)

Related

Efficiency in extracting data from webscraping in R

This is no doubt very simple so apologies but I am new to webscraping and am trying to extract multiple datapoints in one call using rvest. Let's take for example the following code (NB I have not used the actual website which I have replaced in this code snippet with xxxxxx.com):
univsalaries <- lapply(paste0('https://xxxxxx.com/job/p', 1:20,'/key=%F9%80%76&final=1&jump=1&PGTID=0d3408-0000-24gf-ac2b-810&ClickID=2'),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.salary') %>%
html_text()
})
Let's say there is another html node I want to scrape (.company). Obviously I can make a separate call and fetch that data, but I want to understand the syntax of how I could extract the information in the same call.
I tried to put it in the following structure, but the code sent me to the debugger
.... function(url_base){
url_base %>% read_html() %>%
Salary <- univsalaries %>%
html_nodes('.salary') %>% html_text()
Company <- univsalaries %>%
html_nodes('.company') %>% html_text()
dt<-tibble(Salary,Company)
})

Read the webpage once and then you can extract multiple values from the same page.
library(purrr)
library(rvest)
univsalaries <- map(paste0('https://xxxxxx.com/job/p', 1:20,'/key=%F9%80%76&final=1&jump=1&PGTID=0d3408-0000-24gf-ac2b-810&ClickID=2'),
function(url_base){
webpage <- url_base %>% read_html()
data.frame(Salary = webpage %>% html_nodes('.salary') %>% html_text(),
Company = webpage %>% html_nodes('.company') %>% html_text())
})
This would give you a list of dataframes (one for every link), if you need one combined dataframe then use map_df instead of map.

Rvest: Scraping multiple URLs from csv

I am trying to scrape a certain element from multiple websites. One by one rvest works fine, but is there a possibility to scrape all URL at once? I have a csv file with all the URL, but I can't insert more than a single string value in read_html. Do you have an idea? Thx in advance
Right now I work like this:
test1<- read_html("https://www.startnext.com/higchic")
Site1 <- test1 %>%
html_nodes(".js-accordeon:nth-child(4) .accordeon__answer") %>%
html_text() %>%
as.character()
test2<- read_html("https://www.startnext.com/sauberkasten")
Site2 <- test2 %>%
html_nodes(".js-accordeon:nth-child(4) .accordeon__answer") %>%
html_text() %>%
as.character()

You can scrape a few at a time by concatenating the URLs, but more than a few will lead to errors. Please let me know if you find out how.
Here's the code:
url <- c("https://www.vox.com/", "https://www.bbc.com/")
page <-map(url, ~read_html(.x) %>% html_nodes("p") %>% html_text())
str(page)

How can I extract a specific table from a website that has multipe tables in R?

I am trying to extract a table from https://www.basketball-reference.com/leagues/NBA_2018.html. The table I want is the (Team per game stats). This webpage has multiple tables, and when I try to extract the tables from it, it gives the first two tables from the page.
How can I get the table I want using R? I mentioned below the code I used
library(rvest)
url <- "https://www.basketball-reference.com/leagues/NBA_2018.html"
# read the link
html <-read_html(url)
tables <- html %>% html_table(fill =TRUE)
View(tables)

It is commented out. You can grab the comments with xpath then grab the table you want
library(rvest)
page <- read_html('https://www.basketball-reference.com/leagues/NBA_2018.html')
df <- page %>% html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse = '') %>%
read_html() %>%
html_node('#team-stats-per_game') %>%
html_table()

Rvest scraping multiple data in one function

I know how to loop when a page is paginated, but I wish to scrape multiple information/html_nodes in one loop function, but I am not sure if you can set it up. So far I have tried the following. It's basically a jobsearch website, where I want company name, company description and number of open positions.
I use sprintf to get page 1-14.
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
I have made a loop, which works to scrape one data source.
company <- function(virksomhed){
company %>% read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
}
virk <- lapply(urlingtek, virksomhed)
But I wish to scrape all the utilities down at once if possible.
I have so far tried using
jobvirksom <- function(alt){
alt %>%
read_html() %>%
html_nodes('.jix_company_name_link a') %>%
html_text()
html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
html_nodes('.jix_active a') %>%
html_text()
}
So far without any luck. Would be a lot better if I could scrape it all at once, press lapply and turn into one list.

Here is the start of a solution. In this case with only 14 webpages to parse through it is sometimes easier to just use a loop. With this number of pages the time between a for loop and lapply is insignificant.
I notice the web pages are not consistently formatted so this solution will need additional work when the data is missing or inconsistent. This will work for the first 2 pages and fail on the third where the overview is missing.
library(rvest)
urlingtek <- sprintf("https://www.jobindex.dk/virksomhedsoversigt/kanal/ingenioer?page=%d", 1:14)
#define empty data frame to store all data
alllistings<-data.frame()
for (i in urlingtek){
print(i)
#read the page just once
page<-read_html(i)
#parse company name
company<-page%>%html_nodes('.jix_company_name_link a') %>% html_text()
#remove blank company names
company<-trimws(company)
company<-company[nchar(company)>1]
#parse company overview
overv<-page %>% html_nodes('.jix_companyindex_overview_ad_content') %>%
html_text()
#parse active information
active<-page %>% html_nodes('.jix_active a') %>% html_text()
#create temporary dataframe to store data from this loop
tempdf<-data.frame(company, overv, active)
#combine temp with all data
alllistings<-rbind(alllistings, tempdf)
}

using rvest to scrape data from a long table with pagination

I am new to data scraping and trying to use rvest to scrape all the salary data from the long table on this website:
https://www.fedsdatacenter.com/federal-pay-rates/
as expected, the following code gives me the variable names of the data:
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
names <- url %>%
read_html() %>%
html_node('thead') %>%
html_text()
However, why this code gives me no data?
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
dat <- url %>%
read_html() %>%
html_node('tbody') %>%
html_text()
I followed an example in this article: http://bradleyboehmke.github.io/2015/12/scraping-html-tables.html
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
sal <- url %>%
read_html() %>%
html_node('#table-example') %>%
html_table(fill=TRUE)
Again, it produces only column names with no data.
Also, how should I read through all the tens of thousands of pages to get all the data from the table? I suspect that I need to use the information in "#table-example_wrapper > div:nth-child(2) > div", but don't know how. Could anyone help?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scraping in R with rvest and XML2 extract table - r

library(rvest) urlString <- "https://www.valueresearchonline.com/funds/fundSelector/returns.asp?cat=10&exc=susp%2Cclose&rettab=st" urlString %>% read_html() %>% html_nodes(xpath='//*[#id="fundCatData"]/table[1]') %>% html_table(fill=T)

Related

Efficiency in extracting data from webscraping in R

Rvest: Scraping multiple URLs from csv

How can I extract a specific table from a website that has multipe tables in R?

Rvest scraping multiple data in one function

using rvest to scrape data from a long table with pagination

Categories

Resources