R programming Web Scraping - r

I tried to scrape webpage from the below link using R vest package from R programming.
The link that I scraped is http://dk.farnell.com/c/office-computer-networking-products/prl/results
My code is:
library("xml2")
library("rvest")
url<-read_html("http://dk.farnell.com/c/office-computer-networking-products/prl/results")
tbls_ls <- url %>%
html_nodes("table") %>%
html_table(fill = TRUE)%>%
gsub("^\\s\\n\\t+|\\s+$n+$t+$", "", .)
View(tbls_ls)
My requirement is that I want to remove \\n,\\t from the result. I want to give pagination to scrape multiple pages, so that I can scrape this webpage with pagination.

I'm intrigued by these kinds of questions so I'll try to help you out. Be forewarned, I am not an expert with this stuff (or anything close to it). Anyway, I think it should be kind of like this...
library(rvest)
library(rvest)
library(tidyverse)
urls <- read_html("http://dk.farnell.com/c/office-computer-networking-products/prl/results/")
pag <- 1:5
read_urls <- paste0(urls, pag)
read_urls %>%
map(read_html) -> p
Now, I didn't see any '\\n' or '\\t' patterns in the data sets. Nevertheless, if you want to look for a specific string, you can do it like this.
library(stringr)
str_which(urls, "[your]string_here")
The link below is very useful!
http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/webscrape.html

Related

Web Scraping in R Timeout

I am doing a project where I need to download FAFSA completion data from this website: https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school
I am using rvest to webscrape that data, but when I try to use the function read_html on the link, it never reads in and eventually I have to stop execution. I can read in other websites, so I'm not sure if it is a website specific issue or if I'm doing something wrong. Here is my code so far:
library(rvest)
fafsa_link <- "https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school"
read_html(fafsa_link)
Any help would be greatly appreciated! Thank you!
An user-agent header is required. The download links are also given in an json file. You could regex out the links (or indeed parse them out); or as I do, regex out one then substitute the state code within that to get the additional download url (given urls only vary in this aspect)
library(magrittr)
library(httr)
library(stringr)
data <- httr::GET('https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school.json', add_headers("User-Agent" = "Mozilla/5.0")) %>%
content(as = "text")
ca <- data %>% stringr::str_match(': "(.*?CA\\.xls)"') %>% .[2] %>% paste0('https://studentaid.gov', .)
ma <- gsub('CA\\.xls', 'MA\\.xls' ,ca)

Read HTML table using R directly from a website

I want to read covid data directly from government website: https://pikobar.jabarprov.go.id/distribution-case#
I did that using rvest library
url <- "https://pikobar.jabarprov.go.id/distribution-case#"
df <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = T)
I saw someone using lapply to make it into a tidy table, but when I tried it looked like a mess because I'm new to this.
Can anybody help me? I really frustated
You can't scrape the data in the table by rvest because it's requested to this link:
https://dashboard-pikobar-api.digitalservice.id/v2/sebaran/pertumbuhan?wilayah=kota&=32 with the api-key attached.
pg <- httr::GET(
"https://dashboard-pikobar-api.digitalservice.id/v2/sebaran/pertumbuhan?wilayah=kota&=32",
config = httr::add_headers(`api-key` = "480d0aeb78bd0064d45ef6b2254be9b3")
)
data <- httr::content(pg)$data
I don't know if the api-key works in the future but it works for now as I see.

Rvest and xpath returns misleading information

I am struggling with some scraping issues, using rvest and xpath.
The objective is to scrape the following page
https://www.barchart.com/futures/quotes/BT*0/futures-prices
and to extract the names of the futures
BTF21
BTG21
BTH21
etc for the full list of names.
The xpath for those variables seem to be xpath='//a'.
The following code provides no information of relevance, thus my query
library(rvest)
url <- 'https://www.barchart.com/futures/quotes/BT*0'
valuation_col <- url %>%
read_html() %>%
html_nodes(xpath='//a')
value <- valuation_col %>% html_text()
Any hint to proceed further to get the information would be much needed. Thanks in advance!

webscraping a table with no html class

I exploring webscraping some weather data, specifically the table on the right panel of this page https://wrcc.dri.edu/cgi-bin/cliMAIN.pl?ak4988
I'm able to navigate to the appropriate location (see below), but have not been able to pull out the table e.g., html_nodes("table").
library(tidyverse)
library(rvest)
url<- read_html("https://wrcc.dri.edu/cgi-bin/cliMAIN.pl?ak4988")
url %>%
html_nodes("frame") %>%
magrittr::extract2(2)
# {html_node}
# <frame src="/cgi-bin/cliRECtM.pl?ak4988" name="Graph">
I've also looked at the namespace with no luck
xml_ns(url)
# <->
This works for me.
library(rvest)
library(magrittr)
library(plyr)
#Doing URLs one by one
url<-"https://wrcc.dri.edu/cgi-bin/cliRECtM.pl?ak4988"
##GET SALES DATA
pricesdata <- read_html(url) %>% html_nodes(xpath = "//table[1]") %>% html_table(fill=TRUE)
library(plyr)
df <- ldply(pricesdata, data.frame)
Originally I was hitting the wrong URL. The comment from Mogzol pointed me in the right direction. I'm not sure how, or why, different URLs feed into the same one. Maybe it has something to do with the different scrolling windows in one single window. I would be interested in hearing how this works...if someone has some insight into it... Thanks!!

Web Scraping using Rvest on a Tennis table from Wiki

Here I am, a total beginner in R. I am trying to learn more about rvest and how to scrape from the web. Here is the wiki page (https://en.wikipedia.org/wiki/Andy_Murray) and below is the table I want to transfer to R.
Using CSS Selector, I found that the particular table is on ".wikitable". Following some tutorials on other webpages, here is the code that I used:
library(rvest)
tennis <- read_html("https://en.wikipedia.org/wiki/Andy_Murray")
trial <- tennis %>% html_nodes(".wikitable") %>% html_table(fill = T)
trial
I could not isolate the result to the table that I wanted. Can someone please teach me how? An another thing, what does the pipe do (%>%)?
You were almost there. What you extracted was a list. To get to your desired element you need to use indexing:
trial[[2]]
To clean it further use:
df <- trial[[2]]
df <- df[-1,]
df[,17:20] <- NULL
%>% is called a pipe from the magrittr/dplyr package. More info here.

Resources