Scrape forum using R - r

I would like to seek helping in scraping the information from hardware zone.
This is the link: http://www.hardwarezone.com.sg/search/forum/camera
I would like to get all the information on camera on the forum.
library(RSelenium)
library(magrittr)
base_url = "http://www.hardwarezone.com.sg/search/forum/camera"
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
I tried using the above codes for the first part, and I got an error for the last line. (Mac user)
The error I got is undefined RCurl call, but I referenced to many possible solutions but I still cannot solve it.
library (rvest)
url <- "http://www.hardwarezone.com.sg/search/forum/camera"
result <- url %>%
html() %>%
html_nodes (xpath = '//*[id="cse"]/table[1]') %>%
html_table()
result
I tried using another method (Above code), but it still couldn't work.
Can anyone guide me through this?
Thank you.

Related

Web scraping a table: Error using rvest and splash

I am trying to scrape a table from this link (https://www.sac-isc.gc.ca/eng/1620925418298/1620925434679). I have tried two approaches so far:
Using rvest though I couldn't find HTML nodes on the page for the table
Using splash, assuming the page is being loaded dynamically
Neither of the approaches seem to be working for me. After hours of trying and reading posts/blogs, etc, I can't seem to figure out what I'm missing. Can someone please point me to the right direction or atleast tell me what I'm doing wrong? Thanks!
Scraping using rvest:
library(rvest)
loadedTable <- read_html("https://www.sac-isc.gc.ca/eng/1620925418298/1620925434679") %>%
html_nodes("table") %>%
html_table() %>%
as.tbl()
Error:
Error in matrix(unlist(values), ncol = width, byrow = TRUE) :
'data' must be of a vector type, was 'NULL'
Scraping using splashr:
library(splashr)
library(reticulate)
install_splash()
splash("localhost") %>% splash_active()
sp <- start_splash()
pg <- render_html(url = 'https://www.sac-isc.gc.ca/eng/1620925418298/1620925434679')
stop_splash(sp)
loadedTable1 <- pg %>%
html_node('table') %>%
html_table()
loadedTable1
Error:
Did not find required python module 'docker'
Error in curl::curl_fetch_memory(url, handle = handle) :
Failed to connect to localhost port 8050: Connection refused
Error in stop_splash(sp) : object 'sp' not found
Error in html_element(...) : object 'pg' not found
Error: object 'loadedTable1' not found
It is pulled dynamically from another URI. There is some work to do to write a process to extract just what you see on the page, or whatever you actually want to see, as there is a lot of info brought back, including all the more detail.
The unix timestamp on the end is largely about prevent cached results being served. You can generate that and sprintf it into the rest of the url.
library(jsonlite)
data <- jsonlite::read_json('https://www.sac-isc.gc.ca/DAM/DAM-ISC-SAC/DAM-WTR/STAGING/texte-text/lTDWA_map_data_1572010201618_eng.txt?date=1622135739812')
To access all info about First Nation Kapawe'no, for example, you would do the following:
print(data$data[82])

Web Scraping in R Timeout

I am doing a project where I need to download FAFSA completion data from this website: https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school
I am using rvest to webscrape that data, but when I try to use the function read_html on the link, it never reads in and eventually I have to stop execution. I can read in other websites, so I'm not sure if it is a website specific issue or if I'm doing something wrong. Here is my code so far:
library(rvest)
fafsa_link <- "https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school"
read_html(fafsa_link)
Any help would be greatly appreciated! Thank you!
An user-agent header is required. The download links are also given in an json file. You could regex out the links (or indeed parse them out); or as I do, regex out one then substitute the state code within that to get the additional download url (given urls only vary in this aspect)
library(magrittr)
library(httr)
library(stringr)
data <- httr::GET('https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school.json', add_headers("User-Agent" = "Mozilla/5.0")) %>%
content(as = "text")
ca <- data %>% stringr::str_match(': "(.*?CA\\.xls)"') %>% .[2] %>% paste0('https://studentaid.gov', .)
ma <- gsub('CA\\.xls', 'MA\\.xls' ,ca)

Cannot download a webpage using download.file in R

I tried the following codes to download a html file. The code runs without error but the file returned is of very small size (~2kb) and cannot be opened.
url <- "http://racing.hkjc.com/racing/information/english/Horse/OtherHorse.aspx?HorseNo=L042#htop"
download.file(url, destfile)
I am not sure if the connection speed affects whether download.file can return the correct result because sometimes the webpage can be downloaded after several tries. Any help or alternative solution will be appreciated. Thanks.
Lot's of clean up to do, but here's the basic method
library(rvest)
read_html(url) %>%
html_nodes(xpath ='/html/body/div/form/table[3]') %>%
html_table(fill=T)

WebScraping dynamic pages in R

I will change the website, to make this question better. Still facing similar issues, that can't use only rvest package and maybe answer will be easier to obtain with RSelenium. Website: http://ravimaailma.fi/cg/tulokset/20/ and I want to obtain links from the main article which would direct me to individual race results. Links look something like this: http://ravimaailma.fi/article/tulokset/pori-18-11-2017-tulokset/8718/
I'm trying to use simple Rvest as thought that would be all needed here. SelectorGadget is giving links CSS as .article-title a, so my code is simply
url %>%
read_html() %>%
html_nodes(".article-title a") %>%
html_text()
This will return nothing. Website loads more results when you scroll down, but I thought I would atleast get first results out. Below gives out some links and links 28:32 looks promising, but I think they are links from the sidebar, not from article.
url %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href")
What I'm I doing wrong here and can RSelenium help me?
Here is my partial answer, still not getting all, but maybe helps some one. Code will return 1 link for first result. Not sure why it isn't giving them all. I'm using
library(RSelenium)
rD <- rsDriver(port = 4444L, browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate("http://ravimaailma.fi/cg/tulokset/20/")
elem <- remDr$findElement(using="css selector", value=".article-title a")
elemtxt <- elem$getElementAttribute("href")
#Click button to load more results
#button <- remDr$findElement(using="id", value="loadmore")
#button$clickElement()
remDr$close()
I haven't used button click yet, but seemed that it was working as well. Only problem is that I can't get all results from the site.
[I'm not (yet) allowed to write comments, so I chose to make this post an answer]
RSelenium is not always necessary, you can also interact with a website using directly PhantomJS (see e.g. this example).
If you provided an example from the website instead of a local link to a .pdf, I can try to find out how to retrieve the data.

Rvest not seeing xpath in website

I am attempting to scrape this website using the rvest package in R. I have done it successfully with several other website but this one doesn't seem to work and I am not sure why.
I copied the xpath from inside chrome's inspector tool, but when i specify it in the rvest script it shows that it doesn't exist. Does it have anything to do with the fact that the table is generated and not static?
appreciate the help!
library(rvest)
library (tidyverse)
library(stringr)
library(readr)
a<-read_html("http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201")
a<-html_node(a, xpath="//*[#id='indicator10']")
a<-html_table(a)
a
Regarding your question, yes, you are unable to get it because is being generated dynamically. In these cases, it's better to use the RSelenium library:
#Loading libraries
library(rvest) # to read the html
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the website
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
#Specifying the url for desired website to be scrapped
url <- "http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201"
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the element you are looking for
a <-html_node(html_obj, xpath="//*[#id='indicator10']")
I guess that you are trying to get the first table. In that case, maybe it's better to just get the table with read_table:
# get the table with the indicator10 id
indicator10_table <-html_node(html_obj, "#indicator10 table") %>% html_table()
I'm using the CSS selector this time instead of the XPath.
Hope it helps! Happy scraping!

Resources