I would like to get the information of href from below.
http://www.mitbbs.com/bbsdoc1/USANews_101_0.html
I prefer to get someting from each topic like this
/USANews/31587637.html
/USANews/31587633.html
/USANews/31587631.html
...
The code is used below, but it doesn't work.
library("XML")
library("httr")
library("stringr")
data <- list()
for( i in 101:201){
url <- paste('bbsdoc1/USANews_', i, '_0.html', sep='')
html <- content(GET("http://www.mitbbs.com/", path = url),as = 'parsed')
url.list <- xpathSApply(html, "//td[#align='left' height=26]/[#class='news1' href]", xmlAttrs)
data <- rbind(data, url.list)
}
Your suggestions are really appreicated!
You should look into the rvest package which simplifies things a lot
library(rvest); library(dplyr)
myList <- read_html("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html") %>%
html_nodes(".news1") %>% xml_attr("href")
mtList
myList %>% gsub("/article_t", "", .)
Retrieve the document
library(XML)
html = htmlParse("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html")
and extract the links and text you're interested in using the appropriate xpath query
href = "//a[./#class='news1']/#href"
text = "//a[./#class='news1']/text()"
df = data.frame(
url=sub("article_t/", "", sapply(html[href], as.character)),
text=trimws(sapply(html[text], xmlValue)))
trimws() is a function in recent versions of R.
Related
I'm attempting to automate scraping the practice words from this site https://www.livechatinc.com/typing-speed-test/#/ but get a result of 'character(o)'.
I read the url with read_html then use that for x in html_nodes() along with the css selector for the practice words and then read it with html_text, but I get character(0) every time.
No clue what I'm doing wrong, here is the code:
library('rvest')
url <- read_html("https://www.livechatinc.com/typing-speed-test/#/")
wbpg_html <- html_nodes(url,".test-prompt")
wbpg_txt <- html_text(wbpg_html)
> wbpg_txt
character(0)
I'd just like to get the practice words into r, find out how to automate it later.
Thanks for any help.
The word list comes from this js file: https://cdn.livechatinc.com/gtt/app.3.8.min.js
You can try to regex out with R using:
e\\.exports=\\{words:\\[(.*?)\\]
I ran a quick test with python:
import requests, re
r = requests.get('https://cdn.livechatinc.com/gtt/app.3.8.min.js')
p = re.compile(r'e\.exports={words:\[(.*?)\]')
words = p.findall(r.text)
print(words)
With r
library(rvest)
library(stringr)
library(readr)
library(dplyr)
urlmatrix <- paste(readLines('https://cdn.livechatinc.com/gtt/app.3.8.min.js', warn=FALSE),
collapse=" ", fileEncoding = "UTF-16") %>%
str_match(., 'e\\.exports=\\{words:\\[(.*?)\\]')
words <- strsplit(as.character(as.list(urlmatrix[,2])[[1]]), '","')
words[[1]][1] <- substring(words[[1]][1],2,nchar(words[[1]][1]))
words[[1]][length(words[[1]])] <- gsub('\\"', "", words[[1]][length(words[[1]])])
I have tried scraping data from a real estate site, and arranging the data in a way that can then easily be filtered and checked using a spreadsheet. I’m actually a little embarrassed that i don’t move of this R code forward.
Now that i have all the links to the posts, i can not now loop through the previously compiled dataframe and get the details from all the URLs.
Could you just please help me with it? Thanks a lot.
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(xml2)
complete <- data.frame()
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
URL.base <- "https://www.sreality.cz/hledani/prodej/byty?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=dnes&strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=tyden&strana="
for (i in 1:10000) {
#Specifying the url for desired website to be scrapped
main_link<- paste0(URL.base, i)
# go to website
remDr$navigate(main_link)
# get page source and save it as an html object with rvest
main_page <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the data
name <- html_nodes(main_page, css=".name.ng-binding") %>% html_text()
locality <- html_nodes(main_page, css=".locality.ng-binding") %>% html_text()
norm_price <- html_nodes(main_page, css=".norm-price.ng-binding") %>% html_text()
sreality_url <- main_page %>% html_nodes(".title") %>% html_attr("href")
sreality_url2 <- sreality_url[c(4:24)]
name2 <- name[c(4:24)]
record <- data.frame(cbind(name2, locality, norm_price, sreality_url2))
complete <- rbind(complete, record)
}
# Write CSV in R
write.csv(complete, file = "MyData.csv")
I would do this differently:
I would create a function, say 'scraper', that groups up together all the scraping functions you have already defined, doing so I'll create a list with the str_c of all the possibile links (say 30), after that a simple lapply function. As it all said, I will not use Rselenium. (libraries: rvest , stringr , tibble, dplyr )
url = 'https://www.sreality.cz/hledani/prodej/byty?strana='
here it is the URL base, starting from here you should be able to replicate the URL strings for all the pages (1 to whichever) you are interested in (and for all the possible url, for praha, olomuc, ostrava etc ).
main_page = read_html('https://www.sreality.cz/hledani/prodej/byty?strana=')
here you create all the linnks according to the number of pages you want:
list.of.pages = str_c(url, 1:30)
then define a single function for all the single data you are interested, in this way you are more precise and your error debug is easier, as well as the data quality. (I assume your CSS selections are right, otherwise you will obtain empty obj)
for names
name = function(url) {
data = html_nodes(url, css=".name.ng-binding") %>%
html_text()
return(data)
}
for locality
locality = function(url) {
data = html_nodes(url, css=".locality.ng-binding") %>%
html_text()
return(data)
}
for normprice
normprice = function(url) {
data = html_nodes(url, css=".norm-price.ng-binding") %>%
html_text()
return(data)
}
for hrefs
sreality_url = function(url) {
data = html_nodes(url, css=".title") %>%
html_attr("href")
return(data)
}
those are the single fuctions (the CSS selection, even if i didnt test them, seem to be not correct to me, but this will give you the right framework to work on). After that combine them into a tibble obj
get.data.table = function(html){
name = name(html)
locality = locality(html)
normprice = normprice(html)
hrefs = sreality_url(html)
combine = tibble(adtext = name,
loc = locality,
price = normprice,
URL = sreality_url)
combine %>%
select(adtext, loc, price, URL) return(combine)
}
then the final scraper:
scrape.all = function(urls){
list.of.pages %>%
lapply(get.data.table) %>%
bind_rows() %>%
write.csv(file = 'MyData.csv')
}
I'm trying to scrape the speakers for this year's SXSW: https://schedule.sxsw.com/2019/speakers/alpha/A
The end of the link has an A, but it goes through Z (like add a B, or a C, etc., to the end of the link.
Here's my attempt:
library(RCurl)
library(httr)
library(rvest)
library(tidyverse)
sxsw <- 'https://schedule.sxsw.com/2019/speakers/alpha/A'
page <- read_html(sxsw)
for (i in length(LETTERS)) {
sxsw <- paste0('https://schedule.sxsw.com/2019/speakers/alpha/', LETTERS[i])
names <- page %>%
html_nodes(".px1 a") %>%
html_text()
}
I'm simply trying to append the entire range, so it returns all of the speaker names. If you take the names vector out of the loop, and run it, it pops up with all of the A names. I think this is a quick fix - think it has something to do with LETTERS. Thanks
This should do the trick...
library(tidyverse)
library(rvest)
tibble(
url = paste0('https://schedule.sxsw.com/2019/speakers/alpha/', LETTERS[1:26])
) %>%
mutate(
names = map(url, read_html),
names = map(names, html_nodes, ".px1 a"),
names = map(names, html_text)
) %>%
unnest()
Code using lapply. I would recommend avoiding using loops in R
library(RCurl)
library(httr)
library(rvest)
library(tidyverse)
sxsw=list()
letters=toupper(letters)
sxsw <-lapply(letters,function(x){
read_html(paste0("https://schedule.sxsw.com/2019/speakers/alpha/",paste0(x)))%>% html_nodes(".px1 a") %>%
html_text()
}
)
I am trying to get the "Team Offense" table into R. I have tried multiple techniques and I cannot seem to get it to work. It looks like R is only reading the first two tables. The link is below.
https://www.pro-football-reference.com/years/2018/index.htm
This is what I have tried...
library(XML)
TeamData = 'https://www.pro-football-reference.com/years/2018/index.htm'TeamData = 'https://www.pro-football-reference.com/years/2018/index.htm'
URL = TeamData
URLdata = getURL(URL)
table = readHTMLTable(URLdata, stringsAsFactors=F, which = 5)
Scraping Sports Reference sites can be tricky but they are great sources:
library(rvest)
library(httr)
link <- "https://www.pro-football-reference.com/years/2018/index.htm"
doc <- GET(link)
cont <- content(doc, "text") %>%
gsub(pattern = "<!--\n", "", ., fixed = TRUE) %>%
read_html %>%
html_nodes(".table_outer_container table") %>%
html_table()
# Team Offense table is the fifth one
df <- cont[[5]]
I have a list of hospital names for which I need to extract the first google search URL. Here is the code I'm using
library(rvest)
library(urltools)
library(RCurl)
library(httr)
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>%
html_text()
result <- results[1]
return(as.character(result))}
websites <- data.frame(Website = sapply(c,getWebsite))
View(websites)
For short URLs this code works fine but when the link is long and appears in R with "..." (ex. www.medicine.northwestern.edu/divisions/allergy-immunology/.../fellowship.html) it appears in the dataframe the same way with "...". How can I extract the actual URLs without "..."? Appreciate your help!
This is a working example, tested on my computer:
library("rvest")
# Load the page
main.page <- read_html(x = "https://www.google.com/search?q=software%20programming")
links <- main.page %>%
html_nodes(".r a") %>% # get the a nodes with an r class
html_attr("href") # get the href attributes
#clean the text
links = gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
# as a dataframe
websites <- data.frame(links = links, stringsAsFactors = FALSE)
View(websites)