Result of character(0) when trying to web scrape text - r

I'm attempting to automate scraping the practice words from this site https://www.livechatinc.com/typing-speed-test/#/ but get a result of 'character(o)'.
I read the url with read_html then use that for x in html_nodes() along with the css selector for the practice words and then read it with html_text, but I get character(0) every time.
No clue what I'm doing wrong, here is the code:
library('rvest')
url <- read_html("https://www.livechatinc.com/typing-speed-test/#/")
wbpg_html <- html_nodes(url,".test-prompt")
wbpg_txt <- html_text(wbpg_html)
> wbpg_txt
character(0)
I'd just like to get the practice words into r, find out how to automate it later.
Thanks for any help.

The word list comes from this js file: https://cdn.livechatinc.com/gtt/app.3.8.min.js
You can try to regex out with R using:
e\\.exports=\\{words:\\[(.*?)\\]
I ran a quick test with python:
import requests, re
r = requests.get('https://cdn.livechatinc.com/gtt/app.3.8.min.js')
p = re.compile(r'e\.exports={words:\[(.*?)\]')
words = p.findall(r.text)
print(words)
With r
library(rvest)
library(stringr)
library(readr)
library(dplyr)
urlmatrix <- paste(readLines('https://cdn.livechatinc.com/gtt/app.3.8.min.js', warn=FALSE),
collapse=" ", fileEncoding = "UTF-16") %>%
str_match(., 'e\\.exports=\\{words:\\[(.*?)\\]')
words <- strsplit(as.character(as.list(urlmatrix[,2])[[1]]), '","')
words[[1]][1] <- substring(words[[1]][1],2,nchar(words[[1]][1]))
words[[1]][length(words[[1]])] <- gsub('\\"', "", words[[1]][length(words[[1]])])

Related

R: Reading XML as data.frame

I'm facing this issue, I could not read an .xml file to make it as a data.frame in R. I know that this question have already great answers here and here, but I'm not able to decline the answers to my necessity, so sorry if it's duplicate.
I have a .xml like this:
<?xml version='1.0' encoding='UTF-8'?>
<LexicalResource>
<GlobalInformation label="Created with the standard propagation algorithm"/>
<Lexicon languageCoding="UTF-8" label="sentiment" language="-">
<LexicalEntry id="id_0" partOfSpeech="adj">
<Lemma writtenForm="word"/>
<Sense>
<Confidence score="0.333333333333" method="automatic"/>
<Sentiment polarity="negative"/>
<Domain/>
</Sense>
</LexicalEntry>
</Lexicon>
</LexicalResource>
Stored locally. So i tried this way:
library(XML)
doc<-xmlParse("...\\test2.xml")
xmldf <- xmlToDataFrame(nodes=getNodeSet(doc,"//LexicalEntry/Lemma/Sense/Confidence/Sentiment"))
but the result is this:
> xmldf
data frame with 0 columns and 0 rows
So I tried the xml2 package:
library(xml2)
pg <- read_xml("...test2.xml")
recs <- xml_find_all(pg, "LexicalEntry")
> recs
{xml_nodeset (0)}
I have a lack of knowledge in manipulating .xml files, so I think I'm missing the point. What am I doing wrong?
You need the attributes, not the values, that's why the methods you have used do not work, try something like this:
data.frame(as.list(xpathApply(doc, "//Lemma", fun = xmlAttrs)[[1]]),
as.list(xpathApply(doc, "//Confidence", fun = xmlAttrs)[[1]]),
as.list(xpathApply(doc, "//Sentiment", fun = xmlAttrs)[[1]]))
writtenForm score method polarity
1 word 0.333333333333 automatic negative
Another option is to get all the attributes of the xml and build with them a data.frame:
df <- data.frame(as.list(unlist(xmlToList(doc, addAttributes = TRUE, simplify = TRUE))))
colnames(df) <- unlist(lapply(strsplit(colnames(df), "\\."), function(x) x[length(x)]))
df
label writtenForm score method
1 Created with the standard propagation algorithm word 0.333333333333 automatic
polarity id partOfSpeech languageCoding label language
1 negative id_0 adj UTF-8 sentiment -

rvest limits the results to 24 items

Good evening everyone,
I am currently trying to scrape zalando website to get the name of every products that appaears on the first two pages of the following url : (https://www.zalando.nl/damesschoenen-sneakers/)
Here is my code:
require(rvest)
require(dplyr)
url <- read_html('https://www.zalando.nl/damesschoenen-sneakers/')
selector_name <- '.z-nvg-cognac_brandName-2XZRz'
output <- html_nodes(x = url, css = selector_name) %>% html_text
The result is a list of 24 items while there is 86 products on the page. Has anyone encounter this issue before ? Any idea on how to solve it ?
Thank you for your help.
Thomas
I just tried what Nicolas Velasqueaz suggested
url <- read_html('https://www.zalando.nl/damesschoenen-sneakers/')
write_html(url, file = "test_url.html")
selector_name <- '.z-nvg-cognac_brandName-2XZRz'
test_file <- read_html("test_url.html")
output <- html_nodes(x = test_file, css = selector_name) %>% html_text
The results are the same. I still have only 24 items that shows up.
So if anyone has a solution would be very appreciated.
Thank you for your kind answer. I will dive into that direction.
I also find a way to get the name of the brand without RSelenium, here si my code:
library('httr')
library('magrittr')
library('rvest')
################# FUNCTION #################
extract_data <- function(firstPosition,lastPosition){
mapply(function(first,last){
substr(pageContent, first, last) %>%
gsub( "\\W", "\\1 ",.) %>%
gsub("^ *|(?<= ) | *$", "", ., perl = TRUE)
},
firstPosition, lastPosition )
}
############################################
url <- 'https://www.zalando.nl/damesschoenen-sneakers/'
page <- GET(url)
pageContent <- content(page, as='text')
# Get the brand name of the products
firstPosition <-
unlist(gregexpr('brand_name',pageContent))+nchar('brand_name')+1
lastPosition <- unlist(gregexpr('is_premium',pageContent))-2
extract_data(firstPosition, lastPosition)
Unfortunately it starts being difficult when you want something else than brand name so maybe that the best soution is to do it with RSelenium.

R Regex seemingly not working properly in Linux

I'm trying to scrape the webpage of Fangraphs with alphabetical player indices to get a single column dataframe of each letter reference.
I have been able to get the code below to successfully work on a Windows version of R 3.4.1, but cannot get it to work on the Linux side at all, and I can't figure out what exactly is going wrong/different.
library(XML)
# Scrape to get the webpage
url <- paste0("http://www.fangraphs.com/players.aspx?")
table <- readHTMLTable(url, stringsAsFactors = FALSE)
letterz <- table[[2]]
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=", ")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters")
letterz$letters <- as.character(letterz$letters)
# Below this is where I can notice that the code is not operating the same
# as on my Windows machine. None of the gsub commands seem to impact
# the strings at all.
# Stripping the trailing whitespace
letterz$letters <- gsub("[[:space:]]+$", "", letterz$letters)
# Replacing patterns like "AzB Ba" to instead have "Az,Ba"
letterz$letters <- gsub("[[:upper:]]+?[[:space:]]+?[[:space:]]+?[[:space:]]+", ",", letterz$letters)
# Final cleaning up
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=",")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters")
letterz$letters <- as.character(letterz$letters)
letterz$letters <- gsub('c\\("|"\\)|"', "", letterz$letters)
letterz$letters <- gsub('^$', NA, letterz$letters)
letterz$letters <- gsub("^[[:space:]]+","", letterz$letters)
letterz$letters <- gsub("[[:space:]]+$","", letterz$letters)
letterz$letters <- gsub("'", "%27", letterz$letters)
letterz <- na.omit(letterz)
From what I could find, the only real difference between Windows/Linux regex would be the linebreak implementation, which I went back and tried to see if that was making the difference... but still got no change.
I also tried to substitute the R-specific "[[:space:]]" and "[[:upper:]]" style notation with the more standardized "\s" to see if that would fix anything.
As for fixes, I know there are a handful of other packages that I can look into to simply get the result I'm looking for, but more generally, are there just simply differences in how Windows and Linux implement regex that I'm unaware of and am oblivious to? And if so, how would I implement them into gsub to get the same result I get on Windows?
Thanks.

need help in extracting the first google search result using html_node in R

I have a list of hospital names for which I need to extract the first google search URL. Here is the code I'm using
library(rvest)
library(urltools)
library(RCurl)
library(httr)
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>%
html_text()
result <- results[1]
return(as.character(result))}
websites <- data.frame(Website = sapply(c,getWebsite))
View(websites)
For short URLs this code works fine but when the link is long and appears in R with "..." (ex. www.medicine.northwestern.edu/divisions/allergy-immunology/.../fellowship.html) it appears in the dataframe the same way with "...". How can I extract the actual URLs without "..."? Appreciate your help!
This is a working example, tested on my computer:
library("rvest")
# Load the page
main.page <- read_html(x = "https://www.google.com/search?q=software%20programming")
links <- main.page %>%
html_nodes(".r a") %>% # get the a nodes with an r class
html_attr("href") # get the href attributes
#clean the text
links = gsub('/url\\?q=','',sapply(strsplit(links[as.vector(grep('url',links))],split='&'),'[',1))
# as a dataframe
websites <- data.frame(links = links, stringsAsFactors = FALSE)
View(websites)

Use xpathSApply in R

I would like to get the information of href from below.
http://www.mitbbs.com/bbsdoc1/USANews_101_0.html
I prefer to get someting from each topic like this
/USANews/31587637.html
/USANews/31587633.html
/USANews/31587631.html
...
The code is used below, but it doesn't work.
library("XML")
library("httr")
library("stringr")
data <- list()
for( i in 101:201){
url <- paste('bbsdoc1/USANews_', i, '_0.html', sep='')
html <- content(GET("http://www.mitbbs.com/", path = url),as = 'parsed')
url.list <- xpathSApply(html, "//td[#align='left' height=26]/[#class='news1' href]", xmlAttrs)
data <- rbind(data, url.list)
}
Your suggestions are really appreicated!
You should look into the rvest package which simplifies things a lot
library(rvest); library(dplyr)
myList <- read_html("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html") %>%
html_nodes(".news1") %>% xml_attr("href")
mtList
myList %>% gsub("/article_t", "", .)
Retrieve the document
library(XML)
html = htmlParse("http://www.mitbbs.com/bbsdoc1/USANews_101_0.html")
and extract the links and text you're interested in using the appropriate xpath query
href = "//a[./#class='news1']/#href"
text = "//a[./#class='news1']/text()"
df = data.frame(
url=sub("article_t/", "", sapply(html[href], as.character)),
text=trimws(sapply(html[text], xmlValue)))
trimws() is a function in recent versions of R.

Resources