Web Scraping Rvest

Web Scraping Rvest - r

I've tried to read the text of this article, however I obtain just character(0)
library(rvest)
tex <- read_html("http://semanaeconomica.com/article/sectores-y-empresas/transporte/360660-renegociar-si-anular-no/")
p_text <- tex %>%
html_nodes("section") %>%
html_nodes("#text") %>%
html_text()%>%print()
I'm not an expert in web scraping, so I will be very grateful by your help!

I have been able to obtain the text in the page using the following code :
library(RDCOMClient)
url <- "http://semanaeconomica.com/article/sectores-y-empresas/transporte/360660-renegociar-si-anular-no/"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
web_Obj <- doc$querySelector("body > div.container.se-container.se-container--sm.mb60 > div > div.se-article__body.fixed-to-pos.pl55-md.mb60 > div")
txt <- web_Obj$innerText()
txt <- strsplit(txt, "\n|\r")[[1]]
txt <- txt[txt != ""]
txt
[1] "Como una veleta que se mueve según los vientos de la indignación ciudadana, el alcalde Jorge Muñoz anunció que el Concejo Metropolitano evaluará la anulación de los contratos de Rutas de Lim... "
[2] "¿QUIERE LEER LA HISTORIA COMPLETA?"
[3] "Regístrese y obtenga 3 artículos gratis al mes y el boletín informativo."
[4] "Suscríbase para acceso ilimitado"
[5] " "
[6] " DNI "
[7] " Carnet de extranjería "
[8] " "
[9] " "
[10] " "
[11] " "
[12] " "
[13] " "
[14] "Se requiere al menos 8 caracteres, una mayúscula, una minúscula y un número"
[15] " "
[16] " Acepto los términos y condiciones y las políticas de privacidad "
[17] "Regístrese y continúe leyendo "
[18] "¿Ya tiene una cuenta? Inicie sesión "
[19] " grecaptcha.ready(function() { grecaptcha.execute('6LfO_LAZAAAAANQMr4R1KnhUFziP2QJsCQqUCHXR', {action: 'submit'}).then(function(token) { if (token) { document.getElementById('recaptcha').value = token; } }); }); "

Related

removing url with format www in R

I need to remove some urls from a dataframe. So far I have been able to eliminate those with the pattern http://. However, there are still some websites in my corpus with the format www.stackoverflow.com or stackoverflow.org
Here is my code
#Sample of text
test_text <- c("la primera posibilidad real de acabar con la violencia del país es www.jorgeorlandomelo.com y luego desatar")
#Trying to remove the website with no results
test_text <- gsub("www[.]//([a-zA-Z]|[0-9]|[$-_#.&+]|[!*\\(\\),])//[.]com", "", test_text)
The outcome should be
test_text
"la primera posibilidad real de acabar con la violencia del país es y luego desatar"

The following regex removes the test url.
test_text <- c("la primera posibilidad real de acabar con la violencia del país es www.jorgeorlandomelo.com y luego desatar",
"bla1 bla2 www.stackoverflow.org etc",
"this that www.nameofthewebiste.com one more"
)
gsub("(^[^w]*)www\\.[^\\.]*\\.[[:alpha:]]{2,3}(.*$)", "\\1\\2", test_text)
#[1] "la primera posibilidad real de acabar con la violencia del país es y luego desatar"
#[2] "bla1 bla2 etc"
#[3] "this that one more"

Scraping from transfermarkt with R package rvest

I'm learning to scrape data and I'm using transfermakt for it but today I've faced with two problems. I've used Selector Gadget. My code is this:
library(rvest)
url <- "https://www.transfermarkt.es/fc-granada/startseite/verein/16795"
webpage <- read_html(url)
players_html <- html_nodes(webpage,"#yw1 .tooltipstered")
players <- html_text(players_html)
players
valores_html <- html_nodes(webpage,'.rechts.hauptlink')
valores <- html_text(valores_html)
valores
valores <- gsub(" miles €","000", valores)
valores <- gsub(" mill. €","0000", valores)
valores
valores <- gsub(",","",valores)
valores <- gsub(" ","", valores)
valores
I've had the first problem selecting the players. This is the output.
> players_html <- html_nodes(webpage,"#yw1 .tooltipstered")
> players <- html_text(players_html)
> players
character(0)
I think that the problem is in the CSS selector, but it's the one that shows me Selector Gadget when selecting players, so I don't know how to solve this.
The other problem occurs selecting their market values. Gsub doesn't remove some final whitespace that avoid putting characters as numbers. This is the output:
> valores_html <- html_nodes(webpage,'.rechts.hauptlink')
> valores <- html_text(valores_html)
> valores
[1] "700 miles € " "300 miles € " "800 miles € " "500 miles € "
"300 miles € "
[6] "300 miles € " "1,00 mill. € " "300 miles € " "1,20 mill. €
" "500 miles € "
[11] "1,70 mill. € " "1,50 mill. € " "1,00 mill. € " "800 miles €
" "800 miles € "
[16] "300 miles € " "2,00 mill. € " "800 miles € " "700 miles €
" "400 miles € "
[21] "700 miles € " "1,00 mill. € " "800 miles € "
> valores <- gsub(" miles €","000", valores)
> valores <- gsub(" mill. €","0000", valores)
> valores
[1] "700000 " "300000 " "800000 " "500000 " "300000 "
"300000 " "1,000000 "
[8] "300000 " "1,200000 " "500000 " "1,700000 " "1,500000 "
"1,000000 " "800000 "
[15] "800000 " "300000 " "2,000000 " "800000 " "700000 "
"400000 " "700000 "
[22] "1,000000 " "800000 "
> valores <- gsub(",","",valores)
> valores <- gsub(" ","", valores)
> valores
[1] "700000 " "300000 " "800000 " "500000 " "300000 "
"300000 " "1000000 " "300000 "
[9] "1200000 " "500000 " "1700000 " "1500000 " "1000000 "
"800000 " "800000 " "300000 "
[17] "2000000 " "800000 " "700000 " "400000 " "700000 "
"1000000 " "800000 "
Basically, that last gsub used for removing final whitespace does nothing in this case. Could someone give me a hand with these two problems?
PS: I'm using transfermarkt in spanish.

As for gsub, we may use
valores <- html_text(valores_html)
valores <- gsub(" miles €", "000", valores)
valores <- gsub(" mill. €", "0000", valores)
valores <- gsub("\\D", "", valores)
valores
# [1] "700000" "300000" "800000" "500000" "300000" "300000" "1000000" "300000" "1200000"
# [10] "500000" "1700000" "1500000" "1000000" "800000" "800000" "300000" "2000000" "800000"
# [19] "700000" "400000" "700000" "1000000" "800000"
where \\D is anything other than a digit.
For player names we may write
players_html <- html_nodes(webpage,"#yw1 span.hide-for-small a.spielprofil_tooltip")
players <- html_text(players_html)
players
# [1] "Rui Silva" "Aarón Escandell" "Bernardo Cruz"
# [4] "José Antonio Martínez" "Germán Sánchez" "Pablo Vázquez"
# [7] "Álex Martínez" "Adrián Castellano" "Víctor Díaz"
# [10] "Quini" "Nicolás Aguirre" "Fede San Emeterio"
# [13] "Ángel Montoro" "Fran Rico" "Alberto Martín"
# [16] "José Antonio González" "Alejandro Pozo" "Antonio Puertas"
# [19] "Fede Vico" "Daniel Ojeda" "Álvaro Vadillo"
# [22] "Adrián Ramos" "Rodri"
In this way we also get only one set of (full) names. Using, e.g., "#yw1 a.spielprofil_tooltip" would also return their short versions.

FIle list serach by multiple keywords R

I have a list of files, and want to select only those with certain words that are within the file name. Is this case all files that contain "Semana"
I have this code, but unsure what to put in the pattern argument:
malfiles11<-list.files(path = "./", pattern = , recursive = FALSE, full.names = TRUE, ignore.case= FALSE )
Here is a section of the list:
'[1] "./GBD2016_2_1000_Venezuela_MoH_Epi_2008_13.xlsxEntidades Federales10.csv"
[2] "./GBD2016_2_1000_Venezuela_MoH_Epi_2008_13.xlsxESTADO12.csv"
[3] "./GBD2016_2_1000_Venezuela_MoH_Epi_2008_13.xlsxVenezuela, Semana Epidemilógica 01 hasta la semana 13 del Año 2.00811.csv"
[4] "./GBD2016_2_1001_Venezuela_MoH_Epi_2008_14.xlsxESTADO12.csv"
[5] "./GBD2016_2_1001_Venezuela_MoH_Epi_2008_14.xlsxVenezuela, Semana Epidemilógica 01 hasta la semana 14 del Año 2.00811.csv"
[6] "./GBD2016_2_1001_Venezuela_MoH_Epi_2008_14.xlsxVenezuela, Semana Epidemiológica 14 de 2.007, Semana Epidemiológica 14 de año 200810.csv"
[7] "./GBD2016_2_1002_Venezuela_MoH_Epi_2008_15.xlsxESTADO12.csv"
[8] "./GBD2016_2_1002_Venezuela_MoH_Epi_2008_15.xlsxVenezuela, Semana Epidemilógica 01 hasta la semana 15 del Año 2.00811.csv" '

Use grep :
malfiles11_semana <- malfiles11[grep(pattern = "Semana",malfiles11)]

R Web scraping https://www.binance.com/ main table

I have been trying to build a web scraper in R to scrape the the main table on https://www.binance.com/. All i have so far is this:
library(rvest)
url <- read_html("https://www.binance.com/")
binance <- url%>%
# html_node()%>%
html_table()%>%
as.data.frame()
Commented out the line in the code that caused issues
This pulls a table with headers but the data in the table itself is just one row with what looks like some sort of code I don't understand.
I have tried different types of logic and I believe the data in the table is actually a child of the table but the simple code above is actually the only one that I've managed to pull anything at all that remotely resembles the table.
I wouldn't usually ask such an open ended question but I seem to be stuck. Any help would be appreciated!
Thank you!

Here is one approach that can be considered :
library(pagedown)
library(pdftools)
path_To_PDF <- "C:/test125.pdf"
chrome_print("https://www.binance.com/fr", path_To_PDF)
text <- pdftools::pdf_text(path_To_PDF)
text <- strsplit(text, "\n")[[1]]
text <- text[text != ""]
text
[1] " Inscrivez-vous maintenant - Profitez de récompenses de bienvenue jusqu’à 100 $ ! (pour les"
[2] " Racheter le cadeau"
[3] " utilisateurs vérifiés)"
[4] "Achetez et tradez plus de"
[5] "600 cryptomonnaies sur Binance"
[6] " Votre Adresse e-mail ou numéro de téléphone"
[7] " Commencez"
[8] "76 milliards $ + de 350"
[9] "Volume d’échanges sur 24 h sur la plateforme Binance Cryptomonnaies listées"
[10] "90 millions <0,10 %"
[11] "D'utilisateurs inscrits font confiance à Binance Frais de transaction les moins élevés du marché"
[12] "Cryptomonnaies populaires Voir plus"
[13] "Nom Dernier prix Variation 24h"
[14] " BNB BNB €286,6 +1,09%"
[15] " Bitcoin BTC €19 718 -1,07%"
[16] " BUSD BUSD €1,03 +0,01%"
[17] " Ethereum ETH €1 366 -0,51%"

Web-Scraping with rvest doesn't work

I'm trying to scrape comments from this website:
http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/
And this is my code for this task.
url <- 'http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/'
webpage <- read_html(url)
data_html <- html_nodes(webpage,"gig-comment-body")
Unfortunately it seems that rvest doesn't recognize the nodes through the CSS selector (gig-comment-body).
nodes comes out to be a null list, so it's not scraping anything.

That is another solution with rselenium without docker
install.packages("RSelenium")
library (RSelenium)
driver<- rsDriver()
remDr <- driver[["client"]]
remDr$navigate("http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/")
elem <- remDr$findElement( using = "id",value = "commentsDiv-779453")
#or
elem <- remDr$findElement( using = "class name", "gig-comments-comments")
elem$highlightElement() # just for interactive use in browser.
elemtxt <- elem$getElementAttribute("outerHTML") # gets us the HTML

#r2evans is correct. It builds the comment <div>s with javascript and it also requires a delay. I prefer Splash to Selenium (tho I made splashr so I'm not exactly impartial):
library(rvest)
library(splashr)
URL <- 'http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/'
# Needs Docker => https://www.docker.com/
# Then needs splashr::install_splash()
start_splash()
splash_local %>%
splash_response_body(TRUE) %>%
splash_go(URL) %>%
splash_wait(10) %>%
splash_html() -> pg
html_nodes(pg, "div.gig-comment-body")
## {xml_nodeset (10)}
## [1] <div class="gig-comment-body"><p><span>Algunosdesubicados comentan y se refieren a la UE<span> </span>como si en alguna forma Chil ...
## [2] <div class="gig-comment-body">Si buscan información se encontrarán que la unión Europea se está desmorona ndo por asunto de la inmi ...
## [3] <div class="gig-comment-body">Pocos inmigrantes tiene Chile en función de su población. En España hay 4.5 mill de inmigrantes. 800. ...
## [4] <div class="gig-comment-body">Chao chilenois idiotas tanto hablan y dicen que hacer cuando ni su pais les pertenece esta gobernado ...
## [5] <div class="gig-comment-body">\n<div> Victor Hugo Ramirez Lillo, de Conchalí, exiliado en Goiania, Brasil, pecha bono de exonerado, ...
## [6] <div class="gig-comment-body">Les escribo desde mi 2do pais, USA. Mi PDTE. TRUMP se bajó del TPP y Chile se va a la cresta. La o ...
## [7] <div class="gig-comment-body">En CHILE siempre fuimos muy cuidadosos con le emigración, solo lo MEJOR de Alemania, Francia, Suecia, ...
## [8] <div class="gig-comment-body"><span>Basta de inmigración!!! Santiago está lleno de vendedores ambulantes extranieros!!!¿¿esos son l ...
## [9] <div class="gig-comment-body">IGNOREN A JON LESCANO, ESE ES UN CHOLO QUE FUE DEPORTADO DE CHILE.<div>IGNOREN A LOS EXTRANJEROS MET ...
## [10] <div class="gig-comment-body">Me pregunto qué dirá el nacionalista promedio cuando agarre un libro de historia y se dé cuenta de qu ...
killall_splash()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web Scraping Rvest - r

Related

removing url with format www in R

Scraping from transfermarkt with R package rvest

FIle list serach by multiple keywords R

R Web scraping https://www.binance.com/ main table

Web-Scraping with rvest doesn't work

Categories

Resources