Web scraping techniques to obtain links that the website of interest contains

Web scraping techniques to obtain links that the website of interest contains - r

I am working with the following website:
http://www.crowdrise.com/skollsechallenge
Specifically on this page there are 57 crowdfunding campaigns.  Each of those crowdfunding campaigns have text that details out why they want to raise money, the total money raised so far, and the team members.  Some of the campaigns also specify the fundraising goal. I want to write some R code that will scrape and organize this information from each of the 57 sites.
for now, I am trying to scrap each of the 57 links that leads to the 57 different campaigns.
Below is the code I tried:
library("RCurl")
library("XML")
library("stringr")
url <- "http://www.crowdrise.com/skollSEchallenge"
cat("URL:", url)
url.data <- readLines(url)
doc <- htmlTreeParse(url.data, useInternalNodes=TRUE)
xp_exp <- "//a[#href]"
links <- xpathSApply(doc, xp_exp,xmlValue)
the variable
links
however, does not contain links to the 57 websites.....I am little confused...
can someone help me?
thanks,

Using this for example :
xpathApply(doc, '//*[#id="teams-results"]/div/div/div/h4/a'
,xmlGetAttr,'href')
You will get the 16 links of the first page. But you still have the problem of activating javascript code behind( SHOW MORE TEAMS) to see the rest of links.

This very ugly solution gets 32 of them, it is very very verbose, but it does not need to evaluate javascript.
library(httr)
x <- as.character(GET("http://www.crowdrise.com/skollSEchallenge"))
x <- unlist(strsplit(x, split = "\n", fixed = TRUE))
x <- gsub("\t", "", grep('class="profile">', x, value = TRUE, fixed = TRUE))
x <- unlist(strsplit(x, split = 'class="profile">', fixed = TRUE))[-1]
x <- gsub("\r<div class=\"content\">\r<a href=\"/", "", x, fixed = TRUE)
x <- substr(x, 1, as.integer(regexpr('\"><img', x)) - 1)
x <- paste("www.crowdrise.com/", x, sep = '')

Related

Google Search in R [duplicate]

I used the following code:
library(XML)
library(RCurl)
getGoogleURL <- function(search.term, domain = '.co.uk', quotes=TRUE)
{
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
getGoogleURL <- paste('http://www.google', domain, '/search?q=',
search.term, sep='')
}
getGoogleLinks <- function(google.url)
{
doc <- getURL(google.url, httpheader = c("User-Agent" = "R(2.10.0)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
nodes <- getNodeSet(html, "//a[#href][#class='l']")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[[1]]))
}
search.term <- "cran"
quotes <- "FALSE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
links <- getGoogleLinks(search.url)
I would like to find all the links that resulted from my search and I get the following result:
> links
list()
How can I get the links?
In addition I would like to get the headlines and summary of google results how can I get it?
And finally is there a way to get the links that resides in ChillingEffects.org results?

If you look at the htmlvariable, you can see that the search result links all are nested in <h3 class="r"> tags.
Try to change your getGoogleLinks function to:
getGoogleLinks <- function(google.url) {
doc <- getURL(google.url, httpheader = c("User-Agent" = "R
(2.10.0)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function
(...){})
nodes <- getNodeSet(html, "//h3[#class='r']//a")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]]))
}

I created this function to read in a list of company names and then get the top website result for each. It will get you started then you can adjust it as needed.
#libraries.
library(URLencode)
library(rvest)
#load data
d <-read.csv("P:\\needWebsites.csv")
c <- as.character(d$Company.Name)
# Function for getting website.
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>% # Get all notes of type cite. You can change this to grab other node types.
html_text()
result <- results[1]
return(as.character(result)) # Return results if you want to see them all.
}
# Apply the function to a list of company names.
websites <- data.frame(Website = sapply(c,getWebsite))]

other solutions here don't work for me, here's my take on #Bryce-Chamberlain's issue which works for me in August 2019, it answers also another closed question : company name to URL in R
# install.packages("rvest")
get_first_google_link <- function(name, root = TRUE) {
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- xml2::read_html(url)
# extract all links
nodes <- rvest::html_nodes(page, "a")
links <- rvest::html_attr(nodes,"href")
# extract first link of the search results
link <- links[startsWith(links, "/url?q=")][1]
# clean it
link <- sub("^/url\\?q\\=(.*?)\\&sa.*$","\\1", link)
# get root if relevant
if(root) link <- sub("^(https?://.*?/).*$", "\\1", link)
link
}
companies <- data.frame(company = c("apple acres llc","abbvie inc","apple inc"))
companies <- transform(companies, url = sapply(company,get_first_google_link))
companies
#> company url
#> 1 apple acres llc https://www.appleacresllc.com/
#> 2 abbvie inc https://www.abbvie.com/
#> 3 apple inc https://www.apple.com/
Created on 2019-08-10 by the reprex package (v0.2.1)

The free solutions don't work anymore. Plus it doesn't allow you to search for regions outside your location. Here's a solution using Google Custom Search API. The API allows 100 free API calls per day. The function below returns only 10 results or page 1. 1 API call returns only 10 results.
Google.Search.API <- function(keyword, google.key, google.cx, country = "us")
{
# keyword = keywords[10]; country = "us"
url <- paste0("https://www.googleapis.com/customsearch/v1?"
, "key=", google.key
, "&q=", gsub(" ", "+", keyword)
, "&gl=", country # Country
, "&hl=en" # Language from Browser, english
, "&cx=", google.cx
, "&fields=items(link)"
)
d2 <- url %>%
httr::GET(ssl.verifypeer=TRUE) %>%
httr::content(.) %>% .[["items"]] %>%
data.table::rbindlist(.) %>%
mutate(keyword, SERP = row_number(), search.engine = "Google API") %>%
rename(source = link) %>%
select(search.engine, keyword, SERP, source)
pause <- round(runif(1, min = 1.1, max = 5), 1)
if(nrow(d2) == 0)
{cat("\nPausing", pause, "seconds. Failed for:", keyword)} else
{cat("\nPausing", pause, "seconds. Successful for:", keyword)}
Sys.sleep(pause)
rm(keyword, country, pause, url, google.key, google.cx)
return(d2)
}

I'm trying to scrape multiple pages

I'm trying to scrape multiple pages from the same website from a gaming website for reviews.
I tried running it and altering the code I found on here: R web scraping across multiple pages with the one of the answers.
library(tidyverse)
library(rvest)
url_base <- "https://www.metacritic.com/browse/games/score/metascore/all/ps4?sort=desc&page=0"
map_df(1:17, function(i) {
cat(".")
pg <- read_html(sprintf(url_base, i))
data.frame(Name = html_text(html_nodes(pg,"#main .product_title a")),
MetaRating = as.numeric(html_text(html_nodes(pg,"#main .positive"))),
UserRating = as.numeric(html_text(html_nodes(pg,"#main .textscore"))),
stringsAsFactors = FALSE)
}) -> ps4games_metacritic
The results is the first page is being scraped 17 times, instead of the 17 pages on the website

I have made three changes to your code:
since their page numbering starts at 0, map_df(1:17...
should be map_df(0:16...
as proposed by BigDataScientist,
url_base should be set like this: url_base <- "https://www.metacritic.com/browse/games/score/metascore/all/ps4?sort=desc&page=%d"
if you use "#main .positive" you will get an error while
scraping the 7th page, since games without positive scorese start
there - unless you only want to scrape games with positive
evaluations (which would mean a bit different code) you should use
"#main .game" instead
library(tidyverse)
library(rvest)
url_base <- "https://www.metacritic.com/browse/games/score/metascore/all/ps4?sort=desc&page=%d"
map_df(0:16, function(i) {
cat(".")
pg <- read_html(sprintf(url_base, i))
data.frame(Name = html_text(html_nodes(pg,"#main .product_title a")),
MetaRating = as.numeric(html_text(html_nodes(pg,"#main .game"))),
UserRating = as.numeric(html_text(html_nodes(pg,"#main .textscore"))),
stringsAsFactors = FALSE)
}) -> ps4games_metacritic

JSON applied over a dataframe in R

I used the below on one website and it returned a perfect result:
looking for key word: Emaar pasted at the end of the query:
library(httr)
library(jsonlite)
query<-"https://www.googleapis.com/customsearch/v1?key=AIzaSyA0KdZHRkAjmoxKL14eEXp2vnI4Yg_po38&cx=006431301429107149113:as7yqcm2qc8&q=Emaar"
result11 <- content(GET(query))
print(result11)
result11_JSON <- toJSON(result11)
result11_JSON <- fromJSON(result11_JSON)
result11_df <- as.data.frame(result11_JSON)
now I want to apply the same function over a data.frame containing key words:
so i did the below testing .csv file:
Company Name
[1] ADES International Holding Ltd
[2] Emirates REIT (CEIC) Limited
[3] POLARCUS LIMITED
called it Testing Website Extraction.csv
code used:
test_companies <- read.csv("... \\Testing Website Extraction.csv")
#removing space and adding "+" sign then pasting query before it (query already has my unique google key and search engine ID
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
Result here shows a list of length 3 (the 3 search terms) and sublist within each term containing: url (list[2]), queries (list[2]), ... items (list[10]) and these are the same for each search term (same length separately), my issue here is applying the remainder of the code
#when i run:
result_JSON <- toJSON(result)
result_JSON <- as.list(fromJSON(result_JSON))
I get a list of 6 list that has sublists
and putting it into a tidy dataframe where the results are listed under each other (not separately) is proving to be difficult
also note that I tried taking from the "result" list that has 3 separate lists in it each one by itself but its a lot of manual labor if I have a longer list of keywords
The expected end result should include 30 observations of 37 variables (for each search term 10 observations of 37 variables and all are underneath each other.
Things I have tried unsuccessfully:
These work to flatten the list:
#do.call(c , result)
#all.equal(listofvectors, res, check.attributes = FALSE)
#unlist(result, recursive = FALSE)
# for (i in 1:length(result)) {listofvectors <- c(listofvectors, result[[i]])}
#rbind()
#rbind.fill()
even after flattening I dont know how to organize them into a tidy final output for a non-R user to interact with.
Any help here would be greatly appreciated,
I am here in case anything is not clear about my question,
Always happy to learn more about R so please bear with me as I am just starting to catch up.
All the best and thanks in advance!

Basically what I did is extract only the columns I need from the dataframe list, below is the final code:
library(httr)
library(jsonlite)
library(tidyr)
library(stringr)
library(purrr)
library(plyr)
test_companies <- read.csv("c:\\users\\... Companies Without Websites List.csv")
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
function_toJSONall <- function(all) {toJSON(all)}
a <- lapply(result, function_toJSONall)
function_fromJSONall <- function(all) {fromJSON(all)}
b <- lapply(a, function_fromJSONall)
function_dataframe <- function(all) {as.data.frame(all)}
c <- lapply(b, function_dataframe)
function_column <- function(all) {all[ ,15:30]}
result_final <- lapply(c, function_column)
results_df <- rbind.fill(c[])

rvest limits the results to 24 items

Good evening everyone,
I am currently trying to scrape zalando website to get the name of every products that appaears on the first two pages of the following url : (https://www.zalando.nl/damesschoenen-sneakers/)
Here is my code:
require(rvest)
require(dplyr)
url <- read_html('https://www.zalando.nl/damesschoenen-sneakers/')
selector_name <- '.z-nvg-cognac_brandName-2XZRz'
output <- html_nodes(x = url, css = selector_name) %>% html_text
The result is a list of 24 items while there is 86 products on the page. Has anyone encounter this issue before ? Any idea on how to solve it ?
Thank you for your help.
Thomas

I just tried what Nicolas Velasqueaz suggested
url <- read_html('https://www.zalando.nl/damesschoenen-sneakers/')
write_html(url, file = "test_url.html")
selector_name <- '.z-nvg-cognac_brandName-2XZRz'
test_file <- read_html("test_url.html")
output <- html_nodes(x = test_file, css = selector_name) %>% html_text
The results are the same. I still have only 24 items that shows up.
So if anyone has a solution would be very appreciated.

Thank you for your kind answer. I will dive into that direction.
I also find a way to get the name of the brand without RSelenium, here si my code:
library('httr')
library('magrittr')
library('rvest')
################# FUNCTION #################
extract_data <- function(firstPosition,lastPosition){
mapply(function(first,last){
substr(pageContent, first, last) %>%
gsub( "\\W", "\\1 ",.) %>%
gsub("^ *|(?<= ) | *$", "", ., perl = TRUE)
},
firstPosition, lastPosition )
}
############################################
url <- 'https://www.zalando.nl/damesschoenen-sneakers/'
page <- GET(url)
pageContent <- content(page, as='text')
# Get the brand name of the products
firstPosition <-
unlist(gregexpr('brand_name',pageContent))+nchar('brand_name')+1
lastPosition <- unlist(gregexpr('is_premium',pageContent))-2
extract_data(firstPosition, lastPosition)
Unfortunately it starts being difficult when you want something else than brand name so maybe that the best soution is to do it with RSelenium.

Trouble Scraping Whole chart from HTML

I'm trying to scrape the entire chart from this website:
http://stats.ncaa.org/team/stats/12021?org_id=749&sport_year_ctl_id=12021
But when I run this code:
library(XML)
library(gsubfn)
URL = 'http://stats.ncaa.org/team/stats?org_id=381&sport_year_ctl_id=12021'
Player_Stats = readHTMLTable(URL, header = T, stringsAsFactors = F)
Player_Stats
Player_Stats only returns the data for the players, up until and not including the Total line.
What I want is the Team Totals and Opponent Totals.
Thanks

That information is in a <tfoot> element at the bottom of the table, which is why readHTMLTable() isn't picking up on it. You can extract the <tfoot> bit separately using getNodeSet() as follows. I've bound the two bits of the table together at the end, but you may want to keep the different kinds of information apart for your application.
library(XML)
library(gsubfn)
URL = 'http://stats.ncaa.org/team/stats?org_id=381&sport_year_ctl_id=12021'
Player_Stats = readHTMLTable(URL, header = T, stringsAsFactors = F)
stats <- Player_Stats$stat_grid
doc <- htmlTreeParse(URL, useInternalNodes=T)
foot <- getNodeSet(doc,"//tfoot")
totals <- readHTMLTable(unlist(foot)[[1]])
colnames(totals) <- colnames(stats)
fulltable <- rbind(stats,totals)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scraping techniques to obtain links that the website of interest contains - r

Using this for example : xpathApply(doc, '//*[#id="teams-results"]/div/div/div/h4/a' ,xmlGetAttr,'href') You will get the 16 links of the first page. But you still have the problem of activating javascript code behind( SHOW MORE TEAMS) to see the rest of links.

Related

Google Search in R [duplicate]

I'm trying to scrape multiple pages

JSON applied over a dataframe in R

rvest limits the results to 24 items

Trouble Scraping Whole chart from HTML

Categories

Resources