Web scraping in R using for loop - r

I would like to scrape the data from this link, and I have written the following code in R to do so. This, however, does not work and only returns the first page of the results. Apparently, the loop does not work. Does anybody know what's wrong with the loop?
library('rvest')
for (i in 1:40) {
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=, i"))
rank_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(1)')
rank_data <- html_text(rank_data_html)
rank_data<-as.numeric(rank_data)
title_data_html <- html_nodes(webpage,'.censo_list font')
title_data <- html_text(title_data_html)
author_data_html <- html_nodes(webpage,'.censo_list+ td font')
author_data <- html_text(author_data_html)
country_data_html <- html_nodes(webpage,'.censo_list~ td:nth-child(4) font')
rcountry_data <- html_text(country_data_html)
year_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(5) font')
year_data <- html_text(year_data_html)
type_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(6) font')
type_data <- html_text(type_data_html)
}
censorship_df<-data.frame(Rank = rank_data, Title = title_data, Author = author_data, Country = rcountry_data, Type = type_data, Year = year_data)
write.table(censorship_df, file="sample.csv",sep=",",row.names=F)

Are you sure there's anything wrong with the loop? I would expect it to get the first page of results 40 times. Look at
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=, i"))
Shouldn't that be (difference in the last ten characters of the string; the quotation mark moves)
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=", i))
What paste0 does in R is it stitches together two strings without any separator. But you only have one string. So it tries to fetch results for page=, i. But you want it to fetch page=1 through page=40. So put the quotation mark like page=", i so that it pastes the URL and i together.
I'm not an R programmer, but that simply leaps out at me.
Source for paste0 behavior.

Related

Scraping multiple articles by using purrr::map, not for loop in R

Hi dear community members.
I'm now trying to get the data of article titles on this website (https://yomou.syosetu.com/search.php?&type=er&order_former=search&order=new&notnizi=1&p=1) by R.
I executed the following code.
### read HTML ###
html_narou <- rvest::read_html("https://yomou.syosetu.com/search.php?&type=er&order_former=search&order=new&notnizi=1&p=1",
encoding = "UTF-8")
### create the common part object of CSS ###
base_css_former <- "#main_search > div:nth-child("
base_css_latter <- ") > div > a"
### create NULL objects ###
art_css <- NULL
narou_titles <- NULL
### extract the title data and store them into the NULL object ###
#### The titles of the articles doesn't exist in the " #main_search > div:nth-child(1~4) > div > a ", so i in the loop starts from five ####
for (i in 5:24) {
art_css <- paste0(base_css_former, as.character(i), base_css_latter)
narou_title <- rvest::html_element(x = html_narou,
css = art_css) %>%
rvest::html_text()
narou_titles <- base::append(narou_titles, narou_title)
}
But it takes long to do this by for-loop in R and I want to use "map" function in "purrr" instead. However I'm not familiar with purrr::map and the process is complicated.
How can I substitute map for for-loop?
The real issue is that you’re increasing the size of your narou_titles vector on every iteration, which is notoriously slow in R. Instead, you should pre-allocate the vector to its final length, then assign elements by index. purrr does this behind the scenes, which can make it appear faster, but you can do the same thing without purrr.
With your for loop:
library(rvest)
narou_titles <- vector("character", 20)
for (i in 5:24) {
art_css <- paste0(base_css_former, as.character(i), base_css_latter)
narou_titles[[i]] <- html_element(
x = html_narou,
css = art_css
) %>%
html_text()
}
With purrr::map_chr():
library(rvest)
library(purrr)
get_title <- function(i) {
art_css <- paste0(base_css_former, as.character(i), base_css_latter)
html_element(
x = html_narou,
css = art_css
) %>%
html_text()
}
narou_titles <- map_chr(5:24, get_title)

Web Scraping through R

I have an excel file that contains certain keywords that need to be searched in google through R.
The output to be created is a data frame which contains the following variables:
Keyword;Position(position of the url in the search results);Title(title of the ith search result);Text(text in that search result);URL;Domain
The keywords and some example of the output are given in the link below:
https://drive.google.com/file/d/1AM3d5Hbf5nBpbRG1ydnZM7ZG2AdUyy-6/view?usp=sharing
(Sheet 1 has the keywords and sheet 2 has the sample output)
I tried to create a similar output but there seems to be an error.
Code:
# Web Scraping in R
library(XML)
library(RCurl)
library(dplyr)
library(rvest)
library(urltools)
library(htm2txt)
library(readxl)
data <- read_excel(file.choose()) # Importing the data
output <- data.frame(matrix(ncol=6,nrow=0))
colnames(output) <- c("Name","Position","Title","Text","URL","Domain")
for (i in 1:nrow(data)) {
search.term <- data[i,1]
getGoogleURL <- function(search.term, domain = '.com', quotes=TRUE)
{
search.term <- gsub(' ', '%20', search.term) # Cleaning the Search Term
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
getGoogleURL <- paste('http://www.google', domain, '/search?q=',
search.term, sep='')
}
quotes <- "False"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
page <- read_html(search.url)
links <- page %>% html_nodes("a") %>% html_attr("href")
link <- links[startsWith(links, "/url?q=")]
link <- sub("^/url\\?q\\=(.*?)\\&sa.*$","\\1", link)
for (j in 1:length(link)) {
page1 <- read_html(link[j])
name <- data[i,1]
position <- j
title <- page1 %>% html_node("title") %>% html_text()
text <- gettxt(link[j])
url <- link[j]
domain <- suffix_extract(domain(link[j]))$host
vect <- c(name,position,title,text,url,domain)
output <- rbind(output,vect)
}
}
The error being shown is:
Error in match.names(clabs, nmi) : names do not match previous names
Please help, I'm new to R.
That error comes from rbind when the columns don't line up perfectly. For instance, if there is a missing or extra column. In this case, it might be because one of your vect variables is empty/NULL or length over 1.
rbind(data.frame(a=1,b=2), data.frame(b=3))
# Error in rbind(deparse.level, ...) :
# numbers of columns of arguments do not match
Since iteratively adding rows to a frame gets expensive (it makes a complete copy of the frame every time even one row is added, this is grossly inefficient), it's generally better to append to a list and convert into a frame in one call.
out <- list()
for (i in seq_len(nrow(data))) {
# ...
for (j in seq_along(link)) {
# ...
vect <- c(name, position, title, text, url, domain)
stopifnot(length(vect) == 6L)
out <- c(out, list(vect))
}
}
outout <- do.call(rbind.data.frame, out)
colnames(output) <- c("Name", "Position", "Title", "Text", "URL", "Domain")
(In reality, instead of stopifnot, one might record the url and data retrieved into a different list for forensic purposes. Or find the missing element and NA it before adding to the list. Either way, stopifnot is intended here as a placeholder for something more contextually relevant to you and your process.)

change a for loop to a function to scrape a website

I am trying to scrape a website using the following:
industryurl <- "https://finance.yahoo.com/industries"
library(rvest)
read <- read_html(industryurl) %>%
html_table()
library(plyr)
industries <- ldply(read, data.frame)
industries = industries[-1,]
read <- read_html(industryurl)
industryurls <- html_attr(html_nodes(read, "a"), "href")
links <- industryurls[grep("/industry/", industryurls)]
industryurl <- "https://finance.yahoo.com"
links <- paste0(industryurl, links)
links
##############################################################################################
store <- NULL
tbl <- NULL
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
#################################################################################################
I am mostly interested in the code between ########## and I want to apply a function instead of a for loop since I am running into time out issues with yahoo and I want to make it more human like to extract this data (it is not too much).
My question is, how can I take links apply a function and set a sort of delay timer to read in the contents of the for loop?
I can paste my own version of the for loop which does not work.
This is the function I came up with
##First argument is the link you need
##The second argument is the total time for Sys.sleep
extract_function <- function(define_link, define_time){
print(paste0("The system will stop for: ", define_time, " seconds"))
Sys.sleep(define_time)
first <- read_html(define_link)
print(paste0("It will now return the table for link", define_link))
return(html_table(first))
}
##I added the following tryCatch function
link_try_catch <- function(define_link, define_time){
out <- tryCatch(extract_function(define_link,define_time), error =
function(e) NA)
return(out)
}
##You can now retrieve the data using the links vector in two ways
##Picking the first ten, so it should not crash on link 5
p <- lapply(1:10, function(i)link_try_catch(links[i],1))
##OR (I subset the vector just for demo purposes
p2 <- lapply(links[1:10], function(i)extract_function(i,1))
Hope it helps

for loop read html challenge in R

I would like to loop a parse query.
The thing that stops me is that I need to insert a number in the html that R then reads and parses. The html should be between " ", does anyone know how to insert the "i" from the "for loop" so that it will be replaced and R is also able to retrieve the html?
This is the code (I would like a list with all the artists of the charts of the 52 weeks):
library(rvest)
weeknummer = 1:52
l <- c()
b <- c()
for (i in weeknummer){
htmlpage <- read_html("http://www.top40.nl/top40/2015/week-"[i]"")
Top40html <- html_nodes(htmlpage,".credit")
top40week1 <- html_text(Top40html)
b <- top40week1
l <- c(l,b)
}
You need to turn the URL into one string.
pageurl <- paste0("http://www.top40.nl/top40/2015/week-",i)
htmlpage <- read_html(pageurl)

Loop for a string

This code will be used to count number of links in my tweets collection. The collection is collected from 10 accounts. The questions is, how could I loop through the ten accounts in one code and drop the output in a table or graph? "Unames" is representing the name of the account. Thanks in adavance,
mydata <- read.csv("tweets.csv",sep=",", header=TRUE)
head(mydata)
dim(mydata)
colnames(mydata)
****#tweets for each university****
table(mydata$University)
Unames<- unique(mydata$University)
mystring <- function(Uname, string){
mydata_temp <- subset(mydata,University==Uname)
mymatch <- rep(NA,dim(mydata_temp)[1])
for(i in 1:dim(mydata_temp)[1]){
mymatch[i] <- length(grep(string, mydata_temp[i,2]))
}
return(mymatch)
}
**#web link e.g. (Here I would like to see the total links for all universities in table or graph. The below code is only giving me the output one by one!
mylink <- mystring(Unames[1],"http://")
So my suspicions are wrong and you do have a body of data for which this command produces the desired results (and you expect all the :
mylink <- mystring(Unames[1],"http://")
In that case, you should just do this:
links_list <- lapply(Unames, mystring, "http://")

Resources