Web Scraping through R

Web Scraping through R - r

I have an excel file that contains certain keywords that need to be searched in google through R.
The output to be created is a data frame which contains the following variables:
Keyword;Position(position of the url in the search results);Title(title of the ith search result);Text(text in that search result);URL;Domain
The keywords and some example of the output are given in the link below:
https://drive.google.com/file/d/1AM3d5Hbf5nBpbRG1ydnZM7ZG2AdUyy-6/view?usp=sharing
(Sheet 1 has the keywords and sheet 2 has the sample output)
I tried to create a similar output but there seems to be an error.
Code:
# Web Scraping in R
library(XML)
library(RCurl)
library(dplyr)
library(rvest)
library(urltools)
library(htm2txt)
library(readxl)
data <- read_excel(file.choose()) # Importing the data
output <- data.frame(matrix(ncol=6,nrow=0))
colnames(output) <- c("Name","Position","Title","Text","URL","Domain")
for (i in 1:nrow(data)) {
search.term <- data[i,1]
getGoogleURL <- function(search.term, domain = '.com', quotes=TRUE)
{
search.term <- gsub(' ', '%20', search.term) # Cleaning the Search Term
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
getGoogleURL <- paste('http://www.google', domain, '/search?q=',
search.term, sep='')
}
quotes <- "False"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
page <- read_html(search.url)
links <- page %>% html_nodes("a") %>% html_attr("href")
link <- links[startsWith(links, "/url?q=")]
link <- sub("^/url\\?q\\=(.*?)\\&sa.*$","\\1", link)
for (j in 1:length(link)) {
page1 <- read_html(link[j])
name <- data[i,1]
position <- j
title <- page1 %>% html_node("title") %>% html_text()
text <- gettxt(link[j])
url <- link[j]
domain <- suffix_extract(domain(link[j]))$host
vect <- c(name,position,title,text,url,domain)
output <- rbind(output,vect)
}
}
The error being shown is:
Error in match.names(clabs, nmi) : names do not match previous names
Please help, I'm new to R.

That error comes from rbind when the columns don't line up perfectly. For instance, if there is a missing or extra column. In this case, it might be because one of your vect variables is empty/NULL or length over 1.
rbind(data.frame(a=1,b=2), data.frame(b=3))
# Error in rbind(deparse.level, ...) :
# numbers of columns of arguments do not match
Since iteratively adding rows to a frame gets expensive (it makes a complete copy of the frame every time even one row is added, this is grossly inefficient), it's generally better to append to a list and convert into a frame in one call.
out <- list()
for (i in seq_len(nrow(data))) {
# ...
for (j in seq_along(link)) {
# ...
vect <- c(name, position, title, text, url, domain)
stopifnot(length(vect) == 6L)
out <- c(out, list(vect))
}
}
outout <- do.call(rbind.data.frame, out)
colnames(output) <- c("Name", "Position", "Title", "Text", "URL", "Domain")
(In reality, instead of stopifnot, one might record the url and data retrieved into a different list for forensic purposes. Or find the missing element and NA it before adding to the list. Either way, stopifnot is intended here as a placeholder for something more contextually relevant to you and your process.)

Related

Scraping multiple articles by using purrr::map, not for loop in R

Hi dear community members.
I'm now trying to get the data of article titles on this website (https://yomou.syosetu.com/search.php?&type=er&order_former=search&order=new&notnizi=1&p=1) by R.
I executed the following code.
### read HTML ###
html_narou <- rvest::read_html("https://yomou.syosetu.com/search.php?&type=er&order_former=search&order=new&notnizi=1&p=1",
encoding = "UTF-8")
### create the common part object of CSS ###
base_css_former <- "#main_search > div:nth-child("
base_css_latter <- ") > div > a"
### create NULL objects ###
art_css <- NULL
narou_titles <- NULL
### extract the title data and store them into the NULL object ###
#### The titles of the articles doesn't exist in the " #main_search > div:nth-child(1~4) > div > a ", so i in the loop starts from five ####
for (i in 5:24) {
art_css <- paste0(base_css_former, as.character(i), base_css_latter)
narou_title <- rvest::html_element(x = html_narou,
css = art_css) %>%
rvest::html_text()
narou_titles <- base::append(narou_titles, narou_title)
}
But it takes long to do this by for-loop in R and I want to use "map" function in "purrr" instead. However I'm not familiar with purrr::map and the process is complicated.
How can I substitute map for for-loop?

The real issue is that you’re increasing the size of your narou_titles vector on every iteration, which is notoriously slow in R. Instead, you should pre-allocate the vector to its final length, then assign elements by index. purrr does this behind the scenes, which can make it appear faster, but you can do the same thing without purrr.
With your for loop:
library(rvest)
narou_titles <- vector("character", 20)
for (i in 5:24) {
art_css <- paste0(base_css_former, as.character(i), base_css_latter)
narou_titles[[i]] <- html_element(
x = html_narou,
css = art_css
) %>%
html_text()
}
With purrr::map_chr():
library(rvest)
library(purrr)
get_title <- function(i) {
art_css <- paste0(base_css_former, as.character(i), base_css_latter)
html_element(
x = html_narou,
css = art_css
) %>%
html_text()
}
narou_titles <- map_chr(5:24, get_title)

Using seperate items in list as input for pipe operator in R

I have written a script which uses a list of URL's as input, and then scrapes certain information from the websites. I have done this with a for loop, but already the process time is verry long, I expect the list to get bigger over time, so I wanted to re-code my script a more efficient way. My idea was to eliminate the for loop and use pipe operators to reduce the processing time. My original (working code) is as follows;
imo <- c()
mmsi <- c()
for(i in 1:nrow(data)){
url <- sprintf("https://www.marinevesseltraffic.com/vessels?vessel=%s&flag=&page=1&sort=lenght&direction=desc",data$NAME[i])
page <- read_html(url)
CSSextract1 <- html_nodes(page, '.td_imo')
CSSextract2 <- html_nodes(page, '.td_mmsi')
imos <- html_text(CSSextract1)[2]
imo[i] <- imos
mmsis <- html_text(CSSextract2)[2]
mmsi[i] <- mmsis
}
data$IMO <- gsub("[\r \n \t]", "", imo)
data$MMSI <- gsub("[\r \n \t]", "", mmsi)
data$NAME <- gsub("\\+", " ", data$NAME)
I have re-written the code, trying to eliminate the for loop as follows;
CSSex1 <- function(page){
CSSextract <- html_nodes(page,'.td_imo')
return(CSSextract)
}
data$url <- sprintf("https://www.marinevesseltraffic.com/vessels?vessel=%s&flag=&page=1&sort=lenght&direction=desc",data$NAME)
data$mmsi <- data$url %>% read_html() %>% CSSex1() %>% html_text()[2]
However it gives me the error;
Error: `x` must be a string of length 1
I assume, the way I coded, the list (data$url) as a whole is now taken as input, so my question is;
Is it possible, and if yes how, to take each element from data$url as a input without using a (for) loop?

You may wish to set up url as a column of a data frame (data) to try:
mmsi_func <- function(x) {
z <- x %>%
read_html() %>%
CSSex1() %>%
html_text()
z[2]
}
data <- data %>%
rowwise() %>%
dplyr::mutate(mmsi = mmsi_func(url))
or something along those lines. I am not sure what the expected output is supposed to look like, but if it is a list rather than a vector, you can use this minor adjustment for a list column in the dataframe:
mmsi_func <- function(x) {
z <- x %>%
read_html() %>%
CSSex1() %>%
html_text()
z[2]
}
data <- data %>%
rowwise() %>%
dplyr::mutate(mmsi = list(mmsi_func(url)))

How to write NA for missing results in rvest if there was no content in node (within loop) further how to merge variable with results

Hi i'm new to R and try to fetch the tickers/symbols of Yahoo Finance from a text file which contains company names like Adidas, BMW etc. in order to run an event study later. This file contains about 800 names. Some of them can be found in yahoo and some not. (Thats ok)
My loop work so far but missing results won't be displayed. Further it only creates a table with numbers and results which could be found.But i would like to create a list which displayed the variable i ("firmen") and the results that's has been found or an NA in case there was no result.
Hope you guys can help me. Thank you !!!
my code:
library(rvest)
# company_names
firmen <- c(read.table("Mappe1.txt"))
# init
df <- NULL
# loop for search names in Yahoo Ticker Lookup
for(i in firmen){
# find url
url <- paste0("https://finance.yahoo.com/lookup/all?s=", i, "/")
page <- read_html(url,as="text")
# grab table
table <- page %>%
html_nodes(xpath = "//*[#id='lookup-page']/section/div/div/div/div[1]/table/tbody/tr[1]/td[1]") %>%
html_text() %>%
as.data.frame()
# bind to dataframe
df <- rbind(df, table)
}

I solved the first problem and now empty nodes (if "i" has not been found on the yahoo page) will be displayed as "NA"
here is the code:
library(rvest)
# teams
firmen <- c(read.table("Mappe1.txt"))
# init
df <- NULL
table <- NULL
# loop
for(i in firmen){
# find url
url <- paste0("https://finance.yahoo.com/lookup/all?s=", i, "/")
page <- read_html(url,as="text")
# grab ticker from yahoo finance
table <- page %>%
html_nodes(xpath = "//*[#id='lookup-page']/section/div/div/div/div[1]/table/tbody/tr[1]/td[1]") %>%
html_text(trim=TRUE) %>% replace(!nzchar(table), NA) %>%
as.data.frame()
# bind to dataframe
df <- rbind(df,table)
}
Now there is just one question left
How can i merge "df" and "firmen" into one table which has the columns:
"tickers" = df and "firmen" = firmen
because df has just one column named "." with the results and the list firmen contains a number of companies placed in many colums but with just one row.
basically i need to transform the list "firmen" but i don't know how
Thank you for the help

change a for loop to a function to scrape a website

I am trying to scrape a website using the following:
industryurl <- "https://finance.yahoo.com/industries"
library(rvest)
read <- read_html(industryurl) %>%
html_table()
library(plyr)
industries <- ldply(read, data.frame)
industries = industries[-1,]
read <- read_html(industryurl)
industryurls <- html_attr(html_nodes(read, "a"), "href")
links <- industryurls[grep("/industry/", industryurls)]
industryurl <- "https://finance.yahoo.com"
links <- paste0(industryurl, links)
links
##############################################################################################
store <- NULL
tbl <- NULL
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
#################################################################################################
I am mostly interested in the code between ########## and I want to apply a function instead of a for loop since I am running into time out issues with yahoo and I want to make it more human like to extract this data (it is not too much).
My question is, how can I take links apply a function and set a sort of delay timer to read in the contents of the for loop?
I can paste my own version of the for loop which does not work.

This is the function I came up with
##First argument is the link you need
##The second argument is the total time for Sys.sleep
extract_function <- function(define_link, define_time){
print(paste0("The system will stop for: ", define_time, " seconds"))
Sys.sleep(define_time)
first <- read_html(define_link)
print(paste0("It will now return the table for link", define_link))
return(html_table(first))
}
##I added the following tryCatch function
link_try_catch <- function(define_link, define_time){
out <- tryCatch(extract_function(define_link,define_time), error =
function(e) NA)
return(out)
}
##You can now retrieve the data using the links vector in two ways
##Picking the first ten, so it should not crash on link 5
p <- lapply(1:10, function(i)link_try_catch(links[i],1))
##OR (I subset the vector just for demo purposes
p2 <- lapply(links[1:10], function(i)extract_function(i,1))
Hope it helps

Trying to webscrape an unchanging URL with data spread over pages

I am new to Webscraping. The url I am working with is this (https://tsmc.tripura.gov.in/doc_list). At present, I am able to extract data from the first page. Since, the url is unchanging, I don't have an identifier for the other pages to create a loop for data table extraction.
Here is my code:
install.packages("XML")
install.packages("RCurl")
install.packages("rlist")
install.packages("bitops")
library(bitops)
library(XML)
library(RCurl)
url1<- getURL("https://tsmc.tripura.gov.in/doc_list",.opts =
list(ssl.verifypeer = FALSE))
table1<- readHTMLTable(url1)
table1<- list.clean(table1, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(table1, function(t) dim(t)[1]))
table1[[which.max(n.rows)]]
View(table1)
table11= table1[["NULL"]]
Please help. Thanks!

Perhaps try this solution:
url <- "https://tsmc.tripura.gov.in/doc_list?page="
sq <- seq(1, 30) # There appears to be 30 pages so we create a sequence of 1:30 results
links <- paste0(url, sq) #Paste the sequence after the url "page="
store <- NULL
tbl <- NULL
library(rvest) #extract the tables
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
library(plyr)
df <- ldply(tbl, data.frame) #combine the list of data frames into one large data frame
df$`.id` <- gsub("https://tsmc.tripura.gov.in/doc_list?page=", " ", df$`.id`, fixed = TRUE)
Which gives 846 observations across 8 variables.
EDIT: I found that the first url does not have a sequence. In order to add the first page and rbind it with the rest of the data use the following:
firsturl <- "https://tsmc.tripura.gov.in/doc_list"
first_store = read_html(firsturl)
first_tbl = html_table(first_store)
first_df <- as.data.frame(first_tbl)
first_df$`.id` <- 0
df2 <- rbind(first_df, df)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web Scraping through R - r

Related

Scraping multiple articles by using purrr::map, not for loop in R

Using seperate items in list as input for pipe operator in R

How to write NA for missing results in rvest if there was no content in node (within loop) further how to merge variable with results

change a for loop to a function to scrape a website

Trying to webscrape an unchanging URL with data spread over pages

Categories

Resources