for loop read html challenge in R

for loop read html challenge in R - r

I would like to loop a parse query.
The thing that stops me is that I need to insert a number in the html that R then reads and parses. The html should be between " ", does anyone know how to insert the "i" from the "for loop" so that it will be replaced and R is also able to retrieve the html?
This is the code (I would like a list with all the artists of the charts of the 52 weeks):
library(rvest)
weeknummer = 1:52
l <- c()
b <- c()
for (i in weeknummer){
htmlpage <- read_html("http://www.top40.nl/top40/2015/week-"[i]"")
Top40html <- html_nodes(htmlpage,".credit")
top40week1 <- html_text(Top40html)
b <- top40week1
l <- c(l,b)
}

You need to turn the URL into one string.
pageurl <- paste0("http://www.top40.nl/top40/2015/week-",i)
htmlpage <- read_html(pageurl)

Related

Web scraping in R using for loop

I would like to scrape the data from this link, and I have written the following code in R to do so. This, however, does not work and only returns the first page of the results. Apparently, the loop does not work. Does anybody know what's wrong with the loop?
library('rvest')
for (i in 1:40) {
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=, i"))
rank_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(1)')
rank_data <- html_text(rank_data_html)
rank_data<-as.numeric(rank_data)
title_data_html <- html_nodes(webpage,'.censo_list font')
title_data <- html_text(title_data_html)
author_data_html <- html_nodes(webpage,'.censo_list+ td font')
author_data <- html_text(author_data_html)
country_data_html <- html_nodes(webpage,'.censo_list~ td:nth-child(4) font')
rcountry_data <- html_text(country_data_html)
year_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(5) font')
year_data <- html_text(year_data_html)
type_data_html <- html_nodes(webpage,'tr+ tr td:nth-child(6) font')
type_data <- html_text(type_data_html)
}
censorship_df<-data.frame(Rank = rank_data, Title = title_data, Author = author_data, Country = rcountry_data, Type = type_data, Year = year_data)
write.table(censorship_df, file="sample.csv",sep=",",row.names=F)

Are you sure there's anything wrong with the loop? I would expect it to get the first page of results 40 times. Look at
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=, i"))
Shouldn't that be (difference in the last ten characters of the string; the quotation mark moves)
webpage <- read_html(paste0(("http://search.beaconforfreedom.org/search/censored_publications/result.html?author=&cauthor=&title=&country=7327&language=&censored_year=&censortype=&published_year=&censorreason=&sort=t&page=", i))
What paste0 does in R is it stitches together two strings without any separator. But you only have one string. So it tries to fetch results for page=, i. But you want it to fetch page=1 through page=40. So put the quotation mark like page=", i so that it pastes the URL and i together.
I'm not an R programmer, but that simply leaps out at me.
Source for paste0 behavior.

change a for loop to a function to scrape a website

I am trying to scrape a website using the following:
industryurl <- "https://finance.yahoo.com/industries"
library(rvest)
read <- read_html(industryurl) %>%
html_table()
library(plyr)
industries <- ldply(read, data.frame)
industries = industries[-1,]
read <- read_html(industryurl)
industryurls <- html_attr(html_nodes(read, "a"), "href")
links <- industryurls[grep("/industry/", industryurls)]
industryurl <- "https://finance.yahoo.com"
links <- paste0(industryurl, links)
links
##############################################################################################
store <- NULL
tbl <- NULL
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
#################################################################################################
I am mostly interested in the code between ########## and I want to apply a function instead of a for loop since I am running into time out issues with yahoo and I want to make it more human like to extract this data (it is not too much).
My question is, how can I take links apply a function and set a sort of delay timer to read in the contents of the for loop?
I can paste my own version of the for loop which does not work.

This is the function I came up with
##First argument is the link you need
##The second argument is the total time for Sys.sleep
extract_function <- function(define_link, define_time){
print(paste0("The system will stop for: ", define_time, " seconds"))
Sys.sleep(define_time)
first <- read_html(define_link)
print(paste0("It will now return the table for link", define_link))
return(html_table(first))
}
##I added the following tryCatch function
link_try_catch <- function(define_link, define_time){
out <- tryCatch(extract_function(define_link,define_time), error =
function(e) NA)
return(out)
}
##You can now retrieve the data using the links vector in two ways
##Picking the first ten, so it should not crash on link 5
p <- lapply(1:10, function(i)link_try_catch(links[i],1))
##OR (I subset the vector just for demo purposes
p2 <- lapply(links[1:10], function(i)extract_function(i,1))
Hope it helps

JSON applied over a dataframe in R

I used the below on one website and it returned a perfect result:
looking for key word: Emaar pasted at the end of the query:
library(httr)
library(jsonlite)
query<-"https://www.googleapis.com/customsearch/v1?key=AIzaSyA0KdZHRkAjmoxKL14eEXp2vnI4Yg_po38&cx=006431301429107149113:as7yqcm2qc8&q=Emaar"
result11 <- content(GET(query))
print(result11)
result11_JSON <- toJSON(result11)
result11_JSON <- fromJSON(result11_JSON)
result11_df <- as.data.frame(result11_JSON)
now I want to apply the same function over a data.frame containing key words:
so i did the below testing .csv file:
Company Name
[1] ADES International Holding Ltd
[2] Emirates REIT (CEIC) Limited
[3] POLARCUS LIMITED
called it Testing Website Extraction.csv
code used:
test_companies <- read.csv("... \\Testing Website Extraction.csv")
#removing space and adding "+" sign then pasting query before it (query already has my unique google key and search engine ID
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
Result here shows a list of length 3 (the 3 search terms) and sublist within each term containing: url (list[2]), queries (list[2]), ... items (list[10]) and these are the same for each search term (same length separately), my issue here is applying the remainder of the code
#when i run:
result_JSON <- toJSON(result)
result_JSON <- as.list(fromJSON(result_JSON))
I get a list of 6 list that has sublists
and putting it into a tidy dataframe where the results are listed under each other (not separately) is proving to be difficult
also note that I tried taking from the "result" list that has 3 separate lists in it each one by itself but its a lot of manual labor if I have a longer list of keywords
The expected end result should include 30 observations of 37 variables (for each search term 10 observations of 37 variables and all are underneath each other.
Things I have tried unsuccessfully:
These work to flatten the list:
#do.call(c , result)
#all.equal(listofvectors, res, check.attributes = FALSE)
#unlist(result, recursive = FALSE)
# for (i in 1:length(result)) {listofvectors <- c(listofvectors, result[[i]])}
#rbind()
#rbind.fill()
even after flattening I dont know how to organize them into a tidy final output for a non-R user to interact with.
Any help here would be greatly appreciated,
I am here in case anything is not clear about my question,
Always happy to learn more about R so please bear with me as I am just starting to catch up.
All the best and thanks in advance!

Basically what I did is extract only the columns I need from the dataframe list, below is the final code:
library(httr)
library(jsonlite)
library(tidyr)
library(stringr)
library(purrr)
library(plyr)
test_companies <- read.csv("c:\\users\\... Companies Without Websites List.csv")
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
function_toJSONall <- function(all) {toJSON(all)}
a <- lapply(result, function_toJSONall)
function_fromJSONall <- function(all) {fromJSON(all)}
b <- lapply(a, function_fromJSONall)
function_dataframe <- function(all) {as.data.frame(all)}
c <- lapply(b, function_dataframe)
function_column <- function(all) {all[ ,15:30]}
result_final <- lapply(c, function_column)
results_df <- rbind.fill(c[])

R- How to do a loop on a list and output different dataframes

I'm attempting to create a loop in R that will use a vector of dates, run them through a loop that includes a SQL query, and then generate a separate dataframe for each output. Here is as far as I've gotten:
library(RODBC)
dvect <- as.Date("2015-04-13") + 0:2
d <- list()
for(i in list(dvect)){
queryData <- sqlQuery(myconn, paste("SELECT
WQ_hour,
sum(calls) as calls
FROM database
WHERE DDATE = '", i,"'
GROUP BY 1
", sep = ""))
d[i] <- rbind(d, queryData)
}
From what I can tell, the query portion of the code runs fine since I've tested it by itself. Where I'm stumbling is the last line where I try to save the contents of each loop through the query separately with each having a label of the date that was used in the loop.
I'd appreciate any help. I've only been using R consistently for about 2 months now so I'm definitely open to alternative ways of doing this that are cleaner and more efficient.
Thanks.

I'd suggest making the SQL query a function, and use lapply to apply it and return your result as a list.
userSQLquery = function(i) {
sqlQuery(myconn, paste("SELECT
WQ_hour,
sum(calls) as calls
FROM database
WHERE DDATE = '", i,"'
GROUP BY 1
", sep = ""))
}
dvect = as.Date("2015-04-13") + 0:2
d = as.list(1:length(dvect))
names(d) = dvect
lapply(d, userSQLquery)
I have very little experience with SQL though, so this may not work. Maybe it could start you off?

Looks like a job for lapply (lapply documentation)instead of a for loop. (In R it's often good to avoid a for loop by using a vectorization.)
If you want each date to return a separate data frame, and then have each data frame labelled with the original date, try:
dates <- c("Jan 1", "Oct 31", "Dec 25")
queryData <- function(date){
#dummy data
return(runif(5))
}
results <- lapply(dates, queryData)
names(results) <- dates

Either use:
d[[i]] <- queryData
if you want each data.frame (query result) as a separate element in the list output d.
Or use:
d <- rbind(d, queryData)
if you want a single data.frame with all the query outputs combined. In this case you should declare d as a data.frame (i.e. d <- data.frame()).
You can also store each data.frame (i.e. the query result) with its corresponding date in a list as:
d[[i]] <- list(date = dvect[[i]], queryResult = queryData)
I think the last one is what you are looking for.

Loop for a string

This code will be used to count number of links in my tweets collection. The collection is collected from 10 accounts. The questions is, how could I loop through the ten accounts in one code and drop the output in a table or graph? "Unames" is representing the name of the account. Thanks in adavance,
mydata <- read.csv("tweets.csv",sep=",", header=TRUE)
head(mydata)
dim(mydata)
colnames(mydata)
****#tweets for each university****
table(mydata$University)
Unames<- unique(mydata$University)
mystring <- function(Uname, string){
mydata_temp <- subset(mydata,University==Uname)
mymatch <- rep(NA,dim(mydata_temp)[1])
for(i in 1:dim(mydata_temp)[1]){
mymatch[i] <- length(grep(string, mydata_temp[i,2]))
}
return(mymatch)
}
**#web link e.g. (Here I would like to see the total links for all universities in table or graph. The below code is only giving me the output one by one!
mylink <- mystring(Unames[1],"http://")

So my suspicions are wrong and you do have a body of data for which this command produces the desired results (and you expect all the :
mylink <- mystring(Unames[1],"http://")
In that case, you should just do this:
links_list <- lapply(Unames, mystring, "http://")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

for loop read html challenge in R - r

You need to turn the URL into one string. pageurl <- paste0("http://www.top40.nl/top40/2015/week-",i) htmlpage <- read_html(pageurl)

Related

Web scraping in R using for loop

change a for loop to a function to scrape a website

JSON applied over a dataframe in R

R- How to do a loop on a list and output different dataframes

Loop for a string

Categories

Resources