I'm a beginner in R. I have a code that scrapes some data from a website (Rvest package) and then stores it as text in a matrix(which I'll later combine with other matrices to form a data frame).
To populate the matrix I'm using a for loop. The code had been working fine, but when I opened it today, it started filling in only the last iteration, and I don't get what the problem is:
all = matrix(nrow = length(urls))
for (i in length(urls)) {
p = read_html (urls[i]) %>%
html_nodes("p") %>%
html_text() %>%
as.character();
all[i,] = paste(p, collapse="_")
Any help would be much appreciated! Thanks (:
Related
I am scraping NBA play by play data using the play_by_play function in the nbastatR package. The problem is this function only collects data for 1 game ID at a time, and there are 1230 game IDs in a complete season. When I enter more than 15 game ID's in the play_by_play function, R just keeps loading and showing the wheel of death forever.
I tried to get around this by making a for loop which binds each game id to one cumulative dataframe. However, I run into the same error where R will endlessly load around the 16'th game- very peculiar. I could clean the data in the loop and try that out (I do not need all play by play data, just every shot from the season), but does anyone know why this is happening and how/if I could get around this?
Thanks
full<- play_by_play(game_ids = 21400001, nest_data = F, return_message = T)
for(i in 21400002:21400040){
data <- play_by_play(game_ids = c(i), nest_data = F, return_message = F)
full <- bind_rows(full,data)
cat(i)
}
This code will stop working at around the the 16th game ID. I tried using bind_rows from dplyr but that did not help at all.
Try this [untested code as I don't have your data]:
full <- lapply(
21400001:21400040,
function(i) {
play_by_play(game_ids = c(i), nest_data = F, return_message = F)
}
) %>%
bind_rows()
You can get more information on lazy evaluation here.
My goal is to get the weather data from one of the websites. I (with a little help of kind stack users, thank you) already created the vector consisting list of 1440 links and decided to try and use the 'for' loop to iterate over them.
Additionaly, every page has weather for each week so it's 7 rows of data (one for each day) which i have to obtain, which are marked as num0/num1/num2/num3.
That's what I came up with:
Links <- #here are the 1440 links i need to iterate over
library("rvest")
for (index in seq(from=1, to=length(Links), by=1)) {
link = paste(Links[index])
for (num in 0:7) {
node_date <-paste(".num",num," .date",sep="")
node_conditions<-paste(".num",num," .cond span",sep="")
#here I tried to create an 'embeded for loop' to iterate 7 times over various nodes consisting data
page = read_html(link)
DayOfWeek = page %>% html_nodes(node_date) %>% html_text()
Conditions = page %>% html_nodes(node_conditions) %>% html_text()
}
}
For now I get an error
error in command 'open.connection(x, "rb")':HTTP error 502
and I'm really quite confused what should I do now.
Are there other ways to accomplish this goal? Or maybe I'm making some rookie mistakes in here?
Thank you in advance!
I'm building a web scraper for some News websites in Switzerland. After some trial & error and a lot of help from StackOverflow (thx everyone!), I've gotten to a point where I can get text data from all articles.
#packages instalieren
install.packages("rvest")
install.packages("tidyverse")
install.packages("dplyr")
library(rvest)
library(stringr)
#seite einlesen
apisrf<- read_xml('https://www.srf.ch/news/bnf/rss/1646')
urls_srf <- apisrf %>% html_nodes('link') %>% html_text()
zeit_srf <- apisrf %>% html_nodes('pubDate') %>% html_text()
#data.frame basteln
dfsrf_titel_text <- data.frame(Text = character())
#scrape
for(i in 1:length(urls_srf)) {
link <- urls_srf[i]
artikel <- read_html(link)
#Informationen entnehmen
textsrf<- artikel %>% html_nodes('p') %>% html_text()
#In Dataframe strukturieren
dfsrf_text <- data.frame(Text = textsrf)
dfsrf_titel_text <- rbind(dfsrf_titel_text, cbind(dfsrf_text))
}
running this gives me dfsrf_titel_text. (I'm going to combine it with the titles of the articles at some point but let that be my problem.)
however, now my data is pretty untidy and I can't really figure out how to clean it in a way so it works for me. Especially annoying is that the texts from the different articles are not really structured in that way but get a new line whenever there is a paragraph in the texts. I can't combine the paragraphs because all the texts have different lengths. (The first article, starting at point 3, is super long because it's a live ticker covering the corona crisis so don't get confused if you run my code.)
how can I get R to create a new row in my dataframe only if the text is from a new article (meaning from a new URL?
thx for your help!
can you provide a sample of your data? you can use the strsplit(string, pattern) function where the pattern you specify is something that only happens between articles. Perhaps the URL?
strsplit(dfsrf_text,"www.\\w+.ch")
That will split your text anytime a URL in the .ch domain is found. you can use this regular expression cheat sheet to help you identify the pattern that seperates your articles.
You should correct this while creating dataframe itself. Here I am binding this all the data for each article together using paste0 adding new line character between them (\n\n).
library(rvest)
for(i in 1:length(urls_srf)) {
link <- urls_srf[i]
artikel <- read_html(link)
#Informationen entnehmen
textsrf<- paste0(artikel %>% html_nodes('p') %>% html_text(), collapse = "\n\n")
#In Dataframe strukturieren
dfsrf_text <- data.frame(Text = textsrf)
dfsrf_titel_text <- rbind(dfsrf_titel_text, cbind(dfsrf_text))
}
However, growing data in a loop is highly inefficient and can slow the process terribly especially when you have large data to scrape like this. Try using sapply.
dfsrf_titel_text <- data.frame(text = sapply(urls_srf, function(x) {
paste0(read_html(x) %>% html_nodes('p') %>% html_text(), collapse = "\n\n")
}))
So this will give you number of rows same as length of urls_srf .
I am working on a web scraping project, which aims to extract Google + reviews from a set of children's hospitals. My methodology is as follows:
1) Define a list of Google + urls to navigate to for review scraping. The urls are in a dataframe along with other variables defining the hospital.
2) Scrape reviews, number of stars, and post time for all reviews related to a given url.
3) Save these elements in a dataframe, and name the dataframe after another variable in the dataframe corresponding to the url.
4) Move on to the next url ... and so on till all urls are scraped.
Currently, the code is able to scrape from a single url. I have tried to create a function using map from the purrr package. However it doesn't seem to be working, I am doing something wrong.
Here is my attempt, with comments on the purpose of each step
#Load the necessary libraries
devtools::install_github("ropensci/RSelenium")
library(purrr)
library(dplyr)
library(stringr)
library(rvest)
library(xml2)
library(RSelenium)
#To avoid any SSL error messages
library(httr)
set_config( config( ssl_verifypeer = 0L ) )
Defining the URL dataframe
#Now to define the dataframe with the urls
urls_df =data.frame(Name=c("CHKD","AIDHC")
,ID=c("AAWZ12","AAWZ13")
,GooglePlus_URL=c("https://www.google.co.uk/search?ei=fJUKW9DcJuqSgAbPsZ3gDQ&q=Childrens+Hospital+of+the+Kings+Daughter+&oq=Childrens+Hospital+of+the+Kings+Daughter+&gs_l=psy-ab.3..0i13k1j0i22i10i30k1j0i22i30k1l7.8445.8445.0.9118.1.1.0.0.0.0.144.144.0j1.1.0....0...1c.1.64.psy-ab..0.1.143....0.qDMr7IDA-uA#lrd=0x89ba9869b87f1a69:0x384861b1e3a4efd3,1,,,",
"https://www.google.co.uk/search?q=Alfred+I+DuPont+Hospital+for+Children&oq=Alfred+I+DuPont+Hospital+for+Children&aqs=chrome..69i57.341j0j8&sourceid=chrome&ie=UTF-8#lrd=0x89c6fce9425c92bd:0x80e502f2175fb19c,1,,,"
))
Creating the function
extract_google_review=function(googleplus_urls) {
#Opens a Chrome session
rmDr=rsDriver(browser = "chrome",check = F)
myclient= rmDr$client
#Creates a sub-dataframe for the filtered hospital, which I will later use to name the dataframe
urls_df_sub=urls_df %>% filter(GooglePlus_URL %in% googleplus_urls)
#Navigate to the url
myclient$navigate(googleplus_urls)
#click on the snippet to switch focus----------
webEle <- myclient$findElement(using = "css",value = ".review-snippet")
webEle$clickElement()
# Save page source
pagesource= myclient$getPageSource()[[1]]
#simulate scroll down for several times-------------
count=read_html(pagesource) %>%
html_nodes(".p13zmc") %>%
html_text()
#Stores the number of reviews for the url, so we know how many times to scroll down
scroll_down_times=count %>%
str_sub(1,nchar(count)-5) %>%
as.numeric()
for(i in 1 :scroll_down_times){
webEle$sendKeysToActiveElement(sendKeys = list(key="page_down"))
#the content needs time to load,wait 1.2 second every 5 scroll downs
if(i%%5==0){
Sys.sleep(1.2)
}
}
#loop and simulate clicking on all "click on more" elements-------------
webEles <- myclient$findElements(using = "css",value = ".review-more-link")
for(webEle in webEles){
tryCatch(webEle$clickElement(),error=function(e){print(e)})
}
pagesource= myclient$getPageSource()[[1]]
#this should get the full review, including translation and original text
reviews=read_html(pagesource) %>%
html_nodes(".review-full-text") %>%
html_text()
#number of stars
stars <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes("g-review-stars > span") %>%
html_attr("aria-label")
#time posted
post_time <- read_html(pagesource) %>%
html_node(".review-dialog-list") %>%
html_nodes(".dehysf") %>%
html_text()
#Consolidating everything into a dataframe
reviews=head(reviews,min(length(reviews),length(stars),length(post_time)))
stars=head(stars,min(length(reviews),length(stars),length(post_time)))
post_time=head(post_time,min(length(reviews),length(stars),length(post_time)))
reviews_df=data.frame(review=reviews,rating=stars,time=post_time)
#Assign the dataframe a name based on the value in column 'Name' of the dataframe urls_df, defined above
df_name <- tolower(urls_df_sub$Name)
if(exists(df_name)) {
assign(df_name, unique(rbind(get(df_name), reviews_df)))
} else {
assign(df_name, reviews_df)
}
} #End function
Feeding the urls into the function
#Now that the function is defined, it is time to create a vector of urls and feed this vector into the function
googleplus_urls=urls_df$GooglePlus_URL
googleplus_urls %>% map(extract_google_review)
There seems to be an error in the function ,which is preventing it from scraping and storing the data into separate dataframes like intended.
My Intended Output
2 dataframes, each with 3 columns
Any pointers on how this can be improved will be greatly appreciated.
I have some code that scrapes data off this link (http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280) and runs some calculations.
What I want to do is cycle through every team and collect and run the manipulations on every team. I have a dataframe with every team link, like the one above.
Psuedo code:
for (link in teamlist)
{scrape, manipulate, put into a table}
However, I can't figure out how to run loop through the links.
I've tried doing URL = teamlist$link[i], but I get an error when using readhtmltable(). I have no trouble manually pasting each team individual URL into the script, just only when trying to pull it from a table.
Current code:
library(XML)
library(gsubfn)
URL= 'http://stats.ncaa.org/team/stats?org_id=575&sport_year_ctl_id=12280'
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
Thanks.
I agree with #ialm that you should check out the rvest package, which makes it very fun and straightforward to loop through links. I will create some example code here using similar subject matter for you to check out.
Here I am generating a list of links that I will iterate through
rm(list=ls())
library(rvest)
mainweb="http://www.basketball-reference.com/"
urls=html("http://www.basketball-reference.com/teams") %>%
html_nodes("#active a") %>%
html_attrs()
Now that the list of links is complete I iterate through each link and pull a table from each
teamdata=c()
j=1
for(i in urls){
bball <- html(paste(mainweb, i, sep=""))
teamdata[j]= bball %>%
html_nodes(paste0("#", gsub("/teams/([A-Z]+)/$","\\1", urls[j], perl=TRUE))) %>%
html_table()
j=j+1
}
Please see the code below, which basically builds off your code and loops through two different team pages as identified by the vector team_codes. The tables are returned in a list where each list element corresponds to a team's table. However, the tables look like they will need more cleaning.
library(XML)
library(gsubfn)
Player_Stats <- list()
j <- 1
team_codes <- c(575, 580)
for(code in team_codes) {
URL <- paste0('http://stats.ncaa.org/team/stats?org_id=', code, '&sport_year_ctl_id=12280')
tx<- readLines(URL)
tx2<-gsub("</tbody>","",tx)
tx2<-gsub("<tfoot>","",tx2)
tx2<-gsub("</tfoot>","</tbody>",tx2)
Player_Stats[[j]] = readHTMLTable(tx2,asText=TRUE, header = T, which = 2,stringsAsFactors = F)
j <- j + 1
}