Error in 1:maxResults : argument of length 0 in gdscrapeR - r

I am using gdscrapeR to extract Glassdoor Reviews. But, I cannot get past the very first code in gdscrapeR:
library(gdscrapeR)
df <- get_reviews(companyNum = "E40371")
Number of web pages to scrape:
StartingError in 1:maxResults : argument of length 0
I went to the creator's blog here. The blog shows a breakdown of the get_reviews function. I think the problem is occurring here:
totalReviews <- read_html(paste(baseurl, companyNum, sort, sep = "")) %>%
html_nodes(".tightVert.floatLt strong, .margRtSm.minor") %>%
html_text() %>%
sub("Found | reviews", "", .) %>%
sub(",", "", .) %>%
as.integer()
maxResults <- as.integer(ceiling(totalReviews/10)) #10 reviews per page, round up to whole number
I don't know what I have to do to fix this issue. I just want to extract Glassdoor Reviews. Please help!

Related

extracting table with htmltab R

I'm attempting to scrape the second table from
https://fbref.com/en/comps/9/passing/Premier-League-Stats
I have used
URLPL <- "https://fbref.com/en/comps/9/passing/Premier-League-Stats"
Tab <- htmltab(doc = URLPL, which = 2)
which returns
"Error: Couldn't find the table. Try passing (a different) information
to the which argument"
and also
URLPL <- "https://fbref.com/en/comps/9/passing/Premier-League-Stats"
Tab <- htmltab(doc = URLPL, which = "//table[2]")
which returns
"Error in Node[1] : subscript out of bounds"
There is 2 tables on the webpage. If anyone can point me on the right path here.
Thanks.
Edit: I've now realised that there's only 1 table on the webpage and what I thought was a table, is not. Now I'm even more confused as where to go with this.
Answering my own question here. For anyone who may have the same problem.
Anything other than the top table on any of the sports-references websites. (Hockey/Basketball/Baseball) are counted as comments.
PremLeague = "https://fbref.com/en/comps/12/stats/La-Liga-Stats"
Prem = PremLeague %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#stats_standard") %>%
html_table()
This worked for me.

Map a tbl of hyperlinks into read_html

I have a tibble containing one column which stores hyperlinks in each column. Now I want to map over these links using map_dfr, passing the links one after another through read_html(.x[.x]) %>%
html_node(".body-copy-lg") %>% html_text. If I do so I always end up with the error :
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=3].
Which tells me that the read_html basically says: " Hey stop throwing more than one string at the same time on me."
So did I make a mistake in the mapper? Is this a bug? I really can't see why the mapper-function does not grab each element one after another.
What I tried so far :
target_regex <- "(xtm)|((k|K)(i|I|1|11)(d|D)(n|N).)|(Ar<e)\\s(you)\\s(in)|
(LOAN)|(AR(\\s|\\S)[0-9])|((B|b)(i|1|l)tc.)|(Coupon)|(Plastic.King)|(organs)|(SILI)|(Electric.Cigarette.Machine)"
adverts <- function(df) df[!grepl(target_regex, df$...1,perl = T), ]
bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", 10, sep = "="))
report <- map(".read-more", ~html_nodes(bribe, .x) %>%
html_attr(.x[[1]][[1]][[1]], name = "href"))[[1]] %>%
as_tibble(.name_repair = "unique") %>%
bind_rows() %>%
rename( ...1 = value) %>%
adverts() %>%
map_dfr(~read_html(.x[.x]) %>%
html_node(".body-copy-lg") %>%
html_text)
Do not mind the call of rename() which is basically something what needed to be done to make the adverts usable in this case.
You're forgetting that most functions in R are vectorized, and that using map or apply functions is unnecessary. In your case, it is needed in the final step of getting the html text.
The syntax your are using in map is also puzzling, and I think you should review ?map to get a better handle on it. For instance, you use multiple .x or extracted values where you should just be using .x to refer to the sub-element of the object you are iterating over.
library(tidyverse)
library(rvest)
target_regex <- "(xtm)|((k|K)(i|I|1|11)(d|D)(n|N).)|(Ar<e)\\s(you)\\s(in)|
(LOAN)|(AR(\\s|\\S)[0-9])|((B|b)(i|1|l)tc.)|(Coupon)|(Plastic.King)|(organs)|(SILI)|(Electric.Cigarette.Machine)"
adverts <- function(df) df[!grepl(target_regex, df$...1,perl = T), ]
bribe <- read_html(paste("http://ipaidabribe.com/reports/paid?page", 10, sep = "="))
report <- html_nodes(bribe, ".read-more") %>%
html_attr("href") %>%
as_tibble(.name_repair = "unique") %>%
filter(str_detect(value, target_regex, negate = TRUE)) %>%
mutate(text = map_chr(value, ~read_html(.x) %>%
html_node(".body-copy-lg") %>%
html_text))
result
# A tibble: 3 x 2
value text
<chr> <chr>
1 http://ipaidabribe.com/reports/paid/paid-bribe-to-settle-matter… "\r\n Place: Nelamangala Police Station, Bangalore\nDate of incident: 5th Jan 2020, 3PM…
2 http://ipaidabribe.com/reports/paid/paid-500-rs-bribe-at-nizamu… "\r\n My Brother Mahesh Prasad travelling on PNR number 4822171124 train no 12721 Ni…
3 http://ipaidabribe.com/reports/paid/drone-air-follow-focus-wire… "\r\n This new Silencer Air+ is a tremendously versatile and resourceful follow focus, z…

How do I add a loop when using R to scrape data?

I'm trying to create a database of crime data by zip code based on Trulia.com's data. I have the code below but so far it only produces 1 line of data. In the code below, Zipcodes is just a list of US zip codes. Can anyone tell me what I need to add to make this run through my entire list "i" ?
Here is a link to one of the Trulia pages for reference: https://www.trulia.com/real_estate/20004-Washington/crime/
UPDATE:
Here are zip codes for download: https://www.dropbox.com/s/uxukqpu0v88d7tf/Zip%20Code%20Database%20wo%20Boston.xlsx?dl=0
I also changed the code a bit this time after realizing the crime stats appear in different orders depending on the zip code. Is it possible to have the loop produce 4 lines per zipcode? This currently works but only produces the last zip code in the dataset. I can't figure out how to make sure each zip code's data is recorded on separate lines, so it doesn't overwrite and only leave one line of the last zip code.
Please help!!
library(rvest)
data=data.frame(Zipcodes)
for(i in data$Zip.Code)
{
site <- paste("https://www.trulia.com/real_estate/",i,"-Boston/crime/", sep="")
site <- html(site)
crime<- data.frame(zip =i,
type =site %>% html_nodes(".brs") %>% html_text() ,
stringsAsFactors=FALSE)
}
View(crime)
If that code doesn't work, try this:
data=data.frame(Zillow_Data_for_R_Test)
for(i in data$Zip.Code)
site <- paste("https://www.trulia.com/real_estate/",i,"-Boston/crime/", sep="")
site <- read_html(site)
crime<- data.frame(zip =i,
theft =site %>% html_nodes(".crime-text-0") %>% html_text() ,
assault =site %>% html_nodes(".crime-text-1") %>% html_text() ,
arrest =site %>% html_nodes(".crime-text-2") %>% html_text() ,
vandalism =site %>% html_nodes(".crime-text-3") %>% html_text() ,
robbery =site %>% html_nodes(".crime-text-4") %>% html_text() ,
type =site %>% html_nodes(".clearfix") %>% html_text() ,
stringsAsFactors=FALSE)
View(crime)
The comment of #r2evans already provides an answer. Since the #ShanCham asked how to actually implement this I wanted to guide with the following code, which is just more verbose than the comment and could therefore not be posted as additional comment.
library(rvest)
#only two exemplary zipcodes, could be more, of course
zipcodes <- c("02110", "02125")
crime <- lapply(zipcodes, function(z) {
site <- read_html(paste0("https://www.trulia.com/real_estate/",z,"-Boston/crime/"))
#for illustrative purposes:
#introduced as.numeric to numeric columns
#exluded some of your other columns and shortenend the current text in type
data.frame(zip = z,
theft = site %>% html_nodes(".crime-text-0") %>% html_text() %>% as.numeric(),
assault = site %>% html_nodes(".crime-text-1") %>% html_text() %>% as.numeric() ,
type = site %>% html_nodes(".clearfix") %>% html_text() %>% paste(collapse = " ") %>% substr(1, 50) ,
stringsAsFactors=FALSE)
})
class(crime)
#list
#Output are lists that can be bound together to one data.frame
crime <- do.call(rbind, crime)
#crime is a data.frame, hence, classes/types are kept
class(crime$type)
# [1] "character"
class(crime$assault)
# [1] "numeric"

How to use rvest to web crawling correctly?

I try to web crawl this page http://www.funda.nl/en/koop/leiden/ to get the max page it could show which is 29. I followed some online tutorial and located where 29 is in the html code, wrote this R code
url<- read_html("http://www.funda.nl/en/koop/leiden/")
url %>% html_nodes("#pagination-number.pagination-last") %>% html_attr("data-
pagination-page") %>% as.numeric()
However, what I got is numeric(0). If I remove as.numeric(), I get character(0).
How is this done ?
I believe that both your identification of the html and your parsing of the html are wrong. To easily find the name of a CSS id, you can use a chrome extension called Selector Gadget. In your case, it also requires some parsing, accomplished in the str_extract_all() function.
This will work:
url <- read_html("http://www.funda.nl/en/koop/leiden/")
pagination.last <- url %>%
html_node(".pagination-last") %>%
html_text() %>%
stringr::str_extract_all("[:number:]{1,2}", simplify = TRUE) %>%
as.numeric()
> pagination.last
[1] 29
You might find this other question helpful as well: R: Rvest - got hidden text i don't want
I've been dealing with the same issue and this worked for me:
> url = "http://www.funda.nl/en/koop/leiden/"
> last_page <-
+ last(read_html(url) %>%
+ html_nodes(css = ".pagination-pages") %>%
+ html_children()) %>%
+ html_text(trim = T) %>%
+ str_extract("[0-9]+") %>%
+ as.numeric()
> last_page
[1] 23

Extract data from multiple webpages from a website which reloads automatically in r

I have seen other posts which show to extract data from multiple webpages
But the problem is that for my website when I scroll the website to see the number of webpages to check in how many pages the data is divided into, the page automatically refresh next data, making unable to identify the number of webpages.I don't have that good knowledge of html and javascript so that I can easily identify the attribute on which the method is been getting called. so I have identified a way by which we can get the number of pages.
The website when loaded in browser gives number of records present, accessing that number and divide it by 30(number of data present per page) for e.g if number of records present is 90, then do 90/30 = 3 number of pages
here is the code to get the number of records found on that page
active_name_data1 <- html_nodes(webpage,'.active')
active1 <- html_text(active_name_data1)
as.numeric(gsub("[^\\d]+", "", word(active1[1],start = 1,end =1), perl=TRUE))
AND another approach is that get the attribute for number of pages i.e
url='http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'
webpage <- read_html(url)
active_data_html <- html_nodes(webpage,'a.act')
active <- html_text(active_data_html)
here active gives me number of pages i.e "1" " 2" " 3" " 4"
SO here I'm unable to identify how do I get the active page data and iterate the other number of webpage so as to get the entire data.
here is what I have tried (uuu_df2 is the dataframe with multiple link for which I want to crawl data)
library(rvest)
uuu_df2 <- data.frame(x = c('http://www.magicbricks.com/property-for-
sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-
Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-
Lacs&BudgetMax=5-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'))
urlList <- llply(uuu_df2[,1], function(url){
this_pg <- read_html(url)
results_count <- this_pg %>%
xml_find_first(".//span[#id='resultCount']") %>%
xml_text() %>%
as.integer()
if(!is.na(results_count) & (results_count > 0)){
cards <- this_pg %>%
xml_find_all('//div[#class="SRCard"]')
df <- ldply(cards, .fun=function(x){
y <- data.frame(wine = x %>% xml_find_first('.//span[#class="agentNameh"]') %>% xml_text(),
excerpt = x %>% xml_find_first('.//div[#class="postedOn"]') %>% xml_text(),
locality = x %>% xml_find_first('.//span[#class="localityFirst"]') %>% xml_text(),
society = x %>% xml_find_first('.//div[#class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
return(y)
})
} else {
df <- NULL
}
return(df)
}, .progress = 'text')
names(urlList) <- uuu_df2[,1]
a=bind_rows(urlList)
But this code just gives me the data from active page and does not iterate through other pages of the given link.
P.S : If the link doesn't has any record the code skips that link and
moves to other link from the list.
Any suggestion on what changes should be made to the code will be helpful. Thanks in advance.

Resources