This is a follow up to a prior thread. The code works fantastic for a single value but I get the following error when trying to pass more than 1 value I get an error based on the length of the function.
Error in vapply(elements, encode, character(1)) :
values must be length 1,
but FUN(X[1]) result is length 3
Here is a sample of the code. In most instances I have been able just to name an object and scrape that way.
library(httr)
library(rvest)
library(dplyr)
b<-c('48127','48180','49504')
POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode = b),
encode = "form"
) -> res
I was wondering if a loop to insert the values into the form would be the right way to go? However my loop writing skills are still in development and I am unsure of where to place it; in addition when i call the loop it doesn't print line by line it just returns null results.
#d isn't listed in the above code as it returns null
d<-for(i in 1:3){nrow(b)}
Here is an approach to send multiple POST requests
library(httr)
library(rvest)
b <- c('48127','48180','49504')
For each element in b perform a function that will send the appropriate POST request
res <- lapply(b, function(x){
res <- POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode = x),
encode = "form"
)
res <- read_html(content(res, as="raw"))
})
Now for each element of the list res you should do the parsing steps explained by hrbrmstr: How can I Scrape a CGI-Bin with rvest and R?
library(tidyverse)
I will use hrbrmstr's code since he is king and it is already clear to you. Only thing we are doing here is performing it on each element of res list.
res_list = lapply(res, function(x){
rows <- html_nodes(x, "table[width='300'] > tr > td")
ret <- data_frame(
record = !is.na(html_attr(rows, "bgcolor")),
text = html_text(rows, trim=TRUE)
) %>%
mutate(record = cumsum(record)) %>%
filter(text != "") %>%
group_by(record) %>%
summarise(x = paste0(text, collapse="|")) %>%
separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"), sep="\\|", extra="merge")
return(ret)
}
)
or using map from purrr
res %>%
map(function(x){
rows <- html_nodes(x, "table[width='300'] > tr > td")
data_frame(
record = !is.na(html_attr(rows, "bgcolor")),
text = html_text(rows, trim=TRUE)
) %>%
mutate(record = cumsum(record)) %>%
filter(text != "") %>%
group_by(record) %>%
summarise(x = paste0(text, collapse="|")) %>%
separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"),
sep="\\|", extra="merge") -> ret
return(ret)
}
)
If you would like this in a data frame:
res_df <- data.frame(do.call(rbind, res_list), #rbinds list elements
b = rep(b, times = unlist(lapply(res_list, length)))) #names the rows according to elements in b
You can put the values inside the post as below,
b<-c('48127','48180','49504')
for(i in 1:length(b)) {
POST(
url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl",
body = list(zipcode =b[i]),
encode = "form"
) -> res
# YOUR CODES HERE (for getting content of the page etc.)
}
But since for every different zipcode value the "res" value will be different, you need the put the rest of the codes inside the area I commented. Otherwise you get the last value only.
Related
I have been trying to understand how to use possibly() to wrap a lambda/anonymous function within map_dfr() so that my iterations continue on should an error be encountered. I am currently iterating over a large amount of webpages and using rvest to scrape them, however some are not compiled correctly or do not work. I would simply like to note that error so that I can return to it at a later time while continuing collecting data from the remainder of the webpages. My current code is posted below in addition to what I've tried:
df <- tibble(df, map_dfr(df$link, ~ {
# Replicate Human Input by Forcing Random Pauses
Sys.sleep(runif(1,1,3))
# Read in the html links
url <- .x %>% html_session(user_agent(user_agents)) %>% read_html()
# Full Job Description Text
description <- url %>%
html_elements(xpath = "//div[#id = 'jobDescriptionText']") %>%
html_text() %>% tolower()
description <- as.character(description)
# Hiring Insights
hiring_insights <- url %>%
html_elements(xpath = "//div[#id = 'hiringInsightsSectionRoot']") %>%
html_text() %>% str_extract("#REGEX") %>%
str_extract("#REGEX") %>%
str_trim()
hiring_insights <- as.character(hiring_insights)
### Extract Number of Hires
hiring_insights <- str_trim(str_extract(hiring_insights,"#REGEX"))
hiring_insights <- tolower(hiring_insights)
### Fill in all Missing Values with 1
hiring_insights[which(is.na(hiring_insights))] <- "1"
tibble(description, hiring_insights)
}))
I have tried wrapping the lambda function a few different ways but without success:
# First Attempt
df <- tibble(df, map_dfr(df$link, possibly(~ {——}, otherwise = "error)))
# Second Attempt
df <- tibble(df, map_dfr(df$link, possibly(function(x) {——}, otherwise = "error)))
# Third Attempt
df <- tibble(df, possibly(map_dfr(df$link, ~ {——}), otherwise = "error"))
# Fourth Attempt
df <- tibble(df, possibly(map_dfr(df$link, function(x) {——}), otherwise = "error"))
When writing the function with function(x) rather than with the ~ I update the .x to x within the lambda function when defining the url variable. However with each of these iterations I encountered a bad link and receive the HTTP 403 error, which then stops the iteration and discards all of data scraped from the previous variables. What I would like is to either have a dummy variable which notes whether or not the link was bad and then if it is bad fill in the values for the scraped variables with or simply whatever the otherwise argument is set too. Thank you in advance! I've really hit a wall here
map_dfr() expects a dataframe or named vector on every iteration. Your otherwise value isn’t named, so it throws an error. To illustrate:
library(purrr)
vals <- list(1, 2, "bad", 4, 5)
map_dfr(
vals,
possibly(
~ data.frame(x = .x^2),
otherwise = NA_real_
)
)
Error in `dplyr::bind_rows()`:
! Argument 3 must have names.
But if you change otherwise to return a dataframe:
map_dfr(
vals,
possibly(
~ data.frame(x = .x^2),
otherwise = data.frame(x = NA_real_)
)
)
x
1 1
2 4
3 NA
4 16
5 25
This question is a sequence to the problem stackoverflow
I have these two example html: url1.html ; url2.html
The url3.html is another example with more IPC
In URL2.html there is no information (51) and in URL1.html there is.
I'm using this code in R:
library(rvest)
library(tidyverse)
library(stringr)
x<-data.frame(
URL=c(1:2),
page=c(paste(readLines("url1.html"), collapse="\n"),
paste(readLines("url2.html"), collapse="\n"))
)
for (i in 1:nrow(x)){
html<-x$page[i]%>% unclass() %>% unlist()
read_html(html,encoding = "ISO-8859-1") %>%
rvest::html_elements(xpath = '//*[#id="principal"]/table[2]') %>%
html_nodes(xpath='//div[#id="classificacao0"]') %>%
html_text(trim=T)%>%
str_replace_all(.,"[\\n\\r\\t]+", "")%>%
stringr::str_trim( ) -> tmp
if(length(tmp) == 0) tmp <- "ND"
x$ipc_0[i] <- tmp %>% str_replace_all(.,"\\s+", " ") %>% str_replace_all(.," \\)", "\\)")
}
for (i in 1:nrow(htm_temp)){
html<-x$page[i]%>% unclass() %>% unlist()
read_html(html,encoding = "ISO-8859-1") %>%
rvest::html_elements(xpath = '//*[#id="principal"]/table[2]') %>%
html_nodes(xpath='//div[#id="classificacao1"]') %>%
html_text(trim=T)%>%
str_replace_all(.,"[\\n\\r\\t]+", "")%>%
stringr::str_trim( ) -> tmp
if(length(tmp) == 0) tmp <- "ND"
x$ipc_1[i] <- tmp %>% str_replace_all(.,"\\s+", " ") %>% str_replace_all(.," \\)", "\\)")
}
Result: partially correct
Desired result:create a new dataframe with the following structure.
URL
IPC
1
B62B 1/16 (1968.09)...
1
B62B 1/00 (1968.09)...
2
ND
Problem: There are url`s that have the code (51) and others that do not. When you have the code (51) the structure can contain "n" id with the following structure xpath='//div[#id="classificacao0"]. the Rating Id can contain values from 0 to "n". How to optimize this code to capture the necessary information without having to do a lot of for (variable in vector) for each "n"?
Any idea how to solve this problem?
You can use css attribute = value css selector list with ^ starts with operator to capture/exclude elements with specific id and id values.
Convert your current extraction code into a function which accepts (in this case) an url as argument. Extend the regex to remove the other characters not shown in your desired output.
Have that function return a tibble of url and ipcs found; wrap the whole thing in a map_dfr() call to generate a single DataFrame result.
library(rvest)
library(tidyverse)
urls <- sprintf("https://prequest.websiteseguro.com/example/url%i.html", 1:3)
get_ipc <- function(url) {
ipc <- read_html(url, encoding = "ISO-8859-1") %>%
html_elements("div[id^=classificacao]:not([id^=classificacaoc]) .normal > b") %>%
html_text(trim = T) %>%
str_replace_all(., "[\\n\\r\\t]+|\\(|\\s{2,}|\\)", "")%>%
stringr::str_trim()
if(length(ipc) == 0) ipc <- "ND"
return(tibble(url = url, ipc))
}
df <- purrr::map_dfr(urls, get_ipc)
print(df)
I have written a script which uses a list of URL's as input, and then scrapes certain information from the websites. I have done this with a for loop, but already the process time is verry long, I expect the list to get bigger over time, so I wanted to re-code my script a more efficient way. My idea was to eliminate the for loop and use pipe operators to reduce the processing time. My original (working code) is as follows;
imo <- c()
mmsi <- c()
for(i in 1:nrow(data)){
url <- sprintf("https://www.marinevesseltraffic.com/vessels?vessel=%s&flag=&page=1&sort=lenght&direction=desc",data$NAME[i])
page <- read_html(url)
CSSextract1 <- html_nodes(page, '.td_imo')
CSSextract2 <- html_nodes(page, '.td_mmsi')
imos <- html_text(CSSextract1)[2]
imo[i] <- imos
mmsis <- html_text(CSSextract2)[2]
mmsi[i] <- mmsis
}
data$IMO <- gsub("[\r \n \t]", "", imo)
data$MMSI <- gsub("[\r \n \t]", "", mmsi)
data$NAME <- gsub("\\+", " ", data$NAME)
I have re-written the code, trying to eliminate the for loop as follows;
CSSex1 <- function(page){
CSSextract <- html_nodes(page,'.td_imo')
return(CSSextract)
}
data$url <- sprintf("https://www.marinevesseltraffic.com/vessels?vessel=%s&flag=&page=1&sort=lenght&direction=desc",data$NAME)
data$mmsi <- data$url %>% read_html() %>% CSSex1() %>% html_text()[2]
However it gives me the error;
Error: `x` must be a string of length 1
I assume, the way I coded, the list (data$url) as a whole is now taken as input, so my question is;
Is it possible, and if yes how, to take each element from data$url as a input without using a (for) loop?
You may wish to set up url as a column of a data frame (data) to try:
mmsi_func <- function(x) {
z <- x %>%
read_html() %>%
CSSex1() %>%
html_text()
z[2]
}
data <- data %>%
rowwise() %>%
dplyr::mutate(mmsi = mmsi_func(url))
or something along those lines. I am not sure what the expected output is supposed to look like, but if it is a list rather than a vector, you can use this minor adjustment for a list column in the dataframe:
mmsi_func <- function(x) {
z <- x %>%
read_html() %>%
CSSex1() %>%
html_text()
z[2]
}
data <- data %>%
rowwise() %>%
dplyr::mutate(mmsi = list(mmsi_func(url)))
This question already has answers here:
Error in if/while (condition) {: missing Value where TRUE/FALSE needed
(4 answers)
Closed 5 years ago.
I have dataframe with 7 rows and 1 column,which contains links of a website, I'm trying to extract data from those various link and store them in a data frame but not able to append that.Also I'm checking that if for a link if there is no records(this I'm checking through html attribute of that link) skip that link and proceed to next link.I'm also trying to fetch data for multiple pages of a link.
This is reproducible data
text1="http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom="
text3="&proptype="
text4="Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment"
text5="&cityName=Thane&BudgetMin="
text6="&BudgetMax="
bhk=c("1","2","3","4","5",">5")
budg_min=c("5-Lacs","10-Lacs","20-Lacs","30-Lacs","40-Lacs","50-Lacs","60-Lacs","70-Lacs","80-Lacs","90-Lacs","1-Crores","1.2-Crores","1.4-Crores","1.6-Crores","1.8-Crores","2-Crores","2.3-Crores","2.6-Crores","3-Crores","3.5-Crores","4-Crores","4.5-Crores","5-Crores","10-Crores","20-Crores")
budg_max=c("5-Lacs","10-Lacs","20-Lacs","30-Lacs","40-Lacs","50-Lacs","60-Lacs","70-Lacs","80-Lacs","90-Lacs","1-Crores","1.2-Crores","1.4-Crores","1.6-Crores","1.8-Crores","2-Crores","2.3-Crores","2.6-Crores","3-Crores","3.5-Crores","4-Crores","4.5-Crores","5-Crores","10-Crores","20-Crores")
eg <- expand.grid(bhk = bhk, budg_min = budg_min, budg_max = budg_max)
eg <- eg[as.integer(eg$budg_min) <= as.integer(eg$budg_max),]
uuu <- sprintf("%s%s%s%s%s%s%s%s", text1,eg[,1],text3,text4,text5,eg[,2],text6,eg[,3])
uuu_df1=data.frame(x=uuu[1:7,])
dput(uuu_df1)
I have 3 solution for this but none seems to be working fine.
SOlution#1
urlList <- llply(uuu_df1[,1], function(url){
this_pg <- read_html(url)
results_count <- this_pg %>%
xml_find_first(".//span[#id='resultCount']") %>%
xml_text() %>%
as.integer()
if(results_count > 0){
cards <- this_pg %>%
xml_find_all('//div[#class="SRCard"]')
df <- ldply(cards, .fun=function(x){
y <- data.frame(wine = x %>% xml_find_first('.//span[#class="agentNameh"]') %>% xml_text(),
excerpt = x %>% xml_find_first('.//div[#class="postedOn"]') %>% xml_text(),
locality = x %>% xml_find_first('.//span[#class="localityFirst"]') %>% xml_text(),
society = x %>% xml_find_first('.//div[#class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
return(y)
})
} else {
df <- NULL
}
return(df)
}, .progress = 'text')
names(urlList) <- uuu_df1[,1]
a=bind_rows(urlList)
Above code gives me error Error in if (results_count > 0) { : missing value where TRUE/FALSE needed
Solution#2
urlList <- lapply(uuu_df1[,1], function(url){
UrlPage <- html(as.character(url))
ImgNode <- UrlPage %>% html_node("div.noResultHead")
u <- paste("No", word(string = as(ImgNode, "character"), start=4, end=5), sep=" ")
cat(".")
pg <- read_html(url)
if(u!="No Results Found!") {
df <- data.frame(wine=html_text(html_nodes(pg, ".agentNameh")),
excerpt=html_text(html_nodes(pg, ".postedOn")),
locality=html_text(html_nodes(pg,".localityFirst")),
society=html_text(html_nodes(pg,'.labValu .stop-propagation:nth-child(1)')),
stringsAsFactors=FALSE)
} else {
# ASSIGN EMPTY DATAFRAME (FOR CONSISTENT STRUCTURE)
df <- data.frame(wine=character(), excerpt=character(), locality=character(), society=character())
}
# RETURN NAMED LIST
return(list(UrlPage=UrlPage, ImgNode=ImgNode, u=u, df=df))
})
# ROW BIND ONLY DATAFRAME ELEMENT FROM LIST
wines <- map_df(urlList, function(u) u$df)
Above code gives empty dataframe
Solution#3
uuu_df1=data.frame(x=uuu_df[1:7,])
wines=data.frame()
url_test=c()
UrlPage_test=c()
u=c()
ImgNode=c()
pg=c()
for(i in 1:dim(uuu_df1)[1]) {
url_test[i]=as.character(uuu_df1[i,])
UrlPage_test[i] <- html(url_test[i])
ImgNode[i] <- UrlPage_test[i] %>% html_node("div.noResultHead")
u[i]=ImgNode[i]
u[i]=as(u[i],"character")
u[i]=paste("No",word(string = u, start = 4, end = 5),sep = " ")
if(u[i]=="No Results Found!") next
{
map_df(1:5, function(i) # here 1:5 is number of webpages of a website
{
# simple but effective progress indicator
cat(".")
pg[i] <- read_html(sprintf(url_test[i], i))
data.frame(wine=html_text(html_nodes(pg[i], ".agentNameh")),
excerpt=html_text(html_nodes(pg[i], ".postedOn")),
locality=html_text(html_nodes(pg[i],".localityFirst")),
society=html_text(html_nodes(pg[i],'.labValu .stop-propagation:nth-child(1)')),
stringsAsFactors=FALSE)
}) -> wines
}}
Above code also gives an error
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "list"
In addition: Warning messages:
1: 'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")
2: In UrlPage_test[i] <- html(url_test[i]) :
number of items to replace is not a multiple of replacement length
Any suggestions on which code can be corrected so that my requirement is met. Thanks in advance
Solution #1
That missing value where TRUE/FALSE needed is printed when you do something like this:
if (NA > 0) {
do something
}
So replace your if condition
if(results_count > 0)
with
(!is.na(results_count) & (results_count > 0))
I've got a set of functions that I'm trying to work with and I'm struggling to figure out why the assignment isn't working. Here are the functions I'm using:
new_timeline <- function() {
timeline = structure(list(), class="timeline")
timeline$title <- list("text" = list("headline" = NULL, "text" = NULL),
"start_date" = list("year" = NULL, "month" = NULL, "day" = NULL),
"end_date" = list("year" = NULL, "month" = NULL, "day" = NULL))
return(timeline)
}
.add_date <- function(self, date, time_type) {
valid_date <- stringr::str_detect(date, "^[0-9]{4}(-[0-9]{1,2}){0,2}$")
if (!valid_date) {
stringr::str_interp("Your ${time_type} date does not appear to be formatted correctly. It must be of the form 'yyyy-mm-dd'. Only the year is required.") %>% stop()
}
date_elements <- date %>% as.character() %>% stringr::str_split(" ") %>% unlist()
date <- date_elements[1] %>% stringr::str_split("-") %>% unlist()
stringr::str_interp("self$title$${time_type}_date$year <- date[1]") %>% parse(text = .) %>% eval()
if (!is.na(date[2])) stringr::str_interp("self$title$${time_type}_date$month <- date[2]") %>% parse(text = .) %>% eval()
if (!is.na(date[3])) stringr::str_interp("self$title$${time_type}_date$day <- date[3]") %>% parse(text = .) %>% eval()
return(self)
}
edit_title <- function(self, headline = NULL, text = NULL, start_date = NULL, end_date = NULL) {
if (class(self) != "timeline") stop("The object passed must be a timeline object.")
if (is.null(headline) && is.null(self$title$text$headline)) stop("Headline cannot be empty when adding a new title.")
if (!is.null(headline)) self$title$text$headline <- headline
if (!is.null(text)) self$title$text$text <- text
if (!is.null(start_date)) self <- .add_date(self, date = start_date, time_type = "start")
if (!is.null(end_date)) self <- .add_date(self, date = end_date, time_type = "end")
return(self)
}
EDIT: The above code has been severely reduced per a request in the comments. The code is still sufficient to reproduce the error.
I know that's a bit long-winded, so I apologize. The first function establishes a new timeline object. The third function allows us to change the title of the timeline object and the second function is a helper function that handles dates. The code would be used like this:
library(magrittr)
#devtools::install_github("hadley/stringr")
library(stringr)
tl <- new_timeline()
tl <- tl %>% edit_title(headline = "My Timeline", text = "Example", start_date = "2015-10-18")
The code runs with no errors, but when I call tl$title$start_date$year, it comes back as NULL. Using an answer I got in this previous question I asked, I tried to set envir = globalenv() within the eval function. When I do that, the function returns an error saying that object self cannot be found.
So I'm under the impression that self is held in the parent.frame(). So I add both of these to a list: envir = list(globalenv(), parent.frame()). This causes the function to run without error, but there's still no assignment.
Where am I going wrong? Thanks in advance!
As mentioned in the comments, I think you could probably do away with all of the code parsing and just pass variables in [[ for your assignments. Anyway, when you use the pipe operator a bunch of function wrapping happens so determining how many frames to go back is painful. Here are a couple solutions modifying the .add_date function.
You already found one, using <<-, since it searches back through the parent environments until it finds the variable (or doesnt and assigns it in the global).
Another would be just storing the function environment() and passing that to eval.
A third would be counting how many frames deep you go, and using sys.frame to tell eval which environment to look in.
.add_date <- function(self, date, time_type) {
valid_date <- stringr::str_detect(date, "^[0-9]{4}(-[0-9]{1,2}){0,2}$")
if (!valid_date) {
stringr::str_interp("Your ${time_type} date does not appear to be formatted correctly. It must be of the form 'yyyy-mm-dd'. Only the year is required.") %>% stop()
}
## Examining environemnts
e <- environment() # current env
efirst <- sys.nframe() # frame number
print(paste("Currently in frame", efirst))
envs <- stringr::str_interp("${date}") %>% parse(text=.) %>% {.; sys.frames()} # list of frames
elast <- stringr::str_interp("${date}") %>% parse(text=.) %>% {.; sys.nframe()} # number of last
print(paste("Went", elast, "frames deep."))
## Go back this many frames in eval
goback <- efirst-elast
date_elements <- date %>% as.character() %>% stringr::str_split(" ") %>% unlist()
date <- date_elements[1] %>% stringr::str_split("-") %>% unlist()
## Solution 1: use sys.frame
stringr::str_interp("self$title$${time_type}_date$year <- date[1]") %>%
parse(text = .) %>% eval(envir=sys.frame(goback))
## Solution 2: use environment defined in function
if (!is.na(date[2])) stringr::str_interp("self$title$${time_type}_date$month <- date[2]") %>%
parse(text = .) %>% eval(envir=e)
return(self)
}