I am following my question from yesterday - harvesting data via drop down list in R 1
first, I need to obtain all 50k strings of details of all doctors from this page: http://www.lkcr.cz/seznam-lekaru-426.html#seznam
I know, how to obtain them from a single page:
oborID<-"48"
okresID<-"3702"
web<- "http://www.lkcr.cz/seznam-lekaru-426.html"
extractHTML<-function(oborID,okresID){
query<-list('filterObor'="107",'filterOkresId'="3201",'do[findLekar]'=1)
query$filterObor<-oborID
query$filterOkresId<-okresID
html<- POST(url=web,body=query)
html<- content(html, "text")
html
}
IDfromHTML<-function(html){
starting<- unlist(gregexpr("filterId", html))
ending<- unlist(gregexpr("DETAIL", html))
starting<- starting[seq(2,length(starting),2)]
if (starting != -1 && ending != -1){
strings<-c()
for (i in 1:length(starting)) {
strings[i]<-substr(html,starting[i]+9,ending[i]-18)
}
strings<-list(strings)
strings
}
}
still, I am aware that downloading whole page for only few lines of text is quite uneffective(but works!:) Could you give me a tip how to make this process more effective?
I have also encountered some pages with more than 20 doctors listed (i.e. combination of "Brno-město" and "chirurgie". Such data are listed and accessed via hyperlink list at the end of the form. I need to access each of these pages and use there the code I presented here. But I guess I have to pass some cookies there.
Other than that, combination of "Praha" and "chirurgie" is problematic as well, because there is more than 200 records, therefore page applies some script and then I need to click the button "další" and use the same method as in the previous paragraph.
Can you help me please?
Related
My Goal: Using R, scrape all light bulb model #s and prices from homedepot.
My Problem: I can not find the URLs for ALL the light bulb pages. I can scrape one page, but I need to find a way to get the URLs so I can scrape them all.
Ideally I would like these pages
https://www.homedepot.com/p/TOGGLED-48-in-T8-16-Watt-Cool-White-Linear-LED-Tube-Light-Bulb-A416-40210/205935901
but even getting the list pages like these would be ok
https://www.homedepot.com/b/Lighting-Light-Bulbs/N-5yc1vZbmbu
I tried crawlr -> Does not work on homedepot (maybe because https?)I tried to get specific pages
I tried Rvest -> I tried using html_form and set_values to put light bulb in the search box, but the form comes back
[[1]]
<form> 'headerSearchForm' (GET )
<input hidden> '': 21
<input text> '':
<button > '<unnamed>
and set_value will not work because is '' so the error comes back
error: attempt to use zero-length variable name.
I also tried using the paste function and lapply
tmp <- lapply(0:696, function(page) {
url <- paste0("https://www.homedepot.com/b/Lighting-Light-Bulbs/N-
5yc1vZbmbu?Nao=", page, "4&Ns=None")
page <- read_html(url)
html_table(html_nodes(page, "table"))[[1]]
})
I got the error : error in html_table(html_nodes(page,"table"))[[1]]: script out of bounds.
I am seriously at a loss and any advice or tips would be so fantastic.
You can do it through rvest and tidyverse.
You can find a listing of all bulbs starting in this page, with a pagination of 24 bulbs per page across 30 pages:
https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79
Take a look at the pagination grid at the bottom of the initial page. I drew a(n ugly) yellow oval around it:
You could extract the link to each page listing 24 bulbs by following/extracting the links in that pagination grid.
Yet, just by comparing the urls it becomes evident that all pages follow a pattern, with "https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79" as root, and a tail where the
last digit characters represent the first lightbulb displayed, "?Nao=24"
So you could simply infer the structure of each url pointing to a display of the bulbs. The following command creates such a list in R:
library(rvest)
library(tidyverse)
index_list <- as.list(seq(0,(24*30), 24)) %>% paste0("https://www.homedepot.com/b/Lighting-Light-Bulbs-LED-Bulbs/N-5yc1vZbm79?Nao=", . )
Now, to extract the url for each lightbulb page, a combuination of a function and purrt's map function would come handy.
To exctract the individual bulbs url from the index pages, we can call this:
scrap_bulbs <- function(url){
object <- read_html(as.character(url))
object <- html_nodes(x = object, xpath = "//a[#data-pod-type='pr']")
object <- html_attr(x = object, 'href')
Sys.sleep(10) ## Courtesy pause of 10 seconds, prevents the website from possibly blocking your IP
paste0('https://www.homedepot.com', object)
}
Now we store the results in a list create by map().
bulbs_list <- map(.x = index_list, .f = scrap_bulbs)
unlist(bulbs_list)
Done!
I am using RStudio 3.4.4 on a windows 10 machine.
I have got a vector of artist names and I am trying to get genre information for them all on spotify. I have successfully set up the API and the RSpotify package is working as expected.
I am trying to build up to create a function but I am failing pretty early on.
So far i have the following but it is returning unexpected results
len <- nrow(Artist_Nam)
artist_info <- character(artist)
for(i in 1:len){
ifelse(nrow(searchArtist(Artist_Nam$ArtistName[i], token = keys))>=1,
artist_info[i] <- searchArtist(Artist_Nam$ArtistName[i], token = keys)$genres[1],
artist_info[i] <- "")
}
artist_info
I was expecting this to return a list of genres, and artists where there is not a match on spotify I would have an empty entry ""
Instead what is returned is a list and entries are populated with genres and on inspection these genres are correct and there are "" where there is no match however, something odd happens from [73] on wards (I have over 3,000 artists), the list now only returns "".
despite when i actually look into this using the searchArtist() manually there are matches.
I wonder if anyone has any suggestions or has experienced anything like this before?
There may be a rate limit to the number of requests you can make a minute and you may just be hitting that limit. Adding a small delay with Sys.sleep() within your loop to prevent you from hitting their API too hard to be throttled.
I'm using iMacros because I want to scrape a certain site for ID's which are used in the URL, after which I want to press a button.
I know you can't use Regular Expressions or globbing in the syntax for URL GOTO.
But I figured there might be a way to enter variables into the URL GOTO=?
Preferable I wouldn't want to randomize the variable, but have it try every page from [1 - 99999]
This is what I currently have:
VERSION BUILD=8940826 RECORDER=FX
TAB T=1
SET !ERRORIGNORE YES
SET !VAR3 ("Math.floor(Math.random()*99999 + 1);")
URL GOTO=http://example.com/id/ "randomized_variable_here"
TAG POS=1 TYPE=SPAN ATTR=TXT:press<SP>button
I have tried a few things, but I don't seem to be able to do this.
I have very little experience actually creating stuff for myself, I just modify scripts to fit my purposes, but should I look towards an HTML document or something like that to randomize that variable for me?
Thanks in advance!
It's pretty simple to get the string with a randomized variable:
' ...
SET !VAR3 EVAL("Math.floor(Math.random()*99999 + 1);")
URL GOTO=http://example.com/id/{{!VAR3}}
' ...
And the following code is for looping through [1 - 'Max:' value on the 'iMacros' sidebar]:
' ...
SET !LOOP 1
URL GOTO=http://example.com/id/{{!LOOP}}
' ...
Just play this macro in loop mode.
I'm trying to populate a list with a dataset and set the selected option with a helper function that compares the current data with another object's data (the 2 objects are linked)
I made the same type of list population with static variables:
Jade-
select(name='status')
option(value='Newly Acquired' selected='{{isCurrentState "Newly Acquired"}}') Newly Acquired
option(value='Currently In Use' selected='{{isCurrentState "Currently In Use"}}') Currently In Use
option(value='Not In Use' selected='{{isCurrentState "Not In Use"}}') Not In Use
option(value='In Storage' selected='{{isCurrentState "In Storage"}}') In Storage
Coffeescript-
"isCurrentState" : (state) ->
return #status == state
This uses a helper isCurrentState to match a given parameter to the same object that my other code is linked to so I know that part works
The code I'm trying to get to work is :
Jade-
select.loca(name='location')
each locations
option(value='#{siteName}' selected='{{isCurrentLocation {{siteName}} }}') #{siteName}
Coffeescript-
"isCurrentLocation": (location) ->
return #locate == location
All the other parts are functioning 100%, but the selected part is not
I've also tried changing the way I entered the selected='' part in a manner of ways such as:
selected='{{isCurrentLocation "#{siteName}" }}'
selected='{{isCurrentLocation "#{siteName} }}'
selected='{{isCurrentLocation {{siteName}} }}'
selected='#{isCurrentLocation "{{siteName}}" }'
selected='#{isCurrentLocation {{siteName}} }'
selected='#{isCurrentLocation #{siteName} }'
Is what I'm trying to do even possible?
Is there a better way of achieving this?
Any help would be greatly appreciated
UPDATE:
Thanks #david-weldon for the quick reply, i've tried this out a bit and realised that I wasn't exactly clear in what I was trying to accomplish in my question.
I have a template "update_building" created with a parameter( a buidling object) with a number of attributes, one of which is "locate".
Locations is another object with a number of attributes as well, one of which is "siteName". One of the siteName == locate and thus i need to pass in the siteName from locations to match it to the current building's locate attribute
Though it doesn't work in the context I want to use it definitely pointed me in a direction I didn't think of. I am looking into moving the parent template(The building) date context as a parameter into the locations template and using it from within the locations template. This is easily fixable in normal HTML spacebars with:
{{>locations parentDataContext/variable}}
Something like that in jade would easily solve this
Short answer
selected='{{isCurrentLocation siteName}}'
Long answer
You don't really need to pass the current location because the helper should know it's own context. Here's a simple (tested) example:
jade
template(name='myTemplate')
select.location(name='location')
each locations
option(value=this selected=isCurrentLocation) #{this}
coffee
LOCATIONS = [
'Newly Acquired'
'Currently In Use'
'Not In Use'
'In Storage'
]
Template.myTemplate.helpers
locations: LOCATIONS
isCurrentLocation: ->
#toString() is Template.instance().location.get()
Template.myTemplate.onCreated ->
#location = new ReactiveVar LOCATIONS[1]
I looked into the datacontexts some more and ended up making the options that populate the select into a different template and giving that template a helper, accessing the template's parent's data context and using that to determine which location the building had saved in it so that I could set that option to selected
Jade-
template(name="location_building_option")
option(value='#{siteName}' selected='{{isSelected}}') #{siteName}
Coffeescript -
Template.location_building_option.helpers
'isSelected': ->
parent = Template.parentData(1)
buildSite = parent.locate
return #siteName == buildSite
Thanks #david-weldon, Your answer helped me immensely to head in the right direction
I have a portion of code where I pick the value in a button and use it for other purposes. Or, at least, this is what I'd like to do.
The button changes value at every refresh of the page (it's a webpage).
For example: at the first access the button's value (or label) is "Results List (51)" but, if I refresh the page, the value becomes "Results List (11)".
What changes is the number inside the brackets (that identifies the number of results inside the list).
This is the code interested:
ok = Browser("Bwr").Page("Page").Frame("Frame").WebButton("name:=Results List OK").GetToProperty("name")
ko = Browser("Bwr").Page("Page").Frame("Frame").WebButton("name:=Results List KO").GetToProperty("name")
If InStr(ko, "0") > 0 and Instr(ok, "0")=0 Then
reporter.ReportEvent 0, "Riabbinamento effettuato", "Operazione effettuata con esito positivo: tutte le misure sono state riabbinate"
else reporter.ReportEvent 1, "Riabbinamento fallito", "Operazione effettuata con esito negativo: ci sono misure su cui l'operazione è fallita"
End If
Don't pay attention to the reporter (I'm Italian, it's written in my language).
If I execute the above code QTP puts in ok the string "Results List OK" but I want to put in ok the string "Results List OK (n)" (with n being the number that changes at every refresh of the page).
Basically I only need the number inside the brackets in order to make the IF truly works...
Any idea?!
You want to use a regular expression to map the property.
Result List (\d+)
or just Result List.*
Ok problem solved.
I've used GetRoProperty instead of GetToProperty and modified the value in the brackets after WebElement from "name:=Results List OK" to "name:=Results List OK.*"
Thanks to gigatropolis for the useful tips (I upvoted your answer) but it was only half the solution :)