I wanted to pick up some data from the health.usnews.com. I ran the following lines but both lines give me the exact same result - No response at all, R gets stuck on the line and I have to manually click on "interrupt R".
page_response <- httr::GET("https://health.usnews.com/")
# or
page <- xml2::read_html("https://health.usnews.com/")
What am I missing?
The website uses the user-agent header to detect web-scraper. Add a fake user-agent header, you'll able to get the result:
page_response <- httr::GET(
"https://health.usnews.com/",
config = httr::add_headers(
`user-agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"
)
)
The problem though is most of the data is generated by JS. Idk what info you need but you're probably gonna need the V8 package to help you.
Related
I am struggling with how to scrape an interactive map or coordinates from the website,
below is an example of the map (or coordinates) I would like to scrape with requests / bs4.
The idea is to scrape like 100 or so map locations and plot them a map graph.
Could you please advise on how to scrape the map bottom of the website:
https://www.njuskalo.hr/nekretnine/gradevinsko-zemljiste-zagreb-lucko-5000-m2-oglas-34732559
The location data is hidden in a script tag within the HTML, you can get it out like this:
import requests
import json
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
url = 'https://www.njuskalo.hr/nekretnine/gradevinsko-zemljiste-zagreb-lucko-5000-m2-oglas-34732559'
resp = requests.get(url,headers=headers)
start = '"defaultMarker":'
end = ',"cluster":{"icon1'
s = resp.text
dirty_json = s[s.find(start)+len(start):s.rfind(end)].strip() #get the json out the html
clean_json = json.loads(dirty_json)
print(clean_json['lat'],clean_json['lng'])
A tool we use has regular updates. To keep the screenshots in our documentation up-to-date, I would like to start using Rmd and the webshot package. A first step would be to be able to login (and next to redirect to the desired page).
Based on the example in the package I tried the code below to login, but this triggers an "element not found error".
https://dev79379.service-now.com/login.do is the login page where I took the selectors
https://dev79379.service-now.com/home would be one the urls of interest
So, I have two questions:
what would be the correct selector to find the element?
how can I redirect from login.do to /home?
library(webshot)
url <- "https://dev79379.service-now.com/home"
fn <- tools::file_path_sans_ext(basename(url))
webshot(url, paste0(fn,".png"),
useragent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36", eval = "casper.then(function() {
// Enter username and password
this.sendKeys('#user_name.form-control input[type=\"text\"]', 'test');
this.sendKeys('#user_password.form-control input[type=\"password\"]', 'test');
// Now click in the search box. This results in a box expanding below
this.click('#sysverb_login.pull-right.btn.bt-primary input[type=\"submit\"]');
// Wait 500ms
this.wait(500);
});"
)
I'm new to Python and I'm trying to take the temperature from The Weather Network however I receive no value for my temperature. Can someone please help me with this because I've been stuck on this for a while? :( Thank you in advance!
import time
import schedule
import requests
from bs4 import BeautifulSoup
def FindTemp ():
myurl = "https://www.theweathernetwork.com/ca/36-hour-weather-forecast/ontario/toronto"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}
r = requests.get(myurl, headers = headers)
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find("div",{"class":"obs-area"}).find("span",{'class': 'temp'})
todaydate = time.asctime()
TorontoTemp = all.text
print("The temperature in Toronto is" ,TorontoTemp, "on", todaydate)
print(TorontoTemp)
print(FindTemp())
It doesn't have to work at all, even if you didn't do anything wrong. Many sites use Javascript to fetch data, so you'd need to use some other scraper that has Chromium built-in and uses the same DOM that you'd see if you were interacting with the site yourself, in-person. And many sites with valuable data, such as weather data, actively protect themselves from scraping, since the data they provide has monetary value (i.e. you can buy the data feed access).
In any case, you should start with some site that's known to scrape well. Beautifulsoup's own webpage is a good start :)
And you should use a debugger to see the intermediate values your code generated, and investigate at which point they diverge from your expectations.
Recently, Yahoo changed their authentication mechanism to a two step one. So now, when I login to a yahoo site, I put in my username, and then it asks me to open my yahoo mobile app to give it a code. Alternatively, you can have it email or text you some other way around this. The result of this is that code that used to work to programatically login to Yahoo sites no longer works. This code just redirects to the login form. I've tried with and without a useragent string and with and without the countrycode=1 in the form values. I'm fine with entering a code after looking at my mobile app, but it doesn't forward me to the page to enter that code. How do we login to Yahoo these days using R?
url <- "http://mail.yahoo.com"
uastring <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
s <- rvest::html_session(url, httr::user_agent(uastring))
s_form <- rvest::html_form(s)[[1]]
filled_form <- rvest::set_values(s_form, username="myusername",
passwd="mypassword")
out <- rvest::submit_form(session=s, filled_form, submit="signin",
httr::add_headers("Content-Length"=0))
Okay, I've stumbled upon the answer here. I was using the httr::add_headers("Content-Length"=0) in response to a warning that rvest would throw: Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode, :
Length Required (HTTP 411).
As it turns out, despite the warning, everything worked fine and in fact, if I add the content-length header, the login fails. So, my code to login to yahoo ends up looking like this:
username <- "some_username#yahoo.com"
league_id <- "some league id to complete the fantasy football url"
uastring <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
url <- "http://football.fantasysports.yahoo.com/f1/"
url <- paste0(url, league_id)
s <- rvest::html_session(url, httr::user_agent(uastring))
myform <- rvest::html_form(s)[[1]]
myform <- rvest::set_values(myform, username=username)
s <- suppressWarnings(rvest::submit_form(s, myform, submit="signin"))
s <- rvest::jump_to(s, s$response$url)
myform <- rvest::html_form(s)[[1]]
if("code" %in% names(myform$fields)){
code <- readline(prompt="In your Yahoo app, find and click on the Account Key icon.\nGet the 8 character code and\nenter it here: ")
}else{
print("Unable to login")
return(NULL)
}
myform <- rvest::set_values(myform, code=code)
s <- suppressWarnings(rvest::submit_form(s, myform, submit="verify"))
if(grepl("authorize\\/verify", s$url)){
print("Wrong code entered, unable to login")
return(NULL)
}else{
print("Login successful")
}
s <- rvest::jump_to(s, s$response$url)
It's a two step process... Submit your username, then go to your yahoo app to get the login code. There's no yahoo password needed. I use readline to get the login code. Seems to work well... I'm able to scrape my fantasy football data after completing the login. It's just very curious that the warning asking for a content length header would lead you down a path that doesn't work. By the way, this same situation applies when trying to login to google. You have to ignore the warning and it works fine.
I am attempting to scrape a mobile-formatted webpage using RCurl, at the following URL:
http://m.fire.tas.gov.au/?pageId=incidentDetails&closed_incident_no=161685
Using this code:
library(RCurl)
options( RCurlOptions = list(verbose = TRUE, useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13"))
inurl <- getURL(http://m.fire.tas.gov.au/?pageId=incidentDetails&closed_incident_no=161685)
Note that I have attempted to set the user-agent to look like a Chrome browser - the results I get are the same with or without doing this. When I view the URL in Chrome, the dates come out formatted like this, with a time stamp as well:
And the HTML source matches that:
Last Updated: 24-Aug-2009 11:36<br>
First Reported: 24-Aug-2009 11:24<br>
But within R, after I've retrieved the data from the URL, the dates are formatted like this:
Last Updated: 2009-08-24<br>
First Reported: 2009-08-24<br>
Any ideas what's going on here? I figure the server is responding to the browser/Curl's user-agent or region or language or something similar, and returning different data, but can't figure out what I need to set in RCurl's options to change this.
Looks like the server is expecting 'Accept-Language' header:
library(RCurl)
getURL("http://m.fire.tas.gov.au/?pageId=incidentDetails&closed_incident_no=161685",
httpheader = c("Accept-Language" = "en-US,en;q=0.5"))
works for me (returns First Reported: 24-Aug-2009 11:24<br> etc.). I discovered this by using HttpFox Firefox plugin.