I am attempting to scrape a mobile-formatted webpage using RCurl, at the following URL:
http://m.fire.tas.gov.au/?pageId=incidentDetails&closed_incident_no=161685
Using this code:
library(RCurl)
options( RCurlOptions = list(verbose = TRUE, useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13"))
inurl <- getURL(http://m.fire.tas.gov.au/?pageId=incidentDetails&closed_incident_no=161685)
Note that I have attempted to set the user-agent to look like a Chrome browser - the results I get are the same with or without doing this. When I view the URL in Chrome, the dates come out formatted like this, with a time stamp as well:
And the HTML source matches that:
Last Updated: 24-Aug-2009 11:36<br>
First Reported: 24-Aug-2009 11:24<br>
But within R, after I've retrieved the data from the URL, the dates are formatted like this:
Last Updated: 2009-08-24<br>
First Reported: 2009-08-24<br>
Any ideas what's going on here? I figure the server is responding to the browser/Curl's user-agent or region or language or something similar, and returning different data, but can't figure out what I need to set in RCurl's options to change this.
Looks like the server is expecting 'Accept-Language' header:
library(RCurl)
getURL("http://m.fire.tas.gov.au/?pageId=incidentDetails&closed_incident_no=161685",
httpheader = c("Accept-Language" = "en-US,en;q=0.5"))
works for me (returns First Reported: 24-Aug-2009 11:24<br> etc.). I discovered this by using HttpFox Firefox plugin.
Related
I'm making text files consisting of the author, date of publication and main text of news articles. I have code to do this, but I need for Newspaper3k to identify the relevant information from these articles first. Since user agent specification has been an issue before, I also specify the user agent. Here's my code so you can follow along. This is version 3.9.0 of Python.
import time, os, random, nltk, newspaper
from newspaper import Article, Config
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
config = Config()
config.browser_user_agent = user_agent
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
article = Article(url, config=config)
article.download()
#article.html #
article.parse()
article.nlp()
article.authors
article.publish_date
article.text
To better understand why this case is particularly puzzling, please substitute the link I've provided above with this one, and re-run the code. With this link, the code now runs correctly, returning the author, date and text. With the link in the code above, it doesn't. What am I overlooking here?
Apparently, Newspaper demands that we specify the language we're interested in. The code here still doesn't extract the author for some strange reason, but this is enough for me. Here's the code, if anyone else would benefit from it.
#
# Imports our modules
#
import time, os, random, nltk, newspaper
from newspaper import Article
from googletrans import Translator
translator = Translator()
# The link we're interested in
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
#
# Extracts the meta-data
#
article = Article(url, language='es')
article.download()
article.parse()
article.nlp()
#
# Makes these into strings so they'll get into the list
#
authors = str(article.authors)
date = str(article.publish_date)
maintext = translator.translate(article.summary).text
# Makes the list we'll append
elements = [authors+ "\n", date+ "\n", maintext+ "\n", url]
for x in elements:
print(x)
A tool we use has regular updates. To keep the screenshots in our documentation up-to-date, I would like to start using Rmd and the webshot package. A first step would be to be able to login (and next to redirect to the desired page).
Based on the example in the package I tried the code below to login, but this triggers an "element not found error".
https://dev79379.service-now.com/login.do is the login page where I took the selectors
https://dev79379.service-now.com/home would be one the urls of interest
So, I have two questions:
what would be the correct selector to find the element?
how can I redirect from login.do to /home?
library(webshot)
url <- "https://dev79379.service-now.com/home"
fn <- tools::file_path_sans_ext(basename(url))
webshot(url, paste0(fn,".png"),
useragent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36", eval = "casper.then(function() {
// Enter username and password
this.sendKeys('#user_name.form-control input[type=\"text\"]', 'test');
this.sendKeys('#user_password.form-control input[type=\"password\"]', 'test');
// Now click in the search box. This results in a box expanding below
this.click('#sysverb_login.pull-right.btn.bt-primary input[type=\"submit\"]');
// Wait 500ms
this.wait(500);
});"
)
I'm new to Python and I'm trying to take the temperature from The Weather Network however I receive no value for my temperature. Can someone please help me with this because I've been stuck on this for a while? :( Thank you in advance!
import time
import schedule
import requests
from bs4 import BeautifulSoup
def FindTemp ():
myurl = "https://www.theweathernetwork.com/ca/36-hour-weather-forecast/ontario/toronto"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}
r = requests.get(myurl, headers = headers)
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find("div",{"class":"obs-area"}).find("span",{'class': 'temp'})
todaydate = time.asctime()
TorontoTemp = all.text
print("The temperature in Toronto is" ,TorontoTemp, "on", todaydate)
print(TorontoTemp)
print(FindTemp())
It doesn't have to work at all, even if you didn't do anything wrong. Many sites use Javascript to fetch data, so you'd need to use some other scraper that has Chromium built-in and uses the same DOM that you'd see if you were interacting with the site yourself, in-person. And many sites with valuable data, such as weather data, actively protect themselves from scraping, since the data they provide has monetary value (i.e. you can buy the data feed access).
In any case, you should start with some site that's known to scrape well. Beautifulsoup's own webpage is a good start :)
And you should use a debugger to see the intermediate values your code generated, and investigate at which point they diverge from your expectations.
I wanted to pick up some data from the health.usnews.com. I ran the following lines but both lines give me the exact same result - No response at all, R gets stuck on the line and I have to manually click on "interrupt R".
page_response <- httr::GET("https://health.usnews.com/")
# or
page <- xml2::read_html("https://health.usnews.com/")
What am I missing?
The website uses the user-agent header to detect web-scraper. Add a fake user-agent header, you'll able to get the result:
page_response <- httr::GET(
"https://health.usnews.com/",
config = httr::add_headers(
`user-agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"
)
)
The problem though is most of the data is generated by JS. Idk what info you need but you're probably gonna need the V8 package to help you.
Recently, Yahoo changed their authentication mechanism to a two step one. So now, when I login to a yahoo site, I put in my username, and then it asks me to open my yahoo mobile app to give it a code. Alternatively, you can have it email or text you some other way around this. The result of this is that code that used to work to programatically login to Yahoo sites no longer works. This code just redirects to the login form. I've tried with and without a useragent string and with and without the countrycode=1 in the form values. I'm fine with entering a code after looking at my mobile app, but it doesn't forward me to the page to enter that code. How do we login to Yahoo these days using R?
url <- "http://mail.yahoo.com"
uastring <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
s <- rvest::html_session(url, httr::user_agent(uastring))
s_form <- rvest::html_form(s)[[1]]
filled_form <- rvest::set_values(s_form, username="myusername",
passwd="mypassword")
out <- rvest::submit_form(session=s, filled_form, submit="signin",
httr::add_headers("Content-Length"=0))
Okay, I've stumbled upon the answer here. I was using the httr::add_headers("Content-Length"=0) in response to a warning that rvest would throw: Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode, :
Length Required (HTTP 411).
As it turns out, despite the warning, everything worked fine and in fact, if I add the content-length header, the login fails. So, my code to login to yahoo ends up looking like this:
username <- "some_username#yahoo.com"
league_id <- "some league id to complete the fantasy football url"
uastring <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
url <- "http://football.fantasysports.yahoo.com/f1/"
url <- paste0(url, league_id)
s <- rvest::html_session(url, httr::user_agent(uastring))
myform <- rvest::html_form(s)[[1]]
myform <- rvest::set_values(myform, username=username)
s <- suppressWarnings(rvest::submit_form(s, myform, submit="signin"))
s <- rvest::jump_to(s, s$response$url)
myform <- rvest::html_form(s)[[1]]
if("code" %in% names(myform$fields)){
code <- readline(prompt="In your Yahoo app, find and click on the Account Key icon.\nGet the 8 character code and\nenter it here: ")
}else{
print("Unable to login")
return(NULL)
}
myform <- rvest::set_values(myform, code=code)
s <- suppressWarnings(rvest::submit_form(s, myform, submit="verify"))
if(grepl("authorize\\/verify", s$url)){
print("Wrong code entered, unable to login")
return(NULL)
}else{
print("Login successful")
}
s <- rvest::jump_to(s, s$response$url)
It's a two step process... Submit your username, then go to your yahoo app to get the login code. There's no yahoo password needed. I use readline to get the login code. Seems to work well... I'm able to scrape my fantasy football data after completing the login. It's just very curious that the warning asking for a content length header would lead you down a path that doesn't work. By the way, this same situation applies when trying to login to google. You have to ignore the warning and it works fine.