Webscraping of Weather Website returns nil - web-scraping

I'm new to Python and I'm trying to take the temperature from The Weather Network however I receive no value for my temperature. Can someone please help me with this because I've been stuck on this for a while? :( Thank you in advance!
import time
import schedule
import requests
from bs4 import BeautifulSoup
def FindTemp ():
myurl = "https://www.theweathernetwork.com/ca/36-hour-weather-forecast/ontario/toronto"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}
r = requests.get(myurl, headers = headers)
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find("div",{"class":"obs-area"}).find("span",{'class': 'temp'})
todaydate = time.asctime()
TorontoTemp = all.text
print("The temperature in Toronto is" ,TorontoTemp, "on", todaydate)
print(TorontoTemp)
print(FindTemp())

It doesn't have to work at all, even if you didn't do anything wrong. Many sites use Javascript to fetch data, so you'd need to use some other scraper that has Chromium built-in and uses the same DOM that you'd see if you were interacting with the site yourself, in-person. And many sites with valuable data, such as weather data, actively protect themselves from scraping, since the data they provide has monetary value (i.e. you can buy the data feed access).
In any case, you should start with some site that's known to scrape well. Beautifulsoup's own webpage is a good start :)
And you should use a debugger to see the intermediate values your code generated, and investigate at which point they diverge from your expectations.

Related

Web Scrape interactive map coordinates

I am struggling with how to scrape an interactive map or coordinates from the website,
below is an example of the map (or coordinates) I would like to scrape with requests / bs4.
The idea is to scrape like 100 or so map locations and plot them a map graph.
Could you please advise on how to scrape the map bottom of the website:
https://www.njuskalo.hr/nekretnine/gradevinsko-zemljiste-zagreb-lucko-5000-m2-oglas-34732559
The location data is hidden in a script tag within the HTML, you can get it out like this:
import requests
import json
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
url = 'https://www.njuskalo.hr/nekretnine/gradevinsko-zemljiste-zagreb-lucko-5000-m2-oglas-34732559'
resp = requests.get(url,headers=headers)
start = '"defaultMarker":'
end = ',"cluster":{"icon1'
s = resp.text
dirty_json = s[s.find(start)+len(start):s.rfind(end)].strip() #get the json out the html
clean_json = json.loads(dirty_json)
print(clean_json['lat'],clean_json['lng'])

Newspaper3k, User Agents and Scraping

I'm making text files consisting of the author, date of publication and main text of news articles. I have code to do this, but I need for Newspaper3k to identify the relevant information from these articles first. Since user agent specification has been an issue before, I also specify the user agent. Here's my code so you can follow along. This is version 3.9.0 of Python.
import time, os, random, nltk, newspaper
from newspaper import Article, Config
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
config = Config()
config.browser_user_agent = user_agent
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
article = Article(url, config=config)
article.download()
#article.html #
article.parse()
article.nlp()
article.authors
article.publish_date
article.text
To better understand why this case is particularly puzzling, please substitute the link I've provided above with this one, and re-run the code. With this link, the code now runs correctly, returning the author, date and text. With the link in the code above, it doesn't. What am I overlooking here?
Apparently, Newspaper demands that we specify the language we're interested in. The code here still doesn't extract the author for some strange reason, but this is enough for me. Here's the code, if anyone else would benefit from it.
#
# Imports our modules
#
import time, os, random, nltk, newspaper
from newspaper import Article
from googletrans import Translator
translator = Translator()
# The link we're interested in
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
#
# Extracts the meta-data
#
article = Article(url, language='es')
article.download()
article.parse()
article.nlp()
#
# Makes these into strings so they'll get into the list
#
authors = str(article.authors)
date = str(article.publish_date)
maintext = translator.translate(article.summary).text
# Makes the list we'll append
elements = [authors+ "\n", date+ "\n", maintext+ "\n", url]
for x in elements:
print(x)

Why is Beautifulsoup Displaying Scrape URLs with Unnecessary Characters

I tried other types of css selectors and xpaths, so I am assuming I may be using the library incorrectly but there is no documentations that is not telling me otherwise. I also tried other bs4 functions such as find_all, but many do not return any other results. Any type of help would be appreciated, Cheers!
Code:
import bs4 as bs
from requests import get
query = input('Please Enter Your Topic of intrest: ')
first_part = query.replace(" ", "%20")
second_part = query.replace(" ", "+")
results= "0"
num_of_pages = int(input('How many pages do you want scraped? '))
for i in range(num_of_pages):
results= int(results)
results += 10
gsearch_url = "https://www.google.com/search?q={}#q={}%3F&start={}&*".format(first_part, second_part, results)
sauce = get(gsearch_url)
soup = bs.BeautifulSoup(sauce.text, 'lxml')
for url in soup.select('.r a'):
print(url.get('href'))
Return:
/url?q=http://www.codingdojo.com/blog/9-most-in-demand-programming-languages-of-2016/&sa=U&ved=0ahUKEwja3a21w7fSAhWSZiYKHdLGA9gQFggdMAI&usg=AFQjCNFmDl_1epVQRmDfc4y5MWFeNvrPQg
/url?q=https://fossbytes.com/best-popular-programming-languages-2017/&sa=U&ved=0ahUKEwja3a21w7fSAhWSZiYKHdLGA9gQFgghMAM&usg=AFQjCNEKhYqx1FbKl_Wu-9EoMYd3e9i_Dw
/url?q=http://www.bestprogramminglanguagefor.me/&sa=U&ved=0ahUKEwja3a21w7fSAhWSZiYKHdLGA9gQFggnMAQ&usg=AFQjCNHmbzuLwFo_egaWnbXSOW4p-Fva3g
/url?q=http://www.codingdojo.com/blog/9-most-in-demand-programming-languages-of-2016/&sa=U&ved=0ahUKEwja3a21w7fSAhWSZiYKHdLGA9gQFggyMAU&usg=AFQjCNFmDl_1epVQRmDfc4y5MWFeNvrPQg
etc....
First off, scraping Google's search results breaks their terms of service. So, somewhere on the internets the great Alphabet is wagging a finger and furrowing its brow. Oh yeah, you'll probably get slapped with a captcha at some point too.
Second, and ahem purely to resolve any remaining academic curiosity, the results you're getting are not caused by BeautifulSoup. It's actually what Google is returning. You can check it by doing a print(soup) and perusing the html. You'll notice all your href's match exactly to what you're printing out.
Why does this look different than what you see in your browser? The magic of javascript! Which the requests library does not handle, so you're seeing the results without all the client-side processing.
You're looking for this:
# calls for ".yuRUbf a" css selector and grabs "href" attribute (link)
soup.select_one('.yuRUbf a')['href']
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean", # query
"gl": "us", # country to search from
"hl": "en" # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data from the structured JSON rather than figuring out why things don't work and then maintain it over time if some selectors will change.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
Disclaimer, I work for SerpApi.

Yahoo login using rvest

Recently, Yahoo changed their authentication mechanism to a two step one. So now, when I login to a yahoo site, I put in my username, and then it asks me to open my yahoo mobile app to give it a code. Alternatively, you can have it email or text you some other way around this. The result of this is that code that used to work to programatically login to Yahoo sites no longer works. This code just redirects to the login form. I've tried with and without a useragent string and with and without the countrycode=1 in the form values. I'm fine with entering a code after looking at my mobile app, but it doesn't forward me to the page to enter that code. How do we login to Yahoo these days using R?
url <- "http://mail.yahoo.com"
uastring <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
s <- rvest::html_session(url, httr::user_agent(uastring))
s_form <- rvest::html_form(s)[[1]]
filled_form <- rvest::set_values(s_form, username="myusername",
passwd="mypassword")
out <- rvest::submit_form(session=s, filled_form, submit="signin",
httr::add_headers("Content-Length"=0))
Okay, I've stumbled upon the answer here. I was using the httr::add_headers("Content-Length"=0) in response to a warning that rvest would throw: Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode, :
Length Required (HTTP 411).
As it turns out, despite the warning, everything worked fine and in fact, if I add the content-length header, the login fails. So, my code to login to yahoo ends up looking like this:
username <- "some_username#yahoo.com"
league_id <- "some league id to complete the fantasy football url"
uastring <- "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
url <- "http://football.fantasysports.yahoo.com/f1/"
url <- paste0(url, league_id)
s <- rvest::html_session(url, httr::user_agent(uastring))
myform <- rvest::html_form(s)[[1]]
myform <- rvest::set_values(myform, username=username)
s <- suppressWarnings(rvest::submit_form(s, myform, submit="signin"))
s <- rvest::jump_to(s, s$response$url)
myform <- rvest::html_form(s)[[1]]
if("code" %in% names(myform$fields)){
code <- readline(prompt="In your Yahoo app, find and click on the Account Key icon.\nGet the 8 character code and\nenter it here: ")
}else{
print("Unable to login")
return(NULL)
}
myform <- rvest::set_values(myform, code=code)
s <- suppressWarnings(rvest::submit_form(s, myform, submit="verify"))
if(grepl("authorize\\/verify", s$url)){
print("Wrong code entered, unable to login")
return(NULL)
}else{
print("Login successful")
}
s <- rvest::jump_to(s, s$response$url)
It's a two step process... Submit your username, then go to your yahoo app to get the login code. There's no yahoo password needed. I use readline to get the login code. Seems to work well... I'm able to scrape my fantasy football data after completing the login. It's just very curious that the warning asking for a content length header would lead you down a path that doesn't work. By the way, this same situation applies when trying to login to google. You have to ignore the warning and it works fine.

Retrieving URL using RCurl gives different date format than in browser

I am attempting to scrape a mobile-formatted webpage using RCurl, at the following URL:
http://m.fire.tas.gov.au/?pageId=incidentDetails&closed_incident_no=161685
Using this code:
library(RCurl)
options( RCurlOptions = list(verbose = TRUE, useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13"))
inurl <- getURL(http://m.fire.tas.gov.au/?pageId=incidentDetails&closed_incident_no=161685)
Note that I have attempted to set the user-agent to look like a Chrome browser - the results I get are the same with or without doing this. When I view the URL in Chrome, the dates come out formatted like this, with a time stamp as well:
And the HTML source matches that:
Last Updated: 24-Aug-2009 11:36<br>
First Reported: 24-Aug-2009 11:24<br>
But within R, after I've retrieved the data from the URL, the dates are formatted like this:
Last Updated: 2009-08-24<br>
First Reported: 2009-08-24<br>
Any ideas what's going on here? I figure the server is responding to the browser/Curl's user-agent or region or language or something similar, and returning different data, but can't figure out what I need to set in RCurl's options to change this.
Looks like the server is expecting 'Accept-Language' header:
library(RCurl)
getURL("http://m.fire.tas.gov.au/?pageId=incidentDetails&closed_incident_no=161685",
httpheader = c("Accept-Language" = "en-US,en;q=0.5"))
works for me (returns First Reported: 24-Aug-2009 11:24<br> etc.). I discovered this by using HttpFox Firefox plugin.

Resources