Newspaper3k, User Agents and Scraping

Newspaper3k, User Agents and Scraping - web-scraping

I'm making text files consisting of the author, date of publication and main text of news articles. I have code to do this, but I need for Newspaper3k to identify the relevant information from these articles first. Since user agent specification has been an issue before, I also specify the user agent. Here's my code so you can follow along. This is version 3.9.0 of Python.
import time, os, random, nltk, newspaper
from newspaper import Article, Config
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
config = Config()
config.browser_user_agent = user_agent
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
article = Article(url, config=config)
article.download()
#article.html #
article.parse()
article.nlp()
article.authors
article.publish_date
article.text
To better understand why this case is particularly puzzling, please substitute the link I've provided above with this one, and re-run the code. With this link, the code now runs correctly, returning the author, date and text. With the link in the code above, it doesn't. What am I overlooking here?

Apparently, Newspaper demands that we specify the language we're interested in. The code here still doesn't extract the author for some strange reason, but this is enough for me. Here's the code, if anyone else would benefit from it.
#
# Imports our modules
#
import time, os, random, nltk, newspaper
from newspaper import Article
from googletrans import Translator
translator = Translator()
# The link we're interested in
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
#
# Extracts the meta-data
#
article = Article(url, language='es')
article.download()
article.parse()
article.nlp()
#
# Makes these into strings so they'll get into the list
#
authors = str(article.authors)
date = str(article.publish_date)
maintext = translator.translate(article.summary).text
# Makes the list we'll append
elements = [authors+ "\n", date+ "\n", maintext+ "\n", url]
for x in elements:
print(x)

Related

Scraping: No attribute find_all for <p>

Good morning :)
I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=438653&CID=102313
The text I am trying to get seems to be located inside some <p> and separated by <br>.
For some reason, whenever I try to access a <p>, I get the following mistake: "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?", and this even if I do find instead of find_all().
My code is below (it is a very simple thing with no loop yet, I just would like to identify where the mistake comes from):
from selenium import webdriver
import time
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='MYPATH/chromedriver',options=options)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("div", class_="col-sm-12")
people_in_column = column.find_all("p").find_all("br")
Is there anything obvious I am not understanding here?
Thanks a lot in advance for your help!

You are trying to select a lsit of items aka ResultSet multiples times which is incorrect meaning using find_all method two times but not iterating.The correct way is as follows. Hope, it should work.
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Full working code as an example:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source#.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Output:
Notice of NIH Policy to All Applicants:Meeting rosters are provided for information purposes only. Applicant investigators and institutional officials must not communicate directly with study section members about an application before or after the review. Failure to observe this policy will create a serious breach of integrity in the peer review process, and may lead to actions outlined inNOT-OD-22-044, including removal of the application from immediate
review.

unknown characters "Ø³Ù‚ÙˆØ·" are scraped instead of encoding utf-8

I'm trying to scraped a Non-English website (https://arzdigital.com/). Here is my spider code. The problem is although at the beginning I import "urllib.parse" and in the settings.py file I wrote
FEED_EXPORT_ENCODING='utf-8'
the spider doesn't encode properly (the output is like this: "Ø³Ù‚ÙˆØ· Û±Û° Ù‡Ø²Ø§Ø± Ø¯Ù„Ø§Ø±ÛŒ Ø¨ÛŒØª Ú©ÙˆÛŒÙ† Ø¯Ø± Ø¹Ø±Ø¶ ÛŒÚ© Ø³Ø§Ø¹ØªØ› Ø¹Ù„Øª Ú†Ù‡ Ø¨ÙˆØ¯ØŸ"). Even using .encode() function didn't work.
So, here is my spider code:
# -*- coding: utf-8 -*-
import scrapy
import logging
import urllib.parse
parts = urllib.parse.urlsplit(u'http://fa.wikipedia.org/wiki/صفحهٔ_اصلی')
parts = parts._replace(path=urllib.parse.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
'https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C'
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
start_urls=[f'https://arzdigital.com/latest-posts/page/{i}/'.format(i) for i in enter code hererange(1,353)]
def parse(self, response):
posts=response.xpath("//a[#class='arz-last-post arz-row']")
try:
for post in posts:
post_title=post.xpath(".//#title").get()
yield{
'post_title':post_title
}
except AttributeError:
logging.error("The element didn't exist")
Can anybody tell me where the problem is? Thank you so much!

In the response headers there is a charset, otherwise it defaults to Windows-1252.
If you find a charset ISO-8859-1 substitute it with Windows-1252.
Now you have the right encoding to read it.
Best store all in full Unicode, UTF-8, so every script is possible.
It may be you are looking at the output with a console (on Windows most likely not UTF-8), and then you will see multi-byte sequences as two weird chars. Store it in a file, and edit it with Notepad++ or the like, where you
can see the encoding and change it. Nowadays even Windows Notepad sometimes recognizes UTF-8.

Webscraping of Weather Website returns nil

I'm new to Python and I'm trying to take the temperature from The Weather Network however I receive no value for my temperature. Can someone please help me with this because I've been stuck on this for a while? :( Thank you in advance!
import time
import schedule
import requests
from bs4 import BeautifulSoup
def FindTemp ():
myurl = "https://www.theweathernetwork.com/ca/36-hour-weather-forecast/ontario/toronto"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}
r = requests.get(myurl, headers = headers)
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find("div",{"class":"obs-area"}).find("span",{'class': 'temp'})
todaydate = time.asctime()
TorontoTemp = all.text
print("The temperature in Toronto is" ,TorontoTemp, "on", todaydate)
print(TorontoTemp)
print(FindTemp())

It doesn't have to work at all, even if you didn't do anything wrong. Many sites use Javascript to fetch data, so you'd need to use some other scraper that has Chromium built-in and uses the same DOM that you'd see if you were interacting with the site yourself, in-person. And many sites with valuable data, such as weather data, actively protect themselves from scraping, since the data they provide has monetary value (i.e. you can buy the data feed access).
In any case, you should start with some site that's known to scrape well. Beautifulsoup's own webpage is a good start :)
And you should use a debugger to see the intermediate values your code generated, and investigate at which point they diverge from your expectations.

Why is Beautifulsoup Displaying Scrape URLs with Unnecessary Characters

I tried other types of css selectors and xpaths, so I am assuming I may be using the library incorrectly but there is no documentations that is not telling me otherwise. I also tried other bs4 functions such as find_all, but many do not return any other results. Any type of help would be appreciated, Cheers!
Code:
import bs4 as bs
from requests import get
query = input('Please Enter Your Topic of intrest: ')
first_part = query.replace(" ", "%20")
second_part = query.replace(" ", "+")
results= "0"
num_of_pages = int(input('How many pages do you want scraped? '))
for i in range(num_of_pages):
results= int(results)
results += 10
gsearch_url = "https://www.google.com/search?q={}#q={}%3F&start={}&*".format(first_part, second_part, results)
sauce = get(gsearch_url)
soup = bs.BeautifulSoup(sauce.text, 'lxml')
for url in soup.select('.r a'):
print(url.get('href'))
Return:
/url?q=http://www.codingdojo.com/blog/9-most-in-demand-programming-languages-of-2016/&sa=U&ved=0ahUKEwja3a21w7fSAhWSZiYKHdLGA9gQFggdMAI&usg=AFQjCNFmDl_1epVQRmDfc4y5MWFeNvrPQg
/url?q=https://fossbytes.com/best-popular-programming-languages-2017/&sa=U&ved=0ahUKEwja3a21w7fSAhWSZiYKHdLGA9gQFgghMAM&usg=AFQjCNEKhYqx1FbKl_Wu-9EoMYd3e9i_Dw
/url?q=http://www.bestprogramminglanguagefor.me/&sa=U&ved=0ahUKEwja3a21w7fSAhWSZiYKHdLGA9gQFggnMAQ&usg=AFQjCNHmbzuLwFo_egaWnbXSOW4p-Fva3g
/url?q=http://www.codingdojo.com/blog/9-most-in-demand-programming-languages-of-2016/&sa=U&ved=0ahUKEwja3a21w7fSAhWSZiYKHdLGA9gQFggyMAU&usg=AFQjCNFmDl_1epVQRmDfc4y5MWFeNvrPQg
etc....

First off, scraping Google's search results breaks their terms of service. So, somewhere on the internets the great Alphabet is wagging a finger and furrowing its brow. Oh yeah, you'll probably get slapped with a captcha at some point too.
Second, and ahem purely to resolve any remaining academic curiosity, the results you're getting are not caused by BeautifulSoup. It's actually what Google is returning. You can check it by doing a print(soup) and perusing the html. You'll notice all your href's match exactly to what you're printing out.
Why does this look different than what you see in your browser? The magic of javascript! Which the requests library does not handle, so you're seeing the results without all the client-side processing.

You're looking for this:
# calls for ".yuRUbf a" css selector and grabs "href" attribute (link)
soup.select_one('.yuRUbf a')['href']
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean", # query
"gl": "us", # country to search from
"hl": "en" # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data from the structured JSON rather than figuring out why things don't work and then maintain it over time if some selectors will change.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
Disclaimer, I work for SerpApi.

Retrieving URL using RCurl gives different date format than in browser

I am attempting to scrape a mobile-formatted webpage using RCurl, at the following URL:
http://m.fire.tas.gov.au/?pageId=incidentDetails&closed_incident_no=161685
Using this code:
library(RCurl)
options( RCurlOptions = list(verbose = TRUE, useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13"))
inurl <- getURL(http://m.fire.tas.gov.au/?pageId=incidentDetails&closed_incident_no=161685)
Note that I have attempted to set the user-agent to look like a Chrome browser - the results I get are the same with or without doing this. When I view the URL in Chrome, the dates come out formatted like this, with a time stamp as well:
And the HTML source matches that:
Last Updated: 24-Aug-2009 11:36<br>
First Reported: 24-Aug-2009 11:24<br>
But within R, after I've retrieved the data from the URL, the dates are formatted like this:
Last Updated: 2009-08-24<br>
First Reported: 2009-08-24<br>
Any ideas what's going on here? I figure the server is responding to the browser/Curl's user-agent or region or language or something similar, and returning different data, but can't figure out what I need to set in RCurl's options to change this.

Looks like the server is expecting 'Accept-Language' header:
library(RCurl)
getURL("http://m.fire.tas.gov.au/?pageId=incidentDetails&closed_incident_no=161685",
httpheader = c("Accept-Language" = "en-US,en;q=0.5"))
works for me (returns First Reported: 24-Aug-2009 11:24<br> etc.). I discovered this by using HttpFox Firefox plugin.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Newspaper3k, User Agents and Scraping - web-scraping

Related

Scraping: No attribute find_all for <p>

unknown characters "Ø³Ù‚ÙˆØ·" are scraped instead of encoding utf-8

Webscraping of Weather Website returns nil

Why is Beautifulsoup Displaying Scrape URLs with Unnecessary Characters

Retrieving URL using RCurl gives different date format than in browser

Categories

Resources