Scraping large amount of Google Scholar pages with url - web-scraping

I'm trying to get full author list of all publications from an author on Google scholar using BeautifulSoup. Since the home page for the author only has a truncated list of authors for each paper, I have to open the link of the paper to get full list. As a result, I ran into CAPTCHA every few attempts.
Is there a way to avoid CAPTCHA (e.g. pause for 3 secs after every request)? Or make the original Google Scholar profile page to show full author list?

Recently I faced similar issue. I at least eased my collection process with an easy workaround by implementing a random and rather longlasting sleep like this:
import time
import numpy as np
time.sleep((30-5)*np.random.random()+5) #from 5 to 30 seconds
If you have enough time (let's say launch your parser at night), you can make even bigger pause (3+ times bigger) to assure you won't get captcha.
Furthermore, you can randomly change user-agents in your requests to site, that will mask you even more.

Related

London Stock Exchange Company News, recent changes to website disrupt acquisition of RNS from website , need new way to get the news

Recently the London Stock Exchange website has changed.
It was possible to get the links to the RNS news for each company by parsing the html on, for example,
'https://www.londonstockexchange.com/news?tab=news-explorer&sources=RNS&period=lastweek'
and looking for company tickers such as SHEL or BDEV, or other indicators of interest in the HTML e.g. newsitem . From there extract, from the HTML, the link to the RNS (regulatory news item) and consequently download the news item for further examination.
Now this is not possible , the data is blocked company tickers and the like do not appear in the source.
The RNS news is essential for investors and whether large or small there should be equal access. Some days there are a great number of RNS and by only by downloading is it possible for the small investor to scan them to find news items relevant to their investing strategy in the hour before the market opens.
Can anyone help with a method to regain access to RNS news?
P.S. If I haven't put this question in the correct place, or if there is something wrong with it please tell me as I haven't written many questions before.
I had a look over that url, and I'm not really sure how you will move forward on this one, tbh. I expected you have some sort of list of articles, which could be used on some sort of scraping scaffold. Nonetheless, given that url you provided, this is one way you could go around it -check Dev tools - Network tab, and see if any XHR calls are being made, to some api; if you found one, you scrape that api endpoint, like below:
import requests
from bs4 import BeautifulSoup
url = 'https://api.londonstockexchange.com/api/v1/pages?path=news-article&parameters=newsId%253D15574524'
r = requests.get(url)
print(r.json())
This returns a rather large json object, which you can dissect and get the information you need. For example:
html = r.json()['components'][1]['content'][0]['value']['body']
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('title').get_text(strip=True))
print(soup.select_one('body').get_text(strip=True))
As long as you would have a list of 'newsId's, you could scrape the info for every such newsId, modifying the api endpoint used above. This prints in terminal:
Purchase of Own Ordinary Shares
8 August 2022abrdn Property Income Trust Limited (“the Company”)Legal Entity Identifier (LEI): 549300HHFBWZRKC7RW84PURCHASE OF OWN ORDINARY SHARESOn 5 August 2022 the Company purchased 345,935 Ordinary Shares at a price of 79.07 pence per share. These shares will be held in treasury.Following the transaction, the Company’s issued ordinary share capital comprises:386,018,977 Issued Ordinary shares (excluding treasury....

Retrieve a number from each page of a paginated website

I have a list from approx. 36,000 URLs, ranging from https://www.fff.fr/la-vie-des-clubs/1/infos-cles to https://www.fff.fr/la-vie-des-clubs/36179/infos-cles (a few of those pages return 404 erros).
Each of those pages contains a number (the number of teams the soccer club contains). In the HTML file, the number appears as <p class="number">5</p>.
Is there a reasonably simple way to compile an excel or csv file with the URL and the associated number of teams as a field ?
I've tried looking into phantomJS but my method took 10 seconds to open a single webpage and I don't really want to spend 100 hours doing this. I was not able to figure out how (or whether it was at all possible) to use scraping tools such as import.io to do this.
Thanks !
For the goal you want to achieve, I can see two solutions:
Code it in Java: Jsoup + any CSV library
In a few minutes, the 36000+ urls can be downloaded easily.
Use a tool like Portia from scrapinghub.com
Portia is a WYSIWYG tool quickly helping you create your project and run it. They offer a free plan which can take in charge the 36000+ links.

Need help in Web scraping webpages and its links by automatic funciton in R

I am interested to extract the data of paranormal activity reported in news, so that i can analyze the
data of space and time of appearance for any correlations. This project is just for fun, to learn and use web scraping, text extraction and spatial and time correlation analysis. So please forgive me for deciding on this topic, I wanted to do something interesting and challenging work.
First I found this website has some collection of the reported paranormal incidences, they have collection for 2009,2010,2011 and 2012.
The structure of the website goes like this in every year they have 1..10 pages...and links goes like this
for year2009
link http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
In each page they have collected the stories under the heading like this
Internal structure
Paranormal Activity, Posted 03-14-09
each of these head lines has two pages inside it..goes like this
link http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
On each of these pages they have actual reported stories collected on various headlines..and the actual websites link for those stories. I am interested in collecting those reported text and extract information regarding the kind of paranormal activity like ghost, demon or UFOs and the time, date and place of incidents. I wish to analyze this data for any spatial and time correlations. If UFO or Ghosts are real they must have some behavior and correlations in space or time in their movements. This is long shot of the story...
I need help in web scraping the text form the above said pages. Here i have wrote down the code to follow one page and its link down to last final text i want. Can anyone let me know is there any better and efficient way to get the clean text from the final page. Also automation of the collecting text by following all 10 pages for whole 2009.
library(XML)
#source of paranormal news from about.com
#first page to start
#2009 - http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm
pn.url<-"http://paranormal.about.com/od/paranormalgeneralinfo/tp/2009-paranormal-activity.htm"
pn.html<-htmlTreeParse(pn.url,useInternalNodes=T)
pn.h3=xpathSApply(pn.html,"//h3",xmlValue)
#extracting the links of the headlines to follow to the story
pn.h3.links=xpathSApply(pn.html,"//h3/a/#href")
#Extracted the links of the Internal structure to follow ...
#Paranormal Activity, Posted 01-03-09 (following this head line)
#http://paranormal.about.com/od/paranormalgeneralinfo/a/news_090314n.htm
pn.l1.url<-pn.h3.links[1]
pn.l1.html<-htmlTreeParse(pn.l1.url,useInternalNodes=T)
pn.l1.links=xpathSApply(pn.l1.html,"//p/a/#href")
#Extracted the links of the Internal structure to follow ...
#British couple has 'black-and-white-twins' twice (following this head line)
#http://www.msnbc.msn.com/id/28471626/
pn.l1.f1.url=pn.l1.links[7]
pn.l1.f1.html=htmlTreeParse(pn.l1.f1.url,useInternalNodes=T)
pn.l1.f1.text=xpathSApply(pn.l1.f1.html,"//text()[not(ancestor::script)][not(ancestor::style)]",xmlValue)
I sincerely thanks in advance for reading my post and your time for helping me.
I will be great full for any expert who would like to mentor me in this whole project.
Regards
Sathish
Try to use Scrapy and BeautifulSoup libraries. Despite their being Python based, they are considered the best in scrapping domain. You can use command line interface to connect both, for more details about connecting R and Python have a look here.

Scrapy: How to recrawl a page after some time?

Being lazy, I'm trying to use scrapy instead of implementing my own scraping service using celery+requests (been there, done that). Let's say I have a list of N pages that I like to monitor. After retrieving page X and reading its content, I want to tell the system to rescan it sometime later (depending on its content), say once two hours have passed.
Is such a thing possible with Scrapy?

Best approach for fetching news from websites?

I have a function which web-scraping all latest news from a website (approximately 10 news and the number of news is up to that website). Note that the news are in chronical order.
For example, yesterday I got 10 news and stored in database. Today I get 10 news but there are 3 news that are not available from yesterday (7 news stayed the same, 3 new).
My current approach is to extract each news till I find an old news (the 1st among 7 news) then I stop extracting and only update the field "lastUpdateDate" of the old news + add new news to the database. I think this approach is somehow complicated and it takes time.
Actually I'm getting news from 20 websites with same content structure (Moodle) so each request will last about 2 minutes, which my free host doesn't support.
Is it better if I delete all the news and then extracting everything from the start (this actually increments a huge amount of the ID numbers in the database)?
First, check to see if the website has a published API. If it has one, use it.
Second, check the website's terms of service, which may specifically and explicitly disallow scraping the website.
Third, look at a module in your programming language of choice that handles both the fetching of the pages and the extraction of the content from the pages. In Perl, you would start with WWW::Mechanize or Web::Scraper.
Whatever you do, don't fall into the trap that so many who post to StackOverflow fall into: Fetching the web page, and then trying to parse the content themselves, most often with regular expressions which is an inadequate tool for the job. Surf the SO tag html-parsing for tales of sorrow from those who have tried to roll their own HTML parsing systems instead of using existing tools.
Its depend on requirement if you want to show old news to the users or not.
For scraping you can create a custom local script for cron job which will grab the data from those news websites and will store into database.
You can also check through subject if its already exist of not.
Final make a custom news block which will show all the database feed.

Resources