Regex issue using Web Scraper extension of Chrome - web-scraping

I'm trying to scrape ads from realestate.com.au, an Aussie platform where houses on sale are shared. I'm using the Google Chrome extension Web Scraper, and everything is pretty much intuitive.
However, what I'm most interested in, is the number of bedrooms of each house on sale. When I select the number of bedrooms, the regex that shows up is [aria-label='5 bedrooms'] p
So when I finish my scraping, the column "bedrooms_n" of my excel file contains only the values "null" and "5", because the scraper scraped all of the values that had "5" as number of bedrooms, and all of the houses with different number of bedrooms were scraped as "null".
I tried different regex options such as [aria-label=['1 bedrooms', '2 bedrooms','3 bedrooms', '4 bedrooms','5 bedrooms', '6 bedrooms']] p
But non of them works and I couldn't find a solution online. Anyone familiar with the Chrome extension Web Scraper?
IMG: the part of the ad I'm trying to scrape: #n of bedrooms

Related

Web scraping : Extracting of papers links

I would like to collect political papers from this newspaper website https://www.seneweb.com/news/politique/ . There isn't possibility to get the links of the older papers. The last one that shows up is for 2019. But the website is deeper than this.
There isn't an option to load more. I have listened to the api too but I don't find anything.
Someone has more ideas?
url <- "https://www.seneweb.com/news/politique/"
newgrel <- "/news/Politique/"
link <- Rcrawler::LinkExtractor(url, urlbotfiler= FALSE, urlregexfilter=newgrel)$InternalLinks

London Stock Exchange Company News, recent changes to website disrupt acquisition of RNS from website , need new way to get the news

Recently the London Stock Exchange website has changed.
It was possible to get the links to the RNS news for each company by parsing the html on, for example,
'https://www.londonstockexchange.com/news?tab=news-explorer&sources=RNS&period=lastweek'
and looking for company tickers such as SHEL or BDEV, or other indicators of interest in the HTML e.g. newsitem . From there extract, from the HTML, the link to the RNS (regulatory news item) and consequently download the news item for further examination.
Now this is not possible , the data is blocked company tickers and the like do not appear in the source.
The RNS news is essential for investors and whether large or small there should be equal access. Some days there are a great number of RNS and by only by downloading is it possible for the small investor to scan them to find news items relevant to their investing strategy in the hour before the market opens.
Can anyone help with a method to regain access to RNS news?
P.S. If I haven't put this question in the correct place, or if there is something wrong with it please tell me as I haven't written many questions before.
I had a look over that url, and I'm not really sure how you will move forward on this one, tbh. I expected you have some sort of list of articles, which could be used on some sort of scraping scaffold. Nonetheless, given that url you provided, this is one way you could go around it -check Dev tools - Network tab, and see if any XHR calls are being made, to some api; if you found one, you scrape that api endpoint, like below:
import requests
from bs4 import BeautifulSoup
url = 'https://api.londonstockexchange.com/api/v1/pages?path=news-article&parameters=newsId%253D15574524'
r = requests.get(url)
print(r.json())
This returns a rather large json object, which you can dissect and get the information you need. For example:
html = r.json()['components'][1]['content'][0]['value']['body']
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('title').get_text(strip=True))
print(soup.select_one('body').get_text(strip=True))
As long as you would have a list of 'newsId's, you could scrape the info for every such newsId, modifying the api endpoint used above. This prints in terminal:
Purchase of Own Ordinary Shares
8 August 2022abrdn Property Income Trust Limited (“the Company”)Legal Entity Identifier (LEI): 549300HHFBWZRKC7RW84PURCHASE OF OWN ORDINARY SHARESOn 5 August 2022 the Company purchased 345,935 Ordinary Shares at a price of 79.07 pence per share. These shares will be held in treasury.Following the transaction, the Company’s issued ordinary share capital comprises:386,018,977 Issued Ordinary shares (excluding treasury....

Scraping titles and subtitles from different pages

I'm a student and attending university lessons from home. My teacher just gave me this job, consisting to take all the titles and subtitles from an online italian journal that include the words "coronavirus" and / or "covid 19" in a certain time lapse (from 22th to 29th of January and just the 1st and the 8th of April), and transcribe them to an Excel file to analyze the words used.
I searched online and assumed that this could be considered scraping, and that made my day considering that I should find something like 100-150 titles plus subtitles and I have a very short deadline. Unfortunately, I am also a beginner at this and all I could do by myself was finding a way to collect just the title from the webpage. I'm using, like a beginner is supposed to, Data Miner with Google Chrome.
Practically I should find all titles and subtitles from the website "La Gazzetta dello Sport" (whose link I attach below) that contains the words "coronavirus" and/or "covid 19" but there is a problem: I can see just the titles in the search page, but to get the subtitles I should click on the article and go to another page. Is there a way to obtain all the results with Data Miner or should I use another program?
So, just to make it simple: I can't figure out how to make Data Miner collect the title from the search page, click it to go on the article page, collect the subtitle and go back to the search page to pass to the next title and subtitle and repeat. I don't know if this is possible or it's just sci-fi, like I said: I'm a total newbie at this and it's the first time using these kind of tools.
Url: https://www.gazzetta.it/nuovaricerca/home.shtml?q=coronavirus&dateFrom=2020-01-22&dateTo=2020-01-29

Problems with showing rating stars in google search results - Wordpress site

I have the problem with displaying stars in google-search results.
Here, as an example, there are two links to the sites (driving schools in Denmark)
1. https://www.koereskoleoplysningen.dk/listing/skrivers-koreskole-allerod/ (rating stars not working)
2. https://www.koereskoleoplysningen.dk/listing/trekantens-koreskole-vejle/ (rating stars working).
I checked the validity of the star rating (using google search engine), by simply inputting in the mentioned cases:
1. koereskole oplysningen Skrivers Køreskole – Allerød
2. koereskole oplysningen Trekantens Køreskole – Vejle
The first school shows only the link, whereas the second one displays overall rating with total number of ratings correctly. I used https://search.google.com/structured-data/testing-tool to check if both sites contains AggregateRating field (it contains all information about ratings) and both pages returns the same correct results (all pages contains AggregateRating field).
This issue occurs also for other sites (I have more than 1000 listings and some of them as the mentioned examples, show ratings stars properly, some of them not).
I validated the sitemaps, website ranker show me the highest results, Yoast SEO is configured and also didn't return me any errors.
Well, from our analysis there are three factors at play:
1 Trusted review sites
2 Schema markup
3 Site authority
Source

google maps API for a place how many people have made reviews and rating?

I made a app and I use goole maps API. I would like know, you know when you make a request for place, API return 5 last reviews and reviews.rating, and rating, for how many reviews this rating is calculate ? How I can have this information do you know?
I calculated for 5 last reviews and rating, the average does not correspond in 5 reviews.rating. Thus how to know this average is calculated on how much reviews? Thanks
Edit : in this question (4 years ago) : how to get total number of reviews from google reviews I have try this solution user_ratings_total but that don't work
Edit 2 : it's certainly possible nobody's know ?
it is possible now to get total number of reviews using Place Details Place APIs call: https://developers.google.com/places/web-service/details#fields
as of Jan 2019, it returns user_ratings_total field: https://developers.google.com/maps/documentation/javascript/releases#335
which contains the total number of reviews.
If this isn't a long term project, give my API a shot:
http://reviewsmaker.com/api/google/?business=mumbai%20cafe&api_key=4a2819f3-2874-4eee-9c46-baa7fa17971c
You can just swap the business name; I created it local to the US though by the looks of your images it seems you're looking to do it for CA; user_ratings_total was indeed removed from places but the GMB API still has access to this data, I just kind of tweaked it a little bit.
Here's a tip on how you can get the data, if you create a custom RSS feed with the URLs for the places and (not sure what language your using) you can parse through the URLs and get the metadata out; or if you use Google CSE (Custom Search Engine) the PageMap for the schemas 'review', 'aggregatedreviews' will be easy to parse through as well. These are just clevar workarounds; it sucks they omit this data from the natural official API it was very useful.

Resources