Web Scraping Google Scholar Author profiles - web-scraping

I have used scholarly package and parsed the author names generated in the 3 question its method search by author name to get the author profiles including all the citation information of all the professors. I was able to load the data into a final dataframe with NA values for those who do not have a google scholar profile.
However, there is an issue approx. 8 authors citation information is not matching the information on google scholar website, it is because the scholarly package is retrieving the citation information of other authors with the same name. I believe I can fix it by using search_author_id function but the question is how do we get the author_ids of all the professors in the first place.
Any help would be appreciated.
Cheers,
Yash

This solution possibly will not be suitable for the scholarly package. beautifulsoup will be used instead.
Author id's is located under the tag name inside the <a> tag href attribute. Here's how we can grab their id's:
# assumes that request and soup are already sent and made
link = soup.select_one('.gs_ai_name a')['href']
# https://stackoverflow.com/a/6633693/15164646
_id = link
# looking for the text that contains "user=" to split it to 3 parts.
id_identifer = 'user='
# splitting text to 3 parts
before_keyword, keyword, after_keyword = _id.partition(id_identifer)
# after_keyword means that everything AFTER "user=" will be scraped, which is ID.
author_id = after_keyword
# RlANTZEAAAAJ
Code that goes a "bit" out of your question scope (full example in the online IDE under bs4 folder -> get_profiles.py):
from bs4 import BeautifulSoup
import requests, lxml, os
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
proxies = {
'http': os.getenv('HTTP_PROXY')
}
html = requests.get('https://scholar.google.com/citations?view_op=view_org&hl=en&org=9834965952280547731', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
for result in soup.select('.gs_ai_chpr'):
name = result.select_one('.gs_ai_name a').text
link = result.select_one('.gs_ai_name a')['href']
# https://stackoverflow.com/a/6633693/15164646
_id = link
id_identifer = 'user='
before_keyword, keyword, after_keyword = _id.partition(id_identifer)
author_id = after_keyword
affiliations = result.select_one('.gs_ai_aff').text
email = result.select_one('.gs_ai_eml').text
try:
interests = result.select_one('.gs_ai_one_int').text
except:
interests = None
cited_by = result.select_one('.gs_ai_cby').text.split(' ')[2]
print(f'{name}\nhttps://scholar.google.com{link}\n{author_id}\n{affiliations}\n{email}\n{interests}\n{cited_by}\n')
Output:
Jeong-Won Lee
https://scholar.google.com/citations?hl=en&user=D41VK7AAAAAJ
D41VK7AAAAAJ
Samsung Medical Center
Verified email at samsung.com
Gynecologic oncology
107516
Alternatively, you can do the same thing with Google Scholar Profiles API from SerpApi, but without thinking about how to solve the CAPTCHA, find proxies, and maintain the parser over time.
It's a paid API with a free plan.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
# https://docs.python.org/3/library/os.html#os.getenv
"api_key": os.getenv("API_KEY"), # your serpapi API key
"engine": "google_scholar_profiles", # search engine
"mauthors": "samsung" # search query
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
for result in results.get('profiles'):
name = result.get('name')
email = result.get('email')
author_id = result.get('author_id')
affiliation = result.get('affiliations')
cited_by = result.get('cited_by')
interests = result['interests'][0]['title']
interests_link = result['interests'][0]['link']
print(f'{name}\n{email}\n{author_id}\n{affiliation}\n{cited_by}\n{interests}\n{interests_link}\n')
Part of the output:
Jeong-Won Lee
Verified email at samsung.com
D41VK7AAAAAJ
Samsung Medical Center
107516
Gynecologic oncology
https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:gynecologic_oncology
Disclaimer, I work for SerpApi.

Related

web scraping of concurrent review pages

How can I scrape concurrent web pages of customer reviews in Python for which follows A) a regular order B)irregular order, let me explain:
In this link there is a page number=2 means second page of the review
https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2',
And when click next button link becomes '.....pageNumber=3..' and so on... I sometimes find the last page , sometimes not...
But in any case, I want to write a line of code that covers all the pages instead of generating all pages and pasting them to Jupiter notebook
My code was like this, number of URL s are reduced:
`import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urlencode
import csv
# Define a list of URL's that will be scraped.
list_of_urls = ['https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=1',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=3',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=4',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=5',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=6',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=7',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=8',
'https://www.amazon.com/NOUHAUS-Ergo-Flip-Computer-Chair/product-reviews/B07SG3FK4W/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=9'
]
# Retrieve each of the url's HTML data and convert the data into a beautiful soup object.
# Find, extract and store reviewer names and review text into a list.
names = []
reviews = []
data_string = ""
for url in list_of_urls:
params = {'api_key': "f00ffd18cb3cb9e64c315b9aa54e29f3", 'url': url}
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
soup = BeautifulSoup(response.text, 'html.parser')
for item in soup.find_all("span", class_="a-profile-name"):
data_string = data_string + item.get_text()
names.append(data_string)
data_string = ""
for item in soup.find_all("span", {"data-hook": "review-body"}):
data_string = data_string + item.get_text()
reviews.append(data_string)
data_string = ""
# Create the dictionary.
reviews_dict = {'Reviewer Name': names, 'Reviews': reviews}
# Print the lengths of each list.
print(len(names), len(reviews))
# Create a new dataframe.
df = pd.DataFrame.from_dict(reviews_dict, orient='index')
df.head()
# Delete all the columns that have missing values.
df.dropna(axis=1, inplace=True)
df.head()
# Transpose the dataframe.
prod_reviews = df.T
print(prod_reviews.head(10))
# Remove special characters from review text.
prod_reviews['Reviews'] = prod_reviews['Reviews'].astype(str)
prod_reviews.head(5)
# Convert dataframe to CSV file.
prod_reviews.to_csv('Review.csv', index=False, header=True)`
So, a list of URLs that will be scraped goes to hundreds...
I want to shorten it, i dont want to paste all URLs, how can i do it????

Find sub class of a class and return list of elements

I intend to scrape certain countries from a webpage that are under chapter 4 and return a list of those countries. The challenge is that I cannot retrieve the tag
USING READ HTML
reqUS = Request('https://www.state.gov/reports/country-reports-on-terrorism-2019/', headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'})
US = urlopen(reqUS).read()
# print(US)
# Create a soup object
soup = BeautifulSoup(US, 'html.parser')
# find class "floated-right well"
#Terrorist_list = soup.find_all(attrs={"class": "report__section-title"})
Chapter4 = (soup.find('h2', class_="report__section-title",id="report-toc__section-7"))
#print(Chapter4)
# Give location where text is stored which you wish to alter
unordered_list = soup.find("h2", {"id": "report-toc__section-7"})
print(unordered_list)
You could use #report-toc__section-7 as an anchor, and TERRORIST SAFE HAVENS as a start point and COUNTERING TERRORISM ON THE ECONOMIC FRONT as an endpoint. Use those as the text to go into :-soup-contains to filter with css selectors to obtain only the p tags between those with a child strong tag (using :has). You need to also add in :not to remove the p with child strongs after, and including, the endpoint. From that filtering pull out the child strong tags which have the locality and countries.
Loop the returned list and test if strong text is all uppercase; if so, it is a locality and can be the key of a dictionary to which you add a list of following strong values as countries; repeating as you encounter each new locality. You can then pull out specific countries by locality.
For older bs4 versions replace :-soup-contains with :contains.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get(
'https://www.state.gov/reports/country-reports-on-terrorism-2019/')
soup = bs(r.content, 'lxml')
items = soup.select('section:has(>#report-toc__section-7) p:has(strong:-soup-contains("TERRORIST SAFE HAVENS")) ~ p:has(strong):not(p:has(strong:-soup-contains("COUNTERING TERRORISM ON THE ECONOMIC FRONT")), p:has(strong:-soup-contains("COUNTERING TERRORISM ON THE ECONOMIC FRONT")) ~ p) > strong')
d = {}
for i in items:
if i.text.isupper():
key = i.text
d[key] = []
else:
value = i.text.strip()
if value:
d[key].append(value)
print(d)
Prints
Read more about css selectors here: https://developer.mozilla.org/en-US/docs/Web/CSS/Pseudo-classes

How to scrape data by searching for keyword and selecting dropdown using selenium in python?

I am trying to scrape information using selenium in python from the below website which has search bars and drop down menu's. I want to scrape the results (Name, address, phone number) of the clinics from a specific region. For example in the "Ihr Standort" search bar the keywords as "Frankfurt, Germany" and in the Allgemeinmedizin dropdown menu selecting "Hausärzte" option. I am able to print the results by using search bar keywords "Frankfurt, Germany" but I am unable to write a code to select option for dropdown menu.
Can anyone help me how to also include the code to select "Hausärzte" option from Allgemeinmedizin dropdown and extract the results (Name, address, phone number) of the clinics.
Website:
https://www.kvwl.de/earzt/index.htm
Code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from shutil import which
chrome_options = Options()
chrome_options.add_argument("--headless")
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(executable_path=PATH, options=chrome_options)
driver.get("https://www.kvwl.de/earzt/index.htm")
print(driver.title)
search_input = driver.find_element_by_id("doc-search-search-location")
search_input.send_keys("Bielefeld, Germany")
search_input.send_keys(Keys.ENTER)
print(driver.page_source)
driver.close()
Ok, I did a writeup for you using QHarr's suggestion to use the API. The API uses latitude/longitude input, so let's use geopy to retrieve those from the place name. We can then pass them together with the code for Hausärzte in the post request to the website's API and subsequently load the response as json with json.loads. I'm not sure how you want to process the data so for convenience I've loaded them in a pandas dataframe. Subsequently the dataframe runs a function on the column Id that passes the Id in a second API request to retrieve the details for that specific ID and concatenates them to the dataframe.
from geopy.geocoders import Nominatim
import requests
import pandas as pd
import json
import time
location = "München, Germany"
Fachgebiet = '12001_SID' # This code is for Hausärzte, look up other codes here https://www.kvwl.de/DocSearchService/DocSearchService/getExpertiseAreaStructure
geolocator = Nominatim(user_agent="KVWL_retrieval")
location = geolocator.geocode(location)
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36",'content-type': 'application/json; charset=UTF-8'}
data = '{"Latitude":' + str(location.latitude) + ',"Longitude":' + str(location.longitude) + ',"DocGender":"","DocNamePattern":"","ExpertiseAreaStructureId":"' + Fachgebiet + '","ApplicableQualificationId":"","SpecialServiceId":"","LanguageId":"","BarrierFreeAttributeFilter":{"ids":[]},"PageId":0,"PageSize":100}'
response = requests.post('https://www.kvwl.de/DocSearchService/DocSearchService/searchDocs', headers=headers, data=data)
r = json.loads(response.content)
df = pd.json_normalize(r['DoctorAbstracts']['DoctorAbstract'])
def get_doctor(id_nr):
data = '{"Id":"' + id_nr + '"}'
response = requests.post('https://www.kvwl.de/DocSearchService/DocSearchService/getDoctor', headers=headers, data=data)
r = json.loads(response.content)
time.sleep(2) # don't overload the site
return pd.json_normalize(r)
df.join(df.apply(lambda x: pd.Series(get_doctor(x.Id).to_dict()), 1), rsuffix='_right')
The dataframe can be explored with df.head() or exported to csv or excel with df.to_csv('filename.csv') or df.to_excel('filename.xlsx').

How do I find the complete list of url-paths within a website for scraping?

Is there a way I can use python to see the complete list of url-paths for a website I am scraping?
The structure of the url doesn't change just the paths:
https://www.broadsheet.com.au/{city}/guides/best-cafes-{area}
Right now I have a function that allows me to define {city} and {area} using an f-string literal but I have to do this manually. For example: city = melbourne and area = fitzroy.
I'd like to try and make the function iterate through all available paths for me but I need to work out how to get the complete list of paths.
Is there a way a scraper can do it?
You can parse the sitemap for the required URLs, for example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.broadsheet.com.au/sitemap'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for loc in soup.select('loc'):
if not loc.text.strip().endswith('/guide'):
continue
soup2 = BeautifulSoup(requests.get(loc.text).content, 'html.parser')
for loc2 in soup2.select('loc'):
if '/best-cafes-' in loc2.text:
print(loc2.text)
Prints:
https://www.broadsheet.com.au/melbourne/guides/best-cafes-st-kilda
https://www.broadsheet.com.au/melbourne/guides/best-cafes-fitzroy
https://www.broadsheet.com.au/melbourne/guides/best-cafes-balaclava
https://www.broadsheet.com.au/melbourne/guides/best-cafes-preston
https://www.broadsheet.com.au/melbourne/guides/best-cafes-seddon
https://www.broadsheet.com.au/melbourne/guides/best-cafes-northcote
https://www.broadsheet.com.au/melbourne/guides/best-cafes-fairfield
https://www.broadsheet.com.au/melbourne/guides/best-cafes-ascot-vale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-west-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-flemington
https://www.broadsheet.com.au/melbourne/guides/best-cafes-windsor
https://www.broadsheet.com.au/melbourne/guides/best-cafes-kensington
https://www.broadsheet.com.au/melbourne/guides/best-cafes-prahran
https://www.broadsheet.com.au/melbourne/guides/best-cafes-essendon
https://www.broadsheet.com.au/melbourne/guides/best-cafes-pascoe-vale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-albert-park
https://www.broadsheet.com.au/melbourne/guides/best-cafes-port-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-armadale
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brighton
https://www.broadsheet.com.au/melbourne/guides/best-cafes-malvern
https://www.broadsheet.com.au/melbourne/guides/best-cafes-malvern-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-glen-iris
https://www.broadsheet.com.au/melbourne/guides/best-cafes-camberwell
https://www.broadsheet.com.au/melbourne/guides/best-cafes-hawthorn-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brunswick-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-bentleigh
https://www.broadsheet.com.au/melbourne/guides/best-cafes-coburg
https://www.broadsheet.com.au/melbourne/guides/best-cafes-richmond
https://www.broadsheet.com.au/melbourne/guides/best-cafes-bentleigh-east
https://www.broadsheet.com.au/melbourne/guides/best-cafes-collingwood
https://www.broadsheet.com.au/melbourne/guides/best-cafes-elwood
https://www.broadsheet.com.au/melbourne/guides/best-cafes-abbotsford
https://www.broadsheet.com.au/melbourne/guides/best-cafes-south-yarra
https://www.broadsheet.com.au/melbourne/guides/best-cafes-yarraville
https://www.broadsheet.com.au/melbourne/guides/best-cafes-thornbury
https://www.broadsheet.com.au/melbourne/guides/best-cafes-west-footscray
https://www.broadsheet.com.au/melbourne/guides/best-cafes-footscray
https://www.broadsheet.com.au/melbourne/guides/best-cafes-south-melbourne
https://www.broadsheet.com.au/melbourne/guides/best-cafes-hawthorn
https://www.broadsheet.com.au/melbourne/guides/best-cafes-carlton-north
https://www.broadsheet.com.au/melbourne/guides/best-cafes-brunswick
https://www.broadsheet.com.au/melbourne/guides/best-cafes-carlton
https://www.broadsheet.com.au/melbourne/guides/best-cafes-elsternwick
https://www.broadsheet.com.au/sydney/guides/best-cafes-bronte
https://www.broadsheet.com.au/sydney/guides/best-cafes-coogee
https://www.broadsheet.com.au/sydney/guides/best-cafes-rosebery
https://www.broadsheet.com.au/sydney/guides/best-cafes-ultimo
https://www.broadsheet.com.au/sydney/guides/best-cafes-enmore
https://www.broadsheet.com.au/sydney/guides/best-cafes-dulwich-hill
https://www.broadsheet.com.au/sydney/guides/best-cafes-leichhardt
https://www.broadsheet.com.au/sydney/guides/best-cafes-glebe
https://www.broadsheet.com.au/sydney/guides/best-cafes-annandale
https://www.broadsheet.com.au/sydney/guides/best-cafes-rozelle
https://www.broadsheet.com.au/sydney/guides/best-cafes-paddington
https://www.broadsheet.com.au/sydney/guides/best-cafes-balmain
https://www.broadsheet.com.au/sydney/guides/best-cafes-erskineville
https://www.broadsheet.com.au/sydney/guides/best-cafes-willoughby
https://www.broadsheet.com.au/sydney/guides/best-cafes-bondi-junction
https://www.broadsheet.com.au/sydney/guides/best-cafes-north-sydney
https://www.broadsheet.com.au/sydney/guides/best-cafes-bondi
https://www.broadsheet.com.au/sydney/guides/best-cafes-potts-point
https://www.broadsheet.com.au/sydney/guides/best-cafes-mosman
https://www.broadsheet.com.au/sydney/guides/best-cafes-alexandria
https://www.broadsheet.com.au/sydney/guides/best-cafes-crows-nest
https://www.broadsheet.com.au/sydney/guides/best-cafes-manly
https://www.broadsheet.com.au/sydney/guides/best-cafes-woolloomooloo
https://www.broadsheet.com.au/sydney/guides/best-cafes-newtown
https://www.broadsheet.com.au/sydney/guides/best-cafes-vaucluse
https://www.broadsheet.com.au/sydney/guides/best-cafes-chippendale
https://www.broadsheet.com.au/sydney/guides/best-cafes-marrickville
https://www.broadsheet.com.au/sydney/guides/best-cafes-redfern
https://www.broadsheet.com.au/sydney/guides/best-cafes-camperdown
https://www.broadsheet.com.au/sydney/guides/best-cafes-darlinghurst
https://www.broadsheet.com.au/adelaide/guides/best-cafes-goodwood
https://www.broadsheet.com.au/perth/guides/best-cafes-northbridge
https://www.broadsheet.com.au/perth/guides/best-cafes-leederville
You are essentially trying to create a spider just like search engines do. So, why not use one that already exists? It's free up to 100 daily queries. You will have to set up a Google Custom Search and define a search query.
get your API key from here: https://developers.google.com/custom-search/v1/introduction/?apix=true
define a new search engine: https://cse.google.com/cse/all using URL https://www.broadsheet.com.au/
Click public URL and copy the part from cx=123456:abcdef
place your API key and the cx-part in below URL google
adjust the below query to get the results for different cities. I set it up to find results for Melbourne but you can use a placeholder there easily and format the string.
import requests
google = 'https://www.googleapis.com/customsearch/v1?key={your_custom_search_key}&cx={your_custom_search_id}&q=site:https://www.broadsheet.com.au/melbourne/guides/best+%22best+cafes+in%22+%22melbourne%22&start={}'
results = []
with requests.Session() as session:
start = 1
while True:
result = session.get(google.format(start)).json()
if 'nextPage' in result['queries'].keys():
start = result['queries']['nextPage'][0]['startIndex']
print(start)
else:
break
results += result['items']

How to loop in dropdown menu in Aspx dynamic websites using python requests and BeautifulSoup and scrape data

For my question I read this post "request using python to asp.net page" and this also Data Scraping, aspx , and I found what I was looking for but there are some minor items still to solve.
I want to web scrape a website http://up-rera.in/, it is aspx dynamic website. By clicking inspect element websites throws to a different link which is this: http://upreraportal.cloudapp.net/View_projects.aspx
It is using Aspx
How can I loop on all the drop down and click search to get the page content; for example I am able to SCRAPE Agra and WAS ABLE TO GET THE PAGE DETAILS.
Since this is my learning phase so I am avoiding Selenium to get page details.
Here is my code:
import requests
from bs4 import BeautifulSoup
import os
import time
import csv
final_data = []
url = "http://upreraportal.cloudapp.net/View_projects.aspx"
headers= {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
formfields={'__VIEWSTATE':'9VAv5iAKM/uLKHgQ6U91ShYmoKdKfrPqrxB2y86PhSY8pOPAulcgfrsPDINzwmvXGr+vdlE7FT6eBQCKtAFsJPQ9gQ9JIBTBCGCIjYwFuixL3vz6Q7R0OZTH2cwYmyfPHLOqxh8JbLDfyKW3r3e2UgP5N/4pI1k6DNoNAcEmNYGPzGwFHgUdJz3LYfYuFDZSydsVrwSB5CHAy/pErTJVDMmOackTy1q6Y+TNw7Cnq2imnKnBc70eldJn0gH/rtkrlPMS+WP3CXke6G7nLOzaUVIlnbHVoA232CPRcWuP1ykPjSfX12hAao6srrFMx5GUicO3Dvpir+z0U1BDEjux86Cu5/aFML2Go+3k9iHiaS3+WK/tNNui5vNAbQcPiZrnQy9wotJnw18bfHZzU/77uy22vaC+8vX1cmomiV70Ar33szSWTQjbrByyhbFbz9PHd3IVebHPlPGpdaUPxju5xkFQIJRnojsOARjc76WzTYCf479BiXUKNKflMFmr3Fp5S3BOdKFLBie1fBDgwaXX4PepOeZVm1ftY0YA4y8ObPxkJBcGh5YLxZ4vJr2z3pd8LT2i/2fyXJ9aXR9+SJzlWziu9bV8txiuJHSQNojr10mQv8MSCUAKUjT/fip8F3UE9l+zeQBOC++LEeQiTurHZD0GkNix8zQAHbNpGLBfvgocXZd/4KqqnBCLLwBVQobhRbJhbQJXbGYNs6zIXrnkx7CD9PjGKvRx9Eil19Yb5EqRLJQHSg5OdwafD1U+oyZwr3iUMXP/pJw5cTHMsK3X+dH4VkNxsG+KFzBzynKPdF17fQknzqwgmcQOxD6NN6158pi+9cM1UR4R7iwPwuBCOK04UaW3V1A9oWFGvKLls9OXbLq2DS4L3EyuorEHnxO+p8rrGWIS4aXpVVr4TxR3X79j4i8OVHhIUt8H+jo5deRZ6aG13+mXgZQd5Qu1Foo66M4sjUGs7VUcwYCXE/DP/NHToeU0hUi0sJs7+ftRy07U2Be/93TZjJXKIrsTQxxeNfyxQQMwBYZZRPPlH33t3o3gIo0Hx18tzGYj2v0gaBb+xBpx9mU9ytkceBdBPnZI1kJznArLquQQxN3IPjt6+80Vow74wy4Lvp7D+JCThAnQx4K8QbdKMWzCoKR63GTlBwLK2TiYMAVisM77XdrlH6F0g56PlGQt/RMtU0XM1QXgZvWr3KJDV8UTe0z1bj29sdTsHVJwME9eT62JGZFQAD4PoiqYl7nAB61ajAkcmxu0Zlg7+9N9tXbL44QOcY672uOQzRgDITmX6QdWnBqMjgmkIjSo1qo/VpUEzUXaVo5GHUn8ZOWI9xLrJWcOZeFl0ucyKZePMnIxeUU32EK/NY34eE6UfSTUkktkguisYIenZNfoPYehQF9ASL7t4qLiH5jca4FGgZW2kNKb3enjEmoKqbWDFMkc8/1lsk2eTd/GuhcTysVSxtvpDSlR0tjg8A2hVpR67t2rYm8iO/L1m8ImY48=',
'__VIEWSTATEGENERATOR':'4F1A7E70',
'__EVENTVALIDATION':'jVizPhFNJmo9F/GVlIrlMWMsjQe1UKHfYE4jlpTDfXZHWu9yAcpHUvT/1UsRpbgxYwZczJPd6gsvas8ilVSPkfwP1icGgOTXlWfzykkU86LyIEognwkhOfO1+suTK2e598vAjyLXRf555BXMtCO+oWoHcMjbVX2cHKtpBS1GyyqyyVB8IchAAtDEMD3G5bbzhvof6PX4Iwt5Sv1gXkHRKOR333OcYzmSGJvZgLsmo3qQ+5EOUIK5D71x/ZENmubZXvwbU0Ni6922E96RjCLh5cKgFSne5PcRDUeeDuEQhJLyD04K6N45Ow2RKyu7HN1n1YQGFfgAO3nMCsP51i7qEAohXK957z3m/H+FasHWF2u05laAWGVbPwT35utufotpPKi9qWAbCQSw9vW9HrvN01O97scG8HtWxIOnOdI6/nhke44FSpnvY1oPq+BuY2XKrb2404fKl5EPR4sjvNSYy1/8mn6IDH0eXvzoelNMwr/pKtKBESo3BthxTkkx5MR0J42qhgHURB9eUKlsGulAzjF27pyK4vjXxzlOlHG1pRiQm/wzB4om9dJmA27iaD7PJpQGgSwp7cTpbOuQgnwwrwUETxMOxuf3u1P9i+DzJqgKJbQ+pbKqtspwYuIpOR6r7dRh9nER2VXXD7fRfes1q2gQI29PtlbrRQViFM6ZlxqxqoAXVM8sk/RfSAL1LZ6qnlwGit2MvVYnAmBP9wtqcvqGaWjNdWLNsueL6DyUZ4qcLv42fVcOrsi8BPRnzJx0YiOYZ7gg7edHrJwpysSGDR1P/MZIYFEEUYh238e8I2EAeQZM70zHgQRsviD4o5r38VQf/cM9fjFii99E/mZ+6e0mIprhlM/g69MmkSahPQ5o/rhs8IJiM/GibjuZHSNfYiOspQYajMg0WIGeKWnywfaplt6/cqvcEbqt77tIx2Z0yGcXKYGehmhyHTWfaVkMuKbQP5Zw+F9X4Fv5ws76uCZkOxKV3wj3BW7+T2/nWwWMfGT1sD3LtQxiw0zhOXfY1bTB2XfxuL7+k5qE7TZWhKF4EMwLoaML9/yUA0dcXhoZBnSc',
'ctl00$ContentPlaceHolder1$DdlprojectDistrict':'Agra',
'ctl00$ContentPlaceHolder1$txtProject':'',
'ctl00$ContentPlaceHolder1$btnSearch':'Search'}
#here in form details check agra , i am able to scrape one city only,
# how to loop for all cities
r = requests.post(url, data=formfields, headers=headers)
data=r.text
soup = BeautifulSoup(data, "html.parser")
get_list = soup.find_all('option') #gets list of all <option> tag
for element in get_list :
cities = element["value"]
#final.append(cities)
#print(final)
get_details = soup.find_all("table", attrs={"id":"ContentPlaceHolder1_GridView1"})
for details in get_details:
text = details.find_all("tr")[1:]
for tds in text:
td = tds.find_all("td")[1]
rera = td.find_all("span")
rnumber = ""
for num in rera:
rnumber = num.text
print(rnumber)
Try the below code. It will give you all the results you are after. Just a little twitch was needed. I just scraped the different names from the dropdown menu and make use of those in a loop so that you can get all the data one by one. I did noting else except for adding few lines. Your code could have been better if you wrapped it within a function.
Btw, I've put the two giant string within two variables so that you need not to worry about it and make it a little slimmer.
This is the rectified code:
import requests
from bs4 import BeautifulSoup
url = "http://upreraportal.cloudapp.net/View_projects.aspx"
response = requests.get(url).text
soup = BeautifulSoup(response,"lxml")
VIEWSTATE = soup.select("#__VIEWSTATE")[0]['value']
EVENTVALIDATION = soup.select("#__EVENTVALIDATION")[0]['value']
for title in soup.select("#ContentPlaceHolder1_DdlprojectDistrict [value]")[:-1]:
search_item = title.text
# print(search_item)
headers= {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
formfields = {'__VIEWSTATE':VIEWSTATE, #Put the value in this variable
'__VIEWSTATEGENERATOR':'4F1A7E70',
'__EVENTVALIDATION':EVENTVALIDATION, #Put the value in this variable
'ctl00$ContentPlaceHolder1$DdlprojectDistrict':search_item, #this is where your city name changes in each iteration
'ctl00$ContentPlaceHolder1$txtProject':'',
'ctl00$ContentPlaceHolder1$btnSearch':'Search'}
#here in form details check agra , i am able to scrape one city only,
# how to loop for all cities
res = requests.post(url, data=formfields, headers=headers).text
soup = BeautifulSoup(res, "html.parser")
get_list = soup.find_all('option') #gets list of all <option> tag
for element in get_list :
cities = element["value"]
#final.append(cities)
#print(final)
get_details = soup.find_all("table", attrs={"id":"ContentPlaceHolder1_GridView1"})
for details in get_details:
text = details.find_all("tr")[1:]
for tds in text:
td = tds.find_all("td")[1]
rera = td.find_all("span")
rnumber = ""
for num in rera:
rnumber = num.text
print(rnumber)

Resources