I need help web scraping comments section - web-scraping

I am currently doing a project on football players. I am trying to scrape some public comments on football players for sentiment analysis. However I can't seem to scrape the comment. Any help would be MUCH appreciated. It is the comments part I can't seem to do. Weirdly enough I had it working but then it stopped and I cant seem to get scraping comments again. The website I am scraping from is : https://sofifa.com/player/192985/kevin-de-bruyne/200025/
likes = []
dislikes = []
follows = []
comments = []
driver_path = '/Users/niallmcnulty/Desktop/GeneralAssembly/Lessons/DSI11-lessons/week05/day2_web_scraping_and_apis/web_scraping/selenium-examples/chromedriver'
driver = webdriver.Chrome(executable_path=driver_path)
# i = 0
for url in tqdm_notebook(urls):
driver.get(url)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(0.2)
soup1 = BeautifulSoup(driver.page_source,'lxml')
try:
dislike = soup1.find('button', attrs = {'class':'bp3-button bp3-minimal bp3-intent-danger dislike-btn need-sign-in'}).find('span',{'class':'count'}).text.strip()
dislikes.append(dislike)
except:
pass
try:
like = soup1.find('button', attrs = {'class':'bp3-button bp3-minimal bp3-intent-success like-btn need-sign-in'}).find('span',{'class':'count'}).text.strip()
likes.append(like)
except:
pass
try:
follow = soup1.find('button', attrs = {'class':'bp3-button bp3-minimal follow-btn need-sign-in'}).find('span',{'class':'count'}).text.strip()
follows.append(follow)
except:
pass
try:
comment = soup1.find_all('p').text[0:10]
comments.append(comment)
except:
pass
# i += 1
# if i % 5 == 0:
# sentiment = pd.DataFrame({"dislikes":dislikes,"likes":likes,"follows":follows,"comments":comments})
# sentiment.to_csv('/Users/niallmcnulty/Desktop/GeneralAssembly/Lessons/DSI11-lessons/projects/cap-csv/sentiment.csv')
sentiment_final = pd.DataFrame({"dislikes":dislikes,"likes":likes,"follows":follows,"comments":comments})
# df_sent = pd.merge(df, sentiment, left_index=True, right_index=True)

The comments section is dynamically loaded. you can try to capture it using the driver,
try:
comment_elements = driver.find_elements_by_tag_name('p')
for comment in comment_elements:
comments.append(comment.text)
except:
pass
print(Comments)

Related

Control display of an ipyvuetify page by a dropdown works in notebook not in voila

I have encountered another working in notebook but not in voila issue. I have tried for a couple of hours but feel like I am still missing something and therefore seeking expert opinions here.
I have a function create_pages_and_run() that takes in a dictionary, d, as an input to generate a dashboard (the data type is ipyvuetify.generated.App.App). The dictionary can be retrieved from a json file scenario_dict using a country name as key where I designed a dropdown to collect the country name.
The purpose is to ask user to select a country name and the page will be redrawn/refreshed. I have the following code that works in notebook but not in Voila. (Works means that when a new country name is selected, the dashboard is displayed with the widgets using data from that countries)
scenario_dropdown = widgets.Dropdown(
options=all_scenarios,
value=initial_scenario,
description="Scenario",
layout=widgets.Layout(margin="0 20px 0 0", height="39px", width="15%"),
)
d = scenario_dict[initial_scenario]
app = create_pages_and_run(d)
#the below code works for notebook
def on_change(change):
global d, app
if change["name"] == "value" and (change["new"] != change["old"]):
d = scenario_dict[change["new"]]
app = create_pages_and_run(d)
clear_output()
display(app)
scenario_dropdown.observe(on_change)
My failed code using ipywidgets.Output is as below. (Failed in the sense, after selecting country name in the dropdown no change is observed).
scenario_dropdown = widgets.Dropdown(
options=all_scenarios,
value=initial_scenario,
description="Scenario",
layout=widgets.Layout(margin="0 20px 0 0", height="39px", width="15%"),
)
d = scenario_dict[initial_scenario]
app = create_pages_and_run(d)
out = widgets.Output()
with out:
display(app)
# the code works failed for voila
def on_change(change):
global d, app, out
if change["name"] == "value" and (change["new"] != change["old"]):
d = scenario_dict[change["new"]]
app = create_pages_and_run(d)
out.clear_output()
with out:
display(app)
display(out)
scenario_dropdown.observe(on_change)
I appreciate your help, thanks.
I'm not sure why your code didn't work. Maybe it was the use of globals which can be avoided. Could you provide a working example to test. Here is a working example based on your code that works in Voila.
import ipywidgets as widgets
all_scenarios = ['aa','bb','cc']
initial_scenario = all_scenarios[0]
scenario_dict = {}
scenario_dict['aa'] = 'do_this'
scenario_dict['bb'] = 'do_that'
scenario_dict['cc'] = 'do_what'
def create_pages_and_run(action):
print(action)
scenario_dropdown = widgets.Dropdown(
options=all_scenarios,
value=initial_scenario,
description="Scenario",
layout=widgets.Layout(margin="0 20px 0 0", height="39px", width="15%"),
)
d = scenario_dict[initial_scenario]
out = widgets.Output()
with out:
create_pages_and_run(d)
app = widgets.VBox([scenario_dropdown, out])
def on_change(change):
if change["name"] == "value" and (change["new"] != change["old"]):
d = scenario_dict[change["new"]]
out.clear_output()
with out:
create_pages_and_run(d)
scenario_dropdown.observe(on_change)
app

R: search_fullarchive() and Twitter Academic research API track

I was wondering whether anyone has found a way to how to use search_fullarchive() from the "rtweet" package in R with the new Twitter academic research project track?
The problem is whenever I try to run the following code:
search_fullarchive(q = "sunset", n = 500, env_name = "AcademicProject", fromDate = "202010200000", toDate = "202010220000", safedir = NULL, parse = TRUE, token = bearer_token)
I get the following error "Error: Not a valid access token". Is that because search_fullarchive() is only for paid premium accounts and that doesn't include the new academic track (even though you get full archive access)?
Also, can you retrieve more than 500 tweets (e.g., n = 6000) when using search_fullarchive()?
Thanks in advance!
I've got the same problem w/ Twitter academic research API. I think if you set n = 100 or just skip the argument, the command will return you 100 tweets. Also, the rtweet package does not (yet) support the academic research API.
Change your code to this:
search_fullarchive(q = "sunset", n = 500, env_name = "AcademicProject", fromDate = "202010200000", toDate = "202010220000", safedir = NULL, parse = TRUE, token = t, env_name = "Your Environment Name attained in the Dev Dashboard")
Also The token must be created like this:
t <- create_token(
app = "App Name",
'Key',
'Secret',
access_token = '',
access_secret = '',
set_renv = TRUE
)

Wrong page parsed BeautifulSoup?

I want to enter two values on this website https://hausratversicherung.friday.de/ and retrieve the value after submitting it. I wrote the following code
import requests, re
from robobrowser import RoboBrowser
br = RoboBrowser(parser='html.parser')
br.open("https://hausratversicherung.friday.de/")
form = br.get_form()
form['area'] = 100
form['postalCode'] = 44326
br.submit_form(form)
src = str(br.parsed())
start = '<div class="Typography-sc-3c3fuf-0 jEIicc" data-testid="totalPrice">'
end = ' €</div>'
result = re,search('%s(.*)%s' % (start, end),src).group(1)
print(result)
But the browser br is not opening the mentioned page and taking these values.
The postal code 44326 isn't accepted by the server. For other postal codes you can query their API directly:
import json
import requests
area = 100
postalcode = 44309
url = 'https://fdy2-policycenter-production.k8s.blue.friday-prod.de/rest/friday/hc/price?area={area}&postalCode={postalcode}'
data = requests.get(url.format(area=area, postalcode=postalcode)).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some info to screen:
print(data['basicCoverages']['coverages'][0]['insuredSum']['amount'])
print(data['basicCoverages']['coverages'][0]['price']['amount'])
Prints:
65000.0
7.81

How to import data from a HTML table on a website to excel?

I would like to do some statistical analysis with Python on the live casino game called Crazy Time from Evolution Gaming. There is a website that has the data to do this: https://tracksino.com/crazytime. I want the data of the lowest table 'Spin History' to be imported into excel. However, I do not now how this can be done. Could anyone give me an idea where to start?
Thanks in advance!
Try the below code:
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
import csv
import datetime
def scrap_history():
csv_headers = []
file_path = '' #mention your system where you have to save the file
file_name = 'spin_history.csv' # filename
page_number = 1
while True:
#Dynamic URL fetching data in chunks of 100
url = 'https://api.tracksino.com/crazytime_history?filter=&sort_by=&sort_desc=false&page_num=' + str(page_number) + '&per_page=100&period=24hours'
print('-' * 100)
print('URL created : ',url)
response = requests.get(url,verify=False)
result = json.loads(response.text) # loading data to convert in JSON.
history_data = result['data']
print(history_data)
if history_data != []:
with open(file_path + file_name ,'a+') as history:
#Headers for file
csv_headers = ['Occured At','Slot Result','Spin Result','Total Winners','Total Payout',]
csvwriter = csv.DictWriter(history, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
if page_number == 1:
print('Writing CSV header now...')
csvwriter.writeheader()
#write exracted data in to csv file one by one
for item in history_data:
value = datetime.datetime.fromtimestamp(item['when'])
occured_at = f'{value:%d-%B-%Y # %H:%M:%S}'
csvwriter.writerow({'Occured At':occured_at,
'Slot Result': item['slot_result'],
'Spin Result': item['result'],
'Total Winners': item['total_winners'],
'Total Payout': item['total_payout'],
})
print('-' * 100)
page_number +=1
print(page_number)
print('-' * 100)
else:
break
Explanation:
I have implemented the above script using python requests way. The API url https://api.tracksino.com/crazytime_history?filter=&sort_by=&sort_desc=false&page_num=1&per_page=50&period=24hours extarcted from the web site itself(refer screenshot). In the very first step script will take the dynamic URL where page number is dynamic and changed upon on every iteration. For ex:- first it will be page_num = 1 then page_num = 2 and so on till all the data will get extracted.

how to iterate over multiple links and scrape everyone of them one by one and save the output in csv using python beautifulsoup and requests

I have this code but don't know how to read the links from a CSV or a list. I want to read the links and scrape details off every single link and then save the data in columns respected to each link into an output CSV.
Here is the code I built to get specific data.
from bs4 import BeautifulSoup
import requests
url = "http://www.ebay.com/itm/282231178856"
r = requests.get(url)
x = BeautifulSoup(r.content, "html.parser")
# print(x.prettify().encode('utf-8'))
# time to find some tags!!
# y = x.find_all("tag")
z = x.find_all("h1", {"itemprop": "name"})
# print z
# for loop done to extracting the title.
for item in z:
try:
print item.text.replace('Details about ', '')
except:
pass
# category extraction done
m = x.find_all("span", {"itemprop": "name"})
# print m
for item in m:
try:
print item.text
except:
pass
# item condition extraction done
n = x.find_all("div", {"itemprop": "itemCondition"})
# print n
for item in n:
try:
print item.text
except:
pass
# sold number extraction done
k = x.find_all("span", {"class": "vi-qtyS vi-bboxrev-dsplblk vi-qty-vert-algn vi-qty-pur-lnk"})
# print k
for item in k:
try:
print item.text
except:
pass
# Watchers extraction done
u = x.find_all("span", {"class": "vi-buybox-watchcount"})
# print u
for item in u:
try:
print item.text
except:
pass
# returns details extraction done
t = x.find_all("span", {"id": "vi-ret-accrd-txt"})
# print t
for item in t:
try:
print item.text
except:
pass
#per hour day view done
a = x.find_all("div", {"class": "vi-notify-new-bg-dBtm"})
# print a
for item in a:
try:
print item.text
except:
pass
#trending at price
b = x.find_all("span", {"class": "mp-prc-red"})
#print b
for item in b:
try:
print item.text
except:
pass
Your question is kind of vague!
Which links are you talking about? There are a hundred on a single ebay page. Which infos would you like to scrape? Similarly there is also a ton.
But anyway, here is I would proceed:
# First, create a list of urls you want to iterate on
urls = []
soup = (re.text, "html.parser")
# Assuming your links of interests are values of "href" attributes within <a> tags
a_tags = soup.find_all("a")
for tag in a_tags:
urls.append(tag["href"])
# Second, start to iterate while storing the info
info_1, info_2 = [], []
for link in urls:
# Do stuff here, maybe its time to define your existing loops as functions?
info_a, info_b = YourFunctionReturningValues(soup)
info_1.append(info_a)
info_2.append(info_b)
Then if you want a nice csv output:
# Don't forget to import the csv module
with open(r"path_to_file.csv", "wb") as my_file:
csv_writer = csv.writer(final_csv, delimiter = ",")
csv_writer.writerows(zip(urls, info_1, info_2, info_3))
Hope this will help?
Of course, don't hesitate to give additional info, so to have additional details
On attributes with BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
About the csv module: https://docs.python.org/2/library/csv.html

Resources