"requests.get" not saving useful youtube links

"requests.get" not saving useful youtube links - web-scraping

I'm trying to run requests.get() to save a collection of youtube playlists, but somehow it is not working.
import pandas as pd
import numpy as np
import re
import time
import bs4 as bs4
import json
import requests as rq
queries = ["machine+learning", "data+science", "kaggle"]
url = "https://www.youtube.com/results?search_query={query}&sp=CAI%253D&p={page}"
for query in queries:
for page in range(1,21):
urll = url.format(query=query, page=page)
print(urll)
response = rq.get(urll)
with open("./dados_brutos/{}_{}.html".format(query,page), 'w+') as output:
output.write(response.text)
time.sleep(2)
It works well to save the pages, but when I try to load the page in the browser, the header of the page (Youtube) appears, but no info is displayed.
Any suggestions?

Related

Python Requests doesn't render full code from page

Im trying to capture each agents data from this page using python requests.
Capture These
But the response.text doesn't render the code shown in the code inspector. **See snapshot.**a
Below is my script.
import requests
import re
response = requests.get('https://exprealty.com/agents/#/?city=Aberdeen+WA&country=US')
result = re.search('Mike Arthur',response.text)
try:
print (result.group())
except:
print('Nothing found.')

Can you display a PlainTextResponse in the Swagger UI for a FastAPI API?

Right now, I can only view a PlainTextResponse by manually entering the API URL path into my browser. However, I would like to be able to view PlainTextResponses in my Swagger UI. It seems like the OpenAPI loads indefinitely every time I try to request a PlainTextResponse
Here is a sample:
from fastapi import FastAPI
from fastapi.responses import PlainTextResponse
import pandas as pd
app = FastAPI()
#app.get("/plain_text", response_class=PlainTextResponse)
async def plain_text():
url = 'https://raw.githubusercontent.com/ccodwg/Covid19Canada/master/official_datasets/can/phac_n_tests_performed_timeseries_prov.csv'
df = pd.read_csv(url, index_col=0)
return PlainTextResponse(df.to_csv(index=False), media_type="text/plain")
This sample actually works. I'm assuming its because this specific CSV file is relatively small.
However, once you start using larger CSV files, it seems like you are unable to display the response in the UI. For example, try https://raw.githubusercontent.com/Schlumberger/hackathon/master/backend/dataset/data-large.csv instead and it will load forever on the UI but displays relatively quickly if you use the URL path.

I don't know what is happening in at your end, but here is a mvp showing how PlainTextRepsonse comes through in the auto generated docs.
from fastapi import FastAPI
from fastapi.responses import PlainTextResponse
app = FastAPI()
#app.get("/")
def root():
return PlainTextResponse("Plain Response!")
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000, )
Executing the operation (using the "Try out" button) in the generated docs, yields the following result:

Turns out, it's not PlainTextResponse's issue but rather SwaggerUI, credits to this user's answer. Disabling syntax highlighting seems to significantly improve performance and hence the Swagger UI no longer hangs on large responses.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import pandas as pd
app = FastAPI(swagger_ui_parameters={"syntaxHighlight": False})
#app.get("/plain_text", response_class=StreamingResponse)
async def plain_text():
url = 'https://raw.githubusercontent.com/Schlumberger/hackathon/master/backend/dataset/data-large.csv'
df = pd.read_csv(url, index_col=0)
return StreamingResponse(iter([df.to_csv(index=False)]), media_type="text/csv")

How to get data with BeautifulSoup without having "None"?

I am new to web scraping and I have a problem with it.
I want to get the name of the courses in specific search results on Udemy (from this link https://www.udemy.com/courses/search/?src=ukw&q=veri+bilimi).
Here is my code:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.udemy.com/courses/search/?src=ukw&q=veri+bilimi")
print(result.status_code)
src = result.content
soup = BeautifulSoup(src, "lxml")
print(soup.find("div", attrs={"class":"udlite-focus-visible-target udlite-heading-md course-card--course-title--2f7tE"}))
It turns "None" instead of course names. Unfortunately, I didn't understand and see my mistake.
Can you help me?

The udemy website is using javascript to load course title that requests won't access. You need to use selenium
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url ="https://www.udemy.com/courses/search/?src=ukw&q=veri+bilimi"
import time
webdriver =webdriver.Chrome()
webdriver.get(url)
time.sleep(6) # delay 6 sec
soup = BeautifulSoup(webdriver.page_source, "lxml")
course_titles = soup.find_all("div", attrs={"class":"udlite-focus-visible-target udlite-heading-md course-card--course-title--2f7tE"})
for title in course_titles:
print(title.get_text())
Selenium Installation if you need it.

Scraping "older" pages with scrapy, rules and link extractors

I have been working on a project with scrapy. With help, from this lovely community I have managed to be able to scrape the first page of this website: http://www.rotoworld.com/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav. I am trying to scrape information from the "older" pages as well. I have researched "crawlspider", rules and link extractors, and believed I had the proper code. I want the spider to perform the same loop on subsequent pages. Unfortunately at the moment when I run it, it just spits out the 1st page, and doesn't continue to the "older" pages.
I am not exactly sure what I need to change and would really appreciate some help. There are posts going all the way back to February of 2004... I am new to data mining, and not sure if it is actually a realistic goal to be able to scrape every post. If it is I would like to though. Please any help is appreciated. Thanks!
import scrapy
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(crawlspider):
name = "RotoPlayerNews"
start_urls = [
'http://www.rotoworld.com/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[#id="cp1_ctl00_btnNavigate1"]',)), callback="parse_page", follow= True),)
def parse(self, response):
for item in response.xpath("//div[#class='pb']"):
player = item.xpath(".//div[#class='player']/a/text()").extract_first()
position= item.xpath(".//div[#class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[#class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[#class='report']/p/text()").extract_first()
date = item.xpath(".//div[#class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[#class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[#class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}

If your intention is to fetch the data traversing multiple pages, you don't need to go for scrapy. If you still want to have any solution related to scrapy then I suggest you opt for splash to handle the pagination.
I would do something like below to get the items (assuming you have already installed selenium in your machine):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.rotoworld.com/playernews/nfl/football/")
wait = WebDriverWait(driver, 10)
while True:
for item in wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[#class='pb']"))):
player = item.find_element_by_xpath(".//div[#class='player']/a").text
player = player.encode() #it should handle the encoding issue; I'm not totally sure, though
print(player)
try:
idate = wait.until(EC.presence_of_element_located((By.XPATH, "//div[#class='date']"))).text
if "Jun 9" in idate: #put here any date you wanna go back to (last limit: where the scraper will stop)
break
wait.until(EC.presence_of_element_located((By.XPATH, "//input[#id='cp1_ctl00_btnNavigate1']"))).click()
wait.until(EC.staleness_of(item))
except:break
driver.quit()

My suggestion: Selenium
If you want to change of page automatically, you can use Selenium WebDriver.
Selenium makes you to be able to interact with the page click on buttons, write on inputs, etc. You'll need to change your code to scrap the data an then, click on the older button. Then, it'll change the page and keep scraping.
Selenium is a very useful tool. I'm using it right now, on a personal project. You can take a look at my repo on GitHub to see how it works. In the case of the page that you're trying to scrap, you cannot go to older just changing the link to be scraped, so, you need to use Selenium to do change between pages.
Hope it helps.

No need to use Selenium in current case. Before scraping you need to open url in browser and press F12 to inspect code and to see packets in Network Tab. When you press next or "OLDER" in your case you can see new set of TCP packets in Network tab. It provide to you all you need. When you understand how it work you can write working spider.
import scrapy
from scrapy import FormRequest
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(CrawlSpider):
name = "RotoPlayerNews"
start_urls = [
'http://www.<DOMAIN>/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[#id="cp1_ctl00_btnNavigate1"]',)), callback="parse", follow= True),)
def parse(self, response):
for item in response.xpath("//div[#class='pb']"):
player = item.xpath(".//div[#class='player']/a/text()").extract_first()
position= item.xpath(".//div[#class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[#class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[#class='report']/p/text()").extract_first()
date = item.xpath(".//div[#class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[#class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[#class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}
older = response.css('input#cp1_ctl00_btnNavigate1')
if not older:
return
inputs = response.css('div.aspNetHidden input')
inputs.extend(response.css('div.RW_pn input'))
formdata = {}
for input in inputs:
name = input.css('::attr(name)').extract_first()
value = input.css('::attr(value)').extract_first()
formdata[name] = value or ''
formdata['ctl00$cp1$ctl00$btnNavigate1.x'] = '42'
formdata['ctl00$cp1$ctl00$btnNavigate1.y'] = '17'
del formdata['ctl00$cp1$ctl00$btnFilterResults']
del formdata['ctl00$cp1$ctl00$btnNavigate1']
action_url = 'http://www.<DOMAIN>/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav&rw=1'
yield FormRequest(
action_url,
formdata=formdata,
callback=self.parse
)
Be carefull you need to replace all to corrent one in my code.

url doesn't change in next page- web scraping

I want to get some data from this page. when I navigate to the next page url doesn't change. here is my code for scraping the first page
import requests
from bs4 import BeautifulSoup
url="https://www.airyrooms.com/search?s=26-02-2018.28-02-2018.GEO.103859.Bandung"
r=requests.get(url)
soup=BeautifulSoup(r.content)
g_data=soup.find_all("div",{"class":"styles-propertySearchResultDisplayContainer-1XpMp"})
i=0
for item in g_data:
try:
i=i+1
print (item.contents[0].find_all("div",{"class": "styles-titlePopUp-17tHZ"})[0].text)#name
print (item.contents[0].find_all("span",{"class": "styles-propertyLocationLink-1iVPv"})[0].text)#location
print (item.contents[0].find_all("span",{"class": "styles-lineThrough-xyCPH"})[0].text)# initial price
print (item.contents[0].find_all("div",{"class": "styles-value-3pvw_"})[1].text)# price after disount
except:
pass
print(i)
I don't know how to get data from the other pages.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

"requests.get" not saving useful youtube links - web-scraping

Related

Python Requests doesn't render full code from page

Can you display a PlainTextResponse in the Swagger UI for a FastAPI API?

How to get data with BeautifulSoup without having "None"?

Scraping "older" pages with scrapy, rules and link extractors

url doesn't change in next page- web scraping

Categories

Resources