I am just learning and new to scraping
Yesterday I was able to scrape craigslist with a beautiful soup. Today I am unable to.
Here is my code to scrape the first page of rental housing search result on CL.
from requests import get
from bs4 import BeautifulSoup
#get the first page of the san diego housing prices
url = 'https://sandiego.craigslist.org/search/apa?hasPic=1&availabilityMode=0&sale_date=all+dates'
response = get(url) # link exlcudes posts with no picures
html_soup = BeautifulSoup(response.text, 'html.parser')
#get the macro-container for the housing posts
posts = html_soup.find_all('li', class_="result-row")
print(type(posts)) #to double check that I got a ResultSet
print(len(posts)) #to double check I got 120 (elements/page)
The html_soup is not the same as it is in the actual url. It actually has the following in there:
<script>
window.cl.specialCurtainMessages = {
unsupportedBrowser: [
"We've detected you are using a browser that is missing critical features.",
"Please visit craigslist from a modern browser."
],
unrecoverableError: [
"There was an error loading the page."
]
};
</script>
Any help would be much appreciated.
I am not sure if I've potentially been 'blocked' somehow from scraping. I read this article about proxies and rotating IP addresses, but I do not want to break rules if I've been blocked, and also do not want to spend money on this. Is it not allowed to scrape craigslist? I have seen so many educational tutorials on it so thought it was okay.
import requests
from pprint import pp
def main(url):
with requests.Session() as req:
params = {
"availabilityMode": "0",
"batch": "8-0-360-0-0",
"cc": "US",
"hasPic": "1",
"lang": "en",
"sale_date": "all dates",
"searchPath": "apa"
}
r = req.get(url, params=params)
for i in r.json()['data']['items']:
pp(i)
break
main('https://sapi.craigslist.org/web/v7/postings/search/full')
Related
I'm new to web scraping and i was trying to scrape through FUTBIN (FUT 22) player database
"https://www.futbin.com/players" . My code is below and I don't know why if can't get any sort of results from the FUTBIN page but was successful in other webpages like IMDB.
CODE :`
import requests
from bs4 import BeautifulSoup
request = requests.get("https://www.futbin.com/players")
src = request.content
soup = BeautifulSoup(src, features="html.parser")
results = soup.find("a", class_="player_name_players_table get-tp`enter code here`")
print(results)
i am currently working on a spider that crawls on an e-commerce website and extract data. Meanwhile, i need to save the url trail as well in the product such as
{
'product_name: "apple iphone 12",
'trail': ["https://www.apple.com/", "https://www.apple.com/iphone/", "https://www.apple.com/iphone-12/"
}
Same as the user will go from start page to the product.
I am using scrapy 2.4.1
I passed previous url as keyword arguments in the callback
source: https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
cb_kwargs=dict(main_url=response.url))
request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback
yield request
def parse_page2(self, response, main_url, foo):
yield dict(
main_url=main_url,
other_url=response.url,
foo=foo,
)
I am trying to web scrape Yahoo's Finance Recommendation Rating using BeautifulSoup but it keeps returning 'None'.
E.g. Recommendation Rating for AAPL is '2'
https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL
Please advise. Thank you!
Below is the code:
from requests import get
from bs4 import BeautifulSoup
tickers = ['AAPL']
url = 'https://sg.finance.yahoo.com/quote/%s/profile?p=%s'%(ticker, ticker)
print(url)
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
#yf_rec refers to yahoo finance recommendation
try:
yf_rec = html_soup.find('div', attrs={'class':'B(8px) Pos(a) C(white) Py(2px) Px(0) Ta(c) Bdrs(3px) Trstf(eio) Trsde(0.5) Arrow South Bdtc(i)::a Fw(b) Bgc($buy) Bdtc($buy)'}).text.strip()
except:
pass
print(yf_rec)
I am new to Scrapy and trying to scrape https://socialblade.com/ website to get the channel id of the most viewed and most subscribed Youtuber of a country.
The way I am doing it is to click on the link to a Youtuber on the main listing page (e.g. https://socialblade.com/youtube/top/country/pk/mostsubscribed). Then it opens a new page, and the last part of the new opened page contains the channel id (e.g. https://socialblade.com/youtube/channel/UC4JCksJF76g_MdzPVBJoC3Q).
Here is my code:
import scrapy
class SocialBladeSpider(scrapy.Spider):
name = "socialblade"
def start_requests(self):
urls = [
'https://socialblade.com/youtube/top/country/pk/mostviewed',
'https://socialblade.com/youtube/top/country/pk/mostsubscribed'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse_url(self, response):
data = {
'url': response.url.split('/')[-1],
'displayName': response.css('div#YouTubeUserTopInfoBlockTop div h1::text').extract_first()
}
yield {
response.meta['country']: {
response.meta['key']: data
}
}
def parse(self, response):
key = response.url.split("/")[-1]
country = response.url.split("/")[-2]
for a in response.css('a[href^="/youtube/user/"]'):
request = scrapy.Request(url='https://socialblade.com' + a.css('::attr(href)').extract_first(), callback=self.parse_url)
request.meta['key'] = key
request.meta['country'] = country
yield request
Issue is: after scraping these two urls I should get total 500 records. But I am only getting 348 records. I did the R&D but unable to find solution.
Does anyone have any advice on how to solve this?
Pass dont_filter=True to your requests if you do not want to filter out duplicate requests.
For more information, see the documentation about Request.
I am using Scrapy to retrieve information about projects on https://www.indiegogo.com. I want to scrape all pages with the url format www.indiegogo.com/projects/[NameOfProject]. However, I am not sure how to reach all of those pages during a crawl. I can't find a master page that hardcodes links to all of the /projects/ pages. All projects seem to be accessible from https://www.indiegogo.com/explore (through visible links and the search function), but I cannot determine the set of links/search queries that would return all pages. My spider code is given below. These start_urls and rules scrape about 6000 pages, but I hear that there should be closer to 10x that many.
About the urls with parameters: The filter_quick parameter values used come from the "Trending", "Final Countdown", "New This Week", and "Most Funded" links on the Explore page and obviously miss unpopular and poorly funded projects. There is no max value on the per_page url parameter.
Any suggestions? Thanks!
class IndiegogoSpider(CrawlSpider):
name = "indiegogo"
allowed_domains = ["indiegogo.com"]
start_urls = [
"https://www.indiegogo.com/sitemap",
"https://www.indiegogo.com/explore",
"http://go.indiegogo.com/blog/category/campaigns-2",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=countdown&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=new&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=most_funded&per_page=50000",
"https://www.indiegogo.com/explore?filter_browse_balance=true&filter_quick=popular_all&per_page=50000"
]
rules = (
Rule(LinkExtractor(allow=('/explore?'))),
Rule(LinkExtractor(allow=('/campaigns-2/'))),
Rule(LinkExtractor(allow=('/projects/')), callback='parse_item'),
)
def parse_item(self, response):
[...]
Sidenote: there are other URL formats www.indiegogo.com/projects/[NameOfProject]/[OtherStuff] that either redirect to the desired URL format or give 404 errors when I try to load them in the browser. I am assuming that Scrapy is handling the redirects and blank pages correctly, but would be open to hearing ways to verify this.
Well if you have the link to sitemap than it will be faster to let Scrapy fetch pages from there and process them.
This will work something like below.
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/robots.txt']
//**you can set rules for extracting URLs under sitemap_rules.
sitemap_rules = [
('/shop/', 'parse_shop'),
]
sitemap_follow = ['/sitemap_shops']
def parse_shop(self, response):
pass # ... scrape shop here ...
Try the below code this will crawl the site and only crawl the "indiegogo.com/projects/"
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from sitemap.items import myitem
class DmozSpider(CrawlSpider):
name = 'indiego'
allowed_domains = ['indiegogo.com']
start_urls = [
'http://indiegogo.com'
]
rules = (Rule(LinkExtractor(allow_domains=['indiegogo.com/projects/']), callback='parse_items', follow= True),)
def parse_items(self, response):
item = myitem()
item['link'] = response.request.url
item['title'] = response.xpath('//title').extract()
yield item