In this program I'm trying to get all the prices of rentals in Ottawa but it just returns one price that is random each time, why?
import scrapy
class RentalPricesSpider(scrapy.Spider):
name = 'rental_prices'
allowed_domains = ['www.kijiji.ca']
start_urls = ['https://www.kijiji.ca/b-real-estate/ottawa/c34l1700185']
def parse(self, response):
rental_price = response.xpath('normalize-space(//div[#class="price"]/text())').getall()
yield {
'rent': rental_price,
}
You are select wrong xpath for that you are not getting expected output. Use css selector div.price::text instead xpath.
import scrapy
class RentalPricesSpider(scrapy.Spider):
name = 'rental_prices'
allowed_domains = ['www.kijiji.ca']
start_urls = ['https://www.kijiji.ca/b-real-estate/ottawa/c34l1700185']
def parse(self, response):
rental_price = response.css('div.price::text').getall()
rental_price = [x.strip() for x in rental_price if x.strip()]
# rental_price = list(map(str.strip ,x) for x in rental_price)
yield {
'rent': rental_price,
}
process = CrawlerProcess(settings={
"USER_AGENT" : "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36",
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(RentalPricesSpider)
process.start()
Related
Good evening everybody,
I am in a fight with PowerQuery. I am trying to recieve an answer from a website in PowerBI (Power Query). For obvious reasons I renamed passwords and such. When I run the code below, PowerQuery answers that:
Expression.Error: The content-length heading must be changed using the
appropriate property or method. Parameter name: name Details: 103
It does not matter if I use another number. I tried also to remove the header entirly but that results in:
DataSource.Error: Web.Contents could not retrieve the content from
'https://xxxx.nl/api/token/' (500): Internal Server Error Details:
DataSourceKind=Web DataSourcePath=https://xxxx.nl/api/token
Url=https://xxxx.nl/api/token/
I do miss something but I cannot find out what that is. Could you find out what it is? Thanks in advance!
let
url = "https://xxxxx.nl/api/token/",
body = Text.ToBinary("{""""username"""":""""xxxx"""",""""password"""":""""xxxx"""",""""group"""":""""xxxx"""",""""deleteOtherSessions"""":false"),
Data = Web.Contents(
url,
[
Headers = [
#"authority" = "xxx.nl",
#"method" = "POST",
#"path" = "/api/Token",
#"scheme" = "https",
#"accept" = "application/json",
#"accept-encoding" = "gzip, deflate",
#"transfer-encoding" = "deflate",
#"accept-language" = "nl-NL,nl;q=0.7",
#"cache-control" = "no-cache",
#"content-length" = "103",
#"content-type" = "application/json",
#"expires" = "Sat, 01 Jan 2000 00:00:00 GMT",
#"origin" = "https://xxxx.nl",
#"pragma" = "no-cache",
#"referer" = "https://xxxxx.nl/",
#"sec-fetch-dest" = "empty",
#"sec-fetch-mode" = "cors",
#"sec-fetch-site" = "same-origin",
#"sec-gpc" = "1",
#"user-agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
],
Content = body
]
)
in
Data
I'm trying to scraping from this address:
http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino?fbclid=IwAR3_hZwajHk_iyU085S1LDTqLCOYLHIZ5K825XgPGcB4tMI0EuCJpQNrJHM#
There are two drop down ("Origem" and "Destino"). I need to generate a database with all possible combinations of "Origem" and "Destino".
Below a part of the code in R. I'm not able to select an option within the drop down menu, so I can create a looping and extract the data I need.
Any suggestions?
library(RSelenium) # activate Selenium server
library(rJava)
remDr <- rs_driver_object$client
remDr$open()
remDr$navigate("http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino?fbclid=IwAR3_hZwajHk_iyU085S1LDTqLCOYLHIZ5K825XgPGcB4tMI0EuCJpQNrJHM#")
Origem <- remDr$findElement(using = 'id', 'Origem')
Destino <- remDr$findElement(using = 'id', 'Destino')
botão_pesquisar <- remDr$findElement(using = 'id', 'btnPesquisar')
Grab the values (which are the location IDs) in each combo box, have two arrays (from and to), make sure to append the labels also; this page makes a call to an endpoint that has the IDs posted as parameters - the call looks like this:
library(RCurl)
headers = c(
"Accept" = "application/json, text/javascript, */*; q=0.01",
"Accept-Language" = "en-US,en;q=0.9",
"Connection" = "keep-alive",
"Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie" = "__RequestVerificationToken_L1RyYW5zcG9ydGVDb2xldGl2bw2=tY-yKlWmbZvAJzMHmITkohPiIos5XkjDBwf1ZBfP_bYWdXJMBF2Qw3z_B-LRVo0kXjdnHqDqsbZ04Zij_PM-wAf4DWVKfnQskOhqo4ANSRc1",
"Origin" = "http://extranet.artesp.sp.gov.br",
"Referer" = "http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino?fbclid=IwAR3_hZwajHk_iyU085S1LDTqLCOYLHIZ5K825XgPGcB4tMI0EuCJpQNrJHM",
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
"X-Requested-With" = "XMLHttpRequest"
)
params = "origem=387&destino=388&__RequestVerificationToken=Z-wXmGOb9pnQbmkfcQXmChT-6uc3YfGjftHwK4HnC9SDCaKmzIafo7AI3lChBY6YDBHdpT_X98mSHGAr_YrTNgKiepKxKraGu7p6PI7dV4g1"
res <- postForm("http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino/GetGrid", .opts=list(postfields = params, httpheader = headers, followlocation = TRUE), style = "httppost")
cat(res)
See the origem= and the destino= parameters? those are the values from the static combo box fields, would be easy to do this whole thing via simple web requests; the response for each call will look like this:
[
{
"Codigo": 0,
"Empresa": {
"Codigo": 447,
"Descricao": "VIAÇÃO VALE DO TIETE LTDA",
"FlagCNPJ": false,
"CNPJ": null,
"CPF": null,
"Fretamento": null,
"Escolar": null,
"Municipio": null,
"UF": null,
"Endereco": null,
"Bairro": null,
"CEP": null,
"Telefone": null,
"Email": null
},
"CodigoMunicipioOrigem": 387,
"CodigoMunicipioDestino": 388
}
]
So when a trip is found, you'll have an array of.. Unsure what this is but entries for tickets I am assuming; the array returns 0 (null array) when the origin and destination have no schedules.
I'm attempting to scrape a JSON object. Referencing Extract data from an unreachable JsonObject(), I'm able to get the status code 200 but I'm not getting any results. I assume there is something wrong with how I'm constructing the query.
Code I have so far:
library(httr) # web scraping
library(jsonlite) #parsing json data
library(rvest) # web scraping
library(polite) # check robot.txt files
library(tidyverse) # data wrangling
library(curlconverter) # decode curl commands
library(urltools) # URL encoding
r <-
POST(
url = "https://vfm4x0n23a-dsn.algolia.net/1/indexes/*/queries" ,
add_headers(
#.headers=c(
#'Accept' = "*/*",
#'Accept-Encoding' = 'gzip, deflate, br',
#'Accept-Language' = "en-US,en;q=0.9",
#'Cache-Control' = "no-cache",
#'Connection' = "keep-alive",
#'Content-Length' = '450',
#'Content-Type' = 'application/x-www-form-urlencoded',
#'Host' = 'vfm4x0n23a-dsn.algolia.net',
#'Origin' = "https://www.iheartjane.com",
#'Pragma' = "no-cache",
#'Referer' = "https://www.iheartjane.com/",
#'sec-ch-ua' = '".Not/A)Brand"";v=99""Google Chrome";v="103",Chromium";v="103"',
#'sec-ch-ua-mobile' = "?0",
#'sec-ch-ua-platform' = "Windows",
#'Sec-Fetch-Dest' = "empty",
#'Sec-Fetch-Mode' = "cors",
#'Sec-Fetch-Site' = "cross-site",
'User-Agent' = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"#,
#'x-algolia-api-key' = "b499e29eb7542dc373ec0254e007205d",
#'x-algolia-application-id' = "VFM4X0N23A"
#)
),
config = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.13.0); Browser; JS Helper (3.7.4); react (16.14.0); react-instantsearch (6.23.1)',
'x-algolia-application-id' = 'VFM4X0N23A',
'x-algolia-api-key' = 'b499e29eb7542dc373ec0254e007205d'
),
body = FALSE,
encode = 'form',
query = '{"requests":[{"indexName":"menu-products-production","params":"query=highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&page=0&hitsPerPage=48&filters=store_id%20%3D%201641%20AND%20kind%3A%22sale%22%20OR%20root_types%3A%22sale%22&optionalFilters=brand%3AVerano%2Cbrand%3AAvexia%2Cbrand%3AEncore%20Edibles%2Croot_types%3AFeatured%2Croot_types%3ASale&userToken=Zu0iU4Uo2whpmqNBjUGOJ&facets=%5B%5D&tagFilters="}]}',
verbose() )
json <- content(r, type = "application/json")
The payload & query parameters:
Payload
The website
How can I restructure my code to send the query correctly?
I'm trying to run this code below that I got from this site. However, it keeps giving "AttributeError: 'NoneType' object has no attribute 'find'". (40th line) I'd be so glad if you could help me solve this issue.
import requests
from bs4 import BeautifulSoup
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36(KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
def obtener_resultados(termino_busqueda, numero_resultados, codigo_lenguaje):
url_google = 'https://www.google.com/search?q={}&num={}&hl={}'.format(termino_busqueda, numero_resultados, codigo_lenguaje)
respuesta = requests.get(url_google, headers=USER_AGENT)
respuesta.raise_for_status()
return termino_busqueda, respuesta.text
def procesar_resultados(html, palabra):
soup = BeautifulSoup(html, 'html.parser')
resultados_encontrados = []
bloque = soup.find_all("div", class_="g")
for resultado in bloque:
titulo = resultado.find('h3').string
resultados_encontrados.append(titulo)
return resultados_encontrados
def scrape(termino_busqueda, numero_resultados, codigo_lenguaje):
palabra, html = obtener_resultados(termino_busqueda, numero_resultados, codigo_lenguaje)
resultados = procesar_resultados(html, palabra)
return resultados
if __name__ == '__main__':
palabra = 'Quantika14'
h5 = (palabra, 1, "es")
h6 = (h5[0])
username=h6
url = 'https://www.twitter.com/'+username
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
f = soup.find('li', class_="ProfileNav-item--followers")
title=f.find('a')['title']
print (title)
g=soup.find_all('title', limit=1)
h = soup.select('.bio',limit=1)
title2 =g
print (title2)
title3=h
print(title3)
To get rid of Noneype error you can apply if else None statement
Example: Assuming your element selection is correct
title=f.find('a')['title'] if f.find('a') else None
I am trying to scrape a table of recent events from the following link: https://www.tapology.com/fightcenter
When visiting the link, the table shows upcoming events, so you have to click under schedule and change the option to "result".
I have scraped what appears to be the raw data below in the variable resp, but I don't know what language that code is written in and don't know how to parse it.
library(httr)
url <- paste0("https://www.tapology.com/fightcenter_events")
fd <- list(
group = "all",
region = "",
schedule = "results",
sport = "all"
)
postdata <- POST(url = url, query = fd, encode = "form",
add_headers(
"Accept" = "text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01",
"Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie" = "_ga=GA1.2.1873043703.1537368153; __utmz=88071069.1563301531.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); remember_id=149246; remember_token=315e68b7a95fa6cda391fc3e2ae0e1fb1466335ed9a15480558bd4ef8d52d832; __utmc=88071069; __utma=88071069.1873043703.1537368153.1563983348.1563985208.3; _tapology_mma_session=Z2RWaU1XZ0hOQmIwcUhjN1Bac0twN0JZQktnVUlLUjVsVkdMMDR4bTBITGdnSDFlRW9WeHprQ2lRaWdJM0lRbW5PNTFYSG9kbVlaMWFlR3liZmEyZWhnRWVVNm03UVIwRUJLWHl1MmJXRlQ1dEFJTGJsTnVLQWx4MWpUMTJOYlBxQ1N1Y0pQREZlZTNzMDA0NTJINEpLS2FMNXZvaXZjQ3g2dFMzM1dJeTRmekc4TG5JTk9YZDlZdWx5WnpZd3luZlY1ZXliQ0RWS1B1aXJYQnpqVVp4UT09LS10am5XNVI0c0pXa2p1dHJ5OW9PME5nPT0%3D--7488fef85f733279f15da594ea47f0345aa16938",
"Host" = "www.tapology.com",
"Origin" = "https://www.tapology.com",
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
"Referer" = "https://www.tapology.com/fightcenter",
"X-CSRF-Token" = "NS9M1Y5RMShdIfFaIKpYiqr+JuOZ8kwZvn9KSW7daZmgT9eJ4Q0ZyGLZSUHR4wjCdiE840HcQzLHHZSe0WgVJw==",
"X-Requested-With" = "XMLHttpRequest"
)
)
resp <- content(postdata, "text")
substr(resp, 1, 200)
[1] "$(\".fightcenterEvents\").html(\"<h3>\\n<span>Event Results<\\/span>\\n<span class=\\'moreLink\\'> <nav class=\\\"pagination\\\" role=\\\"navigation\\\" aria-label=\\\"pager\\\">\\n \\n \\n <span class=\\\"page "