I am trying to scrape a table of recent events from the following link: https://www.tapology.com/fightcenter
When visiting the link, the table shows upcoming events, so you have to click under schedule and change the option to "result".
I have scraped what appears to be the raw data below in the variable resp, but I don't know what language that code is written in and don't know how to parse it.
library(httr)
url <- paste0("https://www.tapology.com/fightcenter_events")
fd <- list(
group = "all",
region = "",
schedule = "results",
sport = "all"
)
postdata <- POST(url = url, query = fd, encode = "form",
add_headers(
"Accept" = "text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01",
"Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie" = "_ga=GA1.2.1873043703.1537368153; __utmz=88071069.1563301531.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); remember_id=149246; remember_token=315e68b7a95fa6cda391fc3e2ae0e1fb1466335ed9a15480558bd4ef8d52d832; __utmc=88071069; __utma=88071069.1873043703.1537368153.1563983348.1563985208.3; _tapology_mma_session=Z2RWaU1XZ0hOQmIwcUhjN1Bac0twN0JZQktnVUlLUjVsVkdMMDR4bTBITGdnSDFlRW9WeHprQ2lRaWdJM0lRbW5PNTFYSG9kbVlaMWFlR3liZmEyZWhnRWVVNm03UVIwRUJLWHl1MmJXRlQ1dEFJTGJsTnVLQWx4MWpUMTJOYlBxQ1N1Y0pQREZlZTNzMDA0NTJINEpLS2FMNXZvaXZjQ3g2dFMzM1dJeTRmekc4TG5JTk9YZDlZdWx5WnpZd3luZlY1ZXliQ0RWS1B1aXJYQnpqVVp4UT09LS10am5XNVI0c0pXa2p1dHJ5OW9PME5nPT0%3D--7488fef85f733279f15da594ea47f0345aa16938",
"Host" = "www.tapology.com",
"Origin" = "https://www.tapology.com",
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
"Referer" = "https://www.tapology.com/fightcenter",
"X-CSRF-Token" = "NS9M1Y5RMShdIfFaIKpYiqr+JuOZ8kwZvn9KSW7daZmgT9eJ4Q0ZyGLZSUHR4wjCdiE840HcQzLHHZSe0WgVJw==",
"X-Requested-With" = "XMLHttpRequest"
)
)
resp <- content(postdata, "text")
substr(resp, 1, 200)
[1] "$(\".fightcenterEvents\").html(\"<h3>\\n<span>Event Results<\\/span>\\n<span class=\\'moreLink\\'> <nav class=\\\"pagination\\\" role=\\\"navigation\\\" aria-label=\\\"pager\\\">\\n \\n \\n <span class=\\\"page "
Related
Good evening everybody,
I am in a fight with PowerQuery. I am trying to recieve an answer from a website in PowerBI (Power Query). For obvious reasons I renamed passwords and such. When I run the code below, PowerQuery answers that:
Expression.Error: The content-length heading must be changed using the
appropriate property or method. Parameter name: name Details: 103
It does not matter if I use another number. I tried also to remove the header entirly but that results in:
DataSource.Error: Web.Contents could not retrieve the content from
'https://xxxx.nl/api/token/' (500): Internal Server Error Details:
DataSourceKind=Web DataSourcePath=https://xxxx.nl/api/token
Url=https://xxxx.nl/api/token/
I do miss something but I cannot find out what that is. Could you find out what it is? Thanks in advance!
let
url = "https://xxxxx.nl/api/token/",
body = Text.ToBinary("{""""username"""":""""xxxx"""",""""password"""":""""xxxx"""",""""group"""":""""xxxx"""",""""deleteOtherSessions"""":false"),
Data = Web.Contents(
url,
[
Headers = [
#"authority" = "xxx.nl",
#"method" = "POST",
#"path" = "/api/Token",
#"scheme" = "https",
#"accept" = "application/json",
#"accept-encoding" = "gzip, deflate",
#"transfer-encoding" = "deflate",
#"accept-language" = "nl-NL,nl;q=0.7",
#"cache-control" = "no-cache",
#"content-length" = "103",
#"content-type" = "application/json",
#"expires" = "Sat, 01 Jan 2000 00:00:00 GMT",
#"origin" = "https://xxxx.nl",
#"pragma" = "no-cache",
#"referer" = "https://xxxxx.nl/",
#"sec-fetch-dest" = "empty",
#"sec-fetch-mode" = "cors",
#"sec-fetch-site" = "same-origin",
#"sec-gpc" = "1",
#"user-agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
],
Content = body
]
)
in
Data
I'm trying to scraping from this address:
http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino?fbclid=IwAR3_hZwajHk_iyU085S1LDTqLCOYLHIZ5K825XgPGcB4tMI0EuCJpQNrJHM#
There are two drop down ("Origem" and "Destino"). I need to generate a database with all possible combinations of "Origem" and "Destino".
Below a part of the code in R. I'm not able to select an option within the drop down menu, so I can create a looping and extract the data I need.
Any suggestions?
library(RSelenium) # activate Selenium server
library(rJava)
remDr <- rs_driver_object$client
remDr$open()
remDr$navigate("http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino?fbclid=IwAR3_hZwajHk_iyU085S1LDTqLCOYLHIZ5K825XgPGcB4tMI0EuCJpQNrJHM#")
Origem <- remDr$findElement(using = 'id', 'Origem')
Destino <- remDr$findElement(using = 'id', 'Destino')
botão_pesquisar <- remDr$findElement(using = 'id', 'btnPesquisar')
Grab the values (which are the location IDs) in each combo box, have two arrays (from and to), make sure to append the labels also; this page makes a call to an endpoint that has the IDs posted as parameters - the call looks like this:
library(RCurl)
headers = c(
"Accept" = "application/json, text/javascript, */*; q=0.01",
"Accept-Language" = "en-US,en;q=0.9",
"Connection" = "keep-alive",
"Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8",
"Cookie" = "__RequestVerificationToken_L1RyYW5zcG9ydGVDb2xldGl2bw2=tY-yKlWmbZvAJzMHmITkohPiIos5XkjDBwf1ZBfP_bYWdXJMBF2Qw3z_B-LRVo0kXjdnHqDqsbZ04Zij_PM-wAf4DWVKfnQskOhqo4ANSRc1",
"Origin" = "http://extranet.artesp.sp.gov.br",
"Referer" = "http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino?fbclid=IwAR3_hZwajHk_iyU085S1LDTqLCOYLHIZ5K825XgPGcB4tMI0EuCJpQNrJHM",
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
"X-Requested-With" = "XMLHttpRequest"
)
params = "origem=387&destino=388&__RequestVerificationToken=Z-wXmGOb9pnQbmkfcQXmChT-6uc3YfGjftHwK4HnC9SDCaKmzIafo7AI3lChBY6YDBHdpT_X98mSHGAr_YrTNgKiepKxKraGu7p6PI7dV4g1"
res <- postForm("http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino/GetGrid", .opts=list(postfields = params, httpheader = headers, followlocation = TRUE), style = "httppost")
cat(res)
See the origem= and the destino= parameters? those are the values from the static combo box fields, would be easy to do this whole thing via simple web requests; the response for each call will look like this:
[
{
"Codigo": 0,
"Empresa": {
"Codigo": 447,
"Descricao": "VIAÇÃO VALE DO TIETE LTDA",
"FlagCNPJ": false,
"CNPJ": null,
"CPF": null,
"Fretamento": null,
"Escolar": null,
"Municipio": null,
"UF": null,
"Endereco": null,
"Bairro": null,
"CEP": null,
"Telefone": null,
"Email": null
},
"CodigoMunicipioOrigem": 387,
"CodigoMunicipioDestino": 388
}
]
So when a trip is found, you'll have an array of.. Unsure what this is but entries for tickets I am assuming; the array returns 0 (null array) when the origin and destination have no schedules.
I'm attempting to scrape a JSON object. Referencing Extract data from an unreachable JsonObject(), I'm able to get the status code 200 but I'm not getting any results. I assume there is something wrong with how I'm constructing the query.
Code I have so far:
library(httr) # web scraping
library(jsonlite) #parsing json data
library(rvest) # web scraping
library(polite) # check robot.txt files
library(tidyverse) # data wrangling
library(curlconverter) # decode curl commands
library(urltools) # URL encoding
r <-
POST(
url = "https://vfm4x0n23a-dsn.algolia.net/1/indexes/*/queries" ,
add_headers(
#.headers=c(
#'Accept' = "*/*",
#'Accept-Encoding' = 'gzip, deflate, br',
#'Accept-Language' = "en-US,en;q=0.9",
#'Cache-Control' = "no-cache",
#'Connection' = "keep-alive",
#'Content-Length' = '450',
#'Content-Type' = 'application/x-www-form-urlencoded',
#'Host' = 'vfm4x0n23a-dsn.algolia.net',
#'Origin' = "https://www.iheartjane.com",
#'Pragma' = "no-cache",
#'Referer' = "https://www.iheartjane.com/",
#'sec-ch-ua' = '".Not/A)Brand"";v=99""Google Chrome";v="103",Chromium";v="103"',
#'sec-ch-ua-mobile' = "?0",
#'sec-ch-ua-platform' = "Windows",
#'Sec-Fetch-Dest' = "empty",
#'Sec-Fetch-Mode' = "cors",
#'Sec-Fetch-Site' = "cross-site",
'User-Agent' = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"#,
#'x-algolia-api-key' = "b499e29eb7542dc373ec0254e007205d",
#'x-algolia-application-id' = "VFM4X0N23A"
#)
),
config = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.13.0); Browser; JS Helper (3.7.4); react (16.14.0); react-instantsearch (6.23.1)',
'x-algolia-application-id' = 'VFM4X0N23A',
'x-algolia-api-key' = 'b499e29eb7542dc373ec0254e007205d'
),
body = FALSE,
encode = 'form',
query = '{"requests":[{"indexName":"menu-products-production","params":"query=highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&page=0&hitsPerPage=48&filters=store_id%20%3D%201641%20AND%20kind%3A%22sale%22%20OR%20root_types%3A%22sale%22&optionalFilters=brand%3AVerano%2Cbrand%3AAvexia%2Cbrand%3AEncore%20Edibles%2Croot_types%3AFeatured%2Croot_types%3ASale&userToken=Zu0iU4Uo2whpmqNBjUGOJ&facets=%5B%5D&tagFilters="}]}',
verbose() )
json <- content(r, type = "application/json")
The payload & query parameters:
Payload
The website
How can I restructure my code to send the query correctly?
In this program I'm trying to get all the prices of rentals in Ottawa but it just returns one price that is random each time, why?
import scrapy
class RentalPricesSpider(scrapy.Spider):
name = 'rental_prices'
allowed_domains = ['www.kijiji.ca']
start_urls = ['https://www.kijiji.ca/b-real-estate/ottawa/c34l1700185']
def parse(self, response):
rental_price = response.xpath('normalize-space(//div[#class="price"]/text())').getall()
yield {
'rent': rental_price,
}
You are select wrong xpath for that you are not getting expected output. Use css selector div.price::text instead xpath.
import scrapy
class RentalPricesSpider(scrapy.Spider):
name = 'rental_prices'
allowed_domains = ['www.kijiji.ca']
start_urls = ['https://www.kijiji.ca/b-real-estate/ottawa/c34l1700185']
def parse(self, response):
rental_price = response.css('div.price::text').getall()
rental_price = [x.strip() for x in rental_price if x.strip()]
# rental_price = list(map(str.strip ,x) for x in rental_price)
yield {
'rent': rental_price,
}
process = CrawlerProcess(settings={
"USER_AGENT" : "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36",
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(RentalPricesSpider)
process.start()
This is my code the last curl gets a 500 error most of the time, can somebody explain to me why this is happening I'm a newbie here!
headers = {"Accept-Encoding": "gzip, deflate, sdch, br",
"Accept-Language": "en-GB,en-US;q=0.8,en;q=0.6",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Connection": "keep-alive --compressed"}
url_ = "https://www.apct.gov.in/apportal/index.aspx"
r = requests.get(url_,headers=headers)
#open("/tmp/1.html","w").write(r.content)
c1 = r.cookies
headers = {
'Accept-Language': 'en-GB,en-US;q=0.8,en;q=0.6',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'https://www.apct.gov.in/apportal/index.html',
'Connection': 'keep-alive --compressed'}
url = "https://www.apct.gov.in/apportal/index.aspx"
r1 = requests.get(url,headers=headers,cookies=c1)
c2 = r1.cookies
# f = open("/tmp/1.html","w+")
# f.write(r.content)
url = "https://www.apct.gov.in/apportal/Search/ViewAPVATdealers.aspx"
time.sleep(2.5)
headers = { 'Accept-Encoding': 'gzip, deflate, sdch, br',
'Accept-Language': 'en-GB,en-US;q=0.8,en;q=0.6',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'https://www.apct.gov.in/apportal/index.html',
'Connection': 'keep-alive'}
#r = requests.get(url,headers=headers,cookies=c2,timeout=60,verify=True)
Cert ='certifi.old_where()'
r = requests.get('https://www.apct.gov.in/apportal/Search/ViewAPVATdealers.aspx', headers=headers, cookies=c2,verify=certifi.old_where())
r = requests.get('https://www.apct.gov.in/apportal/Search/ViewAPVATdealers.aspx', headers=headers, cookies=c2, timeout=60,verify=True)
soup = BeautifulSoup(r.content)
try:
view_state = soup.find('input', attrs={'id': '__VIEWSTATE'}).get("value")
except:
time.sleep(2.5)
r = requests.get('https://www.apct.gov.in/apportal/Search/ViewAPVATdealers.aspx', headers=headers, cookies=c2, timeout=60,verify=True)
#pdb.set_trace()
# r = requests.get(url,headers=headers,cookies=c2,timeout=60, verify=False)
c3 = r.cookies
soup = BeautifulSoup(r.content)
This is my code above, the last curl is getting a 500 error most of the time - can somebody explain to me why this is happening