Get the text of a span in a span using BeautifulSoup - web-scraping

I'm trying to get the City, Country and Region back from the site using Beautiful Soup on this site:
https://www.geodatatool.com/en/?ip=82.47.160.231
(Don't worry that's not my IP; dummy ip)
This is what I'm trying:
url = "https://www.geodatatool.com/en/?ip="+ip
# Getting site's data in plain text..
sourceCode = requests.get(url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText)
tags = soup('span')
# Parsing data.
data_item = soup.body.findAll('div','data-item')
#bold_item = data_item.findAll('span')
for tag in tags:
print(tag.contents)
I just get an array back of all span content. Trying to narrow it down to specifically my needs but that's not happening anytime soon.
Can someone help me out with this?

This should work. Basically we find all divs with class: 'data-item', and then in here we are looking for the 2 spans, where the first span is the city:, country:, etc. and the second span contains the data.
data_items = soup.findAll('div', {'class': 'data-item'})
# Country
country = data_items[2].findAll('span')[1].text.strip()
# City
city = data_items[5].findAll('span')[1].text.strip()
# Region
country = data_items[4].findAll('span')[1].text.strip()
In general this works, but if the website shows different data or orders the data differently per search, we might want to make the code a bit more robust. We can do this by using regex to find the country, city and region fields. The solution to that would look as follows:
# Country
country = soup.find(text=re.compile('country', re.IGNORECASE)).parent.parent.findAll('span')[1].text.strip()
# City
city = soup.find(text=re.compile('city', re.IGNORECASE)).parent.parent.findAll('span')[1].text.strip()
# Region
region = soup.find(text=re.compile('region', re.IGNORECASE)).parent.parent.findAll('span')[1].text.strip()
We try to find the pattern 'country', 'city' or 'region' inside the HTML code. Then grabing their parent 2 times to get the same results as the data_items as in the codeblock before and perform the same operations to get to the answer.

It's easier to do it with css selectors:
data_items = soup.select('div.sidebar-data div.data-item')
targets = ['Country:','City:','Region:']
for item in data_items:
if item.select('span.bold')[0].text in targets:
print(item.select('span.bold')[0].text, item.select('span')[1].text.strip())
Output:
Country: United Kingdom
Region: England
City: Plymouth

Related

How to scrape options from dropdown list and store them in table?

I am trying to make an interactive dashboard with analysis, base on car side. I would like user to be able to pick car brand for example BMW, Audi etc. and base on this choise he will have only avaiablity to pick BMW/Audi etc. models. I have a problem after selecting each brand, I am not able to scrape the models that belongs to that brand. Page that I am scraping from:
main page --> https://www.otomoto.pl/osobowe/
sub car brand page example --> https://www.otomoto.pl/osobowe/audi/
I have tried to scrape every option, so later on I can maybe somehow clean the data to store only models
code:
otomoto_models - paste0("https://www.otomoto.pl/osobowe/"audi/")
models <- read_html(otomoto_models) %>%
html_nodes("option") %>%
html_text()
But it is just scraping the brands with other options avaiable on the page engine type etc. While after inspecting element I can clearly see models types.
otomoto <- "https://www.otomoto.pl/osobowe/"
brands <- read_html(otomoto) %>%
html_nodes("option") %>%
html_text()
brands <- data.frame(brands)
for (i in 1:nrow(brands)){
no_marka_pojazdu <- i
if(brands[i,1] == "Marka pojazdu"){
break
}
}
no_marka_pojazdu <- no_marka_pojazdu + 1
for (i in 1:nrow(brands)){
zuk <- i
if(substr(brands[i,1],1,3) == "Żuk"){
break
}
}
Modele_pojazdow <- as.character(brands[no_marka_pojazdu:zuk,1])
Modele_pojazdow <- removeNumbers(Modele_pojazdow)
Modele_pojazdow <- substr(Modele_pojazdow,1,nchar(Modele_pojazdow)-2)
Modele_pojazdow <- data.frame(Modele_pojazdow)
Above code is only to pick supported car brands on the webpage and store them in the data frame. With that I am able to create html link and direct everything to one selected brand.
I would like to have similar object to "Modele_pojazdow" but with models limited on previous selected car brand.
Dropdown list with models appears as white box with text "Model pojazdu" next to the "Audi" box on the right side.
Some may frown on the solution language being Python, but the aim of this is was to give some pointers (high level process). I haven't written R in a long time so Python was quicker.
EDIT: R script now added
General outline:
The first dropdown options can be grabbed from the value attribute of each node returned by using a css selector of #param571 option. This uses an id selector (#) to target the parent dropdown select element, and then option type selector in descendant combination, to specify the option tag elements within. The html to apply this selector combination to can be retrieved by an xhr request to the url you initially provided. You want a nodeList returned to iterate over; akin to applying selector with js document.querySelectorAll.
The page uses ajax POST requests to update the second dropdown based on your first dropdown choice. Your first dropdown choice determines the value of a parameter search[filter_enum_make], which is used in the POST request to the server. The subsequent response contains a list of the available options (it includes some case alternatives which can be trimmed out).
I captured the POST request by using fiddler. This showed me the request headers and params in the request body. Screenshot sample shown at end.
The simplest way to extract the options from the response text, IMO, is to regex the appropriate string out (I wouldn't normally recommend regex for working with html but in this case it serves us nicely). If you don't want to use regex, you can grab the relevant info from the data-facets attribute of the element with id body-container. For the non-regex version you need to handle unquoted nulls, and retrieve the inner dictionary whose key is filter_enum_model. I show a function re-write, at the end, to handle this.
The retrieved string is a string representation of a dictionary. This needs converting to an actual dictionary object which you can then extract the option values from. Edit: As R doesn't have a dictionary object a similar structure needs to be found. I will look at this when converting.
I create a user defined function, getOptions(), to return the options for each make. Each car make value comes from the list of possible items in the first dropdown. I loop those possible values, use the function to return a list of options for that make, and add those lists as values to a dictionary, results ,whose keys are the make of car. Again, for R an object with similar functionality to a python dictionary needs to be found.
That dictionary of lists needs converting to a dataframe which includes a transpose operation to make a tidy output of headers, which are the car makes, and columns underneath each header, which contain the associated models.
The whole thing can be written to csv at the end.
So, hopefully that gives you an idea of one way to achieve what you want. Perhaps someone else can use this to help write you a solution.
Python demonstration of this below:
import requests
from bs4 import BeautifulSoup as bs
import re
import ast
import pandas as pd
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
try:
# verify the regex here: https://regex101.com/r/emvqXs/1
data = re.search(r'"filter_enum_model":(.*),"new_used"', r.text ,flags=re.DOTALL).group(1) #regex to extract the string containing the models associated with the car make filter
aDict = ast.literal_eval(data) #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
except:
cleanedList = [] # sometimes there are no associated values in 2nd dropdown
return cleanedList
r = requests.get('https://www.otomoto.pl/osobowe/')
soup = bs(r.content, 'lxml')
values = [item['value'] for item in soup.select('#param571 option') if item['value'] != '']
results = {}
# build a dictionary of lists to hold options for each make
for value in values:
results[value] = getOptions(value) #function call to return options based on make
# turn into a dataframe and transpose so each column header is the make and the options are listed below
df = pd.DataFrame.from_dict(results,orient='index').transpose()
#write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )
Sample of csv output:
Example as sample json for alfa-romeo:
Example of regex match for alfa-romeo:
{"145":1,"146":1,"147":218,"155":1,"156":118,"159":559,"164":2,"166":39,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":89,"GTV":7,"Giulia":251,"Giulietta":378,"Mito":224,"Spider":24,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":378,"gt":89,"gtv":7,"mito":224,"spider":24,"sportwagon":2,"stelvio":242}
Example of the filter option list returned from function call with make parameter value alfa-romeo:
['145', '146', '147', '155', '156', '159', '164', '166', '33', 'Alfasud', 'Brera', 'Crosswagon', 'GT', 'GTV', 'Giulia', 'Giulietta', 'Mito', 'Spider', 'Sportwagon', 'Stelvio']
Sample of fiddler request:
Sample of ajax response html containing options:
<section id="body-container" class="om-offers-list"
data-facets='{"offer_seek":{"offer":2198},"private_business":{"business":1326,"private":872,"all":2198},"categories":{"29":2198,"161":953,"163":953},"categoriesParent":[],"filter_enum_model":{"145":1,"146":1,"147":219,"155":1,"156":116,"159":561,"164":2,"166":37,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":88,"GTV":7,"Giulia":251,"Giulietta":380,"Mito":226,"Spider":25,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":380,"gt":88,"gtv":7,"mito":226,"spider":25,"sportwagon":2,"stelvio":242},"new_used":{"new":371,"used":1827,"all":2198},"sellout":null}'
data-showfacets=""
data-pagetitle="Alfa Romeo samochody osobowe - otomoto.pl"
data-ajaxurl="https://www.otomoto.pl/osobowe/alfa-romeo/?search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
data-searchid=""
data-keys=''
data-vars=""
Alternative version of function without regex:
from bs4 import BeautifulSoup as bs
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
soup = bs(r.content, 'lxml')
data = soup.select_one('#body-container')['data-facets'].replace('null','"null"')
aDict = ast.literal_eval(data)['filter_enum_model'] #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
return cleanedList
print(getOptions('alfa-romeo'))
R conversion and improved python:
Whilst converting to R I found a better way of extracting the parameters from a js file on the server. If you open dev tools you can see the file listed in the sources tab.
R (To be improved):
library(httr)
library(jsonlite)
url <- 'https://www.otomoto.pl/ajax/jsdata/params/'
r <- GET(url)
contents <- content(r, "text")
data <- strsplit(contents, "var searchConditions = ")[[1]][2]
data <- strsplit(as.character(data), ";var searchCondition")[[1]][1]
source <- fromJSON(data)$values$'573'$'571'
makes <- names(source)
for(make in makes){
print(make)
print(source[make][[1]]$value)
#break
}
Python:
import requests
import json
import pandas as pd
r = requests.get('https://www.otomoto.pl/ajax/jsdata/params/')
data = r.text.split('var searchConditions = ')[1]
data = data.split(';var searchCondition')[0]
items = json.loads(data)
source = items['values']['573']['571']
makes = [item for item in source]
results = {}
for make in makes:
df = pd.DataFrame(source[make]) ## build a dictionary of lists to hold options for each make
results[make] = list(df['value'])
dfFinal = pd.DataFrame.from_dict(results,orient='index').transpose() # turn into a dataframe and transpose so each column header is the make and the options are listed below
mask = dfFinal.applymap(lambda x: x is None) #tidy up None values to empty strings https://stackoverflow.com/a/31295814/6241235
cols = dfFinal.columns[(mask).any()]
for col in dfFinal[cols]:
dfFinal.loc[mask[col], col] = ''
print(dfFinal)

Matching Multiple Dog Breed DropDown Lists with Results Logic

I'm totally stuck on this. I hope some expert here can help me out.
I have a page that lists survey results. The user has to guess the top 3 breeds of a dog. Then, the results are shown. Important: the user is guessing the TOP 3 breeds of the dog and they can be in any order.
For example, the user is shown a photo of a dog and underneath the photo is a list of three dropdowns:
Dropdown_1 Dropdown_2 Dropdown_3
Each of these dropdowns contains the same list of breeds, such as Beagle, German Shepard, Pug, etc. The user then selects one (and only one) breed for each of the dropdowns.
So, in the example above, the user would select:
German Shepard Beagle Pug
Now, when the answer/response page is displayed, they will see if their guesses match the correct answers.
Obviously, it would be easy to write something like:
If (BreedChoice1 = BreedChoice1Answer And BreedChoice2 = BreedChoice2Answer And BreedChoice3 = BreedChoice3Answer) Or
(BreedChoice1 = BreedChoice1Answer And BreedChoice2 = BreedChoice3Answer And BreedChoice3 = BreedChoice2Answer) Or
(BreedChoice1 = BreedChoice2Answer And BreedChoice2 = BreedChoice1Answer And BreedChoice3 = BreedChoice3Answer) Or
(BreedChoice1 = BreedChoice2Answer And BreedChoice2 = BreedChoice3Answer And BreedChoice3 = BreedChoice1Answer) Or
(BreedChoice1 = BreedChoice3Answer And BreedChoice2 = BreedChoice1Answer And BreedChoice3 = BreedChoice2Answer) Or
(BreedChoice1 = BreedChoice3Answer And BreedChoice2 = BreedChoice2Answer And BreedChoice3 = BreedChoice1Answer) Then
Response.Write("You Guessed ALL breeds correctly!")
End If
But how would I display message they says: "You guessed two breeds correctly". And one that says "You guessed one breed correctly"?
Remember that choice 1, 2, and 3 can match the answers 1, 2 and 3 in any order.
Any advice would be appreciated! Thank you in advance.
-- Chris
To solve this it would be simplest to use an Intersect method and to section up your code a little differently.
Firstly I would suggest that we put the "answers" into a list rather than keep them as separate variables (since you suggest they can be in any order). Then you should put your submitted answers also into a collection.
You can then do a very simple Intersect to get a collection that contains the elements common to both:
List<string> breeds = new List<string>() { "Beagle", "German Shepard", "Pug" };
List<string> choices = new List<string>() { "Beagle", "Pug", "Greyhound" };
int correctAnswers = breeds.Intersect(choices).Count();
The int "correctAnswers" then tells you how many they got right. (Obviously if you are using something more complicated than a string, such as a custom "Breed" class, for the breeds you could use some Linq to check the breed name property).
You could then use a neat bit of string interpolation (the $ sign in front of the string declaration) to get your result message:
$"Congratulations, you guessed {correctAnswers} breeds correctly!";
Hope his helps!

The headquarters has different <p> tag and it's name is in another <p> tag and it is changing how can i get all headquarters names correctly?

I need help in correctly scrap the headquarters data from all the links in the http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/' website
class ProjectSpider(scrapy.Spider):
name = "cin100"
allowed_domains = ['cincinnati.com']
start_urls = ['http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/']
def parse(self, response):
# get selector for all 100 companies
sel_companies = response.xpath('//p[contains(.,"Here are the companies")]/following-sibling::p/a')
# create request for every single company detail page from href
for sel_companie in sel_companies:
href = sel_companie.xpath('./#href').extract_first()
url = response.urljoin(href)
request = scrapy.Request(url, callback=self.parse_company_detail)
yield request
def parse_company_detail(self, response):
# On detail page create item
item = ProjectItem()
# get detail information with specific XPath statements
# e.g. title is the first paragraph
item['title'] = response.xpath('//div[#role="main"]/p[1]//text()').extract_first().rsplit('-')[1]
# e.g. family owned has a label we can select
item['owned'] = response.xpath('//div[#role="main"]/p[contains(.,"Family owned")]/text()').extract_first()
item['Revenue2014'] ='$'+response.xpath('//div[#role="main"]/p[contains(.,"2014")]/text()').extract_first().rsplit('$')[1]
item['Revenue2015'] ='$'+response.xpath('//div[#role="main"]/p[contains(.,"$")]/text()').extract_first().rsplit('$')[1]
item['Website'] = response.xpath('//div[#role="main"]/p/a[contains(.,"com")]/text()').extract_first()
item['Rank'] = response.xpath('//div[#role="main"]/p[contains(.,"rank")]/text()').extract_first()
item['Employees'] = response.xpath('//div[#role="main"]/p[contains(.,"Employ")]/text()').extract_first()
item['headquarters'] = response.xpath('//div[#role="main"]/p[10]//text()').extract()
item['FoundedYear'] = response.xpath('//div[#role="main"]/p[contains(.,"founded")]/text()').extract()
# Finally: yield the item
yield item
While the headquarter has a preceding header labeled "Headquarter" I would take this as anchor and select the content of next following <p> tag like so:
//p[contains(.,"Headquarters")]/following-sibling::p[1]
Maybe you want to have a look at this XPath Tutorial to get a better understanding of the commands (and find a better solution).

If Else in R, if product ID is X then change Product Name

I am trying to figure out what is working and why the other way is not working for me.
At the moment I have a list of shops I use and I need to change the naming every time; so I have decided to go by the product_id which never changes, but my code is not working.
product_id <- vector()
This one is not working:
product_name[product_id == '40600000003'] <- 'my cool store']
but this one does work:
product_name[product_name == 'my#cool#Store'] <- 'my cool store'
Now, I am not sure what am I doing wrong, I tried to do:
if (product_id == '40600000003') {
product_name = 'my cool shop'
}
I have a list of 15 shops that I need to change the naming as they arrive in the wrong format from the api connection.
Try 40600000003 instead of '40600000003' it's more than likely reading your vector slots as int if it doesn't contain any characters

xpathSApply skip if text equals postseason

I'm running into a road block here and I can't figure out what I'm doing wrong. I need to skip over the link if the text equals postseason. The text is in the second li in the xpaths below in my code.
I tried li[not(.,"postseason")] as I thought that is what I needed to exclude the postseason link but it doesn't work.
This link will show you an example of want I want to exclude under standard batting > game logs > postseason
http://www.baseball-reference.com/players/j/jeterde01.shtml
place this http://www.baseball-reference.com/players/j/jeterde01.shtml in playerURLs and you should season the postseason link returned. How can I skip over the postseason link? Thanks!
#GET YEARS PLAYED LINKS
yplist = NULL
playerURLs <- paste("http://www.baseball-reference.com",datafile17[,c("hrefs")],sep="")
for(thisplayerURL in playerURLs){
doc <- htmlParse(thisplayerURL)
yplinks <- data.frame(
names = xpathSApply(doc, '//*[#id="all_standard_batting"]/div//ul/li[2]/ul/li/a',xmlValue),
hrefs = xpathSApply(doc, '//*[#id="all_standard_batting"]/div/ul/li[2]/ul/li/a',xmlGetAttr,'href'))
yplist = rbind(yplist, yplinks)
}
I'm not familiar with r language specifically, but from xpath point of view, you can use . != "..." or not(contains(.,"...")) predicate pattern to exclude element having specific inner text value.
The following will exclude <li> having inner text exactly equals "postseason" :
li[. != "postseason"]
This one will exclude <li> having inner text like "postseason"
li[not(contains(.,"postseason"))]

Resources