Retrieving between > < soup

Retrieving between > < soup - web-scraping

I am trying to retrieve those informations: https://i.stack.imgur.com/uEV1g.png
from this website: https://www.skiinfo.fr/vorarlberg/golm/station-de-ski.html
Below is the code I have so far. I succesfuly find the element on the page, retrieved the name of each element but I cannot find how to retrieve the values (30,50,20,14,44,9.2,44) as there's no identification possible, only the fact that it is between "> <". (Must I say I'm quite new to this!)
link="https://www.skiinfo.fr/vorarlberg/golm/station-de-ski.html"
soup = BeautifulSoup(requests.get(link).content, "html.parser")
div = soup.find('ul', {"class" : "rt_trail circles"})
children = div.findChildren("li" , recursive=False)
for child in children:
print(child.p.string)
which is giving me:
Pistes vertes
Pistes bleues
Pistes rouges
Pistes noires
Pistes
Km pistes
Snowparks
Piste la plus longue
Domaine skiable
Neige de culture
Canons à neige (km pistes)
Could someone explain me what to do ?

for child in children:
print(child.p.string)
is equivalent to
for child in children:
print(child.find("p").string)
so you are assuming that your values are inside the first p-tag inside each child, which is clearly not true. Instead, you can search for the first p-tag with the css class value:
for child in children:
print(child.find("p", class_="value").text)
This yields
30%
50%
20%
14
44 km
9.2 km
44 km
Can you go on from here?

Related

Get the text of a span in a span using BeautifulSoup

I'm trying to get the City, Country and Region back from the site using Beautiful Soup on this site:
https://www.geodatatool.com/en/?ip=82.47.160.231
(Don't worry that's not my IP; dummy ip)
This is what I'm trying:
url = "https://www.geodatatool.com/en/?ip="+ip
# Getting site's data in plain text..
sourceCode = requests.get(url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText)
tags = soup('span')
# Parsing data.
data_item = soup.body.findAll('div','data-item')
#bold_item = data_item.findAll('span')
for tag in tags:
print(tag.contents)
I just get an array back of all span content. Trying to narrow it down to specifically my needs but that's not happening anytime soon.
Can someone help me out with this?

This should work. Basically we find all divs with class: 'data-item', and then in here we are looking for the 2 spans, where the first span is the city:, country:, etc. and the second span contains the data.
data_items = soup.findAll('div', {'class': 'data-item'})
# Country
country = data_items[2].findAll('span')[1].text.strip()
# City
city = data_items[5].findAll('span')[1].text.strip()
# Region
country = data_items[4].findAll('span')[1].text.strip()
In general this works, but if the website shows different data or orders the data differently per search, we might want to make the code a bit more robust. We can do this by using regex to find the country, city and region fields. The solution to that would look as follows:
# Country
country = soup.find(text=re.compile('country', re.IGNORECASE)).parent.parent.findAll('span')[1].text.strip()
# City
city = soup.find(text=re.compile('city', re.IGNORECASE)).parent.parent.findAll('span')[1].text.strip()
# Region
region = soup.find(text=re.compile('region', re.IGNORECASE)).parent.parent.findAll('span')[1].text.strip()
We try to find the pattern 'country', 'city' or 'region' inside the HTML code. Then grabing their parent 2 times to get the same results as the data_items as in the codeblock before and perform the same operations to get to the answer.

It's easier to do it with css selectors:
data_items = soup.select('div.sidebar-data div.data-item')
targets = ['Country:','City:','Region:']
for item in data_items:
if item.select('span.bold')[0].text in targets:
print(item.select('span.bold')[0].text, item.select('span')[1].text.strip())
Output:
Country: United Kingdom
Region: England
City: Plymouth

How to include child classes in OQL query in ewam?

I have a parent class for eg say A and it has 4 child classes say B,C,D and E. All the child classes are persistent. However, when I write the query as below the child classes are not picked up by my query.
OQL select * from A using the cursor
Do I have to write individual query for each child class?

Use of ++
Look through all instances of the class as well as all instances of the class’s decedents. (include all child classes in the search)
++
Examples:
forEach curCar in OQL select * from x in aVehicle++
where x.Color = cRed
curCar.Price += 100
endFor
forEach curPerson in OQL select * from x in aPerson++
where x.myAddress.City like ‘%New ‘ Order by x.Name
WriteLn(curPerson)
endFor
Please also see wTECH101 day 5 "101A-OQL-Search.pptx" as well as
eWAM help - search for OQL

Arangodb traversal to include head vertex

I'm using GRAPH_TRAVERSAL to get the path from a list of nodes to the head of the tree. This works perfectly except when the example happens to be the head of the tree. In this case, the edgeCollection doesn't have an inbound entry for this object so it doesn't appear in the results.
FOR v IN GRAPH_TRAVERSAL('gdp2',
[{_id:'pmsite/14419285155'}],
'inbound',{edgeCollection:'child'})
RETURN v
The result is an empty list: []
Is there a way I can guarantee that the starting node is on the list? It would be a pain to go through the list of examples to segregate which ones are at the head of a tree.

The problem is within the query itself. It contains a subtle error which is hard to spot:
[{_id:pmsite/14419285155}],
This is missing the quotes around pmsite/14419285155.
What this query realy does is to devide (probably the count of) the collection pmsite by the id 14419285155 and put in this as {_id: divcount}.
If you add the missing quotes, the query should do exactly what you want there. (edit: quotes were present in the original query, fixed the post.)
hint: db._explain() gives information about that.
Trying to reproduce, using the knows sample graph:
arangosh> var examples = require("org/arangodb/graph-examples/example-graph.js");
arangosh> var g = examples.loadGraph("knows_graph");
arangosh> db._query("FOR e IN GRAPH_TRAVERSAL('knows_graph', [{_id: 'persons/eve'}], 'inbound', {edgeCollection: 'knows'}) return e").toArray()
[
[
{
"vertex" : {
"_id" : "persons/eve",
"_rev" : "1405497100114",
"_key" : "eve",
"name" : "Eve"
}
}
]
]
However what creates a somewhat similar behaviour is to use a collection not part of the graph definition:
arangosh> db._create("othercol")
arangosh> db.othercol.save({_key: "1" })
arangosh> db._query("FOR e IN GRAPH_TRAVERSAL('knows_graph', [{_id: 'othercol/1'}], 'inbound', {edgeCollection: 'knows'}) return e").toArray()
[ ]
As pointed out in the Comments, edge relations have a direction. If you want to have edges pointing in both directions, you need to create a second relation in the other direction. Edges not fullfilling the edge definitions may be ignored.

Calculating distance of context node from nearest preceding text node

I wish to record how many elements intervene between a context node and the nearest preceding text node. What I do now is the following:
xquery version "3.0";
let $text :=
<text>
<p n="1">text-node<pb/><lb/><seg>text-node<context-node>context-node</context-node>text-node</seg></p>
<p n="2">text-node<pb/><lb/><seg><context-node>context-node</context-node>text-node</seg></p>
<p n="3">text-node<pb/><seg><lb/><context-node>context-node</context-node>text-node</seg></p>
<p n="4">text-node<pb/><seg>text-node<lb/><context-node>context-node</context-node>text-node</seg></p>
<p n="5">text-node<pb/>text-node<seg><lb/><context-node>context-node</context-node>text-node</seg></p>
<p n="6"><seg>text-node<pb/><lb/><context-node>context-node</context-node>text-node</seg></p>
<p n="7">text-node<seg><pb/><lb/><context-node>context-node</context-node>text-node</seg></p>
<p n="8">text-node<seg><pb/><cb/><lb/><context-node>context-node</context-node>text-node</seg></p>
</text>
let $predicate := 'context-node'
return
for $element at $i in $text/element()
let $context-node := $element//element()[. eq $predicate]
return
<context-text-distance n="{$i}">{
if ($context-node/preceding-sibling::node()[1] instance of text() or $context-node/parent::element()/child::node()[1] is $context-node)
then 1
else
if ($context-node/preceding-sibling::node()[2] instance of text() or $context-node/parent::element()/child::node()[2] is $context-node)
then 2
else
if ($context-node/preceding-sibling::node()[3] instance of text() or $context-node/parent::element()/child::node()[3] is $context-node)
then 3
else ()
}</context-text-distance>
This returns the right answers and of course I could go on like this - a number higher than 5 is extremely unlikely - but I am curious to know whether it is possible to calculate this distance without testing each individual possibility?
I take point of departure in the context node, an element node. I want to see how many preceding element nodes have to be traversed in order to reach a text node sibling of the context node or if the context node is the first child node of its parent. If one of these requirements is satisfied, the distance is 1. Thus in the example 1, the preceding-sibling of the context node is a text node, and the distance is therefore 1. In the example 2, a seg element is parent of the context node, and since the context node is its first child, the distance is again 1. In the example 3, the encompassing seg element comes before the empty lb element, so the count is 2. In the example 4, a text node comes before the empty lb element, so the count is again 2. Example 5 is actually a replay of example 3 and could be deleted. In example 6, two empty elements "intervene" between the context node and the nearest preceding text node, so the distance is 3. In example 7, the parent seg node has the same effect on the distance count as the text node of example 6. Example 8 returns empty, since the distance is 4 which is not covered.
I use this in order to determine in which order to extract and insert standoff markup in text. Empty elements have the same offset, but have to be inserted in a determinate order. All offsets are calculated in relation to text nodes. A context node which is the first child of its parent has the offset of 0, as if there was a text node.

I think the return statement has to be something like the following in order to get the same result:
if ($context-node/preceding-sibling::text())
then
<context-text-distance type="1" n="{$context-node/ancestor::p/#n}">{
for $node at $i in $context-node/preceding-sibling::text()[1]/following-sibling::node()
return
if ($node is $context-node) then $i else ()
}</context-text-distance>
else
<context-text-distance type="2" n="{$context-node/ancestor::p/#n}">{
count($context-node/preceding-sibling::node()) + 1
}</context-text-distance>

XQuery error: Attribute must follow the root element

Happy New Year to All!
I am learning XQuery with BaseX and face the following problem now.
I am parsing the factbook.xml file which is the part of the distribution.
The following query runs ok:
for $country in db:open('factbook')//country
where $country/#population < 1000000 and $country/#population > 500000
return <country name="{$country/name}" population="{$country/#population}">
{
for $city in $country/city
let $pop := number($city/population)
order by $pop descending
return <city population="{$city/population/text()}"> {$city/name/text()}
</city>
}
</country>
but while trying to generate a html running the second query - if I try to put the "{$country/#population}" in the <h2>Country population: </h2> tag I see an error message "Attribute must follow the root element".
<html><head><title>Some Countries</title></head><body>
{
for $country in db:open('factbook')//country
let $pop_c := $country/#population
where $pop_c < 1000000 and $pop_c > 500000
return
<p>
<h1>Country: {$country/name/text()}</h1>
<h2>Country population: #error comes if I put it here!#</h2>
{
for $city in $country/city
let $pop := number($city/population)
order by $pop descending
return ( <h3>City: {$city/name/text()}</h3>,
<p>City population: {$city/population/text()}</p>
)
}
</p>
}
</body></html>
Where is my mistake?
Thank you!

Just using:
{$country/#population}
copies the attribute population in the result. An attribute should follow immediately an element (or other attributes that follow the element) -- but this one follows a text node and this causes the error to be raised.
Use:
<h2>Country population: {string($country/#population)} </h2>

When you write {$country/#population}, you do not insert the text of the population attribute, but the attribute itself. If you did not had the "Country population text before it", using {$country/#population} would create something like`
If you want its value, use:
{data($country/#population)}
Or
{data($pop_c)}
since you have already have it in a variable. (the number or string functions can also be used instead of data, but I think data is the fastest)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Retrieving between > < soup - web-scraping

Related

Get the text of a span in a span using BeautifulSoup

How to include child classes in OQL query in ewam?

Arangodb traversal to include head vertex

Calculating distance of context node from nearest preceding text node

XQuery error: Attribute must follow the root element

Categories

Resources