Beautifulsoup, append data in columns instead of one string - web-scraping

I am using the following code to get some data from a website
#find a list of all span elements
spans = page_body.find_all('span', {'class' : 'vehicle-tech-spec'})
#create a list of lines corresponding to element texts
model_details = [span.get_text() for span in spans]
print (model_details)
as a result I am getting this:
['05.2006', "137'800 km", 'Neu', '50 km', '10.2013', "97'000 km", '09.2015', "160'000 km", '04.2016', "138'000 km", '12.2017', "45'000 km", '05.2013', "90'000 km", '03.2013', "39'000 km", '01.2011', "131'400 km", '09.2017', "39'100 km", '05.2020', "9'900 km", '12.2015', "123'700 km", 'Neu', '15 km', '12.2019', "12'000 km", '06.2020', "22'200 km", 'Neu', '50 km', 'Neu', '50 km', '08.2016', "44'918 km", '03.2019', "25'725 km", '12.2017', "27'000 km"]
But I would like to get it like that:
Reg.Date
Mileage
05.2006
137'800 km
Neu
50 km
10.2013
97'000 km
etc.
etc.
I am quite new to this and I am stucked with this problem now for few days. I am pretty sure for you guys it is basic!

Just split the list into even chunks.
Here's how:
from tabulate import tabulate
model_details = [
'05.2006', "137'800 km", 'Neu', '50 km', '10.2013', "97'000 km", '09.2015',
"160'000 km", '04.2016', "138'000 km", '12.2017', "45'000 km", '05.2013',
"90'000 km", '03.2013', "39'000 km", '01.2011', "131'400 km", '09.2017',
"39'100 km", '05.2020', "9'900 km", '12.2015', "123'700 km", 'Neu', '15 km',
'12.2019', "12'000 km", '06.2020', "22'200 km", 'Neu', '50 km', 'Neu',
'50 km', '08.2016', "44'918 km", '03.2019', "25'725 km", '12.2017', "27'000 km",
]
def chunk_it(list_to_chop: list, chunk_size: int = 2) -> list:
return [
list_to_chop[index:index + chunk_size] for index
in range(0, len(list_to_chop), chunk_size)
]
table = tabulate(
chunk_it(model_details),
headers=["Column One", "Columnt Two"],
tablefmt="pretty",
)
print(table)
Output:
+------------+-------------+
| Column One | Columnt Two |
+------------+-------------+
| 05.2006 | 137'800 km |
| Neu | 50 km |
| 10.2013 | 97'000 km |
| 09.2015 | 160'000 km |
| 04.2016 | 138'000 km |
| 12.2017 | 45'000 km |
| 05.2013 | 90'000 km |
| 03.2013 | 39'000 km |
| 01.2011 | 131'400 km |
| 09.2017 | 39'100 km |
| 05.2020 | 9'900 km |
| 12.2015 | 123'700 km |
| Neu | 15 km |
| 12.2019 | 12'000 km |
| 06.2020 | 22'200 km |
| Neu | 50 km |
| Neu | 50 km |
| 08.2016 | 44'918 km |
| 03.2019 | 25'725 km |
| 12.2017 | 27'000 km |
+------------+-------------+

You DIDNOT mention which type of data output you are looking for it.
But here's a quick example for your target.
Consider using csv module if you are just looking to write the data into .csv file instead of using pandas, otherwise you can use PrettyTable if you are looking for a table.
pip install pandas more_itertools
from bs4 import BeautifulSoup
import requests
from more_itertools import chunked
import pandas as pd
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
spans = [x.get_text(strip=True) for x in soup.select('.vehicle-tech-spec')]
data = list(chunked(spans, 2))
print(pd.DataFrame(data, columns=['Reg.Date', 'Mileage']))
main('Url - Here')

Can you help with the parent element of those spans? Maybe that way we can have a better solution. But the below fix would work for now assuming none of the cells of table will be empty.
#find a list of all span elements
spans = page_body.find_all('span', {'class' : 'vehicle-tech-spec'})
#create a list of lines corresponding to element texts
model_details = []
for i in range(0,len(spans),2):
model_details.append({'reg_date': spans[i].get_text(), 'mileage': spans[i+1].get_text()})
print(model_details) #list of dictionaries with each dictionary corresponding to a line

Related

Problem with replacing a comma with a period

I replace the comma with a period in the data.frame column
data[,22] <- as.numeric(sub(",", ".", sub(".", "", data[,22], fixed=TRUE), fixed=TRUE))
But I have values that look like this: 110.00, 120.00, 130.00...
When replacing, I get the value:11000.0, 12000.0, 13000.0
And I would like to get: 110.0,120.0, 130.0....
My column 22 data.frame:
| n |
|--------|
| 92,5 |
| 94,5 |
| 96,5 |
| 110.00|
| 120.00|
| 130.00|
What I want to get:
| n |
|--------|
| 92.5 |
| 94.5 |
| 96.5 |
| 110.0|
| 120.0|
| 130.0|
or
| n |
|--------|
| 92.5 |
| 94.5 |
| 96.5 |
| 110.00|
| 120.00|
| 130.00|
Don't replace the periods since they are already in the format that you want. Replace only commas to period and turn the data to numeric.
data[[22]] <- as.numeric(sub(',', '.', fixed = TRUE, data[[22]]))
Using str_replace
library(stringr)
data[[22]] <- as.numeric(str_replace(data[[2]], ",", fixed(".")))
You can use gsub like below
transform(
df,
n = as.numeric(gsub("\\D", ".", n))
)
where non-digital character, i.e., "," or ".", are replaced by "."

How do I pretty print or visualise an object of class 'CoreNLP_pb2.ParseTree' in Python/Jupyter Notebook?

I'm using Stanza's CoreNLP client in a Jupyter notebook to do constituency parsing on a string. The final output came in the form of an object of class 'CoreNLP_pb2.ParseTree'.
>>> print type(result)
<class 'CoreNLP_pb2.ParseTree'>
How should I print this in a visible way? When I directly call print(result), there is no output.
You can conver CoreNLP_pb2.ParseTree into nltk.tree.Tree and the call pretty_print() to print the parse tree in a visible way.
from nltk.tree import Tree
def convert_parse_tree_to_nltk_tree(parse_tree):
return Tree(parse_tree.value, [get_nltk_tree(child) for child in parse_tree.child]) if parse_tree.child else parse_tree.value
convert_parse_tree_to_nltk_tree(constituency_parse).pretty_print()
The result is as follows:
ROOT
|
S
_______________|____________________
| VP |
| ________|___ |
NP | NP |
____|_____ | ________|_____ |
NNP NNP VBZ DT JJ NN .
| | | | | | |
Chris Manning is a nice person .

How to apply conditional number format?

With this modelData.amount.toLocaleCurrencyString() I get:
+---------+---------+
| Input | Output |
+---------+---------+
| 100000 | 100,000 |
| 1000000 | 1e+06 |
| -10000 | -10,000 |
| 0 | 0 |
+---------+---------+
Why do I get scientific notation for numbers above 999,999? This actually isn't that useful for me. What I need is conditional formatting like #,##0;(#,##0);- that puts negative in parentheses and converts 0 to - in addition to normal comma separators for positives.
I also don't want currency symbol in my numbers.

How to decode "Сверд..." data format (name of region in Russia) in .csv file to English in R?

I am working on a project on Machine learning. When I download the .csv file, some of the features have values in an unknown format. Something like СвердловÑÐºÐ°Ñ Ð¾Ð±Ð»Ð°ÑÑ‚ÑŒ and Личные вещÐ. These represent the names of regions in Russia. Can anyone tell me how to convert them into plane English in R? I tried doing the following:
df <- read.csv(file.choose(), sep = ',', header = TRUE, encoding = "russian",
stringsAsFactors = FALSE)
Doesn't work
Sample of data:
| region | City |
|---|---|
| ÐижегородÑÐºÐ°Ñ Ð¾Ð±Ð»Ð°ÑÑ‚ÑŒ | КраÑнодар |
| ВоронежÑÐºÐ°Ñ Ð¾Ð±Ð»Ð°ÑÑ‚ÑŒ | ЧелÑбинÑк |
| ÐижегородÑÐºÐ°Ñ Ð¾Ð±Ð»Ð°ÑÑ‚ÑŒ | Воронеж |
| ÐижегородÑÐºÐ°Ñ Ð¾Ð±Ð»Ð°ÑÑ‚ÑŒ | КраÑнодар |
| КраÑноÑÑ€Ñкий край | Самара |
| РоÑтовÑÐºÐ°Ñ Ð¾Ð±Ð»Ð°ÑÑ‚ÑŒ | Тюмень |

Dividing the time into periods each 30 min

I have Dataframe contains "time" column I want to add a new column contain period number after dividing the time into periods each 30 min
for example,
The original Dataframe
l = [('A','2017-01-13 00:30:00'),('A','2017-01-13 00:00:01'),('E','2017-01-13 14:00:00'),('E','2017-01-13 12:08:15')]
df = spark.createDataFrame(l,['test','time'])
df1 = df.select(df.test,df.time.cast('timestamp'))
df1.show()
+----+-------------------+
|test| time|
+----+-------------------+
| A|2017-01-13 00:30:00|
| A|2017-01-13 00:00:01|
| E|2017-01-13 14:00:00|
| E|2017-01-13 12:08:15|
+----+-------------------+
The Desired Dataframe as follow:
+----+-------------------+------+
|test| time|period|
+----+-------------------+------+
| A|2017-01-13 00:30:00| 2|
| A|2017-01-13 00:00:01| 1|
| E|2017-01-13 14:00:00| 29|
| E|2017-01-13 12:08:15| 25|
+----+-------------------+------+
Are there ways to achieve that?
You can simply utilize the hour and minute inbuilt functions to get your final result with when inbuilt function as
from pyspark.sql import functions as F
df1.withColumn('period', (F.hour(df1['time'])*2)+1+(F.when(F.minute(df1['time']) >= 30, 1).otherwise(0))).show(truncate=False)
You should be getting
+----+---------------------+------+
|test|time |period|
+----+---------------------+------+
|A |2017-01-13 00:30:00.0|2 |
|A |2017-01-13 00:00:01.0|1 |
|E |2017-01-13 14:00:00.0|29 |
|E |2017-01-13 12:08:15.0|25 |
+----+---------------------+------+
I hope the answer is helpful

Resources