Webscraping bs4 - sorting resaults from different URLs into table

Webscraping bs4 - sorting resaults from different URLs into table - web-scraping

I have written the script below to scrape a website.
I have left out the URL, if you need this write to me and i will supply it to you.
The current output is kinda messy but it does the job.
Im very novice at scraping so if you have any suggestions on how to improve the scraping it selfe please tell me.
Im looking for help to structure the results into a table that looks like this:
| source | columns... |
| -------- | -------------- |
| url1 | values |
| url2 | values |
Columns: Antal aktier, Börsvärde MSEK, Direktavkastning %, P/E-tal, P/S-tal, etc...
values from data1: 59840000, 5084,00, 0,00,11,11, 0,59, etc...
values from data2: 14532434, 2284,50, 2,70, 9,73, 0,52, etc...
Ides on how to solve this is very welcome.
Script:
import bs4
import requests
import re
from bs4 import BeautifulSoup as bs
URL1 = "XXX"
URL2 ="YYY"
r1 = requests.get(URL1)
r2 = requests.get(URL2)
soup1 = bs(r1.content)
soup2 = bs(r2.content)
data1 = soup1.find_all('dl', attrs= {"class": "border XSText rightAlignText noMarginTop highlightOnHover thickBorderBottom noTopBorder"})
data2 = soup2.find_all('dl', attrs= {"class": "border XSText rightAlignText noMarginTop highlightOnHover thickBorderBottom noTopBorder"})
print(data1[1])
print(data2[1])
Webscraping output:
<dt><span>Antal aktier</span></dt>
<dd><span>59 840 000</span></dd>
<dt><span>Börsvärde MSEK</span></dt>
<dd><span>5 084,00</span></dd>
<dt><span>Direktavkastning %</span></dt>
<dd><span>0,00</span></dd>
<dt><span>P/E-tal</span></dt>
<dd><span>11,11</span></dd>
<dt><span>P/S-tal</span></dt>
<dd><span>0,59</span></dd>
<dt><span>Kurs/eget kapital </span></dt>
<dd><span>2,60</span></dd>
<dt><span>Omsättning/aktie SEK</span></dt>
<dd><span>132,00</span></dd>
<dt><span>Vinst/aktie SEK</span></dt>
<dd><span>6,98</span></dd>
<dt><span>Eget kapital/aktie SEK</span></dt>
<dd><span>29,55</span></dd>
<dt><span>Försäljning/aktie SEK</span></dt>
<dd><span>-</span></dd>
<dt><span>Effektivavkastning %</span></dt>
<dd><span>0,00</span></dd>
<dt><span>Antal ägare hos Avanza</span></dt>
<dd><span>16 041</span></dd>
</dl>
<dl class="border XSText rightAlignText noMarginTop highlightOnHover thickBorderBottom noTopBorder">
<dt><span>Antal aktier</span></dt>
<dd><span>14 532 434</span></dd>
<dt><span>Börsvärde MSEK</span></dt>
<dd><span>2 284,50</span></dd>
<dt><span>Direktavkastning %</span></dt>
<dd><span>2,70</span></dd>
<dt><span>P/E-tal</span></dt>
<dd><span>9,73</span></dd>
<dt><span>P/S-tal</span></dt>
<dd><span>0,52</span></dd>
<dt><span>Kurs/eget kapital </span></dt>
<dd><span>2,73</span></dd>
<dt><span>Omsättning/aktie SEK</span></dt>
<dd><span>303,47</span></dd>
<dt><span>Vinst/aktie SEK</span></dt>
<dd><span>16,16</span></dd>
<dt><span>Eget kapital/aktie SEK</span></dt>
<dd><span>58,34</span></dd>
<dt><span>Försäljning/aktie SEK</span></dt>
<dd><span>-</span></dd>
<dt><span>Effektivavkastning %</span></dt>
<dd><span>2,70</span></dd>
<dt><span>Antal ägare hos Avanza</span></dt>
<dd><span>3 994</span></dd>
</dl>```

There are many ways you can solve this!
Looking at your output tells us that <dt> elements hold the column names and <dd> elements hold the values. So we can iterate through them and append the data to lists.
column_list = []
value_list = []
columns = soup1.find_all('dt')
for col in columns:
column_list.append(col.text.strip()) # strip() removes extra space from the text
values = soup1.find_all('dd')
for val in values:
value_list.append(val.text.strip())
for i in range(len(column_list)):
print(column_list[i] + ': ' + value_list[i])
Now you can use the data in your lists as you wish. It currently gives an output likes this:
Kortnamn: AAPL
ISIN: US0378331005
Marknad: NASDAQ
Bransch: Teknik
Handlas i: USD
Beta: 1,1927
Volatilitet %: 24,99
Belåningsvärde %: 60
Säkerhetskrav %: 150
Superränta: Ja
Blankningsbar: Nej
Antal aktier: 17 001 802 000
Börsvärde MUSD: 2 226 555,99
Direktavkastning %: 0,62
P/E-tal: 38,24
P/S-tal: 8,16
Kurs/eget kapital: 31,05
Omsättning/aktie USD: 16,05
Vinst/aktie USD: 3,42
Eget kapital/aktie USD: 4,25
Försäljning/aktie USD: -
Effektivavkastning %: 0,62
Antal ägare hos Avanza: 34 331

Related

How to extract title name and rating of a movie from IMDB database?

I'm very new to web scrapping in python. I want to extract the movie name, release year, and ratings from the IMDB database. This is the website for IMBD with 250 movies and ratings https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I use the module, BeautifulSoup, and request. Here is my code
movies = bs.find('tbody',class_='lister-list').find_all('tr')
When I tried to extract the movie name, rating & year, I got the same attribute error for all of them.
<td class="title column">
Glass Onion: une histoire à couteaux tirés
<span class="secondary info">(2022)</span>
<div class="velocity">1
<span class="secondary info">(
<span class="global-sprite telemeter up"></span>
1)</span>
<td class="ratingColumn imdbRating">
<strong title="7,3 based on 207 962 user ratings">7,3</strong>strong text
title = movies.find('td',class_='titleColumn').a.text
rating = movies.find('td',class_='ratingColumn imdbRating').strong.text
year = movies.find('td',class_='titleColumn').span.text.strip('()')
AttributeError Traceback (most recent call last)
<ipython-input-9-2363bafd916b> in <module>
----> 1 title = movies.find('td',class_='titleColumn').a.text
2 title
~\anaconda3\lib\site-packages\bs4\element.py in getattr(self, key)
2287 def getattr(self, key):
2288 """Raise a helpful exception to explain a common code fix."""
-> 2289 raise AttributeError(
2290 "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
2291 )
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Can someone help me to solve the problem? Thanks in advance!

To get the ResultSets as list, you can try the next example.
from bs4 import BeautifulSoup
import requests
import pandas as pd
data = []
res = requests.get("https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I")
#print(res)
soup = BeautifulSoup(res.content, "html.parser")
for card in soup.select('.chart.full-width tbody tr'):
data.append({
"title": card.select_one('.titleColumn a').get_text(strip=True),
"year": card.select_one('.titleColumn span').text,
'rating': card.select_one('td[class="ratingColumn imdbRating"]').get_text(strip=True)
})
df = pd.DataFrame(data)
print(df)
#df.to_csv('out.csv', index=False)
Output:
title year rating
0 Avatar: The Way of Water (2022) 7.9
1 Glass Onion (2022) 7.2
2 The Menu (2022) 7.3
3 White Noise (2022) 5.8
4 The Pale Blue Eye (2022) 6.7
.. ... ... ...
95 Zoolander (2001) 6.5
96 Once Upon a Time in Hollywood (2019) 7.6
97 The Lord of the Rings: The Fellowship of the Ring (2001) 8.8
98 New Year's Eve (2011) 5.6
99 Spider-Man: No Way Home (2021) 8.2
[100 rows x 3 columns]
Update: To extract data using find_all and find method.
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0'}
data = []
res = requests.get("https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm.I")
#print(res)
soup = BeautifulSoup(res.content, "html.parser")
for card in soup.table.tbody.find_all("tr"):
data.append({
"title": card.find("td",class_="titleColumn").a.get_text(strip=True),
"year": card.find("td",class_="titleColumn").span.get_text(strip=True),
'rating': card.find('td',class_="ratingColumn imdbRating").get_text(strip=True)
})
df = pd.DataFrame(data)
print(df)

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
find_all returns an array, meaning that movies is an array. You need to iterate over the array with for movie in movies:
for movie in movies:
title = movie.find('td',class_='titleColumn').a.text
rating = movie.find('td',class_='ratingColumn imdbRating').strong.text
year = movie.find('td',class_='titleColumn').span.text.strip('()')

How to separate out letters in a sentence using R

I have a character vector that is a string of letters and punctuation. I want to create a data frame where each column is made up of a letter/character from this string.
e.g.
Character string = I WENT TO THE FAIR
Dataframe = | I | | W | E | N | T | | T | O | | T | H | E | | F | A | I | R |
I thought I could do this using a loop with substr, but I can't work out how to get R to write into separate columns, rather than just writing over the previous letter. I'm new to writing loops etc so struggling a bit to get my head around the way in which to compose what I need.
Thanks for any help and advice that you can offer.
Best wishes,
Natalie

This should get that result
string <- "I WENT TO THE FAIR"
df <- as.data.frame(t(as.data.frame(strsplit(string,""))), row.names = "1")

Levensthein logic to get all the string with minimum difference

Suppose i have a datframe with values
Mtemp:
-----+
code |
-----+
Ram |
John |
Tracy|
Aman |
i want to compare it with dataframe
M2:
------+
code |
------+
Vivek |
Girish|
Rum |
Rama |
Johny |
Stacy |
Jon |
i want to get result so that for each value in Mtemp i will get maximum 2 possible match in M2 with Levensthein distance 2.
i have used
tp<-as.data.frame(amatch(Mtemp$code,M2$code,method = "lv",maxDist = 2))
tp$orig<-Mtemp$code
colnames(tp)<-c('Res','orig')
and i am getting result as follow
Res |orig
-----+-----
3 |Ram
5 |John
6 |Tracy
4 |Aman
please let me know a way to get 2 values(if possible) for every Mtemp string with Lev distance =2

Preserve content while displaying data columnwise from MongoDB

Reading data from twitter and then saving it in MongoDB
data.list <- searchTwitter('#demonetization ', n=10)
data.df = twListToDF(data.list)
temp=mongo.bson.from.df(data.df)
mongo <- mongo.create()
DB_Details <- paste(twitter, "filterstream", sep=".")
mongo.insert.batch(mongo, DB_Details, temp)
Reading the data in MongoDB and saving it in dataset variable(all columns of table are stored in this variable).
mongo <- mongo(db = "twitter",collection = "filterstream",url = "mongodb://localhost")
dataset <- mongo$find()
When i try printing the content of dataset variable there is no problem(See OUTPUT-1), but when i try to print a column from dataset variable the output of column(See OUTPUT-2) differs from the previous output(OUTPUT-1).
OUTPUT1
> **dataset**
--------------------------------------------------
| id | text |
--------------------------------------------------
| 1 | <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD>
<ed> <U+00B8> <U+0082><ed><U+00A0><U+00BD>
<ed> <U+00B1><U+0087>\nSome great jokes on #DeMonetization on
my TL today.\n\nThank you, Modi ji. <ed><U+00A0><U+00BD>
<ed><U+00B1><U+0087> |
--------------------------------------------------
| 2 | should be one |
--------------------------------------------------
OUTPUT-2
> **dataset$text**
| id | text |
--------------------------------------------------
| 1 | \xed��\xed�\u0082\xed��\xed�\u0082\xed��\xed�\u0087\nSome great jokes on #DeMonetization on my TL today.\n\nThank you, Modi ji. \xed��\xed�\u0087 |
--------------------------------------------------
| 2 | should be one |
--------------------------------------------------
4.Detecting these weird characters in OUTPUT-2 and getting rid of them is difficult. I am able to remove special characters(tags) and obtain clean text using REGEX for content of text column in OUTPUT-1, but the content of text column in OUTPUT-2 is quite different and i am not able to remove those special weird characters.
5.Why the content suddenly changes while printing a particular column from dataset, what am i doing wrong.

R data spread only in lines

I have a file with data in this way:
Name: abcdef
Value:40
Id:34
Size: 1000
Name: xyz
Value:4
Id:765
Size: 5561000
Name: qwerty
Value:0
Id:4
Size: 1000
But I would something like this:
| Name | Value | Id | Size |
| abcdef | 40 | 34 | 1000 |
| xyz | 4 | 765 | 5561000 |
| qwerty | 0 | 4 | 1000 |
It's possible do that with R standard commands?

I couldn't find the imagined function in splitstackshape, nor could I find the duplicate question on SO that I also imagined I had seen using "attribute value" or "label value" as search terms, but I can offer a solution based on scan's ability to handle multi-line data and sub to trim out the excess text. You can obviously remove the dangling column:
inp <- scan(text=txt, what=list("n", "v", "i", "s", "blank"),sep="\n")
Read 3 records
names(inp) <- lapply(inp , function(col) sub("\\:.+","",col[1]) )
inp <- data.frame( lapply(inp, function(col) sub(".+\\:[ ]{0,1}","",col) ) )
> inp
Name Value Id Size c............
1 abcdef 40 34 1000
2 xyz 4 765 5561000
3 qwerty 0 4 1000
This will require that the data be very regular. Each section needs to be 5 lines and the order inside a section of the values needs to be constant, although blank values should be handled correctly.
Data used:
txt <- "Name: abcdef
Value:40
Id:34
Size: 1000
Name: xyz
Value:4
Id:765
Size: 5561000
Name: qwerty
Value:0
Id:4
Size: 1000
"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Webscraping bs4 - sorting resaults from different URLs into table - web-scraping

Related

How to extract title name and rating of a movie from IMDB database?

How to separate out letters in a sentence using R

Levensthein logic to get all the string with minimum difference

Preserve content while displaying data columnwise from MongoDB

R data spread only in lines

Categories

Resources