Image "src" text scrape and tablescrape from a webpage using beautifulsoup - web-scraping

I am trying to web scrape this page
There are 2 problems with it:
1) I am trying to grab the data from the table which is present package details tab, yet I am getting no result. My selector path is correct but no output is coming up. The required output is given below:
2) Although I am getting the image "src" text yet I am not getting the required text which is used for the images. Required output is below.
import requests
from bs4 import BeautifulSoup
result = []
response = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
assert response.ok
page = BeautifulSoup(response.text, "html.parser")
for record in page.find_all('.packages-specification-table tr:last-child'):
for data in record.find_all('td'):
print(data.text)
for record1 in page.find_all('.packages-specification-table tr:first-child'):
for data1 in record1.find_all('th'):
print(data1)
for des in page.find_all('img'):
image= des.get('src')
print(image)
Required table output:
Article Number 00361049
Packages 1
Width 74 cm
Height 48 cm
Length 106 cm
Diameter -
Weight 30.00 kg
Required image output src:
/PIAimages/0618875_PE688687_S1.JPG
/PIAimages/0325432_PE517964_S1.JPG
/PIAimages/0690287_PE723209_S1.JPG
/PIAimages/0513996_PE639275_S1.JPG
/PIAimages/0325450_PE517970_S1.JPG

This page use JavaScript to load data.
This code get urls for images.
import requests
url = 'https://www.ikea.com/sa/en/iows/catalog/products/?catalog=departments&category=10687&type=json&dataset=small,allImages,prices&count=11&sort=relevance&sortorder=ascending&startIndex=0'
r = requests.get(url)
data = r.json()
for item in data['products']:
print(item['item']['name'])
for image in item['item']['images']['large']:
print(image)
Other information can be in other files loaded by JavaScript.
You can find them in DevTools in Chrome/Firefox - tab:Network, filter:XHR
EDIT:
This page use JavaScript but BS doesn't run JavaScrit.
When I turn off JavaScript in web browser then I see elements in different tags then in your code.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
soup = BeautifulSoup(r.text, "html.parser")
html = soup.select('div#productDimensionsContainer div#metric')[0].encode_contents().decode().strip()
data = list(filter(None, html.split('<br/>')))
print(data)
# ['Width: 82 cm', 'Depth: 96 cm', 'Height: 101 cm', 'Seat width: 49 cm', 'Seat depth: 54 cm', 'Seat height: 45 cm']
html = soup.select('div#custMaterials')[0].encode_contents().decode().strip()
data = list(filter(None, html.split('<br/>')))
print(data)
# ['Total composition: 100% polyester', 'Frame: Solid wood, Plywood, Particleboard, Polyurethane foam 25 kg/cu.m., Polyurethane foam 35 kg/cu.m., Polyester wadding', 'Seat cushion: Polyurethane foam 35 kg/cu.m., Polyester wadding', 'Leg: Solid beech, Clear lacquer']
EDIT:
There is also <script> with var jProductData=...and there are information from table.
import requests
from bs4 import BeautifulSoup
import json
r = requests.get("https://www.ikea.com/sa/en/catalog/products/00361049/")
soup = BeautifulSoup(r.text, "html.parser")
# var jProductData = {"product":{"items": ... }};
all_scripts = soup.select('script')
for script in all_scripts:
script = script.encode_contents().decode().strip()
if 'var jProductData' in script:
for row in script.split('\n'):
if 'var jProductData' in row:
data = json.loads(row.strip()[19:-1])
for item in data['product']['items']:
#print(item['pkgInfoArr'][0])
print('articleNumber:', item['pkgInfoArr'][0]['articleNumber'])
print('weightMet:', item['pkgInfoArr'][0]['pkgInfo'][0]['weightMet'])
print('widthMet:', item['pkgInfoArr'][0]['pkgInfo'][0]['widthMet'])
print('quantity:', item['pkgInfoArr'][0]['pkgInfo'][0]['quantity'])
print('consumerPackNo:', item['pkgInfoArr'][0]['pkgInfo'][0]['consumerPackNo'])
print('lengthMet:', item['pkgInfoArr'][0]['pkgInfo'][0]['lengthMet'])
print('heightMet:', item['pkgInfoArr'][0]['pkgInfo'][0]['heightMet'])
print('---')
Result:
articleNumber: 20343224
weightMet: 30.40 kg
widthMet: 74 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 48 cm
---
articleNumber: 00361049
weightMet: 30.00 kg
widthMet: 74 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 48 cm
---
articleNumber: 90361894
weightMet: 29.70 kg
widthMet: 74 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 47 cm
---
articleNumber: 80359844
weightMet: 30.00 kg
widthMet: 74 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 53 cm
---
articleNumber: 40359855
weightMet: 31.00 kg
widthMet: 75 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 49 cm
---
articleNumber: 10413953
weightMet: 29.90 kg
widthMet: 74 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 47 cm
---
articleNumber: 40433247
weightMet: 29.90 kg
widthMet: 74 cm
quantity: 1
consumerPackNo: 1
lengthMet: 106 cm
heightMet: 47 cm
---
Probably there are other information like urls for images but I didn't dig in var jProductData to find it.

Related

bs4 Attribute Error while scraping table python

I am trying to scrape a table using bs4. But whenever I iterate over the <tbody> elements, i get the following error: Traceback (most recent call last): File "f:\Python Programs\COVID-19 Notifier\main.py", line 28, in <module> for tr in soup.find('tbody').findAll('tr'): AttributeError: 'NoneType' object has no attribute 'findAll'
I am new to bs4 and have faced this error many times before too. This is the code I am using. Any help would be greatly appreciated as this is an official project to be submitted in a competition and the deadline is near. Thanks in advance. And beautifulsoup4=4.8.2, bs4==0.0.4 and soupsieve==2.0.
My code:
from plyer import notification
import requests
from bs4 import BeautifulSoup
import time
def notifyMe(title, message):
notification.notify(
title = title,
message = message,
app_icon = ".\\icon.ico",
timeout = 6
)
def getData(url):
r = requests.get(url)
return r.text
if __name__ == "__main__":
while True:
# notifyMe("Harry", "Lets stop the spread of this virus together")
myHtmlData = getData('https://www.mohfw.gov.in/')
soup = BeautifulSoup(myHtmlData, 'html.parser')
#print(soup.prettify())
myDataStr = ""
for tr in soup.find('tbody').find_all('tr'):
myDataStr += tr.get_text()
myDataStr = myDataStr[1:]
itemList = myDataStr.split("\n\n")
print(itemList)
states = ['Chandigarh', 'Telengana', 'Uttar Pradesh']
for item in itemList[0:22]:
dataList = item.split('\n')
if dataList[1] in states:
nTitle = 'Cases of Covid-19'
nText = f"State {dataList[1]}\nIndian : {dataList[2]} & Foreign : {dataList[3]}\nCured : {dataList[4]}\nDeaths : {dataList[5]}"
notifyMe(nTitle, nText)
time.sleep(2)
time.sleep(3600)
This line raises the error:
for tr in soup.findAll('tbody').findAll('tr'):
You can only call find_all on a single tag, not a result set returned by another find_all. (findAll is the same as find_all - the latter one is preferably used because it meets the Python PEP 8 styling standard)
According to the documentation:
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
If you're looping through a single table, simply replace the first findAll with find. If multiple tables, store the result set in a variable and loop through it, and you can apply the findAll on a single tag.
This should fix it:
for tr in soup.find('tbody').find_all('tr'):
Multiple tables:
tables = soup.find_all('tbody')
for table in tables:
for tr in table.find_all('tr'):
...
There are a few issues here.
The <tbody> tag is within the comments of the html. BeautifulSoup skips comments, unless you specifically pull those.
Why bother with the getData() function? It's just one line, why not just put that into the code. The extra function doesn't really add efficiency or more readability in the code.
Even when you pull the <tbody> tag, your dataList doesn't have 6 items (you call dataList[5], which will throw and error). I adjusted it, but I don't know if those are the corect numbers. I don't know what each of those vlaues represent, so you may need to fix that. The headers for that data you are pulling are ['S. No.','Name of State / UT','Active Cases*','Cured/Discharged/Migrated*','Deaths**'], so I don't know what Indian : {dataList[2]} & Foreign : are suppose to be.
With that, I don't what those numbers represent, but is it the correct data? Looks like you can pull new data here, but it's not the same numbers in the <tbody>
So, here's to get that other data source...maybe it's more accurate?
import requests
import pandas as pd
jsonData = requests.get('https://www.mohfw.gov.in/data/datanew.json').json()
df = pd.DataFrame(jsonData)
Output:
print(df.to_string())
sno state_name active positive cured death new_active new_positive new_cured new_death state_code
0 2 Andaman and Nicobar Islands 153 5527 5309 65 146 5569 5358 65 35
1 1 Andhra Pradesh 66944 997462 922977 7541 74231 1009228 927418 7579 28
2 3 Arunachal Pradesh 380 17296 16860 56 453 17430 16921 56 12
3 4 Assam 11918 231069 217991 1160 13942 233453 218339 1172 18
4 5 Bihar 69869 365770 293945 1956 76420 378442 300012 2010 10
5 6 Chandigarh 4273 36404 31704 427 4622 37232 32180 430 04
6 7 Chhattisgarh 121555 605568 477339 6674 123479 622965 492593 6893 22
7 8 Dadra and Nagar Haveli and Daman and Diu 1668 5910 4238 4 1785 6142 4353 4 26
8 10 Delhi 91618 956348 851537 13193 92029 980679 875109 13541 07
9 11 Goa 10228 72224 61032 964 11040 73644 61628 976 30
10 12 Gujarat 92084 453836 355875 5877 100128 467640 361493 6019 24
11 13 Haryana 58597 390989 328809 3583 64057 402843 335143 3643 06
12 14 Himachal Pradesh 11859 82876 69763 1254 12246 84065 70539 1280 02
13 15 Jammu and Kashmir 16094 154407 136221 2092 16993 156344 137240 2111 01
14 16 Jharkhand 40942 184951 142294 1715 43415 190692 145499 1778 20
15 17 Karnataka 196255 1247997 1037857 13885 214330 1274959 1046554 14075 29
16 18 Kerala 156554 1322054 1160472 5028 179311 1350501 1166135 5055 32
17 19 Ladakh 2041 12937 10761 135 2034 13089 10920 135 37
18 20 Lakshadweep 803 1671 867 1 920 1805 884 1 31
19 21 Madhya Pradesh 84957 459195 369375 4863 87640 472785 380208 4937 23
20 22 Maharashtra 701614 4094840 3330747 62479 693632 4161676 3404792 63252 27
21 23 Manipur 513 30047 29153 381 590 30151 29180 381 14
22 24 Meghalaya 1133 15488 14198 157 1238 15631 14236 157 17
23 25 Mizoram 608 5220 4600 12 644 5283 4627 12 15
24 26 Nagaland 384 12800 12322 94 457 12889 12338 94 13
25 27 Odisha 32963 388479 353551 1965 36718 394694 356003 1973 21
26 28 Puducherry 5923 50580 43931 726 6330 51372 44314 728 34
27 29 Punjab 40584 319719 270946 8189 43943 326447 274240 8264 03
28 30 Rajasthan 107157 467875 357329 3389 117294 483273 362526 3453 08
29 31 Sikkim 640 6970 6193 137 693 7037 6207 137 11
30 32 Tamil Nadu 89428 1037711 934966 13317 95048 1051487 943044 13395 33
31 34 Telengana 52726 379494 324840 1928 58148 387106 326997 1961 36
32 33 Tripura 563 34302 33345 394 645 34429 33390 394 16
33 35 Uttarakhand 26980 138010 109058 1972 29949 142349 110379 2021 05
34 36 Uttar Pradesh 259810 976765 706414 10541 273653 1013370 728980 10737 09
35 37 West Bengal 68798 700904 621340 10766 74737 713780 628218 10825 19
36 11111 2428616 16263695 13648159 186920 2552940 16610481 13867997 189544
Here's your code with pulling the comments out
Code:
import requests
from bs4 import BeautifulSoup, Comment
import time
def notifyMe(title, message):
notification.notify(
title = title,
message = message,
app_icon = ".\\icon.ico",
timeout = 6
)
if __name__ == "__main__":
while True:
# notifyMe("Harry", "Lets stop the spread of this virus together")
myHtmlData = requests.get('https://www.mohfw.gov.in/').text
soup = BeautifulSoup(myHtmlData, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
myDataStr = ""
for each in comments:
if 'tbody' in str(each):
soup = BeautifulSoup(each, 'html.parser')
for tr in soup.find('tbody').findAll('tr'):
myDataStr += tr.get_text()
myDataStr = myDataStr[1:]
itemList = myDataStr.split("\n\n")
print(itemList)
states = ['Chandigarh', 'Telengana', 'Uttar Pradesh','Meghalaya']
for item in itemList[0:22]:
w=1
dataList = item.split('\n')
if dataList[1] in states:
nTitle = 'Cases of Covid-19'
nText = f"State {dataList[1]}\nIndian : {dataList[0]} & Foreign : {dataList[2]}\nCured : {dataList[3]}\nDeaths : {dataList[4]}" #<-- I changed this
notifyMe(nTitle, nText)
time.sleep(2)
time.sleep(3600)

How to add a new column in data frame using calculation in R?

I want to add a new column with calculation. In the below data frame,
Env<- c("High_inoc","High_NO_inoc","Low_inoc", "Low_NO_inoc")
CV1<- c(30,150,20,100)
CV2<- c(74,99,49,73)
CV3<- c(78,106,56,69)
CV4<- c(86,92,66,70)
CV5<- c(74,98,57,79)
Data<-data.frame(Env,CV1,CV2,CV3,CV4,CV5)
Data$Mean <- rowMeans(Data %>% select(-Env))
Data <- rbind(Data, c("Mean", colMeans(Data %>% select(-Env))))
I'd like to add a new column names 'Env_index' with calculation {each value of 'mean' column - overall mean (76.3) such as 68.4 - 76.3 , 109- 76.3 ,... 78.2 - 76.3
So, I did like this and obtained what I want.
Data$Env_index <- c(68.4-76.3,109-76.3,49.6-76.3,78.2-76.3, 76.3-76.3)
But, I want to directly calculate using code, so if I code like this,
Data$Env_index <- with (data, data$Mean - 76.3)
It generates error. Could you let me know how to calculate?
Thanks,
To make the calculation dynamic which will work on any data you can do :
Data$Mean <- as.numeric(Data$Mean)
Data$Env_index <- Data$Mean - Data$Mean[nrow(Data)]
Data
# Env CV1 CV2 CV3 CV4 CV5 Mean Env_index
#1 High_inoc 30 74 78 86 74 68.4 -7.9
#2 High_NO_inoc 150 99 106 92 98 109.0 32.7
#3 Low_inoc 20 49 56 66 57 49.6 -26.7
#4 Low_NO_inoc 100 73 69 70 79 78.2 1.9
#5 Mean 75 73.75 77.25 78.5 77 76.3 0.0
Data$Mean[nrow(Data)] will select last value of Data$Mean.

Scraping data from public Google sheet - same url for different tabs

I want to scrape data from a public web page of a Google sheet. This is the link.
I am specifically interested in the data in the 4th tab, "US daily 4 pm ET", however the url for that tab is the same as for all the other tabs (at least according the address bar of the browsers I've tried - both Chrome and Firefox). When I try to scrape the data using the rvest package in R, I end up with the data from the 2nd tab, "States current".
I did a right-click to inspect the 1st tab, "README", to see if I could figure something out about the tab names. It looks like the name of the 4th tab is sheet-button-916628299. But entering URLS in my browser that ended with /pubhtml#gid=sheet-button-916628299 or /pubhtml#gid=916628299 didn't take me to the 4th tab.
How can I find a URL that takes me (and, more importantly, the rvest package in R) to the data in the 4th tab?
This is fairly straightforward: the data for all the tabs is loaded on the page already rather than being loaded by xhr requests. The contents of each tab are just hidden or unhidden by css.
If you use the developer pane in your browser, you can see that each tab is in a div with a numerical id which is given by the number in the id of each tab.
We can get the page and make a dataframe of the correct css selectors to get each tab's contents like this:
library(rvest)
url <- paste0("https://docs.google.com/spreadsheets/u/2/d/e/",
"2PACX-1vRwAqp96T9sYYq2-i7Tj0pvTf6XVHjDSMIKBdZ",
"HXiCGGdNC0ypEU9NbngS8mxea55JuCFuua1MUeOj5/pubhtml#")
page <- read_html(url)
tabs <- html_nodes(page, xpath = "//li")
tab_df <- data.frame(name = tabs %>% html_text,
css = paste0("#", gsub("\\D", "", html_attr(tabs, "id"))),
stringsAsFactors = FALSE)
tab_df
#> name css
#> 1 README #1600800428
#> 2 States current #1189059067
#> 3 US current #294274214
#> 4 States daily 4 pm ET #916628299
#> 5 US daily 4 pm ET #964640830
#> 6 States #1983833656
So now we can get the contents of, say, the fourth tab like this:
html_node(page, tab_df$css[4]) %>% html_nodes("table") %>% html_table()
#> [[1]]
#>
#> 1 1 Date State Positive Negative Pending Death Total
#> 2 NA
#> 3 2 20200314 AK 1 143 144
#> 4 3 20200314 AL 6 22 46 74
#> 5 4 20200314 AR 12 65 26 103
#> 6 5 20200314 AZ 12 121 50 0 183
#> 7 6 20200314 CA 252 916 5 1,168
#> 8 7 20200314 CO 101 712 1 814
#> 9 8 20200314 CT 11 125 136
#> 10 9 20200314 DC 10 49 10 69
#> 11 10 20200314 DE 6 36 32 74
#> 12 11 20200314 FL 77 478 221 3 776
#> 13 12 20200314 GA 66 1 66
#> 14 13 20200314 HI 2 2
#> 15 14 20200314 IA 17 83 100
#> .... (535 rows in total)

breakdates in strucchange gives float instead of date

I want to detect breakpoints in a dataset and using strucchange library.
Dataset as timeseries object made by xts as below:
[,1]
2009-12-18 145
2010-01-08 100
2010-02-09 120
2010-03-02 150
2010-03-09 110
2010-03-23 180
2010-03-30 120
2010-04-06 135
2010-05-11 150
2010-05-25 155
2010-06-01 90
I using code below to detect breakdates but it gives me floats(double in r) as breakdates.
bp_ts <- breakpoints(duration ~ 1, breaks = 2)
summary(bp_ts)
The output is:
Corresponding to breakdates:
m = 1 0.168604651162791
m = 2 0.145348837209302 0.372093023255814
I want to output to be as date:
The output should be:
Corresponding to breakdates:
m = 1 2010-03-23
m = 2 2010-03-23 2010-05-11
I could not understand why dates become floats after breakpoints function application.
from now big thanks :)

R - Scraping with rvest package

I'm trying to get the data from the "Team Statistics" table on this webpage:
https://www.hockey-reference.com/teams/CGY/2010.html
I don't have a lot of experience with web scraping, but have made a few attempts with the XML package and now with the rvest package:
library(rvest)
url <- html("https://www.hockey-reference.com/teams/CGY/2010.html")
url %>%
html_node(xpath = "//*[#id='team_stats']")
And end up with what appears to be a single node:
{xml_node}
<table class="sortable stats_table" id="team_stats" data-cols-to-freeze="1">
[1] <caption>Team Statistics Table</caption>
[2] <colgroup>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\ ...
[3] <thead><tr>\n<th aria-label="Team" data-stat="team_name" sco ...
[4] <tbody>\n<tr>\n<th scope="row" class="left " data-stat="team ...
How do I parse this to just get the header and information in the two row table?
You just need to add html_table at the end of the chain:
library(rvest)
url <- read_html("https://www.hockey-reference.com/teams/CGY/2010.html")
url %>%
html_node(xpath = "//*[#id='team_stats']") %>%
html_table()
Alternatively:
library(rvest)
url %>%
html_table() %>%
.[[1]]
Both solutions return:
Team AvAge GP W L OL PTS PTS% GF GA SRS SOS TG/G PP PPO PP% PPA PPOA PK% SH SHA S S% SA SV% PDO
1 Calgary Flames 28.8 82 40 32 10 90 0.549 201 203 -0.03 0.04 5.05 43 268 16.04 54 305 82.30 7 1 2350 8.6 2367 0.916 100.1
2 League Average 27.9 82 41 31 10 92 0.561 233 233 0.00 0.00 5.68 56 304 18.23 56 304 81.77 6 6 2486 9.1 2479 0.911 NA

Resources