bs4 Attribute Error while scraping table python - web-scraping

I am trying to scrape a table using bs4. But whenever I iterate over the <tbody> elements, i get the following error: Traceback (most recent call last): File "f:\Python Programs\COVID-19 Notifier\main.py", line 28, in <module> for tr in soup.find('tbody').findAll('tr'): AttributeError: 'NoneType' object has no attribute 'findAll'
I am new to bs4 and have faced this error many times before too. This is the code I am using. Any help would be greatly appreciated as this is an official project to be submitted in a competition and the deadline is near. Thanks in advance. And beautifulsoup4=4.8.2, bs4==0.0.4 and soupsieve==2.0.
My code:
from plyer import notification
import requests
from bs4 import BeautifulSoup
import time
def notifyMe(title, message):
notification.notify(
title = title,
message = message,
app_icon = ".\\icon.ico",
timeout = 6
)
def getData(url):
r = requests.get(url)
return r.text
if __name__ == "__main__":
while True:
# notifyMe("Harry", "Lets stop the spread of this virus together")
myHtmlData = getData('https://www.mohfw.gov.in/')
soup = BeautifulSoup(myHtmlData, 'html.parser')
#print(soup.prettify())
myDataStr = ""
for tr in soup.find('tbody').find_all('tr'):
myDataStr += tr.get_text()
myDataStr = myDataStr[1:]
itemList = myDataStr.split("\n\n")
print(itemList)
states = ['Chandigarh', 'Telengana', 'Uttar Pradesh']
for item in itemList[0:22]:
dataList = item.split('\n')
if dataList[1] in states:
nTitle = 'Cases of Covid-19'
nText = f"State {dataList[1]}\nIndian : {dataList[2]} & Foreign : {dataList[3]}\nCured : {dataList[4]}\nDeaths : {dataList[5]}"
notifyMe(nTitle, nText)
time.sleep(2)
time.sleep(3600)

This line raises the error:
for tr in soup.findAll('tbody').findAll('tr'):
You can only call find_all on a single tag, not a result set returned by another find_all. (findAll is the same as find_all - the latter one is preferably used because it meets the Python PEP 8 styling standard)
According to the documentation:
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
If you're looping through a single table, simply replace the first findAll with find. If multiple tables, store the result set in a variable and loop through it, and you can apply the findAll on a single tag.
This should fix it:
for tr in soup.find('tbody').find_all('tr'):
Multiple tables:
tables = soup.find_all('tbody')
for table in tables:
for tr in table.find_all('tr'):
...

There are a few issues here.
The <tbody> tag is within the comments of the html. BeautifulSoup skips comments, unless you specifically pull those.
Why bother with the getData() function? It's just one line, why not just put that into the code. The extra function doesn't really add efficiency or more readability in the code.
Even when you pull the <tbody> tag, your dataList doesn't have 6 items (you call dataList[5], which will throw and error). I adjusted it, but I don't know if those are the corect numbers. I don't know what each of those vlaues represent, so you may need to fix that. The headers for that data you are pulling are ['S. No.','Name of State / UT','Active Cases*','Cured/Discharged/Migrated*','Deaths**'], so I don't know what Indian : {dataList[2]} & Foreign : are suppose to be.
With that, I don't what those numbers represent, but is it the correct data? Looks like you can pull new data here, but it's not the same numbers in the <tbody>
So, here's to get that other data source...maybe it's more accurate?
import requests
import pandas as pd
jsonData = requests.get('https://www.mohfw.gov.in/data/datanew.json').json()
df = pd.DataFrame(jsonData)
Output:
print(df.to_string())
sno state_name active positive cured death new_active new_positive new_cured new_death state_code
0 2 Andaman and Nicobar Islands 153 5527 5309 65 146 5569 5358 65 35
1 1 Andhra Pradesh 66944 997462 922977 7541 74231 1009228 927418 7579 28
2 3 Arunachal Pradesh 380 17296 16860 56 453 17430 16921 56 12
3 4 Assam 11918 231069 217991 1160 13942 233453 218339 1172 18
4 5 Bihar 69869 365770 293945 1956 76420 378442 300012 2010 10
5 6 Chandigarh 4273 36404 31704 427 4622 37232 32180 430 04
6 7 Chhattisgarh 121555 605568 477339 6674 123479 622965 492593 6893 22
7 8 Dadra and Nagar Haveli and Daman and Diu 1668 5910 4238 4 1785 6142 4353 4 26
8 10 Delhi 91618 956348 851537 13193 92029 980679 875109 13541 07
9 11 Goa 10228 72224 61032 964 11040 73644 61628 976 30
10 12 Gujarat 92084 453836 355875 5877 100128 467640 361493 6019 24
11 13 Haryana 58597 390989 328809 3583 64057 402843 335143 3643 06
12 14 Himachal Pradesh 11859 82876 69763 1254 12246 84065 70539 1280 02
13 15 Jammu and Kashmir 16094 154407 136221 2092 16993 156344 137240 2111 01
14 16 Jharkhand 40942 184951 142294 1715 43415 190692 145499 1778 20
15 17 Karnataka 196255 1247997 1037857 13885 214330 1274959 1046554 14075 29
16 18 Kerala 156554 1322054 1160472 5028 179311 1350501 1166135 5055 32
17 19 Ladakh 2041 12937 10761 135 2034 13089 10920 135 37
18 20 Lakshadweep 803 1671 867 1 920 1805 884 1 31
19 21 Madhya Pradesh 84957 459195 369375 4863 87640 472785 380208 4937 23
20 22 Maharashtra 701614 4094840 3330747 62479 693632 4161676 3404792 63252 27
21 23 Manipur 513 30047 29153 381 590 30151 29180 381 14
22 24 Meghalaya 1133 15488 14198 157 1238 15631 14236 157 17
23 25 Mizoram 608 5220 4600 12 644 5283 4627 12 15
24 26 Nagaland 384 12800 12322 94 457 12889 12338 94 13
25 27 Odisha 32963 388479 353551 1965 36718 394694 356003 1973 21
26 28 Puducherry 5923 50580 43931 726 6330 51372 44314 728 34
27 29 Punjab 40584 319719 270946 8189 43943 326447 274240 8264 03
28 30 Rajasthan 107157 467875 357329 3389 117294 483273 362526 3453 08
29 31 Sikkim 640 6970 6193 137 693 7037 6207 137 11
30 32 Tamil Nadu 89428 1037711 934966 13317 95048 1051487 943044 13395 33
31 34 Telengana 52726 379494 324840 1928 58148 387106 326997 1961 36
32 33 Tripura 563 34302 33345 394 645 34429 33390 394 16
33 35 Uttarakhand 26980 138010 109058 1972 29949 142349 110379 2021 05
34 36 Uttar Pradesh 259810 976765 706414 10541 273653 1013370 728980 10737 09
35 37 West Bengal 68798 700904 621340 10766 74737 713780 628218 10825 19
36 11111 2428616 16263695 13648159 186920 2552940 16610481 13867997 189544
Here's your code with pulling the comments out
Code:
import requests
from bs4 import BeautifulSoup, Comment
import time
def notifyMe(title, message):
notification.notify(
title = title,
message = message,
app_icon = ".\\icon.ico",
timeout = 6
)
if __name__ == "__main__":
while True:
# notifyMe("Harry", "Lets stop the spread of this virus together")
myHtmlData = requests.get('https://www.mohfw.gov.in/').text
soup = BeautifulSoup(myHtmlData, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
myDataStr = ""
for each in comments:
if 'tbody' in str(each):
soup = BeautifulSoup(each, 'html.parser')
for tr in soup.find('tbody').findAll('tr'):
myDataStr += tr.get_text()
myDataStr = myDataStr[1:]
itemList = myDataStr.split("\n\n")
print(itemList)
states = ['Chandigarh', 'Telengana', 'Uttar Pradesh','Meghalaya']
for item in itemList[0:22]:
w=1
dataList = item.split('\n')
if dataList[1] in states:
nTitle = 'Cases of Covid-19'
nText = f"State {dataList[1]}\nIndian : {dataList[0]} & Foreign : {dataList[2]}\nCured : {dataList[3]}\nDeaths : {dataList[4]}" #<-- I changed this
notifyMe(nTitle, nText)
time.sleep(2)
time.sleep(3600)

Related

Natural Neighbor Interpolation in R

I need to conduct Natural Neighbor Interpolation (NNI) via R in order to smooth my numeric data. For example, say I have very spurious data, my goal is to use NNI to model the data neatly.
I have several hundred rows of data (one observation for each postcode), alongside latitudes and longitudes. I've made up some data below:
Postcode lat lon Value
200 -35.277272 149.117136 7
221 -35.201372 149.095065 38
800 -12.801028 130.955789 27
801 -12.801028 130.955789 3
804 -12.432181 130.84331 29
810 -12.378451 130.877014 20
811 -12.376597 130.850489 3
812 -12.400091 130.913672 42
814 -12.382572 130.853877 32
820 -12.410444 130.856124 39
821 -12.426641 130.882367 39
822 -12.799278 131.131697 49
828 -12.474896 130.907378 38
829 -14.460879 132.280002 34
830 -12.487233 130.972637 8
831 -12.480066 130.984006 49
832 -12.492269 130.990891 29
835 -12.48138 131.029173 33
836 -12.525546 131.103025 40
837 -12.460094 130.842663 39
838 -12.709507 130.995407 28
840 -12.717562 130.351316 22
841 -12.801028 130.955789 8
845 -13.038663 131.072091 19
846 -13.226806 131.098416 50
847 -13.824123 131.835799 11
850 -14.464497 132.262021 2
851 -14.464497 132.262021 23
852 -14.92267 133.064654 36
854 -16.81839 137.14707 17
860 -19.648306 134.186642 3
861 -18.94406 134.318373 8
862 -20.231104 137.762232 28
870 -12.436101 130.84059 24
871 -12.436101 130.84059 16
Is there any kind of package that will do this? I should mention, that the only predictors I am using in this model are latitude and longitude. If there isn't a package than can do this, how can I implement it manually. I've searched extensively and I can't figure out how to implement this in R. I have seen one or two other SO posts, but they haven't assisted me in figuring this out.
Please let me know if there's anything I must add to the question. Thanks.
I suggest the following:
Reproject the data to the corresponding UTM Zone.
Use R WhiteboxTools package to process the data using natural neighbour interpolation.

Store values in a cell dataframe

I am trying to store in multiple cells in a dataframe. But, my code is storing the data in the last cell (on the dd array). Please see my output below.
Can somebody please correct me? Cannot figure out what I am doing wrong.
Thanks in advance,
MyData <- read.csv(file="Pat_AR_035.csv", header=TRUE, sep=",")
dd <- unique(MyData$POLICY_NUM)
for (j in length(dd)) {
myDF <- data.frame(i=1:length(dd), m=I(vector('list', length(dd))))
myDF$m[[j]] <- data.frame(j,MyData[which(MyData$POLICY_NUM==dd[j] & MyData$ACRES), ],ncol(MyData),nrow(MyData))
}
[[60]]
NULL
[[61]]
NULL
[[62]]
NULL
[[63]]
j OBJECTID DIVISION POLICY_SYM POLICY_NUM YIELD_ID LINE_ID RH_CLU_ID ACRES PLANT_DATE ACRE_TYPE CLU_DETERM STATE COUNTY FARM_SERIA TRACT
1646 63 1646 8 MP 754033 3 20 39565604 8.56 5/3/2014 PL A 3 35 109 852
1647 63 1647 8 MP 754033 1 10 39565605 30.07 4/19/2014 PL A 3 35 109 852
1648 63 1648 8 MP 754033 1 10 39565606 56.59 4/19/2014 PL A 3 35 109 852
CLU_NUMBER FIELD_ACRE RMA_CLU_ID UPDATE_DAT Percent_Ar RHCLUID Field1 OBJECTID_1 DIVISION_1 STATE_1 COUNTY_1
1646 3 8.56 F68E591A-ECC2-470B-A012-201C3BB20D7F 9/21/2014 63.4990 39565604 1646 1646 8 3 35
1647 1 30.07 eb04cfc0-e78b-415f-b447-9595c81ef09e 9/21/2014 100.0000 39565605 1647 1647 8 3 35
1648 2 56.59 5922d604-e31c-4b9d-b846-9f38e2d18abe 9/21/2014 92.1442 39565606 1648 1648 8 3 35
POLICY_N_1 YIELD_ID_1 RH_CLU_ID_ short_dist coords_x1 coords_x2 optional SHAPE_Leng SHAPE_Area ncol.MyData. nrow.MyData.
1646 754033 3 39565604 5.110837 516747.8 -221751.4 TRUE 831.3702 34634.73 35 1757
1647 754033 1 39565605 5.606284 515932.1 -221702.0 TRUE 1469.4800 121611.46 35 1757
1648 754033 1 39565606 5.325399 516380.1 -221640.9 TRUE 1982.8757 228832.22 35 1757
for (j in length(dd))
This doesn’t iterate over dd — it iterates over a single number: the length of dd. Not much of an iteration. You probably meant to write the following or something similar:
for (j in seq_along(dd))
However, there are more issues with your code. For instance, the myDF variable is continuously overwritten inside your loop, which probably isn’t what you intended at all. Instead, you should probably create objects in an lapply statement and forego the loop.

Selecting pairs of odd even values in R

I have a large dataset as follows:
head(humic)
SUERC.No GU.Number d13.C Age.(BP) error Batch.Number AMS.USED Year Type
Sampletype
400 32691 535 -28 3382.981 34.74480 1 S3 2011 2 ha
401 32701 536 -28 3375.263 34.86087 1 S3 2011 2 ha
402 32711 537 -28 3308.103 34.83100 1 S3 2011 2 ha
403 32721 538 -28 3368.721 31.58641 1 S3 2011 2 ha
404 32731 539 -28 3368.604 34.72326 1 S3 2011 2 ha
405 32741 540 -28 3314.713 32.83147 1 S3 2011 2 ha
tail(humic)
SUERC.No GU.Number d13.C Age.(BP) error Batch.Number AMS.USED Year Type Sampletype
5445 70880 3962 -28.4 3390.458 29.12815 34 S4 2016 2 ha
5446 70890 3963 -28.5 3358.861 37.14896 34 S4 2016 2 ha
5447 70900 3964 -28.5 3363.626 26.71573 34 S4 2016 2 ha
5448 70910 3965 -28.5 3408.907 26.69665 34 S4 2016 2 ha
5449 70920 3966 -28.5 3348.463 29.01492 34 S4 2016 2 ha
5450 70930 3967 -28.4 3375.247 26.78261 34 S4 2016 2 ha
I am looking to create a variable to identify pairs of odd and even based on the variable GU.Number. These numbers identify duplicates of the same object - have same d13.C values.
For example,
535 - 536
537 - 538
3963-3964
3965-3966 are pairs.
Note, the column of GU.Number is not a sequence, some numbers are missing.
even.rows <- which(!(humic$GU.Number %% 2))
has.pair <- rep(0,nrow(humic))
for(i in even.rows){
has.pair[i] <- max((humic$GU.Number[i] + c(1,-1)) %in% humic$GU.Number)
}
# add as column of data
humic$has.pair <- has.pair
The has.pair column will be 1 if the GU.Number is even and there exists an odd GU.Number one less or one greater than the given GU.Number. Otherwise it will be 0. As a one-liner:
humic$has.pair <- sapply(1:nrow(humic),
function(x) with(humic,(!(GU.Number[x] %% 2))*max((GU.Number[x] + c(1,-1)) %in% GU.Number)))

Subset timeseries (date sequence) into a list

I have a dataframe with a series of dates, here's a simplified version of it:
> eventdates
dr.rank dr.start dr.end
1 14 1964-09-30 1964-10-06
2 16 1964-11-01 1964-12-24
I also have a time series of dates with values etc. associated with that, here's a much simplified version of the timeseries:
ts1964 <- data.frame(DATE = seq(from = as.Date("1964-01-01"), to = as.Date("1964-12-31"), by = "days"),
Q = 1:366)
What I am trying to do is subset by each date in eventdates, i.e.:
> filter(ts1964, ts1964$DATE >= eventdates[1,2] & ts1964$DATE <= eventdates[1,3])
DATE Q
1 1964-09-30 274
2 1964-10-01 275
3 1964-10-02 276
4 1964-10-03 277
5 1964-10-04 278
6 1964-10-05 279
7 1964-10-06 280
8 1964-10-07 281
9 1964-10-08 282
10 1964-10-09 283
11 1964-10-10 284
12 1964-10-11 285
13 1964-10-12 286
14 1964-10-13 287
15 1964-10-14 288
16 1964-10-15 289
17 1964-10-16 290
18 1964-10-17 291
19 1964-10-18 292
20 1964-10-19 293
21 1964-10-20 294
22 1964-10-21 295
23 1964-10-22 296
24 1964-10-23 297
25 1964-10-24 298
26 1964-10-25 299
27 1964-10-26 300
28 1964-10-27 301
29 1964-10-28 302
30 1964-10-29 303
31 1964-10-30 304
32 1964-10-31 305
33 1964-11-01 306
>
But I need to do this hundreds of times. What I would like to do is have each subset form an element in a list. I would normally be considering to using something like dlply in plyr but this isn't an option when I'm using dplyr. Could anyone advise on how I might achieve this otherwise? Thanks
We can use Map
Map(function(x,y) filter(ts1964, DATE >= x & DATE <= y),
eventdates$dr.start, eventdates$dr.end)

How ask R not to combine the X axis values for a bar chart?

I am a beginner with R . My data looks like this:
id count date
1 210 2009.01
2 400 2009.02
3 463 2009.03
4 465 2009.04
5 509 2009.05
6 861 2009.06
7 872 2009.07
8 886 2009.08
9 725 2009.09
10 687 2009.10
11 762 2009.11
12 748 2009.12
13 678 2010.01
14 699 2010.02
15 860 2010.03
16 708 2010.04
17 709 2010.05
18 770 2010.06
19 784 2010.07
20 694 2010.08
21 669 2010.09
22 689 2010.10
23 568 2010.11
24 584 2010.12
25 592 2011.01
26 548 2011.02
27 683 2011.03
28 675 2011.04
29 824 2011.05
30 637 2011.06
31 700 2011.07
32 724 2011.08
33 629 2011.09
34 446 2011.10
35 458 2011.11
36 421 2011.12
37 459 2012.01
38 256 2012.02
39 341 2012.03
40 284 2012.04
41 321 2012.05
42 404 2012.06
43 418 2012.07
44 520 2012.08
45 546 2012.09
46 548 2012.10
47 781 2012.11
48 704 2012.12
49 765 2013.01
50 571 2013.02
51 371 2013.03
I would like to make a bar graph like graph that shows how much what is the count for each date (dates in format of Month-Y, Jan-2009 for instance). I have two issues:
1- I cannot find a good format for a bar-char like graph like that
2- I want all of my data-points to be present in X axis(date), while R aggregates it to each year only (so I inly have four data-points there). Below is the current command that I am using:
plot(df$date,df$domain_count,col="red",type="h")
and my current plot is like this:
Ok, I see some issues in your original data. May I suggest the following:
Add the days in your date column
df$date=paste(df$date,'.01',sep='')
Convert the date column to be of date type:
df$date=as.Date(df$date,format='%Y.%m.%d')
Plot the data again:
plot(df$date,df$domain_count,col="red",type="h")
Also, may I add one more suggestion, have you used ggplot for ploting chart? I think you will find it much easier and resulting in better looking charts. Your example could be visualized like this:
library(ggplot2) #if you don't have the package, run install.packages('ggplot2')
ggplot(df,aes(date, count))+geom_bar(stat='identity')+labs(x="Date", y="Count")
First, you should transform your date column in a real date:
library(plyr) # for mutate
d <- mutate(d, month = as.numeric(gsub("[0-9]*\\.([0-9]*)", "\\1", as.character(date))),
year = as.numeric(gsub("([0-9]*)\\.[0-9]*", "\\1", as.character(date))),
Date = ISOdate(year, month, 1))
Then, you could use ggplot to create a decent barchart:
library(ggplot2)
ggplot(d, aes(x = Date, y = count)) + geom_bar(fill = "red", stat = "identity")
You can also use basic R to create a barchart, which is however less nice:
dd <- setNames(d$count, format(d$Date, "%m-%Y"))
barplot(dd)
The former plot shows you the "holes" in your data, i.e. month where there is no count, while for the latter it is even wuite difficult to see which bar corresponds to which month (this could however be tweaked I assume).
Hope that helps.

Resources