Scrape paginated table where url does not change - web-scraping

I am trying to scrape the from the table from the following web page:
https://postcodebijadres.nl/3800
It shows the first 25 results, but for the rest of the results you need to click on the next button to view them.
I have a python script where I use requests and beautifulsoup to scrape the table, but only the first 25 results can be scraped directly form the HTML. I am totally new to this and after some google searches I still cant figure out how to retrieve all the data from all the pages.
The problem is that the URL does not change when a new page of results is selected.
Could someone please sent me in the right direction?
Kind regards,
Ewoud

You don't need to handle pagination on this page. All data is already stored inside the <table>:
import requests
from bs4 import BeautifulSoup
url = "https://postcodebijadres.nl/3800"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
for idx, tr in enumerate(soup.select("#postcodes-table tr")[1:], 1):
tds = [td.get_text(strip=True) for td in tr.select("td")]
print(idx, *tds)
Prints:
1 3800 AA postbus 0 - 10000
2 3800 AB postbus 0 - 10000
3 3800 AC postbus 0 - 10000
4 3800 AD postbus 0 - 10000
5 3800 AE postbus 0 - 10000
6 3800 AG postbus 0 - 10000
7 3800 AH postbus 0 - 10000
8 3800 AJ postbus 0 - 10000
9 3800 AK postbus 0 - 10000
10 3800 AL postbus 0 - 10000
11 3800 AM postbus 0 - 10000
12 3800 AN postbus 0 - 10000
13 3800 AP postbus 0 - 10000
14 3800 AR postbus 0 - 10000
15 3800 AS postbus 0 - 10000
16 3800 AT postbus 0 - 10000
17 3800 AV postbus 0 - 10000
18 3800 AW postbus 0 - 10000
19 3800 AX postbus 0 - 10000
20 3800 AZ postbus 0 - 10000
21 3800 BA postbus 0 - 10000
22 3800 BB postbus 0 - 10000
23 3800 BC postbus 0 - 10000
24 3800 BD postbus 0 - 10000
25 3800 BE postbus 0 - 10000
26 3800 BG postbus 0 - 10000
27 3800 BH postbus 0 - 10000
28 3800 BJ postbus 0 - 10000
29 3800 BK postbus 0 - 10000
30 3800 BL postbus 0 - 10000
31 3800 BM postbus 0 - 10000
32 3800 BN postbus 0 - 10000
33 3800 BP postbus 0 - 10000
34 3800 BR postbus 0 - 10000
35 3800 BS postbus 0 - 10000
36 3800 BT postbus 0 - 10000
37 3800 BV postbus 0 - 10000
38 3800 CA postbus 0 - 10000
39 3800 CB postbus 0 - 10000
40 3800 CC postbus 0 - 10000
41 3800 CD postbus 0 - 10000
42 3800 CE postbus 0 - 10000
43 3800 DA postbus 0 - 10000
44 3800 DB postbus 0 - 10000
45 3800 EA postbus 0 - 10000
46 3800 GB postbus 0 - 10000
47 3800 GC postbus 0 - 10000
48 3800 GD postbus 0 - 10000
49 3800 GE postbus 0 - 10000
50 3800 GG postbus 0 - 10000
51 3800 GJ postbus 0 - 10000
52 3800 GK postbus 0 - 10000
53 3800 HA postbus 0 - 10000
54 3800 HD postbus 0 - 10000
55 3800 NA postbus 0 - 10000

Related

I am unable to import in R a downloaded xls file

I am trying to directly import the .xls file that comes from this link (French electricity distributor).
I have built, based on this question, the folloning code :
library(rio)
Chemin = "F:/DGTresor/00.Refontes/06.Electricite_HauteFrequence" #WhateverPath
## RTE mois en cours
temporaire <- tempfile()
download.file("https://eco2mix.rte-france.com/download/eco2mix/eCO2mix_RTE_En-cours-TR.zip",temporaire)
unzip(zipfile=temporaire,
files = "eCO2mix_RTE_En-cours-TR.xls",
exdir=Chemin)
RTE_EnCours <- import(paste0(Chemin,"/eCO2mix_RTE_En-cours-TR.xls"))
The file exists, but I am unable to read it. I get the following error : libxls error: Unable to open file
I am not sure why it is happening but when I try to open the .xls file manually, it gives an error like "The file format and its extension does not match" etc. To solve the issue, I converted the file extension to .csv with the codes below.
file.rename(paste0(Chemin,"/eCO2mix_RTE_En-cours-TR.xls"), paste0(Chemin,"/eCO2mix_RTE_En-cours-TR.csv"))
After that, importing the file works,
# to prevent the shifting, header=FALSE should be applied
RTE_EnCours<- read.csv(paste0(Chemin,"/eCO2mix_RTE_En-cours-TR.csv"),sep="\t",header=FALSE,row.names=NULL)
# canceling out the last column which is full NA
RTE_EnCours <- RTE_EnCours[,-ncol(RTE_EnCours)]
# assigning the first row as the column names
colnames(RTE_EnCours) <-as.character(unlist(RTE_EnCours[1,]))
# removing the first row
RTE_EnCours <- RTE_EnCours[-1,]
head(RTE_EnCours)
gives,
Périmètre Nature Date Heures Consommation Prévision J-1 Prévision J Fioul Charbon Gaz Nucléaire Eolien Solaire Hydraulique
2 France Données temps réel 2020-10-01 00:00 46957 46500 47100 134 286 4524 35004 4327 0 4645
3 France Données temps réel 2020-10-01 00:15 46342 45350 45950 149 318 4727 35278 4336 0 4953
4 France Données temps réel 2020-10-01 00:30 44689 44200 44800 149 304 4380 34732 4428 0 4580
5 France Données temps réel 2020-10-01 00:45 43277 42950 43700 165 308 4244 34644 4528 0 4147
6 France Données temps réel 2020-10-01 01:00 42511 41700 42600 165 302 4012 34780 4488 0 4096
7 France Données temps réel 2020-10-01 01:15 42714 41650 42750 165 297 4114 35145 4630 0 3758
Pompage Bioénergies Ech. physiques Taux de Co2 Ech. comm. Angleterre Ech. comm. Espagne Ech. comm. Italie Ech. comm. Suisse
2 -751 1087 -2299 58 179 -914 -1732 -1283
3 -750 1055 -3724 59
4 -920 1045 -4009 58 179 -914 -1732 -1283
5 -1861 1048 -3946 59
6 -1857 1039 -4514 56 497 -1759 -2279 -2217
7 -2005 1037 -4427 57
Ech. comm. Allemagne-Belgique Fioul - TAC Fioul - Cogén. Fioul - Autres Gaz - TAC Gaz - Cogén. Gaz - CCG Gaz - Autres
2 -79 0 21 113 -2 585 3941 0
3 0 21 128 -1 580 4148 0
4 -159 0 21 128 -1 580 3801 0
5 0 21 144 -1 582 3663 0
6 1252 0 21 144 -1 579 3434 0
7 0 21 144 -1 581 3534 0
Hydraulique - Fil de l?eau + éclusée Hydraulique - Lacs Hydraulique - STEP turbinage Bioénergies - Déchets Bioénergies - Biomasse
2 3355 1288 2 183 447
3 3336 1615 2 174 435
4 3242 1338 0 174 434
5 3155 992 0 174 437
6 3060 1036 0 172 434
7 2992 766 0 177 436
Bioénergies - Biogaz
2 301
3 294
4 294
5 294
6 294
7 294
>

How to add columns with particular calculations and make a plot based on that in R?

Say for example I have the following data set:
timestamp open close ID
2000 1000 1100 5
2060 1100 1150 5
2120 1150 1200 5
2180 1200 1150 5
2240 1150 1100 8
2300 1100 1000 8
2360 1000 950 8
2420 950 900 8
2480 900 950 5
2540 950 1000 5
2600 1000 1050 5
2660 1050 1100 4
2720 1100 1150 4
2780 1150 1200 4
How can I add another colum which shows how many times a particular ID has shown up, this is shown by Number_ID? And how can I add another column which gives the percentage change since the beginning when a new ID starts. The first open is the start of the ID and we use the closes to calculate the %_change. So this would look something like this (the because calculation doesnt have to be included, I added it so you can see the calculation):
timestamp open close ID Number_ID %_change
2000 1000 1100 5 1 10 (because (1100-1000)*100/1000)
2060 1100 1150 5 2 15 (because (1150-1000)*100/1000)
2120 1150 1200 5 3 20 (because (1200-1000)*100/1000)
2180 1200 1150 5 4 15 (because (1150-1000)*100/1000)
2240 1150 1100 8 1 -4 (because (1100-1150)*100/1150)
2300 1100 1000 8 2 -13 (because (1000-1150)*100/1150)
2360 1000 950 8 3 -17 (because (950-1150)*100/1150)
2420 950 900 8 4 -21 (because (900-1150)*100/1150)
2480 900 950 5 1 5 (because (950-900)*100/900)
2540 950 1000 5 2 11 (because (1000-900)*100/900)
2600 1000 1050 5 3 16 (because 1050-900)*100/900)
2660 1050 1100 4 1 4 (because (1100-1050)*100/1050)
2720 1100 1150 4 2 9 (because (1150-1050)*100/1050)
2780 1150 1200 4 3 14 (because (1200-1050)*100/1050)
And when have these 2 columns, how can I make a graph which plots the highest positive and negative % change per ID? So, I would first need to add a calculation which calculates the price difference in percentage between the open and the close of an ID. This would look something like this:
timestamp open close ID Number_ID %_change %_change_opencloseID
2000 1000 1100 5 1 10
2060 1100 1150 5 2 15
2120 1150 1200 5 3 20
2180 1200 1150 5 4 15 15 (because (1150-1000)*100/1000)
2240 1150 1100 8 1 -4
2300 1100 1000 8 2 -13
2360 1000 950 8 3 -17
2420 950 900 8 4 -21 -21 (because (900-1150)*100/1150)
2480 900 950 5 1 5
2540 950 1000 5 2 11
2600 1000 1050 5 3 16 16 (because (1050-900)*100/900)
2660 1050 1100 4 1 4
2720 1100 1150 4 2 9
2780 1150 1200 4 3 14 14 (because (1200-1050)*100/1050)
If I have this, how can I make a graph that plots the 16% change for ID 5 and not the 15% change for ID 5 automatically? With timestamp on the x-axis and %_change on the y-axis.
Thanks!
This is how you can do your first step :
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Number_ID = row_number(),
perc_change = (close - first(open))/first(open) * 100)
# timestamp open close ID Number_ID perc_change
# <int> <int> <int> <int> <int> <dbl>
# 1 2000 1000 1100 5 1 10
# 2 2060 1100 1150 5 2 15
# 3 2120 1150 1200 5 3 20
# 4 2180 1200 1150 5 4 15
# 5 2240 1150 1100 8 1 -4.35
# 6 2300 1100 1000 8 2 -13.0
# 7 2360 1000 950 8 3 -17.4
# 8 2420 950 900 8 4 -21.7
# 9 2480 900 950 5 5 -5
#10 2540 950 1000 5 6 0
#11 2600 1000 1050 5 7 5
#12 2660 1050 1100 4 1 4.76
#13 2720 1100 1150 4 2 9.52
#14 2780 1150 1200 4 3 14.3
In data.table :
library(data.table)
setDT(df)[, c("Number_ID", "perc_change") := list(seq_len(.N),
(close - first(open))/first(open) * 100), ID]

How would I convert columns values to 0 and 1 for a specific range

This is my dataset named as st;
> head(st)
sales0 sales1 sales2 sales3 sales4 country State CouSub countytownname population
1: 848 588 666 1116 1133 9 23 19770 town 423
2: 925 717 780 1283 1550 1 50 29575 town 298
3: 924 616 739 1154 1314 13 25 8470 town 3609
4: 924 646 683 1292 1297 35 6 99999 County 34895
5: 1017 730 735 1208 1326 27 50 60100 town 1139
6: 1494 1071 1196 1861 2023 9 25 37995 town 5136
state_alpha store_Type store data CN_AF CN_GL CN_MR CN_SZ SC_M SC_N AN_AF
1: ME Supermarket Type1 0 train 0 1 0 0 0 1 0
2: VT Supermarket Type1 0 train 1 0 0 0 0 1 0
3: MA Supermarket Type1 1 train 0 1 0 0 1 0 0
4: CA Supermarket Type3 0 train 0 1 0 0 0 1 1
5: VT Supermarket Type1 0 train 0 0 0 1 0 1 0
6: MA Supermarket Type3 0 train 1 0 0 0 1 0 0
The table for respective state_alpha column is
> table(st$state_alpha)
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD
29 67 75 15 58 64 169 1 3 67 159 1 5 99 44 102 92 105 120 64 351 25
ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD
535 83 87 116 82 56 100 53 93 259 21 33 17 62 88 77 36 67 78 39 46 66
TN TX UT VA VI VT WA WI WV WY
95 254 29 135 3 255 39 72 55 23
I wanted to group these values in the range like it will be 1 for values between 0-100 and 0 for values greater than 100. But when I ran my code, it showed all values to be 0. Can someone help me with both the methods I used please. Both codes upgraded will be appreciable.
1.
st$state_alpha=ifelse((st$state_alpha>=0 & st$state_alpha<=100),1,0)
> table(st$state_alpha)
0
4769
2.
st$state_alpha=(st$state_alpha<=100) + 0
> table(st$state_alpha)
0
4769
If possible please help me with both the techniques please.
Any answer probably revolves around a group_by and case_when operation. I need reproducible data (via dput, etc.) but a solution might look like this:
library(dplyr)
st %>%
group_by(state_alpha) %>%
mutate(st_range = case_when(sum(state_alpha==state_alpha) <= 100 ~ 1, TRUE ~ 0)))
Alternatively if all you want is the table converted to ranges:
library(dplyr)
library(magrittr)
st %$%
table(state_alpha) %>%
data.frame() %>%
mutate(Freq = case_when(Freq <= 100 ~ 1, TRUE ~ 0))

Error In R code in LPPL model in R

I am learning R and had problem when I try run LPPL using nls. I used monthly data of KLSE.
> library(tseries)
> library(zoo)
ts<-read.table(file.choose(),header=TRUE)
ts
rdate Close Date
1 8/1998 302.91 0
2 9/1998 373.52 100
3 10/1998 405.33 200
4 11/1998 501.47 300
5 12/1998 586.13 400
6 1/1999 591.43 500
7 2/1999 542.23 600
8 3/1999 502.82 700
9 4/1999 674.96 800
10 5/1999 743.04 900
11 6/1999 811.10 1000
12 7/1999 768.69 1100
13 8/1999 767.06 1200
14 9/1999 675.45 1300
15 10/1999 742.87 1400
16 11/1999 734.66 1500
17 12/1999 812.33 1600
18 1/2000 922.10 1700
19 2/2000 982.24 1800
20 3/2000 974.38 1900
21 4/2000 898.35 2000
22 5/2000 911.51 2100
23 6/2000 833.37 2200
24 7/2000 798.83 2300
25 8/2000 795.84 2400
26 9/2000 713.51 2500
27 10/2000 752.36 2600
28 11/2000 729.95 2700
29 12/2000 679.64 2800
30 1/2001 727.73 2900
31 2/2001 709.39 3000
32 3/2001 647.48 3100
33 4/2001 584.50 3200
34 5/2001 572.88 3300
35 6/2001 592.99 3400
36 7/2001 659.40 3500
37 8/2001 687.16 3600
38 9/2001 615.34 3700
39 10/2001 600.07 3800
40 11/2001 638.02 3900
41 12/2001 696.09 4000
42 1/2002 718.82 4100
43 2/2002 708.91 4200
44 3/2002 756.10 4300
45 4/2002 793.99 4400
46 5/2002 741.76 4500
47 6/2002 725.44 4600
48 7/2002 721.59 4700
49 8/2002 711.36 4800
50 9/2002 638.01 4900
51 10/2002 659.57 5000
52 11/2002 629.22 5100
53 12/2002 646.32 5200
54 1/2003 664.77 5300
55 2/2003 646.80 5400
56 3/2003 635.72 5500
57 4/2003 630.37 5600
58 5/2003 671.46 5700
59 6/2003 691.96 5800
60 7/2003 720.56 5900
61 8/2003 743.30 6000
62 9/2003 733.45 6100
63 10/2003 817.12 6200
64 11/2003 779.28 6300
65 12/2003 793.94 6400
66 1/2004 818.94 6500
67 2/2004 879.24 6600
68 3/2004 901.85 6700
69 4/2004 838.21 6800
70 5/2004 810.67 6900
71 6/2004 819.86 7000
72 7/2004 833.98 7100
73 8/2004 827.98 7200
74 9/2004 849.96 7300
75 10/2004 861.14 7400
76 11/2004 917.19 7500
77 12/2004 907.43 7600
78 1/2005 916.27 7700
79 2/2005 907.38 7800
80 3/2005 871.35 7900
81 4/2005 878.96 8000
82 5/2005 860.73 8100
83 6/2005 888.32 8200
84 7/2005 937.39 8300
85 8/2005 913.56 8400
86 9/2005 927.54 8500
87 10/2005 910.76 8600
88 11/2005 896.13 8700
89 12/2005 899.79 8800
90 1/2006 914.01 8900
91 2/2006 928.94 9000
92 3/2006 926.63 9100
93 4/2006 949.23 9200
94 5/2006 927.78 9300
95 6/2006 914.69 9400
96 7/2006 935.85 9500
97 8/2006 958.12 9600
98 9/2006 967.55 9700
99 10/2006 988.30 9800
100 11/2006 1080.66 9900
101 12/2006 1096.24 10000
102 1/2007 1189.35 10100
103 2/2007 1196.45 10200
104 3/2007 1246.87 10300
105 4/2007 1322.25 10400
106 5/2007 1346.89 10500
107 6/2007 1354.38 10600
108 7/2007 1373.71 10700
109 8/2007 1273.93 10800
110 9/2007 1336.30 10900
111 10/2007 1413.65 11000
112 11/2007 1396.98 11100
113 12/2007 1445.03 11200
df <- data.frame(ts)
df <- data.frame(Date=df$Date,Y=df$Close)
df <- df[!is.na(df$Y),]
library(minpack.lm)
library(ggplot2)
f <- function(pars, xx){pars$a+pars$b*(pars$tc-xx)^pars$m* (1+pars$c*cos(pars$omega*log(pars$tc-xx)+pars$phi))}
resids <- function(p,observed,xx){df$Y-f(p,xx)}
nls.out<-nls.lm(par=list(a=7.048293, b=-8.8e-5, tc=112000, m=0.5, omega=3.03, phi=-9.76, c=-14), fn=resids, observed=df$Y, xx=df$days, control=nls.lm.control(maxiter=1024, ftol=1e-6, maxfev=1e6))
par <- nls.out$par
nls.final<-nls(Y~a+(tc-days)^m*(b+c*cos(omega*log(tc-days)+phi)), data=df, start=par, algorithm="plinear", control=nls.control(maxiter=1024, minFactor=1e-8))
Error in qr.solve(QR.B, cc) : singular matrix 'a' in solve
I got error a singular matrix.What I need to change to avoid this error?
Your problem is: the cosine term is zero for some value, this makes the matrix singular, you basically need to constrict the parameter space. Additionally, I would read more of the literature since some fancy trig work will remove the phi parameter, this improves the nl optimization enough to get useful and reproducible results.

Error in is.constant(y) : (list) object cannot be coerced to type 'double'

I have a sample file that I am using to forecast as a panel series.
The steps followed are
library(plm)
library(Formula)
library(forecast)
library(timeDate)
library(zoo)
df1 <- read.csv("full panel Test Input.csv", header=TRUE, sep=",")
pdf1 <- plm.data(df1,index=c("state","time"))
fmodel1 <- plm(M4~M1+M2+M3,data=pdf1,model="within")
fcast <- forecast(fmodel1,data=pdf1)
I get the error stated in the subject exactly at step nine. The process and the data is fine since we have checked it in stata.
Any help is appreciated.
M1 M2 M3 M4 state time
466 63 14 10 AZ 2013w31
0 63 0 0 AZ 2013w32
480 63 77 270 AZ 2013w33
0 63 0 0 AZ 2013w34
10 0 742 40 AZ 2013w35
0 0 0 0 AZ 2013w36
0 0 210 10 AZ 2013w37
1049 28 168 30 AZ 2013w38
1148 35 203 20 AZ 2013w39
5130 182 21 10 AZ 2013w40
8667 427 0 10 AZ 2013w41
460000 10731 14 20 AZ 2013w42
1000000 27608 0 120 AZ 2013w43
0 27608 0 0 AZ 2013w44
18494 1344 7 30 AZ 2013w45
15775 1176 21 10 AZ 2013w46
15516 1197 0 40 AZ 2013w47
0 1197 0 0 AZ 2013w48
0 1197 0 0 AZ 2013w49
11280 700 0 30 AZ 2013w50
6320 336 14 50 AZ 2013w51
765 35 0 20 AZ 2013w52
230 0 0 10 AZ 2014w1
0 0 0 0 NJ 2013w31
0 0 0 0 NJ 2013w32
6 0 0 10 NJ 2013w33

Resources