How to get the next page on a website using requests and bs4 - web-scraping

I want to web scrape the info from a leader board on this website, but it only shows 25 entries at once and the url doesn't change when you press the "next" button to get the next 25 entries. What I want to do is to get all the "number of rescues" from all the entries in the leader board so I can check if the "number of rescues" follows a pareto-distribution (meaning that, for example, the top 20 % of all people are responsible for 80 % of all rescues).
So I can get the first 25 entries no problem like this:
import requests
from bs4 import BeautifulSoup
url = 'https://fuelrats.com/leaderboard'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
rows = soup.findAll('div',{'class':"rt-tr-group",'role':"rowgroup"})
print(len(rows))
but after that, I don't know how to press "next" from python and then get the next 25 entries. How can I do that? Is it even possible with just requests and bs4?

There is another way of getting that data, by scraping the API endpoint that page is being hydrated from. The API url can be found in Dev Tools - Network tab, under 'XHR' section.
import requests
import pandas as pd
r = requests.get('https://fuelrats.com/api/fr/leaderboard?page%5Boffset%5D=0&page%5Blimit%5D=5000')
df = pd.json_normalize(r.json()['data'])
print(df)
Result in terminal:
type id attributes.preferredName attributes.ratNames attributes.joinedAt attributes.rescueCount attributes.codeRedCount attributes.isDispatch attributes.isEpic links.self
0 leaderboard-entries e9520722-02d2-4d69-9dba-c4e3ea727b14 Aleethia [Aanyath, Aleethia, Konisho, Ravenov] 2015-07-28T00:06:53.000Z 15467 2228 False False https://api.fuelrats.com/leaderboard-entries/e...
1 leaderboard-entries 5ed94356-bdcc-4139-9208-3cec320d51c9 Elysiumchains [Alysianfolly, Elysianfields, Elysiumchains, E... 2018-06-22T00:07:59.000Z 8476 810 True False https://api.fuelrats.com/leaderboard-entries/5...
2 leaderboard-entries bb1d04cd-2889-4994-ad6a-3fb881a5d243 Caleb Dume [Agamedes, Alcamenes, Andocides, Argonides, Ca... 2019-09-14T03:40:37.101Z 3163 309 True False https://api.fuelrats.com/leaderboard-entries/b...
3 leaderboard-entries 4a3ebe7a-35e6-4371-94f7-50db34c0167a Falcon JSDF [Falcon JSDF] 2016-03-01T02:44:51.265Z 2639 210 True False https://api.fuelrats.com/leaderboard-entries/4...
4 leaderboard-entries 9257c634-8c79-4d0c-b64c-f9acf1672f3a JERRYCLARK [JERRYCLARK] 2017-05-26T15:58:34.000Z 2452 229 False False https://api.fuelrats.com/leaderboard-entries/9...
... ... ... ... ... ... ... ... ... ... ...
3455 leaderboard-entries 55e8f803-9005-4c99-a5c3-d8ea49ac365a Boonlike [Boonlike] 2020-11-28T20:52:22.250Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/5...
3456 leaderboard-entries 46da00f1-0ac4-475c-9fc9-9da81624a5cd gamerjackiechan2 [gamerjackiechan2] 2018-06-04T18:51:10.000Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/4...
3457 leaderboard-entries e1d78729-3813-451e-95ff-f64c76bce4ba DutchProjectz [DutchProjectz] 2015-09-06T02:46:26.000Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/e...
3458 leaderboard-entries 57417a4e-b7a9-4612-8d66-59b69b33447e dalam88 [dalam88] 2017-07-25T14:48:13.000Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/5...
3459 leaderboard-entries 233a183a-b270-441f-bd55-258511cd9541 Glyc [Glyc, Glyca94] 2021-03-22T19:36:24.121Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/2...
3460 rows × 10 columns

Related

format_datetime inside dynamic data type kusto

I have a function that takes startdate and endDate as inputs. From within the function I perform the below operations-
use format_datetime to convert the input date into a specified format
use sql_request plugin and pass these formatted dates as sql parameters.
let start_time = todynamic(format_datetime(startTs, 'yyyy-MM-dd'));
let end_time = todynamic(format_datetime(endTs, 'yyyy-MM-dd'));
let query = 'select * from [db].[Table] where date > #param0 and date < #param1';
let result = evaluate sql_request(auth, query, dynamic({'param0': start_time, 'param1': end_time}))
However, on the editor I get syntax error on start_time as it says "expected '}' " and also on sql_request saying that sql_request expects 4 arguments.
How do I pass formatted datetime as part of dynamic data type? Thanks!
Unfortunately, at this point plugins support only constant arguments.
This might change in the future.
Some additional remarks:
Parameters names must start with # (both in the SQL query and the SQL parameters definition)
Parameters might have a datetime type. No need to convert them to string type.
dynamic() can only be used with constants, E.g. dynamic({"x":1,"y":2}).
For non constants we should use pack(), bag_pack() or pack_dictionary() (all aliases to each other), E.g.
let x_val = 1;
let y_val = 2;
let my_dict = pack_dictionary("x", x_val, "y", y_val);
print my_dict
print_0
{"x":1,"y":2}
Demo:
let connection_string = h'Server=tcp:dumarkov.database.windows.net,1433;Initial Catalog=mydb;Encrypt=True;Authentication="Active Directory Integrated";';
let sql_query = "select * from sys.tables where create_date >= #mydatetime";
let sql_parameters = dynamic({"#mydatetime":datetime("2022-06-21 00:00:00")});
evaluate sql_request(connection_string, sql_query, sql_parameters)
name
object_id
principal_id
schema_id
parent_object_id
type
type_desc
create_date
modify_date
is_ms_shipped
is_published
is_schema_published
lob_data_space_id
filestream_data_space_id
max_column_id_used
lock_on_bulk_load
uses_ansi_nulls
is_replicated
has_replication_filter
is_merge_published
is_sync_tran_subscribed
has_unchecked_assembly_data
text_in_row_limit
large_value_types_out_of_row
is_tracked_by_cdc
lock_escalation
lock_escalation_desc
is_filetable
is_memory_optimized
durability
durability_desc
temporal_type
temporal_type_desc
history_table_id
is_remote_data_archive_enabled
is_external
history_retention_period
history_retention_period_unit
history_retention_period_unit_desc
is_node
is_edge
data_retention_period
data_retention_period_unit
data_retention_period_unit_desc
ledger_type
ledger_type_desc
ledger_view_id
is_dropped_ledger_table
hello
1605580758
1
0
U
USER_TABLE
2022-06-23T14:22:23.723Z
2022-06-23T14:22:23.723Z
false
false
false
0
1
false
true
false
false
false
false
false
0
false
false
0
TABLE
false
false
0
SCHEMA_AND_DATA
0
NON_TEMPORAL_TABLE
false
false
false
false
-1
-1
INFINITE
0
NON_LEDGER_TABLE
false
world
1653580929
1
0
U
USER_TABLE
2022-06-25T09:21:28.36Z
2022-06-25T09:21:28.36Z
false
false
false
0
1
false
true
false
false
false
false
false
0
false
false
0
TABLE
false
false
0
SCHEMA_AND_DATA
0
NON_TEMPORAL_TABLE
false
false
false
false
-1
-1
INFINITE
0
NON_LEDGER_TABLE
false

Webscraping RequestGet from Airbnb not working properly

This query is returning 0 or 20 randomly every time i run it. Yesterday when i loop through the pages i always get 20 and I am able to scrape through 20 listings and 15 pages. But now, I can't run my code properly because sometimes the listings return 0.
I tried adding headers in the request get and time sleep (5-10s random) before each request but am still facing the same issue. Tried connecting to hotspot to change my IP but am still facing the same issue. Anyone understand why?
import time
from random import randint
from bs4 import BeautifulSoup
import requests #to connect to url
airbnb_url = 'https://www.airbnb.com/s/Mayrhofen--Austria/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&query=Mayrhofen%2C%20Austria&place_id=ChIJbzLYLzjdd0cRDtGuTzM_vt4&checkin=2021-02-06&checkout=2021-02-13&adults=4&source=structured_search_input_header&search_type=autocomplete_click'
soup = BeautifulSoup(requests.get(airbnb_url).content, 'html.parser')
listings = soup.find_all('div', '_8s3ctt')
print(len(listings))
It seems AirBnB returns 2 versions of the page. One "normal" HTML and other where the listings are stored inside <script>. To parse the <script> version of page you can use next example:
import json
import requests
from bs4 import BeautifulSoup
def find_listing(d):
if isinstance(d, dict):
if "__typename" in d and d["__typename"] == "DoraListingItem":
yield d["listing"]
else:
for v in d.values():
yield from find_listing(v)
elif isinstance(d, list):
for v in d:
yield from find_listing(v)
airbnb_url = "https://www.airbnb.com/s/Mayrhofen--Austria/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&date_picker_type=calendar&query=Mayrhofen%2C%20Austria&place_id=ChIJbzLYLzjdd0cRDtGuTzM_vt4&checkin=2021-02-06&checkout=2021-02-13&adults=4&source=structured_search_input_header&search_type=autocomplete_click"
soup = BeautifulSoup(requests.get(airbnb_url).content, "html.parser")
listings = soup.find_all("div", "_8s3ctt")
if len(listings):
# normal page:
print(len(listings))
else:
# page that has listings stored inside <script>:
data = json.loads(soup.select_one("#data-deferred-state").contents[0])
for i, l in enumerate(find_listing(data), 1):
print(i, l["name"])
Prints (when returned the <script> version):
1 Mariandl (MHO103) for 36 persons.
2 central and friendly! For Families and Friends
3 Sonnenheim for 5 persons.
4 MO's Apartments
5 MO's Apartments
6 Beautiful home in Mayrhofen with 3 Bedrooms
7 Quaint Apartment in Finkenberg near Ski Lift
8 Apartment 2 Villa Daringer (5 pax.)
9 Modern Apartment in Schwendau with Garden
10 Holiday flats Dornau, Mayrhofen
11 Maple View
12 Laubichl Lodge by Apart Hotel Therese
13 Haus Julia - Apartment Edelweiß Mayrhofen
14 Melcherhof,
15 Rest coke
16 Vacation home Traudl
17 Luxurious Apartment near Four Ski Lifts in Mayrhofen
18 Apartment 2 60m² for 2-4 persons "Binder"
19 Apart ZEMMGRUND, 4-9 persons in Mayrhofen/Tirol
20 Apartment Ahorn View
EDIT: To print lat, lng:
...
for i, l in enumerate(find_listing(data), 1):
print(i, l["name"], l["lat"], l["lng"])
Prints:
1 Mariandl (MHO103) for 36 persons. 47.16522 11.85723
2 central and friendly! For Families and Friends 47.16209 11.859691
3 Sonnenheim for 5 persons. 47.16809 11.86694
4 MO's Apartments 47.166969 11.863186
...

How can i split one row into serval row ? using a linq 'UNION ALL'

1. Context:
Client call Joe cause his apps need to print "the bill".(~15years ago)
But "bill" have different information depending on client type.
There are 3 types of client: ( a=1 , b=2 , c=4).
And a client can be a combinaison of those types. So it's a bitfield.
+--------------------------------------------------------------------------------+
| id type is_SendA is_SendB is_SendC is_PayedA is_PayedB is_PayedC |
+--------------------------------------------------------------------------------+
01 1 true true true true false false
02 2 false true true false true false
03 4 false false false false false true
04 3 true false false true false false
05 3 true true true false true false
06 5 false true true false false true
07 5 false false false true false false
08 6 true false false false true false
09 7 true true true false false true
Type : Is a bitfield.
Is_SendA B C : Hold the same information, but for the different type as they can be different.
is_PayedA B C : Hold the same information
All the information of the paper are multipliate by the number of type.
2. What do we Need:
As Killing Joe is not an option, neither is a drop database.
The expected result is the following :
ID type is_Send is_Payed
09 1 true false
09 2 true false
09 4 true true
The idea is to make linq generate something like this:
SELECT Id, 1 AS Type
, is_SendA AS Is_Send
, is_PayedA AS Is_Payed
WHERE Id = 9
AND Type & 1 = 1
UNION ALL
SELECT Id, 2 AS Type
, is_SendB AS Is_Send
, is_PayedB AS Is_Payed
WHERE Id = 9
AND Type & 2 = 2
UNION ALL
SELECT Id, 4 AS Type
, is_SendB AS Is_Send
, is_PayedB AS Is_Payed
WHERE Id = 9
AND Type & 4 = 4
Why do we need this ? Because Right now to get the Paper Id=9 Type=A, We have to write down 35 columns name.
With a something able to display a content like show before a simple where id= and type= will give the correct result. And a grid able to display one type would be able to display an other type as the column name will now be the same.
Fun fact:
There is more than 90 column in the table. Only 3 are unique everyting else is duplicate for types.
While writing columns name, Joe experiences with naming convention. and some times he just miss type them.
Nota bene:
First, This is not for code shaming. The number of column are here to indicate the complexity that will have the implementation of some solution you may comes with.
Second, The fact that There is different naming convention explain that there is no way to simply generate the column name.
If I understood your question correctly you can use Enumerable.SelectMany() method to unfold your row into multiple rows like this:
var results = myView.SelectMany(v =>
{
var list = new List<Tuple<int,string>>();
if((v.DEV_Dest & 1) == 1)
list.Add(Tuple.Create(v.Id,v.DataA));
if((v.DEV_Dest & 2) == 2)
list.Add(Tuple.Create(v.Id, v.DataB));
if((v.DEV_Dest & 3) == 3)
list.Add(Tuple.Create(v.Id, v.DataC));
return list;
}).ToList();

Gzip encoded content URL

I am having trouble trying to retrieve the gzip'd content of the following URL:
https://www.lendingclub.com/browse/browseNotesAj.action?method=getResultsInitial&startindex=0&pagesize=1
I can see that the content is encoded using gzip by looking at the response headers:
HTTP/1.1 200 OK
Content-Encoding: gzip
I have tried RCurl using getURL as well as this post with no luck. Can someone help me try to get the content into a variable (hopefully without requiring writing and reading from file)?
Or in httr
library(httr)
library(jsonlite)
out <- GET("https://www.lendingclub.com/browse/browseNotesAj.action?method=getResultsInitial&startindex=0&pagesize=1")
jsonlite::fromJSON(content(out, "text"))
$result
[1] "success"
$searchresult
$searchresult$loans
loanGrade purpose loanAmtRemaining loanUnfundedAmount noFee primeTotalInvestment title
1 C5 debt_consolidation 25 25 0 0 Debt consolidation
isInCurrentOrder alreadySelected primeFractions fico wholeLoanTimeRemaining loanType primeUnfundedAmount
1 FALSE FALSE 0 720-724 -69999 Personal 0
hasCosigner amountToInvest loan_status alreadyInvestedIn loanLength searchrank loanRateDiff loanGUID
1 FALSE 0 INFUNDING FALSE 36 1 .00 35783459
isWholeLoan loanAmt loanAmountRequested primeMarkedInvestment loanRate loanTimeRemaining
1 0 7650 7650 0 14.99 1199721001
$searchresult$totalRecords
[1] 1472
Turns out RCurl handles gzip encoding:
getURL('https://www.lendingclub.com/browse/browseNotesAj.action?method=getResultsInitial&startindex=0&pagesize=1',
encoding="gzip")

Check if a column has an value if so right true or false to column next to it

i was wondering how to make something that checks if column Lair in the data
is below or above an certain threshold lets say below 0.5 is called LOH en
above is called imbalance. So the calls LOH and INBALANCE should be written in a new column. I tried something as the code below.
detection<-function(assay,method,thres){
if(method=="threshold"){
idx<-ifelse(segmenten["intensity"]<1.1000000 & segmenten["intensity"]>0.900000 & segmenten["Lair"]>thres,TRUE,FALSE)
}
if(method=="cnloh"){
idx<-ifelse(segmenten["intensity"]<1.1000000 & segmenten["intensity"]>0.900000 & segmenten["Lair"]<thres,TRUE,FALSE)
}
if(method=="gain"){
idx<-ifelse(segmenten["intensity"]>1.1000000 & segmenten["Lair"]<thres,TRUE,FALSE)
}
if(method=="loss"){
idx<-ifelse(segmenten["intensity"]<0.900000 & segmenten["Lair"]<thres,TRUE,FALSE)
}
if(method=="bloss"){
idx<-ifelse(segmenten["intensity"]<0.900000 & segmenten["Lair"]>thres,TRUE,FALSE)
}
if(method=="bgain"){
idx<-ifelse(segmenten["intensity"]>1.100000 & segmenten["Lair"]>thres,TRUE,FALSE)
}
return(idx)
}
After this part the next step is to write the data from the function to the existing table.
Anyone has an idea
Since your desired result is not clear enough I made some assumptions and wrote something that might be useful or not.
First at all, inside your function there is an object segmenten which is not defined, I suppose this is the data set supplied as an input, then you used ifelse and the returning results are TRUE or FALSE but you want either LOH or INBALANCE when some conditions are met.
You want INBALANCE when ... & segmenten["Lair"]>thres and LOH otherwise (here ... means the other part of the condition) this will give a vector, but you want it in the main dataset as an addional column, don't you? So maybe this could be a new starting point for you to improve your code.
detection <- function(assay, method=c('threshold', 'cnloh', 'gain', 'loss', 'bloss', 'bgain'),
thres=0.5){
x <- assay
idx <- switch(match.arg(method),
threshold = ifelse(x["intensity"]<1.1 & x["intensity"]>0.9 & x["Lair"]>thres, 'INBALANCE', 'LOH'),
cnloh = ifelse(x["intensity"]<1.1 & x["intensity"]>0.9 & x["Lair"]<thres, 'LOH', 'INBALANCE'),
gain = ifelse(x["intensity"]>1.1 & x["Lair"]<thres, 'LOH', 'INBALANCE'),
loss = ifelse(x["intensity"]<0.9 & x["Lair"]<thres,'LOH', 'INBALANCE'),
bloss = ifelse(x["intensity"]<0.9 & x["Lair"]>thres, 'INBALANCE', 'LOH'),
bgain = ifelse(x["intensity"]>1.1 & x["Lair"]>thres, 'INBALANCE', 'LOH'))
colnames(idx) <- 'Checking'
return(cbind(x, as.data.frame(idx)))
}
Example:
Data <- read.csv("japansegment data.csv", header=T)
result <- detection(Data, method='threshold', thres=0.5) # 'threshold' is the default value for method
head(result)
SNP_NAME x0 x1 y pos.start pos.end chrom count copynumber intensity allele.B Lair uncertain sample_id
1 SNP_A-1656705 0 0 0 836727 27933161 1 230 2 1.0783 1 0.9218 FALSE GSM288035
2 SNP_A-1677548 0 0 0 28244579 246860994 1 4408 2 0.9827 1 0.9236 FALSE GSM288035
3 SNP_A-1669537 0 0 0 100819 159783145 2 3480 2 0.9806 1 0.9193 FALSE GSM288035
4 SNP_A-1758569 0 0 0 159783255 159791136 2 5 2 1.7244 1 0.9665 FALSE GSM288035
5 SNP_A-1662168 0 0 0 159817465 168664268 2 250 2 0.9786 1 0.9197 FALSE GSM288035
6 SNP_A-1723506 0 0 0 168721411 168721920 2 2 2 1.8027 -4 NA FALSE GSM288035
Checking
1 INBALANCE
2 INBALANCE
3 INBALANCE
4 LOH
5 INBALANCE
6 LOH
Using match.arg and switch functions will help you to avoid a lot of if statements.

Resources