Gzip encoded content URL - r

I am having trouble trying to retrieve the gzip'd content of the following URL:
https://www.lendingclub.com/browse/browseNotesAj.action?method=getResultsInitial&startindex=0&pagesize=1
I can see that the content is encoded using gzip by looking at the response headers:
HTTP/1.1 200 OK
Content-Encoding: gzip
I have tried RCurl using getURL as well as this post with no luck. Can someone help me try to get the content into a variable (hopefully without requiring writing and reading from file)?

Or in httr
library(httr)
library(jsonlite)
out <- GET("https://www.lendingclub.com/browse/browseNotesAj.action?method=getResultsInitial&startindex=0&pagesize=1")
jsonlite::fromJSON(content(out, "text"))
$result
[1] "success"
$searchresult
$searchresult$loans
loanGrade purpose loanAmtRemaining loanUnfundedAmount noFee primeTotalInvestment title
1 C5 debt_consolidation 25 25 0 0 Debt consolidation
isInCurrentOrder alreadySelected primeFractions fico wholeLoanTimeRemaining loanType primeUnfundedAmount
1 FALSE FALSE 0 720-724 -69999 Personal 0
hasCosigner amountToInvest loan_status alreadyInvestedIn loanLength searchrank loanRateDiff loanGUID
1 FALSE 0 INFUNDING FALSE 36 1 .00 35783459
isWholeLoan loanAmt loanAmountRequested primeMarkedInvestment loanRate loanTimeRemaining
1 0 7650 7650 0 14.99 1199721001
$searchresult$totalRecords
[1] 1472

Turns out RCurl handles gzip encoding:
getURL('https://www.lendingclub.com/browse/browseNotesAj.action?method=getResultsInitial&startindex=0&pagesize=1',
encoding="gzip")

Related

How to get the next page on a website using requests and bs4

I want to web scrape the info from a leader board on this website, but it only shows 25 entries at once and the url doesn't change when you press the "next" button to get the next 25 entries. What I want to do is to get all the "number of rescues" from all the entries in the leader board so I can check if the "number of rescues" follows a pareto-distribution (meaning that, for example, the top 20 % of all people are responsible for 80 % of all rescues).
So I can get the first 25 entries no problem like this:
import requests
from bs4 import BeautifulSoup
url = 'https://fuelrats.com/leaderboard'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
rows = soup.findAll('div',{'class':"rt-tr-group",'role':"rowgroup"})
print(len(rows))
but after that, I don't know how to press "next" from python and then get the next 25 entries. How can I do that? Is it even possible with just requests and bs4?
There is another way of getting that data, by scraping the API endpoint that page is being hydrated from. The API url can be found in Dev Tools - Network tab, under 'XHR' section.
import requests
import pandas as pd
r = requests.get('https://fuelrats.com/api/fr/leaderboard?page%5Boffset%5D=0&page%5Blimit%5D=5000')
df = pd.json_normalize(r.json()['data'])
print(df)
Result in terminal:
type id attributes.preferredName attributes.ratNames attributes.joinedAt attributes.rescueCount attributes.codeRedCount attributes.isDispatch attributes.isEpic links.self
0 leaderboard-entries e9520722-02d2-4d69-9dba-c4e3ea727b14 Aleethia [Aanyath, Aleethia, Konisho, Ravenov] 2015-07-28T00:06:53.000Z 15467 2228 False False https://api.fuelrats.com/leaderboard-entries/e...
1 leaderboard-entries 5ed94356-bdcc-4139-9208-3cec320d51c9 Elysiumchains [Alysianfolly, Elysianfields, Elysiumchains, E... 2018-06-22T00:07:59.000Z 8476 810 True False https://api.fuelrats.com/leaderboard-entries/5...
2 leaderboard-entries bb1d04cd-2889-4994-ad6a-3fb881a5d243 Caleb Dume [Agamedes, Alcamenes, Andocides, Argonides, Ca... 2019-09-14T03:40:37.101Z 3163 309 True False https://api.fuelrats.com/leaderboard-entries/b...
3 leaderboard-entries 4a3ebe7a-35e6-4371-94f7-50db34c0167a Falcon JSDF [Falcon JSDF] 2016-03-01T02:44:51.265Z 2639 210 True False https://api.fuelrats.com/leaderboard-entries/4...
4 leaderboard-entries 9257c634-8c79-4d0c-b64c-f9acf1672f3a JERRYCLARK [JERRYCLARK] 2017-05-26T15:58:34.000Z 2452 229 False False https://api.fuelrats.com/leaderboard-entries/9...
... ... ... ... ... ... ... ... ... ... ...
3455 leaderboard-entries 55e8f803-9005-4c99-a5c3-d8ea49ac365a Boonlike [Boonlike] 2020-11-28T20:52:22.250Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/5...
3456 leaderboard-entries 46da00f1-0ac4-475c-9fc9-9da81624a5cd gamerjackiechan2 [gamerjackiechan2] 2018-06-04T18:51:10.000Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/4...
3457 leaderboard-entries e1d78729-3813-451e-95ff-f64c76bce4ba DutchProjectz [DutchProjectz] 2015-09-06T02:46:26.000Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/e...
3458 leaderboard-entries 57417a4e-b7a9-4612-8d66-59b69b33447e dalam88 [dalam88] 2017-07-25T14:48:13.000Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/5...
3459 leaderboard-entries 233a183a-b270-441f-bd55-258511cd9541 Glyc [Glyc, Glyca94] 2021-03-22T19:36:24.121Z 1 0 False False https://api.fuelrats.com/leaderboard-entries/2...
3460 rows × 10 columns

removing some part of string from url

I want to remove jsessionid from given string url and backslash #start
/product.screen?productId=BS-AG-G09&JSESSIONID=SD1SL6FF6ADFF6510
so that output would be like
product.screen?productId=BS-AG-G09
More data like :
1 /product.screen?productId=WC-SH-A02&JSESSIONID=SD0SL6FF7ADFF4953
2 /oldlink?itemId=EST-6&JSESSIONID=SD0SL6FF7ADFF4953
3 /product.screen?productId=BS-AG-G09&JSESSIONID=SD0SL6FF7ADFF4953
4 /product.screen?productId=FS-SG-G03&JSESSIONID=SD0SL6FF7ADFF4953
5 /cart.do?action=remove&itemId=EST-11&productId=WC-SH-A01&JSESSIONID=SD0SL6FF7ADFF4953
6 /oldlink?itemId=EST-14&JSESSIONID=SD0SL6FF7ADFF4953
7 /cart.do?action=view&itemId=EST-6&productId=MB-AG-T01&JSESSIONID=SD1SL6FF6ADFF6510
8 /product.screen?productId=BS-AG-G09&JSESSIONID=SD1SL6FF6ADFF6510
9 /product.screen?productId=WC-SH-A02&JSESSIONID=SD1SL6FF6ADFF6510
10 /cart.do?action=view&itemId=EST-6&productId=WC-SH-A02&JSESSIONID=SD1SL6FF6ADFF6510
11 /product.screen?productId=WC-SH-A02&JSESSIONID=SD1SL6FF6ADFF6510
You may use:
library(stringi)
lf1 = "/product.screen?productId=BS-AG-G09&JSESSIONID=SD0SL6FF7ADFF4953"
stri_replace_all_regex(
"/product.screen?productId=BS-AG-G09&JSESSIONID=SD0SL6FF7ADFF4953",
"&JSESSIONID=.*","")
Where the string: &JSESSIONID=.* (up to the end .*) gets replaced with nothing ("").
or simply: gsub("&JSESSIONID=.*","",lf1)

How to set the right RCurl options to download from NSE website

I am trying to download files from the NSE India website (nseindia.com). The problem is that webmaster does not like scraping programs downloading files or reading pages from the website. They have a user agent based restriction it seems.
The file I am trying to download is http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip
I am able to download this from the linux shell using
curl -v -A "Mozilla" http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip
The output is this
About to connect() to www.nseindia.com port 80 (#0)
* Trying 115.112.4.12... % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:--
--:--:-- 0connected
GET /archives/equities/bhavcopy/pr/PR280815.zip HTTP/1.1
User-Agent: Mozilla
Host: www.nseindia.com
Accept: /
< HTTP/1.1 200 OK < Server: Oracle-iPlanet-Web-Server/7.0 < Content-Length: 374691 < X-frame-options: SAMEORIGIN < Last-Modified:
Fri, 28 Aug 2015 12:20:02 GMT < ETag: "5b7a3-55e051f2" <
Accept-Ranges: bytes < Content-Type: application/zip < Date: Sat, 29
Aug 2015 17:56:05 GMT < Connection: keep-alive < { [data not shown] PK
5 365k 5 19977 0 0 34013 0 0:00:11 --:--:-- 0:00:11
56592
This allows me to the download the file.
The code I am using in R Curl is this
library("RCurl")
jurl <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
juseragent <- "Mozilla"
myOpts = curlOptions(verbose = TRUE, header = TRUE, useragent = juseragent)
jfile <- getURL(jurl,.opts=myOpts)
This, too, does not work.
I have also unsuccessfully tried using download.file from the base library with the user agent changed.
Any help will be appreciated.
library(curl) # this is not RCurl, you need to download curl
to download file in the working directory
curl_download("http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip","tt.zip",handle = new_handle("useragent" = "my_user_agent"))
First, your problem is not setting the user agent, but downloading binary data. This works:
jfile <- getURLContent(jurl, .opts=myOpts, binary=TRUE)
Here is a (more) complete example using httr instead of RCurl.
library(httr)
url <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
response <- GET(url, user_agent("Mozilla"))
response$status # 200 OK
# [1] 200
tf <- tempfile()
writeBin(content(response, "raw"), tf) # write response content (the zip file) to a temporary file
files <- unzip(tf, exdir=tempdir()) # unzips to system temp directory and returns a vector of file names
df.lst <- lapply(files[grepl("\\.csv$",files)],read.csv) # convert .csv files to list of data.frames
head(df.lst[[2]])
# SYMBOL SERIES SECURITY HIGH.LOW INDEX.FLAG
# 1 AGRODUTCH EQ AGRO DUTCH INDUSTRIES LTD H NA
# 2 ALLSEC EQ ALLSEC TECHNOLOGIES LTD H NA
# 3 ALPA BE ALPA LABORATORIES LTD H NA
# 4 AMTL EQ ADV METERING TECH LTD H NA
# 5 ANIKINDS BE ANIK INDUSTRIES LTD H NA
# 6 ARSHIYA EQ ARSHIYA LIMITED H NA

Reading data from URL

Is there a reasonably easy way to get data from some url? I tried the most obvious version, does not work:
readcsv("https://dl.dropboxusercontent.com/u/.../testdata.csv")
I did not find any usable reference. Any help?
If you want to read a CSV from a URL, you can use the Requests package as #waTeim shows and then read the data through an IOBuffer. See example below.
Or, as #Colin T Bowers comments, you could use the currently (December 2017) more actively maintained HTTP.jl package like this:
julia> using HTTP
julia> res = HTTP.get("https://www.ferc.gov/docs-filing/eqr/q2-2013/soft-tools/sample-csv/transaction.txt");
julia> mycsv = readcsv(res.body);
julia> for (colnum, myheader) in enumerate(mycsv[1,:])
println(colnum, '\t', myheader)
end
1 transaction_unique_identifier
2 seller_company_name
3 customer_company_name
4 customer_duns_number
5 tariff_reference
6 contract_service_agreement
7 trans_id
8 transaction_begin_date
9 transaction_end_date
10 time_zone
11 point_of_delivery_control_area
12 specific location
13 class_name
14 term_name
15 increment_name
16 increment_peaking_name
17 product_name
18 transaction_quantity
19 price
20 units
21 total_transmission_charge
22 transaction_charge
Using the Requests.jl package:
julia> using Requests
julia> res = get("https://www.ferc.gov/docs-filing/eqr/q2-2013/soft-tools/sample-csv/transaction.txt");
julia> mycsv = readcsv(IOBuffer(res.data));
julia> for (colnum, myheader) in enumerate(mycsv[1,:])
println(colnum, '\t', myheader)
end
1 transaction_unique_identifier
2 seller_company_name
3 customer_company_name
4 customer_duns_number
5 tariff_reference
6 contract_service_agreement
7 trans_id
8 transaction_begin_date
9 transaction_end_date
10 time_zone
11 point_of_delivery_control_area
12 specific location
13 class_name
14 term_name
15 increment_name
16 increment_peaking_name
17 product_name
18 transaction_quantity
19 price
20 units
21 total_transmission_charge
22 transaction_charge
If you are looking to read into a dataframe, this will also work in Julia:
using CSV
dataset = CSV.read(download("https://mywebsite.edu/ml/machine-learning-databases/my.data"))
The Requests package seems to work pretty well. There are others (see the entire package list) but Requests is actively maintained.
Obtaining it
julia> Pkg.add("Requests")
julia> using Requests
Using it
You can use one of the exported functions that correspond to the various HTTP verbs get, post, etc which returns a Response type
julia> res = get("http://julialang.org")
Response(200 OK, 21 Headers, 20913 Bytes in Body)
julia> typeof(res)
Response (constructor with 8 methods)
And then, for example, you can print the data using #printf
julia> #printf("%s",res.data);
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-us" lang="en-us">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
...
If it is directly a csv file, something like this should work:
A = readdlm(download(url),';')
Nowadays you can also use UrlDownload.jl which is pure Julia, take care of download details, process data in-memory and can also work with compressed files.
Usage is straightforward
using UrlDownload
A = urldownload("https://data.ok.gov/sites/default/files/unspsc%20codes_3.csv")

How can I apply fisher test on this set of data (nominal variables)

I'm pretty new in statistics:
fisher = function(idxToTest, idxATI){
idxDependent=c()
dependent=c()
p = c()
for(i in c(1:length(idxToTest)))
{
tbl = table(data[[idxToTest[i]]], data[[idxATI]])
rez = fisher.test(tbl, workspace = 20000000000)
if(rez$p.value<0.1){
dependent=c(dependent, TRUE)
if(rez$p.value<0.1){
idxDependent = c(idxDependent, idxToTest[i])
}
}
else{
dependent = c(dependent, FALSE)
}
p = c(p, rez$p.value)
}
}
This is the function I use. It seems to work.
What I understood until now is that I have to pass as first parameter data like:
Men Women
Dieting 10 30
Non-dieting 5 60
My data comes from a CSV:
data = read.csv('***.csv', header = TRUE, sep=',');
My first problem is that I don't know how to converse from:
Loan.Purpose Home.Ownership
lp_value_1 ho_value_2
lp_value_1 ho_value_2
lp_value_2 ho_value_1
lp_value_3 ho_value_2
lp_value_2 ho_value_3
lp_value_4 ho_value_2
lp_value_3 ho_value_3
to:
ho_value_1 ho_value_2 ho_value_3
lp_value1 0 2 0
lp_value2 1 0 1
lp_value3 0 1 1
lp_value4 0 1 0
The second issue is that I don't know what the second parameter should be
POST UPDATE: This is what I get using fisher.test(myTable):
Error in fisher.test(test) : FEXACT error 501.
The hash table key cannot be computed because the largest key
is larger than the largest representable int.
The algorithm cannot proceed.
Reduce the workspace size or use another algorithm.
where myTable is:
MORTGAGE NONE OTHER OWN RENT
car 18 0 0 5 27
credit_card 190 0 2 38 214
debt_consolidation 620 0 2 87 598
educational 5 0 0 3 7
...
Basically, fisher tests only work on smallish data sets because they require alot of memory. But all is good because chi-square tests make minimal additional assumptions and are easier on the computer. Just do:
chisq.test(Loan.Purpose,Home.Ownership)
to get your p-values.
Make sure you read through and understand the help page for chisq.test, especially the examples at the bottom.
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html
Then look at a mosaicplot to see the quantities like:
mosaicplot(Loan.Purpose,Home.Ownership)
this reference explains how mosaicplots work.
http://alumni.media.mit.edu/~tpminka/courses/36-350.2001/lectures/day12/

Resources