I do some data analysis in R. On end of script I want save my results to file. I know there is more options how to do it, but they don't work properly. When I try sink() it works but it give me :
<MySQLResult:1,5,1>
host logname user time request_fline status
1 142.4.5.115 - - 2018-01-03 12:08:58 GET /phpmyadmin?</script><script>alert('<!--VAIBS-->');</script><script> HTTP/1.1 400
size_varchar referer agent ip_adress size_int cookie time_microsec filename
1 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 159 -
request_protocol keepalive request_method contents_of_foobar contents_of_notefoobar port child_id
<MySQLResult:1,5,1>
[1] host logname user time request_fline status
[7] size_varchar referer agent ip_adress size_int cookie
<0 rows> (or 0-length row.names)
which is totally unusable because I cant export that type of data. If I try write.table it give file with one row which is possible read but after one row R skript and give me error : Error in isOpen(file, "w") : invalid connection and when I try write.csv result is same. And when I try lapply it give me just empty file.
There is my code :
fileConn<-file("outputX.txt")
fileCon2<-file("outputX.csv")
sink("outputQuery.txt")
for (i in 1:length(awq)){
sql <- paste("SELECT * FROM mtable ORDER BY cookie LIMIT ", awq[i], ",1")
nb <- dbGetQuery(mydb, sql)
print (nb)
write.table(nb, file = fileConn, append = TRUE, quote = FALSE, sep = " ", eol = "\n", na = "NA", row.names = FALSE, col.names = FALSE)
write.csv(nb, file = fileCon2,row.names=FALSE, sep=" ")
lapply(nb, write, fileConn, append=TRUE, ncolumns=7)
writeLines(unlist(lapply(nb, paste, collapse=" ")))
}
sink()
close(fileConn)
close(fileCon2)
I am new in R, so I don't know what else should I try.What I want is 1 file where data will be print in form which is easy to read and export. For example tike this :
142.4.5.115 - - 2018-01-03 12:08:58 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
142.4.5.115 - - 2018-01-03 12:10:23 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
142.4.5.115 - - 2018-01-03 12:12:41 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
142.4.5.115 - - 2018-01-03 12:15:29 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
or this :
host,logname,user,time, request_fline status,size_varchar,referer agent,ip_adress,size_int,cookie,time_microsec,filename,request_protocol,keepalive,request_method,contents_of_foobar,contents_of_notefoobar port child_id
1 142.4.5.115 - - 2018-01-03 12:08:58 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
2 142.4.5.115 - - 2018-01-03 12:10:23 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
3 142.4.5.115 - - 2018-01-03 12:12:41 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
4 142.4.5.115 - - 2018-01-03 12:15:29 GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 400 Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 445 - 142.4.5.115 445 - 145 - HTTP/1.1 0 GET - - 80 7216 ?/><!--VAIBS--> GET /phpmyadmin?/><!--VAIBS--> HTTP/1.1 - 0 /phpmyadmin - 354 0
or something similar. Best of all, will be some help how to write write.table in loop without error. But I will welcome any functional solution. Best what I have is :
sql <- paste("SELECT * FROM idsaccess ORDER BY cookie LIMIT ", awq[1], ",1")
nb <- dbGetQuery(mydb, sql)
write.table(nb, file = fileConn, append = TRUE, quote = FALSE, sep = " ", eol = "\n", na = "NA", row.names = FALSE, col.names = FALSE)
fileConn<-file("outputX1.txt")
sql <- paste("SELECT * FROM idsaccess ORDER BY cookie LIMIT ", awq[2], ",1")
nb <- dbGetQuery(mydb, sql)
write.table(nb, file = fileConn, append = true, quote = FALSE, sep = " ", eol = "\n", na = "NA", row.names = FALSE, col.names = FALSE)
But this give every query to own file. And I don't want have every query in own file. Any help ?
Simply concatenate all query dataframes into one large dataframe since they all share same structure, and then output to file in one call which is really the typical way to use write.table or its wrapper, write.csv:
Specifically, turn for loop:
for (i in 1:length(awq)){
sql <- paste("SELECT * FROM mtable ORDER BY cookie LIMIT ", awq[i], ",1")
nb <- dbGetQuery(mydb, sql)
}
Into lapply for a list of dataframes:
df_list <- lapply(1:length(awq), function(i) {
sql <- paste0("SELECT * FROM mtable ORDER BY cookie LIMIT ", awq[i], ",1")
nb <- dbGetQuery(mydb, sql)
})
Then, row bind with do.call to stack all dfs into a single dataframe and output to file:
final_df <- do.call(rbind, df_list)
write.table(final_df, file = "outputX.txt", append = true, quote = FALSE, sep = " ",
eol = "\n", na = "NA", row.names = FALSE, col.names = FALSE)
Related
So i am scrapping the nse india results calendar site via using curl command and even after that its giving me "Resource not Found" error . Here's my code
url = f"https://www.nseindia.com/api/event-calendar?index=equities"
header1 ="Host:www.nseindia.com"
header2 = "User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0"
header3 ="Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
header4 ="Accept-Language:en-US,en;q=0.5"
header5 ="Accept-Encoding:gzip, deflate, br"
header6 ="DNT:1"
header7 ="Connection:keep-alive"
header8 ="Upgrade-Insecure-Requests:1"
header9 ="Pragma:no-cache"
header10 ="Cache-Control:no-cache"
def run_curl_command(curl_command, max_attempts):
result = os.popen(curl_command).read()
count = 0
while "Resource not found" in result and count < max_attempts:
result = os.popen(curl_command).read()
count += 1
time.sleep(1)
print("API Read")
result = json.loads(result)
result = pd.DataFrame(result)
def init():
max_attempts = 100
curl_command = f'curl "{url}" -H "{header1}" -H "{header2}" -H "{header3}" -H "{header4}" -H "{header5}" -H "{header6}" -H "{header7}" -H "{header8}" -H "{header9}" -H "{header10}" --compressed '
print(f"curl_command : {curl_command}")
run_curl_command(curl_command, max_attempts)
init()
I am trying to use Beautiful Soup to webscrape the list of ticker symbols from this page: https://www.barchart.com/options/most-active/stocks
My code returns a lot of HTML from the page, but I can't find any of the ticker symbols with CTRL+F. Would be much appreciated if someone could let me know how I can access these!
Code:
from bs4 import BeautifulSoup as bs
import requests
headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"}
url = "https://www.barchart.com/options/most-active/stocks"
page = requests.get(url, headers=headers)
html = page.text
soup = bs(html, 'html.parser')
print(soup.find_all())
import requests
from urllib.parse import unquote
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url[:25])
req.headers.update(
{'X-XSRF-TOKEN': unquote(r.cookies.get_dict()['XSRF-TOKEN'])})
params = {
"list": "options.mostActive.us",
"fields": "symbol,symbolType,symbolName,hasOptions,lastPrice,priceChange,percentChange,optionsImpliedVolatilityRank1y,optionsTotalVolume,optionsPutVolumePercent,optionsCallVolumePercent,optionsPutCallVolumeRatio,tradeTime,symbolCode",
"orderBy": "optionsTotalVolume",
"orderDir": "desc",
"between(lastPrice,.10,)": "",
"between(tradeTime,2021-08-03,2021-08-04)": "",
"meta": "field.shortName,field.type,field.description",
"hasOptions": "true",
"page": "1",
"limit": "500",
"raw": "1"
}
r = req.get(url, params=params).json()
df = pd.DataFrame(r['data']).iloc[:, :-1]
print(df)
main('https://www.barchart.com/proxies/core-api/v1/quotes/get?')
Output:
symbol symbolType ... tradeTime symbolCode
0 AMD 1 ... 08/03/21 STK
1 AAPL 1 ... 08/03/21 STK
2 TSLA 1 ... 08/03/21 STK
3 AMC 1 ... 08/03/21 STK
4 PFE 1 ... 08/03/21 STK
.. ... ... ... ... ...
495 BTU 1 ... 08/03/21 STK
496 EVER 1 ... 08/03/21 STK
497 VRTX 1 ... 08/03/21 STK
498 MCHP 1 ... 08/03/21 STK
499 PAA 1 ... 08/03/21 STK
[500 rows x 14 columns]
I want to change the number of results on this page: https://fifatracker.net/players/ to more than 100 and then export the table to Excel and make it much easier for me. I tried to scrape it using python following a tutorial but I can't make it work. If there is a way to extract the table from all the pages it would also help me.
As stated, it's restricted to 100 per request. Simply iterate through the query payload on the api to get each page:
import pandas as pd
import requests
url = 'https://fifatracker.net/api/v1/players/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'}
page= 1
payload = {
"pagination":{
"per_page":"100","page":page},
"filters":{
"attackingworkrate":[],
"defensiveworkrate":[],
"primarypositions":[],
"otherpositions":[],
"nationality":[],
"order_by":"-overallrating"},
"context":{
"username":"guest",
"slot":"1","season":1},
"currency":"eur"}
jsonData = requests.post(url, headers=headers, json=payload).json()
current_page = jsonData['pagination']['current_page']
last_page = jsonData['pagination']['last_page']
dfs = []
for page in range(1,last_page+1):
if page == 1:
pass
else:
payload['pagination']['page'] = page
jsonData = requests.post(url, headers=headers, json=payload).json()
players = pd.json_normalize(jsonData['result'])
dfs.append(players)
print('Page %s of %s' %(page,last_page))
df = pd.concat(dfs).reset_index(drop=True)
Output:
print(df)
slug ... info.contract.loanedto_clubname
0 lionel-messi ... NaN
1 cristiano-ronaldo ... NaN
2 robert-lewandowski ... NaN
3 neymar-jr ... NaN
4 kevin-de-bruyne ... NaN
... ... ...
19137 levi-kaye ... NaN
19138 phillip-cancar ... NaN
19139 julio-pérez ... NaN
19140 alan-mclaughlin ... NaN
19141 tatsuki-yoshitomi ... NaN
[19142 rows x 92 columns]
httr 1.4.1
R version 3.6.1 (also tried with 3.5.3)
Edit (adding verbose()) output.
I've got a request as follows:
r <- GET("https://my.cool.domain",add_headers(.headers = c('x-api-key' = 'abcdefg', 'Accept' = "text/csv")), verbose())
On my machine it responds with:
-> GET / HTTP/1.1
-> Host: https://my.cool.domain
-> User-Agent: libcurl/7.54.0 r-curl/4.2 httr/1.4.1
-> Accept-Encoding: deflate, gzip
-> x-api-key: abcdefg
-> Accept: text/csv
->
<- HTTP/1.1 200 OK
<- Date: Tue, 26 Nov 2019 17:50:15 GMT
<- Content-Type: text/csv
<- Content-Length: 24902
<- Connection: keep-alive
<- x-amzn-RequestId: ...
<- Content-Encoding: deflate
<- x-amz-apigw-id: ...
<- X-Amzn-Trace-Id: ...
Response [https://my.cool.domain]
Date: 2019-11-26 17:20
Status: 200
Content-Type: text/csv
Size: 209 kB
cats,dogs...
yes,no...
yes,yes...
no,no...
However on my colleague's machine (same version of httr and R, and also with an updated version of R) I get the following:
-> GET / HTTP/2
-> Host: https://my.cool.domain
-> User-Agent: libcurl/7.64.1 r-curl/4.2 httr/1.4.1
-> Accept-Encoding: deflate, gzip
-> x-api-key: abcdefg
-> Accept: text/csv
->
<- HTTP/2 200
<- date: Tue, 26 Nov 2019 17:46:17 GMT
<- content-type: application/json
<- content-length: 21501
<- x-amzn-requestid: ...
<- content-encoding: deflate
<- x-amz-apigw-id: ...
<- x-amzn-trace-id: ...
Response [https://my.cool.domain]
Date: 2019-11-26 17:30
Status: 200
Content-Type: application/json
Size: 377 kB
I'm working with the developer of the https://my.cool.domain domain and I can confirm that the request header params (x-api-key and 'Accept' = "text/csv") are perfect. And the request works on my machine, and several others, but not this one colleague's.
What's going wrong here and how can I debug this?
Thanks
This was fixed by doing httr::set_config(httr::config(http_version = 1.1)) to force 1.1.
I've been using httr to export data from REDCap databases into R for a few months now. We recently upgraded our R Studio Server to the most recent version (v0.98.1049) and upgraded to R 3.1.1 at the same time. After that upgrade, my httr::POST calls stopped working, sometimes. The error I keep getting is
Error in function (type, msg, asError = TRUE) :
GnuTLS recv error (-9): A TLS packet with unexpected length was received.
At first I thought it might be an SSL issue, but the error only occurs in certain databases, and in those databases, I can still download the data using RCurl. That is, this code will work
RCurl::postForm(uri=[URL],
.params=list(token=[TOKEN],
content='record',
format='csv'))
But this code will not
httr::POST(url=[URL],
body=list(token=[TOKEN],
content='record',
format='csv'))
Adding further to my confusion, even though I can't export data in projects where this error occurs, I can import data.
I'm out of ideas on where to start. I'd much appreciate any ideas on what might be going wrong here.
(I'd like to provide a reproducible example, but I'm afraid I'm working with healthcare data. sorry)
As requested, here's the verbose() output. It's a slightly different call, but produces the same error. (I used a call that wouldn't risk exposing confidential information)
> httr::POST(url=whi$url,
+ body=list(token=whi$token,
+ content='metadata',
+ format='csv'),
+ httr::verbose(data_in=TRUE, info=TRUE))
* Hostname was found in DNS cache
* Hostname in DNS cache was stale, zapped
* Trying 172.26.30.4...
* Connected to [URL] (172.26.30.4) port 443 (#7)
* found 153 certificates in /home/nutterb/R/x86_64-unknown-linux-gnu-library/3.1/httr/cacert.pem
* SSL re-using session ID
* server certificate verification OK
* common name: [URL] (matched)
* server certificate expiration date OK
* server certificate activation date OK
* certificate public key: RSA
* certificate version: #3
* subject: OU=Domain Control Validated,CN=[URL]
* start date: Thu, 10 Apr 2014 17:06:17 GMT
*
* expire date: Sat, 21 Mar 2015 16:35:07 GMT
*
* issuer: C=US,ST=Arizona,L=Scottsdale,O=Starfield Technologies\, Inc.,OU=http://certs.starfieldtech.com/repository/,CN=Starfield Secure Certificate Authority - G2
* compression: NULL
* cipher: AES-128-CBC
* MAC: SHA1
-> POST /redcap/api/ HTTP/1.1
-> User-Agent: curl/7.35.0 Rcurl/1.95.4.1 httr/0.5.0.9000
-> Host: [URL]
-> Accept-Encoding: gzip
-> accept: application/json, text/xml, */*
-> Content-Length: 374
-> Expect: 100-continue
-> Content-Type: multipart/form-data; boundary=------------------------05c968969cc362a9
->
<- HTTP/1.1 100 Continue
>> --------------------------05c968969cc362a9
>> Content-Disposition: form-data; name="token"
>>
>> [TOKEN]
>> --------------------------05c968969cc362a9
>> Content-Disposition: form-data; name="content"
>>
>> metadata
>> --------------------------05c968969cc362a9
>> Content-Disposition: form-data; name="format"
>>
>> csv
>> --------------------------05c968969cc362a9--
<- HTTP/1.1 200 OK
<- Date: Sat, 06 Sep 2014 09:55:43 GMT
<- Expires: 0
<- cache-control: no-store, no-cache, must-revalidate
<- Pragma: no-cache
<- Access-Control-Allow-Origin: *
<- Vary: Accept-Encoding
<- Content-Type: text/html; charset=utf-8
<- Connection: close
<- Content-Encoding: gzip
<-
<< NA
<< NA
<< NA
<< NA
<< NA
<< NA
<< NA
<< NA
<< NA
<< NA
<< NA
* GnuTLS recv error (-9): A TLS packet with unexpected length was received.
* Closing connection 7
Error in function (type, msg, asError = TRUE) :
GnuTLS recv error (-9): A TLS packet with unexpected length was received.
In addition: There were 11 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
2: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
3: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
4: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
5: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
6: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
7: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
8: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
9: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
10: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
11: In strsplit(x, "\n", fixed = TRUE) : input string 1 is invalid in this locale
>