I am trying to scrape data on pages from an API using the getURL function of the RCurl package in R. My problem is that I can't replicate the response that I get when I open the URL in Chrome when I make the request using R. Essentially, when I open the API page (url below) in Chrome it works fine but if I request it in using getURL in R (or using incognito mode in Chrome) I get a '500 Internal Server Error' response and not the pretty JSON that I'm looking for.
URL/API in question:
http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA¤cy=USD&language=en-us&productSet=BN&sku=LD04077082
Here is my (failed) request in [R].
test2 <- fromJSON(getURL("http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA¤cy=USD&language=en-us&productSet=BN&sku=LD04077082", ssl.verifypeer = FALSE, useragent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36"))
My Research so Far
First I looked at this prior question on stack and added in my useragent to the request (did not solve problem but may still be necessary):
ViralHeat API issues with getURL() command in RCurl package
Next I looked at this helpful post which guides my rationale:
R Disparity between browser and GET / getURL
My Ideas About the Solution
This is not my area of expertise but my guess is that the request is lacking a cookie needed to complete the request (hence why it doesn't work in my browser in incognito mode). I compared the requests and responses from the successful request to the unsuccessful request:
Successful request:
Unsuccessful request:
Anyone have any ideas? Should I try using the package RSelenium package that was suggested by MrFlick in the 2nd post I made.
This is a courteous site. It would like to know where you come from what currency you use etc. to give you a better user experience. It does this by setting a multitude of cookies on the landing page. So we follow suit and navigate to the landing page first getting the cookies then we goto the page we want:
library(RCurl)
myURL <- "http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA¤cy=USD&language=en-us&productSet=BN&sku=LD04077082"
agent="Mozilla/5.0 (Windows NT 6.3; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0"
#Set RCurl pars
curl = getCurlHandle()
curlSetOpt(cookiejar="cookies.txt", useragent = agent, followlocation = TRUE, curl=curl)
firstPage <- getURL("http://www.bluenile.com", curl=curl)
myPage <- getURL(myURL, curl = curl)
library(RJSONIO)
> names(fromJSON(myPage))
[1] "diamondDetailsHeader" "diamondDetailsBodies" "pageMetadata" "expandedUrl"
[5] "newVersion" "multiDiamond"
and the cookies:
> getCurlInfo(curl)$cookielist
[1] ".bluenile.com\tTRUE\t/\tFALSE\t2412270275\tGUID\tDA5C11F5_E468_46B5_B4E8_D551D4D6EA4D"
[2] ".bluenile.com\tTRUE\t/\tFALSE\t1475342275\tsplit\tver~3&presetFilters~TEST"
[3] ".bluenile.com\tTRUE\t/\tFALSE\t1727630275\tsitetrack\tver~2&jse~0"
[4] ".bluenile.com\tTRUE\t/\tFALSE\t1425230275\tpop\tver~2&china~false&french~false&ie~false&internationalSelect~false&iphoneApp~false&survey~false&uae~false"
[5] ".bluenile.com\tTRUE\t/\tFALSE\t1475342275\tdsearch\tver~6&newUser~true"
[6] ".bluenile.com\tTRUE\t/\tFALSE\t1443806275\tlocale\tver~1&country~IRL¤cy~EUR&language~en-gb&productSet~BNUK"
[7] ".bluenile.com\tTRUE\t/\tFALSE\t0\tbnses\tver~1&ace~false&isbml~false&fbcs~false&ss~0&mbpop~false&sswpu~false&deo~false"
[8] ".bluenile.com\tTRUE\t/\tFALSE\t1727630275\tbnper\tver~5&NIB~0&DM~-&GUID~DA5C11F5_E468_46B5_B4E8_D551D4D6EA4D&SESS-CT~1&STC~32RPVK&FB_MINI~false&SUB~false"
[9] "#HttpOnly_www.bluenile.com\tFALSE\t/\tFALSE\t0\tJSESSIONID\tB8475C3AEC08205E5AC6252C94E4B858"
[10] ".bluenile.com\tTRUE\t/\tFALSE\t1727630278\tmigrationstatus\tver~1&redirected~false"
Related
I have seen that there are many approarches for scraping data from google scholar out there. I tried to put together my own code, however I cannot get access to google scholar using free proxies. I am interested in understanding why that is (and, secondarily, what to change). Below is my code. I know it is not the most elegant one, its my first try at data scraping...
This is a list of proxies I got from "https://free-proxy-list.net/" and I did test if they worked by accessing "http://icanhazip.com" with them.
live_proxies = ['193.122.71.184:3128', '185.76.10.133:8081', '169.57.1.85:8123', '165.154.235.76:80', '165.154.235.156:80']
Then I made the urls I want to scrape, and tried to get the content of the pages with one randome proxy
search_terms = ['Acanthizidae', 'mammalia']
for i in range(len(search_terms)):
url = 'https://scholar.google.de/scholar?hl=en&as_sdt=0%2C5&q={}&btnG='.format(search_terms[i])
session = requests.Session()
proxy = random.choice(live_proxies)
session.proxies = {"http": proxy , "https": proxy}
ua = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
output = s.get(url, headers=ua, timeout=1.5).text
However, I get:
requests.exceptions.ProxyError: HTTPSConnectionPool(host='scholar.google.de', port=443): Max retries exceeded with url: /scholar?hl=en&as_sdt=0%2C5&q=Acanthizidae&btnG= (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 400 Bad Request')))
Like I said, I did test the proxies before with a different site. What is the problem?
I think I'm following the instructions in the documentation exactly (https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html) but I can't get the add_headers functionality to work. A simple example is:
library(httr)
res <- GET('http://www.google.com', httr::add_headers(Referer= 'https://www.google.com/'), user_agent('Mozilla/5.0 (X11; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0'))
str(content(res)$headers)
The last line is supposed to print the header of the request and I am getting NULL
It's because google.com returns HTML, and content by default parses with xml2 to xml_document which you can't index with $headers. And headers is a field returned by httpbin.org in JSON, but not by google.com (headers from google, as most sites will do, you can get to by res$headers)
I'm using phantomJS and selenium to convert Youtube videos to mp3s using anything2mp3.com and then attempting to download the files.
I'm trying to use urllib in Python 3 to download a .mp3 file. However, when I try:
url = 'example.com'
fileName = 'testFile.mp3'
urllib.request.urlretrieve(url, fileName)
I get the error:
urllib.error.HTTPError: HTTP Error 403: Forbidden
From hours of searching, I have found that it is likely due to the website not liking the user agent being used to access the website. I've tried to alter the user agent but haven't had any luck since I can't simply supply a header to urlretrieve.
Use requests lib:
SERVICE_URL = 'http://anything2mp3.com/'
YOUTUBE_URL = 'https://youtu.be/AqCWi_-vnTg'
FILE_NAME = 'song.mp3'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'
# Get mp3 link using selenium
browser = webdriver.PhantomJS()
browser.get(SERVICE_URL)
search = browser.find_element_by_css_selector('#edit-url')
search.send_keys(YOUTUBE_URL)
submit = browser.find_element_by_css_selector('#edit-submit--2')
submit.click()
a = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#block-system-main > a')))
download_link = a.get_attribute('href')
# Download file using requests
# http://docs.python-requests.org/en/latest/
r = requests.get(download_link, stream=True, headers={'User-Agent': USER_AGENT})
with open(FILE_NAME, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
This is a followup question to RCurl getURL with loop - link to a PDF kills looping :
I have the following getURL command:
require(RCurl)
#set a bunch of options for curl
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
agent="Firefox/23.0"
curl = getCurlHandle()
curlSetOpt(
cookiejar = 'cookies.txt' ,
useragent = agent,
followlocation = TRUE ,
autoreferer = TRUE ,
httpauth = 1L, # "basic" http authorization version -- this seems to make a difference for India servers
curl = curl
)
x = getURLContent('http://timesofindia.indiatimes.com//articleshow/2933019.cms')
class(x)
#[1] "character"
attr(x, "Content-Type")
#"text/plain"
In a browser, the link above ends up redirecting to:
x = getURLContent('http://timesofindia.indiatimes.com/photo.cms?msid=2933009')
class(x)
#[1] "raw"
attr(x, "Content-Type")
#"application/pdf"
Assuming I know only the first link, how can I detect that the final location of the redirect (or redirects) is of a certain type (in this case PDF)?
Thanks!!
Maybe there's a better solution, but one way could be this:
# ...
h <- basicTextGatherer()
x = getBinaryURL('http://timesofindia.indiatimes.com//articleshow/2933019.cms',
headerfunction = h$update, curl = curl)
r <- gregexpr("Content-Type:.*?\n", h$value())
tail(regmatches(h$value(), r)[[1]], 1)
# [1] "Content-Type: application/pdf\r\n"
I have run into a similar issue trying to run getURLContent using digest authentication to get at binary data (using a non-standard mime type). I am running RCurl v1.95-4.1 on R 2.15.3.
If I run getURLContent without the binary=TRUE flag, it won't autoswitch to binary=TRUE because of the mime header for this data type, so it attempts to perform a rawToChar() and throws a 'embedded NULL in string' error. However the authentication does work.
If I add a binary=TRUE flag to the getURLContent call, it seems to cause issues with the authentication step, since I then get an 'Error: Unauthorized' response.
What finally worked was to replace getURLContent() with getBinaryURL() [as in the example above], which allowed the userpwd="u:p" authorization to work and delivered the binary data to my assigned object.
I think that the author of RCurl has made improvements to getURLContent's handling of binary data for v1.97, based on what I see in GitHub, so this may become a thing of the past...except for those of us still running older R setups.
I'm trying to create a fail2ban filter that is going to ban the host when it sends over 100 POST requests over 30 seconds interval.
jail.local:
[nginx-postflood]
enabled = false
filter = nginx-postflood
action = myaction
logpath = /var/log/nginx/access.log
findtime = 30
bantime = 100
maxretry = 100
nginx-postflood.conf
[Definition]
failregex = ^<HOST>.*"POST.*
ignoreregex =
Using GREP i was able to test the regular expressions and indeed it matches Host and POST requests.
Problem is that it bans any Host that performs at least one POST request. This means likely that it's not taking findttime or maxretry options into consideration. In my opinion it's timestamp issue.
Sample line of nginx log:
5.5.5.5 - user [05/Aug/2014:00:00:09 +0200] "POST /auth HTTP/1.1" 200 6714 "http://referer.com" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0"
Any help?
I guess it maybe to late for the answer but anyway...
The excerpt you have posed has the filter disabled.
enabled = false
As there is not mentioning of Fail2Ban version and syslog/fail2ban logs are missing for this jail.
I tested your Filter on fail2ban 0.9.3-1 and it works fine although I had to enable it and had to drop the line with action = myaction as you have not provided what you are expecting fail2ban to do.
Therefore this filter should work fine, based that it's enabled and the action is correct as well.
What is happening in the provided example is that Your Filter is disabled and fail2ban is using another Filter which checks the same log file and matches your regex but has more restrictive rules i.e ban after 1 request.