I wanted to scrape this webpage:
http://protected.to/f-42cbf8ce2521d615
But I have to click on "continue to folder" to get to those links. I cannot see these links in the HTML source, only when I physically use a mouse to click on the "continue to folder" button.
How can I avoid that physical click to get to those links in the website?
I am new to web scraping so please help me solve this issue.
Thanks for your attention and time.
Ozooha
import requests
from bs4 import BeautifulSoup
s = requests.Session()
url='http://protected.to/f-c9036f7a236b1511'
r = s.get(url)
soup = BeautifulSoup(r.text, features="html.parser")
params = {i['name']:i.get('value') for i in soup.find('div', {'class':'col-md-12 text-center'}).find_all('input')}
headers = {"Host": "protected.to",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection":"keep-alive",
"Cookie": r.headers['Set-Cookie'],
"Upgrade-Insecure-Requests": "1",
"Sec-GPC": "1",
"DNT": "1"}
print(params)
r_ = s.post(url, headers = headers, cookies = r.cookies, params=params)
print(r_.status_code)
You can use complex libraries written for behaving like user, selenium. But I would go to simple .click() to the button then parse the HTML.
const button = document.querySelector('[value="Continue to folder"]');
button.click();
// Parse the HTML
"Continue to Folder" is a submit button for the form which POSTs the "__RequestVerificationToken" value and the slug token to the page to display the contents of the folder.
So, in theory - you have to parse the HTML in http://protected.to/f-42cbf8ce2521d615 to extract the value of the hidden field "__RequestVerificationToken" that's the input name holding that token value; to obtain the slug token you need to look between the tags, you will see it dynamically creates a slug token when you load the page;
Once you got that value, you'll have to make a POST to the same URL http://protected.to/f-42cbf8ce2521d615 with the token and slug, the contents of the body will look something like this: __RequestVerificationToken=8BYeNPftVEEivO2imhtWIuWAb0mjhPg-5pAhq1mlpL_pTyYR1AyScbfqB8QZDudwGY_1LkV79FCDgpyffRPuktApd2ZQYBdi2ySA5ATUZ601&Slug=42cbf8ce2521d615
The above would return the page with the folder contents; you can replicate what I am saying above by simply opening up Dev tools and inspecting what happens when you hit 'Continue to Folder', you can see the POST made with the contents along with elements of the page which contain the items needed to make the POST call (the verification token and slug token).
Related
Here are my steps:
Click a button
A report is generated and then downloaded.
I am using Cypress. After clicking the button, the file opens in a new browser tab, and I can't validate the content. Is there any way to force download the file instead of displaying the content in a new browser tab in Cypress?
Removing the target attribute will prevent the browser from opening a separate new tab. maybe this can continue the download.
cy
.get('a')
.invoke('removeAttr', 'target')
Also doing a cypress request may help.
const element = doc.querySelector('[data-cy=orange-vcard]').getAttribute('href');
cy.request({
url: orangeVcardUrl,
encoding: 'base64'
})
.then((response) => {
expect(response.status).to.equal(200);
// check for the response body here
});
Ref: https://www.leonelngande.com/testing-file-download-links-with-cypress/
I have a Link where I want to pass certain params in the URL but I don't want the browser to display the params.
I'm using Link's as for this:
<Link href={`/link?foo=bar`} as ={`/link`}>
<a>Link</a>
</Link>
But when I click this link and I try to access the params via router, I can't access foo=bar:
const router = useRouter()
console.log(router.query)
Returns
{
slug: ["link"],
}
And not
{
slug: ["link"],
foo: "bar",
}
So how can I access the URL params in href when using as for Link?
TL;DR You can't use as like that.
This is an incorrect usage of href and as. It would be cool if we could hide state from the end users to keep our URLs nice, clean, and compact, but obviously if you do that, you'll actually lose the state when copy/pasting the URL. That's why you can't hide query parameters in anyway (except for excluding them).
Here's the docs on href and as (dynamic routes, has little to do with hiding query params):
https://nextjs.org/docs/tag/v9.5.2/api-reference/next/link#dynamic-routes
And to further bring up my point, imagine if we could hide state, and we redirect to this URL:
https://example.com/stateful/
Presumably there would be some behind-the-scenes browser action that persists the state.
Now we copy/paste the URL:
https://example.com/stateful/
Oops! We don't have the state anymore because the browser has no previous state to keep track of! That's why you use query parameters, because they keep the state in the URL itself.
I have a Link where I want to pass certain params in the URL but I don't want the browser to display the params.
I'm using Link's as for this:
<Link href={`/link?foo=bar`} as ={`/link`}>
<a>Link</a>
</Link>
But when I click this link and I try to access the params via router, I can't access foo=bar:
const router = useRouter()
console.log(router.query)
Returns
{
slug: ["link"],
}
And not
{
slug: ["link"],
foo: "bar",
}
So how can I access the URL params in href when using as for Link?
TL;DR You can't use as like that.
This is an incorrect usage of href and as. It would be cool if we could hide state from the end users to keep our URLs nice, clean, and compact, but obviously if you do that, you'll actually lose the state when copy/pasting the URL. That's why you can't hide query parameters in anyway (except for excluding them).
Here's the docs on href and as (dynamic routes, has little to do with hiding query params):
https://nextjs.org/docs/tag/v9.5.2/api-reference/next/link#dynamic-routes
And to further bring up my point, imagine if we could hide state, and we redirect to this URL:
https://example.com/stateful/
Presumably there would be some behind-the-scenes browser action that persists the state.
Now we copy/paste the URL:
https://example.com/stateful/
Oops! We don't have the state anymore because the browser has no previous state to keep track of! That's why you use query parameters, because they keep the state in the URL itself.
I'd like to be able to redirect to a given URL if it fails to load with Fancybox 3. The users supply the URL's so it could be a youtube video, image, or some other arbitrary link. In the case of something like a Google Doc, google prevents you from loading those inside iframes, so I'd like to catch that error and stop the fancy box viewer from loading at all and instead redirect to that URL directly in the browser. I can kind of get it working but I can't seem to stop the fancy box dialog from showing before the redirect happens:
$.fancybox.defaults.afterLoad = (instance, current) ->
if current.hasError
window.location = current.src
instance.close()
I've tried returning false.
This is the best I have come up with so far:
$.fancybox.defaults.defaultType = 'iframe'
$.fancybox.defaults.beforeLoad = (instance, current) ->
$.ajax
url: current.src
method: 'HEAD'
.catch ->
window.location = current.src
instance.close()
The default for a URL is ajax if it fails all the other tests (like image and video), so we need to switch this to iframe first. Then I do an ajax HEAD call to see if the result is successful, if not, we can just redirect to the src. instance.close() is the best way I could find to stop the fancybox from loading (it could already be loaded if this is a slideshow/gallery anyway). There is a brief flash before the page then redirects to the URL.
As #misorude mentions, there isn't a way to detect if the iframe failed to load for cross site requests. In the end I decided to do away with previewing off-site links completely and do a redirect like so:
$.fancybox.defaults.afterLoad = (instance, slide) ->
if !slide.redirecting && slide.contentType == 'html'
slide.redirecting = true
message = """<div class="fancybox-error"><p>Redirecting...</p></div>"""
instance.setContent slide, message
window.location = slide.src
This displays a nice redirecting message and then sends the user on to that link via the browser. contentType is only html when it's not image, video, map etc... from the other media type plugins. This means fancybox can still show youtube links without trouble even though these are iframe and html based.
how to override “cache-control” values in a HTTP response
I have a web page that returns the following header when I access material:
Cache-Control:no-cache, no-store
Using a firefox extension (like force cors I can't get it working )
I want to modify this response header so that the material is actually cached instead of wasting bandwidth.
I have recently spent some hours on trying to get a file cached, and discovered that the chrome.webRequest and chrome.declarativeWebRequest APIs cannot force resources to be cached. In no way.
The Cache-Control (and other) response headers can be changed, but it will only be visible in the getResponseHeader method. Not in the caching behaviour.
From some reddit thread:
Install "Modify Response Headers" addon for FF:
https://addons.mozilla.org/en-US/firefox/addon/modify-response-headers/
In the addon's options, go to headers. Select Action -> Filter. In
the header name field, type cache-control, then click Add. Do
the same again but with the header name pragma. Then click the
Start button (big button on the top-left).
Set these values in about:config: browser.cache.disk.enable = false
browser.cache.memory.capacity = 200000 (you will probably need to
create this field - right click empty space -> New -> Integer)
Restart Firefox.