2sxc: 404 error when adding content module - 2sxc

When I add a content module to a page I get a 404 error.
In the logbook I see the following entry
TabId:
PortalAlias:staging.2-le-marche.com/nl-nl
OriginalUrl:/nl-nl/desktopmodules/2sxc/api/view/Module/GetSelectableTemplates
Referer:http://staging.2-le-marche.com/desktopmodules/tosic_sexycontent/dist/dnn/ui.html?sxcver=8.5.1.26679
Url:http://staging.2-le-marche.com/nl-nl/desktopmodules/2sxc/api/view/Module/GetSelectableTemplates
UserAgent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36

My guess is a miss-configuration of DNN and portal domains. That kind of thing can work for quite a while (pages do work) but the js-framework of dnn will then report a (slightly) different url than is actually in use, and end up pointing to a url which doesn't work.

Related

How to get around 403 Forbidden error with R download.file

I am trying to download multiple files from a website. Im scraping the website to come up with the individual URLs, but when I put the URL's into download.file, I'm getting a 403 forbidden error.
It looks like theres a user validation step on the website, but adding a header doesnt help.
Any help getting around this is appreciated. here's what im trying with one sample URL and file :
headers = c(
`user-agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36'
)
download.file("https://gibsons.civicweb.net/filepro/document/125764/Regular%20Council%20-%2006%20Dec%202022%20-%20Minutes%20-%20Pdf.pdf",
"file",
mode="wb",
headers=headers)```

Keep receiving 503 error while scraping with rvest

I am trying to scrape fanfiction.net with rvest and keep getting the 503 server error.
The robots.txt file allows to scrape the site with a delay of 5 seconds, the "Terms of Service" only forbid it for commercial use, whereas I am intending to use it for research purposes (Digital Literary Studies here).
The following chunk of code results in an error already:
library (httr)
library (rvest)
url <- "https://www.fanfiction.net"
userAgent <- "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
read_html(url, user_agent(userAgent))
Most of the advice regarding that error recommends incorporating a delay between scraping requests, or providing a user agent.
Since I get the error from the first URL, incorporating a delay doesn't seem to solve the problem.
I provided different user agents:
"Mozilla/5.0"
agents of the "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0" sort, but the server seems to react to the brackets and the long version gives me a 403 error.
Finally, I provided my institution, name and email address as a user agent, and once (!!) received a 200 status code.
What would be the next step?

How to launch web site in chrome headless mode which blocks the headless mode?

Please find below XPATH am using to scrape price from Myntra site. I can able to scrape from all other sites except Myntra and same below XPATH is working in my local windows system with Selenium,Python3 version and using chrome driver.
Driver path : driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", options=chrome_options);
variable_name = driver.find_element_by_xpath('//*[#id="mountRoot"]/div/div/div/main/div[2]/div[2]/div[1]/p[1]/span/strong').text
link for reference: https://www.myntra.com/beauty-gift-set/kama-ayurveda/kama-ayurveda-round-the-clock-skincare-gift-set/12800176/buy
When hosted to EC2 ubuntu machine getting below error:
Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id="mountRoot"]/div/div/div/main/div[2]/div[2]/div[1]/p[1]/span/strong"}
Tried changing XPATH like driver.find_element_by_xpath('//*[#class="pdp-price"]//*').text but no luck.
Use the below XPath
driver.find_element_by_xpath('//span[#class="pdp-price"]//strong').text
Or by Using the below CSS selector
driver.find_element_by_css_selector('.pdp-price strong').text
The above works only if the site is in GUI mode whereas for headless displays Access Denied attached below screenshot for your reference. Since application blocks headless mode
Add the below user agent argument and load the web driver to your chrome driver options
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--headless")
chrome_options.add_argument(f'user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
Just add this user-agent option while launching headless chrome :
--user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
chromeLauncher.launch({
chromeFlags: ["--headless", '--disable-gpu', `--user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"`],
chromePath: '/usr/bin/google-chrome'
})

Instagram blocks me for the requests with 429

I have used a lot of requests to https://www.instagram.com/{username}/?__a=1 to check if a pseudo was existing and now I am getting 429.
Before, I had just to wait few minutes to make the 429 disapear. Now it is persistent ! :( I'm trying once a day, it doesnt work anymore.
Do you know anything about instagram requests limitation ?
Do you have any workaround please ? Thanks
Code ...
import requests
r = requests.get('https://www.instagram.com/test123/?__a=1')
res = str(r.status_code)
Try adding the user-agent header, otherwise, the website thinks that your a bot, and will block you.
import requests
URL = "https://www.instagram.com/bla/?__a=1"
HEADERS = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
response = requests.get(URL, headers=HEADERS)
print(response.status_code) # <- Output: 200

Change user-agent for Symfony Panther Chromeclient

How do I change the user-agent in a headless Chrome created by Symfony's Panther createChromeClient()?
When I create a Chrome client with
$client = \Symfony\Component\Panther\Client::createChromeClient();
I see in the access_log a user-agent of
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/77.0.3865.90 Safari/537.36"
I searched for solutions, and think I have to change the user-agent string via the arguments of the chrome, but can't find the right way, because the answers on the web aren't for PHP or Panther.
Cheers!
I found it:
$client = \Symfony\Component\Panther\Client::createChromeClient(null, [
'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
'--window-size=1200,1100',
'--headless',
'--disable-gpu',
]);
This question gave me the idea.

Resources