and how many urls are stored?
There are lots of system design posts but i doubt about the numbers. I found nothing precise from google
I want to prevent or hamper the parsing of the classifieds website that I'm improving.
The website uses API with JSON responses. As a solution, I want to add useless data between my data as programmers will probably parse by ID. And not give a clue about it in neither JSON response body nor header; so they won't be able to distinguish it without close inspection.
To prevent users from seeing it, I won't give that "useless data" to my users if they don't request it explicitly by ID. From an SEO perspective, I know that Google won't parse the page with useless data if there isn't any internal or external link.
How reliable would that technic be? And what problems/disadvantages/drawbacks do you think can occur in terms of user experience or SEO? Any ideas or suggestions will be very much appreciated.
P.S. I'm also limiting big request counts made in a short time. But it doesn't help. That's why I'm thinking of this technic.
I think banning parsers won't be better because they can change IP and etc.
Maybe I can get a better solution by requiring a login to access more than 50 item details for example (and that will work for Selenium, maybe?). Registering will make it harder. Even they do it, I can see these users and slow their response times and etc.
I have been web scraping for about 3 months now, and I have noticed that many of my spiders need to be constantly babysat, because of websites changing. I use scrapy, python, and crawlera to scrape my sites. For example, 2 weeks ago I created a spider and just had to rebuild it due to the website changing their metatags from singular to plural (so location became locations). Such a small change shouldn't be able to really mess with my spiders, so I would like to take a more defensive approach to my collections moving forward. Does anyone have any advice for web scraping to allow for less babysitting? thank you in advance!
Since you didn't post any code I can only give general advice.
Look if there's a hidden API that retrieves the data you're looking for.
Load the page in Chrome. Inspect with F12 and look under Network tab. Click CTRL + F and you can search for the text you see on screen which you want to collect. If you find any file under the Network tab that contains the data as json, that is more reliable since the backend of a webpage will change less frequent than the frontend.
Be less specific with selectors. Instead of doing body > .content > #datatable > .row::text you can change to #datatable > .row::text. Then your spider will be less likely to break on small changes.
Handle errors with try except so to stop the whole parse function from ending if you're expecting some data might be inconsistent.
I want to scrape all the URLs from this page:
http://www.domainia.nl/QuarantaineList.aspx
I am able to scrape the first page, however, I can not change the page, because it is not in the URL. So how can I change the page with scraping? I've been looking into RSelenium, but could not get it working.
I'm running the next code to get at least the first page:
#Constructin the to scrape urls
baseURL <- "http://www.domainia.nl/quarantaine/"
date <- gsub("-", "/", Sys.Date())
URL <- paste0(baseURL, date)
#Scraping the page
page <- read_html(URL) %>% html_nodes("td") %>% html_text()
links <- str_subset(page, pattern = "^\r\n.*.nl$")
links <- gsub(pattern = "\r\n", "", links) %>% trimws
I've looked at the site; it's using a Javascript POST to refresh its contents.
Originally a HTTP-POST was meant to send information to a server, for example to send the contents of what somebody entered in a form. As such, it often includes information on the page you are coming from, which means you probably will need more information then just "page n".
If you want to get another page, like your browser would show you, you need to send a similar request. The httr package inlcudes a POST function, I think you should take a look at that.
For knowing what to post, I think it's most useful to capture what your browser does, and copy that. In Chrome, you can use inspect, tab Network to see what is sent and received, I bet other browsers have similar tools.
However, it looks like that website makes its money by showing that information, and if some other source would show the same things, they'd lose money. Therefore I doubt if it's that easy to emulate, I think some part of the request differs every time, yet needs to be exactly right. For example, they could build checks to see if the entire page was rendered, instead of discarded like you do. So I wouldn't be surprised if they intentionally make it very hard to do what you are trying to do.
Which brings me to an entirely different solution: ask them!
When I tried scraping a website with dynamically generated content for the first time, I was struggling as well. Until I explored the website some more, and saw that they had a link where you could download the entire thing, tidied up, in a nice csv-format.
And for a webserver, people trying to scrape their website is often inconvenient, it also demands resources from the server, a lot more than someone downloading a file.
It's quite possible they'll tell you "no", but if they really don't want you to get their data, I bet they've made it difficult to scrape. Maybe you'll just get banned if you make too many requests from the same IP, maybe some other method.
And it's also entirely possible that they don't want their data in the hands of a competitor, but that they'll give it to you if you only use it for a particular purpose.
(too big for a comment and it also has as salient image, but not an answer, per se)
Emil is spot on, except that this is a asp.net/sharepoint-esque site with binary "view states" and other really daft web practices that will make it nigh impossible to scrape with just httr:
When you do use the Network tab (again, as Emil astutely suggests) you can also use curlconverter to automatically build httr VERB functions out of requests "Copied as cURL".
For this site — assuming it's legal to scrape (it has no robots.txt and I am not fluent in Dutch and did not see an obvious "terms and conditions"-like link) — you can use something like splashr or Selenium to navigate, click and scrape since it acts like real browser.
I am trying to debug some problems with very picking/complex webservices where some of the clients that are theoretically making the same requests are getting different results. A debugging proxy like Charles helps a lot but since the requests are complex (lots of headers, cookies, query strings, form data, etc) and the clients create the headers in different orders (which should be perfectly acceptable), etc. it's an extremely tedious process to do manually.
I'm pondering writing something to do this myself but I was hoping someone else had already solved this problem?
As an aside does anyone know of any Charles-like debugging proxies that are completely opensource? If Charles were open source I would definitely contribute any work I did on this front back to the project. If there is something similar out there, I would much rather do this than write an separate program from scratch (especially since I imagine Charles or any analog already has all of the data structures I might need etc).
Edit:
Just to be clear -- text diffing will not work as the order of lines (e.g. headers at least) may be different and/or the order of values within lines (e.g. cookies at least) can be different and in both cases as long as the names and values and metadata are all the same, the different ordering should not cause requests that are otherwise the same to be considered different.
Fiddler has such an option, if you have WinDiff in your path. I don't know though if it will suit your needs, because at first glance it's jus doing text comparisions. But perhaps it normalizes the sessions before that, so I can't say.
If there's nothing purpose built for the job, you can use packet capture to get the message content saved to a text file (something that inserts itself in the IP stack like CommView). The you can text diff the results for different messages.
Can the open-source proxy Squid maybe help?