fill out search on website and screen scrape result in r - r

this is my first post, so if my question is too vague or not clear, please tell me so.
I'm trying to scrape a website with news-articles for a research project. But the link to the modified search on that webpage won't work, because the intranet-authentication will spit out an error.
So my idea was, that I fill out the search form and use the resulting link to scrape the website.
Since my boss likes to work with R, he would like me to write an R-skript to do so, but I have no idea how to and haven't found anything working.

You need two packages: RCurl and XML.
The RCurl package is used for internet browsing. It can access HTML forms with _GET or _PUT arguments. So, with it you can login or fill out the any form.
The output from the server would be in HTML. If you want to grep the links, you can use XLM package. I helps to get any data form XML format.
But before start, you have to find out that is the search form in webpage (and that arguments should be used). The Firefox browser could be useful. You need two add-ins: Live HTTP header and Firebug. With those add-ins you can inspect webpage much more easier.
I know that it did not solve you problem, but I could not say any more, since it deepens on particular situation and webpage structure. I believe that the tool I have mentioned is quite enough to achieve that you want.
Bet regards.

Related

Web scraping using rvest works partially ok

I'm new into web scraping using rvest in R and I'm trying to access to the left column match names of this betting house using xpath. So i know the names are under the tag. But i cant access to them using the next code:
html="https://www.supermatch.com.uy/live#/5009370062"
a=read_html(html)
a %>% html_nodes(xpath="//span") %>% html_text()
But i only access to some of the text. I was reading that this may be because the website dynamically pull data from databases using JavaScript and jQuery. Do you know how i can access to these match names? Already thank you guys.
Some generic notes about basic scraping strategies
Following refers to Google Chrome and Chrome DevTools, but those same concepts apply to other browsers and built-in developer tools too. One thing to remember about rvest is that it can only handle response delivered for that specific request, i.e. content that is not fetched / transformed / generated by JavasScript running on the client side.
Loading the page and inspecting elements to extract xpath or css selector for rvest seems to be most most common approach. Though the static content behind that URL versus the rendered page in browser and elemts in inspector can be quite different. To take some guesswork out of the process, it's better to start by checking what is the actual content that rvest might receive - open the page source and skim through it or just search for a term you are interested in. At the time of writing Viettel is playing, but they are not listed anywhere in the source:
Meaning there's no reason to expect that rvest would be able to extract that data.
You could also disable JavaScript for that particular site in your browser and check if that particular piece of information is still there. If not, it's not there for rvest either.
If you want to step further and/or suspect that rvest receives something different compared to your browser session (target site is checking request headers and delivers some anti-scraping notice when it doesn't like the user-agent, for example), you can always check the actual content rvest was able to retrieve, for example read_html(some_url) %>% as.character() to dump the whole response, read_html(some_url) %>% xml2::html_structure() to get formatted stucture of the page or read_html(some_url) %>% xml2::write_html("temp.html") to save the page content and inspect it in editor or browser.
Coming back to Supermatch & DevTools. That data on a left pane must be coming from somewhere. What usually works is a search on the network pane - open network, clear current content, refresh the page and make sure page is fully loaded; run a search (for "Viettel" for example):
And you'll have the URL from there. There are some IDs in that request (https://www.supermatch.com.uy/live_recargar_menu/32512079?_=1656070333214) and it's wise to assume those values might be related to current session or are just shortlived. So sometimes it's worth trying what would happen if we just clean it up a bit, i.e. remove 32512079?_=1656070333214. In this case it happens to work.
While here it's just a fragment of html and it makes sense to parse it with rvest , in most cases you'll end up landing on JSON and the process transforms into working with APIs. When it happens it's time to switch from rvest to something more apropriate for JSON, jsonlite + httr for example.
Sometimes plane rvest is not enough and you either want or need to work with the page as it would have been rendered in your JavaScript-enabled browser. For this there's RSelenium

Facebook shares not using og:url when clicked in Facebook?

One of the purposes of og:url -- I thought -- was that it was a way you could make sure sessions variables, or any other personal information that might find its way into a URL, would not be passed along by sharing in places like Facebook. According to the best practices on Facebook's developer pages: "URL
A URL with no session id or extraneous parameters. All shares on Facebook will use this as the identifying URL for this article."
(under good examples: developers.facebook.com/docs/sharing/best-practices)
This does NOT appear to be working, and I am puzzled as to either -- how I misunderstood, and/or what I have wrong in my code. Here's an example:
https://vault.sierraclub.org/fb/test.html?name=adrian
When I drop things into the debugger, it seems to be working fine...
https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Fvault.sierraclub.org%2Ffb%2Ftest.html%3Fname%3Dadrian
og:url reads as expected (without name=adrian).
But if I share this on facebook -- and then click the link. The URL goes to the one with name=adrian in it, not the og:url.
Am I doing something incorrectly here, or have I misunderstood? If the latter, how does one keep things like sessions variables out of shares?
Thanks for any insight.
UPDATE
Facebook replied to a bug report on this, and I learned that I indeed was reading the documentation incorrectly
developers.facebook.com/bugs/178234669405574/
The question then remains -- is there any other method to keeping sessions variables/authentication tokens out of shares?

How to see parameters passed in iframe through source

I'm trying to pull up data (for scraping purposes) from a certain website. The data I need is in an iframe and it's much faster to just load up the iframe rather than the entire site. I can access the iframe directly no problem however it does not include any filtered results which the user would normally control through the form. Clearly the form is posting something to the iframe to tell it which results to display. However, as I cannnot see the full URL I don't know exactly what it's passing.
I tried to poke around the source but haven't been able to figure it out, is there something specific I should be looking for?
Thanks

How to find out upload/post time of an special website URL?

Often when searching for information i hit the problem, that the author of an article/website/blog post doesnt give out a date.
Is there any way (maybe special meta search engine, web-archives, use of google search operators to find out at least on which month & year a website URL was uploaded?
thx
puttin
javascript:alert(document.lastModified)
in the adress bar of a browser with loaded page pops up a date and time. Where this time data is coming from i have no idea, probably time html or php file was created on server. On the other way i thought javascript cannot access filesystem, but im no expert...
Still curious if someone knows a reliable method of finding out when a specific .html site was created as i find it useful for enquiry.

Using ASP.Net, is there a programmatic way to take a screenshot of the browser content?

I have an ASP.Net application which as desired feature, users would like to be able to take a screenshot. While I know this can be simulated, it would be really great to have a way to take a URL (or the current rendered page), and turn it into an image which can be stored on the server.
Is this crazy? Is there a way to do it? If so, any references?
I can tell you right now that there is no way to do it from inside the browser, nor should there be. Imagine that your page embeds GMail in an iframe. You could then steal a screenshot of the person's GMail inbox!
This could be made safe by having the browser "black out" all iframes and embeds that would violate cross-domain restrictions.
You could certainly write an extension to do this, but be aware of the security considerations outlined above.
Update: You can use a canvas utility function to get a screenshot of a page on the same origin as your code. There's even a lib to allow you to do this: http://experiments.hertzen.com/jsfeedback/
You can find other possible answers here: Using HTML5/Canvas/JavaScript to take screenshots
Browsershots has an XML-RPC interface and available source code (in Python).
I used the free assembly UrlScreenshot.dll which you can download here.
Works nicely!
There is also WebSiteScreenShot but it's not free.
You could try a browser plugin like IE7 Pro for Internet Explorer which allows you to save a screenshot of the current site to a file on disk. I'm sure there is a comparable plugin for FireFox out there as well.
If you want to do something like you described. You need to call an external process that prints the IE output as described here.
Why don't you take another approach?
If you have the need that users can view the same content over again, then it sounds like that is a business requirement for your application, and so you should be building it into your application.
Structure the URL so that when the same user (assuming you have sessions and the application shows different things to different users) visits the same URL, they always see same thing. They can then bookmark the URL locally, or you can even have an application feature that saves it in a user profile.
Part of this would mean making "clean urls", eg, site.com/view/whatever-information-needed-here.
If you are doing time-based data, where it changes as it gets older, there are probably a couple possible approaches.
If your data is not changing on a regular basis, then you could make the "current" page always, eg, site.com/view/2008-10-20 (add hour/minute/second as appropriate).
If it is refreshing, and/or updating more regularly, have the "current" page as site.com/view .. but allow specifying the exact time afterwards. In this case, you'd have to have a "link to this page" type function, which would link to the permanent URL with the full date/time. Look to google maps for inspiration here-- if you scroll across a map, you can always click "link to here" and it will provide a link that includes the GPS coordinates, objects on the map, etc. In that case it's not a very friendly url but it does work quite well. :)

Resources