I'm trying to build a web scraper which downloads videos from "fmovies.se".
I was not able to fully extract the video url given the webpage.
The webpage I'm considering is "https://fmovies.se/film/la-cage-doree.5283j".
Two queries are required to retrieve the video url.
The initial one is 'https://fmovies.se/ajax/episode/info?ts=1483027200&=2399&id=9076jn&update=0'.
The query is composed of "ts", "_", "id" and "update" elements. Everything except "_" part was mentioned in html code of the webpage.
I couldn't get from where "_2399" part was coming.
Can anyone help me with this ?
Even if you figure out how those parameters are computed, they can change their algorithm at any moment, which this site specifically has done in the past, see this thread.
You need a long lasting solution — a headless browser.
You can use a headless browser to simulate user interactions programmatically and intercept the XHR request that you are looking for (e.g. https://fmovies.se/ajax/episode/info?ts=1483027200&=2399&id=9076jn&update=0).
One of the best headless browsers out there is Puppeteer and there's a lot of information on how to use it.
Related
I'm new into web scraping using rvest in R and I'm trying to access to the left column match names of this betting house using xpath. So i know the names are under the tag. But i cant access to them using the next code:
html="https://www.supermatch.com.uy/live#/5009370062"
a=read_html(html)
a %>% html_nodes(xpath="//span") %>% html_text()
But i only access to some of the text. I was reading that this may be because the website dynamically pull data from databases using JavaScript and jQuery. Do you know how i can access to these match names? Already thank you guys.
Some generic notes about basic scraping strategies
Following refers to Google Chrome and Chrome DevTools, but those same concepts apply to other browsers and built-in developer tools too. One thing to remember about rvest is that it can only handle response delivered for that specific request, i.e. content that is not fetched / transformed / generated by JavasScript running on the client side.
Loading the page and inspecting elements to extract xpath or css selector for rvest seems to be most most common approach. Though the static content behind that URL versus the rendered page in browser and elemts in inspector can be quite different. To take some guesswork out of the process, it's better to start by checking what is the actual content that rvest might receive - open the page source and skim through it or just search for a term you are interested in. At the time of writing Viettel is playing, but they are not listed anywhere in the source:
Meaning there's no reason to expect that rvest would be able to extract that data.
You could also disable JavaScript for that particular site in your browser and check if that particular piece of information is still there. If not, it's not there for rvest either.
If you want to step further and/or suspect that rvest receives something different compared to your browser session (target site is checking request headers and delivers some anti-scraping notice when it doesn't like the user-agent, for example), you can always check the actual content rvest was able to retrieve, for example read_html(some_url) %>% as.character() to dump the whole response, read_html(some_url) %>% xml2::html_structure() to get formatted stucture of the page or read_html(some_url) %>% xml2::write_html("temp.html") to save the page content and inspect it in editor or browser.
Coming back to Supermatch & DevTools. That data on a left pane must be coming from somewhere. What usually works is a search on the network pane - open network, clear current content, refresh the page and make sure page is fully loaded; run a search (for "Viettel" for example):
And you'll have the URL from there. There are some IDs in that request (https://www.supermatch.com.uy/live_recargar_menu/32512079?_=1656070333214) and it's wise to assume those values might be related to current session or are just shortlived. So sometimes it's worth trying what would happen if we just clean it up a bit, i.e. remove 32512079?_=1656070333214. In this case it happens to work.
While here it's just a fragment of html and it makes sense to parse it with rvest , in most cases you'll end up landing on JSON and the process transforms into working with APIs. When it happens it's time to switch from rvest to something more apropriate for JSON, jsonlite + httr for example.
Sometimes plane rvest is not enough and you either want or need to work with the page as it would have been rendered in your JavaScript-enabled browser. For this there's RSelenium
I am currently working on some automation thing to retrieve all currency rates in a specific bank website.
It was working before as the website provides the rates in HTML format when I use HTTP GET.
However, it seems that they have changed the way on how they built the website. Now, the HTML doesn't contain the rates. It is from my understanding inside a table.
Is there a way to retrieve the table content from HTTP GET?
Can some one teach me how to access the table contents with a direct link if possible.
Below is the webpage that I got problem with.
https://www.dbs.com.sg/personal/rates-online/foreign-currency-foreign-exchange.page
Seems that they changed their website to fetch data via ajax now. You can use your browsers web developer tools and check the network tab to see that there gets additional data loaded, e.g.
https://www.dbs.com.sg/flplscsapi/personal/default.page?q=(path:=templatedata/MMContent/RatesSGFX/data/personal/en/fx_rates.xml)&max=10&start=0&format=json&includeDCRContent=true to get a JSON holding Information about the display name of the current, the image to be displayed as well as a shortcut for the current
https://www.dbs.com.sg/sg-rates-api/v1/api/sgrates/getSGFXRates to get a JSON which holds Information about the currency rates.
I am trying to pull pricing data from a website, but each time the page is loaded, thet class is regenerated to a different sequence of letters, and the price is showing instead of a number. Is there a technique that I can use to bypass this in any way? Thanks! Here is the line of html as how it appears when I inspect the element:
<div class="zlgJQq">$</div>
<div class="qFwqmC hkVukg2 njGalW"> </div>
Your help would be much appreciated!
Perhaps that website is actively discouraging you from scraping their data. That would explain the apparently random class names. You might want to read their terms of use to be sure that it's OK to scrape their site.
However, if the raw HTML does not contain the price data but it is visible when the page is rendered, then it's likely that Javascript is being used to insert the prices after the page has loaded. You could try enabling the developer tools in your browser and monitoring the network activity while the page is loading. That might reveal that the site is using dynamic Ajax queries to populate the price data, and you could then write code to interact with the Ajax resource directly.
It's also possible that the price data is embedded somewhere in the HTML, possibly obfuscated, and then loaded dynamically by javascript.
That's just a couple of suggestions. You will need to analyse the site to see whether automated scraping is feasible. If you can let us know what website you're dealing with then someone might be able to suggest something more specific.
I'm running in to a couple of issues and wondered if anyone had any insight. I'm using the latest php-sdk I'm developing a canvas app that has a number of different steps. These steps are spread across multiple pages. Now when I first enter the app everything seems to work fine. The access token is there and I can call the api functions. On the second page (which is linked to in the same iframe) I get OAuth errors. Now if I use this on the 2nd page:
$me = $facebook->getUser();
var_dump($me);
it returns the correct user id, but I still get errors when trying to use an api query (specifically a FQL one in this instance)
Now, bear in mind these links are within the iframe so I was assuming the signed_request is getting lost somewhere, I know facebook normally issues this via a POST. If I set all my links to target="_parent" with a url such as http://apps.facebook.com/myapp/page2.php then everything works fine. Facebook clearly posts the correct info this time. Subsequently, then when I use links that only redirect the iframe it seems to work fine again (implying a cookie is being set somewhere).
Now I've seen other apps that don't have a target="_parent" that seem to work correctly, only ever loading the iframe on subsequent clicks and not the full facebook site. So I can only assume they are storing this info somewhere. I've tried to inspect these apps using httpfox but I can't see anything obvious. Does anyone have any links for best practice with multiple page apps? I know I can get around this using full urls and target="_blank" but I would like to know what's going on here. I've looked through the developer docs and the canvas page examples, but there's nothing obvious to me.
Any help or info would be appreciated
Many Thanks
There is some ways to achieve this
using Facebook JavaScript SDK (which will set cookie for you, so PHP-SDK can rely on it)
issuing POST request to your pages including signed_request from initial page loaded in canvas
I have an ASP.Net application which as desired feature, users would like to be able to take a screenshot. While I know this can be simulated, it would be really great to have a way to take a URL (or the current rendered page), and turn it into an image which can be stored on the server.
Is this crazy? Is there a way to do it? If so, any references?
I can tell you right now that there is no way to do it from inside the browser, nor should there be. Imagine that your page embeds GMail in an iframe. You could then steal a screenshot of the person's GMail inbox!
This could be made safe by having the browser "black out" all iframes and embeds that would violate cross-domain restrictions.
You could certainly write an extension to do this, but be aware of the security considerations outlined above.
Update: You can use a canvas utility function to get a screenshot of a page on the same origin as your code. There's even a lib to allow you to do this: http://experiments.hertzen.com/jsfeedback/
You can find other possible answers here: Using HTML5/Canvas/JavaScript to take screenshots
Browsershots has an XML-RPC interface and available source code (in Python).
I used the free assembly UrlScreenshot.dll which you can download here.
Works nicely!
There is also WebSiteScreenShot but it's not free.
You could try a browser plugin like IE7 Pro for Internet Explorer which allows you to save a screenshot of the current site to a file on disk. I'm sure there is a comparable plugin for FireFox out there as well.
If you want to do something like you described. You need to call an external process that prints the IE output as described here.
Why don't you take another approach?
If you have the need that users can view the same content over again, then it sounds like that is a business requirement for your application, and so you should be building it into your application.
Structure the URL so that when the same user (assuming you have sessions and the application shows different things to different users) visits the same URL, they always see same thing. They can then bookmark the URL locally, or you can even have an application feature that saves it in a user profile.
Part of this would mean making "clean urls", eg, site.com/view/whatever-information-needed-here.
If you are doing time-based data, where it changes as it gets older, there are probably a couple possible approaches.
If your data is not changing on a regular basis, then you could make the "current" page always, eg, site.com/view/2008-10-20 (add hour/minute/second as appropriate).
If it is refreshing, and/or updating more regularly, have the "current" page as site.com/view .. but allow specifying the exact time afterwards. In this case, you'd have to have a "link to this page" type function, which would link to the permanent URL with the full date/time. Look to google maps for inspiration here-- if you scroll across a map, you can always click "link to here" and it will provide a link that includes the GPS coordinates, objects on the map, etc. In that case it's not a very friendly url but it does work quite well. :)