I'm using colly to web scrape youtube charts. This site use polymerjs and as a result, I'm having issues to capture the DOM elements. A simple test I did was document.querySelector("#search-native") on console, and it's returning null.
I saw an element called ytmc-app and I could get this element, but it's not possible to continue querying after that.
Someone has idea how to proceed?
Related
I'm new into web scraping using rvest in R and I'm trying to access to the left column match names of this betting house using xpath. So i know the names are under the tag. But i cant access to them using the next code:
html="https://www.supermatch.com.uy/live#/5009370062"
a=read_html(html)
a %>% html_nodes(xpath="//span") %>% html_text()
But i only access to some of the text. I was reading that this may be because the website dynamically pull data from databases using JavaScript and jQuery. Do you know how i can access to these match names? Already thank you guys.
Some generic notes about basic scraping strategies
Following refers to Google Chrome and Chrome DevTools, but those same concepts apply to other browsers and built-in developer tools too. One thing to remember about rvest is that it can only handle response delivered for that specific request, i.e. content that is not fetched / transformed / generated by JavasScript running on the client side.
Loading the page and inspecting elements to extract xpath or css selector for rvest seems to be most most common approach. Though the static content behind that URL versus the rendered page in browser and elemts in inspector can be quite different. To take some guesswork out of the process, it's better to start by checking what is the actual content that rvest might receive - open the page source and skim through it or just search for a term you are interested in. At the time of writing Viettel is playing, but they are not listed anywhere in the source:
Meaning there's no reason to expect that rvest would be able to extract that data.
You could also disable JavaScript for that particular site in your browser and check if that particular piece of information is still there. If not, it's not there for rvest either.
If you want to step further and/or suspect that rvest receives something different compared to your browser session (target site is checking request headers and delivers some anti-scraping notice when it doesn't like the user-agent, for example), you can always check the actual content rvest was able to retrieve, for example read_html(some_url) %>% as.character() to dump the whole response, read_html(some_url) %>% xml2::html_structure() to get formatted stucture of the page or read_html(some_url) %>% xml2::write_html("temp.html") to save the page content and inspect it in editor or browser.
Coming back to Supermatch & DevTools. That data on a left pane must be coming from somewhere. What usually works is a search on the network pane - open network, clear current content, refresh the page and make sure page is fully loaded; run a search (for "Viettel" for example):
And you'll have the URL from there. There are some IDs in that request (https://www.supermatch.com.uy/live_recargar_menu/32512079?_=1656070333214) and it's wise to assume those values might be related to current session or are just shortlived. So sometimes it's worth trying what would happen if we just clean it up a bit, i.e. remove 32512079?_=1656070333214. In this case it happens to work.
While here it's just a fragment of html and it makes sense to parse it with rvest , in most cases you'll end up landing on JSON and the process transforms into working with APIs. When it happens it's time to switch from rvest to something more apropriate for JSON, jsonlite + httr for example.
Sometimes plane rvest is not enough and you either want or need to work with the page as it would have been rendered in your JavaScript-enabled browser. For this there's RSelenium
I'm trying to build a web scraper which downloads videos from "fmovies.se".
I was not able to fully extract the video url given the webpage.
The webpage I'm considering is "https://fmovies.se/film/la-cage-doree.5283j".
Two queries are required to retrieve the video url.
The initial one is 'https://fmovies.se/ajax/episode/info?ts=1483027200&=2399&id=9076jn&update=0'.
The query is composed of "ts", "_", "id" and "update" elements. Everything except "_" part was mentioned in html code of the webpage.
I couldn't get from where "_2399" part was coming.
Can anyone help me with this ?
Even if you figure out how those parameters are computed, they can change their algorithm at any moment, which this site specifically has done in the past, see this thread.
You need a long lasting solution — a headless browser.
You can use a headless browser to simulate user interactions programmatically and intercept the XHR request that you are looking for (e.g. https://fmovies.se/ajax/episode/info?ts=1483027200&=2399&id=9076jn&update=0).
One of the best headless browsers out there is Puppeteer and there's a lot of information on how to use it.
I am trying to do web scraping of an eCommerce website and have looked for all major kind of possible solutions.The best I found out is web scraping extension of Google Chrome. I actually want to pull out all data available in the website.
For example, I am trying to scrape data of an eCommerce site www.bigbasket.com. Now while trying to create a site map , I am stuck to this part where I have to chose element from a page. Same page of say category A, while being scrolled down contains various products ,and one category page is further split as as page 1, page 2 and few categories have page 3 and so on as well.
Now if I am selecting multiple elements of same page say page 1 it's totally fine, but when I am trying to select element from page 2 or page 3, the scraper prompts with different type element section is disabled,and asks me to enable by selecting the checkbox, and after that I am able to select different elements. But when I run the site map and start scraping, scraper returns null values and data is not pulled out. I don't know how to overcome this problem so that I can draw a generalized site map and pull the data in one go.
To prevent web scraping various websites now use rendering by JavaScript. The website (bigbasket.com), you're using also uses JS for rendering info to various elements. To scrape websites like these you will need to use Selenium instead of traditional methods (like beautifulsoup in Java).
You will also have to check various legal aspects of this and whether the website wants you crawling this data.
How would I use the view google has in the test center (where i test my frontend)?
When a user browse to site/search.aspx I want the to get the view testcenter shows, searchboxes and everything. I would also like to add my own javascript and css to the page.
Is this possible?
Now I have created a search box with updatepanel to show the results but this approach will force me to do a lot of parsing and setting variables for the dynamic navigation. I.e. a lot of logic Google already serves in test center.
By the way, I dont want to use the McA+ library supporting GSA 6.14.
I serialized the xml result from the GSA to C# objects and then fed them to my frontend where I could handle them.
Example of converting XML to HTML using XSL in C# ASP.NET can be found at: http://www.codeproject.com/Articles/469723/Rendering-XML-Data-as-HTML-using-XSL-Transformatio
I'm running in to a couple of issues and wondered if anyone had any insight. I'm using the latest php-sdk I'm developing a canvas app that has a number of different steps. These steps are spread across multiple pages. Now when I first enter the app everything seems to work fine. The access token is there and I can call the api functions. On the second page (which is linked to in the same iframe) I get OAuth errors. Now if I use this on the 2nd page:
$me = $facebook->getUser();
var_dump($me);
it returns the correct user id, but I still get errors when trying to use an api query (specifically a FQL one in this instance)
Now, bear in mind these links are within the iframe so I was assuming the signed_request is getting lost somewhere, I know facebook normally issues this via a POST. If I set all my links to target="_parent" with a url such as http://apps.facebook.com/myapp/page2.php then everything works fine. Facebook clearly posts the correct info this time. Subsequently, then when I use links that only redirect the iframe it seems to work fine again (implying a cookie is being set somewhere).
Now I've seen other apps that don't have a target="_parent" that seem to work correctly, only ever loading the iframe on subsequent clicks and not the full facebook site. So I can only assume they are storing this info somewhere. I've tried to inspect these apps using httpfox but I can't see anything obvious. Does anyone have any links for best practice with multiple page apps? I know I can get around this using full urls and target="_blank" but I would like to know what's going on here. I've looked through the developer docs and the canvas page examples, but there's nothing obvious to me.
Any help or info would be appreciated
Many Thanks
There is some ways to achieve this
using Facebook JavaScript SDK (which will set cookie for you, so PHP-SDK can rely on it)
issuing POST request to your pages including signed_request from initial page loaded in canvas