I'm trying to pull up data (for scraping purposes) from a certain website. The data I need is in an iframe and it's much faster to just load up the iframe rather than the entire site. I can access the iframe directly no problem however it does not include any filtered results which the user would normally control through the form. Clearly the form is posting something to the iframe to tell it which results to display. However, as I cannnot see the full URL I don't know exactly what it's passing.
I tried to poke around the source but haven't been able to figure it out, is there something specific I should be looking for?
Thanks
Related
I am trying to pull pricing data from a website, but each time the page is loaded, thet class is regenerated to a different sequence of letters, and the price is showing instead of a number. Is there a technique that I can use to bypass this in any way? Thanks! Here is the line of html as how it appears when I inspect the element:
<div class="zlgJQq">$</div>
<div class="qFwqmC hkVukg2 njGalW"> </div>
Your help would be much appreciated!
Perhaps that website is actively discouraging you from scraping their data. That would explain the apparently random class names. You might want to read their terms of use to be sure that it's OK to scrape their site.
However, if the raw HTML does not contain the price data but it is visible when the page is rendered, then it's likely that Javascript is being used to insert the prices after the page has loaded. You could try enabling the developer tools in your browser and monitoring the network activity while the page is loading. That might reveal that the site is using dynamic Ajax queries to populate the price data, and you could then write code to interact with the Ajax resource directly.
It's also possible that the price data is embedded somewhere in the HTML, possibly obfuscated, and then loaded dynamically by javascript.
That's just a couple of suggestions. You will need to analyse the site to see whether automated scraping is feasible. If you can let us know what website you're dealing with then someone might be able to suggest something more specific.
this is my first post, so if my question is too vague or not clear, please tell me so.
I'm trying to scrape a website with news-articles for a research project. But the link to the modified search on that webpage won't work, because the intranet-authentication will spit out an error.
So my idea was, that I fill out the search form and use the resulting link to scrape the website.
Since my boss likes to work with R, he would like me to write an R-skript to do so, but I have no idea how to and haven't found anything working.
You need two packages: RCurl and XML.
The RCurl package is used for internet browsing. It can access HTML forms with _GET or _PUT arguments. So, with it you can login or fill out the any form.
The output from the server would be in HTML. If you want to grep the links, you can use XLM package. I helps to get any data form XML format.
But before start, you have to find out that is the search form in webpage (and that arguments should be used). The Firefox browser could be useful. You need two add-ins: Live HTTP header and Firebug. With those add-ins you can inspect webpage much more easier.
I know that it did not solve you problem, but I could not say any more, since it deepens on particular situation and webpage structure. I believe that the tool I have mentioned is quite enough to achieve that you want.
Bet regards.
I have a Wordpress-managed site and I would like to embed a specific part of another website in one of my pages. I have identified the div of the part I'm interested in with Firebug. How can I get it to appear on one of my web pages? Simply using an iframe displays the whole page, but I'd rather have the specific div only.
The webpage I want to fetch the div from is outfitpoints.com
and the particular div is div id="outfitdetails" (the box of details seen on the page). It's this box I'd like to have appear on one of my pages.
I'm the developer of Outfitpoints, so I thought I'd answer with a few options.
Firstly, the data I use comes straight out of the Planetside2 API (census.soe.com), so you could write something that pulls some data yourself straight from the source.
However, yours is not the first request to provide this sort of data in a way people can imbed on their own site, so I will be looking into providing the data in that format in the very near future.
What sort of site are you looking to embed it in (enjin, guildlaunch, vbulletin etc.)? Just so I can test and make sure it works with it.
Also, feel free to give me a shout using the contact method on the site if you want to discuss it more.
I want to make a site where there user can basically navigate the web from within an iframe. The catch is that I'd like to be able to have more control over what is rendered within the iframe. Specifically,
I'd like to be able to filter out images or text, disable forms etc.
I'd also like to be able to gather feedback such as what links the users clicked on.
Question 1:
Is this even possible using a standard back-end scripting language (like php), with html and javascript on the frontend?
Question 2:
Would I first need to grab the source of the site before it is rendered, then do whatever manipulation is necessary, and finally re-render it somehow?
Question 3:
Could somebody please explain the programming flow that would occur here (assuming its possible)?
I think you would probably want to grab the source of the of site (with server-side code) before rendering it. You might run into cross-site scripting issues if you try to use JavaScript. Your iframe would load a page like render.php and pass the address of the page to render os a querystring parameter. Then use regular expressions to find elements in the HTML that render.php downloads from the address. Rewrite the HTML as necessary and then write it all out to the iframe.
Rewrite links so that that the user is taken to a page you control and redirected onto a target site if you want to track where people are going. Example: a link in the page needs to go to google.com. You would send them to tracker.php?target=http://google.com. You control tracker.php and can log each load of this page and then redirect the user to the target site.
Update:
Another possible solution is to use Apache or other server to proxy the target website. There are modules like mod_proxy for this. There may also be modules that let you parse the HTML or you could roll your own.
I should point out that even the best solutions offered to your question will be somewhat brittle if you do not have full control over the target site. You will want to have lots of error handling or alerting.
You can have a look at this. It uses iFrame really well, and maybe even use the library it has.
I have an ASP.Net application which as desired feature, users would like to be able to take a screenshot. While I know this can be simulated, it would be really great to have a way to take a URL (or the current rendered page), and turn it into an image which can be stored on the server.
Is this crazy? Is there a way to do it? If so, any references?
I can tell you right now that there is no way to do it from inside the browser, nor should there be. Imagine that your page embeds GMail in an iframe. You could then steal a screenshot of the person's GMail inbox!
This could be made safe by having the browser "black out" all iframes and embeds that would violate cross-domain restrictions.
You could certainly write an extension to do this, but be aware of the security considerations outlined above.
Update: You can use a canvas utility function to get a screenshot of a page on the same origin as your code. There's even a lib to allow you to do this: http://experiments.hertzen.com/jsfeedback/
You can find other possible answers here: Using HTML5/Canvas/JavaScript to take screenshots
Browsershots has an XML-RPC interface and available source code (in Python).
I used the free assembly UrlScreenshot.dll which you can download here.
Works nicely!
There is also WebSiteScreenShot but it's not free.
You could try a browser plugin like IE7 Pro for Internet Explorer which allows you to save a screenshot of the current site to a file on disk. I'm sure there is a comparable plugin for FireFox out there as well.
If you want to do something like you described. You need to call an external process that prints the IE output as described here.
Why don't you take another approach?
If you have the need that users can view the same content over again, then it sounds like that is a business requirement for your application, and so you should be building it into your application.
Structure the URL so that when the same user (assuming you have sessions and the application shows different things to different users) visits the same URL, they always see same thing. They can then bookmark the URL locally, or you can even have an application feature that saves it in a user profile.
Part of this would mean making "clean urls", eg, site.com/view/whatever-information-needed-here.
If you are doing time-based data, where it changes as it gets older, there are probably a couple possible approaches.
If your data is not changing on a regular basis, then you could make the "current" page always, eg, site.com/view/2008-10-20 (add hour/minute/second as appropriate).
If it is refreshing, and/or updating more regularly, have the "current" page as site.com/view .. but allow specifying the exact time afterwards. In this case, you'd have to have a "link to this page" type function, which would link to the permanent URL with the full date/time. Look to google maps for inspiration here-- if you scroll across a map, you can always click "link to here" and it will provide a link that includes the GPS coordinates, objects on the map, etc. In that case it's not a very friendly url but it does work quite well. :)