How do I scrape for links to webpages which gets opened in another tab on clicking an HTML element using Scrapy and Playwright? - web-scraping

I want to scrape this link : https://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Mumbai for the links for each property.
The link to the individual pages for each property is not in the HTML source code. The opening of the page instead is linked to an event. How do I get the links to the page that opens using Scrapy and Playwright?

Each website is different and needs to be treated differently. Usually the journey starts from the element panel of the page.
Upon taking a closer look at the elements panel of the url you have shared, we can see each card is inside a div and the div also has a script tag with a json. The json indeed has the URL you were looking for.
Here is the code to extract the URLs that you can run inside the page.evaluate function.
await page.evaluate(async () => {
const urls = [];
// parent card elements
var cardElements = [...document.querySelectorAll('[class="mb-srp__list"]')];
for(let cardElement of cardElements) {
// get nested script tag inside each card element that contains the url
const script = cardElement.querySelector('script');
// but the content of the tag is a string, so we need to parse
const cardJSON = JSON.parse(script.innerHTML);
// finally save whatever data we want
urls.push(cardJSON.url);
}
return urls;
});
Here is a shorter version of the code that can go inside page.evaluate,
[...document.querySelectorAll('[class="mb-srp__list"]')].map(card=>JSON.parse(card.querySelector('script').innerHTML).url)

Related

How do I style a NodeJS response with CSS?

I got this code for a weather app that I am building in NodeJS. when I receive the response from the nodejs I receive the details(temperature, rain) in plain text. So I cannot style it. is there any method I can use to get the response as a styled site with CSS? I cannot use a prebuilt html code cause the weather is always changing. Is there a mthod to get a styled site?
app.post('/', function(req,res){
const query=req.body.cityName;
const apiKey=' '
const url="https://api.openweathermap.org/data/2.5/weather?units=metric&q="+query+"&appid="+apiKey;
https.get(url,function(response){
console.log(response.statusCode);
response.on("data",function(data){
const weatherData = JSON.parse(data);
console.log(weatherData);
try{
const temp = weatherData.main.temp;
console.log(temp);
const weatherWescription = weatherData.weather[0].description;
console.log(weatherWescription)
const icon =weatherData.weather[0].icon;
const iconUrl="http://openweathermap.org/img/wn/"+icon+"#2x.png"
res.write("<h1> The Weather is currently " + weatherWescription + "</h1>");
res.write("<h1> Temperature is "+temp+"</h1>");
res.write("<img src="+iconUrl+">")
res.send();
}
catch(e){
res.send("Enter a Valid City Name")
}
});
});
});
Here are some of the options you have:
Your API can fetch JSON data (no HTML) and then you programmatically insert that into your page with Javascript in the web page by inserting that data into the already styled page to replace the data that is already there. This should then inherit the styling that you already have, using the existing CSS rules. If you have a client-side template system such as EJS, you could use that to generate HTML from a template stored in your page with the new data inserted into the template and then insert that generated HTML into the page. Or, you can insert the data manually into the existing HTML with your own Javascript.
Your API can fetch a piece of styled HTML that uses CSS classes and ids that will inherit the existing CSS rules already in the page. You then use client-side Javascript to insert this piece of styled HTML into your existing page and it will automatically be able to use the CSS rules already in the page.
Your API can fetch a whole new HTML body which can then insert. You can either include CSS rules in the new HTML body or you can use the existing CSS rules from the page.
Hi #Pasindu sathsara,
My understanding - you are asking for a css style change respective to response( correct me if i am deviated)
The idea is like,
Understand the api response.
Write N number of style tag w.r.t climate and keep it as node string
Have a simple switch and compare climate and get the climate
According to climate res.write(css style tag in variable ) before res.end() so that style is dynamic according to response
Regards,
Muhamed

FlowRouter without page reload

I am following this example https://kadira.io/academy/meteor-routing-guide/content/rendering-blaze-templates
When I click on my links the whole page is being reloaded. Is there any way to load only the template part that is needed and not the whole page?
Edit: Also I noted another problem. Everything that is outside {{> Template.dynamic}} is being rendered twice.
Here is my project sample. https://github.com/hayk94/UbMvp/tree/routing
EDIT: Putting the contents in the mainLayout template and starting the rendering from there fixed the double render problems. However the reload problems happen because of this code
Template.mainLayout.events({
"click *": function(event, template){
event.stopPropagation();
console.log('body all click log');
// console.log(c0nnIp);
var clickedOne = $(event.target).html().toString();
console.log('This click ' + clickedOne);
//getting the connID
var clientIp = null // headers.getClientIP(); // no need for this anymore
var clientConnId = Meteor.connection._lastSessionId;
console.log(clientIp);
console.log(clientConnId);
Meteor.call("updateDB", {clientIp,clientConnId,clickedOne}, function(error, result){
if(error){
console.log("error", error);
}
if(result){
}
});
}, // click *
});//events
Without this event attached to the template the routing works without any reloads, however as soon as I attach it the problem persists.
Do you have any ideas why this code causes such problems?
EDIT 2 following question Rev 3:
event.stopPropagation() on "click *" event probably prevents the router from intercepting the click on link.
Then your browser performs the default behaviour, i.e. navigates to that link, reloading the whole page.
EDIT following question Rev 2:
Not sure you can directly use your body as BlazeLayout target layout.
Notice in the first code sample of BlazeLayout Usage that they use an actual template as layout (<template name="layout1">), targeted in JS as BlazeLayout.render('layout1', {});.
In the tutorial you mention, they similarly use <template name="mainLayout">.
That layout template is then appended to your page's body and filled accordingly. You can also change the placeholder for that layout with BlazeLayout.setRoot() by the way.
But strange things may happen if you try to directly target the body? In particular, that may explain why you have content rendered twice.
Original answer:
If your page is actually reloaded, then your router might not be configured properly, as your link is not being intercepted and your browser makes you actually navigate to that page. In that case, we would need to see your actual code if you need further help.
In case your page does not actually reload, but only your whole content is changed (whereas you wanted to change just a part of it), then you should make sure you properly point your dynamic templates.
You can refer to kadira:blaze-layout package doc to see how you set up different dynamic template targets in your layout, and how you can change each of them separately (or several of them simultaneously).
You should have something similar in case you use kadira:react-layout package.

Can not display base64 encoded images in an HTML fragment in WinJS app

I'm writing a WinJS app that takes an HTML fragment the user has copied to the clipboard, replaces their
Later, when I go to display the .html, I create an iFrame element (using jQuery $(''), and attempt to source the .html into it, and get the following error
0x800c001c - JavaScript runtime error: Unable to add dynamic content. A script attempted to inject dynamic content, or elements previously modified dynamically, that might be unsafe. For example, using the innerHTML property to add script or malformed HTML will generate this exception. Use the toStaticHTML method to filter dynamic content, or explicitly create elements and attributes with a method such as createElement. For more information, see http://go.microsoft.com/fwlink/?LinkID=247104.
I don't get the exception if I don't base64 encoded the images, i.e. leave them intact and can display iframes on the page with the page showing images.
If I take the html after subbing the urls for base64 and run it through toStaticHTML, it removes the src= attribute completely from the tags.
I know the .html with the encoded pngs is right b/c I can open it in Chrome and it displays fine.
My question is I'm trying to figure out why it strips the src= attributes from the tags and how to fix it, for instance, creating the iframe without using jquery and some MS voodoo, or a different technique to sanitize the HTML?
So, a solution I discovered (not 100% convinced it the best and am still looking for something a little less M$ specific) is the MS Webview
http://msdn.microsoft.com/en-us/library/windows/apps/bg182879.aspx#WebView
I use some code like below (where content is the html string with base64 encoded images)
var loadHtmlSuccess = function (content) {
var webview = document.createElement("x-ms-webview");
webview.navigateToString(content);
assetItem.append(webview);
}
I believe you want to use execUnsafeLocalFunction. For example:
var target = document.getElementById('targetDIV');
MSApp.execUnsafeLocalFunction(function () {
target.innerHTML = content}
);

Apply "onclick" to all elements in an iFrame

How do I use the JavaScript DOM to apply onclick events to links inside of an iframe?
Here's what I'm trying that isn't working:
document.getElementById('myIframe').contentDocument.getElementsByTagName('a').onclick = function();
No errors seem to be thrown, and I have complete control of the stuff in the iframe.
Here is some code to test and see if I can at least count how many div's are in my iframe.
// access body
var docBody = document.getElementsByTagName("body")[0];
// create and load iframe element
var embed_results = document.createElement('iframe');
embed_results.id = "myIframe";
embed_results.setAttribute("src", "http://www.mysite.com/syndication/php/embed.php");
// append to body
docBody.appendChild(embed_results);
// count the divs in iframe and alert
alert(document.getElementById("myIframe").contentDocument.getElementsByTagName('div').length);
It is possible for an iFrame to source content from another website on a different domain.
Being able to access content on other domains would represent a security vulnerability to the user and so it is not possible to do this via Javascript.
For this reason, you can not attach events in your page to content within an iFrame.
getElementsByTagName returns a NodeCollection, so you have to iterate throgh this collection and add onclick handler to every node in that collection. The code below should work.
var links = document.getElementById('myIframe').contentDocument.getElementsByTagName('a');
for(var i=0;i<links.length;++i)links[i].onclick=function(){}
also make sure, you run this code after the frames' content is loaded
embed_results.onload=function(){
// your code
}

Auto ajax selectors with Jquery

I'm trying to make a proof of concept website, but I want perfect degradation. As such I'm going to code the website in plain XHTML first and then add classes & ids to hook them in jQuery.
One thing I want to do is eventually enable Ajax xmlhttprequest for all my links, so they display in a viewport div. I want this viewport to be a "universal" dump for any xmlhttprequest from multiple external pages.
I was wondering if I'm able to hardcode something like:
<a href="blah.html" class="ajax">, <a href="bleat.html" class="ajax">
etc. So as you can see, I give all link tags that I want to call Ajax requests from with the class ajax. In my JS based on jQuery, I want to be able to code it such that all positive ${"a").filter(".ajax") will automatically load their respective hrefs [variable] as a ajax request.
Please help. I'm a n00b.
With your example, you should be able to do:
$('.ajax').click(function () {
// Your code here. You should be able to get the href variable and
// do your ajax request based on it. Something like:
var url = $(this).attr('href');
$.ajax({
type: "GET",
url: url
});
return false; // You need to return false so the link
// doesn't actually fire.
});
I would suggest using a class different from "ajax" because it makes the code a little strange to read, because $('.ajax') could be misread as $.ajax().
The $('.ajax').click() part registers an onClick event handler for every element on the page with the class "ajax" which is exactly what you want. Then you use $(this).attr('href') to get the href of the particular link clicked and then do whatever you need!
Something like:
function callback(responseText){
//load the returned html into a dom object and get the contents of #content
var html = $('#content',responseText)[0].innerHTML;
//assign it to the #content div
$('#content').html(html);
}
$('a.ajax').click(function(){
$.get(this.href, callback);
return false;
});
You need to parse out everything that is outside of the #content div so that the navigation isn't displayed more than once. I was thinking about a regexp but probable easier to use jQuery to do it so I updated the example.

Resources