Cant get the full content of the html page - web-scraping

Hi I am trying to scraping web content in one of a news web site. When I use the "requests.get(url)" and "soup(html.content, 'html.parser')" methods I canot get the exact infomation which can be seen when I am trying to use "Inspect Element" fuction of the webpage.This is the actual content of thr web page
But I canot see all the content when using "soup(html.content, 'html.parser')"
I tried above methods but failed to get full content.

Related

WordPress Elementor Page issue with binding REST API

We are trying to integrate a REST API in one of page created using Element or and have hard time to bind the value because we cannot provide Id or name to html elements like we do in a normal page when use Javascript/Jquery to DOM (document object model) manipulation. We have to use syntax like this:
document.querySelectorAll('#advisor-contact > div > div> div > div')[1].querySelector('a > span').innerText
Such code is problematic and requires a lot of maintenance again and again if the structure of page change. We read some articles on the web and it says Elementor Pro has a way to provide tag to dynamic content. I am not sure if it is talking about giving Id to HTML elements on a page created using Elementor Pro or something else. Considering our use case please suggest/guide us how can we bind response of an REST API on a page created using Elementor. Please provide some links to documentation if available.

Displaying values from a Google spreadsheet on a WordPress page

I have followed the tutorial on this website https://www.wp-tweaks.com/display-a-single-cell-from-google-sheets-wordpress/ which allows to dynamically display values from a Google spreadsheet on a WordPress page using a simple shortcode:
[get_sheet_value location="Cell Location"]
This solution worked seamlessly until a single page contained hundreds of those shortcodes (I basically need the whole content of the page to be editable via the spreadsheet). I started getting 100% Errors by API method (based on the Google Metrics) and the content was not displayed properly anymore. I realize that sending hundreds of read requests after each page load is not ideal and will inevitably affect the load performance and that Google imposes quota limits too. Is there a way to bypass this issue? For example by pulling the values from the Google spreadsheet only once a day. Unfortunately, I don't have much coding experience but I'm open to all solutions.
Thanks in advance!
You could publish the sheet to the web and embed it to your website:
In your sheet, go to File > Publish to the web
In the window that appears, click Embed.
Click Publish.
Copy the code in the text box and paste it into your site.
To show or hide parts of the spreadsheet, edit the HTML on your site.
It would look like this (click on Run code snippet):
<iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR3UbHTtAkR8TGNtXU3o4hzkVVhSwhnckMp7tQVCl1Fds3AnU5WoUJZxTfJBZgcpBP0VqTJ9n_ptk6J/pubhtml?gid=1223818634&single=true&widget=true&headers=false"></iframe>
You could try reading the entire spreadsheet as a JSON file and parse it within your code.
https://www.freecodecamp.org/news/cjn-google-sheets-as-json-endpoint/

Web scraping of an eCommerce website using Google Chrome extension

I am trying to do web scraping of an eCommerce website and have looked for all major kind of possible solutions.The best I found out is web scraping extension of Google Chrome. I actually want to pull out all data available in the website.
For example, I am trying to scrape data of an eCommerce site www.bigbasket.com. Now while trying to create a site map , I am stuck to this part where I have to chose element from a page. Same page of say category A, while being scrolled down contains various products ,and one category page is further split as as page 1, page 2 and few categories have page 3 and so on as well.
Now if I am selecting multiple elements of same page say page 1 it's totally fine, but when I am trying to select element from page 2 or page 3, the scraper prompts with different type element section is disabled,and asks me to enable by selecting the checkbox, and after that I am able to select different elements. But when I run the site map and start scraping, scraper returns null values and data is not pulled out. I don't know how to overcome this problem so that I can draw a generalized site map and pull the data in one go.
To prevent web scraping various websites now use rendering by JavaScript. The website (bigbasket.com), you're using also uses JS for rendering info to various elements. To scrape websites like these you will need to use Selenium instead of traditional methods (like beautifulsoup in Java).
You will also have to check various legal aspects of this and whether the website wants you crawling this data.

Custom Parser for Nutch (or open source .NET Crawler)

I have been using Nutch/Solr/SolrNet for my search solutions, I must say, it works a treat. On a new site I'm working on, I am using Master pages, as a result, content in the header and footer is getting indexed and distorts the results. For example, I have a link to the Contact Us page in the header. Now, when I search for 'Contact' the result returns all the pages in the site.
Is there a customizable Nutch parser that i can maybe pass a div id and then it only indexes content inside the div.
Or if there are .NET based crawlers that I can customize.
See https://issues.apache.org/jira/browse/NUTCH-585
and https://issues.apache.org/jira/browse/NUTCH-961
BTW you'd get a more relevant audience by posting to the Nutch user list
You can implement a Nutch filter (I like Jericho HTML Parser) to extract only the parts of the page you need to index using DOM manipulation. You can use the TextExtractor class to grab clean text (sans HTML tags) to be used in your index. I usually save that data in custom fields.

<!-- #include virtual="/footer.asp"--> in ASP.NET application

I need to include a header and footer currently located in an asp page. The page takes the language ID and gives you the correct header for the page you are viewing.
I was going thru this: http://forums.asp.net/t/1420472.aspx and this particular fragment seemed to explain it better tho I could not wrap my mind around it.
Hi, instead of using include tags, you could compose your page this way:
Your .NET application here
You can
then implement in codebehind remote
header and footer download logic and
set them in the Literals' Text. After
downloading from the remote site, I
would suggest to store the header and
footer in the application's Cache to
avoid too many connections to the
remote server. If the same
header-and-footer are shared from many
pages in your project, moving this
structure to a MasterPage could be
useful.
Kindly assist.
Well, it would be applicable if header/footer content is coming from some other (remote) server. So the suggested solution is to
Write code to download header/footer content from remote server
Cache the content so that you don't have to download it again and again.
Use literal controls as placeholders on page and set its text to this downloaded content from code-behind.
Now, this may or may not be applicable to your problem. From where does you get the content for header/footer. If its some helper class/method then you can directly call it to set the literal text. You can even do that is master page making things even simpler.
Edit: Additional information requested by OP
You can use WebRequest for downloading content. See this article to get started.: http://www.west-wind.com/presentations/dotnetwebrequest/dotnetwebrequest.htm
Refer below to get started on caching:
http://www.asp.net/general/videos/how-do-i-use-the-aspnet-cache-object-to-cache-application-information
You can use HttpWebRequest to get the required footer text from the asp page and then use the Literal control to display this text.
This page has an example code about how you can submit value to a page and get the response.

Resources