Best approach for fetching news from websites? - web-scraping

I have a function which web-scraping all latest news from a website (approximately 10 news and the number of news is up to that website). Note that the news are in chronical order.
For example, yesterday I got 10 news and stored in database. Today I get 10 news but there are 3 news that are not available from yesterday (7 news stayed the same, 3 new).
My current approach is to extract each news till I find an old news (the 1st among 7 news) then I stop extracting and only update the field "lastUpdateDate" of the old news + add new news to the database. I think this approach is somehow complicated and it takes time.
Actually I'm getting news from 20 websites with same content structure (Moodle) so each request will last about 2 minutes, which my free host doesn't support.
Is it better if I delete all the news and then extracting everything from the start (this actually increments a huge amount of the ID numbers in the database)?

First, check to see if the website has a published API. If it has one, use it.
Second, check the website's terms of service, which may specifically and explicitly disallow scraping the website.
Third, look at a module in your programming language of choice that handles both the fetching of the pages and the extraction of the content from the pages. In Perl, you would start with WWW::Mechanize or Web::Scraper.
Whatever you do, don't fall into the trap that so many who post to StackOverflow fall into: Fetching the web page, and then trying to parse the content themselves, most often with regular expressions which is an inadequate tool for the job. Surf the SO tag html-parsing for tales of sorrow from those who have tried to roll their own HTML parsing systems instead of using existing tools.

Its depend on requirement if you want to show old news to the users or not.
For scraping you can create a custom local script for cron job which will grab the data from those news websites and will store into database.
You can also check through subject if its already exist of not.
Final make a custom news block which will show all the database feed.

Related

How to Limit Access in an Amazon S3 Bucket to a Specific Folder Containing Course Information Through WooCommerce

Rookie S3 user here looking to troubleshoot a problem I encountered while helping some friends with their business. Their business revolves around selling courses and the program they use is WooCommerce and they attach course files through WordPress. The way these courses work is that there is a live video call where people like to join in so the product on WooCommerce initially holds the details for the upcoming call and afterward additional audio and transcripts are added to the product for sale. The problem is that this means people who had bought the course prior to this call would not receive these files unless permission to see them was manually given. As this is redundant and troublesome, my thought was to change the purchase to instead give a link which goes into an Amazon S3 bucket labeled courses and give them access to a specific folder within it. Ideally, this link would let them see new files lives and furthermore would limit the size of data on the website which is hosted on a dedicated server (save some $$$ on hosting fees, 2 birds 1 stone) The problem however is that since I am a complete novice to this style of coding, I am unsure of how to do this although I do think it is possible given an answer is already out there or if I can bull and jam my way through a section of code. The reason I am looking to sort out courses as folders inside a bucket instead of individual buckets is that the number of courses the website currently has is nearing 200 and if an effort was made to change those then it would be well over the 100 bucket limit in addition to being an exercise in repetition. Any advice or help would be greatly appreciated, thanks!
If I understand you correctly, you want to host content on S3, but want to achieve some degree of access control on that content.
The most straightforward way to do this, the one that involves minimal S3 integration, is to presign an S3 url for the user. the presigned url would be good for a limited time and could be generated directly before redirecting the user to that URL by your wordpress site, which would in turn hold aws access credentials.
https://docs.amazonaws.cn/zh_cn/aws-sdk-php/guide/latest/service/s3-presigned-url.html eplains more about this from a php perspective, which I'm guessing is the right lens for you.
This allows some modicum of access control ( the users can still share the document after they've accessed it, but at least it's not just public).
If you don't need access control, you can make the s3 object public and omit the signing altogether.

How can I pull data from Google Analytics to see the top pages visited from the current page?

I would like to create a small sidebar on each page of my website that contains related/popular pages with perhaps the top five pages users visit after reading the current page.
I could track and record user movements across the site myself and build the list that way, but as my site already uses Google Analytics and I know the data is there I'd rather access that if all possible.
The trouble is that I don't have the faintest idea whether it is possible or not.
Remember that the Google Analytics Reporting API is not real-time it can take between 24 - 48 hours for the data to finish processing and be in the API for you to request.
The Realtime Google Analytics api is real time but the data is only about 5 minutes old and its very limited to the dimensions and metrics you can request.
Quota, with either of those APIs you are limited to 10,000 requests per day per profile / view. I have no idea how many pages there are on your site or how may users are on your site but this could quickly blow out this NON extendable quota.
Options: Except that its not realtime data and use the reporting api every night run a request against the api get everything for two days ago then show your users on your site data that's two days old. Store the data in the database then you are showing them data on in your DB and wont have an issue with the quota as you only requested it once.
But this isn't exactly what you want as its not showing a users activity over the site. TBH I am not sure you can exactly use Google Analytics to track a user as the data is user non specific.
If you don't want to get involved with learning the API and develop this from the ground up, check out EmbeddedAnalytics (disclaimer: I created the service). We could provide such a widget.
You may find This Article useful. It provides the necessary query to find the "next page visited" using the page of interest as a filter. Ultimately your query would look like this:
https://www.googleapis.com/analytics/v3/data/ga?ids=ga%3Aabc&start-date=30daysAgo&end-date=yesterday&metrics=ga%3Apageviews&dimensions=ga%3ApreviousPagePath%2Cga%3AnextPagePath&sort=-ga%3Apageviews&filters=ga%3ApreviousPagePath%3D%40pricing
The query above will give you the "Next Page" along with pageviews assuming the "previous" page contains the word "pricing".
We could easily build such report widget for you:
You would insert a javascript source code snippet into your page. The javascript would pass the page url to our server and we would return the next "most popular pages visited".
The pages could be "linkified" so that someone could click the link to go to that page.
We already have caching mechanism in place. So each pageview would not require a new query to google (making it quicker and also staying away from the API quota that was mentioned above). For pages that are hardly ever looked at (e.g. less than once a week), we could make "on-demand" calls to get the statistics.
In my experience with the API, the lag in the API is only a couple hours. It may be longer for larger sites.
Please let me know if you are interested in such widget and I can work with you.

adjusting relivance of index service web search

I run a website that is using windows indexing service to create a catalog for the search page. I return the top 30 results.
I was asked by a user why a certain page was not returned. The phrase searched was "Papal Blessing Form". That is the exact title of a link that points to a PDF form. I tried having the search return all the matches and the page was not returned. I did however get most every page that had the words "form", "Blessing" & "Papal" on them. I even rebuilt the catalog thinking the page was new and not yet indexed.
How do I modify the index settings so better results are returned?
Mike
I have written a blog post about the Indexing Service which addresses your question and some other points.
Specifically to answer your question:
-Cannot adjust page ranking.
The ranking system is closed and no API or boosting mechanism exists.
-Indexing PDF documents requires the Adobe IFilter (another link in the chain).
My claim that you cannot adjust weight is based in part and supported by this post by George Cheng: http://objectmix.com/inetserver/291307-how-exactly-does-indexing-service-determine-rank.html

Questions on building RSS feed

I am building RSS feed for the first time and I have some simple, direct questions that I was unable to find on the web, well at list in a sense that would be clear to me. Can you help me understand following
Which items should I include in RSS generation? should I always put in all the articles or what is the criteria when I query my articles for the feed?
What value should I set for pubDate? The specification says "The publication date for the content in the channel. For example, the New York Times publishes on a daily basis, the publication date flips once every 24 hours. That's when the pubDate of the channel changes.". I do not quite understand how to apply this to my feed. I have new articles daily, should I set the pubDate to let say 06:00 AM today and update it every day?
lastBuildDate: if I understand this right is the date of the latest updated item?
Which items should I include in RSS generation?
You should have one generic feed with all the new articles you post (for example: news). Additionally if you got your webpage split into categories, or you have some specific feeds (eg. calendar of the events) then it's good to create additional separate RSS for each one of them
What value should I set for pubDate? I do not quite understand how to apply this to my feed. I have new articles daily, should I set the pubDate to let say 06:00 AM today and update it every day?
Always set pubDate to the time when your news/articles went online. So if you have new articles daily pubDate should be a date when they were released to the public. Not random hour in the morning. Not the moment when you started writing them.
lastBuildDate: if I understand this right is the date of the latest updated item?
lastBuildDate is the most recent date when any of the results was posted or modified. Usually you should skip it - especially if your lastBuildDate will be simply a most recent pubDate. It's an optional parameter.
I use lastBuildDate only for calendar RSS feeds to show when the calendar was updated (as in calendars you not only add new entries but also often edit existing).
You should put every article, but the best is to provide different feeds for different categories, even search keywords. You can build it like any dynamic page, with a querystring.
that's not super important, you can put whatever. I don't think may feed readers use it.
theoretically it's the date the content changed. So the date of the latest updated item should work.
Something super important, since people are going to do polling on this page (meaning a lot of requests on the page)
- Cache it on your server
- Serve and Etag header and/or a LastModifiedDate. That way your server can respond with just a "not modified" if the client has it in cache already.

RSS for Future Items

This may be a simple question, but for some reason I don't know this answer. Is it possible to create an RSS feed file that contains contents for an entire year but only publishes the current date and previous date information?
I have a client that wants to do a "this day in history" post. Currently, I am using IFTTT, and created around sixty dated posts for the next two months. Of course, this works -- but it is very labor intensive.
Is it possible to create an RSS feed that you could put all 365 days of data in to, but if someone pulls up the feed it only shows today's item and prior days in the feed?
Or is RSS not the proper technology to do this? The reason I am using RSS is for ease of use, and IFTTT will take those RSS feeds and pump it in to Facebook and Twitter for automatic status updates for my client.
There are various tools that let you define Facebook and Twitter posts in advance, to be published at a specified date and time in the future. Why not use one of those instead of writing your own?
A quick search for "scheduled twitter post" uncovered Later Bro, Twuffer and twAitter but there must be dozens to choose from.
If you're looking for just posting on Facebook and Twitter, and not an RSS feed as well, I'd follow Matthew's suggestion. If you want an RSS feed, there is a feed for each Twitter feed. But if you want actual RSS, you need to add something in between. An RSS feed is just an XML file. it's not a process. I suggest having a file of some type (maybe RSS, or other XML, or a database table, or even a csv file with all the posts and relevant information, including date. Then a small script that runs as a chron job (or IFTTT if it supports date as trigger and running a script as the "then" part) that pulls the day's feed and updates the actual RSS feed. Pretty simple.
Here is what I ended up doing
Using the Drupal backend of my website, I created a content type specifically for these posts.
I created individual articles for each day, and used the schedule module to schedule the publish date to the date I wanted.
I created an RSS feed of these posts through Drupal.
I linked the newly created RSS feed to IFTTT.
Created an IFTTT recipe to post the text from the RSS feed to Facebook/Twitter/etc.
It wasn't the best solution, but it worked. I was really trying to do this without having to rely on a third-party such as IFTTT, but never really figured out a good way to do it.

Resources