Webpage snapshots on Internet Archive (Wayback Machine) link to future snapshots - web-scraping

How does a page (for example, this one) links to a snapshot in the future (for example, clicking on "Journal Policies" on the left panel leads to this snapshot)?

According to USING THE WAYBACK MACHINE:
Not every date for every site archived is 100% complete. When you are
surfing an incomplete archived site the Wayback Machine will grab the
closest available date to the one you are in for the links that are
missing.
Therefore, given the fact that the first snapshot taken on 2013/10/07 links to the second snapshot taken on 2014/04/19, you cannot necessarily conclude that if you visit the second URL on 2013/10/07, you would see the same page as the 2014/04/19 version.

Related

Linking to product page gives "Not Available for bots to index" error on screen instead of the old style App Details

https://apps.microsoft.com/store/detail/name/ID
Linking to my app product page gives "Not Available for bots to index"
What is the proper syntax to place a href link to the product so that users online can view it as html?
I don't have a final solution, just a workaround.
We've been seeing the same issue with our own app for the past few days and we've started seeing reports of people encountering this with all kinds of apps, including even Microsoft's own AppInstaller. It seems to be a server-side rate-limiting/caching issue at Microsoft as changing anything in the URL fixes the problem temporarily -- only to get back to the same issue a few hours later. We also found that VPN-ing to some locations (mostly to the US) helped as well, and this throttling seems to be specific to an app & an availability region (when it got blocked, it blocked across all Europe, but not in US for example). It seems to come & go as we have not seen a big drop in new install counts.
As a temporary solution we ended up adding a random string to the end of the URL. We used GCLID as we found it to be the least offensive way -- it could just as well be a legitimate tracking ID we pass over. So now the URL we link to looks like this:
https://apps.microsoft.com/store/detail/[APPNAME]/[APPID]?gclid=f-3579a28842a7bcac7ba630d698829e9b
Where the "f-xxxx" value is generated using the md5() of the timestamp -- but it could be any ever-changing value, even a random number.
We've reached out to our MS contact about this but we haven't heard back yet.
I encountered the same issue half an hour ago, but it seems recovered now.
If you own an app, you can confirm Product identity in Partner Center:
You can share the direct link and Store ID to help customers find your app in the Store:
URL: https://www.microsoft.com/store/apps/*12CharsAppId*
Store ID: 12CharsAppId
Store protocol link: ms-windows-store://pdp/?productid=12CharsAppId
The above address would be recommended but the apps.microsoft.com URL should also work.
I work on apps.microsoft.com. I can confirm that there was an issue on our end that appeared on August 23. This issue has been resolved.
And the proper link, of course, is this:
https://apps.microsoft.com/store/detail/[your product id here]

Classic asp code using more resources

I am using classic asp for a web application. I am running the web application on internet explorer.
I had developed few reports related to sales data. All the sales report are linked to Sales Dashboard. Every report has some selection criteria like customer selection date period selection product group selection and other few.
Now the problem which I am facing.
I open a total sales report for the entire year which takes almost 15 minutes to load on screen. while the report is executing if I try to open any other from the sales dashboard the page with selection criteria will appear after the first report is completely executed. If I copy the link location for the second report and open it in new window of internet explorer it will open normally.
I am not able to trace the problem did anyone had face the same problem.
First, I agree with this comment posted under the question:
IIS/ASP only allows one concurrent request per session. This is why the second request does not happen until after the first one completes. If you open a new browser instance or a different browser then this is treated as a different session.
Second, if all that is being asked here is whether other people have similar issues or not, then the answer is yes, due to what johna said in the comment.
If you're looking for a way to get around that for yourself, the way described in the comment (open a new browser instance or a different browser) will work.
However, if you're after a way to bypass the 15 minute wait time entirely, give some though to preparing the data before the report is called. What I mean by that is either schedule the report to run after close of business each day and store the relevant HTML or data separately, and/or provide a button to prepare the report based on current data which can be run whenever the user wants.

Schedule Later - Check version/status of page published - AEM/CQ

AEM provides a OOTB functionality to Activate Later. Following the scenario where confusion is happening
User Schedules a page for Activate Later (suppose 5 mins after current time). It internally creates a page version and waits for selected time.
User modifies some content in the page. In /siteadmin console, modified timestamp is updated and modified icon is changed to "Blue".
Now Schedule Later workflow publishes the version created in step 1 to publish instance (Changes done in step 2 are not published, which is fine and is expected behavior), Replicate API creates another version while replicating in case of content change.
After the page is published /siteadmin console shows "Green" icon status under published column. But "Blue" icon under modified column is removed.
Now this creates a bad user experience (Not sure if this is a bug in AEM, the status under modified column should have been Blue, which would give feedback to author that currently page is in modified state and published version is older). My question is, is there a way to verify which version of page from author instance is currently present in publish instance (So that at least we can be sure that modified version is not yet published). Or control the modified column in /siteadmin console.
AEM has a so called Timewarp feature, this allows you to see the publishing activities for a page.

adjusting relivance of index service web search

I run a website that is using windows indexing service to create a catalog for the search page. I return the top 30 results.
I was asked by a user why a certain page was not returned. The phrase searched was "Papal Blessing Form". That is the exact title of a link that points to a PDF form. I tried having the search return all the matches and the page was not returned. I did however get most every page that had the words "form", "Blessing" & "Papal" on them. I even rebuilt the catalog thinking the page was new and not yet indexed.
How do I modify the index settings so better results are returned?
Mike
I have written a blog post about the Indexing Service which addresses your question and some other points.
Specifically to answer your question:
-Cannot adjust page ranking.
The ranking system is closed and no API or boosting mechanism exists.
-Indexing PDF documents requires the Adobe IFilter (another link in the chain).
My claim that you cannot adjust weight is based in part and supported by this post by George Cheng: http://objectmix.com/inetserver/291307-how-exactly-does-indexing-service-determine-rank.html

Best approach for fetching news from websites?

I have a function which web-scraping all latest news from a website (approximately 10 news and the number of news is up to that website). Note that the news are in chronical order.
For example, yesterday I got 10 news and stored in database. Today I get 10 news but there are 3 news that are not available from yesterday (7 news stayed the same, 3 new).
My current approach is to extract each news till I find an old news (the 1st among 7 news) then I stop extracting and only update the field "lastUpdateDate" of the old news + add new news to the database. I think this approach is somehow complicated and it takes time.
Actually I'm getting news from 20 websites with same content structure (Moodle) so each request will last about 2 minutes, which my free host doesn't support.
Is it better if I delete all the news and then extracting everything from the start (this actually increments a huge amount of the ID numbers in the database)?
First, check to see if the website has a published API. If it has one, use it.
Second, check the website's terms of service, which may specifically and explicitly disallow scraping the website.
Third, look at a module in your programming language of choice that handles both the fetching of the pages and the extraction of the content from the pages. In Perl, you would start with WWW::Mechanize or Web::Scraper.
Whatever you do, don't fall into the trap that so many who post to StackOverflow fall into: Fetching the web page, and then trying to parse the content themselves, most often with regular expressions which is an inadequate tool for the job. Surf the SO tag html-parsing for tales of sorrow from those who have tried to roll their own HTML parsing systems instead of using existing tools.
Its depend on requirement if you want to show old news to the users or not.
For scraping you can create a custom local script for cron job which will grab the data from those news websites and will store into database.
You can also check through subject if its already exist of not.
Final make a custom news block which will show all the database feed.

Resources