Un-Published items showing in Drupal search results (google search appliacne) - drupal

I inherited a Drupal 5 site recently and have a series of enhancements to make. Several of then revolve around search results.
Unpublished pages showing up in
search engine results. Some of these
are old pages, others are recently
unpublished. All are correctly
marked as unpublished in the CMS and
are still showing up.
Outdated pages are showing up from the search engine. The URL path structure changed and those items are old results in the DB.
From what I can tell the site uses Google Search Appliance(GSA) for the search rather than the default Drupal search. Is there a way I can be certain that it's using GSA other than seeing the module enabled?
If it is GSA it seems that I could get someone with access to the GSA to rebuild the search results on the site. Is this correct?
If rebuilding the search results is the right way to go about it, it seems whenever a fair amount of content is removed from the site I'll need to get someone to rebuild the search. Is there a better/automatic way?

Sounds like it's drupal that is handling the search. Google would need db access to show unpublished nodes. It could be you are using views to do search but forgot to only take published nodes.
If Drupal is handling the searchyou just need to flush and rebuild the search index. This can be done without too much trouble if you don't have too much content.

The GSA could still be showing deleted content depending on what your data source is.
If the content is coming from a database feed and is then dropped from the query it would be dropped. If the content was coming from a natural crawl or through a custom connector feed it would not be removed from the index on delete. Instead it needs to naturally cycle out of the index which can take a while.
One way to block deleted url's from being displayed is to do it through the front end. In the GSA Admin interface go to Serving > Front Ends then choose your front end and click the Remove URL tab. You can either list your url's or block a group of url's through regular expressions.

I have posted an answer to your more general question concerning node access. The problem with your search results might well be related to that.

In order to keep the Google Appliance more up to date, you might try out XmlSiteMap, a module that publishes a proper xml sitemap for all your content.
For an online website, publishing a sitemap is a good way to keep the search engines up to date, as they can use it to know about new pages and to purge old pages. I'm assuming that the Google Appliance would use this too,.

Related

Woocommerce Parameters Google Indexing

I'm currently using woocommerce and by chance when looking through google search console I noticed well over 15,000 pages had been indexed. Instantly raised a concern because I know I should have no more than 400 actual pages.
After looking into it, I noticed that absolutely every possible parameters "variations, grid styles, shipping methods" etc is being indexed causing what is 400 into 15,000 variations.
Does this have an affect on google ranking showing 15,000 pages when really there are only 400.
I can not find a single resource that explains whether google indexing so many variations has a positive or negative impact on google rankings.
Finally how to prevent google or any other search engines from indexing url's with parameters. I've seen advice using robot.txt but no recommendation of what standard woocommerce filters should be excluded or if that's a bad idea?
I have a feeling I am losing alot of link juice by having so many indexed pages with parameters?
WordPress by default expects that you wish to share everything because why not? Surely if you have 15,000 items that are index-able surely you'd be more mad if your site decided on a whim to hide parts of that without you knowing. It is understandable that this would be confusing though, and quite rightly it is always better to have control over which parts of your site gets seen by search engines.
Anyway there are plugins which allow you to customise which parts of a website are visible or hidden to search engines by providing a mechanism for generating an XML site map (an index of which pages should be crawl-able). A very common plugin I see people use is called Yoast SEO. Another one I have had some success with is called Google XML Sitemaps, you may find another to work better for you, but that should give an idea of what you need.
Also if you are not getting visited often by google bot then take the sitemap from one of those types of plugins and submit via the google console to help google better understand your site.

How to refresh facebook scrape cache for a whole site

I need to re-scrape facebook's cache for every page in my web site (3000+ pages)
The only way i know how to do that is too tough Open graph debugger
I Cannot run this with 3000...
I read From Facebook developer support page that this (StackOverflow) is the place to ask questions but there is little to none knowledge about refreshing facebook url cache
Can you please suggest any working solution to re-scrape a page?
my web site: Mentallica
One possible answer, given the number of URL's you've got, is to use the batch invalidator. You could go to an access list of your URL's, or maybe do a recursive directory listing and replace the folders with URL's (if it's a flat site), or the like. At least, you don't have to do them one at a time. Once you have a list, paste the list into the invalidator (multiple lines).
The batch invalidator is here:
https://developers.facebook.com/tools/debug/sharing/batch/
It IS frustrating. I've searched several places, and don't really see a solution. We have a website with all of the proper tags, yet FB refuses to refresh any past posts with the new website data.
3 years later, but this can help someone: Paste yout url here and click fetch new scrape information: https://developers.facebook.com/tools/debug/og/object/

Adding Google's standard search (not custom) to my website

My intention is to embed Google results in my website. I don't want to customise the domain/s on which the search is performed or anything, just a 'bog standard' Google search based on search parameters I pass it.
2 questions:
How do I display google results on my website as a response to search criteria entered into a textbox I have?
Is there any legislation I need to take into account?
I know my second question sounds rather strange but I'm aware that what I'm appearing to do here is present content driven by Google as though it's my own so want to avoid breaching any copyright or 'same-origin policy' type thing.
What I've Tried/Ways I Know I Could Achieve This
Screen scraping Google's response to a simple web request with the necessary query parameters (but seems a bit excessive)
Google's custom search (but I don't want to customise anything)
I've tagged this question for some more context.
As it is mentioned here
you can use your own XML parser to customize the display for your
search users.
with an http request like this:
GET /search?q=bill+material&output=xml&client=test&site=operations
But it has a limitation on number of requests per day, 500 or 1000 I guess.
Custom Search can be configured to include the entire Web in its results:
From the Google Custom Search homepage, click New search engine.
In the Sites to Search box, enter at least one valid URL (e.g. www.google.com).
Click Create.
On the next page, under Optional next steps, click Edit.
On the Basics tab, under Search Preferences, select Search the entire web but emphasize included sites.
Click Save Changes.
In the left-hand menu, under Control Panel, click Sites.
Delete the site you entered during the initial setup process.

Drupal Site Index - not crawling through "Blocks"?

I created a "View"* in Drupal to grab all the content and essentially make a site map, but I realized that it doesn't have an option to grab content from the Blocks I have created. Does anyone have an idea if I can even do that?
If not, should I essentially make each block a page so that it can crawl through the pages? I worry that this will end up becoming unmanageable in the end... What are some other options/work arounds? My end goal is to make a site map - maybe I am making this too complicated?
*To make my view I did:
Administration->Structure->Views->Add. Then I made it a page, called it "site-index", and made it "show Content of type All" (with tagged field empty). Then I chose "Content: Title" for my Fields and my Filter Criteria is set as: "Content: Published (Yes):" - That way, it will grab the titles of my web pages.
Thanks, and please reply if further clarification is needed!
Apologies if I'm wrong but I think there might be a bit of confusion over terminology here. In the context of a view Content means nodes, not all HTML content on the site. Your view will return a list of all published nodes, which are essentially the pages on your site.
On a normal sitemap (if there is such a thing) you would only link to full pages, not to parts of pages like a block, they are essentially used to provide a hierarchical overview of your site to aid navigation for users and, probably more importantly these days, search engines (you can submit an XML sitemap to the major search engines instead of this but that's really for another question).
Rather than doing this yourself I'd actually recommend you download and install the Sitemap module which will do all of the work for you, as well as arranging the content in their respective hierarchy.

Flex 3: Project Architecture & SEO

I've got a Flex 3 project. One of the problems I have is that not very much of its content is indexed by Google. Currently, I pull data from a mySQl database, so the Googlebot doesn't see most of the site.
My goal is to increase the amount of content indexed by Google, improve the SEO, and improve SERPs.
I thought that instead of pulling the data from the database that I would change the project's architecture and create separate "pages". So, in my case, I would compile each puzzle separately and upload it to the server in its own directory. This way the info in each puzzle would get indexed.
The negative is that if I add a puzzle, I'd have to add a link to it in all of the puzzles that are already on the server. I would have to add the link, re-compile each puzzle and upload it to the server. Is there a way to get around this problem? Also, if I wanted to communicate some data from one puzzle to another in the future, I wouldn't be able to do so.
Any suggestions?
Thank you.
-Laxmidi
The usual way to achieve this goal is to develop a hidden parallel site in HTML.
On the first page you will have your flash and, hidden by javascript, a list of links to the other pages. These links will be parsed by the robots. Ideally, the href pages are virtual (look for "url rewriting"). On each "fake" page, your server-side language will print on the page a content or links from your database AND the flash. The flash will be provided with a string explaining where it is and what it's supposed to show.
Ex: http://www.mysite.com/category1/content7 The URL rewriting sends this request to http://www.mysite.com/index.php?uri=category1/content7. The page should display the Flash with FlashVar "uri=category1/content7". The Flash knows which content it has to display so when an user comes from google, following this link, he will find the content he was looking for.
Every linking and content for SEO should be in HTML, don't trust robots capability of reading Flash.
have a look at Adobe's reference on deep-linking.
you can generate a website's sitemap.xml with a cron process (daily), such that the URLs encode the state of the application you need. This URL will encode whatever content you need to retrieve from the db, with just one index.html page.
good luck!

Resources