I run a website that is using windows indexing service to create a catalog for the search page. I return the top 30 results.
I was asked by a user why a certain page was not returned. The phrase searched was "Papal Blessing Form". That is the exact title of a link that points to a PDF form. I tried having the search return all the matches and the page was not returned. I did however get most every page that had the words "form", "Blessing" & "Papal" on them. I even rebuilt the catalog thinking the page was new and not yet indexed.
How do I modify the index settings so better results are returned?
Mike
I have written a blog post about the Indexing Service which addresses your question and some other points.
Specifically to answer your question:
-Cannot adjust page ranking.
The ranking system is closed and no API or boosting mechanism exists.
-Indexing PDF documents requires the Adobe IFilter (another link in the chain).
My claim that you cannot adjust weight is based in part and supported by this post by George Cheng: http://objectmix.com/inetserver/291307-how-exactly-does-indexing-service-determine-rank.html
Related
Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 months ago.
Improve this question
When I google a keyword into google.com, I see this URL in the browser:
https://www.google.com/search?q=harry+potter&sxsrf=AOaemvJzqEslTi5rksHz8Da7pgdZ1J3uMw%3A1634810260185&source=hp&ei=lDlxYYaCCNaL9u8Popq2-AQ&iflsig=ALs-wAMAAAAAYXFHpA2d9PU58mYXikU2pl90IN7Z8wXq&ved=0ahUKEwiGnNLmntvzAhXWhf0HHSKNDU8Q4dUDCAg&uact=5&oq=harry+potter&gs_lcp=Cgdnd3Mtd2l6EAMyCAguEIAEEJMCMgUILhCABDIFCAAQgAQyBQguEIAEMgUIABCABDIFCAAQgAQyBQguEIAEMgUIABCABDIFCC4QgAQyBQgAEIAEOgcIIxDqAhAnOgQIIxAnOgUIABCRAjoLCC4QgAQQxwEQowI6CwguEIAEEMcBEK8BOgsILhCABBDHARDRA1D3GliFJmDtJmgAcAB4AIABowGIAeQKkgEDNi43mAEAoAEBsAEK&sclient=gws-wiz
I understand that virtually all websites work via the Hypertext Transfer Protocol. Some of the most common HTTP methods are GET and POST.
I assume the above is a POST method, since it has a request payload (my search query) and a response payload (the webpage returned).
The parameter "q" is clearly my search keyword.
What do
sxsrf=AOaemvJzqEslTi5rksHz8Da7pgdZ1J3uMw%3A1634810260185
source=hp
ei=lDlxYYaCCNaL9u8Popq2-AQ
iflsig=ALs-wAMAAAAAYXFHpA2d9PU58mYXikU2pl90IN7Z8wXq
ved=0ahUKEwiGnNLmntvzAhXWhf0HHSKNDU8Q4dUDCAg
uact=5
oq=harry+potter
gs_lcp=Cgdnd3Mtd2l6EAMyCAguEIAEEJMCMgUILhCABDIFCAAQgAQyBQguEIAEMgUIABCABDIFCAAQgAQyBQguEIAEMgUIABCABDIFCC4QgAQyBQgAEIAEOgcIIxDqAhAnOgQIIxAnOgUIABCRAjoLCC4QgAQQxwEQowI6CwguEIAEEMcBEK8BOgsILhCABBDHARDRA1D3GliFJmDtJmgAcAB4AIABowGIAeQKkgEDNi43mAEAoAEBsAEK
sclient=gws-wiz
represent, and how does one know?
There is two ways to know that: Semi-automated using the Unfurl tool and manually reading the list of explanations.
Semi-automated way with use of Unfurl
There is an project Unfurl URL parser and browser, a free tool to check and decode Google Search URLs, https://dfir.blog/introducing-unfurl/.
And here is online hosted version of Unfurl: https://dfir.blog/unfurl/
It is a visual 2D browser of URL parameters, use mouse wheel to zoom in and zoom out, use mouse to unclutter the nodes and watch the explanations for query parameters, not only google.
And further I collected some information, googled it now, 2022-Sep.
Beware that Google query parameters explanations can get outdated very soon, every several years, so the only thing you can do is to search again for newer explanations in the Net.
Manual way of reading the list of explanations
Google query parameters explanations from [2021][2021]:
q= query sent to search engine
oq= 'original query' text of query last typed by user into the search box before the user selected a search term from given suggestions; it coincides with q= if the latter was entered all manually
ei= Search Session Start Date/Time
represents the time that the user’s session started, "Google time" (so no dependence on the local system time).
ved= Page Load Date/Time
sxsrf= Previous Page Load Date/Time
Explanations from [2016][2016]:
Here is a list of the URL parameters that we would commonly see:
q= the query string (keyword) that the user searched
oq= tracks the characters that were last typed into the search box before the user selected a suggested search term
hl= controls the interface language
redir_esc= unknown
sa= user search behavior
rct= unknown; seems to be related to Google AdWords
gbv= control the presence of JavaScript on the page
gs_l= unknown; seems to be related to what type of search is being done (i.e., mobile, serp, img, youtube, etc.)
esrc= set to ‘s’ for secure search
frm= unknown
source= where the search originated (i.e., google.com, toolbar, etc.)
v= unknown
qsubts= unknown
action= unknown
ct= click location
oi= unknown
cd= ranking position of the search result that was clicked
cad= unknown; appears to be a referrer, affiliate or client token
sqi= unknown
ved= contains information about the search result link that was clicked (see https://moz.com/blog/inside-googles-ved-parameter)
url= the URL that Google will redirect the user to after a search result link is clicked
ei= passes an alphanumeric parameter that decodes the originating SERP where user clicked on a related search
usg= unknown; possibly handling the encrypted search string
bvm= unknown; possibly a location tracker
ie= input encoding (default: utf-8)
oe= output encoding
sig2= unknown
Sources:
[2021]: Analyzing Timestamps in Google Search URLs - Magnet Forensics
[2016]: The Approaching Darkness: The Google Referral URL In 2016
I hope others will update this list, and will add here newer explanations later.
Also I leave here few articles with outdated explanations:
2008 article - Moz' The Ultimate Guide to the Google Search Parameters, this is very similar or the same as mentioned in neighbour answer blog post Google Search URL Parameters [Ultimate Guide] by SEOQuake, it is again like from 2008.
2014 article How to Use the Information Inside Google's Ved Parameter - Moz.
First of all, the request you wrote uses GET as Request Method.
You can easily check that on the Network tab of the Developer Tools in any browser:
Second, the difference between GET and POST (there many other methods but that's other topic) isn't the one you write. Mainly, the difference between these two methods is the presence or not of a body in the request (even if you could send a body in a GET-Request, but that's highly unrecommended).
The Request Methods goal is to indicate the destination server how it should treat the request.
Now, focusing on your question, you could have discover the values of all of them with a simple Google search, but anyway here you have a blog where all the parameters of the Google Search URL are explained:
Google Search URL Parameters [Ultimate Guide]
Say I have two pages on a site called “Page 1” and “Page 10”. I'd like to be able to see the paths visitors take to get from “Page 1” to “Page 10” with full URLs intact. Many of the URLs (including those for “Page 1” and “Page 10”) will include query strings that are important.
Is this possible? If so, how?
Try using behavior flow reports. The report basically shows you how visitors click through your website. There are a lot of ways to customize the report, with which you will need to play around to really answer your question. By default, the behavior flow focuses on entry and exit points of visitors, regardless how many times they hit the different subpages in between. However, I'm sure you can set appropriate filters and settings to answer your question.
I use two methods for tracking where people have been on my website:
Track and store the information in my own SQL database. (details below)
Lead Forensics (paid subscription, but you can do a trial).
For tracking and storing my own data, I record unique visitors based upon the IP Address they're connecting from and then have a separate table that records all page views that links back to the unique visitor table.
Lead Forensics data simply allows me to link up those unique visitors with actual companies that have viewed my website.
Doing it yourself means you don't have to rely on Google working for your records to work, and in my experience Google Analytics tends to round numbers so you don't get a true indication of numbers, and also you can remove bots and website trawlers from your data by tracking the user agent string.
As a somewhat ugly hack you could use transaction tracking. If you use the same transaction id multiple times subsequent products will be added to the existing data. So assign an ID at the start of the visits and on each page record a transaction with the current page url as product name (and the ID as transaction id). This will give you the complete path per user (I am frankly not to sure how this is useful - at some point you probably want aggregated data. Plus each transaction and product counts towards your quota for interaction counts, so on a large site you might run over the 10mio hits limit).
you can do it programatically
have a MAP in the backend which stores the userId (assuming u would have given a unique ID at the time of login to each user) with a list of Strings(each string being URL visited by that user)
whenever the user hits another URL from Page 1(and only from page1, check it using JS), send a POST request to backend with the new URL in its data section.
In the backend, check if the URL is of Page 10 and if not, add this URL as a string into the MAP for that corresponding user
Finally, when the user clicks on the Page 10 URL, you know the URLs in the way from Page 1 to Page 10 and so use them.
Though if I consider JS and I have not misunderstood your question, we can get the previous URL from request header information using document.referrer.
Are you trying to do it from 'Google Tag Manager'? I am not sure whether you are trying to trace the URLS in clientside or server side?
this is my first time posting here. I do not have much experience (less than a week) with html parsing/web scraping and have difficulties parsing this webpage:
https://www.jobsbank.gov.sg/
What I wan to do is to parse the content of all available job listing in the web.
my approach:
click search on an empty search bar which will return me all records listed. The resulting web page is: https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do
provide the search result web address to R and identify all the job listing links
supply the job listing links to R and ask R to go to each listing and extract the content.
look for next page and repeat step 2 and 3.
However, the problem is that the resulting webpage I got from step 1 does not direct me to the search result page. Instead, it will direct me back to the home page.
Is there anyway to overcome this problem?
Suppose I managed to get the web address for the search result, I intent to use the following code:
base_url <- "https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do"
base_html <- getURLContent(base_url,cainfo="cacert.pem")[[1]]
links <- strsplit(base_html,"a href=")[[1]]
Learn to use the web developer tools in your web browser (hint: Use Chrome or Firefox).
Learn about HTTP GET and HTTP POST requests.
Notice the search box sends a POST request.
See what the Form Data parameters are (they seem to be {actionForm.checkValidRequest}:YES
{actionForm.keyWord}:my search string )
Construct a POST request using one of the R http packages with that form data in.
Hope the server doesn't care about the cookies, if it does, get the cookies and feed it cookies.
Hence you end up using postForm from RCurl package:
p = postForm(url, .params=list(checkValidRequest="YES", keyword="finance")
And then just extract the table from p. Getting the next page involves constructing another form request with a bunch of different form parameters.
Basically, a web request is more than just a URL, there's all this other conversation going on between the browser and the server involving form parameters, cookies, sometimes there's AJAX requests going on internally to the web page updating parts.
There's a lot of "I can't scrape this site" questions on SO, and although we could spoonfeed you the precise answer to this exact problem, I do feel the world would be better served if we just told you to go learn about the HTTP protocol, and Forms, and Cookies, and then you'll understand how to use the tools better.
Note I've never seen a job site or a financial site that doesn't like you scraping its content - although I can't see a warning about it on this site, that doesn't mean it's not there and I would be careful about breaking the Terms and Conditions of Use. Otherwise you might find all your requests failing.
I have a function which web-scraping all latest news from a website (approximately 10 news and the number of news is up to that website). Note that the news are in chronical order.
For example, yesterday I got 10 news and stored in database. Today I get 10 news but there are 3 news that are not available from yesterday (7 news stayed the same, 3 new).
My current approach is to extract each news till I find an old news (the 1st among 7 news) then I stop extracting and only update the field "lastUpdateDate" of the old news + add new news to the database. I think this approach is somehow complicated and it takes time.
Actually I'm getting news from 20 websites with same content structure (Moodle) so each request will last about 2 minutes, which my free host doesn't support.
Is it better if I delete all the news and then extracting everything from the start (this actually increments a huge amount of the ID numbers in the database)?
First, check to see if the website has a published API. If it has one, use it.
Second, check the website's terms of service, which may specifically and explicitly disallow scraping the website.
Third, look at a module in your programming language of choice that handles both the fetching of the pages and the extraction of the content from the pages. In Perl, you would start with WWW::Mechanize or Web::Scraper.
Whatever you do, don't fall into the trap that so many who post to StackOverflow fall into: Fetching the web page, and then trying to parse the content themselves, most often with regular expressions which is an inadequate tool for the job. Surf the SO tag html-parsing for tales of sorrow from those who have tried to roll their own HTML parsing systems instead of using existing tools.
Its depend on requirement if you want to show old news to the users or not.
For scraping you can create a custom local script for cron job which will grab the data from those news websites and will store into database.
You can also check through subject if its already exist of not.
Final make a custom news block which will show all the database feed.
I am developing a web application, in which I have the following type of search functionality;
Normal search: where user will enter the search keyword to search the records.
Popular: this is no a kind of search, it will display the popular records on the website, something as digg and other social bookmarking sites does.
Recent: In this I am displaying Recently added records in my website.
City Search: Here I am presenting city names to the user like "Delhi", "Mumbai" etc and when user click this link then all records from that particular city will be displayed.
Tag Search: Same as city search I have tag links, when user will click on a tag then all records marked with that tag will be displayed to the user.
Alphabet Search: Same as city and tag this functionality also has links of letters like "A", "B", .... etc and when user clicks on any letter link then all records starting with that particular letter will be displayed to the user
Now, my problem is I have to provide above listed searches to the user, but I am not able to decide that I'll go with one page (result.aspx) which will display all the searches records, and I'll figure using query string that which search is user is using and what data I have to display to the user. Such as, lets say I am searching for city, delhi and tag delhi-hotels then the urls for both will be as :
For City: www.example.com/result.aspx?search_type=city&city_name=delhi
For Tags: www.example.com/result.aspx?search_type=tag&tag_name=delhi-hotels
For Normal Search: www.example.com/result.aspx?search_type=normal&q=delhi+hotels+and+bar&filter=hotlsOnly
Now, I feels above Idea of using a single page for all searches is messy. So I thought of some more and cleaner Idea, which is using separate pages for all type of searches as
For City: www.example.com/city.aspx?name=delhi
For Tags: www.example.com/tag.aspx?name=delhi-hotels
For Normal Search: www.example.com/result.aspx?q=delhi+hotels+and+bar&filter=hotlsOnly
For Recent: www.example.com/recent.aspx
For Popular: www.example.com/popular.aspx
My new idea is cleaner and it tells specifically everything to the user that which page is for what, it also gives him idea that where he is now, what records he's seeing now. But the new idea has one problem, In case I have to change anything in my search result display then I have to make changes in all pages one by one, I thought that solution for this problem too, which is using user-control under repeater control, I'll pass all my values one by one to user-control for rendering HTML for each record.
Everything is fine with new Idea, But I am still no able to decide that with which I dea I have to go for, Can anyone tell me your thoughts on this problem.
I want to implement an idea which will be easy to maintain, SEO friendly (give good ranking to my website), user-friendly(easy to use and understand for the users)
Thanks.
One thing to mention on the SEO front:
As a lot of the "results" pages will be linking through to the same content, there are a couple of advantages to appearing* to have different URLs for these pages:
Some search engines get cross if you appear to have duplicate content on the site, or if there's the possiblity for almost infinite lists.
Analysing traffic flow.
So for point 1, as an example, you'll notice that SO has numberous ways of finding questions, including:
On the home page
Through /questions
Through /tags
Through /unanswered
Through /feeds
Through /search
If you take a look at the robots.txt for SO, you'll see that spiders are not allowed to visit (among other things):
Disallow: /tags
Disallow: /unanswered
Disallow: /search
Disallow: /feeds
Disallow: /questions/tagged
So the search engine should only find one route to the content rather than three or four.
Having them all go through the same page doesn't allow you to filter like this. Ideally you want the search engine to index the list of Cities and Tags, but you only need it to index the actual details once - say from the A to Z list.
For point 2, when analysing your site traffic, it will be a lot easier to see how people are using your site if the URLs are meaningful, and the results aren't hidden in the form header - many decent stats packages allow you to report on query string values, or if you have "nice" urls, this is even easier. Having this sort of information will also make selling advertising easier if that's what's you're interested in.
Finally, as I mentioned in the comments to other responses, users may well want to bookmark a particular search - having the query baked into the URL one way or another (query strings or rewritten url) is the simiplist way to allow this.
*I say "appearing" because as others have pointed out, URL rewriting would enable this without actually having different pages on the server.
There are a few issues that need to be addressed to properly answer your question:
You do not necessarily need to redirect to the Result page before being able to process the data. The page or control that contains the search interface on submitting could process the submitted search parameters (and type of search) and initiate a call to the database or intermediary webservice that supplies the search result. You could then use a single Results page to display the retrieved data.
If you must pass the submitted search parameters via querystring to the result page, then you would be much better off using a single Result page which parses these parameters and displays the result conditionally.
Most users do not rely on the url/querystring information in the browser's address bar to identify their current location in a website. You should have something more visually indicative (such as a Breadcrumbs control or header labels) to indicate current location. Also, as you mentioned, the maintainability issue is quite significant here.
I would definitely not recommend the second option (using separate result pages for each kind of search). If you are concerned about SEO, use URL rewriting to construct URL "slugs" to create more intuitive paths.
I would stick with the original result.aspx result page. My reasoning for this from a user point of view is that the actual URL itself communicates little information. You would be better off creating visual cues on the page that states stuff like "Search for X in Category Y with Tags Z".
As for coding and maintenance, since everything is so similar besides the category it would be wise to just keep it in one tight little package. Breaking it out as you proposed with your second idea just complicates something that doesn't need to be complicated.
Ditch the querystrings and use URL rewriting to handle your "sections".. much better SEO and clearer from a bookmark/user readability standpoint.
City: www.example.com/city/delhi/
Tag: www.example.com/tag/delhi-hotels/
Recent: www.example.com/recent/
Popular: www.example.com/popular/
Regular search can just go to www.example.com/search.aspx or something.