How to automatize image search with 2 different types of queries? - web-scraping

I need to do some data scraping and it would be very useful for me if there was a way to implement an algorithm that downloads a subset of images matching a certain query, but only within a specific 'root' website (for example all images in www.example.com including all subdirectories such as www.example.com/sub1).
I already know that it might be impossible to find all subdirectories in a root website unless they're listed somewhere. Since I do not know all the subdirectories I think i should avoid looping over subdirectories and extracting all images (with an online image extractor for instance).
So in my opinion the easiest thing to do is to let google do most of the work so that it outputs all (or maybe most) of the images that are contained in any subdirectory of the 'root' and then do a query.
The problem is thus divided in 2 parts:
Get all the images from google image search that come from a specific website
Only get the subset of images matching the query. This I guess would possible with some AI recognition (all images that are labeled as animals, or buildings and so on)
I know that this is a very broad question so i do not expect any answers with code.
What I would like to know is:
Do you think it is even possible to do that?
What programs would you suggest using for this purpose (both for the search and the image recognition)
If you think this question belongs more to another stack site let me know, I'm trying my best to be compliant with the rules. Thanks.

Related

How to check if any URLs within my website contain specific Google Analytics UTM codes?

The website I manage uses Google Analytics to track URLs. Recently I found out that some of the URLs contain UTM codes and should not. I need some way of determining whether or not URLs that contain the following UTM codes utm_source=redirect or utm_source=redirectfolder are currently on the website and being redirected within the same website. If so, I will need to remove the UTM codes on those URLs, because Google Analytics automatically tracks URLs that redirect within the same domain. So it does not require UTM codes (and this actually hurts the analytics).
My apologies if I sound a little broken here, I am still trying to understand it all myself, as I am a new graduate with a CS degree and I am now the only web developer. I am not asking for anyone to write this for me, just if I could be pointed in the right direction to writing a ColdFusion script that may help with this.
So if I understand correctly your codebase is riddled with problematic URLS. To clean up the URLs programmatically you'll need to do a couple of things up front.
Identify the querystring parameter variable/value pair that needs to be
eliminated.
Create a worker file to access all your .cfm and .cfc files (of interest).
Create a loop that goes through the directories and reads, edits and saves your files (be careful here not to go crazy, maybe do not set to overwrite existing files (like make unique, unless you are sure).
Create a find/replace function or regex expression to target and remove your troublesome parameters
Save your file and move on in the loop.
OR:
You can use and IDE like dreamweaver or sublimetext to locate these via a regex search and spot check and remove.
I would selectively remove the URL parameters, but if you have so many pages that it makes no sense, then programmatic removal would be the way to go.
You will be using cfdirectory, cffile, rematch() (and create an array and rebuild) or find/replace replaceNoCase()
Your cfdirectory call will return a variable and like a query you will spin through it like you do with a normal query and cfoutput.
Pull one or two files out of your repo to create your code with until you are confortable. I would code in exit strategies (fail gracefully) like adding a locatable comments to the change spot so you can check it later manually, or escape out if a file won't write and many other try/catch opportunities.
I hope this helps.

Is static URL better than dynamic URL in terms of SERP?

I've been reading up on SEO and how to construct my links in terms of getting better SERP.
I'm using WordPress as the framework for my site and have custom templates retrieving data from my DB.
What makes a URL dynamic, is the usage of ? and &. Nothing more, nothing less. Google recommends that I should not have too many attributes in my URL - and that's understandable.
Dynamic: www.mysite.com/?id=123&name=some+store+name&city=london
Static: www.mysite.com/london/some+store+name/123
Q1: I don't feel that adding the store ID in this static URL looks nice. But I do need it in order to fetch data from the DB, right?
Reading various blogs, I see many SEO (experts) saying different things, but I feel most of it is just talk without actually proving their statements. We can all agree that static URLs are good in terms of usability (and readability).
Q2: But many claim that static URLs prevent duplicate content. I don't agree on that as all my contents have unique ID. Can anyone comment on this?
Q3: In the end, for the Google search engine (and others) it really doesn't matter if the URL is static or dynamic. But since Google is working towards user friendly content, is that the only argument for having static URLs?
1) There's no problem using DB ids alongside static URLs. Many huge e-commerce and other commercial sites do this (Amazon, eBay... hell, everyone really.)
2) A static URL in and of itself does not prevent duplicate content. There are hundreds of ways duplicates can happen (child pages, external copy, javascript, form fields, ajax, archive sections... the list foes on.)
3) It doesn't matter if it's static or dynamic for indexing. But in terms of ranking well, static URLs with informative (and relevant to the targeted keywords) searches are hugely beneficial. Multivariate testing I've done shows users are also generally re-assured by clean looking URLs in terms of usability.
If you give me some more examples, I can probably help out a bit more.
Urls without parameters are always better. It won't absolutely kill SEO - but it is better not to have them.
!0 years ago Google would ignore parameters and would penalize you for URLs with parameters. Today they are really good at figuring out these db parameters - but not perfect. Among other things Google has to try to figure out which URL parameters matter, and which don't and if parameter order matters.
E.g. you may have URL parameters that store user preferences, navigation state etc. This will just proliferate URLs that Google has to try to decode. So what you should do is:
Right before generating an URL at least sort your parameters.
Convert parameters that matter into things that don't look like parameters. So if I had a shoe store with a urls like http://mystore.com/mypage?category=boots&brand=great&color=red I'd rewrite that to something like http://mystore.com/mypage/category/boots/brand/great/color/red or even better:
http://myscore.com/mypage/boots/great/red
Then you can add the parameters that don't matter for the page content at the end. Google will figure out they don't matter.
The other reason to fix your URLs is that Goolge displays them to users in the SERP, and people are more likely to click on readable URLs than database URLs.
Why do big stores like amazon use database urls? because they are giant, bad urls don't hurt them, and their systems are so large and complex it is the only way to manage it. But for smaller sites with fewer products, readble URLs are achievable and are one of the few advantages a small site can have over a big one.
If anyone observing closely Google SERP results definitely find some part of SERP results are highlighted and bold as well. Now noticing further one can easily find "Search Query" are getting highlighted or bold in "Title" , "Descriptions" and "URL" who are using same "Search Query" in Title, Descriptions and URL as well.
Now thing is if any website URL's are dynamic and coming with parameter ID, they are loosing keywords from Title, Descriptions and URL as well.
Ex:
http://www.johnzaccheofineart.com/catagory-2/?id=4
http://www.johnzaccheofineart.com/painting/johnzaccheo
Sample Search : Painting for Sale
Now easily we can understand difference between static and dynamic URL performance. One URL coming with such word which has no search value, other URL is coming with category name as well as painter name.
So, being a user i will give preference to 2nd one which is understandable from URL itself.

Crawl only articles/content

I want a crawler to be able to identify which pages on, for example, a news site, are actual content (i.e. articles), as opposed to About, Contact, category listings, etc.
I've found no elegant way about this so far, as the criteria for content seem to vary by site (no common tags/layouts/protocols, etc.). Can anyone direct me either to libraries or methods that can identify with some level of certainty whether a website is a piece of content? It's perfectly acceptable to make this distinction after I have crawled the candidate page.
Barring anything that already exists, I'd also appreciate any starting points to existing/ongoing research in this area.
You can start by checking Boilerpipe framework . There is online extraction demo available from their project's page. If extraction result is not very good for your case, you need to extend their algorithm.

Interpreting Search Results

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon
I doubt that there exist a silver-bullet algorithm that without any training will just work on any arbitrary search query output.
However, this task can be solved and is actually solved in many applications, but with different approach. First you have to define general structure of single search result item based on what you actually going to do with it (it could be name, date, link, description snippet, etc.), and then write number of html parsers that will extract necessary necessary fields from search result output of particular web sites.
I know it is not super sexy solution, but it probably the only one that works. And it is not rocket science. Writing parsers is actually extremly simple, you can make dozen per day. If you will look into html source of search result, you will notice that output results are typically very structured and marked with specific div sections or class atributes, so it is very easy to find it in the document. You dont have even use any complicated HTML parsing library for that, something grep-like will be enough.
For example, on this particular page your question starts with <div class="post-text"> and ends with </div>. Everything in between is actually a post text with some HTML formatting that you may want to remove along with extra spaces and "\n". And this <div class="post-text"> appears on the page only once.
Once you go at large scale with your retrieval applicaiton, you will find out that there is not that big variety of different search engines on different sites, and you will be able to re-use already created parsers for sties using similar search engines.
The only thing you have to remember is built-in self-testing. Sites tend to upgrade and change design from time to time. If your application is going to live for some time, you will need to include into your parsers some logic that will check validity of their results and notify you every time search output has changed and is not compatible anymore with your parser. Then you will have to modify particular parser or write new one.
Hope this helps.

REST URL design - multiple resources in one HTTP call [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Rails 3 Custom Route that takes multiple ids as a parameter
From what I understand, a good REST URL for getting a resource would look like this:
/resource/{id}
The problem I have is, that I often need to get a large number of resources at the same time and do not want to make a separate HTTP call for each one of them.
Is there a neat URL design that would cater for that or is this just not suitable for a REST API?
Based on your response, the answer to your question is to create a new resource that contains that single set of information. e.g.
GET /Customer/1212/RecentPurchases
Creating composite urls that have many identifiers in a single url limits the benefits of caches and adds unnecessary complexity to the server and client. When you load a web page that has a bunch of graphics, you don't see
GET /MyPage/image1.jpg;image2.jpg;image3.jpg
It just isn't worth the hassle.
I'd say /resources/foo,bar,baz (separator may vary depending on IDs' nature and your aesthetic preferences, "foo+bar+baz", "foo:bar:baz", etc.). Looks a bit "semantically" neater than foo/bar/baz ("baz of bar of foo"?)
If resource IDs are numeric, maybe, even with a range shortcut like /resources/1,3,5-9,12
Or, if you need to query not exactly on resources with specifical IDs, but on group of resources having specific properties, maybe something like /resources/state=complete/size>1GiB/!active/...
I ahve used in the past something like this.
/resources/a/d/
and that would return between x and Y a list.
something like
<resources>
<resource>a</resource>
<resource>b</resource>
<resource>c</resource>
<resource>d</resource>
</resources>
you could also put more advanced searches into the URL dpending on what resource actuall is.
maybe you could try with
[GET]/purchases/user:123;limit:30;sort_date:DESC

Resources