How can I scrape products price dynamically? - web-scraping

For my master's thesis, I was asked to create a generic web scraper that goes through a list of e-commerce sites and get all products' title, price and description. The challenge is that I should NOT use the inspection tool to get the XPath or css selector of the product's title or price.
The scraper should fetch this information on its own without any prior knowledge of the website.
When I looked into the internet, some solutions use regular expressions to solve this, but this will work to find the price if the element is like this <span>10.50$</span>, BUT some prices are like this
<span>10</span>
<span>.</span>
<span>50</span>
<span>$</span>
so I need to combine them to make the total price. Another challenge is that the expression can return two prices, the real one and the discount one; how can I find the difference?
Another solution is to use a generic css selector to find_all('.price'). This will get all elements with class with name price, but there is no guarantee that all sites will use this name! Should I use Machine learning, or am I over-engineering this?

Related

Multi Store with Subdomains

I want to create multiple online shops for selling merchandise products for companies. The products are basically identical but should be personalized in dependence of the company I am building the shop for. Because I do not want to build a new shop every time a new company joins the program I am looking for something like that:
www.myshop.com : One shop with the underlying product database and checkout system - not showing any products, just as a parent structure
www.company1.myshop.com : A slightly personalized shop where only a selection of the product catalogue is available
www.company2.myshop.com : A slightly personalized shop where a different selection of the product catalogue is available
Do you get it?
Does anybody know a tool for that?
Thanks in advance!
I already looked into WooCommerce, Shopify and even WiX. As far as I understood what I am looking for is not supported.
Since your example is based off of subdomains, you can choose to assign a Shopify store to each subdomain. Each store feeds from your inventory and accounting, giving your customers the illusion of a custom experience. Or you can just simplify your life, have one store, and assign your customers to view collections specific to them. That is the smart move. You may not like that, but it would work a peach for you. You just tag customers to see their specific collections, of products specific to those collections. Simple.
I can also think of a dozen other ways to pull this off with Shopify, but that is me, not you. For an opinion question like this, SO is not the right place to ask these kinds of questions, but I answered anyway. Your mileage may vary of course.

Analytics Experiments Dynamic URLs

I have a lot of product pages like this:
www.example.com/catalog001/item123
www.example.com/catalog002/item321
www.example.com/catalog002/item567
Every catalog and product(item) have its own numeric id.
Product pages are similar. Just different product image, price, title.
I tried to use Regular Expressions to set up original url pattern in Analytics Experiments:
www.example.com/catalog(\d+)?/item(\d+)?
Is there any way to set up original url pattern?
I'm not quite sure what you're asking. It sounds like you want to test many different product pages without setting up many different experiments, presumably to test two different product page layouts.
If so you can use relative urls in the experiments interface for that, there is no need for regular expressions. Create an experiment for one product page, select relative urls for the variations, enter a query string (?foo=bar) or fragment identifier (#foo=bar) that triggers the variation page, add experiment code to all the originals and the test will be enabled for all your product pages, not just the one url you entered in the interface.
If you were after something else I suggest you re-word the question to explain the actual problem rather than your attempt to solve it.

Make search URL search engine friendly: hash -> what?

I am developing a flight search engine for a customer, and currently the URLs look as follows (ad = destination airport, ao = origin airport, dates and number of passengers are not specified here):
http://example.com/#ad=S%C3%A3o+Paulo+-+Todos+os+aeroportos+(SAO),+Brasil&ao=Recife+-+Guararapes+Intl+(REC),+Brasil
My customer wants to make search pages more search engine friendly (SEO). The idea is that Brazilians who are looking for flights from, say, SAO to REC by e.g. Google should have a higher chance of finding that particular flight search engine.
The first step is probably replacing the fragment identifier (#) by a query string (?). The server then dynamically generates nice text content that can be viewed without JavaScript (search results would still be loaded via XHR). In my opinion, that makes a lot of sense.
Now, to make the URLs more search engine friendly:
(A) My customer proposes adding additional keywords into the URL, something like:
http://example.com?flights+to+Porto+Alegre&S%C3%A3o+Paulo+-+Todos+os+aeroportos+(SAO),+Brasil&ao=Recife+-+Guararapes+Intl+(REC),+Brasil
(B) I propose adding a slug instead, which can easily be internationalized, and which is good to read also for humans. Example:
http://example.com/pt_BR?ad=REC&ao=SAO/voos_de_Sao_Paulo_para_Recife
(C) Or, perhaps without a slug (but - due to parsability - only for a limited parameter set, which has the disadvantage of limiting sharing of URLs by users):
http://example.com/pt_BR/voos_de_Sao_Paulo_(SAO)_para_Recife_(REC)
What do you suggest? Any examples of good URLs for similar use cases?
That all being said: I understand that links from highly ranked pages are still the most important ranking measure. In the end, I wonder if all that complexity really is worth the effort. When I look at Google's own search pages, then they are rather simple. For example, there is no summary of the search query in a H1 tag, just as my customer wants. Of course, Google doesn't search itself...
don't use _ (underscore) to delimit words. Google interprets hello_world as one word but hello-world as two words.
don't put your human readable keywords in the query string (after the ?). Instead make it a normal URL http://example.com/pt_BR/search/voos-de-Sao-Paulo-(SAO)-para-Recife-(REC)
I would go for a something like: http://example.com/pt_BR/2012-10-28/voos-de-Sao-Paulo-(SAO)-para-Recife-(REC)

Drill Down Search using Sql

it seems like alot of ecommerce sites these days are providing products filters to search for items. For example you can search items by WIDTH,HEIGHT,TV SIZE, Furniture Type etc.
now if it was a simple website with just a few searchable filters then its easy to do, but I am managing a website which sells furniture,appliance & electronics and every category has alot of sub categories as well. for example:
Appliance:
Laundry
Searchable Attributes (Washer,Dryer,Washer Type..Microwave,Width, Height)
Electronics
Tv(Tv Size, Width)
Games (ps3, Genre,Sale Date)
I am sure you get the idea. an ecomerece sites offers basic categoies and then every category could have sub categories OR Searable filters to drill down your search.
what would be the best way to do this using MS SQL Server & Asp.net. I am interested in creating a optimized searchable schema in SQL.
any Hints, Suggestions will be welcome.
Thanks
You can use the Entity-Attribute-Value model.
The simple concept is that instead of having a column for each of your model's attributes (such a genre, sale date, etc for a ps3 game), youll have another table, named Attributes, where the attributes and their types will be listed, and a third table, where your main model instances (ps3 game) will be linked with attributes via 3 columns:
Model Id (the id of the ps3 game)
Attribute Id
Attribute Value
This concept might be harder to manage, and require more complicated queries, but it will alow addition of new products / categorites in the easiest way.
Of course, with this model, if few products share a common attribute (sugh as pc game and ps3 game sharing a genre), you'll have the attribute defined only once, and both model will be linked to it, allowing a common search query on different products.
Too much for a single question. Look for a book on database design. For a drill down you can have a table with as many PK columns as drill downs. But when it comes to details you will need separate tables as TV does not have the same details as a stereo.

Using Yahoo! Pipes

Have you used pipes.yahoo.com to quickly and easily do... anything? I've recently created a quick mashup of StackOverflow tags (via rss) so that I can browse through new questions in fields I like to follow.
This has been around for some time, but I've just recently revisited it and I'm completely impressed with it's ease of use. It's almost to the point where I could set up a pipe and then give a client privileges to go in and edit feed sources... and I didn't have to write more than a few lines of code.
So, what other practical uses can you think of for pipes?
It's nice for aggregating feeds, yes, but the other handy thing to do is filtering the feeds. A while back, I created a feed for Digg (before Digg fell into the Fark pit of dispair). I didn't care about the overwhelming Apple and Ubuntu news, so I filtered those keywords out of Technology, which I then combined with Science and World & Business feeds.
Anyway, you can do a lot more than just combine things. If you wanted to be smart about it, you could set up per-subfeed and whole-feed filters to give granular or over-arching filtering abilities as the news changes and you get bored with one topic or another.
The one thing I have really used Y! Pipes for (rather than just playing around with it) is to clean up item titles, merge and finally de-dupe the feeds I got from querying multiple blog search engines with the same search term. This is something I’ve done in several very different contexts, eg. for my own ego surfing, in another case for the planet site set up by some conference’s organisers to keep an eye on their conference’s buzz, etc. Highly recommended.
You can do tons of things with pipes. For example for sites like digg or reddit, you can make one to bypass the site and go directly to the linked article (rewriting the RSS).
I like also to filter webcomics' feeds to keep just the comics, and then mix them all in only one feed
I've taken the liberty of copying your pipe and rearranging it a bit so that it's easier to add and remove tags:
Yahoo Pipe: StackOverflow Merge Tags
Tags are now listed in a string builder, so to add a tag you just have to hit the + button on the string builder and type in the tag preceded by a slash.
Well, pipes are real fast and useful.
Other effective uses might be:
1) combine many feeds into one, then sort, filter and translate it.
2) geocode your favorite feeds and browse the items on an interactive map.
3) power widgets/badges on your web site.
4) grab the output of any Pipes as RSS, JSON, KML, and other formats.
This is by no means a comprehensive list.
One of my favorite things to do with Yahoo! Pipes is to aggregate multiple craigslist feeds into a single feed. You can make a feed out of any category or search criteria on craigslist. I live in a university town and am always on the lookout for tickets to sporting events, for example. I have a half-dozen craigslist searches all being combined into a single feed via Yahoo! Pipes. This works a lot better for me than simply monitoring the entire "Tickets" category; filters out most of the tickets I am not interested in. Yes, this is another aggregating feeds example, but the craigslist usage is quite valuable with the ability to aggregate feeds that are themselves based upon searches.
I've used Pipes to translate blogs into English. I would have liked to use it to fetch the full text for blogs which only provide a summary of the content in the feed, but unfortunately they don't provide any input which fetches the content from a parameterizable source :-(.
Just stumbled on this while looking for ways to connect Excel to Pipes. A bit necromancer-ish, but here goes.
One thing I've done, is take an HTML page (science data) which has links to tons of CSV files for a bunch of Army Corps measurement stations. Each station has a big table of datafiles, all organized individually by month and year. I use YQL to parse out and organize the links to the individual CSV files in a way that Pipes can read them. Then, I use that as input into a Pipe, which has a user input for "Station" and "Date."
Using this, I can go to the Pipes page, type in those values and get the values only for a specific station and date, rather than have to find the station on a website, find the year and month in a big table, click the link, open the CSV file, and find the values for a day within that month's worth of data. I can even change the pipe to specify the hour, and the parameter, and then get a single value returned.
Now, I wish I could figure out how to program Excel so that I can use "=yahoo_function(station, datetime)" to place that value automatically into a cell give the values of other columns!

Resources