Scraping E-Commerce sites and aggregating same products - web-scraping

I am trying to learn about web-scraping and as an application I figured I'd build an aggregator that crawls retailers for certain products and sets up a price comparison for the same product from different retailers.
As I got started on this I realized exactly how large a tasks this is.
First, I need to crawl sites that have various formats for not only their DOM structures but also slightly different names for the same products and formats for item's prices and prices for items on sale.
Second, After I've somehow decoded the DOM for x number of sites (doing it for one or two is easy but I want to make the crawler scalable!) and fetched the data for various items. I need to be able to compare the different names of same products so I can compare the differing prices (convert them to the same currency, check if the returned price is the original/on-sale price, etc...) between retailers.
I am trying to write my crawlers using Scrapy but can someone recommend an approach for how to adapt the crawler for a variety of retailers and if there are any libraries/approaches that would work well for the second problem of comparing like(unlike) items?

For comparison you can convert strings of product names to lists, compare them and put a threshold to determine whether two products are same or not.

Related

How can I Count Instances of Search Terms in Google Data Studio?

I'm working with internal site search terms from Google Analytics in Google Data Studio. I need to count how many times users searched specific terms on the website. The problem is, the data is case sensitive and users often misspell words when they search, so that won't get tallied in a normal count function. For example, "careers", "Careers", "cAREERS", and "carers" are all different searches. What formula can I use to easily count how many times users searched different terms?
First add a field with the formula LOWER. Then add a field with case when to correct each possible spelling errors.
Another route would be to create a "sounds like" field. Here BigQuery give a nice function SOUNDEX. Data Studio does not offer somthing like that, but you can build a function with reg_exs so that: first character of word and then only the vocals of the word, but remove duplicated vocals first.

web scraping for counting news about stocks

for an assignment I have I need to count the number of news on different stocks and compare it by graph, this is the link to the site and I attached an example for one of the stock where you can see 49 times for this specific stock,
what will be the best way to attack it -using what type of package-bs4 or another one ? what is the best approach here ?
url='https://www.calcalist.co.il/stocks/home/0,7340,L-3959-1102532--4,00.html'

Unique Users in Google Analytics

I'm trying to get all unique visitors for a selected time period, but I want to filter them by date on the server. However, the sum of unique visitors for each day isn't the number of unique visitors for the time period.
For example:
Monday: 2 unique visitors
Tuesday: 3 unique visitors
The unique visitors for the two days period isn't necessarily 5.
Is there a way to get the results I want using the Google Analytics API (v3)?
You're right that Users aren't additive, so you can't simply add them day by day. There are several ways around this.
The fist and most obvious is that if you've implemented the User-ID you should be able to straight up pull and interrogate the data about which users saw your site on which days.
Another way I've implemented before is to dynamically pull the number of Users from the Google Analytics API whenever you need it. Obviously this only works if you're populating a live web dashboard or similar, but since it's just the one figure you're asking for, it wouldn't slow down the load time by much. Eg. if you're using a dashboarding tool such as Klipfolio, you may be able to define a dynamic data source, and query Google whenever you needthe figure (https://support.klipfolio.com/hc/en-us/articles/216183237-BETA-Working-with-dynamic-data-sources)
You could also limit the number of ways that the data can be interrogated, and calculate all of them. For example, if you only allow users to look at data month-by-month or day-by-day, then you only need those figures.
Finally, you can estimate the figure with reasonable accuracy by splitting it into two parts. New Users are equal to New Sessions (you're only new on your first Session), which is additive, so that figure can be separated out and combined as required.
Then, you could take a rough ratio of new to returning Users (% New Users) from, say, 1 year of data, and use that with the New Users figure to generate an average on any level.

WordPress query multiple post_types/categories with weighted results

For a WordPress project I'm looking for a better solution to this problem:
The query should get a set of different post_types and taxonomies (like categories), based on the site visitors choice. For example, the user want to get results from normal posts, but also products (from WooCommerce) and other post_types like events and news (both separate post_types. The tricky part is, that the user wants to assign a weight factor to each. So if they select like posts = 1, products = 3, news = 4, they should get a number of posts, three times more products and 4 times more news.
Next step will be to include categories, also with the weight factor, which will make it even more complex, because for posts, I need to query another taxonomy than for products.
The only way I found to solve this, is to run a separate query for each post_type, like fetching 10 items from posts, 30 items from products and 40 from news, to match the weight factors, then combine the results. But this will not scale very well when I need pagination (for example, I want to show 50 entries on the first page, next 50 on second page).
I thought about collecting these single queries into a temporary table, but such a table will be available for the current session only, so it won't help with the pagination (as the temporary table would no longer exist, when the second page is shown).
Does anybody have an idea, how I could approach this task? I would like it according to the WordPress coding standards, so I would like to use WP_Query and the provided filters, because the site is also using geolocating and WPML for translation, so I really would like to avoid writing a low-level query where I have to include all these manually.
I'm not looking for a final solution, just want to collect some ideas.

How to setup Google Analytics account structure for large collection of sites

We manage a large collection of websites that each fit into a couple of dozen categories. Some of the categories have hundreds of sites in them. The largest of which is over 800 sites. I would like to set up my Google Analytics account structure to facilitate easily reporting on individual sites, categories, and overall enterprise wide traffic.
For example, let's say our categories are food, sports, and politics each with 300 sites. I will need to be able to report on the number of site visits for each individual website, aggregate numbers for all food sites, all sports sites, and all political sites, and aggregate numbers for all 900 sites combined.
To get the overall numbers (all sites/categories combined) it seems easy enough to create a rollup account and use that tracking code along with another tracking code on each page. Where I'm struggling is how to get the category-wide numbers and the site-level numbers. I've thought about two approaches:
Create a custom dimension for category and then build custom reports keyed off the custom dimension. This would allow me to group the accounts however I see fit as long as the custom dimension is reported to the rollup account.
Create multiple properties for each category and associate each with a "lead" site for that category. This would require site-level reports being based on hostname or something similar.
It seems the limiting factor is each account cannot have more than 50 views. Following Google's advice to always keep an unfiltered view of your data plus views for whatever filters are desired means I will need dozens of accounts unless there is some creative approach I haven't thought about.
Any suggestions on how to best setup the accounts, properties and views to facilitate the site-level, category-level and enterprise-wide reports?

Resources