Search movie data base by rating - imdbpy

I've trying to find a module or library (or API) that will let me query a movie database by rating. Rotten Tomatos or IMDb are fine. I have a request for a Rotten Tomatoes API key, and I do have a IMDb one, but I can't find any documemtation on how to query the database and return a list of all movies with a certain rating.
If anyone has a solution, I'd very much appreciate it.

If you want to operate offline, you can consider the IMDbPY library, which allows you to work with both the online and offline set of data of IMDb.
Working offline, basically, you use it to import the plain text data files distributed by IMDb on a SQL database, and then access that data (rating, number of votes and votes distribution are available).
You can't programmatically query movies by rating, but that data is easily accessible from the db itself.

Related

which servies to use for periodically load data from multiple data sources, aggregate and provide fast search?

Please propose a solution design for my case. The data comes from various sources, some from api, some from csv. A user will search using filters.
Ex: Product data (source 1) and Product Reviews ( Source 2). A user will search for a product with its name and rating.
Considerations:
When product reviews will keep on changing. So, the search should reflect that information.
In future, some more sources will get added and additional filters will get added. Ultimately, only products will be shown to the users.
Product price will change. The search results should give updated information.
The search should be very fast.
I looked into using apache airflow, mongodb and elastic search. However, i felt i am complicating the solution.

Is it possible to scrape multiple data points from multiple URLs with data on different pages into a CSV?

I'm trying to build a directory on my website and want to get that data from SERPs. The sites from my search results could have data on different pages.
For example, I want to build a directory of adult sports leagues in the US. I get my SERPs to gather my URLs for leagues. Then from that list, I want to search those individual URLs for: name of league, location, sports offered, contact info, description, etc.
Each website will have that info in different places, obviously. But I'd like to be able to get the data I'm looking for (which not every site will have) and put that in a CSV and then use it to build the directory on my website.
I'm not a coder but trying to find out if this is even feasible from my limited understanding of data scraping. Would appreciate any feedback!
I've looked at some data scraping software. Put requests on Fiverr with no response.

Ingesting Google Analytics data into S3 or Redshift

I am looking for options to ingest Google Analytics data(historical data as well) into Redshift. Any suggestions regarding tools, API's are welcomed. I searched online and found out Stitch as one of the ETL tools, help me know better about this option and other options if you have.
Google Analytics has an API (Core Reporting API). This is good for getting the occasional KPIs, but due to API limits it's not great for exporting great amounts of historical data.
For big data dumps it's better to use the Link to BigQuery ("Link" because I want to avoid the word "integration" which implies a larger level of control than you actually have).
Setting up the link to BigQuery is fairly easy - you create a project in the Google Cloud Console, enable billing (BigQuery comes with a fee, it's not part of the GA360 contract), add your email address as BigQuery Owner in the "IAM&Admin" section, go to your GA account and enter the BigQuery Project ID in the GA Admin section, "Property Settings/Product Linking/All Products/BigQuery Link". The process is described here: https://support.google.com/analytics/answer/3416092
You can select between standard updates and streaming updated - the latter comes with an extra fee, but gives you near realtime data. The former updates data in BigQuery three times a day every eight hours.
The exported data is not raw data, this is already sessionized (i.e. while you will get one row per hit things like the traffic attribution for that hit will be session based).
You will pay three different kinds of fees - one for the export to BigQuery, one for storage, and one for the actual querying. Pricing is documented here: https://cloud.google.com/bigquery/pricing.
Pricing depends on region, among other things. The region where the data is stored might also important be important when it comes to legal matters - e.g. if you have to comply with the GDPR your data should be stored in the EU. Make sure you get the region right, because moving data between regions is cumbersome (you need to export the tables to Google Cloud storage and re-import them in the proper region) and kind of expensive.
You cannot just delete data and do a new export - on your first export BigQuery will backfill the data for the last 13 months, however it will do this only once per view. So if you need historical data better get this right, because if you delete data in BQ you won't get it back.
I don't actually know much about Redshift, but as per your comment you want to display data in Tableau, and Tableau directly connects to BigQuery.
We use custom SQL queries to get the data into Tableau (Google Analytics data is stored in daily tables, and custom SQL seems the easiest way to query data over many tables). BigQuery has a user-based cache that lasts 24 hours as long as the query does not change, so you won't pay for the query every time the report is opened. It still is a good idea to keep an eye on the cost - cost is not based on the result size, but on the amount of data that has to be searched to produce the wanted result, so if you query over a long timeframe and maybe do a few joins a single query can run into the dozens of euros (multiplied by the number of users who use the query).
scitylana.com has a service that can deliver Google Analytics Free data to S3.
You can get 3 years or more.
The extraction is done through the API. The schema is hit level and has 100+ dimensions/metrics.
Depending on the amount of data in your view, I think this could be done with GA360 too.
Another option is to use Stitch's own specfication singer.io and related open source packages:
https://github.com/singer-io/tap-google-analytics
https://github.com/transferwise/pipelinewise-target-redshift
The way you'd use them is piping data from into the other:
tap-google-analytics -c ga.json | target-redshift -c redshift.json
I like Skyvia tool: https://skyvia.com/data-integration/integrate-google-analytics-redshift. It doesn't require coding. With Skyvia, I can create a copy of Google Analytics report data in Amazon Redshift and keep it up-to-date with little to no configuration efforts. I don't even need to prepare the schema — Skyvia can automatically create a table for report data. You can load 10000 records per month for free — this is enough for me.

Adobe Audience Manager External data

At my organization we are starting to use Adobe Audience Manager. We need to read online data from the website, but also to load data from our private database. Today, we do it by using the FTP, but it actually takes almost 3 days to load all the information so we can use it, which is a lot of time for us. I would like to know which is the best way or some alternatives so we can load information in a more agile and fast way, and ideally to read ir in the most real time possible from other sources (like our database or similar).
thanks a lot for your help
AAM offline data can be uploaded either on an FTP location or to an AWS S3 bucket, and unfortunately both of them take from 12 to 24 hours to load on AAM (Adobe Audience Manager), and then it takes another 12 to 24 hours to load them into your DSP (Demand Side Platform).
Given that the only real-time like signals in AAM (that I know of) comes from the online datasources, the best way to achieve your requirement is to do the following:
Send as much information as possible from the online channel channel.
Build an integrate between your CRM data (database in your case) and the online data (user behaviour data on your website).
The CRM data should contain the user details that do not change much, such as demographics (age, gender, ...etc.), and it should also contain the data that are collected via the non-online channels (e.g. retail purchases, customer service phone calls, ...etc.). On the other hand, the online data should contain all the user behaviour data, collected from the online channel. For example, the user search parameters, visited page names, purchased items, clicked links, …etc.
The integration between the online and CRM data can be done by using the same user ID in both activities. The following diagram should give you a high level view of integration. Simple AAM diagram
Here is an example of passing the user ID and online behavior data to AAM
var user_id = "<add your website user ID here>";//ex: user1234
//Add all your online data here
var my_object = {
color : "blue",
price : "900",
page_name : "cart details"
};
//Create the Dil object
var adobe_dil = DIL.create({
partner : "<partner name goes here>",
declaredId : {
dpid : '<add your online data source ID here>' ,
dpuuid : user_id
}});
//Load the object and append "c_" to all keys in the key-value pairs and send data to AudienceManager.
adobe_dil.api.signals(my_object,"c_").submit();
And here is an example of the offline data upload
user1234 "age_range"="1","gender"="F","city"="LA","status"="active"
user5678 "age_range"="2","gender"="M","city"="CA","status"="inactive"
Another Idea, which I haven't done before and I don't really recommend, is to send all your CRM data as online transactions, by calling the online API directly from your back-end. It may cost you more though, given the number of activities you will make to AAM from the back-end.
References:
https://marketing.adobe.com/resources/help/en_US/aam/c_dil_send_page_objects.html
https://marketing.adobe.com/resources/help/en_US/aam/r_dil_create.html

University data aggregation

I have a client who wants to build a web application targeted towards college students. They want the students to be able to pick which class they're in from a valid list of classes and teachers. Websites like koofers, schedulizer, and noteswap all have accurate lists from many universities which are accurate year by year.
How do these companies aggregate this data? Do these universities have some api for this specific purpose? Or, do these companies pay students from these universities to input this data every year?
We've done some of this for a client, and in each case we had to scrape the data. If you can get an API definitely use it, but my guess is that the vast majority will need to be scraped.
I would guess that these companies have some kind of agreements and use an API for data exchange. If you don't have access to that API though you can still build a simple webscraper that extracts that data for you.

Resources