Scrapy best practice: Connect to database in crawler or in pipeline? - web-scraping

I am scraping a main page that has a list of items. Within my pipeline I connect to a database to store the items. My next task is to go to each individual item page and scrape comments. I need to connect to the database again to see if I've already scraped the comments.
Is it more efficient for me connect to the database in the pipeline or in the crawl script?
Is there a way to return from the pipeline and tell the crawler to crawl the comments?

Related

Is there anyway to run Firebase realtime database queries with multiple keys?

I am working on a personal project to recreate the news feed of Facebook. So what I am trying to do is to recreate the scenario where when the user goes to the news feed, the user gets posts of everyone he follows only. Is there any way to run a query like that using the Firebase real-time database using an of "followings".
I can successfully generate single users posts in the android studio app using snapshot and recycler view.
If you're asking whether you can get posts from multiple userUID values with a single query, that is not possible.
If you're asking whether you can pass a list of postUID values to retrieve, that is also not possible.
In both cases the solution is to execute a separate query/read operation for each of the values, and merge the results in your application code. This is not nearly as slow as you may think, since Firebase pipelines the requests over a single web socket connection - which is quite efficient. For more on this, see Speed up fetching posts for my social network app by using query instead of observing a single event repeatedly

Reference external data source from AI/Kusto query?

tl;dr: I want to reference an external data source from a Kusto query in Application Insights.
My application is writing logs to Application Insights, and we're querying it using Kusto in the Azure portal. To give an example of what I'm trying to do:
We're currently looking at these logs to find an action that triggers when a visitor viewed a blog post on our site. This is working well on a per blog-post level, but now we want to group this data by the category these blog posts are in, or by the tags they have, but that's not information I have within the logs.
The information we log contains unique info about that blog post (unique url, our internal id, etc) that I could use to look up this information in another data source (e.g. our SQL DB where this relation is stored), but I have no idea if/how this is possible. So that's the question, is this possible? Can I query a SQL DB, or get data in JSON via a URL or something?
Alternative solutions would be to move the reporting elsewhere (e.g. PowerBI) and just use AI as a data source, or to actually log all the category/tag info, but I really don't want to go down that route.
Kusto supports accessing external data (blobs, Azure SQL, Cosmos DB), however
Application Insights / Azure Monitor and other multi-tenant services are blocking this functionality due to security and resource governance concerns.
You could try setting-up your own Azure Data Explorer (Kusto) cluster, where this functionality will be available, and then access your Application Insights data using cross-cluster query, or by exporting the data from Application Insights and hooking up EventGrid ingestion into your Kusto cluster.
Relevant links:
Kusto supporting external data:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/schema-entities/externaltables
Querying data inside Application Insights:
https://learn.microsoft.com/en-us/azure/data-explorer/query-monitor-data
Continuous export data from Application Insights:
https://learn.microsoft.com/en-us/azure/azure-monitor/app/export-telemetry
Data ingestion into Kusto from EventGrid:
https://learn.microsoft.com/en-us/azure/data-explorer/ingest-data-event-grid

How a process running SQLite knows that a particular page has been updated by another process?

How 2 independent SQLite cache modules get notified about the change in DB. More specifically, how the cache module know that a page has to be fetched from disk, as its content has been updated in DB by some other process.
SQLite writes all changed pages when a transaction finishes; once another connection is allowed to read, there are no dirty pages.
To detect changes made by other connections, there is the file change counter in the database header. However, it does not apply to specific pages but to the entire database.

Putting records into the Elasticsearch index before the relational database

I have an application which consumes RSS feeds and makes them searchable by performing the following steps:
pulling article from the feed URL
storing that data in a relational DB
indexing the data in Elasticsearch
I'd like to reverse this process so that I can use the RSS River Elasticsearch plugin to pull data from feeds. However, this plugin integrates directly with Elasticsearch, bypassing my relational DB (which is a problem for other parts of the application which rely on each article having a record in the DB).
How can I have Elasticsearch notify the DB when a new article has been indexed (and de-indexed)?
Edit
Currently I'm using Ruby on Rails 4 with a PostgreSQL DB. RSS feeds are fetched in the background using Sidekiq to manage jobs. They go directly into PG and are then indexed by Elasticsearch. I'm using Chewy to provide an interface to the ES index. It doesn't support callbacks like I'm looking for (no Ruby library does afaik?).
Searching queries ES for matches then loads the records from PG to display results.
It sounds like you are looking for the sort of notification/trigger functionality described in this feature request. In the absence of that feature I think the approach suggested in that thread by the user "cravergara" is your best bet - that is, you can alter the RSS river Elasticsearch plugin to update your DB whenever an article is indexed.
That would handle the indexing requirement. To sync the de-indexing, you should make sure that any code that deletes your Elasticsearch documents also deletes the corresponding DB records.

Outbound E-mail Profile API: Get list of e-mailaddresses

I'm working on a Java console application that needs to go through all the e-mailaddresses in the frontend database in Tridion Outbound E-mail 2011 and change a certain extended field of that contact.
I've gone through the Subscription API documentation for clues on how to get a listing of all the e-mailaddresses, but I'm getting stuck there. Is there any clean way to do this through the API, without resorting to database queries?
It is not possible to get a list of Contacts using the Subscription API. It is meant primarily for working with single Contacts, who update their profile on your website.
For bulk management of Contacts, you should use Tridion.AudienceManagement.API on your Content Management server instead. The changes will then be synchronized to all of your websites.
You should not change anything directly in the database, as you will get issues with synchronization.

Resources