Metascraper Consolidated Data - web-scraping

I'm using metascraper in a project I'm working on. I'm passing in custom rules into the contructor. It's actually scraping actual content from the page its scraping. The problem is, is that it appears to be finding every tag that matches the CSS selector, and combining all of the text() content from every tag on the page. I checked metascraper website and github and couldn't find any information about an option that changes this kind of mode/behavior. I made sure that each scrape request creates a new instance of metascraper in case it was just using the same member variables across multiple uses of the object, but that didn't seem to do anything. Any thoughts?
Edit: Also, ideally, metascraper would return an array of arrays of sets of selectors it finds. I have 4 selectors in a group that appear in groups throughout a page. I need it to iterate over the selectors in order, until it cannot find any more instances of the 1st selector (aka the groups have stopped appearing on the page).
type4: async (page: Page): Promise<Extract[]> => {
const html = await page.content()
const url = await page.url()
const type4MetascraperInstance = createType4MetaScraperInstance()
const metadata = await type4MetascraperInstance({ html: html, url: url })
console.log('metadata: ', metadata)
const extract: Extract[] = [{
fingerprint: 'type4',
author: metadata.author,
body: metadata.description,
images: null,
logo: null,
product: null,
rating: null,
title: metadata.title,
videos: null
}]
return extract
}
The function for creating the Type4 metascraper instance is:
function createType4MetaScraperInstance() {
const toDescription = toRule(description)
const toAuthor = toRule(author)
const toTitle = toRule(title, { removeSeparator: false })
const type4MetaScraperInstance = metaScraper([ {
author: [
toAuthor($ => $('.a-profile-name').text()),
],
title: [
toTitle($ => $('a[data-hook="review-title"] > span').text()),
],
description: [
toDescription($ => $('.review-text-content').text()),
]
} ])
return type4MetaScraperInstance
}

I decided to architect a different solution here that uses a python script to properly parse reviews, and it will need to read/write to google cloud datastore. Some suggestions people provided were to write my own calls to cheeriojs (https://cheerio.js.org/), instead of using metascraper at all.

Related

How to structurally compare the previous and new value of nested objects that are being used in `watch`, in Options API?

I have a question which is a mix of both composition API and options API
What I want to do: I want to watch an object. That object is deeply nested with all kinds of data types.
Whenever any of the nested properties inside change, I want the watch to be triggered.
(This can be done using the deep: true option).
AND I want to be able to see the previous value and current value of the object.
(this doesn't seem to be possible because Vue stores the references of the objects, so, now the value and prevValue point to the same thing.)
In Vue3 docs, for the watch API, it says this
However, watching a reactive object or array will always return a reference to the
current value of that object for both the current and previous value of the state.
To fully watch deeply nested objects and arrays, a deep copy of values may be required.
This can be achieved with a utility such as lodash.cloneDeep
And this following example is given
import _ from 'lodash'
const state = reactive({
id: 1,
attributes: {
name: ''
}
})
watch(
() => _.cloneDeep(state),
(state, prevState) => {
console.log(state.attributes.name, prevState.attributes.name)
}
)
state.attributes.name = 'Alex' // Logs: "Alex" ""
Link to docs here - https://v3.vuejs.org/guide/reactivity-computed-watchers.html#watching-reactive-objects
However, this is composition API (if I'm not wrong).
How do I use this way of using cloneDeep in a watch defined in options API?
As an example, this is my code
watch: {
items: {
handler(value, prevValue) {
// check if value and prevValue are STRUCTURALLY EQUAL
let isEqual = this.checkIfStructurallyEqual(value, prevValue)
if (isEqual) return
else this.doSomething()
},
deep: true,
},
}
I'm using Vue 3 with Options API.
How would I go about doing this in Options API?
Any help would be appreciated! If there's another way of doing this then please do let me know!
I also asked this question on the Vue forums and it was answered.
We can use the same syntax as provided in the docs in Options API using this.$watch()
data() {
id: 1,
attributes: {
name: ''
}
}
this.$watch(
() => _.cloneDeep(this.attributes),
(state, prevState) => {
console.log(state.name, prevState.name)
}
)
this.attributes.name = 'Alex' // Logs: "Alex" ""

Updating array within array redux

I've got state with a nested array that looks like the following:
{
list: [
{
id: '3546f44b-457e-4f87-95f6-c6717830294b',
title: 'First Nest',
key: '0',
children: [
{
id: '71f034ea-478b-4f33-9dad-3685dab09171',
title: 'Second Nest',
key: '0-0
children: [
{
id: '11d338c6-f222-4701-98d0-3e3572009d8f',
title: 'Q. Third Nest',
key: '0-0-0',
}
],
}
],
],
selectedItemKey: '0'
}
Where the goal of the nested array is to mimic a tree and the selectedItemKey/key is how to access the tree node quickly.
I wrote code to update the title of a nested item with the following logic:
let list = [...state.list];
let keyArr = state.selectedItemKey.split('-');
let idx = keyArr.shift();
let currItemArr = list;
while (keyArr.length > 0) {
currItemArr = currItemArr[idx].children;
idx = keyArr.shift();
}
currItemArr[idx] = {
...currItemArr[idx],
title: action.payload
};
return {
...state,
list
};
Things work properly for the first nested item, but for the second and third level nesting, I get the following Immer console errors
An immer producer returned a new value *and* modified its draft.
Either return a new value *or* modify the draft.
I feel like I'm messing up something pretty big here in regards to my nested array access/update logic, or in the way I'm trying to make a new copy of the state.list and modifying that. Please note the nested level is dynamic, and I do not know the depth of it prior to modifying it.
Thanks again in advance!
Immer allows you to modify the existing draft state OR return a new state, but not both at once.
It looks like you are trying to return a new state, which is ok so long as there is no mutation. However you make a modification when you assign currItemArr[idx] = . This is a mutation because the elements of list and currItemArr are the same elements as in state.list. It is a "shallow copy".
But you don't need to worry about shallow copies and mutations because the easier approach is to just modify the draft state and not return anything.
You just need to find the correct object and set its title property. I came up with a shorter way to do that using array.reduce().
const keyArr = state.selectedItemKey.split("-");
const target = keyArr.reduce(
(accumulator, idx) => accumulator.children[idx],
{ children: state.list }
);
target.title = action.payload;

Where to store Record meta data with Redux and Immutable JS

I switched over to a Redux + Immutable JS project from Ember a few months ago and am overall enjoying the experience.
One problem I still have not found a nice solution for when working with Records is storing meta data for that Record.
For example, let's say I have a User record:
const userRecord = Immutable.Record({
id: null,
name: '',
email: ''
});
For the User, I may also wish to store properties like isLoading or isSaved. The first solution would be to store these in the userRecord. Although this would be the easiest solution by far, this feels wrong to me.
Another solution might be to create a User Map, which contains the User Record, as well as meta data about the User.
Ex.
const userMap = Immutable.Map({
record: Immutable.Record({
id: null,
name: '',
email: ''
}),
isLoading: false,
isSaved: true
});
I think this is more elegant, but I don't like how all the user properties become even more deeply nested, so accessing User properties becomes very verbose.
What I miss most about Ember is being able to access Model properties easily.
Ex. user.get('isSaved') or user.get('name')
Is it possible to recreate something like this with Redux and Immutable? How have you approached this situation before?
I might be misunderstanding the problem, because
What I miss most about Ember is being able to access Model properties easily.
user.get('isSaved') or user.get('name')
This does work for Immutable records.
If you don't want to add too many properties to your record, you could have a single status property and add some getters (assuming your statuses are mutually exclusive):
const STATUS = {
INITIAL: 'INITIAL',
LOADING: 'LOADING',
SAVING: 'SAVING
};
class UserRecord extends Immutable.Record({
id: null,
name: '',
email: '',
status: STATUS.INITIAL}) {
isLoading() {
return this.get('status') === STATUS.LOADING;
}
isSaving() {
return this.get('status') === STATUS.SAVING;
}
}
new UserRecord().isLoading()); // returns false
new UserRecord({status: STATUS.LOADING}).isLoading(); // returns true
new UserRecord().set('status', STATUS.LOADING).isLoading(); // returns true

How to get multiple objects in list at a point in time

I want to provide my users with an API (pointing to my server) that will fetch data from Firebase and return it to them. I want it to be a 'normal' point-in-time request (as opposed to streaming).
My data is 'boxes' within 'projects'. A user can query my API to get all boxes for a project.
My data is normalised, so I will look up the project and get a list of keys of boxes in that project, then go get each box record individually. Once I have them all, I will return the array to the user.
My question: what is the best way to do this?
Here's what I have, and it works. But it feels so hacky.
const projectId = req.params.projectId; // this is passed in by the user in their call to my server.
const boxes = [];
let totalBoxCount = 0;
let fetchedBoxCount = 0;
const projectBoxesRef = db
.child('data/projects')
.child(projectId)
.child('boxes'); // a list of box keys
function getBox(boxSnapshot) {
totalBoxCount++;
db
.child('data/boxes') // a list of box objects
.child(boxSnapshot.key())
.once('value')
.then(boxSnapshot => {
boxes.push(boxSnapshot.val());
fetchedBoxCount++;
if (fetchedBoxCount === totalBoxCount) {
res.json(boxes); // leap of faith that getBox() has been called for all boxes
}
});
}
projectBoxesRef.on('child_added', getBox);
// 'value' fires after all initial 'child_added' things are done
projectBoxesRef.once('value', () => {
projectBoxesRef.off('child_added', getBox);
});
There are some other questions/answers on separating the initial set of child_added objects, and they have influenced my current decision, but they don't seem to relate directly.
Thanks a truck-load for any help.
Update: JavaScript version of Jay's answer below:
db
.child('data/boxes')
.orderByChild(`projects/${projectId}`)
.equalTo(true)
.once('value', boxSnapshot => {
const result = // some parsing of response
res.json(result);
});
This may be too simple a solution but if you have projects, and each project has boxes
your projects node
projects
project_01
boxes
box_id_7: true
box_id_9: true
box_id_34: true
project_37
boxes
box_id_7: true
box_id_14: true
box_id_42: true
and the boxes node
boxes
box_id_7
name: "a 3D box"
shape: "Parallelepiped"
belongs_to_project
project_01: true
box_id_14
name: "I have unequal lenghts"
shape: "Rhumboid"
belongs_to_project
project_37: true
box_id_34
name: "Kinda like a box but with rectangles"
shape: "cuboid"
belongs_to_project
project_01: true
With that, just one (deep) query on the boxes node will load all of the boxes that belong to project_01, which in this case is box_id_7 and box_id_34.
You could go the the other way and since you know the box id for each project in the projects node, you could do a series of observers to load in each project via it's specific path /boxes/box_id_7 etc. I like the query better; faster and less bandwidth.
You could expand on this if a box can belong to multiple projects:
box_id_14
name: "I have unequal lenghts"
shape: "Rhumboid"
belongs_to_project
project_01: true
project_37: true
Now query on the boxes node for all boxes that are part of project_01 will get box_id_7, box_id_14 and box_id_34.
Edit:
Once that structure is in place, use a Deep Query to then get the boxes that belong to the project in question.
For example: suppose you want to craft a Firebase Deep Query to return all boxes where the box's belongs_to_project list contains an item with key "project_37"
boxesRef.queryOrderedByChild("belongs_to_project/project_37"
.queryEqualToValue(true)
.observeSingleEventOfType(.Value, withBlock: { snapshot in
print(snapshot)
})
OK I think I'm happy with my approach, using Promise.all to respond once all the individual 'queries' are returned:
I've changed my approach to use promises, then call Promise.all() to indicate that all the data is ready to send.
const projectId = req.params.projectId;
const boxPromises = [];
const projectBoxesRef = db
.child('data/projects')
.child(projectId)
.child('boxes');
function getBox(boxSnapshot) {
boxPromises.push(db
.child('data/boxes')
.child(boxSnapshot.key())
.once('value')
.then(boxSnapshot => boxSnapshot.val())
);
}
projectBoxesRef.on('child_added', getBox);
projectBoxesRef.once('value', () => {
projectBoxesRef.off('child_added', getBox);
Promise.all(boxPromises).then(boxes => res.json(boxes));
});

Meteor Framework Subscribe/Publish according to document variables

I have a game built on Meteor framework. One game document is something like this:
{
...
participants : [
{
"name":"a",
"character":"fighter",
"weapon" : "sword"
},
{
"name":"b",
"character":"wizard",
"weapon" : "book"
},
...
],
...
}
I want Fighter character not to see the character of the "b" user. (and b character not to see the a's) There are about 10 fields like character and weapon and their value can change during the game so as the restrictions.
Right now I am using Session variables not to display that information. However, it is not a very safe idea. How can I subscribe/publish documents according to the values based on characters?
There are 2 possible solutions that come to mind:
1. Publishing all combinations for different field values and subscribing according to the current state of the user. However, I am using Iron Router's waitOn feature to load subscriptions before rendering the page. So I am not very confident that I can change subscriptions during the game. Also because it is a time-sensitive game, I guess changing subscriptions would take time during the game and corrupt the game pleasure.
My problem right now is the user typing
Collection.find({})
to the console and see fields of other users. If I change my collection name into something difficult to find, can somebody discover the collection name? I could not find a command to find collections on the client side.
The way this is usually solved in Meteor is by using two publications. If your game state is represented by a single document you may have problem implementing this easily, so for the sake of an example I will temporarily assume that you have a Participants collection in which you're storing the corresponding data.
So anyway, you should have one subscription with data available to all the players, e.g.
Meteor.publish('players', function (gameId) {
return Participants.find({ gameId: gameId }, { fields: {
// exclude the "character" field from the result
character: 0
}});
});
and another subscription for private player data:
Meteor.publish('myPrivateData', function (gameId) {
// NOTE: not excluding anything, because we are only
// publishing a single document here, whose owner
// is the current user ...
return Participants.find({
userId: this.userId,
gameId: gameId,
});
});
Now, on the client side, the only thing you need to do is subscribe to both datasets, so:
Meteor.subscribe('players', myGameId);
Meteor.subscribe('myPrivateData', myGameId);
Meteor will be clever enough to merge the incoming data into a single Participants collection, in which other players' documents will not contain the character field.
EDIT
If your fields visibility is going to change dynamically I suggest the following approach:
put all the restricted properties in a separated collection that tracks exactly who can view which field
on client side use observe to integrate that collection into your local player representation for easier access to the data
Data model
For example, the collection may look like this:
PlayerProperties = new Mongo.Collection('playerProperties');
/* schema:
userId : String
gameId : String
key : String
value : *
whoCanSee : [String]
*/
Publishing data
First you will need to expose own properties to each player
Meteor.publish('myProperties', function (gameId) {
return PlayerProperties.find({
userId: this.userId,
gameId: gameId
});
});
then the other players properties:
Meteor.publish('otherPlayersProperties', function (gameId) {
if (!this.userId) return [];
return PlayerProperties.find({
gameId: gameId,
whoCanSee: this.userId,
});
});
Now the only thing you need to do during the game is to make sure you add corresponding userId to the whoCanSee array as soon as the user gets ability to see that property.
Improvements
In order to keep your data in order I suggest having a client-side-only collection, e.g. IntegratedPlayerData, which you can use to arrange the player properties into some manageable structure:
var IntegratedPlayerData = new Mongo.Collection(null);
var cache = {};
PlayerProperties.find().observe({
added: function (doc) {
IntegratedPlayerData.upsert({ _id : doc.userId }, {
$set: _.object([ doc.key ], [ doc.value ])
});
},
changed: function (doc) {
IntegratedPlayerData.update({ _id : doc.userId }, {
$set: _.object([ doc.key ], [ doc.value ])
});
},
removed: function (doc) {
IntegratedPlayerData.update({ _id : doc.userId }, {
$unset: _.object([ doc.key ], [ true ])
});
}
});
This data "integration" is only a draft and can be refined in many different ways. It could potentially be done on server-side with a custom publish method.

Resources