Web Scraping in React & MongoDB Stitch App - web-scraping

I'm moving a MERN project into React + MongoDB Stitch after seeing it allows for easy user authentication, quick deployment, etc.
However, I am having a hard time understanding where and how can I call a site scraping function. Previously, I web scraped in Express.js with cheerio like:
app.post("/api/getTitleAtURL", (req, res) => {
if (req.body.url) {
request(req.body.url, function(error, response, body) {
if (!error && response.statusCode == 200) {
const $ = cheerio.load(body);
const webpageTitle = $("title").text();
const metaDescription = $("meta[name=description]").attr("content");
const webpage = {
title: webpageTitle,
metaDescription: metaDescription
};
res.send(webpage);
} else {
res.status(400).send({ message: "THIS IS AN ERROR" });
}
});
}
});
But obviously with Stitch no Node & Express is needed. Is there a way to fetch another site's content without having to host a node.js application just serving that one function?
Thanks

Turns out you can build Functions in MongoDB Stitch that allows you to upload external dependencies.
However, there're limitation, for example, cheerio didn't work as an uploaded external dependency while request worked. A solution, therefore, would be to create a serverless function in AWS's lambda, and then connect mongoDB stitch to AWS lambda (mongoDB stitch can connect to many third party services, including many AWS lambda cloud services like lambda, s3, kinesis, etc).
AWS lambda allows you to upload any external dependencies, if mongoDB stitch allowed for any, we wouldn't need lambda, but stitch still needs many support. In my case, I had a node function with cheerio & request as external dependencies, to upload this to lambda: make an account, create new lambda function, and pack your node modules & code into a zip file to upload it. Your zip should look like this:
and your file containing the function should look like:
const cheerio = require("cheerio");
const request = require("request");
exports.rss = function(event, context, callback) {
request(event.requestURL, function(error, response, body) {
if (!error && response.statusCode == 200) {
const $ = cheerio.load(body);
const webpageTitle = $("title").text();
const metaDescription = $("meta[name=description]").attr("content");
const webpage = {
title: webpageTitle,
metaDescription: metaDescription
};
callback(null, webpage);
return webpage;
} else {
callback(null, {message: "THIS IS AN ERROR"})
return {message: "THIS IS AN ERROR"};
}
});
};
and in mongoDB, connect to a third party service, choose AWS, enter the secret keys you got from making an IAM amazon user. In rules -> actions, choose lambda as your API, and allow for all actions. Now, in your mongoDB stitch functions, you can connect to Lambda, and that function should look like this in my case:
exports = async function(requestURL) {
const lambda = context.services.get('getTitleAtURL').lambda("us-east-1");
const result = await lambda.Invoke({
FunctionName: "getTitleAtURL",
Payload: JSON.stringify({requestURL: requestURL})
});
console.log(result.Payload.text());
return EJSON.parse(result.Payload.text());
};
Note: this slowed down performances big time though, generally, it took twice extra time for the call to finish.

Related

Build fails while building SSR/ISR pages with new API routes

I am getting issues while building new ISR/SSR pages with getStaticProps and getStaticPaths
Brief explanation:
While creating ISR/SSR pages and adding new API route never existed before, building on Vercel fails because of building pages before building API routes (/pages/api folder)
Detailed explanation:
A. Creating next SSR page with code (/pages/item/[pid].tsx)
export async function getStaticProps(context) {
const pid = context.params.pid;
//newly created API route
const res = await fetch(process.env.APIpath + '/api/getItem?pid=' + (pid));
const data = await res.json();
return {
props: {item: data}
}
}
export async function getStaticPaths(context) {
//newly created API route
let res = await fetch(process.env.APIpath + '/api/getItemsList')
const items = await res.json()
let paths = []
//multi-language support for the pages
for (const item of items){
for (const locale of context.locales){
paths.push({params: {pid: item.url }, locale: locale})
}
}
return { paths, fallback: false }
}
B. Local checks work, deploying to Vercel
C. During deploying Vercel triggers an error because trying to get data from the API route doesn't exist yet. (Vercel is deploying /pages/item/[pid].tsx first and /api/getItemsList file after). Vercel trying to get data from https://yourwebsite.com/api/getItemsList which does not exist.
Only way I am avoiding this error:
Creating API routes needed
Deploying project to Vercel
Creating [pid].tsx page/s and then deploy it
Deploying final version of code
The big issue with my approach is you are making 1 deployment you don't actually. The problems appears also while remaking the code for your API routes also.
Question: there is an way/possiblity to force Versel to deploy firstly routes and than pages?
Any help appreciated

access to places api from a cloud function

as part of my firebase app, I'm using a cloud function to get data from google places API.
for some reason, I'm getting 403 errors when trying to retrieve data, even though the service account I'm using is the default one (App Engine default service account with Editor role) which seems to exist on the API credentials list and also on the specific cloud function I'm using.
here's the code I'm using to retrieve data from the API -
class GoogleMapsRestApiClass {
client = new Client({});
getPlaceInfo(placeId: string) {
return this.client.placeDetails({
params: {
place_id: placeId,
fields: ["name", "rating", "geometry", "photo"],
key: environment.googleMapsJsApi.apiKey
}
} as PlaceDetailsRequest);
}
}
export const GoogleMapsRestApi = new GoogleMapsRestApiClass();
and the cloud function itself -
export const place = functions.https.onRequest(async (request, response) => {
const placeId = request.query.place_id as string;
const resp: AxiosResponse = await GoogleMapsRestApi.getPlaceInfo(placeId);
const result = resp.data.result;
response.send({result});
});
any ideas what I'm missing here?
Update -
if I'm not restricting the API key I do manage to retrieve the data (restricted it to my host address).
how should I protect the API key being used by cloud function?

How to create a VM instance from an instance template within Cloud Functions?

I need to start a Google Compute instance based off a template I've made which has a startup script which downloads the latest game-server executable from my server and runs it. This all works perfectly fine.
Now, my custom built matchmaker will determine if all current games (which are instances of the game server) are full, and if so I want it to run a Cloud Function that creates another new instance from the template I've mentioned above (which basically acts lobby/game for 12 players). Once the instance is created I need the cloud function to return the IP of the newly created instance back to whatever called it (which would be my game).
I know the first part is possible via HTTP POST but I cannot find anywhere in the cloud functions docs/compute docs/admin SDK docs that allows me to create instances and get the IP, is this possible?
EDIT: I have found this documentation but I have not yet found a function to start a VM from a template which then returns the VM's object - which includes it's IP...
You can use directly the APIs. First create the VM, then wait the running state to get the internal and external IP
async function main() {
const auth = new GoogleAuth({
scopes: 'https://www.googleapis.com/auth/cloud-platform'
});
const client = await auth.getClient();
const url = `https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/europe-west1-b/instances`
const template= 'projects/PROJECT_ID/global/instanceTemplates/instance-template-1';
const instanceName = 'example-instance'
const body= '{ "name": "' + instanceName + '" }'
let res = await client.request({ url: url + "?sourceInstanceTemplate=" + template,
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: body
});
res = await client.request({ url: url + "/" + instanceName,
method: 'GET'});
while (res.data['status'] != 'RUNNING') {
setTimeout(function(){},1000)
res = await client.request({ url: url + "/" + instanceName,
method: 'GET'});
}
//Internal Ip (interface0)
console.log(res.data.networkInterfaces[0].networkIP);
//External Ip
console.log(res.data.networkInterfaces[0].accessConfigs[0].natIP);
}
main().catch(console.error);
My NodeJs skill is low (style, format, idioms,...), but this works.

NextJS special characters routes do not work from browser

Using NextJS, I am defining some routes in getStaticPaths by making an API call:
/**
* #dev Fetches the article route and exports the title and id to define the available routes
*/
const getAllArticles = async () => {
const result = await fetch("https://some_api_url");
const articles = await result.json();
return articles.results.map((article) => {
const articleTitle = `${article.title}`;
return {
params: {
title: articleName,
id: `${article.id}`,
},
};
});
};
/**
* #dev Defines the paths available to reach directly
*/
export async function getStaticPaths() {
const paths = await getAllArticles();
return {
paths,
fallback: false,
};
}
Everything works most of the time: I can access most of the articles, Router.push works with all URLs defined.
However, when the article name includes a special character such as &, Router.push keeps working, but copy/pasting the URL that worked from inside the app to another tab returns a page:
An unexpected error has occurred.
In the Network tab of the inspector, a 404 get request error (in Network) appears.
The component code is mostly made of API calls such as:
await API.put(`/set_article/${article.id}`, { object });
With API being defined by axios.
Any idea why it happens and how to make the getStaticPaths work with special characters?
When you transport values in URLs, they need to be URL-encoded. (When you transport values in HTML, they need to be HTML encoded. In JSON, they need to be JSON-encoded. And so on. Any text-based system that can transport structured data has an encoding scheme that you need to apply to data. URLs are not an exception.)
Turn your raw values in your client code
await API.put(`/set_article/${article.id}`)
into encoded ones
await API.put(`/set_article/${encodeURIComponent(article.id)}`)
It might be tempting, but don't pre-encode the values on the server-side. Do this on the client end, at the time you actually use them in a URL.

Google cloud function returning 204 status when accessing realtime database

I have a website for testing purposes hosted via firebase, storing client information on a realtime database which needs to be accessed later. When I do this via a single html document with a script that accesses my reatime database I am able to find information successfuly, but when I copied and pasted that same logic into a cloud function it did not work. I have tried everything I can think of and now when I run the function it executes twice (I am not sure why). The first execution finishes with a http 204 status (no content found). The second execution returns http 500 internal service error. When I checked the logs on firebase it says the error was because "accounts.getValue() is not a function". I think what is happening is on the first execution the function is unable to locate accounts and it executes again without trying to find the accounts, which might be why it can't run accounts.getValue()
I guess my main question is why is my function unable to locate accounts?
geturl is the function I am having trouble with
The structure of my realtime database is
database name
-accounts
-some data
-more data
-more account data
-ActiveQRs
-some data...
My index.js file for cloud functions is
const functions = require('firebase-functions');
const express = require('express');
const cors = require('cors')({origin: true});
var firebase = require("firebase");
var admin = require("firebase-admin");
require("firebase/auth");
require("firebase/database");
//require("firebase/firestore");
//require("firebase/messaging");
require("firebase/functions");
var serviceAccount = require("./serviceKey.json");
// Initialize the app with a service account, granting admin
//privileges
admin.initializeApp({
credential: admin.credential.cert(serviceAccount),
databaseURL: "https://databaseName.firebaseio.com"
});
const displayqr = express();
const geturl = express();
displayqr.get('/displayqr', (request, response) => {
console.log("response sent");
response.send("testio/qrdisplay.html");
});
exports.displayqr = functions.https.onRequest(displayqr);
exports.geturl = functions.https.onCall((email) => {
const mail = email.toString();
var result = "";
result = result + mail;
var accounts =
admin.database().ref("livsuiteform/accounts");
result = (accounts.getValue());
accounts.orderByKey().on("value", function(snapshot) {
snapshot.forEach(function(data) {
if (data.child("Email").val() == mail) {
var firstName = data.child("FirstName").val();
var lastName = data.child("LastName").val();
result = firstname;
result = "if loop entered";
} // end if
// return "name not found";
}); // end for each
}); // end order by
return result;
});
TLDR; follow this tutorial on how to build and deploy callable functions for your mobile app.
There are multiple reasons for why your functions aren't working as you expect.
You are including the client-side version of Firebase (var firebase = require("firebase");). You shouldn't use or even require the client-side version. Instead just use Firebase Admin (docs) to access any data. If you need certain user permissions when accessing the DB from the Admin SDK, here is a good example of how to achieve that (Scroll down to "You can still perform user-authorized changes...").
You have mixed different Admin SDK references. getValue() is part of the Admin SDK for Java. You should use the JavaScript equivalent val(). Also, in your code, accounts is a Reference and not a DataSnapshot.
You aren't returning your Promise's. This can be a source of inconsistency in your function execution later SO Question.
You aren't returning anything from your initial function. If you don't return anything, then nothing will get returned to your app. The solution is the same as 3's solution: return your Promise.
You shouldn't use on in Firebase Functions. You should use once. The difference is that on doesn't return a Promise while once does. It returns a function that is used to detach the listener.
I know this is a lot of bullet points and pointing out problems in your code, but I just didn't want give a shallow answer which resulted in you asking another question and waiting another ~2 hours (at the time of writing) for an answer.
I hope this helps!
Code
exports.geturl = functions.https.onCall((email) => {
const mail = email.toString();
var result = "";
result = result + mail;
var accounts = admin.database().ref("livsuiteform/accounts");
return accounts.orderByKey().once("value")
.then(function (snapshot) {
snapshot.forEach(function (data) {
if (data.child("Email").val() == mail) {
var firstName = data.child("FirstName").val();
var lastName = data.child("LastName").val();
result = firstName;
result = "if loop entered";
} // end if
// return "name not found";
}); // end for each
return result;
}); // end order by
});

Resources