How to pass any URL to an APIFY task? - web-scraping

There is a box to configure the "Start URL" in APIFY, but what happens if i don't know the start URL and it depends of my user input? I would like to be able to pass a variable URL to "Start URL"
Configuration of Start URL in APIFY:
I want to pass any URL automatically through an APIFY task and then scrap it.
I tried to make it automatically through Zapier, in the configuration is possible to select the URL input and pass it to APIFY, but finally it stops the task because is not able to read the format passed. Data out log from Zapier:
I think that APIFY probably lets configure dynamic input URL's but by my beginner level, probably there is something that scapes from my knowledge.
I want to be able to pass variable URL's to be scraped by APIFY.

You can check how input looks like in JSON format using Editor/JSON switcher on the top of input configuration.
After you switch to JSON you can easily check the structure of startUrls.
If you want to override startUrls for example in Zapier integration you can do it using Input JSON overrides field in Run Task Apify<>Zapier action.
You can override input same way using API to run the task, where you need to pass JSON as POST payload of the API request.
If you want to read more about Apify<>Zapier integration you can check article Scrape single URL using Zapier.

Related

language change in nextjs without changing url

is there any way to switch language in nextjs without passing language parameter in url like baseurl/ar or baseurl/en ? if I want to change language from dropdown, url should not change.
query parameters are defined by the routes of your api, in this way, when you change the url, a new call is made in the api passing the parameters that were informed in the route.
you can pass objects to the api route, something like:
const response = awai api.get('yourUrl', {language: en})
this way the information will not appear in the url, but this needs to be changed in the backend so that it knows where to get the parameters from.
you can choose to do the translation in json files, and just switch between them too.
Hope this helps.
#NSL

Purpose of tilde delimited values in URL fragment instead of GET params

I came across an unusual URL structure on a site. It looked like this:
https://www.agilealliance.org/glossary/xp/#q=~(infinite~false~filters~(postType~(~'post~'aa_book~'aa_event_session~'aa_experience_report)~tags~(~'xp))~searchTerm~'~sort~false~sortDirection~'asc~page~1)
It seems the category, pagination and sort options of a widget on the page injects and reads through these values. Does this format for storing data in the URL have a name, or is this an esoteric format someone made?
What's the purpose of doing this over using regular GET params, or at least using a more conventional format after the fragment?
If you inspect the URL carefully, you'll see that the parameters you describe are placed after the fragment (#), meaning they're not sent to the server but used by the client instead.
In this case, the client (JavaScript) builds them into something like an ElasticSearch query that's then POSTed to the server, in order to update listing you see on your screen.

Extract part of an URL behind a login page with Paw

I'm a newbie but I think Paw can do what i need :
I need to extract a session id behind a login page.
I go to https://admin.booking.com, filling the form (login and pass) and the landing page behind includes a session id :
https://admin.booking.com/pc/index.html?ses=xxxxyyyyyzzzzz11112222233333
I'd like to :
1) Push credentials with Paw as part of my request,
2) get the above item (ses) item as a response so i can use the php script extension provided by Paw and then call this script "on demand".
Is this possible ? If so, what should i do ?
Thanks for your help
UPDATE*: we've added a documentation article to describe the process a little more: Login via a web form in Paw. We've detailed the process to deal with CSRF tokens too.
Paw isn't quite yet ready for handling web/HTML forms. Though, there's one way to do it the right way: if you inspect the form with the Chrome dev tools you'll find the name of the input from the DOM/HTML:
In your case, you have the inputs: loginname, password, lang.
Also, find the <form…> tag to see what's the action attribute. If there's no action attribute (like in your example), it means the target URL for your form is the current page's URL (https://admin.booking.com/ in your case). Also, make sure the method="POST" is also there in the <form…> tag, otherwise this method won't work.
Then jump into Paw and set:
URL (in your case https://admin.booking.com/)
method to POST
go to the Body tab and use "Form URL-Encoded + fill up the fields from your form
If all works, you'll see Paw show a redirection request, and if you go to the right-hand side panel under "Response" > "Headers", you should see a Location header with a value similar to the URL you initially mentioned (https://admin.booking.com/pc/index.html?ses=xxxxyyyyyzzzzz11112222233333). Hurray! You got your value into Paw!
Now that you have that, you can create in a new request (click on the + button at the bottom of the left-hand side list). And wherever you want to use this session token/ID, you can insert a dynamic value to retrieve that URL value. You have more infos here, in our docs, but I'll describe the steps here:
On whichever field you want to insert the token, right-click and pick Responses > Response Header.
Make sure you pick the first request in the "Request" dropdown menu, and enter Location in the "Header" field:
You should see the value of the Location header of the previous response appear here.
Now what you want to do is to extract only the part you want (i.e. the value of the ses param in your case). For that you'll need that extension for Paw, so please install it now: https://luckymarmot.com/paw/extensions/RegExMatch
Copy the dynamic value you have just inserted (the blue token), and right-click on that field to insert a new dynamic value, and pick Extensions > RegExp match:
In the Input field, paste the previous dynamic value you copied. And use the RegExp field to write a regular expression that will successfully extract the part of the URL you want (this should work in your case ses=(.*)).
Now that you're set up. You should be able to use this little new blue token wherever you like and automagically extract the value from the previous form. And whenever you send again the initial request, and get a new token, everything else will also update! :)
It was a little long guide, but I hope this will help you and hopefully others too.

Read and modify POST fields "on-the-fly" using Fiddler

I need to use Fiddler to modify the POST fields sent by a browser. I know I can do that using the Fiddler UI but I want to create a script to do it automatically.
I need to insert the code inside the OnBeforeRequest method and I know I can use regular expressions to parse the POST fields but maybe there is something already available to do it like some sort of object POST with all the current fields, e.g: POST["field1"], POST["field2"], etc.
So...is it possible or do I have to do it manually?
Thanks!
Fiddler itself does not contain a script-accessible POST body parser, which means you'd either need to import one, write one, or use string processing to accomplish this task.

Designing proper REST URIs

I have a Java component which scans through a set of folders (input/processing/output) and returns the list of files in JSON format.
The REST URL for the same is:
GET http://<baseurl>/files/<foldername>
Now, I need to perform certain actions on each of the files, like validate, process, delete, etc. I'm not sure of the best way to design the REST URLs for these actions.
Since its a direct file manipulation, I don't have any unique identifier for the files, except their paths. So I'm not sure if the following is a good URL:
POST http://<baseurl>/file/validate?path=<filepath>
Edit: I would have ideally liked to use something like /file/fileId/validate. But the only unique id for files is its path, and I don't think I can use that as part of the URL itself.
And finally, I'm not sure which HTTP verb to use for such custom actions like validate.
Thanks in advance!
Regards,
Anand
When you implement a route like http:///file/validate?path you encode the action in your resource that's not a desired effect when modelling a resource service.
You could do the following for read operations
GET http://api.example.com/files will return all files as URL reference such as
http://api.example.com/files/path/to/first
http://api.example.com/files/path/to/second
...
GET http://api.example.com/files/path/to/first will return validation results for the file (I'm using JSON for readability)
{
name : first,
valid : true
}
That was the simple read only part. Now to the write operations:
DELETE http://api.example.com/files/path/to/first will of course delete the file
Modelling the file processing is the hard part. But you could model that as top level resource. So that:
POST http://api.example.com/FileOperation?operation=somethingweird will create a virtual file processing resource and execute the operation given by the URL parameter 'operation'. Modelling these file operations as resources gives you the possibility to perform the operations asynchronous and return a result that gives additional information about the process of the operation and so on.
You can take a look at Amazon S3 REST API for additional examples and inspiration on how to model resources. I can highly recommend to read RESTful Web Services
Now, I need to perform certain actions on each of the files, like validate, process, delete, etc. I'm not sure of the best way to design the REST URLs for these actions. Since its a direct file manipulation, I don't have any unique identified for the files, except their paths. So I'm not sure if the following is a good URL: POST http:///file/validate?path=
It's not. /file/validate doesn't describe a resource, it describes an action. That means it is functional, not RESTful.
Edit: I would have ideally liked to use something like /file/fileId/validate. But the only unique id for files is its path, and I don't think I can use that as part of the URL itself.
Oh yes you can! And you should do exactly that. Except for that final validate part; that is not a resource in any way, and so should not be part of the path. Instead, clients should POST a message to the file resource asking it to validate itself. Luckily, POST allows you to send a message to the file as well as receive one back; it's ideal for this sort of thing (unless there's an existing verb to use instead, whether in standard HTTP or one of the extensions such as WebDAV).
And finally, I'm not sure which HTTP verb to use for such custom actions like validate.
POST, with the action to perform determined by the content of the message that was POSTed to the resource. Custom “do something non-standard” actions are always mapped to POST when they can't be mapped to GET, PUT or DELETE. (Alas, a clever POST is not hugely discoverable and so causes problems for the HATEOAS principle, but that's still better than violating basic REST principles.)
REST requires a uniform interface, which in HTTP means limiting yourself to GET, PUT, POST, DELETE, HEAD, etc.
One way you can check on each file's validity in a RESTful way is to think of the validity check not as an action to perform on the file, but as a resource in its own right:
GET /file/{file-id}/validity
This could return a simple True/False, or perhaps a list of the specific constraint violations. The file-id could be a file name, an integer file number, a URL-encoded path, or perhaps an unencoded path like:
GET /file/bob/dir1/dir2/somefile/validity
Another approach would be to ask for a list of the invalid files:
GET /file/invalid
And still another would be to prevent invalid files from being added to your service in the first place, ie, when your service processes a PUT request with bad data:
PUT /file/{file-id}
it rejects it with an HTTP 400 (Bad Request). The body of the 400 response could contain information on the specific error.
Update: To delete a file you would of course use the standard HTTP REST verb:
DELETE /file/{file-id}
To 'process' a file, does this create a new file (resource) from one that was uploaded? For example Flickr creates several different image files from each one you upload, each with a different size. In this case you could PUT an input file and then trigger the processing by GET-ing the corresponding output file:
PUT /file/input/{file-id}
GET /file/output/{file-id}
If the processing isn't near-instantaneous, you could generate the output files asynchronously: every time a new input file is PUT into the web service, the web service starts up an asynchronous activity that eventually results in the output file being created.

Resources