Goutte / Web Scraping - How to intercept and download a file - web-scraping

Firstly, thanks in advance for your help here, it's really appreciated!
I've successfully managed to get Goutte to authenticate, hit a URL, change a select field and click a submit button.
The page then reloads and as it finishes loading, it downloads a file to the client.
How do I intercept this file within Goutte? I've read as much doco as I can but can't seem to find an answer. I then want to basically hit this file, traverse it and save it locally.
Depending upon the file type, I want to traverse it, or save it locally.
Thanks :-)

It is not easy to achieve this. In my situation, I open the URL where the file is (after authentication) then the server gives the file (as an object of Page), afterwards you can get the content of the page.
// $url contains the path to the file.
$session->visit($url);
$page = $session->getPage();
$saved = file_put_contents($targetFilePath, $page->getContent());
In my case, I am downloading zip file. In your case, probably save it in a temporary location, detect the type then move it to any desired directory.
Hope this helps.

Related

SilverStripe 4.1 Email->addAttachment()?

I have a contact form that accepts a file input, I'd like to attach the file to the email that gets sent from the form.
Looking at the API reference isn't really helping, it states that the function expects a filepath with no clarification on anything beyond that.
The submit action will save a record of the into the database and this works correctly, something like:
$submission = MyDataObject::create();
$form->saveInto($submission);
$submission->write();
an Email object then gets created and sent. Both of these are functioning and working as expected.
Trying to attach the File I've tried:
$email->addAttachemnt($submission->MyFile()->Link());
which is the closest I can get to getting a filepath for the document. Dumping and pasting the resulting filepath being output by that call will download the form but that line throws an error and can't seem to locate the file.
I suspect that I'm misunderstanding what's supposed to be given to the function, clarification would be very much appreciated.
P.S. I don't currently have access to the code, I'm looking for some clarification on the function itself not an exact answer :).
In SilverStripe 4 the assets are abstracted away, so you can't guarantee that the file exists on your webserver. It generally will, but it could equally exist on a CDN somewhere for example.
When you handle files in SilverStripe 4 you should always use the contents of the file and whatever other metadata you have available, rather than relying on filesystem calls to load it.
This is how the silverstripe/userforms module attaches files to emails:
/** #var SilverStripe\Control\Email\Email $email */
$email->addAttachmentFromData(
$file->getString(), // stream of file contents
$file->getFilename(), // original filename
$file->getMimeType() // mime type
);
I would try $email->addAttachment($submission->MyFile()->Filename); If it doesn't work, you may need to prepend $_SERVER['DOCUMENT_ROOT'] to the filename.
$email->addAttachment($_SERVER['DOCUMENT_ROOT'] . $submission->MyFile()->Filename);

Max execution work-around Google App Script

I'm looking to retrieve the entire folder/subfolder directory from MyDrive and print out to a file type (not sure what' possible). I ran the following code and received a "max execution timeout" error.
// Log the name of every folder in the user's Drive.
var folders = DriveApp.getFolders();
while (folders.hasNext()) {
var folder = folders.next();
Logger.log(folder.getName());
}
MyDrive contains over 20,000 files, not sure on the folder/sub-folder count. I'm assuming the above code is meant as a starting point and not the actual full script.
Any help would be appreciated to help me get around this issue. I only need to run the script once.
Thanks,
Ryan.

Upload file with CMIS Service on st:site

I have been uploading files to Company Home pretty easily with this url:
http://myhost.com:8080/alfresco/s/api/path/workspace/SpacesStore/app:company_home/children
Now I am trying to upload to a folder within a site
http://myhost.com:8080/alfresco/s/api/path/workspace/SpacesStore/app:company_home/st:sites/cm:mysite/children
And keep getting this
Cannot find object for NodePathReference[storeRef=workspace://SpacesStore,path=app:company_home/st:sites/cm:mysite]
Am I missing a special way to declare the path of a site?
i'm not sure how you are uploading to that path but i suppose you need to go into 'documentLibrary' of the site
http://myhost.com:8080/alfresco/s/api/path/workspace/SpacesStore/app:company_home/st:sites/cm:mysite/cm:documentLibrary/children
I found out that there are 6 webscripts related to file manipulation, and it seams each one takes the path in a different way.
I ended up using
http://example.com:8080/alfresco/s/cmis/p/Sites/mySite/Test/children
This particular service it takes Display Names as path segments, and the p itself represents the Company Home segment
I also obtained the same results with this one
http://example.com:8080/alfresco/s/cmis/s/workspace:SpacesStore/i/2aa692bd-0dab-4514-a629-ad36382189f2/children
Which as you can see takes nodeRef Ids as parameter.

download file from remote location

Hey i am in atrouble please help me out.i want to download file from other website to on my location and i used code below
Dim wc As New System.Net.WebClient
wc.DownloadFile(pathUrl, fileName)
PathUrl,fileName both are correct m 100% sure.
after execution of these 2 line my browser progress-bar goes in to wait state like something is retrieving.but file not download any where.what should i do next?
Not enough rep to leave a comment so:
#AZHAR, the file save location is the second parameter. In your example it is fileName, in NiL's example it is "uploads/myPath.doc"
If you use wc.DownloadFileAsync, make sure to include an AsyncCompletedEventHandler so you know when it's done.
I'm not sure about the correctness of what you did, relatively to your goal (I don't mean the code is incorrect, as it is syntactically correct otherwise it won't compile).
If you want to retrieve a file from a remote location and save it to your local machine, this is surely the worst way!!!!
If, instead, you want to download the file onto your server, then your problem is patience :)
I mean, the DownloadFile method is blocking and can take even hours if you are trying to download something a bluray ripped film or a Linux ISO, no matter how fast is your server.
You could think about using an asynchronous job in this case...
The code you wrote did download the file, I tested it and it surely download it
the usage of the DownloadFunction is as follows:
wc.DownloadFile("http://www.domaine.com/uploads/file.doc", "uploads/myPath.doc");
If you are trying to download a big file you can use :
wc.DownloadFileAsync
and it is the same

No error message available, result code: E_FAIL(0x80004005)

My application uses windows authentication. user login with their username/password and upload an excel sheet.
The issues is while uploading the excel ,one user able to upload the excel file but another user get an error:
No error message available, result code: E_FAIL(0x80004005)
The code is same. I don't know what's the actual problem is? Please Help?
Not 100% sure, but can you check:
The user has permissions on the folder where the excel is uploaded.
If you are using OleDBCommand, and the file name is invalid then too you might get same error.
// User was neither granted nor denied read access.
// Pass the callback method the integer
/// value of E_FAIL.
hr = unchecked((int)0x80004005);
This is how the implementation of return value usually goes. The comment may point you the possible problem.
Only Temporary solution:- 1) If you try to upload same file name multiple times this problem will raise. So try to upload distinct file name every time.
I had same problem now got the solutions,
1 => Timeout
(try to insert or update part by part)
2 => Cannot Overwrite
if you trying to create a sheet with same name...

Resources