Download all files in all folders from URL - r

I'd like to recursively download all files from nested folders from this URL to my computer in the same nested structure:
https://hazardsdata.geoplatform.gov/?prefix=Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings%20HYDA/
I've tried several different approaches, using curl and RCurl, including this and some others. There are multiple file types within this folder. But I keep running into cryptic error message such as Error in function (type, msg, asError = TRUE) : error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
I'm not even sure how to begin.

in their javascript you'll find the url https://hazards-geoplatform.s3.amazonaws.com/ and there you'll find a xml file containing the path to (seemingly?) all their files, from there it shouldn't be hard, so
1: download the XML list of files from https://hazards-geoplatform.s3.amazonaws.com
2: each of the XML's <content> tag describes a file or a folder. filter out all the tags that is not relevant to you, that means if the content->key tag does not contain the text Brookings HYDA, filter it out.
3: the remaining content tags contain your download path and save path, for every key tag that ends with /: this is a "folder", you can't download a fol6der, just create the path, for example if the key is
<Contents>
<Key>Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/</Key>
this means you should create the folders Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence and move on, however if the key's value does not end with /, it means you should download it, for example if you find
<Contents>
<Key>Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/200724-CityBrookings-AirportInfo_Email.pdf</Key>
<LastModified>2022-03-04T17:54:48.000Z</LastModified>
<ETag>"9fe9af393f043faaa8e368f324c8404a"</ETag>
<Size>303737</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
it means the save filepath is Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/200724-CityBrookings-AirportInfo_Email.pdf
and the url to download the file is https://hazards-geoplatform.s3.amazonaws.com/ + urlencode(key), in this case:
https://hazards-geoplatform.s3.amazonaws.com/Region8%2FR8_MIT%2FRisk_MAP%2FData%2FBLE%2FSouth_Dakota%2F60601300_BrookingsCO%2FBrookings%20HYDA%2FHydraulics_DataCapture%2FCorrespondence%2F200724-CityBrookings-AirportInfo_Email.pdf
idk how to do it with curl/r, but here's how to do it in PHP, happy porting
<?php
declare(strict_types=1);
function curl_get(string $url): string
{
echo "fetching {$url}\n";
static $ch = null;
if ($ch === null) {
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_ENCODING => '',
CURLOPT_FOLLOWLOCATION=>1,
CURLOPT_VERBOSE=>0
));
}
curl_setopt($ch, CURLOPT_URL, $url);
$ret = curl_exec($ch);
if(curl_errno($ch)) {
throw new Exception("curl error ".curl_errno($ch).": ".curl_error($ch));
}
return $ret;
}
$base_url = 'https://hazards-geoplatform.s3.amazonaws.com/';
$xml = curl_get($base_url);
$domd = new DOMDocument();
#($domd->loadHTML($xml));
$xp = new DOMXPath($domd);
foreach($xp->query("//key[contains(text(),'Brookings HYDA')]") as $node) {
$relative = $node->nodeValue;
if($relative[-1] === '/'){
// it's a folder, ignore
continue;
}
$dir = dirname($relative);
if(!is_dir($dir)) {
mkdir($dir, 0777, true);
}
$url = $base_url . urlencode($node->nodeValue);
file_put_contents($relative, curl_get($url));
}
after running that for a few seconds i have
$ find
.
./fuk.php
./Region8
./Region8/R8_MIT
./Region8/R8_MIT/Risk_MAP
./Region8/R8_MIT/Risk_MAP/Data
./Region8/R8_MIT/Risk_MAP/Data/BLE
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/200724-CityBrookings-AirportInfo_Email.pdf
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Correspondence/2D_Exceptions_2021Update.pdf
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/DCS_Checklist_Hydraulics_BrookingsCoSD.xlsx
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Simulations
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Simulations/RAS
./Region8/R8_MIT/Risk_MAP/Data/BLE/South_Dakota/60601300_BrookingsCO/Brookings HYDA/Hydraulics_DataCapture/Simulations/RAS/0.2PAC
soo it seems to be working.
the last output from the command is
fetching https://hazards-geoplatform.s3.amazonaws.com/Region8%2FR8_MIT%2FRisk_MAP%2FData%2FBLE%2FSouth_Dakota%2F60601300_BrookingsCO%2FBrookings+HYDA%2FHydraulics_DataCapture%2FSimulations%2FRAS%2F0.2PAC%2FPostProcessing.hdf
PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 65019904 bytes) in /home/hans/test/fuk.php on line 17
meaning some of their files are over 134MB in size - it's easy to optimize the curl code to write directly to disk instead of storing the entire file in ram before writing to disk, but since you want to do this in R anyway, i won't bother optimizing the sample php script.

Related

How to return binary data from custom wordpress rest api endpoint

I am writing a custom endpoint for a REST api in wordpress, following the guide here: https://developer.wordpress.org/rest-api/extending-the-rest-api/adding-custom-endpoints/
I am able to write a endpoint that returns json data. But how can I write an endpoint that returns binary data (pdf, png, and similar)?
My restpoint function returns a WP_REST_Response (or WP_Error in case of error).
But I do not see what I should return if I want to responde with binary data.
Late to the party, but I feel the accepted answer does not really answer the question, and Google found this question when I searched for the same solution, so here is how I eventually solved the same problem (i.e. avoiding to use WP_REST_Response and killing the PHP script before WP tried to send anything else other than my binary data).
function download(WP_REST_Request $request) {
$dir = $request->get_param("dir");
// The following is for security, but my implementation is out
// of scope for this answer. You should either skip this line if
// you trust your client, or implement it the way you need it.
$dir = sanitize_path($dir);
$file = $request->get_param("file");
// See above...
$file = sanitize_path($file);
$sandbox = "/some/path/with/shared/files";
// full path to the file
$path = $sandbox.$dir.$file;
$name = basename($path);
// get the file mime type
$finfo = finfo_open(FILEINFO_MIME_TYPE);
$mime_type = finfo_file($finfo, $path);
// tell the browser what it's about to receive
header("Content-Disposition: attachment; filename=$name;");
header("Content-Type: $mime_type");
header("Content-Description: File Transfer");
header("Content-Transfer-Encoding: binary");
header('Content-Length: ' . filesize($path));
header("Cache-Control: no-cache private");
// stream the file without loading it into RAM completely
$fp = fopen($path, 'rb');
fpassthru($fp);
// kill WP
exit;
}
I would look at something called DOMPDF. In short, it streams any HTML DOM straight to the browser.
We use it to generate live copies of invoices straight from the woo admin, generate brochures based on $wp_query results etc. Anything that can be rendered by a browser can be streamed via DOMPDF.

Connection to this site is not Fully Secure

I have a website on which users can write blog posts. I'm using stackoverflow pagedown editor to allow users to add content & also the images by inserting their link.
But the problem is that in case a user inserts a link starting with http:// such as http://example.com/image.jpg, browser shows a warning saying,
Your Connection to this site is not Fully Secure.
Attackers might be able to see the images you are looking at
& trick you by modifying them
I was wondering how can we force the browser to use the https:// version of site only from which image is being inserted, especially when user inserts a link starting with http://?
Or is there any other solution of this issue?
image
unfortunately, browser expect to have all loaded ressources provided over ssl. On your case you have no choice than self store all images or create or proxy request from http to https. But i am not sure if is really safe to do this way.
for exemple you can do something like this :
i assume code is php, and over https
<?php
define('CHUNK_SIZE', 1024*1024); // Size (in bytes) of tiles chunk
// Read a file and display its content chunk by chunk
function readfile_chunked($filename, $retbytes = TRUE) {
$buffer = '';
$cnt = 0;
$handle = fopen($filename, 'rb');
if ($handle === false) {
return false;
}
while (!feof($handle)) {
$buffer = fread($handle, CHUNK_SIZE);
echo $buffer;
ob_flush();
flush();
if ($retbytes) {
$cnt += strlen($buffer);
}
}
$status = fclose($handle);
if ($retbytes && $status) {
return $cnt; // return num. bytes delivered like readfile() does.
}
return $status;
}
$filename = 'http://domain.ltd/path/to/image.jpeg';
$mimetype = 'image/jpeg';
header('Content-Type: '.$mimetype );
readfile_chunked($filename);
Credit for code sample
_ UPDATE 1 _
Alternate solution to proxify steamed downloaded file in Python
_ UPDATE 2 _
On following code, you can stream data from remote server to your front-end client, if your Django application is over https, content will be deliver correctly.
Goal is to read by group of 1024 bits your original images, them stream each group to your browser. This approch avoid timeout issue when you try to load heavy image.
I recommand you to add another layer to have local cache instead to download -> proxy on each request.
import requests
# have this function in file where you keep your util functions
def url2yield(url, chunksize=1024):
s = requests.Session()
# Note: here i enabled the streaming
response = s.get(url, stream=True)
chunk = True
while chunk :
chunk = response.raw.read(chunksize)
if not chunk:
break
yield chunk
# Then creation your view using StreamingHttpResponse
def get_image(request, img_id):
img_url = "domain.ltd/lorem.jpg"
return StreamingHttpResponse(url2yield(img_url), content_type="image/jpeg")

How to download a ZIP file into the wp-plugins folder programmatically?

I am creating a WordPress plugin which should download a ZIP file from a remote location and place it into the wp-plugins folder.
So I created a method which downloads the file using Curl (this works fine) and should then place the file into the wp-plugins folder. I am using the WP_Filesystem in order to make sure I have the rights to place a file on the server. This is what I have until now:
public function download_plugin($url, $path)
{
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
curl_close($ch);
global $wp_filesystem;
if(defined('FS_CHMOD_FILE'))
{
$chmod = FS_CHMOD_FILE;
}
else
{
$chmod = 0755;
}
if (empty($wp_filesystem))
{
require_once (ABSPATH . '/wp-admin/includes/file.php');
WP_Filesystem();
}
if(!$wp_filesystem->put_contents(
$path,
$data,
$chmod
))
else
{
return new \WP_Error('writing_error', 'Error when writing file');
}
}
When I run the method no file is being created on the server. The $path variable however has the right path and the $data variable does contain the ZIP file as a string. The WordPress method put_contents however does nothing. It returns null even when I change the method's parameter to something that should definitely work like $wp_filesystem->put_contents(WP_PLUGIN_DIR.'example.txt','Some text',FS_CHMOD_FILE));.
Is there anything I am doing wrong? It's really difficult to debug since I don't get any errors and put_contents always returns null.

Serving files via symfony2 - path to uploads directory

I have a problem setting a correct path to symfony2 uploads directory.
I am trying to provide user with files that they previously uploaded.
Firstly I tried the following code:
$response = new Response();
$d = $response->headers->makeDisposition(
ResponseHeaderBag::DISPOSITION_ATTACHMENT,
$document->getWebPath()
);
$response->headers->set('Content-Disposition', $d);
as advised in the cookbook and How to get web directory path from inside Entity?.
This however resulted in the following error:
The filename and the fallback cannot contain the "/" and "\" characters.
Therefore I decided to switch to:
$filename = $this->get('kernel')->getRootDir() . '/../web' . $document->getWebPath();
return new Response(file_get_contents($filename), 200, $headers);
this however results in:
Warning: file_get_contents(/***.pl/app/../web/uploads/documents/2.pdf) [<a href='function.file-get-contents'>function.file-get-contents</a>]: failed to open stream: No such file or directory
My file that I want to serve is located in
/web/uploads/documents/2.pdf
What code should I use to provide this file to end users?
In order to serve binary files, it's better to use the BinaryFileResponse, which accepts the absolute file path as its argument. The setContentDisposition() doesn't accept file paths but file names ... and that argument is optional (you should only use it in case you want to change the name of the file being served to end-users):
use Symfony\Component\HttpFoundation\BinaryFileResponse;
$response = new BinaryFileResponse($filePath);
$response->setContentDisposition(
ResponseHeaderBag::DISPOSITION_ATTACHMENT, $fileName
); // This line is optional to modify file name
Regarding the file path, you can keep using the code you showed, but slightly changed:
$filePath = $this->container->getParameter('kernel.root_dir')
.'/../web/'
.$document->getWebPath();

Unable to decode json file in a controller

I'm a new symfony user, and I'm actually trainning myself on it.
Here is my problem:
echo $this->container->get('templating.helper.assets')->getUrl('bundles/tlfront/js/channels.json');
$channels = json_decode($this->container->get('templating.helper.assets')->getUrl('bundles/tlfront/js/channels.json'));
var_dump($channels);
I would like to decode a JSON file in my controller, but here is what the echo and vardump give me:
/Symfony/web/bundles/tlfront/js/channels.json ( the echo)
null (the var_dump)
It seems that the path to the file is correct but json_decode doesn't work.
json_last_error() returns 4 which is JSON_ERROR_SYNTAX
But, when I run the json string in http://jsonlint.com/ it returns Valid JSON
Any ideas or advices ?
Dude, it has nothing to do with Symfony... json_decode only accepts strings:
http://php.net/manual/es/function.json-decode.php
That's why you're getting the syntax error. Just read the file with file_get_contents and you're done:
$path = $this->container->get('templating.helper.assets')->getUrl('bundles/tlfront/js/channels.json');
$content = file_get_contents($path);
$json = json_decode($content, true);
BTW, Symfony comes bundled with Finder, a really nice tool to search and get all the files you need: http://symfony.com/doc/current/components/finder.html

Resources