Obtain filename from url in R

Obtain filename from url in R - r

I have an url like http://example.com/files/01234 that when I click it from the browser downloads a zip file titled like file-08.zip
With wget I can download using the real file name by running
wget --content-disposition http://example.com/files/01234
Functions such as basename do not work in this case, for example:
> basename("http://example.com/files/01234")
[1] "01234"
I'd like to obtain just the filename from the URL in R and create a tibble with zip (files) names. No matter if using packages or system(...) command. Any ideas? what I'd like to obtain is something like
url | file
--------------------------------------------
http://example.com/files/01234 | file-08.zip
http://example.com/files/03210 | file-09.zip
...

Using the httr library, you can make a HEAD call and then parse he content-disposition header For example
library(httr)
hh <- HEAD("https://example.com/01234567")
get_disposition_filename <- function(x) {
sub(".*filename=", "", headers(x)$`content-disposition`)
}
get_disposition_filename(hh)
This function doesn't check that the header actually exists so it's not very robust, but should work in the case where the server returns an alternate name for the downloaded file.

With #Sathish contribution:
When URLs don't contain the file to download in the URL string a valid solution is
system("curl -IXGET -r 0-10 https://example.com/01234567 | grep attachment | sed 's/^.\\+filename=//'")
The idea is to read 10 bytes from the zip instead of the full file before obtaining file name, it will return file-789456.zip or the real zip name from that URL.

Related

How to handle a file having header in between the records after removing duplicates from the file

We have a file which has been processed by unix command for removing duplicates. After the de-duplication new file has the header in-between the records. Please help to solve this and thanks in advance for inputs.
Unix Command : Sort -u >

I would do something like this:
grep "headers" >output.txt
grep -v "headers" >>output.txt
The idea is the following: first take the headers and put them into output.txt, and afterwards take everything which is not a header and put it into that output file.
First you need to put the information in the output file (which means you need to create the output file, hence the single > character), secondly you need to append the information to the already existing output file (hence the double >> character).

How can I list all existing workspaces in JupyterLab?

How can I list all existing workspaces in JupyterLab?
I know that one can view the current workspace name in the URL:

When you create a workspace, this creates a file in ~/.jupyter/lab/workspaces. The name of your workspace is in the ['metadata']['id'] key of the corresponding JSON file.
a simple code to list all workspaces is therefore:
import os, glob, json
for fname in glob.glob(os.path.join(os.environ['HOME'], ".jupyter/lab/workspaces/*")):
with open (fname, "r") as read_file:
print (json.load(read_file)['metadata']['id'])
For convenience, I created a gist with that bit of code. I have also added some cosmetics to directly generate the different URLs:
$ list_workspaces.py -u
http://10.164.5.234:8888/lab
http://10.164.5.234:8888/lab/workspaces/BBCP
http://10.164.5.234:8888/lab/workspaces/blog

I think you could try
ls ~/.jupyter/lab/workspaces
Each time u create a new workspace, there will be a corresponding file generated here. More detailed docs are here

As others have pointed out, workspace files are located at ~/.jupyter/lab/workspaces. Each workspace is represented by a .jupyterlab-workspace, which is actually just a JSON file.
If you have the CLI tool jq installed, the following one-liner gives you a quick list of workspaces:
cat ~/.jupyter/lab/workspaces/* | jq -r '.metadata.id'
Sample output:
/lab
/lab/workspaces/aaaaaaaaaaaa
/lab/workspaces/xxxxxxxxxx

With most basic shell commands:
grep metadata ~/.jupyter/lab/workspaces/* | sed -e 's/"/ /g' | awk '{print $(NF-1)}'
Output will look like:
/lab
/lab/workspaces/auto-x
/lab/workspaces/foo

To find the filename who use the particular account name in unix server

I have to prepare one sheet like below. I want to find those files where we have used EDW account. But when I am using find command then it is returning everything which contains the word EDW. Even it returns those file don't have permission & can't open etc (printing unnecessary line).
I only need to print those file using the account name EDW,GDW etc with path name & file name. So that I can prepare one sheet as below.
AccountName Server Path name File name
XCM uk0300uv550 /home/super/MKBP/scripts/xtc/rap proc_build.sql
Can anyone please help me ?

If I understand your question correctly, you want a list with all files where "EDW" occurs? Try -l to list just the filenames and -r to grep recursively:
echo "Text with EDW inside." > file.txt
grep -lr EDW .

If you would like to find files that are owned by a specific user/group, you could simply use:
find /path/ -user EDW -group EDW
If you would like to find only files that are both owned by your user and contain the word EDW in their name you could:
find /path/ -user EDW -name "*EDW*"

Find missing URL routes using the command-line

I'm trying to automate a check for missing routes a Play! web application.
The routing table is in a file in the following format:
GET /home Home.index
GET /shop Shop.index
I've already managed to use my command line-fu to crawl through my code and make a list of all the actions that should be present in the file. This list is in the following format:
Home.index
Shop.index
Contact.index
About.index
Now I'd like to pipe the output of this text into another command that checks if each line is present in the route file. I'm not sure how to proceed though.
The result should be something like this:
Contact.index
About.index
Does someone have a helpful suggestion on how I can accomplish this?

try this line:
awk 'NR==FNR{a[$NF];next}!($0 in a)' routes.txt list.txt
EDIT
if you want the above line to accept list from stdin:
cat list.txt|awk 'NR==FNR{a[$NF];next}!($0 in a)' routes.txt -
replace cat list.txt with your magic command

Getting a list of files on a web server

All,
I would like to get a list of files off of a server with the full url in tact. For example, I would like to get all the TIFFs from here.
http://hyperquad.telascience.org/naipsource/Texas/20100801/*
I can download all the .tif files with wget but I am looking for is just the full url to each file like this.
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_2_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_3_20100424.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_4_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_05_1_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_05_2_20100430.tif
Any thoughts on how to get all these files in to a list using something like curl or wget?
Adam

You'd need the server to be willing to give you a page with a listing on it. This would normally be an index.html or just ask for the directory.
http://hyperquad.telascience.org/naipsource/Texas/20100801/
It looks like you're in luck in this case so, at risk of upsetting the web master, the solution would be to use wget's recursive option. Specify a maximum recursion of 1 to keep it constrained to that single directory.

I would use lynx shell web browser to get the list of links + grep and awk shell tools to filter the results, like this:
lynx -dump -listonly <URL> | grep http | grep <regexp> | awk '{print $2}'
..where:
URL - is the start URL, in your case: http://hyperquad.telascience.org/naipsource/Texas/20100801/
regexp - is the regular expression that selects only files that interest you, in your case: \.tif$
Complete example commandline to get links to TIF files on this SO page:
lynx -dump -listonly http://stackoverflow.com/questions/6989681/getting-a-list-of-files-on-a-web-server | grep http | grep \.tif$ | awk '{print $2}'
..now returns:
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_2_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_4_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_05_2_20100430.tif

If you wget http://hyperquad.telascience.org/naipsource/Texas/20100801/, the HTML that is returned contains the list of files. If you don't need this to be general, you could use regexes to extract the links. If you need something more robust, you can use an HTML parser (e.g. BeautifulSoup), and programmatically extract the links on the page (from the actual HTML structure).

With winscp have a find window that is possible search for all files in directories and subdirectories from a directory in the own web - after is possible select all and copy, and have in text all links to all files -, need have the username and password for connect ftp:
https://winscp.net/eng/download.php

I have a client-server system that retrieves the file names from an assigned folder in the app server's folder, then displays thumbnails in the client.
CLIENT: (slThumbnailNames is a string list)
== on the server side ===
A TIDCmdTCPServer has a CommandHandler GetThumbnailNames (a commandhandler is a procedure)
Hints: sMFFBServerPictures is generated in the oncreate method of the app server.
sThumbnailDir is passed to the app server from the client.
`slThumbnailNames := funGetThumbnailNames(sThumbNailPath);
function TfMFFBClient.funGetThumbnailNames(sThumbnailPath:string):TStringList;
var
slThisStringList:TStringList;
begin
slThisStringList := TStringList.Create;
dmMFFBClient.tcpMFFBClient.SendCmd('GetThumbnailNames,' + sThumbnailPath,700);
dmMFFBClient.tcpMFFBClient.IOHandler.Capture(slThisStringList);
result := slThisStringList;
end;
procedure TfMFFBServer.MFFBCmdTCPServercmdGetThumbnailNames(
ASender: TIdCommand);
var
sRec:TSearchRec;
sThumbnailDir:string;
i,iNumFiles: Integer;
begin
try
ASender.Response.Clear;
sThumbnailDir := ASender.Params[0];
iNumFiles := FindFirst(sMFFBServerPictures + sThumbnailDir + '*_t.jpg', faAnyfile, SRec );
if iNumFiles = 0 then
try
ASender.Response.Add(SRec.Name);
while iNumFiles = 0 do
begin
if (SRec.Attr and faDirectory <> faDirectory) then
ASender.Response.Add(SRec.Name);
iNumFiles := FindNext(SRec);
end;
finally
FindClose(SRec)
end
else
ASender.Response.Add('NO THUMBNAILS');
except
on e:exception do
begin
messagedlg('Error in procedure TfMFFBServer.MFFBCmdTCPServercmdGetThumbnailNames'+#13+
'Error msg: ' + e.Message,mterror,[mbok],0);
end;
end;
end;`

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Obtain filename from url in R - r

Related

How to handle a file having header in between the records after removing duplicates from the file

How can I list all existing workspaces in JupyterLab?

To find the filename who use the particular account name in unix server

Find missing URL routes using the command-line

Getting a list of files on a web server

Categories

Resources