getting file names in R - r

I am trying to get all the file names that are placed in Hadoop HDFS. all i find is bash command for listing files
hadoop fs -ls
is there any way to get them in R. Please guide me
Thanks!

setpw("/directory/of/choice")
list.files()
The result is a character vector of file names in the Present Working directory
But it occurs to me that hadoop is special. So maybe this works in your situation
system("hadoop fs -ls", intern=T)
The result is again a character vector of file names, assuming "hadoop fs -ls" actually returns something similar to "ls" in a system console.

Check out the RHadoop project. In particular the package you need to list files is rhdfs.

Thought people might find this answer useful, here is the code to get files names from a specific hdfs folder into R using rhdfs.
R Code:
# Load required library and set hadoop environment
library(rhdfs)
Sys.setenv("HADOOP_CMD"="/opt/cloudera/parcels/CDH/bin/hadoop")
# Initialise
hdfs.init()
# Extract files names from a given hdfs folder to a data frame
files <- as.data.frame(hdfs.ls('/'))
Output:
> files #Print data frame
permission owner group size modtime file
1 -rw-r--r-- manohar supergroup 204632 2015-01-13 22:45 /LICENSES.txt
2 drwxr-xr-x manohar supergroup 0 2014-12-20 19:51 /SA
3 drwxr-xr-x manohar supergroup 0 2015-01-10 18:16 /in

i used Rhipe's command rhlist("/") and it returned data frame.

Related

Failed to concatenate global layer netCDF data using NCO

I am using monthly global potential evapotranspiration data from TerraClimate from 1958-2020 (available as 1 nc per year) and planning to concatenate all into single nc file.
The data has a variable pet and three dimension ppt(time,lat,lon).
I managed to combine all of the data using cod mergetime TerraClimate_*.nc and generate around 100GB of output file.
For analysis purpose in Windows machine, I need single netCDF file with order lat,lon,time. What I have done is as follows:
Reorder the dimension from time,lat,lon into lat,lon,time using ncpdq command
for fl in *.nc; do ncpdq -a lat,lon,time $fl ../pet2/$fl; done
Loop all file in the folder to make time the record dimension/variable used for concatenating files using ncks command
for fl in *.nc; do ncks -O --mk_rec_dmn time $fl $fl; done
Concatenates all nc files in the folder into one nc file using ncrcat command
ncrcat -h TerraClimate_*.nc -O TerraClimate_pet_1958_2020.nc
It's worked, but the result is not what I expected, it generate 458KB size of file, when I check the result using Panoply it provide wrong information, all have value -3276.7. See below picture.
I have check the files from step 1 and 2, and everything is correct.
I also try to concatenate only 2 files, using 1958 and 1959 data (each file 103MB), but the result still not what I expected.
ncrcat -h TerraClimate_pet_1958.nc TerraClimate_pet_1959.nc -O ../TerraClimate_pet_1958_1959.nc
Did I missed something on the code or write the wrong code? Any suggestion how to solve the problem?
UPDATE 1 (22 Oct 2021):
Here's the metadata of original data downloaded from above link.
UPDATE 2 (23 Oct 2021):
Following suggestion from Charlie, I did unpack for all the data from point 2 above using below command.
for fl in *.nc4; do ncpdq --unpack $fl ../unpack/$fl; done
Here's the example metadata from unpack process.
And the data visualised using Panoply.
Then I did test to concatenate again using 2 data from unpack process (1958 and 1959)
ncrcat -h TerraClimate_pet_1958.nc TerraClimate_pet_1959.nc -O ../TerraClimate_pet_1958_1959.nc
Unfortunately the result remain same, I got result with size 1MB. Below is the metadata
And visualised the ncrcat result using Panoply
Your commands appear to be correct, however I suspect that the data in the input files is packed. As explained in the ncrcat documentation here, the input data should be unpacked (e.g., with ncpdq --unpack) prior to concatenating all the input files (unless they all share the same values of scale_factor and add_offset). If that does not solve the problem, then (1) there is likely an issue with _FillValue and (2) please post the pet metadata from a sample input file.

Rename many images with names from file

I have a file with a list of names. Let's call it nameFile. For example:
John Doe
John Benjamin
Benjamin Franklin
...
I also have a folder of pictures. The pictures are named like:
pic001.jpg
pic002.jpg
pic003.jpg
...
I want to rename each picture with the corresponding name from the nameFile. Thus, pic001.jpg will become 'John Doe.jpg', pic002.jpg will become 'John Benjamin.jpg', etc.
Is there an easy UNIX command to do this? I know mv can be used to rename, I'm just a bit unsure how to apply it to this situation.
Mostly people do it by writing a simple shell script.
These two links will help you to do it.
Bulk renaming of files in unix
Rename a group of files with one command
The mv is a Unix command that renames one or more files or directories. The original filename or directory name is no longer accessible. Write permission is required on all directories and files being modified.
mv command syntax
You need to use the mv command to rename a file as follows:
mv old-file-name new-file-name
mv file1 file2
mv source target
mv [options] source target

R Linux Shell convert multi-sheet xls to csv in batch

In R i have a script gets content of multiple xls files <Loop over directory to get Excel content>.
All files are about 2 MB. The script takes a few seconds for 3 files, but is now running for 6 hours on a Debian i7 system without results on 120 files.
A better solution is therefore [hopefully] to convert all xls files to csv using ssconvert, using a bash script <Linux Shell Script For Each File in a Directory Grab the filename and execute a program>:
for f in *.xls ; do xls2csv "$f" "${f%.xls}.csv" ; done
This script does the job, however my content is in sheet nr 14, whereas the csv files produced by this script just return the first sheet [i replaced 'xls2csv' with 'ssconvert'].
Can this script be adopted to pickup only sheet nr 14 in the workbook?
If you know the worksheet name, you can do this:
for f in *.xls ; xls2csv -x "$f" -w sheetName -c "${f%.xls}.csv";done
To see all the xls2csv details see here.
EDIT
The OP find the right answer, so I edit mine to add it :
for f in *.xls ; do xls2csv -x "$f" -f -n 14 -c "${f%.xls}.csv"
For this job I use a python script named ssconverter.py (which you can find here, scroll down and download the two attachments, ssconverter.py and ooutils.py), which I call directly from R using system().
It can extract a specific sheet in the workbook, not only by name but also by sheet number, for example:
ssconverter.py infile.xls:2 outfile.csv
to extract the second sheet.
You need to have python and python-uno installed.

How to use mv command to rename multiple files in unix?

I am trying to rename multiple files with extension xyz[n] to extension xyz
example :
mv *.xyz[1] to *.xyz
but the error is coming as - " *.xyz No such file or directory"
Don't know if mv can directly work using * but this would work
find ./ -name "*.xyz\[*\]" | while read line
do
mv "$line" ${line%.*}.xyz
done
Let's say we have some files as shown below.Now i want remove the part -(ab...) from those files.
> ls -1 foo*
foo-bar-(ab-4529111094).txt
foo-bar-foo-bar-(ab-189534).txt
foo-bar-foo-bar-bar-(ab-24937932201).txt
So the expected file names would be :
> ls -1 foo*
foo-bar-foo-bar-bar.txt
foo-bar-foo-bar.txt
foo-bar.txt
>
Below is a simple way to do it.
> ls -1 | nawk '/foo-bar-/{old=$0;gsub(/-\(.*\)/,"",$0);system("mv \""old"\" "$0)}'
for detailed explanation check here
Here is another way using the automated tools of StringSolver. Let us say your first file is named abc.xyz[1] a second named def.xyz[1] and a third named ghi.jpg (not the same extension as the previous two).
First, filter the files you want by giving examples (ok and notok are any words such that the first describes the accepted files):
filter abc.xyz[1] ok def.xyz[1] ok ghi.jpg notok
Then perform the move with the filter it created:
mv abc.xyz[1] abc.xyz
mv --filter --all
The second line generalizes the first transformation on all files ending with .xyz[1].
The last two lines can also be abbreviated in just one, which performs the moves and immediately generalizes it:
mv --filter --all abc.xyz[1] abc.xyz
DISCLAIMER: I am a co-author of this work for academic purposes. Other examples are available on youtube.
I think mv can't operate on multiple files directly without loop.
Use rename command instead. it uses regular expressions but easy to use once mastered and more powerful.
rename 's/^text-to-replace/new-text-you-want/' text-to-replace*
e.g to rename all .jar files in a directory to .jar_bak
rename 's/^jar/jar_bak/' jar*

output redirection in UNIX

I am a beginner in UNIX. I am finding some difficulty in input/output redirection.
ls -l >temp
cat temp
Here why temp file is shown in the list and moreover, it is showing 0 characters.
wc temp >temp
cat temp
here output is 0 0 0 temp.
Why lines, words, characters are 0.
Please help me to undestand this concept.
Because ls reads all the names and sorts them before printing anything, and because the output file is created before the command is executed, at the time when ls checks the size of temp, it is empty, so it shows up in the list as an empty file.
When wc reads the file, it is empty, so it reports 0 characters in 0 words on 0 lines, and writes this information into the file after it has finished reading the empty file.
When you pipe the output to a file, that file is created, the command is run (so ls lists it as an empty file, and wc counts the characters in the empty file), then the output is added to the file.
… in that order.
You cannot write and read from the same file at the same time.
So:
wc file > file # NOT WORKING
# but this works:
wc file > file.stats
mv file.stats file # if you want that

Resources