Convert xlsx file to csv file in R when xlsx file is present in hdfs - r

I want to know how can we convert .xlsx file residing in hdfs to .csv file using R script.
I tried using XLConnect and xlsx packages, but its giving me error 'file not found'.I am providing HDFS location as input in the R script using the above packages.I am able to read .csv files from hdfs using R script (read.csv()).
Do I need to install any new packages for reading .xlsx present in hdfs .
sharing the code i used:
library(XLConnect)
d1=readWorksheetFromFile(file='hadoop fs -cat hdfs://............../filename.xlsx', sheet=1)
"Error: FileNotFoundException (Java): File 'filename.xlsx' could not be found - you may specify to automatically create the file if not existing."
I am sure the file is present in the specified location.
Hope my question is clear. Please suggest a method to resolve it.
Thanks in Advance!

hadoop fs isn't a file, but a command that copies a file from HDFS to your local filesystem. Run this command from outside R (or from inside it using system), and then open the spreadsheet.

Related

How to convert .xls or .xlxs file to csv file without any plugins or tools using Unix command

I have to convert .xls or .xlxs file to .csv file without using plugins or tools using Unix Command
Is their any way to do this ?
I Tried to do like this below ...But not working
Change the characterSet code from .xls file to UTF-8 encoding
Then create file again with extension change
cp temp.xls temp.csv
It is possible, but you need to realise that an *.xls file is a zipped directory structure (just unzip such a file, using Winzip or 7-zip). The unzipping can also be done using UNIX commands.
But what then? The directory structure is quite complicated to understand, and in order to create a script or a program which can do this (without using any external tools) is a tremendous work, so I'd propose you, either to use external tools anyway, or to make sure the files you receive already are CSV format.

How to Read in xls file that requires permission to open in R

I have several xls files I need to read in and combine into one dataframe. I try
df <- readxl::read_excel("file.xls")
or
df <- readxl::read_xls("file.xls")
but neither works. I get the following error
Error:
filepath: /Users/.../.../file.xls
libxls error: Unable to open file
I believe the issue is that every time I open the file in Excel, I am asked if I trust this file before I can open it. Is there anyway around it?
I also am operating on a mac, and I want to avoid library(xlsx) or other packages that have Java dependencies.
UPDATE: I had the idea of just going into each file to click "Save As..." and change the format to xlsx instead of xls, but the default file format that showed was an Excel 2004 XML Spreadsheet (.xml). Does that suggest that my file is actually a xml file even though the extension in the name is .xls?

Save R output to a different directory

I am running some R code on a Windows computer using RStudio and my code generates Excel files and netCDF files periodically (dozens of them eventually). I don't want them to clutter my working directory. Is there a way to save the files to a directory called "Output" (ex: C:/.../original file path/Output) in the parent directory? I would like a way to change my current working directory to a different directory. I understand there is getwd() and setwd() but how do I set the path to the output directory without typing out the entire windows path (for example: setwd(current source file path for windows or Mac/output). My collaborator uses a Mac and he would have his output stored there as well.
You have a file argument in your write* function. If your Output directory is in your working directory, it works like this:
write.xlsx(df, file = "Output/table.xlsx")
write.csv(df, file = "Output/table.csv")
You can specify an argument in your write.csv function and other similar write functions which specifies your path.
#Output path
OutPath<- "C:/blah/blahblah/op/"
#Table to dump as output
OutTbl <- iris
write.csv(OutTbl, file = OutPath)
Source: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/write.table.html

Read a .csv file with Sparklyr in R

I have couple of .csv files in C:\Users\USER_NAME\Documents which are more than 2 GB in size. I want to use Apache Spark to read the data out of them in R. I am using Microsoft R Open 3.3.1 with Spark 2.0.1.
I am stuck with reading the .csv files with the function spark_read_csv(...) defined in Sparklyr package. It is asking for a file path which starts with file://. I want to know the proper file path for my case starting with file:// and ends with the file name which are in .../Documents directory.
I had a similar problem. In my case it was necessary for the .csv file to be put into the hdfs file system before calling it with spark_read_csv.
I think you probably have a similar problem.
If your cluster is also running with hdfs you need to use:
hdfs dfs -put
Best,
Felix

how to access HDFS file path(Installed packages: rmr2,rhdfs) in normal R commands?

I have zip files in HDFS.I am going to write a mapreduce program in R. Now R is having command to unzip the zip file.
unzip("filepath")
but here it is not accepting my HDFS file path? I tried like
unzip(hdfs.file("HDFS file path"))
it is throwing error..
invalid path argument..
Is there any way to give HDFS file path to my R commands?

Resources