Read a .csv file with Sparklyr in R

Read a .csv file with Sparklyr in R - r

I have couple of .csv files in C:\Users\USER_NAME\Documents which are more than 2 GB in size. I want to use Apache Spark to read the data out of them in R. I am using Microsoft R Open 3.3.1 with Spark 2.0.1.
I am stuck with reading the .csv files with the function spark_read_csv(...) defined in Sparklyr package. It is asking for a file path which starts with file://. I want to know the proper file path for my case starting with file:// and ends with the file name which are in .../Documents directory.

I had a similar problem. In my case it was necessary for the .csv file to be put into the hdfs file system before calling it with spark_read_csv.
I think you probably have a similar problem.
If your cluster is also running with hdfs you need to use:
hdfs dfs -put
Best,
Felix

Related

How to convert .xls or .xlxs file to csv file without any plugins or tools using Unix command

I have to convert .xls or .xlxs file to .csv file without using plugins or tools using Unix Command
Is their any way to do this ?
I Tried to do like this below ...But not working
Change the characterSet code from .xls file to UTF-8 encoding
Then create file again with extension change
cp temp.xls temp.csv

It is possible, but you need to realise that an *.xls file is a zipped directory structure (just unzip such a file, using Winzip or 7-zip). The unzipping can also be done using UNIX commands.
But what then? The directory structure is quite complicated to understand, and in order to create a script or a program which can do this (without using any external tools) is a tremendous work, so I'd propose you, either to use external tools anyway, or to make sure the files you receive already are CSV format.

In Bluesky Statistics How do I write output to a csv file

I can't get write.csv or write.table to work in the r editor of BlueSky Statistics.
I usually just use this format in RStudio and it works perfectly.
write.csv(df, "zzz.csv")
Any hints?"

The default install location for BlueSky Statistics is 'C:\Program Files', where by default, there is no write permission (for creating or deleting files). Also, saving a file in the install location is not safe, as the file may get lost/deleted when the application is uninstalled. So it is always good to save your file(s) in your own folder(s) where you also have write permission.
In short, try to provide a writable location/path in write.csv() or other similar functions/commands.
See example below:
To save your file to the Desktop folder.
write.csv(df, "C:/Users/<YourUsername>/Desktop/zzz.csv")
Note: use forward slash(/) as a path separator.

Convert xlsx file to csv file in R when xlsx file is present in hdfs

I want to know how can we convert .xlsx file residing in hdfs to .csv file using R script.
I tried using XLConnect and xlsx packages, but its giving me error 'file not found'.I am providing HDFS location as input in the R script using the above packages.I am able to read .csv files from hdfs using R script (read.csv()).
Do I need to install any new packages for reading .xlsx present in hdfs .
sharing the code i used:
library(XLConnect)
d1=readWorksheetFromFile(file='hadoop fs -cat hdfs://............../filename.xlsx', sheet=1)
"Error: FileNotFoundException (Java): File 'filename.xlsx' could not be found - you may specify to automatically create the file if not existing."
I am sure the file is present in the specified location.
Hope my question is clear. Please suggest a method to resolve it.
Thanks in Advance!

hadoop fs isn't a file, but a command that copies a file from HDFS to your local filesystem. Run this command from outside R (or from inside it using system), and then open the spreadsheet.

Error processing gps file in R script

Newbie R question: I have been trying to test the R script posted in FlowingData, but the script spit out the following error:
Error: XML content does not seem to be XML: 'NA'
I am running R on my windows box, with the .gpx files in the same directory as the script. Any help is appreciated.

Not sure if you ever found the answer to this or not, but the XML error relates to the fact that R does not know where your .gpx files are. While the FlowingData script indicates that the script will work if the .gpx files are in the same folder as your saved R script copy/pasted from FlowingData, that is not true. You must also set your working directory to this path as well, then R will see your .gpx files. If you FlowingData R script file and .gpx files are in: C:\Users\leon\Documents\R then add this line under the library(plotKML) line to set your working directory: setwd("C:\\Users\\leon\\Documents\\R")
Another word of note, make sure you only use the RunKeeper gpx files for a fairly small geographic area or the plotted data will be insanely small.

R workspaces i.e. .R files

How do I start a new .R file default in a new session for new objects in that session?

Workspaces are .RData files, not .R files. .R files are source files, i.e. text files containing code.
It's a bit tricky. If you saved the workspace, then R saves two files in the current working directory : an .RData file with the objects and a .RHistory file with the history of commands. In earlier versions of R, this was saved in the R directory itself. With my version 2.11.1, it uses the desktop.
If you start up your R and it says : "[Previously saved workspace restored]", then it loaded the file ".RData" and ".RHistory" from the default working directory. You find that one by the command
getwd()
If it's not a desktop or so, then you can use
dir()
to see what's inside. For me that doesn't work, as I only have the file "desktop.ini" there (thank you, bloody Windoze).
Now there are 2 options : you manually rename the workspace, or use the command:
save.image(file="filename.RData")
to save the workspaces before you exit. Alternatively, you can set those options in the file Rprofile.site. This is a text file containing the code R has to run at startup. The file resides in the subdirectory /etc of your R directory. You can add to the bottom of the file something like :
fn <- paste("Wspace",Sys.Date(),sep="")
nfiles <- length(grep(paste(fn,".*.RData",sep=""),dir()))
fn <- paste(fn,"_",nfiles+1,".RData",sep="")
options(save.image.defaults=list(file=fn))
Beware: this doesn't do a thing if you save the workspace by clicking "yes" on the message box. You have to use the command
save.image()
right before you close your R-session. If you click "yes", it will still save the workspace as ".RData", so you'll have to rename it again.

I believe that you can save your current workspace using save.image(), which will default to the name ".RData". You can load a workspace simply using load().
If you're loading a pre-existing workspace and you don't want that to happen, rename or delete the .RData file in the current working directory.
If you want to have different projects with different workspaces, the easiest thing to do is create multiple directories.

There is no connection between sessions, objects and controlling files .R. In short: no need to.
You may enjoy walking through the worked example at the end of the Introduction to R - A Sample Session.
Fire up R in your preferred environment and execute the commands one-by-one.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Read a .csv file with Sparklyr in R - r

I had a similar problem. In my case it was necessary for the .csv file to be put into the hdfs file system before calling it with spark_read_csv. I think you probably have a similar problem. If your cluster is also running with hdfs you need to use: hdfs dfs -put Best, Felix

Related

How to convert .xls or .xlxs file to csv file without any plugins or tools using Unix command

In Bluesky Statistics How do I write output to a csv file

Convert xlsx file to csv file in R when xlsx file is present in hdfs

Error processing gps file in R script

R workspaces i.e. .R files

Categories

Resources