Request user to identify file location and auto-extract variable name from file location in R - r

I am EXTREMELY new to R, and programming in general, so thank you for your patience.
I am trying to write a script which reads values from a .txt file and after some manipulation plots the results. I have two questions which are somewhat coupled.
First, is there a function which asks the user to identify the location of a file? i.e. User runs script. Script opens up file navigation prompt and requests user to navigate to and select relevant file.
Currently, I have to manually identify the file and location in R. e.g.
spectra.raw <- read.table("C:\Users\...\file1.txt", row.names=NULL, header = TRUE)
I'd rather have the user identify the file location each time the script is run. This will be used by non-tech people, and I don't trust them to copy/paste file locations into R.
The second question I've been struggling with is, is it possible to create a variable name based off the file selected? For example, if the user selects "file1.txt" I'd like R to assign the output of read.table() to a variable named "file1.raw" much like the above "spectra.raw"
If it helps, all the file names will have the exact same number of characters, so if it's possible to select the last say 5 characters from the file location, that would work.
Thank you very much, and please excuse my ignorance.

See file.choose. Though I believe it behaves slightly differently on different platforms, so beware of that.
See assign, i.e. assign("fileName",value). You'll want to parse the file path that file.choose spits back using string manipulation functions like substr or strsplit.

Try
file.choose
I think it can do what you want.
For example,
myfile <- file.choose()
Enter file name: adataset.Rdata
load(myfile)
myfile contains the name of the file so you don't have to do anything special.

Related

In R and Sparklyr, writing a table to .CSV (spark_write_csv) yields many files, not one single file. Why? And can I change that?

Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)

BlySky Statistics - File naming conventions

When opening file 'TestFile.RData' in BlueSky Statistics it is opened with this name PLUS Dataset3 attached. Looks like this in tab TestFile.RData(Dataset3)
I would like to use my original name when using r code in the r command editor but from what I see BlueSky wants me to use the Dataset3 name.
Please clarify this file name issue for me.
If my original name is changed I see issues with reproducing things - as the given name of Dataset3 is not controllable.
Regards
Your observation is correct. When ever a file is opened in BlueSky Statistics (that is not an R datafile) we create a dataframe object in R. We name these objects sequentially namely Dataset1, Dataset2,Dataset3, etc. We could always use the name of the original file, however we went with Dataset1,Dataset2,Dataset3 for compatibility with SPSS. Many of our users come from SPSS and that is exactly what SPSS does. There is a simple work around, see below.
To work around this you need to change the default code we use to open the dataset. To see the code in the output window, Go to the top level menu Tools , Tools->Configuration settings->Select the Output tab and select the checkbox near the text "Show syntax in output window"
The code you will see when you open a dataset in the output Window is
BSkyloadDataset(fullpathfilename='C:/Users/Aaron_2/Documents/BlueSky Statistics/Sample Datasets/IRT/engagement.csv', filetype='CSV', worksheetName='',load.missing=FALSE, character.to.factor=FALSE, csvHeader=TRUE, isBasketData=FALSE, trimSPSStrailing=FALSE, sepChar=',', deciChar='.', datasetName='Dataset2')
All you need to do is change the datasetName parameter to the name you want to use
I will also add an enhancement to make the default behavior of naming the dataset when opening files to be the name of the file. This is easy to do.
With R datasets this is not a problem because we load all dataframe objects into the grid. The name of the dataset in the grid, continues to be the dataset object
BlueSky is one of the few packages that use R and allow you to open and work on multiple data files at once. This naming approach is its way of allowing that while using files that have not yet been stored as R data files (.RData). After importing data from a non-R file, simply use "File> Save as" and save it as an R Object (.RData). The next time you open that file, it will maintain the name you've given it.

How can I source an R script with errors

I have two R scripts. The first reads csv files, cleans the data, checks for mathematical errors and corrects them ("errorcheck.R"). The second script gets the clean data from the first, combines column names, expressions and values and creates csv files ("createTables.R").
Originally, the first script was created for importing 5 csv files. But for some projects I might have only 4 or 3 csv files to import, which is fine for the final output. But that throws me an error and when I try to source the first script from the second script, I don't get the clean csv files. How can I source the clean datasets from the first script, even with errors? The errors come only from calling csv files that don't exist.
I'm not sure if this is the same question as:
Is there a way to `source()` and continue after an error?
Can I have some ideas on this please?
Thanks in advance
I am not sure if this serves your answer or not:
Situation:
1. According to your description, your first scripts is made for static input of length of 5. (i.e. 5 .csv file input)
Solution:
I don't know how you take the input of .csv files in first script. I suggest to create a vector of string and pass that to first script and calculate the length of vector to decide how many times your operation should run. Now, the input can be of any length.
So, You can effectively handle any range of .csv files rather than only for 5. Try avoiding hard-coding.
Please let me know if this answer your question. If you face any diffculty just let me know.

How to converge multiple R files into one single file

Situation
I wrote an R program which I split up into multiple R-files for the sake of keeping a good code structure.
There is a Main.R file which references all the other R-files with the 'source()' command, like this:
source(paste(getwd(), dirname1, 'otherfile1.R', sep="/"))
source(paste(getwd(), dirname3, 'otherfile2.R', sep="/"))
...
As you can see, the working directory needs to be set correctly in advance, otherwise, this could go wrong.
Now, if I want to share this R program with someone else, I have to pass all the R files and folders in relative order of each other for things to work. Hence my next question.
Question
Is there a way to replace all the 'source' commands with the actual R script code which it refers to? That way, I have a SINGLE R script file, which I can simply pass along without having to worry about setting the working directory.
I'm not looking for a solution which is an 'R package' (which by the way is one single directory, so I would lose my own directory structure). I simply wondering if there is an easy way to combine these self-referencing R files into one single file.
Thanks,
Ok I think you could use something like scaning all the files and then writting them again in the same new one. This can be done using readLines and sink:
sink("mynewRfile.R")
for(i in Nfiles){
current_file = readLines(filedir[i])
cat("\n\n#### Current file:",filedir[i],"\n\n")
cat(current_file, sep ="\n")
}
sink()
Here I have supposed all your file directories are in a vector filedir with length Nfiles, I guess you can adapt that

how to read a file to data frame and print some colums in R

I got a question about reading a file into data frame using R.
I don't understand "getwd" and "setwd", do we must do these before reading the files?
and also i need to print some of the columns in the data frame, and only need to print 1 to 30,how to do this?
Kinds regards
getwd tells you what your current working directory is. setwd is used to change your working directory to a specified path. See the relevant documentation here or by typing ? getwd or ? setwd in your R console.
Using these allows you to shorten what you type into, e.g., read.csv by just specifying a filename without specifying its full path, like:
setwd('C:/Users/Me/Documents')
read.csv('myfile.csv')
instead of:
read.csv('C:/Users/Me/Documents/myfile.csv')

Resources