I have two R scripts. The first reads csv files, cleans the data, checks for mathematical errors and corrects them ("errorcheck.R"). The second script gets the clean data from the first, combines column names, expressions and values and creates csv files ("createTables.R").
Originally, the first script was created for importing 5 csv files. But for some projects I might have only 4 or 3 csv files to import, which is fine for the final output. But that throws me an error and when I try to source the first script from the second script, I don't get the clean csv files. How can I source the clean datasets from the first script, even with errors? The errors come only from calling csv files that don't exist.
I'm not sure if this is the same question as:
Is there a way to `source()` and continue after an error?
Can I have some ideas on this please?
Thanks in advance
I am not sure if this serves your answer or not:
Situation:
1. According to your description, your first scripts is made for static input of length of 5. (i.e. 5 .csv file input)
Solution:
I don't know how you take the input of .csv files in first script. I suggest to create a vector of string and pass that to first script and calculate the length of vector to decide how many times your operation should run. Now, the input can be of any length.
So, You can effectively handle any range of .csv files rather than only for 5. Try avoiding hard-coding.
Please let me know if this answer your question. If you face any diffculty just let me know.
Related
I need some help for my master thesis
I have a very large set of xlsx files and must calculate a series of indices for each file. I have the code for doing it one excel file at the time, but it would take many days to do it one by one. So does anyone nows how to open several excel files at the same time and do a loop for the calculation and putting all the indices in a matrix?
This is the code for one file at the time:
install.packages("nparACT")
library(nparACT)
(Import the data set manually of one file [I am new to R])
Nuevo <- data.frame(as.factor(P1_a_completo_Tmov$Datetime), P1_a_completo_Tmov$Dist)
(P1_a_completo_Tmov is the name of the file, example)
nparACT_base("Nuevo", SR=1/30)
(This last command gives me many options, what I need is the data.frame, so what I do now is to copy nparACT_base("Nuevo", SR=1/30) in the console and then I get the data frame)
Now I am stuck with a very inefficient time consuming way of working, but hope that one of you R experts can give me some light on how to speed the process. Thank you
Background
I'm doing some data manipulation (joins, etc.) on a very large dataset in R, so I decided to use a local installation of Apache Spark and sparklyr to be able to use my dplyr code to manipulate it all. (I'm running Windows 10 Pro; R is 64-bit.) I've done the work needed, and now want to output the sparklyr table to a .csv file.
The Problem
Here's the code I'm using to output a .csv file to a folder on my hard drive:
spark_write_csv(d1, "C:/d1.csv")
When I navigate to the directory in question, though, I don't see a single csv file d1.csv. Instead I see a newly created folder called d1, and when I click inside it I see ~10 .csv files all beginning with "part". Here's a screenshot:
The folder also contains the same number of .csv.crc files, which I see from Googling are "used to store CRC code for a split file archive".
What's going on here? Is there a way to put these files back together, or to get spark_write_csv to output a single file like write.csv?
Edit
A user below suggested that this post may answer the question, and it nearly does, but it seems like the asker is looking for Scala code that does what I want, while I'm looking for R code that does what I want.
I had the exact same issue.
In simple terms, the partitions are done for computational efficiency. If you have partitions, multiple workers/executors can write the table on each partition. In contrast, if you only have one partition, the csv file can only be written by a single worker/executor, making the task much slower. The same principle applies not only for writing tables but also for parallel computations.
For more details on partitioning, you can check this link.
Suppose I want to save table as a single file with the path path/to/table.csv. I would do this as follows
table %>% sdf_repartition(partitions=1)
spark_write_csv(table, path/to/table.csv,...)
You can check full details of sdf_repartition in the official documentation.
Data will be divided into multiple partitions. When you save the dataframe to CSV, you will get file from each partition. Before calling spark_write_csv method you need to bring all the data to single partition to get single file.
You can use a method called as coalese to achieve this.
coalesce(df, 1)
I am having an issue where I am trying to get a list of all of the file names in a directory in R. Usually I would use list.files(). However, this particular folder has ~300,000 very small files in it. So when I call list.files(pattern="*.csv", recursive = FALSE), my R Studio session just totally hangs. I waited a 24 hours last time & it never sorted itself out... Though oddly enough, my computer doesn't seem to be running out of memory at all.
My question is, is there a way to import a list of file names in a more efficient way? Or is there a way to import a smaller chunk of the file names at a time, eg Import the first thousand, then the second thousand, then the third thousand etc.
I wasn't sure what code samples to include--if there is something that'd be helpful, please let me know :)
Please see the picture. I've started using R, and know how/that it can read files from Excel, but can it read something formatted like this?
http://www.flickr.com/photos/68814612#N05/8632809494/
(my apologies, upload was not working for me)
Elaborating on some of what's in the comments:
If you load the file into Excel, you can save it as a fixed-width or comma-delimited text file. Either should be easy to read into R.
The following may be obvious to you already.
(First, a question: Are you sure that you can't get the data in a format that has one set of data per line? Is it possible that the file you're getting was generated from a different file format that is more conducive to loading the data into R?)
Whether you should start rearranging the data in R or instead manipulate the raw text depends on what comes naturally to you (or to people you have around who can help). For me, personally, I would rearrange the text file outside of R before loading it into R. That's what's easiest for me. Perl is a great language for this purpose, but you could also do it with Unix shell scripts if that's accessible to you, or using a powerful editor such as Vim or Emacs. If you have no preference, I'd suggest Perl. If you have any significant programming experience, you'll be able to learn what you need. On the other hand, you're already loading it into R, so maybe it would be better to process the data there.
For example, you could execute a loop that goes the text file line by line and does something like this:
while (still have lines to read) {
read first header line into an vector if this is the first time through the loop
otherwise, read it and throw it away
read data line 1 into an vector
read second header line into vector if this is the first time
otherwise, read it and throw it away
read data line 2 into an vector
read third header line into vector if this is the first time
otherwise, read it and throw it away
read data line 3 into an vector
if this is first time through, concatenate the header vectors; store as next row
in something (a file, a matrix, a dataframe, etc.)
concatenate the data vectors you've been saving, and store as next row in same thing
}
write out the whole 2D data structure
Or if the headers will never change, then you could just embed them literally into the script before the loop, and throw them out no matter what. That will make the code cleaner. Or read the first few lines of the file separately to get the headers, and then have a separate script to read the data and add it to the file with the headers in it. (The headers will probably be useful in R, so I would suggest preserving them at the top of the text file.)
I apologize in advance if this has a simple answer somewhere. It seems like the kind of thing that would, but I can't seem to locate it in the help files, by searching SO, or by Googling.
I'm working with some datasets that are several GB right now. It's enough to fit in memory on one of the cluster nodes I have access to, but takes quite a bit of time to load. For many debugging/programming activities with this data, I don't need the entire file loaded, just the first few thousand observations to have a dataset on which to test code. I can of course just read the whole file in and subset, but I was wondering if there's a way to tell read.dta() to only read in the first N rows? This would of course be much faster.
I could also use a proper format like .csv and then use read.csv()'s nrows argument, but then I'd lose the factor labels in the Stata dataset (and have to recreate quite a few GB of data from someone else's code that's feeding in to this project. So a direct solution on .dta files is preferred.
Stata's binary files are written row-by-row, so you could change the R_LoadStataData function in stataread.c to limit the number of rows read in. However, this will only work if you do not need the value labels because they are written at the end of the file and would require you to read the entire file--which wouldn't save any time.
That's going to be a difficult one, as the do_readStata function under the hood is compiled code, only capable of taking in the whole file. I believe that in general binary files are hard to read line by line, and .dta is a binary format. Also the native binary format of R doesn't allow to select a number of lines from the dataset while reading in.
In my humble opinion, you can better just create a set of test files from within Stata ( eg the Stata code sample 1000, count will give you a sample of 1000 observations from the loaded dataset), and work with them. And if you have no access to Stata, someone else in the project should be able to do that for you.
To follow up on Joris Meys: For this kind of thing, I use a "test" data set and the "real" data set, each in separate folders. I keep a macro at the top of the .do file (with if/then statements below) to (1) take a sample of the data and (2) point input/output to the right folder containing one or the other. I probably do it different for every project, but something like this:
data creation .do file
blah blah blah
save using data/myfile.dta
save if uniform()<.05 using test_data/myfile.dta // or bsample, then save for panel data
analysis .do file
local test = "test_"
// when you're ready to run the file with all the data, use the following
// local test = ""
use `test'data/myfile.dta
blah blah blah
outreg2 ... using `test'output/mytable.txt