Open many xlsx files and run a package that calculates a set of non parametric variables for each file - r

I need some help for my master thesis
I have a very large set of xlsx files and must calculate a series of indices for each file. I have the code for doing it one excel file at the time, but it would take many days to do it one by one. So does anyone nows how to open several excel files at the same time and do a loop for the calculation and putting all the indices in a matrix?
This is the code for one file at the time:
install.packages("nparACT")
library(nparACT)
(Import the data set manually of one file [I am new to R])
Nuevo <- data.frame(as.factor(P1_a_completo_Tmov$Datetime), P1_a_completo_Tmov$Dist)
(P1_a_completo_Tmov is the name of the file, example)
nparACT_base("Nuevo", SR=1/30)
(This last command gives me many options, what I need is the data.frame, so what I do now is to copy nparACT_base("Nuevo", SR=1/30) in the console and then I get the data frame)
Now I am stuck with a very inefficient time consuming way of working, but hope that one of you R experts can give me some light on how to speed the process. Thank you

Related

Is there a way to compare the structure/architecture of .nc files in R?

I have a sample .nc file that contains a number of variables (5 to be precise) and is being read into a program. I want to create a new .nc file containing different data (and different dimensions) that will also be read into that program.
I have created a .nc file that looks the same as my sample file (I have included all of the necessary attributes for each of the variables that were included in the original file).
However, my file is still not being ingested.
My question is: is there a way to test for differences in the layout/structure of .nc files?
I have examined each of the variables/attributes within Rstudio and I have also opened them in panoply and they look the same. There are obviously differences (besides the actual data that they contain) since the file is not being read.
I see that there are options to compare the actual data within .nc files online (Comparison of two netCDF files), but that is not what I want. I want to compare the variable/attributes names/states/descriptions/dimensions to see where my file differs. Is that possible?
The ideal situation here would be to create a .nc template from the variables that exist within the original file and then fill in my data. I could do this by defining the dimensions (ncdim_def), creating the file(nc_create), getting my data (ncvar_get) and putting it in the file (ncvar_put), but that is what I have done so far, and it is too reliant on me not making an error (which I obviously have as they are not the same).
If you are on unix this is more easily achieved using CDO. See the Information section of the reference card: https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_refcard.pdf.
For example, if you wanted to check that the descriptions are the same in files just do:
cdo griddes example1.nc
cdo griddes example2.nc
You can easily use system in R, to wrap around this.

Time series data homogenization using climatol package in R - .dah file missing

I am trying to homogenize rainfall time series data for 12 stations in R (RStudio) using homogen tool in climatol package. I used monthly total series computed using dd2m tool. The homogen command runs well and also generates the results including .rda and .pdf files. But I can't see the .dah (homogenized data with missing data filled) and .esh files being created in working folder as expected.
Any help on what might have happen, and how can I get this result would be appreciated.
Cheers
I just figured out that we can export the 'would be' content of the dah file by loading the rda content to R and then writing to a text file, i.e.
load('rTest_1950-2000.rda')
write.csv(dah,"C:/Test/Test-dah.csv").

How can I source an R script with errors

I have two R scripts. The first reads csv files, cleans the data, checks for mathematical errors and corrects them ("errorcheck.R"). The second script gets the clean data from the first, combines column names, expressions and values and creates csv files ("createTables.R").
Originally, the first script was created for importing 5 csv files. But for some projects I might have only 4 or 3 csv files to import, which is fine for the final output. But that throws me an error and when I try to source the first script from the second script, I don't get the clean csv files. How can I source the clean datasets from the first script, even with errors? The errors come only from calling csv files that don't exist.
I'm not sure if this is the same question as:
Is there a way to `source()` and continue after an error?
Can I have some ideas on this please?
Thanks in advance
I am not sure if this serves your answer or not:
Situation:
1. According to your description, your first scripts is made for static input of length of 5. (i.e. 5 .csv file input)
Solution:
I don't know how you take the input of .csv files in first script. I suggest to create a vector of string and pass that to first script and calculate the length of vector to decide how many times your operation should run. Now, the input can be of any length.
So, You can effectively handle any range of .csv files rather than only for 5. Try avoiding hard-coding.
Please let me know if this answer your question. If you face any diffculty just let me know.

Running jobs in background in R

I am working with a 250 by 250 matrix. However, it takes loads and loads of time to compute this. It takes like an hour at least.
Is it possible that I can store this matrix in memory in R, such that everytime I open up R, it is already there.
Ideally, I would like to know if it is possible to run a job on background in R , so that I dont have to wait an hour to get the matrix out and be able to play around with it.
1) You can save the workspace of R when closing R. Usually R asks "Save workspace image?" when you are closing it. If you will answer "Yes" it will save the workspace in a file named ".Rdata" and will load it when staring a new R instance.
2) The better option (more safe) is to save the matrix explicitly. There are several options how it can be done. One of the options is to save it as Rdata file:
save(m, file = "matrix.Rdata")
where m is your matrix.
You can load the matrix at any time with
load("matrix.Rdata")
if you are on the same working directory.
3) There is not such option as background computing for R. But you can open several R instances. Do computation in one instance, and do something else on other instance.
What would help is to output it to a file when you have computed it and then parse that file everytime you open R. Write yourself a computeMatrix() function or script to produce a file with the matrix stored in a sensible format. Also write yourself a loadMatrix() function or script to read in that file and load the matrix into memory for use, then call or run loadMatrix everytime you start R and want to use the matrix.
In terms of running an R job in the background, you can run an R script from the command line with the syntax "R CMD BATCH scriptName" with scriptName replaced by the name of your script.
It might be better to use the ff package and save the matrix as an ff object. This means that the actual matrix will be saved on the disk in an efficient manner, then when you start a new R session you can point to that same file without loading the entire matrix into memory. When you need part of the matrix, only the part you need will be loaded so it will be much quicker. Even if you need the entire matrix loaded into memory it should load faster than reading a text file.

Reading in only part of a Stata .DTA file in R

I apologize in advance if this has a simple answer somewhere. It seems like the kind of thing that would, but I can't seem to locate it in the help files, by searching SO, or by Googling.
I'm working with some datasets that are several GB right now. It's enough to fit in memory on one of the cluster nodes I have access to, but takes quite a bit of time to load. For many debugging/programming activities with this data, I don't need the entire file loaded, just the first few thousand observations to have a dataset on which to test code. I can of course just read the whole file in and subset, but I was wondering if there's a way to tell read.dta() to only read in the first N rows? This would of course be much faster.
I could also use a proper format like .csv and then use read.csv()'s nrows argument, but then I'd lose the factor labels in the Stata dataset (and have to recreate quite a few GB of data from someone else's code that's feeding in to this project. So a direct solution on .dta files is preferred.
Stata's binary files are written row-by-row, so you could change the R_LoadStataData function in stataread.c to limit the number of rows read in. However, this will only work if you do not need the value labels because they are written at the end of the file and would require you to read the entire file--which wouldn't save any time.
That's going to be a difficult one, as the do_readStata function under the hood is compiled code, only capable of taking in the whole file. I believe that in general binary files are hard to read line by line, and .dta is a binary format. Also the native binary format of R doesn't allow to select a number of lines from the dataset while reading in.
In my humble opinion, you can better just create a set of test files from within Stata ( eg the Stata code sample 1000, count will give you a sample of 1000 observations from the loaded dataset), and work with them. And if you have no access to Stata, someone else in the project should be able to do that for you.
To follow up on Joris Meys: For this kind of thing, I use a "test" data set and the "real" data set, each in separate folders. I keep a macro at the top of the .do file (with if/then statements below) to (1) take a sample of the data and (2) point input/output to the right folder containing one or the other. I probably do it different for every project, but something like this:
data creation .do file
blah blah blah
save using data/myfile.dta
save if uniform()<.05 using test_data/myfile.dta // or bsample, then save for panel data
analysis .do file
local test = "test_"
// when you're ready to run the file with all the data, use the following
// local test = ""
use `test'data/myfile.dta
blah blah blah
outreg2 ... using `test'output/mytable.txt

Resources