I am trying to make my current project reproducible, and so am creating a master document (eventually a .rmd file) that will be used to call and execute several other documents. This way myself and other investigators only need to open and run one file.
There are three layers to the current setup: master file, 2 read-in files, 2 databases. The master file calls the read-in files using source(), and the read-in files parse the .csv databases and apply labels.
The read-in files and the databases are generated automatically with the data management software I'm currently using (REDCap) each time I download the updated data.
However, the read-in files have a line of code that removes all of the objects in my environment. I would like to edit the read-in files directly from the master file so that I do not have to open the read-in files individually each time I run my report. Specifically, since all the read-in files are the same, I would like to remove line #2 in each.
I've tried searching Google, and tried file.edit(), but have been unable to find anything. Not even sure it is possible, but figured I would ask. Let me know if I can improve this question or if you need any additional code to answer it. Thanks!
Current relevant master code (edited for generality):
source("read-in1")
source("read-in2")
Current relevant read-in file code (same in each file, except for the database name):
#Clear existing data and graphics
rm(list=ls())
graphics.off()
#Load Hmisc library
library(Hmisc)
#Read Data
data=read.csv('database.csv')
#Setting Labels
[read-in code truncated]
Additional details:
OS: Windows 7 Professional x86
R version: 3.1.3
R Studio version: 0.99.441
You might try readLines() and something like the following (which was simplified greatly by a suggestion from #Hong Ooi below):
eval(parse(readLines("read-in1.R")[-2]))
My original solution which was much more pedantic:
f <- file("read-in1.R", open="r")
t <- readLines(f)
close(f)
for (l in t[-2]) { eval(parse(text=l)) }
The for() loop just parses and evaluates each line from the text file except for the second one (that's what the -2 index value does). If you're reading and writing longer files then the following will be much faster than the second option, however still less preferable than #Hong Ooi's:
f <- file("read-in1.R", open="r")
t <- readLines(f)
close(f)
f <- file("out.R", open="w")
o <- writeLines(t[-2], f)
close(f)
source("out.R")
Sorry I'm so late in noticing this question, but you may want to investigate getting access the the REDCap API and using either the redcapAPI package or the REDCapR package. Both of those packages will allow you to export the data from REDCap and directly into R without having to use the download scripts. redcapAPI will even apply all the formats and dates (REDCapR might do this now too. It was in the plan, but I haven't used it in a while).
You could try this. It just calls some shell commands: (1) renames the file, then (2) copies all lines not containing rm(list=ls()) to a new file with the same name as the original file, then (3) removes the copy.
files_to_change <- c("read-in1.R", "read-in2.R")
for (f in files_to_change) {
old <- paste0(f, ".old")
system(paste("cmd.exe /c ren", f, old))
system(paste("cmd.exe /c findstr /v rm(list=ls())", old, ">", f))
system(paste("cmd.exe /c rm", old))
}
After calling this loop you should have
#Clear existing data and graphics
graphics.off()
#Load Hmisc library
library(Hmisc)
#Read Data
data=read.csv('database.csv')
#Setting Labels
in your read-in*.R files. You could put this in a batch script
#echo off
ren "%~f1" "%~nx1.old"
findstr /v "rm(list=ls())" "%~f1.old" > "%~f1"
rm "%~nx1.old"
say, "example.bat", and call that in the same way using system.
Related
Okay, so I like to use R projects in Rstudio for scripts and data that I'm working with. However, let's say I want to source those scripts in another directory...R does not detect the .Rproj file unless the script is called from the directory where it is housed. Is there any way to source an R script that is part of an R project from another directory?
This is relevant as I have a system where I perform analyses and make figures in one directory, but then produce LaTeX documents that use those figures in another directory. I like to be able to source the R scripts that make the figures and save them to the directory where I'm writing in LaTeX.
Here's a MRE:
With an R project already created in directory (done via Rstudio)...let's call it ~/test.
Create some data:
a <- 1:10
dat <- data.frame(a = a, b = a + rnorm(length(a), 10, 2))
save(dat, file = "test.RData")
Place the following script in ~/test. Let's call it test.R.
load("test.RData")
pdf(file = "plot.pdf")
plot(b ~ a, data = dat)
dev.off()
Works great, right? But if we try the following from any other directory R can't figure it out.
cd ~
Rscript ~/test/test.R
Any thoughtful solutions? I suppose it's easy enough to just setwd() in the script that I'm sourcing the original script from, but this sort of defeats the whole purpose of using R projects.
You could use setwd("~/test/") at the beginning of the script and if necessary change it back later on.
Situation
I wrote an R program which I split up into multiple R-files for the sake of keeping a good code structure.
There is a Main.R file which references all the other R-files with the 'source()' command, like this:
source(paste(getwd(), dirname1, 'otherfile1.R', sep="/"))
source(paste(getwd(), dirname3, 'otherfile2.R', sep="/"))
...
As you can see, the working directory needs to be set correctly in advance, otherwise, this could go wrong.
Now, if I want to share this R program with someone else, I have to pass all the R files and folders in relative order of each other for things to work. Hence my next question.
Question
Is there a way to replace all the 'source' commands with the actual R script code which it refers to? That way, I have a SINGLE R script file, which I can simply pass along without having to worry about setting the working directory.
I'm not looking for a solution which is an 'R package' (which by the way is one single directory, so I would lose my own directory structure). I simply wondering if there is an easy way to combine these self-referencing R files into one single file.
Thanks,
Ok I think you could use something like scaning all the files and then writting them again in the same new one. This can be done using readLines and sink:
sink("mynewRfile.R")
for(i in Nfiles){
current_file = readLines(filedir[i])
cat("\n\n#### Current file:",filedir[i],"\n\n")
cat(current_file, sep ="\n")
}
sink()
Here I have supposed all your file directories are in a vector filedir with length Nfiles, I guess you can adapt that
I have been trying hard to solve this, but I cannot get my head around how to read zipped .csv files in R. I could first unzip the files and then read them, but since the amount of unzipped data is around 22GB, I guess it is more practical to handle zipped files.
I basically have many .csv files, which I ZIPPED ONE BY ONE into single .7z files. Every file is named like: file1.csv, file2.csv, etc., which zipped became respectively: file1.csv.7z, file2.csv.7z, etc.
If I use the following command:
data <- read.table(unz("substn-20100101.csv.7z", "substn-20100101.csv"), nrows=10, header=T, quote="\"", sep=",")
I get the message:
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") : cannot open zip file 'substn-20100101.7z'
Any help would be much appreciated, thank you in advance.
First of all if your problem is RAM, as you said each file has 22G, using compressed files won't resolve your problems. After read.table, for example, all file will be loaded in memory. If you are using these files to some kind of modeling i advise you to look at ff and bigmemory packages.
Another solution is use Revolutions R that has an academic licence and you can use for free. Revolutions R provides Big Data capabilities and you can manage this files easily with packages like revoscaleR.
Even another solution is using Postgres + MADLib + PivotalR. After ingesting data at Postgres, use PivotalR package to access that data and do models with MADLib library, directly from R console.
BUT, if you are planing something that be done with chunks of data, summary for example, you can use the package iterators. I will provide an use case to show how this can be done. Get Airlines data, 1988, and follow this code:
> install.packages('iterators')
> library(iterators)
> con <- bzfile('1988.csv.bz2', 'r')
OK, now you have a connection to your file. Let's create an iterator:
> it <- ireadLines(con, n=1) ## read just one line from the connection (n=1)
Just to test:
> nextElem(it)
and you will see something like:
1 "1988,1,9,6,1348,1331,1458,1435,PI,942,NA,70,64,NA,23,17,SYR,BWI,273,NA,NA,0,NA,0,NA,NA,NA,NA,NA"
> nextElem(it)
and you will see the next line, and so on. Be aware that you are reading a line at a time, so you are not loading all the file to RAM.
If you want to read line by line till the end of the file you can use
> tryCatch(expr=nextElem(it), error=function(e) return(FALSE))
for example. When the file ends it return a logical FALSE.
If I understand the question correctly, at least on Windows OS, you could use 7-Zip Command-Line.
For the sake of simplicity put 7za.exe in your R working directory (and your 7zip files), create .bat file with the following text in it:
"7za e *.7z -y"
...than in R you run the following code:
my_batch <- "your_bat_file_name.bat"
shell.exec(shQuote(paste(my_batch), type = "cmd"))
Than you just read.table()...
It works for me.
According to the readr package documentation, readr::read_csv and fellows will automatically unzip files ending in .gz, .bz2, .xz, or .zip. Although .7z is not mentioned, perhaps a solution is to change to one of these compression formats and then use readr (which also offers a number of other benefits). If your data is compressed with zip, your code would be:
library(readr)
data <- read_csv("substn-20100101.csv.zip", n_max=10)
I received a .Rnw file that gives errors when trying to build the package it belongs to. Problem is, when checking the package using the tools in RStudio, I get no useful error information whatsoever. So I need to figure out first on what code line the error occurs.
In order to figure this out, I wrote this 5-minute hack to get all code chunks in a separate file. I have the feeling though I'm missing something. What is the clean way of extracting all code in an Rnw file just like you run a script file? Is there a function to either extract all, or run all in such a way you can find out at which line the error occurs?
My hack:
ExtractChunks <- function(file.in,file.out,...){
isRnw <- grepl(".Rnw$",file.in)
if(!isRnw) stop("file.in should be an Rnw file")
thelines <- readLines(file.in)
startid <- grep("^[^%].+>>=$",thelines)
nocode <- grep("^<<",thelines[startid+1]) # when using labels.
codestart <- startid[-nocode]
out <- sapply(codestart,function(i){
tmp <- thelines[-seq_len(i)]
endid <- grep("^#",tmp)[1] # take into account trailing spaces / comments
c("# Chunk",tmp[seq_len(endid-1)])
})
writeLines(unlist(out),file.out)
}
The two strategies are Stangle (for a Sweave variant) and purl for a knitr variant. My impression for .Rnw files is that they are more or less equivalent, but purl should work for other types of files, as well.
Some simple examples:
f <- 'somefile.Rnw'
knitr::purl(f)
Stangle(f)
Either way you can then run the created code file using source.
Note: This post describes an chunk option for knitr to selectively purl chunks, which may be helpful, too.
I want to process a file (1.9GB) that contains 100.000.000 datasets in R.
Actually I only want to have every 1000th dataset.
Each dataset contains 3 Columns, separated by a tab.
I tried: data <- read.delim("file.txt"), but R Was not able to manage all datasets at once.
Can I tell R directly to load only every 1000th dataset from the file?
After reading the file I want to bin the data of column 2.
Is it possible to directly bin the number written in column 2?
Is it possible the read the file line by line, without loading the whole file into the memory?
Thanks for your help.
Sven
You should pre-process the file using another tool before reading into R.
To write every 1000th line to a new file, you can use sed, like this:
sed -n '0~1000p' infile > outfile
Then read the new file into R:
datasets <- read.table("outfile", sep = "\t", header = F)
You may want to look at the manual devoted to R Data Import/Export.
Naive approaches always load all the data. You don't want that. You may want another script which reads line-by-line (written in awk, perl, python, C, ...) and emits only every N-th line. You can then read the output from that program directly in R via a pipe -- see the help on Connections.
In general, very large memory setups require some understanding of R. Be patient, you will get this to work but once again, a naive approach requires lots of RAM and a 64-bit operating system.
Maybe package colbycol could be usefull to you.