I want to read an R file or script, modify the name of the external data file being read and export the modified R code into a new R file or script. Other than the name of the data file being read (and the name of the new R file) I want the two R scripts to be identical.
I can come close, except that I cannot figure out how to retain the blank lines I use for readability and error reduction.
Here is the original R file being read. Note that some of the code in this file is non-sensical, but to me that is irrelevant. This code does not need to run.
# apple.pie.all.purpose.flour.arsc.Jun23.2013.r
library(my.library)
aa <- 10 # aa
bb <- c(1:7) # bb
my.data = convert.txt("../applepieallpurposeflour.txt",
group.df = data.frame(recipe =
c("recipe1", "recipe2", "recipe3", "recipe4", "recipe5")),
covariates = c(paste( "temp", seq_along(1:aa), sep="")))
ingredient <- c('all purpose flour')
function(make.pie){ make a pie }
Here is R code I use to read the above file, modify it and export the result. This R code runs and is the only code that needs to run to achieve the desired result (except that I cannot get the format of the new R script to match that of the original R script exactly, i.e., blank lines present in the original R script are not present in the new R script):
setwd('c:/users/mmiller21/simple r programs/')
# define new fruit
new.fruit <- 'peach'
# read flour file for original fruit
flour <- readLines('apple.pie.all.purpose.flour.arsc.Jun23.2013.r')
# create new file name
output.flour <- paste(new.fruit, ".pie.all.purpose.flour.arsc.Jun23.2013.r", sep="")
# add new file name
flour.a <- gsub("# apple.pie.all.purpose.flour.arsc.Jun23.2013.r",
paste("# ", output.flour, sep=""), flour)
# add line to read new data file
cat(file = output.flour,
gsub( "my.data = convert.txt\\(\"../applepieallpurposeflour.txt",
paste("my.data = convert.txt\\(\"../", new.fruit, "pieallpurposeflour.txt",
sep=""), flour.a),
sep=c("","\n"), fill = TRUE
)
Here is the resulting new R script:
# peach.pie.all.purpose.flour.arsc.Jun23.2013.r
library(my.library)
aa <- 10 # aa
bb <- c(1:7) # bb
my.data = convert.txt("../peachpieallpurposeflour.txt",
group.df = data.frame(recipe =
c("recipe1", "recipe2", "recipe3", "recipe4", "recipe5")),
covariates = c(paste( "temp", seq_along(1:aa), sep="")))
ingredient <- c('all purpose flour')
function(make.pie){ make a pie }
There is one blank line in the newly-created R file, but how can I insert all of the blank lines present in the original R script? Thank you for any advice.
EDIT: I cannot seem to duplicate the blank lines here on StackOverflow. They seem to be deleted automatically. StackOverflow is even deleting the indentation I am using and I cannot seem to replace it. Sorry about this. Automatic deletion of blank lines and indentation is problematic when the issue at hand is specifically about formatting. I cannot seem to fix the post to display the R code as formatted in my script. However, the code does display correctly when I am actively editing the post.
EDIT: June 27, 2013: The deletion of empty rows and indentation in the code for the original R file and in the code for the middle R file appears to be associated with my laptop rather than with StackOverflow. When I view this post and my answers on my office desktop the format is correct. When I view this post and my answers with my laptop the empty rows and indentation are gone. Perhaps my laptop monitor is malfunctioning. Sorry about assuming initially that the problem was with StackOverflow.
Here is a function that will create a new R file for every combination of two variables. Sorry the formatting of the code below is not better. The code does run and does work as intended (provided the name of the original R file ends in ".arsc.Jun26.2013.r" instead of in ".arsc.Jun23.2013.r" used in the original post):
setwd('c:/users/mmiller21/simple r programs/')
# define fruits of interest
fruits <- c('apple', 'pumpkin', 'pecan')
# define ingredients of interest
ingredients <- c('all.purpose.flour', 'sugar', 'ground.cinnamon')
# define every combination of fruit and ingredient
fruits.and.ingredients <- expand.grid(fruits, ingredients)
old.fruit <- as.character(rep('apple', nrow(fruits.and.ingredients)))
old.ingredient <- as.character(rep('all.purpose.flour', nrow(fruits.and.ingredients)))
fruits.and.ingredients2 <- cbind(old.fruit , as.character(fruits.and.ingredients[,1]),
old.ingredient, as.character(fruits.and.ingredients[,2]))
colnames(fruits.and.ingredients2) <- c('old.fruit', 'new.fruit', 'old.ingredient', 'new.ingredient')
# begin function
make.pie <- function(old.fruit, new.fruit, old.ingredient, new.ingredient) {
new.ingredient2 <- gsub('\\.', '', new.ingredient)
old.ingredient2 <- gsub('\\.', '', old.ingredient)
new.ingredient3 <- gsub('\\.', ' ', new.ingredient)
old.ingredient3 <- gsub('\\.', ' ', old.ingredient)
# file name
old.file <- paste(old.fruit, ".pie.", old.ingredient, ".arsc.Jun26.2013.r", sep="")
new.file <- paste(new.fruit, ".pie.", new.ingredient, ".arsc.Jun26.2013.r", sep="")
# read original fruit and original ingredient
flour <- readLines(old.file)
# add new file name
flour.a <- gsub(paste("# ", old.file, sep=""),
paste("# ", new.file, sep=""), flour)
# read new data file
old.data.file <- print(paste("my.data = convert.txt(\"../", old.fruit, "pie", old.ingredient2, ".txt\",", sep=""), quote=FALSE)
new.data.file <- print(paste("my.data = convert.txt(\"../", new.fruit, "pie", new.ingredient2, ".txt\",", sep=""), quote=FALSE)
flour.b <- ifelse(flour.a == old.data.file, new.data.file, flour.a)
flour.c <- ifelse(flour.b == paste('ingredient <- c(\'', old.ingredient3, '\')', sep=""),
paste('ingredient <- c(\'', new.ingredient3, '\')', sep=""), flour.b)
cat(flour.c, file = new.file, sep=c("\n"))
}
apply(fruits.and.ingredients2, 1, function(x) make.pie(x[1], x[2], x[3], x[4]))
Here is one solution that reproduces the original R script (except for the two desired changes) while also preserving the formatting of that original R script in the new R script.
setwd('c:/users/mmiller21/simple r programs/')
new.fruit <- 'peach'
flour <- readLines('apple.pie.all.purpose.flour.arsc.Jun23.2013.r')
output.flour <- paste(new.fruit, ".pie.all.purpose.flour.arsc.Jun23.2013.r", sep="")
flour.a <- gsub("# apple.pie.all.purpose.flour.arsc.Jun23.2013.r",
paste("# ", output.flour, sep=""), flour)
flour.b <- gsub( "my.data = convert.txt\\(\"../applepieallpurposeflour.txt",
paste("my.data = convert.txt\\(\"../", new.fruit, "pieallpurposeflour.txt", sep=""), flour.a)
for(i in 1:length(flour.b)) {
if(i == 1) cat(flour.b[i], file = output.flour, sep=c("\n"), fill=TRUE )
if(i > 1) cat(flour.b[i], file = output.flour, sep=c("\n"), fill=TRUE, append = TRUE)
}
Again, I apologize for my inability to format the above R code in a readable way. I have never encountered this problem on StackOverflow and do not know the solution. Regardless, the above R script solves the problem I described in the original post.
To see the formatting of the original R script you will have to click the edit button under the original post.
EDIT: June 25, 2013
I do not know what I was doing differently yesterday, but today I found that the following simple cat statement, in place of the for-loop immediately above, creates the new R script while preserving the formatting of the original R script.
cat(flour.b, file = output.flour, sep=c("\n"))
Related
Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file is embedded in the code below.
How can I read this file so that I end up with a data frame (or tibble) with the appropriate column names and data?
All the text-to-data functions I know get stymied by those header lines. Here's what I just tried, riffing off of this SO Q&A. My thought was to read the file into a list of lines, then drop the lines that start with # from the list, then do.call(rbind, ...) the rest. The downloading part at the top works fine, but when I run the function, I'm getting back an empty list.
temp <- paste0(tempfile(), ".txt")
download.file("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_trend_gl.txt",
destfile = temp, mode = "wb")
processFile = function(filepath) {
dat_list <- list()
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
append(dat_list, line)
}
close(con)
return(dat_list)
}
dat_list <- processFile(temp)
Here's a possible alternative
processFile = function(filepath, header=TRUE, ...) {
lines <- readLines(filepath)
comments <- which(grepl("^#", lines))
header_row <- gsub("^#","",lines[tail(comments,1)])
data <- read.table(text=c(header_row, lines[-comments]), header=header, ...)
return(data)
}
processFile(temp)
The idea is that we read in all the lines, find the ones that start with "#" and ignore them except for the last one which will be used as the header. We remove the "#" from the header (otherwise it's usually treated as a comment) and then pass it off to read.table to parse the data.
Here are a few options that bypass your function and that you can mix & match.
In the easiest (albeit unlikely) scenario where you know the column names already, you can use read.table and enter the column names manually. The default option of comment.char = "#" means those comment lines will be omitted.
read.table(temp, col.names = c("year", "month", "day", "cycle", "trend"))
More likely is that you don't know those column names, but can get them by figuring out how many comment lines there are, then reading just the last of those lines. That saves you having to read more of the file than you need; this is a small enough file that it shouldn't make a huge difference, but in a larger file it might. I'm doing the counting by accessing the command line, only because that's the way I know how. Note also that I saved the file at an easier path; you could instead paste the command together with the temp variable.
Again, the comments are omitted by default.
n_comments <- as.numeric(system("grep '^# ' co2.txt | wc -l", intern = TRUE))
hdrs <- scan(temp, skip = n_comments - 1, nlines = 1, what = "character")[-1]
read.table(temp, col.names = hdrs)
Or with dplyr and stringr, read all the lines, separate out the comments to extract column names, then filter to remove the comment lines and separate into fields, assigning the column names you've just pulled out. Again, with a bigger file, this could become burdensome.
library(dplyr)
lines <- data.frame(text = readLines(temp), stringsAsFactors = FALSE)
comments <- lines %>%
filter(stringr::str_detect(text, "^#"))
hdrs <- strsplit(comments[nrow(comments), 1], "\\s+")[[1]][-1]
lines %>%
filter(!stringr::str_detect(text, "^#")) %>%
mutate(text = trimws(text)) %>%
tidyr::separate(text, into = hdrs, sep = "\\s+") %>%
mutate_all(as.numeric)
I have code in RStudio which imports a csv based on criteria by using paste function.
Name <- "Sam"
Location <- "Barnsley"
Code <- "A"
Test2 <- read_csv(paste("C:/Users/....,Opposition , " (",Code,")/Vs ",Location, " (",Code,") Export for ",Name,".csv",sep = ""),skip = 8)
I usually follow this import code by a few lines of code for calculations. For arguments sake: Run Code Series
I would like to recreate this code in order to create a list of names, and have the code run through each 1 by 1 followed by running the code.
Desired:
Name <- c("Sam","David","Paul","John")
Then be able to run the import code and have it Run Code Series after each import before importing the next name.
I believe from your question that you want to end with a separate dataframe for each name. If so, you could do it like this:
Names <- c("Sam","David","Paul","John")
Location <- "Barnsley"
Code <- "A"
for(i in Names){
Test2 <- read_csv(paste("C:/Users/....,Opposition" , " (", Code,")/Vs ", Location, " (",Code,") Export for ", i, ".csv", sep = ""), skip = 8)
Run Code Series
assign(paste("df_for_", i, sep = ""), Test2)
}
This will go through your list of names and within the loop, open the file as Test2. You perform your calculations on Test2, and then assign it to a dataframe for the particular name in the list using paste. Also your quotes in your read_csv line do not match up, so that will need to be corrected.
I downloaded data from the internet. I wanted to extract the data and create a data frame. You can find the data in the following filtered data set link: http://www.esrl.noaa.gov/gmd/dv/data/index.php?category=Ozone&type=Balloon . At the bottom of the site page from the 9 filtered data sets you can choose any station. Say Suva, Fiji (SUV):
I have written the following code to create a data frame that has Launch date as part of the data frame for each file.
setwd("C:/Users/")
path = "~C:/Users/"
files <- lapply(list.files(pattern = '\\.l100'), readLines)
test.sample<-do.call(rbind, lapply(files, function(lines){
data.frame(datetime = as.POSIXct(sub('^.*Launch Date : ', '', lines[grep('Launch Date :', lines)])),
# and the data, read in as text
read.table(text = lines[(grep('Sonde Total', lines) + 1):length(lines)]))
}))
The files are from FTP server. The pattern of the file doesn't look familiar to me even though I tried it with .txt, it didn't work. Can you please tweak the above code or any other code to get a data frame.
Thank you in advance.
I think the problem is that the search string does not match "Launch Date :" does not match what is in the files (at least the one I checked).
This should work
lines <- "Launch Date : 11 June 1991"
lubridate::dmy(sub('^.*Launch Date.*: ', '', lines[grep('Launch Date', lines)]))
Code would probably be easier to debug if you broke the problem down into steps rather than as one sentence
I took the following approach:
td <- tempdir()
setwd(td)
ftp <- 'ftp://ftp.cmdl.noaa.gov/ozwv/Ozonesonde/Suva,%20Fiji/100%20Meter%20Average%20Files/'
files <- RCurl::getURL(ftp, dirlistonly = T)
files <- strsplit(files, "\n")
files <- unlist(files)
dat <- list()
for (i in 1:length(files)) {
download.file(paste0(ftp, files[i]), 'data.txt')
df <- read.delim('data.txt', sep = "", skip = 17)
ld <- as.character(read.delim('data.txt')[9, ])
ld <- strsplit(ld, ":")[[1]][2]
df$launch.date <- stringr::str_trim(ld)
dat[[i]] <- df ; rm(df)
}
I wrote some R code to run analysis on my research project. I coded it in such a way that there was an output text file with the status of the program. Now the header of the output file looks like this:
start time: 2014-10-23 19:15:04
starting analysis on state model: 16
current correlation state: 1
>>>em_prod_combs
em_prod_combs
H3K18Ac_H3K4me1 1.040493e-50
H3K18Ac_H3K4me2 3.208806e-77
H3K18Ac_H3K4me3 0.0001307375
H3K18Ac_H3K9Ac 0.001904384
the `>>>em_prod_combs" is on line 4. line 5 its repeated again (R code). I'd like data from like 6. This data goes on for 36 more rows so ends at line 42. Then there is some other text in the file until all the way to like 742 which looks like this:
(742) >>>em_prod_combs
(743) em_actual_perc
(744) H3K18Ac_H3K4me1 0
H3K18Ac_H3K4me2 0
H3K18Ac_H3K4me3 0.0001976819
H3K18Ac_H3K9Ac 0.001690382
And again I'd like to select data from line 744 (actual data, not headers) and go for another 36 rows and end at line 780. Here is my part of the code:
filepath <- paste(folder_directory, corr_folders[fi], filename, sep="" )
con <- file(filepath)
open(con);
results.list <- list();
current.line <- 0
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
if(line==">>>em_prod_combs"){
storethenext <- TRUE
}
}
close(con)
Here, I was trying to see if the line read had the ">>>" mark. If so, set a variable to TRUE and store the next 36 lines (using another counter variable) in a data frame or list and set the storethenext variable back to F. I was kind of hoping that there is a better way of doing this....
So I realized that ReadLines has a parameter that you can set for skipping lines. Based on that, I got this:
df <- data.frame(name = character,
params = numeric(40),
stringsAsFactors = FALSE)
con <- file(filepath)
open(con);
results.list <- list();
current.line <- 0
firstblock <- readLines(con, n = 5, warn = FALSE)
firstblock <- NULL #throwaway
firstblock <- readLines(con, n = 36, warn = FALSE)
firstblock <- as.list(firstblock) #convert to list
for(k in 1:36){
splitstring = strsplit(firstblock[[k]], " ", fixed=TRUE)
## put the data in the df
}
But it turns out from Ben's answer that read.table can do the same thing in one line: So I've reduced it down to the following one liner:
firstblock2 <- read.table(filepath, header = FALSE, sep = " ", skip = 5, nrows = 36)
This also makes it a data frame impliticitly and does all the dirty work for me.
The documentation for read.table is here:
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
In tidyverse (readr)
If you don't wish to convert the data into a dataframe you can just read the slice of text with read_lines()
(being sure to note that the n_max = argument is asking for how many lines you want to read in; not the number of the row you want to stop at. tbh: Ultimately I too found this preferable as usually I need to manage the file length more than I need to isolate sections of code on the read in.)
firstblock <- read_lines(filepath, skip = 5, n_max = 31)
If you don't want to think in terms of file size, you could modify your code thus:
start_line = 5
end_line = 36
line_count = end_line - start_line
firstblock <- read_lines(filepath, skip = start_line, n_max = line_count)
In any case, additional things I found helpful for working with these file formats after I got to know them a little better after finding this post before:
If you want to convert files immediately into lists as you did above just use:
read_lines_raw(filepath, skip = 5, n_max = 31)
and you will get a 31 element list as your firstblock element in lieu of the character element that you get with the first.
Additional super-cool features I stumbled upon there (and was moved to share - because I thought they rock):
automatically decompresses* .gz, .bz2, .xz, or .zip files.
will make the connection and complete the download automatically if the filepath starts with http://, https://, ftp://, or ftps://.
If the file has both an extension and prefix as above, then it does both.
If things need to go back where they came from, the write_lines function turned out to be much more enjoyable to use that the base version. specifically, there are no FileConn's to open and close: just specify the object, and the filename you wish to write it into.
so for example the below:
fileConn <- file("directory/new.sql")
writeLines(new_sql, fileConn)
close(fileConn)
gets to just be:
write_lines(new_sql, "directory/new.sql")
Enjoy, and hope this helps!
I am a bit new to R and trying to learn but I am confused as to how to fix a problem that I have stumbled upon. I am trying to input multiple files so that I may make one histogram per file. The code works well, especially with just one file, but I have encountered a problem when I enter multiple files.
EDIT: Ending code
library("scales")
library("tcltk")
File.names<-(tk_choose.files(default="", caption="Choose your files", multi=TRUE, filters=NULL, index=1))
Num.Files<-NROW(File.names)
dat <- lapply(File.names,read.table,header = TRUE)
names(dat) <- paste("f", 1:length(Num.Files), sep="")
tmp <- stack(lapply(dat,function(x) x[,14]))
require(ggplot2)
ggplot(tmp,aes(x = values)) +
facet_wrap(~ind) +
geom_histogram(aes(y=..count../sum(..count..)))
Well, here's something to get you started (but I can't be sure it will work exactly for you, since your code isn't reproducible):
dat <- lapply(File.names,read.table,header = TRUE)
names(dat) <- paste("f", 1:length(Num.Files), sep="")
tmp <- stack(lapply(dat,function(x) x[,14]))
require(ggplot2)
ggplot(tmp,aes(x = values)) +
facet_wrap(~ind) +
geom_histogram()
Ditch everything your wrote after this line:
File.names<-(tk_choose.files(default="", caption="Choose your files", multi=TRUE, filters=NULL, index=1))
and use the above code instead.
A few other explanations (BlueTrin explained the first error):
for (i in Num.Files){
f<- read.table(File.names[i],header=TRUE)
}
This will loop through your file names and read each one, but it will overwrite the previous file each time through the loop. What you'll be left with is only the last file stored in f.
colnames(f) <- c(1:18)
histoCol <- c(f$'14')
You don't need the c() function here. Just 1:18 is sufficient. But numbers as column names are generally awkward, and should probably be avoided.
f(Num.Files) <- paste("f", 1:length(Num.Files), sep = "") : could not find function "f<-"
This specific error happens because you try to assign a string into the result of a function.
This should load the values into a list:
library("lattice");
library("tcltk");
File.names<-(tk_choose.files(default="", caption="Choose your files", multi=TRUE, filters=NULL, index=1));
Num.Files<-NROW(File.names);
result_list = list();
#f(Num.Files)<-paste("f", 1:length(Num.Files), sep="");
#ls();
for (i in Num.Files) {
full_path = File.names[i];
short_name = basename(full_path);
result_list[[short_name]] = read.table(full_path,header=TRUE);
}
Once you run this program, you can type 'result_list$' without the quotes and press TAB for completion. Alternatively you can use result_list[[1]] for example to access the first table.
result_list is a variable of type list, it is a container which supports indexation by a label, which is the filename in this case. (I replaced the full filename with the short filename as the full filename is a bit ugly in a list but feel free to change it back).
Be careful to not use f as a variable, f is a reserved keyword when you create your function. If you try to replace result_list in the program above with f it should fail to work.
I hope it is enough, with the other solution, to get you started !