R file inputs and histogram - r

I am a bit new to R and trying to learn but I am confused as to how to fix a problem that I have stumbled upon. I am trying to input multiple files so that I may make one histogram per file. The code works well, especially with just one file, but I have encountered a problem when I enter multiple files.
EDIT: Ending code
library("scales")
library("tcltk")
File.names<-(tk_choose.files(default="", caption="Choose your files", multi=TRUE, filters=NULL, index=1))
Num.Files<-NROW(File.names)
dat <- lapply(File.names,read.table,header = TRUE)
names(dat) <- paste("f", 1:length(Num.Files), sep="")
tmp <- stack(lapply(dat,function(x) x[,14]))
require(ggplot2)
ggplot(tmp,aes(x = values)) +
facet_wrap(~ind) +
geom_histogram(aes(y=..count../sum(..count..)))

Well, here's something to get you started (but I can't be sure it will work exactly for you, since your code isn't reproducible):
dat <- lapply(File.names,read.table,header = TRUE)
names(dat) <- paste("f", 1:length(Num.Files), sep="")
tmp <- stack(lapply(dat,function(x) x[,14]))
require(ggplot2)
ggplot(tmp,aes(x = values)) +
facet_wrap(~ind) +
geom_histogram()
Ditch everything your wrote after this line:
File.names<-(tk_choose.files(default="", caption="Choose your files", multi=TRUE, filters=NULL, index=1))
and use the above code instead.
A few other explanations (BlueTrin explained the first error):
for (i in Num.Files){
f<- read.table(File.names[i],header=TRUE)
}
This will loop through your file names and read each one, but it will overwrite the previous file each time through the loop. What you'll be left with is only the last file stored in f.
colnames(f) <- c(1:18)
histoCol <- c(f$'14')
You don't need the c() function here. Just 1:18 is sufficient. But numbers as column names are generally awkward, and should probably be avoided.

f(Num.Files) <- paste("f", 1:length(Num.Files), sep = "") : could not find function "f<-"
This specific error happens because you try to assign a string into the result of a function.
This should load the values into a list:
library("lattice");
library("tcltk");
File.names<-(tk_choose.files(default="", caption="Choose your files", multi=TRUE, filters=NULL, index=1));
Num.Files<-NROW(File.names);
result_list = list();
#f(Num.Files)<-paste("f", 1:length(Num.Files), sep="");
#ls();
for (i in Num.Files) {
full_path = File.names[i];
short_name = basename(full_path);
result_list[[short_name]] = read.table(full_path,header=TRUE);
}
Once you run this program, you can type 'result_list$' without the quotes and press TAB for completion. Alternatively you can use result_list[[1]] for example to access the first table.
result_list is a variable of type list, it is a container which supports indexation by a label, which is the filename in this case. (I replaced the full filename with the short filename as the full filename is a bit ugly in a list but feel free to change it back).
Be careful to not use f as a variable, f is a reserved keyword when you create your function. If you try to replace result_list in the program above with f it should fail to work.
I hope it is enough, with the other solution, to get you started !

Related

How can I read a table in a loosely structured text file into a data frame in R?

Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file is embedded in the code below.
How can I read this file so that I end up with a data frame (or tibble) with the appropriate column names and data?
All the text-to-data functions I know get stymied by those header lines. Here's what I just tried, riffing off of this SO Q&A. My thought was to read the file into a list of lines, then drop the lines that start with # from the list, then do.call(rbind, ...) the rest. The downloading part at the top works fine, but when I run the function, I'm getting back an empty list.
temp <- paste0(tempfile(), ".txt")
download.file("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_trend_gl.txt",
destfile = temp, mode = "wb")
processFile = function(filepath) {
dat_list <- list()
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
append(dat_list, line)
}
close(con)
return(dat_list)
}
dat_list <- processFile(temp)
Here's a possible alternative
processFile = function(filepath, header=TRUE, ...) {
lines <- readLines(filepath)
comments <- which(grepl("^#", lines))
header_row <- gsub("^#","",lines[tail(comments,1)])
data <- read.table(text=c(header_row, lines[-comments]), header=header, ...)
return(data)
}
processFile(temp)
The idea is that we read in all the lines, find the ones that start with "#" and ignore them except for the last one which will be used as the header. We remove the "#" from the header (otherwise it's usually treated as a comment) and then pass it off to read.table to parse the data.
Here are a few options that bypass your function and that you can mix & match.
In the easiest (albeit unlikely) scenario where you know the column names already, you can use read.table and enter the column names manually. The default option of comment.char = "#" means those comment lines will be omitted.
read.table(temp, col.names = c("year", "month", "day", "cycle", "trend"))
More likely is that you don't know those column names, but can get them by figuring out how many comment lines there are, then reading just the last of those lines. That saves you having to read more of the file than you need; this is a small enough file that it shouldn't make a huge difference, but in a larger file it might. I'm doing the counting by accessing the command line, only because that's the way I know how. Note also that I saved the file at an easier path; you could instead paste the command together with the temp variable.
Again, the comments are omitted by default.
n_comments <- as.numeric(system("grep '^# ' co2.txt | wc -l", intern = TRUE))
hdrs <- scan(temp, skip = n_comments - 1, nlines = 1, what = "character")[-1]
read.table(temp, col.names = hdrs)
Or with dplyr and stringr, read all the lines, separate out the comments to extract column names, then filter to remove the comment lines and separate into fields, assigning the column names you've just pulled out. Again, with a bigger file, this could become burdensome.
library(dplyr)
lines <- data.frame(text = readLines(temp), stringsAsFactors = FALSE)
comments <- lines %>%
filter(stringr::str_detect(text, "^#"))
hdrs <- strsplit(comments[nrow(comments), 1], "\\s+")[[1]][-1]
lines %>%
filter(!stringr::str_detect(text, "^#")) %>%
mutate(text = trimws(text)) %>%
tidyr::separate(text, into = hdrs, sep = "\\s+") %>%
mutate_all(as.numeric)

Converting argument to partial string

I'm sure this is pretty basic, but I haven't been able to find an answer on stackoverflow.
The basics of what I'm working with is
f1 <- function(x) {
setwd("~/Rdir/x")
col1 <- f2(...)
col2 <- f3(...)
genelist <- data.frame(co1,col2)
write.csv(genelist, file="x.csv")
}
Essentially what I want is for x to be replaced by whatever I input for example
f1(test) would save a file called test.csv into the directory Rdir/test.
I would post a more complete code sample of what i'm working with - but it is very long.
You can use ?paste:
setwd(paste("~/Rdir/", x, sep=""));
and
write.csv(genelist, file=paste(x, ".csv", sep=""))
in your example. However, it might me more straightforward not to change the working directory but instead just to specify the full path when saving:
write.csv(genelist, file=paste("~/Rdir/", x, "/", x, ".csv", sep=""))
but be aware that this will crash if the directory does not exist. You could have a look at ?dir.create to create the directory first, in case it does not exist.
You can create the filename with paste0 and the path with file.path:
x <- "test"
file.path("~/Rdir", x, paste0(x, ".csv"))
# "~/Rdir/test/test.csv"

selecting text from a middle of a textfile with known line numbers

I wrote some R code to run analysis on my research project. I coded it in such a way that there was an output text file with the status of the program. Now the header of the output file looks like this:
start time: 2014-10-23 19:15:04
starting analysis on state model: 16
current correlation state: 1
>>>em_prod_combs
em_prod_combs
H3K18Ac_H3K4me1 1.040493e-50
H3K18Ac_H3K4me2 3.208806e-77
H3K18Ac_H3K4me3 0.0001307375
H3K18Ac_H3K9Ac 0.001904384
the `>>>em_prod_combs" is on line 4. line 5 its repeated again (R code). I'd like data from like 6. This data goes on for 36 more rows so ends at line 42. Then there is some other text in the file until all the way to like 742 which looks like this:
(742) >>>em_prod_combs
(743) em_actual_perc
(744) H3K18Ac_H3K4me1 0
H3K18Ac_H3K4me2 0
H3K18Ac_H3K4me3 0.0001976819
H3K18Ac_H3K9Ac 0.001690382
And again I'd like to select data from line 744 (actual data, not headers) and go for another 36 rows and end at line 780. Here is my part of the code:
filepath <- paste(folder_directory, corr_folders[fi], filename, sep="" )
con <- file(filepath)
open(con);
results.list <- list();
current.line <- 0
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
if(line==">>>em_prod_combs"){
storethenext <- TRUE
}
}
close(con)
Here, I was trying to see if the line read had the ">>>" mark. If so, set a variable to TRUE and store the next 36 lines (using another counter variable) in a data frame or list and set the storethenext variable back to F. I was kind of hoping that there is a better way of doing this....
So I realized that ReadLines has a parameter that you can set for skipping lines. Based on that, I got this:
df <- data.frame(name = character,
params = numeric(40),
stringsAsFactors = FALSE)
con <- file(filepath)
open(con);
results.list <- list();
current.line <- 0
firstblock <- readLines(con, n = 5, warn = FALSE)
firstblock <- NULL #throwaway
firstblock <- readLines(con, n = 36, warn = FALSE)
firstblock <- as.list(firstblock) #convert to list
for(k in 1:36){
splitstring = strsplit(firstblock[[k]], " ", fixed=TRUE)
## put the data in the df
}
But it turns out from Ben's answer that read.table can do the same thing in one line: So I've reduced it down to the following one liner:
firstblock2 <- read.table(filepath, header = FALSE, sep = " ", skip = 5, nrows = 36)
This also makes it a data frame impliticitly and does all the dirty work for me.
The documentation for read.table is here:
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
In tidyverse (readr)
If you don't wish to convert the data into a dataframe you can just read the slice of text with read_lines()
(being sure to note that the n_max = argument is asking for how many lines you want to read in; not the number of the row you want to stop at. tbh: Ultimately I too found this preferable as usually I need to manage the file length more than I need to isolate sections of code on the read in.)
firstblock <- read_lines(filepath, skip = 5, n_max = 31)
If you don't want to think in terms of file size, you could modify your code thus:
start_line = 5
end_line = 36
line_count = end_line - start_line
firstblock <- read_lines(filepath, skip = start_line, n_max = line_count)
In any case, additional things I found helpful for working with these file formats after I got to know them a little better after finding this post before:
If you want to convert files immediately into lists as you did above just use:
read_lines_raw(filepath, skip = 5, n_max = 31)
and you will get a 31 element list as your firstblock element in lieu of the character element that you get with the first.
Additional super-cool features I stumbled upon there (and was moved to share - because I thought they rock):
automatically decompresses* .gz, .bz2, .xz, or .zip files.
will make the connection and complete the download automatically if the filepath starts with http://, https://, ftp://, or ftps://.
If the file has both an extension and prefix as above, then it does both.
If things need to go back where they came from, the write_lines function turned out to be much more enjoyable to use that the base version. specifically, there are no FileConn's to open and close: just specify the object, and the filename you wish to write it into.
so for example the below:
fileConn <- file("directory/new.sql")
writeLines(new_sql, fileConn)
close(fileConn)
gets to just be:
write_lines(new_sql, "directory/new.sql")
Enjoy, and hope this helps!

using cat in R to create a formatted R script

I want to read an R file or script, modify the name of the external data file being read and export the modified R code into a new R file or script. Other than the name of the data file being read (and the name of the new R file) I want the two R scripts to be identical.
I can come close, except that I cannot figure out how to retain the blank lines I use for readability and error reduction.
Here is the original R file being read. Note that some of the code in this file is non-sensical, but to me that is irrelevant. This code does not need to run.
# apple.pie.all.purpose.flour.arsc.Jun23.2013.r
library(my.library)
aa <- 10 # aa
bb <- c(1:7) # bb
my.data = convert.txt("../applepieallpurposeflour.txt",
group.df = data.frame(recipe =
c("recipe1", "recipe2", "recipe3", "recipe4", "recipe5")),
covariates = c(paste( "temp", seq_along(1:aa), sep="")))
ingredient <- c('all purpose flour')
function(make.pie){ make a pie }
Here is R code I use to read the above file, modify it and export the result. This R code runs and is the only code that needs to run to achieve the desired result (except that I cannot get the format of the new R script to match that of the original R script exactly, i.e., blank lines present in the original R script are not present in the new R script):
setwd('c:/users/mmiller21/simple r programs/')
# define new fruit
new.fruit <- 'peach'
# read flour file for original fruit
flour <- readLines('apple.pie.all.purpose.flour.arsc.Jun23.2013.r')
# create new file name
output.flour <- paste(new.fruit, ".pie.all.purpose.flour.arsc.Jun23.2013.r", sep="")
# add new file name
flour.a <- gsub("# apple.pie.all.purpose.flour.arsc.Jun23.2013.r",
paste("# ", output.flour, sep=""), flour)
# add line to read new data file
cat(file = output.flour,
gsub( "my.data = convert.txt\\(\"../applepieallpurposeflour.txt",
paste("my.data = convert.txt\\(\"../", new.fruit, "pieallpurposeflour.txt",
sep=""), flour.a),
sep=c("","\n"), fill = TRUE
)
Here is the resulting new R script:
# peach.pie.all.purpose.flour.arsc.Jun23.2013.r
library(my.library)
aa <- 10 # aa
bb <- c(1:7) # bb
my.data = convert.txt("../peachpieallpurposeflour.txt",
group.df = data.frame(recipe =
c("recipe1", "recipe2", "recipe3", "recipe4", "recipe5")),
covariates = c(paste( "temp", seq_along(1:aa), sep="")))
ingredient <- c('all purpose flour')
function(make.pie){ make a pie }
There is one blank line in the newly-created R file, but how can I insert all of the blank lines present in the original R script? Thank you for any advice.
EDIT: I cannot seem to duplicate the blank lines here on StackOverflow. They seem to be deleted automatically. StackOverflow is even deleting the indentation I am using and I cannot seem to replace it. Sorry about this. Automatic deletion of blank lines and indentation is problematic when the issue at hand is specifically about formatting. I cannot seem to fix the post to display the R code as formatted in my script. However, the code does display correctly when I am actively editing the post.
EDIT: June 27, 2013: The deletion of empty rows and indentation in the code for the original R file and in the code for the middle R file appears to be associated with my laptop rather than with StackOverflow. When I view this post and my answers on my office desktop the format is correct. When I view this post and my answers with my laptop the empty rows and indentation are gone. Perhaps my laptop monitor is malfunctioning. Sorry about assuming initially that the problem was with StackOverflow.
Here is a function that will create a new R file for every combination of two variables. Sorry the formatting of the code below is not better. The code does run and does work as intended (provided the name of the original R file ends in ".arsc.Jun26.2013.r" instead of in ".arsc.Jun23.2013.r" used in the original post):
setwd('c:/users/mmiller21/simple r programs/')
# define fruits of interest
fruits <- c('apple', 'pumpkin', 'pecan')
# define ingredients of interest
ingredients <- c('all.purpose.flour', 'sugar', 'ground.cinnamon')
# define every combination of fruit and ingredient
fruits.and.ingredients <- expand.grid(fruits, ingredients)
old.fruit <- as.character(rep('apple', nrow(fruits.and.ingredients)))
old.ingredient <- as.character(rep('all.purpose.flour', nrow(fruits.and.ingredients)))
fruits.and.ingredients2 <- cbind(old.fruit , as.character(fruits.and.ingredients[,1]),
old.ingredient, as.character(fruits.and.ingredients[,2]))
colnames(fruits.and.ingredients2) <- c('old.fruit', 'new.fruit', 'old.ingredient', 'new.ingredient')
# begin function
make.pie <- function(old.fruit, new.fruit, old.ingredient, new.ingredient) {
new.ingredient2 <- gsub('\\.', '', new.ingredient)
old.ingredient2 <- gsub('\\.', '', old.ingredient)
new.ingredient3 <- gsub('\\.', ' ', new.ingredient)
old.ingredient3 <- gsub('\\.', ' ', old.ingredient)
# file name
old.file <- paste(old.fruit, ".pie.", old.ingredient, ".arsc.Jun26.2013.r", sep="")
new.file <- paste(new.fruit, ".pie.", new.ingredient, ".arsc.Jun26.2013.r", sep="")
# read original fruit and original ingredient
flour <- readLines(old.file)
# add new file name
flour.a <- gsub(paste("# ", old.file, sep=""),
paste("# ", new.file, sep=""), flour)
# read new data file
old.data.file <- print(paste("my.data = convert.txt(\"../", old.fruit, "pie", old.ingredient2, ".txt\",", sep=""), quote=FALSE)
new.data.file <- print(paste("my.data = convert.txt(\"../", new.fruit, "pie", new.ingredient2, ".txt\",", sep=""), quote=FALSE)
flour.b <- ifelse(flour.a == old.data.file, new.data.file, flour.a)
flour.c <- ifelse(flour.b == paste('ingredient <- c(\'', old.ingredient3, '\')', sep=""),
paste('ingredient <- c(\'', new.ingredient3, '\')', sep=""), flour.b)
cat(flour.c, file = new.file, sep=c("\n"))
}
apply(fruits.and.ingredients2, 1, function(x) make.pie(x[1], x[2], x[3], x[4]))
Here is one solution that reproduces the original R script (except for the two desired changes) while also preserving the formatting of that original R script in the new R script.
setwd('c:/users/mmiller21/simple r programs/')
new.fruit <- 'peach'
flour <- readLines('apple.pie.all.purpose.flour.arsc.Jun23.2013.r')
output.flour <- paste(new.fruit, ".pie.all.purpose.flour.arsc.Jun23.2013.r", sep="")
flour.a <- gsub("# apple.pie.all.purpose.flour.arsc.Jun23.2013.r",
paste("# ", output.flour, sep=""), flour)
flour.b <- gsub( "my.data = convert.txt\\(\"../applepieallpurposeflour.txt",
paste("my.data = convert.txt\\(\"../", new.fruit, "pieallpurposeflour.txt", sep=""), flour.a)
for(i in 1:length(flour.b)) {
if(i == 1) cat(flour.b[i], file = output.flour, sep=c("\n"), fill=TRUE )
if(i > 1) cat(flour.b[i], file = output.flour, sep=c("\n"), fill=TRUE, append = TRUE)
}
Again, I apologize for my inability to format the above R code in a readable way. I have never encountered this problem on StackOverflow and do not know the solution. Regardless, the above R script solves the problem I described in the original post.
To see the formatting of the original R script you will have to click the edit button under the original post.
EDIT: June 25, 2013
I do not know what I was doing differently yesterday, but today I found that the following simple cat statement, in place of the for-loop immediately above, creates the new R script while preserving the formatting of the original R script.
cat(flour.b, file = output.flour, sep=c("\n"))

Executing function on objects of name 'i' within for-loop in R

I am still pretty new to R and very new to for-loops and functions, but I searched quite a bit on stackoverflow and couldn't find an answer to this question. So here we go.
I'm trying to create a script that will (1) read in multiple .csv files and (2) apply a function to strip twitter handles from urls in and do some other things to these files. I have developed script for these two tasks separately, so I know that most of my code works, but something goes wrong when I try to combine them. I prepare for doing so using the following code:
# specify directory for your files and replace 'file' with the first, unique part of the
# files you would like to import
mypath <- "~/Users/you/data/"
mypattern <- "file+.*csv"
# Get a list of the files
file_list <- list.files(path = mypath,
pattern = mypattern)
# List of names to be given to data frames
data_names <- str_match(file_list, "(.*?)\\.")[,2]
# Define function for preparing datasets
handlestripper <- function(data){
data$handle <- str_match(data$URL, "com/(.*?)/status")[,2]
data$rank <- c(1:500)
names(data) <- c("dateGMT", "url", "tweet", "twitterid", "rank")
data <- data[,c(4, 1:3, 5)]
}
That all works fine. The problem comes when I try to execute the function handlestripper() within the for-loop.
# Read in data
for(i in data_names){
filepath <- file.path(mypath, paste(i, ".csv", sep = ""))
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
}
When I execute this code, I get the following error: Error in data$URL : $ operator is invalid for atomic vectors. I know that this means that my function is being applied to the string I called from within the vector data_names, but I don't know how to tell R that, in this last line of my for-loop, I want the function applied to the objects of name i that I just created using the assign command, rather than to i itself.
Inside your loop, you can change this:
assign(i, read.delim(filepath, colClasses = "character", sep = ","))
i <- handlestripper(i)
to
tmp <- read.delim(filepath, colClasses = "character", sep = ",")
assign(i, handlestripper(tmp))
I think you should make as few get and assign calls as you can, but there's nothing wrong with indexing your loop with names as you are doing. I do it all the time, anyway.

Resources