R write to file / append at start of file - r

I am trying to write an input file that requires a single line in the first row telling if the file is sparse and if so how many variable levels there are. I know how to append a single line to the end of a file, but can't find a way to append to the first line of a file. Any suggestions?
library(e1071)
library(caret)
library(Matrix)
library(SparseM)
iris2 <- iris
iris2$sepalOver5 <- ifelse(iris2$Sepal.Length >= 5, 1, -1)
head(iris2)
summary(iris2)
trainRows <- sample(1:nrow(iris2), nrow(iris2) * .66, replace = F)
testRows <- which(!(1:nrow(iris2) %in% trainRows))
sum(testRows %in% trainRows)
sum(trainRows %in% testRows)
vtu1 <- c('Sepal.Width','Petal.Length','Petal.Width','Species')
dv1 <- dummyVars( ~., data = iris2[,vtu1], sparse = T)
train <- iris2[trainRows,]
test <- iris2[testRows,]
trainX <- as.matrix.csr(predict(dv1, train))
testX <- as.matrix.csr(predict(dv1, test))
trainY <- train[,'sepalOver5']
testY <- test[,'sepalOver5']
write.matrix.csr( as(trainX , "matrix.csr"), file= "amz.train" , fac = TRUE)
headString <- paste('sparse ',max(trainX#ja),sep = '')
I'd basically like to insert/append headString into amz.train in the first row. Any suggestions?

It is generally not possible to prepend to the start of a file (and if there are ways, they would be really inefficient, since the information of the start of the file in memory is generally unknown. This holds for any programming language).
Three options come to mind:
Read in the file, write the other information first, followed by the rest of the content of the file (might also be inefficient)
Write the information you want to prepend first
In the case you have a writer that cannot append (write.matrix for instance has no append option), you could try to merge this meta information with the data frame, and then writing it as a whole.
Since you are using a specialized format, I wouldn't recommend storing this meta-information this way.
Your file would look like:
sparse 6
1:3 2:5.2 3:2 6:1
1:3.7 2:1.5 3:0.2 4:1
1:3.2 2:6 3:1.8 6:1
And then there is option 4:
Rather, consider having a meta file which contains information such as file name, whether it is sparse or not and the number of levels. Here you could append, and if you would repeat this process it would be preferable. It will avoid problems of reading in weirdly formatted files.

Related

Saving data frames using a for loop with file names corresponding to data frames

I have a few data frames (colors, sets, inventory) and I want to save each of them into a folder that I have set as my wd. I want to do this using a for loop, but I am not sure how to write the file argument such that R understands that it should use the elements of the vector as the file names.
I might write:
DFs <- c("colors", "sets", "inventory")
for (x in 1:length(DFs)){
save(x, file = "x.Rda")
}
The goal would be that the files would save as colors.Rda, sets.Rda, etc. However, the last element to run through the loop simply saves as x.Rda.
In short, perhaps my question is: how do you tell R that I am wanting to use elements being run through a loop within an argument when that argument requires a character string?
For bonus points, I am sure I will encounter the same problem if I want to load a series of files from that folder in the future. Rather than loading each one individually, I'd also like to write a for loop. To load these a few minutes ago, I used the incredibly clunky code:
sets_file <- "~/Documents/ME teaching/R notes/datasets/sets.csv"
sets <- read.csv(sets_file)
inventories_file <- "~/Documents/ME teaching/R notes/datasets/inventories.csv"
inventories <- read.csv(inventories_file)
colors_file <- "~/Documents/ME teaching/R notes/datasets/colors.csv"
colors <- read.csv(colors_file)
For compactness I use lapply instead of a for loop here, but the idea is the same:
lapply(DFs, \(x) save(list=x, file=paste0(x, ".Rda"))))
Note that you need to generate the varying file names by providing x as a variable and not as a character (as part of the file name).
To load those files, you can simply do:
lapply(paste0(DFs, ".Rda"), load, envir = globalenv())
To save you can do this:
DFs <- list(color, sets, inventory)
names(DFs) = c("color", "sets", "inventory")
for (x in 1:length(DFs)){
dx = paste(names(DFs)[[x]], "Rda", sep = ".")
dfx = DFs[[x]]
save(dfx, file = dx)
}
To specify the path just inform in the construction of the dx object as following to read.
To read:
DFs <- c("colors", "sets", "inventory")
# or
DFs = dir("~/Documents/ME teaching/R notes/datasets/")
for(x in 1:length(DFs)){
arq = paste("~/Documents/ME teaching/R notes/datasets/", DFs[x], ".csv", sep = "")
DFs[x] = read.csv(arq)
}
It will read as a list, so you can access using [[]] indexation.

RSME on dataframe of multiple files in R

My goal is to read many files into R, and ultimately, run a Root Mean Square Error (rmse) function on each pair of columns within each file.
I have this code:
#This calls all the files into a dataframe
filnames <- dir("~/Desktop/LGsampleHUCsWgraphs/testRSMEs", pattern = "*_45Fall_*")
#This reads each file
read_data <- function(z){
dat <- read_excel(z, skip = 0, )
return(dat)
}
#This combines them into one list and splits them by the names in the first column
datalist <- lapply(filnames, read_data)
bigdata <- rbindlist(datalist, use.names = T)
splitByHUCs <- split(bigdata, f = bigdata$HUC...1 , sep = "\n", lex.order = TRUE)
So far, all is working well. Now I want to apply an rmse [library(Metrics)] analysis on each of the "splits" created above. I don't know what to call the "splits". Here I have used names but that is an R reserved word and won't work. I tried the bigdata object but that didn't work either. I also tried to use splitByHUCs, and rMSEs.
rMSEs <- sapply(splitByHUCs, function(x) rmse(names$Predicted, names$Actual))
write.csv(rMSEs, file = "~/Desktop/testRMSEs.csv")
The rmse code works fine when I run it on a single file and create a name for the dataframe:
read_excel("bcc1_45Fall_1010002.xlsm")
bcc1F1010002 <- read_excel("bcc1_45Fall_1010002.xlsm")
rmse(bcc1F1010002$Predicted, bcc1F1010002$Actual)
The "splits" are named by the "splitByHUCs" script, like this:
They are named for the file they came from, appropriately. I need some kind of reference name for the rmse formula and I don't know what it would be. Any ideas? Thanks. I made some small versions of the files, but I don't know how to add them here.
As it is a list, we can loop over the list with sapply/lapply as in the OP's code, but the names$ is incorrect as the lambda function object is x which signifies each of the elements of the list (i.e. a data.frame). Therefore, instead of names$, use x$
sapply(splitByHUCs, function(x) rmse(x$Predicted, x$Actual))

Looping Over a Set of Files

I cooked up some code that is supposed to find all my .txt files (they're outputs of ODE simulations), open them all up as data frames with "read.table" and then perform some calculations on them.
files <- list.files(path="/Users/redheadmammoth/Desktop/Ultimate_Aging_F2016",
pattern=".txt",full.names=TRUE)
ldf <- lapply(files, read.table)
tuse <- seq(from=0,to=100,by=0.1)
for(files in ldf)
findR <- function(r){
with(files,(sum(exp(-r*age)*fecund*surv*0.1)-1)^2)
}
{
R0 <- with(files,(sum(fecund*surv*age)))
GenTime <- with(files,(sum(tuse*fecund*surv*0.1))/R0)
r <- optimize(f=findR,seq(-5,5,.0001),tol=0.00000001)$minimum
RV <- with(files,(exp(r*tuse)/surv)*(exp(-r*tuse)*(fecund*surv)))
plot(log(surv) ~ age,files,type="l")
tmp.lm <- lm(log(surv) ~ age + I(age^2),files) #Fit log surv to a quadratic
lines(files$age,predict(tmp.lm),col="red")
}
However, the problem is that it seems to only be performing the calculations contained in my "for" loop on one file, rather than all of them. I'd like it to perform the calculations on all of my files, then save all the files together as one big data frame so I can access the results of any particular set of my simulations. I suspect the error is that I'm not indexing the files correctly in order to loop over all of them.
How about using plyr::ldply() for this. It takes a list (in your case your list of files) and performs the same function on them and then returns a data frame.
The main thing to remember to do is create a column for the ID of each file you read in so you know which data comes from which file. The simplest way to do this is just to call it the file name and then you can edit it from there.
If you have additional arguments in your function they go after the function you want to use in ldply.
# create file list
files <- list.files(path="/Users/redheadmammoth/Desktop/Ultimate_Aging_F2016",
pattern=".txt",full.names=TRUE)
tuse <- seq(from=0,to=100,by=0.1)
load_and_edit <- function(file, tuse){
temp <- read.table(file)
# here put all your calculations you want to do on each file
temp$R0 <- sum(temp$fecund*temp$surv*temp*age)
# make a column for each file name so you know which data comes from which file
temp$id <- file
return(temp)
}
new_data <- plyr::ldply(list.files, load_and_edit, tuse)
This is the easiest way I have found to read in and wrangle multiple files in batch.
You can then plot each one really easily.

Delete row after row in for loop

I have a large character-vector file and I need to draw a random sample from it. This works fine. But I need to draw sample after sample. For that I want to shorten file by every element that is already drawn out of it (that I can draw a new sample without drawing the same element more than once).
I've got some solution, but I'm interested in anything else that might work faster and even more important, maybe correctly.
Here are my tries:
Approach 1
file <- rep(1:10000)
rand_no <- sample(file, 100)
library(car)
a <- data.frame()
for (i in 1:length(rand_no)){
a <- rbind(a, which.names(rand_no[i], file))
file <- file[-a[1,1]]
}
Problem:
Warning message:
In which.names(rand_no[i], file) : 297 not matched
Approach 2
file <- rep(1:10000)
rand_no <- sample(file, 100)
library(car)
deleter <- function(i) {
a <- which.names(rand_no[i], file)
file <- file[-a]
}
lapply(1:length(rand_no), deleter)
Problem:
This doesn't work at all. Maybe I should split the quesion, because the second problem clearly lies with me not fully understanding lapply.
Thanks for any suggestions.
Edit
I hoped that it will work with numbers, but of course file looks like this:
file <- c("Post-19960101T000000Z-1.tsv", "Post-19960101T000000Z-2.tsv", "Post-19960101T000000Z-3.tsv","Post-19960101T000000Z-4.tsv", "Post-19960101T000000Z-5.tsv", "Post-19960101T000000Z-6.tsv", "Post-19960101T000000Z-7.tsv","Post-19960101T000000Z-9.tsv")
Of course rand_no can't be over 100 files with such a small sample. Therefore:
rand_no <- sample(file, 2)
Use list instead of c. Then you can set the values to NULL and they will be removed.
file[file %in% rand_no] <- NULL This find all instances from rand_no in file and removes them.
file <- list("Post-19960101T000000Z-1.tsv",
"Post-19960101T000000Z-2.tsv",
"Post-19960101T000000Z-3.tsv",
"Post-19960101T000000Z-4.tsv",
"Post-19960101T000000Z-5.tsv",
"Post-19960101T000000Z-6.tsv",
"Post-19960101T000000Z-7.tsv",
"Post-19960101T000000Z-9.tsv")
rand_no <- sample(file, 2)
library(car) #From poster's code.
file[file %in% rand_no] <- NULL
If you are working with a large list of files, using %in% to compare strings may bog you down. In that case I would use indexes.
file <- list("Post-19960101T000000Z-1.tsv",
"Post-19960101T000000Z-2.tsv",
"Post-19960101T000000Z-3.tsv",
"Post-19960101T000000Z-4.tsv",
"Post-19960101T000000Z-5.tsv",
"Post-19960101T000000Z-6.tsv",
"Post-19960101T000000Z-7.tsv",
"Post-19960101T000000Z-9.tsv")
rand_no <- sample(1:length(file), 2)
library(car) #From poster's code.
file[rand_no] <- NULL
Sample() already returns values in a permuted order with no replacements (unless you set replace=T). So it will never pick a value twice.
So if you want three sets of 100 samples that don't share any elements, you can use
file <- rep(1:10000)
rand_no <- sample(seq_along(file), 300)
s1<-file[rand_no[1:100]]
s2<-file[rand_no[101:200]]
s3<-file[rand_no[201:300]]
Or if you wanted to decease the total size by 100 each time you could do
s1<-file[-rand_no[1:100]]
s2<-file[-rand_no[1:200]]
s3<-file[-rand_no[1:300]]
A simple approach would be to select random indices and then remove those indices:
file <- 1:10000 # Build sample data
ind <- sample(seq(length(file)), 100) # Select random indices
rand_no <- file[ind] # Compute the actual values selected
file <- file[-ind] # Remove selected indices
I think using sample and split could be a nice way of doing this, without having to alter your files variable. I'm not a big fan of mutation, unless you really need to, and this would let you know exactly which files you used for each chunk of the analysis going forward.
files<-paste("file",1:100,sep="_")
randfiles<-sample(files, 50)
randfiles_chunks<-split(randfiles,seq(1,length(randfiles), by=10))

Convert R read.csv to a readLines batch?

I have a fitted model that I'd like to apply to score a new dataset stored as a CSV. Unfortunately, the new data set is kind of large, and the predict procedure runs out of memory on it if I do it all at once. So, I'd like to convert the procedure that worked fine for small sets below, into a batch mode that processes 500 lines at a time, then outputs a file for each scored 500.
I understand from this answer (What is a good way to read line-by-line in R?) that I can use readLines for this. So, I'd be converting from:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
newdata <- as.data.frame(read.csv('newstuff.csv'), stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=filename)
to something like:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
con <- file("newstuff.csv", open = "r")
i = 0
while (length(mylines <- readLines(con, n = 500, warn = FALSE)) > 0) {
i = i+1
newdata <- as.data.frame(mylines, stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=paste(filename,i,'.csv',sep=''))
}
close(con)
However, when I print the mylines object inside the loop, it doesn't get auto-columned correctly the same way read.csv produces something that is---headers are still a mess, and whatever modulo column-width happens under the hood that wraps the vector into an ncol object isn't happening.
Whenever I find myself writing barbaric things like cutting the first row, wrapping the columns, I generally suspect R has a better way to do things. Any suggestions for how I can get a read.csv-like output form a readLines csv connection?
If you want to read your data into memory in chunks using read.csv by using the skip and nrows arguments. In pseudo-code:
read_chunk = function(start, n) {
read.csv(file, skip = start, nrows = n)
}
start_indices = (0:no_chunks) * chunk_size + 1
lapply(start_indices, function(x) {
dat = read_chunk(x, chunk_size)
pred = predict(fit, dat)
write.csv(pred)
}
Alternatively, you could put the data into an sqlite database, and use the sqlite package to query the data in chunks. See also this answer, or do some digging with [r] large csv on SO.

Resources