I have a fitted model that I'd like to apply to score a new dataset stored as a CSV. Unfortunately, the new data set is kind of large, and the predict procedure runs out of memory on it if I do it all at once. So, I'd like to convert the procedure that worked fine for small sets below, into a batch mode that processes 500 lines at a time, then outputs a file for each scored 500.
I understand from this answer (What is a good way to read line-by-line in R?) that I can use readLines for this. So, I'd be converting from:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
newdata <- as.data.frame(read.csv('newstuff.csv'), stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=filename)
to something like:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
con <- file("newstuff.csv", open = "r")
i = 0
while (length(mylines <- readLines(con, n = 500, warn = FALSE)) > 0) {
i = i+1
newdata <- as.data.frame(mylines, stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=paste(filename,i,'.csv',sep=''))
}
close(con)
However, when I print the mylines object inside the loop, it doesn't get auto-columned correctly the same way read.csv produces something that is---headers are still a mess, and whatever modulo column-width happens under the hood that wraps the vector into an ncol object isn't happening.
Whenever I find myself writing barbaric things like cutting the first row, wrapping the columns, I generally suspect R has a better way to do things. Any suggestions for how I can get a read.csv-like output form a readLines csv connection?
If you want to read your data into memory in chunks using read.csv by using the skip and nrows arguments. In pseudo-code:
read_chunk = function(start, n) {
read.csv(file, skip = start, nrows = n)
}
start_indices = (0:no_chunks) * chunk_size + 1
lapply(start_indices, function(x) {
dat = read_chunk(x, chunk_size)
pred = predict(fit, dat)
write.csv(pred)
}
Alternatively, you could put the data into an sqlite database, and use the sqlite package to query the data in chunks. See also this answer, or do some digging with [r] large csv on SO.
Related
So I have a large csv excel file that my computer cannot handle opening without rstudio terminating.
To solve this I am trying to iterate through the rows of the file in order do my calculations on each row at a time, before storing the value and then moving on to the next row.
This I can normally achieve (eg on a smaller file) through simply reading and storing the whole csv file within Rstudio and running a simple for loop.
It is, however, the size of this storage of data that I am trying to avoid, hence I am trying to read a row of the csv file one at a time instead.
(I think that makes sense)
This was suggested :here
I have managed to get my calculations to be read and work quickly for the first row of my data file.
It is the looping over this that I am struggling with, as I am trying to use a for loop (potentially should be using a while/if statement) but I have nowhere for the "i" value to be called from within the loop: part of my code is below:
con = file(FileName, "r")
for (row in 1:nrow(con)) {
data <- read.csv(con, nrow=1) #reading of file
"insert calculations here"
}
So the "row" is not called upon so the loop only goes through once. I also have an issue with the "1:nrow(con)" as clearly the nrow(con) simply returns NULL
Any help with this would be great,
thanks.
read.csv() will generate an error if it tries to read past the end of the file. So you could do something like this:
con <- file(FileName, "rt")
repeat {
data <- try(read.csv(con, nrow = 1, header = FALSE), silent = TRUE) #reading of file
if (inherits(data, "try-error")) break
"insert calculations here"
}
close(con)
It will be really slow going one line at a time, but you can do it in larger batches if your calculation code supports that. And I'd recommend specifying the column types using colClasses in the read.csv() call, so that R doesn't guess differently sometimes.
Edited to add:
We've been told that there are 3000 columns of integers in the dataset. The first row only has partial header information. This code can deal with that:
n <- 1 # desired batch size
col.names <- paste0("C", 1:3000) # desired column names
con <- file(FileName, "rt")
readLines(con, 1) # Skip over bad header row
repeat {
data <- try(read.csv(con, nrow = n, header = FALSE,
col.names = col.names,
colClasses = "integer"),
silent = TRUE) #reading of file
if (inherits(data, "try-error")) break
"insert calculations here"
}
close(con)
You could read in your data in batches of, say, 10,000 rows at a time (but you can change n to do as much as you want), do your calculations and then write the changes to a new file, appending the each batch to the end of the file.
Something like:
i = 0
n = 10000
while (TRUE) {
df = readr::read_csv('my_file.csv', skip=i, n_max=n)
# If the number of rows in the file is divisible by n, it may be the case
# that the next pass will result in an empty data.frame being returned
if (nrow(df) > 0) {
# do your calculations
# If you have performed calculations on df and want to save those results,
# save the data.frame to a file, appending it to the file to avoid overwriting prior results.
readr::write_csv(df, 'my_new_file.csv', append=TRUE)
} else {
break
}
# Check to see if we need to keep going, if so add n to i
if (nrow(df) < n) {
break
} else {
i = i + n
}
}
When I run this Loop I can print the results and I want to create a data frame with this data but I cant. Until now I have this:
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
for (i in 1:numfiles) {
file <- read.table(filenames[i],header = TRUE)
ts = subset(file, file$name == "plantNutrientUptake")
tss = subset (ts, ts$path == "//plants/nitrate")
tssc = tss[,2:3]
d40 = tssc[41,2]
print(d40)
print(filenames[i])
}
This is not the most efficient way to do this, but it takes advantage of what code you've already written. First, you'll create an empty data frame with the columns you want, but filled with NA. Then, in each iteration of the loop, you'll fill one row of the data frame.
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
# Create an empty data.frame
df <- data.frame(filename = rep(NA, numfiles), d40 = rep(NA, numfiles))
for (i in 1:numfiles){
file <- read.table(filenames[i],header = TRUE)
ts = subset(file, file$name == "plantNutrientUptake")
tss = subset (ts, ts$path == "//plants/nitrate")
tssc = tss[,2:3]
d40 = tssc[41,2]
# Fill row i of the data frame
df[i,"filename"] = filenames[i]
df[i,"d40"] = d40
}
Hope that does it! Good luck :)
There are a lot of ways to do what you are asking. Also, without a reproducible example it is difficult to validate that code will run. I couldn't tell what type of data was in each of your variable so I just guessed that they were mostly characters with one numeric. You'll need to change the code if that's not true.
The following method is using base R (no other packages). It builds off of what you have done. There are other ways to do this using map, do.call, or apply. But it's important to be able to run through a loop.
As someone commented, your code is just re-writing itself every loop. Luckily you have the variable i that you can use to specify where things go.
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
# Declare an empty dataframe for efficiency purposes
df <- data.frame(
ts = rep(NA_character_,numfiles),
tss = rep(NA_character_,numfiles),
tssc = rep(NA_character_,numfiles),
d40 = rep(NA_real_,numfiles),
stringsAsFactors = FALSE
)
# Loop through the files and fill in the data
for (i in 1:numfiles){
file <- read.table(filenames[i],header = TRUE)
df$ts[i] <- subset(file, file$name == "plantNutrientUptake")
df$tss[i] <- subset (ts, ts$path == "//plants/nitrate")
df$tssc[i] <- tss[,2:3]
df$d40[i] <- tssc[41,2]
print(d40)
print(filenames[i])
}
You'll notice a few things about this code that are extra.
First, I'm declaring the variable type for each column explicitly. You can use rep(NA,numfiles) but that leave R to guess what the column should be. This may not be a problem for you if all of your variables are obviously of the same type. But imagine you have a variable a = c("1","A","B") of all characters. R will go through the first iteration of the loop and guess that the column is numeric. Then on the second run of the loop will crash when it runs into a character.
Next, I'm declaring the entire dataframe before entering the loop. When people tell you that loops in [modern] R are slow it is often because you are re-allocating memory every loop. By declaring the entire dataframe up front you speed up the loop significantly. This also allows you to reference any cell in the dataframe...which is exactly what you want to do in the loop.
Finally, I'm using the $ syntax to make things clear. Writing df[i,"d40"] <- d40 is the same as writing df$d40[i] <- d40. I just think it is clear to use the second method. This is a matter of personal preference.
I am VERY new to R and am having a very difficult time getting an answer to this, so I finally caved to post - so apologies ahead of time.
I am using a genetic algorithm to optimize the shape of an object, and want to gather the intermediate steps for prototyping. The package I am using genalg, allows a monitor function to track the data which I can print just fine. But I'd like to stash it in a data frame for other uses and keep watching it overwrite the other iterations. Here's my code for the monitor function:
monitor <- function(obj){
#Make empty data frame in which to store data
resultlist <- data.frame(matrix(nrow = 200, ncol = 10, byrow = TRUE))
#If statement evaluating each iteration of algorithm
if (obj$iter > 0){
#Put results into list corresponding to number of iteration
resultlist[,obj$iter] <- obj$population[which.min(obj$best),]}
#Make data frame available at global level for prototyping, output, etc.
resultlistOutput <<- resultlist}
I know this works in a for loop with no issues based on searches, so I must be doing something wrong or the if syntax is not capable of this?
Sincere thanks in advance for your time.
Being not sure what error you are getting, I am guessing you are getting only the result from last iteration. This is happening because you are overwriting your global dataframe in each call to monitor function. You should first initialize resultlistOutput <<- data.frame() this way and then do this:
monitor <- function(obj){
#Make empty data frame in which to store data
resultlist <- data.frame(matrix(nrow = 200, ncol = 10, byrow = TRUE))
#If statement evaluating each iteration of algorithm
if (obj$iter > 0){
#Put results into list corresponding to number of iteration
resultlist[,obj$iter] <- obj$population[which.min(obj$best),]}
#Make data frame available at global level for prototyping, output, etc.
# append the dataframe to the old result
resultlistOutput <<- rbind(resultlistOutput , resultlist)
}
I am trying to write an input file that requires a single line in the first row telling if the file is sparse and if so how many variable levels there are. I know how to append a single line to the end of a file, but can't find a way to append to the first line of a file. Any suggestions?
library(e1071)
library(caret)
library(Matrix)
library(SparseM)
iris2 <- iris
iris2$sepalOver5 <- ifelse(iris2$Sepal.Length >= 5, 1, -1)
head(iris2)
summary(iris2)
trainRows <- sample(1:nrow(iris2), nrow(iris2) * .66, replace = F)
testRows <- which(!(1:nrow(iris2) %in% trainRows))
sum(testRows %in% trainRows)
sum(trainRows %in% testRows)
vtu1 <- c('Sepal.Width','Petal.Length','Petal.Width','Species')
dv1 <- dummyVars( ~., data = iris2[,vtu1], sparse = T)
train <- iris2[trainRows,]
test <- iris2[testRows,]
trainX <- as.matrix.csr(predict(dv1, train))
testX <- as.matrix.csr(predict(dv1, test))
trainY <- train[,'sepalOver5']
testY <- test[,'sepalOver5']
write.matrix.csr( as(trainX , "matrix.csr"), file= "amz.train" , fac = TRUE)
headString <- paste('sparse ',max(trainX#ja),sep = '')
I'd basically like to insert/append headString into amz.train in the first row. Any suggestions?
It is generally not possible to prepend to the start of a file (and if there are ways, they would be really inefficient, since the information of the start of the file in memory is generally unknown. This holds for any programming language).
Three options come to mind:
Read in the file, write the other information first, followed by the rest of the content of the file (might also be inefficient)
Write the information you want to prepend first
In the case you have a writer that cannot append (write.matrix for instance has no append option), you could try to merge this meta information with the data frame, and then writing it as a whole.
Since you are using a specialized format, I wouldn't recommend storing this meta-information this way.
Your file would look like:
sparse 6
1:3 2:5.2 3:2 6:1
1:3.7 2:1.5 3:0.2 4:1
1:3.2 2:6 3:1.8 6:1
And then there is option 4:
Rather, consider having a meta file which contains information such as file name, whether it is sparse or not and the number of levels. Here you could append, and if you would repeat this process it would be preferable. It will avoid problems of reading in weirdly formatted files.
I am working on a similar issue as was stated on this other posting and tried adapting the code to select the columns I am interested in and making it fit my data file.
My issue, however, is that the resulting file has become larger than the original one, and I'm not sure the code is working the way I intended.
When I open with SPSS, the dataset seems to have taken in the header line, and then made millions of copies without end of the second line (I had to force stop the process).
I noticed there's no counter in the while loop specifying the line, might this be the case? My background in programming with R is very limited. The file is a .csv and is 4.8GB with 329 variables and millions of rows. I only need to keep around 30 of the variables.
This is the code I used:
##Open separate connections to hold cursor position
file.in <- file('npidata_20050523-20130707.csv', 'rt')
file.out<- file('Mainoutnpidata.txt', 'wt')
line<-readLines(file.in,n=1)
line.split <-strsplit(line, ',')
##Column picking, only column 1
cat(line.split[[1]][1:11],line.split[[1]][23:25], line.split[[1]][31:33], line.split[[1]][308:311], sep = ",", file = file.out, fill= TRUE)
##Use a loop to read in the rest of the lines
line <-readLines(file.in, n=1)
while (length(line)){
line.split <-strsplit(line, ',')
if (length(line.split[[1]])>1) {
cat(line.split[[1]][1:11],line.split[[1]][23:25], line.split[[1]][31:33], line.split[[1]][308:311],sep = ",", file = file.out, fill= TRUE)
}
}
close(file.in)
close(file.out)
One thing wrong that jumps out it that you are missing a lines <- readLines(file.in, n=1) inside your while loop. You are now stuck in an infinite loop. Also, reading only one line at a time is going to be terribly slow.
If in your file (unlike the one in the example you linked to) every row contains the same number of columns, you could use my LaF package. This should result in something along the lines of:
library(LaF)
m <- detect_dm_csv("npidata_20050523-20130707.csv", header=TRUE)
laf <- laf_open(m)
begin(laf)
con <- file("Mainoutnpidata.txt", 'wt')
while(TRUE) {
d <- next_block(laf, columns = c(1:11, 23:25, 31:33, 308:311))
if (nrow(d) == 0) break;
write.csv(d, file=con, row.names=FALSE, header=FALSE)
}
close(con)
close(laf)
If your 30 columns fit into memory you could even do:
library(LaF)
m <- detect_dm_csv("npidata_20050523-20130707.csv", header=TRUE)
laf <- laf_open(m)
d <- laf[, c(1:11, 23:25, 31:33, 308:311)]
close(laf)
I couldn't test the code above on your file, so can't guarantee there are no errors (let me know if there are).