I am VERY new to R and am having a very difficult time getting an answer to this, so I finally caved to post - so apologies ahead of time.
I am using a genetic algorithm to optimize the shape of an object, and want to gather the intermediate steps for prototyping. The package I am using genalg, allows a monitor function to track the data which I can print just fine. But I'd like to stash it in a data frame for other uses and keep watching it overwrite the other iterations. Here's my code for the monitor function:
monitor <- function(obj){
#Make empty data frame in which to store data
resultlist <- data.frame(matrix(nrow = 200, ncol = 10, byrow = TRUE))
#If statement evaluating each iteration of algorithm
if (obj$iter > 0){
#Put results into list corresponding to number of iteration
resultlist[,obj$iter] <- obj$population[which.min(obj$best),]}
#Make data frame available at global level for prototyping, output, etc.
resultlistOutput <<- resultlist}
I know this works in a for loop with no issues based on searches, so I must be doing something wrong or the if syntax is not capable of this?
Sincere thanks in advance for your time.
Being not sure what error you are getting, I am guessing you are getting only the result from last iteration. This is happening because you are overwriting your global dataframe in each call to monitor function. You should first initialize resultlistOutput <<- data.frame() this way and then do this:
monitor <- function(obj){
#Make empty data frame in which to store data
resultlist <- data.frame(matrix(nrow = 200, ncol = 10, byrow = TRUE))
#If statement evaluating each iteration of algorithm
if (obj$iter > 0){
#Put results into list corresponding to number of iteration
resultlist[,obj$iter] <- obj$population[which.min(obj$best),]}
#Make data frame available at global level for prototyping, output, etc.
# append the dataframe to the old result
resultlistOutput <<- rbind(resultlistOutput , resultlist)
}
Related
I currently have a list which is made up of around 80+ data frames, what I would like to do is to loop a chunk of code for each individual data frame within the list, without naming each one individually, or splitting them into individual data frames to work on.
Currently I split the list into each individual data frame using the below code:
dat5split <- setNames(split(dat5, dat5$CODE), paste0("df", unique(dat5$CODE)))
list2env(dat5split, globalenv())
I then work through each data frame individually:
# call in SPC function and write to 'results10000'
results10000<-SPC_XBAR(df10000,vol_n,seasonality)
results10000 = results10000 %>%
cbind(Spec = df10000$CODE) %>%
subset(`table_n` == 1)
results10000 <- results10000[order(results10000$tpd),]
results10000$Date <- as.Date(cbind(Date = df10000$CENSUS_DATE))
# call in SPC function and write to 'results10001'
results10001<-SPC_XBAR(df10001,vol_n,seasonality)
results10001 = results10001 %>%
cbind(Spec = df10001$CODE) %>%
subset(`table_n` == 1)
results10001 <- results10001[order(results10001$tpd),]
results10001$Date <- as.Date(cbind(Date = df10001$CENSUS_DATE))
Currently I call in the function 'SPC_XBAR' to where vol_n and seasonality are set earlier in the code. The script then passes the values to the function which then assigns the results to 'results10000, results10001' etc etc. Upon which I do a small bit of data wrangling on each newly created data frame before feeding the results back into sql server at the end.
As you can see each one is being individually hard coded which is not efficient.
What I would like to do is to loop a chunk of code for each individual data frame within the list, without naming each one individually.
I believe a loop would solve this issue but I am a little inexperienced when it comes to the ability to create a loop around it. Any advice would be much appreciated.
Cheers
Have you considered using lapply instead of a loop throughout the list? Check it here...
EDIT: I try to elaborate a bit more... What happens if you do this:
myFunction <- function(x) {
results<-SPC_XBAR(x,vol_n,seasonality)
results = results %>%
cbind(Spec = x$CODE) %>%
subset(`table_n` == 1)
results <- results[order(results$tpd),]
results$Date <- as.Date(cbind(Date = x$CENSUS_DATE))
results
}
lapply(dat5split, myFunction)
I would expect it to return a list of the resulting datasets
I am currently working on an imputation project where I need to evaluate my methods of imputation. I have my incomplete dataframe with NAs from which I calculate the missing rate for every column/variable. My second data frame contains the complete cases which I extracted from the first data frame. I now want to simulate the missingness structure of the real data in the frame containing the complete cases. the data frame with the generated NAs get stored in the object "result" as you can see in the code. If I now want to replicate this code and thus generate 100 different data frames like "result", how do I replicate and save them separately?
I'm a beginner and would be really thankful for your answers!
I tried to put my loop which generates the NAs in another loop which contains the replicate() command and counts from 1:100 and saves these 100 replicated data frames but it didn't work at all.
result = data.frame(res0=rep(NA, dim(comp_cas)[1]))
for (i in 1:length(Z32_miss_item$miss_per_item)) {
dat = comp_cas[,i]
missRate = Z32_miss_item$miss_per_item[i]
cat (i, " ", paste0(dat, collapse=",") ," ", missRate, "!\n")
df <- data.frame("res"= GenMiss(x=dat, missrate = missRate), stringsAsFactors = FALSE)
colnames(df) = gsub("res", paste0("Var", i), colnames(df))
result = cbind(result, df)
}
result = result[,-1]
I expect that every data frame of the 100 runs get saved in a separate .rda file in my project folder.
also, is imputation and the evaluation of fitness of the latter beginner stuff in r or at what level of proficiency am I if you take a look at the code that I posted?
It is difficult to guess what exactly you are doing without some dummy data. But it is fine to have loops within loops and to save data.frames. Firstly, I would avoid the replicate function here as it has a strange syntax and just stick with plain loops. Secondly, you must make sure that the loops have different indexes (i.e. for(i ... should be surrounded by, say, for(j ... since functions can loop outside their scope in R. Finally, use saveRDS rather than save, as you can then have each object (data.frame) saved in separate .rds files. The save function is designed for saving your whole workspace so that you can pick up where you left off.
fun <- function(i){
df <- data.frame(x=rnorm(5))
names(df) <- paste0("x",i)
df
}
for(j in 1:100){
res <- data.frame(id=1:5)
for(i in 1:10){
res <- cbind(res, fun(i))
}
saveRDS(res, sprintf("replication_%s.rds",j))
}
I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!
When I run this Loop I can print the results and I want to create a data frame with this data but I cant. Until now I have this:
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
for (i in 1:numfiles) {
file <- read.table(filenames[i],header = TRUE)
ts = subset(file, file$name == "plantNutrientUptake")
tss = subset (ts, ts$path == "//plants/nitrate")
tssc = tss[,2:3]
d40 = tssc[41,2]
print(d40)
print(filenames[i])
}
This is not the most efficient way to do this, but it takes advantage of what code you've already written. First, you'll create an empty data frame with the columns you want, but filled with NA. Then, in each iteration of the loop, you'll fill one row of the data frame.
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
# Create an empty data.frame
df <- data.frame(filename = rep(NA, numfiles), d40 = rep(NA, numfiles))
for (i in 1:numfiles){
file <- read.table(filenames[i],header = TRUE)
ts = subset(file, file$name == "plantNutrientUptake")
tss = subset (ts, ts$path == "//plants/nitrate")
tssc = tss[,2:3]
d40 = tssc[41,2]
# Fill row i of the data frame
df[i,"filename"] = filenames[i]
df[i,"d40"] = d40
}
Hope that does it! Good luck :)
There are a lot of ways to do what you are asking. Also, without a reproducible example it is difficult to validate that code will run. I couldn't tell what type of data was in each of your variable so I just guessed that they were mostly characters with one numeric. You'll need to change the code if that's not true.
The following method is using base R (no other packages). It builds off of what you have done. There are other ways to do this using map, do.call, or apply. But it's important to be able to run through a loop.
As someone commented, your code is just re-writing itself every loop. Luckily you have the variable i that you can use to specify where things go.
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
# Declare an empty dataframe for efficiency purposes
df <- data.frame(
ts = rep(NA_character_,numfiles),
tss = rep(NA_character_,numfiles),
tssc = rep(NA_character_,numfiles),
d40 = rep(NA_real_,numfiles),
stringsAsFactors = FALSE
)
# Loop through the files and fill in the data
for (i in 1:numfiles){
file <- read.table(filenames[i],header = TRUE)
df$ts[i] <- subset(file, file$name == "plantNutrientUptake")
df$tss[i] <- subset (ts, ts$path == "//plants/nitrate")
df$tssc[i] <- tss[,2:3]
df$d40[i] <- tssc[41,2]
print(d40)
print(filenames[i])
}
You'll notice a few things about this code that are extra.
First, I'm declaring the variable type for each column explicitly. You can use rep(NA,numfiles) but that leave R to guess what the column should be. This may not be a problem for you if all of your variables are obviously of the same type. But imagine you have a variable a = c("1","A","B") of all characters. R will go through the first iteration of the loop and guess that the column is numeric. Then on the second run of the loop will crash when it runs into a character.
Next, I'm declaring the entire dataframe before entering the loop. When people tell you that loops in [modern] R are slow it is often because you are re-allocating memory every loop. By declaring the entire dataframe up front you speed up the loop significantly. This also allows you to reference any cell in the dataframe...which is exactly what you want to do in the loop.
Finally, I'm using the $ syntax to make things clear. Writing df[i,"d40"] <- d40 is the same as writing df$d40[i] <- d40. I just think it is clear to use the second method. This is a matter of personal preference.
I have a fitted model that I'd like to apply to score a new dataset stored as a CSV. Unfortunately, the new data set is kind of large, and the predict procedure runs out of memory on it if I do it all at once. So, I'd like to convert the procedure that worked fine for small sets below, into a batch mode that processes 500 lines at a time, then outputs a file for each scored 500.
I understand from this answer (What is a good way to read line-by-line in R?) that I can use readLines for this. So, I'd be converting from:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
newdata <- as.data.frame(read.csv('newstuff.csv'), stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=filename)
to something like:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
con <- file("newstuff.csv", open = "r")
i = 0
while (length(mylines <- readLines(con, n = 500, warn = FALSE)) > 0) {
i = i+1
newdata <- as.data.frame(mylines, stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=paste(filename,i,'.csv',sep=''))
}
close(con)
However, when I print the mylines object inside the loop, it doesn't get auto-columned correctly the same way read.csv produces something that is---headers are still a mess, and whatever modulo column-width happens under the hood that wraps the vector into an ncol object isn't happening.
Whenever I find myself writing barbaric things like cutting the first row, wrapping the columns, I generally suspect R has a better way to do things. Any suggestions for how I can get a read.csv-like output form a readLines csv connection?
If you want to read your data into memory in chunks using read.csv by using the skip and nrows arguments. In pseudo-code:
read_chunk = function(start, n) {
read.csv(file, skip = start, nrows = n)
}
start_indices = (0:no_chunks) * chunk_size + 1
lapply(start_indices, function(x) {
dat = read_chunk(x, chunk_size)
pred = predict(fit, dat)
write.csv(pred)
}
Alternatively, you could put the data into an sqlite database, and use the sqlite package to query the data in chunks. See also this answer, or do some digging with [r] large csv on SO.