Loop to dynamically fill dataframe R - r

I am running a for loop to fill dynamically a dataframe (I know a baby seal dies somewhere because I am using a for loop)
I have something like this in mind (the 5 is a placeholder for a function that returns a scalar):
results<-data.frame(matrix(NA, nrow = length(seq(1:10)), ncol =
length(seq(1:10))))
rows<-data.frame(matrix(NA, nrow = 1, ncol = 1))
for (j in seq(1:10)){
rows<-data.frame()
for (i in seq(1:10)){
rows<-cbind(rows,5)
}
results<-cbind(results,rows)
}
I get the following error message with my approach above.
Error in match.names(clabs, names(xi)) :
names do not match previous names
Is there an easier way?

Dynamically filling an object using a for loop is fine - what causes problems is when you dynamically build an object using a for loop (e.g. using cbind and rbind rows).
When you build something dynamically, R has to go and request new memory for the object in each loop, because it keeps increasing in size. This causes a for loop to slow down with every iteration as the object gets bigger.
When you create the object beforehand (e.g. a data.frame with the right number of rows and columns), and fill it in by index, the for loop doesn't have this problem.
One final thing to keep in mind is that for data.frames (and matrices) each column is stored as a vector in memory – so its usually more efficient to fill these in one column at a time.
With all that in mind we can revise your code as follows:
results <- data.frame(matrix(NA, nrow = length(seq(1:10)),
ncol = length(seq(1:10))))
for (rowIdx in 1:nrow(results)) {
for (colIdx in 1:ncol(results)) {
results[rowIdx, colIdx] <- 5 # or whatever value you want here
}
}

Not sure what is your intention. Now keeping your intention and way of implementation a way to fix the problem to change for-loop so that rows is initialized with 1st value. The second for-loop should be from seq(2:10).
The error is occurring because attempting to cbind a blank data.frame with valid value.
for (j in seq(1:10)){
rows<-data.frame(5) #Initialization with 1st value
for (i in seq(2:10)){ #Loop 2nd on wards.
rows<-cbind(rows,5)
}
results<-cbind(results,rows)
}

Related

How do I run a for loop over all columns of a data frame and return the result as a separate data frame or matrix

I am trying to obtain the number of cases for each variable in a df. There are 275 cases in the df but most columns have some missing data. I am trying to run a for loop to obtain the information as follows:
idef_id<-readxl::read_xlsx("IDEF.xlsx")
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(i))
275-nas
}
however the output for casenums is
> summary(casenums)
Length Class Mode
0 NULL NULL
Any help would be much appreciated!
A for loop isn't a function - it doesn't return anything, so x <- for(... doesn't ever make sense. You can do that with, e.g., sapply, like this
casenums <- sapply(idef_id, function(x) sum(!is.na(x)))
Or you can do it in a for loop, but you need to assign to a particular value inside the loop:
casenums = rep(NA, ncol(idef_id))
names(casenums) = names(idef_id)
for(i in names(idef_id)) {
casenums[i] = sum(!is.na(idef_id[[i]]))`
}
You also had a problem that i is taking on column names, so sum(is.na(i)) is asking if the value of the column name is missing. You need to use idef_id[[i]] to access the actual column, not just the column name, as I show above.
You seem to want the answer to be the number of non-NA values, so I switched to sum(!is.na(...)) to count that directly, rather than hard-coding the number of rows of the data frame and doing subtraction.
The immediate fix for your for loop is that your i is a column name, not the data within. On your first pass through the for loop, your i is class character, always length 1, so sum(is.na(i)) is going to be 0. Due to how frames are structured, there is very little likelihood that a name is NA (though it is possible ... with manual subterfuge).
I suggest a literal fix for your code could be:
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
275-nas
}
But this has the added problem that for loops don't return anything (as Gregor's answer also discusses). For the sake of walking through things, I'll keep that (for the first bullet), and then fix it (in the second):
Two things:
hard-coding 275 (assuming that's the number of rows in the frame) will be problematic if/when your data ever changes. Even if you're "confident" it never will ... I still recommend not hard-coding it. If it's based on the number of rows, then perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
OUT_OF - nas
}
at least in a declarative sense, where the variable name (please choose something better) is clear as to how you determined 275 and how (if necessary) it should be fixed in the future.
(Or better, use Gregor's logic of sum(!is.na(...)) if you just need to count not-NA.)
doing something for each column of a frame is easily done using sapply or lapply, perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
OUT_OF - sapply(idef_id, function(one_column) sum(is.na(one_column)))
## or
sapply(idef_id, function(one_column) OUT_OF - sum(is.na(one_column)))

I have a question about writing loops in R

I have two databases, one includes 2283 rows of products information (USDA) and the second one is 621 flavor type of products (Flavor). I wanted to use grepl code to recognize the flavor in each row of my first dataset. However, I do not want to write the code for each flavor one by one. Therefore, I decided to write a loop. However, my loop is not showing multiple columns of results for each flavor check. Instead, it is showing the result of the last match. Would you please help me with this problem?
for (i in 2:length(Flavor$Flavor_names){
result <- cbind(USDA, Flavor=grepl(paste0(Flavor_names$FLAVOR.SCENT[i], collapse="|") , USDA$long_name)))
Before starting a loop like this, you need to create an empty object to fill up with all the results of the loop. Such as result <- NULL. Second, when you run the loop, index the output object as you do the input objects, like result[i]. Your loop would look like:
result <- NULL
for (i in 2:length(Flavor$Flavor_names){
result[i] <- cbind(USDA, Flavor=grepl(paste0(Flavor_names$FLAVOR.SCENT[i], collapse="|") , USDA$long_name)))
}
Now results is length i, and in positions 2 through length(Flavor$Flavor_names) you have the results of the i-th loop. Note that index 1 will be NA, because you started your loop with 2. You could avoid this if your loop had contained result[i - 1] instead.

Storing matrix after every iteration

I have following code.
for(i in 1:100)
{
for(j in 1:100)
R[i,j]=gcm(i,j)
}
gcm() is some function which returns a number based on the values of i and j and so, R has all values. But this calculation takes a lot of time. My machine's power was interrupted several times due to which I had to start over. Can somebody please help, how can I save R somewhere after every iteration, so as to be safe? Any help is highly appreciated.
You can use the saveRDS() function to save the result of each calculation in a file.
To understand the difference between save and saveRDS, here is a link I found useful. http://www.fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/
If you want to save the R-workspace have a look at ?save or ?save.image (use the first to save a subset of your objects, the second one to save your workspace in toto).
Your edited code should look like
for(i in 1:100)
{
for(j in 1:100)
R[i,j]=gcm(i,j)
save.image(file="path/to/your/file.RData")
}
About your code taking a lot of time I would advise trying the ?apply function, which
Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix
You want gmc to be run for-each cell, which means you want to apply it for each combination of row and column coordinates
R = 100; # number of rows
C = 100; # number of columns
M = expand.grid(1:R, 1:C); # Cartesian product of the coordinates
# each row of M contains the indexes of one of R's cells
# head(M); # just to see it
# To use apply we need gmc to take into account one variable only (that' not entirely true, if you want to know how it really works have a look how at ?apply)
# thus I create a function which takes into account one row of M and tells gmc the first cell is the row index, the second cell is the column index
gmcWrapper = function(x) { return(gmc(x[1], x[2])); }
# run apply which will return a vector containing *all* the evaluated expressions
R = apply(M, 1, gmcWrapper);
# re-shape R into a matrix
R = matrix(R, nrow=R, ncol=C);
If the apply-approach is again slow try considering the snowfall package which will allow you to follow the apply-approach using parallel computing. An introduction to snowfall usage can be found in this pdf, look at page 5 and 6 in particular

Reading variable name of dataframe into another dataframe using loops

So using a for loop I was able to break my 1.1 million row dataset in r into 110 tables of approximately 10,000 rows each in hopes of getting r to handle the data better. I now want to run another for loop that assigns the values in each of these tables to a different dataframe name.
My table names are:
Pom_1
Pom_2
Pom_3
...
Pom_110
What I want to do is create a for loop like the following:
for (i in 1:110)
{
Pom <- read.table(paste("Pom",i,sep = "_"))
for (j in 1:nrows(Pom))
{do something}
}
So I want to loop through the array and assign the values of each Pom table to "Pom" so that I can then run a for loop on each subsection of Pom. This problem is the read.table function does not seem to be the right one. Any ideas?
Can you give a more specific example of what you want to do withing each dataframe? You should avoid using the inner loop when possible and if you really need to have a look at ?apply
nrow instead of nrows
This is a generic solution using a example data.frame. The function you're looking for is assign, check it's help page:
Pom = data.frame(x = rnorm(30)) #original data.frame
n.tables = 3 # number of new data.frames you want to creat
Pom.names = paste("Pom",1:3,sep="") # name of all new data.frames
breaks = nrow(Pom)/n.tables * 0:n.tables # breaks of the original data.frame
for (i in 1:n.tables) {
rows = (breaks[i]+1):breaks[i+1] # which rows from Pom are going to be assign to the new data.frame?
assign(Pom.names[i],Pom[rows,]) # create new data.frame
}
ls()
[1] "breaks" "i" "n.tables" "Pom" "Pom.names" "Pom1"
[7] "Pom2" "Pom3" "rows"
I'm willing to bet the problem with your table call is that you aren't specifying the file extension (assuming Pom_1 - Pom_110 are files in your working directory, which I think they are since you're using read.table).
You can fix it by the following
fileExtension<-".xls" #specify your extension, I assume xls
for (i in 1:110)
{
tablename<-paste("Pom",i,sep = "_")
Pom <- read.table(paste(tablename, fileExtension, sep=""))
for (j in 1:nrows(Pom))
{do something}
}
Of course that's assuming a couple things about how everything in your problem is set up, but it's my best guess based on your description and code

is.na() in R for loop not quite understood

I am confused by the behavior of is.na() in a for loop in R.
I am trying to make a function that will create a sequence of numbers, do something to a matrix, summarize the resulting matrix based on the sequence of numbers, then modify the sequence of numbers based on the summary and repeat. I made a simple version of my function because I think it still gets at my problem.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq))
##generate a table where the row names are those numbers
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10))
##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <-
count(temp.results[,1])$freq
##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
## the idea would be to keep cutting this sequence of numbers down with
## successive iterations until the desired number of iterations per row in
## details.table was reached. in other words, in the real code i'd do
## something to details.table in the next line
print(rich.seq)
}
}
##call the function
test(desired.iterations=4, max.iterations=2)
On the first run through the for loop the rich.seq looks like I'd expect it to, where 5 & 6 are no longer in the sequence because both ended up with more than 4 iterations. However, on the second run, it spits out something unexpected.
UPDATE
Thanks for your help and also my apologies. After re-reading my original post it is not only less than clear, but I hadn't realized count was part of the plyr package, which I call in my full function but wasn't calling here. I'll try and explain better.
What I have working at the moment is a function that takes a matrix, randomizes it (in any of a number of different ways), then calculates some statistics on it. These stats are temporarily stored in a table--temp.results--where temp.results[,1] is the sum of the non zero elements in each column, and temp.results[,2] is a different summary statistic for that column. I save these results to a csv file (and append them to the same file at subsequent iterations), because looping through it and rbinding hogs a lot of memory.
The problem is that certain column sums (temp.results[,1]) are sampled very infrequently. In order to sample those sufficiently requires many many iterations, and the resulting .csv files would stretch into the hundreds of gigabytes.
What I want to do is create and then update a table (details.table) at each iteration that keeps track of how many times each column sum actually got sampled. When a given element in the table reaches the desired.iterations, I want it to be excluded from the vector rich.seq, so that only columns that haven't received the desired.iterations are actually saved to the csv file. The max.iterations argument will be used in a break() statement in case things are taking too long.
So, what I was expecting in the example case is the exact same line for rich.seq for both iterations, since I didn't actually do anything to change it. I believe that flodel is definitely right that my problem lies in comparing a matrix (details.table) of length longer than rich.seq, leading to unexpected results. However, I don't want the dimensions of details.table to change. Perhaps I can solve the problem implementing %in% somehow when I redefine rich.seq in the for loop?
I agree you should improve your question. However, I think I can spot what is going wrong.
You compute details.table before the for loop. It is a matrix with same length as rich.seq when it was first initialized (length(4:34), i.e. 31).
Inside the for loop, details.table < desired.iterations | is.na(details.table) is then a logical vector of length 31. On the first loop iteration,
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
will result in reducing the length of rich.seq. But on the second loop iteration, unless details.table is redefined (not the case), you are trying to subset rich.seq by a logical vector of longer length than rich.seq. This will certainly lead to unexpected results.
You probably meant to redefine details.table as part of your for loop.
(Also I am surprised to see you never used temp.results[,2].)
Thanks to flodel for setting me off on the right track. It had nothing to do with is.na but rather the lengths of vectors I was comparing.
That said, I set the initial values of the details.table to zero to avoid the added complexity of the is.na statement.
This code works, and can be modified to do what I described above.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq)) ##generate a table where the row names are those numbers
details.table[,1] <- 0
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10)) ##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <- count(temp.results[,1])$freq ##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- row.names(details.table)[details.table[,1] < desired.iterations]
print(rich.seq)
}
}
Rather than trying to cut down the rich.seq I just redefine it every iteration based on whatever happens with details.table during the previous iteration.

Resources