is.na() in R for loop not quite understood - r

I am confused by the behavior of is.na() in a for loop in R.
I am trying to make a function that will create a sequence of numbers, do something to a matrix, summarize the resulting matrix based on the sequence of numbers, then modify the sequence of numbers based on the summary and repeat. I made a simple version of my function because I think it still gets at my problem.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq))
##generate a table where the row names are those numbers
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10))
##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <-
count(temp.results[,1])$freq
##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
## the idea would be to keep cutting this sequence of numbers down with
## successive iterations until the desired number of iterations per row in
## details.table was reached. in other words, in the real code i'd do
## something to details.table in the next line
print(rich.seq)
}
}
##call the function
test(desired.iterations=4, max.iterations=2)
On the first run through the for loop the rich.seq looks like I'd expect it to, where 5 & 6 are no longer in the sequence because both ended up with more than 4 iterations. However, on the second run, it spits out something unexpected.
UPDATE
Thanks for your help and also my apologies. After re-reading my original post it is not only less than clear, but I hadn't realized count was part of the plyr package, which I call in my full function but wasn't calling here. I'll try and explain better.
What I have working at the moment is a function that takes a matrix, randomizes it (in any of a number of different ways), then calculates some statistics on it. These stats are temporarily stored in a table--temp.results--where temp.results[,1] is the sum of the non zero elements in each column, and temp.results[,2] is a different summary statistic for that column. I save these results to a csv file (and append them to the same file at subsequent iterations), because looping through it and rbinding hogs a lot of memory.
The problem is that certain column sums (temp.results[,1]) are sampled very infrequently. In order to sample those sufficiently requires many many iterations, and the resulting .csv files would stretch into the hundreds of gigabytes.
What I want to do is create and then update a table (details.table) at each iteration that keeps track of how many times each column sum actually got sampled. When a given element in the table reaches the desired.iterations, I want it to be excluded from the vector rich.seq, so that only columns that haven't received the desired.iterations are actually saved to the csv file. The max.iterations argument will be used in a break() statement in case things are taking too long.
So, what I was expecting in the example case is the exact same line for rich.seq for both iterations, since I didn't actually do anything to change it. I believe that flodel is definitely right that my problem lies in comparing a matrix (details.table) of length longer than rich.seq, leading to unexpected results. However, I don't want the dimensions of details.table to change. Perhaps I can solve the problem implementing %in% somehow when I redefine rich.seq in the for loop?

I agree you should improve your question. However, I think I can spot what is going wrong.
You compute details.table before the for loop. It is a matrix with same length as rich.seq when it was first initialized (length(4:34), i.e. 31).
Inside the for loop, details.table < desired.iterations | is.na(details.table) is then a logical vector of length 31. On the first loop iteration,
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
will result in reducing the length of rich.seq. But on the second loop iteration, unless details.table is redefined (not the case), you are trying to subset rich.seq by a logical vector of longer length than rich.seq. This will certainly lead to unexpected results.
You probably meant to redefine details.table as part of your for loop.
(Also I am surprised to see you never used temp.results[,2].)

Thanks to flodel for setting me off on the right track. It had nothing to do with is.na but rather the lengths of vectors I was comparing.
That said, I set the initial values of the details.table to zero to avoid the added complexity of the is.na statement.
This code works, and can be modified to do what I described above.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq)) ##generate a table where the row names are those numbers
details.table[,1] <- 0
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10)) ##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <- count(temp.results[,1])$freq ##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- row.names(details.table)[details.table[,1] < desired.iterations]
print(rich.seq)
}
}
Rather than trying to cut down the rich.seq I just redefine it every iteration based on whatever happens with details.table during the previous iteration.

Related

How do I run a for loop over all columns of a data frame and return the result as a separate data frame or matrix

I am trying to obtain the number of cases for each variable in a df. There are 275 cases in the df but most columns have some missing data. I am trying to run a for loop to obtain the information as follows:
idef_id<-readxl::read_xlsx("IDEF.xlsx")
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(i))
275-nas
}
however the output for casenums is
> summary(casenums)
Length Class Mode
0 NULL NULL
Any help would be much appreciated!
A for loop isn't a function - it doesn't return anything, so x <- for(... doesn't ever make sense. You can do that with, e.g., sapply, like this
casenums <- sapply(idef_id, function(x) sum(!is.na(x)))
Or you can do it in a for loop, but you need to assign to a particular value inside the loop:
casenums = rep(NA, ncol(idef_id))
names(casenums) = names(idef_id)
for(i in names(idef_id)) {
casenums[i] = sum(!is.na(idef_id[[i]]))`
}
You also had a problem that i is taking on column names, so sum(is.na(i)) is asking if the value of the column name is missing. You need to use idef_id[[i]] to access the actual column, not just the column name, as I show above.
You seem to want the answer to be the number of non-NA values, so I switched to sum(!is.na(...)) to count that directly, rather than hard-coding the number of rows of the data frame and doing subtraction.
The immediate fix for your for loop is that your i is a column name, not the data within. On your first pass through the for loop, your i is class character, always length 1, so sum(is.na(i)) is going to be 0. Due to how frames are structured, there is very little likelihood that a name is NA (though it is possible ... with manual subterfuge).
I suggest a literal fix for your code could be:
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
275-nas
}
But this has the added problem that for loops don't return anything (as Gregor's answer also discusses). For the sake of walking through things, I'll keep that (for the first bullet), and then fix it (in the second):
Two things:
hard-coding 275 (assuming that's the number of rows in the frame) will be problematic if/when your data ever changes. Even if you're "confident" it never will ... I still recommend not hard-coding it. If it's based on the number of rows, then perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
OUT_OF - nas
}
at least in a declarative sense, where the variable name (please choose something better) is clear as to how you determined 275 and how (if necessary) it should be fixed in the future.
(Or better, use Gregor's logic of sum(!is.na(...)) if you just need to count not-NA.)
doing something for each column of a frame is easily done using sapply or lapply, perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
OUT_OF - sapply(idef_id, function(one_column) sum(is.na(one_column)))
## or
sapply(idef_id, function(one_column) OUT_OF - sum(is.na(one_column)))

For loop returns last result

I have a small number of csv files, each containing two columns with numeric values. I want to write a for loop that reads the files, sums the columns, and stores the sum totals for each csv in a numeric vector. This is the closest I've come:
allfiles <- list.files()
for (i in seq(allfiles)) {
total <- numeric()
total[i] <- sum(subset(read.csv(allfiles[i]), select=Gift.1), subset(read.csv(allfiles[i]), select=Gift.2))
total
}
My result is all NA's save a value for the last file. I understand that I'm overwriting each iteration each time the for loop executes and I think* I need to do something with indexing.
The first problem is that you are not pre-allocating the right length of (or properly appending to) total. Regardless, I recommend against that method.
There are several ways to do this, but the R-onic (my term, based on pythonic ... I know, it doesn't flow well) is based on vectors/lists.
alldata <- sapply(allfiles, read.csv, simplify = FALSE)
totals <- sapply(alldata, function(a) sum(subset(a, select=Gift.1), subset(a, select=Gift.2)))
I often like to that, keeping the "raw/unaltered" data in one list and then repeatedly extract from it. For instance, if the files are huge and reading them is a non-trivial amount of time, then if you realize you also need Gift.3 and did it your way, then you'd need to re-read the entire dataset. Using my method, however, you just update the second sapply to include the change and rerun on the already-loaded data. (Most of the my rationale is based on untrusted data, portions that are typically unused, or other factors that may not be there for you.)
If you really wanted to reduce the code to a single line, something like:
totals <- sapply(allfiles, function(fn) {
x <- read.csv(fn)
sum(subset(x, select=Gift.1), subset(x, select=Gift.2))
})
allfiles <- list.files()
total <- numeric()
for (i in seq(allfiles)) {
total[i] <- sum(subset(read.csv(allfiles[i]), select=Gift.1), subset(read.csv(allfiles[i]), select=Gift.2))
}
total
if possible try and give the total a known length before hand ie total<-numeric(length(allfiles))

Binding data to a dataframe in each 'for' run under different columns each time to compute average of each column finally

I am trying to do 10-fold-cross-validation in R. In each for run a new row with several columns will be generated, each column will have an appropriate name, I want the results of each 'for' to go under the appropriate column, so that at end I will be able to compute the average value for each column. In each 'for' run results that are generated belong to different columns than the previous for, therefore the names of the columns should also be checked. Is it possible to do it anyway? Or maybe it would be better to just compute the averages for the columns on the spot?
for(i in seq(from=1, to=8200, by=820)){
fold <- df_vector[i:i+819,]
y_fold_vector <- df_vector[!(rownames(df_vector) %in% rownames(folding)),]
alpha_coefficient <- solve(K_training, y_fold_vector)
test_points <- df_matrix[rownames(df_matrix) %in% rownames(K_training), colnames(df_matrix) %in% rownames(folding)]
predictions <- rbind(predictions, crossprod(alpha_coefficient,test_points))
}
You are having problems with the operator precedence of dyadic operators in R should be:
fold <- df_vector[ i:(i+819), ]
Consider:
> i=1
> i:i+189
[1] 190
Lack of a simple example (or any comments on what your code is supposed to be doing) prevents any testing of the rest of the code, but you can find the precedence of operators at ?Syntax. Unary "=" is higher, but binary "+" is lower than ":".
(It's also unclear what the folding vector is supposed to be. You only defined a fold value and it wasn't a vector since you addressed it as you would a dataframe.)

Indexing in nested loops

I am new to R and this site. My aim with the following, assuredly unnecessarily-arcane code is to create an R function that produces a special type of box plot in ggplot2. I first need to process potential input thereinto by calculating the variables that I shall later wish to have plotted.
I start by generating some random data, called datos:
c1=rnorm(98,47,23)
c2=rnorm(98,56,13)
c3=rnorm(98,52,7)
fila1=as.matrix(t(c(-2,15,30)))
colnames(fila1)=c("c1","c2","c3")
fila2=as.matrix(t(c(-20,5,20)))
colnames(fila2)=c("c1","c2","c3")
datos=rbind(data.frame(c1,c2,c3),fila1,fila2)
rm(c1,c2,c3,fila1,fila2)
Then, I calculate the variables to later be plotted, which include for each of the present columns in datos the mean (puntoMedio), the first and third quartiles (cuar1,cuar3), the inner-quartile range (iqr), the lower bound of potential submean whiskers (limInf), the upper bound of potential supermean whiskers (limSup) and outliers (submean outliers vAtInf and supermean outliers vAtSup to be combined in vAt):
puntoMedio=apply(datos,MARGIN=2,FUN=mean)
cuar1=apply(datos,MARGIN=2,FUN=quantile,probs=.25)
cuar3=apply(datos,MARGIN=2,FUN=quantile,probs=.75)
cuar=rbind(cuar1,cuar3)
iqr=apply(cuar,MARGIN=2,FUN=diff)
cuar=rbind(cuar,iqr,puntoMedio)
limInf=array(dim=ncol(datos))
for(i in 1:ncol(datos)){
limInf0=as.matrix(t(cuar[1,]-1.5*cuar[3,]))
if(length(datos[datos[,i]<limInf0[,i],i])>0){
limInf[i]=limInf0[,i]
}else{limInf[i]=min(datos[,i])}
}
limSup=array(dim=ncol(datos))
for(i in 1:ncol(datos)){
limSup0=as.matrix(t(cuar[2,]+1.5*cuar[3,]))
if(length(datos[datos[,i]>limSup0[,i],i])>0){
limSup[i]=limSup0[,i]
}else{limSup[i]=max(datos[,i])}
}
d=data.frame(t(rbind(cuar,limInf,limSup)))
rm(cuar)
vAtInf=datos
for(i in 1:ncol(vAtInf)){
vAtInf[vAtInf[,i]>limInf0[,i],i]=NA
}
colnames(vAtInf)=c("vAtInfc1","vAtInfc2","vAtInfc3")
vAtSup=datos
for(i in 1:ncol(vAtSup)){
vAtSup[vAtSup[,i]<limSup0[,i],i]=NA
}
colnames(vAtSup)=c("vAtSupc1","vAtSupc2","vAtSupc3")
datos=cbind(datos,vAtInf,vAtSup)
rm(limInf0,limSup0,cuar1,cuar3,i,iqr,limInf,limSup,puntoMedio)
Everything works as desired up until here. I have two data frames d and datos, the former of no interest here, and the latter, which in this specific case comprises nine columns: three of all values, three of the corresponding submean outliers and three of the corresponding supermean outliers (these latter six padded with NA). I now wish to extract all outliers by column, wherefore I have tried formulating the following loop. While it does work giving neither error nor warning, it also does not give the desired output in vAt (again, the by-column [columns 4:9] outliers from datos). The problem, then, as far as I have been able to discern, occurs in the nested for-loop, upon attempting to input i into vAt: each iteration of the loop erases the last, such that upon completion of the entire loop, vAt only contains NA and the outliers from the last column/of the last iteration.
for(i in ((ncol(datos)/3)+1):ncol(datos)){
vAt=matrix(nrow=.25*nrow(datos),ncol=ncol(datos)-(ncol(datos)/3))
colnames(vAt)=c(((ncol(datos)/3)+1):ncol(datos))
if(length(datos[,i][is.na(datos[,i])==F])>0){
for(j in 1:(length(datos[,i][is.na(datos[,i])==F]))){
nom=as.character(i)
vAt[j,nom]=datos[,i][is.na(datos[,i])==F][j]
}
}else{next}
}
I have not been able to find any existent thread that answers my question. Thanks for any help.
The problem is that you are initialising vAt inside the loop here.
Moving the initialisation statements outside the for loop will fix the problem that you are facing:
vAt=matrix(nrow=.25*nrow(datos),ncol=ncol(datos)-(ncol(datos)/3))
colnames(vAt)=c(((ncol(datos)/3)+1):ncol(datos))
for(i in ((ncol(datos)/3)+1):ncol(datos)){
if(length(datos[,i][is.na(datos[,i])==F])>0){
for(j in 1:(length(datos[,i][is.na(datos[,i])==F]))){
nom=as.character(i)
vAt[j,nom]=datos[,i][is.na(datos[,i])==F][j]
}
}else{next}
}
However, there are various improvements which you can make to the code as it stands:
Using vectorisation and *ply functions instead of for loops.
Not comparing logical vectors to ==F but instead only using !is.na(...).
Using sum(is.na(...)) instead of length(d[,i][!is.na(...)])
And some more. These will not change the correctness of the code, but will make it more efficient and more idiomatic.

Efficient function to return varying length vector from lookup table

I have three data sources:
types<-c(1,3,3)
places<-list(c(1,2,3),1,c(2,3))
lookup.counts<-as.data.frame(matrix(runif(9,min=0,max=10),nrow=3,ncol=3))
assigned.places<-rep.int(0,length(types))
the numbers in the "types" vector tell me what 'type' a given observation is. The vectors in the places list tell me which places the observation can be found in (some observations are found in only one place, others in all places). By definition there is one entry in types and one list in places for each observation. Lookup.counts tells me how many observations of each type are located in each place (generated from another data source).
I want to randomly assign each observation to a place based on a probability generated from lookup.counts. Using for loops it looks something like"
for (i in 1:length(types)){
row<-types[i]
columns<-places[[i]]
this.obs<-lookup.counts[row,columns] #the counts of this type in each place
total<-sum(this.obs)
this.obs<-this.obs/total #the share of observations of this type in these places
pick<-runif(1,min=0,max=1)
#the following should really be a 'while' loop, but regardless it needs help
for(j in 1:length(this.obs[])){
if(this.obs[j] > pick){
#pick is less than this county so assign
pick<- 100 #just a way of making sure an observation doesn't get assigned twice
assigned.places[i]<-colnames(lookup.counts)[j]
}else{
#pick is greater, move to the next category
pick<- pick-this.obs[j]
}
}
}
I have been trying to vectorize this somehow, but am getting hung up on the variable length of 'places' and of 'this.obs'
In practice, of course, the lookup.counts table is quite a bit bigger (500 x 40) and I have some 900K observations with places lists of length 1 through length 39.
To vectorize the inner loop, you can use sample or sample.int to choose from several alternaives with prescribed probabilities. Unless I read your code incorrectly, you want something like this:
assigned.places[i] <- sample(colnames(this.obs), 1, prob = this.obs)
I'm a bit surprised that you're using colnames(lookup.counts) instead. Shouldn't this be subset by columns as well? It seems that either I missed something, or there is a bug in your code.
the different lengths of your lists are a severe obstacle to vectorizing your outer loops. Perhaps you could use the Matrix package to store that information as sparse matrices. Then you could simply multiply probabilities by that vector to exclude those columns which are not in the places list of a given observation. But as you'd probably still use apply for the above sampling code, you might as well keep the list and use some form of apply to iterate over that.
The overall result might look somewhat like this:
assigned.places <- colnames(lookup.counts)[
apply(cbind(types, places), 1, function(x) {
sample(x[[2]], 1, prob=lookup.counts[x[[1]],x[[2]]])
})
]
The use of cbind and apply isn't particularly beautiful, but seems to work. Each x is a list of two items, x[[1]] being the type and x[[2]] being the corresponding places. We use these to index lookup.counts just as you did. Then we use the found counts as relative probabilities when choosing the index of one of the columns we used in the subscript. Only after all these numbers have been assembled into a single vector by apply will the indices be turned into names based on colnames.
You can check whether things are faster if you don't cbindstuff together, but instead iterate over the indices only:
assigned.places <- colnames(lookup.counts)[
sapply(1:length(types), function(i) {
sample(places[[i]], 1, prob=lookup.counts[types[i],places[[i]]])
})
]
This appears to work as well:
# More convenient if lookup.counts is a matrix.
lookup.counts<-matrix(runif(9,min=0,max=10),nrow=3,ncol=3)
colnames(lookup.counts)<-paste0('V',1:ncol(lookup.counts))
# A function that does what the for loop does for each i
test<-function(i) {
this.places<-colnames(lookup.counts)[places[[i]]]
this.obs<-lookup.counts[types[i],this.places]
sample(this.places,size=1,prob=this.obs)
}
# Applies the function for all i
sapply(1:length(types),test)

Resources