Modifying Data Set within a function but data set is not changed - r

My code is the following in R:
replaceNA<- function(myData,limit){
numNA<- rowsum(is.na(myData))
targetRows<- which(numNA<=limit)
targetCols<- length(names(myData))
for(row in targetRows){
for(col in 1:targetCols){
myData[row,col][is.na(myData[row,col])]<-1
}
}
}
I am trying to iterate through each element in myData and replace all NAs of a row with 1 IF the row does not have more than the number of NAs. I have tested my code with print statements and found that the iteration works perfectly (although not the most efficient code) and if I examine the modified myData by putting in a fix(myData) before the last bracket of the function, I see that my function worked perfectly(the NAs are replaced with 1s for the rows that meet the limit condition). However, when I examine myData after the function terminates, myData does not show the changes replaceNA made.
I know there is a problem in storing the modified myData but I am not sure how to store it properly.

The condition is not clear ( English problem). In any case you don't need a for loop here.
To compute the number of missing values for each row :
rowSums(is.na(myData))
Then you just test your condition and you replace all the row:
mm <- myData[rowSums(is.na(myData)) <= limit ,]
mm[is.na(mm)] <- 1
myData[rowSums(is.na(myData)) <= limit ,] <- mm

You should make your function explicitly return the modified data,
replaceNA<- function(myData,limit){
numNA<- rowsum(is.na(myData))
targetRows<- which(numNA<=limit)
targetCols<- length(names(myData))
for(row in targetRows){
for(col in 1:targetCols){
myData[row,col][is.na(myData[row,col])]<-1
}
}
return(myData)
}
then assign the modified data. You could overwrite your old data
myData <- replaceNA(myData, limit = 2)
or make a copy to compare
myData_no_na <- replaceNA(myData, limit = 2)
You can also avoid the loop entirely, which is much more R-like. #agstudy's answer seems to be covering that approach nicely.

Related

How to pass a column name in a for loop concatenating i with a string?

I need to subset a data frame in several others based in the values of several columns of the original data frame.
Here's my for loop:
for (i in 1:qtde_erros_esti){
temp_esti <- erro_esti[(paste0("erro_esti$" , "erro", i) == "1"),]
assign(paste0("erro", i,"_esti"), temp_esti)
rm(temp_esti)
}
The last piece of the puzzle for me is to pass the column name which value I must check (1st line in the for loop).
I'm trying to pass it with the function paste0, but the result of the function is a string that will never be equal to "1", hence never getting any data.
How can I pass the column names (erro_esti$erro1, erro_esti$erro2, and so on...) in this case?
Observation: I'm aware that this may not be the best approach using R, but I'm a noobie, coming from SAS, so I have limited knowledge.
Secondary question: is the way that I formulated the question (topic title) good? Accepting criticism on that too, please, aiming to improve future questions.
Thanks in advance for anyone who take some time to read this.
We can use [[ instead of $ to subset the column dynamically
erro_esti[[paste0("erro", i)]]
-full code
for(i in seq_len(qtde_erros_esti)) {
temp_esti <- erro_esti[erro_esti[[paste0("erro", i)]] == 1,]
assign(paste0("erro", i,"_esti"), temp_esti)
rm(temp_esti)
}
You are probably going about things a bit too complicated most likely, considert his approach:
for (i in 1:qtde_erros_esti){
column.name <- paste0("erro", i)
column.data <- erro_esti[, column.name ]
## do things with the column.data vector here
}
Now you can do what needs to be done with the data from column i, using the column.data variable.
If you just want to work with every column of your data.frame, also consider this further simplified pattern:
for( column.data in erro_esti ) {
## work with column.data here
}
You can just iterate over the columns of erro_esti directly, no need to use a counter, unless you need that counter for something else.

How do I run a for loop over all columns of a data frame and return the result as a separate data frame or matrix

I am trying to obtain the number of cases for each variable in a df. There are 275 cases in the df but most columns have some missing data. I am trying to run a for loop to obtain the information as follows:
idef_id<-readxl::read_xlsx("IDEF.xlsx")
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(i))
275-nas
}
however the output for casenums is
> summary(casenums)
Length Class Mode
0 NULL NULL
Any help would be much appreciated!
A for loop isn't a function - it doesn't return anything, so x <- for(... doesn't ever make sense. You can do that with, e.g., sapply, like this
casenums <- sapply(idef_id, function(x) sum(!is.na(x)))
Or you can do it in a for loop, but you need to assign to a particular value inside the loop:
casenums = rep(NA, ncol(idef_id))
names(casenums) = names(idef_id)
for(i in names(idef_id)) {
casenums[i] = sum(!is.na(idef_id[[i]]))`
}
You also had a problem that i is taking on column names, so sum(is.na(i)) is asking if the value of the column name is missing. You need to use idef_id[[i]] to access the actual column, not just the column name, as I show above.
You seem to want the answer to be the number of non-NA values, so I switched to sum(!is.na(...)) to count that directly, rather than hard-coding the number of rows of the data frame and doing subtraction.
The immediate fix for your for loop is that your i is a column name, not the data within. On your first pass through the for loop, your i is class character, always length 1, so sum(is.na(i)) is going to be 0. Due to how frames are structured, there is very little likelihood that a name is NA (though it is possible ... with manual subterfuge).
I suggest a literal fix for your code could be:
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
275-nas
}
But this has the added problem that for loops don't return anything (as Gregor's answer also discusses). For the sake of walking through things, I'll keep that (for the first bullet), and then fix it (in the second):
Two things:
hard-coding 275 (assuming that's the number of rows in the frame) will be problematic if/when your data ever changes. Even if you're "confident" it never will ... I still recommend not hard-coding it. If it's based on the number of rows, then perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
OUT_OF - nas
}
at least in a declarative sense, where the variable name (please choose something better) is clear as to how you determined 275 and how (if necessary) it should be fixed in the future.
(Or better, use Gregor's logic of sum(!is.na(...)) if you just need to count not-NA.)
doing something for each column of a frame is easily done using sapply or lapply, perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
OUT_OF - sapply(idef_id, function(one_column) sum(is.na(one_column)))
## or
sapply(idef_id, function(one_column) OUT_OF - sum(is.na(one_column)))

Deleting a row from a data set

I am trying to create a function that deletes n rows from a data set in R. The rows that I want to delete are the minimum values from the column time in the data set my_data_set.
I currently have
delete_data <- function(n)
{
k=1
while(k <= n)
{
my_data_set = my_data_set[-(which.min(my_data_set$time)),]
k=k+1
}
}
When I input these lines manually (without the use of the while loop) it works perfectly but I am not able to get the loop to work.
I am calling the function by:
delete_data(n = 2)
Any help is appreciated!
Thanks
Try:
my_data_set[ ! my_data_set$time == min(my_data_set$time), ]
Or if you are using data.table and wish to use the more direct syntax that data.table provides:
library(data.table)
setDT( my_data_set )
my_data_set[ ! time == min(time) ]
Then review how R work. R is a vectorized language that pretty much does what you mean without having to resort to complicated loops.
Also try:
my_data_set <- my_data_set[which(my_data_set$time > min(my_data_set$time)),]
By the way, which.min() will only pick up the first record if there is more than one record matching the minimum value.

Subset function in R wont work with vector selection

I have this weird problem where I have something like this in my code:
#(2,1,6,3)
states.vector <- unique(data$state)
I am iterating through the vector to subset data for each value in the "state" column. At some point through my iteration, the following line of code gives me an empty data frame:
#When state == 1
data.state <- subset(data,state==states.vector[state])
If state is == 1, it means that states.vector[state] == 2. But when I do the following, it works just fine:
subset(data,state==2)
What is weird is that I used this process multiple times, and it worked fine for the exact same task, with the same format for "data", but with some different values inside.
What am I doing wrong?
I think jlhoward has already explained what the problem is.
Why don't you use something like the following lines of code to loop through your states?
states.vector <- unique(data$state)
for (selected_state in states.vector) {
data.state <- subset(data,state==selected_state)
#...
}

is.na() in R for loop not quite understood

I am confused by the behavior of is.na() in a for loop in R.
I am trying to make a function that will create a sequence of numbers, do something to a matrix, summarize the resulting matrix based on the sequence of numbers, then modify the sequence of numbers based on the summary and repeat. I made a simple version of my function because I think it still gets at my problem.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq))
##generate a table where the row names are those numbers
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10))
##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <-
count(temp.results[,1])$freq
##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
## the idea would be to keep cutting this sequence of numbers down with
## successive iterations until the desired number of iterations per row in
## details.table was reached. in other words, in the real code i'd do
## something to details.table in the next line
print(rich.seq)
}
}
##call the function
test(desired.iterations=4, max.iterations=2)
On the first run through the for loop the rich.seq looks like I'd expect it to, where 5 & 6 are no longer in the sequence because both ended up with more than 4 iterations. However, on the second run, it spits out something unexpected.
UPDATE
Thanks for your help and also my apologies. After re-reading my original post it is not only less than clear, but I hadn't realized count was part of the plyr package, which I call in my full function but wasn't calling here. I'll try and explain better.
What I have working at the moment is a function that takes a matrix, randomizes it (in any of a number of different ways), then calculates some statistics on it. These stats are temporarily stored in a table--temp.results--where temp.results[,1] is the sum of the non zero elements in each column, and temp.results[,2] is a different summary statistic for that column. I save these results to a csv file (and append them to the same file at subsequent iterations), because looping through it and rbinding hogs a lot of memory.
The problem is that certain column sums (temp.results[,1]) are sampled very infrequently. In order to sample those sufficiently requires many many iterations, and the resulting .csv files would stretch into the hundreds of gigabytes.
What I want to do is create and then update a table (details.table) at each iteration that keeps track of how many times each column sum actually got sampled. When a given element in the table reaches the desired.iterations, I want it to be excluded from the vector rich.seq, so that only columns that haven't received the desired.iterations are actually saved to the csv file. The max.iterations argument will be used in a break() statement in case things are taking too long.
So, what I was expecting in the example case is the exact same line for rich.seq for both iterations, since I didn't actually do anything to change it. I believe that flodel is definitely right that my problem lies in comparing a matrix (details.table) of length longer than rich.seq, leading to unexpected results. However, I don't want the dimensions of details.table to change. Perhaps I can solve the problem implementing %in% somehow when I redefine rich.seq in the for loop?
I agree you should improve your question. However, I think I can spot what is going wrong.
You compute details.table before the for loop. It is a matrix with same length as rich.seq when it was first initialized (length(4:34), i.e. 31).
Inside the for loop, details.table < desired.iterations | is.na(details.table) is then a logical vector of length 31. On the first loop iteration,
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
will result in reducing the length of rich.seq. But on the second loop iteration, unless details.table is redefined (not the case), you are trying to subset rich.seq by a logical vector of longer length than rich.seq. This will certainly lead to unexpected results.
You probably meant to redefine details.table as part of your for loop.
(Also I am surprised to see you never used temp.results[,2].)
Thanks to flodel for setting me off on the right track. It had nothing to do with is.na but rather the lengths of vectors I was comparing.
That said, I set the initial values of the details.table to zero to avoid the added complexity of the is.na statement.
This code works, and can be modified to do what I described above.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq)) ##generate a table where the row names are those numbers
details.table[,1] <- 0
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10)) ##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <- count(temp.results[,1])$freq ##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- row.names(details.table)[details.table[,1] < desired.iterations]
print(rich.seq)
}
}
Rather than trying to cut down the rich.seq I just redefine it every iteration based on whatever happens with details.table during the previous iteration.

Resources