I need to do iteratively evaluate the variance of a dataset while i reduce the data.frame row by row in each step. As an example
data <- matrix(runif(100),10,10)
perc <- list("vector")
sums <- sum(data)
for (i in 1:nrow(data)) {
data <- data[-1,]
perc[[i]] <- sum(data)/sums # in reality, here are ~8 additonal lines of code
}
I dont like that data is re-initialized in every step, and that the loop breaks with an error, when data is emptied.
So the questions are:
1. How to express data <- data[-1,] in an incrementing way (something like tmp <- data[-c(1:i),], which doesnt work?
2. Is there a way to stop the loop, before the last row is removed from data?
You could try
set.seed(123)
data <- matrix(runif(100),10,10)
sums <- sum(data)
perc <- lapply(2:nrow(data),function(x) sum(data[x:nrow(data),]/sums))
The above code yields the same result as your original code, but without error message and without modifying data.
perc1 <- list()
for (i in 1:nrow(data)) {
data <- data[-1,]
perc1[[i]] <- sum(data)/sums
}
identical(perc,perc1)
#[1] TRUE
If you wish to preserve the for loop in order to perform other calculations within the loop, you could try:
for (i in 2:nrow(data)) {
perc[[i-1]] <- sum(data[i:nrow(data),])/sums
# do more stuff here
}
identical(perc,perc1)
#[1] TRUE
If you are using the loop index i for other calculations within the loop, you will most probably need to replace it with i-1. It may depend on what is calculated.
You can use lapply
res <- lapply(2:nrow(data), function(i)sum(data[i:nrow(data),])/sums)
You can write the loop part like this:
for (i in 2:nrow(data)) {
perc[[i - 1]] <- sum(data[i:nrow(data),])/sums # in reality, here are ~8 additonal lines of code
}
Related
I am putting together a summary table from a larger data frame. I noticed that I was re-using the following code but with different %like% characters:
# This code creates a df of values where the row name matches the character
df <- (data[which(data$`col_name` %like% "Total"),])
df <- df[3:ncol(df)]
df[is.na(df)] <- 0
# This creates a row composed of the sum of each column
for (i in seq_along(df)) {
df[10, i] <- sum(df[i])
}
# This inserts the resulting values into a separate summary table
summary[1, 2:ncol(summary)] <- df[nrow(df),]
To keep the code dry and avoid repetition, I thought it would be best to translate this into a custom function that I could then call with different strings:
create_row <- function(x) {
df <- (data[which(data$`Crop year` %like% as.character(x)),])
df <- df[3:ncol(df)]
df[is.na(df)] <- 0
for (i in seq_along(df)) {
df[10, i] <- sum(df[i])
}
}
# Then populate the summary table as before with the results
total <- create_row("Total")
summary[1, 2:ncol(summary)] <- total[nrow(total),]
However when attempting to run this, it simply returns an empty variable.
Through trial and error, I have found that the line of code causing this is:
df[is.na(df)] <- 0
The code works absolutely fine when run line by line outside of this custom function.
As mentioned in the comments if you add return(df) at the end of the function, the function will work. We need to do that because for loop unlike any other functions doesn't return an object after it's executed.
Moreover, as mentioned in the comments by #alan that you can use colSums to get sum of each column directly instead of for loop to loop over each column and take its sum.
I'm trying to apply a for-loop to a dataframe in R, using it to take the row number, which will be used in a t-test, along with specified column indices.
When I run the code I currently have, it only takes the last value specified in the for-loop. How do I fix this? (sorry I'm a complete novice)
This is my code:
x represents the dataset
for(i in 1:nrow(x)){
test<- t.test(x[i, 1:5], x[i, 6:10])
return(test$p.value)
}
I want it to run a t-test on every row, using i (as the row number) and the specified column indices as the input, to provide me with the p value from each test
It happens because you overwrite test all the time. If you really want to use a for loop for this purpose and extract the p-values afterwards, this would better work:
set.seed(1)
x <- matrix(sample(1:100,100), nrow = 10)
test = list()
a = 0
for(i in 1:nrow(x)){
a <- a + 1
test[[a]] <- t.test(x[i, 1:5], x[i, 6:10])
}
lapply(test, "[[", "p.value")
However, using apply the way nadizan proposed is much more preferred in this case.
I think that in order to use return you have to define a function (I am actually surprised you don't get an error). What happens is that the loop performs all the tests as you want but it overwrites them on the same variable test, so at the end you have only the last result.
Edit: In fact, I checked and the returnshould let you exit at the first iteration, thus getting only the result of the first test.
One simple way to fix this is to create, for example, a vector and then append each new result in the same position as the correspondent row:
test <- c()
for(i in 1:nrow(x)){
test[i] <- t.test(x[i, 1:5], x[i, 6:10])
}
Notice that appending to an empty vector/list is quite expensive as its final length increases, so you may want to initialize it with NAs with the same length as the number of rows of the dataframe:
test <- rep (NA,nrow(x))
I'm trying to create a loop that fills a vector "naVals" with the number of "NA" values by column in a data frame "wb." so far I have this:
naVals <- rep(NA, 24)
for (i in 1:24){
naVals[i] <- sum(is.na(wb$v[i]))
}
this is not working and i can't figure out why!
naVals <- apply(wb,2,function(wb)sum(is.na(wb)))
(I know that this code does the same thing, but I'd like to do it as a loop if possible. Thanks so much!
As mentioned by #joran in the comments, your for loop would work if you replace wb$v[i] with wb[[i]].
In the form that you have currently posted, you have already extracted column v and are subsetting on rows rather than columns. What you want is:
naVals <- rep(NA, 24)
for (i in 1:24) {
naVals[i] <- sum(is.na(wb[[i]]))
}
Some extra advice, though. If you want this code to be adaptable to different data frames and not so specific then I recommend the following:
naVals <- numeric()
for (i in ncol(wb)) {
naVals[i] <- sum(is.na(wb[[i]]))
}
This way you don't have to constantly edit both the for loop and the naVals initialization. You only need to use it on a new wb.
I am trying to write a function that evaluates each term within a matrix against a condition. If the condition is met for any term, the entire row is added to a second matrix.
(context: I am doing so to compare outliers for all attributes. If any row has outlier data for any attribute (their z-score > 3), then the entire row would be added to the Outlier data matrix)
Please see my code below. I really don't understand why it isn't working.
outliers <- matrix()
x <- 1
for(r in nrow(all_z_stats)){
for(c in ncol(all_z_stats)){
if(all_z_stats[r,c]>3){
outliers[x,] <- all_z_stats[r,]
x <- x + 1
}}
}
Thanks very much in advance for any information or input.
Test data: all_z_stats <- replicate(20, rnorm(20))
First, for r in nrow(all_z_stats) leads to only one r value. Better loop r through all values form 1 to nrow(all_z_stats): for (r in seq_len(all_z_stats)) = for (r in 1:nrow(all_z_stats)) (same for c)
First improvement:
outliers <- matrix(ncol=ncol(all_z_stats)) # Empty matrix with as many cols as outliers matrix
for(r in seq_len(nrow(all_z_stats))){
if(any(all_z_stats[r, ] > 3)){ # any is useful, try ?any in R console
outliers <- rbind(outliers, all_z_stats[r, ]) # add line to outliers
}
}
But you could do this without for loop. First, find all the row indices where an entry > 3 is present (using sapply. Then, extract only these indices from all_z_stats:
outliers <- matrix(ncol=ncol(all_z_stats))
all_z_stats[sapply(seq_len(nrow(all_z_stats)), function(r) any(all_z_stats[r, ] > 3)), ]
Well, you could define outlier as numeric() and use rbind() to pile your matrix like this:
outliers <- numeric()
for(r in nrow(all_z_stats)){
for(c in ncol(all_z_stats)){
if(all_z_stats[r,c]>3){
outliers <- rbind(outliers,all_z_stats[r,])
break
}
}
}
There are better ways to achieve this kind of subsetting.
I want to create a new dataframe and keep adding variables in R with in a for loop. Here is a pseudo code on what I want:
X <- 100
For(i in 1:X)
{
#do some processing and store it a variable named "temp_var"
temp_var[,i] <- z
#make it as a dataframe and keep adding the new variables until the loop completes
}
After completion of the above loop the vector "temp_var" by itself should be a dataframe containing 100 variables.
I would avoid using for loops as much as possible in R. If you know your columns before hand, lapply is something you want to use.
However, if this is constraint by your program, you can do something like this:
X <- 100
tmp_list <- list()
for (i in 1:X) {
tmp_list[[i]] <- your_input_column
}
data <- data.frame(tmp_list)
names(data) <- paste0("col", 1:X)