I can't get this for loop to run.
loopLength <- length(vector_X)
i <- 1
for (x in 1:loopLength)
vector_Y <- Frame_X$column_a == vector_X[i]
Frame_Y <- Frame_X[Vector_Y,]
Frame_A <- Frame_Y$column_b == vector_X[i]
Frame_Z <- Frame_Y[Frame_A,]
Vector_T <- Frame_Y$column_c == Frame_Z[1,2]
Frame_Z2 <- Frame_Y[Vector_T,]
returnSum1[i] <- sum(Frame_Z2$column_d)
Frame_Z3 <- Frame_Y[!(Frame_Z1),]
returnSum2[i] <- sum(Frame_X3$column_d)`
I can run the stand_alone code block by replacing the i with an integer (it is only running from 1 to 20) and crosscheck the db and the results are correct. However, I can't seem to iterate it.
I think I'm missing something glaring about integrating a loop but I've looked and can't seem to find it.
It doesn't work when I try to run it as for (i in 1:20) either.
Nor do the inclusion or exclusion around brackets around the code block work either.
The variable you defined in your for loop is named x, not i. If that isn't it, then the error might come from the fact that if Frame_Z happens to have 0 rows, then Frame_Z[1,2] doesn't exist! I think that step in particular is not very clear. I could help more if you posted an example data.frame and said what you want to do. Also, it would make your code easier to read if you used less steps and didn't name indices Frames (as in Frame_A and Frame_Z1). Also, I think using dplyr would be easier. Something like:
library(dplyr)
loopLength <- length(vector_X)
for(i in 1:loopLength){
xval <- vector_X[i]
Frame_Z <- Frame_X %>%
filter(column_a == xval, column_b == xval)
...
}
I can't post more because I don't quite get what you are trying to do though.
Related
I am working on a code but it has a step that is just super slow. Basically I just need to check for 2 columns and if their value is the same (at their respective row) I mark a 1 at a third column. Like the code below:
#FLAG_REPETIDOS
df1$FLAG_REPETIDOS <- ""
j <-1
for (j in 1:nrow(df1)) {
df1$FLAG_REPETIDOS[[j]] <- ifelse(df1$DATO[[j]]==df1$DATO_ANT[[j]], 1, df1$FLAG_REPETIDOS[[j]])
df1$FLAG_REPETIDOS[[j]] <- ifelse(is.na(df1$FLAG_REPETIDOS[[j]])==TRUE, "", df1$FLAG_REPETIDOS[[j]])
x <- j/100
if ((x == round(x))==TRUE){
print(paste(j, "/", nrow(df1)))
}
}
print(paste("Check 11:", Sys.time(), sep=" "))
Some more information: I am using data table, not data frame. My computer is no the best one, only 8G RAM and the data I am using has 1M rows more or less. Accordingly with my estimative it should take around 72h to end just this step of the code, which is unreasonable.
Is my code doing something it could be done easier and faster? Is there any way to optimize it? I am new to R so I dont know a lot about optimization.
Thanks in advance
I already changed from dataframe to datatable, I've researched on google about optimization and it was one of the things I could try.
The way to make R code go fast is to vectorize your code.
Assuming df is a dataframe, you could probably replace all your included code with something like:
library(dplyr)
df %>%
mutate(
FLAG_REPETIDOS = case_when(
is.na(DATO) | is.na(DATO_ANT) ~ "",
DATO == DATO_ANT ~ 1,
TRUE ~ ""
)
)
However, I'm not able to check since you did not include any data with your question.
Your loop is equivalent to this much simpler and faster code.
df1$FLAG_REPETIDOS <- ""
df1$FLAG_REPETIDOS[which(df1$DATO == df1$DATO_ANT)] <- "1"
Note that which doesn't have the danger of getting NA's in the 2nd code line index.
Hard to know without sample data, but this should work using data.table
library(data.table)
dt <- data.table(x=c(1,3,5,7,9), y=c(1,2,5,6,7)) # example
dt[, z:='']
dt[x==y, z:='1']
I have never used for loops before and I would like to use it for my data. However, I still don't know how to use it properly. Could anyone tell me how to use for loops correctly?
For item 1 to 9
the results I wanted to get
real<lower=0>l1_0+l1_11
real<lower=0>l2_0+l2_11
real<lower=0>l3_0+l3_11
..
real<lower=0>l9_0+l9_11
For item 10 to 18
real<lower=0>l10_0+l10_12
real<lower=0>l11_0+l11_12
real<lower=0>l12_0+l12_12
..
real<lower=18>l18_0+l18_12
What I tried to do..
for(i in 1:9){
i=l[i]"_0"+l[i]"_11"
print(paste("real<lower=0>",i))
}
for (i in 1:9){
i<-paste('l',i,'_0',sep='')
print(paste("real<lower=0>",i)
}
Assuming you have no background in programming and just want to know how to use the for loop. I have created a very simple data-frame and will do something easy.
I want to have the sum of each row in the data-frame (luckily we also have the apply family to do this simply).
df <- data.frame(x=c(1,4,2,6,7,1,8,9,1),
y=c(4,7,2,8,9,1,9,2,8))
This is the example shown everywhere, which is highly unsatisfactory.
for(i in 1:10){
print(i)
}
Only print the example of the sum of each row.
for(i in 1:nrow(df)){
print(df$x[i]+df$y[i])
}
This is the part often horrible explained everywhere (I do not get why? Perhaps I just used the wrong searching terms/keywords?). Fortunately, there was a good example here on Stack Exchange that showed me how. So, the credits go to someone else. Yet, this part is fairly easy, but for someone with no background in modeling, R, or any programming what so ever, it can be an pain in the ass to figure out. To make a for loop and store the results, you NEED to create an object that can store the data of the loop.
Here a simple for loop storing the results in a data frame.
loopdf <- as.data.frame(matrix(ncol = 1, nrow = 0))
for(i in 1:nrow(df)){
loopdf[i,] <- df$x[i]+df$y[i]
}
loopdf
Here a simple for loop storing the results in a list.
looplist <- list()
for(i in 1:nrow(df)){
looplist[[i]] <- df$x[i]+df$y[i]
}
do.call(rbind, looplist)
Here a loop concatenating the results in an atomic vector.
loopvec <- NULL
for(i in 1:nrow(df)){
loopvec <- c(loopvec, df$x[i]+df$y[i])
}
loopvec
Here the apply loop (two versions).
apply(df, 1, sum)
apply(df, 1, function(x), sum(x))
These are the steps I am following:
subset two matrices by a range of proportions (e.g. 80-85, 85-90)
run two separate distance measure functions for each subset of data
run a mantel using the distance matrix produced by each subset of data
produce a list of each test result, each with a unique name
produce a data frame of all the mantel-r results and their
corresponding p-values
I have written code that will complete this process, but I feel there is a more elegant and better way to do so. What I have works, but I would like to improve my R-skills, so any advice/ideas would be welcomed. I am not new to R, but I am far from being where I would like to be.
Also, my code produces unnecessary objects (i.e. SS, HB, sp.dis, epa.dis, and nam in the code below). They are not a big deal, but it would be nice to have code that doesn’t produce this side effect. A reproducible example (modeled after how my data is formatted) and the packages I’m using are below:
library(tidyverse)
library(betapart)
library(vegan)
set.seed(2)
spe2<-data.frame(replicate(10,sample(0:100,100,replace=T)))
spe2$Ag<-round(runif(100, min=0.4, max=1),2)
epa2<-data.frame(replicate(3,sample(1:20,100,replace=T)))
epa2$Ag<-spe2$Ag
Mantel.List<-list()
List.names <- list()
for(i in seq(from=0.85, to=0.95,by=0.05 )){
SS<-spe2 %>%
filter(Ag >= i & Ag < i+0.05)
HB<-epa2 %>%
filter(Ag >= i & Ag < i+0.05)
sp.dis<-beta.pair(decostand(SS[,1:ncol(SS)-1],'pa'))
epa.dis<-vegdist(HB[,1:ncol(HB)-1],
method = 'euclidean')
mnt<-mantel(sp.dis$beta.sor,epa.dis)
Mantel.List[[length(Mantel.List)+1]] <- mnt
nam<-paste('M.tt',i*100,sep='')
List.names[[length(List.names)+1]] <- nam
}
names(Mantel.List)<-List.names
Mantel.Results<-cbind(sapply(Mantel.List, function(x) x$statistic),sapply(Mantel.List, function(x) x$signif))
colnames(Mantel.Results)<-c('Mantel-r', 'p-value')
Mantel.Results
Thank you!
I've done two things two try to make this code a little better. First, I eliminated all the unnecessary objects, and I've done this by using data.table package, which is usually the most efficient way to handle data.frames, cause it doesn't make copies of itself when subsetting.
Secondly, instead of using a for loop, I'm using an apply function. Note the assigner <<- inside doit(), which will replace the object outside the function.
Here's my suggestion:
library(data.table)
set.seed(2)
spe2<-as.data.table(data.frame(replicate(10,sample(0:100,100,replace=T))))
spe2$Ag<-round(runif(100, min=0.4, max=1),2)
epa2<-as.data.table(data.frame(replicate(3,sample(1:20,100,replace=T))))
epa2$Ag<-spe2$Ag
doitAll=function(dt1,dt2){
Mantel.List<-list()
List.names <- list()
doit=function(x,dt1,dt2){
mnt<-mantel(beta.pair(decostand(dt1[Ag >= x & Ag < x+0.05,1:(ncol(dt1)-1),with=F],'pa'))$beta.sor,
vegdist(dt2[Ag >= x & Ag < x+0.05,1:(ncol(dt2)-1),with=F],
method = 'euclidean'))
Mantel.List[[length(Mantel.List)+1]] <<- mnt
nam<-paste('M.tt',x*100,sep='')
List.names[[length(List.names)+1]] <<- nam
}
sapply(seq(from=0.85, to=0.95,by=0.05 ),doit,dt1=dt1,dt2=dt2)
names(Mantel.List)<-List.names
Mantel.Results<-cbind(sapply(Mantel.List, function(x) x$statistic),sapply(Mantel.List, function(x) x$signif))
colnames(Mantel.Results)<-c('Mantel-r', 'p-value')
return(Mantel.Results)
}
doitAll(dt1=spe2,dt2=epa2)
It might be a little hard to read, but it's surely more efficient.
I am currently working on a large dataset (~1.5M of entries) using R - a language I am not yet completely familiar with.
Basically, what I try to do is the following :
I want to check what happens during a time interval after "Start".
"Start" represents a few temporal values within every "Trial", and "Trial" represents all of the trials recorded for one "Reference".
So for each Reference, i want to check all Trials and see what happens after "Start", during this Trial
It's not so important if what i'm trying to do is still obscure, the thing is that I want to check every data in my dataframe.
My instinctive (understand, R-noob-ish) way of programming this function led me to a piece of code which I know is far from being optimized, and takes a LOT of time to run.
My_Function <- function(DataFrame){
counts <- data.frame()
for (reference in DataFrame$Ref){
ref_tested <- subset(DataFrame, Ref == reference)
ref_count <- data.frame()
for (trial in ref_tested$Trial){
trial_tested <- subset(ref_tested, Trial == trial)
for (timing in trial_tested$Start){
interesting <- subset(DataFrame, Start > timing & Start <= timing + some_time & Trial == trial)
ref_count <- rbind(ref_count,as.data.frame(table(interesting$ele)))
}
}
temp <- aggregate(Freq~Var1,data=ref_count,FUN=sum);
counts <- rbind (counts, temp)
}
return(counts)
}
Here, as.data.frame(table(interesting$ele)) can have different lengths, and thus, so do ref_count.
I failed to find a way to grow my dataframe without using rbind, but I also know that given the size of my output it is not time-efficient at all.
Also, I have already programmed in other languages such as Python or C++ (a long time ago) and also know that having 3 consecutive for loops usually means that you're doing it wrong. But then again, I did not find a way to avoid doing that in this particular case.
So, do you have any advice on how to use R, or one of its package, to avoid such a situation?
Thank you in advance,
K.
EDIT :
Thank you for your first advices.
I tried the 'plyr' package and was able to reduce the size of my code chunck - it does as expected and is more understandable.Plus, i was able to produce some example data for reproductivity. See :
#Example Input
DF <- data.frame(c(sample(1:400,500000, replace = TRUE)),c(sample(1:25,500000, replace = TRUE)), rnorm(n=500000, m=1, sd=1) )
colnames(DF)<-c("Trial","Ref","Start")
DF$rn<-rownames(DF)
tempDF <- DF[sample(nrow(DF), 100), ] #For testing purposes
Test<- ddply(.data = tempDF, "rn", function(x){
interesting <- subset(DF,
Trial == x$Trial &
Start > x$Start &
Start < x$Start + some_time )
interesting$Elec <- x$Ref
return(interesting)
})
This is nice, but I still feel like it is not the way to go ; in this example, we only browse 100 observations, which takes ~4sec (I used a system.time()), but if i want to scan the 500000 observations of DF, it'd take more than 5 hours.
I have checked data.table but I am still trying to understand how to use it for now.
this may seem like a novice question, but I'm struggling to understand why this doesn't work.
answer = c()
for(i in 1:8){
answer = c()
knn.pred <- knn(data.frame(train_week$Lag2), data.frame(test_week$Lag2), train_week$Direction, k=i)
test <- mean(knn.pred == test_week$Direction)
append(answer, test)
}
I want the results 1-8 in a vector called answer. it should loop through 8 times, so ideally a vector with 8 numbers would be my output. When I run the for loop, I only get the final answer, meaning it isn't appending. any help would be appreciated, sorry for the novice question, really trying to learn R.
First of all, please include a reproducible example in your question next time. See How to make a great R reproducible example?.
Second, you set answer to c() in the first line of your loop, so this happens in each iteration.
Third, append, just like almost all functions in R, does not modify its argument in place, but it returns a new object. So the correct code is:
answer = c()
for (i in 1:8){
knn.pred <- knn(data.frame(train_week$Lag2), data.frame(test_week$Lag2),
train_week$Direction, k = i)
test <- mean(knn.pred == test_week$Direction)
answer <- append(answer, test)
}
While this wasn't the question, I can't help noting that this is a very inefficient way of creating vectors and lists. It is an anti-pattern. If you know the length of the result vector, then allocate it, and set its elements. E.g
answer = numeric(8)
for (i in 1:8){
knn.pred <- knn(data.frame(train_week$Lag2), data.frame(test_week$Lag2),
train_week$Direction, k = i)
test <- mean(knn.pred == test_week$Direction)
answer[i] <- test
}
You are overwriting answer inside the for loop. Try removing that line. Also, append doesn't act on its arguments directly; it returns the modified vector. So you need to assign it.
answer <- c()
for(i in 1:8){
knn.pred <- knn(data.frame(train_week$Lag2), data.frame(test_week$Lag2), train_week$Direction, k=i)
test <- mean(knn.pred == test_week$Direction)
answer <- append(answer, test)
}