I was wondering if there was a cleaner way to do what I have already done, this code actually works, but I wanted to probe the expertise of people in the house in order to be able to do this a better way.
Suppose we simply have some data with a column of an identifier by participant we will call it indiv for the sake of our example, therefore we have a data frame with 36 subjects and all of them numbered and I want to create a new column every two individuals and name it as dyad and a dynamic numbering.
indiv <- c(1:36)
freq <- data.frame(indiv)
freq$ID <- vector(length = dim(freq)[1])
for (i in 1:nrow(freq)) {
if (0<freq$indiv[i]&&freq$indiv[i]<=2){freq$ID[i] <- print("dyad01")}
else if (2<freq$indiv[i]&&freq$indiv[i]<=4){freq$ID[i] <- print("dyad02")}
else if (4<freq$indiv[i]&&freq$indiv[i]<=6){freq$ID[i] <- print("dyad03")}
else if (6<freq$indiv[i]&&freq$indiv[i]<=8){freq$ID[i] <- print("dyad04")}
else if (8<freq$indiv[i]&&freq$indiv[i]<=10){freq$ID[i] <- print("dyad05")}
else if (10<freq$indiv[i]&&freq$indiv[i]<=12){freq$ID[i] <- print("dyad06")}
else if (12<freq$indiv[i]&&freq$indiv[i]<=14){freq$ID[i] <- print("dyad07")}
else if (14<freq$indiv[i]&&freq$indiv[i]<=16){freq$ID[i] <- print("dyad08")}
else if (16<freq$indiv[i]&&freq$indiv[i]<=18){freq$ID[i] <- print("dyad09")}
else if (18<freq$indiv[i]&&freq$indiv[i]<=20){freq$ID[i] <- print("dyad10")}
else if (20<freq$indiv[i]&&freq$indiv[i]<=22){freq$ID[i] <- print("dyad11")}
else if (22<freq$indiv[i]&&freq$indiv[i]<=24){freq$ID[i] <- print("dyad12")}
else if (24<freq$indiv[i]&&freq$indiv[i]<=26){freq$ID[i] <- print("dyad13")}
else if (26<freq$indiv[i]&&freq$indiv[i]<=28){freq$ID[i] <- print("dyad14")}
else if (28<freq$indiv[i]&&freq$indiv[i]<=30){freq$ID[i] <- print("dyad15")}
else if (30<freq$indiv[i]&&freq$indiv[i]<=32){freq$ID[i] <- print("dyad16")}
else if (32<freq$indiv[i]&&freq$indiv[i]<=34){freq$ID[i] <- print("dyad17")}
else {freq$ID[i] <- print("dyad18")}
}
Of course this has worked for me and did the job the way I expected, however, I was wondering if anyone knows a cleaner way, just for the sake of learning and improving.
Thanks in advance!
You can use sprintf() as it can be used to zero pad leading digits and is vectorized with some integer division:
sprintf("dyad%02d", (indiv-1) %/% 2 + 1)
[1] "dyad01" "dyad01" "dyad02" "dyad02" "dyad03" "dyad03" "dyad04" "dyad04" "dyad05" "dyad05" "dyad06" "dyad06" "dyad07" "dyad07" "dyad08" "dyad08"
[17] "dyad09" "dyad09" "dyad10" "dyad10" "dyad11" "dyad11" "dyad12" "dyad12" "dyad13" "dyad13" "dyad14" "dyad14" "dyad15" "dyad15" "dyad16" "dyad16"
[33] "dyad17" "dyad17" "dyad18" "dyad18"
i<-c(1:44)
diff_arbeitnehmer <- for(x in i){if(x == 44) {diff_arbeitnehmer[x] <- 0} else{diff_arbeitnehmer[x] <- 100/erwerbstaetige[x,2]*erwerbstaetige[x,4]-100/erwerbstaetige[x+1,2]*erwerbstaetige[x+1,4]}}
My data frame has 44 entriess
I am using R script could someone tell me what could be the reason?
I am lost with this
I can't run your code because I don't have your data frame, but maybe the reason is because you are trying to assing a for loop into the variable diff_arbeitnehmer. I did this change and hope that know it works:
i<-c(1:44)
diff_arbeitnehmer <- c()
for(x in i){
if(x == 44){
diff_arbeitnehmer[x] <- 0
} else{
diff_arbeitnehmer[x] <- 100/erwerbstaetige[x,2]*erwerbstaetige[x,4]-100/erwerbstaetige[x+1,2]*erwerbstaetige[x+1,4]
}
}
An advice is to take a look if the assignment in the last condition is right, maybe you need to put some parenthesis.
I'm trying to save each iteration of this for loop in a vector.
for (i in 1:177) {
a <- geomean(er1$CW[1:i])
}
Basically, I have a list of 177 values and I'd like the script to find the cumulative geometric mean of the list going one by one. Right now it will only give me the final value, it won't save each loop iteration as a separate value in a list or vector.
The reason your code does not work is that the object ais overwritten in each iteration. The following code for instance does what precisely what you desire:
a <- c()
for(i in 1:177){
a[i] <- geomean(er1$CW[1:i])
}
Alternatively, this would work as well:
for(i in 1:177){
if(i != 1){
a <- rbind(a, geomean(er1$CW[1:i]))
}
if(i == 1){
a <- geomean(er1$CW[1:i])
}
}
I started down a similar path with rbind as #nate_edwinton did, but couldn't figure it out. I did however come up with something effective. Hmmmm, geo_mean. Cool. Coerce back to a list.
MyNums <- data.frame(x=(1:177))
a <- data.frame(x=integer())
for(i in 1:177){
a[i,1] <- geomean(MyNums$x[1:i])
}
a<-as.list(a)
you can try to define the variable that can save the result first
b <- c()
for (i in 1:177) {
a <- geomean(er1$CW[1:i])
b <- c(b,a)
}
UPDATE
Thanks to the help and suggestions of #CarlWitthoft my code was simplified to this:
model <- unlist(sapply(1:length(model.list),
function(i) ifelse(length(model.list[[i]][model.lookup[[i]]] == "") == 0,
NA, model.list[[i]][model.lookup[[i]]])))
ORIGINAL POST
Recently I read an article on how vectorizing operations in R instead of using for loops are a good practice, I have a piece of code where I used a big for loop and I'm trying to make it a vector operation but I cannot find the answer, could someone help me? Is it possible or do I need to change my approach? My code works fine with the for loop but I want to try the other way.
model <- c(0)
price <- c(0)
size <- c(0)
reviews <- c(0)
for(i in 1:length(model.list)) {
if(length(model.list[[i]][model.lookup[[i]]] == "") == 0) {
model[i] <- NA
} else {
model[i] <- model.list[[i]][model.lookup[[i]]]
}
if(length(model.list[[i]][price.lookup[[i]]] == "") == 0) {
price[i] <- NA
} else {
price[i] <- model.list[[i]][price.lookup[[i]]]
}
if(length(model.list[[i]][reviews.lookup[[i]]] == "") == 0) {
reviews[i] <- NA
} else {
reviews[i] <- model.list[[i]][reviews.lookup[[i]]]
}
size[i] <- product.link[[i]][size.lookup[[i]]]
}
Basically the model.list variable is a list from which I want to extract a particular vector, the location from that vector is given by the variables model.lookup, price.lookup and reviews.lookup which contain logical vectors with just one TRUE value which is used to return the desired vector from model.list. Then every cycle of the for loop the extracted vectors are stored on variables model, price, size and reviews.
Could this be changed to a vector operation?
In general, try to avoid if when not needed. I think your desired output can be built as follows.
model <- unlist(sapply(1:length(model.list), function(i) model.list[[i]][model.lookup[[i]]]))
model[model=='']<-NA
And the same for your other variables. This assumes that all model.lookup[[i]] are of length one. If they aren't, you won't be able to write the output to a single element of model in the first place.
I would also note that you are grossly overcoding, e.g. x<-0 is better than x<-c(0), and don't bother with length evaluation on a single item.
I am backtesting some investment strategy using R, I have a piece of script below:
set.seed(1)
output.df <- data.frame(action=sample(c("initial_buy","sell","buy"),
10000,replace=TRUE),stringsAsFactors=FALSE)
output.df[,"uid"] <- 1:nrow(output.df)
cutrow.fx <- function(output.df) {
loop.del <- 2
while (loop.del <= nrow(output.df)) {
if ((output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="initial_buy")|
(output.df[loop.del,"action"]=="sell" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="buy" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="buy")){
output.df <- output.df[-loop.del,]
} else {
loop.del <- loop.del + 1
}
}
output.df<<-output.df
}
print(system.time(cutrow.fx(output.df=output.df)))
The strategy will determine: 1) when to start buying a stock; 2) when to add additional contribution to the stock; and 3) when to sell all the stock. I have a dataframe with price of a stock for the past 10 years. I wrote 3 scripts to indicate which date should I buy/sell the stock, combine the 3 results and order them.
I need to remove some of the "impossible action", e.g. I cannot sell the same stock twice without buying new units beforehand, so I used the script above to delete those impossible action. But the for loop is kind of slow.
Any suggestion for speeding it up?
Update 01
I have updated the cutrow.fx into the following but fail:
cutrow.fx <- function(output.df) {
output.df[,"action_pre"] <- "NIL"
output.df[2:nrow(output.df),"action_pre"] <- output.df[1:(nrow(output.df)-1),"action"]
while (any(output.df[,"action_pre"]=="initial_buy" & output.df[,"action"]=="initial_buy")|
any(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="sell")|
any(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="buy")|
any(output.df[,"action_pre"]=="buy" & output.df[,"action"]=="initial_buy")) {
output.df <- output.df[!(output.df[,"action_pre"]=="initial_buy" & output.df[,"action"]=="initial_buy"),]
output.df <- output.df[!(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="sell"),]
output.df <- output.df[!(output.df[,"action_pre"]=="sell" & output.df[,"action"]=="buy"),]
output.df <- output.df[!(output.df[,"action_pre"]=="buy" & output.df[,"action"]=="initial_buy"),]
output.df[,"action_pre"] <- "NIL"
output.df[2:nrow(output.df),"action_pre"] <- output.df[1:(nrow(output.df)-1),"action"]
}
output.df[,"action_pre"] <- NULL
output.df<<-output.df
}
I used the vector comparison as somehow inspired (I used somehow as I'm not sure if I get exact what he means in the answer) by John, use a while-loop to repeat. But the output is not the same.
Is the for-loop here inevitable?
It looks like all you're doing is checking the last action. This doesn't require a loop at all. All you have to do is shift the vector and do straight vector comparisons. Here's an artificial example.
x <- sample(1:11)
buysell <- sample(c('buy', 'sell'), 11, replace = TRUE)
So, I have 11 samples, x, and whether I've bought or sold them. I want to make a boolean that shows whether I bought or sold the last sample.
bought <- c(NA, buysell[1:10])
which( bought == 'buy' )
Examine the x and buysell variables and you'll see the results here are the index of the x items where a buy was made on the prior item.
Also, you might want to check out he function %in%.
I tried to do something clever with vectorization, but failed because previous iterations of the loop can change the data relationships for later iterations through. So I couldn't lag the data by a set amount and compare lagged to real results.
What I can do is minimize the copying operation involved. R is assign-by-copy, so when you write a statement like output.df <- output.df[-loop.del,], you are copying the entire data structure for each row that is deleted. Instead of changing (and copying) the data frame, I made changes to a logical vector. Some other attempts at speed-up include using logical and (&&) instead of bitwise and (&), using %in% to make fewer comparisons, and minimizing accesses on output.df.
To compare the two functions I slightly modified OP solution such that the original data frame was not overwritten. It looks like this can improve speeds by a factor of 10, but it still takes a noticeable about of time (>0.5 sec). I'd love to see any faster solutions.
OP's solution (slightly modified in return value and without global assign)
cutrow.fx <- function(output.df) {
loop.del <- 2
while (loop.del <= nrow(output.df)) {
if ((output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="initial_buy")|
(output.df[loop.del,"action"]=="sell" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="buy" &
output.df[loop.del-1,"action"]=="sell")|
(output.df[loop.del,"action"]=="initial_buy" &
output.df[loop.del-1,"action"]=="buy")){
output.df <- output.df[-loop.del,]
} else {
loop.del <- loop.del + 1
}
}
return(output.df)
}
ans1 <- cutrow.fx(output.df)
my solution
cutrow.fx2 <- function(output.df) {
##edge case if output.df has too few rows
if (nrow(output.df) < 2) return(output.df)
##logical vector of indices of rows to keep
idx <- c(TRUE,logical(nrow(output.df)-1))
##keeps track of the previous row
prev.row <- 1
prev.act <- output.df[prev.row,"action"]
for (current.row in seq_len(nrow(output.df))[-1]) {
##access output.df only once per iteration
current.act <- output.df[current.row,"action"]
##checks to see if current row is bad
##if so, continue to next row and leave previous row as is
if ( (prev.act %in% c("initial_buy","buy")) &&
(current.act == "initial_buy") ) {
next
} else if ( (prev.act == "sell") &&
(current.act %in% c("buy","sell")) ) {
next
}
##if current row is good, mark it in idx and update previous row
idx[current.row] <- TRUE
prev.row <- current.row
prev.act <- current.act
}
return(output.df[idx,])
}
ans2 <- cutrow.fx2(output.df)
checks that answers are the same
identical(ans1,ans2)
## [1] TRUE
#benchmarking
require(microbenchmark)
mb <- microbenchmark(
ans1=cutrow.fx(output.df)
,ans2=cutrow.fx2(output.df),times=50)
print(mb)
# Unit: milliseconds
# expr min lq median uq max
# 1 ans1 9630.1671 9743.1102 9967.6442 10264.7000 12396.5822
# 2 ans2 481.8821 491.6699 500.6126 544.4222 645.9658
plot(mb)
require(ggplot2)
ggplot2::qplot(y=time, data=mb, colour=expr) + ggplot2::scale_y_log10()
Here is some code that is a bit simpler and much faster. It does not loop over all elements, but only loops between matches. It matches forward rather than backward.
First, modify your cutrow.fx function. Remove the <<-output.df on the last line, and simply return the result. Then you can run two functions and compare the results.
cutrow.fx1 <- function(d) {
len <- length(d[,1])
o <- logical(len)
f <- function(a) {
switch(a,
initial_buy=c('buy', 'sell'),
buy=c('buy', 'sell'),
sell='initial_buy'
)
}
cur <- 1
o[cur] <- TRUE
while (cur < len) {
nxt <- match(f(d[cur,1]), d[(cur+1):len,1])
if (all(is.na(nxt))) {
break
} else {
cur <- cur + min(nxt, na.rm=TRUE);
o[cur] <- TRUE
}
}
d[o,]
}
Show that the results are correct:
identical(cutrow.fx1(output.df), cutrow.fx(output.df))
## [1] TRUE
And it is quite a bit faster. This is due to the partial vectorization of the problem, using match to find the next row to keep, rather than iterating to discard rows.
print(system.time(cutrow.fx(output.df)))
## user system elapsed
## 5.688 0.000 5.720
print(system.time(cutrow.fx1(output.df)))
## user system elapsed
## 1.050 0.000 1.056