I am currently with a 500,000 observations of data and I have a step in my R code that does the following -
attach(ds)
weight <- rep(NA,length(date))
sales_base <- rep(NA,length(date))
cumsales <- rep(NA,length(date))
weight[dup_no!=0 & month(date)==7] = lag_sales[dup_no!=0 & month(date)==7]
sales_base[dup_no!=0 & month(date)==7] = cumsales[dup_no!=0 & month(date)==7]
cumsales [dup_no!=0 & month(date)==7] = 1+ disc[dup_no!=0 & month(date)==7]
for(i in 2:length(permno))
{
if(dup_no[i]!=0 & month(date[i])!=6 & !is.na(lag_sales[i]) & (lag_sales[i])>0)
{
cumsales[i] = cumsales[i-1]*(1+disc[i])
weight[i] = cumsales[i]*sales_base[i-1]
}
if(dup_no[i]!=0 & month(date[i])!=6 & (lag_sales[i])<=0)
{
cumsales[i] = cumsales[i-1]*(1+disc[i])
weight_port[i] = NA
}
}
(The formulae might not make sense as I haven't showed you the entire code.)
The first three lines creates 3 columns of value 0. The next three lines fills in the values of the cells in the columns provided a set of condition is fulfilled. The next for loop tries to fill in the remaining empty values of the columns by calculating new values based on the previous filled in cell values(obtained from lines 5, 6, 7).
The for loop here is taking a lot of time because of the datasize and I need to optimize this code as it will run on a much larger data. Is there any alternative that can be used instead of this for loop?
Thanks in advance!
Loops are generally very time consuming in R. Best avoid them whenever possible. If you search for "vectorization" you will find tons of threads and tutorials discussing the topic.
Just a brief example with your code:
index <- dup_no!=0 & month(date)!=6 & !is.na(lag_sales) & (lag_sales)>0
cumsales[index] <- cumsales[which(index)-1]*(1+disc[index])
weight[index] <- cumsales[index]*sales_base[which(index)-1]
This should be able to replace the first part of your for loop.
Related
I am very new to R and am trying to make a function which with each "generation" replicates one individual of a population and terminates another. Here is what I have done so far,
simulate_moran<-function(n){
population <- c(rep(0, (n - 1)), 1)
ind_die<-(sample(1:n,size=1))
ind_repr<-(sample(1:n,size=1))
while(sum(population)>=1 & sum(population)<=4){
population[ind_die]<-population[ind_repr]
}
return(population)
}
simulate_moran(5)
How can I set the condition for the while loop so that it loops until there is only one type (0 or 1) of an individual.
If you choose to help thank you very very much.
There is an all() method in base (see ?base::all), which you could use to check if all elements are equal to the first. Like
while (!all(population == population[[1]])) {
...
}
I am currently working on a large dataset (~1.5M of entries) using R - a language I am not yet completely familiar with.
Basically, what I try to do is the following :
I want to check what happens during a time interval after "Start".
"Start" represents a few temporal values within every "Trial", and "Trial" represents all of the trials recorded for one "Reference".
So for each Reference, i want to check all Trials and see what happens after "Start", during this Trial
It's not so important if what i'm trying to do is still obscure, the thing is that I want to check every data in my dataframe.
My instinctive (understand, R-noob-ish) way of programming this function led me to a piece of code which I know is far from being optimized, and takes a LOT of time to run.
My_Function <- function(DataFrame){
counts <- data.frame()
for (reference in DataFrame$Ref){
ref_tested <- subset(DataFrame, Ref == reference)
ref_count <- data.frame()
for (trial in ref_tested$Trial){
trial_tested <- subset(ref_tested, Trial == trial)
for (timing in trial_tested$Start){
interesting <- subset(DataFrame, Start > timing & Start <= timing + some_time & Trial == trial)
ref_count <- rbind(ref_count,as.data.frame(table(interesting$ele)))
}
}
temp <- aggregate(Freq~Var1,data=ref_count,FUN=sum);
counts <- rbind (counts, temp)
}
return(counts)
}
Here, as.data.frame(table(interesting$ele)) can have different lengths, and thus, so do ref_count.
I failed to find a way to grow my dataframe without using rbind, but I also know that given the size of my output it is not time-efficient at all.
Also, I have already programmed in other languages such as Python or C++ (a long time ago) and also know that having 3 consecutive for loops usually means that you're doing it wrong. But then again, I did not find a way to avoid doing that in this particular case.
So, do you have any advice on how to use R, or one of its package, to avoid such a situation?
Thank you in advance,
K.
EDIT :
Thank you for your first advices.
I tried the 'plyr' package and was able to reduce the size of my code chunck - it does as expected and is more understandable.Plus, i was able to produce some example data for reproductivity. See :
#Example Input
DF <- data.frame(c(sample(1:400,500000, replace = TRUE)),c(sample(1:25,500000, replace = TRUE)), rnorm(n=500000, m=1, sd=1) )
colnames(DF)<-c("Trial","Ref","Start")
DF$rn<-rownames(DF)
tempDF <- DF[sample(nrow(DF), 100), ] #For testing purposes
Test<- ddply(.data = tempDF, "rn", function(x){
interesting <- subset(DF,
Trial == x$Trial &
Start > x$Start &
Start < x$Start + some_time )
interesting$Elec <- x$Ref
return(interesting)
})
This is nice, but I still feel like it is not the way to go ; in this example, we only browse 100 observations, which takes ~4sec (I used a system.time()), but if i want to scan the 500000 observations of DF, it'd take more than 5 hours.
I have checked data.table but I am still trying to understand how to use it for now.
I am only a learner in R and have a fairly basic question.
I have a dataset called edata with two columns relevant to the posted question. These are GazeCue and TargetLocation. I wish to create a new column called CueType that shows as "Cued" or "Uncued" based on the values of the other two columns.
When GazeCue is equal to RightGazePic1.png and TargetLocation is equal to TargetR, the new CueType column should show as "Cued". Similarly when GazeCue is equal to LeftGazePic1.png and TargetLocation is equal to TargetL, the CueType column should again show as "Cued". Any other variation of values should show in CueType as "uncued".
An example of what I would like is pasted below.
GazeCue TargetLocation CueType
RightGazePic1.png TargetR Cued
LeftGazePic1.png TargetL Cued
RightGazePic1.png TargetL Uncued
LeftGazePic1.png TargetR Uncued
I have been trying to complete this code using ifelse but with no luck. Any advice would be greatly appreciated.
This is pretty basic. One way would be to extract the L and R from both the png and the Target, and compare those using ifelse:
CueType <- ifelse(substr(GazeCue, 1,1) == substr(TargetLocation, 7,7),
"Cued",
"Uncued")
If the names can vary a bit more, take a look at gsub to extract the relevant information from the strings before making the comparison.
You can also make use of the fact that R recycles vectors:
ix <- (substr(df$GazeCue,1,1) == substring(df$TargetLocation,7)) + 1
df$CueType <- c("Uncued","Cued")[ix]
you can try this:
edata[,3] <- NA #add a new column
names(edata)[3] <- "CueType" #add a name column
for (i in 1 : nrow(edata)) {
if (edata$GazeCue[i] == 'RightGazePic1.png' & edata$TargetLocation[i]==
'TargetR') {
edata[i,3] <- "Cued"
} else if (edata$GazeCue[i] == 'LeftGazePic1.png' & data$TargetLocation[i]
=='TargetL') {
edata[i,3] <- "Cued"
}
else {
edata[i,3] <- "Uncued"
}
}
Test, it should work properly!
I am doing a case-control study and I want to match cases to controls in a 1:5 ratio, by a number of important variables. However, when I try do this, I get a number of duplicate controls in my data set. Does anyone know how to remove duplicates in r while running through the for loop?
case=case data set
con= control data set
So far my code looks like this:
out <- NULL
for (i in 1: length(case[,5]) ){
g <-case$Sex[i]
y <-case$Age[i]
x <- sample((which(con$Sex==g & con$Age>=y-1 & con$Age<=y+1)), size=5 )
out <- c(out, x)
if (duplicated(out)=="TRUE")
i=i-1;
out = out(size(out)-1,);
end
}
But the part trying to remove duplicates is wrong and I am not sure how to fix it, Any suggestions?
I have created a while loop that is being executed across a sizable data set. The loop is as such:
i = 1
while(i<=m){
Date = London.Events$start_time[i]
j=1
while(j<=n){
Hotel = London.Hotels$AS400.ID[j]
Day.Zero[i,j] = sum(London.Bookings$No.of.Rooms[London.Bookings$Stay.Date == Date & London.Bookings$Legacy.Hotel.Code == Hotel])
j=j+1
}
i=i+1
}
Where:
m = 9957 #Number of Events
n = 814 #Number of Hotels
Day.Zero = as.data.frame(matrix(0, 9957, 814))
Briefly explained, for each combination of date and hotel (pulled from two other data frames), produce the sum from the column London.Bookings$No.of.Rooms and deposit that into the corresponding row of the matrix.
The loop appears to run without error, however when stopping it after 5 mins+ it is still running and nowhere near complete!
I would like to know how one of the apply family of functions could be used as a replacement here for much faster completion.
Thanks!
Probably,
xtabs(No.of.Rooms ~ Stay.Date + Legacy.Hotel, data = London.Bookings)
gets you something similar to what you want.
Using library dplyr, you can do something like the following (assuming your input data frame has such column names - vaguely interpreted from your code / question):
library(dplyr)
London.Bookings %>% group_by(Legacy.Hotel.Code, Stay.Date) %>% summarise(Total.No.of.Rooms = sum(No.of.Rooms))