I have read up on vectorization as a solution for speeding up a for-loop. However, the data structure I am creating within a for-loop seems to need to be a data.frame/table.
Here is the scenario:
I have a large table of serial numbers and timestamps. Several timestamps can apply to the same serial number. I only want the latest timestamp for every serial number.
My approach now is to create a vector with unique serial numbers. Then for each loop through this vector, I create a temporary table that holds all observations of a serial number/timestamp combinations ('temp'). I then take the last entry of this temporary table (using tail command) and put it into another table that will eventually hold all unique serial numbers and their latest timestamp ('last.pass'). Finally, I simply remove rows from the starting table serial where number/timestamp combination cannot be found 'last.pass'
Here is my code:
#create list of unique serial numbers found in merged 9000 table
hddsn.unique <- unique(merge.data$HDDSN)
#create empty data.table to populate
last.pass < data.table(HDDSN=as.character(1:length(hddsn.unique)),
ENDDATE=as.character(1:length(hddsn.unique)))
#populate last.pass with the combination of serial numbers and their latest timestamps
for (i in 1:length(hddsn.unique)) {
#create temporary table that finds all serial number/timestamp combinations
temp <- merge.data[merge.data$HDDSN %in% hddsn.unique[i],][,.(HDDSN, ENDDATE)]
#populate last.pass with the latest timestamp record for every serial number
last.pass[i,] <- tail(temp, n=1)
}
match <- which(merge.data[,(merge.data$HDDSN %in% last.pass$HDDSN) &
(merge.data$ENDDATE %in% last.pass$ENDDATE)]==TRUE)
final <- merge.data[match]
My ultimate question is, how do I maintain the automated nature of this script while speeding it up, say, through vectorization or turning it into a function.
Thank you!!!
How about this. Without a clear idea of what your input data looks like, I took a guess.
# make some dummy data with multiple visits per serial
merge.data <- data.frame(HDDSN = 1001:1020,
timestamps = sample(1:9999, 100))
# create a function to find the final visit for a given serial
fun <- function(serial) {
this.serial <- subset(merge.data, HDDSN==serial)
this.serial[which.max(this.serial$timestamps), ]
}
# apply the function to each serial number and clean up the result
final <- as.data.frame(t(sapply(unique(merge.data$HDDSN), fun)))
This data has several ENDDATE for each HDDSN
merge.data <- data.frame(HDDSN = 1001:1100, ENDDATE = sample(9999, 1000))
place it in order, first by HDDSN then by ENDDATE
df = merge.data[do.call("order", merge.data),]
then find the last entry for each HDDSN
df[!duplicated(df[["HDDSN"]], fromLast=TRUE),]
The following illustrate the key steps
> head(df, 12)
HDDSN ENDDATE
701 1001 4
101 1001 101
1 1001 1225
301 1001 2800
201 1001 6051
501 1001 6714
801 1001 6956
601 1001 7894
401 1001 8234
901 1001 8676
802 1002 247
402 1002 274
> head(df[!duplicated(df[["HDDSN"]], fromLast=TRUE),])
HDDSN ENDDATE
901 1001 8676
902 1002 6329
803 1003 9947
204 1004 8825
505 1005 8472
606 1006 9743
If there are composite keys, then look for duplicates on a data.frame rather than a vector, !duplicated(df[, c("key1", "key2")]), as illustrated in the following:
> df = data.frame(k0=c(1:3, 1:6), k1=1:3)
> df[!duplicated(df, fromLast=TRUE),]
k0 k1
1 1 1
2 2 2
3 3 3
7 4 1
8 5 2
9 6 3
(the row numbers are from the original data frame, so rows 4-6 were duplicates). (Some care might need to be taken, especially if one of the columns is numeric, because duplicated.data.frame pastes columns together into a single string and rounding error may creep in).
Related
I have data representing stock prices, with 1 minute bars. I need to delete the row corresponding to the first minute of each day and the following 29 rows.
The first row of each day always has value >60 at the time_difference variable.
If I write del<- df[which(df$time_difference>60),] , then df_new=anti_join(df, sel, by= "Time")I select the first row of each day. However, I need to remove the next 29 rows as well.
Here is a sample of the df, I also added a time_difference vector computed as difference between each row and the next row for the Time variable (not displayed here). Df file can be downloaded from here
Time Open High Low Close Volume Wap Gap Count
1 1536154200 234.61 234.95 234.57 234.76 302 234.600 0 31
2 1536154260 234.76 235.23 234.76 235.16 135 235.008 0 94
3 1536154320 235.09 235.33 234.88 235.33 121 235.010 0 109
4 1536154380 235.24 235.35 235.08 235.35 24 235.203 0 22
5 1536154440 235.27 235.47 235.22 235.42 62 235.340 0 35
6 1536154500 235.39 235.81 235.39 235.63 136 235.633 0 110
My original answer only works on one set of rows at a time. Thanks for sharing your data. Below is an updated answer. Note that this is sensitive to your data being in chronological order as we are using row indices rather than the actual time!
dat <- read.csv("MSFT.3years.csv")
startofday <- which(dat$time_difference>60)
removerows <- unlist(Map(`:`, startofday, startofday+29))
dat_new <- dat[-removerows,]
inspired from here: Generate sequence between each element of 2 vectors
I want to create a for loop that will replace a value in a row with a value from the previous year (same month), based on if two columns are matching.
I have created the structure of a for loop, but I have not made progress in determining how to get the for loop to reference a value from a previous year.
Here is an example dataset:
fish <- c("A","A","B","B","C","C")
fish_wt<-c(2,3,4,5,5,7)
fish_count<-c(2,200,47,78,5,845)
date <- as.Date(c('2010-11-1','2009-11-1','2009-11-1','2008-11-1','2008-2-1','2007-2-1'))
data <- data.frame(fish,fish_wt,fish_count,date)
data$newcount<-0
Here is my for loop so far:
for (i in 1:nrow(data)) {
if (data$fish_wt[i] == data$fish_count[i]) {
data$newcount[i] <- 10
} else {
data$newcount[i] <- data$fish_count[i]
}
}
Right now, I am using the value of the row-1, which is fine for this small dataset, but does not work for a larger one where the two fish A rows will not be next to one another.
for (i in 1:nrow(data)) {
if (data$fish_wt[i] == data$fish_count[i]) {
data$newcount[i] <- data$newcount[data$date==data$date[i-1])]
} else {
data$newcount[i] <- data$fish_count[i]
}
}
This is what I want my dataset to look like:
fish fish_wt fish_count date newcount
1 A 2 2 2010-11-01 200
2 A 3 200 2009-11-01 200
3 B 4 47 2009-11-01 47
4 B 5 78 2008-11-01 78
5 C 5 5 2008-02-01 845
6 C 7 845 2007-02-01 845
I have thought of separating rows by fish, then using the row-1 solution. I am just wondering if there is something easier.
As a solution to this problem, I set up a table of mean temperature by fish, year, and month (long format), then merged the dataset and used the average value for any row where fish_wt==fish_count.
I have a data containing quotations of indexes (S&P500, CAC40,...) for every 5 minutes of the last 3 years, which make it quite huge. I am trying to create new columns containing the performance of the index for each time (ie (quotation at [TIME]/quotation at yesterday close) -1) and for each index. I began that way (my data is named temp):
listIndexes<-list("CAC","SP","MIB") # there are a lot more
listTime<-list(900,905,910,...1735) # every 5 minutes
for (j in 1:length(listTime)){
Time<-listTime[j]
for (i in 1:length(listIndexes)) {
Index<-listIndexes[i]
temp[[paste0(Index,"perf",Time)]]<-temp[[paste0(Index,Time)]]/temp[[paste0(Index,"close")]]-1
# other stuff to do but with the same concept
}
}
but it is quite long. Is there a way to get rid of the for loop(s) or to make the creation of those variables quicker ? I read some stuff about the apply functions and the derivatives of it but I do not see if and how it should be used here.
My data looks like this :
date CACcloseyesterday CAC1000 CAC1005 ... CACclose ... SP1000 ... SPclose
20140105 3999 4000 40001.2 4005 .... 2000 .... 2003
20140106 4005 4004 40003.5 4002 .... 2005 .... 2002
...
and my desired output would be a new column (more eaxcatly a new column for each time and each index) which would be added to temp
date CACperf1000 CACperf1005... SPperf1000...
20140106 (4004/4005)-1 (4003.5/4005)-1 .... (2005/2003)-1 # the close used is the one of the day before
idem for the following day
i wrote (4004/4005)-1 just to show the calcualtio nbut the result should be a number : -0.0002496879
It looks like you want to generate every combination of Index and Time. Each Index-Time combination is a column in temp and you want to calculate a new perf column by comparing each Index-Time column against a specific Index close column. And your problem is that you think there should be an easier (less error-prone) way to do this.
We can remove one of the for-loops by generating all the necessary column names beforehand using something like expand.grid.
listIndexes <-list("CAC","SP","MIB")
listTime <- list(900, 905, 910, 915, 920)
df <- expand.grid(Index = listIndexes, Time = listTime,
stringsAsFactors = FALSE)
df$c1 <- paste0(df$Index, "perf", df$Time)
df$c2 <- paste0(df$Index, df$Time)
df$c3 <- paste0(df$Index, "close")
head(df)
#> Index Time c1 c2 c3
#> 1 CAC 900 CACperf900 CAC900 CACclose
#> 2 SP 900 SPperf900 SP900 SPclose
#> 3 MIB 900 MIBperf900 MIB900 MIBclose
#> 4 CAC 905 CACperf905 CAC905 CACclose
#> 5 SP 905 SPperf905 SP905 SPclose
#> 6 MIB 905 MIBperf905 MIB905 MIBclose
Then only one loop is required, and it's for iterating over each batch of column names and doing the calculation.
for (row_i in seq_len(nrow(df))) {
this_row <- df[row_i, ]
temp[[this_row$c1]] <- temp[[this_row$c2]] / temp[[this_row$c3]] - 1
}
An alternative solution would also be to reshape your data into a form that makes this transformation much simpler. For instance, converting into a long, tidy format with columns for Date, Index, Time, Value, ClosingValue column and directly operating on just the two relevant columns there.
I have a large data set of vehicles. They were recorded every 0.1 seconds so there IDs repeat in Vehicle ID column. In total there are 2169 vehicles. I filtered the 'Vehicle velocity' column for every vehicle (using for loop) which resulted in a new column with first and last 30 values removed (per vehicle) . In order to bind it with original data frame, I removed the first and last 30 values of table too and then using cbind() combined them. This works for one last vehicle. I want this smoothing and column binding for all vehicles and finally I want to combine all the data frames of vehicles into one single table. That means rowbinding in sequence of vehicle IDs. This is what I wrote so far:
traj1 <- read.csv('trajectories-0750am-0805am.txt', sep=' ', header=F)
head(traj1)
names (traj1)<-c('Vehicle ID', 'Frame ID','Total Frames', 'Global Time','Local X', 'Local Y', 'Global X','Global Y','Vehicle Length','Vehicle width','Vehicle class','Vehicle velocity','Vehicle acceleration','Lane','Preceding Vehicle ID','Following Vehicle ID','Spacing','Headway')
# TIME COLUMN
Time <- sapply(traj1$'Frame ID', function(x) x/10)
traj1$'Time' <- Time
# SMOOTHING VELOCITY
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r
}
for (i in unique(traj1$'Vehicle ID')){
veh <- subset (traj1, traj1$'Vehicle ID'==i)
svel <- smooth(veh$'Vehicle velocity',30,10)
svel <- data.frame(svel)
veh <- head(tail(veh, -30), -30)
fta <- cbind(veh,svel)
}
'fta' now only shows the data frame for last vehicle. But I want all data frames (for all vehicles 'i') combined by row. May be for loop is not the right way to do it but I don't know how can I use tapply (or any other apply function) to do so many things same time.
EDIT
I can't reproduce my dataset here but 'Orange' data set in R could provide good analogy. Using the same smoothing function, the for loop would look like this (if 'age' column is smoothed and 'Tree' column is equivalent to my 'Vehicle ID' coulmn):
for (i in unique(Orange$Tree)){
tre <- subset (Orange, Orange$'Tree'==i)
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb <- cbind(tre,age2)}
}
Umair, I am not sure I understood what you want.
If I understood right, you want to combine all the results by row. To do that you could save all the results in a list and then do.call an rbind:
comb <- list() ### create list to save the results
length(comb) <- length(unique(Orange$Tree))
##Your loop for smoothing:
for (i in 1:length(unique(Orange$Tree))){
tre <- subset (Orange, Tree==unique(Orange$Tree)[i])
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb[[i]] <- cbind(tre,age2) ### save results in the list
}
final.data<-do.call("rbind", comb) ### combine all results by row
This will give you:
Tree age circumference age2
3 1 664 87 687.88
4 1 1004 115 982.66
5 1 1231 120 1211.49
10 2 664 111 687.88
11 2 1004 156 982.66
12 2 1231 172 1211.49
17 3 664 75 687.88
18 3 1004 108 982.66
19 3 1231 115 1211.49
24 4 664 112 687.88
25 4 1004 167 982.66
26 4 1231 179 1211.49
31 5 664 81 687.88
32 5 1004 125 982.66
33 5 1231 142 1211.49
Just for fun, a different way to do it using plyr::ddply and sapply with split:
library(plyr)
data<-ddply(Orange, .(Tree), tail, n=-2)
data<-ddply(data, .(Tree), head, n=-2)
data<- cbind(data,
age2=matrix(sapply(split(Orange$age, Orange$Tree), smooth, D=2, delta=0.67), ncol=1, byrow=FALSE))
If I have a dataframe of election results by district and candidate, is there an easy way to find the winner in each district in R? That is, for each row, select both the maximum value and the column name for that maximum value?
District CandidateA CandidateB CandidateC
1 702 467 35
2 523 642 12
...
So I'd want to select not only 702 in row 1 and 642 in row 2, but also "CandidateA" from row 1 and "CandidateB" in row 2.
I'm asking this as a learning question, as I know I can do this with any general-purpose scripting language like Perl or Ruby. Perhaps R isn't the tool for this, but it seems like it could be. Thank you.
d <- read.table(textConnection(
"District CandidateA CandidateB CandidateC
1 702 467 35
2 523 642 12"),
header=TRUE)
d2 <- d[,-1] ## drop district number
data.frame(winner=names(d2)[apply(d2,1,which.max)],
votes=apply(d2,1,max))
result:
winner votes
1 CandidateA 702
2 CandidateB 642
Do you need to worry about ties? See the help for which and which.max, they treat ties differently ...
If this isn't too messy, you can try running a for loop and printing out the results using cat. So if your data.frame object is x:
for(i in 1:length(x$District)) {
row <- x[i,]
max_row <- max(row[2:length(row)])
winner_row <- names(x)[which(row==max_row)]
cat(winner_row, max_row, "\n")
}