5Can anyone give me a hint to speed up the following program?
Situation: I have a huge amount of measurement data. I need to extract data for "10 minutes stable operation conditions" of 5 parameters i.e. column values.
Here is my (working, but really slow) solution:
- Take the first 10 rows from the dataframe
- Compare the min and max of each column to the first value of the column
- If at least one column min or max is not within tolerance, delete the first row, repeat
- If they are within tolerance, calculate the mean of the results, store them, delete 10 rows, repeat.
- break when the dataframe has less than 10 rows
Since I am using a repeat loop, this takes 30min to extract 610 operation points from 86.220 minutes of data.
Any help is appreciated. Thanks!
edit: I created some code to explain. Please note that I deleted the checking routines for na values and standby operation (values around 0):
n_cons<-5 # Number of consistent minutes?
### Function to check wheter a value is within tolerance
f_cons<-function(min,max,value,tol){
z<-max > (value + tol) | min < (value - tol);
return(z)
}
# Define the +/- tolerances
Vu_1_tol<-5 # F_HT
Vu_2_tol<-5 # F_LT
# Create empty result map
map<-c(rep(NA,3))
dim(map)<- c(1,3)
colnames(map)<-list("F_HT","F_LT","Result")
system.time(
repeat{
# Criteria to break
if(nrow(t6)<n_cons){break}
# Subset of the data to check
t_check<-NULL
t_check<-cbind(t6$F_HT[1:n_cons],
t6$F_LT[1:n_cons]
)
# Check for consistency
if(f_cons(min(t_check[,1]),max(t_check[,1]),t_check[1,1],Vu_1_tol)){t6<-t6[-1,]
next}
if(f_cons(min(t_check[,2]),max(t_check[,2]),t_check[1,2],Vu_2_tol)){t6<-t6[-1,]
next}
# If the repeat loop passes the consistency check, store the means
attach(t6[1:n_cons,])
# create a new row wih means of steady block
new_row<-c(mean(F_HT),mean(F_LT),mean(Result))
new_row[-1]<-round(as.numeric(new_row[-1]),2)
map<-rbind(map,new_row) # attach new steady point to the map
detach(t6[1:n_cons,])
t6<-t6[-(1:n_cons),] # delete the evaluated lines from the data
}
)
The data I am using looks like this
t6<-structure(list(F_HT = c(1499.71, 1500.68, 1500.44, 1500.19, 1500.31,
1501.76, 1501, 1551.22, 1500.01, 1500.52, 1499.53, 1500.78, 1500.65,
1500.96, 1500.25, 1500.76, 1499.49, 1500.24, 1500.47, 1500.25,
1735.32, 2170.53, 2236.08, 2247.48, 2250.71, 2249.59, 2246.68,
2246.69, 2248.27, 2247.79), F_LT = c(2498.96, 2499.93, 2499.73,
2494.57, 2496.94, 2507.71, 2495.67, 2497.88, 2499.63, 2506.18,
2495.57, 2504.28, 2497.38, 2498.66, 2502.17, 2497.78, 2498.38,
2501.06, 2497.75, 2501.32, 2500.79, 2498.17, 2494.82, 2499.96,
2498.5, 2503.47, 2500.57, 2501.27, 2501.17, 2502.33), Result = c(9125.5,
8891.5, 8624, 8987, 9057.5, 8840.5, 9182, 8755.5, 9222.5, 9079,
9175.5, 9458.5, 9058, 9043, 9045, 9309, 9085.5, 9230, 9346, 9234,
9636.5, 9217.5, 9732.5, 9452, 9358, 9071.5, 9063.5, 9016.5, 8591,
8447.5)), .Names = c("F_HT", "F_LT", "Result"), row.names = 85777:85806, class = "data.frame")
With this code and data, I get 3 steady operation points, which is what I want, but which is very slow.
Hopefully, this helps to better explain my problem.
Heureka!
Thanks to the comment of Carl Witthoft, I was able to speed up the proces by factor 15!
I used rollapply a lot, because rollmean and rollmax had some problems with NA which did not occur when using rollaply.
Thanks for your help!
Here is what I did I used the same data like before:
# Use only the values needed to check for stability
t7<-as.data.frame(cbind(t6$F_HT,t6$F_LT))
n_cons<-5 # Number of consistent minutes?
# Calculate the mean values for each column over 5 rows
t7_rm<-rollapply(t7,n_cons,mean,align = "left")
colnames(t7_rm)<-c("mean_F_HT","mean_F_LT")
# idem with maximum
t7_max<-rollapply(t7,width=n_cons,FUN=max, na.rm = F,align = "left")
colnames(t7_max)<-c("max_F_HT","max_F_LT")
# idem with minimum
t7_min<-rollapply(t7,width=n_cons,FUN=min, na.rm = F,align = "left")
colnames(t7_min)<-c("min_F_HT","min_F_LT")
# create table with maximum absolute daviation from the mean values
t7_dif<-pmax((t7_max-t7_rm[1:nrow(t7_max),]),(t7_rm[1:nrow(t7_min),]-t7_min))
colnames(t7_dif)<-c("diff_F_HT","diff_F_LT")
# Enter tolerance limits
V1_tol<-50 # F_HT
V2_tol<-50 # F_LT
# Create a tolerance table
t7_tol<-cbind(rep(V1_tol,nrow(t7_dif)),rep(V2_tol,nrow(t7_dif)))
# Create a logical table with TRUE or FALSE depending on if the max deviation is within tolerance
t7_check<-(t7_dif<t7_tol)
# Replace all "FALSE" with "NA" (in order to use is.na)
t7_check_NA<-apply(t7_check,c(1,2),function(x) {ifelse(x==FALSE,NA,x)})
# Create rolling mean over complete data
t6_rm<-rollapply(t6,n_cons,mean,na.rm=TRUE,align = "left")
# Create a map of stable operation points with means of parameters and result
t6_map<-t6_rm[complete.cases(t7_check_NA),]
The result differs from my original one, because no lines are omitted. But this works for me.
Related
I'm working with a 4-dimensional matrix (Year, Simulation, Flow, Time instant: 10x5x20x10) in R. I need to remove some values from the matrix. For example, for year 1 I need to remove simulations number 1 and 2; for year 2 I need to remove simulation number 5.
Can anyone suggest me how I can make such changes?
Arrays (which is how R documentation usually refers to higher-dimensional 'matrices') can be indexed with negative values in the same way as matrices or vectors: a negative value removes the corresponding row/column/slice. So if you wanted to remove year 1 completely (for example), you could use a[-1,,,]; to remove simulation 5 completely, a[,-5,,].
However, arrays can't be "ragged", there has to be something in every row/column/slice combination. You could replace the values you want to remove with NAs (and then make sure to account for the NAs appropriately when computing, e.g. using na.rm = TRUE in sum()/min()/max()/median()/etc.): a[1,1:2,,] <- NA or a[2,5,,] <- NA in your examples.
If you knew that all values of Flow and Time would always be present, you could store your data as a list of lists of matrices: e.g.
results <- list(Year1 = list(Simulation1 = matrix(...),
Simulation2 = matrix(...),
...),
Year2 = list(Simulation1 = matrix(...),
Simulation2 = matrix(...),
...))
Then you could easily remove years or simulations within years (by setting them to NULL, but it would make indexing a little bit harder (e.g. "retrieve Simulation1 values for all years" would require an lapply or a loop across years).
I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")
I have a table of data, where I've labeled the rows based on a cluster they fall into, as well as calculated the average of the rows column values. I would like to select the median row for each cluster.
For example sake, just looking at one, I would like to use:
median(as.numeric(as.vector(subset(df,df$cluster == i )$avg)))
I can see that
> as.numeric(as.vector(subset(df,df$cluster == i )$avg))
[1] 48.11111111 47.77777778 49.44444444 49.33333333 47.55555556 46.55555556 47.44444444 47.11111111 45.66666667 45.44444444
And yet, the median is
> median(as.numeric(as.vector(subset(df,df$cluster == i )$avg)))
[1] 47.5
I would like to find the median record, by matching the median returned with the average in the column, but that isn't possible with this return.
I've found some documentation and questions on rounding with the mean function, but that doesn't seem to apply to this unfortunately.
I could also limit the data decimal places, but some records will be too close, that duplicates will be common if rounded to one decimal.
When the input has an even number of values (like the 10 values you have) then there is not a value directly in the middle. The standard definition of a median (which R implements) averages the two middle values in the case of an even number of inputs. You could rank the data, and in the case of an even-length input select either the n/2 or n/2 + 1 record.
So, if your data was x = c(8, 6, 7, 5), the median is 6.5. You seem to want the index of "the median", that is either 2 or 3.
If we assume there are no ties, then we can get these answers with
which(rank(x) == length(x) / 2)
# [1] 2
which(rank(x) == length(x) / 2 + 1)
# [1] 3
If ties are a possibility, then rank's default tie-breaking method will cause you some problems. Have a look at ?rank and figure out which option you'd like to use.
We can, of course, turn this into a little utility function:
median_index = function(x) {
lx = length(x)
if (lx %% 2 == 1) {
return(match(median(x), x))
}
which(rank(x, ties.method = "first") == lx/2 + 1)
}
There is an easier way to do that: use dplyr
library(dplyr)
df%>%
group_by(cluster)%>%
summarise(Median=median(avg))
I am a bit stuck with this basic problem, but I cannot find a solution.
I have two data frames (dummies below):
x<- data.frame("Col1"=c(1,2,3,4), "Col2"=c(3,3,6,3))
y<- data.frame("ColA"=c(0,0,9,4), "ColB"=c(5,3,20,3))
I need to use the location of the median value of one column in df x to then retrieve a value from df y. For this, I am trying to get the row number of the median value in e.g. x$Col1 to then retrieve the value using something like y[,"ColB"][row.number]
is there an elegant way/function for doing this? Solutions might need to account for two cases - when the sample has an even number of values, and ehwn this is uneven (when numbers are even, the median value might be one that is not found in the sample as a result of calculating the mean of the two values in the middle)
The problem is a little underspecified.
What should happen when the median isn't in the data?
What should happen if the median appears in the data multiple times?
Here's a solution which takes the (absolute) difference between each value and the median, then returns the index of the first row for which that difference vector achieves its minimum.
with(x, which.min(abs(Col1 - median(Col1))))
# [1] 2
The quantile function with type = 1 (i.e. no averaging) may also be of interest, depending on your desired behavior. It returns the lower of the two "sides" of the median, while the which.min method above can depend on the ordering of your data.
quantile(x$Col1, .5, type = 1)
# 50%
# 2
An option using quantile is
with(x, which(Col1 == quantile(Col1, .5, type = 1)))
# [1] 2
This could possibly return multiple row-numbers.
Edit:
If you want it to only return the first match, you could modify it as shown below
with(x, which.min(Col1 != quantile(Col1, .5, type = 1)))
Here, something like y$ColB[which(x$Col1 == round(median(x$Col1)))] would do the trick.
The problem is x has an even number of rows, so the median 2.5 is not an integer. In this case you have to choose between 2 or 3.
Note: The above works for your example, not for general cases (e.g. c(-2L,2L) or with rational numbers). For the more general case see #IceCreamToucan's solution.
The language I'm using is R, but you don't necessarily need to know about R to answer the question.
Question:
I have a sequence that can be considered the ground truth, and another sequence that is a shifted version of the first, with some missing values. I'd like to know how to align the two.
setup
I have a sequence ground.truth that is basically a set of times:
ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
rep( seq(0,length.out=5,by=4*10+30), each=10 )
Think of ground.truth as times where I'm doing the following:
{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5
I have a second sequence observations, which is ground.truth shifted with 20% of the values missing:
nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs <- length(observations)
If I plot these vectors this is what it looks like (remember, think of these as times):
What I've tried. I want to:
calculate the shift (theLag in my example above)
calculate a vector idx such that ground.truth[idx] == observations - theLag
First, assume we know theLag. Note that ground.truth[1] is not necessarily observations[1]-theLag. In fact, we have ground.truth[1] == observations[1+lagI]-theLag for some lagI.
To calculate this, I thought I'd use cross-correlation (ccf function).
However, whenever I do this I get a lag with a max. cross-correlation of 0, meaning ground.truth[1] == observations[1] - theLag. But I've tried this in examples where I've explicitly made sure that observations[1] - theLag is not ground.truth[1] (i.e. modify idx_to_keep to make sure it doesn't have 1 in it).
The shift theLag shouldn't affect the cross-correlation (isn't ccf(x,y) == ccf(x,y-constant)?) so I was going to work it out later.
Perhaps I'm misunderstanding though, because observations doesn't have as many values in it as ground.truth? Even in the simpler case where I set theLag==0, the cross correlation function still fails to identify the correct lag, which leads me to believe I'm thinking about this wrong.
Does anyone have a general methodology for me to go about this, or know of some R functions/packages that could help?
Thanks a lot.
For the lag, you can compute all the differences (distances) between your two sets of points:
diffs <- outer(observations, ground.truth, '-')
Your lag should be the value that appears length(observations) times:
which(table(diffs) == length(observations))
# 55.715382960625
# 86
Double check:
theLag
# [1] 55.71538
The second part of your question is easy once you have found theLag:
idx <- which(ground.truth %in% (observations - theLag))
The following should work if your time series are not too long.
You have two vectors of time-stamps,
the second one being a shifted and incomplete copy of the first,
and you want to find by how much it was shifted.
# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]
We can try all possible lags and, for each one,
compute how bad the alignment is,
by matching each observed timestamp with the closest
"truth" timestamp.
# Loss function
library(sqldf)
f <- function(u) {
# Put all the values in a data.frame
d1 <- data.frame(g="truth", value=x)
d2 <- data.frame(g="observed", value=y+u)
d <- rbind(d1,d2)
# For each observed value, find the next truth value
# (we could take the nearest, on either side,
# but it would be more complicated)
d <- sqldf("
SELECT A.g, A.value,
( SELECT MIN(B.value)
FROM d AS B
WHERE B.g='truth'
AND B.value >= A.value
) AS next
FROM d AS A
WHERE A.g = 'observed'
")
# If u is greater than the lag, there are missing values.
# If u is smaller, the differences decrease
# as we approach the lag.
if(any(is.na(d))) {
return(Inf)
} else {
return( sum(d$`next` - d$value, na.rm=TRUE) )
}
}
We can now search for the best lag.
# Look at the loss function
sapply( seq(-2,2,by=.1), f )
# Minimize the loss function.
# Change the interval if it does not converge,
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time