In a large dataframe (1 million+ rows), I am counting the number of elements (rows) that are within a particular range and satisfy a third criteria. I have 33 of those ranges and use a very slow for loop to get me the answer, no problem.
As speed is of massive concern, I would appreciate any help to get this to run faster. Can I get rid of the for loop and "vectorise" or any sort of "apply" solution?
Thanks in advance
Code:
N.data<-c(1:33)
Lower<-c(0,100000,125000,150000,175000,200000,225000,250000,275000,300000,325000,350000,375000,400000,425000,450000,475000,500000,550000,600000,650000,700000,750000,800000,850000,900000,950000,1000000,1100000,1200000,1300000,1400000,1500000)
Upper<-c(100000,125000,150000,175000,200000,225000,250000,275000,300000,325000,350000,375000,400000,425000,450000,475000,500000,550000,600000,650000,700000,750000,800000,850000,900000,950000,1000000,1100000,1200000,1300000,1400000,1500000, 5000000)
for (i in 1:(length(N.data))){
N.data[i]<-nrow(dataset[dataset$Z==c & dataset$X > Lower[i] & dataset$X < Upper[i],])
}
A more efficient approach:
# first logical index (vector)
idx1 <- dataset$Z == c
# second logical index (matrix)
idx2 <- mapply(function(l, u) dataset$X > l & dataset$X < u, Lower, Upper)
# combine both indices and count number of rows
N.data <- colSums(idx1 & idx2)
apply functions are not VECTORIZED. They are merely more efficient implementations of a for loop. To achieve what you seek using vectorization, here is one approach.
# Create a Dummy Dataset and Breaks
dataset = data.frame(
X = rpois(100, 10),
Z = rpois(100, 20)
)
breaks = seq(0, max(dataset$Z), length = 5)
# Add Column with Breaks
dataset = transform(dataset, Z2 = cut(Z, breaks, labels = FALSE))
# Use Aggregate to compute length for each value of Z2
c = 10
aggregate(Z ~ Z2, data = dataset, length, subset = (X == c))
This should be more efficient that using mapply, as it is completely vectorized.
Related
I have a data frame with 30 rows and 4 columns (namely, x, y, z, u). It is given below.
mydata = data.frame(x = rnorm(30,4), y = rnorm(30,2,1), z = rnorm(30,3,1), u = rnorm(30,5))
Further, I have a sequence values, which represent row number in my data frame.
myseq = c(seq(1, 30, by = 5))
myseq
[1] 1 6 11 16 21 26
Now, I wanted to compute the prob values for each segment of 99 rows.
filt= subset(mydata[1:6,], mydata[1:6,]$x < mydata[1:6,]$y & mydata[1:6,]$z < mydata[1:6,]$u
filt
prob = length(filt$x)/30
prob
Then I need to compute the above prob for 1:6,.., 27:30 and so on . Here, I have only 6 prob values. So, I can do one by one. If I have 100 values it would be tedious. Are there any way to compute the prob values?.
Thank you in advance.
BTW: in subset(DF[1:99,], ...), use DF[1:99,] in the first argument, not again, ala
subset(DF[1:99,], cumsuml < inchivaluel & cumsumr < inchivaluer)
Think about how to do this in a list.
The first step is to break your data into the va starting points. I'll start with a list of the indices to break it into:
inds <- mapply(seq, va, c(va[-1], nrow(DF)), SIMPLIFY=FALSE)
this now is a list of sequences, starting with 1:99, then 100:198, etc. See str(inds) to verify.
Now we can subset a portion of the data based on each element's vector of indices:
filts <- lapply(inds, function(ind) subset(DF[ind,], cumsuml < inchivaluel & cumsumr < inchivaluer))
We now have a list of vectors, let's summarize it:
results <- sapply(filts, function(filt) length(filt$cumsuml)/length(alpha))
Bottom line, it helps to think about how to break this problem into lists, examples at http://stackoverflow.com/a/24376207/3358272.
BTW: instead of initially making a list of indices, we could just break up the data in that first step, ala
DF2 <- mapply(function(a,b) DF[a:b,], va, c(va[-1], nrow(DF)), SIMPLIFY=FALSE)
filts <- lapply(DF2, function(x) subset(x, cumsuml < inchivaluel & cumsumr < inchivaluer))
results <- sapply(filts, function(filt) length(filt$cumsuml)/length(alpha))
In trying to avoid using the for loop in R, I wrote a function that returns an average value from one data frame given row-specific values from another data frame. I then pass this function to sapply over the range of row numbers. My function works, but it returns ~ 2.5 results per second, which is not much better than using a for loop. So, I feel like I've not fully exploited the vectorized aspects of the apply family of functions. Can anyone help me rethink my approach? Here is a minimally working example. Thanks in advance.
#Creating first dataframe
dates<-seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1)
n<-length(seq(as.Date("2013-01-01"), as.Date("2016-07-01"), by = 1))
df1<-data.frame(date = dates,
hour = sample(1:24, n,replace = T),
cat = sample(c("a", "b"), n, replace = T),
lag = sample(1:24, n, replace = T))
#Creating second dataframe
df2<-data.frame(date = sort(rep(dates, 24)),
hour = rep(1:24, length(dates)),
p = runif(length(rep(dates, 24)), min = -20, max = 100))
df2<-df2[order(df2$date, df2$hour),]
df2$cat<-"a"
temp<-df2
temp$cat<-"b"
df2<-rbind(df2,temp)
#function
period_mean<-function(x){
tmp<-df2[df$cat == df1[x,]$cat,]
#This line extracts the row name index from tmp,
#in which the two dataframes match on date and hour
he_i<-which(tmp$date == df1[x,]$date & tmp$hour == df1[x,]$hour)
#My lagged period is given by the variable "lag". I want the average
#over the period hour - (hour - lag). Since df2 is sorted such hours
#are consecutive, this method requires that I subset on only the
#relevant value for cat (hence the creation of tmp in the first line
#of the function
p<-mean(tmp[(he_i - df1[x,]$lag):he_i,]$p)
print(x)
print(p)
return(p)
}
#Execute function
out<-sapply(1:length(row.names(df1)), period_mean)
EDIT I have subsequently learned that part of the reason my original problem was iterating so slowly is that my data classes between the two dataframes were not the same. df1$date was a date field, while df2$date was a character field. Of course, this wasn't apparent with the example I posted because the data types were the same by construction. Hope this helps.
Here's one suggestion:
getIdx <- function(i) {
date <- df1$date[i]
hour <- df1$hour[i]
cat <- df1$cat[i]
which(df2$date==date & df2$hour==hour & df2$cat==cat)
}
v_getIdx <- Vectorize(getIdx)
df1$index <- v_getIdx(1:nrow(df1))
b_start <- match("b", df2$cat)
out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
flr <- ifelse(x[1]=="a", 1, b_start)
x <- as.numeric(x[2:3])
mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
})
We make a function (getIdx) to retrieve the rows from df2 that match the values from each row in df1, and then Vectorize the function.
We then run the vectorized function to get a vector of rownames. We set b_start to be the row where the "b" category starts.
We then iterate through the rows of df1 with apply. In the mean(...) function, we set the "floor" to be either row 1 (if cat=="a") or b_start (if cat=="b"), which eliminates the need to subset (what you were doing with tmp).
Performance:
> system.time(out<-sapply(1:length(row.names(df1)), period_mean))
user system elapsed
11.304 0.393 11.917
> system.time({
+ df1$index <- v_getIdx(1:nrow(df1))
+ b_start <- match("b", df2$cat)
+ out2 <- apply(df1[,c("cat","lag","index")], MAR=1, function(x) {
+ flr <- ifelse(x[1]=="a", 1, b_start)
+ x <- as.numeric(x[2:3])
+ mean(df2$p[max(flr, (x[2]-x[1])):x[2]])
+ })
+ })
user system elapsed
2.839 0.405 3.274
> all.equal(out, out2)
[1] TRUE
Given a data frame or matrix with arbitrary number of rows and columns, what is the fastest way to apply a function to all pairwise combinations of columns?
For example, if I have a data table:
N <- 3
K <- 3
data <- data.table(id=seq(N))
for(k in seq(K)) {
data[[k]] <- runif(N)
}
And I want to compute the simple difference between all pairs of columns, I could loop (or lapply) over columns:
differences = data.table(foo=seq(N))
for(var1 in names(data)) {
for(var2 in names(data)) {
if (var1==var2) next
if (which(names(data)==var1)>which(names(data)==var2)) next
combo <- paste0(var1, var2)
differences[[combo]] <- data[[var1]]-data[[var2]]
}
}
But as K gets larger, this becomes absurdly slow.
One solution I've considered is to make two new data tables using combn and subtract them:
a <- data[,combn(colnames(data),2)[1,],with=F]
b <- data[,combn(colnames(data),2)[2,],with=F]
differences <- a-b
But as N and K get larger, this becomes very memory intensive (though faster than looping).
It seems to me that the outer product of the matrix with itself is probably the best way to go, but I can't piece it together. This is especially hard if I want to apply an arbitrary function (RMSE for example), instead of just the difference.
What's the fastest way?
If it is necessary to have the data in a matrix first, you can do the following:
library(data.table)
data <- matrix(runif(300*500), nrow = 300, ncol = 500)
data.DT <- setkey(data.table(c(data), colId = rep(1:500, each = 300), rowId = rep(1:300, times = 500)), colId)
diff.DT <- data.DT[
, {
ccl <- unique(colId)
vv <- V1
data.DT[colId > ccl, .(col2 = colId, V1 - vv)]
}
, keyby = .(col1 = colId)
]
I have a data.frame, ordered by mean column that looks like this:
10SE191_2 10SE207 10SE208 mean
7995783 12.64874 13.06391 12.69378 12.73937
8115327 12.69979 12.52285 12.41582 12.50363
8108370 12.58685 12.87818 12.66021 12.45720
7945680 12.46392 12.26087 11.77040 12.36518
7923547 11.98463 11.96649 12.50666 12.33138
8016718 12.81610 12.71548 12.48164 12.32703
I would like to apply a t.test to each row, using as input the intensity values: df[i,1:3] and the mean values from the rows with lower intensities. For example, for the first row I want to compute a t.test for df[1,1:3] vs _mean values_ from row 2 to row 6. My code uses a for loop but my current data.frame has more than 20,000 rows and 24 columns and it takes a long time. Any ideas for improving the code?
Thanks
Code:
temp <- matrix(-9, nrow=dim(matrix.order)[1], ncol=2) #create a result matrix
l <- dim(matrix.order)[1]
for (i in 1:l){
j <- 1+i
if (i < l | j +2 == l) { #avoid not enough y observations
mean.val <- matrix.order[j:l,4]
p <- t.test(matrix.order[i, 1:3], mean.val)
temp[i,1] <- p$p.value
}
else {temp[i,1] <- 1}
}
dput for my df
structure(list(`10SE191_2` = c(12.6487418898415, 12.6997932097351,12.5868508174491, 12.4639169398277, 11.9846348627906, 12.8160978540904), `10SE207` = c(13.0639063105224, 12.522848114011, 12.8781769160682, 12.260865493177, 11.9664905651469, 12.7154788700468), `10SE208` = c(12.6937808736673, 12.4158248856386, 12.6602128982717, 11.7704045448312, 12.5066604109231, 12.4816357798965), mean = c(12.7393707471856, 12.5036313008127, 12.4572035036992, 12.3651842840775, 12.3313821056582, 12.3270331271091)), .Names = c("10SE191_2", "10SE207", "10SE208", "mean"), row.names = c("7995783", "8115327", "8108370", "7945680", "7923547", "8016718"), class = "data.frame")
You can obtain all p-values (if possible) with this command:
apply(df, 1, function(x) {
y <- df$mean[df$mean < x[4]]
if(length(y) > 1)
t.test(x[1:3], y)$p.value
else NA
})
The function will return NA if there are not enough values for y.
7995783 8115327 8108370 7945680 7923547 8016718
0.08199794 0.15627947 0.04993244 0.50885253 NA NA
Running 2E4 t.tests probably takes a lot of time no matter what. Try using Rprof to find the hot spots. You might also want to use mcapply or similar parallel processing tools, since your analysis of each row is independent of all other data (which means this is a task well-suited to multicore parallel processing).
I have a data frame df with 2 variables A and B. I would like to split A in groups 1 and 2 so that mean(df$B[df$group==1]) as close as possible to mean(df$B[df$group==2])
Or just to express it otherwise, what I would like is to find a cut point (cutp) in df$A that would minimize the abs(mean(df$B[df$A<cutp])-mean(df$B[df$A>=cutp]))
Any ideas?
If you want to find a threshold on variable A, to split the data into two groups, so that the means of B in those two groups be similar, you can compute these means for all possible cut-points, and check when the distance between those means is minimal.
# Sample data
n <- 10
d <- data.frame(
A = rnorm(n),
B = rnorm(n)
)
# The quantity to minimize
# (You can use a loop instead of apply.)
d$differences <- apply(
d, 1,
# Compute the difference of the means for each value of A
function (u) {
i <- d$A <= u[1];
abs( mean( d$B[which(i)]) - mean(d$B[which(!i)] ) )
}
)
# The mean of an empty vector is NaN: discard those values
d$differences[ ! is.finite( d$differences ) ] <- Inf
# Take the minimum
threshold <- d$A[ which.min( d$differences ) ]
# Build the groups
d$group <- ifelse( d$A <= threshold, "group 1", "group 2" )
I'm still not sure how column A factors into it. It seems you want to create a new column that has two levels which create ~= mean values for column B. Column A is obviously associated with the new column created, but does not directly factor into the calculation needed. Am I missing something?
Regardless, here's a start (note this can be made much more robust, but proof of concept should work). Define a tolerance that you find acceptable and then set up a while loop to create new groups until the condition is met, i.e.
FUN <- function(tol){
df$groups <- sample(1:2, nrow(df), TRUE)
while(abs(mean(df$B[df$groups == 1]) - mean(df$B[df$groups == 2])) > tol) {
df$groups <- sample(1:2, nrow(df), TRUE)
}
return(df)
}
set.seed(101)
df <- data.frame(A=runif(20),B=runif(20))
#Test it. Means should be less than .02 different and have roughly equivalent sample sizes.
set.seed(101)
out <- FUN(.02)
library(plyr)
> ddply(out, "groups", summarize, n = length(B), mean = mean(B))
groups n mean
1 1 11 0.5229024
2 2 9 0.5037279
I should note that you could create a runaway function if you set tol super low so don't blame me if your computer crashes.