As this is my first time asking a question on SO, I apologize in advance for any improper formatting.
I am very new to R and am trying to create a function that will return the row value of a data frame column once a running total in another column has met or exceeded a given value (the row that the running sum begins in is also an argument).
For example, given the following data frame, if given a starting parameter of x=3 and stop parameter of y=17, the function should return 5 (the x value of the row where the sum of y >= 17).
X Y
1 5
2 10
3 5
4 10
5 5
6 10
7 5
8 10
The function as I've currently written it returns the correct answer, but I have to believe there is a much more 'R-ish' way to accomplish this, instead of using loops and incrementing temporary variables, and would like to learn the right way, rather than form bad habits that I will have to correct later.
A very simplified version of the function:
myFunction<-function(DataFrame,StartRow,Total){
df<-DataFrame[DataFrame[[1]] >= StartRow,]
i<-0
j<-0
while (j < Total) {
i<-i+1
j<-sum(df[[2]][1:i])
}
x<-df[[1]][i]
return(x)
}
All the solutions posted so far compute the cumulative sum of the entire Y variable, which can be inefficient in cases where the data frame is really large but the index is near the beginning. In this case, a solution with Rcpp could be more efficient:
library(Rcpp)
get_min_cum2 = cppFunction("
int gmc2(NumericVector X, NumericVector Y, int start, int total) {
double running = 0.0;
for (int idx=0; idx < Y.size(); ++idx) {
if (X[idx] >= start) {
running += Y[idx];
if (running >= total) {
return X[idx];
}
}
}
return -1; // Running total never exceeds limit
}")
Comparison with microbenchmark:
get_min_cum <-
function(start,total)
with(dat[dat$X>=start,],X[min(which(cumsum(Y)>total))])
get_min_dt <- function(start, total)
dt[X >= start, X[cumsum(Y) >= total][1]]
set.seed(144)
dat = data.frame(X=1:1000000, Y=abs(rnorm(1000000)))
dt = data.table(dat)
get_min_cum(3, 17)
# [1] 29
get_min_dt(3, 17)
# [1] 29
get_min_cum2(dat$X, dat$Y, 3, 17)
# [1] 29
library(microbenchmark)
microbenchmark(get_min_cum(3, 17), get_min_dt(3, 17),
get_min_cum2(dat$X, dat$Y, 3, 17))
# Unit: milliseconds
# expr min lq median uq max neval
# get_min_cum(3, 17) 125.324976 170.052885 180.72279 193.986953 418.9554 100
# get_min_dt(3, 17) 100.990098 149.593250 162.24523 176.661079 399.7531 100
# get_min_cum2(dat$X, dat$Y, 3, 17) 1.157059 1.646184 2.30323 4.628371 256.2487 100
In this case, it's about 100x faster to use the Rcpp solution than other approaches.
Try this for example, I am using cumsum and vectorized logical subsetting:
get_min_cum <-
function(start,total)
with(dat[dat$X>=start,],X[min(which(cumsum(Y)>total))])
get_min_cum(3,17)
5
Here you go (using data.table because of ease of syntax):
library(data.table)
dt = data.table(df)
dt[X >= 3, X[cumsum(Y) >= 17][1]]
#[1] 5
Well, here's one way:
i <- 3
j <- 17
min(df[i:nrow(df),]$X[cumsum(df$Y[i:nrow(df)])>j])
# [1] 5
This takes df$X for rows i:nrow(df) and indexes that based on cumsum(df$Y) > j, starting also at row i. This returns all df$X for which the cumsum > j. min(...) then returns the smallest value.
with(df, which( cumsum( (x>=3)*y) >= 17)[1] )
Related
I am trying to sum the odds numbers of a specific number (but excluding itself), for example: N = 5 then 1+3 = 4
a<-5
sum<-function(x){
k<-0
for (n in x) {
if(n %% 2 == 1)
k<-k+1
}
return(k)
}
sum(a)
# [1] 1
But the function is not working, because it counts the odds numbers instead of summing them.
We may use vectorized approach
a1 <- head(seq_len(a), -1)
sum(a1[a1%%2 == 1])
[1] 4
If we want a loop, perhaps
f1 <- function(x) {
s <- 0
k <- 1
while(k < x) {
if(k %% 2 == 1) {
s <- s + k
}
k <- k + 1
}
s
}
f1(5)
The issue in OP's code is
for(n in x)
where x is just a single value and thus n will be looped once - i.e. if our input is 5, then it will be only looped once and 'n' will be 5. Instead, it would be seq_len(x -1). The correct loop would be something like
f2<-function(x){
k<- 0
for (n in seq_len(x-1)) {
if(n %% 2 == 1) {
k <- k + n
}
}
k
}
f2(5)
NOTE: sum is a base R function. So, it is better to name the custom function with a different name
Mathematically, we can try the following code to calculate the sum (N could be odd or even)
(ceiling((N - 1) / 2))^2
It's simple and it does what it says:
sum(seq(1, length.out = floor(N/2), by = 2))
The multiplication solution is probably gonna be quicker, though.
NB - an earlier version of this answer was
sum(seq(1, N - 1, 2))
which as #tjebo points out, silently gives the wrong answer for N = 1.
We could use logical statement to access the values:
a <- 5
a1 <- head(seq_len(a), -1)
sum(a1[c(TRUE, FALSE)])
output:
[1] 4
Fun benchmarking. Does it surprise that Thomas' simple formula is by far the fastest solution...?
count_odds_thomas <- function(x){
(ceiling((x - 1) / 2))^2
}
count_odds_akrun <- function(x){
a1 <- head(seq_len(x), -1)
sum(a1[a1%%2 == 1])
}
count_odds_dash2 <- function(x){
sum(seq(1, x - 1, 2))
}
m <- microbenchmark::microbenchmark(
akrun = count_odds_akrun(10^6),
dash2 = count_odds_dash2(10^6),
thomas = count_odds_thomas(10^6)
)
m
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> akrun 22117564 26299922.0 30052362.16 28653712 31891621 70721894 100
#> dash2 4016254 4384944.0 7159095.88 4767401 8202516 52423322 100
#> thomas 439 935.5 27599.34 6223 8482 2205286 100
ggplot2::autoplot(m)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Moreover, Thomas solution works on really big numbers (also no surprise)... on my machine, count_odds_akrun stuffs the memory at a “mere” 10^10, but Thomas works fine till Infinity…
count_odds_thomas(10^10)
#> [1] 2.5e+19
count_odds_akrun(10^10)
#> Error: vector memory exhausted (limit reached?)
This question is related to my previous one. Here is a small sample data. I have used both data.table and data.frame to find a faster solution.
test.dt <- data.table(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.dt[,rown:=as.numeric(row.names(test.dt))]
test.df <- data.frame(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.df$rown <- as.numeric(row.names(test.df))
> test.df
strt end a1.2 a2.3 a3.4 a4.5 a5.6 rown
1 1 2 1 2 3 5 4 1
2 1 1 2 4 1 1 8 2
3 2 5 3 6 2 15 2 3
4 3 5 4 8 4 10 1 4
5 5 5 5 10 5 12 3 5
6 2 4 6 12 1 10 9 6
I want to use the start and end column values to determine the range of columns to subset (columns from a1.2 to a5.6) and obtain the mean. For example, in the first row, since strt=1 and end=2, I need to get the mean of a1.2 and a2.3; in the third row, I need to get the mean of a2.3, a3.4, a4.5, and a5.6
The output should be a vector like this
> k
1 2 3 4 5 6
1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
Here, is what I tried:
Solution 1: This uses the data.table and applies a function over it.
func.dt <- function(rown, x, y) {
tmp <- paste0("a", x, "." , x+1)
tmp1 <- paste0("a", y, "." , y+1)
rowMeans(test.dt[rown,get(tmp):get(tmp1), with=FALSE])
}
k <- test.dt[, func.dt(rown, strt, end), by=.(rown)]
Solution 2: This uses the data.frame and applies a function over it.
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
Solution 3: This uses the data.frame and loops through it.
test.ave <- rep(NA, length(test1$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i, as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
Benchmarking shows that Solution 2 is the fastest.
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1 100 0.67 4.786 0.67 0 NA NA
2 sol2 100 0.14 1.000 0.14 0 NA NA
3 sol3 100 0.15 1.071 0.16 0 NA NA
But, this is not good enough for me. Given the size of my data, these functions would need to run for a few days before I get the output. I am sure that I am not fully utilizing the power of data.table and I also know that my functions are crappy (they refer to the dataset in the global environment without passing it). Unfortunately, I am out of my depth and do not know how to fix these issues and make my functions fast. I would greatly appreciate any suggestions that help in improving my function(s) or point to alternate solutions.
I was curious how fast I could make this without resorting to writing custom C or C++ code. The best I could come up with is below. Note that using mean.default will provide greater precision, since it does a second pass over the data for error correction.
f_jmu <- compiler::cmpfun({function(m) {
# remove start/end columns from 'm' matrix
ma <- m[,-(1:2)]
# column index for each row in 'ma' matrix
cm <- col(ma)
# logical index of whether we need the column for each row
i <- cm >= m[,1L] & cm <= m[,2L]
# multiply the input matrix by the index matrix and sum it
# divide by the sum of the index matrix to get the mean
rowSums(i*ma) / rowSums(i)
}})
The Rcpp function is still faster (not surprisingly), but the function above gets respectably close. Here's an example on 50 million observations on my laptop with an i7-4600U and 12GB of RAM.
set.seed(21)
N <- 5e7
test.df <- data.frame(strt = 1L,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$strt <- pmax(1L, test.df$end - sample(3, N, replace = TRUE) + 1L)
test.m <- as.matrix(test.df)
Also note that I take care to ensure that test.m is an integer matrix. That helps reduce the memory footprint, which can help make things faster.
R> system.time(st1 <- MYrcpp(test.m))
user system elapsed
0.900 0.216 1.112
R> system.time(st2 <- f_jmu(test.m))
user system elapsed
6.804 0.756 7.560
R> identical(st1, st2)
[1] TRUE
Unless you can think of a way to do this with a clever subsetting approach, I think you've reached R's speed barrier. You'll want to use a low-level language like C++ for this problem. Fortunately, the Rcpp package makes interfacing with C++ in R simple. Disclaimer: I've never written a single line of C++ code in my life. This code may be very inefficient.
library(Rcpp)
cppFunction('NumericVector MYrcpp(NumericMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
NumericVector out(nrow);
for (int i = 0; i < nrow; i++) {
double avg = 0;
int start = x(i,0);
int end = x(i,1);
int N = end - start + 1;
while(start<=end){
avg += x(i, start + 1);
start = start + 1;
}
out[i] = avg/N;
}
return out;
}')
For this code I'm going to pass the data.frame as a matrix (i.e. testM <- as.matrix(test.df))
Let's see if it works...
MYrcpp(testM)
[1] 1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
How fast is it?
Unit: microseconds
expr min lq mean median uq max neval
f2() 1543.099 1632.3025 2039.7350 1843.458 2246.951 4735.851 100
f3() 1859.832 1993.0265 2642.8874 2168.012 2493.788 19619.882 100
f4() 281.541 315.2680 364.2197 345.328 375.877 1089.994 100
MYrcpp(testM) 3.422 10.0205 16.7708 19.552 21.507 56.700 100
Where f2(), f3() and f4() are defined as
f2 <- function(){
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
}
f3 <- function(){
test.ave <- rep(NA, length(test.df$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i,as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
}
f4 <- function(){
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean)
}
That's roughly a 20x increase over the fastest.
Note, to implement the above code you'll need a C complier which R can access. For windows look into Rtools. For more on Rcpp read this
Now let's see how it scales.
N = 5e3
test.df <- data.frame(strt = 1,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$rown <- as.numeric(row.names(test.df))
test.dt <- as.data.table(test.df)
microbenchmark(f4(), MYrcpp(testM))
Unit: microseconds
expr min lq mean median uq max neval
f4() 88647.256 108314.549 125451.4045 120736.073 133487.5295 259502.49 100
MYrcpp(testM) 196.003 216.533 242.6732 235.107 261.0125 499.54 100
With 5e3 rows MYrcpp is now 550x faster. This partially due to the fact that f4() is not going to scale well as Richard discusses in the comment. The f4() is essentially invoking a nested for loop by calling an apply within a lapply. Interestingly, the C++ code is also invoking a nested loop by utilizing a while loop inside a for loop. The speed disparity is due in large part to the fact that the C++ code is already complied and does not need to be interrupted into something the machine can understand at run time.
I'm not sure how big your data set is, but when I run MYrcpp on a data.frame with 1e7 rows, which is the largest data.frame I could allocate on my crummy laptop, it ran in 500 milliseconds.
Update: R equivalent of C++ code
MYr <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
out <- matrix(NA, nrow = 1, ncol = nrow)
for(i in 1:nrow){
avg <- 0
start <- x[i,1]
end <- x[i,2]
N <- end - start + 1
while(start<=end){
avg <- avg + x[i, start + 2]
start = start + 1
}
out[i] <- avg/N
}
out
}
Both MYrcpp and MYr are similar in many ways. Let me discuss a couple of the differences
The first line of MYrcpp is different from the MYr. In words the first line of MYrcpp, NumericVector MYrcpp(NumericMatrix x), means that we are defining a function whose name is MYrcpp which returns an output of class NumericVector and takes an input x of class NumericMatrix.
In C++ you have to define the class of a variable when you introduce it, i.e. int nrow = x.row() is a variable whose name is nrow whose class is int (i.e. integer) and is assigned to be x.nrow() i.e. the number of rows of x. (IGNORE if you're overwhelmed, nrow() is a method for instances of class `NumericVector. Like in Python you call a method by attaching it to the instance. The R equivalent is S3 and S4 methods)
When you subset in C++ you use () instead of [] like in R. Also, indexing begins at zero (like in Python). For example, x(0,1) in C++ is equivalent to x[1,2] in R
++ is an operator that means increment by 1, i.e. j++ is the same as j + 1. += is an operator that means add to together and assign, i.e. a += b is the same as a = a + b
My solution is the first one in the benchmark
library(microbenchmark)
microbenchmark(
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean),
test.dt[, func.dt(rown, strt, end), by=.(rown)]
)
min lq mean median uq max neval
138.654 175.7355 254.6245 201.074 244.810 3702.443 100
4243.641 4747.5195 5576.3399 5252.567 6247.201 8520.286 100
It seems to be 25 times faster, but this is a small dataset. I am sure there is a better way to do this than what I have done.
I have a matrix that contains 3 columns and in total 10,000 elements. First and second columns are indexes and third column is the score. I want to normalize the score column based on this formula:
Normalized_score_i_j = score_i_j / ((sqrt(score_i_i) * (sqrt(score_j_j))
score_i_j = the current score itself
score_i_i = look at current score's index in first column, and in the dataset look for a score that has that index in both its first and second columns
score_j_j = look at current score's index in second column, and in the dataset look for a score that has that index in both its first and second columns
An example is for instance, if df is as follow:
df <- read.table(text = "
First.Protein,Second.Protein,Score
1,1,25
1,2,90
1,3,82
1,4,19
2,1,90
2,2,99
2,3,76
2,4,79
3,1,82
3,2,76
3,3,91
3,4,33
4,1,28
4,2,11
4,3,99
4,4,50
", header = TRUE, sep = ",")
If we are normalizing this row:
First.Protein Second.Protein Score
4 3 99
The normalized score will be:
The score itself divided by the sqrt of a score that its First.Protein and Second.Protein index are both 4 multiplied by the sqrt of a score where its First.Protein and Second.Protein indexes are both 3.
Therefore:
Normalized = 99 / (sqrt(50) * sqrt(91)) = 1.467674
I have the code below, but it is behaving very weirdly and is giving me values that are not at all normalized and are in fact very odd:
for(i in 1:nrow(Smith_Waterman_Scores))
{
Smith_Waterman_Scores$Score[i] <-
Smith_Waterman_Scores$Score[i] /
(sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$First.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$First.Protein[i])])) *
(sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$Second.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$Second.Protein[i])]))
}
Here's a re-write of your original attempt (which() is not necessary; just use the logical vector for sub-setting; with() allows you to refer to variables in the data frame without having to re-type the name of the data.frame -- easier to read but also easier to make a mistake)
orig0 <- function(df) {
for(i in 1:nrow(df)) {
df$Score[i] <- with(df, {
ii <- First.Protein == First.Protein[i] &
Second.Protein == First.Protein[i]
jj <- First.Protein == Second.Protein[i] &
Second.Protein == Second.Protein[i]
Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
})
}
df$Score
}
The problem is that Score[ii] and Score[jj] appear on the right-hand side both before and after they've been updated. Here's a revision where the original columns are interpreted as 'read-only'
orig1 <- function(df) {
normalized <- numeric(nrow(df)) # pre-allocate
for(i in 1:nrow(df)) {
normalized[i] <- with(df, {
ii <- First.Protein == First.Protein[i] &
Second.Protein == First.Protein[i]
jj <- First.Protein == Second.Protein[i] &
Second.Protein == Second.Protein[i]
Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
})
}
normalized
}
I think the results are now correct (see below). A better implementation would use sapply (or vapply) to avoid having to worry about the allocation of the return value
orig2 <- function(df) {
sapply(seq_len(nrow(df)), function(i) {
with(df, {
ii <- First.Protein == First.Protein[i] &
Second.Protein == First.Protein[i]
jj <- First.Protein == Second.Protein[i] &
Second.Protein == Second.Protein[i]
Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
})
})
}
Now that the results are correct, we can ask about performance. Your solution requires a scan of, e.g., First.Protein, each time through the loop. There are N=nrow(df) elements of First.Protein, and you're going through the loop N times, so you'll be making a multiple of N * N = N^2 comparisons -- if you increase the size of the data frame from 10 to 100 rows, the time taken will change from 10 * 10 = 100 units, to 100 * 100 = 10000 units of time.
Several of the answers attempt to avoid that polynomial scaling. My answer does this using match() on a vector of values; this probably scales as N (each look-up occurs in constant time, and there are N look-ups), which is much better than polynomial.
Create a subset of data with identical first and second proteins
ii = df[df$First.Protein == df$Second.Protein,]
Here's the ijth score from the original data frame
s_ij = df$Score
Look up First.Protein of df in ii and record the score; likewise for Second.Protein
s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"]
s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]
The normalized scores are then
> s_ij / (sqrt(s_ii) * sqrt(s_jj))
[1] 1.0000000 1.8090681 1.7191871 0.5374012 1.8090681 1.0000000 0.8007101
[8] 1.1228571 1.7191871 0.8007101 1.0000000 0.4892245 0.7919596 0.1563472
[15] 1.4676736 1.0000000
This will be fast, using a single call to match() instead of many calls to which() inside a for loop or tests for identity inside an apply() -- both of the latter make N^2 comparisons and so scale very poorly.
I summarized some of the proposed solutions as
f0 <- function(df) {
contingency = xtabs(Score ~ ., df)
diagonals <- unname(diag(contingency))
i <- df$First.Protein
j <- df$Second.Protein
idx <- matrix(c(i, j), ncol=2)
contingency[idx] / (sqrt(diagonals[i]) * sqrt(diagonals[j]))
}
f1 <- function(df) {
ii = df[df$First.Protein == df$Second.Protein,]
s_ij = df$Score
s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"]
s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]
s_ij / (sqrt(s_ii) * sqrt(s_jj))
}
f2 <- function(dt) {
dt.lookup <- dt[First.Protein == Second.Protein]
setkey(dt,"First.Protein" )
setkey(dt.lookup,"First.Protein" )
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1")
dt <- dt[dt.lookup]
setkey(dt,"Second.Protein" )
setkey(dt.lookup,"Second.Protein")
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2")
dt[dt.lookup][
, Normalized := Score / (sqrt(Score1) * sqrt(Score2))][
, .(First.Protein, Second.Protein, Normalized)]
}
f3 <- function(dt) {
eq = dt[First.Protein == Second.Protein]
dt[eq, Score_ii := i.Score, on = "First.Protein"]
dt[eq, Score_jj := i.Score, on = "Second.Protein"]
dt[, Normalised := Score/sqrt(Score_ii * Score_jj)]
dt[, c("Score_ii", "Score_jj") := NULL]
}
I know how to programmatically check that the first two generate consistent results; I don't know data.table well enough to get the normalized result out in the same order as the input columns for f2() so can't compare with the others (though they look correct 'by eye'). f3() produces numerically similar but not identical results
> identical(orig1(df), f0(df))
[1] TRUE
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), { f3(dt3); dt3[["Normalized"]] }) # pass by reference!
[1] FALSE
> all.equal(f0(df), { f3(dt3); dt3[["Normalized"]] })
[1] TRUE
There are performance differences
library(data.table)
dt2 <- as.data.table(df)
dt3 <- as.data.table(df)
library(microbenchmark)
microbenchmark(f0(df), f1(df), f2(dt2), f3(dt3))
with
> microbenchmark(f0(df), f1(df), f2(df), f3(df))
Unit: microseconds
expr min lq mean median uq max neval
f0(df) 967.117 992.8365 1059.7076 1030.9710 1094.247 2384.360 100
f1(df) 176.238 192.8610 210.4059 207.8865 219.687 333.260 100
f2(df) 4884.922 4947.6650 5156.0985 5017.1785 5142.498 6785.975 100
f3(df) 3281.185 3329.4440 3463.8073 3366.3825 3443.400 5144.430 100
The solutions f0 - f3 are likely to scale well (especially data.table) with real data; the fact that the times are in microseconds probably means that speed is not important (now that we are not implementing an N^2 algorithm).
On reflection, a more straight-forward impelementation of f1() just looks up the 'diagonal' elements
f1a <- function(df) {
ii = df[df$First.Protein == df$Second.Protein, ]
d = sqrt(ii$Score[order(ii$First.Protein)])
df$Score / (d[df$First.Protein] * d[df$Second.Protein])
}
You may be doing this in a very round-about manner. Can you see if this works for you:
R> xx
First Second Score
1 1 1 25
2 1 2 90
3 1 3 82
4 1 4 19
5 2 1 90
6 2 2 99
7 2 3 76
8 2 4 79
9 3 1 82
10 3 2 76
11 3 3 91
12 3 4 33
13 4 1 28
14 4 2 11
15 4 3 99
16 4 4 50
R> contingency = xtabs(Score ~ ., data=xx)
R> contingency
Second
First 1 2 3 4
1 25 90 82 19
2 90 99 76 79
3 82 76 91 33
4 28 11 99 50
R> diagonals <- unname(diag(contingency))
R> diagonals
[1] 25 99 91 50
R> normalize <- function (i, j, contingencies, diagonals) {
+ contingencies[i, j] / (sqrt(diagonals[i]) * sqrt(diagonals[j]))
+ }
R> normalize(4, 3, contingency, diagonals)
[1] 1.467674
Here's how I'd approach using data.table. Hopefully #MartinMorgan finds this easier to understand :-).
require(data.table) # v1.9.6+
dt = as.data.table(df) # or use setDT(df) to convert by reference
eq = dt[First.Protein == Second.Protein]
So far, I've just created a new data.table eq which contains all rows where both columns are equal.
dt[eq, Score_ii := i.Score, on = "First.Protein"]
dt[eq, Score_jj := i.Score, on = "Second.Protein"]
Here we add columns Score_ii and Score_jj while joining on columns First.Protein and Second.Protein. That it is a join operation should be clear because of on= argument. The i. refers to the Score column in the data.table provided in the i-argument (here, eq's Score).
Note that we can use match() here as well. But that wouldn't work if you've to lookup directly (and as efficiently) based on more than one column. Using on=, we can extend this quite easily, and is also much easier to read/understand.
Once we've all the required columns, the task is just to get the final Normalised column (and delete the intermediates if they're not necessary).
dt[, Normalised := Score/sqrt(Score_ii * Score_jj)]
dt[, c("Score_ii", "Score_jj") := NULL] # delete if you don't want them
I'll leave out the micro- and milli- second benchmarks as I'm not interested in them.
PS: The columns Score_ii and Score_jj are added above on purpose under the assumption that you might need them. If you don't want them at all, you can also do:
Score_ii = eq[dt, Score, on = "First.Protein"] ## -- (1)
Score_jj = eq[dt, Score, on = "Second.Protein"]
(1) reads: for each row in dt get matching row in eq while matching on column First.Protein and extract eq$Score corresponding to that matching row.
Then, we can directly add the Normalised column as:
dt[, Normalised := Score / sqrt(Score_ii * Score_jj)]
You can can implement this with joins, here is an example using data.table:
library(data.table)
dt <- data.table(df)
dt.lookup <- dt[First.Protein == Second.Protein]
setkey(dt,"First.Protein" )
setkey(dt.lookup,"First.Protein" )
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1")
dt <- dt[dt.lookup]
setkey(dt,"Second.Protein" )
setkey(dt.lookup,"Second.Protein")
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2")
dt <- dt[dt.lookup][
, Normalized := Score / (sqrt(Score1) * sqrt(Score2))][
, .(First.Protein, Second.Protein, Normalized)]
Just make sure you don't use for loops.
Loop through rows using apply:
#compute
df$ScoreNorm <-
apply(df, 1, function(i){
i[3] /
(
sqrt(df[ df$First.Protein == i[1] &
df$Second.Protein == i[1], "Score"]) *
sqrt(df[ df$First.Protein == i[2] &
df$Second.Protein == i[2], "Score"])
)
})
#test output
df[15, ]
# First.Protein Second.Protein Score ScoreNorm
# 15 4 3 99 1.467674
I have a little problem in a function.
The aim of it is to remove outliers I've detected in my data.frame. They are detected when there's a too big difference with the previous correct value (e.g c(1,2,3,20,30,4,5,6): "20" and "30" are the outliers). But my data is much more complex than this.
My idea is to consider the first two numeric values of my column as "correct". Then, I want to test each next value:
if the difference between the tested value and the previous one is <20, then it's a new correct one, and the test must start again from this new correct value (and not from the previous correct one)
if the same difference is >20, then it's a wrong one. An index must be put next to the wrong value, and the test must still continue from this same correct value, until a new correct value is detected
Here's an example with my function and a fake DF:
myts <- data.frame(x=c(12,12,35,39,46,45,33,5,26,28,29,34,15,15),z=NA)
test <- function(x){
st1 = NULL
temp <- st1[1] <- x[1]
st1 <- numeric(length(x))
for (i in 2:(length(x))){
if((!is.na(x[i])) & (!is.na(x[i-1]))& (abs((x[i])-(temp)) > 20)){
st1[i] <- 1
} }
return(st1)
}
myts[,2] <- apply(as.data.frame(myts[,1]),2,test)
myts[,2] <- as.numeric(myts[,2])
It does nearly the job, but the problem is that the last correct value is not memorized. It still does the test from the first correct value.
Due to this, rows 9 to 11 in my example are not detected. I let you imagine the problem on a 500,000 rows data.frame.
How can I solve this little problem? The rest of the function may be OK.
You just need to update temp for any indices that aren't outliers:
test <- function(x) {
temp <- x[1]
st1 <- numeric(length(x))
for (i in 2:(length(x))){
if(!is.na(x[i]) & !is.na(x[i-1]) & abs(x[i]-temp) > 20) {
st1[i] <- 1
} else {
temp <- x[i]
}
}
return(st1)
}
myts[,2] <- apply(as.data.frame(myts[,1]),2,test)
myts[,2] <- as.numeric(myts[,2])
myts
# x z
# 1 12 0
# 2 12 0
# 3 35 1
# 4 39 1
# 5 46 1
# 6 45 1
# 7 33 1
# 8 5 0
# 9 26 1
# 10 28 1
# 11 29 1
# 12 34 1
# 13 15 0
# 14 15 0
One thing to note is that for loops in R will be quite slow compared to vectorized functions. However, because each element in your vector depends on a complicated way on the previous ones, it's tough to use R's built-in vectorized functions to efficiently compute your vector. You can convert this code nearly verbatim to C++ and use the Rcpp package to regain the efficiency:
library(Rcpp)
test2 <- cppFunction(
"IntegerVector test2(NumericVector x) {
const int n = x.length();
IntegerVector st1(n, 0);
double temp = x[0];
for (int i=1; i < n; ++i) {
if (!R_IsNA(x[i]) && !R_IsNA(x[i]) && fabs(x[i] - temp) > 20.0) {
st1[i] = 1;
} else {
temp = x[i];
}
}
return st1;
}")
all.equal(test(myts[,1]), test2(myts[,1]))
# [1] TRUE
# Benchmark on large vector with some NA values:
set.seed(144)
large.vec <- c(0, sample(c(1:50, NA), 1000000, replace=T))
all.equal(test(large.vec), test2(large.vec))
# [1] TRUE
library(microbenchmark)
microbenchmark(test(large.vec), test2(large.vec))
# Unit: milliseconds
# expr min lq mean median uq max neval
# test(large.vec) 2343.684164 2468.873079 2667.67970 2604.22954 2747.23919 3753.54901 100
# test2(large.vec) 9.596752 9.864069 10.97127 10.23011 11.68708 16.67855 100
The Rcpp code is about 250x faster on a vector of length 1 million. Depending on your use case this speedup may or may not be important.
I am very new in using R and trying to get my hear around different commands. I have this simple code:
setwd("C:/Research")
tempdata=read.csv("temperature_humidity.csv")
Thour=tempdata$t
RHhour=tempdata$RH
weather=data.frame(cbind(hour,Thour,RHhour))
head(weather)
if (Thour>25) {
y=0 else {
y=3
}
x=Thour+y*2
x
I simply want the code to read the Thour(temperature) from CSV file and if it is higher than 25 then uses y=0 in the formula, if its lower than 25 then uses y=3
I tried ifelse but it doesn't work as well.
Thanks for your help.
I've said that too many times today already, but avoid ifelse statements as much as possible (very inefficient and unnecessary in most cases), try this instead:
c(3, 0)[(Thour >= 25) + 1]
This solution will return a logical vector of TRUE/FALSE which will be coerced to 0/1 when added to 1 and become 1/2 which will be the indexes in c(3, 0)
Or even better solution (posted by #BondedDust in comments) would be:
3*(Thour <= 25)
This solution will return a logical vector of TRUE/FALSE which will be coerced to 0/1 when multiplied by 3
Benchmark comparison:
Thour <- sample(1:100000)
require(microbenchmark)
microbenchmark(ifel = {ifelse(Thour < 25 , 0 , 3)}, Bool = {3*(Thour >= 25)})
Unit: microseconds
expr min lq median uq max neval
ifel 38633.499 41643.768 41786.978 55153.050 59169.69 100
Bool 901.135 1848.091 1972.434 2010.841 20754.74 100
This should work for you. Just replace what you're naming Thour with the appropriate code.
Thour <- sample(1:100, 1)
Thour
# [1] 8
y <- ifelse(Thour >= 25, 0, 3)
y
# [1] 3
And:
Thour <- sample(1:100, 1)
Thour
# [1] 37
y <- ifelse(Thour >= 25, 0, 3)
y
# [1] 0
You may need to change the logical operator (>=) to match your exact circumstance since it's unclear, which if any of the higher or lower range you want to be inclusive.
R has a very flexible syntax. So you can write this in many ways:
# ifelse() function
y <- ifelse(Thour > 25, 0, 3)
# more ifelse()
y <- 3 * ifelse(Thour > 25, 0, 1)
# The simpler way:
y <- 3 * (Thour > 25)
By the way, use <- instead of = for assignment... it's the "preferred" style