Comparing two lists of values of different lengths - r

I have a long list of random numbers between 1 and 100, and i would like to count how many of them are larger than 10,20,30 etc
x <- c(sample(1:100, 500, replace = T))
y <- seq(0,100, by = 10)
I am looking for this to return an output such as;
Total
10
20
30
40
50
Count
7
13
17
28
42
Where Count is the number of x Values that are larger than Total (each y value )
So far, I have tried
Count = ifelse(x > y, 1, 0)
However this returns a list of Binary 1,0 returns for each of the 500 values of X
I'd appreciate any help

This answer asummes your looking for intervals not for cummulative sum of numbers greater than a threshold given your count.
cut + table are useful here:
table(cut(x, breaks = y))
(0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
51 66 36 44 54 49 55 46 58 41
findInterval + table will give you the same result
table(findInterval(x, y, left.open = TRUE))
Data
set.seed(505)
x <- c(sample(1:100, 500, replace = T))
y <- seq(0,100, by = 10)

With base R this is one approach
rbind(Total = y, Count = rowSums(sapply(x, ">", y)))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
Total 0 10 20 30 40 50 60 70 80 90 100
Count 500 444 381 329 279 241 198 150 104 52 0

If I understood correctly, this might work:
x <- c(sample(1:100, 500, replace = T))
y <- seq(0,100, by = 10)
is_bigger_than <- function(y){
data.frame(y, n = sum(x > y,na.rm = TRUE))
}
purrr::map_df(y,is_bigger_than)
y n
1 0 500
2 10 450
3 20 403
4 30 359
5 40 305
6 50 264
7 60 201
8 70 155
9 80 100
10 90 52
11 100 0

Related

R: Counting rows in a dataframe in which all values fall within individual ranges

I have a dataframe (A rows x K columns). For each column, I get the 5th and 95th percentile value. I want to know how many rows in the df have all K of their values within these K sets of 5th and 95th percentile values.
Example code below works, removing rows that do not fall within the bounds (1 and 9 here, but in practice will be percentiles), and then counting what remains. But A will be 10K and K will be 40, and I am simulating this dataframe 10K times, so I am wondering if there is code that will run faster.
data <- rbind(c(1,2,3,4,5), c(3,5,7,8,5), c(2,8,9,5,9), c(9,1,1,8,9),
c(3,5,6,7,5))
Lower_Bound <- rbind(1,1,1,1,1)
Upper_Bound <- rbind(9,9,9,9,9)
for (i in c(1:5)) {
data <- data[data[,i] > Lower_Bound[i,],]
data <- data[data[,i] < Upper_Bound[i,],]
}
N <- nrow(data)
If I understand correctly, the OP is only interested in the number of rows which fulfill the condition. So, there is no need to actually remove rows fromdata that do not fall within the bounds. It is sufficient to count the number of rows which do fall within the bounds.
This answer contains solutions for
matrices
data.frames
and a benchmark which compares
OP's approach,
apply() with matrices and data.frames,
an approach using purrr's map() and reduce() functions.
apply() with matrices
Let's start with the provided sample data and fixed Lower_Bound and Upper_Bound. Please, note that all three objects are matrices created by rbind(). This is in contrast to the text of the question which refers to a dataframe (A rows x K columns). Anyhow, we will provide solutions for both cases.
apply(data, 1, function(x) all(x > Lower_Bound & x < Upper_Bound))
returns a vector of type logical
[1] FALSE TRUE FALSE FALSE TRUE
The number of rows which fulfill the condition can be derived by
N <- sum(apply(data, 1, function(x) all(x > Lower_Bound & x < Upper_Bound)))
N
[1] 2
because TRUE is coerced to 1L and FALSE to 0L.
The next step is to also compute the bounds for each column as 5th and 95th percentile. For this, we have to create a new sample dataset mat, again as matrix
# create sample data
n_col <- 5
n_row <- 10
set.seed(42) # required for reproducible results
mat <- sapply(1:n_col, function(x) rnorm(n_row, mean = x))
mat
[,1] [,2] [,3] [,4] [,5]
[1,] 2.3709584 3.3048697 2.693361 4.455450 5.205999
[2,] 0.4353018 4.2866454 1.218692 4.704837 4.638943
[3,] 1.3631284 0.6111393 2.828083 5.035104 5.758163
[4,] 1.6328626 1.7212112 4.214675 3.391074 4.273295
[5,] 1.4042683 1.8666787 4.895193 4.504955 3.631719
[6,] 0.8938755 2.6359504 2.569531 2.282991 5.432818
[7,] 2.5115220 1.7157471 2.742731 3.215541 4.188607
[8,] 0.9053410 -0.6564554 1.236837 3.149092 6.444101
[9,] 3.0184237 -0.4404669 3.460097 1.585792 4.568554
[10,] 0.9372859 3.3201133 2.360005 4.036123 5.655648
For demonstration, each column has a different mean.
# count number of rows
probs <- c(0.05, 0.95)
bounds <- apply(mat, 2, quantile, probs)
idx <- apply(mat, 1, function(x) all(x > bounds[1, ] & x < bounds[2, ]))
N <- sum(idx)
N
1 5
If required, the subset of mat which fulfills the condition can be derived by
mat[idx, ]
[,1] [,2] [,3] [,4] [,5]
[1,] 2.3709584 3.304870 2.693361 4.455450 5.205999
[2,] 1.6328626 1.721211 4.214675 3.391074 4.273295
[3,] 0.8938755 2.635950 2.569531 2.282991 5.432818
[4,] 2.5115220 1.715747 2.742731 3.215541 4.188607
[5,] 0.9372859 3.320113 2.360005 4.036123 5.655648
The bounds are
bounds
[,1] [,2] [,3] [,4] [,5]
5% 0.641660 -0.5592606 1.226857 1.899532 3.882318
95% 2.790318 3.8517060 4.588960 4.886484 6.135429
apply() with data.frames
In case the dataset is a data.frame we can use the same code, i.e.,
df <- as.data.frame(mat)
probs <- c(0.05, 0.95)
bounds <- apply(df, 2, quantile, probs)
idx <- apply(df, 1, function(x) all(x > bounds[1, ] & x < bounds[2, ]))
N <- sum(idx)
Benchmark
The OP is looking for code which is faster than OP's own approach because the OP wants to replicate the simulation 10000 times.
So, here is a benchmark which compares
OP1: OP's own approach using matrices
OP2: a slightly modified version of OP1
apply_mat: the apply() function with matrices
apply_df: the apply() function with data.frames
purrr: using map(), pmap(), and reduce() from the purrr package
(Note that the list of methods is not exhaustive)
The benchmark is repeated for varying problem sizes, i.e., 5, 10, and 40 columns as well as 100, 1000, and 10000 rows. The largest problem size corresponds to the size of OP's simulations. As some codes modify the input dataset, all runs start with a fresh copy of the input data.
library(bench)
library(purrr)
library(ggplot2)
bm <- press(
n_col = c(5L, 10L, 40L)
, n_row = 10L^(2:4)
, {
set.seed(42)
mat0 <- sapply(1:n_col, function(x) rnorm(n_row, mean = x))
df0 <- as.data.frame(mat0)
mark(
OP1 = {
data <- data.table::copy(mat0)
Lower_Bound <- as.matrix(apply(data, 2, quantile, probs = 0.05), ncol = 1L)
Upper_Bound <- as.matrix(apply(data, 2, quantile, probs = 0.95), ncol = 1L)
for (i in seq_len(ncol(data))) {
data <- data[data[, i] > Lower_Bound[i, ], ]
data <- data[data[, i] < Upper_Bound[i, ], ]
}
nrow(data)
},
OP2 = {
data <- data.table::copy(mat0)
Lower_Bound <- as.matrix(apply(data, 2, quantile, probs = 0.05), ncol = 1L)
Upper_Bound <- as.matrix(apply(data, 2, quantile, probs = 0.95), ncol = 1L)
for (i in seq_len(ncol(data))) {
data <- data[data[, i] > Lower_Bound[i, ] & data[, i] < Upper_Bound[i, ], ]
}
nrow(data)
},
apply_mat = {
mat <- data.table::copy(mat0)
probs <- c(0.05, 0.95)
bounds <- apply(mat, 2, quantile, probs)
idx <- apply(mat, 1, function(x) all(x > bounds[1, ] & x < bounds[2, ]))
sum(idx)
},
apply_df = {
df <- data.table::copy(df0)
probs <- c(0.05, 0.95)
bounds <- apply(df, 2, quantile, probs)
idx <- apply(df, 1, function(x) all(x > bounds[1, ] & x < bounds[2, ]))
sum(idx)
},
purrr = {
data.table::copy(df0) %>%
map2(map_dfc(., quantile, probs), ~ (.x > .y[1L] & .x < .y[2L])) %>%
pmap(all) %>%
reduce(`+`)
}
)
}
)
autoplot(bm)
Note the logarithmic time scale
print(bm[, 1:11], n = Inf)
# A tibble: 45 x 11
expression n_col n_row min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <int> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 OP1 5 100 1.46ms 1.93ms 493. 88.44KB 0 248 0 503ms
2 OP2 5 100 1.34ms 1.78ms 534. 71.56KB 0 267 0 500ms
3 apply_mat 5 100 1.16ms 1.42ms 621. 26.66KB 2.17 286 1 461ms
4 apply_df 5 100 1.41ms 1.8ms 526. 34.75KB 0 263 0 500ms
5 purrr 5 100 2.34ms 2.6ms 374. 17.86KB 0 187 0 500ms
6 OP1 10 100 2.42ms 2.78ms 344. 205.03KB 0 172 0 500ms
7 OP2 10 100 2.37ms 2.71ms 354. 153.38KB 2.07 171 1 484ms
8 apply_mat 10 100 1.76ms 2.12ms 457. 51.64KB 0 229 0 501ms
9 apply_df 10 100 2.31ms 2.63ms 367. 67.78KB 0 184 0 501ms
10 purrr 10 100 3.44ms 4.1ms 222. 34.89KB 2.09 106 1 477ms
11 OP1 40 100 9.4ms 10.57ms 92.9 955.41KB 0 47 0 506ms
12 OP2 40 100 9.18ms 10.08ms 96.8 638.92KB 0 49 0 506ms
13 apply_mat 40 100 5.44ms 6.46ms 146. 429.95KB 2.12 69 1 472ms
14 apply_df 40 100 6.12ms 6.75ms 141. 608.66KB 0 71 0 503ms
15 purrr 40 100 10.43ms 11.8ms 84.9 149.53KB 0 43 0 507ms
16 OP1 5 1000 1.75ms 1.94ms 478. 837.55KB 2.10 228 1 477ms
17 OP2 5 1000 1.69ms 1.94ms 487. 674.36KB 0 244 0 501ms
18 apply_mat 5 1000 4.84ms 5.62ms 176. 255.17KB 0 89 0 506ms
19 apply_df 5 1000 6.37ms 7.66ms 122. 333.58KB 0 62 0 506ms
20 purrr 5 1000 9.86ms 11.22ms 87.7 165.52KB 2.14 41 1 467ms
21 OP1 10 1000 3.35ms 3.91ms 253. 1.89MB 0 127 0 503ms
22 OP2 10 1000 3.33ms 3.72ms 256. 1.41MB 2.06 124 1 484ms
23 apply_mat 10 1000 5.86ms 6.93ms 142. 491.09KB 0 72 0 508ms
24 apply_df 10 1000 7.74ms 10.08ms 99.2 647.86KB 0 50 0 504ms
25 purrr 10 1000 14.55ms 15.44ms 62.5 323.17KB 2.23 28 1 448ms
26 OP1 40 1000 13.8ms 16.28ms 58.8 8.68MB 2.18 27 1 459ms
27 OP2 40 1000 13.29ms 14.72ms 67.9 5.84MB 0 34 0 501ms
28 apply_mat 40 1000 12.17ms 13.85ms 68.5 4.1MB 2.14 32 1 467ms
29 apply_df 40 1000 14.61ms 15.86ms 62.9 5.78MB 0 32 0 509ms
30 purrr 40 1000 41.85ms 43.66ms 22.7 1.25MB 0 12 0 529ms
31 OP1 5 10000 5.57ms 6.55ms 147. 8.15MB 2.07 71 1 482ms
32 OP2 5 10000 5.38ms 6.27ms 157. 6.55MB 2.06 76 1 485ms
33 apply_mat 5 10000 43.98ms 46.9ms 20.7 2.48MB 0 11 0 532ms
34 apply_df 5 10000 53.59ms 56.53ms 17.8 3.24MB 3.57 5 1 280ms
35 purrr 5 10000 86.32ms 88.83ms 11.1 1.6MB 0 6 0 540ms
36 OP1 10 10000 12.03ms 13.63ms 72.3 18.97MB 2.07 35 1 484ms
37 OP2 10 10000 11.66ms 12.97ms 76.5 14.07MB 4.25 36 2 471ms
38 apply_mat 10 10000 50.31ms 51.77ms 18.5 4.77MB 0 10 0 541ms
39 apply_df 10 10000 62.09ms 65.17ms 15.1 6.3MB 0 8 0 528ms
40 purrr 10 10000 125.82ms 128.3ms 7.35 3.13MB 2.45 3 1 408ms
41 OP1 40 10000 53.38ms 56.34ms 16.2 87.79MB 5.41 6 2 369ms
42 OP2 40 10000 46.24ms 47.43ms 20.3 58.82MB 2.25 9 1 444ms
43 apply_mat 40 10000 78.25ms 83.79ms 11.4 40.94MB 2.85 4 1 351ms
44 apply_df 40 10000 95.66ms 97.02ms 10.3 57.58MB 2.06 5 1 486ms
45 purrr 40 10000 361.26ms 373.23ms 2.68 12.31MB 0 2 0 746ms
Conclusions
To my surprise, OPs approach does perform quite well despite the repeated copy operations. In fact, for OP's problem size of 10000 rows and 40 columns the modified version OP2 is nearly tow times faster than apply_mat.
A possible explanation (which needs to be verified, though) is that OPs approach is kind of recursive where the number of rows to be checked are reduced when iterating over the columns.
Interestingly, the purrr variant has the lowest memory requirements.
Taking the median run time of about 50 ms for the OP2 method from this benchmark, 10000 repetitions of the simulation may take less than 10 minutes.

How to multiply each column by each scalar in R?

I have the following variable Q
a = c(1,2,3,4)
b = c(45,4,3,2)
c = c(34,23,12,45)
Q = cbind(a,b,c)
I also have another variable r
r = c(10,20,30)
I would like to multiply each column of Q by each respective value in r (for example, the first column of Q multiplied by first value in r, the second column of Q multiplied by second value in rand so on).
Specifically for this example, the output I am looking for is:
10 900 1020
20 80 690
30 60 360
40 40 1350
I am new to R and looking for the most optimal way to do this.
Try this:
Q %*% diag(r)
giving:
[,1] [,2] [,3]
[1,] 10 900 1020
[2,] 20 80 690
[3,] 30 60 360
[4,] 40 40 1350
or any of these:
t(t(Q) * r)
Q * r[col(Q)]
sweep(Q, 2, r, "*")
Q * rep(r, each = nrow(Q))
mapply("*", as.data.frame(Q), r)
See this answer for the same question except using division:
How to divide each row of a matrix by elements of a vector in R
you will just need to do double transpose:
t(r*t(Q))
a b c
[1,] 10 900 1020
[2,] 20 80 690
[3,] 30 60 360
[4,] 40 40 1350

Extracting chunks from a matrix by columns

Say I have a matrix with 1000 columns. I want to create a new matrix with every other n columns from the original matrix, starting from column i.
So let say that n=3 and i=5, then the columns I need from the old matrix are 5,6,7,11,12,13,17,18,19 and so on.
Using two seq()s to create the start and stop bounds, then using a mapply() on those to build your true column index intervals. Then just normal bracket notation to extract from your matrix.
set.seed(1)
# using 67342343's test case
M <- matrix(runif(100^2), ncol = 100)
n <- 3
i <- 5
starts <- seq(i, ncol(M), n*2)
stops <- seq(i+(n-1), ncol(M), n*2)
col_index <- c(mapply(seq, starts, stops)) # thanks Jaap and Sotos
col_index
[1] 5 6 7 11 12 13 17 18 19 23 24 25 29 30 31 35 36 37 41 42 43 47 48 49 53 54 55 59 60 61 65 66 67 71 72 73 77 78
[39] 79 83 84 85 89 90 91 95 96 97
M[, col_index]
Another solution is based on the fact that R uses index recycling:
i <- 5; n <- 3
M <- matrix(runif(100^2), ncol = 100)
id <- seq(i, ncol(M), by = 1)[rep(c(TRUE, FALSE), each = n)]
M_sub <- M[, id]
I would write a function that determines the indices of the columns you want, and then call that function as needed.
col_indexes <- function(mat, start = 1, by = 1){
n <- ncol(mat)
inx <- seq(start, n, by = 2*by)
inx <- c(sapply(inx, function(i) i:(i + by -1)))
inx[inx <= n]
}
m <- matrix(0, nrow = 1, ncol = 20)
icol <- col_indexes(m, 5, 3)
icol
[1] 5 6 7 11 12 13 17 18 19
Here is a method using outer.
c(outer(5:7, seq(0L, 95L, 6L), "+"))
[1] 5 6 7 11 12 13 17 18 19 23 24 25 29 30 31 35 36 37 41 42 43 47 48 49 53
[26] 54 55 59 60 61 65 66 67 71 72 73 77 78 79 83 84 85 89 90 91 95 96 97
To generalize this, you could do
idx <- c(outer(seq(i, i + n), seq(0L, ncol(M) - i, 2 * n), "+"))
The idea is to construct the initial set of columns (5:7 or seq(i, i + n)), calculate the starting points for every subsequent set (seq(0L, 95L, 6L) or seq(0L, ncol(M) - i, 2 * n)) then use outer to calculate the sum of every combination of these two vectors.
you can subset the matrix using [ like M[, idx].

Flip Every Nth Coin in R [duplicate]

This question already has answers here:
R: How to use ifelse statement for a vector of characters
(2 answers)
Closed 6 years ago.
My friend gave me a brain teaser that I wanted to try on R.
Imagine 100 coins in a row, with heads facing up for all coins. Now every 2nd coin is flipped (thus becoming tails). Then every 3rd coin is flipped. How many coins are now showing heads?
To create the vector, I started with:
flips <- rep('h', 100)
levels(flips) <- c("h", "t")
Not sure how to proceed from here. Any help would be appreciated.
Try this:
coins <- rep(1, 100) # 1 = Head, 0 = Tail
n = 3 # run till the time when you flip every 3rd coin
invisible(sapply(2:n function(i) {indices <- seq(i, 100, i); coins[indices] <<- (coins[indices] + 1) %% 2}) )
which(coins == 1)
# [1] 1 5 6 7 11 12 13 17 18 19 23 24 25 29 30 31 35 36 37 41 42 43 47 48 49 53 54 55 59 60 61 65 66 67 71 72 73 77 78 79 83 84 85 89 90 91 95 96 97
sum(coins==1)
#[1] 49
If you run till n = 100, only the coins at the positions which are perfect squares will be showing heads.
coins <- rep(1, 100) # 1 = Head, 0 = Tail
n <- 100
invisible(sapply(2:n, function(i) {indices <- seq(i, 100, i); coins[indices] <<- (coins[indices] + 1) %% 2}) )
which(coins == 1)
# [1] 1 4 9 16 25 36 49 64 81 100
sum(coins==1)
# [1] 10

Loop over matrix using n consecutive rows in R

I have a matrix that consists of two columns and a number (n) of rows, while each row represents a point with the coordinates x and y (the two columns).
This is what it looks (LINK):
V1 V2
146 17
151 19
153 24
156 30
158 36
163 39
168 42
173 44
...
now, I would like to use a subset of three consecutive points starting from 1 to do some fitting, save the values from this fit in another list, an den go on to the next 3 points, and the next three, ... till the list is finished. Something like this:
Data_Fit_Kasa_1 <- CircleFitByKasa(Data[1:3,])
Data_Fit_Kasa_2 <- CircleFitByKasa(Data[3:6,])
....
Data_Fit_Kasa_n <- CircleFitByKasa(Data[i:i+2,])
I have tried to construct a loop, but I can't make it work. R either tells me that there's an "unexpected '}' in "}" " or that the "subscript is out of bonds". This is what I've tried:
minimal runnable code
install.packages("conicfit")
library(conicfit)
CFKasa <- NULL
Data.Fit <- NULL
for (i in 1:length(Data)) {
row <- Data[i:(i+2),]
CFKasa <- CircleFitByKasa(row)
Data.Fit[i] <- CFKasa[3]
}
RStudio Version 0.99.902 – © 2009-2016 RStudio, Inc.; Win10 Edu.
The third element of the fitted circle (CFKasa[3]) represents the radius, which is what I am really interested in. I am really stuck here, please help.
Many thanks in advance!
Best, David
Turn your data into a 3D array and use apply:
DF <- read.table(text = "V1 V2
146 17
151 19
153 24
156 30
158 36
163 39", header = TRUE)
a <- t(DF)
dim(a) <-c(nrow(a), 3, ncol(a) / 3)
a <- aperm(a, c(2, 1, 3))
# , , 1
#
# [,1] [,2]
# [1,] 146 17
# [2,] 151 19
# [3,] 153 24
#
# , , 2
#
# [,1] [,2]
# [1,] 156 30
# [2,] 158 36
# [3,] 163 39
center <- function(m) c(mean(m[,1]), mean(m[,2]))
t(apply(a, 3, center))
# [,1] [,2]
#[1,] 150 20
#[2,] 159 35
center(DF[1:3,])
#[1] 150 20

Resources