Related
I am trying to make a piece-wise function. This is a really basic one. I want y to be a list of values (preferably not just a list of integers but a list of real numbers like (1.34, 20.92) in the future).
How might I make a piece-wise function?
y <- 1:10
if (y < 2){
print("CAN'T COMPUTE")
} else if (y >= 2 & y < 6){
print(y^2)
} else {
print(y * 2)
}
Let me give it a try:
library("dplyr")
y <- 1:10
y %>%
as_tibble() %>%
mutate(res = case_when(y < 2 ~ "CAN'T COMPUTE",
y >= 2 & y < 6 ~ as.character(y^2),
TRUE ~ as.character(y*2)))
Here's the results:
# A tibble: 10 x 2
value res
<int> <chr>
1 1 CAN'T COMPUTE
2 2 4
3 3 9
4 4 16
5 5 25
6 6 12
7 7 14
8 8 16
9 9 18
10 10 20
Here are a some base R approaches. We have used NA instead of a character string in order to produce a numeric vector result. The first uses a nested ifelse. The second uses a single ifelse to select between NA and the other values and computes the other values using a formula. The third computes which leg of the result is wanted (1, 2 or 3) and then uses switch to select that leg. The fourth is a variation of three that uses findInterval to compute the leg number.
ifelse(y < 2, NA, ifelse(y < 6, y^2, 2*y))
## [1] NA 4 9 16 25 12 14 16 18 20
ifelse(y < 2, NA, (y < 6) * y^2 + (y >= 6) * 2*y)
## [1] NA 4 9 16 25 12 14 16 18 20
mapply(switch, 1 + (y >= 2) + (y >= 6), NA, y^2, 2*y)
## [1] NA 4 9 16 25 12 14 16 18 20
mapply(switch, findInterval(y, c(-Inf, 2, 6, Inf), left.open = FALSE), NA, y^2, 2*y)
## [1] NA 4 9 16 25 12 14 16 18 20
I am practicing a simple R loop. From a vector "m" with values 1 to 20, i want to create a loop that save a selected value in a object"a" and the remaining values in object "b".
This is what i did:
a=NULL
b=NULL
m <- c(1:20)
for (i in m)
if (i == 4){
a[[i]] <- i
} else {
b[[i]] <- i
}
This is the output:
> a
[1] NA NA NA 4
> b
[1] 1 2 3 NA 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
My question is: how can i improve my loop code so the output does not show NAs, and without using function "na.omit"?
Thanks
a=NULL
b=NULL
m <- c(1:20)
for (i in m){
if (i == 4){
a <- i
} else {
b <- append(b, i)
}
}
This will put a single value (in this case 4) in object a, and will consecutively add the other values to b.
Result:
> a
[1] 4
> b
[1] 1 2 3 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Another way to solve it is with vector operations. We doesn't need to do a loop to solve some problems about classification.
In your case, we can use:
m = c(1:20)
subset_with = m[m == 4] # It returns the values with the maching (m == 4)
subset_without = m[m != 4] # It returns the values with the maching (m != 4)
I hope this helps you.
I have a data.table as follows
set.seed(5)
x <- data.table(x=sample(1:20,15))
> x
x
1: 5
2: 14
3: 17
4: 20
5: 2
6: 11
7: 8
8: 15
9: 12
10: 16
11: 3
12: 18
13: 10
14: 4
15: 13
and I would like to start at 1 and cumulate values iteratively such that the value of cumsum() determines the next number to be added to the sum.
In the example I want to add the first value of x, here 5, then jump to value number 5 and add that, here 2, then jump to value number 5+2=7, here 8, then value number 5+2+8=15, here 13.
That is, I want to get a vector
> res
[1] 1 5 7 15
Has anyone any idea for this problem?
We can use Reduce with accumulate = TRUE
accum <- Reduce(function(i, j) i + x$x[i], x$x, accumulate = TRUE)
c(1, accum[!is.na(accum)])
# [1] 1 5 7 15 28
or purrr::accumulate
library(purrr)
accum <- accumulate(x$x, ~ .x + x$x[.x])
c(1, accum[!is.na(accum)])
# [1] 1 5 7 15 28
A base R solution:
i = 1
v = i
sum = 0
while (i <= nrow(x)) {
v = c(v, i)
sum = sum + x$x[i]
i = sum
}
Here's a function that takes how long you want your vector to be and produces a vector of that length:
recursiveadd<-function(x, n) {k<-x$x[1]
for (i in 1:(n-1)) {
k[i+1]<-sum(x$x[k[i]],k[i])
}
k
}
recursiveadd(x,4)
[1] 5 7 15 28
I have two data frames:
set.seed(123)
myData<-data.frame(id=1:10, pos=21:30)
refData<-data.frame(id=letters[1:15], pos=sample(10:40,15))
looking like that
> myData
id1 pos1
1 21
2 22
3 23
4 24
5 25
6 26
7 27
8 28
9 29
10 30
> refData
id2 pos2
a 18
b 33
c 21
d 34
e 35
f 11
g 23
h 31
i 22
j 20
k 30
l 19
m 32
n 39
o 36
I want an extended data frame of myData. For each row in myData i want to check if there is an entry in refData with a distance less than 2 numbers and if so, i want the IDs of refData pasted in a new column of myData.
In the end my new data frame should look like that:
id1 pos1 newColumn
1 21 c, g, i, j, l
2 22 c, g, i, j
3 23 c, g, i
4 24 g, i
5 25 g
6 26
7 27
8 28 k
9 29 h, k
10 30 h, k, m
Obviously, i could do that with the following loop, which works fine:
myData$newColumn<-rep(NA, nrow(myData))
for(i in 1:nrow(myData)){
ww<-which(abs(refData$pos2 - myData$pos1[i]) <= 2)
myData$newColumn[i]<-paste(refData[ww,1],collapse=", ")
}
But, i'm looking for a really fast way to do that, since my real data has about 10^6 entries, and my real refData has about 10^7 entries.
I really appreciate any help and ideas of a fast way to do that!
You could try:
myData$newColumn = lapply(myData$pos,
function(x) {paste(refData$id[abs(refData$pos-x)<3],collapse=', ')})
Output:
id pos newColumn
1 1 21 c, g, i, j, l
2 2 22 c, g, i, j
3 3 23 c, g, i
4 4 24 g, i
5 5 25 g
6 6 26
7 7 27
8 8 28 k
9 9 29 h, k
10 10 30 h, k, m
Hope this helps!
Another option would be
myData$newColumn <- sapply(myData$pos, function(x) paste(refData$id[refData$pos >= x-2 & refData$pos <= x+2], collapse = ", "))
A benchmark with n = 1000 shows #Florian's solution slightly ahead:
set.seed(123)
myData<-data.frame(id=1:1000, pos=sample(21:30, 1000, replace = T))
refData<-data.frame(id=sample(letters[1:15], 1000, replace = T), pos=sample(10:40, 1000, replace = T))
myData$newColumn<-rep(NA, nrow(myData))
library(microbenchmark)
microbenchmark(for(i in 1:nrow(myData)){
ww<-which(abs(refData$pos - myData$pos[i]) <= 2)
myData$newColumn[i]<-paste(refData[ww, "id"],collapse=", ")
},
myData$newColumn2 <- sapply(myData$pos, function(x) paste(refData$id[refData$pos >= x-2 & refData$pos <= x+2], collapse = ", ")),
myData$newColumn3 <- lapply(myData$pos, function(x) paste(refData$id[abs(refData$pos - x) < 3], collapse = ", ")))
Unit: milliseconds
expr
for (i in 1:nrow(myData)) { ww <- which(abs(refData$pos - myData$pos[i]) <= 2) myData$newColumn[i] <- paste(refData[ww, "id"], collapse = ", ") }
myData$newColumn2 <- sapply(myData$pos, function(x) paste(refData$id[refData$pos >= x - 2 & refData$pos <= x + 2], collapse = ", "))
myData$newColumn3 <- lapply(myData$pos, function(x) paste(refData$id[abs(refData$pos - x) < 3], collapse = ", "))
min lq mean median uq max neval cld
62.97657 64.74155 70.01541 68.81024 71.02023 206.80477 100 c
46.55872 47.90585 50.75397 50.42333 53.42990 58.01813 100 b
36.69362 37.34244 39.70480 38.54905 42.49614 46.27513 100 a
Your current problem has two main bottlenecks -- 1) the nrow(myData) * nrow(refData) computations and, 2) the creation of possibly large character vectors by concatenating refData$id.
To overcome the first one, one way (since myData$pos is/can be sorted) is to use findInterval to locate the ranges that each refData$pos falls in regards to myData$pos +/- the allowed distance (here 2). This way, the computational complexity gets reduced to nrow(refData) * log(nrow(myData)) or, possibly, even less.
To save some typing:
a = myData$pos
b = refData$pos
As a start, we need to find the interval of a + 2 where each b is found:
i = findInterval(b, a + 2L, all.inside = TRUE, left.open = TRUE)
#> i
# [1] 1 9 1 9 9 1 1 8 1 1 7 1 9 9 9
We specify the intervals as (lower, upper] and avoid falling outside of the 1:(length(a) - 1) range so we can calculate easily the first index where b is 2 units away from a:
i1 = ifelse(abs(b - a[i + 1L]) <= 2, i + 1L, NA)
i2 = ifelse(abs(b - a[i]) <= 2, i, NA)
ii = pmin(i1, i2, na.rm = TRUE)
#> ii
# [1] NA NA 1 NA NA NA 1 9 1 1 8 1 10 NA NA
We, also, need to locate the ([lower, upper)) interval of a - 2 where each b falls and we find the last index of a where b is 2 units away:
j = findInterval(b, a - 2L, all.inside = TRUE, left.open = FALSE)
j1 = ifelse(abs(b - a[j + 1L]) <= 2, j + 1L, NA)
j2 = ifelse(abs(b - a[j]) <= 2, j, NA)
jj = pmax(j1, j2, na.rm = TRUE)
#> jj
# [1] NA NA 3 NA NA NA 5 10 4 2 10 1 10 NA NA
Now, we are left with the location of the first (ii) and last (jj) index of myData$pos (a) where each refData$pos (b) is located +/- 2 units away (where the missing values denote no matching).
A way to overcome the second bottleneck is to avoid it overall if we can utilize the above format to continue.
Nonetheless, to further proceed with representing the matches as concatenated refData$ids, we could, probably, utilize the IRanges package from here on to hope for something efficient:
library(IRanges)
nr = 1:nrow(myData)
myrng = IRanges(nr, nr)
refrng = IRanges(ifelse(is.na(ii), 0L, ii), ifelse(is.na(jj), 0L, jj)) ## replace NA with 0
ovrs = findOverlaps(myrng, refrng)
tapply(refData$id[subjectHits(ovrs)], factor(queryHits(ovrs), nr), toString)
# 1 2 3 4 5
#"c, g, i, j, l" "c, g, i, j" "c, g, i" "g, i" "g"
# 6 7 8 9 10
# NA NA "k" "h, k" "h, k, m"
The seq function in R would give me a sequence from x to y with a constant step m:
seq(x, y, m)
E.g. seq(1,9,2) = c(1,3,5,7,9).
What would be the most elegant way to get a sequence from x to y with alternating steps m1 and m2, such that something like "seq(x, y, c(m1, m2))" would give me c(x, x + m1, (x + m1) + m2, (x + m1 + m2) + m1, ..., y), each time adding one of the steps (not necessarily reaching up to y, of course, as in seq)?
Example: x = 1; y = 19; m1 = 2; m2 = 4 and I get c(1,3,7,9,13,15,19).
I arrived the solution by:
1. Use cumsum with a vector c(from,rep(by,times),...), with by repeated times = ceiling((to-from)/sum(by)) times.
2. Truncate the sequence by !(seq > to).
seq_alt <- function(from, to, by) {
seq <- cumsum(c(from,rep(by,ceiling((to-from)/sum(by)))))
return(seq[! seq > to])
}
First n terms of this sequence you can generate with
x = 1; m1 = 2; m2 = 4
n <- 0:10 # first 11 terms
x + ceiling(n/2)*m1 + ceiling((n-1)/2)*m2
# [1] 1 3 7 9 13 15 19 21 25 27 31
Here is another idea,
fun1 <- function(x, y, j, z){
if(j >= y) {return(x)}else{
s1 <- seq(x, y, j+z)
s2 <- seq(x+j, y, j+z)
return(sort(c(s1, s2)))
}
}
fun1(1, 19, 2, 4)
#[1] 1 3 7 9 13 15 19
fun1(1, 40, 4, 3)
#[1] 1 5 8 12 15 19 22 26 29 33 36 40
fun1(3, 56, 7, 10)
#[1] 3 10 20 27 37 44 54
fun1(1, 2, 2, 4)
#[1] 1
Here is an alternative that uses diffinv This method over allocates the values, so as a stopping rule, I get the elements that are less than or equal to the stopping value.
seqAlt <- function(start, stop, by1, by2) {
out <- diffinv(rep(c(by1, by2), ceiling(stop / (by1 + by2))), xi=start)
return(out[out <= stop])
}
seqAlt(1, 19, 2, 4)
[1] 1 3 7 9 13 15 19
You could use Reduce with accumulate = TRUE to iteratively add either 2 or 4:
Reduce(`+`, rep(c(2,4), 10), init = 1, accumulate = TRUE)
# [1] 1 3 7 9 13 15 19 21 25 27 31 33 37 39 43 45 49 51 55 57 61
The number of times you repeat c(2,4) will determine sequence length; since it is 10 above, the sequence is length 20.
The purrr package has an accumulate wrapper, if you prefer the syntax:
purrr::accumulate(rep(c(2,4), 10), `+`, .init = 1)
## [1] 1 3 7 9 13 15 19 21 25 27 31 33 37 39 43 45 49 51 55 57 61
perfect example of recycling vectors in R
# 1.
x = 1; y = 19; m1 = 2; m2 = 4
(x:y)[c(TRUE, rep(FALSE, m1-1), TRUE, rep(FALSE,m2-1))]
# [1] 1 3 7 9 13 15 19
# 2.
x = 3; y = 56; m1 = 7; m2 = 10
(x:y)[c(TRUE, rep(FALSE, m1-1), TRUE, rep(FALSE,m2-1))]
# [1] 3 10 20 27 37 44 54