ifelse over each element of a vector - r

Looking at this post, I thought ifelse is vectorized in the sense that f(c(x1, x2, x3)) = c(f(x1), f(x2), f(x3)).
So, I thought if the code for z1 (provided below) will perform the following for each element of the vector y:
Test whether it is unity or not.
If YES, draw a random number from {1, 3, 5, 7, 9}.
If NO, draw a random number from {0, 2, 4, 6, 8}.
But, unfortunately it doesn't do that. It generates once for each case, and returns that very random number always.
Where exactly am I doing wrong? Or, is it actually the expected behaviour of ifelse?
Just to note, if I use this as a wrapper function inside sapply, I get the expected output z2 (in the sense that it is not deterministic as z1 where observing one occurrence of each case is enough), as you can see below.
y <- rbinom(n = 20,
size = 1,
prob = 0.5)
z1 <- ifelse(test = (y == 1),
yes = sample(x = c(1, 3, 5, 7, 9),
size = 1),
no = sample(x = c(0, 2, 4, 6, 8),
size = 1))
z2 <- sapply(X = y,
FUN = function(w)
{
ifelse(test = (w == 1),
yes = sample(x = c(1, 3, 5, 7, 9),
size = 1),
no = sample(x = c(0, 2, 4, 6, 8),
size = 1))
})
data.frame(y, z1, z2)
#> y z1 z2
#> 1 0 2 2
#> 2 1 1 3
#> 3 1 1 9
#> 4 1 1 7
#> 5 0 2 0
#> 6 0 2 2
#> 7 1 1 7
#> 8 1 1 7
#> 9 0 2 0
#> 10 1 1 5
#> 11 0 2 0
#> 12 0 2 0
#> 13 0 2 6
#> 14 0 2 0
#> 15 0 2 2
#> 16 1 1 7
#> 17 1 1 7
#> 18 0 2 2
#> 19 0 2 2
#> 20 0 2 0
unique(x = z1[y == 1])
#> [1] 1
unique(x = z1[y == 0])
#> [1] 2
Created on 2019-03-13 by the reprex package (v0.2.1)
Any help will be appreciated.

ifelse isn't a function of one vector, it is a function of 3 vectors of the same length. The first vector, called test, is a boolean, the second vector yes and third vector no give the elements in the result, chosen item-by-item based on the test value.
A sample of size = 1 is a different size than test (unless the length of test is 1), so it will be recycled by ifelse (see note below). Instead, draw samples of the same size as test from the start:
ifelse(
test = (y == 1),
yes = sample(x = c(1, 3, 5, 7, 9), size = length(y), replace = TRUE),
no = sample(x = c(0, 2, 4, 6, 8), size = lenght(y), replace = TRUE)
)
The vectors don't actually have to be of the same length. The help page ?ifelse explains: "If yes or no are too short, their elements are recycled." This is the behavior you observed with "It generates once for each case, and returns that very random number always.".

Related

change numeric vector

I have a numeric vector (see below). I would like to change all numbers that are assigned to high_ to 1 and all low_ to 2.
c(high_X17 = 3, high_X18 = 4, high_X19 = 5, high_X20 = 3, high_X21 = 1,
high_X22 = 1, high_X23 = 2, high_X24 = 2, low_X25 = 6, low_X26 = 4,
low_X27 = 6, low_X28 = 5, low_X29 = 2, low_X30 = 1, low_X31 = 1,
low_X32 = 2)
result
high_X17 high_X18 high_X19 high_X20 high_X21 high_X22 high_X23 high_X24 low_X25 low_X26
1 1 1 1 1 1 1 1 2 2
low_X29 low_X30 low_X31 low_X32
2 2 2 2
Try the code below
x <- startsWith(names(x),"low_") + 1
You can use -
x[] <- as.integer(sub('_.*', '', names(x)) == 'low') + 1
x
#high_X17 high_X18 high_X19 high_X20 high_X21 high_X22 high_X23 high_X24
# 1 1 1 1 1 1 1 1
# low_X25 low_X26 low_X27 low_X28 low_X29 low_X30 low_X31 low_X32
# 2 2 2 2 2 2 2 2
sub('_.*', '', names(x)) removes everything after underscore keeping only 'high' and 'low' values.
Using grepl
grepl("low_", names(x)) + 1

R - extracting beta and alpha from regression within a window

I have this dataframe:
structure(list(X_ = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), Y_ = c(0.00485082338451504,
-0.0168046606213001, 0.0271922543834244, 0.00553894528785559,
0.0459064669618974, 0.0735144938632293, 0.0368605806880207, 0.0597490764776278,
0.0244300474780141, 0.00904348896641594), Window_5 = c(-4, -3,
-2, -1, 0, 1, 2, 3, 4, 5), Window_2 = c(-1, 0, 1, 2, 3, 4, 5,
6, 7, 8)), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"), na.action = structure(c(`1` = 1L), class = "exclude"))
X_ Y_ Window_5 Window_2
<dbl> <dbl> <dbl> <dbl>
1 1 0.00485 -4 -1
2 2 -0.0168 -3 0
3 3 0.0272 -2 1
4 4 0.00554 -1 2
5 5 0.0459 0 3
6 6 0.0735 1 4
7 7 0.0369 2 5
9 9 0.0244 4 7
10 10 0.00904 5 8
Where Window_5 = X_-5 and Window_2 = X_-2. I'm looking for the beta and alpha of a simple regression line, where alpha and beta:
However, the challenge is I need those parameters for each row given a window for X_. For example, for X_=7 the regression line should consider only rows where X_ starts at Window_5 and ends at Window_2, which in this case where X_=7 the window will be from X_=2 to X_=5.
So, the expected output would be: (I did this in excel and double-checked it, so the values should be right)
PS: If you can add the error, that's would be great but is not needed strictly speaking.
If all you need is a sliding window regression of fixed width (4 in your case, but not exactly as your example is formatted), maybe look at the rollRegres package?
library(rollRegres)
roll_regres(Y_ ~ X_, data = dd, width=4)
I have code that looks like it should do what you want. It's not "tidy" though ... (it should use purrr::pmap + broom::tidy + ...)
wfun <- function(i, data) {
start <- data$Window_5[i]
end <- data$Window_2[i]
if (start<=0) return(data.frame(Alpha=NA,Alpha_SE=NA,
Beta=NA, Beta_SE=NA))
cc <- coef(summary(lm(Y_ ~ X_, data=data[start:end,])))
data.frame(Alpha=cc["(Intercept)","Estimate"],
Alpha_SE=cc["(Intercept)","Std. Error"],
Beta=cc["X_","Estimate"],
Beta_SE=cc["(Intercept)","Std. Error"])
}
res <- lapply(seq(nrow(dd)),wfun, data=dd)
do.call(rbind,res)

index from one vector to another by closest values

Given two sorted vectors, how can you get the index of the closest values from one onto the other.
For example, given:
a = 1:20
b = seq(from=1, to=20, by=5)
how can I efficiently get the vector
c = (1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)
which, for each value in a, provides the index of the largest value in b that is less than or equal to it. But the solution needs to work for unpredictable (though sorted) contents of a and b, and needs to be fast when a and b are large.
You can use findInterval, which constructs a sequence of intervals given by breakpoints in b and returns the interval indices in which the elements of a are located (see also ?findInterval for additional arguments, such as behavior at interval boundaries).
a = 1:20
b = seq(from = 1, to = 20, by = 5)
findInterval(a, b)
#> [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
We can use cut
as.integer(cut(a, breaks = unique(c(b-1, Inf)), labels = seq_along(b)))
#[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

Is there a function to know how many times a column has the best value?

I have a data.frame like this :
A B C
4 8 2
1 3 5
5 7 6
It could have more column and lines.
So what I'd like to know is for each column how many times they have the lowest values (in my example the result should be 2 for A and 1 for C).
d = data.frame(a = c(4, 1, 5), b = c(8, 3, 7), c = c(2, 5, 6))
row_mins = apply(d, 1, min)
# alternately, slightly more efficient
row_mins = do.call(pmin, d)
colSums(d == row_mins)
# a b c
# 2 0 1

calculating simple retention in R

For the dataset test, my objective is to find out how many unique users carried over from one period to the next on a period-by-period basis.
> test
user_id period
1 1 1
2 5 1
3 1 1
4 3 1
5 4 1
6 2 2
7 3 2
8 2 2
9 3 2
10 1 2
11 5 3
12 5 3
13 2 3
14 1 3
15 4 3
16 5 4
17 5 4
18 5 4
19 4 4
20 3 4
For example, in the first period there were four unique users (1, 3, 4, and 5), two of which were active in the second period. Therefore the retention rate would be 0.5. In the second period there were three unique users, two of which were active in the third period, and so the retention rate would be 0.666, and so on. How would one find the percentage of unique users that are active in the following period? Any suggestions would be appreciated.
The output would be the following:
> output
period retention
1 1 NA
2 2 0.500
3 3 0.666
4 4 0.500
The test data:
> dput(test)
structure(list(user_id = c(1, 5, 1, 3, 4, 2, 3, 2, 3, 1, 5, 5,
2, 1, 4, 5, 5, 5, 4, 3), period = c(1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)), .Names = c("user_id", "period"
), row.names = c(NA, -20L), class = "data.frame")
How about this? First split the users by period, then write a function that calculates the proportion carryover between any two periods, then loop it through the split list with mapply.
splt <- split(test$user_id, test$period)
carryover <- function(x, y) {
length(unique(intersect(x, y))) / length(unique(x))
}
mapply(carryover, splt[1:(length(splt) - 1)], splt[2:length(splt)])
1 2 3
0.5000000 0.6666667 0.5000000
Here is an attempt using dplyr, though it also uses some standard syntax in the summarise:
test %>%
group_by(period) %>%
summarise(retention=length(intersect(user_id,test$user_id[test$period==(period+1)]))/n_distinct(user_id)) %>%
mutate(retention=lag(retention))
This returns:
period retention
<dbl> <dbl>
1 1 NA
2 2 0.5000000
3 3 0.6666667
4 4 0.5000000
This isn't so elegant but it seems to work. Assuming df is the data frame:
# make a list to hold unique IDS by
uniques = list()
for(i in 1:max(df$period)){
uniques[[i]] = unique(df$user_id[df$period == i])
}
# hold the retention rates
retentions = rep(NA, times = max(df$period))
for(j in 2:max(df$period)){
retentions[j] = mean(uniques[[j-1]] %in% uniques[[j]])
}
Basically the %in% creates a logical of whether or not each element of the first argument is in the second. Taking a mean gives us the proportion.

Resources