R - extracting beta and alpha from regression within a window - r

I have this dataframe:
structure(list(X_ = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), Y_ = c(0.00485082338451504,
-0.0168046606213001, 0.0271922543834244, 0.00553894528785559,
0.0459064669618974, 0.0735144938632293, 0.0368605806880207, 0.0597490764776278,
0.0244300474780141, 0.00904348896641594), Window_5 = c(-4, -3,
-2, -1, 0, 1, 2, 3, 4, 5), Window_2 = c(-1, 0, 1, 2, 3, 4, 5,
6, 7, 8)), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"), na.action = structure(c(`1` = 1L), class = "exclude"))
X_ Y_ Window_5 Window_2
<dbl> <dbl> <dbl> <dbl>
1 1 0.00485 -4 -1
2 2 -0.0168 -3 0
3 3 0.0272 -2 1
4 4 0.00554 -1 2
5 5 0.0459 0 3
6 6 0.0735 1 4
7 7 0.0369 2 5
9 9 0.0244 4 7
10 10 0.00904 5 8
Where Window_5 = X_-5 and Window_2 = X_-2. I'm looking for the beta and alpha of a simple regression line, where alpha and beta:
However, the challenge is I need those parameters for each row given a window for X_. For example, for X_=7 the regression line should consider only rows where X_ starts at Window_5 and ends at Window_2, which in this case where X_=7 the window will be from X_=2 to X_=5.
So, the expected output would be: (I did this in excel and double-checked it, so the values should be right)
PS: If you can add the error, that's would be great but is not needed strictly speaking.

If all you need is a sliding window regression of fixed width (4 in your case, but not exactly as your example is formatted), maybe look at the rollRegres package?
library(rollRegres)
roll_regres(Y_ ~ X_, data = dd, width=4)

I have code that looks like it should do what you want. It's not "tidy" though ... (it should use purrr::pmap + broom::tidy + ...)
wfun <- function(i, data) {
start <- data$Window_5[i]
end <- data$Window_2[i]
if (start<=0) return(data.frame(Alpha=NA,Alpha_SE=NA,
Beta=NA, Beta_SE=NA))
cc <- coef(summary(lm(Y_ ~ X_, data=data[start:end,])))
data.frame(Alpha=cc["(Intercept)","Estimate"],
Alpha_SE=cc["(Intercept)","Std. Error"],
Beta=cc["X_","Estimate"],
Beta_SE=cc["(Intercept)","Std. Error"])
}
res <- lapply(seq(nrow(dd)),wfun, data=dd)
do.call(rbind,res)

Related

ifelse over each element of a vector

Looking at this post, I thought ifelse is vectorized in the sense that f(c(x1, x2, x3)) = c(f(x1), f(x2), f(x3)).
So, I thought if the code for z1 (provided below) will perform the following for each element of the vector y:
Test whether it is unity or not.
If YES, draw a random number from {1, 3, 5, 7, 9}.
If NO, draw a random number from {0, 2, 4, 6, 8}.
But, unfortunately it doesn't do that. It generates once for each case, and returns that very random number always.
Where exactly am I doing wrong? Or, is it actually the expected behaviour of ifelse?
Just to note, if I use this as a wrapper function inside sapply, I get the expected output z2 (in the sense that it is not deterministic as z1 where observing one occurrence of each case is enough), as you can see below.
y <- rbinom(n = 20,
size = 1,
prob = 0.5)
z1 <- ifelse(test = (y == 1),
yes = sample(x = c(1, 3, 5, 7, 9),
size = 1),
no = sample(x = c(0, 2, 4, 6, 8),
size = 1))
z2 <- sapply(X = y,
FUN = function(w)
{
ifelse(test = (w == 1),
yes = sample(x = c(1, 3, 5, 7, 9),
size = 1),
no = sample(x = c(0, 2, 4, 6, 8),
size = 1))
})
data.frame(y, z1, z2)
#> y z1 z2
#> 1 0 2 2
#> 2 1 1 3
#> 3 1 1 9
#> 4 1 1 7
#> 5 0 2 0
#> 6 0 2 2
#> 7 1 1 7
#> 8 1 1 7
#> 9 0 2 0
#> 10 1 1 5
#> 11 0 2 0
#> 12 0 2 0
#> 13 0 2 6
#> 14 0 2 0
#> 15 0 2 2
#> 16 1 1 7
#> 17 1 1 7
#> 18 0 2 2
#> 19 0 2 2
#> 20 0 2 0
unique(x = z1[y == 1])
#> [1] 1
unique(x = z1[y == 0])
#> [1] 2
Created on 2019-03-13 by the reprex package (v0.2.1)
Any help will be appreciated.
ifelse isn't a function of one vector, it is a function of 3 vectors of the same length. The first vector, called test, is a boolean, the second vector yes and third vector no give the elements in the result, chosen item-by-item based on the test value.
A sample of size = 1 is a different size than test (unless the length of test is 1), so it will be recycled by ifelse (see note below). Instead, draw samples of the same size as test from the start:
ifelse(
test = (y == 1),
yes = sample(x = c(1, 3, 5, 7, 9), size = length(y), replace = TRUE),
no = sample(x = c(0, 2, 4, 6, 8), size = lenght(y), replace = TRUE)
)
The vectors don't actually have to be of the same length. The help page ?ifelse explains: "If yes or no are too short, their elements are recycled." This is the behavior you observed with "It generates once for each case, and returns that very random number always.".

average calculation for many data in R

I have a file to which the results are saved:
4
4
4
4
5
4
4
5
6
4
4
5
5
6
4
I would like to calculate the average for each group
unfortunately, only I managed to calculate for everyone
I would like to get an average of 5 items
they are savedin wynik2.txt file
wynik_epidemii <- read.table(file="wynik2.txt")
wynik_epidemii<- mean(as.numeric(unlist(wynik_epidemii)))
You can use tapply, defining the grouping factor with a cumsum trick.
meanN <- function(x, n = 5){
f <- cumsum(rep(c(1, rep(0, n - 1)), length.out = length(x)))
tapply(x, f, mean)
}
meanN(x)
# 1 2 3
#4.2 4.6 4.8
DATA.
x <-
c(4, 4, 4, 4, 5, 4, 4, 5, 6, 4, 4, 5, 5, 6, 4)

calculating simple retention in R

For the dataset test, my objective is to find out how many unique users carried over from one period to the next on a period-by-period basis.
> test
user_id period
1 1 1
2 5 1
3 1 1
4 3 1
5 4 1
6 2 2
7 3 2
8 2 2
9 3 2
10 1 2
11 5 3
12 5 3
13 2 3
14 1 3
15 4 3
16 5 4
17 5 4
18 5 4
19 4 4
20 3 4
For example, in the first period there were four unique users (1, 3, 4, and 5), two of which were active in the second period. Therefore the retention rate would be 0.5. In the second period there were three unique users, two of which were active in the third period, and so the retention rate would be 0.666, and so on. How would one find the percentage of unique users that are active in the following period? Any suggestions would be appreciated.
The output would be the following:
> output
period retention
1 1 NA
2 2 0.500
3 3 0.666
4 4 0.500
The test data:
> dput(test)
structure(list(user_id = c(1, 5, 1, 3, 4, 2, 3, 2, 3, 1, 5, 5,
2, 1, 4, 5, 5, 5, 4, 3), period = c(1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)), .Names = c("user_id", "period"
), row.names = c(NA, -20L), class = "data.frame")
How about this? First split the users by period, then write a function that calculates the proportion carryover between any two periods, then loop it through the split list with mapply.
splt <- split(test$user_id, test$period)
carryover <- function(x, y) {
length(unique(intersect(x, y))) / length(unique(x))
}
mapply(carryover, splt[1:(length(splt) - 1)], splt[2:length(splt)])
1 2 3
0.5000000 0.6666667 0.5000000
Here is an attempt using dplyr, though it also uses some standard syntax in the summarise:
test %>%
group_by(period) %>%
summarise(retention=length(intersect(user_id,test$user_id[test$period==(period+1)]))/n_distinct(user_id)) %>%
mutate(retention=lag(retention))
This returns:
period retention
<dbl> <dbl>
1 1 NA
2 2 0.5000000
3 3 0.6666667
4 4 0.5000000
This isn't so elegant but it seems to work. Assuming df is the data frame:
# make a list to hold unique IDS by
uniques = list()
for(i in 1:max(df$period)){
uniques[[i]] = unique(df$user_id[df$period == i])
}
# hold the retention rates
retentions = rep(NA, times = max(df$period))
for(j in 2:max(df$period)){
retentions[j] = mean(uniques[[j-1]] %in% uniques[[j]])
}
Basically the %in% creates a logical of whether or not each element of the first argument is in the second. Taking a mean gives us the proportion.

Create all pairs within groups and maintaining variables

I have a dataframe with around 30k observations, divided in 300 groups. For example
id, group, x, y
1, 1, 2, 3
2, 1, 4, 3
3, 1, 2, 4
4, 2, 5, 4
5, 2, 5, 3
6, 2, 6, 4
I want to make it so
pair, group, x_i, x_j, y_i, y_j
12, 1, 2, 4, 3, 3
13, 1, 2, 2, 3, 4
23, 1, 4, 2, 3, 4
45, 2, 5, 5, 4, 3
and so on. I've found a few topics, but they don't seem to apply exactly to my problem.
The combn function can be used to generate each corresponding pair of x and y values. We operate by group using lapply. lapply returns a list so we use rbind to put each list element (the results for each group) back together in a single data frame.
new.dat = lapply(unique(dat$group), function(g) {
data.frame(pairs = apply(t(combn(dat$id[dat$group==g], 2)), 1, paste, collapse=""),
group=g,
x = t(combn(dat$x[dat$group==g], 2)),
y = t(combn(dat$y[dat$group==g], 2)))
})
do.call(rbind, new.dat)
pairs group x.1 x.2 y.1 y.2
1 12 1 2 4 3 3
2 13 1 2 2 3 4
3 23 1 4 2 3 4
4 45 2 5 5 4 3
5 46 2 5 6 4 4
6 56 2 5 6 3 4
You could also use split, which saves some typing, but is about 10% slower on my machine:
lapply(split(dat, dat$group), function(df) {
data.frame(pairs = apply(t(combn(df$id, 2)), 1, paste, collapse=""),
group=g,
x = t(combn(df$x, 2)),
y = t(combn(df$y, 2)))
})
I won't say this is an ooptimal result, but it should work:
df <- read.table(text="id, group, x, y
1,1,2,3
2,1,4,3
3,1,2,4
4,2,5,4
5,2,5,3
6,2,6,4", header=T, sep=",")
df.new <- do.call(rbind,lapply(tapply(df$id, df$group, combn, m=2), FUN=function(x) data.frame(pairi=x[1,], pairj=x[2,])))
df.new <- do.call(rbind,apply(df.new, 1, FUN=function(x) data.frame(pair=paste0(x[1], x[2]),group=df[df$id==x[1], 'group'], x_i=df[df$id==x[1],'x'], x_j=df[df$id==x[2],'x'], y_i=df[df$id==x[1],'y'], y_j=df[df$id==x[2],'y'] )))
df.new
pair group x_i x_j y_i y_j
1.1 12 1 2 4 3 3
1.2 13 1 2 2 3 4
1.3 23 1 4 2 3 4
2.1 45 2 5 5 4 3
2.2 46 2 5 6 4 4
2.3 56 2 5 6 3 4

Exchange two elements of a vector in one call

I have a vector c(9,6,3,4,2,1,5,7,8), and I want to switch the elements at index 2 and at index 5 in the vector. However, I don't want to have to create a temporary variable and would like to make the switch in one call. How would I do that?
How about just x[c(i,j)] <- x[c(j,i)]? Similar to replace(...), but perhaps a bit simpler.
swtch <- function(x,i,j) {x[c(i,j)] <- x[c(j,i)]; x}
swtch(c(9,6,3,4,2,1,5,7,8) , 2,5)
# [1] 9 2 3 4 6 1 5 7 8
You could use replace().
x <- c(9, 6, 3, 4, 2, 1, 5, 7, 8)
replace(x, c(2, 5), x[c(5, 2)])
# [1] 9 2 3 4 6 1 5 7 8
And if you don't even want to assign x, you can use
replace(
c(9, 6, 3, 4, 2, 1, 5, 7, 8),
c(2, 5),
c(9, 6, 3, 4, 2, 1, 5, 7, 8)[c(5, 2)]
)
# [1] 9 2 3 4 6 1 5 7 8
but that's a bit silly. You will probably want x assigned to begin with.
If you actually want to do it without creating a temporary copy of the vector, you would need to write a short C function.
library(inline)
swap <- cfunction(c(i = "integer", j = "integer", vec="integer"),"
int *v = INTEGER(vec);
int ii = INTEGER(i)[0]-1, jj = INTEGER(j)[0]-1;
int tmp = v[ii];
v[ii] = v[jj];
v[jj] = tmp;
return R_NilValue;
")
vec <- as.integer(c(9,6,3,4,2,1,5,7,8))
swap(2L, 5L, vec)
vec
# [1] 9 2 3 4 6 1 5 7 8

Resources