Conditional cumsum in R? - r

Hi this an extension of the question asked here:
Conditional cumulative sum
Suppose I have the following vector. I'd like to calculate the running total of blocks within the zeros.
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1);
Ans d <- c(0,0,0,1,4,8,13,12,14,17,12,20,0,0,-2,-5,-2,3,0,0,0,-1,-2,-3,-4).
I'd like to do it in a vectorized way as my vector is rather large. So far I have been trying to use rle to achieve this without much success.
Many thanks.

This will work:
aux <- split(d, cumsum(d == 0))
v <- unlist(sapply(aux, cumsum))
1 2 31 32 33 34 35 36 37 38 39 310 4 51 52 53 54 55 6 7 81 82 83 84 85
0 0 0 1 4 8 13 12 14 17 12 20 0 0 -2 -5 -2 3 0 0 0 -1 -2 -3 -4
as.vector(v)
[1] 0 0 0 1 4 8 13 12 14 17 12 20 0 0 -2 -5 -2 3 0 0 0 -1 -2 -3 -4
here as.vector() just hides the numbers of elements.

this should work. no loops.
very fast because all the work happens outside R
sum_from<-function(value,from)
{
i <- cummax(seq_along(value)*from)
cv <- cumsum(value*cummax(from))
cv - c(0,0,cv[-length(cv)])[i+1]
}
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1)
all(sum_from(d,d==0)==c(0,0,0,1,4,8,13,12,14,17,12,20,0,0,-2,-5,-2,3,0,0,0,-1,-2,-3,-4))
results match:
> all(sum_from(d,d==0)==c(0,0,0,1,4,8,13,12,14,17,12,20,0,0,-2,-5,-2,3,0,0,0,-1,-2,-3,-4))
[1] TRUE
>

Related

Count next n rows that meets a condition in R

Let's say I have a df that looks like this
ID X_Value
1 40
2 13
3 75
4 83
5 64
6 43
7 74
8 45
9 54
10 84
So what I would like to do, is to do a rolling function that if in the actual and last 4 rows, there are 2 or more values that are higher than X (let's say 70 for this example) then return 1, else 0.
So the output would be something like the following:
ID X_Value Next_4_2
1 40 0
2 13 0
3 75 0
4 83 1
5 64 1
6 43 1
7 24 1
8 45 0
9 74 0
10 84 1
I think this would be possible with a rolling function, but I have tried and not sure how to do it. Thank you in advance
Given your expected output, I suppose you meant "in the actual and previous 3 rows". Then using some rolling function indeed does the job:
library(zoo)
thr1 <- 70
thr2 <- 2
last <- 3 + 1
df$Next_4_2 <- 1 * (rollsum(df$X_Value > thr1, last, align = "right", fill = 0) >= thr2)
df
# ID X_Value Next_4_2
# 1 1 40 0
# 2 2 13 0
# 3 3 75 0
# 4 4 83 1
# 5 5 64 1
# 6 6 43 1
# 7 7 74 1
# 8 8 45 0
# 9 9 54 0
# 10 10 84 1
The indexing using max(1,i-3) is perhaps the only part of the code worth remembering. I might help in subsequent construction when a for-loop was really needed.
dat$X_Next_4_2 <- integer( length(dat$X_Value) )
dat$ X_Next_4_2[1]=0
for (i in 2:length(dat$X_Value) ){
dat$ X_Next_4_2[i]=
( sum(dat$X_Value[i: (max(0, i-4) )] >=70) >=2 )}
(Not very pretty and clearly inferior to the rollsum answer already posted.)

How to get value from upcomming row if condition is met?

I searched in google and SO but could not find any answer to my question.
I try to get a value from the first upcomming row if the condition is met.
Example:
Pupil participation bonus
2 55 6
2 33 3
2 88 9
2 0 -100
2 44 4
2 66 7
2 0 -33
to
Pupil participation bonus bonusAtNoParti sumBonusTillParticipation=0
2 55 6 -94 6+3+9 = 18
2 33 3 -97 3+9 = 12
2 88 9 -91 9
2 0 -100 0 0
2 44 4 -29 4+7=11
2 66 7 -26 7
2 0 -33 0 0
So I need to do this:
Iterate through the dataframe and check next rows till participation equals to 0 and get the bonus from that line and add the bonus from the current line and write it to bonusAtNoPati.
My problem here is the "check next rows till participation equals to 0 and get the bonus from that line"
I know how to Iterate through the whole list but not after the current point(row)
I would need to do this process to the whole list where i can get any random participation value in random order.
Has anyone any idea how to realize it?
Edit, I also added another column("sumBonusTillParticipation=0", only sum value is required) which is even harder to realize. R is such a hard to learn language =(
you can use which to get which row number participation is 0.
df <- read.table(text = 'Pupil participation bonus
2 55 6
2 33 3
2 88 9
2 0 -100
2 44 4
2 66 7
2 0 -33', header = T)
index <- c(0, which(df$participation == 0))
diffs <- diff(index)
df$tp <- rep(df$bonus[index], times = diffs)
df$bonusAtNoParti <- df$bonus + df$tp
df$bonusAtNoParti[index] <- 0
df$tp <- NULL
Pupil participation bonus bonusAtNoParti
1 2 55 6 -94
2 2 33 3 -97
3 2 88 9 -91
4 2 0 -100 0
5 2 44 4 -29
6 2 66 7 -26
7 2 0 -33 0

transform values in data frame, generate new values as 100 minus current value

I'm currently working on a script which will eventually plot the accumulation of losses from cell divisions. Firstly I generate a matrix of values and then I add the number of times 0 occurs in each column - a 0 represents a loss.
However, I am now thinking that a nice plot would be a degradation curve. So, given the following example;
>losses_plot_data <- melt(full_losses_data, id=c("Divisions", "Accuracy"), value.name = "Losses", variable.name = "Size")
> full_losses_data
Divisions Accuracy 20 15 10 5 2
1 0 0 0 0 3 25
2 0 0 0 1 10 39
3 0 0 1 3 17 48
4 0 0 1 5 23 55
5 0 1 3 8 29 60
6 0 1 4 11 34 64
7 0 2 5 13 38 67
8 0 3 7 16 42 70
9 0 4 9 19 45 72
10 0 5 11 22 48 74
Is there a way I can easily turn this table into being 100 minus the numbers shown in the table? If I can plot that data instead of my current data, I would have a lovely curve of degradation from 100% down to however many cells have been lost.
Assuming you do not want to do that for the first column:
fld <- full_losses_data
fld[, 2:ncol(fld)] <- 100 - fld[, -1]

Countif in R: Relational Vectors

I'm new using R and I'm having an issue trying to perform a "countif" as in Excel. What I have is below. There are two vectors, vector1 being the possible values of vector2. Vector1 numbers include team_ids to represent the possible teams that may win a game in a tournament. Vector2 is the result of a simulation.
The reason why I can't use a table to summarize the simulation is that many teams won't be represented in game63, but I would still like to return a 0.
In the end, I would like to add a vector possible_teams_prob that counts the number of times each item in possible_teams is in game63. This way I can combine into a final possible table that has the teams listed along with their probabilities of winning game63.
> possible_teams <- seq(1,64)
> possible_teams
[1] 1 2 3 4 5 6 7 8 9 10 11 12
[13] 13 14 15 16 17 18 19 20 21 22 23 24
[25] 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48
[49] 49 50 51 52 53 54 55 56 57 58 59 60
[61] 61 62 63 64
> game63[1:20]
[1] 4 8 4 4 3 20 2 3 3 1 3 20
[13] 3 8 2 4 3 1 14 3
Interesting question. In general, one can use the fact that R evaluates TRUE as 1 and FALSE as 0 to do a lot of COUNTIF-type work. In this case, though, you want it along the vector. Writing a loop would certainly work, but this is R, so we need to use some vectorized version, which leads one to the apply family. In this case, the following seems to be what you want:
f2 <- function(V1, V2) sum(V1 == V2)
vapply(possible_teams, f2, V2 = game_63, FUN.VALUE = double(1))
which returns
[1] 2 2 7 4 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
This works by setting up a function which create the "countif," between vectors. It won't work on its own, as it expects vectors and the two don't recycle nicely, but vapply will iterate the function down the length of the first vector, which is what you wanted.
sapply will work as well, and doesn't require a "target value" definition, but can be slower because of that. Your case is small enough it doesn't really matter.
> microbenchmark(sapply(possible_teams, f2, V2 = game_63), vapply(possible_teams, f2, V2 = game_63, FUN.VALUE = double(1)), times = 1000L, control=list(order='block'))
Unit: microseconds
expr min lq mean median uq max neval
sapply(possible_teams, f2, V2 = game_63) 89.351 92.926 103.31433 95.309 100.371 945.629 1000
vapply(possible_teams, f2, V2 = game_63, FUN.VALUE = double(1)) 61.057 64.631 73.80298 67.610 71.779 1223.510 1000
Try this:
# recreate your data
allteams <- seq(64)
# summarize the game63 data to get counts by team
temp = tapply(game63,game63,length)
# initialize return vector
answer = integer(length(allteams)); names(answer) <- 1:64
# replace true values
answer <- temp[match(allteams,names(temp))]
# replace missing values
answer[is.na(answer)] <- 0

Calculating Confidence Intervals for two datasets

Quite the number of questions I've made today.
I'd like to calculate the Confidence Interval (99% level, not 95) for the mean value of variable age of two dataframes, infert_control and infert_patient where:
infert_control = subset(infert$age, infert$case == 0)
infert_patient = subset(infert$age, infert$case == 1)
infert is a built-in R dataset, for those not familiar with it, here it is: case 0 denominates the control-group patients, case 1 the actual ones.
> infert
education age parity induced case spontaneous stratum pooled.stratum
1 0-5yrs 26 6 1 1 2 1 3
2 0-5yrs 42 1 1 1 0 2 1
3 0-5yrs 39 6 2 1 0 3 4
4 0-5yrs 34 4 2 1 0 4 2
5 6-11yrs 35 3 1 1 1 5 32
6 6-11yrs 36 4 2 1 1 6 36
7 6-11yrs 23 1 0 1 0 7 6
8 6-11yrs 32 2 0 1 0 8 22
9 6-11yrs 21 1 0 1 1 9 5
10 6-11yrs 28 2 0 1 0 10 19
11 6-11yrs 29 2 1 1 0 11 20
...
239 12+ yrs 38 6 0 0 2 74 63
240 12+ yrs 26 2 1 0 1 75 49
241 12+ yrs 31 1 1 0 0 76 45
242 12+ yrs 31 2 0 0 1 77 53
243 12+ yrs 25 1 0 0 1 78 41
244 12+ yrs 31 1 0 0 1 79 45
245 12+ yrs 34 1 0 0 0 80 47
246 12+ yrs 35 2 2 0 0 81 54
247 12+ yrs 29 1 0 0 1 82 43
248 12+ yrs 23 1 0 0 1 83 40
What would be the correct way to solve this?
I've already calculated the Mean value of column age for both infert_control and infert_patient, plus the standard deviation of each subset.
You could use bootstrap for this:
library(boot)
set.seed(42)
boot_mean <- boot(infert_control, function(x, i) mean(x[i]), R=1e4)
quantile(boot_mean$t, probs=c(0.005, 0.995))
# 0.5% 99.5%
# 30.47273 32.58182
Or if you don't want to use a library:
set.seed(42)
R <- 1e4
boot_mean <- colMeans(
matrix(
sample(infert_control, R * length(infert_control), TRUE),
ncol=R))
quantile(boot_mean, probs=c(0.005, 0.995))
# 0.5% 99.5%
#30.42424 32.55152
So many answers...
The mean value of a random sample has a t-distribution, not normal, although t -> N as df -> Inf.
cl <- function(data,p) {
n <- length(data)
cl <- qt(p/2,n-1,lower.tail=F)*sd(data)/sqrt(n)
m <- mean(data)
return(c(lower=m-cl,upper=m+cl))
}
cl.control <- cl(infert_control,0.01)
cl.control
# lower upper
# 30.42493 32.55689
cl.patient <- cl(infert_patient,0.01)
cl.patient
# lower upper
# 30.00221 33.05803
aggregate(age~case,data=infert,cl,p=0.01) # much better way...
# case age.lower age.upper
# 1 0 30.42493 32.55689
# 2 1 30.00221 33.05803
Also, the quantile functions (e.q. qt(...) and qnorm(...)) return the lower tail by default, so your limits would be reversed unless you set lower.tail=F
You could easily calculate the confidence interval manually:
infert_control <- subset(infert$age, infert$case == 0)
# calculate needed values
m <- mean(infert_control)
s <- sd(infert_control)
n <- length(infert_control)
# calculate error for normal distribution (choose you distribution here, e.g. qt for t-distribution)
a <- 0.995 # 99% CI => 0.5% on both sides
error <- qnorm(a)*s/sqrt(n)
# calculate CI
ci_lower <- m-error
ci_upper <- m+error
See also http://en.wikipedia.org/wiki/Confidence_interval (sorry for a wikipedia link, but it has a good explanation and shows you the formula)
... or as small function:
cifun <- function(data, ALPHA){
c(mean(data) - qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)),
mean(data) + qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)))
}
cifun(infert_control, 0.01)

Resources