Cumulative conditional product with reset - r

I have a large xts object. However the example is in a data.frame two column subset of the data. I would like to calculate (in a new column) the cumulative product of the first column df$rt whenever the second column df$dd is less than 0. Whenever df$dd is 0 I want to reset the cumulating to 0 again. So for the next instance that df$dd is less than 0 the cumulative product starts again for df$rt.
The following example dataframe adds the desired outcome as column three df$crt, for reference. Note that some rounding has been applied.
df <- data.frame(
rt = c(0, 0.0171, 0.0796, 0.003, 0.0754, -0.0314, 0.0275, -0.0323, 0.0364, 0.0473, -0.0021),
dd = c(0, -0.0657, -0.0013, 0, -0.018, -0.0012, 0, 0, 0, -0.0016, -0.0856),
crt = c(0, 0.171, 0.0981, 0, 0.0754, 0.0415, 0, 0, 0, 0.473, 0.045)
)
I have tried various combinations of with, ifelse and cumprod like:
df$crt <- with(df, ifelse(df$dd<0, cumprod(1+df$rt)-1, 0))
However this does not reset the cumulative product after a 0 in df$dd, it only writes a 0 and continues the previous cumulation of df$rt when df$dd is below zero again.
I think I am missing a counter of some sort to initiate the reset. Note that the dataframe I'm working with to implement this is large.

Create a grouping column by taking the cumulative sum of logical vector (dd == 0) so that it increments by 1 at positions where dd is 0, then use replace with the condition to do the cumulative product in 'rt' only in places where 'dd' is not equal to 0
library(dplyr)
df %>%
group_by(grp = cumsum(dd == 0)) %>%
mutate(crt1 = replace(dd, dd != 0, (cumprod(1 + rt[dd!=0]) - 1))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 11 x 4
rt dd crt crt1
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0.0171 -0.0657 0.171 0.0171
3 0.0796 -0.0013 0.0981 0.0981
4 0.003 0 0 0
5 0.0754 -0.018 0.0754 0.0754
6 -0.0314 -0.0012 0.0415 0.0416
7 0.0275 0 0 0
8 -0.0323 0 0 0
9 0.0364 0 0 0
10 0.0473 -0.0016 0.473 0.0473
11 -0.0021 -0.0856 0.045 0.0451
Or using base R
with(df, ave(rt * (dd != 0), cumsum(dd == 0), FUN = function(x)
replace(x, x != 0, (cumprod(1 + x[x != 0]) - 1))))
-ouptut
[1] 0.00000000 0.01710000 0.09806116 0.00000000 0.07540000 0.04163244 0.00000000 0.00000000 0.00000000 0.04730000 0.04510067

Related

How to compute a single mean of multiple columns?

I've a database with 4 columns and 8 observations:
> df1
Rater1 Rater2 Rater4 Rater5
1 3 3 3 3
2 3 3 2 3
3 3 3 2 2
4 0 0 1 0
5 0 0 0 0
6 0 0 0 0
7 0 0 1 0
8 0 0 0 0
I would like to have the mean, median, iqr, sd of all Rater1 and Rater4 observations (16) and all Rater2 and Rater5 observations (16) without creating a new df with 2 variables like this:
> df2
var1 var2
1 3 3
2 3 3
3 3 3
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 3 3
10 2 3
11 2 2
12 1 0
13 0 0
14 0 0
15 1 0
16 0 0
I would like to obtain this (without a new database, just working on the first database):
> stat.desc(df2)
var1 var2
nbr.val 16.0000000 16.0000000
nbr.null 8.0000000 10.0000000
nbr.na 0.0000000 0.0000000
min 0.0000000 0.0000000
max 3.0000000 3.0000000
range 3.0000000 3.0000000
sum 18.0000000 17.0000000
median 0.5000000 0.0000000
mean 1.1250000 1.0625000
SE.mean 0.3275541 0.3590352
CI.mean.0.95 0.6981650 0.7652653
var 1.7166667 2.0625000
std.dev 1.3102163 1.4361407
coef.var 1.1646367 1.3516618
How can I do this in R?
Thank you in advance
Another solution, using a for loop to compute the statistics in one go:
First, create vectors for the raters you want to combine:
# Raters 2 and 4:
r24 <- as.integer(unlist(df1[,c("Rater2", "Rater4")]))
# Raters 1 and 5:
r15 <- as.integer(unlist(df1[,c("Rater1","Rater5")]))
Combine these vectors in a dataframe:
df <- data.frame(r15, r24)
And calculate the statistics:
for(i in 1:ncol(df)){
print(c(mean(df[,i]), IQR(df[,i]), median(df[,i]), sd(df[,i])))
}
[1] 1.062500 3.000000 0.000000 1.436141
[1] 1.125000 2.250000 0.500000 1.310216
A possible base approach:
df <- data.frame( # construct your original dataframe
Rater1 = c(3, 3, 3, 0, 0, 0, 0, 0),
Rater2 = c(3, 3, 3, 0, 0, 0, 0, 0),
Rater4 = c(3, 2, 2, 1, 0, 0, 1, 0),
Rater5 = c(3, 3, 2, 0, 0, 0, 0, 0)
)
combined <- data.frame( # make a new dataframe with your desired variables
R14 = with(df, c(Rater1, Rater4)),
R25 = with(df, c(Rater2, Rater5))
)
sapply(combined, mean) # compute mean of each column
sapply(combined, median) # median
sapply(combined, sd) # standard deviation
sapply(combined, IQR) # interquartile range
We can loop over the column names that are similar, convert to a vector and get the mean, median, IQR and sd
out <- do.call(rbind, Map(function(x, y) {v1 <- c(df1[[x]], df1[[y]])
data.frame(Mean = mean(v1), Median = median(v1),
IQR = IQR(v1), SD = sd(v1))}, names(df1)[1:2], names(df1)[3:4]))
row.names(out) <- paste(names(df1)[1:2], names(df1)[3:4], sep="_")
out
# Mean Median IQR SD
#Rater1_Rater4 1.1250 0.5 2.25 1.310216
#Rater2_Rater5 1.0625 0.0 3.00 1.436141
data
df1 <- structure(list(Rater1 = c(3, 3, 3, 0, 0, 0, 0, 0), Rater2 = c(3,
3, 3, 0, 0, 0, 0, 0), Rater4 = c(3, 2, 2, 1, 0, 0, 1, 0), Rater5 = c(3,
3, 2, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA,
-8L))
A tidyverse/dplyr solution.
library(dplyr)
bind_rows(select(df, r12 = Rater1, r45 = Rater4),
select(df, r12 = Rater2, r45 = Rater5)) %>%
summarise_all(list(
mean = mean,
median = median,
sd = sd,
iqr = IQR
))
#> r12_mean r45_mean r12_median r45_median r12_sd r45_sd r12_iqr r45_iqr
#> 1 1.125 1.0625 0 0.5 1.5 1.236595 3 2
In case you want the output similar to the one in your question, use t() to transpose the result.
t(.Last.value)

Subsetting data in R based on a test

I would like to subset a dataframe based on a test performed. For instance, I ran the test
CheckUnsystematic(dat = long, deltaq = 0.025, bounce = 0.1, reversals = 0, ncons0 = 2)
It gave me this:
> CheckUnsystematic(dat = long, deltaq = 0.025, bounce = 0.1, reversals = 0, ncons0 = 2)
> CheckUnsystematic(dat = long, deltaq = 0.025, bounce = 0.1, reversals = 0, ncons0 = 2)
id TotalPass DeltaQ DeltaQPass Bounce BouncePass Reversals ReversalsPass NumPosValues
1 2 3 0.9089 Pass 0.0000 Pass 0 Pass 15
2 3 3 0.6977 Pass 0.0000 Pass 0 Pass 16
3 4 2 0.0000 Fail 0.0000 Pass 0 Pass 18
4 5 3 0.2107 Pass 0.0000 Pass 0 Pass 18
5 6 3 0.2346 Pass 0.0000 Pass 0 Pass 18
6 7 3 0.9089 Pass 0.0000 Pass 0 Pass 16
7 8 3 0.9622 Pass 0.0000 Pass 0 Pass 15
8 9 3 0.8620 Pass 0.0000 Pass 0 Pass 11
9 10 3 0.9089 Pass 0.0000 Pass 0 Pass 12
10 11 3 0.9089 Pass 0.0000 Pass 0 Pass 11
I want to keep only the observations that have a "3" in "TotalPass".
I tried this:
CleanAPT <- long[ which(long$TotalPass==3),]
Since you did tag this as a dplyr question, let's use it:
library(dplyr)
check_df <- CheckUnsystematic(dat = long, deltaq = 0.025,
bounce = 0.1, reversals = 0, ncons0 = 2)
CleanAPT <- check_df %>%
filter(TotalPass == 3)
The reason the CleanAPT <- long[ which(long$TotalPass==3),] is not working is because you are calling on the long dataframe (which is unmodified from the CheckUnsystematic function). In the above, I save the function results to check_df. So, CleanAPT <- check_df[which(check_df$TotalPass==3),] should work.
Merging back with the original data (difficult to say exactly how to do this since column names of long - so assuming id is present and unique), can be done with a semi_join from dplyr:
long_filtered <- long %>%
mutate(id = as.character(id)) %>%
semi_join(CleanAPT %>%
mutate(id = as.character(id)),
by = "id")
Try this with your long dataset.
CleanAPT <- subset(long, TotalPass == 3)
CheckUnsystematic(dat = CleanAPT, deltaq = 0.025, bounce = 0.1, reversals = 0, ncons0 = 2)

Calculating percentage of particular value against sum of all values when the other values are all 0s

I have a data frame of questionnaire data which has undergone processing. Each column measures a particular construct in binary terms (1 represent yes; 0 represent no; NA are blanks).
A sample of the data frame is as follow:
df <- data.frame(qol1 = c(1, 0, 0, 1, NA, 0, 0, 1, NA, 0),
qol2 = c(0, 0, 0, 0, NA, 1, 0, 0, 0, 0),
qol3 = c(1, 0, NA, NA, NA, 0, 0, 0, 1, 1))
df
qol1 qol2 qol3
1 1 0 1
2 0 0 0
3 0 0 NA
4 1 0 NA
5 NA NA NA
6 0 1 0
7 0 0 0
8 1 0 0
9 NA 0 1
10 0 0 1
I would like to calculate the percentage of 1s over the total number of 1s and 0s (ignoring the NAs) for each column.
I have attempted to use the following code, but it did not result in the correct answer because anything that adds 0 will result in the same number:
library(dplyr)
df2 <- df %>%
summarise_all(funs(sum(. == 1, na.rm = TRUE)/sum(., na.rm = TRUE)*100))
I have thought of using nrow, count, etc, but they do not have an argument for na.rm.
The desired outcome I would like is:
qol1 qol2 qol3
37.5 11.11 42.85
Thanks and much appreciated!
We can use is.na and sum over them to calculate non-NA values
library(dplyr)
df %>%
summarise_all(funs(sum(. == 1, na.rm = TRUE)/sum(!is.na(.))*100))
# qol1 qol2 qol3
#1 37.5 11.11111 42.85714
A base R option with same logic
colSums(df == 1, na.rm = TRUE)/colSums(!is.na(df)) * 100
# qol1 qol2 qol3
#37.50000 11.11111 42.85714
Or even simpler, since the input contains only 1,0 and NAs
colMeans(df, na.rm = TRUE) * 100
# qol1 qol2 qol3
#37.50000 11.11111 42.85714
Using mean() in base R:
sapply(df, function(x) mean(x, na.rm = TRUE) * 100)
qol1 qol2 qol3
37.50000 11.11111 42.85714
# or more concisely:
sapply(df, mean, na.rm = TRUE) * 100
Same logic in dplyr
summarise_all(df, mean, na.rm = TRUE) * 100
qol1 qol2 qol3
1 37.5 11.11111 42.85714

create data.fame satisfying some condition

Input
final_table =
Chr start end num seg.mean seg.mean.1 seg.mean.2
1 68580000 68640000 A8430 0.7000 0.1440 0.1032
1 115900000 116260000 B8430 0.0039 2.7202 2.7202
1 173500000 173680000 C5 -1.7738 -0.0746 -0.2722
How can I make a new data.frame where the values of columns 5 through 7 are set to:
-1, if value < -0.679
0, if -0.679 <= value <= 0.450
+1, if value > 0.450
Expected output
Chr start end num seg.mean seg.mean.1 seg.mean.2
1 68580000 68640000 A8430 1 0 0
1 115900000 116260000 B8430 0 1 1
1 173500000 173680000 C5 -1 0 0
try this:
# read the data in
df <- read.table(header = TRUE, text="Chr start end num seg.mean seg.mean.1 seg.mean.2
1 68580000 68640000 A8430 0.7000 0.1440 0.1032
1 115900000 116260000 B8430 0.0039 2.7202 2.7202
1 173500000 173680000 C5 -1.7738 -0.0746 -0.2722")
# get the column-names of the columns you wanna change
cols <- names(df[5:length(df)])
# set a function for the different values you want for the value-ranges
fun_cond <- function(x) {
ifelse(x < -0.679 , -1, ifelse(
x >= -0.679 & x <= 0.450, 0, 1))
}
# copy the data-frame so the old one doesnt get overwritten
new_df <- df
# work with data-table to apply the function to the columns
library(data.table)
setDT(new_df)[ , (cols) := lapply(.SD, fun_cond), .SDcols = cols]
output:
Chr start end num seg.mean seg.mean.1 seg.mean.2
1: 1 68580000 68640000 A8430 1 0 0
2: 1 115900000 116260000 B8430 0 1 1
3: 1 173500000 173680000 C5 -1 0 0
same thing without using any additional packages:
cols <- names(df[5:length(df)])
fun_cond <- function(x) {
ifelse(x < -0.679 , -1, ifelse(
x >= -0.679 & x <= 0.450, 0, 1))
}
new_df <- df
new_df[5:length(df)] <- lapply(new_df[5:length(df)], fun_cond)
I'd use the cut function and apply it to the last three columns individually.
Here's a simple example:
original = data.frame(a=c(rep("A", 2), rep("B", 2)), seg.mean=c(-1, 0, 0.4, 0.5));
original$segmented = cut(original$seg.mean, c(-Inf, -0.679, 0.450, Inf), labels = c(-1,0,1))
One thing to be careful about: the new column will be a factor. If you need numerical values, you may need to apply as.numeric to it.
You can also try to use labels=FALSE which will give you numerical values (but likely 1,2,3 rather than -1,0,1). You can fix that by subtracting 2:
original$segmented = cut(original$seg.mean, c(-Inf, -0.679, 0.450, Inf), labels = FALSE)-2
You can directly replace fields in the data frame by subsetting
df[, 5:7] <- ifelse(df[, 5:7] < -0.679, -1,
ifelse(df[, 5:7] < 0.450, 0,
1))

Create counter within consecutive runs of certain values

I have an hourly value. I want to count how many consecutive hours the value has been zero since the last time it was not zero. This is an easy job for a spreadsheet or for loop, but I am hoping for a snappy vectorized one-liner to accomplish the task.
x <- c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0)
df <- data.frame(x, zcount = NA)
df$zcount[1] <- ifelse(df$x[1] == 0, 1, 0)
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(df$x[i] == 0, df$zcount[i - 1] + 1, 0)
Desired output:
R> df
x zcount
1 1 0
2 0 1
3 1 0
4 0 1
5 0 2
6 0 3
7 1 0
8 1 0
9 0 1
10 0 2
William Dunlap's posts on R-help are the place to look for all things related to run lengths. His f7 from this post is
f7 <- function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)}
and in the current situation f7(!x). In terms of performance there is
> x <- sample(0:1, 1000000, TRUE)
> system.time(res7 <- f7(!x))
user system elapsed
0.076 0.000 0.077
> system.time(res0 <- cumul_zeros(x))
user system elapsed
0.345 0.003 0.349
> identical(res7, res0)
[1] TRUE
Here's a way, building on Joshua's rle approach: (EDITED to use seq_len and lapply as per Marek's suggestion)
> (!x) * unlist(lapply(rle(x)$lengths, seq_len))
[1] 0 1 0 1 2 3 0 0 1 2
UPDATE. Just for kicks, here's another way to do it, around 5 times faster:
cumul_zeros <- function(x) {
x <- !x
rl <- rle(x)
len <- rl$lengths
v <- rl$values
cumLen <- cumsum(len)
z <- x
# replace the 0 at the end of each zero-block in z by the
# negative of the length of the preceding 1-block....
iDrops <- c(0, diff(v)) < 0
z[ cumLen[ iDrops ] ] <- -len[ c(iDrops[-1],FALSE) ]
# ... to ensure that the cumsum below does the right thing.
# We zap the cumsum with x so only the cumsums for the 1-blocks survive:
x*cumsum(z)
}
Try an example:
> cumul_zeros(c(1,1,1,0,0,0,0,0,1,1,1,0,0,1,1))
[1] 0 0 0 1 2 3 4 5 0 0 0 1 2 0 0
Now compare times on a million-length vector:
> x <- sample(0:1, 1000000,T)
> system.time( z <- cumul_zeros(x))
user system elapsed
0.15 0.00 0.14
> system.time( z <- (!x) * unlist( lapply( rle(x)$lengths, seq_len)))
user system elapsed
0.75 0.00 0.75
Moral of the story: one-liners are nicer and easier to understand, but not always the fastest!
rle will "count how many consecutive hours the value has been zero since the last time it was not zero", but not in the format of your "desired output".
Note the lengths for the elements where the corresponding values are zero:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 1 3 2 2
# values : num [1:6] 1 0 1 0 1 0
A simple base R approach:
ave(!x, cumsum(x), FUN = cumsum)
#[1] 0 1 0 1 2 3 0 0 1 2
One-liner, not exactly super elegant:
x <- c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0)
unlist(lapply(split(x, c(0, cumsum(abs(diff(!x == 0))))), function(x) (x[1] == 0) * seq(length(x))))
Using purr::accumulate() is very straightforward, so this tidyverse solution may add some value here. I must acknowledge it is definitely not the fastest, as it calls the same function length(x)times.
library(purrr)
accumulate(x==0, ~ifelse(.y!=0, .x+1, 0))
[1] 0 1 0 1 2 3 0 0 1 2

Resources