The following code provides example data:
library(TTR)
set.seed(15)
r <- rnorm(1000, 0, .01)
P_1 <- 100
P <- P_1*cumprod(1+r)
zz <- ZigZag(P, change = 5, percent = TRUE)
set.seed(15)
volume <- round(runif(1000, 50, 550), digits = 0)
data <- as.data.frame(cbind(P, zz, volume))
plot(P, type = "l")
lines(zz, col = "red")
in the end I would like to create cumulative sum of volume in new column, where reset happens when zigzag line (zz) changes direction. I have tried to play with s <- sign(diff(data$zz, lag = 1)), which would show those turning points, but haven't been able to use cumsum with it.
Here is a solution that uses dplyr:
library(dplyr)
data %>%
mutate(
zz_up = (zz - lag(zz) > 0),
zz_switch = zz_up != lag(zz_up),
zz_switch = ifelse(is.na(zz_switch), FALSE, zz_switch),
group = cumsum(zz_switch)
) %>%
group_by(group) %>%
mutate(cum_volume = cumsum(volume))
Attempt with RcppRoll:
Code
Vectorize(require)(package = c("magrittr", "dplyr", "RcppRoll"),
char = TRUE)
data %<>%
# Create difference for ZigZag
mutate(diffZZ = c(0,diff(zz))) %>%
# Use it as a group
group_by(diffZZ) %>%
# Use RcppRoll to compute that sum
mutate(sumVolByDiff = roll_sum(x = volume, n = 2, fill = NA)) %>%
# Clean / not important
ungroup()
Preview
> head(data)
Source: local data frame [6 x 5]
P zz volume diffZZ sumVolByDiff
(dbl) (dbl) (dbl) (dbl) (dbl)
1 100.2588 100.2588 351 0.000000 NA
2 102.0947 100.5596 148 0.300785 523
3 101.7480 100.8604 533 0.300785 1077
4 102.6608 101.1612 375 0.300785 609
5 103.1618 101.4620 234 0.300785 692
6 101.8668 101.7627 544 0.300785 938
Related
I have a set of x,y data (10,000). These data points are to be partitioned along the x-axis into non-overlapping bins of 10 data points each. From this, I need a new dataset, such that x = mean of these 10 data, y = maximum of these 10 data. The final data set should be 1000 sets of x,y. sample
Sample in Excel. I want to perform this task in R
In tidyverse:
library(tidyverse)
df %>%
arrange(x) %>%
group_by(grp = gl(n(), 10, n())) %>%
summarise(x = mean(x), y = max(y))
In Base R
n <- nrow(df)
do.call(rbind.data.frame, by(df[order(df$x),], gl(n, 10, n),
function(x) cbind(x = mean(x$x), y = max(x$y))))
I created some sample data as you did not provide those.
I use the library data.table but you could do similar in dplyr or base.
library(data.table)
dt <- data.table(
x = sample(40:50, 50, replace = T),
y = sample(1000:3000, 50)
)
dt[, grp := gl(.N, 10, .N)] # edit based on Onyambu's solution
dt[, .(x_avg = mean(x), y_max = max(y)), by = grp]
# grp x_avg y_max
# 1: 1 44.7 2765
# 2: 2 45.3 2861
# 3: 3 44.7 2831
# 4: 4 46.2 2947
# 5: 5 46.7 2684
I'm not really familiar with dplyr function in R. However, I want to filter my dataset into certain conditions.
Let's say I've more than 100 of attributes in my dataset. And I want to perform filter with multiple condition.
Can I put my coding filter the position of the column instead of their name as follow:
y = filter(retag, c(4:50) != 8 & c(90:110) == 8)
I've tried few times similar with this coding, however still haven't get the result.
I also did tried coding as follow, but not sure how to add another conditions into the rowSums function.
retag[rowSums((retag!=8)[,c(4:50)])>=1,]
The only example that I found was using the dataset names instead of the position.
Or is there any way to filter using the dataset position as my data quite huge.
You can use a combination of filter() and across(). I didn't have your version of the retag dataframe so I created my own as an example
set.seed(2000)
retag <- tibble(
col1 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col2 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col3 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col4 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col5 = runif(n = 1000, min = 0, max = 10) %>% round(0)
)
# filter where the first, second, and third column all equal 5 and the fourth column does not equal 5
retag %>%
filter(
across(1:3, function(x) x == 5),
across(4, function(x) x != 5)
)
if_all() and if_any() were recently introduced into the tidyverse for the purpose of filtering across multiple variables.
library(dplyr)
filter(retag, if_all(X:Y, ~ .x > 10 & .x < 35))
# # A tibble: 5 x 2
# X Y
# <int> <int>
# 1 11 30
# 2 12 31
# 3 13 32
# 4 14 33
# 5 15 34
filter(retag, if_any(X:Y, ~ .x == 2 | .x == 25))
# # A tibble: 2 x 2
# X Y
# <int> <int>
# 1 2 21
# 2 6 25
Data
retag <- structure(list(X = 1:20, Y = 20:39), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Here's a base R option.
This will select rows where there is no 8 in column 4 to 50 and there is at least one 8 in column 90 to 110.
result <- retag[rowSums(retag[4:50] == 8, na.rm = TRUE) == 0 &
rowSums(retag[90:110] == 8,na.rm = TRUE) > 0, ]
I have some patient data, where the individual patients change treatment groups over time. My goal is to visualize the sequence of group changes and aggregate this data into a "sequence profile" for each treatment group.
For each treatment group I would like to show, when it generally occurs
in the treatment cycle (say rather in the beginning or in the end). To account for the differing sequence length, I would like to standardize these profiles betweenn 0 (very beginning) and 1 (end).
I would like to find an efficient data preparation and visualization.
Mininmal Example
Structure of Data
library(dplyr)
library(purrr)
library(ggplot2)
# minimal data
cj_df_raw <- tibble::tribble(
~id, ~group,
1, "A",
1, "B",
2, "A",
2, "B",
2, "A"
)
# compute "intervals" for each person [start, end]
cj_df_raw %>%
group_by(id) %>%
mutate(pos = row_number(),
len = length(id),
start = (pos - 1) / len,
end = pos / len) %>%
filter(group == "A")
#> # A tibble: 3 x 6
#> # Groups: id [2]
#> id group pos len start end
#> <dbl> <chr> <int> <int> <dbl> <dbl>
#> 1 1 A 1 2 0 0.5
#> 2 2 A 1 3 0 0.333
#> 3 2 A 3 3 0.667 1
(So Id 1 was in group A in the first 50% of their sequence, and Id 2 was in Group A in the first 33% and the last 33% of their sequence. This means, that 2 Ids where between 0-33% of the sequence, 1 between 33-50%, 0 between 50-66% and 1 above 66%.)
This is the result I would like to achieve and I miss a chance to transform my data effectively.
Desired outcome
profile_treatmen_a <- tibble::tribble(
~x, ~y,
0, 0L,
0.33, 2L,
0.5, 1L,
0.66, 0L,
1, 1L,
1, 0L
)
profile_treatmen_a %>%
ggplot(aes(x, y)) +
geom_step(direction = "vh") +
expand_limits(x = c(0, 1), y = 0)
(Ideally the area under the curve would be shaded)
Ideal solution: via ggridges
The goal of the visualization would be to compare the "sequence-profile" of many treatment-groups at the same time. If I could prepare the data accordingly, I would like to use the ggridges-package for a striking visual comparison the treatment groups.
library(ggridges)
data.frame(group = rep(letters[1:2], each=20),
mean = rep(2, each=20)) %>%
mutate(count = runif(nrow(.))) %>%
ggplot(aes(x=count, y=group, fill=group)) +
geom_ridgeline(stat="binline", binwidth=0.5, scale=0.9)
You could build helper intervals and then just plot a histogram. Since each patient is either in Group A or B both groups sum up to 100%. With these helper intervals you could also easily switch to other geoms.
library(tidyverse, warn.conflicts = FALSE)
library(ggplot2)
# create sample data
set.seed(42)
id <- 1:10 %>% map(~ rep(x = .x, times = runif(n = 1, min = 1, max = 6))) %>%
unlist()
group <- sample(x = c("A", "B"), size = length(id), replace = TRUE) %>%
as_factor()
df <- tibble(id, group)
glimpse(df)
#> Observations: 37
#> Variables: 2
#> $ id <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5,...
#> $ group <fct> A, B, B, A, A, B, B, A, A, B, B, A, B, B, A, B, A, B, A,...
# tidy data
df <- df %>%
group_by(id) %>%
mutate(from = (row_number() - 1) / n(),
to = row_number() / n()) %>%
ungroup() %>%
rowwise() %>%
mutate(list = seq(from + 1/60, to, 1/60) %>% list()) %>%
unnest()
# plot
df %>%
ggplot(aes(x = list, fill = group)) +
geom_histogram(binwidth = 1/60) +
ggthemes::theme_hc()
Created on 2018-09-16 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).
My attempt at an answer.. although it is probably not the nicest/fastest/most efficient way, I think it might help you in your efforts.
library(data.table)
# compute "intervals" for each person [start, end]
df <- cj_df_raw %>%
group_by(id) %>%
mutate(pos = row_number(),
len = length(id),
from = (pos - 1) / len,
to = pos / len,
value = 1)
dt <- as.data.table(df)
setkey(dt, from, to)
#create intervals
dt.interval <- data.table(from = seq( from = 0, by = 0.01, length.out = 100),
to = seq( from = 0.01, by = 0.01, length.out = 100))
#perform overlap join on intervals
dt2 <- foverlaps( dt.interval, dt, type = "within", nomatch = NA)[, sum(value), by = c("i.from", "group")]
#some melting ans casting to fill in '0' on empty intervals
dt3 <- melt( dcast(dt2, ... ~ group, fill = 0), id.vars = 1 )
#plot
ggplot( dt3 ) +
geom_step( aes( x = i.from, y = value, color = variable ) ) +
facet_grid( .~variable )
I would like to make a groupwise summarise() operation in dplyr, but where a different function applies if edge cases are met.
I have count data that looks like this. The concentration and the standard deviation are calculated like this:
library(dplyr)
testdata <- data_frame(sample = sort(rep(1:3, 4)),
volume = rep(c(1e-1, 1e-1, 1e-2, 1e-2), 3),
count = c(400, 400, 40, 40, 0, 0, 0, 0, 400, 400, 400, 400))
testdata %>%
group_by(sample) %>%
summarise(concentration = sum(count) / sum(volume),
sd = sqrt(sum(count)))
However, when making the calculation only counts with values between 25-250 are to be included. which I could achieve with:
testdata %>%
group_by(sample) %>%
filter((count >= 25) & (count <= 250)) %>%
summarise(concentration = sum(count) / sum(volume),
sd = sqrt(sum(count)))
But then samples 2 & 3 have no concentration.
The edge cases for each group might be calculated with something like:
if (all(count <= 25)){
summarise(concentration = 25 / min(volume),
sd = NA)
}
else if (all(count >= 250)){
summarise(concentration = 250 / max(volume),
sd = NA)
}
Can such edge cases be integrated into the summarise() function?
I would ideally also like a flag to indicate an edge case which returns result = "OK" for all cases except edge cases that return:
if (all(count <= 25)){
summarise(concentration = 25 / min(volume),
sd = NA,
result = "LOW")
}
else if (all(count >= 250)){
summarise(concentration = 250 / max(volume),
sd = NA,
result = "HIGH")
}
One way is to encode your logic within summarise using ifelse:
library(dplyr)
result <- testdata %>% group_by(sample) %>%
summarise(concentration = ifelse(all(count <= 25),
25 / min(volume),
ifelse(all(count >= 250),
250 / max(volume),
sum(count) / sum(volume))),
sd = ifelse(all(count <= 25),
NA,
ifelse(all(count >= 250),
NA,
sqrt(sum(count)))),
result = ifelse(all(count <= 25),
"LOW",
ifelse(all(count >= 250),
"HIGH",
"OK")))
print(result)
### A tibble: 3 x 4
## sample concentration sd result
## <int> <dbl> <dbl> <chr>
##1 1 4000 29.66479 OK
##2 2 2500 NA LOW
##3 3 2500 NA HIGH
Updated approach
Another approach, which is hopefully closer to what the OP asks, is to define a function:
summarise.func <- function(count, volume) {
if (all(count <= 25)) {
concentration <- 25 / min(volume)
sd <- NA
result <- "LOW"
} else if (all(count >= 250)) {
concentration <- 250 / max(volume)
sd <- NA
result <- "HIGH"
} else {
concentration <- sum(count) / sum(volume)
sd <- sqrt(sum(count))
result <- "OK"
}
data.frame(concentration=concentration, sd=sd, result=result, stringsAsFactors=FALSE)
}
that handles both the regular case and the edge cases. The key is that this function return a data.frame containing the summarized results. Then, summarise will create a column that is a list containing these data frames that can then be tidyr::unnested:
library(dplyr)
library(tidyr)
result <- testdata %>% group_by(sample) %>%
summarise(csr=list(f(count, volume))) %>%
unnest(csr)
print(result)
### A tibble: 3 x 4
## sample concentration sd result
## <int> <dbl> <dbl> <chr>
##1 1 4000 29.66479 OK
##2 2 2500 NA LOW
##3 3 2500 NA HIGH
Suppose I have data that looks like this:
set.seed(031915)
myDF <- data.frame(
Name= rep(c("A", "B"), times = c(10,10)),
Group = rep(c("treatment", "control", "treatment", "control"), times = c(5,5,5,5)),
X = c(rnorm(n=5,mean = .05, sd = .001), rnorm(n=5,mean = .02, sd = .001),
rnorm(n=5,mean = .08, sd = .02), rnorm(n=5,mean = .03, sd = .02))
)
I want to create a t.test table with a row for "A" and one for "B"
I can write my own function that does that:
ttestbyName <- function(Name) {
b <- t.test(myDF$X[myDF$Group == "treatment" & myDF$Name==Name],
myDF$X[myDF$Group == "control" & myDF$Name==Name],
conf.level = 0.90)
dataNameX <- data.frame(Name = Name,
treatment = round(b$estimate[[1]], digits = 4),
control = round(b$estimate[[2]], digits = 4),
CI = paste('(',round(b$conf.int[[1]],
digits = 4),', ',
round(b$conf.int[[2]],
digits = 4), ')',
sep=""),
pvalue = round(b$p.value, digits = 4),
ntreatment = nrow(myDF[myDF$Group == "treatment" & myDF$Name==Name,]),
ncontrol = nrow(myDF[myDF$Group == "control" & myDF$Name==Name,]))
}
library(parallel)
Test_by_Name <- mclapply(unique(myDF$Name), ttestbyName)
Test_by_Name <- do.call("rbind", Test_by_Name)
and the output looks like this:
Name treatment control CI pvalue ntreatment ncontrol
1 A 0.0500 0.0195 (0.0296, 0.0314) 0.0000 5 5
2 B 0.0654 0.0212 (0.0174, 0.071) 0.0161 5 5
I'm wondering if there is a cleaner way of doing this with dplyr. I thought about using groupby, but I'm a little lost.
Thanks!
Not much cleaner, but here's an improvement:
library(dplyr)
ttestbyName <- function(myName) {
bt <- filter(myDF, Group=="treatment", Name==myName)
bc <- filter(myDF, Group=="control", Name==myName)
b <- t.test(bt$X, bc$X, conf.level=0.90)
dataNameX <- data.frame(Name = myName,
treatment = round(b$estimate[[1]], digits = 4),
control = round(b$estimate[[2]], digits = 4),
CI = paste('(',round(b$conf.int[[1]],
digits = 4),', ',
round(b$conf.int[[2]],
digits = 4), ')',
sep=""),
pvalue = round(b$p.value, digits = 4),
ntreatment = nrow(bt), # changes only in
ncontrol = nrow(bc)) # these 2 nrow() args
}
You should really replace the do.call function with rbindlist from data.table:
library(data.table)
Test_by_Name <- lapply(unique(myDF$Name), ttestbyName)
Test_by_Name <- rbindlist(Test_by_Name)
or, even better, use the %>% pipes:
Test_by_Name <- myDF$Name %>%
unique %>%
lapply(., ttestbyName) %>%
rbindlist
> Test_by_Name
Name treatment control CI pvalue ntreatment ncontrol
1: A 0.0500 0.0195 (0.0296, 0.0314) 0.0000 5 5
2: B 0.0654 0.0212 (0.0174, 0.071) 0.0161 5 5
An old question, but the broom package has since been made available for this exact purpose (as well as other statistical tests):
library(broom)
library(dplyr)
myDF %>% group_by(Name) %>%
do(tidy(t.test(X~Group, data = .)))
Source: local data frame [2 x 9]
Groups: Name [2]
Name estimate estimate1 estimate2 statistic p.value
(fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
1 A -0.03050475 0.01950384 0.05000860 -63.838440 1.195226e-09
2 B -0.04423181 0.02117864 0.06541046 -3.104927 1.613625e-02
Variables not shown: parameter (dbl), conf.low (dbl), conf.high (dbl)
library(tidyr)
library(dplyr)
myDF %>% group_by(Group) %>% mutate(rowname=1:n())%>%
spread(Group, X) %>%
group_by(Name) %>%
do(b = t.test(.$control, .$treatment)) %>%
mutate(
treatment = round(b[['estimate']][[2]], digits = 4),
control = round(b[['estimate']][[1]], digits = 4),
CI = paste0("(", paste(b[['conf.int']], collapse=", "), ")"),
pvalue = b[['p.value']]
)
# Name treatment control CI pvalue
#1 A 0.0500 0.0195 (-0.031677109707283, -0.0293323994902097) 1.195226e-09
#2 B 0.0654 0.0212 (-0.0775829100729602, -0.010880719830447) 1.613625e-02
You can add ncontrol, ntreatment manually.
You can do it with a custom t.test function and do:
my.t.test <- function(data, formula, ...)
{
tt <- t.test(formula=formula, data=data, ...)
ests <- tt$estimate
names(ests) <- sub("mean in group ()", "\\1",names(ests))
counts <- xtabs(formula[c(1,3)],data)
names(counts) <- paste0("n",names(counts))
cbind(
as.list(ests),
data.frame(
CI = paste0("(", paste(tt$conf.int, collapse=", "), ")"),
pvalue = tt$p.value,
stringsAsFactors=FALSE
),
as.list(counts)
)
}
myDF %>% group_by(Name) %>% do(my.t.test(.,X~Group))
Source: local data frame [2 x 7]
Groups: Name
Name control treatment CI pvalue ncontrol ntreatment
1 A 0.01950384 0.05000860 (-0.031677109707283, -0.0293323994902097) 1.195226e-09 5 5
2 B 0.02117864 0.06541046 (-0.0775829100729602, -0.010880719830447) 1.613625e-02 5 5