Grouped operation on all groups relative to "baseline" group, with multiple observations - r

Starting with data containing multiple observations for each group, like this:
set.seed(1)
my.df <- data.frame(
timepoint = rep(c(0, 1, 2), each= 3),
counts = round(rnorm(9, 50, 10), 0)
)
> my.df
timepoint counts
1 0 44
2 0 52
3 0 42
4 1 66
5 1 53
6 1 42
7 2 55
8 2 57
9 2 56
To perform a summary calculation at each timepoint relative to timepoint == 0, for each group I need to pass a vector of counts for timepoint == 0 and a vector of counts for the group (e.g. timepoint == 0) to an arbitrary function, e.g.
NonsenseFunction <- function(x, y){
(mean(x) - mean(y)) / (1 - mean(y))
}
I can get the required output from this table, either with dplyr:
library(dplyr)
my.df %>%
group_by(timepoint) %>%
mutate(rep = paste0("r", 1:n())) %>%
left_join(x = ., y = filter(., timepoint == 0), by = "rep") %>%
group_by(timepoint.x) %>%
summarise(result = NonsenseFunction(counts.x, counts.y))
or data.table:
library(data.table)
my.dt <- data.table(my.df)
my.dt[, rep := paste0("r", 1:length(counts)), by = timepoint]
merge(my.dt, my.dt[timepoint == 0], by = "rep", all = TRUE)[
, NonsenseFunction(counts.x, counts.y), by = timepoint.x]
This only works if the number of observations between groups is the same. Anyway, the observations aren't matched, so using the temporary rep variable seems hacky.
For a more general case, where I need to pass vectors of the baseline values and the group's values to an arbitrary (more complicated) function, is there an idiomatic data.table or dplyr way of doing so with a grouped operation for all groups?

Here's the straightforward data.table approach:
my.dt[, f(counts, my.dt[timepoint==0, counts]), by=timepoint]
This probably grabs my.dt[timepoint==0, counts] again and again, for each group. You could instead save that value ahead of time:
v = my.dt[timepoint==0, counts]
my.dt[, f(counts, v), by=timepoint]
... or if you don't want to add v to the environment, maybe
with(list(v = my.dt[timepoint==0, counts]),
my.dt[, f(counts, v), by=timepoint]
)

You could give the second argument to use the vector from your group of interest as a constant.
my.df %>%
group_by(timepoint) %>%
mutate(response = NonsenseFunction(counts, my.df$counts[my.df$timepoint == 0]))
Or if you want to make it beforehand:
constant = = my.df$counts[my.df$timepoint == 0]
my.df %>%
group_by(timepoint) %>%
mutate(response = NonsenseFunction(counts, constant))

You can try,
library(dplyr)
my.df %>%
mutate(new = mean(counts[timepoint == 0])) %>%
group_by(timepoint) %>%
summarise(result = NonsenseFunction(counts, new))
# A tibble: 3 × 2
# timepoint result
# <dbl> <dbl>
#1 0 0.0000000
#2 1 0.1398601
#3 2 0.2097902

Related

How to summarise grouped value increases

I have this type of data:
df <- data.frame(
Utt = c(rep("oh", 10), rep("ah", 10)),
name = rep(LETTERS[1:2], 10),
value = c(0.5,2,2,2,2,1,0,1,3.5,1,
2.2,2.3,1.9,0.1,0.3,1.8,3,4,3.5,2)
)
I need to know whether within in each group of Utt and name, there are continuous value increases and how large these increases are.
EDIT: I've cobbled together this code, which produces the right result but seems convoluted:
df %>%
# order by name:
arrange(name) %>%
group_by(name, Utt) %>%
# mutate:
mutate(
# is there an increase from one value to the next?
is_increase = ifelse(lag(value) < value, value, NA),
# what's the difference between these values?
diff = is_increase - lag(value)) %>%
group_by(name, Utt, grp = rleid(!is.na(diff))) %>%
# sum the contiguous values:
summarise(increase_size = sum(diff, na.rm = TRUE)) %>%
# remove 0 values:
filter(!increase_size == 0) %>%
# put same-group increase_sizes in the same row:
summarise(
increase_size = str_c(increase_size, collapse = ', '))
# A tibble: 3 x 3
# Groups: name [2]
name Utt increase_size
<chr> <chr> <chr>
1 A ah 3.2
2 A oh 1.5, 3.5
3 B ah 3.9
NOTE: Ideally, the expected outcome would be:
1 A ah 3.2
2 A oh 1.5, 3.5
3 B ah 3.9
4 B oh NA
Is there a better (i.e., more concise, more clever) dplyr solution?
Use this function to find what you want.
f <- function(x) {
ind <- which(x > lag(x))
if (length(ind) == 0) {
return(NA)
}
ind2 <- ind[which(lead(ind, default = max(ind) + 2) - ind > 1)]
ind1 <- ind[which(ind - lag(ind, default = min(ind) - 2) > 1)] - 1
return(paste0(x[ind2] - x[ind1], collapse = ", "))
}
And use the function in summarise:
df %>% group_by(name, Utt) %>% summarise(increase = f(value))
Using tidyverse, my solution was similar to yours. One possible modification might be to subset your columns before summing instead of filtering. This will keep all combinations of name and Utt and allow for NA for increase_size in the end. Since the column increase_size is character type, you can convert an empty string to NA.
library(data.table)
library(tidyverse)
df %>%
arrange(name) %>%
group_by(name, Utt) %>%
mutate(diff = c(0, diff(value))) %>%
group_by(grp = rleid(diff < 0), .add = T) %>%
summarise(increase_size = sum(diff[diff > 0], na.rm = T)) %>%
group_by(name, Utt) %>%
summarise(increase_size = toString(increase_size[increase_size > 0])) %>%
mutate(increase_size = na_if(increase_size, ""))
Output
name Utt increase_size
<chr> <chr> <chr>
1 A ah 3.2
2 A oh 1.5, 3.5
3 B ah 3.9
4 B oh NA

Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation

I want to create jack-knife data partitions for the data frame below, with the partitions to be used in caret::train (like the caret::groupKFold() produces). However, the catch is that I want to restrict the test points to say greater than 16 days, whilst using the remainder of these data as the training set.
df <- data.frame(Effect = seq(from = 0.05, to = 1, by = 0.05),
Time = seq(1:20))
The reason I want to do this is that I am only really interested in how well the model is predicting the upper bound, as this is the region of interest. I feel like there is a way to do this with the caret::groupKFold() function but I am not sure how. Any help would be greatly appreciated.
An example of what each CV fold would comprise:
TrainSet1 <- subset(df, Time != 16)
TestSet1 <- subset(df, Time == 16)
TrainSet2 <- subset(df, Time != 17)
TestSet2 <- subset(df, Time == 17)
TrainSet3 <- subset(df, Time != 18)
TestSet3 <- subset(df, Time == 18)
TrainSet4 <- subset(df, Time != 19)
TestSet4 <- subset(df, Time == 19)
TrainSet5 <- subset(df, Time != 20)
TestSet5 <- subset(df, Time == 20)
Albeit in the format that the caret::groupKFold function outputs, so that the folds could be fed into the caret::train function:
CVFolds <- caret::groupKFold(df$Time)
CVFolds
Thanks in advance!
For customized folds I find in built functions are usually not flexible enough. Therefore I usually produce them using tidyverse. One approach to your problem would be:
library(tidyverse)
df %>%
mutate(id = row_number()) %>% #use the row number as a column called id
filter(Time > 15) %>% #filter Time as per your need
split(.$Time) %>% #split df to a list by Time
map(~ .x %>% select(id)) #select row numbers for each list element
example with two rows per each time:
df <- data.frame(Effect = seq(from = 0.025, to = 1, by = 0.025),
Time = rep(1:20, each = 2))
df %>%
mutate(id = row_number()) %>%
filter(Time > 15) %>%
split(.$Time) %>%
map(~ .x %>% select(id)) -> test_folds
test_folds
#output
$`16`
id
1 31
2 32
$`17`
id
3 33
4 34
$`18`
id
5 35
6 36
$`19`
id
7 37
8 38
$`20`
id
9 39
10 40
with unequal number of rows per time
df <- data.frame(Effect = seq(from = 0.55, to = 1, by = 0.05),
Time = c(rep(1, 5), rep(2, 3), rep(rep(3, 2))))
df %>%
mutate(id = row_number()) %>%
filter(Time > 1) %>%
split(.$Time) %>%
map(~ .x %>% select(id))
$`2`
id
1 6
2 7
3 8
$`3`
id
4 9
5 10
Now you can define these hold out folds inside trainControl with the argument indexOut.
EDIT: to get similar output as caret::groupKFold one can:
df %>%
mutate(id = row_number()) %>%
filter(Time > 1) %>%
split(.$Time) %>%
map(~ .x %>%
select(id) %>%
unlist %>%
unname) %>%
unname

issues calculating rowwise maximum

suppose I have a tibble dat below, what I would like to do is to calculate maximum of (x 2, x 3) and then minus x 1, where x can be either a or b. In my real data I have more than 3 columns, so something like 2:n (e.g., 2:3) would be great. tried many things, seems not working as I wanted them to, still struggling with the string vs column name thing..
dat <- tibble(`a 1` = c(0, 0, 0), `a 2` = 1:3, `a 3` = 3:1,
`b 1` = rep(1, 3), `b 2` = 4:6, `b 3` = 6:4)
foo <- function(x = 'a')
{
???
}
end result:
if x == `a`
c(3, 2, 3)
if x == `b`
c(5, 4, 5)
Solution 1
This solution uses only base R. The idea is to define a function (max_minus_first) to calculate the answer. The max_minus_first function has two arguments. The first argument, dat, is a data frame for analysis with the same format as the OP provided. group is the name of the group for analysis. The end product is a vector with the answer.
max_minus_first <- function(dat, group){
# Get all column names with starting string "group"
col_names <- colnames(dat)
dat2 <- dat[, col_names[grepl(paste0("^", group), col_names)]]
# Get the maximum values from all columns except the first column
max_value <- apply(dat2[, -1], 1, max, na.rm = TRUE)
# Calculate max_value minus the values from the first column
final_value <- max_value - unlist(dat2[, 1], use.names = FALSE)
return(final_value)
}
max_minus_first(dat, "a")
# [1] 3 2 3
max_minus_first(dat, "b")
# [1] 5 4 5
Solution 2
A solution using the tidyverse. The end product (dat2) is a tibble with the output from each group (a, b, ...)
library(tidyverse)
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid, -ends_with(" 1")) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
gather(Column_1, Value_1, ends_with(" 1")) %>%
separate(Column_1, into = c("Group_1", "Column_Number_1")) %>%
filter(Group == Group_1) %>%
group_by(rowid, Group, Value_1) %>%
summarise(Value = max(Value, na.rm = TRUE)) %>%
mutate(Final = Value - Value_1) %>%
ungroup() %>%
select(-starts_with("Value")) %>%
spread(Group, Final)
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5
Explanation
rowid_to_column() is from the tibble package, a way to create a new column based on row ID.
gather is from the tidyr package to convert the data frame from the wide format to long format. I used gather twice because the first column of each group is different than other columns in the same group. ends_with(" 1") is a select helper function from the dplyr, which select the column with a name ending in " 1". Notice that the space in " 1" is important because "1" may select other columns like a 11 if such columns exist.
separate is from the tidyr package to separate a column into two columns. I used it to separate the Group name and column numbers in each Group.
filter(Group == Group_1) is to filter rows with Group == Group_1.
group_by(rowid, Group, Value_1) and then summarise(Value = max(Value, na.rm = TRUE)) make sure the maximum from each Group is calculated.
mutate(Final = Value - Value_1) is to calculate the difference between maximum from each Group and the value from the first column. The results are stored in the Final column.
select(-starts_with("Value")) removes any columns with a name beginning with "Value".
spread from the tidyr package converts the data frame from long format to wide format.
Solution 3
Another tidyverse solution, which similar to Solution 2. It uses do to conduct operation to each Group hence making the code more concise.
dat2 <- dat %>%
rowid_to_column() %>%
gather(Column, Value, -rowid) %>%
separate(Column, into = c("Group", "Column_Number")) %>%
group_by(rowid, Group) %>%
do(data_frame(Max = max(.$Value[.$Column_Number != 1]),
First = .$Value[.$Column_Number == 1])) %>%
mutate(Final = Max - First) %>%
select(-Max, -First) %>%
spread(Group, Final) %>%
ungroup()
dat2
# # A tibble: 3 x 3
# rowid a b
# * <int> <dbl> <dbl>
# 1 1 3 5
# 2 2 2 4
# 3 3 3 5

counts of grouped variables using dplyr

I would like to create a dataframe with confidence intervals for proportions as a final result. I have introduced a variable (tp in my example) as a cut off value to calculate the proportions for. I would like to use the dplyr package to produce the final dataframe.
Below is a simplified example:
library(dplyr)
my_names <- c("A","B")
dt <- data.frame(
Z = sample(my_names,100,replace = TRUE),
X = sample(1:10, replace = TRUE),
Y = sample(c(0,1), 100, replace = TRUE)
)
my.df <- dt%>%
mutate(tp = (X >8)* 1) %>% #multiply by one to convert into numeric
group_by(Z, tp) %>%
summarise(n = n()) %>%
mutate(prop.tp= n/sum(n)) %>%
mutate(SE.tp = sqrt((prop.tp*(1-prop.tp))/n))%>%
mutate(Lower_limit = prop.tp-1.96 * SE.tp)%>%
mutate(Upper_limit = prop.tp+1.96 * SE.tp)
output:
Source: local data frame [4 x 7]
Groups: Z
Z tp n prop.tp SE.tp Lower_limit Upper_limit
1 A 0 33 0.6346154 0.08382498 0.4703184 0.7989123
2 A 1 19 0.3653846 0.11047236 0.1488588 0.5819104
3 B 0 27 0.5625000 0.09547033 0.3753782 0.7496218
4 B 1 21 0.4375000 0.10825318 0.2253238 0.6496762
However, I would like to calculate the Standard error and the CI:s using the total sample for the groups in column Z, not the splitted sample by the categorical variable tp.
So the total sample for A in my example should be n = 33 +19.
Any ideas?
Not quite sure I get which group you want to compare with which here, but at any rate you have two grouping variables tp = X > 8 and Z.
If you want to compare the rows with X > 8 and Z == "A" to all rows with X > 8 you can do it like this
merge(
dt %>%
group_by(X > 8) %>%
summarize(n.X = n()),
dt %>%
group_by(X > 8, Z) %>%
summarise(n.XZ = n()),
by = "X > 8"
) %>%
mutate(prop.XZ = n.XZ/n.X) %>%
mutate(SE = sqrt((prop.XZ*(1-prop.XZ))/n.X))%>%
mutate(Lower_limit = prop.XZ-1.96 * SE) %>%
mutate(Upper_limit = prop.XZ+1.96 * SE)
X > 8 n.X Z n.XZ prop.XZ SE Lower_limit Upper_limit
1 FALSE 70 A 37 0.5285714 0.05966378 0.4116304 0.6455124
2 FALSE 70 B 33 0.4714286 0.05966378 0.3544876 0.5883696
3 TRUE 30 A 16 0.5333333 0.09108401 0.3548087 0.7118580
4 TRUE 30 B 14 0.4666667 0.09108401 0.2881420 0.6451913
If you want to turn the problem around and compare X > 8 and Z == "A" to all rows with Z == "A" you can do it like this
merge(
dt %>%
group_by(Z) %>%
summarize(n.Z = n()),
dt %>%
group_by(X > 8, Z) %>%
summarise(n.XZ = n()),
by = "Z"
) %>%
mutate(prop.XZ = n.XZ/n.Z) %>%
mutate(SE = sqrt((prop.XZ*(1-prop.XZ))/n.Z))%>%
mutate(Lower_limit = prop.XZ-1.96 * SE) %>%
mutate(Upper_limit = prop.XZ+1.96 * SE)
Z n.Z X > 8 n.XZ prop.XZ SE Lower_limit Upper_limit
1 A 53 FALSE 37 0.6981132 0.06305900 0.5745176 0.8217088
2 A 53 TRUE 16 0.3018868 0.06305900 0.1782912 0.4254824
3 B 47 FALSE 33 0.7021277 0.06670743 0.5713811 0.8328742
4 B 47 TRUE 14 0.2978723 0.06670743 0.1671258 0.4286189
It is a bit messy having to merge two separate groupings, but I don't know if it is possible to ungroup and re-group in the same statement. I am suprised though how difficult it seems to be to use groupings on two different levels (if you can call it that) and hope someone else can come up with a better solution.

Avoiding missing row after summarise

I'm using RStudio Version 0.98.1028 on windows. Summarising a multi level data frame, package dplyr, using the function sum(), I lost a row, which had sum = 0. In other words, if my original data frame was something like
group <- as.factor(rep(c('X', 'Y'), each = 1, times = 6))
type <- as.factor(rep(c('a', 'b'), each = 2, times = 3))
day <- as.factor(rep(1:3, each = 4))
df = data.frame(type = type, day = day, value = abs(rnorm(12)))
df = df[day != 1 | type != 'a',]
and I summarise it
df1 = df %>%
group_by(day, type) %>%
summarise(sum = sum(value))
then I get one missing row, which is the interaction between day = 1 and type = a, which I would like to have (even if it's 0...)
Thanks in advance!
EB
You could try left_join
library(dplyr)
left_join(expand.grid(type=unique(df$type), day=unique(df$day)), df1) %>%
group_by(day, type) %>%
summarise(sum=sum(value, na.rm=TRUE))
# day type sum
#1 1 a 0.0000000
#2 1 b 0.5132914
#3 2 a 1.2482210
#4 2 b 0.9232343
#5 3 a 2.0381779
#6 3 b 0.7558351
where df1 is
df1 <- df[day != 1 | type != 'a',]

Resources