Split up grouped binomial data in r - r

I have data that looks like this
samplesize <- 6
group <- c(1,2,3)
total <- rep(samplesize,length(group))
outcomeTrue <- c(2,1,3)
df <- data.frame(group,total,outcomeTrue)
and would like my data to look like this
group2 <- c(rep(1,6),rep(2,6),rep(3,6))
outcomeTrue2 <- c(rep(1,2),rep(0,6-2),rep(1,1),rep(0,6-1),rep(1,3),rep(0,6-3))
df2 <- data.frame(group2,outcomeTrue2)
That is to say I have binary data where I am told the total observations and the successful observations, but would prefer it to be organised as individual observations with their explicit outcome as 0 or 1. i.e.Visual Example of Desired Result
Is there an easy way to do this in r, or will I need to write a loop to automate this myself?

Here is one option with tidyverrse. We uncount to expand the rows using the 'total' column, grouped by 'group', create a binary index with a logical condition based on the row_number() and the value of 'outcomeTrue'
library(tidyverse)
df %>%
uncount(total) %>%
group_by(group) %>%
mutate(outcomeTrue = as.integer(row_number() <= outcomeTrue[1]))
# A tibble: 18 x 2
# Groups: group [3]
# group outcomeTrue
# <dbl> <int>
# 1 1 1
# 2 1 1
# 3 1 0
# 4 1 0
# 5 1 0
# 6 1 0
# 7 2 1
# 8 2 0
# 9 2 0
#10 2 0
#11 2 0
#12 2 0
#13 3 1
#14 3 1
#15 3 1
#16 3 0
#17 3 0
#18 3 0

You are also there. just use the group 2 variable with the "[" function in the x position:
df[ group2 , ]
group total outcomeTrue
1 1 6 2
1.1 1 6 2
1.2 1 6 2
1.3 1 6 2
1.4 1 6 2
1.5 1 6 2
2 2 6 1
2.1 2 6 1
2.2 2 6 1
2.3 2 6 1
2.4 2 6 1
2.5 2 6 1
3 3 6 3
3.1 3 6 3
3.2 3 6 3
3.3 3 6 3
3.4 3 6 3
3.5 3 6 3
When a number or character value that matches a rowname is put in the x-position of the "[" it replicates the entire row

Here is a base R solution.
do.call(rbind, lapply(split(df, df$group), function(x) data.frame(group2 = x$group, outcome2 = rep(c(1,0), times = c(x$outcome, x$total-x$outcome)))))
# group2 outcome2
# 1.1 1 1
# 1.2 1 1
# 1.3 1 0
# 1.4 1 0
# 1.5 1 0
# 1.6 1 0
# 2.1 2 1
# 2.2 2 0
# 2.3 2 0
# 2.4 2 0
# 2.5 2 0
# 2.6 2 0
# 3.1 3 1
# 3.2 3 1
# 3.3 3 1
# 3.4 3 0
# 3.5 3 0
# 3.6 3 0

Related

Make duplicate rows as replicates in R dataframe

I have a data frame with duplicated rows having one continuous variable column and 2-factor columns (0,1). The goal is to find the duplicated rows and identify them as replicates in a new column.
Here is the structure of the data frame
cont.var fact1 fact2
1 1.0 1 0
2 1.0 0 1
3 1.5 1 0
4 1.5 1 0
5 1.5 0 1
6 1.5 0 1
Now let's say
If cont.var has value 1.0 in two rows but has different values for fact1 and fact2, so it will be assigned two different replicates.
If cont.var has value 1.5 and fact1/fact2 is also the same for successive rows, they will be given the same replicate identifier.
Expected Output
cont.var fact1 fact2 rep
1 1.0 1 0 1
2 1.0 0 1 2
3 1.5 1 0 3
4 1.5 1 0 3
5 1.5 0 1 4
6 1.5 0 1 4
What I have tried
library(dplyr)
sample.df <- data.frame(
cont.var = c(1,1,1.5,1.5,1.5,1.5,2,2,2,3),
fact1 = c(1,0,1,1,0,0,1,1,0,1),
fact2 = c(0,1,0,0,1,1,0,0,1,0)
)
sample.df %>%
group_by(cont.var, fact1, fact2) %>%
mutate(replicate = make.unique(as.character(cont.var), "_"))
Incorrect Output
I would expect that row-1 and row-2 will have different replicate counts.
I would expect that Replicate count for row-3 == row-4 and row-5 == row-6, but row-5 != row-3
cont.var fact1 fact2 replicate
1 1.0 1 0 1
2 1.0 0 1 1
3 1.5 1 0 1.5
4 1.5 1 0 1.5_1
5 1.5 0 1 1.5
6 1.5 0 1 1.5_1
I couldn't find a straightforward solution to this; I would really appreciate any help.
Thanks in advance.
You can use data.table::rleid:
library(dplyr)
df %>%
mutate(rleid = data.table::rleid(cont.var, fact1, fact2))
cont.var fact1 fact2 rleid
1 1.0 1 0 1
2 1.0 0 1 2
3 1.5 1 0 3
4 1.5 1 0 3
5 1.5 0 1 4
6 1.5 0 1 4
If you have dplyr's dev. version, you can also use consecutive_id, the dplyr version of data.table::rleid:
#devtools::install_github("tidyverse/dplyr")
library(dplyr)
df %>%
mutate(rleid2 = consecutive_id(cont.var, fact1, fact2))
Finally, a base R option would be to match the rows by unique values:
df$rleid <- match(do.call(paste, df), do.call(paste, unique(df)))
Another dplyr method, in case you're already grouped:
quux %>%
group_by(cont.var, fact1, fact2) %>%
mutate(rep = group_indices()) %>%
ungroup()
# # A tibble: 6 x 4
# cont.var fact1 fact2 rep
# <dbl> <int> <int> <int>
# 1 1 1 0 2
# 2 1 0 1 1
# 3 1.5 1 0 4
# 4 1.5 1 0 4
# 5 1.5 0 1 3
# 6 1.5 0 1 3
While the actual values are not the same, the spirit of your request is retained.
Here is another base R solution:
sample.df <- data.frame(
cont.var = c(1,1,1.5,1.5,1.5,1.5,2,2,2,3),
fact1 = c(1,0,1,1,0,0,1,1,0,1),
fact2 = c(0,1,0,0,1,1,0,0,1,0)
)
sample.df$replicate <- cumsum(!duplicated(sample.df))
sample.df
#> cont.var fact1 fact2 replicate
#> 1 1.0 1 0 1
#> 2 1.0 0 1 2
#> 3 1.5 1 0 3
#> 4 1.5 1 0 3
#> 5 1.5 0 1 4
#> 6 1.5 0 1 4
#> 7 2.0 1 0 5
#> 8 2.0 1 0 5
#> 9 2.0 0 1 6
#> 10 3.0 1 0 7
EDIT
ensure dups are continuous:
sample.df <- sample.df[with(sample.df, order(fact2,fact1,cont.var)),]
sample.df$replicate <- cumsum(!duplicated(sample.df))
sample.df
#> cont.var fact1 fact2 replicate
#> 1 1.0 1 0 1
#> 3 1.5 1 0 2
#> 4 1.5 1 0 2
#> 7 2.0 1 0 3
#> 8 2.0 1 0 3
#> 10 3.0 1 0 4
#> 2 1.0 0 1 5
#> 5 1.5 0 1 6
#> 6 1.5 0 1 6
#> 9 2.0 0 1 7

Count interactions with unique accounts in financial transaction dataset

I have a question about a dataset with financial transactions:
Account_from Account_to Value
1 1 2 25.0
2 1 3 30.0
3 2 1 28.0
4 2 3 10.0
5 2 3 12.0
6 3 1 40.0
7 3 1 30.0
8 3 1 20.0
Each row represents a transaction. I would like to create an extra column with a variable containing the information of the number of interactions with each unique account.
That it would look like the following:
Account_from Account_to Value Count_interactions_out Count_interactions_in
1 1 2 25.0 2 2
2 1 3 30.0 2 2
3 2 1 28.0 2 1
4 2 3 10.0 2 1
5 2 3 12.0 2 1
6 3 1 40.0 1 2
7 3 1 30.0 1 2
8 3 1 20.0 1 2
Account 3 only interacts with account 1, therefore Count_interactions_out is 1. However, it receives interactions from account 1 and 2, therefore the count_interactions_in is 2.
How can I apply this to the whole dataset?
Thanks
Here's an approach using dplyr
library(dplyr)
financial.data %>%
group_by(Account_from) %>%
mutate(Count_interactions_out = nlevels(factor(Account_to))) %>%
ungroup() %>%
group_by(Account_to) %>%
mutate(Count_interactions_in = nlevels(factor(Account_from))) %>%
ungroup()
Here is a solution with base R, where ave() is used
df <- cbind(df,
with(df, list(
Count_interactions_out = ave(Account_to,Account_from,FUN = function(x) length(unique(x))),
Count_interactions_in = ave(Account_from,Account_to,FUN = function(x) length(unique(x)))[match(Account_from,Account_to,)])))
such that
> df
Account_from Account_to Value Count_interactions_out Count_interactions_in
1 1 2 25 2 2
2 1 3 30 2 2
3 2 1 28 2 1
4 2 3 10 2 1
5 2 3 12 2 1
6 3 1 40 1 2
7 3 1 30 1 2
8 3 1 20 1 2
or
df <- within(df, list(
Count_interactions_out <- ave(Account_to,Account_from,FUN = function(x) length(unique(x))),
Count_interactions_in <- ave(Account_from,Account_to,FUN = function(x) length(unique(x)))[match(Account_from,Account_to,)]))
such that
> df
Account_from Account_to Value Count_interactions_in Count_interactions_out
1 1 2 25 2 2
2 1 3 30 2 2
3 2 1 28 1 2
4 2 3 10 1 2
5 2 3 12 1 2
6 3 1 40 2 1
7 3 1 30 2 1
8 3 1 20 2 1

Count number of values which are less than current value

I'd like to count the rows in the column input if the values are smaller than the current row (Please see the results wanted below). The issue to me is that the condition is based on current row value, so it is very different from general case where the condition is a fixed number.
data <- data.frame(input = c(1,1,1,1,2,2,3,5,5,5,5,6))
input
1 1
2 1
3 1
4 1
5 2
6 2
7 3
8 5
9 5
10 5
11 5
12 6
The results I expect to get are like this. For example, for observations 5 and 6 (with value 2), there are 4 observations with value 1 less than their value 2. Hence count is given value 4.
input count
1 1 0
2 1 0
3 1 0
4 1 0
5 2 4
6 2 4
7 3 6
8 5 7
9 5 7
10 5 7
11 5 7
12 6 11
Edit: as I am dealing with grouped data with dplyr, the ultimate results I wish to get is like below, that is, I am wishing the conditions could be dynamic within each group.
data <- data.frame(id = c(1,1,2,2,2,3,3,4,4,4,4,4),
input = c(1,1,1,1,2,2,3,5,5,5,5,6),
count=c(0,0,0,0,2,0,1,0,0,0,0,4))
id input count
1 1 1 0
2 1 1 0
3 2 1 0
4 2 1 0
5 2 2 2
6 3 2 0
7 3 3 1
8 4 5 0
9 4 5 0
10 4 5 0
11 4 5 0
12 4 6 4
Here is an option with tidyverse
library(tidyverse)
data %>%
mutate(count = map_int(input, ~ sum(.x > input)))
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
Update
With the updated data, add the group by 'id' in the above code
data %>%
group_by(id) %>%
mutate(count1 = map_int(input, ~ sum(.x > input)))
# A tibble: 12 x 4
# Groups: id [4]
# id input count count1
# <dbl> <dbl> <dbl> <int>
# 1 1 1 0 0
# 2 1 1 0 0
# 3 2 1 0 0
# 4 2 1 0 0
# 5 2 2 2 2
# 6 3 2 0 0
# 7 3 3 1 1
# 8 4 5 0 0
# 9 4 5 0 0
#10 4 5 0 0
#11 4 5 0 0
#12 4 6 4 4
In base R, we can use sapply and for each input count how many values are greater than itself.
data$count <- sapply(data$input, function(x) sum(x > data$input))
data
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
With dplyr one way would be using rowwise function and following the same logic.
library(dplyr)
data %>%
rowwise() %>%
mutate(count = sum(input > data$input))
1. outer and rowSums
data$count <- with(data, rowSums(outer(input, input, `>`)))
2. table and cumsum
tt <- cumsum(table(data$input))
v <- setNames(c(0, head(tt, -1)), c(head(names(tt), -1), tail(names(tt), 1)))
data$count <- v[match(data$input, names(v))]
3. data.table non-equi join
Perhaps more efficient with a non-equi join in data.table. Count number of rows (.N) for each match (by = .EACHI).
library(data.table)
setDT(data)
data[data, on = .(input < input), .N, by = .EACHI]
If your data is grouped by 'id', as in your update, join on that variable as well:
data[data, on = .(id, input < input), .N, by = .EACHI]
# id input N
# 1: 1 1 0
# 2: 1 1 0
# 3: 2 1 0
# 4: 2 1 0
# 5: 2 2 2
# 6: 3 2 0
# 7: 3 3 1
# 8: 4 5 0
# 9: 4 5 0
# 10: 4 5 0
# 11: 4 5 0
# 12: 4 6 4

Reshaping a data frame and setting flag variables

I want to reshape my data frame from the df1 to df2 as appears below:
df1 <-
ID TIME RATEALL CL V1 Q V2
1 0 0 2.4 10 6 20
1 1 2 0.6 10 6 25
2 0 0 3.0 15 7 30
2 5 3 3.0 16 8 15
into a long format like this:
df2 <-
ID var TIME value
1 1 0 0
1 1 1 2
1 2 0 2.4
1 2 1 10
1 3 0 6
1 3 1 6
1 4 0 20
1 4 1 20
2 1 0 3.0
2 1 1 3.0
AND so on ...
Basically I want to give a flag variables (1: for RATEALL, 2:for CL, 3:for V1, 4:for Q,and 5: for V2 and then melt the values for each subject ID. Is there an easy way to do this in R?
You can try
df2 <- reshape2::melt(df1, c("ID", "TIME"))
names <- c("RATEALL"=1, "CL"=2, "V1"=3, "Q"=4, "V2"=5)
df2$variable <- names[df2$variable]
You could use tidyr/dplyr
library(tidyr)
library(dplyr)
res <- gather(df1,var, value, RATEALL:V2) %>%
mutate(var= as.numeric(factor(var)))
head(res)
# ID TIME var value
#1 1 0 1 0.0
#2 1 1 1 2.0
#3 2 0 1 0.0
#4 2 5 1 3.0
#5 1 0 2 2.4
#6 1 1 2 0.6

Adding an extra row for each subject ID and maintaining the values in the other columns

I want to add an extra row for each subject ID in the data frame (below). This row should have TIME=0 and DV=0. Other values in the other columns should stay the same. The data frame looks like the following:
ID TIME DV DOSE pH
1 1 5 50 4.6
1 5 10 50 4.6
2 1 6 100 6.0
2 7 10 100 6.0
After adding the extra row, it should look like this:
ID TIME DV DOSE pH
1 0 0 50 4.6
1 1 5 50 4.6
1 5 10 50 4.6
2 0 0 100 6.0
2 1 6 100 6.0
2 7 10 100 6.0
How could I achieve this in R?
Try this:
#dummy data
df <- read.table(text="ID TIME DV DOSE pH
1 1 5 50 4.6
1 5 10 50 4.6
2 1 6 100 6.0
2 7 10 100 6.0",header=TRUE)
#data with zeros
df1 <- df
df1[,c(2,3)] <- 0
df1 <- unique(df1)
#rowbind and sort
res <- rbind(df,df1)
res <- res[order(res$ID,res$TIME),]
res
# ID TIME DV DOSE pH
# 11 1 0 0 50 4.6
# 1 1 1 5 50 4.6
# 2 1 5 10 50 4.6
# 31 2 0 0 100 6.0
# 3 2 1 6 100 6.0
# 4 2 7 10 100 6.0
Here's another possible data.table solution
library(data.table)
setDT(df)[, .SD[c(1L, seq_len(.N))], ID][,
indx := seq_len(.N), ID][indx == 1L, 2:3 := 0][]
# ID TIME DV DOSE pH indx
# 1: 1 0 0 50 4.6 1
# 2: 1 1 5 50 4.6 2
# 3: 1 5 10 50 4.6 3
# 4: 2 0 0 100 6.0 1
# 5: 2 1 6 100 6.0 2
# 6: 2 7 10 100 6.0 3
I changed the indexing from c(.N+1, 1:.N) to c(1L, 1:.N) (from #David Arenburg's post) as it is easier in this way :-)
library(data.table)
setDT(df)[, .SD[c(1L,1:.N)], by=ID][, 2:3 := .SD*(!duplicated(.SD,
fromLast=TRUE))+0L, .SDcols=2:3][]
# ID TIME DV DOSE pH
#1: 1 0 0 50 4.6
#2: 1 1 5 50 4.6
#3: 1 5 10 50 4.6
#4: 2 0 0 100 6.0
#5: 2 1 6 100 6.0
#6: 2 7 10 100 6.0
Or you could use set that updates by reference (if there are many columns)
DT <- setDT(df)[, .SD[c(1L, 1:.N)], by=ID]
indx <- DT[, !duplicated(.SD, fromLast=TRUE), .SDcols=2:3]
for(j in 2:3){
set(DT, i=NULL, j=j, value= DT[[j]]*(indx+0L))
}
A concise approach using plyr:
library(plyr)
ldply(split(df, df$ID), function(u){x=u[1,];x[c("DV","TIME")]=0;rbind(x,u)})
# .id ID TIME DV DOSE pH
#1 1 1 0 0 50 4.6
#2 1 1 1 5 50 4.6
#3 1 1 5 10 50 4.6
#4 2 2 0 0 100 6.0
#5 2 2 1 6 100 6.0
#6 2 2 7 10 100 6.0

Resources