collapsing by groups and calculating n in data.frame using R - r

I have a data frame that contains three variables: treatment, dose, and outcome (plus or minus). I have multiple observations for each treatment and dose. I'm trying to output a contingency table that would collapse the data to indicate the number of each outcome as a function of the treatment and dose, as well as the number of observations. For example:
treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1
The desired output would be:
treatment dose outcome n
control 0 0 1 4
treatmentA 1 2 3
treatmentA 2 3 3
I've played around with this all day and haven't had much luck beyond being able to get a frequency for each outcome for each observation. Any suggestions would be appreciated (including pointing out what parts of the R manual and/or examples) i've overlooked.
Thanks!
R

Here is a solution using a wonderful package data.table:
library(data.table)
x <- data.table(read.table( text = "treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1", header = TRUE)
x[, list(outcome = sum(outcome), count = .N), by = 'treatment,dose']
produces
treatment dose outcome count
1: control 0 1 4
2: treatmentA 1 2 3
3: treatmentA 2 3 3

If you don't want to use extra libraries as suggested in other answers, you can try following.
> df
treatment dose outcome
1 control 0 0
2 control 0 0
3 control 0 0
4 control 0 1
5 treatmentA 1 0
6 treatmentA 1 1
7 treatmentA 1 1
8 treatmentA 2 1
9 treatmentA 2 1
10 treatmentA 2 1
> dput(df)
structure(list(treatment = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("control", "treatmentA"), class = "factor"),
dose = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 2L, 2L, 2L), outcome = c(0L,
0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L)), .Names = c("treatment",
"dose", "outcome"), class = "data.frame", row.names = c(NA, -10L
))
Now we use aggregate function to get count and sum of outcome column
> nObs <- aggregate(outcome ~ treatment + dose, data = df, length)
> sObs <- aggregate(outcome ~ treatment + dose, data = df, sum)
Change names of of aggregated column appropriately
names(nObs) <- c('treatment', 'dose', 'count')
> names(sObs) <- c('treatment', 'dose', 'sum')
> nObs
treatment dose count
1 control 0 4
2 treatmentA 1 3
3 treatmentA 2 3
> sObs
treatment dose sum
1 control 0 1
2 treatmentA 1 2
3 treatmentA 2 3
Use merge to combine above two by all columns by same name treatment and dose in this case.
> result <- merge(nObs, sObs)
> result
treatment dose count sum
1 control 0 4 1
2 treatmentA 1 3 2
3 treatmentA 2 3 3

If I understand correctly, this is straightforward with the data.table library. First, load the library and read the data in:
library(data.table)
data <- read.table(header=TRUE, text="
treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1")
Next, create a data.table with the treatment and dose columns as the table keys (indices).
data <- data.table(data, key="treatment,dose")
Then aggregate using data.table syntax.
data[, list(outcome=sum(outcome), n=length(outcome)), by=list(treatment,dose)]
treatment dose outcome n
1: control 0 1 4
2: treatmentA 1 2 3
3: treatmentA 2 3 3

imho, sql is underrated. :)
# read in your example data as `x`
x <- read.table( text = "treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1",h=T)
# load the sql data frame library
library(sqldf)
# create a new table of all unique `treatment` and `dose` columns,
# summing the `outcome` column and
# counting the number of records in each combo
y <- sqldf( 'SELECT treatment, dose ,
sum( outcome ) as outcome ,
count(*) as n
FROM x
GROUP BY treatment, dose' )
# check the results
y

Here are another couple of options (even thought the data.table approach clearly wins in succinctness of syntax).
The first uses ave within within. ave can apply a function to a variable (the first variable mentioned) grouped by one or more variables. We wrap the output in unique after dropping the now unnecessary "outcome" column.
unique(within(df, {
SUM <- ave(outcome, treatment, dose, FUN = sum)
COUNT <- ave(outcome, treatment, dose, FUN = length)
rm(outcome)
}))
# treatment dose COUNT SUM
# 1 control 0 4 1
# 5 treatmentA 1 3 2
# 8 treatmentA 2 3 3
A second solution in base R is very similar to #geektrader's answer, except it calculates both sum and length in one call to aggregate. There is a "downside" though: the result of that cbind is a "column" in your data.frame that is actually a matrix. See the result of str to see what I mean.
temp <- aggregate(outcome ~ treatment + dose, df,
function(x) cbind(sum(x), length(x)))
str(temp)
# 'data.frame': 3 obs. of 3 variables:
# $ treatment: Factor w/ 2 levels "control","treatmentA": 1 2 2
# $ dose : int 0 1 2
# $ outcome : int [1:3, 1:2] 1 2 3 4 3 3
colnames(temp$outcome) <- c("SUM", "COUNT")
temp
# treatment dose outcome.SUM outcome.COUNT
# 1 control 0 1 4
# 2 treatmentA 1 2 3
# 3 treatmentA 2 3 3
I mention storage structure as a "downside" mostly because you might not get what you expect when you try to access the data in ways you might be accustomed to.
temp$outcome.SUM
# NULL
temp$outcome
# SUM COUNT
# [1,] 1 4
# [2,] 2 3
# [3,] 3 3
Instead, you have to access it via:
temp$outcome[, "SUM"] ## or temp$outcome[, 1]
# [1] 1 2 3

Related

Grouping first few rows with positive value followed by another group with negative values and so on using R

I have a dataframe looks like this:
name strand
thrL 1
thrA 1
thrB 1
yaaA -1
yaaJ -1
talB 1
mog 1
I would like to group first few positive values into a group, negative values a group and next postive numbers as another group which look like this:
name strand directon
thrL 1 1
thrA 1 1
thrB 1 1
yaaA -1 2
yaaJ -1 2
talB 1 3
mog 1 3
I am thinking to use dplyr but I need some help with the code using R. Thank you so much.
Using rle :
df$direction <- with(rle(sign(df$strand)), rep(seq_along(values), lengths))
df
# name strand direction
#1 thrL 1 1
#2 thrA 1 1
#3 thrB 1 1
#4 yaaA -1 2
#5 yaaJ -1 2
#6 talB 1 3
#7 mog 1 3
This can be made shorter with data.table rleid.
df$direction <- data.table::rleid(sign(df$strand))
We can also do this as
df1$direction <- inverse.rle(within.list(rle(sign(df1$strand)),
values <- seq_along(values)))
df1$direction
#[1] 1 1 1 2 2 3 3
data
df1 <- structure(list(name = c("thrL", "thrA", "thrB", "yaaA", "yaaJ",
"talB", "mog"), strand = c(1L, 1L, 1L, -1L, -1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-7L))

How to find columns that fit an specific range (per individual) and add 1, else 0, using R

I have a data frame with three initial columns: ID, start and end positions.The rest of the columns are numeric chromosomal positions, and it looks like this:
ID start end 1 2 3 4 5 6 7 ... n
ind1 2 4
ind2 1 3
ind3 5 7
What I want is to fill out the empty columns (1:n) based on the range for every individual (start:end). For example in the first individual (ind1) the range goes from positions 2 to 4, then those positions fitting the range are filled out with one (1), and those positions out the range with zero (0). To simplify, the desired output should look like this:
ID start end 1 2 3 4 5 6 7 ... n
ind1 2 4 0 1 1 1 0 0 0 ... 0
ind2 1 3 1 1 1 0 0 0 0 ... 0
ind3 5 7 0 0 0 0 1 1 1 ... 1
I will appreciate any comment.
Supposing you know the number of columns you could use the between function from the data.table package:
cols <- paste0('c',1:7)
library(data.table)
setDT(DF)[, (cols) := lapply(1:7, function(x) +(between(x, start, end)))][]
which gives:
ID start end c1 c2 c3 c4 c5 c6 c7
1: ind1 2 4 0 1 1 1 0 0 0
2: ind2 1 3 1 1 1 0 0 0 0
3: ind3 5 7 0 0 0 0 1 1 1
Notes:
It is better not to name your colummns with just numbers. Therefore I added a c at the start of the columnnames.
Using + in +(between(x, start, end)) is a kind of tric. The more idiomatic way is using as.integer(between(x, start, end)).
Used data:
DF <- read.table(text="ID start end
ind1 2 4
ind2 1 3
ind3 5 7", header=TRUE)
If you were to begin with the data frame df, without the columns already added,
ID start end
1 ind1 2 4
2 ind2 1 3
3 ind3 5 7
you could do
mx <- max(df[-1])
M <- Map(function(x, y) replace(integer(mx), x:y, 1L), df$start, df$end)
cbind(df, do.call(rbind, M))
# ID start end 1 2 3 4 5 6 7
# 1 ind1 2 4 0 1 1 1 0 0 0
# 2 ind2 1 3 1 1 1 0 0 0 0
# 3 ind3 5 7 0 0 0 0 1 1 1
The number of new columns will equal the maximum of the start and end columns.
Data:
df <- structure(list(ID = structure(1:3, .Label = c("ind1", "ind2",
"ind3"), class = "factor"), start = c(2L, 1L, 5L), end = c(4L,
3L, 7L)), .Names = c("ID", "start", "end"), class = "data.frame", row.names = c(NA,
-3L))

Expanding a data.frame by replacing missing values with set of all possible values in R

I want to expand my dataset by replacing each incomplete row with the set of all possible rows. Does anyone have any suggestions for an efficient way to do this?
For example, suppose X and Z can each take values 0 or 1.
Input:
id y x z
1 1 0 0 NA
2 2 1 NA 0
3 3 0 1 1
4 4 1 NA NA
Output:
id y x z
1 1 0 0 0
2 1 0 0 1
3 2 1 0 0
4 2 1 1 0
5 3 0 1 1
6 4 1 0 0
7 4 1 0 1
8 4 1 1 0
9 4 1 1 1
At the moment I just work through the original dataset row by row:
for(i in 1:N){
if(is.na(temp.dat$x[i]) & !is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],2),ncol=ncol(temp.dat),byrow=TRUE)
augment[,3] <- c(0,1)
}else
if(!is.na(temp.dat$x[i]) & is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],2),ncol=ncol(temp.dat),byrow=TRUE)
augment[,4] <- c(0,1)
}else{
if(is.na(temp.dat$x[i]) & is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],4),ncol=ncol(temp.dat),byrow=TRUE)
augment[,3] <- c(0,0,1,1)
augment[,4] <- c(0,1,0,1)
}
}
You could try by
Creating an "indx" of count of "NAs" in each row (rowSums(is.na(...))
Use the "indx" to expand the rows of the original dataset (df[rep(1:nrow...)
Loop over (sapply) the "indx" and use that as "times" argument in rep, and do expand.grid of values 0,1 to create the "lst"
split the expanded dataset, "df1", by "id"
Use Map to change corresponding "NA" values in "lst2" by the values in "lst"
rbind the list elements
indx <- rowSums(is.na(df[-1]))
df1 <- df[rep(1:nrow(df), 2^indx),]
lst <- sapply(indx, function(x) expand.grid(rep(list(0:1), x)))
lst2 <- split(df1, df1$id)
res <- do.call(rbind,Map(function(x,y) {x[is.na(x)] <- as.matrix(y);x},
lst2, lst))
row.names(res) <- NULL
res
# id y x z
#1 1 0 0 0
#2 1 0 0 1
#3 2 1 0 0
#4 2 1 1 0
#5 3 0 1 1
#6 4 1 0 0
#7 4 1 1 0
#8 4 1 0 1
#9 4 1 1 1
data
df <- structure(list(id = 1:4, y = c(0L, 1L, 0L, 1L), x = c(0L, NA,
1L, NA), z = c(NA, 0L, 1L, NA)), .Names = c("id", "y", "x", "z"
), class = "data.frame", row.names = c("1", "2", "3", "4"))

r- cumulative frequency when every combination doesn't appear always

I need to get the cumulative clients by number of calls up to everyday.
An example table would be:
> data
dia cli llam elegidos cumllam
1 1-11 a 1 1 1
2 3-11 a 1 1 2
3 1-11 b 2 1 2
4 2-11 b 1 1 3
5 2-11 c 2 0 2
As you can see, client a wasn't call in day 2-11, so the combination client a + day 2-11 doesn't appear in the table. If I run:
series<-data.frame(dcast(data, elegidos+dia~cumllam , length))
I get:
> series
elegidos dia X1 X2 X3
1 0 2-11 0 1 0
2 1 1-11 1 1 0
3 1 2-11 0 0 1
4 1 3-11 0 1 0
But if you consider up to the 2nd day how many clients were called once, client a should appear and it doesn't because I have no row in previous table for the combination client a and day 2-11.
The table should look like:
elegidos dia X1 X2 X3
1 0 2-11 0 1 0
2 1 1-11 1 1 0
3 1 2-11 1 0 1
4 1 3-11 0 1 1
x1 is the number of clients who received until and including the day in the row exactly 1 call.
x2 is the number of clients who received until and including the day in the row exactly 2 calls.
And so on.
The explanation is:
Client "a" gets a call on day 1st and 3rd, client "b" receives 2 calls on day 1st and 1 call on day 2nd. So, 1st day we have 1 client receiving 1 call, and another receiving 2 calls.
2nd day, since it's cumulative, we have client a, who stays the same with one call and client b who gets one more call reaching 3 calls.
On the 3rd day, client a receives another call and climb up to 2 calls cumulative, that's why he's in x2 and client b stays the same in x3.
Is there a way to do this cumulative count to each day, without having to create a row for each client day combination?
Thanks.
Try this:
dat1 <-data[!!data$elegidos,]
dat2 <- expand.grid(dia=sort(unique(dat1$dia)), cli=unique(dat1$cli))
dat3 <- merge(data,dat2, all=TRUE)
dat3N <- dat3[with(dat3, order( cli, dia)),]
library(zoo)
dat3N[,c('elegidos', 'cumllam')] <- lapply(dat3N[,
c('elegidos', 'cumllam')], na.locf)
library(reshape2)
dcast(dat3N, elegidos+dia~cumllam, length, value.var='cumllam')
# elegidos dia 1 2 3
#1 0 2-11 0 1 0
#2 1 1-11 1 1 0
#3 1 2-11 1 0 1
#4 1 3-11 0 1 1
Update
You could also do this in data.table
library(data.table)
DT <- data.table(data)
setkey(DT, dia, cli)
DT1 <- rbind(DT[!!elegidos, CJ(dia=unique(dia),
cli=unique(cli))], DT[elegidos==0, 1:2, with=FALSE])
nm1 <- c('elegidos', 'cumllam')
#There is also a roll option but unfortunately I couldn't get it right here.
# So, I am using na.locf from zoo.
DT2 <- DT[DT1[order(cli, dia)]][,(nm1):= lapply(.SD, na.locf), .SDcols=nm1]
dcast.data.table(DT2, elegidos+dia~cumllam, length, value.var='cumllam')
# elegidos dia 1 2 3
#1: 0 2-11 0 1 0
#2: 1 1-11 1 1 0
#3: 1 2-11 1 0 1
#4: 1 3-11 0 1 1
data
data <- structure(list(dia = c("1-11", "3-11", "1-11", "2-11", "2-11"
), cli = c("a", "a", "b", "b", "c"), llam = c(1L, 1L, 2L, 1L,
2L), elegidos = c(1L, 1L, 1L, 1L, 0L), cumllam = c(1L, 2L, 2L,
3L, 2L)), .Names = c("dia", "cli", "llam", "elegidos", "cumllam"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

Using R to find a T/F answer for one observation based on other observations sharing the same level of a different variable

So I have a very large (confidential, hence the vague names) data set containing a number of variables, and let's call the relevant ones group and dummy1. What I want to do is create a new variable, dummy2, that determines whether dummy1 is true (or 1 in this case) for at least one observation with the same value for group. This variable must have a value for every observation, even when dummy1 is NA for someone in the group (there are no NAs in group). I am very new to R and programming generally so I have not been able to figure out how to extract this information from aggregate for use in a variable, which seems like what you'd want to do, but I'm stuck.
So here is a chunk of what my data would hypothetically end up looking like:
Obs. Group Dummy1 Dummy2
1 101 0 1
2 101 1 1
3 101 0 1
4 102 0 0
5 102 0 0
6 103 1 1
7 103 1 1
8 103 1 1
So the idea here is that since at least one person in Group 101 has a value of 1 for dummy1, all members of that group get a 1 in dummy2, and likewise since nobody in Group 103 has dummy1, all members of group 103 have a 0 value for dummy2. The dataset has close to 7k observations over 1300 groups, so I need some kind of loop setup I suspect, but can anybody help me?
Thanks!
I think here plyr and ddply will be a better bet
require(plyr)
ddply(data, .(Group), transform, Dummy2 = 1 * any(Dummy1, na.rm = TRUE))
## Obs. Group Dummy1 dummy2
## 1 1 101 0 1
## 2 2 101 1 1
## 3 3 101 0 1
## 4 4 102 0 0
## 5 5 102 0 0
## 6 6 103 1 1
## 7 7 103 1 1
## 8 8 103 1 1
If for any reason you want more speed to process to your data, then data.table can be used
require(data.table)
data <- as.data.table(data)
data[, Dummy2:= 1 * any(Dummy1, na.rm = TRUE), by = "Group"]
data
## Obs. Group Dummy1 Dummy2
## 1: 1 101 0 1
## 2: 2 101 1 1
## 3: 3 101 0 1
## 4: 4 102 0 0
## 5: 5 102 0 0
## 6: 6 103 1 1
## 7: 7 103 1 1
## 8: 8 103 1 1
EDIT : Added na.rm = TRUE is any to deal with missing value thanks to #Dwin
df$Dummy2 <- with(df, ave(Dummy1 , Group,
FUN=function(x) max(c(0,x), na.rm=TRUE) ) )
Test object:
df <- structure(list(Obs. = 1:8, Group = c(101L, 101L, 101L, 102L,
102L, 103L, 103L, 103L), Dummy1 = c(0L, NA, 0L, NA, NA, 1L, NA,
1L), Dummy2 = c(0, 0, 0, 0, 0, 1, 1, 1)), .Names = c("Obs.",
"Group", "Dummy1", "Dummy2"), row.names = c(NA, -8L), class = "data.frame")

Resources