r- cumulative frequency when every combination doesn't appear always - r

I need to get the cumulative clients by number of calls up to everyday.
An example table would be:
> data
dia cli llam elegidos cumllam
1 1-11 a 1 1 1
2 3-11 a 1 1 2
3 1-11 b 2 1 2
4 2-11 b 1 1 3
5 2-11 c 2 0 2
As you can see, client a wasn't call in day 2-11, so the combination client a + day 2-11 doesn't appear in the table. If I run:
series<-data.frame(dcast(data, elegidos+dia~cumllam , length))
I get:
> series
elegidos dia X1 X2 X3
1 0 2-11 0 1 0
2 1 1-11 1 1 0
3 1 2-11 0 0 1
4 1 3-11 0 1 0
But if you consider up to the 2nd day how many clients were called once, client a should appear and it doesn't because I have no row in previous table for the combination client a and day 2-11.
The table should look like:
elegidos dia X1 X2 X3
1 0 2-11 0 1 0
2 1 1-11 1 1 0
3 1 2-11 1 0 1
4 1 3-11 0 1 1
x1 is the number of clients who received until and including the day in the row exactly 1 call.
x2 is the number of clients who received until and including the day in the row exactly 2 calls.
And so on.
The explanation is:
Client "a" gets a call on day 1st and 3rd, client "b" receives 2 calls on day 1st and 1 call on day 2nd. So, 1st day we have 1 client receiving 1 call, and another receiving 2 calls.
2nd day, since it's cumulative, we have client a, who stays the same with one call and client b who gets one more call reaching 3 calls.
On the 3rd day, client a receives another call and climb up to 2 calls cumulative, that's why he's in x2 and client b stays the same in x3.
Is there a way to do this cumulative count to each day, without having to create a row for each client day combination?
Thanks.

Try this:
dat1 <-data[!!data$elegidos,]
dat2 <- expand.grid(dia=sort(unique(dat1$dia)), cli=unique(dat1$cli))
dat3 <- merge(data,dat2, all=TRUE)
dat3N <- dat3[with(dat3, order( cli, dia)),]
library(zoo)
dat3N[,c('elegidos', 'cumllam')] <- lapply(dat3N[,
c('elegidos', 'cumllam')], na.locf)
library(reshape2)
dcast(dat3N, elegidos+dia~cumllam, length, value.var='cumllam')
# elegidos dia 1 2 3
#1 0 2-11 0 1 0
#2 1 1-11 1 1 0
#3 1 2-11 1 0 1
#4 1 3-11 0 1 1
Update
You could also do this in data.table
library(data.table)
DT <- data.table(data)
setkey(DT, dia, cli)
DT1 <- rbind(DT[!!elegidos, CJ(dia=unique(dia),
cli=unique(cli))], DT[elegidos==0, 1:2, with=FALSE])
nm1 <- c('elegidos', 'cumllam')
#There is also a roll option but unfortunately I couldn't get it right here.
# So, I am using na.locf from zoo.
DT2 <- DT[DT1[order(cli, dia)]][,(nm1):= lapply(.SD, na.locf), .SDcols=nm1]
dcast.data.table(DT2, elegidos+dia~cumllam, length, value.var='cumllam')
# elegidos dia 1 2 3
#1: 0 2-11 0 1 0
#2: 1 1-11 1 1 0
#3: 1 2-11 1 0 1
#4: 1 3-11 0 1 1
data
data <- structure(list(dia = c("1-11", "3-11", "1-11", "2-11", "2-11"
), cli = c("a", "a", "b", "b", "c"), llam = c(1L, 1L, 2L, 1L,
2L), elegidos = c(1L, 1L, 1L, 1L, 0L), cumllam = c(1L, 2L, 2L,
3L, 2L)), .Names = c("dia", "cli", "llam", "elegidos", "cumllam"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

Related

Grouping first few rows with positive value followed by another group with negative values and so on using R

I have a dataframe looks like this:
name strand
thrL 1
thrA 1
thrB 1
yaaA -1
yaaJ -1
talB 1
mog 1
I would like to group first few positive values into a group, negative values a group and next postive numbers as another group which look like this:
name strand directon
thrL 1 1
thrA 1 1
thrB 1 1
yaaA -1 2
yaaJ -1 2
talB 1 3
mog 1 3
I am thinking to use dplyr but I need some help with the code using R. Thank you so much.
Using rle :
df$direction <- with(rle(sign(df$strand)), rep(seq_along(values), lengths))
df
# name strand direction
#1 thrL 1 1
#2 thrA 1 1
#3 thrB 1 1
#4 yaaA -1 2
#5 yaaJ -1 2
#6 talB 1 3
#7 mog 1 3
This can be made shorter with data.table rleid.
df$direction <- data.table::rleid(sign(df$strand))
We can also do this as
df1$direction <- inverse.rle(within.list(rle(sign(df1$strand)),
values <- seq_along(values)))
df1$direction
#[1] 1 1 1 2 2 3 3
data
df1 <- structure(list(name = c("thrL", "thrA", "thrB", "yaaA", "yaaJ",
"talB", "mog"), strand = c(1L, 1L, 1L, -1L, -1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-7L))

restart counting under conditions in R [duplicate]

This question already has answers here:
Create counter within consecutive runs of certain values
(6 answers)
Closed 3 years ago.
I have a flag column that contains continuous streams 1s and 0s. I want to add the stream of 1s. When it encounters 0s, the summing should stop. For the next stream of 1s, summing should start afresh
I have tried cumsum(negread_flag == 1) this continues to sum after the 0s
negread_flag result
1 1
1 2
1 3
1 4
0 0
0 0
0 0
1 1
1 2
1 3
0 0
We can make use of rleid (run-length-id - to generate different ids when the adjacent element differ) as a grouping variable, then get the sequence of the group and assign it to 'result' where 'negread_flag' is 1, remove the 'grp' column by assigning it to NULL
library(data.table)
setDT(df1)[, grp := rleid(negread_flag)
][, result := 0
][negread_flag == 1,
result := seq_len(.N), grp][, grp := NULL][]
# negread_flag result
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 1 4
# 5: 0 0
# 6: 0 0
# 7: 0 0
# 8: 1 1
# 9: 1 2
#10: 1 3
#11: 0 0
Or a similar idea with tidyverse, using the rleid (from data.table), create the 'result' by multiplying the row_number() with the 'negread_flag' so that values corresponding to 0 in 'negread_flag' becomes 0
library(tidyverse)
df1 %>%
group_by(grp = rleid(negread_flag)) %>%
mutate(result = row_number() * negread_flag) %>%
ungroup %>%
select(-grp)
# A tibble: 11 x 2
# negread_flag result
# <int> <int>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 0 0
# 6 0 0
# 7 0 0
# 8 1 1
# 9 1 2
#10 1 3
#11 0 0
Or using base R
i1 <- df1$negread_flag != 0
df1$result[i1] <- with(rle(df1$negread_flag), sequence(lengths * values))
Or as #markus commented
df1$result[i1] <- sequence(rle(df1$negread_flag)$lengths) * df1$negread_flag
data
df1 <- structure(list(negread_flag = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L,
1L, 1L, 0L)), row.names = c(NA, -11L), class = "data.frame")

How to find columns that fit an specific range (per individual) and add 1, else 0, using R

I have a data frame with three initial columns: ID, start and end positions.The rest of the columns are numeric chromosomal positions, and it looks like this:
ID start end 1 2 3 4 5 6 7 ... n
ind1 2 4
ind2 1 3
ind3 5 7
What I want is to fill out the empty columns (1:n) based on the range for every individual (start:end). For example in the first individual (ind1) the range goes from positions 2 to 4, then those positions fitting the range are filled out with one (1), and those positions out the range with zero (0). To simplify, the desired output should look like this:
ID start end 1 2 3 4 5 6 7 ... n
ind1 2 4 0 1 1 1 0 0 0 ... 0
ind2 1 3 1 1 1 0 0 0 0 ... 0
ind3 5 7 0 0 0 0 1 1 1 ... 1
I will appreciate any comment.
Supposing you know the number of columns you could use the between function from the data.table package:
cols <- paste0('c',1:7)
library(data.table)
setDT(DF)[, (cols) := lapply(1:7, function(x) +(between(x, start, end)))][]
which gives:
ID start end c1 c2 c3 c4 c5 c6 c7
1: ind1 2 4 0 1 1 1 0 0 0
2: ind2 1 3 1 1 1 0 0 0 0
3: ind3 5 7 0 0 0 0 1 1 1
Notes:
It is better not to name your colummns with just numbers. Therefore I added a c at the start of the columnnames.
Using + in +(between(x, start, end)) is a kind of tric. The more idiomatic way is using as.integer(between(x, start, end)).
Used data:
DF <- read.table(text="ID start end
ind1 2 4
ind2 1 3
ind3 5 7", header=TRUE)
If you were to begin with the data frame df, without the columns already added,
ID start end
1 ind1 2 4
2 ind2 1 3
3 ind3 5 7
you could do
mx <- max(df[-1])
M <- Map(function(x, y) replace(integer(mx), x:y, 1L), df$start, df$end)
cbind(df, do.call(rbind, M))
# ID start end 1 2 3 4 5 6 7
# 1 ind1 2 4 0 1 1 1 0 0 0
# 2 ind2 1 3 1 1 1 0 0 0 0
# 3 ind3 5 7 0 0 0 0 1 1 1
The number of new columns will equal the maximum of the start and end columns.
Data:
df <- structure(list(ID = structure(1:3, .Label = c("ind1", "ind2",
"ind3"), class = "factor"), start = c(2L, 1L, 5L), end = c(4L,
3L, 7L)), .Names = c("ID", "start", "end"), class = "data.frame", row.names = c(NA,
-3L))

Expanding a data.frame by replacing missing values with set of all possible values in R

I want to expand my dataset by replacing each incomplete row with the set of all possible rows. Does anyone have any suggestions for an efficient way to do this?
For example, suppose X and Z can each take values 0 or 1.
Input:
id y x z
1 1 0 0 NA
2 2 1 NA 0
3 3 0 1 1
4 4 1 NA NA
Output:
id y x z
1 1 0 0 0
2 1 0 0 1
3 2 1 0 0
4 2 1 1 0
5 3 0 1 1
6 4 1 0 0
7 4 1 0 1
8 4 1 1 0
9 4 1 1 1
At the moment I just work through the original dataset row by row:
for(i in 1:N){
if(is.na(temp.dat$x[i]) & !is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],2),ncol=ncol(temp.dat),byrow=TRUE)
augment[,3] <- c(0,1)
}else
if(!is.na(temp.dat$x[i]) & is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],2),ncol=ncol(temp.dat),byrow=TRUE)
augment[,4] <- c(0,1)
}else{
if(is.na(temp.dat$x[i]) & is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],4),ncol=ncol(temp.dat),byrow=TRUE)
augment[,3] <- c(0,0,1,1)
augment[,4] <- c(0,1,0,1)
}
}
You could try by
Creating an "indx" of count of "NAs" in each row (rowSums(is.na(...))
Use the "indx" to expand the rows of the original dataset (df[rep(1:nrow...)
Loop over (sapply) the "indx" and use that as "times" argument in rep, and do expand.grid of values 0,1 to create the "lst"
split the expanded dataset, "df1", by "id"
Use Map to change corresponding "NA" values in "lst2" by the values in "lst"
rbind the list elements
indx <- rowSums(is.na(df[-1]))
df1 <- df[rep(1:nrow(df), 2^indx),]
lst <- sapply(indx, function(x) expand.grid(rep(list(0:1), x)))
lst2 <- split(df1, df1$id)
res <- do.call(rbind,Map(function(x,y) {x[is.na(x)] <- as.matrix(y);x},
lst2, lst))
row.names(res) <- NULL
res
# id y x z
#1 1 0 0 0
#2 1 0 0 1
#3 2 1 0 0
#4 2 1 1 0
#5 3 0 1 1
#6 4 1 0 0
#7 4 1 1 0
#8 4 1 0 1
#9 4 1 1 1
data
df <- structure(list(id = 1:4, y = c(0L, 1L, 0L, 1L), x = c(0L, NA,
1L, NA), z = c(NA, 0L, 1L, NA)), .Names = c("id", "y", "x", "z"
), class = "data.frame", row.names = c("1", "2", "3", "4"))

collapsing by groups and calculating n in data.frame using R

I have a data frame that contains three variables: treatment, dose, and outcome (plus or minus). I have multiple observations for each treatment and dose. I'm trying to output a contingency table that would collapse the data to indicate the number of each outcome as a function of the treatment and dose, as well as the number of observations. For example:
treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1
The desired output would be:
treatment dose outcome n
control 0 0 1 4
treatmentA 1 2 3
treatmentA 2 3 3
I've played around with this all day and haven't had much luck beyond being able to get a frequency for each outcome for each observation. Any suggestions would be appreciated (including pointing out what parts of the R manual and/or examples) i've overlooked.
Thanks!
R
Here is a solution using a wonderful package data.table:
library(data.table)
x <- data.table(read.table( text = "treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1", header = TRUE)
x[, list(outcome = sum(outcome), count = .N), by = 'treatment,dose']
produces
treatment dose outcome count
1: control 0 1 4
2: treatmentA 1 2 3
3: treatmentA 2 3 3
If you don't want to use extra libraries as suggested in other answers, you can try following.
> df
treatment dose outcome
1 control 0 0
2 control 0 0
3 control 0 0
4 control 0 1
5 treatmentA 1 0
6 treatmentA 1 1
7 treatmentA 1 1
8 treatmentA 2 1
9 treatmentA 2 1
10 treatmentA 2 1
> dput(df)
structure(list(treatment = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("control", "treatmentA"), class = "factor"),
dose = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 2L, 2L, 2L), outcome = c(0L,
0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L)), .Names = c("treatment",
"dose", "outcome"), class = "data.frame", row.names = c(NA, -10L
))
Now we use aggregate function to get count and sum of outcome column
> nObs <- aggregate(outcome ~ treatment + dose, data = df, length)
> sObs <- aggregate(outcome ~ treatment + dose, data = df, sum)
Change names of of aggregated column appropriately
names(nObs) <- c('treatment', 'dose', 'count')
> names(sObs) <- c('treatment', 'dose', 'sum')
> nObs
treatment dose count
1 control 0 4
2 treatmentA 1 3
3 treatmentA 2 3
> sObs
treatment dose sum
1 control 0 1
2 treatmentA 1 2
3 treatmentA 2 3
Use merge to combine above two by all columns by same name treatment and dose in this case.
> result <- merge(nObs, sObs)
> result
treatment dose count sum
1 control 0 4 1
2 treatmentA 1 3 2
3 treatmentA 2 3 3
If I understand correctly, this is straightforward with the data.table library. First, load the library and read the data in:
library(data.table)
data <- read.table(header=TRUE, text="
treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1")
Next, create a data.table with the treatment and dose columns as the table keys (indices).
data <- data.table(data, key="treatment,dose")
Then aggregate using data.table syntax.
data[, list(outcome=sum(outcome), n=length(outcome)), by=list(treatment,dose)]
treatment dose outcome n
1: control 0 1 4
2: treatmentA 1 2 3
3: treatmentA 2 3 3
imho, sql is underrated. :)
# read in your example data as `x`
x <- read.table( text = "treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1",h=T)
# load the sql data frame library
library(sqldf)
# create a new table of all unique `treatment` and `dose` columns,
# summing the `outcome` column and
# counting the number of records in each combo
y <- sqldf( 'SELECT treatment, dose ,
sum( outcome ) as outcome ,
count(*) as n
FROM x
GROUP BY treatment, dose' )
# check the results
y
Here are another couple of options (even thought the data.table approach clearly wins in succinctness of syntax).
The first uses ave within within. ave can apply a function to a variable (the first variable mentioned) grouped by one or more variables. We wrap the output in unique after dropping the now unnecessary "outcome" column.
unique(within(df, {
SUM <- ave(outcome, treatment, dose, FUN = sum)
COUNT <- ave(outcome, treatment, dose, FUN = length)
rm(outcome)
}))
# treatment dose COUNT SUM
# 1 control 0 4 1
# 5 treatmentA 1 3 2
# 8 treatmentA 2 3 3
A second solution in base R is very similar to #geektrader's answer, except it calculates both sum and length in one call to aggregate. There is a "downside" though: the result of that cbind is a "column" in your data.frame that is actually a matrix. See the result of str to see what I mean.
temp <- aggregate(outcome ~ treatment + dose, df,
function(x) cbind(sum(x), length(x)))
str(temp)
# 'data.frame': 3 obs. of 3 variables:
# $ treatment: Factor w/ 2 levels "control","treatmentA": 1 2 2
# $ dose : int 0 1 2
# $ outcome : int [1:3, 1:2] 1 2 3 4 3 3
colnames(temp$outcome) <- c("SUM", "COUNT")
temp
# treatment dose outcome.SUM outcome.COUNT
# 1 control 0 1 4
# 2 treatmentA 1 2 3
# 3 treatmentA 2 3 3
I mention storage structure as a "downside" mostly because you might not get what you expect when you try to access the data in ways you might be accustomed to.
temp$outcome.SUM
# NULL
temp$outcome
# SUM COUNT
# [1,] 1 4
# [2,] 2 3
# [3,] 3 3
Instead, you have to access it via:
temp$outcome[, "SUM"] ## or temp$outcome[, 1]
# [1] 1 2 3

Resources