R data.table - add row based on value - r

I have data.table x below
x <- data.table(id=c('A1', 'B1'), start=c(1,1), stop=c(4,5))
id
start
stop
A1
1
4
B1
1
5
I would like to expand row. Is it possible to use rbindlist using Map to generate data.table as below?
id
start
stop
A1
1
2
A1
2
3
A1
3
4
B1
1
2
B1
2
3
B1
3
4
B1
4
5

You can create a sequence from start to stop for each id. Use shift to get next value and drop the NA rows.
library(data.table)
x <- x[, .(start = seq(start, stop)), id]
x[, stop := shift(start, type = 'lead'), id]
x[!is.na(stop)]
# id start stop
#1: A1 1 2
#2: A1 2 3
#3: A1 3 4
#4: B1 1 2
#5: B1 2 3
#6: B1 3 4
#7: B1 4 5
Here's an equivalent tidyverse way -
library(tidyverse)
x %>%
mutate(start = map2(start, stop, seq)) %>%
unnest(start) %>%
group_by(id) %>%
mutate(stop = lead(start)) %>%
ungroup %>%
filter(!is.na(stop))

Related

R: creating combinations of elements within a group and adding up numbers associated with combinations in a new data frame

I have the following dataset:
Letter ID Number
A A1 1
A A2 2
A A3 3
B B1 1
B B2 2
B B3 3
B B4 4
My aim is first to create all possible combinations of IDs within the same "Letter" group. For example, for the letter A, it would be only three combinations: A1-A2,A2-A3,and A1-A3. The same IDs ordered differently don't count as a new combination, so for example A1-A2 is the same as A2-A1.
Then, within those combinations, I want to add up the numbers from the "Number" column associated with those IDs. So for the combination A1-A2, which are associated with 1 and 2 in the "Number" column, this would result in the number 1+2=3.
Finally, I want to place the ID combinations, added numbers and original Letter in a new data frame. Something like this:
Letter Combination Add.Number
A A1-A2 3
A A2-A3 5
A A1-A3 4
B B1-B2 3
B B2-B3 5
B B3-B4 7
B B1-B3 4
B B2-B4 6
B B1-B4 5
How can I do this in R, ideally using the package dplyr?
library(dplyr)
letter <- c("A","A","A","B","B","B","B")
df <-
data.frame(letter) %>%
group_by(letter) %>%
mutate(
number = row_number(),
id = paste0(letter,number)
)
df %>%
full_join(df,by = "letter") %>%
filter(number.x < number.y) %>%
mutate(
combination = paste0(id.x,"-",id.y),
add_number = number.x + number.y) %>%
select(letter,combination,add_number)
# A tibble: 9 x 3
# Groups: letter [2]
letter combination add_number
<chr> <chr> <int>
1 A A1-A2 3
2 A A1-A3 4
3 A A2-A3 5
4 B B1-B2 3
5 B B1-B3 4
6 B B1-B4 5
7 B B2-B3 5
8 B B2-B4 6
9 B B3-B4 7
In base R, using combn:
df <- data.frame(
Letter = c("A","A","A","B","B","B","B"),
Id = c("A1","A2","A3","B1","B2","B3","B4"),
Number = c(1,2,3,1,2,3,4))
# combinations
l<-lapply(split(df$Id, df$Letter) ,function(x)
setNames(data.frame(t(combn(x,2))), c("L1","L2")))
n<-lapply(split(df$Number, df$Letter) ,function(x)
setNames(data.frame(t(combn(x,2))), c("N1","N2")))
# rbind all
result <- do.call(rbind, mapply(cbind, Letter=names(l), l, n, SIMPLIFY = F))
result$combination <- paste(result$L1, result$L2, sep="-")
result$sum = result$N1 + result$N2
result
#> Letter L1 L2 N1 N2 combination sum
#> A.1 A A1 A2 1 2 A1-A2 3
#> A.2 A A1 A3 1 3 A1-A3 4
#> A.3 A A2 A3 2 3 A2-A3 5
#> B.1 B B1 B2 1 2 B1-B2 3
#> B.2 B B1 B3 1 3 B1-B3 4
#> B.3 B B1 B4 1 4 B1-B4 5
#> B.4 B B2 B3 2 3 B2-B3 5
#> B.5 B B2 B4 2 4 B2-B4 6
#> B.6 B B3 B4 3 4 B3-B4 7

R calculating the sum of values according to condition

Here is a data frame:
ID<-c(rep("A",3),rep("B",2), rep("C",3),rep("D",5))
cell<-c("a1","a2","a3","a1","a2","a1","a2", "a3","a1","a2","a1","a2","a3")
value<-c(2,5,3,4,5,6,9,8,7,2,5,2,4)
df<-as.data.frame(cbind(ID, cell, value))
I want to calculate the sum of all values for each ID up to cell a2 (incl.). The sequence of cells and ID’s must be taken into account. If there isn’t any cell “a2” after calculating of the sum, this rows should not be taken into account.
As a result I would like to get this table:
Could You please help me to code this condition?
Thanks in advance.
Best regards, Inna
assuming the file is already correctly ordered by cell
library( tidyverse )
df %>%
group_by( ID ) %>%
mutate( value = cumsum( value ) ) %>%
filter( cell == "a2" )
# # A tibble: 5 x 3
# # Groups: ID [4]
# ID cell value
# <chr> <chr> <dbl>
# 1 A a2 7
# 2 B a2 9
# 3 C a2 15
# 4 D a2 9
# 5 D a2 16
Treating each occurrence of "a2" as different group we can do :
library(dplyr)
df %>%
#Create a group column with every value of cell == 'a2' as different group
group_by(ID, grp = cumsum(lag(cell == 'a2', default = TRUE))) %>%
#Remove those groups that do not have 'a2' in them
filter(any(cell == 'a2')) %>%
#Sum till 'a2' value
summarise(value = sum(value[seq_len(match('a2', cell))]),
cell = last(cell)) %>%
select(-grp)
# ID value cell
# <chr> <dbl> <chr>
#1 A 7 a2
#2 B 9 a2
#3 C 15 a2
#4 D 9 a2
#5 D 7 a2
A succinct solution using ave.
r <- transform(df, value=ave(value, ID, FUN=cumsum))[df$cell == "a2", ]
r
# ID cell value
# 2 A a2 7
# 5 B a2 9
# 7 C a2 15
# 10 D a2 9
# 12 D a2 16
An option with data.table
library(data.table)
setDT(df)[, value := cumsum(value) , ID][cell == 'a2']
-output
# ID cell value
#1: A a2 7
#2: B a2 9
#3: C a2 15
#4: D a2 9
#5: D a2 16

How to get the top element per group with multiple columns?

I have the use-case shown below. Basically I have a data frame with three columns. I want to group by two columns (c1,c2) and sum the third one c3. Then I want to pick only the top 1 c1 with maximum c3 (among all c2) i.e. sorting would be unnecessary since I'm only interested in the max.
library(plyr)
df <- data.frame(c1=c('a','a','a','b','b','c'),c2=c('x','y','y','x','y','x'),c3=c(1,2,3,4,5,6))
df
c1 c2 c3
1 a x 1
2 a y 2
3 a y 3
4 b x 4
5 b y 5
6 c x 6
sel <- plyr::ddply(df, c('c1','c2'), plyr::summarize,c3=sum(c3))
sel[with(sel, order(c1,-c3)),]
c1 c2 c3
2 a y 5 <<< this one highest c3 for (c1,c2) combination
1 a x 1
4 b y 5 <<< this one highest c3 for (c1,c2) combination
3 b x 4
5 c x 6 <<< this one highest c3 for (c1,c2) combination
I could do this in a loop but I'm wondering how it can be done in a vector fashion or using a high-level function.
Here's a base R approach:
df2 <- aggregate(c3~c1+c2, df, sum)
subset(df2[order(-df2$c3),], !duplicated(c1))
# c1 c2 c3
#3 c x 6
#4 a y 5
#5 b y 5
Another solution from dplyr.
library(dplyr)
df2 <- df %>%
group_by(c1, c2) %>%
summarise(c3 = sum(c3)) %>%
filter(c3 == max(c3))
df2
# A tibble: 3 x 3
# Groups: c1 [3]
c1 c2 c3
<fctr> <fctr> <dbl>
1 a y 5
2 b y 5
3 c x 6
Here is another option with data.table
library(data.table)
setDT(df)[, .(c3 = sum(c3)) , .(c1, c2)][, .SD[which.max(c3)], .(c1)]
# c1 c2 c3
#1: a y 5
#2: b y 5
#3: c x 6
Using dplyr:
df %>%
group_by(c1, c2) %>%
summarise(c3 = sum(c3)) %>%
top_n(1, c3)
Or the last line can be slice(which.max(c3)), which will guarantee a single row.

calculate summary by group and bring value back in the dataframe [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 5 years ago.
df <- data.frame(
id = c('A1','A2','A4','A2','A1','A4','A3','A2','A1','A3'),
value = c(4,3,1,3,4,6,6,1,8,4))
I want to get max value within each id group. I tried following but got an error saying replacement has 4 rows and data has 10 which i understand but don't know how to correct
df$max.by.id <- aggregate(value ~ id, df, max)
this is how i ended up successfully doing it
max.by.id <- aggregate(value ~ id, df, max)
names(max.by.id) <- c("id", "max")
df2 <- merge(df,max.by.id, by.x = "id", by.y = "id")
df2
# id value max
#1 A1 4 8
#2 A1 4 8
#3 A1 8 8
#4 A2 3 3
#5 A2 3 3
#6 A2 1 3
#7 A3 6 6
#8 A3 4 6
#9 A4 1 6
#10 A4 6 6
any better way? thanks in advance
ave() is the function for that task:
df$max.by.id <- ave(df$value, df$id, FUN=max)
example:
df <- data.frame(
id = c('A1','A2','A4','A2','A1','A4','A3','A2','A1','A3'),
value = c(4,3,1,3,4,6,6,1,8,4))
df$max.by.id <- ave(df$value, df$id, FUN=max)
The result of ave() has the same length as the original vector of values (what is also the length of the grouping variables). The values of the result are going to the right positions with respect to the grouping variables. For more information read the documentation of ave().
with data.table, you can compute the max by id "inside" the data, automatically adding the newly computed value (unique by id):
library(data.table)
setDT(df)[, max.by.id := max(value), by=id]
df
# id value max.by.id
# 1: A1 4 8
# 2: A2 3 3
# 3: A4 1 6
# 4: A2 3 3
# 5: A1 4 8
# 6: A4 6 6
# 7: A3 6 6
# 8: A2 1 3
# 9: A1 8 8
#10: A3 4 6
tapply(df$value, df$id, max)
# A1 A2 A3 A4
8 3 6 6
library(plyr)
ddply(df, .(id), function(df){max(df$value)})
# id V1
# 1 A1 8
# 2 A2 3
# 3 A3 6
# 4 A4 6
library(dplyr)
df %>% group_by(id) %>% arrange(desc(value)) %>% do(head(., 1))
# Source: local data frame [4 x 2]
# Groups: id [4]
# id value
# (fctr) (dbl)
# 1 A1 8
# 2 A2 3
# 3 A3 6
# 4 A4 6
UPDATE:
If you need to keep the raw value, use the following code.
library(plyr)
ddply(df, .(id), function(df){
df$max.val = max(df$value)
return(df)
})
library(dplyr)
df %>% group_by(id) %>% mutate(max.val=max(value))
# Source: local data frame [10 x 3]
# Groups: id [4]
# id value max.val
# (fctr) (dbl) (dbl)
# 1 A1 4 8
# 2 A2 3 3
# 3 A4 1 6
# 4 A2 3 3
# 5 A1 4 8
# 6 A4 6 6
# 7 A3 6 6
# 8 A2 1 3
# 9 A1 8 8
# 10 A3 4 6

Changing dcast to show multiple columns

I have the following situation. Consider the following df:
mymatrix <- as.data.frame(matrix(data = 0, nrow = 7, ncol = 4))
colnames(mymatrix) <- c("Patient", "marker", "Number", "Visit")
mymatrix[,1] <- c("B1","B1","C1","C1","D1","D1","D1")
mymatrix[,2] <- c("A","A","A","A","A","A","A")
mymatrix[,3] <- c(1,0,0,15,1,2,13)
mymatrix[,4] <- c("baseline","followup","baseline","followup","baseline","followup","followup")
> mymatrix
Patient marker Number Visit
1 B1 A 1 baseline
2 B1 A 0 followup
3 C1 A 0 baseline
4 C1 A 15 followup
5 D1 A 1 baseline
6 D1 A 2 followup
7 D1 A 13 followup
If I do dcast on the first 6 rows I get:
> dcast(mymatrix[1:6,], Patient +marker~Visit, value.var = "Number")
Patient marker baseline followup
1 B1 A 1 0
2 C1 A 0 15
3 D1 A 1 2
If I do dcast on all the rows I get:
> dcast(mymatrix, Patient +marker~Visit, value.var = "Number")
Aggregation function missing: defaulting to length
Patient marker baseline followup
1 B1 A 1 1
2 C1 A 1 1
3 D1 A 1 2
Is there a way instead of defaulting to length it would add a second followup column? So the data would show as follows:
Patient marker baseline followup.1 followup.2
1 B1 A 1 0 NA
2 C1 A 0 15 NA
3 D1 A 1 2 13
Thanks!
It's not clear what you asking, because it seems like you want to combine two different functions in dcast at the same time. It seems to me that you want to improve your first output instead of the second. If so, a simple solution would be just to add an automatic index to the values in the Visit column and then dcast. Here's a simple approach using the data.table package (thought the output is not exactly what you want because I've also added an index to baseline, but it can get you started)
library(data.table)
setDT(mymatrix)[, Visit := paste(Visit, seq_len(.N), sep = "."), list(Patient, Visit)]
dcast.data.table(mymatrix, Patient + marker ~ Visit, value.var = "Number")
# Patient marker baseline.1 followup.1 followup.2
# 1: B1 A 1 0 NA
# 2: C1 A 0 15 NA
# 3: D1 A 1 2 13
You could also use base R
d1 <- transform(mymatrix, Visit=paste0(Visit,ave(seq_along(Number),
Patient, Visit, FUN=seq_along)) )
reshape(d1, idvar=c('Patient', 'marker'), timevar='Visit', direction='wide')
# Patient marker Number.baseline1 Number.followup1 Number.followup2
#1 B1 A 1 0 NA
#3 C1 A 0 15 NA
#5 D1 A 1 2 13
Or dplyr/tidyr
library(dplyr)
library(tidyr)
mymatrix %>%
group_by(Patient, Visit) %>%
mutate(indx=row_number()) %>%
ungroup() %>%
unite(Visit1, Visit, indx) %>%
spread(Visit1, Number)
# Patient marker baseline_1 followup_1 followup_2
#1 B1 A 1 0 NA
#2 C1 A 0 15 NA
#3 D1 A 1 2 13

Resources