Counting occurrence of a variable without taking account duplicates - r

I have a big data frame, called data with 1 004 490 obs, and I want to analyse the success of a treatment.
ID POSITIONS TREATMENT
1 0 A
1 1 A
1 2 B
2 0 C
2 1 D
3 0 B
3 1 B
3 2 C
3 3 A
3 4 A
3 5 B
So firstly, I want to count the number of time that one treatment is applicated to a patient (ID), but one treatment can be given several times to an iD. So, do I need to first delete all the duplicates and after count or there is a function that don't take into account all the duplicates.
What I want to have :
A : 2
B : 2
C : 2
D : 1
Then, I want to know how many time the treatment was given at the last position, but the last position is always different according to the ID.
What I want to have :
A : 0
B : 2 (for ID = 1 and 3)
C : 0
D : 1 (for ID = 1)
Thanks for your help, I am a new user of R !

Using base R, we can do,
merge(aggregate(ID ~ TREATMENT, df, FUN = function(i) length(unique(i))),
aggregate(ID ~ TREATMENT, df[!duplicated(df$ID, fromLast = TRUE),], toString),
by = 'TREATMENT', all = TRUE)
Which gives,
TREATMENT ID.x ID.y
1 A 2 <NA>
2 B 2 1, 3
3 C 2 <NA>
4 D 1 2

Here is a tidyverse approach, where we get the distinct rows based on 'ID', 'TREATMENT' and get the count of 'TREATMENT'
library(tidyverse)
df1 %>%
distinct(ID, TREATMENT) %>%
count(TREATMENT)
# A tibble: 4 x 2
# TREATMENT n
# <chr> <int>
#1 A 2
#2 B 2
#3 C 2
#4 D 1
and for second output, after grouping by 'ID', slice the last row (n()), create a column 'ind' and fill that with 0 for all missing combinations of 'TREATMENT' with complete, then get the sum of 'ind' after grouping by 'TREATMENT'
df1 %>%
group_by(ID) %>%
slice(n()) %>%
mutate(ind = 1) %>%
complete(TREATMENT = unique(df1$TREATMENT), fill = list(ind=0)) %>%
group_by(TREATMENT) %>%
summarise(n = sum(ind))
# A tibble: 4 x 2
# TREATMENT n
# <chr> <dbl>
#1 A 0
#2 B 2
#3 C 0
#4 D 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L), POSITIONS = c(0L, 1L, 2L, 0L, 1L, 0L, 1L, 2L, 3L, 4L, 5L
), TREATMENT = c("A", "A", "B", "C", "D", "B", "B", "C", "A",
"A", "B")), .Names = c("ID", "POSITIONS", "TREATMENT"),
class = "data.frame", row.names = c(NA, -11L))

Related

How to randomise order of group within group in R/dplyr?

I have a group nested within another group in my data. I would like to randomise the order of the nested groups while preserving the order of the rows within each nested group. (This will be a step within an existing pipe, so a tidyverse solution would be ideal.)
In the example below, how do I randomise the order of block within participant_id, while also preserving the order of both participant_id and trial?
library(dplyr)
set.seed(123)
# dummy data
data <- tibble::tribble(
~participant_id, ~block, ~trial,
1L, "a", 1L,
1L, "a", 2L,
1L, "a", 3L,
1L, "b", 1L,
1L, "b", 2L,
1L, "b", 3L,
2L, "a", 1L,
2L, "a", 2L,
2L, "a", 3L,
2L, "b", 1L,
2L, "b", 2L,
2L, "b", 3L
)
# something along the lines of...
new_data <- data %>%
group_by(participant_id) %>%
# ? step here to randomise order within 'block', while preserving order within 'trial'.
Thanks.
And here's another:
# Randomise within one participant
randomiseGroup <- function(.x, .y) {
# Generalise to that any number of blocks can be handled
r <- .x %>%
distinct(block) %>%
mutate(random=runif(nrow(.)))
# Randomise
.y %>%
bind_cols(
.x %>%
ungroup() %>%
left_join(r, by="block") %>%
arrange(random, trial) %>%
select(-random)
)
}
# Randomise all participants
data %>%
group_by(participant_id) %>%
group_map(randomiseGroup) %>%
bind_rows()
# A tibble: 12 × 3
participant_id block trial
<int> <chr> <int>
1 1 a 1
2 1 a 2
3 1 a 3
4 1 b 1
5 1 b 2
6 1 b 3
7 2 b 1
8 2 b 2
9 2 b 3
10 2 a 1
11 2 a 2
12 2 a 3
One option could be:
data %>%
group_by(participant_id) %>%
mutate(rleid = cumsum(block != lag(block, default = first(block))),
block_random = sample(n())) %>%
group_by(participant_id, rleid) %>%
mutate(block_random = min(block_random)) %>%
ungroup()
participant_id block trial rleid block_random
<int> <chr> <int> <int> <int>
1 1 a 1 0 2
2 1 a 2 0 2
3 1 a 3 0 2
4 1 b 1 1 1
5 1 b 2 1 1
6 1 b 3 1 1
7 2 a 1 0 2
8 2 a 2 0 2
9 2 a 3 0 2
10 2 b 1 1 1
11 2 b 2 1 1
12 2 b 3 1 1

Formatting a data.frame with binary values

I have a dataframe with 4 columns and 4 rows. For simplicity, I changed it to numeric format. The schema is as follows:
df <- structure(list(a = c(1,2,2,0),
b = c(2,1,2,2),
c = c(2,0,1,0),
d = c(0,2,1,1)),row.names=c(NA,-4L) ,class = "data.frame")
a b c d
1 1 2 2 0
2 2 1 2 2
3 2 0 1 0
4 0 2 1 1
I would like to change this data frame and obtain the following:
1 2
1 a b/c
2 b a/c/d
3 c a
4 c/d b
Is there a function or package I should look into? I have been doing lots of text processing in R recently. I'd appreciate your assistance!
tapply fun with some row and col indexes (stealing df from Ronak's answer):
tapply(
colnames(df)[col(df)],
list(row(df), unlist(df)),
FUN=paste, collapse="/"
)[,-1]
# 1 2
#1 "a" "b/c"
#2 "b" "a/c/d"
#3 "c" "a"
#4 "c/d" "b"
Basically I'm taking one long vector representing each column name in df, and tabulating it by the combination of the row of df, and the original values in df.
One way with dplyr and tidyr could be to get data in long format, remove 0 values and paste the column names together for each row and value combination. Finally get the data in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
filter(value != 0) %>%
group_by(row, value) %>%
summarise(val = paste(name, collapse = "/")) %>%
pivot_wider(names_from = value, values_from = val)
# row `1` `2`
# <int> <chr> <chr>
#1 1 a b/c
#2 2 b a/c/d
#3 3 c a
#4 4 c/d b
data
df <- structure(list(a = c(1L, 2L, 2L, 0L), b = c(2L, 1L, 0L, 2L),
c = c(2L, 2L, 1L, 1L), d = c(0L, 2L, 0L, 1L)), class = "data.frame",
row.names = c("1", "2", "3", "4"))

Counting the elements in rows and map to column in r

I would like summarize my data by counting the entities and create counting_column for each entity.
let say:
df:
id class
1 A
1 B
1 A
1 A
1 B
1 c
2 A
2 B
2 B
2 D
I want to create a table like
id A B C D
1 3 2 1 0
2 1 2 0 1
How can I do this in R using apply function?
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
class = structure(c(1L, 2L, 1L, 1L, 2L, 3L, 1L, 2L, 2L, 4L
), .Label = c("A", "B", "C", "D"), class = "factor")), .Names = c("id",
"class"), class = "data.frame", row.names = c(NA, -10L))
with(df, table(id, class))
# class
#id A B C D
# 1 3 2 1 0
# 2 1 2 0 1
xtabs(~ id + class, df)
# class
#id A B C D
# 1 3 2 1 0
# 2 1 2 0 1
tapply(rep(1, nrow(df)), df, length, default = 0)
# class
#id A B C D
# 1 3 2 1 0
# 2 1 2 0 1
This seems like a very strange requirement but if you insist on using apply then the function count counts the number of rows for which id equals x and class equals y. It is applied to every combination of id and class to get a using nested apply calls. Finally we add the row and column names.
uid <- unique(DF$id)
uclass <- unique(DF$class)
count <- function(x, y, DF) sum(x == DF$id & y == DF$class)
a <- apply(matrix(uclass), 1, function(u) apply(matrix(uid), 1, count, u, DF))
dimnames(a) <- list(uid, uclass)
giving:
> a
A B c D
1 3 2 1 0
2 1 2 0 1
Note
We used this for DF
Lines <- "id class
1 A
1 B
1 A
1 A
1 B
1 c
2 A
2 B
2 B
2 D"
DF <- read.table(text = Lines, header = TRUE)

Cumulative Count Paste

I have this dataset:
ID Set Type Count
1 1 1 A NA
2 2 1 R NA
3 3 1 R NA
4 4 1 U NA
5 5 1 U NA
6 6 1 U NA
7 7 2 A NA
8 8 3 R NA
9 9 3 R NA
As dputs:
mystart <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("ID", "Set", "Type",
"Count"), class = "data.frame", row.names = c(NA, -9L))
By using dplyr package how can I obtain this:
ID Set Type Count
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
Again dputs:
myend <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 1L, 7L, 8L), .Label = c("A1", "A1R1", "A1R2",
"A1R2U1", "A1R2U2", "A1R2U3", "R1", "R2"), class = "factor")), .Names = c("ID",
"Set", "Type", "Count"), class = "data.frame", row.names = c(NA,
-9L))
In short, I want to count the observations of the column "type" within column "set" and print this count(text) cumulatively.
Examining similar posts, I got closely to this:
myend <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = c(1L,
1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L)), .Names = c("ID", "Set", "Type",
"Count"), class = "data.frame", row.names = c(NA, -9L))
With the code:
library(dplyr)
myend <- read.table("mydata.txt", header=TRUE, fill=TRUE)
myend %>%
group_by(Set, Type) %>%
mutate(Count = seq(n())) %>%
ungroup(myend)
Thank you very much for your help,
Base R version :
aggregateGroup <- function(x){
vecs <- Reduce(f=function(a,b){ a[b] <- sum(a[b],1L,na.rm=TRUE); a },
init=integer(0),
as.character(x),
accumulate = TRUE)
# vecs is a list with something like this :
# [[1]]
# integer(0)
# [[2]]
# A
# 1
# [[3]]
# A R
# 1 1
# ...
# so we simply turn those vectors into characters using vapply and paste
# (excluding the first)
vapply(vecs,function(y) paste0(names(y),y,collapse=''),FUN.VALUE='')[-1]
}
split(mystart$Count,mystart$Set) <- lapply(split(mystart$Type,mystart$Set), aggregateGroup)
> mystart
ID Set Type Count
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
A dplyr version:
mystart %>%
group_by(Set) %>%
mutate(Count = paste0('A', cumsum(Type == 'A'),
'R', cumsum(Type == 'R'),
'U', cumsum(Type == 'U'))) %>%
ungroup()
Which yields
# A tibble: 9 x 4
ID Set Type Count
<int> <int> <chr> <chr>
1 1 1 A A1R0U0
2 2 1 R A1R1U0
3 3 1 R A1R2U0
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1R0U0
8 8 3 R A0R1U0
9 9 3 R A0R2U0
If you want to omit the variables with count zero, you'd need to wrap a function around it like so
mygroup <- function(lst) {
name <- names(lst)
vectors <- lapply(seq_along(lst), function(i) {
x <- lst[[i]]
char <- name[i]
x <- ifelse(x == 0, "", paste0(char, x))
return(x)
})
return(do.call("paste0", vectors))
}
mystart %>%
group_by(Set) %>%
mutate(Count = mygroup(list(A = cumsum(Type == 'A'),
R = cumsum(Type == 'R'),
U = cumsum(Type == 'U')))) %>%
ungroup()
This yields
# A tibble: 9 x 4
ID Set Type Count
<int> <int> <chr> <chr>
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
One line solve with data.table
you gotta first do
require(data.table)
mystart <- as.data.table(mystart)
then just use one line
mystart[, .(Type,
count = paste0(
'A',
cumsum(Type == 'A'),
'R',
countR = cumsum(Type == 'R'),
'U',
countU = cumsum(Type == 'U')
)),
by = c('Set')]
first you want cumsum each type and paste them together by 'set'
cumsum(Type=='A') equals the count, since when Type==A, it's 1, otherwise it's 0.
you wanted to paste them into one column also. So, paste0() is good to use.
you still wanted the Type column, so I included Type in the line.
The output:
Set Type count
1: 1 A A1R0U0
2: 1 R A1R1U0
3: 1 R A1R2U0
4: 1 U A1R2U1
5: 1 U A1R2U2
6: 1 U A1R2U3
7: 2 A A1R0U0
8: 3 R A0R1U0
9: 3 R A0R2U0
Hope this helps.
btw, if you want count 0 ignored, you gotta design some if-esle clause yourself.
basically you want this: if cumsum(something) ==0, NULL, esle paste0('something', cumsum(something)), then you paste0() them together.
It's gonna get nasty, I'm not writing it. you get the idea
Here's a base solution.
We can paste raw letters toseq_along of letter groups to get the last 2 characters, then paste the result to the last element of the previous result, using Reduce.
On top of this we use ave to compute by group.
fun <- function(x,y) paste0(x[length(x)],y,seq_along(y))
mystart$Count <- ave(as.character(mystart$Type),mystart$Set,
FUN = function(x) unlist(Reduce(fun,split(x,x),init=NULL,accumulate = TRUE)))
# ID Set Type Count
# 1 1 1 A A1
# 2 2 1 R A1R1
# 3 3 1 R A1R2
# 4 4 1 U A1R2U1
# 5 5 1 U A1R2U2
# 6 6 1 U A1R2U3
# 7 7 2 A A1
# 8 8 3 R R1
# 9 9 3 R R2
Details
split(x,x) splits letters as shown here for first Set:
with(subset(mystart,Set==1),split(Type,Type))
# $A
# [1] "A"
#
# $R
# [1] "R" "R"
#
# $U
# [1] "U" "U" "U"
Then fun does this type of operations, helped by Reduce :
fun(NULL,"A") # [1] "A1"
fun("A1",c("R","R")) # [1] "A1R1" "A1R2"
fun(c("A1R1","A1R2"),c("U","U","U")) # [1] "A1R2U1" "A1R2U2" "A1R2U3"
Bonus solution
This other base solution, using rle and avoiding split gives the same output for given example (and whenever Type values are grouped in Sets), but not with mystart2 <- rbind(mystart,mystart) for instance.
fun2 <- function(x){
rle_ <- rle(x)
suffix <- paste0(x,sequence(rle_$length))
prefix <- unlist(mapply(rep,
lag(unlist(
Reduce(paste0,paste0(rle_$values,rle_$lengths),accumulate=TRUE)
),rle_$lengths[1]),
each=rle_$lengths))
prefix[is.na(prefix)] <- ""
paste0(prefix,suffix)
}
mystart$Count2 <-ave(as.character(mystart$Type), mystart$Set,FUN=fun2)
Many elegant solutions have been provided for the problem. Still I was looking for something dplyr way (without-cumsum on fixed types). The function is generic enough to handle additional values of Type.
A solution with help of a custom function as:
library(dplyr)
mystart %>% group_by(Set, Type) %>%
mutate(type_count = row_number()) %>%
mutate(TypeMod = paste0(Type,type_count)) %>%
group_by(Set) %>%
mutate(Count = cumCat(TypeMod, type_count)) %>%
select(-type_count, -TypeMod)
cumCat <- function(x, y){
retVal <- character(length(x))
prevVal = ""
lastGrpVal = ""
for ( i in seq_along(x)){
if(y[i]==1){
lastGrpVal = prevVal
}
retVal[i] = paste0(lastGrpVal,x[i])
prevVal = retVal[i]
}
retVal
}
# # Groups: Set [3]
# ID Set Type Count
# <int> <int> <fctr> <chr>
# 1 1 1 A A1
# 2 2 1 R A1R1
# 3 3 1 R A1R2
# 4 4 1 U A1R2U1
# 5 5 1 U A1R2U2
# 6 6 1 U A1R2U3
# 7 7 2 A A1
# 8 8 3 R R1
# 9 9 3 R R2

How to get the top cases for each group using dplyr? [duplicate]

This question already has answers here:
Getting the top values by group
(6 answers)
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Convert data from long format to wide format with multiple measure columns
(6 answers)
Closed 4 years ago.
I have a data table with 3 columns: ID, Type, and Count. For each ID, I want to get the Type with top 2 Count in this ID, and flatten the result into one row. For example, if my data table is like below:
ID Type Count
A 1 8
B 1 3
A 2 5
A 3 2
B 2 1
B 3 4
Then I want my output to be two rows like below:
ID Top1Type Top1TypeCount Top2Type Top2TypeCount
A 1 8 2 5
B 3 4 1 3
Can anyone tell me how to achieve this using the dplyr library in R? Thank you very much.
It's mostly better to keep your data in a long/tidy format. To achieve that, you can use:
df1 %>% group_by(ID) %>% top_n(2, Count) %>% arrange(ID)
which gives:
ID Type Count
(fctr) (int) (int)
1 A 1 8
2 A 2 5
3 B 1 3
4 B 3 4
When you have ties, you can use slice to select an equal number of observations for each group:
# some example data
df2 <- structure(list(ID = structure(c(1L, 2L, 1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
Type = c(1L, 1L, 2L, 3L, 2L, 3L),
Count = c(8L, 3L, 8L, 8L, 1L, 4L)),
.Names = c("ID", "Type", "Count"), class = "data.frame", row.names = c(NA, -6L))
Without slice():
df2 %>% group_by(ID) %>% top_n(2, Count) %>% arrange(ID)
gives:
ID Type Count
(fctr) (int) (int)
1 A 1 8
2 A 2 8
3 A 3 8
4 B 1 3
5 B 3 4
With the use of slice():
df2 %>% group_by(ID) %>% top_n(2, Count) %>% arrange(ID) %>% slice(1:2)
gives:
ID Type Count
(fctr) (int) (int)
1 A 1 8
2 A 2 8
3 B 1 3
4 B 3 4
With arrange you can determine the order of the cases and thus which are selected by slice. The following:
df2 %>% group_by(ID) %>% top_n(2, Count) %>% arrange(ID, -Type) %>% slice(1:2)
gives this result:
ID Type Count
(fctr) (int) (int)
1 A 3 8
2 A 2 8
3 B 3 4
4 B 1 3
Using data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', we order the 'Count' in descending order, subset the first two rows (head(.SD, 2)). Then, we create a sequence column ('N') grouped by 'ID', and dcast from 'long' to 'wide'. The data.table dcast can take multiple value.var columns.
library(data.table)#v1.9.6+
DT <- setDT(df1)[order(-Count), head(.SD, 2) , by = ID]
DT[, N:= 1:.N, by = ID]
dcast(DT, ID~paste0('Top', N),
value.var=c('Type', 'Count'), fill = 0)
# ID Type_Top1 Type_Top2 Count_Top1 Count_Top2
#1: A 1 2 8 5
#2: B 3 1 4 3
data
df1 <- structure(list(ID = c("A", "B", "A", "A", "B", "B"),
Type = c(1L,
1L, 2L, 3L, 2L, 3L), Count = c(8L, 3L, 5L, 2L, 1L, 4L)),
.Names = c("ID",
"Type", "Count"), class = "data.frame", row.names = c(NA, -6L))

Resources