R: Unique count by first occurrence of grouping variable

R: Unique count by first occurrence of grouping variable - r

I would like to create a new variable "Count" that is a count of the unique values of a factor "Period", by grouping variable "ID". The following data includes a column with the values I would want in "Count":
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("a", "b"), class = "factor"), Period = c(1.1, 1.1,
1.2, 1.3, 1.2, 1.3, 1.5, 1.5), Count = c(1L, 1L, 2L, 3L, 1L,
2L, 3L, 3L)), .Names = c("ID", "Period", "Count"), class = "data.frame", row.names = c(NA,
-8L))
I tried to use mutate with Count = 1:length(Period) but it creates a cumulative count of each value of "Period", whereas I want a cumulative count of only unique values. This is what I tried:
library(plyr)
samp1<-ddply(samp, .(ID, Period), mutate, Count = 1:length(Period))
Could anyone provide the correct function to use?

Edit- New answer
Now that come to think of it some more, my initial approach won't return correct results if each groups elements aren't grouped together, so for example for
v <- c(1, 3, 2, 2, 1, 2)
My function will put non-consecutive 1s and 2 in different groups
myrleid(v)
## [1] 1 2 3 3 4 5
Thus, the best approach seem to be
match(v, unique(v))
## [1] 1 2 3 3 1 3
Will will both preserve the appearance order and keep un-ordered values in the same group.
Thus, I would recommend just doing
library(data.table)
setDT(df)[, Count2 := match(Period, unique(Period)), by = ID]
or (with base R)
with(df, ave(Period, ID, FUN = function(x) match(x, unique(x))))
Old answer
Looks like a good candidate for the rleid function from the data.table devel version on GH
### Devel version installation instructions
# library(devtools)
# install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table) # v 1.9.5+
setDT(df)[, Count2 := rleid(Period), by = ID]
df
# ID Period Count Count2
# 1: a 1.1 1 1
# 2: a 1.1 1 1
# 3: a 1.2 2 2
# 4: a 1.3 3 3
# 5: b 1.2 1 1
# 6: b 1.3 2 2
# 7: b 1.5 3 3
# 8: b 1.5 3 3
Or, If you don't want to load external packages, we could define this function on our own
myrleid <- function(x) {
temp <- rle(x)$lengths
rep.int(seq_along(temp), temp)
}
with(df, ave(Period, ID, FUN = myrleid))
## [1] 1 1 2 3 1 2 3 3
Or if the groups are in increasing order, you could try ranking them too
library(data.table) ## V1.9.5+
setDT(df)[, Count2 := frank(Period, ties.method = "dense"), by = ID]
Or
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count2 = dense_rank(Period))

samp <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("a", "b"), class = "factor"), Period = c(1.1, 1.1,
1.2, 1.3, 1.2, 1.3, 1.5, 1.5), Count = c(1L, 1L, 2L, 3L, 1L,
2L, 3L, 3L)), .Names = c("ID", "Period", "Count"), class = "data.frame", row.names = c(NA,
-8L))
select(samp, -Count) %>%
arrange(ID, Period) %>%
group_by(ID) %>%
mutate(dup = !duplicated(Period),
Count = cumsum(dup))
The key steps are to arrange by ID and Period, and then to identify that first new representation of Period as "not duplicated".

A solution in base R with transform:
transform(df, Count2 = unlist(
tapply(df$Period, df$ID, function(x)
as.numeric(factor(x)))
))
ID Period Count Count2
a1 a 1.1 1 1
a2 a 1.1 1 1
a3 a 1.2 2 2
a4 a 1.3 3 3
b1 b 1.2 1 1
b2 b 1.3 2 2
b3 b 1.5 3 3
b4 b 1.5 3 3
as David suggested this solution does not work well if data Period are not monotonic increasing.

Related

Normalizing matched data in R

Consider:
Pair X
1 2
1 3
2 1
2 2
If I have a dataframe that is structured like the example above, how do I normalize column X given the Pair column?
In other words, the two elements should be translated to 0.4 (2/(2+3)) and 0.6 (3/(2+3)) for Pair 1 and .333 (1/(1+2)) and .666 (2/(1+2)) for Pair 2.

We can use base R
df$norm <- with(df, X/rowsum(X, Pair)[Pair])
df$norm
#[1] 0.4000000 0.6000000 0.3333333 0.6666667
data
df <- structure(list(Pair = c(1L, 1L, 2L, 2L), X = c(2L, 3L, 1L, 2L
)), class = "data.frame", row.names = c(NA, -4L))

You can do this in base R:
df$norm <- with(df, X/ave(X, Pair, FUN = sum))
df
# Pair X norm
#1 1 2 0.40
#2 1 3 0.60
#3 2 1 0.33
#4 2 2 0.67
dplyr
library(dplyr)
df %>% group_by(Pair) %>% mutate(norm = X/sum(X))
and data.table:
library(data.table)
setDT(df)[, norm := X/sum(X), Pair]
data
df <- structure(list(Pair = c(1L, 1L, 2L, 2L), X = c(2L, 3L, 1L, 2L
)), class = "data.frame", row.names = c(NA, -4L))

Mapping values across a dataframe

I have a large dataset. The example below is a much abbreviated version.
There are two dataframes, df1 and df2. I would like to map to each row of df1, a derived value using conditions from df2 with arguments from df1.
Hope the example below makes more sense
year <- rep(1996:1997, each=3)
age_group <- rep(c("20-24","25-29","30-34"),2)
df1 <- as.data.frame(cbind(year,age_group))
df1 is a database with all permutations of year and age group.
df2 <- as.data.frame(rbind(c(111,1997,"20-24"),c(222,1997,"30-34")))
names(df2) <- c("id","year","age.group")
df2 is a database where each row represents an individual at a particular year
I would like to use arguments from df1 conditional on values from df2 and then to map to df1. The arguments are as follows:
each_yr <- map(df1, function(year,age_group) case_when(
as.character(df1$year) == as.character(df2$year) & as.character(df1$age_group)
== as.character(df2$age.group)~ 0,
TRUE ~ 1))
The output i get is wrong and shown below
structure(list(year = c(1, 1, 1, 1, 1, 0), age_group = c(1, 1,
1, 1, 1, 0)), .Names = c("year", "age_group"))
The output i would ideally like is something like this (dataframe as an example but would be happy as a list)
structure(list(year = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1996",
"1997"), class = "factor"), age_group = structure(c(1L, 2L, 3L,
1L, 2L, 3L), .Label = c("20-24", "25-29", "30-34"), class = "factor"),
v1 = structure(c(2L, 2L, 2L, 1L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), v2 = structure(c(2L, 2L, 2L, 2L,
2L, 1L), .Label = c("0", "1"), class = "factor")), .Names = c("year",
"age_group", "v1", "v2"), row.names = c(NA, -6L), class = "data.frame")
I have used map before when 'df1' is a vector but in this scenario it is a dataframe where both columns are used as arguments. Can Map handle this?
In df3 the column v1 is the result of conditions based on df1 and df2 and then mapped to df1 for patient '111'. Likewise column v2 is the outcome for patient '222'.
Thanks in advance

Looks like some work for pmap instead. And a touch of tidyr to get the suggested result.
purrr::pmap(list(df2$id,as.character(df2$year),as.character(df2$age.group)),
function(id,x,y)
data.frame(df1,
key=paste0("v",id),
value=1-as.integer((x==df1$year)&(y==df1$age_group)),
stringsAsFactors=FALSE
)) %>%
replyr::replyr_bind_rows() %>% tidyr::spread(key,value)
# year age_group v1 v2
#1 1996 20-24 1 1
#2 1996 25-29 1 1
#3 1996 30-34 1 1
#4 1997 20-24 0 1
#5 1997 25-29 1 1
#6 1997 30-34 1 0

Whithing tidiverse you can do it this way:
library(tidyverse)
#library(dplyr)
#library(tidyr)
df2 %>%
mutate(tmp = 0) %>%
spread(id, tmp, fill = 1, sep = "_") %>%
right_join(df1, by = c("year", "age.group" = "age_group")) %>%
mutate_at(vars(-c(1, 2)), coalesce, 1)
# year age.group id_111 id_222
# 1 1996 20-24 1 1
# 2 1996 25-29 1 1
# 3 1996 30-34 1 1
# 4 1997 20-24 0 1
# 5 1997 25-29 1 1
# 6 1997 30-34 1 0
#Warning messages:
# 1: Column `year` joining factors with different levels, coercing to character vector
# 2: Column `age.group`/`age_group` joining factors with different levels, coercing to
# character vector

Cumulative Count Paste

I have this dataset:
ID Set Type Count
1 1 1 A NA
2 2 1 R NA
3 3 1 R NA
4 4 1 U NA
5 5 1 U NA
6 6 1 U NA
7 7 2 A NA
8 8 3 R NA
9 9 3 R NA
As dputs:
mystart <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("ID", "Set", "Type",
"Count"), class = "data.frame", row.names = c(NA, -9L))
By using dplyr package how can I obtain this:
ID Set Type Count
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2
Again dputs:
myend <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 1L, 7L, 8L), .Label = c("A1", "A1R1", "A1R2",
"A1R2U1", "A1R2U2", "A1R2U3", "R1", "R2"), class = "factor")), .Names = c("ID",
"Set", "Type", "Count"), class = "data.frame", row.names = c(NA,
-9L))
In short, I want to count the observations of the column "type" within column "set" and print this count(text) cumulatively.
Examining similar posts, I got closely to this:
myend <- structure(list(ID = 1:9, Set = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
3L, 3L), Type = structure(c(1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L
), .Label = c("A", "R", "U"), class = "factor"), Count = c(1L,
1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L)), .Names = c("ID", "Set", "Type",
"Count"), class = "data.frame", row.names = c(NA, -9L))
With the code:
library(dplyr)
myend <- read.table("mydata.txt", header=TRUE, fill=TRUE)
myend %>%
group_by(Set, Type) %>%
mutate(Count = seq(n())) %>%
ungroup(myend)
Thank you very much for your help,

Base R version :
aggregateGroup <- function(x){
vecs <- Reduce(f=function(a,b){ a[b] <- sum(a[b],1L,na.rm=TRUE); a },
init=integer(0),
as.character(x),
accumulate = TRUE)
# vecs is a list with something like this :
# [[1]]
# integer(0)
# [[2]]
# A
# 1
# [[3]]
# A R
# 1 1
# ...
# so we simply turn those vectors into characters using vapply and paste
# (excluding the first)
vapply(vecs,function(y) paste0(names(y),y,collapse=''),FUN.VALUE='')[-1]
}
split(mystart$Count,mystart$Set) <- lapply(split(mystart$Type,mystart$Set), aggregateGroup)
> mystart
ID Set Type Count
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2

A dplyr version:
mystart %>%
group_by(Set) %>%
mutate(Count = paste0('A', cumsum(Type == 'A'),
'R', cumsum(Type == 'R'),
'U', cumsum(Type == 'U'))) %>%
ungroup()
Which yields
# A tibble: 9 x 4
ID Set Type Count
<int> <int> <chr> <chr>
1 1 1 A A1R0U0
2 2 1 R A1R1U0
3 3 1 R A1R2U0
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1R0U0
8 8 3 R A0R1U0
9 9 3 R A0R2U0
If you want to omit the variables with count zero, you'd need to wrap a function around it like so
mygroup <- function(lst) {
name <- names(lst)
vectors <- lapply(seq_along(lst), function(i) {
x <- lst[[i]]
char <- name[i]
x <- ifelse(x == 0, "", paste0(char, x))
return(x)
})
return(do.call("paste0", vectors))
}
mystart %>%
group_by(Set) %>%
mutate(Count = mygroup(list(A = cumsum(Type == 'A'),
R = cumsum(Type == 'R'),
U = cumsum(Type == 'U')))) %>%
ungroup()
This yields
# A tibble: 9 x 4
ID Set Type Count
<int> <int> <chr> <chr>
1 1 1 A A1
2 2 1 R A1R1
3 3 1 R A1R2
4 4 1 U A1R2U1
5 5 1 U A1R2U2
6 6 1 U A1R2U3
7 7 2 A A1
8 8 3 R R1
9 9 3 R R2

One line solve with data.table
you gotta first do
require(data.table)
mystart <- as.data.table(mystart)
then just use one line
mystart[, .(Type,
count = paste0(
'A',
cumsum(Type == 'A'),
'R',
countR = cumsum(Type == 'R'),
'U',
countU = cumsum(Type == 'U')
)),
by = c('Set')]
first you want cumsum each type and paste them together by 'set'
cumsum(Type=='A') equals the count, since when Type==A, it's 1, otherwise it's 0.
you wanted to paste them into one column also. So, paste0() is good to use.
you still wanted the Type column, so I included Type in the line.
The output:
Set Type count
1: 1 A A1R0U0
2: 1 R A1R1U0
3: 1 R A1R2U0
4: 1 U A1R2U1
5: 1 U A1R2U2
6: 1 U A1R2U3
7: 2 A A1R0U0
8: 3 R A0R1U0
9: 3 R A0R2U0
Hope this helps.
btw, if you want count 0 ignored, you gotta design some if-esle clause yourself.
basically you want this: if cumsum(something) ==0, NULL, esle paste0('something', cumsum(something)), then you paste0() them together.
It's gonna get nasty, I'm not writing it. you get the idea

Here's a base solution.
We can paste raw letters toseq_along of letter groups to get the last 2 characters, then paste the result to the last element of the previous result, using Reduce.
On top of this we use ave to compute by group.
fun <- function(x,y) paste0(x[length(x)],y,seq_along(y))
mystart$Count <- ave(as.character(mystart$Type),mystart$Set,
FUN = function(x) unlist(Reduce(fun,split(x,x),init=NULL,accumulate = TRUE)))
# ID Set Type Count
# 1 1 1 A A1
# 2 2 1 R A1R1
# 3 3 1 R A1R2
# 4 4 1 U A1R2U1
# 5 5 1 U A1R2U2
# 6 6 1 U A1R2U3
# 7 7 2 A A1
# 8 8 3 R R1
# 9 9 3 R R2
Details
split(x,x) splits letters as shown here for first Set:
with(subset(mystart,Set==1),split(Type,Type))
# $A
# [1] "A"
#
# $R
# [1] "R" "R"
#
# $U
# [1] "U" "U" "U"
Then fun does this type of operations, helped by Reduce :
fun(NULL,"A") # [1] "A1"
fun("A1",c("R","R")) # [1] "A1R1" "A1R2"
fun(c("A1R1","A1R2"),c("U","U","U")) # [1] "A1R2U1" "A1R2U2" "A1R2U3"
Bonus solution
This other base solution, using rle and avoiding split gives the same output for given example (and whenever Type values are grouped in Sets), but not with mystart2 <- rbind(mystart,mystart) for instance.
fun2 <- function(x){
rle_ <- rle(x)
suffix <- paste0(x,sequence(rle_$length))
prefix <- unlist(mapply(rep,
lag(unlist(
Reduce(paste0,paste0(rle_$values,rle_$lengths),accumulate=TRUE)
),rle_$lengths[1]),
each=rle_$lengths))
prefix[is.na(prefix)] <- ""
paste0(prefix,suffix)
}
mystart$Count2 <-ave(as.character(mystart$Type), mystart$Set,FUN=fun2)

Many elegant solutions have been provided for the problem. Still I was looking for something dplyr way (without-cumsum on fixed types). The function is generic enough to handle additional values of Type.
A solution with help of a custom function as:
library(dplyr)
mystart %>% group_by(Set, Type) %>%
mutate(type_count = row_number()) %>%
mutate(TypeMod = paste0(Type,type_count)) %>%
group_by(Set) %>%
mutate(Count = cumCat(TypeMod, type_count)) %>%
select(-type_count, -TypeMod)
cumCat <- function(x, y){
retVal <- character(length(x))
prevVal = ""
lastGrpVal = ""
for ( i in seq_along(x)){
if(y[i]==1){
lastGrpVal = prevVal
}
retVal[i] = paste0(lastGrpVal,x[i])
prevVal = retVal[i]
}
retVal
}
# # Groups: Set [3]
# ID Set Type Count
# <int> <int> <fctr> <chr>
# 1 1 1 A A1
# 2 2 1 R A1R1
# 3 3 1 R A1R2
# 4 4 1 U A1R2U1
# 5 5 1 U A1R2U2
# 6 6 1 U A1R2U3
# 7 7 2 A A1
# 8 8 3 R R1
# 9 9 3 R R2

Finding the max number of occurrences from the available result

I have a dataframe which looks like -
Id Result
A 1
B 2
C 1
B 1
C 1
A 2
B 1
B 2
C 1
A 1
B 2
Now I need to calculate how many 1's and 2's are there for each Id and then select the number whose frequency of occurrence is the greatest.
Id Result
A 1
B 2
C 1
How can I do that? I have tried using the table function in some way but not able to use it effectively. Any help would be appreciated.

Here you can use aggregate in one step:
df <- structure(list(Id = structure(c(1L, 2L, 3L, 2L, 3L, 1L, 2L, 2L,
3L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor"),
Result = c(1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L)),
.Names = c("Id", "Result"), class = "data.frame", row.names = c(NA, -11L)
)
res <- aggregate(Result ~ Id, df, FUN=function(x){which.max(c(sum(x==1), sum(x==2)))})
res
Result:
Id Result
1 A 1
2 B 2
3 C 1

With data.table you can try (df is your data.frame):
require(data.table)
dt<-as.data.table(df)
dt[,list(times=.N),by=list(Id,Result)][,list(Result=Result[which.max(times)]),by=Id]
# Id Result
#1: A 1
#2: B 2
#3: C 1

Using dplyr, you can try
library(dplyr)
df %>% group_by(Id, Result) %>% summarize(n = n()) %>% group_by(Id) %>%
filter(n == max(n)) %>% summarize(Result = Result)
Id Result
1 A 1
2 B 2
3 C 1

An option using table and ave
subset(as.data.frame(table(df1)),ave(Freq, Id, FUN=max)==Freq, select=-3)
# Id Result
# 1 A 1
# 3 C 1
# 5 B 2

Use unique rows from data.frame to subset another data.frame

I have a data.frame v that I would like to use the unique rows from
#v
DAY MONTH YEAR
1 1 1 2000
2 1 1 2000
3 2 2 2000
4 2 2 2000
5 2 3 2001
to subset a data.frame w.
# w
DAY MONTH YEAR V1 V2 V3
1 1 1 2000 1 2 3
2 1 1 2000 3 2 1
3 2 2 2000 2 3 1
4 2 2 2001 1 2 3
5 3 4 2001 3 2 1
The result is data.frame vw. Where only the rows in 'w' that match the unique rows (e.g. (DAY, MONTH, YEAR)) in v are remaining.
# vw
DAY MONTH YEAR V1 V2 V3
1 1 1 2000 1 2 3
2 2 2 2000 2 3 1
Right now I am using the code below, where I merge the data.frames and then use ddply to pick only the unqiue/ first instance of a row. This work, but will become cumbersome if I have to include V1=x$V1[1], etc for all of my variables in the ddply part of the code. Is there a way to use the first instance of (DAY, MONTH, YEAR) and the rest of the columns on that row?
Or, is there another to approach the problem of using unique rows from one data.frame to subset another data.frame?
v <- structure(list(DAY = c(1L, 1L, 2L, 2L, 2L), MONTH = c(1L, 1L,
2L, 2L, 3L), YEAR = c(2000L, 2000L, 2000L, 2000L, 2001L)), .Names = c("DAY",
"MONTH", "YEAR"), class = "data.frame", row.names = c(NA, -5L
))
w <- structure(list(DAY = c(1L, 1L, 2L, 2L, 3L), MONTH = c(1L, 1L,
2L, 2L, 4L), YEAR = c(2000L, 2000L, 2000L, 2001L, 2001L), V1 = c(1L,
3L, 2L, 1L, 3L), V2 = c(2L, 2L, 3L, 2L, 2L), V3 = c(3L, 1L, 1L,
3L, 1L)), .Names = c("DAY", "MONTH", "YEAR", "V1", "V2", "V3"
), class = "data.frame", row.names = c(NA, -5L))
vw_example <- structure(list(DAY = 1:2, MONTH = 1:2, YEAR = c(2000L, 2000L),
V1 = 1:2, V2 = 2:3, V3 = c(3L, 1L)), .Names = c("DAY", "MONTH",
"YEAR", "V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-2L))
wv_inter <- merge(v, w, by=c("DAY","MONTH","YEAR"))
vw <- ddply(www,.(DAY, MONTH, YEAR),function(x) data.frame(DAY=x$DAY[1],MONTH=x$MONTH[1],YEAR=x$YEAR[1], V1=x$V1[1], V2=x$V2[1], V3=x$V3[1]))

In base R, I would take unique of v first before merging. The merge command will by default merge on common column names, so by is unnecessary here.
vw <- merge(unique(v), w)
With your approach (take the first row from each combination), I think you could do (untested):
vw <- ddply(www,.(DAY, MONTH, YEAR),function(x) x[1,])

library(data.table)
v <- data.table(v)
w <- data.table(w)
setkey(v)
setkeyv(w, names(v))
# if you want to capture ALL unique values of `v`, use:
w[unique(v, by=NULL)]
# if you want only values that mutually exist in `v` and `w` use:
w[unique(v, by=NULL), nomatch=0L]

EDITED:
Rather than merge a unique v with w, to get a unique vw first merge v and w and then select values unique on the DAY MONTH YEAR columns.
vw <- merge(v, w, by=c("DAY","MONTH","YEAR"))
vw <- vw[which( ! duplicated(vw[,c("DAY","MONTH","YEAR")]) ), ]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Unique count by first occurrence of grouping variable - r

Related

Normalizing matched data in R

Mapping values across a dataframe

Cumulative Count Paste

Finding the max number of occurrences from the available result

Use unique rows from data.frame to subset another data.frame

Categories

Resources