Count character values by group in a data.frame - r

I have got a data.frame which contains two columns: ID and Letter. I need to summarize the Letter observations by ID.
Here an example:
df = read.table(text = 'ID Letter
1 A
1 A
1 B
1 A
1 C
1 D
1 B
2 A
2 B
2 B
2 B
2 D
2 F
3 B
3 A
3 A
3 C
3 D, header = TRUE)
My output should be 3 data.frames as follows:
df_1
A 3
B 2
C 1
D 1
df_2
A 1
B 3
D 1
F 1
df_3
A 2
B 1
C 1
D 1
It is just the count of the letters within each ID group. I think I could use a combination of the functions table and aggregate, but how?

thanks to #akrun, please see below how I managed to do the trick:
#create list of data.frames
library(dplyr)
lst = lapply(split(df, df$ID), function(x) count(x, ID, Letter) %>% ungroup() %>% select(-ID))
lst = lapply(lst, function(y) y = as.data.frame(y)) #convert data into data.frames

This will also work (with base R):
lapply(split(df, df$ID), function(x) subset(as.data.frame(table(x$Letter)), Freq != 0))

Related

is it possible to filter rows of one dataframe based on another dataframe?

is it possible to filter rows of one dataframe based on another dataframe?
I have this 2 dataframe:
df_node <- data.frame( id= c("a","b","c","d","e","f","g","h","i"),
group= c(1,1,1,2,2,2,3,3,3))
df_link <- data.frame(from = c("a","d","f","i","b"),
to = c("d","f","i","b","h"))
I would like to delete the lines with characters that are not present in the second dataframe, like this:
here is a basic way to do that:
df_node <- data.frame( id= c("a","b","c","d","e","f","g","h","i"),
group= c(1,1,1,2,2,2,3,3,3))
df_link <- data.frame(from = c("a","d","f","i","b"),
to = c("d","f","i","b","h"))
library(dplyr)
df_result <- df_node%>%
filter(id%in%c(df_link$from,df_link$to))
df_result
# > df_result
# id group
# 1 a 1
# 2 b 1
# 3 d 2
# 4 f 2
# 5 h 3
# 6 i 3
We could use a semi_join:
library(dplyr)
df_node |>
semi_join(tibble(id = c(df_link$from, df_link$to)))
Output:
id group
1 a 1
2 b 1
3 d 2
4 f 2
5 h 3
6 i 3
Here is a oneliner with base R:
df_node[df_node$id %in% unlist(df_link),]
id group
1 a 1
2 b 1
4 d 2
6 f 2
8 h 3
9 i 3
But you could also use a join:
library(dplyr)
df_uniqueID <- data.frame(id = unique(c(df_link$from,df_link$to)) )
right_join(df_node,df_uniqueID)
Joining, by = "id"
id group
1 a 1
2 b 1
3 d 2
4 f 2
5 h 3
6 i 3

Apply function to a row in a data.frame using dplyr

In base R I would do the following:
d <- data.frame(a = 1:4, b = 4:1, c = 2:5)
apply(d, 1, which.max)
With dplyr I could do the following:
library(dplyr)
d %>% mutate(u = purrr::pmap_int(list(a, b, c), function(...) which.max(c(...))))
If there’s another column in d I need to specify it, but I want this to work w/ an arbitrary amount if columns.
Conceptually, I’d like something like
pmap_int(list(everything()), ...)
pmap_int(list(.), ...)
But this does obviously not work. How would I solve that canonically with dplyr?
We just need the data to be specified as . as data.frame is a list with columns as list elements. If we wrap list(.), it becomes a nested list
library(dplyr)
d %>%
mutate(u = pmap_int(., ~ which.max(c(...))))
# a b c u
#1 1 4 2 2
#2 2 3 3 2
#3 3 2 4 3
#4 4 1 5 3
Or can use cur_data()
d %>%
mutate(u = pmap_int(cur_data(), ~ which.max(c(...))))
Or if we want to use everything(), place that inside select as list(everything()) doesn't address the data from which everything should be selected
d %>%
mutate(u = pmap_int(select(., everything()), ~ which.max(c(...))))
Or using rowwise
d %>%
rowwise %>%
mutate(u = which.max(cur_data())) %>%
ungroup
# A tibble: 4 x 4
# a b c u
# <int> <int> <int> <int>
#1 1 4 2 2
#2 2 3 3 2
#3 3 2 4 3
#4 4 1 5 3
Or this is more efficient with max.col
max.col(d, 'first')
#[1] 2 2 3 3
Or with collapse
library(collapse)
dapply(d, which.max, MARGIN = 1)
#[1] 2 2 3 3
which can be included in dplyr as
d %>%
mutate(u = max.col(cur_data(), 'first'))
Here are some data.table options
setDT(d)[, u := which.max(unlist(.SD)), 1:nrow(d)]
or
setDT(d)[, u := max.col(.SD, "first")]

lapply aggregate columns in multiple dataframes R

I have several dataframes in a list in R. There are entries in each of those DF I would like to summarise. Im trying to get into lapply so that would be my preferred way (though if theres a better solution I would be happy to know it and why).
My Sample data:
df1 <- data.frame(Count = c(1,2,3), ID = c("A","A","C"))
df2 <- data.frame(Count = c(1,1,2), ID = c("C","B","C"))
dfList <- list(df1,df2)
> head(dfList)
[[1]]
Count ID
1 1 A
2 2 A
3 3 C
[[2]]
Count ID
1 1 C
2 1 B
3 2 C
I tried to implement this in lapply with
dfList_agg<-lapply(dfList, function(i) {
aggregate(i[[1:length(i)]][1L], by=list(names(i[[1:length(i)]][2L])), FUN=sum)
})
However this gives me a error "arguments must have same length". What am I doing wrong?
My desired output would be the sum of Column "Count" by "ID" which looks like this:
>head(dfList_agg)
[[1]]
Count ID
1 3 A
2 3 C
[[2]]
Count ID
1 3 C
2 1 B
I think you've overcomplicated it. Try this...
dfList_agg<-lapply(dfList, function(i) {
aggregate(i[,1], by=list(i[,2]), FUN=sum)
})
dflist_agg
[[1]]
Group.1 x
1 A 3
2 C 3
[[2]]
Group.1 x
1 B 1
2 C 3
Here is a third option
lapply(dfList, function(x) aggregate(. ~ ID, data = x, FUN = "sum"))
#[[1]]
# ID Count
#1 A 3
#2 C 3
#
#[[2]]
#ID Count
#1 B 1
#2 C 3
I guess this is what you need
library(dplyr)
lapply(dfList,function(x) ddply(x,.(ID),summarize,Count=sum(Count)))
An option with tidyverse would be
library(tidyverse)
map(dfList, ~ .x %>%
group_by(ID) %>%
summarise(Count = sum(Count)) %>%
select(names(.x)))
#[[1]]
# A tibble: 2 x 2
# Count ID
# <dbl> <fctr>
#1 3.00 A
#2 3.00 C
#[[2]]
# A tibble: 2 x 2
# Count ID
# <dbl> <fctr>
#1 1.00 B
#2 3.00 C

Collapse duplicate rows by median value in R

I have a date frame with two columns. I would like to remove rows where there are duplicate entries in the first column. however I would like to select a specific row to remain based on the value of the second columns.
Specifically - if there are 2 duplicate entries in columns 1, I would like the row removed with the lower value in column 2
Or if there are more than 2 identical entries in columns 1 then I would like the row with the median value in row 2 to remain.
So for data frame
a <- c(rep("A", 3), rep("B", 3), rep("C",1), rep("D",1), rep("D",1))
b <- c(1,2,3,4,5,6,4,7,6)
df <-data.frame(a,b)
would become
a <- c(rep("A", 1), rep("B", 1), rep("C",1), rep("D",1))
b <- c(2,5,4,7)
df <-data.frame(a,b)
I have tried functions unique() and duplicated() but can't seem to find arguments that meet these criteria. Any help much appreciated.
You can try
library(data.table)
setDT(df)[, list(b=if(.N==2) min(b) else median(b)) , by = a]
# a b
#1: A 2
#2: B 5
#3: C 4
#4: D 6
Or a similar option with aggregate
aggregate(b~a, df, FUN=function(x) if(length(x)==2) min(x) else median(x))
# a b
#1 A 2
#2 B 5
#3 C 4
#4 D 6
Or
library(sqldf)
sqldf('select a,
case
when count(b) is 2 then min(b)
else median(b)
end b
from df
group by a')
# a b
#1 A 2
#2 B 5
#3 C 4
#4 D 6
Based on the expected output showed, the last row is D 7, so if we are selecting the first observation when the group length is 2,
setDT(df)[, list(b=if(.N==2) b[1L] else median(b)) , by = a]
# a b
#1: A 2
#2: B 5
#3: C 4
#4: D 7
Or
aggregate(b~a, df, FUN=function(x) if(length(x)==2) x[1L] else median(x))
# a b
#1 A 2
#2 B 5
#3 C 4
#4 D 7
Or
sqldf('select a,
case
when count(b) is 2 and min(rowid) then b
else median(b)
end b
from df
group by a')
# a b
#1 A 2
#2 B 5
#3 C 4
#4 D 7
EDIT changed first observation to min after I saw #eipi10's post. Didn't read the OP's post correctly and the OP's expected output is not matching the description.
Using dplyr:
library(dplyr)
df %>% group_by(a) %>%
summarise(b = ifelse(n() == 2, min(b), median(b)))
a b
1 A 2
2 B 5
3 C 4
4 D 6
In your question, you said you want the "lower" value, in case there are two rows, which would give D=6, rather than D=7. If you meant the first row that appears in the data frame, you can do this:
df %>% group_by(a) %>%
summarise(b = ifelse(n() == 2, b[1], median(b)))

Create a variable capturing the most frequent occurence by group

Define:
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
s.t.
> df1
id v1
1 1 a
2 1 b
3 1 b
4 2 c
5 2 c
6 2 c
I want to create a third variable freq that contains the most frequent observation in v1 by id s.t.
> df2
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c
You can do this using ddply and a custom function to pick out the most frequent value:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
Note that which.max will return the first occurrence of the maximum value, in the case of ties. See ??which.is.max in the nnet package for an option that breaks ties randomly.
Another way consists of using tidyverse functions:
grouping first, using group_by(), and counting the occurrence of the second variable using tally()
arranging by the number of occurrences with arrange()
summarizing and picking out the first row with summarize() and first()
Therefore:
df1 %>%
group_by(id, v1) %>%
tally() %>%
arrange(id, desc(n)) %>%
summarize(freq = first(v1))
This will give you just the mapping (which I find cleaner):
# A tibble: 2 x 2
id freq
<dbl> <fctr>
1 1 b
2 2 c
You can then left_join your original data frame with that table.
mode <- function(x) names(table(x))[ which.max(table(x)) ]
df1$freq <- ave(df1$v1, df1$id, FUN=mode)
> df1
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c

Resources