count number of events grouped by id [duplicate] - r

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 2 years ago.
DF<-data.frame(id=c(1,1,2,3,3),code=c("A","A","A","E","E"))
> DF
id code
1 1 A
2 1 A
3 2 A
4 3 E
5 3 E
Now I want to count nr id with same code. Desired output:
# A tibble: 2 x 2
code count
1 A 2
2 E 1
I´v been trying:
> DF%>%group_by(code)%>%summarize(count=n())
# A tibble: 2 x 2
code count
<fct> <int>
1 A 3
2 E 2
> DF%>%group_by(code,id)%>%summarize(count=n())
# A tibble: 3 x 3
# Groups: code [2]
code id count
<fct> <dbl> <int>
1 A 1 2
2 A 2 1
3 E 3 2
>
Which doesn´t give me the desired output.
Best H

Being pedantic, I'd rephrase your question as "count the number of distinct IDs per code". With that mindset, the answer becomes clearer.
DF %>%
group_by(code) %>%
summarize(count = n_distinct(id))

An option with data.table would be uniqueN (instead of n_distinct from dplyr) after grouping by 'code' and converting to data.table (setDT)
library(data.table)
setDT(DF)[, .(count = uniqueN(id)), code]
# code count
#1: A 2
#2: E 1

A simple base R solution also works:
#Data
DF<-data.frame(id=c(1,1,2,3,3),code=c("A","A","A","E","E"))
#Classic base R sol
aggregate(id~code,data=DF,FUN = function(x) length(unique(x)))
code id
1 A 2
2 E 1

Related

Sample n random rows per group in a dataframe with dplyr when some observations have less than n rows

I have a data frame with two categorical variables.
samples<-c("A","A","A","A","B","B")
groups<-c(1,1,1,2,1,1)
df<- data.frame(samples,groups)
df
samples groups
1 A 1
2 A 1
3 A 1
4 A 2
5 B 1
6 B 1
The result that I would like to have is for each given observation (sample-group) to downsample (randomly, this is important) the data frame to a maximum of X rows and keep all obervation for which appear less than X times. In the example here X=2. Is there an easy way to do this? The issue that I have is that observation 4 (A,2) appears only once, thus dplyr sample_n would not work.
desired output
samples groups
1 A 1
2 A 1
3 A 2
4 B 1
5 B 1
You can sample minimum of number of rows or x for each group :
library(dplyr)
x <- 2
df %>% group_by(samples, groups) %>% sample_n(min(n(), x))
# samples groups
# <chr> <dbl>
#1 A 1
#2 A 1
#3 A 2
#4 B 1
#5 B 1
However, note that sample_n() has been super-seeded in favor of slice_sample but n() doesn't work with slice_sample. There is an open issue here for it.
However, as #tmfmnk mentioned we don't need to call n() here. Try :
df %>% group_by(samples, groups) %>% slice_sample(n = x)
One option with data.table:
df[df[, .I[sample(.N, min(.N, X))], by = .(samples, groups)]$V1]
samples groups
1: A 1
2: A 1
3: A 2
4: B 1
5: B 1

R reset counter based on two columns [duplicate]

This question already has an answer here:
R code to assign a sequence based off of multiple variables [duplicate]
(1 answer)
Closed 3 years ago.
I have following kind of data and i need output as the second data frame...
a <- c(1,1,1,1,2,2,2,2,2,2,2)
b <- c(1,1,1,2,3,3,3,3,4,5,6)
d <- c(1,2,3,4,1,2,3,4,5,6,7)
df <- as.data.frame(cbind(a,b,d))
output <- c(1,1,1,2,1,1,1,1,2,3,4)
df_output <- as.data.frame(cbind(df,output))
I have tried cumsum and I am not able to get the desired results. Please guide. Regards, Enthu.
based on column a value cahnges and if b is to be reset starting from one.
the condition is if b has same value it should start with 1.
Like in the 5th record, col b has value as 3. It should reset to 1 and if all the values if col b is same ( as the case from ro 6,6,7,8 is same , then it should be 1 and any change should increment by 1).
We can do a group by column 'a' and then create the new column with either match the unique values in 'b'
library(dplyr)
df2 <- df %>%
group_by(a) %>%
mutate(out = match(b, unique(b)))
df2
# A tibble: 11 x 4
# Groups: a [2]
# a b d out
# <dbl> <dbl> <dbl> <int>
# 1 1 1 1 1
# 2 1 1 2 1
# 3 1 1 3 1
# 4 1 2 4 2
# 5 2 3 1 1
# 6 2 3 2 1
# 7 2 3 3 1
# 8 2 3 4 1
# 9 2 4 5 2
#10 2 5 6 3
#11 2 6 7 4
Or another option is to coerce a factor variable to integer
df %>%
group_by(a) %>%
mutate(out = as.integer(factor(b)))
data
df <- data.frame(a, b, d)

Dense ranking of column based on order of second column

I am beating my brains out on something that is probably straight forward. I want to get a "dense" ranking (as defined for the data.table::frank function), on a column in a data frame, but not based on the columns proper order, the order should be given by another column (val in my example)
I managed to get the dense ranking with #Prasad Chalasani 's solution, like that:
library(dplyr)
foo_df <- data.frame(id = c(4,1,1,3,3), val = letters[1:5])
foo_df %>% arrange(val) %>% mutate(id_fac = as.integer(factor(id)))
#> id val id_fac
#> 1 4 a 3
#> 2 1 b 1
#> 3 1 c 1
#> 4 3 d 2
#> 5 3 e 2
But I would like the factor levels to be ordered based on val. Desired output:
foo_desired <- foo_df %>% arrange(val) %>% mutate(id_fac = as.integer(factor(id, levels = c(4,1,3))))
foo_desired
#> id val id_fac
#> 1 4 a 1
#> 2 1 b 2
#> 3 1 c 2
#> 4 3 d 3
#> 5 3 e 3
I tried data.table::frank
I tried both methods by #Prasad Chalasani.
I tried setting the order of id using id[rank(val)] (and sort(val), and order(val)).
Finally, I also tried to sort the levels using rank(val) etc, but this throws an error (Evaluation error: factor level [3] is duplicated.)
I know that one can specify the level order, I used this for creation of the desired output. This solution is however not great as my data has way more rows and levels
I need that for convenience, in order to produce a table with a specific order, not for computations.
Created on 2018-12-19 by the reprex package (v0.2.1)
You can check with first
foo_df %>% arrange(val) %>%
group_by(id)%>%mutate(id_fac = first(val))%>%
ungroup()%>%
mutate(id_fac=as.integer(factor(id_fac)))
# A tibble: 5 x 3
id val id_fac
<dbl> <fctr> <int>
1 4 a 1
2 1 b 2
3 1 c 2
4 3 d 3
5 3 e 3
Why do you even need factors ? Not sure if I am missing something but this gives your desired output.
You can use match to get id_fac based on the occurrence of ids.
library(dplyr)
foo_df %>%
mutate(id_fac = match(id, unique(id)))
# id val id_fac
#1 4 a 1
#2 1 b 2
#3 1 c 2
#4 3 d 3
#5 3 e 3

Find difference between rows by id, but place difference on first row in R

I have read a few different posts about finding the difference between two different rows in R using dplyr. However, the posts I have seen do not give me quite what I want. I would like to find the difference between the times, and place that difference between n and n+1 in a new variable, on the same row as n, kind of like the duration between n and n+1. All other posts place the elapsed time on the same row as n+1.
Here is some sample data:
df <- read.table(text = c("
id time
1 1
1 4
1 7
2 5
2 10"), header = T)
My desired output:
# id time duration
# 1 1 3
# 1 4 3
# 1 7 NA
# 2 5 5
# 2 10 NA
I have the following code at the moment:
df %>% arrange(id, time) %>% group_by(id) %>% mutate(duration = time - lag(time))
Please let me know how I should change this around. Thanks!
You can use diff(), appending the NA to each group. Just change your mutate() call to
mutate(duration = c(diff(time), NA)))
Edit: To clarify, the code above is only the mutate() call at the end of the pipe in the code shown in the question. So the the entire operation would be, based on the code shown in the question, is
df %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(duration = c(diff(time), NA))
# Source: local data frame [5 x 3]
# Groups: id [2]
#
# id time duration
# <dbl> <dbl> <dbl>
# 1 1 1 3
# 2 1 4 3
# 3 1 7 NA
# 4 2 5 5
# 5 2 10 NA
We can swap lag with lead
df %>%
group_by(id) %>%
mutate(duration = lead(time)- time)
# id time duration
# <int> <int> <int>
#1 1 1 3
#2 1 4 3
#3 1 7 NA
#4 2 5 5
#5 2 10 NA
A corresponding option in data.table would be shift with type = "lead"
library(data.table)
setDT(df)[, duration := shift(time, type = "lead") - time, by = id]
NOTE: In the example the 'id', 'time' were in order. If it is not, add the order statement as the OP showed in his post.

How can I get the most common combination of several columns, aggregating by others, in a data.frame?

Let's say I have a dataframe with the following structure:
id A B
1 1 1
1 1 2
1 1 2
1 2 2
1 2 3
1 2 4
1 2 5
2 1 2
2 2 2
2 3 2
2 3 5
2 3 5
2 4 6
I'd like to get the most common combination of values in A and B for each id:
id A B
1 1 2
2 3 5
I need to do this for a fairly big dataset (several million rows). I've got to a couple of horrible, slow, and very un-idiomatic solutions; I'm sure there is an easy, R-ish way.
I think I should be using aggregate, but I can't find a way to do it that works:
> aggregate(cbind(A, B) ~ id, d, Mode)
id A B
1 1 2 2
2 2 3 2
> # wrong!
> aggregate(interaction(A, B) ~ id, d, Mode)
id interaction(A, B)
1 1 1.2
2 2 3.5
> # close, but I need the original columns
Using dplyr:
library(dplyr)
df %>%
group_by(id, A, B) %>%
mutate(n = n()) %>%
group_by(id) %>%
slice(which.max(n)) %>%
select(-n)
#Source: local data frame [2 x 3]
#Groups: id
#
# id A B
#1 1 1 2
#2 2 3 5
And a similar data.table approach:
library(data.table)
setDT(df)[, .N, by=.(id, A, B)][, .SD[which.max(N)], by = id]
# id A B N
#1: 1 1 2 2
#2: 2 3 5 2
Edit to include a brief explanation:
Both approaches do essentially the same:
group the data by id, A and B.
Add a column with the number of rows per group
group the data by id (only) and return the (first) maximum group per id.
In the data.table version, you start with setDT(df) to convert the data.frame to a data.table object.

Resources