Subsetting first Observation per id and date in r - r

I want to subset the first date per observation per id. For example, just get the rows for the first date in which observations A and B appeared. If we have the following dataset:
df =
id date Observation
1 3 A
1 2 B
1 8 B
2 5 B
2 3 A
2 9 A
the outcome should look like this:
df =
id date Observation
1 3 A
1 2 B
2 5 B
2 3 A
thanks

If you don't mind the order being different, it can be accomplished using dplyr by grouping then slicing:
library(tidyverse)
df <- read_table("id date Observation
1 3 A
1 2 B
1 8 B
2 5 B
2 3 A
2 9 A")
df %>%
group_by(id, Observation) %>%
slice(1)
#> # A tibble: 4 x 3
#> # Groups: id, Observation [4]
#> id date Observation
#> <dbl> <dbl> <chr>
#> 1 1 3 A
#> 2 1 2 B
#> 3 2 3 A
#> 4 2 5 B
Created on 2021-04-12 by the reprex package (v1.0.0)

library(dplyr)
df %>%
group_by(id, Observation) %>%
slice(1) %>%
ungroup()
# OR
df %>%
group_by(id, Observation) %>%
filter(row_number() == 1) %>%
ungroup()

Related

How to count exact matches across two data frames within IDs in R

I have two datasets similar to the one below (but with 4m observations) and I want to count the number of matching sample days between the two data frames (see example below).
DF1
ID date
1 1992-10-15
1 2010-02-17
2 2019-09-17
2 2015-08-18
3 2020-10-27
3 2020-12-23
DF2
ID date
1 1992-10-15
1 2001-04-25
1 2010-02-17
3 1990-06-22
3 2014-08-18
3 2020-10-27
Expected output
ID Count
1 2
2 0
3 1
I have tried the aggregate function (though unsure what to put in "which":
test <- aggregate(date~ID, rbind(DF1, DF2), length(which(exact?)))
and the table function:
Y<-table(DF1$ID)
X <- table(DF2$ID)
Y2 <- DF1[Y %in% X,]
I am having trouble finding an example to help my situation.
Your help is appreciated!
in Base R
data.frame(table(factor(merge(df1,df2)$ID, unique(df1$ID))))
Var1 Freq
1 1 2
2 2 0
3 3 1
Using tidyverse
library(dplyr)
library(tidyr)
inner_join(df1, df2) %>%
complete(ID = unique(df1$ID)) %>%
reframe(Freq = sum(!is.na(date)), .by = "ID")
-output
# A tibble: 3 × 2
ID Freq
<int> <int>
1 1 2
2 2 0
3 3 1
Here is one way to do it with 'dplyr' and 'tidyr':
library(dplyr)
library(tidyr)
DF1 %>%
semi_join(DF2) %>%
count(ID) %>%
complete(ID = DF1$ID,
fill = list(n = 0))
#> Joining with `by = join_by(ID, date)`
#> # A tibble: 3 × 2
#> ID n
#> <dbl> <int>
#> 1 1 2
#> 2 2 0
#> 3 3 1
data
DF1 <- tibble(ID = c(1,1,2,2,3,3),
date = c("1992-10-15", "2010-02-17", "2019-09-17",
"2015-08-18", "2020-10-27", "2020-12-23"))
DF2 <- tibble(ID = c(1,1,1,3,3,3),
date = c("1992-10-15", "2001-04-25", "2010-02-17",
"1990-06-22", "2014-08-18", "2020-10-27"))
Created on 2023-02-16 with reprex v2.0.2

Filtering every positive value for every negative in R

I have a dataset with financial data. Sometimes, a product gets refunded, resulting in a negative count of the product (so the money gets returned). I want to conditionally filter these rows out of the dataset.
Example:
library(tidyverse)
set.seed(1)
df <- tibble(
count = sample(c(-1,1),80,replace = TRUE,prob=c(.2,.8)),
id = rep(1:4,20)
)
df %>%
group_by(id) %>%
summarize(total = sum(count))
# A tibble: 4 x 2
id total
<int> <dbl>
1 1 10
2 2 14
3 3 16
4 4 10
id = 1 has 15 positive counts and 5 negatives. (15 - 5= 10). I want to keep 10 values in df with id = 1 with the positive values.
id = 2 has 17 positive counts and 3 negatives. (17- 3 = 14). I want to keep 14 values in df with id = 2 with the positive values.
In the end, this condition should be True nrow(df) == sum(df$count)
Unfortunately, a filtering join such as anti_join() will remove all the rows. For some reason I cannot think of another option to filter the tibble.
Thanks for helping me!
You can "uncount" using the total column to get the number of repeats of each row.
df %>%
group_by(id) %>%
summarize(total = sum(count)) %>%
uncount(total) %>%
mutate(count = 1)
#> # A tibble: 50 x 2
#> id count
#> <int> <dbl>
#> 1 1 1
#> 2 1 1
#> 3 1 1
#> 4 1 1
#> 5 1 1
#> 6 1 1
#> 7 1 1
#> 8 1 1
#> 9 1 1
#> 10 1 1
#> # ... with 40 more rows
Created on 2022-10-21 with reprex v2.0.2

Which groups have exactly the same rows

If I have a data frame like the following
group1
group2
col1
col2
A
1
ABC
5
A
1
DEF
2
B
1
AB
1
C
1
ABC
5
C
1
DEF
2
A
2
BC
8
B
2
AB
1
We can see that the the (A, 1) and (C, 1) groups have the same rows (since col1 and col2 are the same within this group). The same is true for (B,1) and (B, 2).
So really we are left with 3 distinct "larger groups" (call them categories) in this data frame, namely:
category
group1
group2
1
A
1
1
C
1
2
B
1
2
B
2
3
A
2
And I am wondering how can I return the above data frame in R given a data frame like the first? The order of the "category" column doesn't matter here, for example (A,2) could be group 1 instead of {(A,1), (C,1)}, as long as these have a distinct category index.
I have tried a few very long/inefficient ways of doing this in Dplyr but I'm sure there must be a more efficient way to do this. Thanks
You can use pivot_wider first to handle identical groups over multiple rows.
library(tidyverse)
df %>%
group_by(group1, group2) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = c(col1, col2)) %>%
group_by(across(-c(group1, group2))) %>%
mutate(category = cur_group_id()) %>%
ungroup() %>%
select(category, group1, group2) %>%
arrange(category)
category group1 group2
<int> <chr> <int>
1 1 B 1
2 1 B 2
3 2 A 1
4 2 C 1
5 3 A 2
You could first group_by "col1" and "col2" and select the duplicated rows. Next, you can create a unique ID using cur_group_id like this:
library(dplyr)
library(tidyr)
df %>%
group_by(col1, col2) %>%
filter(n() != 1) %>%
mutate(ID = cur_group_id()) %>%
ungroup() %>%
select(-starts_with("col"))
#> # A tibble: 6 × 3
#> group1 group2 ID
#> <chr> <int> <int>
#> 1 A 1 2
#> 2 A 1 3
#> 3 B 1 1
#> 4 C 1 2
#> 5 C 1 3
#> 6 B 2 1
Created on 2022-08-12 by the reprex package (v2.0.1)

dplyr: divide all values in group by group's first value

My df looks something like this:
ID Obs Value
1 1 26
1 2 13
1 3 52
2 1 1,5
2 2 30
Using dplyr, I to add the additional column Col, which is the result of a division of all values in the column value by the group's first value in that column.
ID Obs Value Col
1 1 26 1
1 2 13 0,5
1 3 52 2
2 1 1,5 1
2 2 30 20
How do I do that?
After grouping by 'ID', use mutate to create a new column by dividing the 'Value' by the first of 'Value'
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Col = Value/first(Value))
If the first 'Value' is 0 and we don't want to use it, then subset the 'Value' with a logical expression and then take the first of that
df1 %>%
group_by(ID) %>%
mutate(Col = Value/first(Value[Value != 0]))
Or in base R
df1$Col <- with(df1, Value/ave(Value, ID, FUN = head, 1))
NOTE: The comma in 'Value' suggests it is a character column. In that case, it should be first changed to decimal (.) if that is the case, convert to nunmeric and then do the division. It can be done while reading the data
Or, without creating an additional column:
library(tidyverse)
df = data.frame(ID=c(1,1,1,2,2), Obs=c(1,2,3,1,2), Value=c(26, 13, 52, 1.5, 30))
df %>%
group_by(ID) %>%
mutate_at('Value', ~./first(.))
#> # A tibble: 5 x 3
#> # Groups: ID [2]
#> ID Obs Value
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
#> 2 1 2 0.5
#> 3 1 3 2
#> 4 2 1 1
#> 5 2 2 20
### OR ###
df %>%
group_by(ID) %>%
mutate_at('Value', function(x) x/first(x))
#> # A tibble: 5 x 3
#> # Groups: ID [2]
#> ID Obs Value
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
#> 2 1 2 0.5
#> 3 1 3 2
#> 4 2 1 1
#> 5 2 2 20
Created on 2020-01-04 by the reprex package (v0.3.0)

R add rows to grouped df using dplyr

I have a grouped df and I would like to add additional rows to the top of the groups that match with a variable (item_code) from the df.
The additional rows do not have an id column. The additional rows should not be duplicated within the groups of df.
Example data:
df <- as.tibble(data.frame(id=rep(1:3,each=2),
item_code=c("A","A","B","B","B","Z"),
score=rep(1,6)))
additional_rows <- as.tibble(data.frame(item_code=c("A","Z"),
score=c(6,6)))
What I tried
I found this post and tried to apply it:
Add row in each group using dplyr and add_row()
df %>% group_by(id) %>% do(add_row(additional_rows %>%
filter(item_code %in% .$item_code)))
What I get:
# A tibble: 9 x 3
# Groups: id [3]
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 Z 6
3 1 NA NA
4 2 A 6
5 2 Z 6
6 2 NA NA
7 3 A 6
8 3 Z 6
9 3 NA NA
What I am looking for:
# A tibble: 6 x 3
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 A 1
3 1 A 1
4 2 B 1
5 2 B 1
6 3 B 1
7 3 Z 6
8 3 Z 1
This should do the trick:
library(plyr)
df %>%
join(subset(df, item_code %in% additional_rows$item_code, select = c(id, item_code)) %>%
join(additional_rows) %>%
subset(!duplicated(.)), type = "full") %>%
arrange(id, item_code, -score)
Not sure if its the best way, but it works
Edit: to get the score in the same order added the other arrange terms
Edit 2: alright, there should now be no duplicated rows added from the additional rows as per your comment

Resources