Hi I have df as below:
ID | Gender
1 | M
1 | F
2 | F
2 | F
2 | F
3 | M
3 | M
3 | F
4 | M
4 | M
4 | M
I'd like to distinct filter IDs which have more than 1 Gender (filter dirty data as can't have > 1 Gender per person)
Results should be:
ID | Gender
1 | M
1 | F
3 | M
3 | F
How can I go about in R using dplyr?
Using dplyr,
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(Gender) > 1) %>%
distinct(Gender)
which gives,
# A tibble: 4 x 2
# Groups: ID [2]
Gender ID
<chr> <int>
1 M 1
2 F 1
3 M 3
4 F 3
Related
How can I count the number of unique values such that I go from:
organisation <- c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D")
variable <- c("0","0","1","2","0","0","1","1","0","0","1","1","0","0","2","2")
df <- data.frame(organisation,variable)
organisation | variable
A | 0
A | 1
A | 2
A | 2
B | 0
B | 0
B | 1
B | 1
C | 0
C | 0
C | 1
C | 1
D | 0
D | 2
D | 2
D | 2
To:
unique_values | frequency
0,1,2 | 1
0,1 | 2
0,2 | 1
There are only 3 possible sequences:
0,1,2
0,1
0,2
Try this
s <- aggregate(. ~ organisation , data = df , \(x) names(table(x)))
s$variable <- sapply(s$variable , \(x) paste0(x , collapse = ","))
setNames(aggregate(. ~ variable , data = s , length) , c("unique_values" , "frequency"))
output
unique_values frequency
1 0,1 2
2 0,1,2 1
3 0,2 1
You can do something simple like this:
library(dplyr)
library(stringr)
distinct(df) %>%
arrange(variable) %>%
group_by(organisation) %>%
summarize(unique_values = str_c(variable,collapse = ",")) %>%
count(unique_values)
Output:
unique_values n
<chr> <int>
1 0,1 2
2 0,1,2 1
3 0,2 1
I have two data frames
df1 is like this
| NOC | 2007 | 2008 |
|:---- |:------:| -----:|
| A | 100 | 5 |
| B | 100 | 5 |
| C | 100 | 5|
| D | 20 | 2 |
| E | 10 | 12 |
| F | 2 | 1 |
df2
| NOC | GROUP |
|:---- |:------:|
| A | aa|
| B | aa |
| C | aa |
| D | bb |
| E | bb |
| F | cc |
I would like to create a new df3 which will aggregate the columns 2007 and 2008 based on Group identity by assigning the sum of rows with the same group identity, so my df3 would look like this
NOC
2007
2008
GROUP
S2007
s2008
A
100
5
aa
300
15
B
100
5
aa
300
15
C
100
5
aa
300
15
D
20
2
bb
30
14
E
10
12
bb
30
14
F
2
1
cc
2
1
my codes are not very efficient, I first merged df1 with df2 by NOC, into df3
df3<-merge(df1, df2, by="NOC",all.x=TRUE)
then used dprl summarised into df4 and created s2007 and s2008
df3 %>%
group_by(GROUP) %>%
summarise(num = n(),
s2017 = sum(2007),s2018 = sum(2008))->df3
then I merged df1 with df3 again to create my final database
I am wondering two problems:
is there a more efficient way?
since my dataframe contains annual data 2007-2030, currently I am writing out the summarize function for each year, is there a faster way of summarize all the columns except NOC?
Thank you!
Before this, a small piece of advice, never name your columns in numeric, it may create you many glitches.
library(tidyverse)
df1 %>% left_join(df2, by = 'NOC') %>%
group_by(GROUP) %>%
mutate(across(c(`2007`, `2008`), ~sum(.), .names = 's.{.col}' ))
# A tibble: 6 x 6
# Groups: GROUP [3]
NOC `2007` `2008` GROUP s.2007 s.2008
<chr> <int> <int> <chr> <int> <int>
1 A 100 5 aa 300 15
2 B 100 5 aa 300 15
3 C 100 5 aa 300 15
4 D 20 2 bb 30 14
5 E 10 12 bb 30 14
6 F 2 1 cc 2 1
I will post a reproducible Example.
id <- c(1,1,1,1,2,2,1,1)
group <- c("a","b","c","d","a","b","c","d")
df <- data.frame(id, group)
I want something like this as end result.
+====+========+========+
| id | group1 | group2 |
+====+========+========+
| 1 | a | b |
+----+--------+--------+
| 1 | b | c |
+----+--------+--------+
| 1 | c | d |
+----+--------+--------+
| 1 | d | - |
+----+--------+--------+
| 2 | a | b |
+----+--------+--------+
| 2 | b | - |
+----+--------+--------+
| 1 | c | d |
+----+--------+--------+
| 1 | d | - |
+----+--------+--------+
Just to mention the order of ID's matter. I have another column as timestamp.
One solution with dplyr and rleid from data.table:
library(dplyr)
df %>%
mutate(id2 = data.table::rleid(id)) %>%
group_by(id2) %>%
mutate(group2 = lead(group))
# A tibble: 8 x 4
# Groups: id2 [3]
id group id2 group2
<dbl> <fct> <int> <fct>
1 1.00 a 1 b
2 1.00 b 1 c
3 1.00 c 1 d
4 1.00 d 1 NA
5 2.00 a 2 b
6 2.00 b 2 NA
7 1.00 c 3 d
8 1.00 d 3 NA
If I understood correct your question, you can use the following function:
id <- c(1,1,1,1,2,2,1,1)
group <- c("a","b","c","d","a","b","c","d")
df <- data.frame(id, group)
add_group2 <- function(df) {
n <-length(group)
group2 <- as.character(df$group[2:n])
group2 <- c(group2, "-")
group2[which(c(df$id[-n] - c(df$id[2:n]), 0) != 0)] <- "-"
return(data.frame(df, group2))
}
add_group2(df)
Result should be:
id group group2
1 1 a b
2 1 b c
3 1 c d
4 1 d -
5 2 a b
6 2 b -
7 1 c d
8 1 d -
I have a data.table like so:
id | id2 | val
--------------
1 | 1 | A
1 | 2 | B
2 | 3 | C
2 | 4 | D
3 | 5 | E
3 | 6 | F
I want to group by the id column, and return the maximum id2 for that `id. Like so:
id | id2 | val
--------------
1 | 2 | B
2 | 4 | D
3 | 6 | F
It's easy in SQL:
SELECT id, MAX(id2) FROM tbl GROUP BY id;
But I want to know how to do this with data.table. So far I have:
tbl[, .(id2 = max(id2)), by = id]
but I don't know how to get the val part.
df <- read.table(header = T, text = "id id2 val
1 1 A
1 2 B
2 3 C
2 4 D
3 5 E
3 6 F")
library(data.table)
setDT(df)
df[, max_id2 := max(id2), by = id]
df <- df[id2 == max_id2, ]
df[, max_id2 := NULL]
id id2 val
1: 1 2 B
2: 2 4 D
3: 3 6 F
There are so many posts on how to get the group-wise min or max with SQL. But how do you do it in R?
Let's say, you have got the following data frame
ID | t | value
a | 1 | 3
a | 2 | 5
a | 3 | 2
a | 4 | 1
a | 5 | 5
b | 2 | 2
b | 3 | 1
b | 4 | 5
For every ID, I don't want the min t, but the value at the min t.
ID | value
a | 3
b| 2
df is your data.frame -
library(data.table)
setDT(df) # convert to data.table in place
df[, value[which.min(t)], by = ID]
Output -
> df[, value[which.min(t)], by = ID]
ID V1
1: a 3
2: b 2
You are looking for tapply:
df <- read.table(textConnection("
ID | t | value
a | 1 | 3
a | 2 | 5
a | 3 | 2
a | 4 | 1
a | 5 | 5
b | 2 | 2
b | 3 | 1
b | 4 | 5"), header=TRUE, sep="|")
m <- tapply(1:nrow(df), df$ID, function(i) {
df$value[i[which.min(df$t[i])]]
})
# a b
# 3 2
Two more solutions (with sgibb's df):
sapply(split(df, df$ID), function(x) x$value[which.min(x$t)])
#a b
#3 2
library(plyr)
ddply(df, .(ID), function(x) x$value[which.min(x$t)])
# ID V1
#1 a 3
#2 b 2