How can I count the number of unique values such that I go from:
organisation <- c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D")
variable <- c("0","0","1","2","0","0","1","1","0","0","1","1","0","0","2","2")
df <- data.frame(organisation,variable)
organisation | variable
A | 0
A | 1
A | 2
A | 2
B | 0
B | 0
B | 1
B | 1
C | 0
C | 0
C | 1
C | 1
D | 0
D | 2
D | 2
D | 2
To:
unique_values | frequency
0,1,2 | 1
0,1 | 2
0,2 | 1
There are only 3 possible sequences:
0,1,2
0,1
0,2
Try this
s <- aggregate(. ~ organisation , data = df , \(x) names(table(x)))
s$variable <- sapply(s$variable , \(x) paste0(x , collapse = ","))
setNames(aggregate(. ~ variable , data = s , length) , c("unique_values" , "frequency"))
output
unique_values frequency
1 0,1 2
2 0,1,2 1
3 0,2 1
You can do something simple like this:
library(dplyr)
library(stringr)
distinct(df) %>%
arrange(variable) %>%
group_by(organisation) %>%
summarize(unique_values = str_c(variable,collapse = ",")) %>%
count(unique_values)
Output:
unique_values n
<chr> <int>
1 0,1 2
2 0,1,2 1
3 0,2 1
Related
I have data that look like this
+---+-------+
| | col1 |
+---+-------+
| 1 | A |
| 2 | A,B |
| 3 | B,C |
| 4 | B |
| 5 | A,B,C |
+---+-------+
Expected Output
+---+-----------+
| | A | B | C |
+---+-----------+
|1 | 1 | 0 | 0 |
|2 | 1 | 1 | 0 |
|3 | 0 | 1 | 1 |
|4 | 0 | 1 | 0 |
|5 | 1 | 1 | 1 |
+---+---+---+---+
How can I encode it like this?
Maybe this could help
df %>%
mutate(r = 1:n()) %>%
unnest(col1) %>%
table() %>%
t()
which gives
col1
r A B C
1 1 0 0
2 1 1 0
3 0 1 1
4 0 1 0
5 1 1 1
Data
df <- tibble(
col1 = list(
"A",
c("A", "B"),
c("B", "C"),
"B",
c("A", "B", "C")
)
)
If your data is given in the following format
df <- data.frame(
col1 = c("A", "A,B", "B,C", "B", "A,B,C")
)
then you can try
with(
df,
table(rev(stack(setNames(strsplit(col1, ","), seq_along(col1)))))
)
which gives
values
ind A B C
1 1 0 0
2 1 1 0
3 0 1 1
4 0 1 0
5 1 1 1
You could use table() with map_df() from purrr to count the occurrences
in each element of a list, and return a data frame. Putting it into a
function with some post-processing, and using dplyrs data frame unpacking in
mutate(), you could do something like this to stay within a data frame
context:
library(tidyverse)
one_hot <- function(x) {
map_df(x, table) %>%
mutate_all(as.integer) %>%
mutate_all(replace_na, 0L)
}
df <- data.frame(col1 = c("A", "A,B", "B,C", "B", "A,B,C"))
df %>%
mutate(
one_hot(strsplit(col1, ","))
)
#> col1 A B C
#> 1 A 1 0 0
#> 2 A,B 1 1 0
#> 3 B,C 0 1 1
#> 4 B 0 1 0
#> 5 A,B,C 1 1 1
An additional base R solution:
+(
with(
df,
sapply(
unique(
unlist(
strsplit(
col1,
","
)
)
),
`grepl`,
col1
)
)
)
I have a df like this:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
3 | B |
3 | C |
4 | D |
4 | C |
In R, how do I filter for VisitIDs as long as they contain Item A & B?
Expected Outcome:
VisitID | Item |
1 | A |
1 | B |
1 | C |
1 | D |
2 | A |
2 | D |
2 | B |
I tried df %>% group_by(VisitID) %>% filter(any(Item == 'A' & Item == 'B')) but it doesn't work..
df <- read_delim("ID | Item
1 | A
1 | B
2 | A
3 | B
1 | C
4 | C
5 | B
3 | A
4 | A
5 | D", delim = "|", trim_ws = TRUE)
Since you want both "A" and "B" you can use all
library(dplyr)
df %>% group_by(VisitID) %>% filter(all(c("A", "B") %in% Item))
# VisitID Item
# <int> <chr>
#1 1 A
#2 1 B
#3 1 C
#4 1 D
#5 2 A
#6 2 D
#7 2 B
OR if you want to use any use them separately.
df %>% group_by(VisitID) %>% filter(any(Item == 'A') && any(Item == 'B'))
An otion with data.table
library(data.table)
setDT(df)[, .SD[all(c("A", "B") %in% Item)], VisitID]
Hi I have df as below:
ID | Gender
1 | M
1 | F
2 | F
2 | F
2 | F
3 | M
3 | M
3 | F
4 | M
4 | M
4 | M
I'd like to distinct filter IDs which have more than 1 Gender (filter dirty data as can't have > 1 Gender per person)
Results should be:
ID | Gender
1 | M
1 | F
3 | M
3 | F
How can I go about in R using dplyr?
Using dplyr,
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(Gender) > 1) %>%
distinct(Gender)
which gives,
# A tibble: 4 x 2
# Groups: ID [2]
Gender ID
<chr> <int>
1 M 1
2 F 1
3 M 3
4 F 3
I have the following table:
+----+------------+----------+
| ID | Date | Variable |
+----+------------+----------+
| a | 12/03/2017 | d |
| a | 15/04/2017 | d |
| a | 20/06/2017 | c |
| b | 14/05/2017 | c |
| b | 15/08/2017 | c |
| b | 16/09/2017 | c |
+----+------------+----------+
For each ID, I'd like to have a check in the separate column which tells whether there was a "c" value after the occurence of "d" value, like this:
+----+------------+----------+-------+------------+
| ID | Date | Variable | Check | Date |
+----+------------+----------+-------+------------+
| a | 12/03/2017 | d | 1 | 20/06/2017 |
| a | 15/04/2017 | d | 1 | 20/06/2017 |
| a | 20/06/2017 | c | 1 | 20/06/2017 |
| b | 14/05/2017 | c | 0 | 0 |
| b | 15/08/2017 | c | 0 | 0 |
| b | 16/09/2017 | c | 0 | 0 |
+----+------------+----------+-------+------------+
It's not just about finding the occurence of "c", but about seeing whether "c" occurs after d or not. It would also help to have the corresponding date in a separate column. I was trying with removing the duplicates & then identifying the lead value (or n of rows > 1), but is there a simpler way to do this?
Any dplyr or data.table approach would be most helpful.
A solution using dplyr. There must be a better way than this, but I think this should work. unique(Variable[!is.na(Variable)]) is to get a vector with only c("c", "d"), c("d", "c"), "c", or "d". If you are sure there are no NA, you can remove !is.na. Date[Variable %in% "c"][1] is to select the first date.
dat2 <- dat %>%
group_by(ID) %>%
mutate(Check = ifelse(identical(unique(Variable[!is.na(Variable)]), c("d", "c")),
1L, 0L)) %>%
mutate(Date2 = ifelse(Check == 1L, Date[Variable %in% "c"][1], "0")) %>%
ungroup()
dat2
# # A tibble: 6 x 5
# ID Date Variable Check Date2
# <chr> <chr> <chr> <int> <chr>
# 1 a 12/03/2017 d 1 20/06/2017
# 2 a 15/04/2017 d 1 20/06/2017
# 3 a 20/06/2017 c 1 20/06/2017
# 4 b 14/05/2017 c 0 0
# 5 b 15/08/2017 c 0 0
# 6 b 16/09/2017 c 0 0
DATA
dat <- read.table(text = "ID Date Variable
a '12/03/2017' d
a '15/04/2017' d
a '20/06/2017' c
b '14/05/2017' c
b '15/08/2017' c
b '16/09/2017' c",
header = TRUE, stringsAsFactors = FALSE)
A data.table solution. Also suggested by #RYoda, you can use data.table::shift to test for your condition and then merge the results back to the original dataset
check <- dat[, {
idx <- Variable =='d' & shift(Variable, type="lead") == "c"
list(MatchDate=ifelse(any(idx), shift(Date, type="lead", fill=NA_character_)[idx][1L], "0"),
Check=as.integer(any(idx)))
}, by=.(ID)]
dat[check, on=.(ID)]
# ID Date Variable MatchDate Check
# 1: a 12/03/2017 d 20/06/2017 1
# 2: a 15/04/2017 d 20/06/2017 1
# 3: a 20/06/2017 c 20/06/2017 1
# 4: b 14/05/2017 c 0 0
# 5: b 15/08/2017 c 0 0
# 6: b 16/09/2017 c 0 0
data:
library(data.table)
dat <- data.table(ID=rep(c('a','b'), each=3),
Date=c("12/03/2017","15/04/2017","20/06/2017","14/05/2017","15/08/2017","16/09/2017"),
Variable=c('d','d','c','c','c','c'))
One solution can be arrived using fill from tidyr package. The approach is as:
First populate Check and C_Date for rows with Variable as c. Then fill up the rows above using fill function on both Check and C_Date columns. This steps will populate desired values in rows with d value. Finally, just replace the value of Check and C_Date for rows having Variable as c.
Note: OP suggested that Check for rows with Variable as c can be either 0 or 1. My solution has considered it to be 0.
# Data
df <- read.table(text = "ID Date Variable
a 12/03/2017 d
a 15/04/2017 d
a 20/06/2017 c
b 14/05/2017 c
b 15/08/2017 c
b 16/09/2017 c", header = T, stringsAsFactors = F)
df$Date <- as.POSIXct(df$Date, format = "%d/%m/%Y")
library(dplyr)
library(tidyr)
df %>% group_by(ID) %>%
arrange(ID, Date) %>%
mutate(Check = ifelse(Variable == "c", 1L, NA),
c_Date = ifelse(Variable == "c", as.character(Date), NA) ) %>%
fill(Check, .direction = "up") %>%
fill(c_Date, .direction = "up") %>%
mutate(Check = ifelse(Variable == "c", 0L, Check),
c_Date = ifelse(Variable == "c", NA, c_Date) )
# Result
# ID Date Variable Check c_Date
# <chr> <dttm> <chr> <int> <chr>
# 1 a 2017-03-12 00:00:00 d 1 2017-06-20
# 2 a 2017-04-15 00:00:00 d 1 2017-06-20
# 3 a 2017-06-20 00:00:00 c 0 <NA>
# 4 b 2017-05-14 00:00:00 c 0 <NA>
# 5 b 2017-08-15 00:00:00 c 0 <NA>
# 6 b 2017-09-16 00:00:00 c 0 <NA>
I will post a reproducible Example.
id <- c(1,1,1,1,2,2,1,1)
group <- c("a","b","c","d","a","b","c","d")
df <- data.frame(id, group)
I want something like this as end result.
+====+========+========+
| id | group1 | group2 |
+====+========+========+
| 1 | a | b |
+----+--------+--------+
| 1 | b | c |
+----+--------+--------+
| 1 | c | d |
+----+--------+--------+
| 1 | d | - |
+----+--------+--------+
| 2 | a | b |
+----+--------+--------+
| 2 | b | - |
+----+--------+--------+
| 1 | c | d |
+----+--------+--------+
| 1 | d | - |
+----+--------+--------+
Just to mention the order of ID's matter. I have another column as timestamp.
One solution with dplyr and rleid from data.table:
library(dplyr)
df %>%
mutate(id2 = data.table::rleid(id)) %>%
group_by(id2) %>%
mutate(group2 = lead(group))
# A tibble: 8 x 4
# Groups: id2 [3]
id group id2 group2
<dbl> <fct> <int> <fct>
1 1.00 a 1 b
2 1.00 b 1 c
3 1.00 c 1 d
4 1.00 d 1 NA
5 2.00 a 2 b
6 2.00 b 2 NA
7 1.00 c 3 d
8 1.00 d 3 NA
If I understood correct your question, you can use the following function:
id <- c(1,1,1,1,2,2,1,1)
group <- c("a","b","c","d","a","b","c","d")
df <- data.frame(id, group)
add_group2 <- function(df) {
n <-length(group)
group2 <- as.character(df$group[2:n])
group2 <- c(group2, "-")
group2[which(c(df$id[-n] - c(df$id[2:n]), 0) != 0)] <- "-"
return(data.frame(df, group2))
}
add_group2(df)
Result should be:
id group group2
1 1 a b
2 1 b c
3 1 c d
4 1 d -
5 2 a b
6 2 b -
7 1 c d
8 1 d -