I want to replace duplicated elements within a group
df <- data.frame(A=c("a", "a", "a", "b", "b", "c"), group = c(1, 1, 2, 2, 2, 3))
I want to keep the first element of the group, while replacing anything else with NA. Something like:
df <- df %>%
group_by(group) %>%
mutate(B = first(A))
Which doesn't produce what I want. What I want instead is B <- c(a, NA, a, NA, NA, c)
Use replace with duplicated:
df %>% group_by(group) %>% mutate(B = replace(A, duplicated(A), NA))
# A tibble: 6 x 2
# Groups: group [3]
# A group
# <fctr> <dbl>
#1 a 1
#2 NA 1
#3 a 2
#4 b 2
#5 NA 2
#6 c 3
Or if keep only the first element:
df %>%
group_by(group) %>%
mutate(B = ifelse(row_number() == 1, as.character(A), NA))
# A tibble: 6 x 2
# Groups: group [3]
# A group
# <chr> <dbl>
#1 a 1
#2 <NA> 1
#3 a 2
#4 <NA> 2
#5 <NA> 2
#6 c 3
OR use replace:
df %>%
group_by(group) %>%
mutate(B = replace(A, row_number() > 1, NA))
# A tibble: 6 x 2
# Groups: group [3]
# A group
# <fctr> <dbl>
#1 a 1
#2 NA 1
#3 a 2
#4 NA 2
#5 NA 2
#6 c 3
In data.table you could do:
library(data.table)
setDT(df)[, B := c(A[1], rep(NA, .N - 1)), by = group]
Or same logic in dplyr:
library(dplyr)
df %>% group_by(group) %>% mutate(B = c(as.character(A[1]), rep(NA, n() - 1)))
# A tibble: 6 x 3
# Groups: group [3]
# A group B
# <fctr> <dbl> <chr>
#1 a 1 a
#2 a 1 <NA>
#3 a 2 a
#4 b 2 <NA>
#5 b 2 <NA>
#6 c 3 c
Related
I have a data frame containing a varying number of data points in the same column:
library(tidyverse)
df <- tribble(~id, ~data,
"A", "a;b;c",
"B", "e;f")
I want to obtain one row per data point, separating the content of column data and distributing it on rows. This code gives the expected result, but is clumsy:
df %>%
separate(data,
into = paste0("dat_",1:5),
sep = ";",
fill = "right") %>%
pivot_longer(starts_with("dat_"),
names_to = "data_number",
names_pattern = "dat_(\\d+)") %>%
filter(!is.na(value))
#> # A tibble: 5 x 3
#> id data_number value
#> <chr> <chr> <chr>
#> 1 A 1 a
#> 2 A 2 b
#> 3 A 3 c
#> 4 B 1 e
#> 5 B 2 f
Tidyverse solutions preferred.
Here is one way
library(dplyr)
library(tidyr)
library(data.table)
df %>%
separate_rows(data) %>%
mutate(data_number = rowid(id), .before = 2)
-output
# A tibble: 5 x 3
id data_number data
<chr> <int> <chr>
1 A 1 a
2 A 2 b
3 A 3 c
4 B 1 e
5 B 2 f
library(dplyr)
library(tidyr)
df %>%
separate_rows(data)
output:
# A tibble: 5 x 2
id data
<chr> <chr>
1 A a
2 A b
3 A c
4 B e
5 B f
Using str_split and unnest -
library(tidyverse)
df %>%
mutate(data = str_split(data, ';'),
data_number = map(data, seq_along)) %>%
unnest(c(data, data_number))
# id data data_number
# <chr> <chr> <int>
#1 A a 1
#2 A b 2
#3 A c 3
#4 B e 1
#5 B f 2
Dear dplyr/tidyverse companions, I am looking for a nice solution to the following problem. I only get my solutions in base R with a loop. How do you solve this cleanly in tidyverse?
I have a dataset called data, which has not useful column names and not useful values (integer).
data <- tibble(var1 = rep(c(1:3), 2),
var2 = rep(c(1:3), 2))
# A tibble: 6 x 2
var1 var2
<int> <int>
1 1 1
2 2 2
3 3 3
4 1 1
5 2 2
6 3 3
Additional I have a coding table, which has for every column a better name (var1 -> variable1) and a better value (1 -> "a")
coding <- tibble(variable = c(rep("var1", 3),rep("var2", 3)),
name = c(rep("variable1", 3),rep("variable2", 3)),
code = rep(c(1:3), 2),
value = rep(c("a", "b", "c"), 2))
# A tibble: 6 x 4
variable name code value
<chr> <chr> <int> <chr>
1 var1 variable1 1 a
2 var1 variable1 2 b
3 var1 variable1 3 c
4 var2 variable2 1 a
5 var2 variable2 2 b
6 var2 variable2 3 c
I'm looking for a result, which has transformed names of the columns and the real values as factors in the dataset, compare:
result <- tibble(variable1 = factor(rep(c("a", "b", "c"), 2)),
variable2 = factor(rep(c("a", "b", "c"), 2)))
# A tibble: 6 x 2
variable1 variable2
<fct> <fct>
1 a a
2 b b
3 c c
4 a a
5 b b
6 c c
Thank you for your commitment :) :) :) :)
library(dplyr)
library(tidyr)
data %>%
stack() %>%
left_join(coding, by = c(ind = "variable", values = "code")) %>%
group_by(name) %>%
mutate(j = row_number()) %>%
pivot_wider(id_cols = j, values_from = value) %>%
select(-j)
# # A tibble: 6 x 2
# variable1 variable2
# <chr> <chr>
# 1 a a
# 2 b b
# 3 c c
# 4 a a
# 5 b b
# 6 c c
A general solution for any number of columns -
create a row number column to identify each row
get data in long format
join it with coding for each value
keep only unique rows and get it back in wide format.
library(dplyr)
library(tidyr)
data %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row, values_to = 'code') %>%
left_join(coding, by = 'code') %>%
select(row, name = name.y, value) %>%
distinct() %>%
pivot_wider() %>%
select(-row)
# variable1 variable2
# <chr> <chr>
#1 a a
#2 b b
#3 c c
#4 a a
#5 b b
#6 c c
This question already has answers here:
Finding the index or unique values from a dataframe column
(4 answers)
Closed 2 years ago.
Consider the following:
library(dplyr)
df <- data.frame(group_1 = c("A", "B", "A", "C"),
group_2 = c("B", "C", "B", "A"))
> df
group_1 group_2
1 A B
2 B C
3 A B
4 C A
I would like to receive the following output, in pseudocode:
df %>%
group_by(group_1, group_2) %>%
summarize(rows = whichever_rows_contain_group_1_and_group_2, .groups = "keep")
group_1 group_2 rows
A B 1,3
B C 2
C A 4
I've tried playing around with rownames() with not much luck. What is the appropriate command with summarize() that I can use to get what I seek?
The value of rows, for each row, should be in ascending order.
Try working around row_number() to create a new variable and then use summarise() to obtain the desired variable using toString(). Here the code:
library(dplyr)
#Code
dfnew <- df %>%
mutate(id=row_number()) %>%
group_by(group_1,group_2) %>%
summarise(Var=toString(id))
Output:
# A tibble: 3 x 3
# Groups: group_1 [3]
group_1 group_2 Var
<fct> <fct> <chr>
1 A B 1, 3
2 B C 2
3 C A 4
Another option can be (Many thanks and all credit to #ThomasIsCoding):
#Code2
dfnew2 <- df %>%
mutate(id=row_number()) %>%
group_by(group_1,group_2) %>%
summarise_at("id",toString)
Same output:
# A tibble: 3 x 3
# Groups: group_1 [3]
group_1 group_2 id
<fct> <fct> <chr>
1 A B 1, 3
2 B C 2
3 C A 4
Try aggregate like below
aggregate(rows ~ ., cbind(df,rows = 1:nrow(df)),c)
which gives
group_1 group_2 rows
1 C A 4
2 A B 1, 3
3 B C 2
Here is a tidyverse way.
library(dplyr)
library(tibble)
df %>%
rowid_to_column() %>%
group_by(group_1, group_2) %>%
summarise(rows = paste0(rowid, collapse = ","))
# A tibble: 3 x 3
# Groups: group_1 [3]
group_1 group_2 rows
<chr> <chr> <chr>
1 A B 1,3
2 B C 2
3 C A 4
Using dplyr and paste0
> library(dplyr)
> df %>% mutate(rn = row_number()) %>% group_by(group_1, group_2) %>% summarise(rows = paste0(rn, collapse = ','))
`summarise()` regrouping output by 'group_1' (override with `.groups` argument)
# A tibble: 3 x 3
# Groups: group_1 [3]
group_1 group_2 rows
<chr> <chr> <chr>
1 A B 1,3
2 B C 2
3 C A 4
>
I have data:
rowID incidentID participant.type
1 1 A
2 1 B
3 2 A
4 3 A
5 3 B
6 3 C
7 4 B
8 4 C
And I would like to end up with:
rowID incident participant.type participant.type.1 participant.type.2
1 1 A B
2 2 A
3 3 A B C
4 4 B C
I tried the spread but can't achieve one line per incident; I don't think I have a way of creating a key-value pair so I wonder if there is some other method for doing this.
Before using spread(), you need to create a proper key argument.
df %>% select(-rowID) %>%
group_by(incidentID) %>%
mutate(id = 1:n()) %>%
spread(id, participant.type)
# incidentID `1` `2` `3`
# <int> <fct> <fct> <fct>
# 1 1 A B NA
# 2 2 A NA NA
# 3 3 A B C
# 4 4 B C NA
Since your grouping is based on the row order within the icidentID column. The following simple solution will also work.
It is just filtering the dataframe and then merging in the end.
It is probably not the best solution in terms of effective use of computing power, but it is easy to understand.
library(tidyverse)
df <-
tribble(
~rowID, ~incidentID, ~participant.type,
1, 1, "A",
2, 1, "B",
3, 2, "A",
4, 3, "A",
5, 3, "B",
6, 3, "C",
7, 4, "B",
8, 4, "C")
df_1 <- df %>%
select(-rowID) %>%
group_by(incidentID) %>%
filter(row_number()==1)
df_2 <- df %>%
select(-rowID) %>%
group_by(incidentID) %>%
filter(row_number()==2) %>%
rename(participant.type.1 = participant.type)
df_3 <- df %>%
select(-rowID) %>%
group_by(incidentID) %>%
filter(row_number()==3) %>%
rename(participant.type.2 = participant.type)
full_join(df_1, full_join(df_2, df_3))
Result:
Joining, by = "incidentID"
Joining, by = "incidentID"
# A tibble: 4 x 4
# Groups: incidentID [?]
incidentID participant.type participant.type.1 participant.type.2
<dbl> <chr> <chr> <chr>
1 1 A B NA
2 2 A NA NA
3 3 A B C
4 4 B C NA
Here's my solution:
df %>%
select(-rowID) %>%
group_by(incidentID) %>%
nest() %>%
mutate(data = map_chr(data, ~str_c(.x$participant.type, collapse = '_'))) %>%
separate(data, paste0('participant.type.', 0:2)) %>%
mutate_at(2:4, ~replace_na(.x, ''))
We can use reshape2::dcast for this
reshape2::dcast(df, insidentID ~ participant.type)
# insidentID A B C
# 1 1 <NA> B <NA>
# 2 8 <NA> B <NA>
# 3 12 <NA> <NA> C
# 4 16 A <NA> <NA>
# 5 24 <NA> B <NA>
# 6 27 <NA> B C
# 7 29 <NA> <NA> C
with the data
set.seed(123)
df <- data.frame(insidentID = sample(0:30, 8L, replace = TRUE),
participant.type = sample(LETTERS[1:3], 8L, replace = TRUE),
stringsAsFactors = FALSE)
df
# insidentID participant.type
# 1 8 B
# 2 24 B
# 3 12 C
# 4 27 B
# 5 29 C
# 6 1 B
# 7 16 A
# 8 27 C
The 'related question' link provided by #markus shows a variety of other solutions, including what appears to be the most concise in a tidyverse format:
df %>%
group_by(incidentID) %>%
mutate(rn = paste0("newcolumn",row_number())) %>%
spread(rn, participant.type)
gives:
incidentID newcolumn1 newcolumn2 newcolumn3
<int> <fct> <fct> <fct>
1 1 A B NA
2 2 A NA NA
3 3 A B C
4 4 B C NA
A
I have the following tibble:
library(tidyverse)
df <- tibble::tribble(
~gene, ~colB, ~colC,
"a", 1, 2,
"b", 2, 3,
"c", 3, 4,
"d", 1, 1
)
df
#> # A tibble: 4 x 3
#> gene colB colC
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 b 2 3
#> 3 c 3 4
#> 4 d 1 1
What I want to do is to filter every columns after gene column
for values greater or equal 2 (>=2). Resulting in this:
gene, colB, colC
a NA 2
b 2 3
c 3 4
How can I achieve that?
The number of columns after genes actually is more than just 2.
One solution: convert from wide to long format, so you can filter on just one column, then convert back to wide at the end if required. Note that this will drop genes where no values meet the condition.
library(tidyverse)
df %>%
gather(name, value, -gene) %>%
filter(value >= 2) %>%
spread(name, value)
# A tibble: 3 x 3
gene colB colC
* <chr> <dbl> <dbl>
1 a NA 2
2 b 2 3
3 c 3 4
The forthcoming dplyr 0.6 (install from GitHub now, if you like) has filter_at, which can be used to filter to any rows that have a value greater than or equal to 2, and then na_if can be applied similarly through mutate_at, so
df %>%
filter_at(vars(-gene), any_vars(. >= 2)) %>%
mutate_at(vars(-gene), funs(na_if(., . < 2)))
#> # A tibble: 3 x 3
#> gene colB colC
#> <chr> <dbl> <dbl>
#> 1 a NA 2
#> 2 b 2 3
#> 3 c 3 4
or similarly,
df %>%
mutate_at(vars(-gene), funs(na_if(., . < 2))) %>%
filter_at(vars(-gene), any_vars(!is.na(.)))
which can be translated for use with dplyr 0.5:
df %>%
mutate_at(vars(-gene), funs(na_if(., . < 2))) %>%
filter(rowSums(is.na(.)) < (ncol(.) - 1))
All return the same thing.
We can use data.table
library(data.table)
setDT(df)[df[, Reduce(`|`, lapply(.SD, `>=`, 2)), .SDcols = colB:colC]
][, (2:3) := lapply(.SD, function(x) replace(x, x < 2, NA)), .SDcols = colB:colC][]
# gene colB colC
#1: a NA 2
#2: b 2 3
#3: c 3 4
Or with melt/dcast
dcast(melt(setDT(df), id.var = 'gene')[value>=2], gene ~variable)
# gene colB colC
#1: a NA 2
#2: b 2 3
#3: c 3 4
Alternatively we could also try the below code
df %>% rowwise %>%
filter(any(c_across(starts_with('col'))>=2)) %>%
mutate(across(starts_with('col'), ~ifelse(!(.>=2), NA, .)))
Created on 2023-02-05 with reprex v2.0.2
# A tibble: 3 × 3
# Rowwise:
gene colB colC
<chr> <dbl> <dbl>
1 a NA 2
2 b 2 3
3 c 3 4