How to create a unique identifier based on other column in R - r

I have a data frame with five thousands rows. I need to create a new column with a unique identifier based on column "gender", then the number 21, and a sequential number starting on 0001. It is important that the sequential number restarts with a different letter in column "gender" (gender + 21 + seq#).
df <- data_frame(
name = c("A", "B", "C", "D", "E", "F", "G", "H", "I"),
gender = c("F", "F", "F", "M","M","F","M","F","F")
)
df
name gender
<chr> <chr>
1 A F
2 B F
3 C F
4 D M
5 E M
6 F F
7 G M
8 H F
9 I F
With unique identifier:
df
name gender id
1 A F F210001
2 B F F210002
3 C F F210003
4 D M M210001
5 E M M210002
6 F F F210004
7 G M M210003
8 H F F210005
9 I F F210006
Any help on how to achieve this will be very appreciated.

An option is paste with rowid
library(dplyr)
library(stringr)
library(data.table)
df1 <- df %>%
mutate(id = str_c(gender, rowid(gender) + 210000))
Or do a group_by/row_number
df1 <- df %>%
group_by(gender) %>%
mutate(id = str_c(cur_group(), row_number() + 210000)) %>%
ungroup

in base R you could use ave:
transform(df, group = ave(gender, gender, FUN = function(x)sprintf("%s21%04d",x,seq(x))))
name gender group
1 A F F210001
2 B F F210002
3 C F F210003
4 D M M210001
5 E M M210002
6 F F F210004
7 G M M210003
8 H F F210005
9 I F F210006

Related

How to create a long dataset based on number in column indicating how many times they should correspond [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 1 year ago.
for example I have a dataset that looks like this
structure(list(ID = c(1, 2, 3, 4, 5), COL1 = c("A", "B", "C",
"D", "E"), COL2 = c("F", "G", "H", "I", "J"), Paired = c(2, 3,
1, 2, 1)), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
ID COL1 COL2 Paired
1 A F 2
2 B G 3
3 C H 1
4 D I 2
5 E J 1
I would like to create a dataset that looks like this. Note the number in the paired column
Col Col2
A F
A F
F A
F A
B G
B G
B G
G B
G B
G B
C H
H C
D I
D I
I D
I D
E J
J E
Note that A F is paired up two times. I want it basically to show in long the two times A and F paired in both combination scenario so 2 pairs is AF, AF, FA, FA.
We can use
library(dplyr)
library(tidyr)
df1 %>%
uncount(Paired) -> tmp
tmp %>%
rename(COL1= COL2, COL2 = COL1) %>%
bind_rows(tmp) %>%
select(-ID)
-output
A tibble: 18 x 2
COL2 COL1
<chr> <chr>
1 A F
2 A F
3 B G
4 B G
5 B G
6 C H
7 D I
8 D I
9 E J
10 F A
11 F A
12 G B
13 G B
14 G B
15 H C
16 I D
17 I D
18 J E

Calculate number of unique values in grouped matrix

I have a grouped data set that looks like this:
data = data.frame(group = c(1,1,1,1,2,2,2,2),
c1 = c("A", "E", "A", "J", "L", "M", "L", "J"),
c2 = c("B", "F", "F", "K", "B", "F", "T", "E"),
c3 = c("C", "G", "C", "L", "C", "X", "C", "V"),
c4 = c("D", "H", "I", "M", "D", "T", "I", "W"))
And I need to calculate the number of values in each row that are not duplicated within each group. For example, something that looks like this:
group c1 c2 c3 c4 uniq.vals
1 1 A B C D 2
2 1 E F G H 3
3 1 A F C I 1
4 1 J K L M 4
5 2 L B C D 2
6 2 M F X T 3
7 2 L T C I 1
8 2 J E V W 4
The count for row 1 would be 2, because B and D do not show up in any of the other rows within group 1.
I am familiar with using group_by and summarize but I am having trouble extending that to this particular situation, which requires that each value be checked across multiple columns and rows. For example, n_distinct on its own would not work because I'm looking for non-duplicated values, not unique values.
Ideally the solution would also ignore NAs and not count them as duplicated or non-duplicated values.
Here is an option with tidyverse. Reshape to 'long' format with pivot_longer, grouped by 'group', replace all the duplicate 'value' to NA, then grouped by row number, summarise to get the counts with n_distinct (number of distinct elements), and bind with the original data
library(dplyr)
library(tidyr)
data %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = starts_with('c')) %>%
group_by(group) %>%
mutate(value = replace(value, duplicated(value)|duplicated(value,
fromLast = TRUE), NA)) %>%
group_by(rn) %>%
summarise(uniq.vals = n_distinct(value, na.rm = TRUE), .groups = 'drop') %>%
select(uniq.vals) %>%
bind_cols(data, .)
-output
# group c1 c2 c3 c4 uniq.vals
#1 1 A B C D 2
#2 1 E F G H 3
#3 1 A F C I 1
#4 1 J K L M 4
#5 2 L B C D 2
#6 2 M F X T 3
#7 2 L T C I 1
#8 2 J E V W 4
In base R you would do:
a <- tapply(unlist(data[-1]), data$group[row(data[-1])],table)
data$uniq.vals <- c(by(data, seq(nrow(data)),
function(x)sum(a[[x[,1]]][unlist(x[-1])]<2)))
group c1 c2 c3 c4 uniq.vals
1 1 A B C D 2
2 1 E F G H 3
3 1 A F C I 1
4 1 J K L M 4
5 2 L B C D 2
6 2 M F X T 3
7 2 L T C I 1
8 2 J E V W 4
Note that in your case, row 3 should have 1 since only I is the unique value

Mutating a character vector through case_when on specific values of a variable in a specific order

library(dplyr)
df <- tibble(Letters = c("A", "B", "C", "C", "C", "D", "D", "D", "E", "E", "E"))
meta <- c("foo", "bar", "baz")
Letters
<chr>
1 A
2 B
3 C
4 C
5 C
6 D
7 D
8 D
9 E
10 E
11 E
Here, I would like to mutate the character vector meta on to the letters C, D, E in the specific order of the vector.
I've tried something like:
df <- df %>% mutate(New = case_when(Letters %in% c("C", "D", "E") ~
meta %>% rep_len(nrow(df)),
TRUE ~ NA_character_))
However this starts the process of vectorisation at the top of the data frame, and C, D, E, are filled in the order of baz, foo, bar instead of foo, bar, baz.
# A tibble: 11 x 2
Letters New
<chr> <chr>
1 A NA
2 B NA
3 C baz
4 C foo
5 C bar
6 D baz
7 D foo
8 D bar
9 E baz
10 E foo
11 E bar
Desired output:
Letters New
<chr> <chr>
1 A NA
2 B NA
3 C foo
4 C bar
5 C baz
6 D foo
7 D bar
8 D baz
9 E foo
10 E bar
11 E baz
Assuming the number of times the character repeats in the 'Letters' column is the same as length of 'meta', a simple assignment would do it
df$New <- NA
df$New[df$Letters %in% c("C", "D", "E")] <- meta
Or using tidyverse
library(tidyverse)
df %>%
group_by(Letters) %>%
nest %>%
mutate(New = case_when(Letters %in% c('C', 'D', 'E') ~ list(meta),
TRUE ~ list(NA_character_))) %>%
unnest

rstudio dplyr group _by multiple column

In Rstudio, I have a dataframe which contains 4 columns and I need to get the list of every different triplet of the 3 first columns sorted decreasingly by the sum on the 4th column. For example, with:
A B C 2
D E F 5
A B C 4
G H I 5
D E F 3
I need as a result:
D E F 8
A B C 6
G H I 5
I've tried the following different approach but I can't manage to have exactly the result I need:
df_list<-df_raw_data %>%
group_by(param1, param2, param3) %>%
summarise_all(total = sum(param4))
arrange(df_list, desc(total))
and:
df_list<-unique(df_raw_data[, c('param1', 'param2', 'param3')])
cbind(df_list, total)
for(i in 1:nrow(df_raw_data))
{
filter ???????????
}
I would prefer to use the dplyr package since it's a more elegant solution.
EDIT: Okay, thanks for your working answers. I think that I've lost some time figuring out that the plyr package shouldn't be loaded after dplyr...
We can use group_by_at to select the columns to group.
library(dplyr)
dat2 <- dat %>%
group_by_at(vars(-V4)) %>%
summarise(V4 = sum(V4)) %>%
ungroup()
dat2
# # A tibble: 3 x 4
# V1 V2 V3 V4
# <chr> <chr> <chr> <int>
# 1 A B C 6
# 2 D E F 8
# 3 G H I 5
Or use group_by_if to select columns to group based on column types.
dat2 <- dat %>%
group_by_if(is.character) %>%
summarise(V4 = sum(V4)) %>%
ungroup()
dat2
# # A tibble: 3 x 4
# V1 V2 V3 V4
# <chr> <chr> <chr> <int>
# 1 A B C 6
# 2 D E F 8
# 3 G H I 5
DATA
dat <- read.table(text = "A B C 2
D E F 5
A B C 4
G H I 5
D E F 3",
header = FALSE, stringsAsFactors = FALSE)
Would this be what you are looking for?
df <- data_frame(var1 = c("A", "D", "A", "G", "D"),
var2 = c("B", "E", "B", "H", "E"),
var3 = c("C", "F", "C", "I", "F"),
var4 = c(2, 5, 4, 5, 3))
df %>% group_by(var1, var2, var3) %>%
summarise(sum = sum(var4)) %>%
arrange(desc(sum))

Group data frame by elements from a variable containing lists of elements

I would like to perform a a non-trivial group_by, grouping and summarizing a data frame by single elements of lists found in one of its variables.
df <- data.frame(x = 1:5)
df$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E")
df
x y
1 1 A
2 2 A, B
3 3 C
4 4 B, D, C
5 5 E
Now grouping by y (and say counting no. of rows), which is a variable holding lists of elements, the required end results should be:
data.frame(group = c("A", "B", "C", "D", "E"), n = c(2,2,2,1,1))
group n
1 A 2
2 B 2
3 C 2
4 D 1
5 E 1
Because "A" appears in 2 rows, "B" in 2 rows, etc.
Note: the sum of n is not necessarily equal to number of rows in the data frame.
We can use simple base R solution with table to calculate the frequency after unlisting the list and then create a data.table based on that table object
tbl <- table(unlist(df$y))
data.frame(group = names(tbl), n = as.vector(tbl))
# group n
#1 A 2
#2 B 2
#3 C 2
#4 D 1
#5 E 1
Or another option with tidyverse
library(dplyr)
library(tidyr)
unnest(df) %>%
group_by(group = y) %>%
summarise(n=n())
# <chr> <int>
#1 A 2
#2 B 2
#3 C 2
#4 D 1
#5 E 1
Or as #alexis_laz mentioned in the comments, an alternative is as.data.frame.table
as.data.frame(table(group = unlist(df$y)), responseName = "n")
simple base R solution: (actually this is dup question, unable to locate it though)
sapply(unique(unlist(df$y)), function(x) sum(grepl(x, df$y))
# A B C D E
# 2 2 2 1 1

Resources