Group by, summarize, spread in R not working - r

I have a data frame that looks like the following:
ID Code Desc
1 0A Red
1 NA Red
2 1A Blue
3 2B Green
I want to first create a new column where I concatenate the values in the Code column where the IDs are the same. So:
ID Combined_Code Desc
1 0A | NA Red
2 1A Blue
3 2B Green
Then I want to take the original Code column and spread it. The values in this case would be a count of how many times each Code shows up for a given ID. So:
ID Combined_Code 0A NA 1A 2B Desc
1 0A | NA 1 1 0 0 Red
2 1A 0 0 1 0 Blue
3 2B 0 0 0 1 Green
I've tried:
sample_data %>%
group_by(ID) %>%
summarise(Combined_Code = paste(unique(Combined_Code), collapse ='|'))
This works for creating the concatenation. However, I can't get this to work in tandem with spread:
sample_data %>%
group_by(ID) %>%
summarise(Combined_Code = paste(unique(Combined_Code), collapse ='|'))
sample_data <- spread(count(sample_data, ID, Combined_Code, Desc., Code), Code, n, fill = 0)
Doing this spreads, but drops the concatenation. I've also tried this with filter instead of summarise, which gives the same result. This results in:
ID Combined_Code 0A NA 1A 2B Desc
1 0A 1 0 0 0 Red
1 NA 0 1 0 0 Red
2 1A 0 0 1 0 Blue
3 2B 0 0 0 1 Green
Finally, I've tried piping spread through the summarise function:
sample_data %>%
group_by(ID) %>%
summarise(Combined_Code = paste(unique(Combined_Code), collapse ='|')) %>%
spread(count(sample_data, ID, Combined_Code, Desc., Code), Code, n, fill = 0)
This results in the error:
Error: `var` must evaluate to a single number or a column name, not a list
Run `rlang::last_error()` to see where the error occurred.
What can I do to solve these problems?

We can do a group by paste
library(dplyr)
library(stringr)
df1 %>%
group_by(ID, Desc) %>%
summarise(Combined_Code = str_c(Code, collapse="|"))
# A tibble: 3 x 3
# Groups: ID [3]
# ID Desc Combined_Code
# <int> <chr> <chr>
#1 1 Red 0A|0B
#2 2 Blue 1A
#3 3 Green 2B
For the second case, after creating a 'val' column of 1s, paste the 'Code' elements afte grouping by 'ID', 'Desc', then use pivot_wider from tidyr to reshape from 'long' to 'wide format.
library(tidyr)
df1 %>%
mutate(val = 1) %>%
group_by(ID, Desc) %>%
mutate(Combined_Code = str_c(Code, collapse="|")) %>%
pivot_wider(names_from = Code, values_from = val, values_fill = list(val = 0))
# A tibble: 3 x 7
# Groups: ID, Desc [3]
# ID Desc Combined_Code `0A` `0B` `1A` `2B`
# <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 Red 0A|0B 1 1 0 0
#2 2 Blue 1A 0 0 1 0
#3 3 Green 2B 0 0 0 1
The OP's expected output is
ID Combined_Code 0A 0B 1A 2B Desc
1 0A | 0B 1 1 0 0 Red
2 1A 0 0 1 0 Blue
3 2B 0 0 0 1 Green
Update
For the updated dataset, there are NA elements in the 'Code', and by default str_c returns NA if there any NA as one of the elements, while paste still returns the NA along with the other elements. Here, we replace the str_c with paste
df2 %>%
mutate(val = 1) %>%
group_by(ID, Desc) %>%
mutate(Combined_Code = paste(Code, collapse="|")) %>%
pivot_wider(names_from = Code, values_from = val, values_fill = list(val = 0))
# A tibble: 3 x 7
# Groups: ID, Desc [3]
# ID Desc Combined_Code `0A` `NA` `1A` `2B`
# <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 Red 0A|NA 1 1 0 0
#2 2 Blue 1A 0 0 1 0
#3 3 Green 2B 0 0 0 1
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 3L), Code = c("0A", "0B", "1A",
"2B"), Desc = c("Red", "Red", "Blue", "Green")),
class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(ID = c(1L, 1L, 2L, 3L), Code = c("0A", NA, "1A",
"2B"), Desc = c("Red", "Red", "Blue", "Green")), class = "data.frame",
row.names = c(NA,
-4L))

Related

filter a column based on values of another column

I have a dataset that looks like this
data <- data.frame(ID = c("1a", "1b", "2a", "2b", "3a", "4b", "5a", "5b"),
Sex = c(1, 2, 2, 1, 1, 2, 1, 2))
ID Sex
1a 1
1b 2
2a 2
2b 1
3a 1
4b 2
5a 1
5b 2
I want to filter based on ID. Specifically, if there are same numbers in a string in ID, such as 1a and 1b, 2a and 2b, and 5a and 5b, then I want to filter rows with Sex = 1. Additionally, I want to keep the rows with 3a and 4b, because it does not have its counterparts of 3b and 4a, regardless of its value in Sex.
My final desired output is:
ID Sex
1a 1
2b 1
3a 1
4b 2
5a 1
Thank you for your help!
We may group by the numeric part of 'ID', and filter where Sex is 1 or (|) when number of rows is 1
library(dplyr)
data %>%
group_by(grp = readr::parse_number(ID)) %>%
filter(Sex == 1|n() ==1) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 × 2
ID Sex
<chr> <dbl>
1 1a 1
2 2b 1
3 3a 1
4 4b 2
5 5a 1
Idea:
split ID between number and letter to check whether there are several letters for each ID first number
group by number
keep when there's one letter max, or when ID_2 contains both "a" and "b" and Sex == 1
library(dplyr)
library(tidyr)
data <- data.frame(ID = c("1a", "1b", "2a", "2b", "3a", "4b", "5a", "5b"),
Sex = c(1, 2, 2, 1, 1, 2, 1, 2))
data %>%
separate(ID, into = c("ID_1", "ID_2"), sep = 1) %>%
group_by(ID_1) %>%
filter(n() <= 1 | (all(c("a", "b") %in% ID_2) & Sex == 1)) %>%
ungroup() %>%
unite(col = "ID", ID_1, ID_2, sep = "")
#> # A tibble: 5 × 2
#> ID Sex
#> <chr> <dbl>
#> 1 1a 1
#> 2 2b 1
#> 3 3a 1
#> 4 4b 2
#> 5 5a 1
Created on 2022-07-11 by the reprex package (v2.0.1)
Try this
library(dplyr)
data |> group_by(sub("\\D" , "" , ID)) |>
filter(n() == 1 | Sex == 1) |> ungroup() |>
select(ID , Sex)
output
# A tibble: 5 × 2
ID Sex
<chr> <dbl>
1 1a 1
2 2b 1
3 3a 1
4 4b 2
5 5a 1
Another possible solution:
library(dplyr)
data %>%
add_count(gsub("[a-z]", "", ID)) %>%
filter(Sex == 1 | n == 1) %>%
select(ID, Sex)
#> ID Sex
#> 1 1a 1
#> 2 2b 1
#> 3 3a 1
#> 4 4b 2
#> 5 5a 1
One more similar to #akrun's solution:
library(dplyr)
data %>%
group_by(group_id = as.numeric(gsub("\\D", "", ID))) %>%
arrange(Sex, .by_group = TRUE) %>%
slice(1) %>%
ungroup() %>%
select(-group_id)
ID Sex
<chr> <dbl>
1 1a 1
2 2b 1
3 3a 1
4 4b 2
5 5a 1

One-hot-encoding a R list of characters

I have the following R dataframe :
id color
001 blue
001 yellow
001 red
002 blue
003 blue
003 yellow
What's the general method to one-hot-encode such a dataframe into the following :
id blue yellow red
001 1 1 1
002 1 0 0
003 1 0 1
Thank you very much.
Try this. You can create a variable for those observations present in data equals to one and then use pivot_wider() to reshape the values. As you will get NA for classes not present in data, you can replace it with zero using replace(). Here the code using tidyverse functions:
library(dplyr)
library(tidyr)
#Code
dfnew <- df %>% mutate(val=1) %>%
pivot_wider(names_from = color,values_from=val) %>%
replace(is.na(.),0)
Output:
# A tibble: 3 x 4
id blue yellow red
<int> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 1 0 0
3 3 1 1 0
Some data used:
#Data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 3L, 3L), color = c("blue",
"yellow", "red", "blue", "blue", "yellow")), class = "data.frame", row.names = c(NA,-6L))
There are many ways to do this in R. It depends on what packages you are using. Most of the modeling packages such as caret and tidymodels have functions to do this for you.
However, if you aren't using a modeling package the tidyverse has an easy way to do this.
library(dplyr)
library(tidyr)
df <- tribble(
~id, ~color,
'001', 'blue',
'001', 'yellow',
'001', 'red',
'002', 'blue',
'003', 'blue',
'003', 'yellow')
df_onehot <- df %>%
mutate(value = 1) %>%
pivot_wider(names_from = color,values_from = value,values_fill = 0)
# A tibble: 3 x 4
# id blue yellow red
# <chr> <dbl> <dbl> <dbl>
# 1 001 1 1 1
# 2 002 1 0 0
# 3 003 1 1 0
With data.table:
library(data.table)
dcast(setDT(df), id ~ color, fun.aggregate = length)
# id blue red yellow
# 1: 001 1 1 1
# 2: 002 1 0 0
# 3: 003 1 0 1
Same logic with tidyr:
library(tidyr)
pivot_wider(df, names_from=color, values_from=color, values_fn=length, values_fill=0)
# id blue yellow red
# <chr> <int> <int> <int>
# 1 001 1 1 1
# 2 002 1 0 0
# 3 003 1 1 0
Base R:
out <- as.data.frame.matrix(pmin(with(df, table(id, color)), 1))
out$id <- rownames(out)
out
# blue red yellow id
# 001 1 1 1 001
# 002 1 0 0 002
# 003 1 0 1 003
Reproducible data
df <- data.frame(
id = c("001", "001", "001", "002", "003", "003"),
color = c("blue", "yellow", "red", "blue", "blue", "yellow")
)

I want to dissociate lists in a data frame into single values to different columns in R

I have a database that goes like that:
d <- c(01, 02, 03, 04)
h <- c("19:00", "19:00", "07:00", "07:00")
p1 <- c(123, 321, 123, 123)
p2 <- c(321, 345, 567, 567)
df <- data.frame(date = d, hours = h, person1 = p1, person2 = p2)
I used this code to associate all the characteristics of each person1 in different columns:
EDITED: rn = rowid(person1, date, hours) is the actual code. Not rn = rowid(person1)
library(dplyr)
library(data.table)
library(tidyr)
df1 <- df %>%
mutate(rn = rowid(person1, date, hours)) %>%
pivot_wider(names_from = rn, values_from = c(date, hours, person2),
names_sep="")
But this code gives me this output:
# person1 date1 hours1 person21
# 123 c(1,3,4) c("19:00", "07:00", "07:00") c(321,567,567)
# 321 2 19:00 345
I Dont want it to repeat values like 07:00 or 567. I want it to give me each different value in different columns, ignoring repeated values. And if possible, organized like that:
# person1 date1 date2 date3 date4... hours1 hours2 ... person21 person22 person23 person24
# 123 01 NA 03 04 07:00 19:00 NA 321 NA 567
# 321 NA 02 NA NA NA 19:00 NA NA 345 NA
person21, 22, 23 and 24 being the first, second, third, fourth, and so on person of my df1$person1.
But the ideal output for me would be something like this:
# person1 d01 d02 d03 d04 ... h07:00 h19:00 ... p123 p321 p345 p567
# 123 1 0 1 1 ... 1 0 ... 1 0 0 1
# 321 0 1 0 0 ... 0 0 ... 1 0 0 1
How can I do this?
If we want to return a binary output, specify the values_fn and values_fill in pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(rn = rowid(person1)) %>%
pivot_wider(names_from = rn, values_from = c(date, hours, person2),
names_sep="", values_fn = length, values_fill = list(date = 0, hours = 0, person2 = 0))
# A tibble: 2 x 10
# person1 date1 date2 date3 hours1 hours2 hours3 person21 person22 person23
# <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 123 1 1 1 1 1 1 1 1 1
#2 321 1 0 0 1 0 0 1 0 0
If we want the values to be also column names, an option is to reshape into 'long' format first and then do the pivot_wider after transformation
df %>%
mutate(date = sprintf("%02d", date)) %>%
mutate(across(where(is.numeric), as.character)) %>%
pivot_longer(cols = -person1) %>%
mutate(name = substr(name, 1, 1)) %>%
unite(name, name, value, sep="") %>%
distinct(person1, name) %>%
mutate(n = 1) %>%
pivot_wider(names_from = name, values_from =n, values_fill = list(n = 0))
# A tibble: 2 x 10
# person1 d01 `h19:00` p321 d02 p345 d03 `h07:00` p567 d04
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 123 1 1 1 0 0 1 1 1 1
#2 321 0 1 0 1 1 0 0 0 0

R: Frequency count but every category in a separate column

With the following code I assign a quantile rank y (from 1 to 4) for every value of x.
df$y <- ntile(df$x, 4)
Then, I would like to have four separate columns for absolute frequency count of every quantile rank, grouped also by variable z. With the following code, it does the calculation but I get all calculations in the same column.
df <-
df %>%
group_by(z, y) %>%
mutate(Freq = n())
example:
z y(quartile) n_quartile_4 n_quartile 3 n_quartile 2
1 4 2 1 0
1 3 2 1 0
1 4 2 1 0
2 2 0 0 3
2 2 0 0 3
2 2 0 0 3
We could create the count column with add_count, then pivot to 'wide' format with pivot_wider, fill the NA elements with the non-NA value in the column for each group and finally replace the rest of the NAs with 0
library(dplyr)
library(tidyr)
library(stringr)
df %>%
add_count(z, y) %>%
mutate(new = str_c('n_quartile_', y), rn = row_number()) %>%
pivot_wider(names_from = new, values_from = n) %>%
group_by(z) %>%
fill(starts_with('n_quartile'), .direction = 'downup') %>%
ungroup %>%
select(-rn) %>%
mutate_at(vars(starts_with('n_quartile')), replace_na, 0)
# A tibble: 6 x 5
# z y n_quartile_4 n_quartile_3 n_quartile_2
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 4 2 1 0
#2 1 3 2 1 0
#3 1 4 2 1 0
#4 2 2 0 0 3
#5 2 2 0 0 3
#6 2 2 0 0 3
data
df <- structure(list(z = c(1L, 1L, 1L, 2L, 2L, 2L), y = c(4, 3, 4,
2, 2, 2)), class = "data.frame", row.names = c(NA, -6L))

pivot_wider when there's no value column

I'm trying to reshape a dataset from long to wide. The following code works, but I'm curious if there's a way not to provide a value column and still use pivot_wider. In the following example, I have to create a temporary column "val" to use pivot_wider, but is there a way I can do it without it?
a <- data.frame(name = c("sam", "rob", "tom"),
type = c("a", "b", "c"))
a
name type
1 sam a
2 rob b
3 tom c
I want to convert it as the following.
name a b c
1 sam 1 0 0
2 rob 0 1 0
3 tom 0 0 1
This can be done by the following code, but can I do it without creating "val" column (and still using tidyverse language)?
a <- data.frame(name = c("sam", "rob", "tom"),
type = c("a", "b", "c"),
val = rep(1, 3)) %>%
pivot_wider(names_from = type, values_from = val, values_fill = list(val = 0))
You can use the values_fn argument to assign 1 and values_fill to assign 0:
library(tidyr)
pivot_wider(a, names_from = type, values_from = type, values_fn = ~1, values_fill = 0)
# A tibble: 3 x 4
name a b c
<fct> <dbl> <dbl> <dbl>
1 sam 1 0 0
2 rob 0 1 0
3 tom 0 0 1
We can mutate with a column of 1s and use that in pivot_wider
library(dplyr)
library(tidyr)
a %>%
mutate(n = 1) %>%
pivot_wider(names_from = type, values_from = n, values_fill = list(n = 0))
# A tibble: 3 x 4
# name a b c
# <fct> <dbl> <dbl> <dbl>
#1 sam 1 0 0
#2 rob 0 1 0
#3 tom 0 0 1
In base R, it would be easier..
table(a)
Going older school, reshape2::dcast, or the thriving data.table::dcast, let you do this by specifying an aggregate function:
reshape2::dcast(a, name ~ type, fun.aggregate = length)
# name a b c
# 1 rob 0 1 0
# 2 sam 1 0 0
# 3 tom 0 0 1
data.table::dcast(setDT(a), name ~ type, fun.aggregate = length)
# name a b c
# 1: rob 0 1 0
# 2: sam 1 0 0
# 3: tom 0 0 1

Resources