Remove duplicate character strings from list column

Remove duplicate character strings from list column - r

I have this dataframe:
structure(list(class = c("Großbrittanien", "Rest Europa"), countries = list(
c("United Kingdom", "United Kingdom"), "Spain")), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
it looks like this:
I want to turn the countries-list column into a character column. I want to remove duplicate entries. Such that United Kingdom only appears once. I am a little confused how I could, using the dplyr syntax, achieve that.

You can unnest countries and then remove duplicated rows.
library(tidyverse)
df %>%
unnest(countries) %>%
distinct()
# # A tibble: 2 × 2
# class countries
# <chr> <chr>
# 1 Großbrittanien United Kingdom
# 2 Rest Europa Spain

Or without unnest, using unique by class, before converting to a character string.
With grouping:
library(dplyr)
df |>
mutate(countries = toString(unique(unlist(countries))), .by = class)
# Note: If you're using `dplyr < v.1.1.0`, use `group_by`/`ungroup`.
With purrr:
library(dplyr)
library(purrr)
df |>
mutate(countries = map_chr(countries, ~ toString(unique(.))))
Output:
# A tibble: 2 × 2
class countries
<chr> <chr>
1 Großbrittanien United Kingdom
2 Rest Europa Spain, Portugal
Data (including something that is not duplicated .. Portugal):
df <-
structure(list(class = c("Großbrittanien", "Rest Europa"), countries = list(
c("United Kingdom", "United Kingdom"), c("Spain", "Portugal"))), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))

Here is the version using dplyr syntax
library(dplyr)
df %>%
unnest(countries) %>%
distinct(class, countries) %>%
group_by(class) %>%
summarise(countries = paste(countries, collapse = ", "))

Related

Separating a column with multiple different entries with tidyr

I am trying to split up one column in a data frame that shows the period active(s) for several artists/ bands into two columns (start_of_career, end_of_career). The variable class is character. I tried to use tidyrs separate function for it and when I run it, I see that it is split in the console but not in the data frame itself, so I assume that it doesn't work properly.
Please see here a made up example of the data I want to split:
Column A
Column B
Artist A
1995-present
Artist B
1995-1997, 2008, 2010-present
As you can see, some rows will consists only of a start and end date, while others have several dates.
All I actually need is the first number and the last, e.g. for Artist B I need only start_of_career 1995 and end_of_career "present". But I am somehow not able to solve this issue.
The code I used was:
library(tidyr)
df %>% separate(col = period_active, into = c('start_of_career', 'end_of_career'), sep = '-')
I also tried other separators(",", " "), but it didn't work either.
I also tried:
df$start_of_career = strsplit(df$period_active, split = '-')
But this didn't work as well.

Using df shown reproducibly in the Note at the end remove everything except first and last parts of Column B and then separate what is left.
library(dplyr)
library(tidyr)
dd %>%
mutate(`Column B` = sub("-.*-", "-", `Column B`)) %>%
separate(`Column B`, c("start", "end"))
## Column A start end
## 1 Artist A 1995 present
## 2 Artist B 1995 present
Note
df <-
structure(list(`Column A` = c("Artist A", "Artist B"), `Column B` = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA,
-2L))

Using base R
df <- cbind(df[1], read.table(text = sub("-[0-9, ]+", "", df$`Column B`),
header = FALSE, col.names = c("start", "end"), sep = "-"))
-output
> df
Column A start end
1 Artist A 1995 present
2 Artist B 1995 present
We could do this with separate as well
library(tidyr)
separate(df, `Column B`, into = c("start", "end"), sep = "-[^A-Za-z]*")
Column A start end
1 Artist A 1995 present
2 Artist B 1995 present
data
df <- structure(list(`Column A` = c("Artist A", "Artist B"),
`Column B` = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame",
row.names = c(NA,
-2L))

We could use separate_rows and then filter for first and last row of group:
library(tidyr)
library(dplyr)
df %>%
separate_rows(Column.B) %>%
group_by(Column.A) %>%
filter(row_number()==1 | row_number()==n()) %>%
mutate(Colum.C = c("start", "end"))
Column.A Column.B Colum.C
<chr> <chr> <chr>
1 Artist A 1995 start
2 Artist A present end
3 Artist B 1995 start
4 Artist B present end
data:
structure(list(Column.A = c("Artist A", "Artist B"), Column.B = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA,
-2L))

Using strsplit and then subsequently pick the first and the last entry.
library(dplyr)
df %>%
rowwise() %>%
mutate(splitrow = strsplit(`Column B`, "-"),
start_of_career = splitrow[1],
end_of_career = splitrow[length(splitrow)],
splitrow = NULL) %>%
ungroup()
# A tibble: 2 × 4
`Column A` `Column B` start_of_career end_of_career
<chr> <chr> <chr> <chr>
1 Artist A 1995-present 1995 present
2 Artist B 1995-1997, 2008, 2010-present 1995 present
Data
df <- structure(list(`Column A` = c("Artist A", "Artist B"), `Column B` = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA,
-2L))

Another option: use strsplit, and return the list of start and end values
f <- \(v) {
v = strsplit(v, "-|,| ")[[1]]
list(start = v[1],end = v[length(v)])
}
df %>%
mutate(df, `Column B` = lapply(`Column B`,f)) %>%
unnest_wider(`Column B`)
Output:
# A tibble: 2 × 3
`Column A` start end
<chr> <chr> <chr>
1 Artist A 1995 present
2 Artist B 1995 present

Below code extract the first word before the dash and last word after.
for(i in 1:length(df))
{
df$start[i] <-sub("-.*", "", df$`Column B`[i])
df$end[i] <-sub("^.+-", "", df$`Column B`[i])
}

Categorizizng variable with multiple values in one cell and tallying in R

Reposting this question to clarify my objective- I am trying to create a new categorical variable "Income" (3 levels) that categorizes a subset of predetermined countries (x, y, z) into the different levels. My issue is that the countries variable has multiple countries in each cell, so I don't know how to sort this.
What I'm hoping to get:
ID country **income**
1 Chad, USA, USA, USA LMIC, HMIC, HMIC, HMIC
2 USA HMIC
3 Ethiopia, USA, Chad LMIC, HMIC, LMIC
1 Albania, Canada UMIC, HMIC
so tally would result in LMIC = 2 (as ID 1 and 3 contain LMIC, and they would count one per entry rather than lmic = 3 total), HMIC = 4 (as ID 1-4 contain HMIC), and UMIC = 1 (ID 4).
This is the code I have based one someone's recommendation. However, it turns n=134,086 observations (publications) into n=388,844, so when I tally the levels of Income, I get values like HMIC = 305k. My goal is to tally among the original publication count such that I can calc what proportion of n=134k were LMIC, LMIC, etc. This would mean that the Income cells (by ID) that have multiple values like "HMIC, HMIC, HMIC, LMIC" would count as one HMIC and one LMIC when I tally. Is there a way to do this?
data.set %>% separate_rows(country, sep = ",")
data.set %>% mutate(Income = case_when(country %in% c("USA", "Canada", "Japan") ~ "HMIC", country %in% c("Albania", "Argentina") ~ "UMIC", country %in% c("Chad", "Ethiopia") ~ "LMIC", TRUE ~ NA_character_))

You could apply separate_rows on both income and country at once, get rid of the duplicate income categories using e.g. distinct and use count:
Note: I adjusted your example data which contained a duplicated ID=1 in the fourth row which I replaced by ID=4.
library(tidyverse)
data.set <- structure(list(ID = c(1L, 2L, 3L, 4L), country = c(
"Chad, USA, USA, USA",
"USA", "Ethiopia, USA, Chad", "Albania, Canada"
), income = c(
"LMIC, HMIC, HMIC, HMIC",
"HMIC", "LMIC, HMIC, LMIC", "UMIC, HMIC"
)), class = "data.frame", row.names = c(
NA,
-4L
))
data.set %>%
separate_rows(country, income, sep = ", ") %>%
distinct(ID, income) %>%
count(income)
#> # A tibble: 3 × 2
#> income n
#> <chr> <int>
#> 1 HMIC 4
#> 2 LMIC 2
#> 3 UMIC 1
EDIT If I understand you right you want to do two things. First you want to create an income variable which assigns countries to income groups. Second you want to summarise your data. To solve the first task you could use recode or case_when as suggested in the answer on your former question or in the comments. As an alternative I would suggest to first create a lookup table which via a left_join could be used to add your income variable after applying separate_rows. After this step you could go on with the code I provided to get the counts per income group:
# Get rid of income column
data.set <- data.set %>%
select(-income)
# Create a recode or lookup table which assigns countries to income groups
recode_table <- structure(list(country = c(
"Chad", "USA", "Ethiopia", "Albania",
"Canada"
), income = c("LMIC", "HMIC", "LMIC", "UMIC", "HMIC")), row.names = c(
NA,
-5L
), class = c("tbl_df", "tbl", "data.frame"))
data.set %>%
separate_rows(country, sep = ", ") %>%
left_join(recode_table, by = "country") |>
distinct(ID, income) %>%
count(income)
#> # A tibble: 3 × 2
#> income n
#> <chr> <int>
#> 1 HMIC 4
#> 2 LMIC 2
#> 3 UMIC 1

how to build a new variable by extract a string from another variable

I have df that looks like this, and I would like to build a new variableMain if Math|ELA in Subject. The sample data and my codes are:
df<- structure(list(Subject = c("Math", "Math,ELA", "Math,ELA, PE",
"PE, Math", "ART,ELA", "PE,ART")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
df<-df %>%
+ mutate(Main=case_when (grepl("Math|ELA", Subject)~ paste0(str_extract_all(df$Subject, "Math|ELA"))))
However my outcome looks like following, not the one I like. What did I do wrong? I feel that my codes complicated the simple step. Any better solution?

str_extract_all returns a list. We need to loop over the list and paste/str_c
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate(Main = case_when(grepl("Math|ELA", Subject)~
map_chr(str_extract_all(Subject, "Math|ELA"), toString)))
-output
# A tibble: 6 x 2
# Subject Main
# <chr> <chr>
#1 Math Math
#2 Math,ELA Math, ELA
#3 Math,ELA, PE Math, ELA
#4 PE, Math Math
#5 ART,ELA ELA
#6 PE,ART <NA>
Or another option is separate_rows from tidyr
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(Subject) %>%
group_by(rn) %>%
summarise(Main = toString(intersect(Subject, c("Math", "ELA"))),
.groups = 'drop') %>%
select(Main) %>%
bind_cols(df, .)
NOTE: paste by itself doesn't do anything and in a list, we need to loop over the list
Or another option is to use
trimws(gsub("(Math|ELA)(*SKIP)(*FAIL)|\\w+", "", df$Subject, perl = TRUE), whitespace = ",\\s*")
#[1] "Math" "Math,ELA" "Math,ELA" "Math" "ELA" ""

Here is a base R option using regmatches
transform(
df,
Main = sapply(
regmatches(Subject, gregexpr("Math|ELA", Subject)),
function(x) replace(toString(x), !length(x), NA)
)
)
which gives
Subject Main
1 Math Math
2 Math,ELA Math, ELA
3 Math,ELA, PE Math, ELA
4 PE, Math Math
5 ART,ELA ELA
6 PE,ART <NA>

Parsing a string with multiple brackets

I have a dataset dt with column "subject", that I need to parse. For example,
ID subject
1 USA(Texas)(Austin)
2 USA(California)(Sacramento)
As a result, I want to get the following table:
ID subject Country State Capital
1 USA(Texas)(Austin) USA Texas Austin
2 USA(California)(Sacramento) USA California Sacramento
How can I do it?

Since you have multiple brackets to extract data from you need to make your regex lazy.
library(dplyr)
library(tidyr)
extract(dt, subject, into = c("Country", "State", "Capital"),
regex = "(.*)\\((.*?)\\)\\((.*)\\)", remove = FALSE)
# ID subject Country State Capital
#1 1 USA(Texas)(Austin) USA Texas Austin
#2 2 USA(California)(Sacramento) USA California Sacramento
Another option with a simpler regex is to remove round brackets with gsub and use separate with sep argument as whitespace.
dt %>%
mutate(subject = trimws(gsub('[()]', ' ', subject))) %>%
separate(subject, into = c("Country", "State", "Capital"), sep = "\\s+")
data
dt <- structure(list(ID = 1:2, subject = structure(2:1,
.Label = c("USA(California)(Sacramento)", "USA(Texas)(Austin)"),
class = "factor")), class = "data.frame", row.names = c(NA, -2L))

Meaning of error using . shorthand inside dplyr function

I'm getting a dplyr::bind_rows error. It's a very trivial problem, because I can easily get around it, but I'd like to understand the meaning of the error message.
I have the following data of some population groups for New England states, and I'd like to bind on a copy of these same values with the name changed to "New England," so that I can group by name and add them up, giving me values for the individual states, plus an overall value for the region.
df <- structure(list(name = c("CT", "MA", "ME", "NH", "RI", "VT"),
estimate = c(501074, 1057316, 47369, 76630, 141206, 27464)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
I'm doing this as part of a much larger flow of piped steps, so I can't just do bind_rows(df, df %>% mutate(name = "New England")). dplyr gives the convenient . shorthand for a data frame being piped from one function to the next, but I can't use that to bind the data frame to itself in a way I'd like.
What does work and gets me the output I want:
library(tidyverse)
df %>%
# arbitrary piped operation
mutate(name = str_to_lower(name)) %>%
bind_rows(mutate(., name = "New England")) %>%
group_by(name) %>%
summarise(estimate = sum(estimate))
#> # A tibble: 7 x 2
#> name estimate
#> <chr> <dbl>
#> 1 ct 501074
#> 2 ma 1057316
#> 3 me 47369
#> 4 New England 1851059
#> 5 nh 76630
#> 6 ri 141206
#> 7 vt 27464
But when I try to do the same thing with the . shorthand, I get this error:
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows(. %>% mutate(name = "New England"))
#> Error in bind_rows_(x, .id): Argument 2 must be a data frame or a named atomic vector, not a fseq/function
Like I said, doing it the first way is fine, but I'd like to understand the error because I write a lot of multi-step piped code.

As #aosmith noted in the comments it's due to the way magrittr parses the dot in this case :
from ?'%>%':
Using the dot-place holder as lhs
When the dot is used as lhs, the
result will be a functional sequence, i.e. a function which applies
the entire chain of right-hand sides in turn to its input.
To avoid triggering this, any modification of the expression on the lhs will do:
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows((.) %>% mutate(name = "New England"))
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows({.} %>% mutate(name = "New England"))
df %>%
mutate(name = str_to_lower(name)) %>%
bind_rows(identity(.) %>% mutate(name = "New England"))
Here's a suggestion that avoid the problem altogether:
df %>%
# arbitrary piped operation
mutate(name = str_to_lower(name)) %>%
replicate(2,.,simplify = FALSE) %>%
map_at(2,mutate_at,"name",~"New England") %>%
bind_rows
# # A tibble: 12 x 2
# name estimate
# <chr> <dbl>
# 1 ct 501074
# 2 ma 1057316
# 3 me 47369
# 4 nh 76630
# 5 ri 141206
# 6 vt 27464
# 7 New England 501074
# 8 New England 1057316
# 9 New England 47369
# 10 New England 76630
# 11 New England 141206
# 12 New England 27464

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove duplicate character strings from list column - r

You can unnest countries and then remove duplicated rows. library(tidyverse) df %>% unnest(countries) %>% distinct() # # A tibble: 2 × 2 # class countries # <chr> <chr> # 1 Großbrittanien United Kingdom # 2 Rest Europa Spain

Here is the version using dplyr syntax library(dplyr) df %>% unnest(countries) %>% distinct(class, countries) %>% group_by(class) %>% summarise(countries = paste(countries, collapse = ", "))

Related

Separating a column with multiple different entries with tidyr

Categorizizng variable with multiple values in one cell and tallying in R

how to build a new variable by extract a string from another variable

Parsing a string with multiple brackets

Meaning of error using . shorthand inside dplyr function

Categories

Resources