Find all unique values in column separated by comma - r

I have multiple observations of one species with different observers / groups of observers and want to create a list of all unique observers. My data look like this:
data <- read.table(text="species observer
1 A,B
1 A,B
1 B,E
1 B,E
1 D,E,A,C,C
1 F" , header = TRUE, stringsAsFactors = FALSE)
My output should return a list of all unique observers - so:
A,B,C,E,F
I tried to substring the data in column C using the following command but that only returns the unique combinations of observers.
all_observers <- unique(strsplit(as.character(data$observer), ","))
all_observers
[[1]]
[1] "A" "B"
[[2]]
[1] "B" "E"
[[3]]
[1] "D" "E" "A" "C" "C"
[[4]]
[1] "F"

You're almost there, you just need to unlist before you do the unique:
all_observers <- unique(unlist(strsplit(as.character(data$observer), ",")))

We can use separate_rows on the 'observer', get the distinct rows, grouped by 'species', and paste the 'observer'
library(tidyverse)
data %>%
separate_rows(observer) %>%
distinct %>%
group_by(species) %>%
summarise(observer = toString(observer))

You could also use scan()
unique(scan(text=data$observer, what="", sep=","))
# Read 14 items
# [1] "A" "B" "E" "D" "C" "F"

Related

Pass function through specific columns with lapply or for loop

I have created a function that reorganizes a data frame into a list. I want to pass the function through all of the columns in the data frame (excluding the first 2 columns) however, the lapply function is returning strange results.
Here is a reproducible example:
names <- c("A", "B", "C", "D")
titles <- c("P", "S", "S", "P")
day1 <- c(1,0,1,0)
day2 <- c(0,0,1,1)
day3 <- c(1,1,0,0)
df <- data.frame(names, titles, day1, day2, day3)
ids <-df[,1:2]
obs <- df[,3:5]
I create the function which searches each "day column" for a 1 or a 0 and reports the "name" and "title" of a row with a 0 (it also removes duplicated values).
group_maker1 <- function(x){
g1 <- ids$names[obs[,x]> 0]
g2 <- ids$titles[obs[,x]> 0]
temp <- c(g1,g2)
temp <- temp[!duplicated(temp)]
paste(temp)
}
#test group_maker
> group_maker1(3)
[1] "A" "B" "P" "S"
In the actual data frame, there are many (>300) columns of "days". I want to pass this group_maker function through each column of "days" to the nth day.
I've tried running it through a for loop but the output doesn't seem to store anywhere
for(i in 1:nrow(df)) { # for-loop over columns
group_maker1 <- function(x){
g1 <- ids$names[obs[,x]> 0]
g2 <- ids$titles[obs[,x]> 0]
temp <- c(g1,g2)
temp <- temp[!duplicated(temp)]
paste(temp)
}
}
Alternatively, I tried lapply, which seems more promising as it gives an output, however "NA"'s are present, and its not reporting any of the "B" names
lapply(obs[,1:3], group_maker1)
$day1
[1] "A" "C" "NA" "P" "S"
$day2
[1] "A" "C" "NA" "P" "S"
$day3
[1] "A" "C" "NA" "P" "S"
This is the desired output, however the values within it are incorrect. I want it to return the output as seen above in the group_maker1(3) line but with the correct values for each column of days (i.e. no "NA's" and all of the values in that column)
Essentially, I want the loop/apply to pass the function through each column of "days" and provide an output of all the "names" and "titles" for each day in the form of a list.
Using your test data, we have
> group_maker1(1)
[1] "A" "C" "P" "S"
> group_maker1(2)
[1] "C" "D" "S" "P"
> group_maker1(3)
[1] "A" "B" "P" "S"
So, we can replicate using a for loop with
> for(i in 1:3) print(group_maker1(i))
[1] "A" "C" "P" "S"
[1] "C" "D" "S" "P"
[1] "A" "B" "P" "S"
or using lapply with
> lapply(1:3, group_maker1)
[[1]]
[1] "A" "C" "P" "S"
[[2]]
[1] "C" "D" "S" "P"
[[3]]
[1] "A" "B" "P" "S"
In both cases, your attempt failed because of a simple typo.
Or, taking a completely different approach to avoid the explicit use of loops altogether
library(tidyverse)
df %>%
pivot_longer(
starts_with("day"),
names_to="col",
values_to="val"
) %>%
group_by(col) %>%
group_map(
function(.x, .y) {
z <- .x %>% filter(val > 0)
c(z %>% pull(names) %>% unique(), z %>% pull(titles) %>% unique())
}
)
[[1]]
[1] "A" "C" "P" "S"
[[2]]
[1] "C" "D" "S" "P"
[[3]]
[1] "A" "B" "P" "S"
This final option could be shorter if there were no need to deal with awkward input and output formats.

What's the R function used to find unique and distinct value in a column? [duplicate]

I have multiple observations of one species with different observers / groups of observers and want to create a list of all unique observers. My data look like this:
data <- read.table(text="species observer
1 A,B
1 A,B
1 B,E
1 B,E
1 D,E,A,C,C
1 F" , header = TRUE, stringsAsFactors = FALSE)
My output should return a list of all unique observers - so:
A,B,C,E,F
I tried to substring the data in column C using the following command but that only returns the unique combinations of observers.
all_observers <- unique(strsplit(as.character(data$observer), ","))
all_observers
[[1]]
[1] "A" "B"
[[2]]
[1] "B" "E"
[[3]]
[1] "D" "E" "A" "C" "C"
[[4]]
[1] "F"
You're almost there, you just need to unlist before you do the unique:
all_observers <- unique(unlist(strsplit(as.character(data$observer), ",")))
We can use separate_rows on the 'observer', get the distinct rows, grouped by 'species', and paste the 'observer'
library(tidyverse)
data %>%
separate_rows(observer) %>%
distinct %>%
group_by(species) %>%
summarise(observer = toString(observer))
You could also use scan()
unique(scan(text=data$observer, what="", sep=","))
# Read 14 items
# [1] "A" "B" "E" "D" "C" "F"

Replacing multiple numbers with string in a dataframe without regex in R

I have columns in a dataframe where I want to replace integers with their corresponding string values. The integers are often repeating in cells (separated by spaces, commas, /, or - etc.). For example my dataframe column is:
> df = data.frame(c1=c(1,2,3,23,c('11,21'),c('13-23')))
> df
c1
1 1
2 2
3 3
4 23
5 11,21
6 13-23
I have used both str_replace_all() and str_replace() methods but did not get the desired results.
> df[,1] %>% str_replace_all(c("1"="a","2"="b","3"="c","11"="d","13"="e","21"="f","23"="g"))
[1] "a" "b" "c" "bc" "aa,ba" "ac-bc"
> df[,1] %>% str_replace(c("1"="a","2"="b","3"="c","11"="d","13"="e","21"="f","23"="g"))
Error in fix_replacement(replacement) : argument "replacement" is missing, with no default
The desired result would be:
[1] "a" "b" "c" "g" "d,f" "e-g"
As there are multiple values to replace that's why my first choice was str_replace_all() as it allows to have a vector with the original column values and desired replacement values but the method fails due to regex. Am I doing it wrong or is there any better alternative to solve my problem?
Simply place the longest multi-character at the beginning like:
library(stringr)
str_replace_all(df[,1],
c("11"="d","13"="e","21"="f","23"="g","1"="a","2"="b","3"="c"))
#[1] "a" "b" "c" "g" "d,f" "e-g"
and for complexer cases:
x <- c("1"="a","2"="b","3"="c","11"="d","13"="e","21"="f","23"="g")
x <- x[order(nchar(names(x)), decreasing = TRUE)]
str_replace_all(df[,1], x)
#[1] "a" "b" "c" "g" "d,f" "e-g"
Using the ordering method in #GKi's answer, here's a base R version using Reduce/gsub instead of stringr::str_replace_all
Starting vector
x <- as.character(df$c1)
Ordering as in #GKi answer
repl_dict <- c("11"="d","13"="e","21"="f","23"="g","1"="a","2"="b","3"="c")
repl_dict <- repl_dict[order(nchar(names(repl_dict)), decreasing = TRUE)]
Replacement
Reduce(
function(x, n) gsub(n, repl_dict[n], x, fixed = TRUE),
names(repl_dict),
init = x)
# [1] "a" "b" "c" "g" "d,f" "e-g"

Cannot pipe variable to levels

I am working with a large data frame and rather than write manipulations to memory, I've been trying to do as much as a I with pipes. In trying to check my factor levels in intermediate steps, I ran into a problem using the levels function and wondered if anyone might know what the problem is.
An example:
library(dplyr)
Data <- data.frame(x = rep(LETTERS[1:5],3),
y = sample(1:10,length(x), replace=T))
The usual way works:
levels(Data$x)
[1] "A" "B" "C" "D" "E"
It mostly works if I use sapply:
Data %>% select(x) %>% sapply(levels)
x
[1,] "A"
[2,] "B"
[3,] "C"
[4,] "D"
[5,] "E"
But piping does not work and returns NULL:
Data %>% select(x) %>% levels()
NULL
Why does Data %>% select(x) %>% levels() return NULL?
Is there a way to use levels with piped data?
select gives a data frame, but levels expects a vector as argument, that's why they don't work together; To use levels with pipe:
You can either use .$x to extract the column in the levels method:
Data %>% select(x) %>% {levels(.$x)}
# [1] "A" "B" "C" "D" "E"
Or a better approach use pull instead of select, pull gives the column as a vector/factor:
Data %>% pull(x) %>% levels()
# [1] "A" "B" "C" "D" "E"

Aggregate strings using c() in dplyr summarize or aggregate

I want to aggregate some strings using c() as aggregation function in dplyr. I first tried the following:
> InsectSprays$spray = as.character(InsectSprays$spray)
> dt = tbl_df(InsectSprays)
> dt %>% group_by(count) %>% summarize(c(spray))
Error: expecting a single value
But using c() function in aggregate() works:
> da = aggregate(spray ~ count, InsectSprays, c)
> head(da)
count spray
1 0 C, C
2 1 C, C, C, C, E, E
3 2 C, C, D, E>
Searching in stackoverflow hinted that instead of c() function, using paste() with collapse would solve the problem:
dt %>% group_by(count) %>% summarize(s=paste(spray, collapse=","))
or
dt %>% group_by(count) %>% summarize(paste( c(spray), collapse=","))
My question is: Why does c() function work in aggregate() but not in dplyr summarize()?
If you have a closer look, you can find that c() actually does work (to a certain extent) when we use do(). But to my understanding, dplyr does not currently allow this type of printing of lists
> InsectSprays$spray = as.character(InsectSprays$spray)
> dt = tbl_df(InsectSprays)
> doC <- dt %>% group_by(count) %>% do(s = c(.$spray))
> head(doC)
Source: local data frame [6 x 2]
count s
1 0 <chr[2]>
2 1 <chr[6]>
3 2 <chr[4]>
4 3 <chr[8]>
5 4 <chr[4]>
6 5 <chr[7]>
> head(doC)[[2]]
[[1]]
[1] "C" "C"
[[2]]
[1] "C" "C" "C" "C" "E" "E"
[[3]]
[1] "C" "C" "D" "E"
[[4]]
[1] "C" "C" "D" "D" "E" "E" "E" "E"
[[5]]
[1] "C" "D" "D" "E"
[[6]]
[1] "D" "D" "D" "D" "D" "E" "E"

Resources