Aggregate strings using c() in dplyr summarize or aggregate - r

I want to aggregate some strings using c() as aggregation function in dplyr. I first tried the following:
> InsectSprays$spray = as.character(InsectSprays$spray)
> dt = tbl_df(InsectSprays)
> dt %>% group_by(count) %>% summarize(c(spray))
Error: expecting a single value
But using c() function in aggregate() works:
> da = aggregate(spray ~ count, InsectSprays, c)
> head(da)
count spray
1 0 C, C
2 1 C, C, C, C, E, E
3 2 C, C, D, E>
Searching in stackoverflow hinted that instead of c() function, using paste() with collapse would solve the problem:
dt %>% group_by(count) %>% summarize(s=paste(spray, collapse=","))
or
dt %>% group_by(count) %>% summarize(paste( c(spray), collapse=","))
My question is: Why does c() function work in aggregate() but not in dplyr summarize()?

If you have a closer look, you can find that c() actually does work (to a certain extent) when we use do(). But to my understanding, dplyr does not currently allow this type of printing of lists
> InsectSprays$spray = as.character(InsectSprays$spray)
> dt = tbl_df(InsectSprays)
> doC <- dt %>% group_by(count) %>% do(s = c(.$spray))
> head(doC)
Source: local data frame [6 x 2]
count s
1 0 <chr[2]>
2 1 <chr[6]>
3 2 <chr[4]>
4 3 <chr[8]>
5 4 <chr[4]>
6 5 <chr[7]>
> head(doC)[[2]]
[[1]]
[1] "C" "C"
[[2]]
[1] "C" "C" "C" "C" "E" "E"
[[3]]
[1] "C" "C" "D" "E"
[[4]]
[1] "C" "C" "D" "D" "E" "E" "E" "E"
[[5]]
[1] "C" "D" "D" "E"
[[6]]
[1] "D" "D" "D" "D" "D" "E" "E"

Related

Pass function through specific columns with lapply or for loop

I have created a function that reorganizes a data frame into a list. I want to pass the function through all of the columns in the data frame (excluding the first 2 columns) however, the lapply function is returning strange results.
Here is a reproducible example:
names <- c("A", "B", "C", "D")
titles <- c("P", "S", "S", "P")
day1 <- c(1,0,1,0)
day2 <- c(0,0,1,1)
day3 <- c(1,1,0,0)
df <- data.frame(names, titles, day1, day2, day3)
ids <-df[,1:2]
obs <- df[,3:5]
I create the function which searches each "day column" for a 1 or a 0 and reports the "name" and "title" of a row with a 0 (it also removes duplicated values).
group_maker1 <- function(x){
g1 <- ids$names[obs[,x]> 0]
g2 <- ids$titles[obs[,x]> 0]
temp <- c(g1,g2)
temp <- temp[!duplicated(temp)]
paste(temp)
}
#test group_maker
> group_maker1(3)
[1] "A" "B" "P" "S"
In the actual data frame, there are many (>300) columns of "days". I want to pass this group_maker function through each column of "days" to the nth day.
I've tried running it through a for loop but the output doesn't seem to store anywhere
for(i in 1:nrow(df)) { # for-loop over columns
group_maker1 <- function(x){
g1 <- ids$names[obs[,x]> 0]
g2 <- ids$titles[obs[,x]> 0]
temp <- c(g1,g2)
temp <- temp[!duplicated(temp)]
paste(temp)
}
}
Alternatively, I tried lapply, which seems more promising as it gives an output, however "NA"'s are present, and its not reporting any of the "B" names
lapply(obs[,1:3], group_maker1)
$day1
[1] "A" "C" "NA" "P" "S"
$day2
[1] "A" "C" "NA" "P" "S"
$day3
[1] "A" "C" "NA" "P" "S"
This is the desired output, however the values within it are incorrect. I want it to return the output as seen above in the group_maker1(3) line but with the correct values for each column of days (i.e. no "NA's" and all of the values in that column)
Essentially, I want the loop/apply to pass the function through each column of "days" and provide an output of all the "names" and "titles" for each day in the form of a list.
Using your test data, we have
> group_maker1(1)
[1] "A" "C" "P" "S"
> group_maker1(2)
[1] "C" "D" "S" "P"
> group_maker1(3)
[1] "A" "B" "P" "S"
So, we can replicate using a for loop with
> for(i in 1:3) print(group_maker1(i))
[1] "A" "C" "P" "S"
[1] "C" "D" "S" "P"
[1] "A" "B" "P" "S"
or using lapply with
> lapply(1:3, group_maker1)
[[1]]
[1] "A" "C" "P" "S"
[[2]]
[1] "C" "D" "S" "P"
[[3]]
[1] "A" "B" "P" "S"
In both cases, your attempt failed because of a simple typo.
Or, taking a completely different approach to avoid the explicit use of loops altogether
library(tidyverse)
df %>%
pivot_longer(
starts_with("day"),
names_to="col",
values_to="val"
) %>%
group_by(col) %>%
group_map(
function(.x, .y) {
z <- .x %>% filter(val > 0)
c(z %>% pull(names) %>% unique(), z %>% pull(titles) %>% unique())
}
)
[[1]]
[1] "A" "C" "P" "S"
[[2]]
[1] "C" "D" "S" "P"
[[3]]
[1] "A" "B" "P" "S"
This final option could be shorter if there were no need to deal with awkward input and output formats.

What's the R function used to find unique and distinct value in a column? [duplicate]

I have multiple observations of one species with different observers / groups of observers and want to create a list of all unique observers. My data look like this:
data <- read.table(text="species observer
1 A,B
1 A,B
1 B,E
1 B,E
1 D,E,A,C,C
1 F" , header = TRUE, stringsAsFactors = FALSE)
My output should return a list of all unique observers - so:
A,B,C,E,F
I tried to substring the data in column C using the following command but that only returns the unique combinations of observers.
all_observers <- unique(strsplit(as.character(data$observer), ","))
all_observers
[[1]]
[1] "A" "B"
[[2]]
[1] "B" "E"
[[3]]
[1] "D" "E" "A" "C" "C"
[[4]]
[1] "F"
You're almost there, you just need to unlist before you do the unique:
all_observers <- unique(unlist(strsplit(as.character(data$observer), ",")))
We can use separate_rows on the 'observer', get the distinct rows, grouped by 'species', and paste the 'observer'
library(tidyverse)
data %>%
separate_rows(observer) %>%
distinct %>%
group_by(species) %>%
summarise(observer = toString(observer))
You could also use scan()
unique(scan(text=data$observer, what="", sep=","))
# Read 14 items
# [1] "A" "B" "E" "D" "C" "F"

Extract colnames from a nested list of data.frames

I have a nested list of data.frames, what is the easiest way to get the column names of all data.frames?
Example:
d = data.frame(a = 1:3, b = 1:3, c = 1:3)
l = list(a = d, list(b = d, c = d))
Result:
$a
[1] "a" "b" "c"
$b
[1] "a" "b" "c"
$c
[1] "a" "b" "c"
There are already a couple of answers. But let me leave another approach. I used rapply2() in the rawr package.
devtools::install_github('raredd/rawr')
library(rawr)
library(purrr)
rapply2(l = l, FUN = colnames) %>%
flatten
$a
[1] "a" "b" "c"
$b
[1] "a" "b" "c"
$c
[1] "a" "b" "c"
Here is a base R solution.
You can define a customized function to flatten your nested list (which can deal nested list of any depths, e.g., more than 2 levels), i.e.,
flatten <- function(x){
islist <- sapply(x, class) %in% "list"
r <- c(x[!islist], unlist(x[islist],recursive = F))
if(!sum(islist))return(r)
flatten(r)
}
and then use the following code to achieve the colnames
out <- Map(colnames,flatten(l))
such that
> out
$a
[1] "a" "b" "c"
$b
[1] "a" "b" "c"
$c
[1] "a" "b" "c"
Example with a deeper nested list
l <- list(a = d, list(b = d, list(c = list(e = list(f= list(g = d))))))
> l
$a
a b c
1 1 1 1
2 2 2 2
3 3 3 3
[[2]]
[[2]]$b
a b c
1 1 1 1
2 2 2 2
3 3 3 3
[[2]][[2]]
[[2]][[2]]$c
[[2]][[2]]$c$e
[[2]][[2]]$c$e$f
[[2]][[2]]$c$e$f$g
a b c
1 1 1 1
2 2 2 2
3 3 3 3
and you will get
> out
$a
[1] "a" "b" "c"
$b
[1] "a" "b" "c"
$c.e.f.g
[1] "a" "b" "c"
Here is an attempt to do this as Vectorized as possible,
i1 <- names(unlist(l, TRUE, TRUE))
#[1] "a.a1" "a.a2" "a.a3" "a.b1" "a.b2" "a.b3" "a.c1" "a.c2" "a.c3" "b.a1" "b.a2" "b.a3" "b.b1" "b.b2" "b.b3" "b.c1" "b.c2" "b.c3" "c.a1" "c.a2" "c.a3" "c.b1" "c.b2" "c.b3" "c.c1" "c.c2" "c.c3"
i2 <- names(split(i1, gsub('\\d+', '', i1)))
#[1] "a.a" "a.b" "a.c" "b.a" "b.b" "b.c" "c.a" "c.b" "c.c"
We can now split i2 on everything before the dot, which will give,
split(i2, sub('\\..*', '', i2))
# $a
# [1] "a.a" "a.b" "a.c"
# $b
# [1] "b.a" "b.b" "b.c"
# $c
# [1] "c.a" "c.b" "c.c"
To get them fully cleaned, we need to loop over and apply a simple regex,
lapply(split(i2, sub('\\..*', '', i2)), function(i)sub('.*\\.', '', i))
which gives,
$a
[1] "a" "b" "c"
$b
[1] "a" "b" "c"
$c
[1] "a" "b" "c"
The Code compacted
i1 <- names(unlist(l, TRUE, TRUE))
i2 <- names(split(i1, gsub('\\d+', '', i1)))
final_res <- lapply(split(i2, sub('\\..*', '', i2)), function(i)sub('.*\\.', '', i))
Try this
d = data.frame(a = 1:3, b = 1:3, c = 1:3)
l = list(a = d, list(b = d, c = d))
foo <- function(x, f){
if (is.data.frame(x)) return(f(x))
lapply(x, foo, f = f)
}
foo(l, names)
The crux here is that data.frames actually are special list, so it's important what to test for.
Small explanation: what needs to be done here is a recursion, since with every element you might look at either a dataframe, so you want to decide if you apply the names or go deeper into the recursion and call foo again.
First create l1, a nested list with only the colnames
l1 <- lapply(l, function(x) if(is.data.frame(x)){
list(colnames(x)) #necessary to list it for the unlist() step afterwards
}else{
lapply(x, colnames)
})
Then unlist l1
unlist(l1, recursive=F)
Here is one way using purrr functions map_depth and vec_depth
library(purrr)
return_names <- function(x) {
if(inherits(x, "list"))
return(map_depth(x, vec_depth(x) - 2, names))
else return(names(x))
}
map(l, return_names)
#$a
#[1] "a" "b" "c"
#[[2]]
#[[2]]$b
#[1] "a" "b" "c"
#[[2]]$c
#[1] "a" "b" "c"
Using an external package, this is also straightforward with rrapply() in the rrapply-package (and works for arbitrary levels of nesting):
library(rrapply)
rrapply(l, classes = "data.frame", f = colnames, how = "flatten")
#> $a
#> [1] "a" "b" "c"
#>
#> $b
#> [1] "a" "b" "c"
#>
#> $c
#> [1] "a" "b" "c"
## deeply nested list
l2 <- list(a = d, list(b = d, list(c = list(e = list(f = list(g = d))))))
rrapply(l2, classes = "data.frame", f = colnames, how = "flatten")
#> $a
#> [1] "a" "b" "c"
#>
#> $b
#> [1] "a" "b" "c"
#>
#> $g
#> [1] "a" "b" "c"

Find all unique values in column separated by comma

I have multiple observations of one species with different observers / groups of observers and want to create a list of all unique observers. My data look like this:
data <- read.table(text="species observer
1 A,B
1 A,B
1 B,E
1 B,E
1 D,E,A,C,C
1 F" , header = TRUE, stringsAsFactors = FALSE)
My output should return a list of all unique observers - so:
A,B,C,E,F
I tried to substring the data in column C using the following command but that only returns the unique combinations of observers.
all_observers <- unique(strsplit(as.character(data$observer), ","))
all_observers
[[1]]
[1] "A" "B"
[[2]]
[1] "B" "E"
[[3]]
[1] "D" "E" "A" "C" "C"
[[4]]
[1] "F"
You're almost there, you just need to unlist before you do the unique:
all_observers <- unique(unlist(strsplit(as.character(data$observer), ",")))
We can use separate_rows on the 'observer', get the distinct rows, grouped by 'species', and paste the 'observer'
library(tidyverse)
data %>%
separate_rows(observer) %>%
distinct %>%
group_by(species) %>%
summarise(observer = toString(observer))
You could also use scan()
unique(scan(text=data$observer, what="", sep=","))
# Read 14 items
# [1] "A" "B" "E" "D" "C" "F"

Reorder a vector with wrap around in R

Let's say I have a simple vector x in R. It is in the order 'a','b','c','d'. Is there a function that would take the vector and reorder it with wrap around? For example, how can I get x to be 'c','d','a','b'?
#Vector x
> x <- letters[1:4]
> x
[1] "a" "b" "c" "d"
#What I want:
> somefcn(x, 3)
[1] "c" "d" "a" "b"
x <- letters[1:4]
shiftnum <- 3
c(x[shiftnum:length(x)],x[1:shiftnum-1])
[1] "c" "d" "a" "b"
Is a very rough way to do, but it works

Resources