Concat a list column in R - r

I'm trying to concatenate characters within a list column in R.
When I try this approach, the result is not 'abc' but a vector converted to character.
What is the right approach?
library(tidyverse)
tibble(b=list(letters[1:3])) %>%
mutate(b = paste(b))
#> # A tibble: 1 x 1
#> b
#> <chr>
#> 1 "c(\"a\", \"b\", \"c\")"
Created on 2020-10-14 by the reprex package (v0.3.0)

Maybe you need something like below
library(tidyverse)
tibble(b=list(letters[1:3])) %>%
mutate(b = sapply(b,paste,collapse = ""))
giving
# A tibble: 1 x 1
b
<chr>
1 abc

Try this. Keep in mind that the element in your tibble is a list. So you can use any of these approaches:
library(tidyverse)
tibble(b=list(letters[1:3])) %>%
mutate(b = lapply(b,function(x)paste0(x,collapse = '')))
Or this:
#Code 2
tibble(b=list(letters[1:3])) %>%
mutate(b = sapply(b,function(x)paste0(x,collapse = '')))
Output:
# A tibble: 1 x 1
b
<chr>
1 abc
In the first case, you will get the result in a list whereas in the second one you will get it as a value.

We can use tidyverse
library(dplyr)
library(stringr)
library(purrr)
tibble(b=list(letters[1:3])) %>%
mutate(b = map_chr(b, str_c, collapse=""))
# A tibble: 1 x 1
# b
# <chr>
#1 abc

Related

how to pass column names including space in R

assume my column names are: User ID and name
how should I pass this column name to functions like what I have below?
df %>%
group_by(User ID) %>%
count(name)
apparently, group_by() or similar functions do not accept column names with space in their names.
You need to use tibble instead of data.frame:
library(tidyverse)
df <- tibble(`User ID` = 1:2, x = 5:6)
df %>%
group_by(`User ID`) %>%
summarise(total = sum(x))
#> # A tibble: 2 × 2
#> `User ID` total
#> <int> <int>
#> 1 1 5
#> 2 2 6

How to 'summarize' variable which mixed by 'numeric' and 'character'

here is data.frame data as below , how to transfer it to wished_data Thanks!
library(tidyverse)
data <- data.frame(category=c('a','b','a','b','a'),
values=c(1,'A','2','4','B'))
#below code can't work
data %>% group_by(category ) %>%
summarize(sum=if_else(is.numeric(values)>0,sum(is.numeric(values)),paste0(values)))
#below is the wished result
wished_data <- data.frame(category=c('a','a','b','b'),
values=c('3','B','A','4'))
Mixing numeric and character variables in a column is not tidy. Consider giving each type their own column, for example:
data %>%
mutate(letters = str_extract(values, "[A-Z]"),
numbers = as.numeric(str_extract(values, "\\d"))) %>%
group_by(category) %>%
summarise(values = sum(numbers, na.rm = T),
letters = na.omit(letters))
category values letters
<chr> <dbl> <chr>
1 a 3 B
2 b 4 A
In R string math does not make sense, "1+1" is not "2", and is.numeric("1") gives FALSE. A workaround is converting to list object, or to give each their own columns.
I'd create a separate column to group numeric values in a category separately from characters.
data %>%
mutate(num_check = grepl("[0-9]", values)) %>%
group_by(category, num_check) %>%
summarize(sum = ifelse(
unique(num_check),
as.character(sum(as.numeric(values))),
unique(values)
), .groups = "drop")
#> # A tibble: 4 × 3
#> category num_check sum
#> <chr> <lgl> <chr>
#> 1 a FALSE B
#> 2 a TRUE 3
#> 3 b FALSE A
#> 4 b TRUE 4
Here is a bit of a messy answer,
library(dplyr)
bind_rows(data %>%
filter(is.na(as.numeric(values))),
data %>%
mutate(values = as.numeric(values)) %>%
group_by(category) %>%
summarise(values = as.character(sum(values, na.rm = TRUE)))) %>%
arrange(category)
category values
#1 a B
#2 a 3
#3 b A
#4 b 4

applying function to each group using dplyr and return specified dataframe

I used group_map for the first time and think I do it correctly. This is my code:
library(REAT)
df <- data.frame(value = c(1,1,1, 1,0.5,0.1, 0,0,0,1), group = c(1,1,1, 2,2,2, 3,3,3,3))
haves <- df %>%
group_by(group) %>%
group_map(~gini(.x$value, coefnorm = TRUE))
The thing is that haves is a list rather than a data frame. What would I have to do to obtain this df
wants <- data.frame(group = c(1,2,3), gini = c(0,0.5625,1))
group gini
1 0.0000
2 0.5625
3 1.0000
Thanks!
You can use dplyr::summarize:
df %>%
group_by(group) %>%
summarize(gini = gini(value, coefnorm = TRUE))
#> # A tibble: 3 x 2
#> group gini
#> <dbl> <dbl>
#> 1 1 0
#> 2 2 0.562
#> 3 3 1
According to the documentation, group_map always produces a list. group_modify is an alternative that produces a tibble if the function does, but gini just outputs a vector. So, you could do something like this...
df %>%
group_by(group) %>%
group_modify(~tibble(gini = gini(.x$value, coefnorm = TRUE)))
# A tibble: 3 x 2
# Groups: group [3]
group gini
<dbl> <dbl>
1 1 0
2 2 0.562
3 3 1
Using data.table
library(data.table)
setDT(df)[, .(gini = gini(value, coefnorm = TRUE)), group]
For grouped datasets, we can specify .data if in case we don't want to use column names unquoted
library(dplyr)
df %>%
group_by(group) %>%
summarize(gini = gini(.data$value, coefnorm = TRUE))

Apply a custom function over levels of a factor in a dataframe

I'm trying to apply a tidyverse-based approach, or at least a tidy solution, for applying custom functions over the levels of a factor in a dataframe.
Consider the following test dataset:
df <- tibble(LINE=rep(c(1,2),each=6), FOUND=c(1,1,1,0,1,1,0,0,1,0,0,1))
# LINE FOUND
# <dbl> <dbl>
# 1 1 1
# 2 1 1
# 3 1 1
# 4 1 0
# 5 1 1
# 6 1 1
# 7 2 0
# 8 2 0
# 9 2 1
#10 2 0
#11 2 0
#12 2 1
I want to know for example the proportion of found results (eg. FOUND==1) by level of the LINE factor. Right now, I'm working with the following code, but I'm really trying to get to something cleaner.
# This is the function to calculate the proportion "found"
get_prop <- function (data) {
tot <- data %>% nrow()
found <- data %>% dplyr::filter(FOUND==1) %>% nrow
found / tot
}
# This is the code to generate the expected result
lines <- df$LINE %>% unique %>% sort
v_line <- vector()
v_prop <- vector()
for (i in 1:length(lines)) {
tot <- df %>% dplyr::filter(LINE==lines[i])
v_line[i] <- lines[i]
v_prop[i] <- get_prop(tot)
}
df_line = data.frame(LINE = v_line, CALL = v_prop)
I would expect the following to work, but it does not, since its returning the result for each level, but the numerical solution is that of the whole dataset, and not levels-specific:
df %>% dplyr::group_by(LINE) %>% dplyr::summarise(get_prop(.))
EDIT: Please note that what I am looking for is a solution for applying a custom function over the levels of a factor in a dataframe. It is not necessarily the number or the proportion of occurrences of a particular value, as in the example illustrated.
EDIT 2: That is, I'm looking for a solution that makes use of the get_prop function above. This is not because it is the best way of solving this particular issue, but because it is more generalizable
If you want to apply a custom function group-wise, you can use the group_split command. This will split your data frame into elements of a list. Each list element being a subset of the df. You can then use map to apply your function to each level (note that you can group_split and map in one step by using group_map). I added the last line to get to the form of the original approach.
df %>%
group_by(LINE) %>%
group_split() %>%
map_dbl(get_prop) %>%
tibble(LINE = seq_along(.), CALL = .) # optional to get back to a df
#> # A tibble: 2 x 2
#> LINE CALL
#> <int> <dbl>
#> 1 1 0.833
#> 2 2 0.333
Created on 2020-01-20 by the reprex package (v0.3.0)
Now one thing I'm worried about with this solution is that group_split drops the grouping variable (I would have preferred if it was kept as the names of the list or an attribute). So if you want a tibble as the outcome it might make sense to save the grouping variable beforehand:
groups <- unique(df$LINE)
df %>%
group_by(LINE) %>%
group_split() %>%
map_dbl(get_prop) %>%
tibble(group = groups, result = .)
update
I think the overall cleanest approach would be this (using a more general example):
library(tidyverse)
df <- tibble(LINE=rep(c("a", "b"),each=6), FOUND=c(1,1,1,0,1,1,0,0,1,0,0,1))
lvls <- unique(df$LINE)
df %>%
group_by(LINE) %>%
group_map(~ get_prop(.x)) %>%
setNames(lvls) %>%
unlist() %>%
enframe()
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 a 0.833
#> 2 b 0.333
Created on 2020-01-20 by the reprex package (v0.3.0)
Another option could be to use group_map and then tibble::enframe
library(dplyr)
df %>%
group_by(LINE) %>%
group_map(~get_prop(.)) %>%
unlist() %>%
tibble::enframe()
# name value
# <int> <dbl>
#1 1 0.833
#2 2 0.333
You could also use group_modify which would keep the group names (using #JBGruber's data)
df %>%
group_by(LINE) %>%
group_modify(~ tibble::enframe(get_prop(.), name = NULL))
# LINE value
# <chr> <dbl>
#1 a 0.833
#2 b 0.333

tidyverse - prefered way to turn a named vector into a data.frame/tibble

Using the tidyverse a lot i often face the challenge of turning named vectors into a data.frame/tibble with the columns being the names of the vector.
What is the prefered/tidyversey way of doing this?
EDIT: This is related to: this and this github-issue
So i want:
require(tidyverse)
vec <- c("a" = 1, "b" = 2)
to become this:
# A tibble: 1 × 2
a b
<dbl> <dbl>
1 1 2
I can do this via e.g.:
vec %>% enframe %>% spread(name, value)
vec %>% t %>% as_tibble
Usecase example:
require(tidyverse)
require(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
Which gives
# A tibble: 2 × 3
a b c
<chr> <chr> <chr>
1 1 2 <NA>
2 1 <NA> 3
This is now directly supported using bind_rows (introduced in dplyr 0.7.0):
library(tidyverse))
vec <- c("a" = 1, "b" = 2)
bind_rows(vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
This quote from https://cran.r-project.org/web/packages/dplyr/news.html explains the change:
bind_rows() and bind_cols() now accept vectors. They are treated as rows by the former and columns by the latter. Rows require inner names like c(col1 = 1, col2 = 2), while columns require outer names: col1 = c(1, 2). Lists are still treated as data frames but can be spliced explicitly with !!!, e.g. bind_rows(!!! x) (#1676).
With this change, it means that the following line in the use case example:
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
can be rewritten as
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(bind_rows)
which is also equivalent to
txt %>% map(read_xml) %>% map(xml_attrs) %>% { bind_rows(!!! .) }
The equivalence of the different approaches is demonstrated in the following example:
library(tidyverse)
library(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
temp <- txt %>% map(read_xml) %>% map(xml_attrs)
# x, y, and z are identical
x <- temp %>% map_df(~t(.) %>% as_tibble)
y <- temp %>% map_df(bind_rows)
z <- bind_rows(!!! temp)
identical(x, y)
#> [1] TRUE
identical(y, z)
#> [1] TRUE
z
#> # A tibble: 2 x 3
#> a b c
#> <chr> <chr> <chr>
#> 1 1 2 <NA>
#> 2 1 <NA> 3
The idiomatic way would be to splice the vector with !!! within a tibble() call so the named vector elements become column definitions :
library(tibble)
vec <- c("a" = 1, "b" = 2)
tibble(!!!vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
Created on 2019-09-14 by the reprex package (v0.3.0)
This works for me: c("a" = 1, "b" = 2) %>% t() %>% tbl_df()
Interestingly you can use the as_tibble() method for lists to do this in one call. Note that this isn't best practice since this isn't an exported method.
tibble:::as_tibble.list(vec)
as_tibble(as.list(c(a=1, b=2)))

Resources