how to pass column names including space in R

how to pass column names including space in R - r

assume my column names are: User ID and name
how should I pass this column name to functions like what I have below?
df %>%
group_by(User ID) %>%
count(name)
apparently, group_by() or similar functions do not accept column names with space in their names.

You need to use tibble instead of data.frame:
library(tidyverse)
df <- tibble(`User ID` = 1:2, x = 5:6)
df %>%
group_by(`User ID`) %>%
summarise(total = sum(x))
#> # A tibble: 2 × 2
#> `User ID` total
#> <int> <int>
#> 1 1 5
#> 2 2 6

Related

Iterating name of a field with dplyr::summarise function

first time for me here, I'll try to explain you my problem as clearly as possible.
I'm working on erosion data contained in farms in the form of pixels (e.g. 1 farm = 10 pixels so 10 lines in my df), for this I have 4 df in a list, and I would like to calculate for each farm the mean of erosion. I thought about a loop on the name of erosion field but my problem is that my df don't have the exact name (either ERO13 or ERO17). I don't want to work the position of the field because it could change between the df, only with the name which is variable.
Here's a example :
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
lst_df <- list(df1,df2)
for (df in lst_df){
cur_df <- df
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(current_name_of_erosion_field = mean(current_name_of_erosion_field))
}
I tried with
for (df in lst_df){
cur_df <- df
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(cur_camp = mean(cur_camp))
}
but first doesn't work because it's a string character and not a variable containing the string character and it works with the position.
How can I build the current_name_of_erosion_field here ?

We may convert it to symbol and evaluate (!!) or may pass the string across. Also, as we are using a for loop, make sure to create a list to store the output. Also, to assign from an object created, use := with !!
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(!!cur_camp := mean(!! sym(cur_camp)))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12
Or may use across
out <- vector('list', length(lst_df))
for (i in seq_along(lst_df)){
cur_df <- lst_df[[i]]
cur_camp <- names(cur_df)[2]
cur_df <- cur_df %>%
group_by(ID) %>%
summarise(across(all_of(cur_camp), mean))
out[[i]] <- cur_df
}
-output
> out
[[1]]
# A tibble: 2 × 2
ID ERO13
<dbl> <dbl>
1 1 3
2 2 6
[[2]]
# A tibble: 2 × 2
ID ERO17
<dbl> <dbl>
1 4 4.5
2 6 12

A slightly different approach would be to bind the dataframes and use pivot_longer to separate the erosion name from the erosion value. Then you can take the mean of the values without having to specify the name.
library(tidyverse)
df1 <- data.frame(ID = c(1,1,2), ERO13 = c(2,4,6))
df2 <- data.frame(ID = c(4,4,6), ERO17 = c(4,5,12))
bind_rows(df1, df2) %>%
pivot_longer(starts_with('ERO'),
names_to = 'ERO',
values_drop_na = TRUE) %>%
group_by(ID, ERO) %>%
summarize(value = mean(value))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups` argument.
#> # A tibble: 4 x 3
#> # Groups: ID [4]
#> ID ERO value
#> <dbl> <chr> <dbl>
#> 1 1 ERO13 3
#> 2 2 ERO13 6
#> 3 4 ERO17 4.5
#> 4 6 ERO17 12
Created on 2022-01-14 by the reprex package (v2.0.0)

How to 'summarize' variable which mixed by 'numeric' and 'character'

here is data.frame data as below , how to transfer it to wished_data Thanks!
library(tidyverse)
data <- data.frame(category=c('a','b','a','b','a'),
values=c(1,'A','2','4','B'))
#below code can't work
data %>% group_by(category ) %>%
summarize(sum=if_else(is.numeric(values)>0,sum(is.numeric(values)),paste0(values)))
#below is the wished result
wished_data <- data.frame(category=c('a','a','b','b'),
values=c('3','B','A','4'))

Mixing numeric and character variables in a column is not tidy. Consider giving each type their own column, for example:
data %>%
mutate(letters = str_extract(values, "[A-Z]"),
numbers = as.numeric(str_extract(values, "\\d"))) %>%
group_by(category) %>%
summarise(values = sum(numbers, na.rm = T),
letters = na.omit(letters))
category values letters
<chr> <dbl> <chr>
1 a 3 B
2 b 4 A
In R string math does not make sense, "1+1" is not "2", and is.numeric("1") gives FALSE. A workaround is converting to list object, or to give each their own columns.

I'd create a separate column to group numeric values in a category separately from characters.
data %>%
mutate(num_check = grepl("[0-9]", values)) %>%
group_by(category, num_check) %>%
summarize(sum = ifelse(
unique(num_check),
as.character(sum(as.numeric(values))),
unique(values)
), .groups = "drop")
#> # A tibble: 4 × 3
#> category num_check sum
#> <chr> <lgl> <chr>
#> 1 a FALSE B
#> 2 a TRUE 3
#> 3 b FALSE A
#> 4 b TRUE 4

Here is a bit of a messy answer,
library(dplyr)
bind_rows(data %>%
filter(is.na(as.numeric(values))),
data %>%
mutate(values = as.numeric(values)) %>%
group_by(category) %>%
summarise(values = as.character(sum(values, na.rm = TRUE)))) %>%
arrange(category)
category values
#1 a B
#2 a 3
#3 b A
#4 b 4

Concat a list column in R

I'm trying to concatenate characters within a list column in R.
When I try this approach, the result is not 'abc' but a vector converted to character.
What is the right approach?
library(tidyverse)
tibble(b=list(letters[1:3])) %>%
mutate(b = paste(b))
#> # A tibble: 1 x 1
#> b
#> <chr>
#> 1 "c(\"a\", \"b\", \"c\")"
Created on 2020-10-14 by the reprex package (v0.3.0)

Maybe you need something like below
library(tidyverse)
tibble(b=list(letters[1:3])) %>%
mutate(b = sapply(b,paste,collapse = ""))
giving
# A tibble: 1 x 1
b
<chr>
1 abc

Try this. Keep in mind that the element in your tibble is a list. So you can use any of these approaches:
library(tidyverse)
tibble(b=list(letters[1:3])) %>%
mutate(b = lapply(b,function(x)paste0(x,collapse = '')))
Or this:
#Code 2
tibble(b=list(letters[1:3])) %>%
mutate(b = sapply(b,function(x)paste0(x,collapse = '')))
Output:
# A tibble: 1 x 1
b
<chr>
1 abc
In the first case, you will get the result in a list whereas in the second one you will get it as a value.

We can use tidyverse
library(dplyr)
library(stringr)
library(purrr)
tibble(b=list(letters[1:3])) %>%
mutate(b = map_chr(b, str_c, collapse=""))
# A tibble: 1 x 1
# b
# <chr>
#1 abc

How to use dplyr `rowwise()` column numbers instead of column names

library(tidyverse)
df <- tibble(col1 = c(5, 2), col2 = c(6, 4), col3 = c(9, 9))
df %>% rowwise() %>% mutate(col4 = sd(c(col1, col3)))
# # A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
# 1 5 6 9 2.83
# 2 2 4 9 4.95
After asking a series of questions I can finally calculate standard deviation across rows. See my code above.
But I can't use column names in my production code, because the database I pull from likes to change the column names periodically. Lucky for me the relative column positions is always the same.
So I'll just use column numbers instead. And let's check to make sure I can just swap things in and out:
identical(df$col1, df[[1]])
# [1] TRUE
Yes, I can just swap df[[1]] in place of df$col1. I think I do it like this.
df %>% rowwise() %>% mutate(col4 = sd(c(.[[1]], .[[3]])))
# # A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
# 1 5 6 9 3.40
# 2 2 4 9 3.40
df %>% rowwise() %>% {mutate(col4 = sd(c(.[[1]], .[[3]])))}
# Error in mutate_(.data, .dots = compat_as_lazy_dots(...)) :
# argument ".data" is missing, with no default
Nope, it looks like these don't work because the results are different from my original. And I can't use apply, if you really need to know why I made a separate question.
df %>% mutate(col4 = apply(.[, c(1, 3)], 1, sd))
How do I apply dplyr rowwise() with column numbers instead of names?

The issue in using .[[1]] or .[[3]] after doing the rowwise (grouping by row - have only single row per group) is that it breaks the grouping structure and extracts the whole column. Inorder to avoid that, we can create a row_number() column before doing the rowwise and then subset the columns based on that index
library(dplyr)
df %>%
mutate(rn = row_number()) %>% # create a sequence of row index
rowwise %>%
mutate(col4 = sd(c(.[[1]][rn[1]], .[[3]][rn[1]]))) %>% #extract with index
select(-rn)
#Source: local data frame [2 x 4]
#Groups: <by row>
# A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
#1 5 6 9 2.83
#2 2 4 9 4.95
Or another option is map from purrr where we loop over the row_number() and do the subsetting of rows of dataset
library(purrr)
df %>%
mutate(col4 = map_dbl(row_number(), ~ sd(c(df[[1]][.x], df[[3]][.x]))))
# A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
#1 5 6 9 2.83
#2 2 4 9 4.95
Or another option is pmap (if we don't want to use row_number())
df %>%
mutate(col4 = pmap_dbl(.[c(1, 3)], ~ sd(c(...))))
# A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
#1 5 6 9 2.83
#2 2 4 9 4.95
Of course, the easiest way would be to use rowSds from matrixStats as described in the dupe tagged post here
NOTE: All of the above methods doesn't require any reshaping

Since you don't necessarily know the column names, but know the positions of the columns for which you need standard deviation, etc., I'd reshape into long data and add an ID column. You can gather by position instead of column name, either by giving the numbers of the column that should become the key, or the numbers of the columns to omit from the key. That way, you don't need to specify those values by column because you'll have them all in one column already. Then you can join those summary values back to your original wide-shaped data.
library(dplyr)
library(tidyr)
df <- tibble(col1 = c(5, 2), col2 = c(6, 4), col3 = c(9, 9)) %>%
mutate(id = row_number())
df %>%
mutate(id = row_number()) %>%
gather(key, value, 1, 3) %>%
group_by(id) %>%
summarise(sd = sd(value)) %>%
inner_join(df, by = "id")
#> # A tibble: 2 x 5
#> id sd col1 col2 col3
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2.83 5 6 9
#> 2 2 4.95 2 4 9
Rearrange columns by position as you need.

An approach transposing data, converting it to matrix, computing the standard deviation, transposing again and transforming into tibble.
df %>%
t %>%
rbind(col4 = c(sd(.[c(1, 3),1]), sd(.[c(1, 3),2]))) %>%
t %>%
as_tibble()

tidyverse - prefered way to turn a named vector into a data.frame/tibble

Using the tidyverse a lot i often face the challenge of turning named vectors into a data.frame/tibble with the columns being the names of the vector.
What is the prefered/tidyversey way of doing this?
EDIT: This is related to: this and this github-issue
So i want:
require(tidyverse)
vec <- c("a" = 1, "b" = 2)
to become this:
# A tibble: 1 × 2
a b
<dbl> <dbl>
1 1 2
I can do this via e.g.:
vec %>% enframe %>% spread(name, value)
vec %>% t %>% as_tibble
Usecase example:
require(tidyverse)
require(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
Which gives
# A tibble: 2 × 3
a b c
<chr> <chr> <chr>
1 1 2 <NA>
2 1 <NA> 3

This is now directly supported using bind_rows (introduced in dplyr 0.7.0):
library(tidyverse))
vec <- c("a" = 1, "b" = 2)
bind_rows(vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
This quote from https://cran.r-project.org/web/packages/dplyr/news.html explains the change:
bind_rows() and bind_cols() now accept vectors. They are treated as rows by the former and columns by the latter. Rows require inner names like c(col1 = 1, col2 = 2), while columns require outer names: col1 = c(1, 2). Lists are still treated as data frames but can be spliced explicitly with !!!, e.g. bind_rows(!!! x) (#1676).
With this change, it means that the following line in the use case example:
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
can be rewritten as
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(bind_rows)
which is also equivalent to
txt %>% map(read_xml) %>% map(xml_attrs) %>% { bind_rows(!!! .) }
The equivalence of the different approaches is demonstrated in the following example:
library(tidyverse)
library(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
temp <- txt %>% map(read_xml) %>% map(xml_attrs)
# x, y, and z are identical
x <- temp %>% map_df(~t(.) %>% as_tibble)
y <- temp %>% map_df(bind_rows)
z <- bind_rows(!!! temp)
identical(x, y)
#> [1] TRUE
identical(y, z)
#> [1] TRUE
z
#> # A tibble: 2 x 3
#> a b c
#> <chr> <chr> <chr>
#> 1 1 2 <NA>
#> 2 1 <NA> 3

The idiomatic way would be to splice the vector with !!! within a tibble() call so the named vector elements become column definitions :
library(tibble)
vec <- c("a" = 1, "b" = 2)
tibble(!!!vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
Created on 2019-09-14 by the reprex package (v0.3.0)

This works for me: c("a" = 1, "b" = 2) %>% t() %>% tbl_df()

Interestingly you can use the as_tibble() method for lists to do this in one call. Note that this isn't best practice since this isn't an exported method.
tibble:::as_tibble.list(vec)