dplyr `across()` function and data frame length while grouping - r

packageVersion("dplyr")
#[1] ‘0.8.99.9002’
Please note that this question uses dplyr's new across() function. To install the latest dev version of dplyr issue the remotes::install_github("tidyverse/dplyr") command. To restore to the released version of dplyr issue the install.packages("dplyr") command. If you are reading this some point in the future and are already on dplyr 1.X+ you won't need to worry about this note.
library(tidyverse)
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3),
rep(as.Date("2020-02-01"), 2)),
Type = c("A", "A", "B", "C", "C"),
col1 = 1:5,
col2 = c(0, 8, 0, 3, 0),
col3 = c(25:29),
colX = rep(99, 5))
#> # A tibble: 5 x 6
#> Date Type col1 col2 col3 colX
#> <date> <chr> <int> <dbl> <int> <dbl>
#> 1 2020-01-01 A 1 0 25 99
#> 2 2020-01-01 A 2 8 26 99
#> 3 2020-01-01 B 3 0 27 99
#> 4 2020-02-01 C 4 3 28 99
#> 5 2020-02-01 C 5 0 29 99
I'd like to sum columns 1 through X above row-wise, grouped by "Date" and "Type". I will always start at the third column (ie col1), but will never know the numerical value of X in colX. That's OK because I can use the length of the data frame to determine how far I need to go 'out' to capture all columns until the end of the data frame. Here's my approach:
df %>%
group_by(Date, Type) %>%
summarize(across(3:length(.)), sum())
#> Error: Problem with `summarise()` input `..1`.
#> x Can't subset columns that don't exist.
#> x Locations 5 and 6 don't exist.
#> i There are only 4 columns.
#> i Input `..1` is `across(3:length(.))`.
#> i The error occured in group 1: Date = 2020-01-01, Type = "A".
#> Run `rlang::last_error()` to see where the error occurred.
But it seems my usage of the base R length(.) function is improper. Am I using dplyr's new across() function in the right manner? How can I get the length of the data frame in the portion of the pipe where I need it? I'll never know how many columns there are to the end, nor are the actual names nearly as clean as my example data frame.

packageVersion("dplyr")
#[1] ‘0.8.99.9002’
First, you just have a little problem with your syntax, the select statement and the function both go inside the across call.
df %>% summarize(across(3:length(.),sum))
## A tibble: 1 x 4
# col1 col2 col3 colX
# <int> <dbl> <int> <dbl>
#1 15 11 135 495
The following code does not work because you cannot select columns that are currently being group_by-ed on.
df %>%
group_by(Date, Type) %>%
summarize(across(3:length(.), sum))
#Error: Problem with `summarise()` input `..1`.
#x Can't subset columns that don't exist.
#x Locations 5 and 6 don't exist.
#ℹ There are only 4 columns.
This is obvious when you try the following:
df %>%
group_by(Date, Type) %>%
summarize(across(everything(), sum))
## A tibble: 3 x 6
## Groups: Date [2]
# Date Type col1 col2 col3 colX
# <date> <chr> <int> <dbl> <int> <dbl>
#1 2020-01-01 A 3 8 51 198
#2 2020-01-01 B 3 0 27 99
#3 2020-02-01 C 9 3 57 198
Other options include the starts_with tidy-select verb.
df %>%
group_by(Date, Type) %>%
summarize(across(starts_with("col"), sum))
## A tibble: 3 x 6
## Groups: Date [2]
# Date Type col1 col2 col3 colX
# <date> <chr> <int> <dbl> <int> <dbl>
#1 2020-01-01 A 3 8 51 198
#2 2020-01-01 B 3 0 27 99
#3 2020-02-01 C 9 3 57 198
The row-wise and column-wise vignettes are pretty good. The row-wise one actually discusses how group_by columns are subset.

Related

R script not rounding as intended when running in Terminal [duplicate]

I am having trouble getting the desired number of decimal places from summarise. Here is a simple example:
test2 <- data.frame(c("a","a","b","b"), c(245,246,247,248))
library(dplyr)
colnames(test2) <- c("V1","V2")
group_by(test2,V1) %>% summarise(mean(V2))
The dataframe is:
V1 V2
1 a 245
2 a 246
3 b 247
4 b 248
The output is:
V1 `mean(V2)`
<fctr> <dbl>
1 a 246
2 b 248
I would like it to give me the means including the decimal place (i.e. 245.5 and 247.5)
Because you are using dplyr tools, the resulting output is actually a tibble, which by default prints numbers with 3 significant digits (see option pillar.sigfig). This is not the same as number of digits after the period. To obtain the latter, convert it simply to a data.frame: as.data.frame
Note that tibble's concept of significant digits is somehow complicated, and does not indicate how many digits after the period are represented, but the minimum number of digits necessary to have a given accurate representation of the number (I think 99.9%, see discussion here).
This means the number of digits printed depends on the "size" of your number:
library(tibble)
packageVersion("tibble")
#> [1] '2.1.3'
packageVersion("pillar")
#> [1] '1.4.2'
tab <- tibble(x = c(0.1234, 1.1234, 10.1234, 100.1234, 1000.1234))
options(pillar.sigfig=3)
tab
#> # A tibble: 5 x 1
#> x
#> <dbl>
#> 1 0.123
#> 2 1.12
#> 3 10.1
#> 4 100.
#> 5 1000.
options(pillar.sigfig=4)
tab
#> # A tibble: 5 x 1
#> x
#> <dbl>
#> 1 0.1234
#> 2 1.123
#> 3 10.12
#> 4 100.1
#> 5 1000.
as.data.frame(tab)
#> x
#> 1 0.1234
#> 2 1.1234
#> 3 10.1234
#> 4 100.1234
#> 5 1000.1234
Created on 2019-08-21 by the reprex package (v0.3.0)
This is one solution-
test2 <- data.frame(c("a", "a", "b", "b"), c(245, 246, 247, 248))
library(dplyr)
colnames(test2) <- c("V1", "V2")
group_by(test2, V1) %>%
dplyr::summarise(mean(V2)) %>%
dplyr::mutate_if(is.numeric, format, 1)
#> # A tibble: 2 x 2
#> V1 `mean(V2)`
#> <fct> <chr>
#> 1 a 245.5
#> 2 b 247.5
Created on 2018-01-20 by the reprex
package (v0.1.1.9000).
EDIT :
If you want to keep it as numeric :
test2 <- data.frame(c("a", "a", "b", "b"), c(245, 246, 247, 248))
library(dplyr)
colnames(test2) <- c("V1", "V2")
group_by(test2, V1) %>%
dplyr::summarise(mean(V2)) %>%
as.data.frame(.) %>%
dplyr::mutate_if(is.numeric, round, 1)
Gives
V1 mean(V2)
1 a 245.5
2 b 247.5
And with another example (from #Matifou) :
tab <- tibble(x = c(0.1234, 1.1234, 10.1234, 100.1234, 1000.1234))
tab %>%
as.data.frame(.) %>%
dplyr::mutate_if(is.numeric, round, 2)
Gives :
x
1 0.12
2 1.12
3 10.12
4 100.12
5 1000.12
I think the simplest solution is the following:
test2 <- data.frame(c("a","a","b","b"), c(245,246,247,248))
library(dplyr)
colnames(test2) <- c("V1","V2")
group_by(test2,V1) %>% summarise(`mean(V2)` = sprintf("%0.1f",mean(V2)))
# A tibble: 2 x 2
V1 `mean(V2)`
<fct> <chr>
1 a 245.5
2 b 247.5

Iterating over listed data frames within a piped purrr anonymous function call

Using purrr::map and the magrittr pipe, I am trying generate a new column with values equal to a substring of the existing column.
I can illustrate what I'm trying to do with the following toy dataset:
library(tidyverse)
library(purrr)
test <- list(tibble(geoid_1970 = c(123, 456),
name_1970 = c("here", "there"),
pop_1970 = c(1, 2)),
tibble(geoid_1980 = c(234, 567),
name_1980 = c("here", "there"),
pop_1970 = c(3, 4))
)
Within each listed data frame, I want a column equal to the relevant year. Without iterating, the code I have is:
data <- map(test, ~ .x %>% mutate(year = as.integer(str_sub(names(test[[1]][1]), -4))))
Of course, this returns a year of 1970 in both listed data frames, which I don't want. (I want 1970 in the first and 1980 in the second.)
In addition, it's not piped, and my attempt to pipe it throws an error:
data <- test %>% map(~ .x %>% mutate(year = as.integer(str_sub(names(.x[[1]][1]), -4))))
# > Error: Problem with `mutate()` input `year`.
# > x Input `year` can't be recycled to size 2.
# > ℹ Input `year` is `as.integer(str_sub(names(.x[[1]][1]), -4))`.
# > ℹ Input `year` must be size 2 or 1, not 0.
How can I iterate over each listed data frame using the pipe?
Try:
test %>% map(~.x %>% mutate(year = as.integer(str_sub(names(.x[1]), -4))))
[[1]]
# A tibble: 2 x 4
geoid_1970 name_1970 pop_1970 year
<dbl> <chr> <dbl> <int>
1 123 here 1 1970
2 456 there 2 1970
[[2]]
# A tibble: 2 x 4
geoid_1980 name_1980 pop_1970 year
<dbl> <chr> <dbl> <int>
1 234 here 3 1980
2 567 there 4 1980
We can get the 'year' with parse_number
library(dplyr)
library(purrr)
map(test, ~ .x %>%
mutate(year = readr::parse_number(names(.)[1])))
-output
#[[1]]
# A tibble: 2 x 4
# geoid_1970 name_1970 pop_1970 year
# <dbl> <chr> <dbl> <dbl>
#1 123 here 1 1970
#2 456 there 2 1970
#[[2]]
# A tibble: 2 x 4
# geoid_1980 name_1980 pop_1970 year
# <dbl> <chr> <dbl> <dbl>
#1 234 here 3 1980
#2 567 there 4 1980

Grouping based on two variables, including their individual combinations (e.g. A - B is same than B - A)

I got stuck during an coding problem at work. I have a data frame with three variables var1 and var2 and length. The latter is the mutual length between var1 and var2, e.g. a boundary.
Ultimately I want to calculate the percentage of each combination of var1 - var2 (var2 - var1 is regarded as identical) on the total length of each unique element in either var1 and var2. Because this sounds too complicated I have made some examples to show where I am stuck.
library(tidyverse)
df <- tibble(
var1 = c("A","B","A","D","A"),
var2 = c("B","A","D","A","B"),
Length = c(10,12,5,20,34))
#First I wanted the total length of each variable, irrespective of it occurring in var1 or var2
# I think that I figured this out. Let me know it its a bit convoluted
var_unique <- unique(c(unique(df$var1),unique(df$var2)))
names(var_unique) <- var_unique
total_length <- map_df(var_unique, function(x){
df %>%
filter( var1 == x | var2 == x )%>%
summarise(var_total_length = sum(Length))
},.id = "var" )
total_length
#> # A tibble: 3 x 2
#> var var_total_length
#> <chr> <dbl>
#> 1 A 81
#> 2 B 56
#> 3 D 25
# Second I need the length of each combination of var1 and var2.
#I would like the "A" - "B" should be the same than "B" - "A"
# Grouping does not work in this case. This is where I am stuck
#Neiter this
df %>% group_by(var1,var2) %>%
mutate(combination_length = sum(Length))
#> # A tibble: 5 x 4
#> # Groups: var1, var2 [4]
#> var1 var2 Length combination_length
#> <chr> <chr> <dbl> <dbl>
#> 1 A B 10 44
#> 2 B A 12 12
#> 3 A D 5 5
#> 4 D A 20 20
#> 5 A B 34 44
# nor that one does the job, because it looks at individual combination of var1 and var2.
df %>% group_by(var1,var2) %>%
summarise(combination_length = sum(Length))
#> # A tibble: 4 x 3
#> # Groups: var1 [3]
#> var1 var2 combination_length
#> <chr> <chr> <dbl>
#> 1 A B 44
#> 2 A D 5
#> 3 B A 12
#> 4 D A 20
# this is the dataframe that I would like. Rows 1,2 and 5 of df should be considered the
# same group
tibble(
var1 = c("A","B","A","D","A"),
var2 = c("B","A","D","A","B"),
Length = c(10,12,5,20,34),
combination_length = c(56,56,25,25,56))
#> # A tibble: 5 x 4
#> var1 var2 Length combination_length
#> <chr> <chr> <dbl> <dbl>
#> 1 A B 10 56
#> 2 B A 12 56
#> 3 A D 5 25
#> 4 D A 20 25
#> 5 A B 34 56
# Ultimately i want to divide each combination by the total length of the variable
# occurring in the combination to obtain the percentage of each boundary for each unique variable
Created on 2019-11-27 by the reprex package (v0.3.0)
I assume there are ways to make it less complex than I try to do it.
We can use sorted var1, var2 in group_by which can be done using pmax and pmin
library(dplyr)
df %>%
group_by(group1 = pmin(var1, var2), group2 = pmax(var1, var2)) %>%
mutate(combination_length = sum(Length)) %>%
ungroup() %>%
select(-group1, -group2)
# var1 var2 Length combination_length
# <chr> <chr> <dbl> <dbl>
#1 A B 10 56
#2 B A 12 56
#3 A D 5 25
#4 D A 20 25
#5 A B 34 56
Here is a solution for base R, where split() is used and it is assumed that df is a data frame, i.e.,
df <- data.frame(
var1 = c("A","B","A","D","A"),
var2 = c("B","A","D","A","B"),
Length = c(10,12,5,20,34))
then, using the following code
sp <- data.frame(t(apply(df[1:2], 1, sort)))
v <- split(df,sp)
res <- unsplit(lapply(v, function(x) data.frame(x,combination_length = sum(x[3]))),sp)
gives
> res
var1 var2 Length combination_length
1 A B 10 56
2 B A 12 56
3 A D 5 25
4 D A 20 25
5 A B 34 56

Replace infinite values in an R data frame [why doesn't `is.infinite()` behave like `is.na()`]

library(tidyverse)
df <- tibble(col1 = c("A", "B", "C"),
col2 = c(NA, Inf, 5))
#> # A tibble: 3 x 2
#> col1 col2
#> <chr> <dbl>
#> 1 A NA
#> 2 B Inf
#> 3 C 5
I can use the base R is.na() function to easily replace NAs with 0s, shown below:
df %>% replace(is.na(.), 0)
#> # A tibble: 3 x 2
#> col1 col2
#> <chr> <dbl>
#> 1 A 0
#> 2 B Inf
#> 3 C 5
If I try to duplicate this logic with is.infinite() things break:
df %>% replace(is.infinite(.), 1)
#> Error in is.infinite(.) : default method not implemented for type 'list'
Looking at this older answer about Inf and R data frames I can hack together the solution shown below. This takes my original data frame and turns all NA into 0 and all Inf into 1. Why doesn't is.infinite() behave like is.na() and what is (perhaps) a better way to do what I want?
df %>%
replace(is.na(.), 0) %>%
mutate_if(is.numeric, list(~na_if(abs(.), Inf))) %>% # line 3
replace(is.na(.), 1)
#> # A tibble: 3 x 2
#> col1 col2
#> <chr> <dbl>
#> 1 A 0
#> 2 B 1
#> 3 C 5
The is.infinite expects the input 'x' to be atomic vector according to ?is.infinite
x- object to be tested: the default methods handle atomic vectors.
whereas ?is.na can take a vector, matrix, data.frame as input
an R object to be tested: the default method for is.na and anyNA handle atomic vectors, lists, pairlists, and NULL
Also, by checking the methods,
methods('is.na')
#[1] is.na.data.frame is.na.data.table* is.na.numeric_version is.na.POSIXlt is.na.raster* is.na.vctrs_vctr*
methods('is.infinite') # only for vectors
#[1] is.infinite.vctrs_vctr*
We can modify the replace in the code to
library(dplyr)
df %>%
mutate_if(is.numeric, ~ replace_na(., 0) %>%
replace(., is.infinite(.), 1))
# A tibble: 3 x 2
# col1 col2
# <chr> <dbl>
#1 A 0
#2 B 1
#3 C 5

Mass changing columns of a data set to numeric

I've imported an excel data set and want to set nearly all columns (greater than 90) to numeric when they are initially characters. What is the best way to achieve this because importing and changing each to numeric one by one isn't the most efficient approach?
This should do as you wish:
# Random data frame for illustration (100 columns wide)
df <- data.frame(replicate(100,sample(0:1,1000,rep=TRUE)))
# Check column names / return column number (just encase you wanted to check)
colnames(df)
# Specify columns
cols <- c(1:length(df)) # length(df) is useful as if you ever add more columns at later date
# Or if only want to specify specific column numbers:
# cols <- c(1:100)
#With help of magrittr pipe function change all to numeric
library(magrittr)
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
# Check our columns are numeric
str(df)
Assuming your data is already imported with all character columns, you can convert the relevant columns to numeric using mutate_at by position or name:
suppressPackageStartupMessages(library(tidyverse))
# Assume the imported excel file has 5 columns a to e
df <- tibble(a = as.character(1:3),
b = as.character(5:7),
c = as.character(8:10),
d = as.character(2:4),
e = as.character(2:4))
# select the columns by position (convert all except 'b')
df %>% mutate_at(c(1, 3:5), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
# or drop the columns that shouldn't be used ('b' and 'd' should stay as chr)
df %>% mutate_at(-c(2, 4), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
# select the columns by name
df %>% mutate_at(c("a", "c", "d", "e"), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4

Resources