Finding the mode using Dpylr - r

I have some code written using the dplyr package. I want to calculate the mode. Currently I get results back with a column which says "Character" all the way down. The mode will be the most reoccurring value, which in my case could be a letter, number of a symbol.
eth.data<-data.comb %>%
group_by(Ethnicity, `Qualification Title`, `Qualification Number`, `OutGrade`)%>%
summarise(`Number of Learners`=n(), `Mode` = mode(`OutGrade`)) %>%
group_by(`Qualification Number`)%>%
mutate(`Total Number of Learners`= sum(`Number of Learners`)) %>%
arrange(`Total Number of Learners`)

Take a look at ?mode. mode tells you the storage mode of an object (e.g. "character" for character vectors). If you want the statistical mode, write your own function, see this question.
Also, if you group_by OutGrade, then you will have precisely 1 unique OutGrade in the summarise function, so don't do that.
Let us set up an example (which you should do when you are asking a question!).
df <- data.frame(group=rep(LETTERS[1:5], each=20),
grade=sample(letters[1:15], 100, replace=T))
mymode <- function(x) {
t <- table(x)
names(t)[ which.max(t) ]
}
df %>% group_by(group) %>% summarise(mode=mymode(grade))
The result is what you want:
# A tibble: 5 x 2
group mode
<chr> <chr>
1 A l
2 B f
3 C g
4 D g
5 E c
Note that if you did group_by(group, grade), the summarise function would be called for each combination of group and grade, so the results would have been very different:
# A tibble: 55 x 3
# Groups: group [5]
group grade mode
<chr> <chr> <chr>
1 A a a
2 A b b
3 A f f
4 A h h
5 A i i
6 A k k
7 A l l
8 A m m
9 A n n
10 B a a
# … with 45 more rows

Related

Matching old and new column names in R [duplicate]

I have a tibble which has column names containing spaces & special characters which make it a hassle to work with. I want to change these column names to easier to use names while I'm working with the data, and then change them back to the original names at the end for display. Ideally, I want to be able to do this as part of a pipe, however I haven't figured out how to do it with rename_with().
Sample data:
df <- tibble(oldname1 = seq(1:10),
oldname2 = letters[seq(1:10)],
oldname3 = LETTERS[seq(1:10)])
cols_lookup <- tibble(old_names = c("oldname4", "oldname2", "oldname1"),
new_names = c("newname4", "newname2", "newname1"))
Desired output:
> head(df_renamed)
# A tibble: 6 x 3
newname1 newname2 oldname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
Some columns are removed & reordered during this work so when converting them back there will be entries in the cols_lookup table which are no longer in df. There are also new columns created in df which I want to remain named the same.
I am aware there are similar questions which have already been asked, however the answers either don't work well with tibbles or in a pipe (eg. those using match()), or don't work if the columns aren't all present in the same order in both tables.
We can use rename_at. From the master lookup table, filter the rows where the names of dataset have a match (filtered_lookup), then use that in rename_at where we specify the 'old_names' in vars and replace with the 'new_names'
library(dplyr)
filtered_lookup <- cols_lookup %>%
filter(old_names %in% names(df))
df %>%
rename_at(vars(filtered_lookup$old_names), ~ filtered_lookup$new_names)
Or using rename_with, use the same logic
df %>%
rename_with(.fn = ~filtered_lookup$new_names, .cols = filtered_lookup$old_names)
Or another option is rename with splicing (!!!) from a named vector
library(tibble)
df %>%
rename(!!! deframe(filtered_lookup[2:1]))
You can use rename_ with setnames
cols_lookup <- tibble(old_names = c("oldname3", "oldname2", "oldname1"),
new_names = c("newname3", "newname2", "newname1"))
df
rename_(df, .dots=setNames(cols_lookup$old_names, cols_lookup$new_names))
Output:
# A tibble: 10 x 3
newname1 newname2 newname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J

Rename columns using values from separate dataframe

I have a tibble which has column names containing spaces & special characters which make it a hassle to work with. I want to change these column names to easier to use names while I'm working with the data, and then change them back to the original names at the end for display. Ideally, I want to be able to do this as part of a pipe, however I haven't figured out how to do it with rename_with().
Sample data:
df <- tibble(oldname1 = seq(1:10),
oldname2 = letters[seq(1:10)],
oldname3 = LETTERS[seq(1:10)])
cols_lookup <- tibble(old_names = c("oldname4", "oldname2", "oldname1"),
new_names = c("newname4", "newname2", "newname1"))
Desired output:
> head(df_renamed)
# A tibble: 6 x 3
newname1 newname2 oldname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
Some columns are removed & reordered during this work so when converting them back there will be entries in the cols_lookup table which are no longer in df. There are also new columns created in df which I want to remain named the same.
I am aware there are similar questions which have already been asked, however the answers either don't work well with tibbles or in a pipe (eg. those using match()), or don't work if the columns aren't all present in the same order in both tables.
We can use rename_at. From the master lookup table, filter the rows where the names of dataset have a match (filtered_lookup), then use that in rename_at where we specify the 'old_names' in vars and replace with the 'new_names'
library(dplyr)
filtered_lookup <- cols_lookup %>%
filter(old_names %in% names(df))
df %>%
rename_at(vars(filtered_lookup$old_names), ~ filtered_lookup$new_names)
Or using rename_with, use the same logic
df %>%
rename_with(.fn = ~filtered_lookup$new_names, .cols = filtered_lookup$old_names)
Or another option is rename with splicing (!!!) from a named vector
library(tibble)
df %>%
rename(!!! deframe(filtered_lookup[2:1]))
You can use rename_ with setnames
cols_lookup <- tibble(old_names = c("oldname3", "oldname2", "oldname1"),
new_names = c("newname3", "newname2", "newname1"))
df
rename_(df, .dots=setNames(cols_lookup$old_names, cols_lookup$new_names))
Output:
# A tibble: 10 x 3
newname1 newname2 newname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J

How to apply functions sequentially with purrr and pipes

I am struggling with the purrr package.
I am trying to apply the function is.factor to a data frame, and then fct_count on those columns that are factors.
I have tried some variations of modify_if, and summarise_if. I guess I am using incorrectly the dots (.) when calling for the previous object.
(A guide about purrr, and dots would be really beneficial if you have a link).
For example,
df <- data.frame(f1 = c("men", "woman", "men", "men"),
f2 = c("high", "low", "low", "low"),
n1 = c(1, 3, 3, 6))
Then
map(df, is.factor)
If I use
map_if(df, is.factor, forcats::fct_count)
I got results for every variable, instead of only for the factors.
I think it is a pretty simple problem, and with a bit of understanding of the dots (.) can be solved.
Thanks in advance
:)
Issue is that map_if returns the unmodified columns as well. Hence, when the OP tries the code (repeating the same code as in the OP just to show)
map_if(df, is.factor, forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
#$n1
#[1] 1 3 3 6 ### it is the same column value unchanged
Here, we can specify the .else and discard the NULL elements. So, if we specify the other columns to return NULL and then use discard the NULL elements, it would be a list of factor counts.
library(tidyverse)
map_if(df, is.factor, forcats::fct_count, .else = ~ NULL) %>%
discard(is.null)
#$f1
## A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
Or another option is summarise_if and place the output in a list
df %>%
summarise_if(is.factor, list(~ list(fct_count(.)))) %>%
unclass
Or another option would be to gather into 'long' format and then count once
gather(df, key, val, f1:f2) %>%
dplyr::count(key, val)
Or this can be done with lapply from base R
lapply(df[sapply(df, is.factor)], fct_count)
Or using only base R
lapply(df[sapply(df, is.factor)], table)
Or the results can be represented in a different way
table(names(df)[1:2][col(df[1:2])], unlist(df[1:2]))
The issue with map_if/modify_if is it applies the function to only the columns which satisfy the predicate function and rest of them are returned as it is.
Hence, when you try
library(tidyverse)
map_if(df, is.factor, forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3
#$n1
#[1] 1 3 3 6
fct_count is applied to columns f1 and f2 which are factors and column n1 is returned as it is. If you want to get only factor columns in the output one way would be to select them first and then apply the function
df %>%
select_if(is.factor) %>%
map(forcats::fct_count)
#$f1
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 men 3
#2 woman 1
#$f2
# A tibble: 2 x 2
# f n
# <fct> <int>
#1 high 1
#2 low 3

Add multiple output variables using purrr and a predefined function

Take this simple dataset and function (representative of more complex problems):
x <- data.frame(a = 1:3, b = 2:4)
mult <- function(a,b,n) (a + b) * n
Using base R's Map I could do this to add 2 new columns in a vectorised fashion:
ns <- 1:2
x[paste0("new",seq_along(ns))] <- Map(mult, x["a"], x["b"], n=ns)
x
# a b new1 new2
#1 1 2 3 6
#2 2 3 5 10
#3 3 4 7 14
purrr attempt via pmap gets close with a list output:
library(purrr)
library(dplyr)
x %>% select(a,b) %>% pmap(mult, n=1:2)
#[[1]]
#[1] 3 6
#
#[[2]]
#[1] 5 10
#
#[[3]]
#[1] 7 14
My attempts from here with pmap_dfr etc all seem to error out in trying to map this back to new columns.
How do I end up making 2 further variables which match my current "new1"/"new2"? I'm sure there is a simple incantation, but I'm clearly overlooking it or using the wrong *map* function.
There is some useful discussion here - How to use map from purrr with dplyr::mutate to create multiple new columns based on column pairs - but it seems overly hacky and inflexible for what I imagined was a simple problem.
The best approach I've found (which is still not terribly elegant) is to pipe into bind_cols. To get pmap_dfr to work correctly, the function should return a named list (which may or may not be a data frame):
library(tidyverse)
x <- data.frame(a = 1:3, b = 2:4)
mult <- function(a,b,n) as.list(set_names((a + b) * n, paste0('new', n)))
x %>% bind_cols(pmap_dfr(., mult, n = 1:2))
#> a b new1 new2
#> 1 1 2 3 6
#> 2 2 3 5 10
#> 3 3 4 7 14
To avoid changing the definition of mult, you can wrap it in an anonymous function:
mult <- function(a,b,n) (a + b) * n
x %>% bind_cols(pmap_dfr(
.,
~as.list(set_names(
mult(...),
paste0('new', 1:2)
)),
n = 1:2
))
#> a b new1 new2
#> 1 1 2 3 6
#> 2 2 3 5 10
#> 3 3 4 7 14
In this particular case, it's not actually necessary to iterate over rows, though, because you can vectorize the inputs from x and instead iterate over n. The advantage is that usually n > p, so the number of iterations will be [potentially much] lower. To be clear, whether such an approach is possible depends on for which parameters the function can accept vector arguments.
mult still needs to be called on the variables of x. The simplest way to do this is to pass them explicitly:
x %>% bind_cols(map_dfc(1:2, ~mult(x$a, x$b, .x)))
#> a b V1 V2
#> 1 1 2 3 6
#> 2 2 3 5 10
#> 3 3 4 7 14
...but this loses the benefit of pmap that named variables will automatically get passed to the correct parameter. You can get that back by using purrr::lift, which is an adverb that changes the domain of a function so it accepts a list by wrapping it in do.call. The returned function can be called on x and the value of n for that iteration:
x %>% bind_cols(map_dfc(1:2, ~lift(mult)(x, n = .x)))
This is equivalent to
x %>% bind_cols(map_dfc(1:2, ~invoke(mult, x, n = .x)))
but the advantage of the former is that it returns a function which can be partially applied on x so it only has an n parameter left, and thus requires no explicit references to x and so pipes better:
x %>% bind_cols(map_dfc(1:2, partial(lift(mult), .)))
All return the same thing. Names can be fixed after the fact with %>% set_names(~sub('^V(\\d+)$', 'new\\1', .x)), if you like.
Here is one possibility.
library(purrr)
library(dplyr)
n <- 1:2
x %>%
mutate(val = pmap(., mult, n = n)) %>%
unnest() %>%
mutate(var = rep(paste0("new", n), nrow(.) / length(n))) %>%
spread(var, val)
# a b new1 new2
#1 1 2 3 6
#2 2 3 5 10
#3 3 4 7 14
Not pretty, so I'm also curious to see alternatives. A lot of excess comes about from unnesting the list column and spreading into new columns.
Here is another possibility using pmap_dfc plus an ugly as.data.frame(t(...)) call
bind_cols(x, as.data.frame(t(pmap_dfc(x, mult, n = n))))
# a b V1 V2
#1 1 2 3 6
#2 2 3 5 10
#3 3 4 7 14
Sample data
x <- data.frame(a = 1:3, b = 2:4)
mult <- function(a,b,n) (a + b) * n
To mimic the input format for Map, we could call pmap from purrr in this way:
x[paste0("new",seq_along(ns))] <- pmap(list(x['a'], x['b'], ns), mult)
To fit this in a pipe:
x %>%
{list(.['a'], .['b'], ns)} %>%
pmap(mult) %>%
setNames(paste0('new', seq_along(ns))) %>%
cbind(x)
# new1 new2 a b
# 1 3 6 1 2
# 2 5 10 2 3
# 3 7 14 3 4
Apparently, this looks ugly compared to the concise base R code. But I could not think of a better way.

Use regular expressions to detect range of alphanumeric string

I need help with regular expressions to do the following. I have a list of study subjects, named such as:
subject <- c('x-010', 'x-011', 'x-012', 'x-013', 'x-014', 'x-015', 'x-016', 'x-017', 'x-018', 'x-019', 'x-020', 'x-021', 'x-022', 'x-023', 'x-024', 'x-025', 'x-026', 'x-027', 'x-028', 'x-029', 'x-030')
df <- data.frame(subject)
I want to add a column to classify the subjects by group according to their number, such that 1 - 10 are in Group A, 11 - 20 are in Group B, 21 - 30 are in Group C, and so on. I don't know how to do this using regular expressions, only to start with:
df <- data.frame(subject) %>%
mutate(case_when(group = str_detect(subject,
but need to understand how to describe this pattern.
We can extract the numeric part and create the group with %/%
library(tidyverse)
df %>%
group_by(grp = paste0("Group ", LETTERS[(as.numeric(str_extract(subject,
"[0-9]+"))-1) %/% 10 + 1]))
# A tibble: 21 x 2
# Groups: grp [3]
# subject grp
# <fct> <chr>
# 1 x-010 Group A
# 2 x-011 Group B
# 3 x-012 Group B
# 4 x-013 Group B
# 5 x-014 Group B
# 6 x-015 Group B
# 7 x-016 Group B
# 8 x-017 Group B
# 9 x-018 Group B
#10 x-019 Group B
# ... with 11 more rows

Resources