I am attempting to create a multi-layered cross tab in R. Currently, when using this code:
NewMexico_DEM_xtab_ <- NewMexico_DEM_Voterfile %>%
group_by(Sex, CountyName) %>%
tally() %>%
spread(Sex, n)
I receive this output:
My goal is to add a layer for age using the Age column and for R to output a tab like this:
Is there a way I can do this with my current code or a package that would make this easier?
Do either of these approaches solve your problem?
library(tidyverse)
# Create sample data
iris_df <- iris
iris_df$Sample <- sample(c("M","F"), 150, replace = TRUE)
# crosstabs
iris_df %>%
group_by(Species, Sample) %>%
tally() %>%
spread(Sample, n)
#> # A tibble: 3 × 3
#> # Groups: Species [3]
#> Species F M
#> <fct> <int> <int>
#> 1 setosa 26 24
#> 2 versicolor 25 25
#> 3 virginica 27 23
# Add in 'Age'
iris_df$Age <- sample(c("18-24", "25-35", "36-45", "45+"), 150, replace = TRUE)
# crosstabs
iris_df %>%
group_by(Species, Sample, Age) %>%
tally() %>%
spread(Age, n)
#> # A tibble: 6 × 6
#> # Groups: Species, Sample [6]
#> Species Sample `18-24` `25-35` `36-45` `45+`
#> <fct> <chr> <int> <int> <int> <int>
#> 1 setosa F 2 4 14 6
#> 2 setosa M 11 4 5 4
#> 3 versicolor F 3 8 8 6
#> 4 versicolor M 5 8 2 10
#> 5 virginica F 5 8 7 7
#> 6 virginica M 6 10 3 4
# Using janitor::tabyl()
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
iris_df %>%
tabyl(Species, Sample, Age)
#> $`18-24`
#> Species F M
#> setosa 2 11
#> versicolor 3 5
#> virginica 5 6
#>
#> $`25-35`
#> Species F M
#> setosa 4 4
#> versicolor 8 8
#> virginica 8 10
#>
#> $`36-45`
#> Species F M
#> setosa 14 5
#> versicolor 8 2
#> virginica 7 3
#>
#> $`45+`
#> Species F M
#> setosa 6 4
#> versicolor 6 10
#> virginica 7 4
Created on 2022-08-24 by the reprex package (v2.0.1)
Related
Please have a look at the snippet at the end of the post.
I am essentially looking for a cleaner way to obtain the same result.
I have a tibble where the x column is a character vector (I did not translate it into a factor, but this is actually what it is).
Each factor appears multiple times and it always has an associated numerical value (the w column in the tibble).
I would like to keep the top 4 factors according to the sum of their associated w values and change everything else into an "other" factor.
I achieve it below, but I wonder if there is a smarter way to do the same using e.g. forcats.
Any suggestion is appreciated
library(tidyverse)
df <- tibble(x=rep(letters[1:10], 10), w=seq(100))
df
#> # A tibble: 100 × 2
#> x w
#> <chr> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
#> 7 g 7
#> 8 h 8
#> 9 i 9
#> 10 j 10
#> # … with 90 more rows
###detect the first 4 factors based on the w column
ff <- df |>
group_by(x) |>
summarise(w_tot=sum(w)) |>
ungroup() |>
arrange(desc(w_tot)) |>
slice(1:4) |>
pull(x)
ff
#> [1] "j" "i" "h" "g"
## recode the data
df_new <- df |>
mutate(w=if_else(x %in% ff, x, "other"))
df_new
#> # A tibble: 100 × 2
#> x w
#> <chr> <chr>
#> 1 a other
#> 2 b other
#> 3 c other
#> 4 d other
#> 5 e other
#> 6 f other
#> 7 g g
#> 8 h h
#> 9 i i
#> 10 j j
#> # … with 90 more rows
Created on 2022-09-16 by the reprex package (v2.0.1)
It appears that I can pass a weight argument to fct_lump_n() so this works
library(tidyverse)
library(forcats)
df <- tibble(x=rep(letters[1:10], 10), w=seq(100))
df
#> # A tibble: 100 × 2
#> x w
#> <chr> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
#> 7 g 7
#> 8 h 8
#> 9 i 9
#> 10 j 10
#> # … with 90 more rows
###detect the first 4 factors based on the w column
ff <- df |>
group_by(x) |>
summarise(w_tot=sum(w)) |>
ungroup() |>
arrange(desc(w_tot)) |>
slice(1:4) |>
pull(x)
ff
#> [1] "j" "i" "h" "g"
## recode the data
df_new <- df |>
mutate(w=if_else(x %in% ff, x, "other"))
df_new
#> # A tibble: 100 × 2
#> x w
#> <chr> <chr>
#> 1 a other
#> 2 b other
#> 3 c other
#> 4 d other
#> 5 e other
#> 6 f other
#> 7 g g
#> 8 h h
#> 9 i i
#> 10 j j
#> # … with 90 more rows
df_new2 <- df |>
mutate(x2=fct_lump_n(x,4, w))
df_new2
#> # A tibble: 100 × 3
#> x w x2
#> <chr> <int> <fct>
#> 1 a 1 Other
#> 2 b 2 Other
#> 3 c 3 Other
#> 4 d 4 Other
#> 5 e 5 Other
#> 6 f 6 Other
#> 7 g 7 g
#> 8 h 8 h
#> 9 i 9 i
#> 10 j 10 j
#> # … with 90 more rows
Created on 2022-09-16 by the reprex package (v2.0.1)
I would like to proportionally split the data I have. For example, I have 100 rows and I want to randomly sample 1 row every two rows. Using tidymodels rsample I assumed I would do the below.
dat <- as_tibble(seq(1:100))
split <- inital_split(dat, prop = 0.5, breaks = 50)
testing <- testing(split)
When checking the data the split hasnt done what I thought it would. It seems close but not exactly. I thought the breaks call generates bins which are sampled from. So, breaks = 50 would split the the 100 rows into 50 bins, therefore having two rows per bin. I have also tried strata = value to strafy accross the rows but I cannot get this to work either.
I am using this as an exaple but I am also curious how this would work when sampling 1 row every four etc.
Have I miss understood the breaks call function?
There is an argument that protects users from trying to create stratified splits that are too small that you are running up against; it's called pool:
library(rsample)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
dat <- tibble(value = seq(1:100), strat = as.factor(rep(1:50, each = 2)))
dat
#> # A tibble: 100 × 2
#> value strat
#> <int> <fct>
#> 1 1 1
#> 2 2 1
#> 3 3 2
#> 4 4 2
#> 5 5 3
#> 6 6 3
#> 7 7 4
#> 8 8 4
#> 9 9 5
#> 10 10 5
#> # … with 90 more rows
split <- initial_split(dat, prop = 0.5, strata = strat, pool = 0.0)
#> Warning: Stratifying groups that make up 0% of the data may be statistically risky.
#> • Consider increasing `pool` to at least 0.1
split
#> <Analysis/Assess/Total>
#> <50/50/100>
training(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#> value strat
#> <int> <fct>
#> 1 1 1
#> 2 4 2
#> 3 5 3
#> 4 8 4
#> 5 10 5
#> 6 12 6
#> 7 13 7
#> 8 16 8
#> 9 17 9
#> 10 20 10
#> # … with 40 more rows
testing(split) %>% arrange(strat)
#> # A tibble: 50 × 2
#> value strat
#> <int> <fct>
#> 1 2 1
#> 2 3 2
#> 3 6 3
#> 4 7 4
#> 5 9 5
#> 6 11 6
#> 7 14 7
#> 8 15 8
#> 9 18 9
#> 10 19 10
#> # … with 40 more rows
Created on 2022-02-22 by the reprex package (v2.0.1)
We really don't recommend turning pool down to zero like this, but you can do it here to see how the strata and prop arguments work.
In R , is there any available function like IFERROR formula in EXCEL ?
I want to calculate moving average using 4 nearest figures, but if the figures less than 4 in the group then using normal average.
Detail refer to below code, the IF_ERROR is just i wished function and can't work
library(tidyverse)
library(TTR)
test_data <- data.frame(category=c('a','a','a','b','b','b','b','b','b'),
amount=c(1,2,3,4,5,6,7,8,9))
test_data %>% group_by(category) %>% mutate(avg_amount=IF_ERROR(TTR::runMedian(amount,4),
median(amount),
TTR::runMedian(amount,4))
In general, input should only generate errors in exceptional circumstances. It can be computationally expensive to catch and handle errors where a simple if statement will suffice. The key here is realising that runMedian throws an error if the group size is less than 4. Remember we can check the group size inside mutate by using n(), so all you need do is:
test_data %>%
group_by(category) %>%
mutate(avg_amount = if(n() > 3) TTR::runMedian(amount, 4) else median(amount))
#> # A tibble: 9 x 3
#> # Groups: category [2]
#> category amount avg_amount
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 a 2 2
#> 3 a 3 2
#> 4 b 4 NA
#> 5 b 5 NA
#> 6 b 6 NA
#> 7 b 7 5.5
#> 8 b 8 6.5
#> 9 b 9 7.5
Additionally, if you want to replace the NA values from the beginning of the running median, you could use ifelse:
test_data %>%
group_by(category) %>%
mutate(avg_amount = if(n() > 3) TTR::runMedian(amount, 4) else median(amount),
avg_amount = ifelse(is.na(avg_amount), median(amount), avg_amount))
#> # A tibble: 9 x 3
#> # Groups: category [2]
#> category amount avg_amount
#> <chr> <dbl> <dbl>
#> 1 a 1 2
#> 2 a 2 2
#> 3 a 3 2
#> 4 b 4 6.5
#> 5 b 5 6.5
#> 6 b 6 6.5
#> 7 b 7 5.5
#> 8 b 8 6.5
#> 9 b 9 7.5
Consider the dataframe df at the end of the post.
I simply would like to swap the elements of columns x and y whenever x>y.
There may be other columns in the dataframe which I do not want to touch.
In a sense, I would like to sort row wise the columns x and y.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df<-tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
df
#> # A tibble: 10 × 3
#> # Rowwise:
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
Created on 2021-10-06 by the reprex package (v2.0.1)
base solution:
use which(df$x > df$y) to determine row numbers you want to change, then use rev to swap values for these:
df[which(df$x > df$y), c("x", "y")] <- rev(df[which(df$x > df$y), c("x", "y")])
df
# x y extra
# <int> <int> <chr>
# 1 1 10 A
# 2 2 9 B
# 3 3 8 C
# 4 4 7 D
# 5 5 6 E
# 6 5 6 F
# 7 4 7 G
# 8 3 8 H
# 9 2 9 I
# 10 1 10 J
Thanks everyone!
I wrote a small function which does what I need and generalizes to the case of multiple variables.
See the reprex
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(1234)
set_colnames <- `colnames<-`
df<-tibble(x=1:10, y=10:1, z=rnorm(10), extra=LETTERS[1:10]) %>%
rowwise()
df
#> # A tibble: 10 × 4
#> # Rowwise:
#> x y z extra
#> <int> <int> <dbl> <chr>
#> 1 1 10 -1.21 A
#> 2 2 9 0.277 B
#> 3 3 8 1.08 C
#> 4 4 7 -2.35 D
#> 5 5 6 0.429 E
#> 6 6 5 0.506 F
#> 7 7 4 -0.575 G
#> 8 8 3 -0.547 H
#> 9 9 2 -0.564 I
#> 10 10 1 -0.890 J
sort_rows <- function(df, col_names, dec=F){
temp <- df %>%
select(all_of(col_names))
extra_names <- setdiff(colnames(df), col_names)
temp2 <- df %>%
select(all_of(extra_names))
res <- t(apply(temp, 1, sort, decreasing=dec)) %>%
as_tibble %>%
set_colnames(col_names) %>%
bind_cols(temp2)
return(res)
}
col_names <- c("x", "y", "z")
df_s <- df %>%
sort_rows(col_names, dec=F)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
df_s
#> # A tibble: 10 × 4
#> x y z extra
#> <dbl> <dbl> <dbl> <chr>
#> 1 -1.21 1 10 A
#> 2 0.277 2 9 B
#> 3 1.08 3 8 C
#> 4 -2.35 4 7 D
#> 5 0.429 5 6 E
#> 6 0.506 5 6 F
#> 7 -0.575 4 7 G
#> 8 -0.547 3 8 H
#> 9 -0.564 2 9 I
#> 10 -0.890 1 10 J
Created on 2021-10-06 by the reprex package (v2.0.1)
This looks like sorting for me:
library(tidyverse)
df <- tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
df
#> # A tibble: 10 x 3
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
extra_cols <- df %>% colnames() %>% setdiff(c("x", "y"))
extra_cols
#> [1] "extra"
df %>%
mutate(row = row_number()) %>%
pivot_longer(-c(row, extra_cols)) %>%
group_by_at(c("row", extra_cols)) %>%
transmute(
value = value %>% sort(),
name = c("x", "y"),
) %>%
pivot_wider() %>%
ungroup() %>%
select(-row)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(extra_cols)` instead of `extra_cols` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#> # A tibble: 10 x 3
#> extra x y
#> <chr> <int> <int>
#> 1 A 1 10
#> 2 B 2 9
#> 3 C 3 8
#> 4 D 4 7
#> 5 E 5 6
#> 6 F 5 6
#> 7 G 4 7
#> 8 H 3 8
#> 9 I 2 9
#> 10 J 1 10
Created on 2021-10-06 by the reprex package (v2.0.1)
Try using apply on axis 1 and transpose it with t, then use as_tibble to convert it to a tibble.
Then finally change the column names:
> df <- as_tibble(t(apply(df, 1, sort)))
> names(df) <- c('x', 'y')
> df
# A tibble: 10 x 2
x y
<int> <int>
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 5 6
7 4 7
8 3 8
9 2 9
10 1 10
I want to create a function that takes a grouping argument. Which can be a single or multiple variables. I want it to look like this:
wanted <- function(data, groups, other_params){
data %>% group_by( {{groups}} ) %>% count()
}
This work only when a single group is given but breaks when there are multiple groups. I know it's possible to use the following with ellipsis ... (But I want the syntax groups = something):
not_wanted <- function(data, ..., other_params){
data %>% group_by( ... ) %>% count()
}
Here is the entire code:
library(dplyr)
library(magrittr)
iris$group2 <- rep(1:5, 30)
wanted <- function(data, groups, other_params){
data %>% group_by( {{groups}} ) %>% count()
}
not_wanted <- function(data, ..., other_params){
data %>% group_by( ... ) %>% count()
}
# works
wanted(iris, groups = Species )
not_wanted(iris, Species, group2)
# doesn't work
wanted(iris, groups = vars(Species, group2) )
wanted(iris, groups = c(Species, group2) )
wanted(iris, groups = vars("Species", "group2") )
# Error: Column `vars(Species, group2)` must be length 150 (the number of rows) or one, not 2
You guys are over complicating things, this works just fine:
library(tidyverse)
wanted <- function(data, groups){
data %>% count(!!!groups)
}
mtcars %>% wanted(groups = vars(mpg,disp,hp))
# A tibble: 31 x 4
mpg disp hp n
<dbl> <dbl> <dbl> <int>
1 10.4 460 215 1
2 10.4 472 205 1
3 13.3 350 245 1
4 14.3 360 245 1
5 14.7 440 230 1
6 15 301 335 1
7 15.2 276. 180 1
8 15.2 304 150 1
9 15.5 318 150 1
10 15.8 351 264 1
# … with 21 more rows
The triple bang operator and parse_quos from the rlang package will do the trick. For more info, see e.g. https://stackoverflow.com/a/49941635/6086135
library(dplyr)
library(magrittr)
iris$group2 <- rep(1:5, 30)
vec <- c("Species", "group2")
wanted <- function(data, groups){
data %>% count(!!!rlang::parse_quos(groups, rlang::current_env()))
}
wanted(iris, vec)
#> # A tibble: 15 x 3
#> Species group2 n
#> <fct> <int> <int>
#> 1 setosa 1 10
#> 2 setosa 2 10
#> 3 setosa 3 10
#> 4 setosa 4 10
#> 5 setosa 5 10
#> 6 versicolor 1 10
#> 7 versicolor 2 10
#> 8 versicolor 3 10
#> 9 versicolor 4 10
#> 10 versicolor 5 10
#> 11 virginica 1 10
#> 12 virginica 2 10
#> 13 virginica 3 10
#> 14 virginica 4 10
#> 15 virginica 5 10
Created on 2020-01-06 by the reprex package (v0.3.0)
Here is another option to avoid quotations in the function call. I admit its not very pretty though.
library(tidyverse)
wanted <- function(data, groups){
grouping <- gsub(x = rlang::quo_get_expr(enquo(groups)), pattern = "\\((.*)?\\)", replacement = "\\1")[-1]
data %>% group_by_at(grouping) %>% count()
}
iris$group2 <- rep(1:5, 30)
wanted(iris, groups = c(Species, group2) )
#> # A tibble: 15 x 3
#> # Groups: Species, group2 [15]
#> Species group2 n
#> <fct> <int> <int>
#> 1 setosa 1 10
#> 2 setosa 2 10
#> 3 setosa 3 10
#> 4 setosa 4 10
#> 5 setosa 5 10
#> 6 versicolor 1 10
#> 7 versicolor 2 10
#> 8 versicolor 3 10
#> 9 versicolor 4 10
#> 10 versicolor 5 10
#> 11 virginica 1 10
#> 12 virginica 2 10
#> 13 virginica 3 10
#> 14 virginica 4 10
#> 15 virginica 5 10