r: dplyr function to make long format with multiple groups - r

I'm trying to take this code and turn it into a function:
mtcars %>% group_by(gear) %>% select(hp, disp) %>%
summarise_all(funs(n=sum(!is.na(.)), mean=mean(.,na.rm=T))) %>%
gather(variable, value, -gear) %>%
arrange(gear, sub('_.*', '', variable), sub('.*_', '', variable)) %>%
separate(variable, into = c('var', 'metric'), '_')
What it does is create a dataframe that I can easily import into Excel and create pivots with. I've tried the following, but it doesn't seem to work:
mean_func <- function(vars,groups) {
results <- test %>% group_by_at(vars(one_of(groups))) %>%
summarise_at(vars(starts_with(vars)), funs(n=sum(!is.na(.)), mean=mean(.,na.rm=T))) %>%
gather_(variable, value, -groups) %>%
arrange_(groups) %>%
separate_(variable, into = c('var', 'metric'), '_'); View(results)
}
Seems like the problem is somewhere along the gather statement, but I'm not sure what could be wrong here. Any thoughts?

Use a bit of tidy cleanup to change the grouping variable and use group_by_ for programming.
library(tidyverse)
mean_func <- function(vars, groups) {
groups = enquo(groups)
vars %>%
group_by_(groups) %>%
dplyr::select(hp, disp) %>%
summarise_all(funs(n=sum(!is.na(.)), mean=mean(.,na.rm=T))) %>%
gather(variable, value, -!!groups) %>%
arrange(!!groups, sub('_.*', '', variable), sub('.*_', '', variable)) %>%
separate(variable, into = c('var', 'metric'), '_')
}
mean_func(mtcars, gear)

Related

Loop to create crosstabs of columns using tidyr

I would like to use a loop to create crosstabs of one column with every other column in a df. I started with this code (substituting in the iris df), which works nicely for two variables:
iris <- iris
tbl <- iris %>%
tabyl(Species, Sepal.Length, show_missing_levels = FALSE, show_na = FALSE) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title("combined") %>%
knitr::kable()
print(tbl)
My df contains ~200 columns. I thought I would write a for loop to print a crosstab for one variable with each of the other variables. Here's what I tried:
cols <- c('Sepal.Length', 'Sepal.Width')
for (c in cols){
tbl <- iris %>%
tabyl(Species, c, show_missing_levels = FALSE, show_na = FALSE) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title("combined") %>%
knitr::kable()
print(tbl)
}
This returns Column `c` is not found.
This seems like it should be simple, but I can't figure it out. Thanks for any help.
Change the c in your code to !!sym(c). I can't explain this non-standard tidyverse evaluation thingy, but in layman's terms, you want to access an object (i.e. "c") outside of your pipe (iris). That's why you need !!sym.
You can use the .data pronoun when passing columns names as strings.
cols <- c('Sepal.Length', 'Sepal.Width')
for (col in cols){
tbl <- iris %>%
tabyl(Species, .data[[col]],show_missing_levels = FALSE,show_na = FALSE) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title("combined") %>%
knitr::kable()
print(tbl)
}

Passing a string to an R function and using it as a column name within the function

I have a dataframe with a list of scores of students for a few subjects (each subject represented by a column) I want to do the calculation below for each subject (Math, Science and Reading)
avgdata_math <- data%>%
group_by(country) %>%
summarise(ci = list(bootstrap_ci(sex, Math, weight))) %>%
unnest_wider(ci) %>%
ungroup() %>%
mutate(country = fct_reorder(country, avg))
Since I have to repeat the same code twice I want to write a function to do the calculation (without pivoting the dataframe)
aus_nz <- function(df, subject = "Math") {
df %>%
group_by(country) %>%
summarise(ci = list(bootstrap_ci(sex, subject, weight))) %>%
unnest_wider(ci) %>%
ungroup() %>%
mutate(country = fct_reorder(country, avg))
}
This gives me an error, since I've passed the column name(subject) as a string, then grouped data and thereafter used a string value in calling the bootstrap_ci function, whereas it should be a column of data passed there (which should be after the group operation).
Using !! rlang::ensym(subject) in your function should work.
aus_nz <- function(df, subject = "Math") {
df %>%
group_by(country) %>%
summarise(ci = list(bootstrap_ci(sex, !! rlang::ensym(subject), weight))) %>%
unnest_wider(ci) %>%
ungroup() %>%
mutate(country = fct_reorder(country, avg))
}
Update
If you also want to pass the grouping variable as a string into the function and if you sometimes have more than one variable you want to group by, then using !!!, rlang::ensyms() and the ellipsis ... argument would do the trick, if it not were for your last line of your function. fct_reorder only expects one variable. In case of two grouping variables: what would you do? Create two new variables and reorder each grouping variable by avg? It would also be helpful to see your data (maybe with dput(head(...))).
aus_nz <- function(df, subject = "Math", ...) {
group_var <- rlang::ensyms(...)
df %>%
group_by(!!! group_var) %>%
summarise(ci = list(bootstrap_ci(sex, !! rlang::ensym(subject), weight))) %>%
unnest_wider(ci) %>%
ungroup() # %>% last line needs to be fixed
# mutate(grouped_by = fct_reorder(!!! group_var, avg))
}
If you do not want to use the ellipsis argument, you can use rlang::syms and a character vector (with one or multiple elements) instead:
aus_nz <- function(df, subject = "Math", group = "country") {
group_var <- rlang::syms(group)
df %>%
group_by(!!! group_var) %>%
summarise(ci = list(bootstrap_ci(sex, !! rlang::ensym(subject), weight))) %>%
unnest_wider(ci) %>%
ungroup() # %>% last line needs to be fixed
# mutate(grouped_by = fct_reorder(!!! group_var, avg))
}

Adding a list-column to a tibble in which each element is the first element of another list-column

I know how to do something analogousfor non-list-columns. For example:
library(tidyverse)
mtcars %>% mutate(first_cyl = first(cyl))
I expected that for list-columns, this will work:
mtcars %>%
group_by(gear) %>%
nest(.key = "x") %>%
mutate(first_x = first(x))
but produces an error.
I think I was missing the list function. Now, this works:
mtcars %>%
group_by(gear) %>%
nest(.key = "x") %>%
mutate(first_x = list(first(x)))

Mutating values of subset of columns into percentage format

I have generated this summary table based on the df below.
set.seed(1)
df <- data.frame(rep(
sample(c(2012,2016),10, replace = T)),
sample(c('Treat','Control'),10,replace = T),
runif(10,0,1),
runif(10,0,1),
runif(10,0,1))
colnames(df) <- c('Year','Group','V1','V2','V3')
summary.table = df %>%
group_by(Year, Group) %>%
group_by(N = n(), add = TRUE) %>%
summarise_all(funs(sd,median)) %>%
ungroup %>%
mutate(Year = ifelse(duplicated(Year),"",Year))
Is there a way I could display the values related to the median columns as percentages?
I did not know how to use mutate() and scales::percent() for only a subset of columns (I dont want to do it individually, since there will be more columns in the original dataset, making this procedure not practical enough.
What should I have done instead if I wanted to mutate according to a subset of rows?
Thank you
EDIT:
And if it was like this?
summary.table = df %>%
group_by(Year, Group) %>%
summarise_all(funs(median,sd)) %>%
gather(key, value, -Year, -Group) %>%
separate(key, into=c("var", "stat")) %>%
unite(stat_Group, stat, Group) %>%
spread(stat_Group, value) %>%
ungroup %>%
mutate(Year = ifelse(duplicated(Year),"",Year))
We need to use the percent wrapped on median
summary.table <- df %>%
group_by(Year, Group) %>%
group_by(N = n(), add = TRUE) %>%
summarise_all(funs(sd=sd(.),median=scales::percent(median(.)))) %>%
ungroup %>%
mutate(Year = ifelse(duplicated(Year),"",Year))

Combine list of data frames with one column of characters

I am learning to get, cleaning and combining data. I am confused why in a loop rbind command result in returning 10 data instead of expected 30 data as when I combine it manually (i by i).
library(XML)
mergeal <- NULL
tabnums <- 3
for (i in 1:length(tabnums)) {
bnn <- paste0("http://www.ngchanmau.com/listing_browse.php?cur_page=",
tabnums[i], "&&coming=22-Oct-2015&coming=22-Oct-2015")
tem <- readHTMLTable(bnn, header=T, stringsAsFactors=F)
#data cleaning
ff <- tem[8] #wanted data
ff1 <- as.data.frame(ff)
ff2 <- ff1[ , 1] #get 1st col data only
ff3 <- unique(ff2)
ff4 <- ff3[c(2,5:13)] #wanted list only
#merging dataset
mergeal <- rbind(mergeal, ff4)
}
I've tried using list rbind list of data frames with one column of characters and numerics but still have the same result as above. Appreciate any help on what I missed, thanks.
I cleaned up the data cause I was bored.
library(plyr)
library(XML)
library(dplyr)
library(magrittr)
library(stringi)
library(tidyr)
library(lubridate)
answer =
data_frame(tabnums = 1:3) %>%
group_by(tabnums) %>%
do(.$tabnums %>%
paste0("http://www.ngchanmau.com/listing_browse.php?cur_page=",
., "&&coming=22-Oct-2015&coming=22-Oct-2015") %>%
readHTMLTable(header = T, stringsAsFactors = F) %>%
extract2(8)) %>%
ungroup %>%
select(V1) %>%
distinct %>%
mutate(V1 =
V1 %>%
stri_replace_all_fixed("Â", "\n") %>%
stri_replace_all_fixed("Type:", "\nType:") %>%
stri_replace_all_fixed("Time:", "\nTime:") %>%
stri_replace_all_fixed("Area:", "\nArea:") %>%
stri_split_fixed("\n")) %>%
unnest(V1) %>%
mutate(V1 = V1 %>% stri_trim) %>%
filter(V1 %>% stri_detect_regex("^There are currently") %>% `!`) %>%
filter(V1 != "") %>%
separate(V1, c("variable", "value"), sep = ":", fill = "left") %>%
mutate(variable = variable %>% mapvalues(NA, "Description"),
ID = variable %>% `==`("Description") %>% cumsum) %>%
spread(variable, value) %>%
mutate(Area = Area %>% extract_numeric,
Price = Price %>% extract_numeric,
Datetime =
Time %>%
stri_replace_all_fixed("a.m.", "am") %>%
stri_replace_all_fixed("p.m.", "pm") %>%
paste(Date, .) %>%
dmy_hm) %>%
select(-Date, -Time)

Resources