I would like to use a tidy approach to produce correlograms by group.
My attempt with iris and libraries dplyr and corrplot:
library(corrplot)
library(dplyr)
par(mfrow=c(2,2))
iris %>%
group_by(Species) %>%
group_map(~ corrplot::corrplot(cor(.x,use = "complete.obs"),tl.cex=0.7,title =""))
It works but I would like to add the Species name on each plot.
Also, any other tidy approaches/ functions are very welcome!
We could use cur_group()
library(dplyr)
library(corrplot)
out <- iris %>%
group_by(Species) %>%
summarise(outr = list( corrplot::corrplot(cor(cur_data(),
use = "complete.obs"),tl.cex=0.7,title = cur_group()[[1]])))
Or if we are using group_map, the .keep = FALSE by default. Specify it as TRUE and extract the group element
iris %>%
group_by(Species) %>%
group_map(~ corrplot::corrplot(cor(select(.x, where(is.numeric)),
use = "complete.obs"),tl.cex=0.7,title = first(.x$Species)), .keep = TRUE)
You can use split and map approach with imap -
library(dplyr)
library(purrr)
iris %>%
split(.$Species) %>%
imap(~corrplot::corrplot(cor(.x[-5],use ="complete.obs"),tl.cex=0.7,title =.y))
Related
I'm trying to do a Wilcoxon test on long-formatted data. I want to use dplyr::group_by() to specify the subsets I'd like to do the test on.
The final result would be a new column with the p-value of the Wilcoxon test appended to the original data frame. All of the techniques I have seen require summarizing the data frame. I DO NOT want to summarize the data frame.
Please see an example reformatting the iris dataset to mimic my data, and finally my attempts to perform the task.
I am getting close, but I want to preserve all of my original data from before the Wilcoxon test.
# Reformatting Iris to mimic my data.
long_format <- iris %>%
gather(key = "attribute", value = "measurement", -Species) %>%
mutate(descriptor =
case_when(
str_extract(attribute, pattern = "\\.(.*)") == ".Width" ~ "Width",
str_extract(attribute, pattern = "\\.(.*)") == ".Length" ~ "Length")) %>%
mutate(Feature =
case_when(
str_extract(attribute, pattern = "^(.*?)\\.") == "Sepal." ~ "Sepal",
str_extract(attribute, pattern = "^(.*?)\\.") == "Petal." ~ "Petal"))
# Removing no longer necessary column.
cleaned_up <- long_format %>% select(-attribute)
# Attempt using do(), but I lose important info like "measurement"
cleaned_up %>%
group_by(Species, Feature) %>%
do(w = wilcox.test(measurement~descriptor, data=., paired=FALSE)) %>%
mutate(Wilcox = w$p.value)
# This is an attempt with the dplyr experimental group_map function. If only I could just make this a new column appended to the original df in one step.
cleaned_up %>%
group_by(Species, Feature) %>%
group_map(~ wilcox.test(measurement~descriptor, data=., paired=FALSE)$p.value)
Thanks for your help.
The model object can be wrapped in a list
library(tidyverse)
cleaned_up %>%
group_by(Species, Feature) %>%
nest %>%
mutate(model = map(data, ~
.x %>%
transmute(w = list(wilcox.test(measurement~descriptor,
data=., paired=FALSE)))))
Or another option is group_split into a list, then map through the list, elements create the 'pval' column after applying the model
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
Another option is to avoid the data argument entirely. The wilcox.test function only requires a data argument when the variables being tested aren't in the calling scope, but functions called within mutate have all the columns from the data frame in scope.
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
Same as akrun's output (thanks to his correction in the comments above)
akrun <-
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
me <-
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
all.equal(akrun, me)
# [1] TRUE
Lets say I want to split out mtcars into 3 csv files based on their cyl grouping. I can use mutate to do this, but it will create a NULL column in the output.
library(tidyverse)
by_cyl = mtcars %>%
group_by(cyl) %>%
nest()
by_cyl %>%
mutate(unused = map2(data, cyl, function(x, y) write.csv(x, paste0(y, '.csv'))))
is there a way to do this on the by_cyl object without calling mutate?
Here is an option using purrr without mutate from dplyr.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
walk2(names(.), ~write_csv(.x, paste0(.y, '.csv')))
Update
This drops the cyl column before saving the output.
library(tidyverse)
mtcars %>%
split(.$cyl) %>%
map(~ .x %>% select(-cyl)) %>%
walk2(names(.), ~write_csv(.x, paste0(.y, '.csv')))
Update2
library(tidyverse)
by_cyl <- mtcars %>%
group_by(cyl) %>%
nest()
by_cyl %>%
split(.$cyl) %>%
walk2(names(.), ~write_csv(.x[["data"]][[1]], paste0(.y, '.csv')))
Here's a solution with do and group_by, so if your data is already grouped as it should, you save one line:
mtcars %>%
group_by(cyl) %>%
do(data.frame(write.csv(.,paste0(.$cyl[1],".csv"))))
data.frame is only used here because do needs to return a data.frame, so it's a little hack.
I am trying to figure out how to mutate a single column of data by several functions using dplyr. I can do every column:
library(dplyr)
iris %>%
group_by(Species) %>%
mutate_all(funs(min, max))
But I don't know how to select one column. I can imagine something like this though this obviously does not run:
iris %>%
group_by(Species) %>%
mutate(Sepal.Length, funs(min, max))
I can sort of accomplish this task using do() and a custom function like this:
summary_func = function(x){
tibble(max_out = max(x),
min_out = min(x)
)
}
iris %>%
group_by(Species) %>%
do(summary_func(.$Sepal.Length))
However this doesn't really do what I want to do either because it isn't adding to the exist tibble a la mutate.
Any ideas?
Use mutate_at
iris %>%
group_by(Species) %>%
mutate_at("Sepal.Length", funs(min, max))
It takes a character so watch the quotes
Use mutate
iris %>%
group_by(Species) %>%
mutate(min = min(Sepal.Length),
max = max(Sepal.Length))
How can make I several, sequential manipulations of the same variable using dplyr, but more elegantly than the code below?
Specifically, I would like to remove the multiple calls to car_names = without having to nest any of the functions.
mtcars2 <- mtcars %>% mutate(car_names = row.names(.)) %>%
mutate(car_names=stri_extract_first_words(car_names)) %>%
mutate(car_names=as.factor(car_names)
If you want to type less and not nest the function, you can use the pipe inside the mutate call :
library(dplyr)
library(stringi)
# What you did
mtcars2 <- mtcars %>%
mutate(car_names = row.names(.)) %>%
mutate(car_names = stri_extract_first_words(car_names)) %>%
mutate(car_names = as.factor(car_names))
# Another way with less typing and no nesting
mtcars3 <- mtcars %>%
mutate(car_names = rownames(.) %>%
stri_extract_first_words(.) %>%
as.factor(.))
identical(mtcars2, mtcars3)
[1] TRUE
I'm trying to bootstrap some model fits and then calculate statistics without having to rerun the models every time. I can do this fine if I calculate r2 inside the first do() but I'd like to know how to access the data.
library(dplyr)
library(tidyr)
library(modelr)
library(purrr)
allmdls <-
mtcars %>%
group_by(cyl) %>%
do({
datsplit=crossv_mc(.,10)
mdls=list(map(datsplit$train, ~glm(hp~disp,data=.,family=gaussian(link='identity'))))
data_frame(datsplit=list(datsplit),mdls)
})
and now something like:
allmdls %>%
by_slice(dmap,.f=map2_dbl(.$mdls,.$datsplit$test,rsquare))
but I get
Error: .y is not a vector (NULL)
or
allmdls %>%
group_by(cyl) %>%
do({
map2_df(.x=.$mdls, .y=.$datsplit, .f=map2_dbl(.x=.x,.y=.y$test,.f=rsquare))
})
Error in map2_dbl(.x = .x, .y = .y$test, .f = rsquare) : object
'.x' not found
I can't seem to get the syntax right.
help?
Thanks
EDIT:
Thanks to #aosmith's comment, I created a somewhat simpler solution:
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})
One option is to use map2 within mutate. Because you are using lists of lists I ended up with nested map2s to get access to the innermost lists. I pulled the test data out via map(datsplit, "test"), as neither the dollar sign operator nor the extract brackets were working for me.
mutate(allmdls, rsq = map2(mdls, map(datsplit, "test"), ~map2_dbl(.x, .y, rsquare)))
Here is another option that avoids the nested lists all together:
mtcars %>%
split(.$cyl) %>%
map_df(crossv_mc, 10, .id = "cyl") %>%
mutate(models = map(train, ~glm(hp ~ disp, data = .x)),
rsq = map2_dbl(models, test, rsquare))
#aosmith answered my question but here is a simpler solution overall
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})