I have this work and need do it by only changing var1 and var2. So, I would like to create function but I could not. Can you make function with this simple work?
a12 = data %>% group_by(var1,var2)%>% tally
a12_1 <- data %>% group_by(var1) %>% tally
a12_2 = merge(a12,a12_1,by="var1")
a12_2$perc = a12_2[,3] / a12_2[,4]
The challenging for me is how to deal with this argument while creating function.
a_fun <- function(data,var1,var2)
I'm guessing you're struggling with the non-standard evaluation, if you append _ to the dplyr functions you can pass strings as arguments. I've not tested it, but you could try:
a_fun <- function(data, var1, var2) {
a12 <- data %>% group_by_(var1, var2) %>% tally()
a12_1 <- data %>% group_by_(var1) %>% tally()
a12_2 <- merge(a12, a12_1, by = var1)
a12_2$perc <- a12_2[, 3] / a12_2[, 4]
return(a12_2)
}
e.g
a_fun(data, "col1", "col2")
Related
How can I hand over the argument ColName of my function foo to the R function count? ColName is the name of the column in the dataframe.
library(scales)
library(dplyr)
foo <- function(df, ColName, YearCol){
percentData <- df %>%
group_by(format(as.Date(df[,YearCol]),"%Y")) %>%
count(ColName) %>% # does not work like this, also df[,ColName] does not work
mutate(ratio=scales::percent(n/sum(n)))
}
You can use the .dots parameter of select to choose the columns you're interested in.
foo <- function(df, ColName, YearCol){
percentData <- df %>%
select(.dots = c(ColName, YearCol)) %>%
group_by(format(as.Date(.dots2), "%Y")) %>%
count(.dots1) %>%
mutate(ratio=scales::percent(n/sum(n)))
percentData
}
I'm trying to create a function that essentially gets me the MODE...or MODE-X (2nd-Xth most common value & and the associated counts for each column in a data frame.
I can't figure out what I may be missing and I'm looking for some assistance? I believe it has to do with the passing in of a variable into dplyr function.
library(tidyverse)
myfunct_get_mode = function(x, rank=1){
mytable = dplyr::count(rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = table %>% dplyr::slice(rlang::sym(rank))
return(result)
}
mtcars %>% lapply(. %>% (function(x) myfunct_get_mode(x, rank=2)))
There are some problems with your function:
You function-call is not doing what you think. Check with mtcars %>% lapply(. %>% (function(x) print(x))) that actually your x is the whole column of mtcars. To get the names of the column apply the function to names(mtcars). But then you also have to specify the dataframe you're working on.
To evaluate a symbol you get sym from you need to use !! in front of the rlang::sym(x).
rank is not a variable name, thus no need for rlang::sym here.
table should be mytable in second to last line of your function.
So how could it work (although there are probably better ways):
myfunct_get_mode = function(df, x, rank=1){
mytable = count(df, !!rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = mytable %>% slice(rank)
return(result)
}
names(mtcars) %>% lapply(function(x) myfunct_get_mode(mtcars, x, rank=2))
If we need this in a list, we can use map
f1 <- function(dat, rank = 1) {
purrr::imap(dat, ~
dat %>%
count(!! rlang::sym(.y)) %>%
rename_all(~ c('variable', 'counts')) %>%
arrange(desc(counts)) %>%
slice(seq_len(rank))) #%>%
#bind_cols - convert to a data.frame
}
f1(mtcars, 2)
I'm working with nested dataframes and want to pass the name of the top level dataframe, and the name of a column containing lower level dataframes, to a function that uses purrr::map to iterate over the lower level data frames.
Here's a toy example.
library(dplyr)
library(purrr)
library(tibble)
library(tidyr)
df1 <- tibble(x = c("a","b","c", "a","b","c"), y = 1:6)
df1 <- df1 %>%
group_by(x) %>%
nest()
testfunc1 <- function(df) {
df <- df %>%
mutate(out = map(data, min))
tibble(min1 = df$out)
}
testfunc2 <- function(df, col_name) {
df <- df %>%
mutate(out = map(col_name, min))
tibble(min2 = df$out)
}
df1 <- bind_cols(df1, testfunc1(df1))
df1 <- bind_cols(df1, testfunc2(df1, "data"))
df1$min1
df1$min2
testfunc1 behaves as expected, in this case giving the minimum of each data column in a new column. In testfunc2, where I've tried to pass the column name, a string reading "data" is passed to the new column. I think I understand from the thread here (Pass a data.frame column name to a function) why this doesn't behave as I want, but I haven't been able to figure out how to make it work in this case. Any suggestions would be great.
This should work for you, it uses the tidy eval framework. This assumes col_name is a string.
testfunc2 <- function(df, col_name) {
df <- df %>%
mutate(out = map(!! rlang::sym(col_name), min))
tibble(min2 = df$out)
}
EDIT:
If you'd rather pass a bare column name to the function, instead of a string, use enquo instead of sym.
testfunc2 <- function(df, col_name) {
col_quo = enquo(col_name)
df <- df %>%
mutate(out = map(!! col_quo, min))
tibble(min2 = df$out)
}
I am trying to build a summary table of a data frame like DataProfile below.
The idea is to transform each column into a row and add variables for count, nulls, not nulls, unique, and add additional mutations of those variables.
It seems like there should be a better faster way to do this. Is there a function that does this?
#trying to write the functions within dplyr & magrittr framework
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
#
total <- mtcars %>% summarise_all(funs(n())) %>% melt
nulls <- mtcars %>% summarise_all(funs(sum(is.na(.)))) %>% melt
filled <- mtcars %>% summarise_all(funs(sum(!is.na(.)))) %>% melt
uniques <- mtcars %>% summarise_all(funs(length(unique(.)))) %>% melt
mtcars %>% summarise_all(funs(n_distinct(.))) %>% melt
#Build a Data Frame from names of mtcars and add variables with mutate
DataProfile <- as.data.frame(names(mtcars))
DataProfile <- DataProfile %>% mutate(Total = total$value,
Nulls = nulls$value,
Filled = filled $value,
Complete = Filled/Total,
Cardinality = uniques$value,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile
#These are other attempts with Base R, but they are harder to read and don't play well with summarise_all
sapply(mtcars, function(x) length(unique(x[!is.na(x)]))) %>% melt
rapply(mtcars,function(x)length(unique(x))) %>% melt
The summarise_all() function can process more than one function at a time, so you can consolidate code by doing it in one pass then formatting your data to get to the type of "profile" per variable that you want.
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
DataProfile <- mtcars %>%
summarise_all(funs("Total" = n(),
"Nulls" = sum(is.na(.)),
"Filled" = sum(!is.na(.)),
"Cardinality" = length(unique(.)))) %>%
melt() %>%
separate(variable, into = c('variable', 'measure'), sep="_") %>%
spread(measure, value) %>%
mutate(Complete = Filled/Total,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile
I would like to be able to use more automation when creating SpatialLines objects from otherwise tidy data frames.
library(sp)
#create sample data
sample_data <- data.frame(group_id = rep(c("a", "b","c"), 10),
x = rnorm(10),
y = rnorm(10))
#How can I recreate this using dplyr?
a_list <- Lines(list(Line(sample_data %>% filter(group_id == "a") %>% select(x, y))), ID = 1)
b_list <- Lines(Line(list(sample_data %>% filter(group_id == "b") %>% select(x, y))), ID = 2)
c_list <- Lines(Line(list(sample_data %>% filter(group_id == "c") %>% select(x, y))), ID = 3)
SpatialLines(list(a_list, b_list, c_list))
You can see how using something like group_by would make the process pretty easy if you could understand how the data could be piped into a list.
Using your sample data, a wrapper function, and dplyr::do will give you what you want :)
wrapper <- function(df) {
df %>% select(x,y) %>% as.data.frame %>% Line %>% list %>% return
}
y <- sample_data %>% group_by(group_id) %>%
do(res = wrapper(.))
# and now assign IDs (since we can't do that inside dplyr easily)
ids = 1:dim(y)[1]
SpatialLines(
mapply(x = y$res, ids = ids, FUN = function(x,ids) {Lines(x,ID=ids)})
)
I don't use sp so there might be a better way to assign IDs.
For reference, consider reading Hadley's comments on returning non-dataframe from dplyr do calls