I'm learning R and have a tibble with some World Bank data. I used the apply() function in a slice of the columns and applied the standard deviation over the values, in this way: result <- apply(df[6:46],2,sd,na.rm=TRUE).
The result is an object with two columns with no header, one column is all the names of the tibble columns that were selected and the other one is the standard deviation for each column. When I use the typeof() command in the output, the result is 'double'. The R documentation says the output of apply() is a vector, an array or a list.
I need to know this because I want to extract all the row names and using the command rownames(result) throws the output NULL. What can I do to extract the row names of this object? Please help.
Tried rownames(result) and row.names(result and none worked.
We can use stack to convert the vector output into dataframe.
temp <- stack(apply(df[6:46],2,sd,na.rm=TRUE))
Now, we can access all the column names with temp$ind and values of sd in temp$values.
Using mtcars as example,
temp <- stack(apply(mtcars, 2, sd, na.rm = TRUE))
temp
# values ind
#1 6.02695 mpg
#2 1.78592 cyl
#3 123.93869 disp
#4 68.56287 hp
#5 0.53468 drat
#6 0.97846 wt
#7 1.78694 qsec
#8 0.50402 vs
#9 0.49899 am
#10 0.73780 gear
#11 1.61520 carb
We can also use this with sapply and lapply
stack(sapply(mtcars,sd, na.rm = TRUE))
#and
stack(lapply(mtcars,sd, na.rm = TRUE))
Here, the sd returns a single value and as the apply is with MARGIN = 2 i,e columnwise, we are getting a named vector. So, names(out) would get the names instead of row.names. Using a reproducible example with the inbuilt dataset iris
data(iris)
out <- apply(iris[1:4], 2, sd, na.rm = TRUE)
names(out)
#[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
Also, by wrapping the output of apply with data.frame, we can use the row.names
out1 <- data.frame(val = out)
row.names(out1)
#[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
If we need a data.frame as output, this can he directly created with data.frame call
data.frame(names = names(out), values = out)
Also, this can be done in tidyverse
library(dplyr)
library(tidyr)
iris %>%
summarise_if(is.numeric, sd, na.rm = TRUE) %>%
gather
# key value
#1 Sepal.Length 0.8280661
#2 Sepal.Width 0.4358663
#3 Petal.Length 1.7652982
#4 Petal.Width 0.7622377
Or convert to a list and enframe
library(tibble)
iris %>%
summarise_if(is.numeric, sd, na.rm = TRUE) %>%
as.list %>%
enframe
Related
I need to perform calculation based on inputs defined in a dataframe. Refer the dataframe RefDf below. It has 3 columns - column name, calculation, New Variable Name. When Calculation column contains count, we should use n_distinct( ) function.
RefDf <- read.table(text = "Variables Calculation NewVariable
Sepal.Length sum Sepal.Length2
Petal.Length count Petal.LengthNew
", header = T)
Manual Approach - Needs to be automated via inputs in RefDf. Species remains same for grouping.
library(dplyr)
iris %>% group_by_at("Species") %>%
summarise(Sepal.Length2 = sum(Sepal.Length,na.rm = T),
Petal.LengthNew = n_distinct(Petal.Length, na.rm = T)
)
I am looking for dplyr or base R based solution
Here's a solution with data.table package
library(data.table)
library(dplyr)
# using data.table
dt <- as.data.table(RefDf)
dt[Calculation == "count", Calculation := "n_distinct"]
# function for doing grouping calculation
inner.fun <- function(calc, data, column, group="Species"){
print(column)
data.dt <- as.data.table(data)
data.dt[, .(as.numeric(get(calc)(get(column)))), by=group][]
}
out <- dt[, inner.fun(calc=Calculation, data=iris, column=Variables), by=NewVariable]
# reshape from wide to long
out2 <- dcast(data=out, Species ~ NewVariable, value.var="V1")
# convert to data.frame
out_df <- as.data.frame(out2)
out_df
Species Petal.LengthNew Sepal.Length2
1 setosa 9 250.3
2 versicolor 19 296.8
3 virginica 20 329.4
I'm a SAS programmer trying to learn R. If SAS, I would do this to save results of descriptive stats into a dataset:
proc means data=abc;
var var1 var2 var3;
ods output summary=result1;
run;
I think in R, it would be this:
summary(abc)->result1
Someone told me to do this.
as.data.frame(unclass(summary(new_scales)))->new_table
But the result in this table is not very usable.
Is there away to get a better structured result like I would get from SAS PROC MEANS? I would like columns to look like:
variable name, Mean, SD, min, max, etc.
and columns carry results from each variable.
Consider sapply (hidden loop to return equal length object as input) to create a matrix of aggregation results:
# SINGLE AGGREGATE
stats_vector <- sapply(abc[c("var1", "var2", "var3")], function(x) mean(x, na.rm=TRUE)))
# MULTIPLE AGGREGATES
stats_matrix <- sapply(abc[c("var1", "var2", "var3")],
function(x) c(count=length(x), sum=sum(x), mean=mean(x), min=min(x),
q1=quantile(x)[2], median=median(x), q3=quantile(x)[4],
max=max(x), sd=sd(x)))
)
If your proc means uses class for grouping, then use aggregate which returns a data frame:
# SINGLE AGGREGATE
mean_df <- aggregate(cbind(var1, var2, var3) ~ group, abc, function(x) mean(x, na.rm=TRUE)))
# MULTIPLE AGGREGATES
agg_raw <- aggregate(cbind(var1, var2, var3) ~ group, abc,
function(x) c(count=length(x), sum=sum(x), mean=mean(x), min=min(x),
q1=quantile(x)[2], median=median(x), q3=quantile(x)[4],
max=max(x), sd=sd(x)))
)
agg_df <- do.call(data.frame, agg_raw)
Rextester demo
Consider the tidyverse approach. The idea is to pass the data into an equation like linear regression, then map the model result to model values & finally storing the summary into a data frame.
library(tidyverse)
library(broom)
summary_result<-mtcars %>%
nest(-carb) %>%
mutate(model = purrr::map(data, function(x) {
lm(gear ~ mpg+cyl, data = x)}),
values = purrr::map(model, glance),
r.squared = purrr::map_dbl(values, "r.squared"),
pvalue = purrr::map_dbl(values, "p.value")) %>%
select(-data, -model, -values)
summary_result
carb r.squared pvalue
1 4 0.4352 0.135445
2 1 0.7011 0.089325
3 2 0.8060 0.003218
4 3 0.5017 0.498921
5 6 0.0000 NA
6 8 0.0000 NA
I am currently try to compare the column classes and names of various data frames in R prior to undertaking any transformations and calculations.
The code I have is noted below::
library(dplyr)
m1 <- mtcars
m2 <- mtcars %>% mutate(cyl = factor(cyl), xxxx1 = factor(cyl))
m3 <- mtcars %>% mutate(cyl = factor(cyl), xxxx2 = factor(cyl))
out <- cbind(sapply(m1, class), sapply(m2, class), sapply(m3, class))
If someone can solve this for dataframes stored in a list, that would be great. All my dataframes are currently stored in a list, for easier processing.
All.list <- list(m1,m2,m3)
I am expecting that the output is displayed in a matrix form as shown in the dataframe "out". The output in "out" is not desireable as it is incorrect. I am expecting the output to be more along the following::
Try compare_df_cols() from the janitor package:
library(janitor)
compare_df_cols(All.list)
#> column_name All.list_1 All.list_2 All.list_3
#> 1 am numeric numeric numeric
#> 2 carb numeric numeric numeric
#> 3 cyl numeric factor factor
#> 4 disp numeric numeric numeric
#> 5 drat numeric numeric numeric
#> 6 gear numeric numeric numeric
#> 7 hp numeric numeric numeric
#> 8 mpg numeric numeric numeric
#> 9 qsec numeric numeric numeric
#> 10 vs numeric numeric numeric
#> 11 wt numeric numeric numeric
#> 12 xxxx1 <NA> factor <NA>
#> 13 xxxx2 <NA> <NA> factor
It accepts both a list and/or the individual named data.frames, i.e., compare_df_cols(m1, m2, m3).
Disclaimer: I maintain the janitor package to which this function was recently added - posting it here as it addresses exactly this use case.
I think the easiest way would be to define a function, and then use a combination of lapply and dplyr to obtain the result you want. Here is how I did it.
library(dplyr)
m1 <- mtcars
m2 <- mtcars %>% mutate(cyl = factor(cyl), xxxx1 = factor(cyl))
m3 <- mtcars %>% mutate(cyl = factor(cyl), xxxx2 = factor(cyl))
All.list <- list(m1,m2,m3)
##Define a function to get variable names and types
my_function <- function(data_frame){
require(dplyr)
x <- tibble(`var_name` = colnames(data_frame),
`var_type` = sapply(data_frame, class))
return(x)
}
target <- lapply(1:length(All.list),function(i)my_function(All.list[[i]]) %>%
mutate(element =i)) %>%
bind_rows() %>%
spread(element, var_type)
target
I want to find the rank correlation of various columns in a data.frame using dplyr.
I am sure there is a simple solution to this problem, but I think the problem lies in me not being able to use two inputs in summarize_each_ in dplyr when using the cor function.
For the following df:
df <- data.frame(Universe=c(rep("A",5),rep("B",5)),AA.x=rnorm(10),BB.x=rnorm(10),CC.x=rnorm(10),AA.y=rnorm(10),BB.y=rnorm(10),CC.y=rnorm(10))
I want to get the rank correlations between all the .x and the .y combinations. My problem in the function below where you see ????
cor <- df %>% group_by(Universe) %>%
summarize_each_(funs(cor(.,method = 'spearman',use = "pairwise.complete.obs")),????)
I want cor to just include the correlation pairs: AA.x.AA.y , AA.x,BB.y, ... for each Universe.
Please help!
An alternative approach is to just call the cor function once since this will calculate all required correlations. Repeated calls to cor might be a performance issue for a large data set. Code to do this and extract the correlation pairs with labels could look like:
#
# calculate correlations and display in matrix format
#
cor_matrix <- df %>% group_by(Universe) %>%
do(as.data.frame(cor(.[,-1], method="spearman", use="pairwise.complete.obs")))
#
# to add row names
#
cor_matrix1 <- cor_matrix %>%
data.frame(row=rep(colnames(.)[-1], n_groups(.)))
#
# calculate correlations and display in column format
#
num_col=ncol(df[,-1])
out_indx <- which(upper.tri(diag(num_col)))
cor_cols <- df %>% group_by(Universe) %>%
do(melt(cor(.[,-1], method="spearman", use="pairwise.complete.obs"), value.name="cor")[out_indx,])
So here follows the winning (time-wise) solution to my problem:
d <- df %>% gather(R1,R1v,contains(".x")) %>% gather(R2,R2v,contains(".y"),-Universe) %>% group_by(Universe,R1,R2) %>%
summarize(ICAC = cor(x=R1v, y=R2v,method = 'spearman',use = "pairwise.complete.obs")) %>%
unite(Pair, R1, R2, sep="_")
Albeit 0.005 milliseconds in this example, adding data adds time.
Try this:
library(data.table) # needed for fast melt
setDT(df) # sets by reference, fast
mdf <- melt(df[, id := 1:.N], id.vars = c('Universe','id'))
mdf %>%
mutate(obs_set = substr(variable, 4, 4) ) %>% # ".x" or ".y" subgroup
full_join(.,., by=c('Universe', 'obs_set', 'id')) %>% # see notes
group_by(Universe, variable.x, variable.y) %>%
filter(variable.x != variable.y) %>%
dplyr::summarise(rank_corr = cor(value.x, value.y,
method='spearman', use='pairwise.complete.obs'))
Produces:
Universe variable.x variable.y rank_corr
(fctr) (fctr) (fctr) (dbl)
1 A AA.x BB.x -0.9
2 A AA.x CC.x -0.9
3 A BB.x AA.x -0.9
4 A BB.x CC.x 0.8
5 A CC.x AA.x -0.9
6 A CC.x BB.x 0.8
7 A AA.y BB.y -0.3
8 A AA.y CC.y 0.2
9 A BB.y AA.y -0.3
10 A BB.y CC.y -0.3
.. ... ... ... ...
Explanation:
Melt: converts table to long form, one row per observation. To do the melt in a dplyr chain, you would have to use tidyr::gather, I believe, so pick your dependency. Using data.table there is faster and not hard to understand. The step also creates an id for each observation, 1 to nrow(df). The rest is in dplyr like you wanted.
Full join: joins the melted table to itself to create paired observations from all variable pairings based on common Universe and observation id (edit: and now '.x' or '.y' subgroup).
Filter: we don't need to correlate observations paired to themselves, we know those correlations = 1. If you wanted to include them for a correlation matrix or something, comment out this step.
Summarize using Spearman correlation. Note you should use dplyr::summarise since if you have plyr also loaded you might accidentally call plyr::summarise.
I'd like to apply a transformation to all columns via dplyr::mutate_each, e.g.
library(dplyr)
mult <- function(x,m) return(x*m)
mtcars %>% mutate_each(funs(mult(.,2))) # Multiply all columns by a factor of two
However, the transformation should have parameters depending on the column name. Therefore, the column name should be passed to the function as an additional argument
named.mult <- function(x,colname) return(x*param.A[[colname]])
Example: multiply every column by a different factor:
param.A <- c()
param.A[names(mtcars)] <- seq(length(names(mtcars)))
param.A
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 2 3 4 5 6 7 8 9 10 11
Since the column name gets lost during mutate_each, I currently work around this by passing a list with lazy evalution to mutate_ (the SE version):
library(lazyeval)
named.mutate <- function(fun, cols) sapply(cols, function(n) interp(~fun(col, n), fun=fun, col=as.name(n)))
mtcars %>% mutate_(.dots=named.mutate(named.mult, names(.)))
Works, but is there some special variable like .name which contains the column name of . for each colwise execution? So I could do something like
mtcars %>% mutate_each(funs(named.mult(.,.name)))
I'd suggest taking a different approach. Instead of using mutate_each a combination of dplyr::mutate with tidyr::gather and tidyr::spread can achieve the same result.
For example:
library(dplyr)
library(tidyr)
data(mtcars)
# Multiple each column by a different interger
mtcars %>%
dplyr::tbl_df() %>%
dplyr::mutate(make_and_model = rownames(mtcars)) %>%
tidyr::gather(key, value, -make_and_model) %>%
dplyr::mutate(m = as.integer(factor(key)), # a multiplication factor dependent on column name
value = value * m) %>%
dplyr::select(-m) %>%
tidyr::spread(key, value)
# compare to the original data
mtcars[order(rownames(mtcars)), order(names(mtcars))]
# the muliplicative values used.
mtcars %>%
tidyr::gather() %>%
dplyr::mutate(m = as.integer(factor(key))) %>%
dplyr::select(-value) %>%
dplyr::distinct()