R question: how to save summary results into a dataset - r

I'm a SAS programmer trying to learn R. If SAS, I would do this to save results of descriptive stats into a dataset:
proc means data=abc;
var var1 var2 var3;
ods output summary=result1;
run;
I think in R, it would be this:
summary(abc)->result1
Someone told me to do this.
as.data.frame(unclass(summary(new_scales)))->new_table
But the result in this table is not very usable.
Is there away to get a better structured result like I would get from SAS PROC MEANS? I would like columns to look like:
variable name, Mean, SD, min, max, etc.
and columns carry results from each variable.

Consider sapply (hidden loop to return equal length object as input) to create a matrix of aggregation results:
# SINGLE AGGREGATE
stats_vector <- sapply(abc[c("var1", "var2", "var3")], function(x) mean(x, na.rm=TRUE)))
# MULTIPLE AGGREGATES
stats_matrix <- sapply(abc[c("var1", "var2", "var3")],
function(x) c(count=length(x), sum=sum(x), mean=mean(x), min=min(x),
q1=quantile(x)[2], median=median(x), q3=quantile(x)[4],
max=max(x), sd=sd(x)))
)
If your proc means uses class for grouping, then use aggregate which returns a data frame:
# SINGLE AGGREGATE
mean_df <- aggregate(cbind(var1, var2, var3) ~ group, abc, function(x) mean(x, na.rm=TRUE)))
# MULTIPLE AGGREGATES
agg_raw <- aggregate(cbind(var1, var2, var3) ~ group, abc,
function(x) c(count=length(x), sum=sum(x), mean=mean(x), min=min(x),
q1=quantile(x)[2], median=median(x), q3=quantile(x)[4],
max=max(x), sd=sd(x)))
)
agg_df <- do.call(data.frame, agg_raw)
Rextester demo

Consider the tidyverse approach. The idea is to pass the data into an equation like linear regression, then map the model result to model values & finally storing the summary into a data frame.
library(tidyverse)
library(broom)
summary_result<-mtcars %>%
nest(-carb) %>%
mutate(model = purrr::map(data, function(x) {
lm(gear ~ mpg+cyl, data = x)}),
values = purrr::map(model, glance),
r.squared = purrr::map_dbl(values, "r.squared"),
pvalue = purrr::map_dbl(values, "p.value")) %>%
select(-data, -model, -values)
summary_result
carb r.squared pvalue
1 4 0.4352 0.135445
2 1 0.7011 0.089325
3 2 0.8060 0.003218
4 3 0.5017 0.498921
5 6 0.0000 NA
6 8 0.0000 NA

Related

Do regression analysis for all the variable X and response G, for all data frames found under one data frame in R

I have a data frame (df) which looks like this:
group.no Amount Response
1 5 10
1 10 25
1 2 20
2 12 20
2 4 8
2 3 5
and I have split the data.frame into several data.frames based on their group number with
out <- split( df , f = df$group.no )
Now what I want is to do a regression analysis with lm between the amount ~ response for all the new data.frames in the "out"
Please consider this is an example and I have 500 splitted data.frames in "out"
Assume the data shown reproducibly in the Note at the end. Specify pool = FALSE as an lmList argument if you don't want to pool the standard errors.
# 1
library(nlme)
lmList(Response ~ Amount | group.no, DF)
An alternative is:
# 2
lm(Response ~ grp / (Amount + 1) - 1, transform(DF, grp = factor(group.no)))
or this which carries out completely separate regressions:
# 3
by(DF, DF$group.no, function(DF) lm(Response ~ Amount, DF))
This last line can also be written:
# 3a
by(DF, DF$group.no, lm, formula = Response ~ Amount)
R squared
We can compute R squared by group using any of these:
summary(lmList(Response ~ Amount | group.no, DF))$r.squared
c(by(DF, DF$group.no, function(x) summary(lm(Response ~ Amount, x))$r.squared))
reg.list <- by(DF, DF$group.no, lm, formula = Response ~ Amount)
sapply(reg.list, function(x) summary(x)$r.squared)
c(by(DF, DF$group.no, with, cor(Response, Amount)^2))
library(dplyr)
DF %>%
group_by(group.no) %>%
do(summarize(., r.squared = summary(lm(Response ~ Amount, .))$r.squared)) %>%
ungroup
Note
Lines <- "group.no Amount Response
1 5 10
1 10 25
1 2 20
2 12 20
2 4 8
2 3 5"
DF <- read.table(text = Lines, header = TRUE)
Depending on what you want to do with the regression you could use either magrittr or dplyr to first split and then create a list of linear regressions:
library(magrittr) #alternative library(dplyr)
df %>% split(.$group.no) %>% lapply(function(x) lm(Amount ~ Response, data = x))
If you wish to avoid the dplyr syntax a single lapply call can be used as
lapply(split(df, df$group.no), function(x) lm(Amount ~ Response, data = x))
out is a list of dataframes so you can use lapply() to estimate your regression for each dataframe.
mods <- lapply(out, lm, formula=y~x)
And then mods will be a list of the models.

Use tidyverse to find time-series means of cross-sectional correlations

I am trying to find the time-series mean of annual cross-sectional correlations.
Before tidyverse, I would:
convert dat to a list of annual data frames
use lapply() to find the annual cross-sectional correlations
use Reduce() to find the means manually
This logic works, but is not tidy.
set.seed(2001)
dat <- data.frame(year = rep(2001:2003, each = 10),
x = runif(3*10))
dat <- transform(dat, y = 5*x + runif(3*10))
dat_list <- split(dat[c('x', 'y')], dat$year)
dat_list2 <- lapply(dat_list, cor)
dat2 <- Reduce('+', dat_list2) / length(dat_list2)
dat2
## x y
## x 1.0000000 0.9772068
## y 0.9772068 1.0000000
For a tidyerse solution, my best (and failed) attempt is to:
group_by() the year variable
use do() and cor() each year
use map() and mean() to find elementwise means
This logic fails and returns NULL.
library(tidyverse)
dat2 <- dat %>%
group_by(year) %>%
do(cormat = cor(.$x, .$y)) %>%
map(.$cormat, mean)
dat2
## $year
## NULL
##
## $cormat
## NULL
Is there a tidyverse idiom to replace the Reduce() idiom in my non-tidyverse solution above?
dat %>%
group_by(year) %>%
do(correl = cor(.data[c('x', 'y')])) %>%
{reduce(.$correl, `+`)/nrow(.)}
x y
x 1.0000000 0.9772068
y 0.9772068 1.0000000
Note that this is exactly the same as cor(dat[c('x', 'y')]), so unless you need the matrices for each year individually there's no need to group by year and then reduce. This also holds for >2 variables.

R: Kruskal-Wallis test in loop over specified columns in data frame

I would like to run a KW-test over certain numerical variables from a data frame, using one grouping variable. I'd prefer to do this in a loop, instead of typing out all the tests, as they are many variables (more than in the example below).
Simulated data:
library(dplyr)
set.seed(123)
Data <- tbl_df(
data.frame(
muttype = as.factor(rep(c("missense", "frameshift", "nonsense"), each = 80)),
ados.tsc = runif(240, 0, 10),
ados.sa = runif(240, 0, 10),
ados.rrb = runif(240, 0, 10))
) %>%
group_by(muttype)
ados.sim <- as.data.frame(Data)
The following code works just fine outside of the loop.
kruskal.test(formula (paste((colnames(ados.sim)[2]), "~ muttype")), data =
ados.sim)
But it doesn't inside the loop:
for(i in names(ados.sim[,2:4])){
ados.mtp <- kruskal.test(formula (paste((colnames(ados.sim)[i]), "~ muttype")),
data = ados.sim)
}
I get the error:
Error in terms.formula(formula, data = data) :
invalid term in model formula
Anybody who knows how to solve this?
Much appreciated!!
Try:
results <- list()
for(i in names(ados.sim[,2:4])){
results[[i]] <- kruskal.test(formula(paste(i, "~ muttype")), data = ados.sim)
}
This also saves your results in a list and avoids overwriting your results as ados.mtp in every iteration, which I think is not what you intended to do.
Note the following:
for(i in names(ados.sim[,2:4])){
print(i)
}
[1] "ados.tsc"
[1] "ados.sa"
[1] "ados.rrb"
That is, i already gives you the name of the column. The problem in your code was that you tried to use it like an integer for subsetting, which turned the outcome into NA.
for(i in names(ados.sim[,2:4])){
print(paste((colnames(ados.sim)[i]), "~ muttype"))
}
[1] "NA ~ muttype"
[1] "NA ~ muttype"
[1] "NA ~ muttype"
And just for reference, all of this could also be done in the following two ways that I often prefer since it makes subsequent analysis slightly easier:
First, store all test objects in a dataframe:
library(tidyr)
df <- ados.sim %>% gather(key, value, -muttype) %>%
group_by(key) %>%
do(test = kruskal.test(x= .$value, g = .$muttype))
You can then subset the dataframe to get the test outcomes:
df[df$key == "ados.rrb",]$test
[[1]]
Kruskal-Wallis rank sum test
data: .$value and .$muttype
Kruskal-Wallis chi-squared = 2.2205, df = 2, p-value = 0.3295
Alternatively, get all results directly in a dataframe, without storing the test objects:
library(broom)
df2 <- ados.sim %>% gather(key, value, -muttype) %>%
group_by(key) %>%
do(tidy(kruskal.test(x= .$value, g = .$muttype)))
df2
# A tibble: 3 x 5
# Groups: key [3]
key statistic p.value parameter method
<chr> <dbl> <dbl> <int> <fctr>
1 ados.rrb 2.2205031 0.3294761 2 Kruskal-Wallis rank sum test
2 ados.sa 0.1319554 0.9361517 2 Kruskal-Wallis rank sum test
3 ados.tsc 0.3618102 0.8345146 2 Kruskal-Wallis rank sum test

Split-apply-combine with function that returns multiple variables

I need to apply myfun to subsets of a dataframe and include the results as new columns in the dataframe returned. In the old days, I used ddply. But in dplyr, I believe summarise is used for that, like this:
myfun<- function(x,y) {
df<- data.frame( a= mean(x)*mean(y), b= mean(x)-mean(y) )
return (df)
}
mtcars %>%
group_by(cyl) %>%
summarise(a = myfun(cyl,disp)$a, b = myfun(cyl,disp)$b)
The above code works, but the myfun I'll be using is computationally very expensive, so I want it to be called only once rather than separately for the a and b columns. Is there a way to do that in dplyr?
Since your function returns a data frame, you can call your function within group_by %>% do which applies the function to each individual group and rbind the returned data frame together:
mtcars %>% group_by(cyl) %>% do(myfun(.$cyl, .$disp))
# A tibble: 3 x 3
# Groups: cyl [3]
# cyl a b
# <dbl> <dbl> <dbl>
#1 4 420.5455 -101.1364
#2 6 1099.8857 -177.3143
#3 8 2824.8000 -345.1000
do is not necessarily going to improve the speed. In this post, I am going to introduce a way to design a function performing the same task, and then do a benchmarking to compare the performance of each method.
Here is an alternative way to define the function.
myfun2 <- function(dt, x, y){
x <- enquo(x)
y <- enquo(y)
dt2 <- dt %>%
summarise(a = mean(!!x) * mean(!!y), b = mean(!!x) - mean(!!y))
return(dt2)
}
Notice that the first argument of myfun2 is dt, which is the input data frame. By doing this, myfun2 can successfully implement as a part of the pipe operation.
mtcars %>%
group_by(cyl) %>%
myfun2(x = cyl, y = disp)
# A tibble: 3 x 3
cyl a b
<dbl> <dbl> <dbl>
1 4 420.5455 -101.1364
2 6 1099.8857 -177.3143
3 8 2824.8000 -345.1000
By doing this, we don't have to call my_fun each time when we want to create a new column. So this method is probably more efficient than my_fun.
Here is a comparison of the performance using the microbenchmark. The methods I compared are listed as follows. I ran the simulation 1000 times.
m1: OP's original way to apply `myfun`
m2: Psidom's method, using `do`to apply `myfun`.
m3: My approach, using `myfun2`
m4: Using `do` to apply `myfun2`
m5: Z.Lin's suggestion, directly calculating the values without defining a function.
m6: akrun's `data.table` approach with `myfun`
Here is the code for benchmarking.
microbenchmark(m1 = (mtcars %>%
group_by(cyl) %>%
summarise(a = myfun(cyl, disp)$a, b = myfun(cyl, disp)$b)),
m2 = (mtcars %>%
group_by(cyl) %>%
do(myfun(.$cyl, .$disp))),
m3 = (mtcars %>%
group_by(cyl) %>%
myfun2(x = cyl, y = disp)),
m4 = (mtcars %>%
group_by(cyl) %>%
do(myfun2(., x = cyl, y = disp))),
m5 = (mtcars %>%
group_by(cyl) %>%
summarise(a = mean(cyl) * mean(disp), b = mean(cyl) - mean(disp))),
m6 = (as.data.table(mtcars)[, myfun(cyl, disp), cyl]),
times = 1000)
And here is the result of benchmarking.
Unit: milliseconds
expr min lq mean median uq max neval
m1 7.058227 7.692654 9.429765 8.375190 10.570663 28.730059 1000
m2 8.559296 9.381996 11.643645 10.500100 13.229285 27.585654 1000
m3 6.817031 7.445683 9.423832 8.085241 10.415104 193.878337 1000
m4 21.787298 23.995279 28.920262 26.922683 31.673820 177.004151 1000
m5 5.337132 5.785528 7.120589 6.223339 7.810686 23.231274 1000
m6 1.320812 1.540199 1.919222 1.640270 1.935352 7.622732 1000
The result shows that the do method (m2 and m4) are actually slower than their counterparts(m1 and m3). In this situation, applying myfun (m1) and myfun2 (m3) is faster than using do. myfun2 (m3) is slighly faster than myfun (m1). However, without defining any functions (m5) is actually faster than all the function-defined method (m1 to m4), suggesting that for this particular case, there is actually no need to define a fucntion. Finally, if there is no need to stay in tidyverse, or the size of the dataset is enormous. We can consider the data.table approach (m6), which is a lot faster than all the tidyverse solutions listed here.
We can use data.table
library(data.table)
setDT(mtcars)[, myfun(cyl, disp), cyl]
# cyl a b
#1: 6 1099.8857 -177.3143
#2: 4 420.5455 -101.1364
#3: 8 2824.8000 -345.1000

How to loop set of commands with different variable each time in R?

I am quite new to R coding, thus I really need your help to run a looping command in R.
I have a big table ("variable_table.txt") with columns as below:
sample BMI var1_LRR var1_BAF var2_LRR var2_BAF var3_LRR var3_BAF ........ var200_LRR var200_BAF
AA 18.9 0.27 0.99 0.18 0.99 0.11 1 ........ 0.20 0.99
BB 27.1 0.23 1 0.13 0.99 0.17 1 ........ 0.23 0.99
I would like to run a regression command as below:
dataset<- read.table ("variable_table.txt", na.strings="NA", header=TRUE)
linear_var1 <- lm (BMI ~ var1_LRR + var1_BAF,data=dataset)
summary(linear_var1)
confint_var1_CI <- confint(linear _var1, level=0.95)
confint_var_CI
Question 1:
Can someone help me how can I do the above commands, and repeat them again using the next variable (from var1 to var2, then to var3, until var200) without having to run it individually.
Question 2:
How to compile each run result into one compiled table?
The easiest way would be to subset your data.frame, e.g.
mydata <- data.frame(y = runif(100),
foo1 = runif(100), bar1 = runif(100),
foo2 = runif(100), bar2 = runif(100))
out <- list()
for (i in 1:2)
out[[i]] <- lm(y ~., data = mydata[, c("y", paste(c("foo", "bar"), i, sep=""))])
As about saving output to a table, first you have to decide what part of output you want to save (e.g. coefficients)
mytab <- matrix(NA, 2, 3)
for (i in 1:2)
mytab[i, ] <- out[[i]]$coefficients
You can also use broom library to extract "tidy" output from lm objects.
library(broom)
tidy(out[[1]])
## term estimate std.error statistic p.value
## 1 (Intercept) 0.5060922 0.07619095 6.642419 0.000000001794162
## 2 foo1 -0.1567166 0.10023700 -1.563461 0.121201059993118
## 3 bar1 0.1578192 0.10404012 1.516907 0.132542574934363
next, you could combine those outputs using rbind.
You might try something like this:
for ( i in 1:200 ) {
# build the formula
form<-as.formula(paste("BMI ~ **var", i, "**_LRR + **var", i, "**_BAF", sep=""))
# make a character string with the lm-instruction, using the formula above
code.lm<-paste("lm.V", i, "<-lm(form, data=dataset)", sep="")
# dynamically execute the code in that string
eval(parse(text=code.lm))
# create a string xith the summary code
code.summ<-paste("summary(lm.V", i, ")", sep="")
# dynamically execute the string
eval(parse(text=code.summ))
}
I did it up to the 'summary' instruction, but the rest is similar: you 'paste' your code in a character string and then execute it with 'eval(parse(text=))'.
After this you can acces the variables 'lm.V1', ... 'lm.V200'
You'll have a much easier time working with the data frame if you rearrange it first:
library(tidyr)
# gather all columns into a single column
tidied <- gather(dataset, var, value, -sample, -BMI)
# separate the "var" column into varnum (var1, var2...) and variable
tidied <- separate(tidied, var, c("var1", "variable"))
# now spread the two variables (BAF and LRR) back across columns
tidied <- spread(tidied, variable, value)
You'll end up with a table x that has five columns: sample, BMI, var (which is var1, var2, etc), LRR, and BAF. It will have 200 times as many rows as your current table. Note that with the %>% operator, you can do the above steps as:
library(dplyr)
tidied <- dataset %>%
gather(var, value, -sample, -BMI) %>%
separate(var, c("var", "variable")) %>%
spread(variable, value)
Once you've done that rearrangement, you can very easily perform a linear regression within each var using dplyr's group_by and do, along with broom:
library(broom)
coefs <- tidied %>%
group_by(var) %>%
do(tidy(lm(BMI ~ BAF + LRR, data = .), conf.int = TRUE))
For example, if your dataset were:
set.seed(1)
dataset <- data.frame(sample = 1:100, BMI = rnorm(100),
var1_LRR = rnorm(100), var1_BAF = runif(100),
var2_LRR = rnorm(100), var2_BAF = runif(100),
var3_LRR = rnorm(100), var3_BAF = runif(100))
The results of the above code would be:
Source: local data frame [9 x 8]
Groups: var
var term estimate std.error statistic p.value conf.low conf.high
1 var1 (Intercept) 0.1298513867 0.17715588 0.732978145 0.4653394 -0.22175399 0.4814568
2 var1 BAF -0.0415096698 0.30068830 -0.138048836 0.8904880 -0.63829271 0.5552734
3 var1 LRR 0.0001270982 0.09550805 0.001330759 0.9989409 -0.18942994 0.1896841
4 var2 (Intercept) 0.1064316834 0.18173583 0.585639517 0.5594779 -0.25426363 0.4671270
5 var2 BAF 0.0144181386 0.31656921 0.045544981 0.9637666 -0.61388410 0.6427204
6 var2 LRR -0.0470190629 0.09340229 -0.503403723 0.6158217 -0.23239676 0.1383586
7 var3 (Intercept) 0.0616288934 0.17865709 0.344956329 0.7308741 -0.29295597 0.4162138
8 var3 BAF 0.1045320710 0.31246736 0.334537572 0.7386962 -0.51562914 0.7246933
9 var3 LRR 0.1118595808 0.07714709 1.449952134 0.1502976 -0.04125603 0.2649752

Resources