Subsetting datasets into a list

Subsetting datasets into a list - r

I am new to R . I've spent hours trying to figure this out and searching Google and SO, but can't seem to find anything that's exactly what I'm looking for. Hopefully you can help?
I have a data set that looks like:
Site(factor) Species Date Mass GDD
1 cockerelli 0017-03-14 2.73 252.1
2 doddsii 0017-01-12 3.73 583.4
4 cockerelli 0017-03-14 2.71 385.4
4 doddsii 0018-05-16 2.22 783.2
1 infrequens 0018-05-16 2.89 583.0
etc.
I split() my data frame into a list of data frames that I can then pass into apply() functions.
splitdata = split(data, paste(data$Species,data$Site))
However, when I use code such as:
grmodel = lapply(splitdata, function(x){
grmodel = aov(x$Mass~x$GDD)
print(summary(grmodel))
})
I get a huge list of ANOVA summaries (like those below), but I don't know which species and site they belong to.
Df Sum Sq Mean Sq F value Pr(>F)
x$GDD 1 0.000022 0.0000216 0.044 0.838
Residuals 9 0.004396 0.0004884
1 observation deleted due to missingness
Df Sum Sq Mean Sq F value Pr(>F)
x$GDD 1 0.0002526 0.0002526 0.65 0.451
Residuals 6 0.0023319 0.0003887
1 observation deleted due to missingness
I was wondering if anyone knows how to alter the code to tell me which species and site the ANOVA table belong to? I've found some answers that talk about paste() and other functions, but nothing I've tried works.
Thanks so much in advance!

As far as I can tell the names should be visible, I'm not sure what you're seeing exactly but maybe a reprex would be useful.
You could also try to use tidy::broom to see clearer:
lapply(split(iris,iris$Species),
function(x) aov(Petal.Length ~ Petal.Width,x))
# $`setosa`
# Call:
# aov(formula = Petal.Length ~ Petal.Width, data = x)
#
# Terms:
# Petal.Width Residuals
# Sum of Squares 0.1625262 1.3152738
# Deg. of Freedom 1 48
#
# Residual standard error: 0.1655341
# Estimated effects may be unbalanced
#
# $versicolor
# Call:
# aov(formula = Petal.Length ~ Petal.Width, data = x)
#
# Terms:
# Petal.Width Residuals
# Sum of Squares 6.695921 4.124079
# Deg. of Freedom 1 48
#
# Residual standard error: 0.2931183
# Estimated effects may be unbalanced
#
# $virginica
# Call:
# aov(formula = Petal.Length ~ Petal.Width, data = x)
#
# Terms:
# Petal.Width Residuals
# Sum of Squares 1.548503 13.376297
# Deg. of Freedom 1 48
#
# Residual standard error: 0.5278947
# Estimated effects may be unbalanced
With tidy::broom :
lapply(split(iris,iris$Species),
function(x) aov(Petal.Length ~ Petal.Width,x) %>% broom::tidy())
# $`setosa`
# term df sumsq meansq statistic p.value
# 1 Petal.Width 1 0.1625262 0.16252620 5.93128 0.01863892
# 2 Residuals 48 1.3152738 0.02740154 NA NA
#
# $versicolor
# term df sumsq meansq statistic p.value
# 1 Petal.Width 1 6.695921 6.69592109 77.93357 1.271916e-11
# 2 Residuals 48 4.124079 0.08591831 NA NA
#
# $virginica
# term df sumsq meansq statistic p.value
# 1 Petal.Width 1 1.548503 1.5485033 5.556707 0.02253577
# 2 Residuals 48 13.376297 0.2786728 NA NA

The names of the result of split are the values of the second argument coerced to character-class and lapply preserves these names, so you should not need to add back any names, but rather just look at:
names(grmodel)
Perhaps you want to do this for output:
for( i in names(grmodel) ){ cat(i);
cat( : : :\n");
print(grmodel[[i]]);
cat("\n\n")}
.... which just prints the names and a bit of spacing for each item in the grmodels list.

Related

R loop over linear regression

I have looked over the forum but couldn't find what I am looking for.
I want to run a simple linear regression a couple of times. Each time using a different column as my independent variable, the dependent variable stays the same. After running it I want to be able to extract the R squared from each of the regressions. My thought process was to use a simple for loop. However, I cannot make it work.
Assume I work with the following data:
num value person1 person2 person3
0 1 229 29 81 0
1 2 203 17 75 0
2 3 244 62 0 55
and that I want to run the regression on the value using three variables: person1, person2 and person3. Note that this is a minimal working example but I hope to generalize the idea.
And so my initial attempt was to:
column <- names(df)[-2]
for(i in 3:5){
temp <- df[,c("value", column[i])]
lm.test <- lm(value ~ ., data = temp)
i + 1
}
However, when I run summary(lm.test) I only get a summary of the last regression, i.e. lm(value ~ person3) which I think makes sense but when trying to rewrite it as: lm.test[i] <- lm(value ~ ., data = temp) I get the following error:
debug at #3: temp <- df[,c("value", column[i])]
suggesting that there's something wrong with line 3?
If possible I'd like to be able to capture the summary for each regression but what I am really after is the R squared for each one of the regressions.

You can create formula in a loop and then run the lm. For instance, if I want to run regression on mtcars for regressing mpg on each of cyl, wt, hp, I can use the following:
vars <- c("cyl", "wt", "hp")
lm_results <- lapply(vars, function(col){
lm_formula <- as.formula(paste0("mpg ~ ", col))
lm(lm_formula, data = mtcars)
})
You can then again iterate over lm_results to get the r.squared:
lapply(lm_results, function(x) summary(x)$r.squared)

Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:
library(broom)
library(purrr)
lm.test <- map_dfr(
set_names(names(df)[-2]),
~ glance(lm(
as.formula(paste("value ~", .x)),
data = df
)),
.id = "predictor"
)
Result:
# A tibble: 4 x 13
predictor r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 num 0.131 -0.739 27.4 0.150 0.765 1 -12.5 31.1
2 person1 0.836 0.672 11.9 5.10 0.265 1 -10.0 26.1
3 person2 0.542 0.0831 19.9 1.18 0.474 1 -11.6 29.2
4 person3 0.607 0.215 18.4 1.55 0.431 1 -11.3 28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().

R run linear model by group in dataset [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 2 years ago.
My dataset looks like this
df = data.frame(site=c(rep('A',95),rep('B',110),rep('C',250)),
nps_score=c(floor(runif(455, min=0, max=10))),
service_score=c(floor(runif(455, min=0, max=10))),
food_score=c(floor(runif(455, min=0, max=10))),
clean_score=c(floor(runif(455, min=0, max=10))))
I'd like to run a linear model on each group (i.e. for each site), and produce the coefficients for each group in a dataframe, along with the significance levels of each variable.
I am trying to group_by the site variable and then run the model for each site but it doesn't seem to be working. I've looked at some existing solutions on stack overflow but cannot seem to adapt the code to my solution.
#Trying to run this by group, and output the resulting coefficients per site in a separate df with their signficance levels.
library(MASS)
summary(ols <- rlm(nps_score ~ ., data = df))
Any help on this would be greatly appreciated

library(tidyverse)
library(broom)
library(MASS)
# We first create a formula object
my_formula <- as.formula(paste("nps_score ~ ", paste(df %>% select(-site, -nps_score) %>% names(), collapse= "+")))
# Now we can group by site and use the formula object within the pipe.
results <- df %>%
group_by(site) %>%
do(tidy(rlm(formula(my_formula), data = .)))
which gives:
# A tibble: 12 x 5
# Groups: site [3]
site term estimate std.error statistic
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 5.16 0.961 5.37
2 A service_score -0.0656 0.110 -0.596
3 A food_score -0.0213 0.102 -0.209
4 A clean_score -0.0588 0.110 -0.536
5 B (Intercept) 2.22 0.852 2.60
6 B service_score 0.221 0.103 2.14
7 B food_score 0.163 0.104 1.56
8 B clean_score -0.0383 0.0928 -0.413
9 C (Intercept) 5.47 0.609 8.97
10 C service_score -0.0367 0.0721 -0.509
11 C food_score -0.0585 0.0724 -0.808
12 C clean_score -0.0922 0.0691 -1.33
Note: i'm not familiar with the rlm function and if it provides p-values in the first place. But at least the tidy function doesn't offer p-values for rlm. If a simple linear regression would fit your suits, you could replace the rlm function by lm in which case a sixth column with p-values would be added.

Producing anova from already summarized data

I have a table that looks like this:
I'm trying to run aov() on the above table, but I'm only able to create a partial output. I'm not sure how to include the standard deviation in the calculation.
Right now I'm concatenating and repeating each group like so:
groups <- c(rep('LHS', 121), rep('HS', 546), rep('Jr', 97), rep('Bachelors', 253), rep('Graduate', 155))
And then doing the same for the means (since I don't have access to the original data sheet):
means <- c(rep(38.67, 121), rep(39.6, 546), rep(41.39, 97), rep(42.55, 253), rep(40.85, 155))
At this point I can create a data fame and then run aov on it:
df <- data.frame(groups, means)
groups.aov <- aov(means ~ groups, data = df)
Unfortunately summary(groups.aov) only gives me a partial result.
Df Sum Sq Mean Sq F value Pr(>F)
groups 4 2004 501 4.247e+27 <2e-16 ***
Residuals 1167 0 0
Any other way I can go, where I can factor in the SD?

We simulate some data so that we know the calculations are correct:
set.seed(100)
df = data.frame(
groups=rep(letters[1:4],times=seq(20,35,by=5)),
value=rnorm(110,rep(1:4,times=seq(20,35,by=5)),1))
We get back something like the table you see above:
library(dplyr)
res <- df %>% group_by(groups) %>% summarize_all(c(mean=mean,sd=sd,n=length))
total <- data.frame(groups="total",mean=mean(df$value),sd=sd(df$value),n=nrow(df))
rbind(res,total)
# A tibble: 5 x 4
groups mean sd n
<fct> <dbl> <dbl> <int>
1 a 0.937 1.14 20
2 b 1.91 0.851 25
3 c 3.01 0.780 30
4 d 4.01 0.741 35
5 total 2.70 1.42 110
We always work with the sum of squares in anova. So from sd back to sum of squares, you usually multiply by n-1, and from there you can derive the F value. The detailed calculations:
# number of groups
ngroups=nrow(res)# number of groups
# total sum of squares
SST = (total$sd^2)*(total$n-1)
#error within groups
SSE = sum((res$sd^2)*(res$n-1))
aovtable = data.frame(
Df = c(ngroups-1,total$n-ngroups-1),
SumSq = c(SST-SSE,SSE)
)
aovtable$MeanSq = aovtable$SumSq / aovtable$Df
aovtable$F = c(aovtable$MeanSq[1]/aovtable$MeanSq[2],NA)
aovtable$p = c(pf(aovtable$F[1],aovtable$Df[1],aovtable$Df[2],lower.tail=FALSE),NA)
And we can compare the two results:
aovtable
Df SumSq MeanSq F p
1 3 140.55970 46.8532330 62.62887 2.705082e-23
2 105 78.55147 0.7481092 NA NA
summary(aov(value~groups,data=df))
Df Sum Sq Mean Sq F value Pr(>F)
groups 3 140.56 46.85 63.23 <2e-16 ***
Residuals 106 78.55 0.74

Fitting linear model / ANOVA by group [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I'm trying to run anova() in R and running into some difficulty. This is what I've done up to now to help shed some light on my question.
Here is the str() of my data to this point.
str(mhw)
'data.frame': 500 obs. of 5 variables:
$ r : int 1 2 3 4 5 6 7 8 9 10 ...
$ c : int 1 1 1 1 1 1 1 1 1 1 ...
$ grain: num 3.63 4.07 4.51 3.9 3.63 3.16 3.18 3.42 3.97 3.4 ...
$ straw: num 6.37 6.24 7.05 6.91 5.93 5.59 5.32 5.52 6.03 5.66 ...
$ Quad : Factor w/ 4 levels "NE","NW","SE",..: 2 2 2 2 2 2 2 2 2 2 ...
Column r is a numerical value indicating which row in the field an individual plot resides
Column c is a numerical value indicating which column an individual plot resides
Column Quad corresponds to the geographical location in the field to which each plot resides
Quad <- ifelse(mhw$c > 13 & mhw$r < 11, "NE",ifelse(mhw$c < 13 & mhw$r < 11,"NW", ifelse(mhw$c < 13 & mhw$r >= 11, "SW","SE")))
mhw <- cbind(mhw, Quad)
I have fit a lm() as follows
nov.model <-lm(mhw$grain ~ mhw$straw)
anova(nov.model)
This is an anova() for the entire field, which is testing grain yield against straw yield for each plot in the dataset.
My trouble is that I want to run an individual anova() for the Quad column of my data to test grain yield and straw yield in each quadrant.
perhaps a with() might fix that. I have never used it before and I am in the process of learning R currently. Any help would be greatly appreciated.

I think you are looking for by facility in R.
fit <- with(mhw, by(mhw, Quad, function (dat) lm(grain ~ straw, data = dat)))
Since you have 4 levels in Quad, you end up with 4 linear models in fit, i.e., fit is a "by" class object (a type of "list") of length 4.
To get coefficient for each model, you can use
sapply(fit, coef)
To produce model summary, use
lapply(fit, summary)
To export ANOVA table, use
lapply(fit, anova)
As a reproducible example, I am taking the example from ?by:
tmp <- with(warpbreaks,
by(warpbreaks, tension,
function(x) lm(breaks ~ wool, data = x)))
class(tmp)
# [1] "by"
mode(tmp)
# [1] "list"
sapply(tmp, coef)
# L M H
#(Intercept) 44.55556 24.000000 24.555556
#woolB -16.33333 4.777778 -5.777778
lapply(tmp, anova)
#$L
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 1200.5 1200.50 5.6531 0.03023 *
#Residuals 16 3397.8 212.36
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#$M
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 102.72 102.722 1.2531 0.2795
#Residuals 16 1311.56 81.972
#
#$H
#Analysis of Variance Table
#
#Response: breaks
# Df Sum Sq Mean Sq F value Pr(>F)
#wool 1 150.22 150.222 2.3205 0.1472
#Residuals 16 1035.78 64.736
I was aware of this option, but not familiar with it. Thanks to #Roland for providing code for the above reproducible example:
library(nlme)
lapply(lmList(breaks ~ wool | tension, data = warpbreaks), anova)
For your data I think it would be
fit <- lmList(grain ~ straw | Quad, data = mhw)
lapply(fit, anova)
You don't need to install nlme; it comes with R as one of recommended packages.

Output t.test results to a data frame in R

I have a data frame of values from individuals linked to groups. I want to identify those groups who have mean values greater than the mean value plus one standard deviation for the whole data set. To do this, I'm calculating the mean value and standard deviation for the entire data frame and then running pairwise t-tests to compare to each group mean. I'm running into trouble outputting the results.
> head(df)
individual group value
1 11559638 75 0.371
2 11559641 75 0.367
3 11559648 75 0.410
4 11559650 75 0.417
5 11559652 75 0.440
6 11559654 75 0.395
> allvalues <- data.frame(mean=rep(mean(df$value), length(df$individual)), sd=rep(sd(df$value), length(df$individual)))
> valueplus <- with(df, by(df, df$individual, function(x) t.test(allvalues$mean + allvalues$sd, df$value, data=x)))
> tmpplus
--------------------------------------------------------------------------
df$individuals: 10
NULL
--------------------------------------------------------------------------
df$individuals: 20
NULL
--------------------------------------------------------------------------
df$individuals: 21
Welch Two Sample t-test
data: allvalues$mean + allvalues$sd and df$value
t = 84.5217, df = 4999, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.04676957 0.04899068
sample estimates:
mean of x mean of y
0.4719964 0.4241162
How do I get the results into a data frame? I'd expect the output to look something like this:
groups t df p-value mean.x mean.y
1 10 NULL NULL NULL NULL NULL
2 20 NULL NULL NULL NULL NULL
3 21 84.5217 4999 2.2e-16 0.4719964 0.4241162

From a purely programming perspective, you are asking how to get the output of t.test into a data.frame. Try the following, using mtcars:
library(broom)
tidy(t.test(mtcars$mpg))
estimate statistic p.value parameter conf.low conf.high
1 20.09062 18.85693 1.526151e-18 31 17.91768 22.26357
Or for multiple groups:
library(dplyr)
mtcars %>% group_by(vs) %>% do(tidy(t.test(.$mpg)))
# A tibble: 2 x 9
# Groups: vs [2]
vs estimate statistic p.value parameter conf.low conf.high method alternative
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 0 16.6 18.3 1.32e-12 17 14.7 18.5 One Sample t-test two.sided
2 1 24.6 17.1 2.75e-10 13 21.5 27.7 One Sample t-test two.sided
Needless to say, you'll need to adjust the code to fit your specific setting.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subsetting datasets into a list - r

Related

R loop over linear regression

R run linear model by group in dataset [duplicate]

Producing anova from already summarized data

Fitting linear model / ANOVA by group [duplicate]

Output t.test results to a data frame in R

Categories

Resources