R calculate mean square for all sub groups in a subset - r

how do I calculate the mean square of all 2019_Preston_STD,2019_Preston_V1,2019_Preston_V2 etc using the Value column, then the adjmth1, adjmth3 columns
structure(list(IDX = c("2019_Preston_STD", "2019_Preston_V1",
"2019_Preston_V2", "2019_Preston_V3", "2019_Preston_W1", "2019_Preston_W2"
), Value = c(3L, 2L, 3L, 2L, 3L, 5L), adjmth1 = c(2.87777777777778,
1.85555555555556, 2.01111111111111, 1.77777777777778, 3.62222222222222,
4.45555555555556), adjmth3 = c(2.9328763348507, 2.08651828334684,
2.80282946626847, 2.15028039284054, 2.68766916156347, 4.51425274916654
), adjmth13 = c(2.81065411262847, 1.82585524933201, 1.81394057737959,
1.40785681078568, 3.30989138378569, 4.7301083495049)), row.names = 29:34, class = "data.frame")

This task can be done in many ways, as shown in the link that #r2evans pointed out. My favorite one is dplyr using summarize(across() because to me its syntax is easy to understand and easy to apply to many columns. It also presents the resulted numbers in nice format.
For example, from iris data I want to get the arithmetic mean of Sepal.Length, Petal.Length, and Petal.Width for each of species : setosa, versicolor, and virginica. Here is the head of the data:
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
And here is how to get the mean in each species:
iris %>% group_by(Species) %>%
summarize(across(c(Sepal.Length, Petal.Length, Petal.Width), mean))
# A tibble: 3 x 4
# Species Sepal.Length Petal.Length Petal.Width
# <fct> <dbl> <dbl> <dbl>
# 1 setosa 5.01 1.46 0.246
# 2 versicolor 5.94 4.26 1.33
# 3 virginica 6.59 5.55 2.03
As for your task, first you need to define the function for the mean square (because its definition slightly varies in some references). Then, you apply it to your data frame using summarize(across()).
For example, you define the mean square function as follows:
meansq <- function(x) sum((x-mean(x))^2)/(length(x)-1)
Note: This definition requires that length(x) doesn't equal 1, or otherwise NaN will be produced.
You can apply it to your data frame newdata as follows:
newdata %>% group_by(IDX) %>%
summarize(across(c(Value, adjmth1, adjmth3), meansq)

Related

Mutate if variable name appears in a list

I would like to use dplyr to divide a subset of variables by the IQR. I am open to ideas that use a different approach than what I've tried before, which is a combination of mutate_if and %in%. I want to reference the list bin instead of indexing the data frame by position. Thanks for any thoughts!
contin <- c("age", "ct")
data %>%
mutate_if(%in% contin, function(x) x/IQR(x))
You should use:
data %>%
mutate(across(all_of(contin), ~.x/IQR(.x)))
Working example:
data <- head(iris)
contin <- c("Sepal.Length", "Sepal.Width")
data %>%
mutate(across(all_of(contin), ~.x/IQR(.x)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 15.69231 7.777778 1.4 0.2 setosa
2 15.07692 6.666667 1.4 0.2 setosa
3 14.46154 7.111111 1.3 0.2 setosa
4 14.15385 6.888889 1.5 0.2 setosa
5 15.38462 8.000000 1.4 0.2 setosa
6 16.61538 8.666667 1.7 0.4 setosa

Passing argument in a Subset using function in R

I want to make a function that will accept a argument and use it into subset and then plot a graph with multiple line. I wrote following code
plot.new( )
rest_o_noise <- function(noise_level, color) {
rest_o_noise_level = subset(yelp_flat, attributes.RestaurantsPriceRange2!= "NA", eval(parse(text=noise_level)))
rest_o_noise_level <- rest_o_noise_level %>%
group_by(attributes.RestaurantsPriceRange2) %>%
summarise(n=mean(stars))
lines(rest_o_noise_level, stars, col=color)
}
rest_o_noise("attributes.NoiseLevel=='loud'", "green")
rest_o_noise("attributes.NoiseLevel=='low'", "green")
I am getting a error:
Error in grouped_df_impl(data, unname(vars), drop) : Column attributes.RestaurantsPriceRange2 is unknown
Just to be clear attributes.RestaurantsPriceRange2 is present in csv.
final output should look like:
Is this correct way to plot?
Please help!!
Getting a subset of rows from iris data using a conditional in a string
Species <- as.character(iris$Species)
noise_level <- "Species == \"setosa\""
subset(iris, Sepal.Length == 5.1 &
eval(parse(text=noise_level)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 18 5.1 3.5 1.4 0.3 setosa
# 20 5.1 3.8 1.5 0.3 setosa
# 22 5.1 3.7 1.5 0.4 setosa
# 24 5.1 3.3 1.7 0.5 setosa
# 40 5.1 3.4 1.5 0.2 setosa
# 45 5.1 3.8 1.9 0.4 setosa
# 47 5.1 3.8 1.6 0.2 setosa
In your case, it would be something like what is below which conditions on the
columns attributes.RestaurantsPriceRange2 and attributes.NoiseLevel to select subset of rows.
plot.new( )
rest_o_noise <- function(noise_level, color) {
rest_o_noise_level = subset(yelp_flat,
(attributes.RestaurantsPriceRange2!= "NA") &
eval(parse(text=noise_level)))
rest_o_noise_level <- rest_o_noise_level %>%
group_by(attributes.RestaurantsPriceRange2) %>%
summarise(n=mean(stars))
lines(rest_o_noise_level, stars, col=color)
}
rest_o_noise("attributes.NoiseLevel==\"loud\"", "green")
You can, of course, select a single column, as below. However, why you would use the "==" for that is not clear, hence I assumed you are trying to select a subset of rows.
noise_level <- "Petal.Width"
subset(iris, Sepal.Length == 5.1,
eval(parse(text=noise_level)))
# Petal.Width
# 1 0.2
# 18 0.3
# 20 0.3
# 22 0.4
# 24 0.5
# 40 0.2
# 45 0.4
# 47 0.2
# 99 1.1

How can I replace various columns in a tibble using select?

I try to replace all columns selected using select by data of the same size.
A reproducible example is
library(tidyverse)
iris = as_data_frame(iris)
temp = cbind( runif(nrow(iris)), runif(nrow(iris)), runif(nrow(iris)), runif(nrow(iris)))
select(iris, -one_of("Petal.Length")) = temp
Then I get the error
Error in select(iris, -one_of("Petal.Length")) = temp : could not find
function "select"
Thanks for any comments.
You want to bind columns of two data frames, so you can simply use bind_cols():
library(tidyverse)
iris <- as_tibble(iris)
temp <- tibble(r1 = runif(nrow(iris)), r2 = runif(nrow(iris)), r3 = runif(nrow(iris)), r4 = runif(nrow(iris)))
select(iris, -Petal.Length) %>% bind_cols(temp)
# or use:
# bind_cols(iris, temp) %>% select(-Petal.Length)
which gives you:
# A tibble: 150 × 8
Sepal.Length Sepal.Width Petal.Width Species r1 r2 r3 r4
<dbl> <dbl> <dbl> <fctr> <dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 0.2 setosa 0.7208566 0.1367070 0.04314771 0.4909396
2 4.9 3.0 0.2 setosa 0.4101884 0.4795735 0.75318182 0.1463689
3 4.7 3.2 0.2 setosa 0.6270065 0.5425814 0.26599432 0.1467248
4 4.6 3.1 0.2 setosa 0.8001282 0.4691908 0.73060637 0.0792256
5 5.0 3.6 0.2 setosa 0.5663895 0.4745482 0.65088630 0.5360953
6 5.4 3.9 0.4 setosa 0.8813042 0.1560600 0.41734507 0.2582568
7 4.6 3.4 0.3 setosa 0.5046977 0.9555570 0.22118401 0.9246906
8 5.0 3.4 0.2 setosa 0.5283764 0.4730212 0.24982471 0.6313071
9 4.4 2.9 0.2 setosa 0.5976045 0.4717439 0.14270551 0.2149888
10 4.9 3.1 0.1 setosa 0.3919660 0.5125420 0.95001067 0.5259598
# ... with 140 more rows
We can use -> to assign the output to 'temp'
select(iris, -one_of("Petal.Length")) -> temp
Using tidyverse paradigma you could use:
dplyr::mutate_at(iris, vars(-one_of("Petal.Length")), .funs = funs(runif))
Although the above sample produces the behaviour with random numbers, it will probably not suit your needs - i suppose you want match features and rows to that one in temp.
It can be done by trasforming iris and temp into long format and the join and replace data accordingly with *join methods for example.

save residuals with `dplyr`

I want to use dplyr to group a data.frame, fit linear regressions and save the residuals as a column in the original, ungrouped data.frame.
Here's an example
> iris %>%
select(Sepal.Length, Sepal.Width) %>%
group_by(Species) %>%
do(mod = lm(Sepal.Length ~ Sepal.Width, data=.)) %>%
Returns:
Species mod
1 setosa <S3:lm>
2 versicolor <S3:lm>
3 virginica <S3:lm>
Instead, I would like the original data.frame with a new column containing residuals.
For example,
Sepal.Length Sepal.Width resid
1 5.1 3.5 0.04428474
2 4.9 3.0 0.18952960
3 4.7 3.2 -0.14856834
4 4.6 3.1 -0.17951937
5 5.0 3.6 -0.12476423
6 5.4 3.9 0.06808885
I adapted an example from http://jimhester.github.io/plyrToDplyr/.
r <- iris %>%
group_by(Species) %>%
do(model = lm(Sepal.Length ~ Sepal.Width, data=.)) %>%
do((function(mod) {
data.frame(resid = residuals(mod$model))
})(.))
corrected <- cbind(iris, r)
update Another method is to use the augment function in the broom package:
r <- iris %>%
group_by(Species) %>%
do(augment(lm(Sepal.Length ~ Sepal.Width, data=.))
Which returns:
Source: local data frame [150 x 10]
Groups: Species
Species Sepal.Length Sepal.Width .fitted .se.fit .resid .hat
1 setosa 5.1 3.5 5.055715 0.03435031 0.04428474 0.02073628
2 setosa 4.9 3.0 4.710470 0.05117134 0.18952960 0.04601750
3 setosa 4.7 3.2 4.848568 0.03947370 -0.14856834 0.02738325
4 setosa 4.6 3.1 4.779519 0.04480537 -0.17951937 0.03528008
5 setosa 5.0 3.6 5.124764 0.03710984 -0.12476423 0.02420180
...
A solution that seems to be easier than the ones proposed so far and closer to the code of the original question is :
iris %>%
group_by(Species) %>%
do(data.frame(., resid = residuals(lm(Sepal.Length ~ Sepal.Width, data=.))))
Result :
# A tibble: 150 x 6
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species resid
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 0.0443
2 4.9 3 1.4 0.2 setosa 0.190
3 4.7 3.2 1.3 0.2 setosa -0.149
4 4.6 3.1 1.5 0.2 setosa -0.180
5 5 3.6 1.4 0.2 setosa -0.125
6 5.4 3.9 1.7 0.4 setosa 0.0681
7 4.6 3.4 1.4 0.3 setosa -0.387
8 5 3.4 1.5 0.2 setosa 0.0133
9 4.4 2.9 1.4 0.2 setosa -0.241
10 4.9 3.1 1.5 0.1 setosa 0.120
Since you are be running the exact same regression for each group, you might find it simpler to just define your regression model as a function() beforehand, and then execute it for each group using mutate.
model<- function(y,x){
a<- y + x
if( length(which(!is.na(a))) <= 2 ){
return( rep(NA, length(a)))
} else {
m<- lm( y ~ x, na.action = na.exclude)
return( residuals(m))
}
}
Note, that the first part of this function is to insure against any error messages popping up in case your regression is run on a group with less than zero degrees of freedom (This might be the case if you have a dataframe with several grouping variables with many levels , or numerous independent variables for your regression (like for example lm(y~ x1 + x2)), and can't afford to inspect each of them for sufficient non-NA observations).
So your example can be rewritten as follows:
iris %>% group_by(Species) %>%
mutate(resid = model(Sepal.Length,Sepal.Width) ) %>%
select(Sepal.Length,Sepal.Width,resid)
Which should yield:
Species Sepal.Length Sepal.Width resid
<fctr> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 0.04428474
2 setosa 4.9 3.0 0.18952960
3 setosa 4.7 3.2 -0.14856834
4 setosa 4.6 3.1 -0.17951937
5 setosa 5.0 3.6 -0.12476423
6 setosa 5.4 3.9 0.06808885
This method should not be computationally much different from the one using augment().(I've had to use both methods on data sets containing several hundred million observations, and believe there was no significant difference in terms of speed compared to using the do() function).
Also, please note that omitting na.action = na.exclude, or using m$residuals instead of residuals(m), will result in the exclusion of rows that have NAs (dropped prior to estimation) from the output vector of residuals. The corresponding vector will thus not have sufficient length() in order to be merged with the data set, and some error message might appear.

Convert all column headers into regressors in R

I'm completely new to R and would like to turn each column label (header?) in my data set into a regressor without having to define each regressor one at a time, i.e. date -> data$Date
Is there a way to do this all at once?
Thank you in advance!
Is this what you want:
R > data(iris)
R > head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
R > lm(Sepal.Length ~ ., data = iris)
Call:
lm(formula = Sepal.Length ~ ., data = iris)
Coefficients:
(Intercept) Sepal.Width Petal.Length Petal.Width
2.1713 0.4959 0.8292 -0.3152
Speciesversicolor Speciesvirginica
-0.7236 -1.0235
If you want to choose specific column you can use this:
data is sample.data with dependent variable in col 3 and cols 1, 2, 4:8 are independent variables
yy<-lm(as.formula(paste(colnames(sample.data)[3], "~",paste(colnames(sample.data)[c(1, 2,4:8)], collapse = "+"), sep = "")), data=sample.data)
)
summary(yy)

Resources