Draw only 2 ellipses in PCA plot (instead of 20) - r

I have a PCA plot created with ggplot/ggfortify
and the function autoplot(), such as in this question: Change point colors and color of frame/ellipse around points
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
df <- iris[c(1, 2, 3, 4)]
autoplot(prcomp(df))
autoplot(prcomp(df), data = iris, colour = 'Species')
autoplot(prcomp(df), data = iris, colour = 'Species', shape='Species', frame=T)
Is there a way to draw only 1 or 2 frames/ellipses, instead of all of them, in the PCA plot?

The problem with using autoplot is that, although it is great for producing nice visualizations of common data structures and models with little effort, it doesn't give you the full freedom to customize the plot. However, it is pretty straightforward to do the whole thing within ggplot. The following is a full reprex:
library(ggplot2)
pc <- prcomp(iris[1:4])
df <- cbind(pc$x[,1:2], iris)
ggplot(df, aes(PC1, PC2, color = Species)) +
geom_point() +
stat_ellipse(geom = "polygon", aes(fill = after_scale(alpha(colour, 0.3))),
data = df[df$Species != "versicolor",])
Created on 2022-02-21 by the reprex package (v2.0.1)

Related

way to customize zScore function with r

I am new to R and have a question.
Create a function, zScore, that will take a vector of numbers (x) and converts them to a vector of z-scaled numbers (see code below). (Don't worry about NA's)
#This creates the z-scaled numbers for sepal lengths
(iris$Sepal.Length - mean(iris$Sepal.Length))/sd(iris$Sepal.Length)
#This creates the z-scaled numbers for sepal widths
(iris$Sepal.Width - mean(iris$Sepal.Width))/sd(iris$Sepal.Width)
write a zScore function that is flexible.
thank you for any help you provide
You can use the following code:
# Z-score function
zscore <- function(x) {
(x - mean(x))/sd(x)
}
library(tidyverse)
iris %>%
mutate(zscore_sepal.length = zscore(Sepal.Length)) %>%
mutate(zscore_sepal.width = zscore(Sepal.Width))
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species zscore_sepal.length zscore_sepal.width
1 5.1 3.5 1.4 0.2 setosa -1.95660229 -3.514384
2 4.9 3.0 1.4 0.2 setosa -2.15660229 -4.014384
3 4.7 3.2 1.3 0.2 setosa -2.35660229 -3.814384
4 4.6 3.1 1.5 0.2 setosa -2.45660229 -3.914384
5 5.0 3.6 1.4 0.2 setosa -2.05660229 -3.414384

How to do a Leave One Out cross validation by group / subset?

This question is the second part of a previous question (Linear Regression prediction in R using Leave One out Approach).
I'm trying to build models for each country and generate linear regression predictions using the leave one out approach. In other words, in the code below when building model1 and model2 the "data" used should not be the entire data set. Instead it should be a subset of the dataset (country). Each country data should be evaluated using a model built with data specific to that country.
The code below returns an error. How can I modify/fix the code below to do that? Or is there a better way of doing that?
library(modelr)
install.packages("gapminder")
library(gapminder)
data(gapminder)
#CASE 1
model1 <- lm(lifeExp ~ pop, data = gapminder, subset = country)
model2 <- lm(lifeExp ~ pop + gdpPercap, data = gapminder, subset = country)
models <- list(fit_model1 = model1,fit_model2 = model2)
gapminder %>% nest_by(continent, country) %>%
bind_cols(
map(1:nrow(gapminder), function(i) {
map_dfc(models, function(model) {
training <- data[-i, ]
fit <- lm(model, data = training)
validation <- data[i, ]
predict(fit, newdata = validation)
})
}) %>%
bind_rows()
)
The most succinct and straightforward solution would be a nested for loop approach, where the outer loop is the two model formulae and the inner loop is the unity we want to leave out. This can also be done with outer, which I also show afterwards.
For sake of clarity I first show how to leave out one observation (i.e. one row) in each iteration (Part I). I show later how to leave out one cluster (e.g. country) (Part II). I also use the built-in iris data set, which is smaller and thus easier to handle. It contains a "Species" column that is meant to correspond to the "countries" in your data.
Part I
First, we put the two formulae into a list and name them as we would like them to appear in the resulting columns later.
FOAE <- list(fit1=Petal.Length ~ Sepal.Length,
fit2=Petal.Length ~ Sepal.Length + Petal.Width)
For the loop, we want to initialize a matrix im whose rows correspond to the number of rows we want to leave out, and columns to the number of model formulae.
im <- matrix(NA, nrow=nrow(iris), ncol=length(FOAE),
dimnames=list(NULL, names(FOAE)))
This would look like this:
head(im, n=3)
# fit1 fit2
# [1,] NA NA
# [2,] NA NA
# [3,] NA NA
Now we loop over formulas and rows as described above.
for (i in seq(FOAE)) {
for(j in seq(nrow(iris))) {
train <- iris[-j,]
test <- iris[j,]
fit <- lm(FOAE[[i]], data=train)
im[j, i] <- predict(fit, newdata=test)
}
}
im has now been filled, and we may cbind it to the original iris data set to get our result res1.
res1 <- cbind(iris, im)
head(res1)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# 1 5.1 3.5 1.4 0.2 setosa 2.388501 1.611976
# 2 4.9 3.0 1.4 0.2 setosa 2.014324 1.501389
# 3 4.7 3.2 1.3 0.2 setosa 1.639805 1.392955
# 4 4.6 3.1 1.5 0.2 setosa 1.446175 1.333199
# 5 5.0 3.6 1.4 0.2 setosa 2.201646 1.556620
# 6 5.4 3.9 1.7 0.4 setosa 2.944788 2.127184
To alternatively follow the outer approach, we put the code inside the for loop into a formula which we Vectorize so that it can handle matrix columns (i.e. vectors).
FUN1 <- Vectorize(function(x, y) {
train <- iris[-x,]
test <- iris[x,]
fit <- lm(y, data=train)
predict(fit, newdata=test)
})
Now we put FOAE and the rows 1:nrow(iris) to leave out subsequently, together with FUN1 into outer(). This already gives us the result that we can cbind to iris in the same way as above to get our result res2.
o1 <- outer(FOAE, 1:nrow(iris), FUN1)
res2 <- cbind(iris, o1)
head(res2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# 1 5.1 3.5 1.4 0.2 setosa 2.388501 1.611976
# 2 4.9 3.0 1.4 0.2 setosa 2.014324 1.501389
# 3 4.7 3.2 1.3 0.2 setosa 1.639805 1.392955
# 4 4.6 3.1 1.5 0.2 setosa 1.446175 1.333199
# 5 5.0 3.6 1.4 0.2 setosa 2.201646 1.556620
# 6 5.4 3.9 1.7 0.4 setosa 2.944788 2.127184
## test if results are different is negative
stopifnot(all.equal(res1, res2))
Part II
We may follow a similar approach when leaving out a cluster (i.e. species or countries). I show here the outer method. The thing we want to change is that we now want to leave out observations belonging to a specific cluster, here "Species" (in your case "countries"), which unique values we put into a vector Species.u . Since the values are in "character" or "factor" format we subset the data using data[!data$cluster %in% x, ] instead of data[-x, ]. Because predict would yield multiple values in the clusters, but we want the same value in the respective clusters, we might want to use a statistic, e.g. the mean prediction of each cluster. We use rownames according to the cluster.
FUN2 <- Vectorize(function(x, y) {
train <- iris[!iris$Species %in% x,]
test <- iris[iris$Species %in% x,]
fit <- lm(y, data=train)
mean(predict(fit, newdata=test))
})
Species.u <- unique(iris$Species)
o2 <- `rownames<-`(outer(Species.u, FOAE, FUN2), Species.u)
This now gives us a matrix which is smaller than our data set. Thanks to the rownames we may match the predictions tho the clusters to which they belong.
o2
# fit1 fit2
# setosa 3.609943 2.662609
# versicolor 3.785760 3.909919
# virginica 4.911009 5.976922
res3 <- cbind(iris, o2[match(iris$Species, rownames(o2)), ])
head(res3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa 5.1 3.5 1.4 0.2 setosa 3.609943 2.662609
# setosa.1 4.9 3.0 1.4 0.2 setosa 3.609943 2.662609
# setosa.2 4.7 3.2 1.3 0.2 setosa 3.609943 2.662609
# setosa.3 4.6 3.1 1.5 0.2 setosa 3.609943 2.662609
# setosa.4 5.0 3.6 1.4 0.2 setosa 3.609943 2.662609
# setosa.5 5.4 3.9 1.7 0.4 setosa 3.609943 2.662609
tail(res3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# virginica.44 6.7 3.3 5.7 2.5 virginica 4.911009 5.976922
# virginica.45 6.7 3.0 5.2 2.3 virginica 4.911009 5.976922
# virginica.46 6.3 2.5 5.0 1.9 virginica 4.911009 5.976922
# virginica.47 6.5 3.0 5.2 2.0 virginica 4.911009 5.976922
# virginica.48 6.2 3.4 5.4 2.3 virginica 4.911009 5.976922
# virginica.49 5.9 3.0 5.1 1.8 virginica 4.911009 5.976922
Edit
In this version of FUN2, FUN3, the output of the models of each cluster are rbinded (in two columns of course, because of two models).
FUN3 <- Vectorize(function(x, y) {
train <- iris[!iris$Species %in% x,]
test <- iris[iris$Species %in% x,]
fit <- lm(y, data=train)
(predict(fit, newdata=test))
}, SIMPLIFY=F)
Species.u <- unique(iris$Species)
o3 <- `rownames<-`(outer(Species.u, FOAE, FUN3), Species.u)
res32 <- cbind(iris, apply(o3, 2, unlist))
head(res32)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa.1 5.1 3.5 1.4 0.2 setosa 3.706940 2.678255
# setosa.2 4.9 3.0 1.4 0.2 setosa 3.500562 2.547587
# setosa.3 4.7 3.2 1.3 0.2 setosa 3.294183 2.416919
# setosa.4 4.6 3.1 1.5 0.2 setosa 3.190994 2.351586
# setosa.5 5.0 3.6 1.4 0.2 setosa 3.603751 2.612921
# setosa.6 5.4 3.9 1.7 0.4 setosa 4.016508 3.073249
Edit 2
As I learned in your comment you want 1. a subset of your data along clusters. This would be ss in FUN4 below. Then the ss is also subsetted by leaving out one row z over the rows of subset ss.
FUN4 <- Vectorize(function(x, y) {
## subsets first by cluster then by row
ss <- iris[iris$Species %in% x,] ## cluster subset
sapply(1:nrow(ss), function(z) { ## subset rows using `sapply`
train <- ss[-z,] ## train data w/o row z
test <- ss[z,] ## test data for `predict`, just row z
fit <- lm(y, data=train)
predict(fit, newdata=test)
})
}, SIMPLIFY=F)
## the two models
FOAE <- list(fit1=Petal.Length ~ Sepal.Length,
fit2=Petal.Length ~ Sepal.Length + Petal.Width)
## unique cluster names
Species.u <- unique(iris$Species)
## with the `outer` we iterate over all the permutations of clusters and models `FOAE`.
o4 <- `rownames<-`(outer(Species.u, FOAE, FUN4), Species.u)
## `unlist`ed result is directly `cbind`able to original data
res4 <- cbind(iris, apply(o4, 2, unlist))
## result
head(res4)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa.1 5.1 3.5 1.4 0.2 setosa 1.476004 1.451029
# setosa.2 4.9 3.0 1.4 0.2 setosa 1.449120 1.431737
# setosa.3 4.7 3.2 1.3 0.2 setosa 1.426185 1.416492
# setosa.4 4.6 3.1 1.5 0.2 setosa 1.404040 1.398103
# setosa.5 5.0 3.6 1.4 0.2 setosa 1.462460 1.441295
# setosa.6 5.4 3.9 1.7 0.4 setosa 1.504990 1.559045

save residuals with `dplyr`

I want to use dplyr to group a data.frame, fit linear regressions and save the residuals as a column in the original, ungrouped data.frame.
Here's an example
> iris %>%
select(Sepal.Length, Sepal.Width) %>%
group_by(Species) %>%
do(mod = lm(Sepal.Length ~ Sepal.Width, data=.)) %>%
Returns:
Species mod
1 setosa <S3:lm>
2 versicolor <S3:lm>
3 virginica <S3:lm>
Instead, I would like the original data.frame with a new column containing residuals.
For example,
Sepal.Length Sepal.Width resid
1 5.1 3.5 0.04428474
2 4.9 3.0 0.18952960
3 4.7 3.2 -0.14856834
4 4.6 3.1 -0.17951937
5 5.0 3.6 -0.12476423
6 5.4 3.9 0.06808885
I adapted an example from http://jimhester.github.io/plyrToDplyr/.
r <- iris %>%
group_by(Species) %>%
do(model = lm(Sepal.Length ~ Sepal.Width, data=.)) %>%
do((function(mod) {
data.frame(resid = residuals(mod$model))
})(.))
corrected <- cbind(iris, r)
update Another method is to use the augment function in the broom package:
r <- iris %>%
group_by(Species) %>%
do(augment(lm(Sepal.Length ~ Sepal.Width, data=.))
Which returns:
Source: local data frame [150 x 10]
Groups: Species
Species Sepal.Length Sepal.Width .fitted .se.fit .resid .hat
1 setosa 5.1 3.5 5.055715 0.03435031 0.04428474 0.02073628
2 setosa 4.9 3.0 4.710470 0.05117134 0.18952960 0.04601750
3 setosa 4.7 3.2 4.848568 0.03947370 -0.14856834 0.02738325
4 setosa 4.6 3.1 4.779519 0.04480537 -0.17951937 0.03528008
5 setosa 5.0 3.6 5.124764 0.03710984 -0.12476423 0.02420180
...
A solution that seems to be easier than the ones proposed so far and closer to the code of the original question is :
iris %>%
group_by(Species) %>%
do(data.frame(., resid = residuals(lm(Sepal.Length ~ Sepal.Width, data=.))))
Result :
# A tibble: 150 x 6
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species resid
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 0.0443
2 4.9 3 1.4 0.2 setosa 0.190
3 4.7 3.2 1.3 0.2 setosa -0.149
4 4.6 3.1 1.5 0.2 setosa -0.180
5 5 3.6 1.4 0.2 setosa -0.125
6 5.4 3.9 1.7 0.4 setosa 0.0681
7 4.6 3.4 1.4 0.3 setosa -0.387
8 5 3.4 1.5 0.2 setosa 0.0133
9 4.4 2.9 1.4 0.2 setosa -0.241
10 4.9 3.1 1.5 0.1 setosa 0.120
Since you are be running the exact same regression for each group, you might find it simpler to just define your regression model as a function() beforehand, and then execute it for each group using mutate.
model<- function(y,x){
a<- y + x
if( length(which(!is.na(a))) <= 2 ){
return( rep(NA, length(a)))
} else {
m<- lm( y ~ x, na.action = na.exclude)
return( residuals(m))
}
}
Note, that the first part of this function is to insure against any error messages popping up in case your regression is run on a group with less than zero degrees of freedom (This might be the case if you have a dataframe with several grouping variables with many levels , or numerous independent variables for your regression (like for example lm(y~ x1 + x2)), and can't afford to inspect each of them for sufficient non-NA observations).
So your example can be rewritten as follows:
iris %>% group_by(Species) %>%
mutate(resid = model(Sepal.Length,Sepal.Width) ) %>%
select(Sepal.Length,Sepal.Width,resid)
Which should yield:
Species Sepal.Length Sepal.Width resid
<fctr> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 0.04428474
2 setosa 4.9 3.0 0.18952960
3 setosa 4.7 3.2 -0.14856834
4 setosa 4.6 3.1 -0.17951937
5 setosa 5.0 3.6 -0.12476423
6 setosa 5.4 3.9 0.06808885
This method should not be computationally much different from the one using augment().(I've had to use both methods on data sets containing several hundred million observations, and believe there was no significant difference in terms of speed compared to using the do() function).
Also, please note that omitting na.action = na.exclude, or using m$residuals instead of residuals(m), will result in the exclusion of rows that have NAs (dropped prior to estimation) from the output vector of residuals. The corresponding vector will thus not have sufficient length() in order to be merged with the data set, and some error message might appear.

Plotting a table in R

I want to plot a table of my data frame with lines and color formatting. So far the only solution I have come up with is through packages xtables and data.table but I seem to get the output in Latex code.
How do I output this code in R to see what it looks like?
Is there no simple solution for plotting a table with the base package?
Use plot function.
For example data set iris, which is available in R by default.
There are 5 columns there.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
You can plot it using
plot(Sepal.Length ~ Sepal.Width, data = iris)

Convert all column headers into regressors in R

I'm completely new to R and would like to turn each column label (header?) in my data set into a regressor without having to define each regressor one at a time, i.e. date -> data$Date
Is there a way to do this all at once?
Thank you in advance!
Is this what you want:
R > data(iris)
R > head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
R > lm(Sepal.Length ~ ., data = iris)
Call:
lm(formula = Sepal.Length ~ ., data = iris)
Coefficients:
(Intercept) Sepal.Width Petal.Length Petal.Width
2.1713 0.4959 0.8292 -0.3152
Speciesversicolor Speciesvirginica
-0.7236 -1.0235
If you want to choose specific column you can use this:
data is sample.data with dependent variable in col 3 and cols 1, 2, 4:8 are independent variables
yy<-lm(as.formula(paste(colnames(sample.data)[3], "~",paste(colnames(sample.data)[c(1, 2,4:8)], collapse = "+"), sep = "")), data=sample.data)
)
summary(yy)

Resources