replicate multiple regression plot from excel in R - r

I am struggling with reproducing a multiple linear regression plot in R which is rather easily obtainable in Excel.
I make an example. Say I have the following data frame (called test) in R:
y x1 x2 x3
2 5 5 9
6 4 2 9
4 2 6 15
7 5 10 6
7 5 10 6
5 4 3 12
To generate a linear regression, I simply write:
reg=lm(y ~ x1 + x2 + x3, data = test)
Now I would like to create a plot of the actual value of the y variable, the predicted y and on a secondary axis the standardised residuals. I add a screenshot from Excel so you see what I mean.
To access the Excel plot I would like to obtain:
the plot is in italian, "y" means observed y values, "Y prevista" means predicted Y values and "Residui standard" means standardized residuals. The standard residuals are plotted on a secondary axis
If anyone could show me who I can achieve the above in R, it would be very appreciated.

Use something like
matplot(seq(nrow(test)), cbind(test$y, predict(reg), rstudent(reg)), type="l")
but you'd have to set the axes to make sure everything is okay

You could try something like this. Easier to debug your code.
test <- data.frame(y=c(2,6,4,7,7,5), x1=c(5,4,2,5,5,4), x2=c(5,2,6,10,10,3),
x3=c(9,9,15,6,6,12))
reg=lm(y ~ x1 + x2 + x3, data = test)
# Add new columns to dataframe from regression
test$yhat <- reg$fitted.values
test$resid <- reg$residuals
# Create your x-variable column
test$X <-seq(nrow(test))
library(ggplot2)
library(reshape2)
# Columns to keep
keep = c("y", "yhat", "resid", "X")
# Drop columns not needed
test <-test[ , keep, drop=FALSE]
# Reshape for easy plotting
test <- melt(test, id.vars="X")
# Everything on the same plot
ggplot(test, aes(X,value, col=variable)) +
geom_point() +
geom_line()
For different look, you could also replace geom_line with geom_smooth()

Related

Plotting a facet grid in R using ggplot2 with only one variable

I have a data frame, called mouse.data, with 3 columns: Eigenvalues, DualEigenvalues and Experiment. This question does not concern the DualEigenvalues data, so that can be forgotten.
We ran 5 experiments and used the data from each experiment to calculate 14 eigenvalues. So the first 14 rows of this data frame are the 14 eigenvalues of the first experiment, with the experiment entry having value 1, the second 14 rows are the 14 eigenvalues of the second experiment with the experiment entry having value 2 etc.
I am then plotting the eigenvalues of each pairwise experiment against each other, here is an example of this code:
eigen.1 <- mouse.data$Eigenvalues[mouse.data$Experiment == 1]
eigen.2 <- mouse.data$Eigenvalues[mouse.data$Experiment == 2]
p.data <- data.frame(x = eigen.1, y = eigen.2)
ggplot(p.data, aes(x,y)) + geom_abline(slope = 1, colour = "red") + geom_point()
This gives me graph like this one:
This is precisely what I want this graph to look like.
What I would like to do, but can't work out, is to plot a facet_grid so that the plot in the ith row and jth column plots the eigenvalues from the ith experiment on the y-axis and the eigenvalues from the jth experiment on the x-axis.
This is the closest I have got so far, I hope this makes it clearer what I mean.
This is tricky without a reproducible example of your data, but it sounds like we can roughly approximate the structure of your data frame like this:
library(ggplot2)
set.seed(1)
Eigen <- as.vector(sapply(runif(5, .5, 1.5),
function(x) sort(rgamma(14, 2, 0.02*x))))
mouse.data <- data.frame(Experiment = rep(seq(5), each = 14), Eigenvalue = Eigen)
head(mouse.data)
#> Experiment Eigenvalue
#> 1 1 39.61451
#> 2 1 44.48163
#> 3 1 54.57964
#> 4 1 75.06725
#> 5 1 75.50014
#> 6 1 94.41255
The key to getting the plot to work is to reshape your data into a long-format data frame that contains each combination of experiments. One way to do this is to split the data frame by Experiment, then use simple indexing of the resultant list (using rep) to get all unique pairs of data frames. Each unique pair is stuck together column-wise, then the resultant 25 data frames are all joined row-wise into the plotting data frame.
experiments <- split(mouse.data, mouse.data$Experiment)
experiments <- mapply(cbind,
experiments[rep(1:5, 5)],
experiments[rep(1:5, each = 5)],
SIMPLIFY = FALSE)
p.data <- do.call(rbind, lapply(experiments, setNames,
nm = c("Experiment1", "x",
"Experiment2", "y")))
Once we have done this, we can use your plot code, with the addition of a facet_grid call:
ggplot(p.data, aes(x,y)) +
geom_abline(slope = 1, colour = "red") +
geom_point() +
facet_grid(Experiment1~Experiment2)

K-Means Clustering Method

Description of the data
I am trying to produce in R a suitable graphical display of the cluster means.
How can I place the attributes on the x-axis and treat the means for each cluster as trajectories over the items?
All the data is continuous.
What about the following approach: since your variables are on a similar measurement scale (e.g. Likert scale) you could show the distribution of each variable within each cluster (with e.g. boxplots) and visually compare their distribution by using the same axis limits on every cluster.
This can be accomplished by a putting your data in a suitable format and using the ggplot2 package to generate the plot. This is shown below.
Step 1: Generate simulated data to mimic the numeric data you have
The generated data contains four non-negative integer variables and a cluster variable with 3 clusters.
library(ggplot2)
set.seed(1717) # make the simulated data repeatable
N = 100
nclusters = 3
cluster = as.numeric( cut(rnorm(N), breaks=nclusters, label=seq_len(nclusters)) )
df = data.frame(cluster=cluster,
x1=floor(cluster + runif(N)*5),
x2=floor(runif(N)*5),
x3=floor((nclusters-cluster) + runif(N)*5),
x4=floor(cluster + runif(N)*5))
df$cluster = factor(df$cluster) # define cluster as factor to ease plotting code below
tail(df)
table(df$cluster)
whose output is:
cluster x1 x2 x3 x4
95 2 5 2 5 2
96 3 5 4 0 3
97 3 3 3 1 7
98 2 5 4 3 3
99 3 6 1 1 7
100 3 5 1 2 5
1 2 3
15 64 21
i.e., out of the 100 simulated cases, the data contains 15 cases in cluster 1, 64 cases in cluster 2 and 21 cases in cluster 3.
Step 2: Prepare the data for plotting
Here we use reshape() from the stats package to transpose the dataset from wide to long so that the four numeric variables (x1, x2, x3, x4) are placed into one single column, suitable for generating a boxplot for each of the four variables which are then grouped by the cluster variable.
vars2transpose = c("x1", "x2","x3", "x4")
df.long = reshape(df, direction="long", idvar="id",
varying=list(vars2transpose),
timevar="var", times=vars2transpose, v.names="value")
head(df.long)
table(df.long$cluster)
whose output is:
cluster var value id
1.x1 1 x1 5 1
2.x1 1 x1 3 2
3.x1 3 x1 5 3
4.x1 1 x1 1 4
5.x1 2 x1 3 5
6.x1 1 x1 2 6
1 2 3
60 256 84
Note that the number of cases in each cluster has increased 4-fold (i.e. the number of numeric variables) since the data is now in transposed long format.
Step 3: Create the variable's boxplots by cluster with line-connected means
We plot horizontal boxplots for each variable x1, x2, x3, x4 that show their distribution in each cluster, and mark the mean values with connected red crosses (the trajectories you are after).
gg <- ggplot(df.long, aes(x=var, y=value))
gg + facet_grid(cluster ~ ., labeller=label_both) +
geom_boxplot(aes(fill=cluster)) +
stat_summary(fun.y="mean", geom="point", color="red", pch="x", size=3) +
stat_summary(fun.y="mean", geom="line", color="red", aes(group=1)) +
coord_flip() # swap the x and y axis to make boxplots horizontal instead of vertical
which generates the following graph.
The graph might get packed with the many variables you have, so you may want to:
either show vertical boxplots by removing the last coord_flip() line
or remove the boxplots altogether and just show the connected red crosses by eliminating the geom_boxplot() line.
And if you want to compare each variable side by side among the different clusters, you can swap the grouping and x-axis variables as follows:
gg <- ggplot(df.long, aes(x=cluster, y=value))
gg + facet_grid(var ~ ., labeller=label_both) +
geom_boxplot(aes(group=cluster, fill=cluster)) +
stat_summary(fun.y="mean", geom="point", color="red", pch="x", size=3) +
stat_summary(fun.y="mean", geom="line", color="red", aes(group=1)) +
coord_flip() # swap the x and y axis to make boxplots horizontal instead of vertical

plotCI: how to overlay plots of two variables

I am trying to plot populations of predators and of prey over time, with confidence intervals. I can plot these two separately, how to plot on same graph?
#take mean, number, and create se of prey(d)
d.means=tapply(mydata$prey,mydata$week, mean)
d.n=tapply(mydata$prey,mydata$week, length)
d.se=tapply(mydata$prey,mydata$week, sd)/sqrt(d.n)
#plot with se using plotrix
plotCI(as.numeric(row.names(d.means)),d.means,d.se,ylim=c(0,400),pch=19,gap=0,xlab="Week",ylab="d, w population")
#take mean, number, and create se of predator(w)
w.means=tapply(mydata$pred,mydata$week, mean)
w.n=tapply(mydata$pred,mydata$week, length)
w.se=tapply(mydata$pred,mydata$week, sd)/sqrt(w.n)
#plot with se using plotrix
plotCI(as.numeric(row.names(w.means)),w.means,w.se,ylim=c(0,400),pch=19,gap=0,xlab="Week",ylab="d, w population")
After the first plot, use the code below before plotting the next plot:
par(new=T)
Make sure that you set the xlim and ylim to accommodate both plots. And you will need to use the options axes=F and ann=F.
These graphical features are discussed in detail in the ebook "R Fundamentals & Graphics". You might want to use it as a desk reference.
#take mean, number, and create se of prey(d)
d.means=tapply(mydata$prey,mydata$week, mean)
d.n=tapply(mydata$prey,mydata$week, length)
d.se=tapply(mydata$prey,mydata$week, sd)/sqrt(d.n)
#take mean, number, and create se of predator(w)
w.means=tapply(mydata$pred,mydata$week, mean)
w.n=tapply(mydata$pred,mydata$week, length)
w.se=tapply(mydata$pred,mydata$week, sd)/sqrt(w.n)
Here you have created all the variables you need but to plot them using ggplot you need them to be in a tall dataset with an variable indicating if they are predator or prey. I also added a time variable, I think yours would be week.
x=data.frame(means=c(w.means,d.means),
n=c(w.n,d.n),
se=c(w.se,d.se),
role=c(rep("pred",length(w.n)),rep("prey",length(d.n))),
time=c(1:length(w.n),1:length(d.n))
)
I don't know exactly what your data look like so here is a fake one I cooked up just to illustrate the format.
means n se role time
1 0.9874234 10 0.16200575 pred 1
2 1.4120207 12 0.08895026 pred 2
3 2.7352516 8 0.07991036 pred 3
4 1.1301248 11 0.05481813 prey 1
5 2.4810040 13 0.28682585 prey 2
6 3.1546947 9 0.22126054 prey 3
Once the data are in this nice format using ggplot is really pretty easy.
ggplot(x, aes(x=time, y=means, colour=role)) +
geom_errorbar(aes(ymin=means-se, ymax=means+se), width=.1) +
geom_line()
That gives this:

R plot function - axes for a line chart

assume the following frequency table in R, which comes out of a survey:
1 2 3 4 5 8
m 5 16 3 16 5 0
f 12 25 3 10 3 1
NA 1 0 0 0 0 0
The rows stand for the gender of the survey respondent (male/female/no answer). The colums represent the answers to a question on a 5 point scale (let's say: 1= agree fully, 2 = agree somewhat, 3 = neither agree nor disagree, 4= disagree somewhat, 5 = disagree fully, 8 = no answer).
The data is stored in a dataframe called "slm", the gender variable is called "sex", the other variable is called "tv_serien".
My problem is, that I don't find a (in my opinion) proper way to create a line chart, where the x-axis represents the 5-point scale (plus the don't know answers) and the y-axis represents the frequencies for every point on the scale. Furthemore I want to create two lines (one for males, one for females).
My solution so far is the following:
I create a plot without plotting the "content" and the x-axis:
plot(slm$tv_serien, xlim = c(1,6), ylim = c(0,100), type = "n", xaxt = "n")
The problem here is that it feels like cheating to specify the xlim=c(1,6), because the raw scores of slm$tv_serienare 100 values. I tried also to to plot the variable via plot(factor(slm$tv_serien)...), but then it would still create a metric scale from 1 to 8 (because the dont know answer is 8).
So my first question is how to tell R that it should take the six distinct values (1 to 5 and 8) and take that as the x-axis?
I create the new x axis with proper labels:
axis(1, 1:6, labels = c("1", "2", "3", "4", "5", "DK"))
At least that works pretty well. ;-)
Next I create the line for the males:
lines(1:5, table(slm$tv_serien[slm$sex == 1]), col = "blue")
The problem here is that there is no DK (=8) answer, so I manually have to specify x = 1:5 instead of 1:6 in the "normal" case. My question here is, how to tell R to also draw the line for nonexisting values? For example, what would have happened, if no male had answered with 3, but I want a continuous line?
At last I create the line for females, which works well:
lines(1:6, table(slm$tv_serien[slm$sex == 2], col = "red")
To summarize:
How can I tell R to take the 6 distinct values of slm$tv_serien as the x axis?
How can i draw continuous lines even if the line contains "0"?
Thanks for your help!
PS: Attached you find the current plot for the abovementiond functions.
PPS: I tried to make a list from "1." to "4." but it seems that every new list element started again with "1.". Sorry.
Edit: Response to OP's comment.
This directly creates a line chart of OP's data. Below this is the original answer using ggplot, which produces a far superior output.
Given the frequency table you provided,
df <- data.frame(t(freqTable)) # transpose (more suitable for plotting)
df <- cbind(Response=rownames(df),df) # add row names as first column
plot(as.numeric(df$Response),df$f,type="b",col="red",
xaxt="n", ylab="Count",xlab="Response")
lines(as.numeric(df$Response),df$m,type="b",col="blue")
axis(1,at=c(1,2,3,4,5,6),labels=c("Str.Agr.","Sl.Agr","Neither","Sl.Disagr","Str.Disagr","NA"))
Produces this, which seems like what you were looking for.
Original Answer:
Not quite what you asked for, but converting your frequency table to a data frame, df
df <- data.frame(freqTable)
df <- cbind(Gender=rownames(df),df) # append rownames (Gender)
df <- df[-3,] # drop unknown gender
df
# Gender X1 X2 X3 X4 X5 X8
# m m 5 16 3 16 5 0
# f f 12 25 3 10 3 1
df <- df[-3,] # remove unknown gender column
library(ggplot2)
library(reshape2)
gg=melt(df)
labels <- c("Agree\nFully","Somewhat\nAgree","Neither Agree\nnor Disagree","Somewhat\nDisagree","Disagree\nFully", "No Answer")
ggp <- ggplot(gg,aes(x=variable,y=value))
ggp <- ggp + geom_bar(aes(fill=Gender), position="dodge", stat="identity")
ggp <- ggp + scale_x_discrete(labels=labels)
ggp <- ggp + theme(axis.text.x = element_text(angle=90, vjust=0.5))
ggp <- ggp + labs(x="", y="Frequency")
ggp
Produces this:
Or, this, which is much better:
ggp + facet_grid(Gender~.)

Draw a lot of plots on the same canvas (clean way)

I want to draw a number of similar plots with a loop.
What I do is:
plot(0, 0, type="l", col="white", xlim=range(1,N), ylim=range(0.5, 2.5)) # provide axes, frame, ...
for(col in colors)
{
X <- generate_X() # vector of random numbers
lines(1:N, X, type="l", col=col)
}
The problem is that random numbers sometimes go out of the range(0.5,2.5) and I want to lengthen ylim range. Atm I'm going to do it with min and max before plotting. But there must be much, much cleaner way which I poorly cant find anywhere.
I think I'm missing something basic about plotting, but I couldnt find the solution.
Thanks
I think there are two quick answers to the OP's question:
calculate the plot range before initializing the plot (implied by OP), or
use a "cleaner" plotting wrapper function.
Setup: First we need to define the variables and functions the OP implies and then generate some data to work with.
# Initialize our N number of X points and
# colors vector.
N <- 20
colors <- c("yellow", "red", "blue", "green")
# Create function 'generate_X' to perform
# as implied by the OP.
generate_X <- function(.N){
rnorm(n=.N, mean=0, sd=1)
}
# Generate the entire data frame
# using the 'matrix' function to shape
# the data quickly.
data <- data.frame(
id=1:N,
matrix(
generate_X(N*length(colors)),
ncol=length(colors)
)
)
The above code simply initializes the variables, function, and data needed for the OP's example.
Method 1: Calculate the plot range and initialize the plot. This is pretty easy using the 'range' function. In the data frame we created, there is an "id" column for our x values, so we use the range of 'data$id' for our x. Then, we find the range of all the data across every column EXCEPT the first column (data[,-1]) to find the overall y range. We initialize with the color white, since our background is also white. Otherwise, we would have a point in the lower-left and upper-right corners. I added x and y labels just for looks.
plot(
range(data$id),
range(data[,-1]),
col="white",
xlab="x",
ylab="y")
Next we just loop through and plot the lines.
for(i in 1:length(colors)){
lines(data$id, data[, i + 1], type="l", col=colors[i])
}
This is essentially the same thing the OP demonstrated, but it's adapted slightly to accept a data frame as input. It's far easier to reference columns using an integer counter (i in this case) rather than the list of colors.
Method 2: There are a lot of plot wrapper packages out there, and one of the most popular is the 'ggplot2' package, and for good reason. You can avoid a lot of the looping hassle with plots by feeding shaped data into a 'ggplot' function. The code here is much "cleaner" from a reading perspective.
# Load packages for shaping data and plotting.
library(reshape2)
library(ggplot2)
First, we need the 'reshape2' package, because we want to use "melted" data in our plot. This just makes the 'ggplot' code WAY cleaner. Then, we load up the 'ggplot2' package for the plotting.
For our plot, we initialize a plot without any instructions, so we can specify them in the geometry layer. If we were creating multiple layers from the same data, we would specify the options in the base plot layer, but for this, we are only creating a single geometry layer with lines. The + allows us to add plot layers.
Next, we choose a geometry layer ('geom_line' in this case) and specify the data as melt(data, id.vars="id"). This shapes our data for the 'ggplot' function to use with minimal code. We use the "id" column as the ID variable, since that contains our x values. The shaped data now looks more like this:
# id variable value
# 1 1 X1 -0.280035386
# 2 2 X1 -0.371020958
# 3 3 X1 -0.239889784
# 4 4 X1 0.450357442
# 5 5 X1 -0.801697283
# 6 6 X1 -0.453057841
# 7 7 X1 -0.451321958
# 8 8 X1 0.948124835
# 9 9 X1 2.724205279
# 10 10 X1 -0.725622824
# 11 11 X1 0.475545293
# 12 12 X1 0.533060822
# 13 13 X1 -1.928335572
# 14 14 X1 -0.466790259
# 15 15 X1 -1.606005895
# 16 16 X1 0.005678344
# 17 17 X1 -1.719827853
# 18 18 X1 0.601011314
# 19 19 X1 -2.056315661
# 20 20 X1 1.006169713
# 21 1 X2 -1.591227194
# ...
# 80 20 X4 -1.045224561
You don't need to get too hung up on the shaping. Just understand that "melted" data works better with the 'ggplot' functions. We specify our melted data as the data for our geometry layer, and then we use the 'aes' function to tell the geometry layer how to deal with our data. Our x values are in the "id" column, and our y values are in the "value" column. The next part is what removes the loops: we specify the color to be differentiated based on the "variable" column. In our melted data, the "variable" column contains the name of the column that the data originally came from, and using it to specify the color will tell 'ggplot' to automatically change the color for each new "variable" value.
ggplot() +
geom_line(
data=melt(data, id.vars="id"),
aes(
x=id,
y=value,
col=variable
),
lwd=1,
alpha=0.7)
I specified the line width ("lwd") and alpha values just to make the graph a little more readable.

Resources