Adding a trend line to a scatterplot using R

Adding a trend line to a scatterplot using R - r

I have a data set with number of people at a certain age (ranging from 0-105+), recorded in the period 1846-2014, and I am making a scatterplot of the summed amount of people by year; there's one data set for males and one for females. After that, I am going to add a trend line, but I am having problems figuring out how.
This is what I've got so far:
B <- as.matrix(read.table("clipboard"))
head(B)
age <- 0:105
y <- 1846:2014
plot(c(1846:2014), c(colSums(B)), col=3, xlab="Year", ylab="Summed age", main="Summed people")
This gives me the plot, but I am not sure how to add the trend line. Please help.
Plot looks like this: https://www.dropbox.com/s/5dono5bjrmqylcp/Plot.png?dl=0
Data available here:
https://www.ssb.no/statistikkbanken/SelectVarVal/Define.asp?subjectcode=01&ProductId=01&MainTable=FolkemEttAarig&SubTable=1&PLanguage=1&nvl=True&Qid=0&gruppe1=Hele&gruppe2=Hele&gruppe3=Hele&VS1=AlleAldre00B&VS2=Kjonn3&VS3=&mt=0&KortNavnWeb=folkemengde&CMSSubjectArea=befolkning&StatVariant=&checked=true

I downloaded your data file and posted it somewhere accessible.
urlsrc <- "http://www.math.mcmaster.ca/bolker/misc"
urlfn <- "201512516853914205393FolkemEttAarig.tsv"
d <- read.delim(url(paste(urlsrc,urlfn,sep="/")),header=TRUE,
check.names=FALSE)
dm <- d[,3:171]
y <- as.numeric(names(dm))
Now make the plot:
plot(y, colSums(dm),
col=3, xlab="Year", ylab="Summed age", main="Summed people")
abline(lm(colSums(dm) ~ y))
You can also do it like this:
library("tidyr")
library("ggplot2"); theme_set(theme_bw())
library("dplyr")
d2 <- gather(dm,year,pop,convert=TRUE)
d3 <- d2 %>% group_by(year) %>% summarise(pop=mean(pop))
ggplot(d3,aes(year,pop)) + geom_point() +
geom_smooth(method="lm")
There is a confidence interval around this trend line, but it's so narrow that it's hard to see.
update: I accidentally used the mean instead of the sum in the second plot, but of course it should be easy to change that.

Related

How to incorporate data into plot which was constructed in ggplot2 using data from another file (R)?

Using a dataset, I have created the following plot:
I'm trying to create the following plot:
Specifically, I am trying to incorporate Twitter names over the first image. To do this, I have a dataset with each name in and a value that corresponds to a point on the axes. A snippet looks something like:
Name Score
#tedcruz 0.108
#RealBenCarson 0.119
Does anyone know how I can plot this data (from one CSV file) over my original graph (which is constructed from data in a different CSV file)? The reason that I am confused is because in ggplot2, you specify the data you want to use at the start, so I am not sure how to incorporate other data.
Thank you.

The question you ask about ggplot combining source of data to plot different element is answered in this post here
Now, I don't know for sure how this is going to apply to your specific data. Here I want to show you an example that might help you to go forward.
Imagine we have two data.frames (see bellow) and we want to obtain a plot similar to the one you presented.
data1 <- data.frame(list(
x=seq(-4, 4, 0.1),
y=dnorm(x = seq(-4, 4, 0.1))))
data2 <- data.frame(list(
"name"=c("name1", "name2"),
"Score" = c(-1, 1)))
The first step is to find the "y" coordinates of the names in the second data.frame (data2). To do this I added a y column to data2. y is defined here as a range of points from the may value of y to the min value of y with some space for aesthetics.
range_y = max(data1$y) - min(data1$y)
space_y = range_y * 0.05
data2$y <- seq(from = max(data1$y)-space, to = min(data1$y)+space, length.out = nrow(data2))
Then we can use ggplot() to plot data1 and data2 following some plot designs. For the current example I did this:
library(ggplot2)
p <- ggplot(data=data1, aes(x=x, y=y)) +
geom_point() + # for the data1 just plot the points
geom_pointrange(data=data2, aes(x=Score, y=y, xmin=Score-0.5, xmax=Score+0.5)) +
geom_text(data = data2, aes(x = Score, y = y+(range_y*0.05), label=name))
p
which gave this following plot:

how to plot only most abundant species in NMDS?

I need to plot an ordination plot showing only let s say the 20 most abundant species.
I tried to do the sum of the species colunm and then select only a certain sum value:
abu <- colSums(dune)
abu
sol <- metaMDS(dune)
sol
plot(sol, type="text", display="species", select = abu > 40)
I get this error: select is not a graphical parameter
I would expect to see only small number of species but it does not happen,
how do you show only a small number of species in the NMDS plot?

This is not straightforward. You are getting an error because select is not a parameter for the plot. Unfortunately, the result of the analysis is not a data.frame that could be handled easily (e.g. with tidyverse), and even more unfortunately, the plot() function called is not your standard plot, but a method defined specifically for objects of this class. The authors of this method did not foresee your need, and therefore, we must make the plot manually. But to do that, we need to understand what is plotting and how.
Let us find out more about the object sol:
class(sol)
# [1] "metaMDS" "monoMDS"
methods(class="metaMDS")
# [1] goodness nobs plot points print scores sppscores<- text
Oh good, we have a plot method. After a moment of digging, we find it in the vegan package (not exported, so we need to access it via vegan:::plot.metaMDS). It appears to be a wrapper around a function called ordiplot. We edit the function with edit() to figure out what it is doing. Essentially, it boils down to the following (with loads of unnecessary code):
Y <- scores(sol, display="species")
plot(Y, type="n")
text(Y[,1], Y[,2], rownames(Y), col="red")
This is, more or less, your plot. Choosing the species to show is now trivial, but first we must make sure that rows of Y are in the same order as columns of dune:
all(colnames(dune) == rownames(Y))
Y.sel <- Y[colSums(dune) > 40, ]
plot(Y.sel[,1], Y.sel[,2], type="n", xlim=c(-.8, .8), ylim=c(-.4, .4))
text(Y.sel[,1], Y.sel[,2], rownames(Y.sel), col="red")
We can of course make a much nicer plot. For example, with ggplot (it is definitely possible to make a much nicer plot with base R as well). We could actually show the abundance of the plants using the size esthetics:
library(ggplot2)
library(ggrepel)
Y <- data.frame(Y)
Y$abundance <- colSums(dune)
Y$labels <- rownames(Y)
ggplot(Y, aes(x=NMDS1, y=NMDS2, size=abundance)) +
geom_point() + geom_text_repel(aes(label=labels)) +
theme_minimal()
To filter the species by abundance, we now can do the following:
library(tidyverse)
Y %>% filter(abundance > 40) %>%
ggplot(Y, aes(x=NMDS1, y=NMDS2, size=abundance)) +
geom_point() + geom_text_repel(aes(label=labels)) +
theme_minimal()

Differentiate missing values from main data in a plot using R

I create a dummy timeseries xts object with missing data on date 2-09-2015 as:
library(xts)
library(ggplot2)
library(scales)
set.seed(123)
seq <- seq(as.POSIXct("2015-09-01"),as.POSIXct("2015-09-02"), by = "1 hour")
ob1 <- xts(rnorm(length(seq),150,5),seq)
seq2 <- seq(as.POSIXct("2015-09-03"),as.POSIXct("2015-09-05"), by = "1 hour")
ob2 <- xts(rnorm(length(seq2),170,5),seq2)
final_ob <- rbind(ob1,ob2)
plot(final_ob)
# with ggplot
df <- data.frame(time = index(final_ob), val = coredata(final_ob) )
ggplot(df, aes(time, val)) + geom_line()+ scale_x_datetime(labels = date_format("%Y-%m-%d"))
After plotting my data looks like this:
The red coloured rectangular portion represents the date on which data is missing. How should I show that data was missing on this day in the main plot?
I think I should show this missing data with a different colour. But, I don't know how should I process data to reflect the missing data behaviour in the main plot.

Thanks for the great reproducible example.
I think you are best off to omit that line in your "missing" portion. If you have a straight line (even in a different colour) it suggests that data was gathered in that interval, that happened to fall on that straight line. If you omit the line in that interval then it is clear that there is no data there.
The problem is that you want the hourly data to be connected by lines, and then no lines in the "missing data section" - so you need some way to detect that missing data section.
You have not given a criteria for this in your question, so based on your example I will say that each line on the plot should consist of data at hourly intervals; if there's a break of more than an hour then there should be a new line. You will have to adjust this criteria to your specific problem. All we're doing is splitting up your dataframe into bits that get plotted by the same line.
So first create a variable that says which "group" (ie line) each data is in:
df$grp <- factor(c(0, cumsum(diff(df$time) > 1)))
Then you can use the group= aesthetic which geom_line uses to split up lines:
ggplot(df, aes(time, val)) + geom_line(aes(group=grp)) + # <-- only change
scale_x_datetime(labels = date_format("%Y-%m-%d"))

log-scaled density plot: ggplot2 and freqpoly, but with points instead of lines

What I really want to do is plot a histogram, with the y-axis on a log-scale. Obviously this i a problem with the ggplot2 geom_histogram, since the bottom os the bar is at zero, and the log of that gives you trouble.
My workaround is to use the freqpoly geom, and that more-or less does the job. The following code works just fine:
ggplot(zcoorddist) +
geom_freqpoly(aes(x=zcoord,y=..density..),binwidth = 0.001) +
scale_y_continuous(trans = 'log10')
The issue is that at the edges of my data, I get a couple of garish vertical lines that really thro you off visually when combining a bunch of these freqpoly curves in one plot. What I'd like to be able to do is use points at every vertex of the freqpoly curve, and no lines connecting them. Is there a way to to this easily?

The easiest way to get the desired plot is to just recast your data. Then you can use geom_point. Since you don't provide an example, I used the standard example for geom_histogram to show this:
# load packages
require(ggplot2)
require(reshape)
# get data
data(movies)
movies <- movies[, c("title", "rating")]
# here's the equivalent of your plot
ggplot(movies) + geom_freqpoly(aes(x=rating, y=..density..), binwidth=.001) +
scale_y_continuous(trans = 'log10')
# recast the data
df1 <- recast(movies, value~., measure.var="rating")
names(df1) <- c("rating", "number")
# alternative way to recast data
df2 <- as.data.frame(table(movies$rating))
names(df2) <- c("rating", "number")
df2$rating <- as.numeric(as.character(df$rating))
# plot
p <- ggplot(df1, aes(x=rating)) + scale_y_continuous(trans="log10", name="density")
# with lines
p + geom_linerange(aes(ymax=number, ymin=.9))
# only points
p + geom_point(aes(y=number))

Boxplot in R showing the mean (again)

I saw Boxplot in R showing the mean
I'm interested in the ggplot solution. But what I am plotting are averages already so I don't want to do an average of an average. I do have the true mean stored in TrueAvgCPC.
Here is what I tried, but it's not working:
p <- qplot(Mydf$Network,Mydf$Avg.CPC,data=Mydf,geom='boxplot')
p <- p+stat_summary(TrueAvgCPC,shape=1,col='red',geom='point')
print(p)
Thanks!

As far as I see, you want to just add a true mean (or several?) to the box plot. If you have the value(s), why use stat_summary instead of just plotting the points?
#sample data
x <- rnorm(30)
y <- rep(letters[1:3],10)
TrueAVGCPC <- c(0.34,0.1,0.44)
#plot
p <- qplot(y,x,geom='boxplot')
p <- p+geom_point(aes(x=c(1,2,3),y=TrueAVGCPC),col="red")
print(p)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Adding a trend line to a scatterplot using R - r

Related

How to incorporate data into plot which was constructed in ggplot2 using data from another file (R)?

how to plot only most abundant species in NMDS?

Differentiate missing values from main data in a plot using R

log-scaled density plot: ggplot2 and freqpoly, but with points instead of lines

Boxplot in R showing the mean (again)

Categories

Resources