I would like to plot multiple lines on the same plot, without using ggplot.
I have scores for different individuals across a set time period and wish to plot a line between yearly scores for each individual. Data is organised with each row representing an individual and each column an observed value in a given year.
Currently I am using a for loop, but am aware that this is often not efficient in R, and am interested if there are any more suitable approaches available within base R.
I will be working with up 100,000 individuals
Thanks.
Code:
df=data.frame(runif(10,0,100),runif(10,0,100),runif(10,0,100),runif(10,0,100))
df=data.frame(t(df))
Years=seq(1,10,1)
plot(1,type="n",xlab="Year",ylab="Score", xlim=c(1,10), ylim=c(0,100))
for(x in 1:4){lines(Years,df[x,])}
Efficiency is not much of a consideration when plotting since plotting to a device is a slow operation in itself. You can use matplot (which uses a loop internally). It's basically a more sophisticated version of your code wrapped in a function.
matplot(Years, t(df), xlab="Year", ylab="Score", type = "l")
Related
I have a large 2-variable dataset that may be classified into 2 groups using a third variable. Overplotting is an issue, so I've resorted to visualizing my data using bin2d and other similar approaches. I would like to calculate the difference between the binned counts of the two groups and visualize that as well (e.g subtract one 2d histogram from another).
example code:
df <- diamonds
df_color_H <- filter(df,color=="H")
df_color_E <- filter(df,color=="E")
ggplot(df_color_H)+
geom_bin2d(aes(carat,price),bins=40)
ggplot(df_color_E)+
geom_bin2d(aes(carat,price),bins=40)
Ultimately, I want to visualize the difference between overlapping bins. I know the solution is likely a pre-processing step before bringing them into GGplot but I haven't found exactly what I'm looking for. I also don't need a sophisticated solution using KDEs or something like that.
Any suggestions would be welcome!
I'm making a MDS plot with Salary and WAR data from baseball players. I have attached it here. Because there are so many data points I just want to label the Baseball Player names for the outlier data points. Any idea on how to do so?
Well, first it depends on the method you want to detect the outliers. I create a reproducible example, since you did not provide a dataset. I'm using the mtcars dataset, which is a regular dataset already implemented in R.
For outlier detection I'm using the cooks distance (you can check about that in a post here)
First i set up the data and calculating the cooks distance (cooksd) with cooks.distance from stats between the variables hp and sec
df<-cbind( mtcars[,c("hp", "qsec")], cooks.distance(lm(mtcars$hp~mtcars$qsec)))
names(df)[3]<-"cooksd"
plot(cooks.distance(lm(mtcars$hp~mtcars$qsec)))
We see, that we got two outliers. I'm using the threshold of 4*mean(cooksd) and labeling those in my dataset with following loop. Those are considered as outliers.
for (i in 1:nrow(df)){
if (df$cooksd[i] > 4* mean(df$cooksd)){
df$outlier[i] =1
}else {df$outlier[i]=0}
}
Last step is to plot the values and only give a label to those, which are above the threshold. I'm using here the row.names, but you can also use any column, just filter it as the other datapoints.
plot(df$hp, df$qsec,
text(df[df$outlier==1,"hp"], df[df$outlier==1,"qsec"], row.names(df[df$outlier == 1,]), cex=0.6, pos=4, col="red") )
A huge problem with outlier detection is, there are different methods and for every case the interpretation must be adjusted. Which method you want to use or which threshold, won't be answered here, since it's more a discussion for stackexchange. You can use cooksd, stdev or anything you seem it is the most useful. But the solution stays the same:
use your method to find the outliers
label them in your dataset
just implement the text in your plot, which you labelled.
For your three clusters, you just have to do a lm modell for every cluster, label the outliers and then join all three results.
This is my first question here so please excuse any mistakes I (may or may not) make.
The premise:
I got a vegetational dataset containing paired data on different plots for old and new observations. I used the 'openxlsx'-package to load my data, and 'vegan'-package to execute an NMDS as follows:
mydata <- read.xlsx(mydata)
mydataMDS <- metaMDS(mydata, k=2, trymax=500)
The result is then used for a model via the "envfit()"-function, including environmental variables:
myenvdata <- read.xlsx(myenvdata)
mydataMDS_fit <- envfit(mydataMDS, myenvdata, perm=10000, na.rm=TRUE)
plot(mydataMDS, display="sites")
plot(mydataMDS_fit, p.max=0.01, axis=TRUE)
Now I have a plot with my statistical "mydataMDS"-analyses, including vectors produced by the "mydataMDS_fit" R calculated.
The problem:
I want to colour and connect certain points within this plot. As "mydata" consists of observations within the same plot at different times, I intend to colour all points of old observations in one colour, and all the new ones in a different one. I've read about adding columns in order to group old and new observations, but as I'm working with a model there are no columns. How can I edit my datasets ("mydata", "mydataMDS", "myenvdata", "mydataMDS_fit") in order to show old and new plots in 2 different colours (one colour for old, one colour for new), and connect the paired observations with lines? Or: Is there a possibility to directly re-colour the points within my graphical output via checking for old/new observations?
(Sorry, I feel like my explanation is quite complicated, but I still hope someone may be able to help)
I have pre-calculated data with amount on the x axis and the count (as a proportion) which I'm using as the y axis.
What I would like to have is the functionality I would get if I had used stat="bin". I can't use rep to simply explode the data to it's original form and then rebin it, because of the large size of the dataset.
For example:
I would like to be able to smooth the data, like I would have been able to by using binwidth.
Also, I'm plotting this data using geom_freqpoly. However, if I don't have a specific amount on the x axis I'd like to have it as a 0 value, instead of joining to the next point, which binning using ggplot does.
Since no one had a response for ggplot, I used rep to re-expand and sample the data.
So, if I had 18 million observations originally, I used 180,000 for the times argument of rep, and multiplied by this by my previously calculated proportion of the data. I'm not sure what the threshold would then be for the times argument (if it's less than 1 will no data point be created?). This means I lose the less frequent observations altogether, but this is OK in my case.
Many of the ggplot stat functions will accept a weight as part of the aesthetic, e.g.: aex(x=X, y=Y, weight=n). Depending on you versions, a couple even complain about the "unused aesthetic, 'weight'", but then go on to do the right thing! I've used this on geom_hist, bin2d, and probably others.
Sequential portions of my time series are under different treatments, and I'd like to separately color a line connecting observations in each portion.
For example, in the series under treatment A I'd have a red line, and in the succeeding series under treatment B I'd have a blue line.
plot(response, type="l",col="treatment") failed - all observations were connected with a line the same color.
This listhost posting proposed just splitting the data by treatment and then separately plotting each subset on the same plot. (http://r.789695.n4.nabble.com/Can-R-plot-multicolor-lines-td791081.html).
Is there a more elegant way?
An alternative using Map that avoids manually plotting segments:
dat <- data.frame(treatment=rep(LETTERS[1:2],3:4),
response=c(6,5,2,1,5,6,7),time=1:7)
plot(response ~ time, data=dat, type="n")
Map(
function(x) lines(response ~ time, data=x, col=x$treatment),
split(dat, dat$treatment)
)
There are two popular more elegant ways. One is to use the ggplot2 package. Without more information it's hard to advise you other than look at help or examples in various places. The other is to check out the function matplot. That will require you to first restructure your data as a matrix but it can easily do what you want. Keep in mind that while it says in the help, "Plot the columns of one matrix against the columns of another", the x-axis matrix can be a vector the same length as one column of a matrix containing your line information. The function will just recycle the x vector.