I have a large dataframe.
Here is fake data of a similar structure;
dat = data.frame(id=seq(1:12),variable=rep(c("p1","p2","p3"),times=2),value=c(runif(6),runif(6)+1),locus=c(rep("A",6),rep("B",6)),replicate=rep(c(1,2),6), TimesLocus=rep(2,times=12))
I would like to plot the correlation between replicate 1, and replicate 2.
I have achieved this using.
Corr<-cor(dat[dat$replicate==1,]$value,dat[dat$replicate==2,]$value)
ggplot(dat,aes(x=dat[dat$replicate==1,]$value,y=dat[dat$replicate==2,]$value))+
geom_point()+xlab("replicate1")+ylab("replicate2")+
geom_smooth(method = "lm") +
annotate("text", x = 0.9*max(dat[dat$replicate==1,]$value),
y = 0.9*max(dat[dat$replicate==2,]$value),
label = paste("r^2=",round(Corr,digits=2),sep=" "),color="blue")
However, now I want to see if the correlations are different PER VARIABLE.
I can do this using.
ggplot(dat,aes(x=dat[dat$replicate==1,]$value,y=dat[dat$replicate==2,]$value))+
geom_point()+xlab("replicate1")+ylab("replicate2")+
geom_smooth(method = "lm") + facet_wrap(~variable)
If I want to have the correlation per variable I know that I should make a separate dataframe, but I am having problems with this.
r_df <- ddply(dat, .(variable), summarise,
rsq=round(summary(lm(dat[dat$replicate==2,]$value~
dat[dat$replicate==1,]$value))$r.squared, 2))
It gives the same r2ed for each variable.
What am I doing wrong? Can I do this without reshaping my data again?
Okay, I am now trying to use info from #shadow, and have the following.
r_df_val <- ddply(df_mlt_loc_Dup, .(variable), summarise, rsq=round(summary(lm(value[replicate==2]~value[replicate==1]))$r.squared, 2))
Some how the calculation isn't correct. All of the rsq are 0.06 or something, when they should be near 0.8, you can see the correlation in the plot below. Is it somehow re-ordering the dataframe upon subsetting by variable?
In your ddply call, you used dat again. That refers to the original data. You should instead directly use value and replicate. Then they are interpreted correctly.
r_df <- ddply(dat, .(variable), summarise,
rsq = round(summary(lm(value[replicate==2]~value[replicate==1]))$r.squared, 2))
This does not work for the data you provided, because the datasets are too small. But for your original data it should work. Also here's a larger dataset (essentially the data you provided with some additional rows). For this data it should work as desired.
dat = data.frame(id=seq(1:24),variable=rep(c("p1","p2","p3"),times=4),value=c(runif(12),runif(12)+1),locus=c(rep("A",12),rep("B",12)),replicate=rep(c(1,2),12), TimesLocus=rep(2,times=24))
Related
I would like to use R to randomly construct chi-square distribution with the degree of freedom of 5 with 100 observations. After doing so, I want to calculate the mean of those observations and use ggplot2 to plot the chi-square distribution with a bar chart. The following is my code:
rm(list = ls())
library(ggplot2)
set.seed(9487)
###Step_1###
x_100 <-data.frame(rchisq(100, 5, ncp = FALSE))
###Step_2###
mean_x <- mean(x_100[,1])
class(x_100)
###Step_3###
plot_x_100 <- ggplot(data = x_100, aes(x = x_100)) +
geom_bar()
plot_x_100
Firstly, I construct a data frame of a random chi-square distribution with df = 5, obs = 100.
Secondly, I calculate the mean value of this chi-square distribution.
At last, I plot the graph with the ggplot2 package.
However, I get the result like the follows:
Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
Error in is.finite(x) : default method not implemented for type 'list'
I got stuck in this problem for several hours and cannot find any list in my global environment. It would be appreciated if anyone can help me and give me some suggestions.
The problem is that inside the ggplot function you are calling the same dataframe (x_100) as both the data and the x variable inside aes. Remember that in ggplot, inside aes you should indicate the name of the column you wish to map. Additionally, if you want to plot the chi-square distribution I think it might be a better idea to use the geom_histogram instead of geom_bar, as the first one groups the observations into bins.
library(ggplot2)
# Rename the only column of your data frame as "value"
colnames(x_100) <- "value"
plot_x_100 <- ggplot(data = x_100, aes(x = value)) +
geom_histogram(bins = 20)
Here is what I am after:
Let's use the ToothGrowth dataset that comes with R as a simple example. In this dataset there are 3 columns: length, supplement, dose. Both dose and supplement are explanatory variables for length. It's easy enough to, say, plot dose against length and use the supplement as a factor. For instance, using qplot you would just do this:
qplot(x = ToothGrowth$dose , y = ToothGrowth$len, color = ToothGrowth$supp)
The next thing I'd want to do is see the trend of the average growth for each supplement as dose increased. I.e., construct a very similar plot, except I want the y variable to be the average of the values based on the dose and supplement.
I'm not sure how to do that in place with a call to qplot. It occurred to me that perhaps the thing to do was to compute a new column or something, but I'm also not sure how to use something like mutate to build a new column based on multiple explanatory variables.
I think this may be what you are looking for but you may need to clarify. Here is how you can generate the averages using dplyr
Avg_ToothGrowth <- ToothGrowth %>%
group_by(supp, dose) %>%
summarise(avg_len = mean(len)) %>%
ungroup
qplot(dose, avg_len, data = Avg_ToothGrowth, color = supp)
This should get you close but you may have to go through a dplyr tutorial to better understand the use of group_by and summarise. I used the ungroup to strip off the remaining groupings as they are not needed (there may be a better way to do this).
EDIT:
You can also plot the original data with a trend line for each group
# With confidence interval
qplot(dose, len, data = ToothGrowth, color = supp, geom = c('smooth', 'point'), method = 'lm')
# Without confidence interval
qplot(dose, len, data = ToothGrowth, color = supp, geom = c('smooth', 'point'), method = 'lm', se=FALSE)
I personally prefer to use dplyr as steveb did, but in case you are not familiar with the package, a solution without it might be easier to understand. The function aggregate() can help you:
tg <- aggregate(len ~ dose + supp, mean, data = ToothGrowth)
The first argument is a formula that tells the function that it should aggregate the value of the column len for all rows that have the same values for dose and supp. The second argument gives the function to use for the aggregation, which is mean. So, what is actually done is the following:
Rows of the data frame are grouped together by dose and supp. All rows within a group have thus the same values for dose and supp.
Then, for each group, the function mean() is applied to the column len.
This is exactly what is happening in the dplyr solution, but there, the two steps are more clearly spelled out.
The resulting data frame can then be plotted:
qplot(dose, len, colour = supp, data = tg)
I am trying to find the best way to create barplots in R with standard errors displayed. I have seen other articles but I cannot figure out the code to use with my own data (having not used ggplot before and this seeming to be the most used way and barplot not cooperating with dataframes). I need to use this in two cases for which I have created two example dataframes:
Plot df1 so that the x-axis has sites a-c, with the y-axis displaying the mean value for V1 and the standard errors highlighted, similar to this example with a grey colour. Here, plant biomass should the mean V1 value and treatments should be each of my sites.
Plot df2 in the same way, but so that before and after are located next to each other in a similar way to this, so pre-test and post-test equate to before and after in my example.
x <- factor(LETTERS[1:3])
site <- rep(x, each = 8)
values <- as.data.frame(matrix(sample(0:10, 3*8, replace=TRUE), ncol=1))
df1 <- cbind(site,values)
z <- factor(c("Before","After"))
when <- rep(z, each = 4)
df2 <- data.frame(when,df1)
Apologies for the simplicity for more experienced R users and particuarly those that use ggplot but I cannot apply snippets of code that I have found elsewhere to my data. I cannot even get enough code together to produce a start to a graph so I hope my descriptions are sufficient. Thank you in advance.
Something like this?
library(ggplot2)
get.se <- function(y) {
se <- sd(y)/sqrt(length(y))
mu <- mean(y)
c(ymin=mu-se, ymax=mu+se)
}
ggplot(df1, aes(x=site, y=V1)) +
stat_summary(fun.y=mean, geom="bar", fill="lightgreen", color="grey70")+
stat_summary(fun.data=get.se, geom="errorbar", width=0.1)
ggplot(df2, aes(x=site, y=V1, fill=when)) +
stat_summary(fun.y=mean, geom="bar", position="dodge", color="grey70")+
stat_summary(fun.data=get.se, geom="errorbar", width=0.1, position=position_dodge(width=0.9))
So this takes advantage of the stat_summary(...) function in ggplot to, first, summarize y for given x using mean(...) (for the bars), and then to summarize y for given x using the get.se(...) function for the error-bars. Another option would be to summarize your data prior to using ggplot, and then use geom_bar(...) and geom_errorbar(...).
Also, plotting +/- 1 se is not a great practice (although it's used often enough). You'd be better served plotting legitimate confidence limits, which you could do, for instance, using the built-in mean_cl_normal function instead of the contrived get.se(...). mean_cl_normal returns the 95% confidence limits based on the assumption that the data is normally distributed (or you can set the CL to something else; read the documentation).
I used group_by and summarise_each function for this and std.error function from package plotrix
library(plotrix) # for std error function
library(dplyr) # for group_by and summarise_each function
library(ggplot2) # for creating ggplot
For df1 plot
# Group data by when and site
grouped_df1<-group_by(df1,site)
#summarise grouped data and calculate mean and standard error using function mean and std.error(from plotrix)
summarised_df1<-summarise_each(grouped_df1,funs(mean=mean,std_error=std.error))
# Define the top and bottom of the errorbars
limits <- aes(ymax = mean + std_error, ymin=mean-std_error)
#Begin your ggplot
#Here we are plotting site vs mean and filling by another factor variable when
g<-ggplot(summarised_df1,aes(site,mean))
#Creating bar to show the factor variable position_dodge
#ensures side by side creation of factor bars
g<-g+geom_bar(stat = "identity",position = position_dodge())
#creation of error bar
g<-g+geom_errorbar(limits,width=0.25,position = position_dodge(width = 0.9))
#print graph
g
For df2 plot
# Group data by when and site
grouped_df2<-group_by(df2,when,site)
#summarise grouped data and calculate mean and standard error using function mean and std.error
summarised_df2<-summarise_each(grouped_df2,funs(mean=mean,std_error=std.error))
# Define the top and bottom of the errorbars
limits <- aes(ymax = mean + std_error, ymin=mean-std_error)
#Begin your ggplot
#Here we are plotting site vs mean and filling by another factor variable when
g<-ggplot(summarised_df2,aes(site,mean,fill=when))
#Creating bar to show the factor variable position_dodge
#ensures side by side creation of factor bars
g<-g+geom_bar(stat = "identity",position = position_dodge())
#creation of error bar
g<-g+geom_errorbar(limits,width=0.25,position = position_dodge(width = 0.9))
#print graph
g
I want to create a correlation matrix plot, i.e. a plot where each variable is plotted in a scatterplot against each other variable like with pairs() or splom(). I want to do this with ggplot2. See here for examples. The link mentions some code someone wrote for doing this in ggplot2, however, it is outdated and no longer works (even after you swap out the deprecated parts).
One could do this with a loop in a loop and then multiplot(), but there must be a better way. I tried melting the dataset to long, and copying the value and variable variables and then using facets. This almost gives you something correct.
d = data.frame(x1=rnorm(100),
x2=rnorm(100),
x3=rnorm(100),
x4=rnorm(100),
x5=rnorm(100))
library(reshape2)
d = melt(d)
d$value2 = d$value
d$variable2 = d$variable
library(ggplot2)
ggplot(data=d, aes(x=value, y=value2)) +
geom_point() +
facet_grid(variable ~ variable2)
This gets the general structure right, but only works for the plotting each variable against itself. Is there some more clever way of doing this without resorting to 2 loops?
library(GGally)
set.seed(42)
d = data.frame(x1=rnorm(100),
x2=rnorm(100),
x3=rnorm(100),
x4=rnorm(100),
x5=rnorm(100))
# estimated density in diagonal
ggpairs(d)
# blank
ggpairs(d, diag = list("continuous"="blank")
Using PerformanceAnalytics library :
library("PerformanceAnalytics")
chart.Correlation(df, histogram = T, pch= 19)
I am new to R. Forgive me if this if this question has an obvious answer but I've not been able to find a solution. I have experience with SAS and may just be thinking of this problem in the wrong way.
I have a dataset with repeated measures from hundreds of subjects with each subject having multiple measurements across different ages. Each subject is identified by an ID variable. I'd like to plot each measurement (let's say body WEIGHT) by AGE for each individual subject (ID).
I've used ggplot2 to do something like this:
ggplot(data = dataset, aes(x = AGE, y = WEIGHT )) + geom_line() + facet_wrap(~ID)
This works well for a small number of subjects but won't work for the entire dataset.
I've also tried something like this:
ggplot(data=data, aes(x = AGE,y = BW, group = ID, colour = ID)) + geom_line()
This also works for a small number of subjects but is unreadable with hundreds of subjects.
I've tried to subset using code like this:
temp <- split(dataset,dataset$ID)
but I'm not sure how to work with the resulting dataset. Or perhaps there is a way to simply adjust the facet_wrap so that individual plots are created?
Thanks!
Because you want to split up the dataset and make a plot for each level of a factor, I would approach this with one of the split-apply-return tools from the plyr package.
Here is a toy example using the mtcars dataset. I first create the plot and name it p, then use dlply to split the dataset by a factor and return a plot for each level. I'm taking advantage of %+% from ggplot2 to replace the data.frame in a plot.
p = ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_line()
require(plyr)
dlply(mtcars, .(cyl), function(x) p %+% x)
This returns all the plots, one after another. If you name the resulting list object you can also call one plot at a time.
plots = dlply(mtcars, .(cyl), function(x) p %+% x)
plots[1]
Edit
I started thinking about putting a title on each plot based on the factor, which seems like it would be useful.
dlply(mtcars, .(cyl), function(x) p %+% x + facet_wrap(~cyl))
Edit 2
Here is one way to save these in a single document, one plot per page. This is working with the list of plots named plots. It saves them all to one document, one plot per page. I didn't change any of the defaults in pdf, but you can certainly explore the changes you can make.
pdf()
plots
dev.off()
Updated to use package dplyr instead of plyr. This is done in do, and the output will have a named column that contains all the plots as a list.
library(dplyr)
plots = mtcars %>%
group_by(cyl) %>%
do(plots = p %+% . + facet_wrap(~cyl))
Source: local data frame [3 x 2]
Groups: <by row>
cyl plots
1 4 <S3:gg, ggplot>
2 6 <S3:gg, ggplot>
3 8 <S3:gg, ggplot>
To see the plots in R, just ask for the column that contains the plots.
plots$plots
And to save as a pdf
pdf()
plots$plots
dev.off()
A few years ago, I wanted to do something similar - plot individual trajectories for ~2500 participants with 1-7 measurements each. I did it like this, using plyr and ggplot2:
library(plyr)
library(ggplot2)
d_ply(dat, .var = "participant_id", .fun = function(x) {
# Generate the desired plot
ggplot(x, aes(x = phase, y = result)) +
geom_point() +
geom_line()
# Save it to a file named after the participant
# Putting it in a subdirectory is prudent
ggsave(file.path("plots", paste0(x$participant_id, ".png")))
})
A little slow, but it worked. If you want to get a sense of all participants' trajectories in one plot (like your second example - aka the spaghetti plot), you can tweak the transparency of the lines (forget coloring them, though):
ggplot(data = dat, aes(x = phase, y = result, group = participant_id)) +
geom_line(alpha = 0.3)
lapply(temp, function(X) ggplot(X, ...))
Where X is your subsetted data
Keep in mind you may have to explicitly print the ggplot object (print(ggplot(X, ..)))