Subset x and y variables separately in ggplot - r

I have a dataframe, df, that looks like so:
group ID y1 y2
A 1 21 14
A 2 11 21
A 3 21 17
...
B 1 71 12
B 2 41 14
B 3 31 15
...
And would like to use ggplot() to plot variables in one group against variables in another. For example, df$y1[df$group=="A"] against df$y2[df$group=="B"]. I naively thought the code for plotting may be something like this, but it's obviously not correct:
ggplot(df, aes(x = df$y1[df$group=="A"], y = df$y2[df$group=="B"])) + geom_point()
I know that if I wanted to subset the overall data, for example to plot only group A, I could do something like:
ggplot(subset(df, group=="A"), aes(x = y1, y = y2)) + geom_point()
I think I could solve this by reshaping my data so as to create variables y1.A, y1.B, y2.A, y2.B and so forth, but I have many variables and this seems like a long-winded approach.

Related

box plots for two columns side by side using ggplot

I have a dataset in the following format
value1 value2 group
10 20 A
20 30 A
67 45 B
98 76 C
102 11 A
11 22 B
10 10 B
19 20 C
I am trying to make box plots for three groups (A, B and C) and the box plots for 1st and end column should be side by side. I can do two separate plots like following, but not able to figure out how to combine to put it side by side.
p1 <- ggplot(x, aes(x=group, y=value1)) + geom_boxplot()
p2 <- ggplot(x, aes(x=group, y=value)) + geom_boxplot()
I would appreciate any help. I am a newbie in R and ggplot.
Here's an option using pivot_longer from tidyr
x_new <- tidyr::pivot_longer(x, c(value1, value2))
ggplot(x_new, aes(x = group, y = value, col = name, fill = name)) + geom_boxplot(alpha = .5)
The gridExtra package can do this too. Assign your plots to variables then just use grid.arrange(plot1,plot2). Look up the documentation with ?grid.arrange for extra options.

Create a histogram filled using another variable in ggplot

I am working with a dataset that includes the age of some people. I am trying to create a histogram for the ages of the people with ggplot in which the colours of the bars of the histogram should depend on some predefined age intervals.
So for example imagine a dataset like this:
>X
Age Age2
10 Under 14
11 Under 14
10 Under 14
13 Under 14
20 Between 15 and 25
21 Between 15 and 25
35 Above 25
I have tried to do something like this:
ggplot(X, aes(x = Age)) + geom_histogram(aes(fill = Age2))
But it displays the following error message:
Error: StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat="count"?
What am I doing wrong?
plotted with ggplot2, corrected excessive capitalization.
age <-c(10,11,10,13,20,21,35)
age2<-c(rep("Under 14", times=4), rep("Between 15 and 25",times=2),"Above 25")
X<-as.data.frame(cbind(age,age2))
X$age<-as.numeric(age)
X
names(X)
summary(X)
p<- ggplot(X, aes(x = age))+
geom_histogram(aes(fill = age2))
p

Creating a line chart in r for the average value of groups

I'm trying to create simple line charts with r that connect data points the average of groups of respondents (would also nive to lable them or distinguish them in diferent colors etc.)
My data is in long format and sorted like this shown (I also have it in wide format if thats of any value):
ID gender week class motivation
1 male 0 1 100
1 male 6 1 120
1 male 10 1 130
...
2 female 0 1 90
2 female 6 1 NA
2 female 10 1 117
...
3 male 0 2 89
3 male 6 2 112
3 male 10 2 NA
...
Basically, every respondent was measured a total of n times and the occasions (week) were the same for everyone. Some respondents were missing during one or more occasions. Let's say for motivation. Variables like gender, class and ID don't change, motivation does.
I tried to get a line chart using ggplot2
## define base for the graphs and store in object 'p'
plot <- ggplot(data = DataRlong, aes(x = week, y = motivation, group = gender))
plot + geom_line()
As grouping variable, I want to use class or gender for example.
However, my approach does not lead to lines that connect the averages per group.
I also get vertical lines for each measurement occasion. What does this mean? The only way I cold imagine fixing this is to create a new variable average.motivation and to compute the average for every group per occasion and then assign this average to all members of the group. However, this would mean that I had do this for every single group variable when I want to display group lines based on another factor.
Also, how does the plot handle missing data? (If one member of a group has a missing value, I still want the group average of this occasion to calculate the point rather than omitting the whole occasion for that group ).
Edit:
Thank you, the solution with dplyr works great for all my categorical variables.
Now, I'm trying to figure out how I can distinguish between subgroups by colouring their lines based on a second/third factor.
For example, I plot 20 lines for the groups of "class2", but rather than having all of them in 20 different colors, I would like them to use the same colour, if they belong to the same type of class ("class_type", e.g. A, B or C =20 lines, three groups of colours).
I've added the second factor to "mean_data2". That works well. Next, I've tried to change the colour argument in ggplot, (also tried as in geom_line), but that way, I don't have 20 lines anymore.
mean_data2 <- group_by(DataRlong, class2, class_type, occ)%>%
summarise(procras = mean(procras, na.rm = TRUE))
library(ggplot2) ggplot(na.omit(mean_data2), aes(x = occ, y = procras,
colour=class2)) + geom_point() + geom_line(aes(colour=class_type))
You can also use the dplyr package to aggregate the data:
library(dplyr)
mean_data <- group_by(data, gender, week) %>%
summarise(motivation = mean(motivation, na.rm = TRUE))
You can use na.omit() to get rid of the NA values as follows:
library(ggplot2)
ggplot(na.omit(mean_data), aes(x = week, y = motivation, colour = gender)) +
geom_point() + geom_line()
There is no need here to explicitly use the group aesthetic because ggplot will automatically group the lines by the categorical variables in your plot. And the only categorical variable you have is gender. (See this answer for more information).
Another possibility is using stat_summary, so you can do it only with ggplot.
ggplot(data = DataRlong, aes(x = week, y = motivation, group = gender)) +
stat_summary(geom = "line", fun.y = mean)
You almost certainly have to make sure those grouping variables are factors.
I'm not quite sure what you want, but here's a shot...
library("ggplot2")
df <- read.table(textConnection("ID gender week class motivation
1 male 0 1 100
1 male 6 1 120
1 male 10 1 130
2 female 0 1 90
2 female 6 1 NA
2 female 10 1 117
3 male 0 2 89
3 male 6 2 112
3 male 10 2 NA"), header=TRUE, stringsAsFactors=FALSE)
df2 <- aggregate(df$motivation, by=list(df$gender, df$week),
function(x)mean(x, na.rm=TRUE))
names(df2) <- c("gender", "week", "avg")
df2$gender <- factor(df2$gender)
ggplot(data = df2[!is.na(df2$avg), ],
aes(x = week, y = avg, group=gender, color=gender)) +
geom_point()+geom_line()

Generating a histogram and density plot from binned data

I've binned some data and currently have a dataframe that consists of two columns, one that specifies a bin range and another that specifies the frequency like this:-
> head(data)
binRange Frequency
1 (0,0.025] 88
2 (0.025,0.05] 72
3 (0.05,0.075] 92
4 (0.075,0.1] 38
5 (0.1,0.125] 20
6 (0.125,0.15] 16
I want to plot a histogram and density plot using this but I can't seem to find a way of doing so without having to generate new bins etc. Using this solution here I tried to do the following:-
p <- ggplot(data, aes(x= binRange, y=Frequency)) + geom_histogram(stat="identity")
but it crashes. Anyone know of how to deal with this?
Thank you
the problem is that ggplot doesnt understand the data the way you input it, you need to reshape it like so (I am not a regex-master, so surely there are better ways to do is):
df <- read.table(header = TRUE, text = "
binRange Frequency
1 (0,0.025] 88
2 (0.025,0.05] 72
3 (0.05,0.075] 92
4 (0.075,0.1] 38
5 (0.1,0.125] 20
6 (0.125,0.15] 16")
library(stringr)
library(splitstackshape)
library(ggplot2)
# extract the numbers out,
df$binRange <- str_extract(df$binRange, "[0-9].*[0-9]+")
# split the data using the , into to columns:
# one for the start-point and one for the end-point
df <- cSplit(df, "binRange")
# plot it, you actually dont need the second column
ggplot(df, aes(x = binRange_1, y = Frequency, width = 0.025)) +
geom_bar(stat = "identity", breaks=seq(0,0.125, by=0.025))
or if you don't want the data to be interpreted numerically, you can just simply do the following:
df <- read.table(header = TRUE, text = "
binRange Frequency
1 (0,0.025] 88
2 (0.025,0.05] 72
3 (0.05,0.075] 92
4 (0.075,0.1] 38
5 (0.1,0.125] 20
6 (0.125,0.15] 16")
library(ggplot2)
ggplot(df, aes(x = binRange, y = Frequency)) + geom_bar(stat = "identity")
you won't be able to plot a density-plot with your data, given its not continous but rather categorical, thats why I actually prefer the second way of showing it,
You can try
library(ggplot2)
ggplot(df, aes(x = binRange, y = Frequency)) + geom_col()

plot lines using qplot

I want to plot multiple lines on the sample plot using qplot in the ggplot2 package.
But I'm having some problem with it.
Using the old plot, and lines function I would do something like
m<-cbind(1:4,5:8,-(5:8))
colnames(m)<-c("time","y1","y2")
m<-as.data.frame(m)
> m
time y1 y2
1 1 5 -5
2 2 6 -6
3 3 7 -7
4 4 8 -8
plot(x=m$time,y=m$y1,type='l',ylim=range(m[,-1]))
lines(x=m$time,y=m$y2)
Thanks
Using the reshape package to melt m:
library(reshape)
library(ggplot2)
m2 <- melt(m, id = "time")
p <- ggplot(m2, aes(x = time, y = value, color = variable))
p + geom_line() + ylab("y")
You could rename the columns in the new data.frame to your liking. The trick here is to have a factor that denotes each of the lines you want to plot.

Resources