Imagine you got 3 variables :
gestation of the mom
height of the mom
weight of the baby at birth
my 2 variables x are :
gestation of the mom
height of the mom
and my variable y is :
weight of the baby at birth
I would like to got a graphics matrix who explains weight of the baby at birth in function of gestation of the mom and weight of the baby at birth in function of height of the mom
I did it :
pairs((baby$bwt~baby$gestation+baby$age))
I obtains a graphic matrix like on picture :matrix_picture
But i would like to know how i can got only y in function of x1 and y in function of x2 because on my picture I got all, in others terms, i would like to obtain only the first line of my picture.
thanks for reading me
EDIT :
[matrix2_picture][2]
As you can see, on my absciss i got always same value ( 0 - 300) but i would like to got better value to got a better visualisation on each graphics, for example for age, i can't got 200 or 300, so i would like to got in absciss 10 m and 50 max for example
thanks
EDIT2:
[matrix3][3]
Just a last question, if I want get same thing than on the picture, how I can do it with ggplot
First is gestation of the mom in function of the weight of baby at birth, second is age of the mom in function of the weight of baby at birth and last is height of the mom in function of weigh of baby at birth
I did it :
df3 <- reshape2::melt(baby, "bwt")
ggplot(df3, aes(x=bwt, y=value)) +
geom_point() + facet_grid(.~variable,scales="free")
But I obtain it :
[matrix3][4]
Or you can see my ordinate is always same, not like when I used pairs.
thanks a lot !!!
[2]: https://i.stack.imgur.com/jppCJ.png
[3]: https://i.stack.imgur.com/TnEBe.png
[4]: https://i.stack.imgur.com/BPOUP.png
Last edit :
Do u know how we can do the same things but only for redidus of each variable
A little bit like the function pairs() but pairs with residus
reg=lm(formula=baby$bwt~baby$weight+baby$gestation+baby$age)
summary(reg)
plot(reg)
I would like to have residus of baby$bwt in function of theses 3 variables( weight , gestation, age)
For what i know, there isn't a solution using pairs. There are several other options, the one i know uses ggplot2.
First generating some dummy data:
df <- data.frame(
`gestation of the mom` = rnorm(20,300,30),
`height of the mom` = rnorm(20,170,10),
`weight of the baby at birth` = rnorm(20,50,5))
>df
gestation.of.the.mom height.of.the.mom weight.of.the.baby.at.birth
1 304.9339 165.7853 52.92590
2 219.7718 185.3528 43.06043
3 310.6279 166.5677 56.19357
4 278.8190 179.8276 54.33385
5 247.4760 186.6949 51.95354
Then reshaping the data frame for ggplot:
df2 <- reshape2::melt(df, "weight.of.the.baby.at.birth")
>df2
weight.of.the.baby.at.birth variable value
1 52.92590 gestation.of.the.mom 304.9339
2 43.06043 gestation.of.the.mom 219.7718
3 56.19357 gestation.of.the.mom 310.6279
4 54.33385 gestation.of.the.mom 278.8190
5 51.95354 gestation.of.the.mom 247.4760
...
21 52.92590 height.of.the.mom 165.7853
22 43.06043 height.of.the.mom 185.3528
23 56.19357 height.of.the.mom 166.5677
24 54.33385 height.of.the.mom 179.8276
25 51.95354 height.of.the.mom 186.6949
Then plotting:
library(ggplot2)
ggplot(df2, aes(x=value, y=weight.of.the.baby.at.birth)) +
geom_point() + facet_grid(.~variable)
Output:
You can find other answers in: Pairs scatter plot; one vs many, and Plot one numeric variable against n numeric variables in n plots.
EDIT1:
To make the scales be different, add the scales="free" argument to facet_grid:
ggplot(df2, aes(x=value, y=weight.of.the.baby.at.birth)) +
geom_point() + facet_grid(.~variable, scales="free")
Output:
EDIT2:
As you want the fixed variable to be your x axis, you need to change the place of variable in facet_grid:
ggplot(df2, aes(x=value, y=weight.of.the.baby.at.birth)) +
geom_point() + facet_grid(variable~., scales="free")
Output:
EDIT3:
Creating the model:
reg = lm(df$weight.of.the.baby.at.birth ~ df$gestation.of.the.mom + df$height.of.the.mom)
Adding a column with the residuals (before reshaping), and then reshaping:
df$resid = resid(reg)
df2 <- reshape2::melt(df, c("weight.of.the.baby.at.birth","resid"))
Plotting:
ggplot(df2, aes(x=value, y=resid)) +
geom_point() + facet_grid(.~variable, scales="free")
Output:
Related
Question 1:
Lets assume I have carried out a "before-after" (repeated measures with two points in time) experiment with 100 subjects. Each subject ticks a score on a 1 to 3 numerical scale in the 'before' condition (at T1) and again, after some treatment applied, in the 'after' condition (at T2). The behavior of each subject in the experiment can be described as 'transition from the score value at T1 to score value at T2'. E.g. from 2 to 3, or from 1 to 1, or from 3 to 1 and so on... The cartesian product tells us that 9 different transition types are theoretically possible. For each transition type, I calculated (in an external program) the count of observations.
This gives the following dataframe:
MyData1 <- data.frame(TransitionTypeID=seq(1:9), T1=c(1,1,1,2,2,2,3,3,3), T2=c(1,2,3,1,2,3,1,2,3), Count=c(2,14,0,18,12,8,23,12,11))
MyData1
For each score value (on the y-axis) I would like to plot a point and a line between T1 and T2 (on the x-axis), whereas the thickness of the line between T1 and T2 should (somehow) correspond to the count which is observed. The plot should simply visualize which transitions occur more often than others. Any hints?
Question 2:
While some pre-calculation steps of the above example have been carried out outside of R (in MS Access), I believe there must exist a way reaching the desired result from within R, i.e. using a dataframe with individual records for each subject and each point in time (i.e. with two rows per subject, one for the score at T1 and one for T2, hence in the 'long' format).
In that case the dataframe is something like this:
MyData2 <- data.frame(SubjectID=seq(1:100), Condition = c(rep("T1",100), rep("T2",100)), Score=floor(runif(200,min=1, max=4)))
library(ggplot2)
ggplot(data = MyData2, aes(x = Condition, y = Score, group = SubjectID)) + geom_line()
I get a nice plot showing the observed transitions, but obviously the individual lines for each subject are just plotted on top of each other, i.e. the thicknesses between T1 and T2 do not reflect the count of observations for each type of transition. Again: hints on how to achieve meaningful line thicknesses would be highly appreciated.
Here is a possible solution for Question 1.
MyData1 <- data.frame(TransitionTypeID=seq(1:9),
T1=c(1,1,1,2,2,2,3,3,3),
T2=c(1,2,3,1,2,3,1,2,3),
Count=c(2,14,0,18,12,8,23,12,11))
MyData1
df <- data.frame(x=rep(c("T1","T2"), each=nrow(MyData1)),
y=c(MyData1$T1,MyData1$T2),
ID=rep(MyData1$TransitionTypeID,2),
cnt=rep(MyData1$Count,2)
)
df_lb <- data.frame(x=rep(c("T1","T2"), each=3),
y=rep(1:3,2),
hj=rep(c(2,-1),each=3))
library(ggplot2)
pal <- colorRampPalette(c("white","blue","red"))
ggplot(data=df, aes(x=x, y=y, group=ID, size=cnt, color=cnt)) +
geom_line() +
geom_point(show.legend=F) +
labs(x="", y="Score") +
scale_color_gradientn(colours=pal(10)) +
geom_text(data=df_lb, aes(x=x, y=y, label=y), size=7, inherit.aes=F, hjust=df_lb$hj) +
theme_void()
For question #1
You get to your goal very tidily with geom_segment:
ggplot( data=cbind( MyData1 ),
aes(x=1, y=T1,xend=2, yend=T2 ,size=Count))+
geom_segment()
I suspect you will need to change the 0 Counts to NA to get that 1->3 transition to go away. I think a size of 0 should make a segment disappear, but apparently Hadley thinks otherwise.
Yep:
is.na(MyData1 ) <- MyData1==0
ggplot( data=cbind( MyData1 ), aes(x=1, y=T1,xend=2, yend=T2 ,size=Count))+geom_segment()
The same code as above now delivers the correct plot.
I am trying to show different growing season lengths by displaying crop planting and harvest dates at multiple regions.
My final goal is a graph that looks like this:
which was taken from an answer to this question. Note that the dates are in julian days (day of year).
My first attempt to reproduce a similar plot is:
library(data.table)
library(ggplot2)
mydat <- "Region\tCrop\tPlanting.Begin\tPlanting.End\tHarvest.Begin\tHarvest.End\nCenter-West\tSoybean\t245\t275\t1\t92\nCenter-West\tCorn\t245\t336\t32\t153\nSouth\tSoybean\t245\t1\t1\t122\nSouth\tCorn\t183\t336\t1\t153\nSoutheast\tSoybean\t275\t336\t1\t122\nSoutheast\tCorn\t214\t336\t32\t122"
# read data as data table
mydat <- setDT(read.table(textConnection(mydat), sep = "\t", header=T))
# melt data table
m <- melt(mydat, id.vars=c("Region","Crop"), variable.name="Period", value.name="value")
# plot stacked bars
ggplot(m, aes(x=Crop, y=value, fill=Period, colour=Period)) +
geom_bar(stat="identity") +
facet_wrap(~Region, nrow=3) +
coord_flip() +
theme_bw(base_size=18) +
scale_colour_manual(values = c("Planting.Begin" = "black", "Planting.End" = "black",
"Harvest.Begin" = "black", "Harvest.End" = "black"), guide = "none")
However, there's a few issues with this plot:
Because the bars are stacked, the values on the x-axis are aggregated and end up too high - out of the 1-365 scale that represents day of year.
I need to combine Planting.Begin and Planting.End in the same color, and do the same to Harvest.Begin and Harvest.End.
Also, a "void" (or a completely uncolored bar) needs to be created between Planting.Begin and Harvest.End.
Perhaps the graph could be achieved with geom_rect or geom_segment, but I really want to stick to geom_bar since it's more customizable (for example, it accepts scale_colour_manual in order to add black borders to the bars).
Any hints on how to create such graph?
I don't think this is something you can do with a geom_bar or geom_col. A more general approach would be to use geom_rect to draw rectangles. To do this, we need to reshape the data a bit
plotdata <- mydat %>%
dplyr::mutate(Crop = factor(Crop)) %>%
tidyr::pivot_longer(Planting.Begin:Harvest.End, names_to="period") %>%
tidyr::separate(period, c("Type","Event")) %>%
tidyr::pivot_wider(names_from=Event, values_from=value)
# Region Crop Type Begin End
# <chr> <fct> <chr> <int> <int>
# 1 Center-West Soybean Planting 245 275
# 2 Center-West Soybean Harvest 1 92
# 3 Center-West Corn Planting 245 336
# 4 Center-West Corn Harvest 32 153
# 5 South Soybean Planting 245 1
# ...
We've used tidyr to reshape the data so we have one row per rectangle that we want to draw and we've also make Crop a factor. We can then plot it like this
ggplot(plotdata) +
aes(ymin=as.numeric(Crop)-.45, ymax=as.numeric(Crop)+.45, xmin=Begin, xmax=End, fill=Type) +
geom_rect(color="black") +
facet_wrap(~Region, nrow=3) +
theme_bw(base_size=18) +
scale_y_continuous(breaks=seq_along(levels(plotdata$Crop)), labels=levels(plotdata$Crop))
The part that's a bit messy here that we are using a discrete scale for y but geom_rect prefers numeric values, so since the values are factors now, we use the numeric values for the factors to create ymin and ymax positions. Then we need to replace the y axis with the names of the levels of the factor.
If you also wanted to get the month names on the x axis you could do something like
dateticks <- seq.Date(as.Date("2020-01-01"), as.Date("2020-12-01"),by="month")
# then add this to you plot
... +
scale_x_continuous(breaks=lubridate::yday(dateticks),
labels=lubridate::month(dateticks, label=TRUE, abbr=TRUE))
I am looking to do a plot to look into the most common occuring FINAL_CALL_TYPE in my dataset by BOROUGH in NYC. I have a dataset with over 3 million obs. I broke this down into a sample of 2000, but have refined it even more to just the incident type and the borough it occured in.
Essentially, I want to create a plot that will visualize to the 5 most common call types in each borough, with the count of how many of each call types there was in each borough.
Below is a brief look of how my data looks with just Call Type and Borough
> head(df)
FINAL_CALL_TYPE BOROUGH
1804978 INJURY BRONX
1613888 INJMAJ BROOKLYN
294874 INJURY BROOKLYN
1028374 DRUG BROOKLYN
1974030 INJURY MANHATTAN
795815 CVAC BRONX
This shows how many unique values there are
> str(df)
'data.frame': 2000 obs. of 2 variables:
$ FINAL_CALL_TYPE: Factor w/ 139 levels "ABDPFC","ABDPFT",..: 50 48 50 34 50 25 17 138 28 28 ...
$ BOROUGH : Factor w/ 5 levels "BRONX","BROOKLYN",..: 1 2 2 2 3 1 4 2 4 4 ...
This is the code that I have tried
> ggplot(df, aes(x=BOROUGH, y=FINAL_CALL_TYPE)) +
+ geom_bar(stat = 'identity') +
+ facet_grid(~BOROUGH)
and below is the result
I have tried a few suggestions accross this community, but I have not found any that shows how to perform the action with 2 columns.
It would be much appreciated if there is someone who know a solution for this.
Thanks!
If I understand correctly, you can use tidyverse to doo something like:
df <- df %>%
group_by(BOROUGH, FINAL_CALL) %>%
summarise(count = n()) %>%
top_n(n = 5, wt = count)
then plot
ggplot(df, aes(x = FINAL_CALL, y = count) +
geom_col() +
facet(~BOROUGH, scales = "free")
creating the barplot
The first part of your problem is to create the barplot. With geom_bar you only need to supply the x variable, as the y-axis is the count of observations of that variable. You can then use the facet option to separate that count into different panels for another grouping variable.
library(ggplot2)
ggplot(data = diamonds, aes(x = color)) +
geom_bar() +
facet_grid(.~cut)
filtering to top 5 observations
The second part of your problem, limiting the data to only the top five in each group is slightly more complex. An easy way to do this is to first tally the data which will create a column n that has the count of observations. By adding the sort option we can filter the data to the first five rows in each group. tally, like summarize, automatically removes the last group.
In the ggplot call I now use geom_col instead of geom_bar and I explicitly specify that the y-variable is n (n is created by tally).
geom_bar plots the count of observations per x-variable, geom_col plots a y-variable value for each value of the x-variable.
scales = "free_x" removes values from the x-axis that are present in one cut panel but not another.
library(tidyverse)
df <- diamonds %>%
group_by(cut, color) %>%
tally(sort = TRUE) %>%
filter(row_number() <= 5)
ggplot(data = df, aes(x = color, y = n)) +
geom_col() +
facet_grid(.~cut, scales = "free_x")
Trying to make some plots with ggplot2 and cannot figure out how colour works as defined in aes. Struggling with errors of aesthetic length.
I've tried defining colours in either main ggplot call aes to give legend, but also in geom_line aes.
# Define dataset:
number<-rnorm(8,mean=10,sd=3)
species<-rep(c("rose","daisy","sunflower","iris"),2)
year<-c("1995","1995","1995","1995","1996","1996","1996","1996")
d.flowers<-cbind(number,species,year)
d.flowers<-as.data.frame(d.flowers)
#Plot with no colours:
ggplot(data=d.flowers,aes(x=year,y=number))+
geom_line(group=species) # Works fine
#Adding colour:
#Defining aes in main ggplot call:
ggplot(data=d.flowers,aes(x=year,y=number,colour=factor(species)))+
geom_line(group=species)
# Doesn't work with data size 8, asks for data of size 4
ggplot(data=d.flowers,aes(x=year,y=number,colour=unique(species)))+
geom_line(group=species)
# doesn't work with data size 4, now asking for data size 8
The first plot gives
Error: Aesthetics must be either length 1 or the same as the data (4): group
The second gives
Error: Aesthetics must be either length 1 or the same as the data (8): x, y, colour
So I'm confused - when given aes of length either 4 or 8 it's not happy!
How could I think about this more clearly?
Here are #kath's comments as a solution. It's subtle to learn at first but what goes inside or outside the aes() is key. Some more info here - When does the aesthetic go inside or outside aes()? and lots of good googleable "ggplot aesthetic" centric pages with lots of examples to cut and paste and try.
library(ggplot2)
number <- rnorm(8,mean=10,sd=3)
species <- rep(c("rose","daisy","sunflower","iris"),2)
year <- c("1995","1995","1995","1995","1996","1996","1996","1996")
d.flowers <- data.frame(number,species,year, param1, param2)
head(d.flowers)
#number species year
#1 8.957372 rose 1995
#2 7.145144 daisy 1995
#3 9.864917 sunflower 1995
#4 7.645287 iris 1995
#5 4.996174 rose 1996
#6 8.859320 daisy 1996
ggplot(data = d.flowers, aes(x = year,y = number,
group = species,
colour = species)) + geom_line()
#note geom_point() doesn't need to be grouped - try:
ggplot(data = d.flowers, aes(x = year,y = number, colour = species)) + geom_point()
I'd like to create a bar chart using factors and more than two variables! My data looks like this:
Var1 Var2 ... VarN Factor1 Factor2
Obs1 1-5 1-5 ... 1-5
Obs2 1-5 1-5 ... ...
Obs3 ... ... ... ...
Each datapoint is a likert item ranging from 1-5
Plotting total sums using a dichotomized version (every item above 4 is a one, else 0)
I converted the data using this
MyDataFrame = dichotomize(MyDataFrame,>=4)
p <- colSums(MyDataFrame)
p <- data.frame(names(p),p)
names(p) <- c("var","value")
ggplot(p,aes(var,value)) + geom_bar() + coord_flip()
Doing this i loose the information provided by factor1 etc, i'd like to use stacking in order to visualize from which group of people the rating came
Is there a elegant solution to this problem? I read about using reshape to melt the data and then applying ggplot?
I would suggest the following: use one of your factor for stacking, the other one for faceting. You can remove position="fill" to geom_bar() to use counts instead of standardized values.
my.df <- data.frame(replicate(10, sample(1:5, 100, rep=TRUE)),
F1=gl(4, 5, 100, labels=letters[1:4]),
F2=gl(2, 50, labels=c("+","-")))
my.df[,1:10] <- apply(my.df[,1:10], 2, function(x) ifelse(x>4, 1, 0))
library(reshape2)
my.df.melt <- melt(my.df)
library(plyr)
res <- ddply(my.df.melt, c("F1","F2","variable"), summarize, sum=sum(value))
library(ggplot2)
ggplot(res, aes(y=sum, x=variable, fill=F1)) +
geom_bar(stat="identity", position="fill") +
coord_flip() +
facet_grid(. ~ F2) +
ylab("Percent") + xlab("Item")
In the above picture, I displayed observed frequencies of '1' (value above 4 on the Likert scale) for each combination of F1 (four levels) and F2 (two levels), where there are either 10 or 15 observations:
> xtabs(~ F1 + F2, data=my.df)
F2
F1 + -
a 15 10
b 15 10
c 10 15
d 10 15
I then computed conditional item sum scores with ddply,† using a 'melted' version of the original data.frame. I believe the remaining graphical commands are highly configurable, depending on what kind of information you want to display.
† In this simplified case, the ddply instruction is equivalent to with(my.df.melt, aggregate(value, list(F1=F1, F2=F2, variable=variable), sum)).