Errorbars in r of two groups ggplot2 - r

I'd like to plot standard deviations of the mean(z)/mean(b) which are grouped by two factors $angle and $treatment:
z= Tracer angle treatment
60 0 S
51 0 S
56.415 15 X
56.410 15 X
b=Tracer angle treatment
21 0 S
15 0 S
16.415 15 X
26.410 15 X
So far I've calculated the mean for each variable based on angle and treatment:
aggmeanz <-aggregate(z$Tracer, list(angle=z$angle,treatment=z$treatment), FUN=mean)
aggmeanb <-aggregate(b$Tracer, list(angle=b$angle,treatment=b$treatment), FUN=mean)
It now looks like this:
aggmeanz
angle treatment x
1 0 S 0.09088021
2 30 S 0.18463353
3 60 S 0.08784315
4 80 S 0.09127198
5 90 S 0.12679296
6 0 X 2.68670392
7 15 X 0.50440692
8 30 X 0.83564470
9 60 X 0.52856956
10 80 X 0.63220093
11 90 X 1.70123025
But when I come to plot it, I can't quite get what I'm after
ggplot(aggmeanz, aes(x=aggmeanz$angle,y=aggmeanz$x/aggmeanb$x, colour=treatment)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=0.1, ymax=1.15),
width=.2,
position=position_dodge(.9)) +
theme(panel.grid.minor = element_blank()) +
theme_bw()
EDIT:
dput(aggmeanz)
structure(list(time = structure(c(1L, 3L, 4L, 5L, 6L, 1L, 2L,
3L, 4L, 5L, 6L), .Label = c("0", "15", "30", "60", "80", "90"
), class = "factor"), treatment = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("S", "X"), class = "factor"),
x = c(56.0841582902523, 61.2014237854156, 42.9900742785269,
42.4688447229277, 41.3354173870287, 45.7164231791512, 55.3943182966382,
55.0574951462903, 48.1575625699563, 60.5527200655174, 45.8412287451211
)), .Names = c("time", "treatment", "x"), row.names = c(NA,
-11L), class = "data.frame")
> dput(aggmeanb)
structure(list(time = structure(c(1L, 3L, 4L, 5L, 6L, 1L, 2L,
3L, 4L, 5L, 6L), .Label = c("0", "15", "30", "60", "80", "90"
), class = "factor"), treatment = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("S", "X"), class = "factor"),
x = c(56.26325504249, 61.751655279608, 43.1687113436753,
43.4147408285209, 41.9113698082799, 46.2800894420131, 55.1550995335947,
54.7531592595068, 47.3280215294235, 62.4629068516043, 44.2590192583692
)), .Names = c("time", "treatment", "x"), row.names = c(NA,
-11L), class = "data.frame")
EDIT 2: I calculated the standard dev as follows:
aggstdevz <-aggregate(z$Tracer, list(angle=z$angle,treatment=z$treatment), FUN=std)
aggstdevb <-aggregate(b$Tracer, list(angle=b$angle,treatment=b$treatment), FUN=std)
Any thoughts would be much appreciated,
Cheers

As others have noted, you'll need to join the two dataframes together. There are also some little quirks in the dput data you showed, so I've renamed some columns to make sure that they join appropriately and match what you've attempted. NOTE: You'll need name the two means differently so that they don't get merged together or cause conflicts.
names(aggmeanb)[names(aggmeanb) == "x"] = "mean_b"
names(aggmeanb)[names(aggmeanb) == "time"] = "angle"
names(aggmeanz)[names(aggmeanz) == "x"] = "mean_z"
names(aggmeanz)[names(aggmeanz) == "time"] = "angle"
joined_data = join(aggmeanb, aggmeanz)
joined_data$divmean = joined_data$mean_b/joined_data$mean_z
> head(joined_data)
angle treatment mean_b mean_z divmean
1 0 S 56.26326 56.08416 1.003193
2 30 S 61.75166 61.20142 1.008991
3 60 S 43.16871 42.99007 1.004155
4 80 S 43.41474 42.46884 1.022273
5 90 S 41.91137 41.33542 1.013934
6 0 X 46.28009 45.71642 1.012330
ggplot(joined_data, aes(factor(angle), divmean)) +
geom_boxplot() +
theme(panel.grid.minor = element_blank()) +
theme_bw()
It might be that the data you've included is just a bit of your real data set, but as is there's only one data point per angle-treatment group. However, when you are using a fuller dataset, you can try something like:
ggplot(joined_data, aes(factor(angle), diffmean, group = treatment)) +
geom_boxplot() +
facet_grid(.~angle, scales = "free_x")
That will group the boxes by angle and then allow you to fill them by treatment.

Think about the problem in two steps:
create a data frame (say data) which contains all the information
you would like to visualize. In this case, this seems to be the two
factors (angle, treatment), the mean group differences (say dif)
and standard errors (say ste).
visualize this information.
Step 2) will be easy. This should probably produce something very similar to your sketch.
ggplot(data, aes(x=angle, y=dif, colour=treatment)) +
geom_point(position=position_dodge(0.1)) +
geom_errorbar(aes(ymin=dif-ste, ymax=dif+ste), width=.1, position=position_dodge(0.1)) +
theme_bw()
However, at this point, you do not provide enough information to get help with Step 1. Try to include code which produces your original data (or the type of data you have) instead of copy-pasting chunks of your data output or pasting the aggregated data which lacks standard errors.
Combining your two aggregated data frames and generating random numbers for standard error produces the graph below:
#I imported your two aggregated data frames from your dput output.
data <- cbind(aggmeanb, aggmeanz$x, rnorm(11))
names(data) <- c("angle", "treatment", "meanz", "meanb", "ste")
data$dif <- data$meanz - data$meanb

Related

Plot one vs many actual-predicted values scatter plot using R

For a sample dataframe df, pred_value and real_value respectively represent the monthly predicted values and actual values for a variable, and acc_level represents the accuracy level of the predicted values comparing with the actual values for the correspondent month, the smaller the values are, more accurate the predictions result:
df <- structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L), .Label = c("2022/3/31", "2022/4/30",
"2022/5/31"), class = "factor"), pred_value = c(2721.8, 2721.8,
2705.5, 2500, 2900.05, 2795.66, 2694.45, 2855.36, 2300, 2799.82,
2307.36, 2810.71, 3032.91), real_value = c(2736.2, 2736.2, 2736.2,
2736.2, 2736.2, 2759.98, 2759.98, 2759.98, 2759.98, 3000, 3000,
3000, 3000), acc_level = c(1L, 1L, 2L, 3L, 3L, 1L, 2L, 2L, 3L,
2L, 3L, 2L, 1L)), class = "data.frame", row.names = c(NA, -13L
))
Out:
date pred_value real_value acc_level
1 2022/3/31 2721.80 2736.20 1
2 2022/3/31 2721.80 2736.20 1
3 2022/3/31 2705.50 2736.20 2
4 2022/3/31 2500.00 2736.20 3
5 2022/3/31 2900.05 2736.20 3
6 2022/4/30 2795.66 2759.98 1
7 2022/4/30 2694.45 2759.98 2
8 2022/4/30 2855.36 2759.98 2
9 2022/4/30 2300.00 2759.98 3
10 2022/5/31 2799.82 3000.00 2
11 2022/5/31 2307.36 3000.00 3
12 2022/5/31 2810.71 3000.00 2
13 2022/5/31 3032.91 3000.00 1
I've plotted the predicted values with code below:
library(ggplot2)
ggplot(x, aes(x=date, y=pred_value, color=acc_level)) +
geom_point(size=2, alpha=0.7, position=position_jitter(w=0.1, h=0)) +
theme_bw()
Out:
Beyond what I've done above, if I hope to plot the actual values for each month with red line and red points, how could I do that? Thanks.
Reference:
How to add 4 groups to make Categorical scatter plot with mean segments?
We can add the actuals using additional layers. To make the line show up, we need to specify that the points should be part of the same series.
ggplot assumes by default that since the x axis is discrete that the data points are not part of the same group. We could alternatively deal with this by making the date variable into a date data type, like with aes(x=as.Date(date)...
library(ggplot2)
ggplot(df, aes(x=date, y=pred_value, color=as.factor(acc_level))) +
geom_point(size=2, alpha=0.7, position=position_jitter(w=0.1, h=0)) +
geom_point(aes(y = real_value), size=2, color = "red") +
geom_line(aes(y = real_value, group = 1), color = "red") +
scale_color_manual(values = c("yellow", "magenta", "cyan"),
name = "Acc Level") +
theme_bw()

Create line graph with multiple lines in R

I want to plot census data to compare data for each race over multiple years.
My data frame has years 1950-2010 (every 10 years) as the rows and race as the columns. The data at the cross section is the percentage of that race in a given year.
I want my line graph to plot the years on the x axis and race on the y axis. So with my 5 "race" variables, there would be 5 lines of different colors all plotted on the same graph.
I have tried to watch videos and scoured all over here but nothing I find seems to work the way I want it to.
Edit:
I refactored to the code and built my own dataframe instead of having it return a matrix.
However, I want the right side to say "Race" and then have my 5 lines. I am working on getting one line to show up at all before doing the other 4.
new dataframe
returned plot
Edit:
I have figured out thus far in my code - Allston <- ggplot(data = dataAllston, aes(Year, White.pct, group = 1)) + geom_point(aes(color = "orange")) + geom_line(aes(color = "orange"))
I want to scale the Y axis and from 0-1 in 0.2 increments and have the Y be "Race" instead of the individual labels. And more than just relabeling -- I want the graph to be representative of the actual increases/decreases as opposed to a straight line diagonally down as it is now.
I think it will take me longer to learn how to make the reproducible code than it will to make tweaks.
new returned plot
Edit:
dput(dataAllston)
returns
structure(list(Year = c(1950, 1960, 1970, 1980, 1990, 2000, 2010
), White.pct = structure(7:1, .Label = c("57.0", "59.0", "63.0",
"78.0", "90.8", "98.0", "98.3"), class = "factor"), BlackOrAA.pct =
structure(c(2L,
1L, 3L, 4L, 5L, 4L, 4L), .Label = c("1.20", "1.30", "2.60", "5.00",
"9.00"), class = "factor"), Hispanic.pct = structure(c(1L, 1L,
3L, 4L, 2L, 2L, 2L), .Label = c("0.00", "13.0", "3.10", "6.00"
), class = "factor"), AsianOrPI.pct = structure(c(1L, 1L, 5L,
6L, 2L, 3L, 4L), .Label = c("0.00", "14.0", "18.0", "20.0", "3.20",
"9.00"), class = "factor"), Other.pct = structure(c(2L, 1L, 3L,
4L, 5L, 4L, 4L), .Label = c("1.20", "1.30", "2.60", "5.00", "9.00"
), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
result from dput(data)
You need first to reshape your dataset into a longer format by using for example pivot_longer function from tidyr. At the end, your data should look like this.
As your data are in factor format (except Year column), the first line will convert all of them into a numerical format much appropriate for plotting.
library(dplyr)
library(tidyr)
Reshaped_DF <- df %>% mutate_at(vars(ends_with(".pct")), ~as.numeric(as.character(.))) %>%
pivot_longer(-Year, names_to = "Races", values_to = "values")
# A tibble: 35 x 3
Year Races values
<dbl> <chr> <dbl>
1 1950 White.pct 98.3
2 1950 BlackOrAA.pct 1.3
3 1950 Hispanic.pct 0
4 1950 AsianOrPI.pct 0
5 1950 Other.pct 1.3
6 1960 White.pct 98
7 1960 BlackOrAA.pct 1.2
8 1960 Hispanic.pct 0
9 1960 AsianOrPI.pct 0
10 1960 Other.pct 1.2
# … with 25 more rows
Then, you can plot it in ggplot2 by doing:
library(ggplot2)
ggplot(Reshaped_DF,aes(x = Year, y = values, color = Races, group = Races))+
geom_line()+
geom_point()+
ylab("Percentage")
Does it answer your question ?
If not, please consider providing a reproducible example of your dataset that people can easily copy/paste. See this guide: How to make a great R reproducible example

Assign a color code for every unique string in multiple files - R

I'm trying to create a rule to assign a specific color code for every unique string for graphing purposes in ggplot2 for different files. For example, if I have two tab delimited files, file1.txt and file2.txt that look like this:
file1.txt
Freq Seq
90 AAGTGT
3 AAGTGG
3 AAGTCC
2 AATTTT
2 TTTTTT
file2.txt
Freq Seq
91 AAGTGT
4 AAGTGG
2 AAGTCC
2 CCCCCC
1 TTTTTT
There are a total of 6 different colors that will be used for the above files for the 6 different sequences (AAGTGT, AAGTGG, AAGTCC, CCCCCC, TTTTTT, AATTTT). Across my many files, I have ~3000 colors that I've created a palette (pal) for using
pal<-c(randomColor(count=2951))
Is there a method to ensure that all sequences among my many files maintain the ordered pairs of the strings and corresponding hex color codes (i.e. that all files that show the AAGTGT sequence will have the same hex color code for that string)? Of note, not all 3000 colors are represented in each file.
Thanks!
Hope this helps!
library(ggplot2)
library(randomcoloR)
#build a pallete mapping using 'Seq' column's value in all available dataframes
set.seed(123)
pal <- c(randomColor(count=6))
pal_seq_mapping <- data.frame(sequence=unique(c(as.character(df1$Seq),as.character(df2$Seq))), color=pal)
#example plot on 'df1' dataframe
ggplot(df1, aes(x=Seq, y=Freq)) +
geom_bar(stat="identity", fill=pal_seq_mapping[match(df1$Seq, pal_seq_mapping$sequence),"color"]) +
theme_bw()
#example plot on 'df2' dataframe
ggplot(df2, aes(x=Seq, y=Freq)) +
geom_bar(stat="identity", fill=pal_seq_mapping[match(df2$Seq, pal_seq_mapping$sequence),"color"]) +
theme_bw()
Output Plot:
Note that color used is same for Seq common in df1 and df2
#sample data
> dput(df1)
structure(list(Freq = c(90L, 3L, 3L, 2L, 2L), Seq = structure(c(3L,
2L, 1L, 4L, 5L), .Label = c("AAGTCC", "AAGTGG", "AAGTGT", "AATTTT",
"TTTTTT"), class = "factor")), .Names = c("Freq", "Seq"), class = "data.frame", row.names = c(NA,
-5L))
> dput(df2)
structure(list(Freq = c(91L, 4L, 2L, 2L, 1L), Seq = structure(c(3L,
2L, 1L, 4L, 5L), .Label = c("AAGTCC", "AAGTGG", "AAGTGT", "CCCCCC",
"TTTTTT"), class = "factor")), .Names = c("Freq", "Seq"), class = "data.frame", row.names = c(NA,
-5L))

How to display separate rows in histogram in R?

I have a set of data that I've assigned to a variable named "data1". I know how to make a histogram of certain column, by hist(data1$RT). But among the RT column, there are "high", "medium", and "low", 'Factor's', I want to make 3 separate histograms for each factor variable but can't figure out how to do this. Here's an example of the data:
Frequency Prime_type RT
1 high prime 450
2 high prime 460
3 med prime 520
4 med prime 430
5 low prime 450
6 low prime 420
I can display hist(data1$RT), but how would I just display RT's 'high' or 'med' factors for example? I've tried a lot of things and am still stumped.
You can do it by faceting the plot with ggplot2. First, we modify df$Frequency to have the panels in order: high, med and low. Then we create the histogram specifying the breaks and using facet_wrap to divide the chart in panels. Note that we add the argument right = TRUE (right-closed and left-open intervals) to calculate the intervals as the hist function does.
library(ggplot2)
df$Frequency <- factor(df$Frequency, levels=unique(df$Frequency))
h <- ggplot(df, aes(x=RT), xlim=c(420,520)) +
geom_histogram(breaks=seq(420, 520, by=20), col="white", right = TRUE) +
facet_wrap( ~ Frequency) +
scale_x_continuous(breaks=seq(420, 520, by=20))
h
Output:
Data:
df <- structure(list(Frequency = structure(c(1L, 1L, 3L, 3L, 2L, 2L
), .Label = c("high", "low", "med"), class = "factor"), Prime_type = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "prime", class = "factor"), RT = c(450L,
460L, 520L, 430L, 450L, 420L)), .Names = c("Frequency", "Prime_type",
"RT"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6"))

Ggmap-geompoint, how to make grouping?

Suppose I have this dataframe
latitude longitude category
42.39905 -72.93871 A
42.39905 -73.93871 B
43.37471 -73.36336 A
43.37471 -74.36336 B
44.28322 -74.31423 B
What I would like to do is to group the coordinates by its integer. Then for each group, I could create a bubble with a size function on the counts in a group.
The colour diverges from A to B, based on how many A than B. So far, I've been doing this,
map = get_map(location="jk",zoom=6,source="stamen")
#Plot the point
ggmap(map)+
geom_point(data=zipmap,
aes(x=round(longitude),y=round(latitude),colour=category))+
scale_color_brewer(type='div')
But as you would expect, the colour is not diverging, and the size of the bubble is not implemented. How could I achieve this? I can't use scale_x_continuous, as it already used somewhere in ggmap
Here is one direction to try.
dput(df)
structure(list(latitude = c(42.39905, 42.39905, 43.37471, 43.37471,
44.28322), longitude = c(-73, -74, -73, -74, -74), category = structure(c(1L,
2L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"), latround = structure(c(1L,
1L, 2L, 2L, 3L), .Label = c("42", "43", "44"), class = "factor"),
longround = structure(c(2L, 1L, 2L, 1L, 1L), .Label = c("-74",
"-73"), class = "factor")), .Names = c("latitude", "longitude",
"category", "latround", "longround"), row.names = c(NA, -5L), class = "data.frame")
df$latround <- as.factor(round(df$latitude)) # round the coords
df$longround <- as.factor(round(df$longitude))
library(dplyr) # group by rounded coordinates and count the categories
df2 <- df %>% group_by(latround) %>% summarise(catnumber = n())
latround catnumber
1 42 2
2 43 2
3 44 1
library(ggmap)
From here you don't specify the location jk so I outlined an approach to plotting.
map <- get_map(location="jk",zoom=6,source="stamen")
#Plot the point
ggmap(map)+
geom_point(df2, aes(x=longround),y=latround), size = catnumber, colour=catnumber))+
scale_color_brewer(type='div') # more is needed in the ggmap code

Resources