I want to plot census data to compare data for each race over multiple years.
My data frame has years 1950-2010 (every 10 years) as the rows and race as the columns. The data at the cross section is the percentage of that race in a given year.
I want my line graph to plot the years on the x axis and race on the y axis. So with my 5 "race" variables, there would be 5 lines of different colors all plotted on the same graph.
I have tried to watch videos and scoured all over here but nothing I find seems to work the way I want it to.
Edit:
I refactored to the code and built my own dataframe instead of having it return a matrix.
However, I want the right side to say "Race" and then have my 5 lines. I am working on getting one line to show up at all before doing the other 4.
new dataframe
returned plot
Edit:
I have figured out thus far in my code - Allston <- ggplot(data = dataAllston, aes(Year, White.pct, group = 1)) + geom_point(aes(color = "orange")) + geom_line(aes(color = "orange"))
I want to scale the Y axis and from 0-1 in 0.2 increments and have the Y be "Race" instead of the individual labels. And more than just relabeling -- I want the graph to be representative of the actual increases/decreases as opposed to a straight line diagonally down as it is now.
I think it will take me longer to learn how to make the reproducible code than it will to make tweaks.
new returned plot
Edit:
dput(dataAllston)
returns
structure(list(Year = c(1950, 1960, 1970, 1980, 1990, 2000, 2010
), White.pct = structure(7:1, .Label = c("57.0", "59.0", "63.0",
"78.0", "90.8", "98.0", "98.3"), class = "factor"), BlackOrAA.pct =
structure(c(2L,
1L, 3L, 4L, 5L, 4L, 4L), .Label = c("1.20", "1.30", "2.60", "5.00",
"9.00"), class = "factor"), Hispanic.pct = structure(c(1L, 1L,
3L, 4L, 2L, 2L, 2L), .Label = c("0.00", "13.0", "3.10", "6.00"
), class = "factor"), AsianOrPI.pct = structure(c(1L, 1L, 5L,
6L, 2L, 3L, 4L), .Label = c("0.00", "14.0", "18.0", "20.0", "3.20",
"9.00"), class = "factor"), Other.pct = structure(c(2L, 1L, 3L,
4L, 5L, 4L, 4L), .Label = c("1.20", "1.30", "2.60", "5.00", "9.00"
), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
result from dput(data)
You need first to reshape your dataset into a longer format by using for example pivot_longer function from tidyr. At the end, your data should look like this.
As your data are in factor format (except Year column), the first line will convert all of them into a numerical format much appropriate for plotting.
library(dplyr)
library(tidyr)
Reshaped_DF <- df %>% mutate_at(vars(ends_with(".pct")), ~as.numeric(as.character(.))) %>%
pivot_longer(-Year, names_to = "Races", values_to = "values")
# A tibble: 35 x 3
Year Races values
<dbl> <chr> <dbl>
1 1950 White.pct 98.3
2 1950 BlackOrAA.pct 1.3
3 1950 Hispanic.pct 0
4 1950 AsianOrPI.pct 0
5 1950 Other.pct 1.3
6 1960 White.pct 98
7 1960 BlackOrAA.pct 1.2
8 1960 Hispanic.pct 0
9 1960 AsianOrPI.pct 0
10 1960 Other.pct 1.2
# … with 25 more rows
Then, you can plot it in ggplot2 by doing:
library(ggplot2)
ggplot(Reshaped_DF,aes(x = Year, y = values, color = Races, group = Races))+
geom_line()+
geom_point()+
ylab("Percentage")
Does it answer your question ?
If not, please consider providing a reproducible example of your dataset that people can easily copy/paste. See this guide: How to make a great R reproducible example
Related
For a sample dataframe df, pred_value and real_value respectively represent the monthly predicted values and actual values for a variable, and acc_level represents the accuracy level of the predicted values comparing with the actual values for the correspondent month, the smaller the values are, more accurate the predictions result:
df <- structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L), .Label = c("2022/3/31", "2022/4/30",
"2022/5/31"), class = "factor"), pred_value = c(2721.8, 2721.8,
2705.5, 2500, 2900.05, 2795.66, 2694.45, 2855.36, 2300, 2799.82,
2307.36, 2810.71, 3032.91), real_value = c(2736.2, 2736.2, 2736.2,
2736.2, 2736.2, 2759.98, 2759.98, 2759.98, 2759.98, 3000, 3000,
3000, 3000), acc_level = c(1L, 1L, 2L, 3L, 3L, 1L, 2L, 2L, 3L,
2L, 3L, 2L, 1L)), class = "data.frame", row.names = c(NA, -13L
))
Out:
date pred_value real_value acc_level
1 2022/3/31 2721.80 2736.20 1
2 2022/3/31 2721.80 2736.20 1
3 2022/3/31 2705.50 2736.20 2
4 2022/3/31 2500.00 2736.20 3
5 2022/3/31 2900.05 2736.20 3
6 2022/4/30 2795.66 2759.98 1
7 2022/4/30 2694.45 2759.98 2
8 2022/4/30 2855.36 2759.98 2
9 2022/4/30 2300.00 2759.98 3
10 2022/5/31 2799.82 3000.00 2
11 2022/5/31 2307.36 3000.00 3
12 2022/5/31 2810.71 3000.00 2
13 2022/5/31 3032.91 3000.00 1
I've plotted the predicted values with code below:
library(ggplot2)
ggplot(x, aes(x=date, y=pred_value, color=acc_level)) +
geom_point(size=2, alpha=0.7, position=position_jitter(w=0.1, h=0)) +
theme_bw()
Out:
Beyond what I've done above, if I hope to plot the actual values for each month with red line and red points, how could I do that? Thanks.
Reference:
How to add 4 groups to make Categorical scatter plot with mean segments?
We can add the actuals using additional layers. To make the line show up, we need to specify that the points should be part of the same series.
ggplot assumes by default that since the x axis is discrete that the data points are not part of the same group. We could alternatively deal with this by making the date variable into a date data type, like with aes(x=as.Date(date)...
library(ggplot2)
ggplot(df, aes(x=date, y=pred_value, color=as.factor(acc_level))) +
geom_point(size=2, alpha=0.7, position=position_jitter(w=0.1, h=0)) +
geom_point(aes(y = real_value), size=2, color = "red") +
geom_line(aes(y = real_value, group = 1), color = "red") +
scale_color_manual(values = c("yellow", "magenta", "cyan"),
name = "Acc Level") +
theme_bw()
Let's say i have a data frame like this
id password year length Something
1 1234567 2001 7 good
2 pass4 2001 5 bad
3 angel3 2003 6 bad
4 pizza 2004 5 ok
im trying to get a code that would create a geom_point with 3 variable but i only want to highlight a single level of the factor ''Something'' . And i dont want any of the other levels of the factor Something(like good or bad) to colored. Or at least they can stay black.
im was thinking maybe something like this :
graph <- dat %>%
ggplot(aes(x=(year), y=length, color=Something$ok)+
geom_point()
but i can't use $ .
You can color just one point by setting all points to one color and changing the color of the point you want to change. To do this you can use scale_color_manual
Data:
dat <- structure(list(id = 1:4, password = structure(c(1L, 3L, 2L, 4L
), .Label = c("1234567", "angel3", "pass4", "pizza"), class = "factor"),
year = c(2001L, 2001L, 2003L, 2004L), length = c(7L, 5L,
6L, 5L), Something = structure(c(2L, 1L, 1L, 3L), .Label = c("bad",
"good", "ok"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
Plot:
dat %>%
ggplot(aes(x=(year), y=length, color = Something == "ok"))+
geom_point() +
scale_color_manual(values = c("blue", "orange"))
I have a dataset of genes grouped by region, and I measure their distances between each other in their regions.
Currently to calculate total distance across all regions I use unique() on my region distances however, this doesn't acount for the chance 2 regions may have the exact same number, and should both be kept when I sum the total distance.
I think I am not sure how to incorperate this condition into my code, the other questions I find here do not have conditions based on other columns of data like I need.
For example my data looks like this:
Gene region region.distance
ACE 1 10
AGT 1 10
BRCA 2 20
DVL1 3 10
NOTCH3 4 40
I then use this code to gain the unique values in region.distance to sum the total distance:
total.distance <- sum(unique(df$region.distance))
However this does not account for regions 1 and 3 both having a distance of 10. My output total distance for my example data above should be 80 not 70.
Is it possible for me to incorporate within unqiue() an if statement, for example using diff(df$region) but also including that if it's still a duplicate but in a different region that should be kept in?
You can remove the duplicates within the group and then sum
library(dplyr)
df %>%
group_by(region) %>%
filter(!duplicated(region.distance)) %>%
pull(region.distance) %>% sum
#[1] 80
Similarly, in base R we can do
sum(subset(df, !ave(region.distance, region, FUN = duplicated))$region.distance)
#[1] 80
data
df <- structure(list(Gene = structure(1:5, .Label = c("ACE", "AGT",
"BRCA", "DVL1", "NOTCH3"), class = "factor"), region = c(1L,
1L, 2L, 3L, 4L), region.distance = c(10L, 10L, 20L, 10L, 40L)),
class = "data.frame", row.names = c(NA, -5L))
We can use data.table
library(data.table)
unique(setDT(df),by = c("region", "region.distance"))[, sum(region.distance)]
#[1] 80
data
df <- structure(list(Gene = structure(1:5, .Label = c("ACE", "AGT",
"BRCA", "DVL1", "NOTCH3"), class = "factor"), region = c(1L,
1L, 2L, 3L, 4L), region.distance = c(10L, 10L, 20L, 10L, 40L)),
class = "data.frame", row.names = c(NA, -5L))
I'd like to plot standard deviations of the mean(z)/mean(b) which are grouped by two factors $angle and $treatment:
z= Tracer angle treatment
60 0 S
51 0 S
56.415 15 X
56.410 15 X
b=Tracer angle treatment
21 0 S
15 0 S
16.415 15 X
26.410 15 X
So far I've calculated the mean for each variable based on angle and treatment:
aggmeanz <-aggregate(z$Tracer, list(angle=z$angle,treatment=z$treatment), FUN=mean)
aggmeanb <-aggregate(b$Tracer, list(angle=b$angle,treatment=b$treatment), FUN=mean)
It now looks like this:
aggmeanz
angle treatment x
1 0 S 0.09088021
2 30 S 0.18463353
3 60 S 0.08784315
4 80 S 0.09127198
5 90 S 0.12679296
6 0 X 2.68670392
7 15 X 0.50440692
8 30 X 0.83564470
9 60 X 0.52856956
10 80 X 0.63220093
11 90 X 1.70123025
But when I come to plot it, I can't quite get what I'm after
ggplot(aggmeanz, aes(x=aggmeanz$angle,y=aggmeanz$x/aggmeanb$x, colour=treatment)) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=0.1, ymax=1.15),
width=.2,
position=position_dodge(.9)) +
theme(panel.grid.minor = element_blank()) +
theme_bw()
EDIT:
dput(aggmeanz)
structure(list(time = structure(c(1L, 3L, 4L, 5L, 6L, 1L, 2L,
3L, 4L, 5L, 6L), .Label = c("0", "15", "30", "60", "80", "90"
), class = "factor"), treatment = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("S", "X"), class = "factor"),
x = c(56.0841582902523, 61.2014237854156, 42.9900742785269,
42.4688447229277, 41.3354173870287, 45.7164231791512, 55.3943182966382,
55.0574951462903, 48.1575625699563, 60.5527200655174, 45.8412287451211
)), .Names = c("time", "treatment", "x"), row.names = c(NA,
-11L), class = "data.frame")
> dput(aggmeanb)
structure(list(time = structure(c(1L, 3L, 4L, 5L, 6L, 1L, 2L,
3L, 4L, 5L, 6L), .Label = c("0", "15", "30", "60", "80", "90"
), class = "factor"), treatment = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("S", "X"), class = "factor"),
x = c(56.26325504249, 61.751655279608, 43.1687113436753,
43.4147408285209, 41.9113698082799, 46.2800894420131, 55.1550995335947,
54.7531592595068, 47.3280215294235, 62.4629068516043, 44.2590192583692
)), .Names = c("time", "treatment", "x"), row.names = c(NA,
-11L), class = "data.frame")
EDIT 2: I calculated the standard dev as follows:
aggstdevz <-aggregate(z$Tracer, list(angle=z$angle,treatment=z$treatment), FUN=std)
aggstdevb <-aggregate(b$Tracer, list(angle=b$angle,treatment=b$treatment), FUN=std)
Any thoughts would be much appreciated,
Cheers
As others have noted, you'll need to join the two dataframes together. There are also some little quirks in the dput data you showed, so I've renamed some columns to make sure that they join appropriately and match what you've attempted. NOTE: You'll need name the two means differently so that they don't get merged together or cause conflicts.
names(aggmeanb)[names(aggmeanb) == "x"] = "mean_b"
names(aggmeanb)[names(aggmeanb) == "time"] = "angle"
names(aggmeanz)[names(aggmeanz) == "x"] = "mean_z"
names(aggmeanz)[names(aggmeanz) == "time"] = "angle"
joined_data = join(aggmeanb, aggmeanz)
joined_data$divmean = joined_data$mean_b/joined_data$mean_z
> head(joined_data)
angle treatment mean_b mean_z divmean
1 0 S 56.26326 56.08416 1.003193
2 30 S 61.75166 61.20142 1.008991
3 60 S 43.16871 42.99007 1.004155
4 80 S 43.41474 42.46884 1.022273
5 90 S 41.91137 41.33542 1.013934
6 0 X 46.28009 45.71642 1.012330
ggplot(joined_data, aes(factor(angle), divmean)) +
geom_boxplot() +
theme(panel.grid.minor = element_blank()) +
theme_bw()
It might be that the data you've included is just a bit of your real data set, but as is there's only one data point per angle-treatment group. However, when you are using a fuller dataset, you can try something like:
ggplot(joined_data, aes(factor(angle), diffmean, group = treatment)) +
geom_boxplot() +
facet_grid(.~angle, scales = "free_x")
That will group the boxes by angle and then allow you to fill them by treatment.
Think about the problem in two steps:
create a data frame (say data) which contains all the information
you would like to visualize. In this case, this seems to be the two
factors (angle, treatment), the mean group differences (say dif)
and standard errors (say ste).
visualize this information.
Step 2) will be easy. This should probably produce something very similar to your sketch.
ggplot(data, aes(x=angle, y=dif, colour=treatment)) +
geom_point(position=position_dodge(0.1)) +
geom_errorbar(aes(ymin=dif-ste, ymax=dif+ste), width=.1, position=position_dodge(0.1)) +
theme_bw()
However, at this point, you do not provide enough information to get help with Step 1. Try to include code which produces your original data (or the type of data you have) instead of copy-pasting chunks of your data output or pasting the aggregated data which lacks standard errors.
Combining your two aggregated data frames and generating random numbers for standard error produces the graph below:
#I imported your two aggregated data frames from your dput output.
data <- cbind(aggmeanb, aggmeanz$x, rnorm(11))
names(data) <- c("angle", "treatment", "meanz", "meanb", "ste")
data$dif <- data$meanz - data$meanb
This is what my dataframe looks like:
Persnr Date AmountHolidays
1 55312 X201101 2
2 55312 X201102 4.5
3 55312 X201103 5
etc.
What I want to have is a graph that shows the amount of holidays (on the y-axis) of each period (Date on the x-axis) of a specific person (persnr). Basically, it's a pivot graph in R. So far I know, it is not possible to create such a graph.
Something like this is my desired result:
http://imgur.com/62VsYdJ
Is it possible in the first place to create such a model in R? If not, what is the best way for me to visualise such graph in R?
Thanks in advance.
Something like this could do the trick?
dat <- read.table(text="Persnr Date AmountHolidays
55312 2011-01-01 2
55312 2011-02-01 4.5
55312 2011-03-01 5
55313 2011-01-01 4
55313 2011-02-01 2.5
55313 2011-03-01 6", header=TRUE)
dat$Date <- as.POSIXct(dat$Date)
dat$Persnr <- as.factor(dat$Persnr)
# Build a primary graph
plot(AmountHolidays ~ Date, data = dat[dat$Persnr==55312,], type="l", col="red",
xlim = c(1293858000, 1299301200), ylim=c(0,8))
# Add additional lines to it
lines(AmountHolidays ~ Date, data = dat[dat$Persnr==55313,], type="l", col="blue")
# Build and place a legend
legend(x=as.POSIXct("2011-02-19"), y=2.2, legend = levels(dat$Persnr),fill = c("red", "blue"))
To set X coordinates, you can either use as.POSIXct(YYYY-MM-DD) or as.numeric(as.POSIXct(YYYY-MM-DD) as I did for the xlim's.
You can try with package ggplot2:
First option
ggplot(dat, aes(x=Date, y=AmountHolidays, group=Persnr)) +
geom_line(aes(colour=Persnr)) + scale_colour_discrete()
or
Second option
ggplot(dat, aes(x=Date, y=AmountHolidays, group=Persnr)) +
geom_line() + facet_grid(~Persnr)
One of the advantages is that you don't need to have a line per Persnr or even to specify (to know) the name or number of Persnr.
example:
first option
second option
Data:
dat <- structure(list(Persnr = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("54000",
"55312"), class = "factor"), Date = structure(c(1L, 2L, 3L, 1L,
2L, 3L), .Label = c("2011-01-01", "2011-02-01", "2011-03-01"), class = "factor"),
AmountHolidays = c(5, 4.5, 2, 3, 6, 7)), .Names = c("Persnr",
"Date", "AmountHolidays"), row.names = c(3L, 5L, 6L, 1L, 2L,
4L), class = "data.frame")