I have panel data with ID=1,2,3... year=2007,2008,2009... and a factor foreign=0,1, and a variable X.
I would like to create a time series plot with x-axis=year, y-axis=values of X that compares the average (=mean) development of each factor over time. As there are 2 factors, there should be two lines, one solid and one dashed.
I assume that the first step involves the calculation of the means for each year and factor of X, i.e. in a panel setting. The second step should look something like this:
ggplot(data, aes(x=year, y=MEAN(X), group=Foreign, linetype=Foreign))+geom_line()+theme_bw()
Many thank.
Using dplyr to calculate the means:
library(dplyr)
# generate some data (because you didn't provide any, or any way or generating it...)
data = data.frame(ID = 1:200,
year = rep(1951:2000, each = 4),
foreign = rep(c(0, 1), 100),
x = rnorm(200))
# For each year, and seperately for foreign or not, calculate mean x.
data.means <- data %>%
group_by(year, foreign) %>%
summarize(xmean = mean(x))
# plot. You don't need group = foreign
ggplot(data.means, aes(x = year, y = xmean, linetype = factor(foreign))) +
geom_line() +
theme_bw()
Related
I've got a dataset of different energies (eV) and related counts. I changed the detection wavelength throughout the measurement which resulted in having a first column with all wavelength and than further columns. There the different rows are filled with NAs because no data was measured at the specific wavelength.
I would like to plot the spectra in R, but it doesn't work because the length of X and y values differs for each column.
It would be great, if someone could help me.
Thank you very much.
It would be better if we could work with (simulated) data you provided. Here's my attempt at trying to visualize your problem the way I see it.
library(ggplot2)
library(tidyr)
# create and fudge the data
xy <- data.frame(measurement = 1:20, red = rnorm(20), green = rnorm(20, mean = 10), uv = NA)
xy[16:20, "green"] <- NA
xy[16:20, "uv"] <- rnorm(5, mean = -3)
# flow it into "long" format
xy <- gather(xy, key = color, value = value, - measurement)
# plot
ggplot(xy, aes(x = measurement, y = value, group = color)) +
theme_bw() +
geom_line()
I have data of the form
cvar1 cvar1 numvar
a x 0.1
a y 0.2
b x 0.15
b y 0.25
That is, two categorical variables, and one numerical variable.
Using ggplot2, I can get a nice scatter plot that plots the data for each combination of cv1 and cv2 by doing qplot(y=numvar, x=interaction(cvar1, cvar2)). This gives me several columns of points like this:
To each of these columns I would like to add a small horizontal line representing the mean of the data points in that column. And a similar small horizontal line for the mean + sd and the mean - sd. (Kind of a bastardized box plot, but with all points visible and using mean and sd rather than median and IQR.) Thanks in advance!
You can create a table that contains the mean and mean +/- sd for each group of points. Then you can plot lines using geom_segment().
First, I create some sample data:
set.seed(1245)
data <- data.frame(cvar1 = rep(letters[1:2], each = 12),
cvar2 = rep(letters[25:26], times = 12),
numvar = runif(2*12))
This creates the table with the values that you need using dplyr and tidyr:
library(dplyr)
library(tidyr)
summ <- group_by(data, cvar1, cvar2) %>%
summarise(mean = mean(numvar),
low = mean - sd(numvar),
high = mean + sd(numvar)) %>%
gather(variable, value, mean:high)
The three lines do the following: First, the data is split into the groups and then for each group the three required values are calculated. Finally, the data is converted to long format, which is needed for ggplot(). (Maybe your are more familiar with melt(), which does basically the same thing as gather())
And finally, this creates the plot:
gplot(data) + geom_point(aes(x = interaction(cvar1, cvar2), y = numvar)) +
geom_segment(data = summ,
aes(x = as.numeric(interaction(cvar1, cvar2)) - .5,
xend = as.numeric(interaction(cvar1, cvar2)) + .5,
y = value, yend = value, colour = variable))
You probably won't want the colours. I just added them to make the example more clear.
geom_segments() needs the start and end coordinates of each line to be specified. Because interaction(cvar1, cvar2) is a factor, it needs to be converted to numeric before it is possible to do arithmetic with it. I added and subtracted 0.5 to interaction(cvar1, cvar2), which makes the lines quite wide. Choosing a smaller value will make the lines shorter.
I have this dataframe and this plot :
df <- data.frame(Groupe = rep(c("A","B"),4),
Period = gl(4,2,8,c("t0","t1","t2","t3","t4")),
rate = c(0.83,0.96,0.75,0.93,0.67,0.82,0.65,0.73))
ggplot(data = df, mapping = aes(y = rate, x = Period ,group = Groupe, colour=Groupe, shape=Groupe)) +
geom_line(size=1.2) +
geom_point(size=5)
How could i organize my data so that the points between t1 and t2 are not connected with a line ? I'd like t0 and t1 to be connected (blue or red according to the group), t2 and t3 connected in the same way, but no lines between t1 and t2. I tried several things by looking at similar questions, but it always mess up my grouping colors :/
Creating a new grouping variable manually is mostly not the best way. So, a slightly different approach which requires less hardcoding:
# create new grouping variable
df$grp <- c(1,2)[df$Period %in% c("t2","t3","t4") + 1L]
# create the plot and use the interaction between 'Group' and 'grp' as group
ggplot(df, aes(x = Period, y = rate,
group = interaction(Groupe,grp),
colour = Groupe,
shape = Groupe)) +
geom_line(size=1.2) +
geom_point(size=5)
this gives the same plot as in the other answer:
The best way to handle a problem like this in ggplot is often to create an additional column in your data frame that indicates the grouping you want to work with in your data. For example, here I've added an extra column gp to your data frame:
df$gp <- c(1,2,1,2,3,4,3,4)
ggplot(data = df, aes(y = rate, x = Period, group = gp, colour=Groupe, shape=Groupe)) +
geom_line(size=1.2) +
geom_point(size=5)
The result is, I believe, what you are looking for:
If you make Period a numerical column rather than a character vector or factor, you can more easily generate a column like gp automatically rather than manually specifying it (perhaps using ifelse or cases to create it) - this would be useful if you wanted to do the same thing many times or with a large data frame.
Would appreciate help with generating a 2D histogram of frequencies, where frequencies are calculated within a column. My main issue: converting from counts to column based frequency.
Here's my starting code:
# expected packages
library(ggplot2)
library(plyr)
# generate example data corresponding to expected data input
x_data = sample(101:200,10000, replace = TRUE)
y_data = sample(1:100,10000, replace = TRUE)
my_set = data.frame(x_data,y_data)
# define x and y interval cut points
x_seq = seq(100,200,10)
y_seq = seq(0,100,10)
# label samples as belonging within x and y intervals
my_set$x_interval = cut(my_set$x_data,x_seq)
my_set$y_interval = cut(my_set$y_data,y_seq)
# determine count for each x,y block
xy_df = ddply(my_set, c("x_interval","y_interval"),"nrow") # still need to convert for use with dplyr
# convert from count to frequency based on formula: freq = count/sum(count in given x interval)
################ TRYING TO FIGURE OUT #################
# plot results
fig_count <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) + geom_tile(aes(fill = nrow)) # count
fig_freq <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) + geom_tile(aes(fill = freq)) # frequency
I would appreciate any help in how to calculate the frequency within a column.
Thanks!
jac
EDIT: I think the solution will require the following steps
1) Calculate and store overall counts for each x-interval factor
2) Divide the individual bin count by its corresponding x-interval factor count to obtain frequency.
Not sure how to carry this out though. .
If you want to normalize over the x_interval values, you can create a column with a count per interval and then divide by that. I must admit i'm not a ddply wiz so maybe it has an easier way, but I would do
xy_df$xnrows<-with(xy_df, ave(nrow, x_interval, FUN=sum))
then
fig_freq <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) +
geom_tile(aes(fill = nrow/xnrows))
I have a dataframe which is a history of runs. Some fo the variables include a date (in POSIXct) and a value for that run (here = size). I want to produce various graphs showing a line based on the total fo the size column for a particular date range. Ideally I'd like to use the same dataset and change from totals per week, 2 weeks, month quarter.
Here's an example dataset;
require(ggplot2)
set.seed(666)
seq(Sys.time()-(365*24*60*60), Sys.time(), by="day")
foo<-data.frame(Date=sample(seq(today-(365*24*60*60), today, by="day"),50, replace=FALSE),
value=rnorm(50, mean=100, sd=25),
type=sample(c("Red", "Blue", "Green"), 50, replace=TRUE))
I can create this plot which shows individual values;
ggplot(data=foo, aes(x=Date, y=value, colour=type))+stat_summary(fun.y=sum, geom="line")
Or I can do this to show a sum per Month;
ggplot(data=foo, aes(x=format(Date, "%m %y"), y=value, colour=type))+stat_summary(fun.y=sum, geom="line", aes(group=type))
However it gets more complicated to do sums per quarter / 2 weeks etc. Ideally I'd like something like the stat_bin and stat_summary combined so I could specify a binwidth (or have ggplot make a best guess based on the range)
Am I missing something obvious, or is this just not possible ?
It's pretty easy with plyr and lubridate to do all the calculations yourself:
library(plyr)
library(lubridate)
foo <- data.frame(
date = sample(today() + days(1:365), 50, replace = FALSE),
value = rnorm(50, mean = 100, sd = 25),
type = sample(c("Red", "Blue", "Green"), 50, replace = TRUE))
foo$date2 <- floor_date(foo$date2, "week")
foosum <- ddply(foo, c("date2", "type"), summarise,
n = length(value),
mean = mean(value))
ggplot(foosum, aes(date2, mean, colour = type)) +
geom_point(aes(size = n)) +
geom_line()
The chron package could be very useful to convert dates in a way not covered in the "basic" format command. But the latter can also do smart things (like the strftime in PHP), e.g.:
Show given year and month of a date:
format(foo$Date, "%Y-%m")
And with package chron showing the appropriate quarter of year:
quarters(foo$Date)
To compute the 2-weeks period, you might not find a complete function, but could be computed from a the week number easily, e.g.:
floor(as.numeric(format(foo$Date, "%V"))/2)+1
After computing the new variables in the dataframe, you could easily plot your data just like in your original example.