Would appreciate help with generating a 2D histogram of frequencies, where frequencies are calculated within a column. My main issue: converting from counts to column based frequency.
Here's my starting code:
# expected packages
library(ggplot2)
library(plyr)
# generate example data corresponding to expected data input
x_data = sample(101:200,10000, replace = TRUE)
y_data = sample(1:100,10000, replace = TRUE)
my_set = data.frame(x_data,y_data)
# define x and y interval cut points
x_seq = seq(100,200,10)
y_seq = seq(0,100,10)
# label samples as belonging within x and y intervals
my_set$x_interval = cut(my_set$x_data,x_seq)
my_set$y_interval = cut(my_set$y_data,y_seq)
# determine count for each x,y block
xy_df = ddply(my_set, c("x_interval","y_interval"),"nrow") # still need to convert for use with dplyr
# convert from count to frequency based on formula: freq = count/sum(count in given x interval)
################ TRYING TO FIGURE OUT #################
# plot results
fig_count <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) + geom_tile(aes(fill = nrow)) # count
fig_freq <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) + geom_tile(aes(fill = freq)) # frequency
I would appreciate any help in how to calculate the frequency within a column.
Thanks!
jac
EDIT: I think the solution will require the following steps
1) Calculate and store overall counts for each x-interval factor
2) Divide the individual bin count by its corresponding x-interval factor count to obtain frequency.
Not sure how to carry this out though. .
If you want to normalize over the x_interval values, you can create a column with a count per interval and then divide by that. I must admit i'm not a ddply wiz so maybe it has an easier way, but I would do
xy_df$xnrows<-with(xy_df, ave(nrow, x_interval, FUN=sum))
then
fig_freq <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) +
geom_tile(aes(fill = nrow/xnrows))
Related
I have this dataframe
df1 = data.frame(x_before = c(1,2,3,4,5,1,2,3,2,1),
x_after = c(2,1,2,3,6,7,2,2,2,3))
and I want to compare between the values of the two variables x_before and x_after using curves. I want them both on the same graph using ggplot. Thanks.
The following code will draw your variable against index, by status group (before/after)
# Import library
library(ggplot2)
# Data-management
df1 <- data.frame(y_before = c(1,2,3,4,5,1,2,3,2,1),
y_after = c(2,1,2,3,6,7,2,2,2,3))
df <- data.frame(y=c(df1$y_before, df1$y_after),
x=c(1:length(df1$y_before),
1:length(df1$y_after)),
group=c(
rep("before", length(df1$y_before)),
rep("after", length(df1$y_after))))
# Plot with y as the variable to plot, x being the index, group being the y variable status
ggplot(df) +
geom_line(aes(y=y, x = x, color = group))
I have data where sample counts have been pre-calculated for bins across a range, and the bins are overlapping and uneven sizes. Looks something like:
x2 <- data.frame("BinFrom" = c(1,1,2,2,4,4,4,5,5,5,8,8,8,9,11,14,17,18,19),
"BinTo" = c(3,6,4,8,5,8,6,10,12,6,7,15,11,10,20,20,18,19,20),
"Count" = c(1000,2400,15,2000,20,3800,10,6000,4200,10,25,3000,2800,10,1300,9000,10,5,40))
I wish to generate a histogram and density plot for these data. Is there a way to do this?
ggdensity etc expect the expanded data. I attempted to force that format by expanding on the mid-point of the bins, e.g.:
x2 <- x2 %>% mutate(MidBin = BinFrom + ((BinTo-BinFrom)/2))
xp <- x2 %>% expandRows(., "Count")
ggdensity(xp, "MidBin")
but this loses important data, and is not possible with my actual data frame as the row expansion exhausts the vector memory.
All help appreciated
Create new matrix and count base-overlap
base=cbind.data.frame(base=c(min(x2$BinFrom):max(x2$BinTo)))
base$overlap=sapply(base$base,function(x) sum(x2$Count[x >= x2$BinFrom & x <= x2$BinTo ]))
plot
ggplot(base,aes(x=base,y=overlap))+geom_bar(stat = "identity")
#or
ggplot(base,aes(x=base,y=overlap))+geom_area(alpha=0.25)
I've got a dataset of different energies (eV) and related counts. I changed the detection wavelength throughout the measurement which resulted in having a first column with all wavelength and than further columns. There the different rows are filled with NAs because no data was measured at the specific wavelength.
I would like to plot the spectra in R, but it doesn't work because the length of X and y values differs for each column.
It would be great, if someone could help me.
Thank you very much.
It would be better if we could work with (simulated) data you provided. Here's my attempt at trying to visualize your problem the way I see it.
library(ggplot2)
library(tidyr)
# create and fudge the data
xy <- data.frame(measurement = 1:20, red = rnorm(20), green = rnorm(20, mean = 10), uv = NA)
xy[16:20, "green"] <- NA
xy[16:20, "uv"] <- rnorm(5, mean = -3)
# flow it into "long" format
xy <- gather(xy, key = color, value = value, - measurement)
# plot
ggplot(xy, aes(x = measurement, y = value, group = color)) +
theme_bw() +
geom_line()
I have panel data with ID=1,2,3... year=2007,2008,2009... and a factor foreign=0,1, and a variable X.
I would like to create a time series plot with x-axis=year, y-axis=values of X that compares the average (=mean) development of each factor over time. As there are 2 factors, there should be two lines, one solid and one dashed.
I assume that the first step involves the calculation of the means for each year and factor of X, i.e. in a panel setting. The second step should look something like this:
ggplot(data, aes(x=year, y=MEAN(X), group=Foreign, linetype=Foreign))+geom_line()+theme_bw()
Many thank.
Using dplyr to calculate the means:
library(dplyr)
# generate some data (because you didn't provide any, or any way or generating it...)
data = data.frame(ID = 1:200,
year = rep(1951:2000, each = 4),
foreign = rep(c(0, 1), 100),
x = rnorm(200))
# For each year, and seperately for foreign or not, calculate mean x.
data.means <- data %>%
group_by(year, foreign) %>%
summarize(xmean = mean(x))
# plot. You don't need group = foreign
ggplot(data.means, aes(x = year, y = xmean, linetype = factor(foreign))) +
geom_line() +
theme_bw()
I'd like to create a graph in R, with Date on the Y axis, and total observations on that date / number of a specific observations on the X-axis. However, I'm not sure how to get the total number of observations per date.
ggplot(aes = (x = Date, y = (<number_of_observations> / (colour = 'Red'))),
data = cardata) +
geom_histogram()
How can I do this, so I can get a number of specific observations? (e.g. so I can compare the number of Red cars with the total number of cars)
I'm not sure I am following your question, but the dplyr package would suggest something like this. Without some sample data it is hard to be more precise:
df <- data %.%
group_by(Date) %.%
summarise(DateObservations = length(Date) %.%
summarise(DatePct = DateObservation/nrow(data)
Then you could ggplot it:
ggplot(df, aes(x = date, y = DatePct) + geom_bar()
Try:
ggplot(aes = (x = Date, data = cardata[cardata$colour == 'red'])) + geom_histogram()
So you filter on your condition when you define the data.