R how to bin weighted data - r

Hi I'm trying to draw an histogram in ggplot but my data doesn't have all the values but values and number of occurrences.
value=c(1,2,3,4,5,6,7,8,9,10)
weight<-c(8976,10857,10770,14075,18075,20757,24770,14556,11235,8042)
df <- data.frame(value,weight)
df
value weight
1 1 8976
2 2 10857
3 3 10770
4 4 14075
5 5 18075
6 6 20757
7 7 24770
8 8 14556
9 9 11235
10 10 8042
Anybody would know either how to bin the values or how to plot an histogram of binned values.
I want to get something that would look like
bin weight
1 1-2 19833
2 3-4 24845
...

I would add another variable that designates the binning and then
df$group <- rep(c("1-2", "3-4", "5-6", "7-8", "9-10"), each = 2)
draw it using ggplot.
ggplot(df, aes(y = weight, x = group)) + stat_summary(fun.y="sum", geom="bar")

Here's one method for binning the data up:
df$bin <- findInterval(df$value,seq(1,max(df$value),2))
result <- aggregate(df["weight"],df["bin"],sum)
# get your named bins automatically without specifying them individually
result$bin <- tapply(df$value,df$bin,function(x) paste0(x,collapse="-"))
# result
bin weight
1 1-2 19833
2 3-4 24845
3 5-6 38832
4 7-8 39326
5 9-10 19277
# barplot it (base example since Roman has covered ggplot)
with(result,barplot(weight,names.arg=bin))

Just expand your data:
value=c(1,2,3,4,5,6,7,8,9,10)
weight<-c(8976,10857,10770,14075,18075,20757,24770,14556,11235,8042)
dat = rep(value,weight)
# plot result
histres = hist(dat)
And histres contains some potentially useful information if you want details of the histogram data.

Related

R: a plot to visualise correlation of two groups with one measurements?

I have two groups with one measurement variable.
I would like to plot them on one graph to see if they show a correlation or they overlap.
The measurement for both group is in the same scale.
I thought of doing a scatter plot, but in this case, I thought it would just give me a straight line as I only have one measurement.
Could I get some ideas and suggestions please?
You can unstack the data.
set.seed(1234)
df <- data.frame(var = rnorm(200, 50, 10), gp = gl(2,100))
head(df)
var gp
1 37.92934 1
2 52.77429 1
3 60.84441 1
4 26.54302 1
5 54.29125 1
6 55.06056 1
unstack(df)
X1 X2
1 37.92934 54.14524
2 52.77429 45.25282
3 60.84441 50.65993
4 26.54302 44.97522
5 54.29125 41.74001
6 55.06056 51.66989
And then plot this.
library(ggplot2)
library(dplyr)
unstack(df) %>% ggplot(aes(x=X1, y=X2)) +
geom_point() +
geom_smooth(method="lm")

Approach for creating plotting means from data frame

Trying to develop a flexible script to plot mean of continuous variable observations 'score' as a function of discrete time points 'day' from data frame.
I can do this by creating subsets, but I have a big set of data with many factor vectors like 'day,' so would like to get vectors or a data frame for each factor and its corresponding mean.
I have a data frame structured like this:
subject day score
1 0 99.13
2 0 NA
3 0 86.87
1 7 73.71
2 7 82.42
3 7 84.45
1 14 66.88
2 14 83.73
3 14 NA
I tried tapply(), but couldn't get it to output vectors or tables with appropriate headers and could also handle NAs.
Looking for a simple bit of code to get two vectors or a data frame with which to plot mean of 'score' as a function of factor 'day'.
So the plot will have point for average score on each day 0, 7, and 14.
I have seen a lot of posts for doing this directly with ggplot, but it seems useful to know how to do, and I need to see the output to make sure it is handling NAs correctly.
If you are able to help, please include explanatory annotations in your script. Thanks!
I think tapply should be able to handle this, you can amend the function to remove NAs:
df=data.frame("subject"=rep(1:3,3), "day"=as.factor(rep(c(0,7,14),each=3)),
"score"=c(99.13,NA,86.87,73.71,82.42,84.45,66.88,83.73,NA))
res = with(df, tapply(score, day, function(x) mean(x,na.rm=T)))
EDIT to get day and score as vectors
day=as.numeric(names(res))
day
0 7 14
score=as.numeric(res)
score
93.00000 80.19333 75.30500
Plot in base R:
plot(x=as.numeric(as.character(df$day)),y=df$score,type="p")
lines(x=names(res),y=res, col="red")
Not entirely clear what are you trying to achieve. Here I will show how to use the ggplot2 package to create a point plot with the mean for each group. Assuming that dt is your data frame.
library(ggplot2)
ggplot(dt, aes(x = day, y = score, color = factor(subject))) + # Specify x, y and color information
geom_point(size = 3) + # plot the point and specify the size is 3
scale_color_brewer(name = "Subject",
type = "qual",
palette = "Pastel1") + # Format the color of points and the legend using ColorBrewer
scale_x_continuous(breaks = c(0, 7, 14)) + # Set the breaks on x-axis
stat_summary(fun.y = "mean",
color = "red",
geom = "point",
size = 5,
shape = 8) + # Compute mean of each group and plot it
theme_classic() # Specify the theme
Warning messages: 1: Removed 2 rows containing non-finite values
(stat_summary). 2: Removed 2 rows containing missing values
(geom_point).
If you run the above code, you will get the warning message and a plot as follows. The warning message means NA has been removed, so you don't need to further remove NA from the dataset.
DATA
dt <- read.table(text = "subject day score
1 0 99.13
2 0 NA
3 0 86.87
1 7 73.71
2 7 82.42
3 7 84.45
1 14 66.88
2 14 83.73
3 14 NA",
header = TRUE, stringsAsFactors = FALSE)

Reordering legend while modifying one particular line for a line chart in ggplot

Let's say I have a simple data frame as shown below:
> A <- data.frame(x=1:10, a=rep(1,10), d=rep(2,10), b=rep(3,10))
> A
x a d b
1 1 1 2 3
2 2 1 2 3
3 3 1 2 3
4 4 1 2 3
5 5 1 2 3
6 6 1 2 3
7 7 1 2 3
8 8 1 2 3
9 9 1 2 3
10 10 1 2 3
I want to plot this with x on the x-axis and the other columns as lines on the y-axis. I want the line representing final column to be a little thicker than the other lines. So I can do this with the following code, which leads to the plot shown below it:
library(ggplot2)
#Plot that creates a thicker line for last column of data.
#However, order of legend is changed to alphabetical order.
p <- ggplot(A, aes(x))
for(i in 2:length(A)){
gg.data <- data.frame(x=A$x, value=A[,i], name=names(A)[i])
if(i==length(A)){
p <- p + geom_line(data=gg.data, aes(y=value, color=name), size=1.1)
} else{
p <- p + geom_line(data=gg.data, aes(y=value, color=name))
}
}
Now the problem with the plot above is that the order of the variables in the legend has changed to align with alphabetical order. I don't want that; instead I want the order to remain a,d,b.
I can keep the order as I wish by using melt and then plotting using the code below, but now I don't see how to increase the size of the line representing the last column in A.
Amelt <- melt(A, id.vars='x')
#Plot that orders legend according to order of columns in data frame.
#However, not sure how to thicken one particular line over the others.
pmelt <- ggplot(Amelt)+geom_line(aes(x=x, y=value, color=variable))
How can I get both things that I want?
Have you tried using scale_fill_discrete(breaks=c("a","d","b")) to specify the legends for the plots.
Please have a look at this link:
http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/
Hope this helps!

how to Create Histogram for one variable, using another to determine its frequency?

I'm new to R, and am using Histograms for the first time. I need to construct a histogram chart to show the frequency of income for all 50 United States + District of Columbia.
This is the data given to me:
> data
X.Income. X.No.States.
1 -22.024 5
2 -25.027 13
3 -28.030 16
4 -31.033 9
5 -34.036 4
6 -37.039 2
7 -40.042 2
> hist(data$X.Income, col="red")
But that only produces a histogram of the number of frequency that each income amount appears in the graph, not the number of states that have that level of income. How do I account for the number of states that have each level of income in the chart?
Use a bar plot instead of a histogram, as the histogram expects to calculate the frequencies for you:
library(ggplot2)
# make some data to exercise
income = c(-22.024, -25.027, -28.030, -31.033, -34.036, -37.039,-40.042)
freq = c(5,13,16,9,4,2,2)
df <- data.frame(income, freq)
df <- names(c("income","freq"))
# the graph object
p <- ggplot(data=df) +
aes(x=income, y=freq) +
geom_bar(stat="identity", fill="red")
# call the object to view
p

Calculate the run length of a variable and plot with ggplot

I'm using ggplot to plot an ordered sequence of numbers that is colored by a factor. For example, given this fake data:
# Generate fake data
library(dplyr)
set.seed(12345)
plot.data <- data.frame(fitted = rnorm(20),
actual = sample(0:1, 20, replace=TRUE)) %>%
arrange(fitted)
head(plot.data)
fitted actual
1 -1.8179560 0
2 -0.9193220 1
3 -0.8863575 1
4 -0.7505320 1
5 -0.4534972 1
6 -0.3315776 0
I can easily plot the actual column from rows 1–20 as colored lines:
# Plot with lines
ggplot(plot.data, aes(x=seq(length.out = length(actual)), colour=factor(actual))) +
geom_linerange(aes(ymin=0, ymax=1))
The gist of this plot is to show how often the actual numbers appear sequentially across the range of fitted values. As you can see in the image, sequential 0s and 1s are readily seen as sequential blue and red vertical lines.
However, I'd like to move away from the lines and use geom_rect instead to create bands for the sequential number. I can fake this with really thick lineranges:
# Fake rectangular regions with thick lines
ggplot(plot.data, aes(x=seq(length.out = length(actual)), colour=factor(actual))) +
geom_linerange(aes(ymin=0, ymax=1), size=10)
But the size of these lines is dependent on the number of observations—if they're too thick, they'll overlap. Additionally, doing this means that there are a bunch of extraneous graphical elements that are plotted (i.e. sequential rectangular sections are really just a bunch of line segments that bleed into each other). It would be better to use geom_rect instead.
However, geom_rect requires that data include minimum and maximum values for x, meaning that I need to reshape actual to look something like this instead:
xmin xmax colour
0 1 red
1 5 blue
I need to programmatically calculate the run length of each color to mark the beginning and end of that color. I know that R has the rle() function, which is likely the best option for calculating the run length, but I'm unsure about how to split the run length into two columns (xmin and xmax).
What's the best way to calculate the run length of a variable so that geom_rect can plot it correctly?
Thanks to #baptiste, it seems that the best way to go about this is to condense the data into just those rows that see a change in x:
condensed <- plot.data %>%
mutate(x = seq_along(actual), change = c(0, diff(actual))) %>%
subset(change != 0 ) %>% select(-change)
first.row <- plot.data[1,] %>% mutate(x = 0)
condensed.plot.data <- rbind(first.row, condensed) %>%
mutate(xmax = lead(x),
xmax = ifelse(is.na(xmax), max(x) + 1, xmax)) %>%
rename(xmin = x)
condensed.plot.data
# fitted actual xmin xmax
# 1 -1.8179560 0 0 2
# 2 -0.9193220 1 2 6
# 3 -0.3315776 0 6 9
# 4 -0.1162478 1 9 11
# 5 0.2987237 0 11 14
# 6 0.5855288 1 14 15
# 7 0.6058875 0 15 20
# 8 1.8173120 1 20 21
ggplot(condensed.plot.data) +
geom_rect(aes(xmin=xmin, xmax=xmax, ymin=0, ymax=1, fill=factor(actual)))

Resources